[Qemu-block] [PATCH 1/2] qcow2: Prefer 'entries' over 'size' for non-byte values in spec
We want to limit the use of the term 'size' for only values that count by bytes. Renaming fields in the spec does not invalidate any existing implementation, but may make future implementations easier to write. A reasonable followup would be to rename internal qemu code that operates on qcow2 images to also use the distinction between size and entries in variable names. Signed-off-by: Eric Blake--- docs/interop/qcow2.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt index d7fdb1fee31..597d3f261d5 100644 --- a/docs/interop/qcow2.txt +++ b/docs/interop/qcow2.txt @@ -47,7 +47,7 @@ The first cluster of a qcow2 image contains the file header: 1 for AES encryption 2 for LUKS encryption - 36 - 39: l1_size + 36 - 39: l1_entries Number of entries in the active L1 table 40 - 47: l1_table_offset @@ -538,7 +538,7 @@ Structure of a bitmap directory entry: (described below) for the bitmap starts. Must be aligned to a cluster boundary. - 8 - 11:bitmap_table_size + 8 - 11:bitmap_table_entries Number of entries in the bitmap table of the bitmap. 12 - 15:flags -- 2.14.3
[Qemu-block] [RFC PATCH 0/2] s/size/entries/ when dealing with non-byte units
I mentioned this while reviewing Berto's series on L2 slice handling; this is a first cut at patches that I think are worth doing throughout the qcow2 code base if we like the idea. Eric Blake (2): qcow2: Prefer 'entries' over 'size' for non-byte values in spec qcow2: Prefer 'entries' over 'size' during cache creation docs/interop/qcow2.txt | 4 ++-- block/qcow2.h | 4 ++-- block/qcow2.c | 21 +++-- 3 files changed, 15 insertions(+), 14 deletions(-) -- 2.14.3
[Qemu-block] [PATCH 2/2] qcow2: Prefer 'entries' over 'size' during cache creation
Using 'size' for anything other than bytes is difficult to reason about; let's rename entries related to the number of entries in a cache accordingly. Signed-off-by: Eric Blake--- block/qcow2.h | 4 ++-- block/qcow2.c | 21 +++-- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 883802241fb..0daf8e6d6f8 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -68,10 +68,10 @@ #define MAX_CLUSTER_BITS 21 /* Must be at least 2 to cover COW */ -#define MIN_L2_CACHE_SIZE 2 /* cache entries */ +#define MIN_L2_CACHE_ENTRIES 2 /* Must be at least 4 to cover all cases of refcount table growth */ -#define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */ +#define MIN_REFCOUNT_CACHE_ENTRIES 4 /* Whichever is more */ #define DEFAULT_L2_CACHE_CLUSTERS 8 /* clusters */ diff --git a/block/qcow2.c b/block/qcow2.c index 288b5299d80..f25c33df1d1 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -843,6 +843,7 @@ static int qcow2_update_options_prepare(BlockDriverState *bs, const char *opt_overlap_check, *opt_overlap_check_template; int overlap_check_template = 0; uint64_t l2_cache_size, l2_cache_entry_size, refcount_cache_size; +uint64_t l2_cache_entries, refcount_cache_entries; int i; const char *encryptfmt; QDict *encryptopts = NULL; @@ -869,21 +870,21 @@ static int qcow2_update_options_prepare(BlockDriverState *bs, goto fail; } -l2_cache_size /= l2_cache_entry_size; -if (l2_cache_size < MIN_L2_CACHE_SIZE) { -l2_cache_size = MIN_L2_CACHE_SIZE; +l2_cache_entries = l2_cache_size / l2_cache_entry_size; +if (l2_cache_entries < MIN_L2_CACHE_ENTRIES) { +l2_cache_entries = MIN_L2_CACHE_ENTRIES; } -if (l2_cache_size > INT_MAX) { +if (l2_cache_entries > INT_MAX) { error_setg(errp, "L2 cache size too big"); ret = -EINVAL; goto fail; } -refcount_cache_size /= s->cluster_size; -if (refcount_cache_size < MIN_REFCOUNT_CACHE_SIZE) { -refcount_cache_size = MIN_REFCOUNT_CACHE_SIZE; +refcount_cache_entries = refcount_cache_size / s->cluster_size; +if (refcount_cache_entries < MIN_REFCOUNT_CACHE_ENTRIES) { +refcount_cache_entries = MIN_REFCOUNT_CACHE_ENTRIES; } -if (refcount_cache_size > INT_MAX) { +if (refcount_cache_entries > INT_MAX) { error_setg(errp, "Refcount cache size too big"); ret = -EINVAL; goto fail; @@ -908,9 +909,9 @@ static int qcow2_update_options_prepare(BlockDriverState *bs, } r->l2_slice_size = l2_cache_entry_size / sizeof(uint64_t); -r->l2_table_cache = qcow2_cache_create(bs, l2_cache_size, +r->l2_table_cache = qcow2_cache_create(bs, l2_cache_entries, l2_cache_entry_size); -r->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_size, +r->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_entries, s->cluster_size); if (r->l2_table_cache == NULL || r->refcount_block_cache == NULL) { error_setg(errp, "Could not allocate metadata caches"); -- 2.14.3
Re: [Qemu-block] qcow2 images corruption
On 02/13/2018 04:41 AM, Kevin Wolf wrote: > Am 07.02.2018 um 18:06 hat Nicolas Ecarnot geschrieben: >> TL; DR : qcow2 images keep getting corrupted. Any workaround? > > Not without knowing the cause. > > The first thing to make sure is that the image isn't touched by a second > process while QEMU is running a VM. The classic one is using 'qemu-img > snapshot' on the image of a running VM, which is instant corruption (and > newer QEMU versions have locking in place to prevent this), but we have > seen more absurd cases of things outside QEMU tampering with the image > when we were investigating previous corruption reports. > > This covers the majority of all reports, we haven't had a real > corruption caused by a QEMU bug in ages. > >> After having found (https://access.redhat.com/solutions/1173623) the right >> logical volume hosting the qcow2 image, I can run qemu-img check on it. >> - On 80% of my VMs, I find no errors. >> - On 15% of them, I find Leaked cluster errors that I can correct using >> "qemu-img check -r all" >> - On 5% of them, I find Leaked clusters errors and further fatal errors, >> which can not be corrected with qemu-img. >> In rare cases, qemu-img can correct them, but destroys large parts of the >> image (becomes unusable), and on other cases it can not correct them at all. > > It would be good if you could make the 'qemu-img check' output available > somewhere. > > It would be even better if we could have a look at the respective image. > I seem to remember that John (CCed) had a few scripts to analyse > corrupted qcow2 images, maybe we would be able to see something there. > Hi! I did write a pretty simplistic tool for trying to tell the shape of a corruption at a glance. It seems to work pretty similarly to the other tool you already found, but it won't hurt anything to run it: https://github.com/jnsnow/qcheck (Actually, that other tool looks like it has an awful lot of options. I'll have to check it out.) It can print a really upsetting amount of data (especially for very corrupt images), but in the default case, the simple setting should do the trick just fine. You could always put the output from this tool in a pastebin too; it might help me visualize the problem a bit more -- I find seeing the exact offsets and locations of where all the various tables and things to be pretty helpful. You can also always use the "deluge" option and compress it if you want, just don't let it print to your terminal: jsnow@probe (dev) ~/s/qcheck> ./qcheck -xd /home/bos/jsnow/src/qemu/bin/git/install_test_f26.qcow2 > deluge.log; and ls -sh deluge.log 4.3M deluge.log but it compresses down very well: jsnow@probe (dev) ~/s/qcheck> 7z a -t7z -m0=ppmd deluge.ppmd.7z deluge.log jsnow@probe (dev) ~/s/qcheck> ls -s deluge.ppmd.7z 316 deluge.ppmd.7z So I suppose if you want to send along: (1) The basic output without any flags, in a pastebin (2) The zipped deluge output, just in case and I will try my hand at guessing what went wrong. (Also, maybe my tool will totally choke for your image, who knows. It hasn't received an overwhelming amount of testing apart from when I go to use it personally and inevitably wind up displeased with how it handles certain situations, so ...) >> What I read similar to my case is : >> - usage of qcow2 >> - heavy disk I/O >> - using the virtio-blk driver >> >> In the proxmox thread, they tend to say that using virtio-scsi is the >> solution. Having asked this question to oVirt experts >> (https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but it's >> not clear the driver is to blame. > > This seems very unlikely. The corruption you're seeing is in the qcow2 > metadata, not only in the guest data. If anything, virtio-scsi exercises > more qcow2 code paths than virtio-blk, so any potential bug that affects > virtio-blk should also affect virtio-scsi, but not the other way around. > >> I agree with the answer Yaniv Kaul gave to me, saying I have to properly >> report the issue, so I'm longing to know which peculiar information I can >> give you now. > > To be honest, debugging corruption after the fact is pretty hard. We'd > need the 'qemu-img check' output and ideally the image to do anything, > but I can't promise that anything would come out of this. > > Best would be a reproducer, or at least some operation that you can link > to the appearance of the corruption. Then we could take a more targeted > look at the respective code. > >> As you can imagine, all this setup is in production, and for most of the >> VMs, I can not "play" with them. Moreover, we launched a campaign of nightly >> stopping every VM, qemu-img check them one by one, then boot. >> So it might take some time before I find another corrupted image. >> (which I'll preciously store for debug) >> >> Other informations : We very rarely do snapshots, but I'm close to imagine >> that automated migrations of VMs could trigger similar behaviors on qcow2 >> images. > > To my
[Qemu-block] [PATCH v8 21/21] block: Drop unused .bdrv_co_get_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Now that all drivers have been updated to provide the byte-based .bdrv_co_block_status(), we can delete the sector-based interface. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v7: no change v6: rebase to changes in patch 1, drop R-b v5: rebase to master v4: rebase to interface tweak v3: no change v2: rebase to earlier changes --- include/block/block_int.h | 3 --- block/io.c| 50 ++- 2 files changed, 10 insertions(+), 43 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index bf2598856cf..5ae7738cf8d 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -215,9 +215,6 @@ struct BlockDriver { * as well as non-NULL pnum, map, and file; in turn, the driver * must return an error or set pnum to an aligned non-zero value. */ -int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, -BlockDriverState **file); int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs, bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum, int64_t *map, BlockDriverState **file); diff --git a/block/io.c b/block/io.c index 5bae79f282e..4c3dba09730 100644 --- a/block/io.c +++ b/block/io.c @@ -1963,7 +1963,7 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, /* Must be non-NULL or bdrv_getlength() would have failed */ assert(bs->drv); -if (!bs->drv->bdrv_co_get_block_status && !bs->drv->bdrv_co_block_status) { +if (!bs->drv->bdrv_co_block_status) { *pnum = bytes; ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED; if (offset + bytes == total_size) { @@ -1981,53 +1981,23 @@ static int coroutine_fn bdrv_co_block_status(BlockDriverState *bs, /* Round out to request_alignment boundaries */ align = bs->bl.request_alignment; -if (bs->drv->bdrv_co_get_block_status && align < BDRV_SECTOR_SIZE) { -align = BDRV_SECTOR_SIZE; -} aligned_offset = QEMU_ALIGN_DOWN(offset, align); aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset; -if (bs->drv->bdrv_co_get_block_status) { -int count; /* sectors */ -int64_t longret; - -assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes, - BDRV_SECTOR_SIZE)); -/* - * The contract allows us to return pnum smaller than bytes, even - * if the next query would see the same status; we truncate the - * request to avoid overflowing the driver's 32-bit interface. - */ -longret = bs->drv->bdrv_co_get_block_status( -bs, aligned_offset >> BDRV_SECTOR_BITS, -MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, , -_file); -if (longret < 0) { -assert(INT_MIN <= longret); -ret = longret; -goto out; -} -if (longret & BDRV_BLOCK_OFFSET_VALID) { -local_map = longret & BDRV_BLOCK_OFFSET_MASK; -} -ret = longret & ~BDRV_BLOCK_OFFSET_MASK; -*pnum = count * BDRV_SECTOR_SIZE; -} else { -ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, -aligned_bytes, pnum, _map, -_file); -if (ret < 0) { -*pnum = 0; -goto out; -} -assert(*pnum); /* The block driver must make progress */ +ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset, +aligned_bytes, pnum, _map, +_file); +if (ret < 0) { +*pnum = 0; +goto out; } /* - * The driver's result must be a multiple of request_alignment. + * The driver's result must be a non-zero multiple of request_alignment. * Clamp pnum and adjust map to original request. */ -assert(QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset); +assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) && + align > offset - aligned_offset); *pnum -= offset - aligned_offset; if (*pnum > bytes) { *pnum = bytes; -- 2.14.3
[Qemu-block] [PATCH v8 19/21] vpc: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the vpc driver accordingly. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v7: tweak commit message and type of 'n' [Fam] v6: no change v5: fix incorrect rounding in 'map' and bad loop condition [Vladimir] v4: rebase to interface tweak v3: rebase to master v2: drop get_sector_offset() [Kevin], rebase to mapping flag --- block/vpc.c | 45 +++-- 1 file changed, 23 insertions(+), 22 deletions(-) diff --git a/block/vpc.c b/block/vpc.c index cfa5144e867..fba4492fd7b 100644 --- a/block/vpc.c +++ b/block/vpc.c @@ -706,53 +706,54 @@ fail: return ret; } -static int64_t coroutine_fn vpc_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file) +static int coroutine_fn vpc_co_block_status(BlockDriverState *bs, +bool want_zero, +int64_t offset, int64_t bytes, +int64_t *pnum, int64_t *map, +BlockDriverState **file) { BDRVVPCState *s = bs->opaque; VHDFooter *footer = (VHDFooter*) s->footer_buf; -int64_t start, offset; +int64_t image_offset; bool allocated; -int64_t ret; -int n; +int ret; +int64_t n; if (be32_to_cpu(footer->type) == VHD_FIXED) { -*pnum = nb_sectors; +*pnum = bytes; +*map = offset; *file = bs->file->bs; -return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID | - (sector_num << BDRV_SECTOR_BITS); +return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; } qemu_co_mutex_lock(>lock); -offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false, NULL); -start = offset; -allocated = (offset != -1); +image_offset = get_image_offset(bs, offset, false, NULL); +allocated = (image_offset != -1); *pnum = 0; ret = 0; do { /* All sectors in a block are contiguous (without using the bitmap) */ -n = ROUND_UP(sector_num + 1, s->block_size / BDRV_SECTOR_SIZE) - - sector_num; -n = MIN(n, nb_sectors); +n = ROUND_UP(offset + 1, s->block_size) - offset; +n = MIN(n, bytes); *pnum += n; -sector_num += n; -nb_sectors -= n; +offset += n; +bytes -= n; /* *pnum can't be greater than one block for allocated * sectors since there is always a bitmap in between. */ if (allocated) { *file = bs->file->bs; -ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start; +*map = image_offset; +ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; break; } -if (nb_sectors == 0) { +if (bytes == 0) { break; } -offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false, - NULL); -} while (offset == -1); +image_offset = get_image_offset(bs, offset, false, NULL); +} while (image_offset == -1); qemu_co_mutex_unlock(>lock); return ret; @@ -1098,7 +1099,7 @@ static BlockDriver bdrv_vpc = { .bdrv_co_preadv = vpc_co_preadv, .bdrv_co_pwritev= vpc_co_pwritev, -.bdrv_co_get_block_status = vpc_co_get_block_status, +.bdrv_co_block_status = vpc_co_block_status, .bdrv_get_info = vpc_get_info, -- 2.14.3
[Qemu-block] [PATCH v8 20/21] vvfat: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the vvfat driver accordingly. Note that we can rely on the block driver having already clamped limits to our block size, and simplify accordingly. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v5-v7: no change v4: rebase to interface tweak v3: no change v2: rebase to earlier changes, simplify --- block/vvfat.c | 16 +++- 1 file changed, 7 insertions(+), 9 deletions(-) diff --git a/block/vvfat.c b/block/vvfat.c index 7e06ebacf61..4a17a49e128 100644 --- a/block/vvfat.c +++ b/block/vvfat.c @@ -3088,15 +3088,13 @@ vvfat_co_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes, return ret; } -static int64_t coroutine_fn vvfat_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *n, BlockDriverState **file) +static int coroutine_fn vvfat_co_block_status(BlockDriverState *bs, + bool want_zero, int64_t offset, + int64_t bytes, int64_t *n, + int64_t *map, + BlockDriverState **file) { -*n = bs->total_sectors - sector_num; -if (*n > nb_sectors) { -*n = nb_sectors; -} else if (*n < 0) { -return 0; -} +*n = bytes; return BDRV_BLOCK_DATA; } @@ -3257,7 +3255,7 @@ static BlockDriver bdrv_vvfat = { .bdrv_co_preadv = vvfat_co_preadv, .bdrv_co_pwritev= vvfat_co_pwritev, -.bdrv_co_get_block_status = vvfat_co_get_block_status, +.bdrv_co_block_status = vvfat_co_block_status, }; static void bdrv_vvfat_init(void) -- 2.14.3
[Qemu-block] [PATCH v8 18/21] vmdk: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the vmdk driver accordingly. Drop the now-unused vmdk_find_index_in_cluster(). Also, fix a pre-existing bug: if find_extent() fails (unlikely, since the block layer did a bounds check), then we must return a failure, rather than 0. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: drop dead code [Vladimir], return error on find_extent() failure v4: rebase to interface tweak v3: no change v2: rebase to mapping flag --- block/vmdk.c | 38 ++ 1 file changed, 14 insertions(+), 24 deletions(-) diff --git a/block/vmdk.c b/block/vmdk.c index ef15ddbfd3d..75f84213e6f 100644 --- a/block/vmdk.c +++ b/block/vmdk.c @@ -1304,33 +1304,27 @@ static inline uint64_t vmdk_find_offset_in_cluster(VmdkExtent *extent, return extent_relative_offset % cluster_size; } -static inline uint64_t vmdk_find_index_in_cluster(VmdkExtent *extent, - int64_t sector_num) -{ -uint64_t offset; -offset = vmdk_find_offset_in_cluster(extent, sector_num * BDRV_SECTOR_SIZE); -return offset / BDRV_SECTOR_SIZE; -} - -static int64_t coroutine_fn vmdk_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file) +static int coroutine_fn vmdk_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t offset, int64_t bytes, + int64_t *pnum, int64_t *map, + BlockDriverState **file) { BDRVVmdkState *s = bs->opaque; int64_t index_in_cluster, n, ret; -uint64_t offset; +uint64_t cluster_offset; VmdkExtent *extent; -extent = find_extent(s, sector_num, NULL); +extent = find_extent(s, offset >> BDRV_SECTOR_BITS, NULL); if (!extent) { -return 0; +return -EIO; } qemu_co_mutex_lock(>lock); -ret = get_cluster_offset(bs, extent, NULL, - sector_num * 512, false, , +ret = get_cluster_offset(bs, extent, NULL, offset, false, _offset, 0, 0); qemu_co_mutex_unlock(>lock); -index_in_cluster = vmdk_find_index_in_cluster(extent, sector_num); +index_in_cluster = vmdk_find_offset_in_cluster(extent, offset); switch (ret) { case VMDK_ERROR: ret = -EIO; @@ -1345,18 +1339,14 @@ static int64_t coroutine_fn vmdk_co_get_block_status(BlockDriverState *bs, ret = BDRV_BLOCK_DATA; if (!extent->compressed) { ret |= BDRV_BLOCK_OFFSET_VALID; -ret |= (offset + (index_in_cluster << BDRV_SECTOR_BITS)) -& BDRV_BLOCK_OFFSET_MASK; +*map = cluster_offset + index_in_cluster; } *file = extent->file->bs; break; } -n = extent->cluster_sectors - index_in_cluster; -if (n > nb_sectors) { -n = nb_sectors; -} -*pnum = n; +n = extent->cluster_sectors * BDRV_SECTOR_SIZE - index_in_cluster; +*pnum = MIN(n, bytes); return ret; } @@ -2410,7 +2400,7 @@ static BlockDriver bdrv_vmdk = { .bdrv_close = vmdk_close, .bdrv_create = vmdk_create, .bdrv_co_flush_to_disk= vmdk_co_flush, -.bdrv_co_get_block_status = vmdk_co_get_block_status, +.bdrv_co_block_status = vmdk_co_block_status, .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size, .bdrv_has_zero_init = vmdk_has_zero_init, .bdrv_get_specific_info = vmdk_get_specific_info, -- 2.14.3
[Qemu-block] [PATCH v8 17/21] vdi: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the vdi driver accordingly. Note that the TODO is already covered (the block layer guarantees bounds of its requests), and that we can remove the now-unused s->block_sectors. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: fix pnum when offset rounded down to block_size [Vladimir] v4: rebase to interface tweak v3: no change v2: rebase to mapping flag --- block/vdi.c | 33 + 1 file changed, 13 insertions(+), 20 deletions(-) diff --git a/block/vdi.c b/block/vdi.c index 32b1763cde0..0780c82d829 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -172,8 +172,6 @@ typedef struct { uint32_t *bmap; /* Size of block (bytes). */ uint32_t block_size; -/* Size of block (sectors). */ -uint32_t block_sectors; /* First sector of block map. */ uint32_t bmap_sector; /* VDI header (converted to host endianness). */ @@ -463,7 +461,6 @@ static int vdi_open(BlockDriverState *bs, QDict *options, int flags, bs->total_sectors = header.disk_size / SECTOR_SIZE; s->block_size = header.block_size; -s->block_sectors = header.block_size / SECTOR_SIZE; s->bmap_sector = header.offset_bmap / SECTOR_SIZE; s->header = header; @@ -509,33 +506,29 @@ static int vdi_reopen_prepare(BDRVReopenState *state, return 0; } -static int64_t coroutine_fn vdi_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file) +static int coroutine_fn vdi_co_block_status(BlockDriverState *bs, +bool want_zero, +int64_t offset, int64_t bytes, +int64_t *pnum, int64_t *map, +BlockDriverState **file) { -/* TODO: Check for too large sector_num (in bdrv_is_allocated or here). */ BDRVVdiState *s = (BDRVVdiState *)bs->opaque; -size_t bmap_index = sector_num / s->block_sectors; -size_t sector_in_block = sector_num % s->block_sectors; -int n_sectors = s->block_sectors - sector_in_block; +size_t bmap_index = offset / s->block_size; +size_t index_in_block = offset % s->block_size; uint32_t bmap_entry = le32_to_cpu(s->bmap[bmap_index]); -uint64_t offset; int result; -logout("%p, %" PRId64 ", %d, %p\n", bs, sector_num, nb_sectors, pnum); -if (n_sectors > nb_sectors) { -n_sectors = nb_sectors; -} -*pnum = n_sectors; +logout("%p, %" PRId64 ", %" PRId64 ", %p\n", bs, offset, bytes, pnum); +*pnum = MIN(s->block_size - index_in_block, bytes); result = VDI_IS_ALLOCATED(bmap_entry); if (!result) { return 0; } -offset = s->header.offset_data + - (uint64_t)bmap_entry * s->block_size + - sector_in_block * SECTOR_SIZE; +*map = s->header.offset_data + (uint64_t)bmap_entry * s->block_size + +index_in_block; *file = bs->file->bs; -return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset; +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; } static int coroutine_fn @@ -903,7 +896,7 @@ static BlockDriver bdrv_vdi = { .bdrv_child_perm = bdrv_format_default_perms, .bdrv_create = vdi_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, -.bdrv_co_get_block_status = vdi_co_get_block_status, +.bdrv_co_block_status = vdi_co_block_status, .bdrv_make_empty = vdi_make_empty, .bdrv_co_preadv = vdi_co_preadv, -- 2.14.3
[Qemu-block] [PATCH v8 13/21] qed: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the qed driver accordingly, taking the opportunity to inline qed_is_allocated_cb() into its lone caller (the callback used to be important, until we switched qed to coroutines). There is no intent to optimize based on the want_zero flag for this format. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: initialize len before qed_find_cluster() [Vladimir] v4: rebase to interface change, inline pointless callback v3: no change v2: rebase to mapping flag, fix mask in qed_is_allocated_cb --- block/qed.c | 84 + 1 file changed, 28 insertions(+), 56 deletions(-) diff --git a/block/qed.c b/block/qed.c index c6ff3ab015d..a5952209261 100644 --- a/block/qed.c +++ b/block/qed.c @@ -688,74 +688,46 @@ finish: return ret; } -typedef struct { -BlockDriverState *bs; -Coroutine *co; -uint64_t pos; -int64_t status; -int *pnum; -BlockDriverState **file; -} QEDIsAllocatedCB; - -/* Called with table_lock held. */ -static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t len) -{ -QEDIsAllocatedCB *cb = opaque; -BDRVQEDState *s = cb->bs->opaque; -*cb->pnum = len / BDRV_SECTOR_SIZE; -switch (ret) { -case QED_CLUSTER_FOUND: -offset |= qed_offset_into_cluster(s, cb->pos); -cb->status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset; -*cb->file = cb->bs->file->bs; -break; -case QED_CLUSTER_ZERO: -cb->status = BDRV_BLOCK_ZERO; -break; -case QED_CLUSTER_L2: -case QED_CLUSTER_L1: -cb->status = 0; -break; -default: -assert(ret < 0); -cb->status = ret; -break; -} - -if (cb->co) { -aio_co_wake(cb->co); -} -} - -static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors, int *pnum, +static int coroutine_fn bdrv_qed_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t pos, int64_t bytes, + int64_t *pnum, int64_t *map, BlockDriverState **file) { BDRVQEDState *s = bs->opaque; -size_t len = (size_t)nb_sectors * BDRV_SECTOR_SIZE; -QEDIsAllocatedCB cb = { -.bs = bs, -.pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE, -.status = BDRV_BLOCK_OFFSET_MASK, -.pnum = pnum, -.file = file, -}; +size_t len = MIN(bytes, SIZE_MAX); +int status; QEDRequest request = { .l2_table = NULL }; uint64_t offset; int ret; qemu_co_mutex_lock(>table_lock); -ret = qed_find_cluster(s, , cb.pos, , ); -qed_is_allocated_cb(, ret, offset, len); +ret = qed_find_cluster(s, , pos, , ); -/* The callback was invoked immediately */ -assert(cb.status != BDRV_BLOCK_OFFSET_MASK); +*pnum = len; +switch (ret) { +case QED_CLUSTER_FOUND: +*map = offset | qed_offset_into_cluster(s, pos); +status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; +*file = bs->file->bs; +break; +case QED_CLUSTER_ZERO: +status = BDRV_BLOCK_ZERO; +break; +case QED_CLUSTER_L2: +case QED_CLUSTER_L1: +status = 0; +break; +default: +assert(ret < 0); +status = ret; +break; +} qed_unref_l2_cache_entry(request.l2_table); qemu_co_mutex_unlock(>table_lock); -return cb.status; +return status; } static BDRVQEDState *acb_to_s(QEDAIOCB *acb) @@ -1594,7 +1566,7 @@ static BlockDriver bdrv_qed = { .bdrv_child_perm = bdrv_format_default_perms, .bdrv_create = bdrv_qed_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, -.bdrv_co_get_block_status = bdrv_qed_co_get_block_status, +.bdrv_co_block_status = bdrv_qed_co_block_status, .bdrv_co_readv= bdrv_qed_co_readv, .bdrv_co_writev = bdrv_qed_co_writev, .bdrv_co_pwrite_zeroes= bdrv_qed_co_pwrite_zeroes, -- 2.14.3
[Qemu-block] [PATCH v8 15/21] sheepdog: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the sheepdog driver accordingly. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng Reviewed-by: Jeff Cody --- v7: rebase to minor spacing changes in master v5-v6: no change v4: update to interface tweak v3: no change v2: rebase to mapping flag --- block/sheepdog.c | 26 +- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index ac02b10fe03..3c3becf94df 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -3004,19 +3004,19 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState *bs, int64_t offset, return acb.ret; } -static coroutine_fn int64_t -sd_co_get_block_status(BlockDriverState *bs, int64_t sector_num, int nb_sectors, - int *pnum, BlockDriverState **file) +static coroutine_fn int +sd_co_block_status(BlockDriverState *bs, bool want_zero, int64_t offset, + int64_t bytes, int64_t *pnum, int64_t *map, + BlockDriverState **file) { BDRVSheepdogState *s = bs->opaque; SheepdogInode *inode = >inode; uint32_t object_size = (UINT32_C(1) << inode->block_size_shift); -uint64_t offset = sector_num * BDRV_SECTOR_SIZE; unsigned long start = offset / object_size, - end = DIV_ROUND_UP((sector_num + nb_sectors) * - BDRV_SECTOR_SIZE, object_size); + end = DIV_ROUND_UP(offset + bytes, object_size); unsigned long idx; -int64_t ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset; +*map = offset; +int ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; for (idx = start; idx < end; idx++) { if (inode->data_vdi_id[idx] == 0) { @@ -3033,9 +3033,9 @@ sd_co_get_block_status(BlockDriverState *bs, int64_t sector_num, int nb_sectors, } } -*pnum = (idx - start) * object_size / BDRV_SECTOR_SIZE; -if (*pnum > nb_sectors) { -*pnum = nb_sectors; +*pnum = (idx - start) * object_size; +if (*pnum > bytes) { +*pnum = bytes; } if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID) { *file = bs; @@ -3113,7 +3113,7 @@ static BlockDriver bdrv_sheepdog = { .bdrv_co_writev = sd_co_writev, .bdrv_co_flush_to_disk= sd_co_flush_to_disk, .bdrv_co_pdiscard = sd_co_pdiscard, -.bdrv_co_get_block_status = sd_co_get_block_status, +.bdrv_co_block_status = sd_co_block_status, .bdrv_snapshot_create = sd_snapshot_create, .bdrv_snapshot_goto = sd_snapshot_goto, @@ -3149,7 +3149,7 @@ static BlockDriver bdrv_sheepdog_tcp = { .bdrv_co_writev = sd_co_writev, .bdrv_co_flush_to_disk= sd_co_flush_to_disk, .bdrv_co_pdiscard = sd_co_pdiscard, -.bdrv_co_get_block_status = sd_co_get_block_status, +.bdrv_co_block_status = sd_co_block_status, .bdrv_snapshot_create = sd_snapshot_create, .bdrv_snapshot_goto = sd_snapshot_goto, @@ -3185,7 +3185,7 @@ static BlockDriver bdrv_sheepdog_unix = { .bdrv_co_writev = sd_co_writev, .bdrv_co_flush_to_disk= sd_co_flush_to_disk, .bdrv_co_pdiscard = sd_co_pdiscard, -.bdrv_co_get_block_status = sd_co_get_block_status, +.bdrv_co_block_status = sd_co_block_status, .bdrv_snapshot_create = sd_snapshot_create, .bdrv_snapshot_goto = sd_snapshot_goto, -- 2.14.3
[Qemu-block] [PATCH v8 05/21] gluster: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the gluster driver accordingly. In want_zero mode, we continue to report fine-grained hole information (the caller wants as much mapping detail as possible); but when not in that mode, the caller prefers larger *pnum and merely cares about what offsets are allocated at this layer, rather than where the holes live. Since holes still read as zeroes at this layer (rather than deferring to a backing layer), we can take the shortcut of skipping find_allocation(), and merely state that all bytes are allocated. We can also drop redundant bounds checks that are already guaranteed by the block layer. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: drop redundant code v4: tweak commit message [Fam], rebase to interface tweak v3: no change v2: tweak comments [Prasanna], add mapping, drop R-b --- block/gluster.c | 70 - 1 file changed, 34 insertions(+), 36 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index 3f17b7819d2..1a07d221d17 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -1362,68 +1362,66 @@ exit: } /* - * Returns the allocation status of the specified sectors. + * Returns the allocation status of the specified offset. * - * If 'sector_num' is beyond the end of the disk image the return value is 0 - * and 'pnum' is set to 0. + * The block layer guarantees 'offset' and 'bytes' are within bounds. * - * 'pnum' is set to the number of sectors (including and immediately following - * the specified sector) that are known to be in the same + * 'pnum' is set to the number of bytes (including and immediately following + * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'nb_sectors' is the max value 'pnum' should be set to. If nb_sectors goes - * beyond the end of the disk image it will be clamped. + * 'bytes' is the max value 'pnum' should be set to. * - * (Based on raw_co_get_block_status() from file-posix.c.) + * (Based on raw_co_block_status() from file-posix.c.) */ -static int64_t coroutine_fn qemu_gluster_co_get_block_status( -BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum, -BlockDriverState **file) +static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t offset, + int64_t bytes, + int64_t *pnum, + int64_t *map, + BlockDriverState **file) { BDRVGlusterState *s = bs->opaque; -off_t start, data = 0, hole = 0; -int64_t total_size; +off_t data = 0, hole = 0; int ret = -EINVAL; if (!s->fd) { return ret; } -start = sector_num * BDRV_SECTOR_SIZE; -total_size = bdrv_getlength(bs); -if (total_size < 0) { -return total_size; -} else if (start >= total_size) { -*pnum = 0; -return 0; -} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) { -nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE); +if (!want_zero) { +*pnum = bytes; +*map = offset; +*file = bs; +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; } -ret = find_allocation(bs, start, , ); +ret = find_allocation(bs, offset, , ); if (ret == -ENXIO) { /* Trailing hole */ -*pnum = nb_sectors; +*pnum = bytes; ret = BDRV_BLOCK_ZERO; } else if (ret < 0) { /* No info available, so pretend there are no holes */ -*pnum = nb_sectors; +*pnum = bytes; ret = BDRV_BLOCK_DATA; -} else if (data == start) { -/* On a data extent, compute sectors to the end of the extent, +} else if (data == offset) { +/* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ -*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, BDRV_SECTOR_SIZE)); +*pnum = MIN(bytes, hole - offset); ret = BDRV_BLOCK_DATA; } else { -/* On a hole, compute sectors to the beginning of the next extent. */ -assert(hole == start); -*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE); +/* On a hole, compute bytes to the beginning of the next extent. */ +assert(hole == offset); +*pnum = MIN(bytes, data - offset); ret = BDRV_BLOCK_ZERO; } +*map = offset; *file = bs; -return ret | BDRV_BLOCK_OFFSET_VALID | start; +return ret |
[Qemu-block] [PATCH v8 16/21] vdi: Avoid bitrot of debugging code
Rework the debug define so that we always get -Wformat checking, even when debugging is disabled. Signed-off-by: Eric BlakeReviewed-by: Stefan Weil Reviewed-by: Philippe Mathieu-Daudé Reviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v2-v7: no change --- block/vdi.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/block/vdi.c b/block/vdi.c index fc1c614cb12..32b1763cde0 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -87,12 +87,18 @@ #define DEFAULT_CLUSTER_SIZE (1 * MiB) #if defined(CONFIG_VDI_DEBUG) -#define logout(fmt, ...) \ -fprintf(stderr, "vdi\t%-24s" fmt, __func__, ##__VA_ARGS__) +#define VDI_DEBUG 1 #else -#define logout(fmt, ...) ((void)0) +#define VDI_DEBUG 0 #endif +#define logout(fmt, ...) \ +do {\ +if (VDI_DEBUG) {\ +fprintf(stderr, "vdi\t%-24s" fmt, __func__, ##__VA_ARGS__); \ +} \ +} while (0) + /* Image signature. */ #define VDI_SIGNATURE 0xbeda107f -- 2.14.3
[Qemu-block] [PATCH v8 09/21] null: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the null driver accordingly. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: minor fix to type of 'ret' v4: rebase to interface tweak v3: no change v2: rebase to mapping parameter --- block/null.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/block/null.c b/block/null.c index 214d394fff4..806a8631e4d 100644 --- a/block/null.c +++ b/block/null.c @@ -223,22 +223,23 @@ static int null_reopen_prepare(BDRVReopenState *reopen_state, return 0; } -static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors, int *pnum, - BlockDriverState **file) +static int coroutine_fn null_co_block_status(BlockDriverState *bs, + bool want_zero, int64_t offset, + int64_t bytes, int64_t *pnum, + int64_t *map, + BlockDriverState **file) { BDRVNullState *s = bs->opaque; -off_t start = sector_num * BDRV_SECTOR_SIZE; +int ret = BDRV_BLOCK_OFFSET_VALID; -*pnum = nb_sectors; +*pnum = bytes; +*map = offset; *file = bs; if (s->read_zeroes) { -return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO; -} else { -return BDRV_BLOCK_OFFSET_VALID | start; +ret |= BDRV_BLOCK_ZERO; } +return ret; } static void null_refresh_filename(BlockDriverState *bs, QDict *opts) @@ -270,7 +271,7 @@ static BlockDriver bdrv_null_co = { .bdrv_co_flush_to_disk = null_co_flush, .bdrv_reopen_prepare= null_reopen_prepare, -.bdrv_co_get_block_status = null_co_get_block_status, +.bdrv_co_block_status = null_co_block_status, .bdrv_refresh_filename = null_refresh_filename, }; @@ -290,7 +291,7 @@ static BlockDriver bdrv_null_aio = { .bdrv_aio_flush = null_aio_flush, .bdrv_reopen_prepare= null_reopen_prepare, -.bdrv_co_get_block_status = null_co_get_block_status, +.bdrv_co_block_status = null_co_block_status, .bdrv_refresh_filename = null_refresh_filename, }; -- 2.14.3
[Qemu-block] [PATCH v8 14/21] raw: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the raw driver accordingly. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v5-v7: no change v4: rebase to interface tweak v3: no change v2: rebase to mapping --- block/raw-format.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/block/raw-format.c b/block/raw-format.c index ab552c09541..830243a8e48 100644 --- a/block/raw-format.c +++ b/block/raw-format.c @@ -250,17 +250,17 @@ fail: return ret; } -static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, -int nb_sectors, int *pnum, +static int coroutine_fn raw_co_block_status(BlockDriverState *bs, +bool want_zero, int64_t offset, +int64_t bytes, int64_t *pnum, +int64_t *map, BlockDriverState **file) { BDRVRawState *s = bs->opaque; -*pnum = nb_sectors; +*pnum = bytes; *file = bs->file->bs; -sector_num += s->offset / BDRV_SECTOR_SIZE; -return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID | - (sector_num << BDRV_SECTOR_BITS); +*map = offset + s->offset; +return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; } static int coroutine_fn raw_co_pwrite_zeroes(BlockDriverState *bs, @@ -496,7 +496,7 @@ BlockDriver bdrv_raw = { .bdrv_co_pwritev = _co_pwritev, .bdrv_co_pwrite_zeroes = _co_pwrite_zeroes, .bdrv_co_pdiscard = _co_pdiscard, -.bdrv_co_get_block_status = _co_get_block_status, +.bdrv_co_block_status = _co_block_status, .bdrv_truncate= _truncate, .bdrv_getlength = _getlength, .has_variable_length = true, -- 2.14.3
[Qemu-block] [PATCH v8 08/21] iscsi: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the iscsi driver accordingly. In this case, it is handy to teach iscsi_co_block_status() to handle a NULL map and file parameter, even though the block layer passes non-NULL values, because we also call the function directly. For now, there are no optimizations done based on the want_zero flag. We can also make the simplification of asserting that the block layer passed in aligned values. Signed-off-by: Eric BlakeReviewed-by: Fam Zheng --- v8: rebase to master v7: rebase to master v6: no change v5: assert rather than check for alignment v4: rebase to interface tweaks v3: no change v2: rebase to mapping parameter --- block/iscsi.c | 67 --- 1 file changed, 32 insertions(+), 35 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index d2b0466775c..4842519fdad 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -653,36 +653,36 @@ out_unlock: -static int64_t coroutine_fn iscsi_co_get_block_status(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors, int *pnum, - BlockDriverState **file) +static int coroutine_fn iscsi_co_block_status(BlockDriverState *bs, + bool want_zero, int64_t offset, + int64_t bytes, int64_t *pnum, + int64_t *map, + BlockDriverState **file) { IscsiLun *iscsilun = bs->opaque; struct scsi_get_lba_status *lbas = NULL; struct scsi_lba_status_descriptor *lbasd = NULL; struct IscsiTask iTask; uint64_t lba; -int64_t ret; +int ret; iscsi_co_init_iscsitask(iscsilun, ); -if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) { -ret = -EINVAL; -goto out; -} +assert(QEMU_IS_ALIGNED(offset | bytes, iscsilun->block_size)); /* default to all sectors allocated */ -ret = BDRV_BLOCK_DATA; -ret |= (sector_num << BDRV_SECTOR_BITS) | BDRV_BLOCK_OFFSET_VALID; -*pnum = nb_sectors; +ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; +if (map) { +*map = offset; +} +*pnum = bytes; /* LUN does not support logical block provisioning */ if (!iscsilun->lbpme) { goto out; } -lba = sector_qemu2lun(sector_num, iscsilun); +lba = offset / iscsilun->block_size; qemu_mutex_lock(>mutex); retry: @@ -727,12 +727,12 @@ retry: lbasd = >descriptors[0]; -if (sector_qemu2lun(sector_num, iscsilun) != lbasd->lba) { +if (lba != lbasd->lba) { ret = -EIO; goto out_unlock; } -*pnum = sector_lun2qemu(lbasd->num_blocks, iscsilun); +*pnum = lbasd->num_blocks * iscsilun->block_size; if (lbasd->provisioning == SCSI_PROVISIONING_TYPE_DEALLOCATED || lbasd->provisioning == SCSI_PROVISIONING_TYPE_ANCHORED) { @@ -743,15 +743,13 @@ retry: } if (ret & BDRV_BLOCK_ZERO) { -iscsi_allocmap_set_unallocated(iscsilun, sector_num * BDRV_SECTOR_SIZE, - *pnum * BDRV_SECTOR_SIZE); +iscsi_allocmap_set_unallocated(iscsilun, offset, *pnum); } else { -iscsi_allocmap_set_allocated(iscsilun, sector_num * BDRV_SECTOR_SIZE, - *pnum * BDRV_SECTOR_SIZE); +iscsi_allocmap_set_allocated(iscsilun, offset, *pnum); } -if (*pnum > nb_sectors) { -*pnum = nb_sectors; +if (*pnum > bytes) { +*pnum = bytes; } out_unlock: qemu_mutex_unlock(>mutex); @@ -760,7 +758,7 @@ out: if (iTask.task != NULL) { scsi_free_scsi_task(iTask.task); } -if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID) { +if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID && file) { *file = bs; } return ret; @@ -800,25 +798,24 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs, nb_sectors * BDRV_SECTOR_SIZE) && !iscsi_allocmap_is_allocated(iscsilun, sector_num * BDRV_SECTOR_SIZE, nb_sectors * BDRV_SECTOR_SIZE)) { -int pnum; -BlockDriverState *file; +int64_t pnum; /* check the block status from the beginning of the cluster * containing the start sector */ -int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS; -int head; -int64_t ret; +int64_t head; +int ret; -assert(cluster_sectors); -head = sector_num % cluster_sectors; -ret = iscsi_co_get_block_status(bs, sector_num - head, -BDRV_REQUEST_MAX_SECTORS, , -
[Qemu-block] [PATCH v8 11/21] qcow: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the qcow driver accordingly. There is no intent to optimize based on the want_zero flag for this format. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v5-v7: no change v4: rebase to interface tweak v3: rebase to master v2: rebase to mapping flag --- block/qcow.c | 27 --- 1 file changed, 16 insertions(+), 11 deletions(-) diff --git a/block/qcow.c b/block/qcow.c index 8631155ac81..dead5029c67 100644 --- a/block/qcow.c +++ b/block/qcow.c @@ -524,23 +524,28 @@ static int get_cluster_offset(BlockDriverState *bs, return 1; } -static int64_t coroutine_fn qcow_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file) +static int coroutine_fn qcow_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t offset, int64_t bytes, + int64_t *pnum, int64_t *map, + BlockDriverState **file) { BDRVQcowState *s = bs->opaque; -int index_in_cluster, n, ret; +int index_in_cluster, ret; +int64_t n; uint64_t cluster_offset; qemu_co_mutex_lock(>lock); -ret = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0, _offset); +ret = get_cluster_offset(bs, offset, 0, 0, 0, 0, _offset); qemu_co_mutex_unlock(>lock); if (ret < 0) { return ret; } -index_in_cluster = sector_num & (s->cluster_sectors - 1); -n = s->cluster_sectors - index_in_cluster; -if (n > nb_sectors) -n = nb_sectors; +index_in_cluster = offset & (s->cluster_size - 1); +n = s->cluster_size - index_in_cluster; +if (n > bytes) { +n = bytes; +} *pnum = n; if (!cluster_offset) { return 0; @@ -548,9 +553,9 @@ static int64_t coroutine_fn qcow_co_get_block_status(BlockDriverState *bs, if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) { return BDRV_BLOCK_DATA; } -cluster_offset |= (index_in_cluster << BDRV_SECTOR_BITS); +*map = cluster_offset | index_in_cluster; *file = bs->file->bs; -return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | cluster_offset; +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; } static int decompress_buffer(uint8_t *out_buf, int out_buf_size, @@ -1128,7 +1133,7 @@ static BlockDriver bdrv_qcow = { .bdrv_co_readv = qcow_co_readv, .bdrv_co_writev = qcow_co_writev, -.bdrv_co_get_block_status = qcow_co_get_block_status, +.bdrv_co_block_status = qcow_co_block_status, .bdrv_make_empty= qcow_make_empty, .bdrv_co_pwritev_compressed = qcow_co_pwritev_compressed, -- 2.14.3
[Qemu-block] [PATCH v8 12/21] qcow2: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the qcow2 driver accordingly. For now, we are ignoring the 'want_zero' hint. However, it should be relatively straightforward to honor the hint as a way to return larger *pnum values when we have consecutive clusters with the same data/zero status but which differ only in having non-consecutive mappings. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v5-v7: no change v4: update to interface tweak v3: no change v2: rebase to mapping flag --- block/qcow2.c | 24 +--- 1 file changed, 13 insertions(+), 11 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index 57a517e2bdd..288b5299d80 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1670,32 +1670,34 @@ static void qcow2_join_options(QDict *options, QDict *old_options) } } -static int64_t coroutine_fn qcow2_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file) +static int coroutine_fn qcow2_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t offset, int64_t count, + int64_t *pnum, int64_t *map, + BlockDriverState **file) { BDRVQcow2State *s = bs->opaque; uint64_t cluster_offset; int index_in_cluster, ret; unsigned int bytes; -int64_t status = 0; +int status = 0; -bytes = MIN(INT_MAX, nb_sectors * BDRV_SECTOR_SIZE); +bytes = MIN(INT_MAX, count); qemu_co_mutex_lock(>lock); -ret = qcow2_get_cluster_offset(bs, sector_num << BDRV_SECTOR_BITS, , - _offset); +ret = qcow2_get_cluster_offset(bs, offset, , _offset); qemu_co_mutex_unlock(>lock); if (ret < 0) { return ret; } -*pnum = bytes >> BDRV_SECTOR_BITS; +*pnum = bytes; if (cluster_offset != 0 && ret != QCOW2_CLUSTER_COMPRESSED && !s->crypto) { -index_in_cluster = sector_num & (s->cluster_sectors - 1); -cluster_offset |= (index_in_cluster << BDRV_SECTOR_BITS); +index_in_cluster = offset & (s->cluster_size - 1); +*map = cluster_offset | index_in_cluster; *file = bs->file->bs; -status |= BDRV_BLOCK_OFFSET_VALID | cluster_offset; +status |= BDRV_BLOCK_OFFSET_VALID; } if (ret == QCOW2_CLUSTER_ZERO_PLAIN || ret == QCOW2_CLUSTER_ZERO_ALLOC) { status |= BDRV_BLOCK_ZERO; @@ -4352,7 +4354,7 @@ BlockDriver bdrv_qcow2 = { .bdrv_child_perm = bdrv_format_default_perms, .bdrv_create= qcow2_create, .bdrv_has_zero_init = bdrv_has_zero_init_1, -.bdrv_co_get_block_status = qcow2_co_get_block_status, +.bdrv_co_block_status = qcow2_co_block_status, .bdrv_co_preadv = qcow2_co_preadv, .bdrv_co_pwritev= qcow2_co_pwritev, -- 2.14.3
[Qemu-block] [PATCH v8 10/21] parallels: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the parallels driver accordingly. Note that the internal function block_status() is still sector-based, because it is still in use by other sector-based functions; but that's okay because request_alignment is 512 as a result of those functions. For now, no optimizations are added based on the mapping hint. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v7: fix bug in *map [Vladimir] v6: no change v5: fix pnum when return is 0 v4: rebase to interface tweak, R-b dropped v3: no change v2: rebase to mapping parameter; it is ignored, so R-b kept --- block/parallels.c | 22 +++--- 1 file changed, 15 insertions(+), 7 deletions(-) diff --git a/block/parallels.c b/block/parallels.c index e1e3d80c887..3e952a9c147 100644 --- a/block/parallels.c +++ b/block/parallels.c @@ -261,23 +261,31 @@ static coroutine_fn int parallels_co_flush_to_os(BlockDriverState *bs) } -static int64_t coroutine_fn parallels_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file) +static int coroutine_fn parallels_co_block_status(BlockDriverState *bs, + bool want_zero, + int64_t offset, + int64_t bytes, + int64_t *pnum, + int64_t *map, + BlockDriverState **file) { BDRVParallelsState *s = bs->opaque; -int64_t offset; +int count; +assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE)); qemu_co_mutex_lock(>lock); -offset = block_status(s, sector_num, nb_sectors, pnum); +offset = block_status(s, offset >> BDRV_SECTOR_BITS, + bytes >> BDRV_SECTOR_BITS, ); qemu_co_mutex_unlock(>lock); +*pnum = count * BDRV_SECTOR_SIZE; if (offset < 0) { return 0; } +*map = offset * BDRV_SECTOR_SIZE; *file = bs->file->bs; -return (offset << BDRV_SECTOR_BITS) | -BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; } static coroutine_fn int parallels_co_writev(BlockDriverState *bs, @@ -782,7 +790,7 @@ static BlockDriver bdrv_parallels = { .bdrv_open = parallels_open, .bdrv_close= parallels_close, .bdrv_child_perm = bdrv_format_default_perms, -.bdrv_co_get_block_status = parallels_co_get_block_status, +.bdrv_co_block_status = parallels_co_block_status, .bdrv_has_zero_init = bdrv_has_zero_init_1, .bdrv_co_flush_to_os = parallels_co_flush_to_os, .bdrv_co_readv = parallels_co_readv, -- 2.14.3
[Qemu-block] [PATCH v8 03/21] block: Switch passthrough drivers to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the generic helpers, and all passthrough clients (blkdebug, commit, mirror, throttle) accordingly. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: rebase to master v4: rebase to interface tweak v3: rebase to addition of throttle driver v2: rebase to master, retitle while merging blkdebug, commit, and mirror --- include/block/block_int.h | 28 block/io.c| 36 block/blkdebug.c | 20 +++- block/commit.c| 2 +- block/mirror.c| 2 +- block/throttle.c | 2 +- 6 files changed, 50 insertions(+), 40 deletions(-) diff --git a/include/block/block_int.h b/include/block/block_int.h index c93722b43a4..bf2598856cf 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -1041,23 +1041,27 @@ void bdrv_format_default_perms(BlockDriverState *bs, BdrvChild *c, uint64_t *nperm, uint64_t *nshared); /* - * Default implementation for drivers to pass bdrv_co_get_block_status() to + * Default implementation for drivers to pass bdrv_co_block_status() to * their file. */ -int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs, -int64_t sector_num, -int nb_sectors, -int *pnum, -BlockDriverState **file); +int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs, +bool want_zero, +int64_t offset, +int64_t bytes, +int64_t *pnum, +int64_t *map, +BlockDriverState **file); /* - * Default implementation for drivers to pass bdrv_co_get_block_status() to + * Default implementation for drivers to pass bdrv_co_block_status() to * their backing file. */ -int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors, - int *pnum, - BlockDriverState **file); +int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs, + bool want_zero, + int64_t offset, + int64_t bytes, + int64_t *pnum, + int64_t *map, + BlockDriverState **file); const char *bdrv_get_parent_name(const BlockDriverState *bs); void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp); bool blk_dev_has_removable_media(BlockBackend *blk); diff --git a/block/io.c b/block/io.c index b00c7e2e2c0..5bae79f282e 100644 --- a/block/io.c +++ b/block/io.c @@ -1868,30 +1868,34 @@ typedef struct BdrvCoBlockStatusData { bool done; } BdrvCoBlockStatusData; -int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs, -int64_t sector_num, -int nb_sectors, -int *pnum, -BlockDriverState **file) +int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs, +bool want_zero, +int64_t offset, +int64_t bytes, +int64_t *pnum, +int64_t *map, +BlockDriverState **file) { assert(bs->file && bs->file->bs); -*pnum = nb_sectors; +*pnum = bytes; +*map = offset; *file = bs->file->bs; -return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID | - (sector_num << BDRV_SECTOR_BITS); +return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID; } -int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState *bs, - int64_t
[Qemu-block] [PATCH v8 00/21] add byte-based block_status driver callbacks
There are patches floating around to add NBD_CMD_BLOCK_STATUS, but NBD wants to report status on byte granularity (even if the reporting will probably be naturally aligned to sectors or even much higher levels). I've therefore started the task of converting our block status code to report at a byte granularity rather than sectors. These patches have been around for a while, but it's time to finish it now that 2.12 is open for patches. Based-on: <20180213170529.10858-1-kw...@redhat.com> (Kevin's [PULL 00/55] Block layer patches) The overall conversion currently looks like: part 1: bdrv_is_allocated (merged, commit 51b0a488, 2.10) part 2: dirty-bitmap (merged, commit ca759622, 2.11) part 3: bdrv_get_block_status (merged, commit f0a9c18f, 2.11) part 4: .bdrv_co_block_status (this series, v7 was here [1]) [1] https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg00954.html Available as a tag at: git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-callback-v8 Since v7: - rebase to master (more iscsi context changes, nvme driver is new) - add a few more R-bys Eric Blake (21): block: Add .bdrv_co_block_status() callback nvme: Drop pointless .bdrv_co_get_block_status() block: Switch passthrough drivers to .bdrv_co_block_status() file-posix: Switch to .bdrv_co_block_status() gluster: Switch to .bdrv_co_block_status() iscsi: Switch cluster_sectors to byte-based iscsi: Switch iscsi_allocmap_update() to byte-based iscsi: Switch to .bdrv_co_block_status() null: Switch to .bdrv_co_block_status() parallels: Switch to .bdrv_co_block_status() qcow: Switch to .bdrv_co_block_status() qcow2: Switch to .bdrv_co_block_status() qed: Switch to .bdrv_co_block_status() raw: Switch to .bdrv_co_block_status() sheepdog: Switch to .bdrv_co_block_status() vdi: Avoid bitrot of debugging code vdi: Switch to .bdrv_co_block_status() vmdk: Switch to .bdrv_co_block_status() vpc: Switch to .bdrv_co_block_status() vvfat: Switch to .bdrv_co_block_status() block: Drop unused .bdrv_co_get_block_status() include/block/block.h | 14 ++--- include/block/block_int.h | 51 +-- block/io.c| 86 +++-- block/blkdebug.c | 20 +++--- block/commit.c| 2 +- block/file-posix.c| 62 +- block/gluster.c | 70 ++--- block/iscsi.c | 157 -- block/mirror.c| 2 +- block/null.c | 23 +++ block/nvme.c | 14 - block/parallels.c | 22 --- block/qcow.c | 27 block/qcow2.c | 24 +++ block/qed.c | 84 + block/raw-format.c| 16 ++--- block/sheepdog.c | 26 block/throttle.c | 2 +- block/vdi.c | 45 +++-- block/vmdk.c | 38 +-- block/vpc.c | 45 ++--- block/vvfat.c | 16 +++-- 22 files changed, 404 insertions(+), 442 deletions(-) -- 2.14.3
[Qemu-block] [PATCH v8 04/21] file-posix: Switch to .bdrv_co_block_status()
We are gradually moving away from sector-based interfaces, towards byte-based. Update the file protocol driver accordingly. In want_zero mode, we continue to report fine-grained hole information (the caller wants as much mapping detail as possible); but when not in that mode, the caller prefers larger *pnum and merely cares about what offsets are allocated at this layer, rather than where the holes live. Since holes still read as zeroes at this layer (rather than deferring to a backing layer), we can take the shortcut of skipping lseek(), and merely state that all bytes are allocated. We can also drop redundant bounds checks that are already guaranteed by the block layer. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v6-v7: no change v5: drop redundant code v4: tweak commit message [Fam], rebase to interface tweak v3: no change v2: tweak comment, add mapping support --- block/file-posix.c | 62 +- 1 file changed, 29 insertions(+), 33 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index ca49c1a98ae..f1591c38490 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2131,25 +2131,24 @@ static int find_allocation(BlockDriverState *bs, off_t start, } /* - * Returns the allocation status of the specified sectors. + * Returns the allocation status of the specified offset. * - * If 'sector_num' is beyond the end of the disk image the return value is 0 - * and 'pnum' is set to 0. + * The block layer guarantees 'offset' and 'bytes' are within bounds. * - * 'pnum' is set to the number of sectors (including and immediately following - * the specified sector) that are known to be in the same + * 'pnum' is set to the number of bytes (including and immediately following + * the specified offset) that are known to be in the same * allocated/unallocated state. * - * 'nb_sectors' is the max value 'pnum' should be set to. If nb_sectors goes - * beyond the end of the disk image it will be clamped. + * 'bytes' is the max value 'pnum' should be set to. */ -static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs, -int64_t sector_num, -int nb_sectors, int *pnum, -BlockDriverState **file) +static int coroutine_fn raw_co_block_status(BlockDriverState *bs, +bool want_zero, +int64_t offset, +int64_t bytes, int64_t *pnum, +int64_t *map, +BlockDriverState **file) { -off_t start, data = 0, hole = 0; -int64_t total_size; +off_t data = 0, hole = 0; int ret; ret = fd_open(bs); @@ -2157,39 +2156,36 @@ static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs, return ret; } -start = sector_num * BDRV_SECTOR_SIZE; -total_size = bdrv_getlength(bs); -if (total_size < 0) { -return total_size; -} else if (start >= total_size) { -*pnum = 0; -return 0; -} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) { -nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE); +if (!want_zero) { +*pnum = bytes; +*map = offset; +*file = bs; +return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID; } -ret = find_allocation(bs, start, , ); +ret = find_allocation(bs, offset, , ); if (ret == -ENXIO) { /* Trailing hole */ -*pnum = nb_sectors; +*pnum = bytes; ret = BDRV_BLOCK_ZERO; } else if (ret < 0) { /* No info available, so pretend there are no holes */ -*pnum = nb_sectors; +*pnum = bytes; ret = BDRV_BLOCK_DATA; -} else if (data == start) { -/* On a data extent, compute sectors to the end of the extent, +} else if (data == offset) { +/* On a data extent, compute bytes to the end of the extent, * possibly including a partial sector at EOF. */ -*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, BDRV_SECTOR_SIZE)); +*pnum = MIN(bytes, hole - offset); ret = BDRV_BLOCK_DATA; } else { -/* On a hole, compute sectors to the beginning of the next extent. */ -assert(hole == start); -*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE); +/* On a hole, compute bytes to the beginning of the next extent. */ +assert(hole == offset); +*pnum = MIN(bytes, data - offset); ret = BDRV_BLOCK_ZERO; } +*map = offset; *file = bs; -return ret | BDRV_BLOCK_OFFSET_VALID | start; +return ret |
[Qemu-block] [PATCH v8 01/21] block: Add .bdrv_co_block_status() callback
We are gradually moving away from sector-based interfaces, towards byte-based. Now that the block layer exposes byte-based allocation, it's time to tackle the drivers. Add a new callback that operates on as small as byte boundaries. Subsequent patches will then update individual drivers, then finally remove .bdrv_co_get_block_status(). The new code also passes through the 'want_zero' hint, which will allow subsequent patches to further optimize callers that only care about how much of the image is allocated (want_zero is false), rather than full details about runs of zeroes and which offsets the allocation actually maps to (want_zero is true). As part of this effort, fix another part of the documentation: the claim in commit 4c41cb4 that BDRV_BLOCK_ALLOCATED is short for 'DATA || ZERO' is a lie at the block layer (see commit e88ae2264), even though it is how the bit is computed from the driver layer. After all, there are intentionally cases where we return ZERO but not ALLOCATED at the block layer, when we know that a read sees zero because the backing file is too short. Note that the driver interface is thus slightly different than the public interface with regards to which bits will be set, and what guarantees are provided on input. We also add an assertion that any driver using the new callback will make progress (the only time pnum will be 0 is if the block layer already handled an out-of-bounds request, or if there is an error); the old driver interface did not provide this guarantee, which could lead to some inf-loops in drastic corner-case failures. Signed-off-by: Eric BlakeReviewed-by: Vladimir Sementsov-Ogievskiy Reviewed-by: Fam Zheng --- v7: add R-b v6: drop now-useless rounding of mid-sector end-of-file hole [Kevin], better documentation of 'want_zero' [Kevin] v5: rebase to master, typo fix, document more block layer guarantees v4: rebase to master v3: no change v2: improve alignment handling, ensure all iotests still pass --- include/block/block.h | 14 +++--- include/block/block_int.h | 20 +++- block/io.c| 28 +++- 3 files changed, 41 insertions(+), 21 deletions(-) diff --git a/include/block/block.h b/include/block/block.h index 19b3ab9cb5e..947e8876cdd 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -115,19 +115,19 @@ typedef struct HDGeometry { * BDRV_BLOCK_ZERO: offset reads as zero * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing raw data * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this - * layer (short for DATA || ZERO), set by block layer - * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this layer + * layer rather than any backing, set by block layer + * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this + * layer, set by block layer * * Internal flag: * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request * that the block layer recompute the answer from the returned * BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID. * - * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK) of - * the return value (old interface) or the entire map parameter (new - * interface) represent the offset in the returned BDS that is allocated for - * the corresponding raw data. However, whether that offset actually - * contains data also depends on BDRV_BLOCK_DATA, as follows: + * If BDRV_BLOCK_OFFSET_VALID is set, the map parameter represents the + * host offset within the returned BDS that is allocated for the + * corresponding raw guest data. However, whether that offset + * actually contains data also depends on BDRV_BLOCK_DATA, as follows: * * DATA ZERO OFFSET_VALID * ttt sectors read as zero, returned file is zero at offset diff --git a/include/block/block_int.h b/include/block/block_int.h index 5ea63f8fa8a..c93722b43a4 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -202,15 +202,25 @@ struct BlockDriver { /* * Building block for bdrv_block_status[_above] and * bdrv_is_allocated[_above]. The driver should answer only - * according to the current layer, and should not set - * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW. See block.h - * for the meaning of _DATA, _ZERO, and _OFFSET_VALID. The block - * layer guarantees input aligned to request_alignment, as well as - * non-NULL pnum and file. + * according to the current layer, and should only need to set + * BDRV_BLOCK_DATA, BDRV_BLOCK_ZERO, BDRV_BLOCK_OFFSET_VALID, + * and/or BDRV_BLOCK_RAW; if the current layer defers to a backing + * layer, the result should be 0 (and not BDRV_BLOCK_ZERO). See + * block.h for the overall meaning of the bits.
[Qemu-block] [PATCH v8 02/21] nvme: Drop pointless .bdrv_co_get_block_status()
Commit bdd6a90 has a bug: drivers should never directly set BDRV_BLOCK_ALLOCATED, but only io.c should do that (as needed). Instead, drivers should report BDRV_BLOCK_DATA if it knows that data comes from this BDS. But let's look at the bigger picture: semantically, the nvme driver is similar to the nbd, null, and raw drivers (no backing file, all data comes from this BDS). But while two of those other drivers have to supply the callback (null because it can special-case BDRV_BLOCK_ZERO, raw because it can special-case a different offset), in this case the block layer defaults are good enough without the callback at all (similar to nbd). So, fix the bug by deletion ;) Signed-off-by: Eric Blake--- v8: new patch --- block/nvme.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/block/nvme.c b/block/nvme.c index 10bffbbf2f4..4e561b08df3 100644 --- a/block/nvme.c +++ b/block/nvme.c @@ -1068,18 +1068,6 @@ static int nvme_reopen_prepare(BDRVReopenState *reopen_state, return 0; } -static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs, - int64_t sector_num, - int nb_sectors, int *pnum, - BlockDriverState **file) -{ -*pnum = nb_sectors; -*file = bs; - -return BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID | - (sector_num << BDRV_SECTOR_BITS); -} - static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts) { QINCREF(opts); @@ -1179,8 +1167,6 @@ static BlockDriver bdrv_nvme = { .bdrv_co_flush_to_disk= nvme_co_flush, .bdrv_reopen_prepare = nvme_reopen_prepare, -.bdrv_co_get_block_status = nvme_co_get_block_status, - .bdrv_refresh_filename= nvme_refresh_filename, .bdrv_refresh_limits = nvme_refresh_limits, -- 2.14.3
[Qemu-block] [PATCH v8 06/21] iscsi: Switch cluster_sectors to byte-based
We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Convert all uses of the cluster size in sectors, along with adding assertions that we are not dividing by zero. Improve some comment grammar while in the area. Signed-off-by: Eric BlakeAcked-by: Paolo Bonzini Reviewed-by: Fam Zheng --- v8: rebase to master v2-v7: no change --- block/iscsi.c | 56 +++- 1 file changed, 35 insertions(+), 21 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index 421983dd6ff..3414c21c7f5 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -86,7 +86,7 @@ typedef struct IscsiLun { unsigned long *allocmap; unsigned long *allocmap_valid; long allocmap_size; -int cluster_sectors; +int cluster_size; bool use_16_for_rw; bool write_protected; bool lbpme; @@ -430,9 +430,10 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int open_flags) { iscsi_allocmap_free(iscsilun); +assert(iscsilun->cluster_size); iscsilun->allocmap_size = -DIV_ROUND_UP(sector_lun2qemu(iscsilun->num_blocks, iscsilun), - iscsilun->cluster_sectors); +DIV_ROUND_UP(iscsilun->num_blocks * iscsilun->block_size, + iscsilun->cluster_size); iscsilun->allocmap = bitmap_try_new(iscsilun->allocmap_size); if (!iscsilun->allocmap) { @@ -440,7 +441,7 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int open_flags) } if (open_flags & BDRV_O_NOCACHE) { -/* in case that cache.direct = on all allocmap entries are +/* when cache.direct = on all allocmap entries are * treated as invalid to force a relookup of the block * status on every read request */ return 0; @@ -461,17 +462,19 @@ iscsi_allocmap_update(IscsiLun *iscsilun, int64_t sector_num, int nb_sectors, bool allocated, bool valid) { int64_t cl_num_expanded, nb_cls_expanded, cl_num_shrunk, nb_cls_shrunk; +int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS; if (iscsilun->allocmap == NULL) { return; } /* expand to entirely contain all affected clusters */ -cl_num_expanded = sector_num / iscsilun->cluster_sectors; +assert(cluster_sectors); +cl_num_expanded = sector_num / cluster_sectors; nb_cls_expanded = DIV_ROUND_UP(sector_num + nb_sectors, - iscsilun->cluster_sectors) - cl_num_expanded; + cluster_sectors) - cl_num_expanded; /* shrink to touch only completely contained clusters */ -cl_num_shrunk = DIV_ROUND_UP(sector_num, iscsilun->cluster_sectors); -nb_cls_shrunk = (sector_num + nb_sectors) / iscsilun->cluster_sectors +cl_num_shrunk = DIV_ROUND_UP(sector_num, cluster_sectors); +nb_cls_shrunk = (sector_num + nb_sectors) / cluster_sectors - cl_num_shrunk; if (allocated) { bitmap_set(iscsilun->allocmap, cl_num_expanded, nb_cls_expanded); @@ -535,9 +538,12 @@ iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t sector_num, if (iscsilun->allocmap == NULL) { return true; } -size = DIV_ROUND_UP(sector_num + nb_sectors, iscsilun->cluster_sectors); +assert(iscsilun->cluster_size); +size = DIV_ROUND_UP(sector_num + nb_sectors, +iscsilun->cluster_size >> BDRV_SECTOR_BITS); return !(find_next_bit(iscsilun->allocmap, size, - sector_num / iscsilun->cluster_sectors) == size); + sector_num * BDRV_SECTOR_SIZE / + iscsilun->cluster_size) == size); } static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun, @@ -547,9 +553,12 @@ static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun, if (iscsilun->allocmap_valid == NULL) { return false; } -size = DIV_ROUND_UP(sector_num + nb_sectors, iscsilun->cluster_sectors); +assert(iscsilun->cluster_size); +size = DIV_ROUND_UP(sector_num + nb_sectors, +iscsilun->cluster_size >> BDRV_SECTOR_BITS); return (find_next_zero_bit(iscsilun->allocmap_valid, size, - sector_num / iscsilun->cluster_sectors) == size); + sector_num * BDRV_SECTOR_SIZE / + iscsilun->cluster_size) == size); } static int coroutine_fn @@ -793,16 +802,21 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState *bs, BlockDriverState *file; /* check the block status from the beginning of the cluster * containing the start sector */ -int64_t ret = iscsi_co_get_block_status(bs, - sector_num - sector_num % iscsilun->cluster_sectors, - BDRV_REQUEST_MAX_SECTORS, , ); +int
[Qemu-block] [PATCH v8 07/21] iscsi: Switch iscsi_allocmap_update() to byte-based
We are gradually converting to byte-based interfaces, as they are easier to reason about than sector-based. Convert all uses of the allocmap (no semantic change). Callers that already had bytes available are simpler, and callers that now scale to bytes will be easier to switch to byte-based in the future. Signed-off-by: Eric BlakeAcked-by: Paolo Bonzini Reviewed-by: Fam Zheng --- v7: rebase to master, simple enough to keep ack v3-v6: no change v2: rebase to count/bytes rename --- block/iscsi.c | 90 +-- 1 file changed, 44 insertions(+), 46 deletions(-) diff --git a/block/iscsi.c b/block/iscsi.c index 3414c21c7f5..d2b0466775c 100644 --- a/block/iscsi.c +++ b/block/iscsi.c @@ -458,24 +458,22 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int open_flags) } static void -iscsi_allocmap_update(IscsiLun *iscsilun, int64_t sector_num, - int nb_sectors, bool allocated, bool valid) +iscsi_allocmap_update(IscsiLun *iscsilun, int64_t offset, + int64_t bytes, bool allocated, bool valid) { int64_t cl_num_expanded, nb_cls_expanded, cl_num_shrunk, nb_cls_shrunk; -int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS; if (iscsilun->allocmap == NULL) { return; } /* expand to entirely contain all affected clusters */ -assert(cluster_sectors); -cl_num_expanded = sector_num / cluster_sectors; -nb_cls_expanded = DIV_ROUND_UP(sector_num + nb_sectors, - cluster_sectors) - cl_num_expanded; +assert(iscsilun->cluster_size); +cl_num_expanded = offset / iscsilun->cluster_size; +nb_cls_expanded = DIV_ROUND_UP(offset + bytes, + iscsilun->cluster_size) - cl_num_expanded; /* shrink to touch only completely contained clusters */ -cl_num_shrunk = DIV_ROUND_UP(sector_num, cluster_sectors); -nb_cls_shrunk = (sector_num + nb_sectors) / cluster_sectors - - cl_num_shrunk; +cl_num_shrunk = DIV_ROUND_UP(offset, iscsilun->cluster_size); +nb_cls_shrunk = (offset + bytes) / iscsilun->cluster_size - cl_num_shrunk; if (allocated) { bitmap_set(iscsilun->allocmap, cl_num_expanded, nb_cls_expanded); } else { @@ -498,26 +496,26 @@ iscsi_allocmap_update(IscsiLun *iscsilun, int64_t sector_num, } static void -iscsi_allocmap_set_allocated(IscsiLun *iscsilun, int64_t sector_num, - int nb_sectors) +iscsi_allocmap_set_allocated(IscsiLun *iscsilun, int64_t offset, + int64_t bytes) { -iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, true, true); +iscsi_allocmap_update(iscsilun, offset, bytes, true, true); } static void -iscsi_allocmap_set_unallocated(IscsiLun *iscsilun, int64_t sector_num, - int nb_sectors) +iscsi_allocmap_set_unallocated(IscsiLun *iscsilun, int64_t offset, + int64_t bytes) { /* Note: if cache.direct=on the fifth argument to iscsi_allocmap_update * is ignored, so this will in effect be an iscsi_allocmap_set_invalid. */ -iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, false, true); +iscsi_allocmap_update(iscsilun, offset, bytes, false, true); } -static void iscsi_allocmap_set_invalid(IscsiLun *iscsilun, int64_t sector_num, - int nb_sectors) +static void iscsi_allocmap_set_invalid(IscsiLun *iscsilun, int64_t offset, + int64_t bytes) { -iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, false, false); +iscsi_allocmap_update(iscsilun, offset, bytes, false, false); } static void iscsi_allocmap_invalidate(IscsiLun *iscsilun) @@ -531,34 +529,30 @@ static void iscsi_allocmap_invalidate(IscsiLun *iscsilun) } static inline bool -iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t sector_num, -int nb_sectors) +iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t offset, +int64_t bytes) { unsigned long size; if (iscsilun->allocmap == NULL) { return true; } assert(iscsilun->cluster_size); -size = DIV_ROUND_UP(sector_num + nb_sectors, -iscsilun->cluster_size >> BDRV_SECTOR_BITS); +size = DIV_ROUND_UP(offset + bytes, iscsilun->cluster_size); return !(find_next_bit(iscsilun->allocmap, size, - sector_num * BDRV_SECTOR_SIZE / - iscsilun->cluster_size) == size); + offset / iscsilun->cluster_size) == size); } static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun, - int64_t sector_num, int nb_sectors) + int64_t offset, int64_t bytes) {
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 02/13/2018 07:46 PM, Eric Blake wrote: > On 02/13/2018 08:48 AM, Daniel P. Berrangé wrote: No, that's policy decision that doesn't matter from QMP pov. If the mgmt app wants the snapshot to be wrt to the initial time, it can simply invoke the "stop" QMP command before doing the live migration and "cont" afterwards. >>> >>> That would be non-live. I think Roman means a live snapshot that saves >>> the state at the beginning of the operation. Basically the difference >>> between blockdev-backup (state at the beginning) and blockdev-mirror >>> (state at the end), except for a whole VM. >> >> That doesn't seem practical unless you can instantaneously write out >> the entire guest RAM to disk without blocking, or can somehow snapshot >> the RAM so you can write out a consistent view of the original RAM, >> while the guest continues to dirty RAM pages. > > One idea for that is via fork()'s copy-on-write semantics; the parent > continues processing the guest, while the child writes out RAM pages. > Pages touched by the guest in the parent are now cleanly copied by the > OS so that the child can take all the time it wants, but still writes > out the state of the guest at the time of the fork(). It may be > possible to use userfaultfd to achieve the same effects without a fork(). > this would be problematic as we for sure will face memory problems. Guest stalls are IMHO much less problematic (as they are expected by end-user) then memory shortage. Anyway, this is discussable. Den
Re: [Qemu-block] [PATCH v6 3/9] block: Add VFIO based NVMe driver
On 01/16/2018 12:08 AM, Fam Zheng wrote: This is a new protocol driver that exclusively opens a host NVMe controller through VFIO. It achieves better latency than linux-aio by completely bypassing host kernel vfs/block layer. $rw-$bs-$iodepth linux-aio nvme:// randread-4k-1 10.5k 21.6k randread-512k-1 745 1591 randwrite-4k-130.7k 37.0k randwrite-512k-1 1945 1980 (unit: IOPS) The driver also integrates with the polling mechanism of iothread. This patch is co-authored by Paolo and me. Signed-off-by: Paolo BonziniSigned-off-by: Fam Zheng Message-Id: <20180110091846.10699-4-f...@redhat.com> --- Sorry for not noticing sooner, but +static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs, + int64_t sector_num, + int nb_sectors, int *pnum, + BlockDriverState **file) +{ +*pnum = nb_sectors; +*file = bs; + +return BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID | + (sector_num << BDRV_SECTOR_BITS); This is wrong. Drivers should only ever return BDRV_BLOCK_DATA (which io.c then _adds_ BDRV_BLOCK_ALLOCATED to, as needed). I'll fix it up as part of my byte-based block status series (v8 coming up soon). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH v4] ssh: switch from libssh2 to libssh
On 02/13/2018 12:49 PM, Max Reitz wrote: On 2018-01-18 17:44, Pino Toscano wrote: Rewrite the implementation of the ssh block driver to use libssh instead of libssh2. The libssh library has various advantages over libssh2: - easier API for authentication (for example for using ssh-agent) - easier API for known_hosts handling - supports newer types of keys in known_hosts @@ -628,6 +570,8 @@ static int connect_to_ssh(BDRVSSHState *s, QDict *options, Error *local_err = NULL; const char *user, *path, *host_key_check; long port = 0; +unsigned long portU = 0; I was about to say: How about making port an unsigned long and swapping the qemu_strtol() for a qemu_strtoul()? But I think you'd rather want an unsigned int instead (and that won't work with qemu_strtoul()). Dan has a pending patch that adds qemu_strtoi() and qemu_strtoui(), when we want to deal with parsing to ints. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH v4] ssh: switch from libssh2 to libssh
On 2018-01-18 17:44, Pino Toscano wrote: > Rewrite the implementation of the ssh block driver to use libssh instead > of libssh2. The libssh library has various advantages over libssh2: > - easier API for authentication (for example for using ssh-agent) > - easier API for known_hosts handling > - supports newer types of keys in known_hosts > > Kerberos authentication can be enabled once the libssh bug for it [1] is > fixed. > > The development version of libssh (i.e. the future 0.8.x) supports > fsync, so reuse the build time check for this. > > [1] https://red.libssh.org/issues/242 > > Signed-off-by: Pino Toscano> --- > > Changes from v3: > - fix socket cleanup in connect_to_ssh() > - add comments about the socket cleanup > - improve the error reporting (closer to what was with libssh2) > - improve EOF detection on sftp_read() > > Changes from v2: > - used again an own fd > - fixed co_yield() implementation > > Changes from v1: > - fixed jumbo packets writing > - fixed missing 'err' assignment > - fixed commit message One thing: The performance seems to have dropped hugely, from what I can tell. Before this patch, running the iotests 1-10 over ssh (raw/ssh) took 12.6 s. With this patch, they take 59.3 s. Perhaps the starkest contrast can be seen in test 1, which took 4 s before and 27 s after -- this test simply reads and writes 128 MB of continuous data. I like having elliptic curves, but I think this patch needs optimization work before we can replace libssh2. > block/Makefile.objs | 6 +- > block/ssh.c | 522 > > configure | 65 --- > 3 files changed, 278 insertions(+), 315 deletions(-) [...] > diff --git a/block/ssh.c b/block/ssh.c > index b049a16eb9..2975fc27d8 100644 > --- a/block/ssh.c > +++ b/block/ssh.c [...] > @@ -87,27 +81,25 @@ static void ssh_state_init(BDRVSSHState *s) > { > memset(s, 0, sizeof *s); > s->sock = -1; > -s->offset = -1; > qemu_co_mutex_init(>lock); > } > > static void ssh_state_free(BDRVSSHState *s) > { > +if (s->attrs) { > +sftp_attributes_free(s->attrs); > +} > if (s->sftp_handle) { > -libssh2_sftp_close(s->sftp_handle); > +sftp_close(s->sftp_handle); > } > if (s->sftp) { > -libssh2_sftp_shutdown(s->sftp); > +sftp_free(s->sftp); > } > if (s->session) { > -libssh2_session_disconnect(s->session, > - "from qemu ssh client: " > - "user closed the connection"); > -libssh2_session_free(s->session); > -} > -if (s->sock >= 0) { > -close(s->sock); > +ssh_disconnect(s->session); > +ssh_free(s->session); > } > +/* s->sock is owned by the ssh_session, which free's it. */ s/free's/frees/ > } > > static void GCC_FMT_ATTR(3, 4) > @@ -121,13 +113,13 @@ session_error_setg(Error **errp, BDRVSSHState *s, const > char *fs, ...) > va_end(args); > > if (s->session) { > -char *ssh_err; > +const char *ssh_err; > int ssh_err_code; > > -/* This is not an errno. See . */ > -ssh_err_code = libssh2_session_last_error(s->session, > - _err, NULL, 0); > -error_setg(errp, "%s: %s (libssh2 error code: %d)", > +/* This is not an errno. See . */ > +ssh_err = ssh_get_error(s->session); > +ssh_err_code = ssh_get_error_code(s->session); > +error_setg(errp, "%s: %s (libssh error code: %d)", > msg, ssh_err, ssh_err_code); Maybe we should not append the error info if there is no error. (Example: $ ./qemu-img info ssh://localhost/tmp/foo qemu-img: Could not open 'ssh://localhost/tmp/foo': no host key was found in known_hosts: (libssh error code: 0) ) > } else { > error_setg(errp, "%s", msg); [...] > @@ -291,68 +283,41 @@ static void ssh_parse_filename(const char *filename, > QDict *options, > static int check_host_key_knownhosts(BDRVSSHState *s, > const char *host, int port, Error > **errp) > { > -const char *home; > -char *knh_file = NULL; > -LIBSSH2_KNOWNHOSTS *knh = NULL; > -struct libssh2_knownhost *found; > -int ret, r; > -const char *hostkey; > -size_t len; > -int type; > +int ret; > +int state; > > -hostkey = libssh2_session_hostkey(s->session, , ); > -if (!hostkey) { > -ret = -EINVAL; > -session_error_setg(errp, s, "failed to read remote host key"); > -goto out; > -} > +state = ssh_is_server_known(s->session); > > -knh = libssh2_knownhost_init(s->session); > -if (!knh) { > -ret = -EINVAL; > -session_error_setg(errp, s, > - "failed to initialize known hosts support"); > -goto out; > -} > - > -
Re: [Qemu-block] block_status automatically added flags
On 02/13/2018 11:36 AM, Vladimir Sementsov-Ogievskiy wrote: Hi Eric! I'm now testing my nbd block status realization (block_status part, not about dirty bitmaps), and faced into the following effect. I created empty qcow2 image and wrote to the first sector, so qemu-io -c map x reports: 64 KiB (0x1) bytes allocated at offset 0 bytes (0x0) 9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1) But I can't get same results, when connecting to nbd server, exporting the same qcow2 file, I get 10 MiB (0xa0) bytes allocated at offset 0 bytes (0x0) Is this with or without your NBD_CMD_BLOCK_STATUS patches applied? And are you exposing the data over NBD as raw ('qemu-nbd -f qcow2'/'qemu-io -f raw') or as qcow2 ('qemu-nbd -f raw'/'qemu-io -f qcow2')? /me does a quick reproduction Yes, I definitely see that behavior without any NBD_CMD_BLOCK_STATUS patches and when the image is exposed over NBD as raw, but not when exposed as qcow2, when testing the 2.11 release: $ qemu-img create -f qcow2 file3 10M Formatting 'file3', fmt=qcow2 size=10485760 cluster_size=65536 lazy_refcounts=off refcount_bits=16 $ qemu-io -c 'w 0 64k' -c map -f qcow2 file3 wrote 65536/65536 bytes at offset 0 64 KiB, 1 ops; 0.0579 sec (1.079 MiB/sec and 17.2601 ops/sec) 64 KiB (0x1) bytes allocated at offset 0 bytes (0x0) 9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1) $ qemu-nbd -f qcow2 -x foo file3 $ qemu-io -f raw -c map nbd://localhost:10809/foo 10 MiB (0xa0) bytes allocated at offset 0 bytes (0x0) $ qemu-nbd -f raw -x foo file3 $ qemu-io -f qcow2 -c map nbd://localhost:10809/foo 64 KiB (0x1) bytes allocated at offset 0 bytes (0x0) 9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1) Right now, without NBD block status, the NBD driver reports the entire file as allocated, as it can't do any better (NBD has no backing file, and all data . Presumably, once NBD_CMD_BLOCK_STATUS is implemented, we can then use that for more accurate information. Finally, I understand the reason: for local file, qemu-io calls bdrv_is_allocated, which calls bdrv_common_block_status_above with want_zero=false. So, it doesn't set BDRV_BLOCK_ZERO, and doesn't set BDRV_BLOCK_ALLOCATED. 'qemu-io map' is a bit unusual; it is the only UI that easily exposes bdrv_is_allocated() to the outside world ('qemu-img map' does not). (The fact that both operations are named 'map' but do something different is annoying; for back-compat reasons, we can't change qemu-img, and I don't know if changing qemu-io is worth it.) And, even if we change want_zero to true, Well, you'd do that by invoking bdrv_block_status() (via 'qemu-img map', for example). here, it will set BDRV_BLOCK_ZERO, but will not set BDRV_BLOCK_ALLOCATED, which contradicts with it's definition: BDRV_BLOCK_ALLOCATED: the content of the block is determined by this layer (short for DATA || ZERO), set by block layer This text is wrong; it gets fixed in my still-pending concluding series for byte-based block status: https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg00955.html Conceptually, BDRV_BLOCK_ALLOCATED means "is THIS layer of the backing chain responsible for the contents at this guest offset"; and there are cases where we know that we read zeroes but where the current layer is not responsible for the contents (such as a qcow2 that has a backing file with shorter length, where we return BDRV_BLOCK_ZERO but not BDRV_BLOCK_ALLOCATED). But since NBD has no backing chain, the entire image is considered allocated. Meanwhile, asking whether something is allocated ('qemu-io -c map') is not the usual question you want to ask when determining what portions of a file are zero. for nbd, we go through the similar way on server (but with want_zero = true), and we finally have BDRV_BLOCK_ZERO without BDRV_BLOCK_ALLOCATED, which maps to NBD_STATE_HOLE+NBD_STATE_ZERO. But then, in the client we have BDRV_BLOCK_ZERO not automatically added by block layer but directly from nbd driver, therefor BDRV_BLOCK_ALLOCATED is set and I get different result. Drivers should never set BDRV_BLOCK_ALLOCATED; only the code in io.c should set it; and output based on BDRV_BLOCK_ALLOCATED is only useful in backing chain scenarios (which NBD does not have). this all looks weird for me. BDRV_BLOCK_ALLOCATED definition should be fixed, to show that this flag show only reported by driver flags, not automatically added. If we need to report yet more flags, where the driver can report additional information, then that's different. But changing the BDRV_BLOCK_ALLOCATED semantics would probably have knock-on effects that I'm not prepared to audit for (that is, we'd rather fix the documentation to match reality, which my pending patch does, and NOT change the code to match the current incorrect documentation). And then the situation
[Qemu-block] block_status automatically added flags
Hi Eric! I'm now testing my nbd block status realization (block_status part, not about dirty bitmaps), and faced into the following effect. I created empty qcow2 image and wrote to the first sector, so qemu-io -c map x reports: 64 KiB (0x1) bytes allocated at offset 0 bytes (0x0) 9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1) But I can't get same results, when connecting to nbd server, exporting the same qcow2 file, I get 10 MiB (0xa0) bytes allocated at offset 0 bytes (0x0) Finally, I understand the reason: for local file, qemu-io calls bdrv_is_allocated, which calls bdrv_common_block_status_above with want_zero=false. So, it doesn't set BDRV_BLOCK_ZERO, and doesn't set BDRV_BLOCK_ALLOCATED. And, even if we change want_zero to true, here, it will set BDRV_BLOCK_ZERO, but will not set BDRV_BLOCK_ALLOCATED, which contradicts with it's definition: BDRV_BLOCK_ALLOCATED: the content of the block is determined by this layer (short for DATA || ZERO), set by block layer for nbd, we go through the similar way on server (but with want_zero = true), and we finally have BDRV_BLOCK_ZERO without BDRV_BLOCK_ALLOCATED, which maps to NBD_STATE_HOLE+NBD_STATE_ZERO. But then, in the client we have BDRV_BLOCK_ZERO not automatically added by block layer but directly from nbd driver, therefor BDRV_BLOCK_ALLOCATED is set and I get different result. this all looks weird for me. BDRV_BLOCK_ALLOCATED definition should be fixed, to show that this flag show only reported by driver flags, not automatically added. And then the situation with nbd looks strange, because automatically added flag becomes "flag, reported by block driver". On the other hand, it is not wrong. I think, to test the feature, I'll improve qemu-io map, to show zeros (or is it better to add separate command?) and leave the other things as is. What do you think? -- Best regards, Vladimir
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 13.02.2018 17:43, Kevin Wolf wrote: Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben: On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: Then you could just use the regular migrate QMP commands for loading and saving snapshots. Yes, you could. I think for a proper implementation you would want to do better, though. Live migration provides just a stream, but that's not really well suited for snapshots. When a RAM page is dirtied, you just want to overwrite the old version of it in a snapshot [...] This means the point in time where the guest state is snapshotted is not when the command is issued, but any unpredictable amount of time later. I'm not sure this is what a user expects. I don't think it's necessarily a big problem as long as you set the expectations right, but good point anyway. A better approach for the save part appears to be to stop the vcpus, dump the device state, resume the vcpus, and save the memory contents in the background, prioritizing the old copies of the pages that change. So basically you would let the guest fault whenever it writes to a page that is not saved yet, and then save it first before you make the page writable again? Essentially blockdev-backup, except for RAM. No multiple copies of the same page would have to be saved so the stream format would be fine. For the load part the usual inmigrate should work. This is true. I think it will require changes to the qcow2 driver, though. Currently, you write the VM state into the active layer and then take the disk snapshot so that the VM state is automatically snapshotted as well. Afterwards, the VM state is discarded again in the active layer. If you want to take the snapshot at the point in time when the command was issued, you first need to create a qcow2 snapshot to save the disk state, but then continue to write the VM state into that snapshot, even though it's not the active layer. In fact I think this would be more elegant than writing the VM state into the active layer and then discarding it again, but inactive snapshots haven't been writable so far, so that will require some changes. I'm sure that Denis has already some thoughts on this, though. Yes, I stick to the scheme when you write a snapshot as it is now but instead of ram writing I just preallocate the space for ram and save the actual file offsets of the clusters forming the space, then I resume vcpus and write the ram using those cluster offsets with respect to vcpu pagefaults. Kevin -- Best, Denis
[Qemu-block] [PULL 52/55] qcow2: Allow configuring the L2 slice size
From: Alberto GarciaNow that the code is ready to handle L2 slices we can finally add an option to allow configuring their size. An L2 slice is the portion of an L2 table that is read by the qcow2 cache. Until now the cache was always reading full L2 tables, and since the L2 table size is equal to the cluster size this was not very efficient with large clusters. Here's a more detailed explanation of why it makes sense to have smaller cache entries in order to load L2 data: https://lists.gnu.org/archive/html/qemu-block/2017-09/msg00635.html This patch introduces a new command-line option to the qcow2 driver named l2-cache-entry-size (cf. l2-cache-size). The cache entry size has the same restrictions as the cluster size: it must be a power of two and it has the same range of allowed values, with the additional requirement that it must not be larger than the cluster size. The L2 cache entry size (L2 slice size) remains equal to the cluster size for now by default, so this feature must be explicitly enabled. Although my tests show that 4KB slices consistently improve performance and give the best results, let's wait and make more tests with different cluster sizes before deciding on an optimal default. Now that the cache entry size is not necessarily equal to the cluster size we need to reflect that in the MIN_L2_CACHE_SIZE documentation. That minimum value is a requirement of the COW algorithm: we need to read two L2 slices (and not two L2 tables) in order to do COW, see l2_allocate() for the actual code. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: c73e5611ff4a9ec5d20de68a6c289553a13d2354.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- qapi/block-core.json | 6 ++ block/qcow2.h| 6 -- block/qcow2-cache.c | 10 -- block/qcow2.c| 34 +++--- 4 files changed, 45 insertions(+), 11 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 2c107823fe..5c5921bfb7 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -2521,6 +2521,11 @@ # @l2-cache-size: the maximum size of the L2 table cache in # bytes (since 2.2) # +# @l2-cache-entry-size: the size of each entry in the L2 cache in +# bytes. It must be a power of two between 512 +# and the cluster size. The default value is +# the cluster size (since 2.12) +# # @refcount-cache-size: the maximum size of the refcount block cache # in bytes (since 2.2) # @@ -2542,6 +2547,7 @@ '*overlap-check': 'Qcow2OverlapChecks', '*cache-size': 'int', '*l2-cache-size': 'int', +'*l2-cache-entry-size': 'int', '*refcount-cache-size': 'int', '*cache-clean-interval': 'int', '*encrypt': 'BlockdevQcow2Encryption' } } diff --git a/block/qcow2.h b/block/qcow2.h index d956a6cdd2..883802241f 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -68,7 +68,7 @@ #define MAX_CLUSTER_BITS 21 /* Must be at least 2 to cover COW */ -#define MIN_L2_CACHE_SIZE 2 /* clusters */ +#define MIN_L2_CACHE_SIZE 2 /* cache entries */ /* Must be at least 4 to cover all cases of refcount table growth */ #define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */ @@ -100,6 +100,7 @@ #define QCOW2_OPT_OVERLAP_INACTIVE_L2 "overlap-check.inactive-l2" #define QCOW2_OPT_CACHE_SIZE "cache-size" #define QCOW2_OPT_L2_CACHE_SIZE "l2-cache-size" +#define QCOW2_OPT_L2_CACHE_ENTRY_SIZE "l2-cache-entry-size" #define QCOW2_OPT_REFCOUNT_CACHE_SIZE "refcount-cache-size" #define QCOW2_OPT_CACHE_CLEAN_INTERVAL "cache-clean-interval" @@ -647,7 +648,8 @@ void qcow2_free_snapshots(BlockDriverState *bs); int qcow2_read_snapshots(BlockDriverState *bs); /* qcow2-cache.c functions */ -Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables); +Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables, + unsigned table_size); int qcow2_cache_destroy(Qcow2Cache *c); void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table); diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index b1aa42477e..d9dafa31e5 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -120,14 +120,20 @@ void qcow2_cache_clean_unused(Qcow2Cache *c) c->cache_clean_lru_counter = c->lru_counter; } -Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables) +Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables, + unsigned table_size) { BDRVQcow2State *s = bs->opaque; Qcow2Cache *c; +assert(num_tables > 0); +assert(is_power_of_2(table_size)); +assert(table_size >= (1 << MIN_CLUSTER_BITS)); +assert(table_size <= s->cluster_size); +
[Qemu-block] [PULL 55/55] iotests: Add l2-cache-entry-size to iotest 137
From: Alberto GarciaThis test tries reopening a qcow2 image with valid and invalid options. This patch adds l2-cache-entry-size to the set. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 3d3b7d2dbfc020deaef60fb58739b0801eb9517c.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- tests/qemu-iotests/137 | 5 + tests/qemu-iotests/137.out | 2 ++ 2 files changed, 7 insertions(+) diff --git a/tests/qemu-iotests/137 b/tests/qemu-iotests/137 index 5a01250005..87965625d8 100755 --- a/tests/qemu-iotests/137 +++ b/tests/qemu-iotests/137 @@ -83,6 +83,9 @@ $QEMU_IO \ -c "reopen -o overlap-check.inactive-l2=off" \ -c "reopen -o cache-size=1M" \ -c "reopen -o l2-cache-size=512k" \ +-c "reopen -o l2-cache-entry-size=512" \ +-c "reopen -o l2-cache-entry-size=4k" \ +-c "reopen -o l2-cache-entry-size=64k" \ -c "reopen -o refcount-cache-size=128k" \ -c "reopen -o cache-clean-interval=5" \ -c "reopen -o cache-clean-interval=0" \ @@ -107,6 +110,8 @@ $QEMU_IO \ -c "reopen -o cache-size=1M,l2-cache-size=2M" \ -c "reopen -o cache-size=1M,refcount-cache-size=2M" \ -c "reopen -o l2-cache-size=256T" \ +-c "reopen -o l2-cache-entry-size=33k" \ +-c "reopen -o l2-cache-entry-size=128k" \ -c "reopen -o refcount-cache-size=256T" \ -c "reopen -o overlap-check=constant,overlap-check.template=all" \ -c "reopen -o overlap-check=blubb" \ diff --git a/tests/qemu-iotests/137.out b/tests/qemu-iotests/137.out index 05efd74d17..e28e1eadba 100644 --- a/tests/qemu-iotests/137.out +++ b/tests/qemu-iotests/137.out @@ -20,6 +20,8 @@ cache-size, l2-cache-size and refcount-cache-size may not be set the same time l2-cache-size may not exceed cache-size refcount-cache-size may not exceed cache-size L2 cache size too big +L2 cache entry size must be a power of two between 512 and the cluster size (65536) +L2 cache entry size must be a power of two between 512 and the cluster size (65536) L2 cache size too big Conflicting values for qcow2 options 'overlap-check' ('constant') and 'overlap-check.template' ('all') Unsupported value 'blubb' for qcow2 option 'overlap-check'. Allowed are any of the following: none, constant, cached, all -- 2.13.6
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 13.02.2018 18:05, Dr. David Alan Gilbert wrote: * Denis V. Lunev (d...@virtuozzo.com) wrote: On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote: * Daniel P. Berrangé (berra...@redhat.com) wrote: On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote: Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben: On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote: On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: Then you could just use the regular migrate QMP commands for loading and saving snapshots. Yes, you could. I think for a proper implementation you would want to do better, though. Live migration provides just a stream, but that's not really well suited for snapshots. When a RAM page is dirtied, you just want to overwrite the old version of it in a snapshot [...] This means the point in time where the guest state is snapshotted is not when the command is issued, but any unpredictable amount of time later. I'm not sure this is what a user expects. A better approach for the save part appears to be to stop the vcpus, dump the device state, resume the vcpus, and save the memory contents in the background, prioritizing the old copies of the pages that change. No multiple copies of the same page would have to be saved so the stream format would be fine. For the load part the usual inmigrate should work. No, that's policy decision that doesn't matter from QMP pov. If the mgmt app wants the snapshot to be wrt to the initial time, it can simply invoke the "stop" QMP command before doing the live migration and "cont" afterwards. That would be non-live. I think Roman means a live snapshot that saves the state at the beginning of the operation. Basically the difference between blockdev-backup (state at the beginning) and blockdev-mirror (state at the end), except for a whole VM. That doesn't seem practical unless you can instantaneously write out the entire guest RAM to disk without blocking, or can somehow snapshot the RAM so you can write out a consistent view of the original RAM, while the guest continues to dirty RAM pages. People have suggested doing something like that with userfault write mode; but the same would also be doable just by write protecting the whole of RAM and then following the faults. nope, userfault fd does not help :( We have tried, the functionality is not enough. Better to have small extension to KVM to protect all memory and notify QEMU with accessed address. Can you explain why? I thought the write-protect mode of userfaultfd was supposed to be able to do that; cc'ing in Andrea Hi everybody Yes, that's true but it isn't implemented yet in the kernel ... userfaultfd_register if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) { vm_flags |= VM_UFFD_WP; /* * FIXME: remove the below error constraint by * implementing the wprotect tracking mode. */ ret = -EINVAL; goto out; } and I don't feel like doing that taking into account that somebody has it done already but when it's done I'll use it with pleasure. KVM need to tiny wini modification to be made in its mmu part which would tell to the userspace the fault address. This is a simple solution which can be used while we're living without userfaultfd. Dave Den -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- Best, Denis
[Qemu-block] [PULL 45/55] qcow2: Prepare expand_zero_clusters_in_l1() for adding L2 slice support
From: Alberto GarciaAdding support for L2 slices to expand_zero_clusters_in_l1() needs (among other things) an extra loop that iterates over all slices of each L2 table. Putting all changes in one patch would make it hard to read because all semantic changes would be mixed with pure indentation changes. To make things easier this patch simply creates a new block and changes the indentation of all lines of code inside it. Thus, all modifications in this patch are cosmetic. There are no semantic changes and no variables are renamed yet. The next patch will take care of that. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: c2ae9f31ed5b6e591477ad4654448badd1c89d73.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 187 ++ 1 file changed, 96 insertions(+), 91 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 1041dcb6dd..ccb2944eda 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1904,118 +1904,123 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, goto fail; } -if (is_active_l1) { -/* get active L2 tables from cache */ -ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, -(void **)_table); -} else { -/* load inactive L2 tables from disk */ -ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE, -(void *)l2_table, s->cluster_sectors); -} -if (ret < 0) { -goto fail; -} - -for (j = 0; j < s->l2_size; j++) { -uint64_t l2_entry = be64_to_cpu(l2_table[j]); -int64_t offset = l2_entry & L2E_OFFSET_MASK; -QCow2ClusterType cluster_type = qcow2_get_cluster_type(l2_entry); - -if (cluster_type != QCOW2_CLUSTER_ZERO_PLAIN && -cluster_type != QCOW2_CLUSTER_ZERO_ALLOC) { -continue; +{ +if (is_active_l1) { +/* get active L2 tables from cache */ +ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, + (void **)_table); +} else { +/* load inactive L2 tables from disk */ +ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE, +(void *)l2_table, s->cluster_sectors); +} +if (ret < 0) { +goto fail; } -if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) { -if (!bs->backing) { -/* not backed; therefore we can simply deallocate the - * cluster */ -l2_table[j] = 0; -l2_dirty = true; +for (j = 0; j < s->l2_size; j++) { +uint64_t l2_entry = be64_to_cpu(l2_table[j]); +int64_t offset = l2_entry & L2E_OFFSET_MASK; +QCow2ClusterType cluster_type = +qcow2_get_cluster_type(l2_entry); + +if (cluster_type != QCOW2_CLUSTER_ZERO_PLAIN && +cluster_type != QCOW2_CLUSTER_ZERO_ALLOC) { continue; } -offset = qcow2_alloc_clusters(bs, s->cluster_size); -if (offset < 0) { -ret = offset; -goto fail; -} +if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) { +if (!bs->backing) { +/* not backed; therefore we can simply deallocate the + * cluster */ +l2_table[j] = 0; +l2_dirty = true; +continue; +} + +offset = qcow2_alloc_clusters(bs, s->cluster_size); +if (offset < 0) { +ret = offset; +goto fail; +} -if (l2_refcount > 1) { -/* For shared L2 tables, set the refcount accordingly (it is - * already 1 and needs to be l2_refcount) */ -ret = qcow2_update_cluster_refcount(bs, -offset >> s->cluster_bits, +if (l2_refcount > 1) { +/* For shared L2 tables, set the refcount accordingly + * (it is already 1 and needs to be l2_refcount) */ +ret = qcow2_update_cluster_refcount( +bs, offset >> s->cluster_bits, refcount_diff(1, l2_refcount), false, QCOW2_DISCARD_OTHER); -if (ret < 0) {
[Qemu-block] [PULL 54/55] iotests: Test downgrading an image using a small L2 slice size
From: Alberto Garciaexpand_zero_clusters_in_l1() is used when downgrading qcow2 images from v3 to v2 (compat=0.10). This is one of the functions that needed more changes to support L2 slices, so this patch extends iotest 061 to test downgrading a qcow2 image using a smaller slice size. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 3e5662dce5e4926c8fabbad4c0b9142b2a506dd4.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- tests/qemu-iotests/061 | 16 tests/qemu-iotests/061.out | 61 ++ 2 files changed, 77 insertions(+) diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061 index f5678b10c9..911b6f2894 100755 --- a/tests/qemu-iotests/061 +++ b/tests/qemu-iotests/061 @@ -54,6 +54,22 @@ $QEMU_IO -c "read -P 0 0 128k" "$TEST_IMG" | _filter_qemu_io _check_test_img echo +echo "=== Testing version downgrade with zero expansion and 4K cache entries ===" +echo +IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M +$QEMU_IO -c "write -z 0 128k" "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "write -z 32M 128k" "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c map "$TEST_IMG" | _filter_qemu_io +$PYTHON qcow2.py "$TEST_IMG" dump-header +$QEMU_IMG amend -o "compat=0.10" --image-opts \ + driver=qcow2,file.filename=$TEST_IMG,l2-cache-entry-size=4096 +$PYTHON qcow2.py "$TEST_IMG" dump-header +$QEMU_IO -c "read -P 0 0 128k" "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c "read -P 0 32M 128k" "$TEST_IMG" | _filter_qemu_io +$QEMU_IO -c map "$TEST_IMG" | _filter_qemu_io +_check_test_img + +echo echo "=== Testing dirty version downgrade ===" echo IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out index 942485de99..e857ef9a7d 100644 --- a/tests/qemu-iotests/061.out +++ b/tests/qemu-iotests/061.out @@ -52,6 +52,67 @@ read 131072/131072 bytes at offset 0 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) No errors were found on the image. +=== Testing version downgrade with zero expansion and 4K cache entries === + +Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 +wrote 131072/131072 bytes at offset 0 +128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 131072/131072 bytes at offset 33554432 +128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +128 KiB (0x2) bytes allocated at offset 0 bytes (0x0) +31.875 MiB (0x1fe) bytes not allocated at offset 128 KiB (0x2) +128 KiB (0x2) bytes allocated at offset 32 MiB (0x200) +31.875 MiB (0x1fe) bytes not allocated at offset 32.125 MiB (0x202) +magic 0x514649fb +version 3 +backing_file_offset 0x0 +backing_file_size 0x0 +cluster_bits 16 +size 67108864 +crypt_method 0 +l1_size 1 +l1_table_offset 0x3 +refcount_table_offset 0x1 +refcount_table_clusters 1 +nb_snapshots 0 +snapshot_offset 0x0 +incompatible_features 0x0 +compatible_features 0x1 +autoclear_features0x0 +refcount_order4 +header_length 104 + +Header extension: +magic 0x6803f857 +length144 +data + +magic 0x514649fb +version 2 +backing_file_offset 0x0 +backing_file_size 0x0 +cluster_bits 16 +size 67108864 +crypt_method 0 +l1_size 1 +l1_table_offset 0x3 +refcount_table_offset 0x1 +refcount_table_clusters 1 +nb_snapshots 0 +snapshot_offset 0x0 +incompatible_features 0x0 +compatible_features 0x0 +autoclear_features0x0 +refcount_order4 +header_length 72 + +read 131072/131072 bytes at offset 0 +128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +read 131072/131072 bytes at offset 33554432 +128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +64 MiB (0x400) bytes not allocated at offset 0 bytes (0x0) +No errors were found on the image. + === Testing dirty version downgrade === Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864 -- 2.13.6
[Qemu-block] [PULL 53/55] iotests: Test valid values of l2-cache-entry-size
From: Alberto GarciaThe l2-cache-entry-size setting can only contain values that are powers of two between 512 and the cluster size. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: bd3547b670b8d0af11480c760991a22bcae5b48c.1517840877.git.be...@igalia.com [mreitz: Changed non-power-of-two test value from 300 to 4242] Signed-off-by: Max Reitz --- tests/qemu-iotests/103 | 17 + tests/qemu-iotests/103.out | 3 +++ 2 files changed, 20 insertions(+) diff --git a/tests/qemu-iotests/103 b/tests/qemu-iotests/103 index d0cfab8844..2841318492 100755 --- a/tests/qemu-iotests/103 +++ b/tests/qemu-iotests/103 @@ -66,6 +66,14 @@ $QEMU_IO -c "open -o cache-size=1M,refcount-cache-size=2M $TEST_IMG" 2>&1 \ $QEMU_IO -c "open -o cache-size=0,l2-cache-size=0,refcount-cache-size=0 $TEST_IMG" \ 2>&1 | _filter_testdir | _filter_imgfmt +# Invalid cache entry sizes +$QEMU_IO -c "open -o l2-cache-entry-size=256 $TEST_IMG" \ +2>&1 | _filter_testdir | _filter_imgfmt +$QEMU_IO -c "open -o l2-cache-entry-size=4242 $TEST_IMG" \ +2>&1 | _filter_testdir | _filter_imgfmt +$QEMU_IO -c "open -o l2-cache-entry-size=128k $TEST_IMG" \ +2>&1 | _filter_testdir | _filter_imgfmt + echo echo '=== Testing valid option combinations ===' echo @@ -94,6 +102,15 @@ $QEMU_IO -c "open -o l2-cache-size=1M,refcount-cache-size=0.25M $TEST_IMG" \ -c 'read -P 42 0 64k' \ | _filter_qemu_io +# Valid cache entry sizes +$QEMU_IO -c "open -o l2-cache-entry-size=512 $TEST_IMG" \ +2>&1 | _filter_testdir | _filter_imgfmt +$QEMU_IO -c "open -o l2-cache-entry-size=16k $TEST_IMG" \ +2>&1 | _filter_testdir | _filter_imgfmt +$QEMU_IO -c "open -o l2-cache-entry-size=64k $TEST_IMG" \ +2>&1 | _filter_testdir | _filter_imgfmt + + echo echo '=== Testing minimal L2 cache and COW ===' echo diff --git a/tests/qemu-iotests/103.out b/tests/qemu-iotests/103.out index b7aaadf89a..bd45d3875a 100644 --- a/tests/qemu-iotests/103.out +++ b/tests/qemu-iotests/103.out @@ -9,6 +9,9 @@ can't open device TEST_DIR/t.IMGFMT: cache-size, l2-cache-size and refcount-cach can't open device TEST_DIR/t.IMGFMT: l2-cache-size may not exceed cache-size can't open device TEST_DIR/t.IMGFMT: refcount-cache-size may not exceed cache-size can't open device TEST_DIR/t.IMGFMT: cache-size, l2-cache-size and refcount-cache-size may not be set the same time +can't open device TEST_DIR/t.IMGFMT: L2 cache entry size must be a power of two between 512 and the cluster size (65536) +can't open device TEST_DIR/t.IMGFMT: L2 cache entry size must be a power of two between 512 and the cluster size (65536) +can't open device TEST_DIR/t.IMGFMT: L2 cache entry size must be a power of two between 512 and the cluster size (65536) === Testing valid option combinations === -- 2.13.6
[Qemu-block] [PULL 48/55] qcow2: Rename l2_table in qcow2_alloc_compressed_cluster_offset()
From: Alberto GarciaThis function doesn't need any changes to support L2 slices, but since it's now dealing with slices instead of full tables, the l2_table variable is renamed for clarity. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 0c5d4b9bf163aa3b49ec19cc512a50d83563f2ad.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 8fbaba008b..24055e19a1 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -747,26 +747,26 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs, { BDRVQcow2State *s = bs->opaque; int l2_index, ret; -uint64_t *l2_table; +uint64_t *l2_slice; int64_t cluster_offset; int nb_csectors; -ret = get_cluster_table(bs, offset, _table, _index); +ret = get_cluster_table(bs, offset, _slice, _index); if (ret < 0) { return 0; } /* Compression can't overwrite anything. Fail if the cluster was already * allocated. */ -cluster_offset = be64_to_cpu(l2_table[l2_index]); +cluster_offset = be64_to_cpu(l2_slice[l2_index]); if (cluster_offset & L2E_OFFSET_MASK) { -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); return 0; } cluster_offset = qcow2_alloc_bytes(bs, compressed_size); if (cluster_offset < 0) { -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); return 0; } @@ -781,9 +781,9 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs, /* compressed clusters never have the copied flag */ BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED); -qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); -l2_table[l2_index] = cpu_to_be64(cluster_offset); -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice); +l2_slice[l2_index] = cpu_to_be64(cluster_offset); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); return cluster_offset; } -- 2.13.6
[Qemu-block] [PULL 51/55] qcow2: Rename l2_table in count_cow_clusters()
From: Alberto GarciaThis function doesn't need any changes to support L2 slices, but since it's now dealing with slices intead of full tables, the l2_table variable is renamed for clarity. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 6107001fc79e6739242f1de7d191375e4f130aac.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 074a4aaf1e..e406b0f3b9 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -999,12 +999,12 @@ err: * which must copy from the backing file) */ static int count_cow_clusters(BDRVQcow2State *s, int nb_clusters, -uint64_t *l2_table, int l2_index) +uint64_t *l2_slice, int l2_index) { int i; for (i = 0; i < nb_clusters; i++) { -uint64_t l2_entry = be64_to_cpu(l2_table[l2_index + i]); +uint64_t l2_entry = be64_to_cpu(l2_slice[l2_index + i]); QCow2ClusterType cluster_type = qcow2_get_cluster_type(l2_entry); switch(cluster_type) { -- 2.13.6
[Qemu-block] [PULL 50/55] qcow2: Rename l2_table in count_contiguous_clusters_unallocated()
From: Alberto GarciaThis function doesn't need any changes to support L2 slices, but since it's now dealing with slices instead of full tables, the l2_table variable is renamed for clarity. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 78bcc54bc632574dd0b900a77a00a1b6ffc359e6.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 54ae210b43..074a4aaf1e 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -406,10 +406,10 @@ static int count_contiguous_clusters(int nb_clusters, int cluster_size, /* * Checks how many consecutive unallocated clusters in a given L2 - * table have the same cluster type. + * slice have the same cluster type. */ static int count_contiguous_clusters_unallocated(int nb_clusters, - uint64_t *l2_table, + uint64_t *l2_slice, QCow2ClusterType wanted_type) { int i; @@ -417,7 +417,7 @@ static int count_contiguous_clusters_unallocated(int nb_clusters, assert(wanted_type == QCOW2_CLUSTER_ZERO_PLAIN || wanted_type == QCOW2_CLUSTER_UNALLOCATED); for (i = 0; i < nb_clusters; i++) { -uint64_t entry = be64_to_cpu(l2_table[i]); +uint64_t entry = be64_to_cpu(l2_slice[i]); QCow2ClusterType type = qcow2_get_cluster_type(entry); if (type != wanted_type) { -- 2.13.6
[Qemu-block] [PULL 49/55] qcow2: Rename l2_table in count_contiguous_clusters()
From: Alberto GarciaThis function doesn't need any changes to support L2 slices, but since it's now dealing with slices intead of full tables, the l2_table variable is renamed for clarity. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 812b0c3505bb1687e51285dccf1a94f0cecb1f74.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 24055e19a1..54ae210b43 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -370,19 +370,19 @@ fail: } /* - * Checks how many clusters in a given L2 table are contiguous in the image + * Checks how many clusters in a given L2 slice are contiguous in the image * file. As soon as one of the flags in the bitmask stop_flags changes compared * to the first cluster, the search is stopped and the cluster is not counted * as contiguous. (This allows it, for example, to stop at the first compressed * cluster which may require a different handling) */ static int count_contiguous_clusters(int nb_clusters, int cluster_size, -uint64_t *l2_table, uint64_t stop_flags) +uint64_t *l2_slice, uint64_t stop_flags) { int i; QCow2ClusterType first_cluster_type; uint64_t mask = stop_flags | L2E_OFFSET_MASK | QCOW_OFLAG_COMPRESSED; -uint64_t first_entry = be64_to_cpu(l2_table[0]); +uint64_t first_entry = be64_to_cpu(l2_slice[0]); uint64_t offset = first_entry & mask; if (!offset) { @@ -395,7 +395,7 @@ static int count_contiguous_clusters(int nb_clusters, int cluster_size, first_cluster_type == QCOW2_CLUSTER_ZERO_ALLOC); for (i = 0; i < nb_clusters; i++) { -uint64_t l2_entry = be64_to_cpu(l2_table[i]) & mask; +uint64_t l2_entry = be64_to_cpu(l2_slice[i]) & mask; if (offset + (uint64_t) i * cluster_size != l2_entry) { break; } -- 2.13.6
[Qemu-block] [PULL 46/55] qcow2: Update expand_zero_clusters_in_l1() to support L2 slices
From: Alberto Garciaexpand_zero_clusters_in_l1() expands zero clusters as a necessary step to downgrade qcow2 images to a version that doesn't support metadata zero clusters. This function takes an L1 table (which may or may not be active) and iterates over all its L2 tables looking for zero clusters. Since we'll be loading L2 slices instead of full tables we need to add an extra loop that iterates over all slices of each L2 table, and we should also use the slice size when allocating the buffer used when the L1 table is not active. This function doesn't need any additional changes so apart from that this patch simply updates the variable name from l2_table to l2_slice. Finally, and since we have to touch the bdrv_read() / bdrv_write() calls anyway, this patch takes the opportunity to replace them with the byte-based bdrv_pread() / bdrv_pwrite(). Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 43590976f730501688096cff103f2923b72b0f32.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 51 --- 1 file changed, 28 insertions(+), 23 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index ccb2944eda..8fbaba008b 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1863,22 +1863,25 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, { BDRVQcow2State *s = bs->opaque; bool is_active_l1 = (l1_table == s->l1_table); -uint64_t *l2_table = NULL; +uint64_t *l2_slice = NULL; +unsigned slice, slice_size2, n_slices; int ret; int i, j; +slice_size2 = s->l2_slice_size * sizeof(uint64_t); +n_slices = s->cluster_size / slice_size2; + if (!is_active_l1) { /* inactive L2 tables require a buffer to be stored in when loading * them from disk */ -l2_table = qemu_try_blockalign(bs->file->bs, s->cluster_size); -if (l2_table == NULL) { +l2_slice = qemu_try_blockalign(bs->file->bs, slice_size2); +if (l2_slice == NULL) { return -ENOMEM; } } for (i = 0; i < l1_size; i++) { uint64_t l2_offset = l1_table[i] & L1E_OFFSET_MASK; -bool l2_dirty = false; uint64_t l2_refcount; if (!l2_offset) { @@ -1904,22 +1907,23 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, goto fail; } -{ +for (slice = 0; slice < n_slices; slice++) { +uint64_t slice_offset = l2_offset + slice * slice_size2; +bool l2_dirty = false; if (is_active_l1) { /* get active L2 tables from cache */ -ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, - (void **)_table); +ret = qcow2_cache_get(bs, s->l2_table_cache, slice_offset, + (void **)_slice); } else { /* load inactive L2 tables from disk */ -ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE, -(void *)l2_table, s->cluster_sectors); +ret = bdrv_pread(bs->file, slice_offset, l2_slice, slice_size2); } if (ret < 0) { goto fail; } -for (j = 0; j < s->l2_size; j++) { -uint64_t l2_entry = be64_to_cpu(l2_table[j]); +for (j = 0; j < s->l2_slice_size; j++) { +uint64_t l2_entry = be64_to_cpu(l2_slice[j]); int64_t offset = l2_entry & L2E_OFFSET_MASK; QCow2ClusterType cluster_type = qcow2_get_cluster_type(l2_entry); @@ -1933,7 +1937,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, if (!bs->backing) { /* not backed; therefore we can simply deallocate the * cluster */ -l2_table[j] = 0; +l2_slice[j] = 0; l2_dirty = true; continue; } @@ -1960,12 +1964,13 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, } if (offset_into_cluster(s, offset)) { +int l2_index = slice * s->l2_slice_size + j; qcow2_signal_corruption( bs, true, -1, -1, "Cluster allocation offset " "%#" PRIx64 " unaligned (L2 offset: %#" PRIx64 ", L2 index: %#x)", offset, -l2_offset, j); +l2_offset, l2_index); if (cluster_type ==
[Qemu-block] [PULL 47/55] qcow2: Update qcow2_truncate() to support L2 slices
From: Alberto GarciaThe qcow2_truncate() code is mostly independent from whether we're using L2 slices or full L2 tables, but in full and falloc preallocation modes new L2 tables are allocated using qcow2_alloc_cluster_link_l2(). Therefore the code needs to be modified to ensure that all nb_clusters that are processed in each call can be allocated with just one L2 slice. Signed-off-by: Alberto Garcia Message-id: 1fd7d272b5e7b66254a090b74cf2bed1cc334c0e.1517840877.git.be...@igalia.com Reviewed-by: Max Reitz Reviewed-by: Eric Blake Signed-off-by: Max Reitz --- block/qcow2.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index ba8d71c72d..953254a004 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -3262,9 +3262,9 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t offset, host_offset = allocation_start; guest_offset = old_length; while (nb_new_data_clusters) { -int64_t guest_cluster = guest_offset >> s->cluster_bits; -int64_t nb_clusters = MIN(nb_new_data_clusters, - s->l2_size - guest_cluster % s->l2_size); +int64_t nb_clusters = MIN( +nb_new_data_clusters, +s->l2_slice_size - offset_to_l2_slice_index(s, guest_offset)); QCowL2Meta allocation = { .offset = guest_offset, .alloc_offset = host_offset, -- 2.13.6
[Qemu-block] [PULL 43/55] qcow2: Update qcow2_update_snapshot_refcount() to support L2 slices
From: Alberto Garciaqcow2_update_snapshot_refcount() increases the refcount of all clusters of a given snapshot. In order to do that it needs to load all its L2 tables and iterate over their entries. Since we'll be loading L2 slices instead of full tables we need to add an extra loop that iterates over all slices of each L2 table. This function doesn't need any additional changes so apart from that this patch simply updates the variable name from l2_table to l2_slice. Signed-off-by: Alberto Garcia Message-id: 5f4db199b9637f4833b58487135124d70add8cf0.1517840877.git.be...@igalia.com Reviewed-by: Max Reitz Reviewed-by: Eric Blake Signed-off-by: Max Reitz --- block/qcow2-refcount.c | 32 ++-- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index dfa28301c4..d46b69d7f3 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1183,17 +1183,20 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, int64_t l1_table_offset, int l1_size, int addend) { BDRVQcow2State *s = bs->opaque; -uint64_t *l1_table, *l2_table, l2_offset, entry, l1_size2, refcount; +uint64_t *l1_table, *l2_slice, l2_offset, entry, l1_size2, refcount; bool l1_allocated = false; int64_t old_entry, old_l2_offset; +unsigned slice, slice_size2, n_slices; int i, j, l1_modified = 0, nb_csectors; int ret; assert(addend >= -1 && addend <= 1); -l2_table = NULL; +l2_slice = NULL; l1_table = NULL; l1_size2 = l1_size * sizeof(uint64_t); +slice_size2 = s->l2_slice_size * sizeof(uint64_t); +n_slices = s->cluster_size / slice_size2; s->cache_discards = true; @@ -1236,19 +1239,19 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, goto fail; } -{ +for (slice = 0; slice < n_slices; slice++) { ret = qcow2_cache_get(bs, s->l2_table_cache, - l2_offset, - (void **) _table); + l2_offset + slice * slice_size2, + (void **) _slice); if (ret < 0) { goto fail; } -for (j = 0; j < s->l2_size; j++) { +for (j = 0; j < s->l2_slice_size; j++) { uint64_t cluster_index; uint64_t offset; -entry = be64_to_cpu(l2_table[j]); +entry = be64_to_cpu(l2_slice[j]); old_entry = entry; entry &= ~QCOW_OFLAG_COPIED; offset = entry & L2E_OFFSET_MASK; @@ -1273,12 +1276,14 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, case QCOW2_CLUSTER_NORMAL: case QCOW2_CLUSTER_ZERO_ALLOC: if (offset_into_cluster(s, offset)) { +/* Here l2_index means table (not slice) index */ +int l2_index = slice * s->l2_slice_size + j; qcow2_signal_corruption( bs, true, -1, -1, "Cluster " "allocation offset %#" PRIx64 " unaligned (L2 offset: %#" PRIx64 ", L2 index: %#x)", -offset, l2_offset, j); +offset, l2_offset, l2_index); ret = -EIO; goto fail; } @@ -1317,14 +1322,13 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, qcow2_cache_set_dependency(bs, s->l2_table_cache, s->refcount_block_cache); } -l2_table[j] = cpu_to_be64(entry); +l2_slice[j] = cpu_to_be64(entry); qcow2_cache_entry_mark_dirty(s->l2_table_cache, - l2_table); + l2_slice); } } -qcow2_cache_put(s->l2_table_cache, (void **) _table); - +qcow2_cache_put(s->l2_table_cache, (void **) _slice); } if (addend != 0) { @@ -1352,8 +1356,8 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, ret = bdrv_flush(bs); fail: -if (l2_table) { -qcow2_cache_put(s->l2_table_cache, (void **) _table); +if (l2_slice) { +qcow2_cache_put(s->l2_table_cache, (void **) _slice); } s->cache_discards = false; -- 2.13.6
[Qemu-block] [PULL 36/55] qcow2: Update qcow2_get_cluster_offset() to support L2 slices
From: Alberto Garciaqcow2_get_cluster_offset() checks how many contiguous bytes are available at a given offset. The returned number of bytes is limited by the amount that can be addressed without having to load more than one L2 table. Since we'll be loading L2 slices instead of full tables this patch changes the limit accordingly using the size of the L2 slice for the calculations instead of the full table size. One consequence of this is that with small L2 slices operations such as 'qemu-img map' will need to iterate in more steps because each qcow2_get_cluster_offset() call will potentially return a smaller number. However the code is already prepared for that so this doesn't break semantics. The l2_table variable is also renamed to l2_slice to reflect this, and offset_to_l2_index() is replaced with offset_to_l2_slice_index(). Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 6b602260acb33da56ed6af9611731cb7acd110eb.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 30 +++--- 1 file changed, 15 insertions(+), 15 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index cd1ba40f74..e0eea73297 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -529,8 +529,8 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, { BDRVQcow2State *s = bs->opaque; unsigned int l2_index; -uint64_t l1_index, l2_offset, *l2_table; -int l1_bits, c; +uint64_t l1_index, l2_offset, *l2_slice; +int c; unsigned int offset_in_cluster; uint64_t bytes_available, bytes_needed, nb_clusters; QCow2ClusterType type; @@ -539,12 +539,12 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, offset_in_cluster = offset_into_cluster(s, offset); bytes_needed = (uint64_t) *bytes + offset_in_cluster; -l1_bits = s->l2_bits + s->cluster_bits; - /* compute how many bytes there are between the start of the cluster - * containing offset and the end of the l1 entry */ -bytes_available = (1ULL << l1_bits) - (offset & ((1ULL << l1_bits) - 1)) -+ offset_in_cluster; + * containing offset and the end of the l2 slice that contains + * the entry pointing to it */ +bytes_available = +((uint64_t) (s->l2_slice_size - offset_to_l2_slice_index(s, offset))) +<< s->cluster_bits; if (bytes_needed > bytes_available) { bytes_needed = bytes_available; @@ -573,17 +573,17 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, return -EIO; } -/* load the l2 table in memory */ +/* load the l2 slice in memory */ -ret = l2_load(bs, offset, l2_offset, _table); +ret = l2_load(bs, offset, l2_offset, _slice); if (ret < 0) { return ret; } /* find the cluster offset for the given disk offset */ -l2_index = offset_to_l2_index(s, offset); -*cluster_offset = be64_to_cpu(l2_table[l2_index]); +l2_index = offset_to_l2_slice_index(s, offset); +*cluster_offset = be64_to_cpu(l2_slice[l2_index]); nb_clusters = size_to_clusters(s, bytes_needed); /* bytes_needed <= *bytes + offset_in_cluster, both of which are unsigned @@ -610,14 +610,14 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, case QCOW2_CLUSTER_UNALLOCATED: /* how many empty clusters ? */ c = count_contiguous_clusters_unallocated(nb_clusters, - _table[l2_index], type); + _slice[l2_index], type); *cluster_offset = 0; break; case QCOW2_CLUSTER_ZERO_ALLOC: case QCOW2_CLUSTER_NORMAL: /* how many allocated clusters ? */ c = count_contiguous_clusters(nb_clusters, s->cluster_size, - _table[l2_index], QCOW_OFLAG_ZERO); + _slice[l2_index], QCOW_OFLAG_ZERO); *cluster_offset &= L2E_OFFSET_MASK; if (offset_into_cluster(s, *cluster_offset)) { qcow2_signal_corruption(bs, true, -1, -1, @@ -633,7 +633,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, abort(); } -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); bytes_available = (int64_t)c * s->cluster_size; @@ -651,7 +651,7 @@ out: return type; fail: -qcow2_cache_put(s->l2_table_cache, (void **)_table); +qcow2_cache_put(s->l2_table_cache, (void **)_slice); return ret; } -- 2.13.6
[Qemu-block] [PULL 33/55] qcow2: Update l2_allocate() to support L2 slices
From: Alberto GarciaThis patch updates l2_allocate() to support the qcow2 cache returning L2 slices instead of full L2 tables. The old code simply gets an L2 table from the cache and initializes it with zeroes or with the contents of an existing table. With a cache that returns slices instead of tables the idea remains the same, but the code must now iterate over all the slices that are contained in an L2 table. Since now we're operating with slices the function can no longer return the newly-allocated table, so it's up to the caller to retrieve the appropriate L2 slice after calling l2_allocate() (note that with this patch the caller is still loading full L2 tables, but we'll deal with that in a separate patch). Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Message-id: 20fc0415bf0e011e29f6487ec86eb06a11f37445.1517840877.git.be...@igalia.com Reviewed-by: Max Reitz Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 56 +++ 1 file changed, 34 insertions(+), 22 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 3f7272441d..4aa9ea7b29 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -263,11 +263,12 @@ int qcow2_write_l1_entry(BlockDriverState *bs, int l1_index) * */ -static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) +static int l2_allocate(BlockDriverState *bs, int l1_index) { BDRVQcow2State *s = bs->opaque; uint64_t old_l2_offset; -uint64_t *l2_table = NULL; +uint64_t *l2_slice = NULL; +unsigned slice, slice_size2, n_slices; int64_t l2_offset; int ret; @@ -298,42 +299,45 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) /* allocate a new entry in the l2 cache */ +slice_size2 = s->l2_slice_size * sizeof(uint64_t); +n_slices = s->cluster_size / slice_size2; + trace_qcow2_l2_allocate_get_empty(bs, l1_index); -{ +for (slice = 0; slice < n_slices; slice++) { ret = qcow2_cache_get_empty(bs, s->l2_table_cache, -l2_offset, -(void **) table); +l2_offset + slice * slice_size2, +(void **) _slice); if (ret < 0) { goto fail; } -l2_table = *table; - if ((old_l2_offset & L1E_OFFSET_MASK) == 0) { -/* if there was no old l2 table, clear the new table */ -memset(l2_table, 0, s->l2_size * sizeof(uint64_t)); +/* if there was no old l2 table, clear the new slice */ +memset(l2_slice, 0, slice_size2); } else { -uint64_t *old_table; +uint64_t *old_slice; +uint64_t old_l2_slice_offset = +(old_l2_offset & L1E_OFFSET_MASK) + slice * slice_size2; -/* if there was an old l2 table, read it from the disk */ +/* if there was an old l2 table, read a slice from the disk */ BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_COW_READ); -ret = qcow2_cache_get(bs, s->l2_table_cache, - old_l2_offset & L1E_OFFSET_MASK, - (void **) _table); +ret = qcow2_cache_get(bs, s->l2_table_cache, old_l2_slice_offset, + (void **) _slice); if (ret < 0) { goto fail; } -memcpy(l2_table, old_table, s->cluster_size); +memcpy(l2_slice, old_slice, slice_size2); -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); } -/* write the l2 table to the file */ +/* write the l2 slice to the file */ BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE); trace_qcow2_l2_allocate_write_l2(bs, l1_index); -qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); } ret = qcow2_cache_flush(bs, s->l2_table_cache); @@ -349,14 +353,13 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) goto fail; } -*table = l2_table; trace_qcow2_l2_allocate_done(bs, l1_index, 0); return 0; fail: trace_qcow2_l2_allocate_done(bs, l1_index, ret); -if (l2_table != NULL) { -qcow2_cache_put(s->l2_table_cache, (void **) table); +if (l2_slice != NULL) { +qcow2_cache_put(s->l2_table_cache, (void **) _slice); } s->l1_table[l1_index] = old_l2_offset; if (l2_offset > 0) { @@ -701,7 +704,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset, } } else {
[Qemu-block] [PULL 38/55] qcow2: Update handle_copied() to support L2 slices
From: Alberto Garciahandle_copied() loads an L2 table and limits the number of checked clusters to the amount that fits inside that table. Since we'll be loading L2 slices instead of full tables we need to update that limit. Apart from that, this function doesn't need any additional changes, so this patch simply updates the variable name from l2_table to l2_slice. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 541ac001a7d6b86bab2392554bee53c2b312148c.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index a01abb3b18..7699544358 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1119,7 +1119,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t guest_offset, BDRVQcow2State *s = bs->opaque; int l2_index; uint64_t cluster_offset; -uint64_t *l2_table; +uint64_t *l2_slice; uint64_t nb_clusters; unsigned int keep_clusters; int ret; @@ -1131,23 +1131,23 @@ static int handle_copied(BlockDriverState *bs, uint64_t guest_offset, == offset_into_cluster(s, *host_offset)); /* - * Calculate the number of clusters to look for. We stop at L2 table + * Calculate the number of clusters to look for. We stop at L2 slice * boundaries to keep things simple. */ nb_clusters = size_to_clusters(s, offset_into_cluster(s, guest_offset) + *bytes); -l2_index = offset_to_l2_index(s, guest_offset); -nb_clusters = MIN(nb_clusters, s->l2_size - l2_index); +l2_index = offset_to_l2_slice_index(s, guest_offset); +nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index); assert(nb_clusters <= INT_MAX); /* Find L2 entry for the first involved cluster */ -ret = get_cluster_table(bs, guest_offset, _table, _index); +ret = get_cluster_table(bs, guest_offset, _slice, _index); if (ret < 0) { return ret; } -cluster_offset = be64_to_cpu(l2_table[l2_index]); +cluster_offset = be64_to_cpu(l2_slice[l2_index]); /* Check how many clusters are already allocated and don't need COW */ if (qcow2_get_cluster_type(cluster_offset) == QCOW2_CLUSTER_NORMAL @@ -1175,7 +1175,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t guest_offset, /* We keep all QCOW_OFLAG_COPIED clusters */ keep_clusters = count_contiguous_clusters(nb_clusters, s->cluster_size, - _table[l2_index], + _slice[l2_index], QCOW_OFLAG_COPIED | QCOW_OFLAG_ZERO); assert(keep_clusters <= nb_clusters); @@ -1190,7 +1190,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t guest_offset, /* Cleanup */ out: -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); /* Only return a host offset if we actually made progress. Otherwise we * would make requirements for handle_alloc() that it can't fulfill */ -- 2.13.6
[Qemu-block] [PULL 35/55] qcow2: Update get_cluster_table() to support L2 slices
From: Alberto GarciaThis patch updates get_cluster_table() to return L2 slices instead of full L2 tables. The code itself needs almost no changes, it only needs to call offset_to_l2_slice_index() instead of offset_to_l2_index(). This patch also renames all the relevant variables and the documentation. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 64cf064c0021ba315d3f3032da0f95db1b615f33.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 044eccda72..cd1ba40f74 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -659,20 +659,20 @@ fail: * get_cluster_table * * for a given disk offset, load (and allocate if needed) - * the l2 table. + * the appropriate slice of its l2 table. * - * the cluster index in the l2 table is given to the caller. + * the cluster index in the l2 slice is given to the caller. * * Returns 0 on success, -errno in failure case */ static int get_cluster_table(BlockDriverState *bs, uint64_t offset, - uint64_t **new_l2_table, + uint64_t **new_l2_slice, int *new_l2_index) { BDRVQcow2State *s = bs->opaque; unsigned int l2_index; uint64_t l1_index, l2_offset; -uint64_t *l2_table = NULL; +uint64_t *l2_slice = NULL; int ret; /* seek to the l2 offset in the l1 table */ @@ -712,17 +712,17 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset, assert(offset_into_cluster(s, l2_offset) == 0); } -/* load the l2 table in memory */ -ret = l2_load(bs, offset, l2_offset, _table); +/* load the l2 slice in memory */ +ret = l2_load(bs, offset, l2_offset, _slice); if (ret < 0) { return ret; } /* find the cluster offset for the given disk offset */ -l2_index = offset_to_l2_index(s, offset); +l2_index = offset_to_l2_slice_index(s, offset); -*new_l2_table = l2_table; +*new_l2_slice = l2_slice; *new_l2_index = l2_index; return 0; -- 2.13.6
[Qemu-block] [PULL 44/55] qcow2: Read refcount before L2 table in expand_zero_clusters_in_l1()
From: Alberto GarciaAt the moment it doesn't really make a difference whether we call qcow2_get_refcount() before of after reading the L2 table, but if we want to support L2 slices we'll need to read the refcount first. This patch simply changes the order of those two operations to prepare for that. The patch with the actual semantic changes will be easier to read because of this. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 947a91d934053a2dbfef979aeb9568f57ef57c5d.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index f440105f6c..1041dcb6dd 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1898,6 +1898,12 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, goto fail; } +ret = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits, + _refcount); +if (ret < 0) { +goto fail; +} + if (is_active_l1) { /* get active L2 tables from cache */ ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, @@ -1911,12 +1917,6 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, goto fail; } -ret = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits, - _refcount); -if (ret < 0) { -goto fail; -} - for (j = 0; j < s->l2_size; j++) { uint64_t l2_entry = be64_to_cpu(l2_table[j]); int64_t offset = l2_entry & L2E_OFFSET_MASK; -- 2.13.6
[Qemu-block] [PULL 39/55] qcow2: Update handle_alloc() to support L2 slices
From: Alberto Garciahandle_alloc() loads an L2 table and limits the number of checked clusters to the amount that fits inside that table. Since we'll be loading L2 slices instead of full tables we need to update that limit. Apart from that, this function doesn't need any additional changes, so this patch simply updates the variable name from l2_table to l2_slice. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: b243299c7136f7014c5af51665431ddbf5e99afd.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 7699544358..dbd57cd35f 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1274,7 +1274,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, { BDRVQcow2State *s = bs->opaque; int l2_index; -uint64_t *l2_table; +uint64_t *l2_slice; uint64_t entry; uint64_t nb_clusters; int ret; @@ -1287,29 +1287,29 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, assert(*bytes > 0); /* - * Calculate the number of clusters to look for. We stop at L2 table + * Calculate the number of clusters to look for. We stop at L2 slice * boundaries to keep things simple. */ nb_clusters = size_to_clusters(s, offset_into_cluster(s, guest_offset) + *bytes); -l2_index = offset_to_l2_index(s, guest_offset); -nb_clusters = MIN(nb_clusters, s->l2_size - l2_index); +l2_index = offset_to_l2_slice_index(s, guest_offset); +nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index); assert(nb_clusters <= INT_MAX); /* Find L2 entry for the first involved cluster */ -ret = get_cluster_table(bs, guest_offset, _table, _index); +ret = get_cluster_table(bs, guest_offset, _slice, _index); if (ret < 0) { return ret; } -entry = be64_to_cpu(l2_table[l2_index]); +entry = be64_to_cpu(l2_slice[l2_index]); /* For the moment, overwrite compressed clusters one by one */ if (entry & QCOW_OFLAG_COMPRESSED) { nb_clusters = 1; } else { -nb_clusters = count_cow_clusters(s, nb_clusters, l2_table, l2_index); +nb_clusters = count_cow_clusters(s, nb_clusters, l2_slice, l2_index); } /* This function is only called when there were no non-COW clusters, so if @@ -1338,7 +1338,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, * nb_clusters already to a range of COW clusters */ preallocated_nb_clusters = count_contiguous_clusters(nb_clusters, s->cluster_size, - _table[l2_index], QCOW_OFLAG_COPIED); + _slice[l2_index], QCOW_OFLAG_COPIED); assert(preallocated_nb_clusters > 0); nb_clusters = preallocated_nb_clusters; @@ -1349,7 +1349,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, keep_old_clusters = true; } -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); if (!alloc_cluster_offset) { /* Allocate, if necessary at a given offset in the image file */ -- 2.13.6
[Qemu-block] [PULL 34/55] qcow2: Refactor get_cluster_table()
From: Alberto GarciaAfter the previous patch we're now always using l2_load() in get_cluster_table() regardless of whether a new L2 table has to be allocated or not. This patch refactors that part of the code to use one single l2_load() call. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: ce31758c4a1fadccea7a6ccb93951eb01d95fd4c.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 21 +++-- 1 file changed, 7 insertions(+), 14 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 4aa9ea7b29..044eccda72 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -694,15 +694,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset, return -EIO; } -/* seek the l2 table of the given l2 offset */ - -if (s->l1_table[l1_index] & QCOW_OFLAG_COPIED) { -/* load the l2 table in memory */ -ret = l2_load(bs, offset, l2_offset, _table); -if (ret < 0) { -return ret; -} -} else { +if (!(s->l1_table[l1_index] & QCOW_OFLAG_COPIED)) { /* First allocate a new L2 table (and do COW if needed) */ ret = l2_allocate(bs, l1_index); if (ret < 0) { @@ -718,11 +710,12 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset, /* Get the offset of the newly-allocated l2 table */ l2_offset = s->l1_table[l1_index] & L1E_OFFSET_MASK; assert(offset_into_cluster(s, l2_offset) == 0); -/* Load the l2 table in memory */ -ret = l2_load(bs, offset, l2_offset, _table); -if (ret < 0) { -return ret; -} +} + +/* load the l2 table in memory */ +ret = l2_load(bs, offset, l2_offset, _table); +if (ret < 0) { +return ret; } /* find the cluster offset for the given disk offset */ -- 2.13.6
[Qemu-block] [PULL 32/55] qcow2: Prepare l2_allocate() for adding L2 slice support
From: Alberto GarciaAdding support for L2 slices to l2_allocate() needs (among other things) an extra loop that iterates over all slices of a new L2 table. Putting all changes in one patch would make it hard to read because all semantic changes would be mixed with pure indentation changes. To make things easier this patch simply creates a new block and changes the indentation of all lines of code inside it. Thus, all modifications in this patch are cosmetic. There are no semantic changes and no variables are renamed yet. The next patch will take care of that. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: d0d7dca8520db304524f52f49d8157595a707a35.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 53 --- 1 file changed, 29 insertions(+), 24 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 542080aed0..3f7272441d 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -299,38 +299,43 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) /* allocate a new entry in the l2 cache */ trace_qcow2_l2_allocate_get_empty(bs, l1_index); -ret = qcow2_cache_get_empty(bs, s->l2_table_cache, l2_offset, (void**) table); -if (ret < 0) { -goto fail; -} +{ +ret = qcow2_cache_get_empty(bs, s->l2_table_cache, +l2_offset, +(void **) table); +if (ret < 0) { +goto fail; +} -l2_table = *table; +l2_table = *table; -if ((old_l2_offset & L1E_OFFSET_MASK) == 0) { -/* if there was no old l2 table, clear the new table */ -memset(l2_table, 0, s->l2_size * sizeof(uint64_t)); -} else { -uint64_t* old_table; +if ((old_l2_offset & L1E_OFFSET_MASK) == 0) { +/* if there was no old l2 table, clear the new table */ +memset(l2_table, 0, s->l2_size * sizeof(uint64_t)); +} else { +uint64_t *old_table; -/* if there was an old l2 table, read it from the disk */ -BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_COW_READ); -ret = qcow2_cache_get(bs, s->l2_table_cache, -old_l2_offset & L1E_OFFSET_MASK, -(void**) _table); -if (ret < 0) { -goto fail; +/* if there was an old l2 table, read it from the disk */ +BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_COW_READ); +ret = qcow2_cache_get(bs, s->l2_table_cache, + old_l2_offset & L1E_OFFSET_MASK, + (void **) _table); +if (ret < 0) { +goto fail; +} + +memcpy(l2_table, old_table, s->cluster_size); + +qcow2_cache_put(s->l2_table_cache, (void **) _table); } -memcpy(l2_table, old_table, s->cluster_size); +/* write the l2 table to the file */ +BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE); -qcow2_cache_put(s->l2_table_cache, (void **) _table); +trace_qcow2_l2_allocate_write_l2(bs, l1_index); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); } -/* write the l2 table to the file */ -BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE); - -trace_qcow2_l2_allocate_write_l2(bs, l1_index); -qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); ret = qcow2_cache_flush(bs, s->l2_table_cache); if (ret < 0) { goto fail; -- 2.13.6
[Qemu-block] [PULL 40/55] qcow2: Update discard_single_l2() to support L2 slices
From: Alberto Garciadiscard_single_l2() limits the number of clusters to be discarded to the amount that fits inside an L2 table. Since we'll be loading L2 slices instead of full tables we need to update that limit. The function is renamed to discard_in_l2_slice() for clarity. Apart from that, this function doesn't need any additional changes, so this patch simply updates the variable name from l2_table to l2_slice. Signed-off-by: Alberto Garcia Message-id: 1cb44a5b68be5334cb01b97a3db3a3c5a43396e5.1517840877.git.be...@igalia.com Reviewed-by: Eric Blake Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 32 1 file changed, 16 insertions(+), 16 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index dbd57cd35f..62336960a5 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1631,32 +1631,32 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset) /* * This discards as many clusters of nb_clusters as possible at once (i.e. - * all clusters in the same L2 table) and returns the number of discarded + * all clusters in the same L2 slice) and returns the number of discarded * clusters. */ -static int discard_single_l2(BlockDriverState *bs, uint64_t offset, - uint64_t nb_clusters, enum qcow2_discard_type type, - bool full_discard) +static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset, + uint64_t nb_clusters, + enum qcow2_discard_type type, bool full_discard) { BDRVQcow2State *s = bs->opaque; -uint64_t *l2_table; +uint64_t *l2_slice; int l2_index; int ret; int i; -ret = get_cluster_table(bs, offset, _table, _index); +ret = get_cluster_table(bs, offset, _slice, _index); if (ret < 0) { return ret; } -/* Limit nb_clusters to one L2 table */ -nb_clusters = MIN(nb_clusters, s->l2_size - l2_index); +/* Limit nb_clusters to one L2 slice */ +nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index); assert(nb_clusters <= INT_MAX); for (i = 0; i < nb_clusters; i++) { uint64_t old_l2_entry; -old_l2_entry = be64_to_cpu(l2_table[l2_index + i]); +old_l2_entry = be64_to_cpu(l2_slice[l2_index + i]); /* * If full_discard is false, make sure that a discarded area reads back @@ -1694,18 +1694,18 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset, } /* First remove L2 entries */ -qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice); if (!full_discard && s->qcow_version >= 3) { -l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); +l2_slice[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); } else { -l2_table[l2_index + i] = cpu_to_be64(0); +l2_slice[l2_index + i] = cpu_to_be64(0); } /* Then decrease the refcount */ qcow2_free_any_clusters(bs, old_l2_entry, 1, type); } -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); return nb_clusters; } @@ -1729,10 +1729,10 @@ int qcow2_cluster_discard(BlockDriverState *bs, uint64_t offset, s->cache_discards = true; -/* Each L2 table is handled by its own loop iteration */ +/* Each L2 slice is handled by its own loop iteration */ while (nb_clusters > 0) { -cleared = discard_single_l2(bs, offset, nb_clusters, type, -full_discard); +cleared = discard_in_l2_slice(bs, offset, nb_clusters, type, + full_discard); if (cleared < 0) { ret = cleared; goto fail; -- 2.13.6
[Qemu-block] [PULL 42/55] qcow2: Prepare qcow2_update_snapshot_refcount() for adding L2 slice support
From: Alberto GarciaAdding support for L2 slices to qcow2_update_snapshot_refcount() needs (among other things) an extra loop that iterates over all slices of each L2 table. Putting all changes in one patch would make it hard to read because all semantic changes would be mixed with pure indentation changes. To make things easier this patch simply creates a new block and changes the indentation of all lines of code inside it. Thus, all modifications in this patch are cosmetic. There are no semantic changes and no variables are renamed yet. The next patch will take care of that. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 8ffaa5e55bd51121f80e498f4045b64902a94293.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-refcount.c | 144 + 1 file changed, 75 insertions(+), 69 deletions(-) diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 9df380d52c..dfa28301c4 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -1236,90 +1236,96 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, goto fail; } -ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset, -(void**) _table); -if (ret < 0) { -goto fail; -} +{ +ret = qcow2_cache_get(bs, s->l2_table_cache, + l2_offset, + (void **) _table); +if (ret < 0) { +goto fail; +} -for (j = 0; j < s->l2_size; j++) { -uint64_t cluster_index; -uint64_t offset; - -entry = be64_to_cpu(l2_table[j]); -old_entry = entry; -entry &= ~QCOW_OFLAG_COPIED; -offset = entry & L2E_OFFSET_MASK; - -switch (qcow2_get_cluster_type(entry)) { -case QCOW2_CLUSTER_COMPRESSED: -nb_csectors = ((entry >> s->csize_shift) & - s->csize_mask) + 1; -if (addend != 0) { -ret = update_refcount(bs, -(entry & s->cluster_offset_mask) & ~511, +for (j = 0; j < s->l2_size; j++) { +uint64_t cluster_index; +uint64_t offset; + +entry = be64_to_cpu(l2_table[j]); +old_entry = entry; +entry &= ~QCOW_OFLAG_COPIED; +offset = entry & L2E_OFFSET_MASK; + +switch (qcow2_get_cluster_type(entry)) { +case QCOW2_CLUSTER_COMPRESSED: +nb_csectors = ((entry >> s->csize_shift) & + s->csize_mask) + 1; +if (addend != 0) { +ret = update_refcount( +bs, (entry & s->cluster_offset_mask) & ~511, nb_csectors * 512, abs(addend), addend < 0, QCOW2_DISCARD_SNAPSHOT); -if (ret < 0) { +if (ret < 0) { +goto fail; +} +} +/* compressed clusters are never modified */ +refcount = 2; +break; + +case QCOW2_CLUSTER_NORMAL: +case QCOW2_CLUSTER_ZERO_ALLOC: +if (offset_into_cluster(s, offset)) { +qcow2_signal_corruption( +bs, true, -1, -1, "Cluster " +"allocation offset %#" PRIx64 +" unaligned (L2 offset: %#" +PRIx64 ", L2 index: %#x)", +offset, l2_offset, j); +ret = -EIO; goto fail; } -} -/* compressed clusters are never modified */ -refcount = 2; -break; - -case QCOW2_CLUSTER_NORMAL: -case QCOW2_CLUSTER_ZERO_ALLOC: -if (offset_into_cluster(s, offset)) { -qcow2_signal_corruption(bs, true, -1, -1, "Cluster " -"allocation offset %#" PRIx64 -" unaligned (L2 offset: %#" -PRIx64 ", L2 index: %#x)", -offset, l2_offset, j); -
[Qemu-block] [PULL 41/55] qcow2: Update zero_single_l2() to support L2 slices
From: Alberto Garciazero_single_l2() limits the number of clusters to be zeroed to the amount that fits inside an L2 table. Since we'll be loading L2 slices instead of full tables we need to update that limit. The function is renamed to zero_in_l2_slice() for clarity. Apart from that, this function doesn't need any additional changes, so this patch simply updates the variable name from l2_table to l2_slice. Signed-off-by: Alberto Garcia Message-id: ebc16e7e79fa6969d8975ef487d679794de4fbcc.1517840877.git.be...@igalia.com Reviewed-by: Eric Blake Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 28 ++-- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 62336960a5..f440105f6c 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1752,33 +1752,33 @@ fail: /* * This zeroes as many clusters of nb_clusters as possible at once (i.e. - * all clusters in the same L2 table) and returns the number of zeroed + * all clusters in the same L2 slice) and returns the number of zeroed * clusters. */ -static int zero_single_l2(BlockDriverState *bs, uint64_t offset, - uint64_t nb_clusters, int flags) +static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset, +uint64_t nb_clusters, int flags) { BDRVQcow2State *s = bs->opaque; -uint64_t *l2_table; +uint64_t *l2_slice; int l2_index; int ret; int i; bool unmap = !!(flags & BDRV_REQ_MAY_UNMAP); -ret = get_cluster_table(bs, offset, _table, _index); +ret = get_cluster_table(bs, offset, _slice, _index); if (ret < 0) { return ret; } -/* Limit nb_clusters to one L2 table */ -nb_clusters = MIN(nb_clusters, s->l2_size - l2_index); +/* Limit nb_clusters to one L2 slice */ +nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index); assert(nb_clusters <= INT_MAX); for (i = 0; i < nb_clusters; i++) { uint64_t old_offset; QCow2ClusterType cluster_type; -old_offset = be64_to_cpu(l2_table[l2_index + i]); +old_offset = be64_to_cpu(l2_slice[l2_index + i]); /* * Minimize L2 changes if the cluster already reads back as @@ -1790,16 +1790,16 @@ static int zero_single_l2(BlockDriverState *bs, uint64_t offset, continue; } -qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice); if (cluster_type == QCOW2_CLUSTER_COMPRESSED || unmap) { -l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); +l2_slice[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST); } else { -l2_table[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO); +l2_slice[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO); } } -qcow2_cache_put(s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _slice); return nb_clusters; } @@ -1823,13 +1823,13 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, uint64_t offset, return -ENOTSUP; } -/* Each L2 table is handled by its own loop iteration */ +/* Each L2 slice is handled by its own loop iteration */ nb_clusters = size_to_clusters(s, bytes); s->cache_discards = true; while (nb_clusters > 0) { -cleared = zero_single_l2(bs, offset, nb_clusters, flags); +cleared = zero_in_l2_slice(bs, offset, nb_clusters, flags); if (cleared < 0) { ret = cleared; goto fail; -- 2.13.6
[Qemu-block] [PULL 27/55] qcow2: Remove BDS parameter from qcow2_cache_is_table_offset()
From: Alberto GarciaThis function was only using the BlockDriverState parameter to pass it to qcow2_cache_get_table_addr(). This is no longer necessary so this parameter can be removed. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: eb0ed90affcf302e5a954bafb5931b5215483d3a.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 3 +-- block/qcow2-cache.c| 3 +-- block/qcow2-refcount.c | 6 +++--- 3 files changed, 5 insertions(+), 7 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 7296784353..edc5d8d57d 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -654,8 +654,7 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); void qcow2_cache_put(Qcow2Cache *c, void **table); -void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c, - uint64_t offset); +void *qcow2_cache_is_table_offset(Qcow2Cache *c, uint64_t offset); void qcow2_cache_discard(Qcow2Cache *c, void *table); /* qcow2-bitmap.c functions */ diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 8d0b65e671..b1aa42477e 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -428,8 +428,7 @@ void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table) c->entries[i].dirty = true; } -void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c, - uint64_t offset) +void *qcow2_cache_is_table_offset(Qcow2Cache *c, uint64_t offset) { int i; diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 361b39d5cc..9df380d52c 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -871,14 +871,14 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, if (refcount == 0) { void *table; -table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache, +table = qcow2_cache_is_table_offset(s->refcount_block_cache, offset); if (table != NULL) { qcow2_cache_put(s->refcount_block_cache, _block); qcow2_cache_discard(s->refcount_block_cache, table); } -table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset); +table = qcow2_cache_is_table_offset(s->l2_table_cache, offset); if (table != NULL) { qcow2_cache_discard(s->l2_table_cache, table); } @@ -3186,7 +3186,7 @@ static int qcow2_discard_refcount_block(BlockDriverState *bs, s->free_cluster_index = cluster_index; } -refblock = qcow2_cache_is_table_offset(bs, s->refcount_block_cache, +refblock = qcow2_cache_is_table_offset(s->refcount_block_cache, discard_block_offs); if (refblock) { /* discard refblock from the cache if refblock is cached */ -- 2.13.6
[Qemu-block] [PULL 28/55] qcow2: Add offset_to_l1_index()
From: Alberto GarciaSimilar to offset_to_l2_index(), this function returns the index in the L1 table for a given guest offset. This is only used in a couple of places and it's not a particularly complex calculation, but it makes the code a bit more readable. Although in the qcow2_get_cluster_offset() case the old code was taking advantage of the l1_bits variable, we're going to get rid of the other uses of l1_bits in a later patch anyway, so it doesn't make sense to keep it just for this. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: a5f626fed526b7459a0425fad06d823d18df8522.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 5 + block/qcow2-cluster.c | 4 ++-- 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index edc5d8d57d..d9ba57c030 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -463,6 +463,11 @@ static inline int64_t size_to_l1(BDRVQcow2State *s, int64_t size) return (size + (1ULL << shift) - 1) >> shift; } +static inline int offset_to_l1_index(BDRVQcow2State *s, uint64_t offset) +{ +return offset >> (s->l2_bits + s->cluster_bits); +} + static inline int offset_to_l2_index(BDRVQcow2State *s, int64_t offset) { return (offset >> s->cluster_bits) & (s->l2_size - 1); diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index ab5715c97e..ce7591dc3d 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -540,7 +540,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, /* seek to the l2 offset in the l1 table */ -l1_index = offset >> l1_bits; +l1_index = offset_to_l1_index(s, offset); if (l1_index >= s->l1_size) { type = QCOW2_CLUSTER_UNALLOCATED; goto out; @@ -663,7 +663,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t offset, /* seek to the l2 offset in the l1 table */ -l1_index = offset >> (s->l2_bits + s->cluster_bits); +l1_index = offset_to_l1_index(s, offset); if (l1_index >= s->l1_size) { ret = qcow2_grow_l1_table(bs, l1_index + 1, false); if (ret < 0) { -- 2.13.6
[Qemu-block] [PULL 29/55] qcow2: Add l2_slice_size field to BDRVQcow2State
From: Alberto GarciaThe BDRVQcow2State structure contains an l2_size field, which stores the number of 64-bit entries in an L2 table. For efficiency reasons we want to be able to load slices instead of full L2 tables, so we need to know how many entries an L2 slice can hold. An L2 slice is the portion of an L2 table that is loaded by the qcow2 cache. At the moment that cache can only load complete tables, therefore an L2 slice has the same size as an L2 table (one cluster) and l2_size == l2_slice_size. Later we'll allow smaller slices, but until then we have to use this new l2_slice_size field to make the rest of the code ready for that. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: adb048595f9fb5dfb110c802a8b3c3be3b937f37.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 1 + block/qcow2.c | 3 +++ 2 files changed, 4 insertions(+) diff --git a/block/qcow2.h b/block/qcow2.h index d9ba57c030..45fb7f4299 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -251,6 +251,7 @@ typedef struct BDRVQcow2State { int cluster_bits; int cluster_size; int cluster_sectors; +int l2_slice_size; int l2_bits; int l2_size; int l1_size; diff --git a/block/qcow2.c b/block/qcow2.c index 461cc7ab1c..ba8d71c72d 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -807,6 +807,7 @@ static void read_cache_sizes(BlockDriverState *bs, QemuOpts *opts, typedef struct Qcow2ReopenState { Qcow2Cache *l2_table_cache; Qcow2Cache *refcount_block_cache; +int l2_slice_size; /* Number of entries in a slice of the L2 table */ bool use_lazy_refcounts; int overlap_check; bool discard_passthrough[QCOW2_DISCARD_MAX]; @@ -888,6 +889,7 @@ static int qcow2_update_options_prepare(BlockDriverState *bs, } } +r->l2_slice_size = s->cluster_size / sizeof(uint64_t); r->l2_table_cache = qcow2_cache_create(bs, l2_cache_size); r->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_size); if (r->l2_table_cache == NULL || r->refcount_block_cache == NULL) { @@ -1051,6 +1053,7 @@ static void qcow2_update_options_commit(BlockDriverState *bs, } s->l2_table_cache = r->l2_table_cache; s->refcount_block_cache = r->refcount_block_cache; +s->l2_slice_size = r->l2_slice_size; s->overlap_check = r->overlap_check; s->use_lazy_refcounts = r->use_lazy_refcounts; -- 2.13.6
[Qemu-block] [PULL 23/55] qcow2: Remove BDS parameter from qcow2_cache_put()
From: Alberto GarciaThis function was only using the BlockDriverState parameter to pass it to qcow2_cache_get_table_idx(). This is no longer necessary so this parameter can be removed. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 6f98155489054a457563da77cdad1a66ebb3e896.1517840876.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 2 +- block/qcow2-cache.c| 2 +- block/qcow2-cluster.c | 28 ++-- block/qcow2-refcount.c | 30 +++--- 4 files changed, 31 insertions(+), 31 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 73e0e0133d..606dcdd47f 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -653,7 +653,7 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table); -void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table); +void qcow2_cache_put(Qcow2Cache *c, void **table); void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset); void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table); diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 07603e6b15..3b55f39afb 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -407,7 +407,7 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, return qcow2_cache_do_get(bs, c, offset, table, false); } -void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table) +void qcow2_cache_put(Qcow2Cache *c, void **table) { int i = qcow2_cache_get_table_idx(c, *table); diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 08af5c5995..ab5715c97e 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -317,7 +317,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) memcpy(l2_table, old_table, s->cluster_size); -qcow2_cache_put(bs, s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); } /* write the l2 table to the file */ @@ -345,7 +345,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) fail: trace_qcow2_l2_allocate_done(bs, l1_index, ret); if (l2_table != NULL) { -qcow2_cache_put(bs, s->l2_table_cache, (void**) table); +qcow2_cache_put(s->l2_table_cache, (void **) table); } s->l1_table[l1_index] = old_l2_offset; if (l2_offset > 0) { @@ -619,7 +619,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t offset, abort(); } -qcow2_cache_put(bs, s->l2_table_cache, (void**) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); bytes_available = (int64_t)c * s->cluster_size; @@ -637,7 +637,7 @@ out: return type; fail: -qcow2_cache_put(bs, s->l2_table_cache, (void **)_table); +qcow2_cache_put(s->l2_table_cache, (void **)_table); return ret; } @@ -744,13 +744,13 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs, * allocated. */ cluster_offset = be64_to_cpu(l2_table[l2_index]); if (cluster_offset & L2E_OFFSET_MASK) { -qcow2_cache_put(bs, s->l2_table_cache, (void**) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); return 0; } cluster_offset = qcow2_alloc_bytes(bs, compressed_size); if (cluster_offset < 0) { -qcow2_cache_put(bs, s->l2_table_cache, (void**) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); return 0; } @@ -767,7 +767,7 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs, BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED); qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); l2_table[l2_index] = cpu_to_be64(cluster_offset); -qcow2_cache_put(bs, s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); return cluster_offset; } @@ -956,7 +956,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) } -qcow2_cache_put(bs, s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); /* * If this was a COW, we need to decrease the refcount of the old cluster. @@ -1174,7 +1174,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t guest_offset, /* Cleanup */ out: -qcow2_cache_put(bs, s->l2_table_cache, (void **) _table); +qcow2_cache_put(s->l2_table_cache, (void **) _table); /* Only return a host offset if we actually made progress. Otherwise we * would make requirements for handle_alloc() that it can't fulfill */ @@
[Qemu-block] [PULL 26/55] qcow2: Remove BDS parameter from qcow2_cache_discard()
From: Alberto GarciaThis function was only using the BlockDriverState parameter to pass it to qcow2_cache_get_table_idx() and qcow2_cache_table_release(). This is no longer necessary so this parameter can be removed. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 9724f7e38e763ad3be32627c6b7fe8df9edb1476.1517840877.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 2 +- block/qcow2-cache.c| 2 +- block/qcow2-refcount.c | 6 +++--- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 867f2e8828..7296784353 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -656,7 +656,7 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void qcow2_cache_put(Qcow2Cache *c, void **table); void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset); -void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table); +void qcow2_cache_discard(Qcow2Cache *c, void *table); /* qcow2-bitmap.c functions */ int qcow2_check_bitmaps_refcounts(BlockDriverState *bs, BdrvCheckResult *res, diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 03b3e03c6c..8d0b65e671 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -441,7 +441,7 @@ void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c, return NULL; } -void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table) +void qcow2_cache_discard(Qcow2Cache *c, void *table) { int i = qcow2_cache_get_table_idx(c, table); diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 65af05dd23..361b39d5cc 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -875,12 +875,12 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, offset); if (table != NULL) { qcow2_cache_put(s->refcount_block_cache, _block); -qcow2_cache_discard(bs, s->refcount_block_cache, table); +qcow2_cache_discard(s->refcount_block_cache, table); } table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset); if (table != NULL) { -qcow2_cache_discard(bs, s->l2_table_cache, table); +qcow2_cache_discard(s->l2_table_cache, table); } if (s->discard_passthrough[type]) { @@ -3190,7 +3190,7 @@ static int qcow2_discard_refcount_block(BlockDriverState *bs, discard_block_offs); if (refblock) { /* discard refblock from the cache if refblock is cached */ -qcow2_cache_discard(bs, s->refcount_block_cache, refblock); +qcow2_cache_discard(s->refcount_block_cache, refblock); } update_refcount_discard(bs, discard_block_offs, s->cluster_size); -- 2.13.6
[Qemu-block] [PULL 19/55] qcow2: Remove BDS parameter from qcow2_cache_get_table_addr()
From: Alberto GarciaThis function was only using the BlockDriverState parameter to get the cache table size (since it was equal to the cluster size). This is no longer necessary so this parameter can be removed. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: e1f943a9e89e1deb876f45de1bb22419ccdb6ad3.1517840876.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cache.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 38c03770b4..a481ef499a 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -46,8 +46,7 @@ struct Qcow2Cache { uint64_tcache_clean_lru_counter; }; -static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs, -Qcow2Cache *c, int table) +static inline void *qcow2_cache_get_table_addr(Qcow2Cache *c, int table) { return (uint8_t *) c->table_array + (size_t) table * c->table_size; } @@ -78,7 +77,7 @@ static void qcow2_cache_table_release(BlockDriverState *bs, Qcow2Cache *c, { /* Using MADV_DONTNEED to discard memory is a Linux-specific feature */ #ifdef CONFIG_LINUX -void *t = qcow2_cache_get_table_addr(bs, c, i); +void *t = qcow2_cache_get_table_addr(c, i); int align = getpagesize(); size_t mem_size = (size_t) c->table_size * num_tables; size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t; @@ -222,7 +221,7 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) } ret = bdrv_pwrite(bs->file, c->entries[i].offset, - qcow2_cache_get_table_addr(bs, c, i), c->table_size); + qcow2_cache_get_table_addr(c, i), c->table_size); if (ret < 0) { return ret; } @@ -378,7 +377,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, } ret = bdrv_pread(bs->file, offset, - qcow2_cache_get_table_addr(bs, c, i), + qcow2_cache_get_table_addr(c, i), c->table_size); if (ret < 0) { return ret; @@ -390,7 +389,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, /* And return the right table */ found: c->entries[i].ref++; -*table = qcow2_cache_get_table_addr(bs, c, i); +*table = qcow2_cache_get_table_addr(c, i); trace_qcow2_cache_get_done(qemu_coroutine_self(), c == s->l2_table_cache, i); @@ -439,7 +438,7 @@ void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c, for (i = 0; i < c->size; i++) { if (c->entries[i].offset == offset) { -return qcow2_cache_get_table_addr(bs, c, i); +return qcow2_cache_get_table_addr(c, i); } } return NULL; -- 2.13.6
[Qemu-block] [PULL 25/55] qcow2: Remove BDS parameter from qcow2_cache_clean_unused()
From: Alberto GarciaThis function was only using the BlockDriverState parameter to pass it to qcow2_cache_table_release(). This is no longer necessary so this parameter can be removed. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: b74f17591af52f201de0ea3a3b2dd0a81932334d.1517840876.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 2 +- block/qcow2-cache.c | 2 +- block/qcow2.c | 4 ++-- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 9936cd0700..867f2e8828 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -646,7 +646,7 @@ int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c, Qcow2Cache *dependency); void qcow2_cache_depends_on_flush(Qcow2Cache *c); -void qcow2_cache_clean_unused(BlockDriverState *bs, Qcow2Cache *c); +void qcow2_cache_clean_unused(Qcow2Cache *c); int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c); int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 6f17a28635..03b3e03c6c 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -93,7 +93,7 @@ static inline bool can_clean_entry(Qcow2Cache *c, int i) t->lru_counter <= c->cache_clean_lru_counter; } -void qcow2_cache_clean_unused(BlockDriverState *bs, Qcow2Cache *c) +void qcow2_cache_clean_unused(Qcow2Cache *c) { int i = 0; while (i < c->size) { diff --git a/block/qcow2.c b/block/qcow2.c index 38e7fdc4b0..461cc7ab1c 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -706,8 +706,8 @@ static void cache_clean_timer_cb(void *opaque) { BlockDriverState *bs = opaque; BDRVQcow2State *s = bs->opaque; -qcow2_cache_clean_unused(bs, s->l2_table_cache); -qcow2_cache_clean_unused(bs, s->refcount_block_cache); +qcow2_cache_clean_unused(s->l2_table_cache); +qcow2_cache_clean_unused(s->refcount_block_cache); timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + (int64_t) s->cache_clean_interval * 1000); } -- 2.13.6
[Qemu-block] [PULL 22/55] qcow2: Remove BDS parameter from qcow2_cache_entry_mark_dirty()
From: Alberto GarciaThis function was only using the BlockDriverState parameter to pass it to qcow2_cache_get_table_idx(). This is no longer necessary so this parameter can be removed. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 5c40516a91782b083c1428b7b6a41bb9e2679bfb.1517840876.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2.h | 3 +-- block/qcow2-cache.c| 3 +-- block/qcow2-cluster.c | 12 ++-- block/qcow2-refcount.c | 14 ++ 4 files changed, 14 insertions(+), 18 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 016e87c81a..73e0e0133d 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -639,8 +639,7 @@ int qcow2_read_snapshots(BlockDriverState *bs); Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables); int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c); -void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c, - void *table); +void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table); int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c); int qcow2_cache_write(BlockDriverState *bs, Qcow2Cache *c); int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c, diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 5ff2cbf5c5..07603e6b15 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -421,8 +421,7 @@ void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table) assert(c->entries[i].ref >= 0); } -void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c, - void *table) +void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table) { int i = qcow2_cache_get_table_idx(c, table); assert(c->entries[i].offset != 0); diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index e5ab102c29..08af5c5995 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -324,7 +324,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table) BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE); trace_qcow2_l2_allocate_write_l2(bs, l1_index); -qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); ret = qcow2_cache_flush(bs, s->l2_table_cache); if (ret < 0) { goto fail; @@ -765,7 +765,7 @@ uint64_t qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs, /* compressed clusters never have the copied flag */ BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED); -qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); l2_table[l2_index] = cpu_to_be64(cluster_offset); qcow2_cache_put(bs, s->l2_table_cache, (void **) _table); @@ -937,7 +937,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) if (ret < 0) { goto err; } -qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); assert(l2_index + m->nb_clusters <= s->l2_size); for (i = 0; i < m->nb_clusters; i++) { @@ -1678,7 +1678,7 @@ static int discard_single_l2(BlockDriverState *bs, uint64_t offset, } /* First remove L2 entries */ -qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); if (!full_discard && s->qcow_version >= 3) { l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); } else { @@ -1774,7 +1774,7 @@ static int zero_single_l2(BlockDriverState *bs, uint64_t offset, continue; } -qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); if (cluster_type == QCOW2_CLUSTER_COMPRESSED || unmap) { l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO); qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST); @@ -1983,7 +1983,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table, if (is_active_l1) { if (l2_dirty) { -qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table); +qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table); qcow2_cache_depends_on_flush(s->l2_table_cache); } qcow2_cache_put(bs, s->l2_table_cache, (void **) _table); diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 92701ab7af..5434e7d4c8 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -421,7 +421,7 @@ static int alloc_refcount_block(BlockDriverState *bs, /* Now the new refcount block needs to be written to disk */ BLKDBG_EVENT(bs->file,
[Qemu-block] [PULL 12/55] gluster: Add preallocated truncation
From: Max ReitzBy using qemu_do_cluster_truncate() in qemu_cluster_truncate(), we now automatically have preallocated truncation. Signed-off-by: Max Reitz Reviewed-by: Eric Blake Signed-off-by: Kevin Wolf --- block/gluster.c | 17 + 1 file changed, 1 insertion(+), 16 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index 806b894bc8..3f17b7819d 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -1122,23 +1122,8 @@ static coroutine_fn int qemu_gluster_co_rw(BlockDriverState *bs, static int qemu_gluster_truncate(BlockDriverState *bs, int64_t offset, PreallocMode prealloc, Error **errp) { -int ret; BDRVGlusterState *s = bs->opaque; - -if (prealloc != PREALLOC_MODE_OFF) { -error_setg(errp, "Unsupported preallocation mode '%s'", - PreallocMode_str(prealloc)); -return -ENOTSUP; -} - -ret = glfs_ftruncate(s->fd, offset); -if (ret < 0) { -ret = -errno; -error_setg_errno(errp, -ret, "Failed to truncate file"); -return ret; -} - -return 0; +return qemu_gluster_do_truncate(s->fd, offset, prealloc, errp); } static coroutine_fn int qemu_gluster_co_readv(BlockDriverState *bs, -- 2.13.6
[Qemu-block] [PULL 17/55] qcow2: Fix documentation of get_cluster_table()
From: Alberto GarciaThis function has not been returning the offset of the L2 table since commit 3948d1d4876065160583e79533bf604481063833 Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: b498733b6706a859a03678d74ecbd26aeba129aa.1517840876.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cluster.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index f077cd3ac5..e5ab102c29 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -647,8 +647,7 @@ fail: * for a given disk offset, load (and allocate if needed) * the l2 table. * - * the l2 table offset in the qcow2 file and the cluster index - * in the l2 table are given to the caller. + * the cluster index in the l2 table is given to the caller. * * Returns 0 on success, -errno in failure case */ -- 2.13.6
[Qemu-block] [PULL 13/55] sheepdog: Make sd_prealloc() take a BDS
From: Max ReitzWe want to use this function in sd_truncate() later on, so taking a filename is not exactly ideal. Signed-off-by: Max Reitz Reviewed-by: Eric Blake Signed-off-by: Kevin Wolf --- block/sheepdog.c | 29 + 1 file changed, 21 insertions(+), 8 deletions(-) diff --git a/block/sheepdog.c b/block/sheepdog.c index af125a2c8d..cc1d37b3da 100644 --- a/block/sheepdog.c +++ b/block/sheepdog.c @@ -1826,10 +1826,10 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t *vdi_id, int snapshot, return 0; } -static int sd_prealloc(const char *filename, Error **errp) +static int sd_prealloc(BlockDriverState *bs, Error **errp) { BlockBackend *blk = NULL; -BDRVSheepdogState *base = NULL; +BDRVSheepdogState *base = bs->opaque; unsigned long buf_size; uint32_t idx, max_idx; uint32_t object_size; @@ -1837,10 +1837,11 @@ static int sd_prealloc(const char *filename, Error **errp) void *buf = NULL; int ret; -blk = blk_new_open(filename, NULL, NULL, - BDRV_O_RDWR | BDRV_O_RESIZE | BDRV_O_PROTOCOL, errp); -if (blk == NULL) { -ret = -EIO; +blk = blk_new(BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | BLK_PERM_RESIZE, + BLK_PERM_ALL); + +ret = blk_insert_bs(blk, bs, errp); +if (ret < 0) { goto out_with_err_set; } @@ -1852,7 +1853,6 @@ static int sd_prealloc(const char *filename, Error **errp) goto out; } -base = blk_bs(blk)->opaque; object_size = (UINT32_C(1) << base->inode.block_size_shift); buf_size = MIN(object_size, SD_DATA_OBJ_SIZE); buf = g_malloc0(buf_size); @@ -2108,7 +2108,20 @@ static int sd_create(const char *filename, QemuOpts *opts, } if (prealloc) { -ret = sd_prealloc(filename, errp); +BlockDriverState *bs; +QDict *opts; + +opts = qdict_new(); +qdict_put_str(opts, "driver", "sheepdog"); +bs = bdrv_open(filename, NULL, opts, BDRV_O_PROTOCOL | BDRV_O_RDWR, + errp); +if (!bs) { +goto out; +} + +ret = sd_prealloc(bs, errp); + +bdrv_unref(bs); } out: g_free(backing_file); -- 2.13.6
[Qemu-block] [PULL 04/55] docs: Document share-rw property more thoroughly
From: Fam ZhengSuggested-by: Stefan Hajnoczi Signed-off-by: Fam Zheng Reviewed-by: Stefan Hajnoczi Reviewed-by: Kashyap Chamarthy Signed-off-by: Kevin Wolf --- docs/qemu-block-drivers.texi | 10 ++ 1 file changed, 10 insertions(+) diff --git a/docs/qemu-block-drivers.texi b/docs/qemu-block-drivers.texi index cd74767ed3..f1793692bb 100644 --- a/docs/qemu-block-drivers.texi +++ b/docs/qemu-block-drivers.texi @@ -845,6 +845,16 @@ QEMU transparently handles lock handover during shared storage migration. For shared virtual disk images between multiple VMs, the "share-rw" device option should be used. +By default, the guest has exclusive write access to its disk image. If the +guest can safely share the disk image with other writers the @code{-device +...,share-rw=on} parameter can be used. This is only safe if the guest is +running software, such as a cluster file system, that coordinates disk accesses +to avoid corruption. + +Note that share-rw=on only declares the guest's ability to share the disk. +Some QEMU features, such as image file formats, require exclusive write access +to the disk image and this is unaffected by the share-rw=on option. + Alternatively, locking can be fully disabled by "locking=off" block device option. In the command line, the option is usually in the form of "file.locking=off" as the protocol driver is normally placed as a "file" child -- 2.13.6
[Qemu-block] [PULL 00/55] Block layer patches
The following changes since commit fb68096da3d35e64c88cd610c1fa42766c58e92a: Revert "tests: use memfd in vhost-user-test" (2018-02-13 09:51:52 +) are available in the git repository at: git://repo.or.cz/qemu/kevin.git tags/for-upstream for you to fetch changes up to 0a4dc980e6c935e9be745ce3ee1a4c71629ecd00: Merge remote-tracking branch 'mreitz/tags/pull-block-2018-02-13' into queue-block (2018-02-13 17:01:13 +0100) Block layer patches Alberto Garcia (40): qcow2: Use g_try_realloc() in qcow2_expand_zero_clusters() qcow2: Fix documentation of get_cluster_table() qcow2: Add table size field to Qcow2Cache qcow2: Remove BDS parameter from qcow2_cache_get_table_addr() qcow2: Remove BDS parameter from qcow2_cache_get_table_idx() qcow2: Remove BDS parameter from qcow2_cache_table_release() qcow2: Remove BDS parameter from qcow2_cache_entry_mark_dirty() qcow2: Remove BDS parameter from qcow2_cache_put() qcow2: Remove BDS parameter from qcow2_cache_destroy() qcow2: Remove BDS parameter from qcow2_cache_clean_unused() qcow2: Remove BDS parameter from qcow2_cache_discard() qcow2: Remove BDS parameter from qcow2_cache_is_table_offset() qcow2: Add offset_to_l1_index() qcow2: Add l2_slice_size field to BDRVQcow2State qcow2: Add offset_to_l2_slice_index() qcow2: Update l2_load() to support L2 slices qcow2: Prepare l2_allocate() for adding L2 slice support qcow2: Update l2_allocate() to support L2 slices qcow2: Refactor get_cluster_table() qcow2: Update get_cluster_table() to support L2 slices qcow2: Update qcow2_get_cluster_offset() to support L2 slices qcow2: Update qcow2_alloc_cluster_link_l2() to support L2 slices qcow2: Update handle_copied() to support L2 slices qcow2: Update handle_alloc() to support L2 slices qcow2: Update discard_single_l2() to support L2 slices qcow2: Update zero_single_l2() to support L2 slices qcow2: Prepare qcow2_update_snapshot_refcount() for adding L2 slice support qcow2: Update qcow2_update_snapshot_refcount() to support L2 slices qcow2: Read refcount before L2 table in expand_zero_clusters_in_l1() qcow2: Prepare expand_zero_clusters_in_l1() for adding L2 slice support qcow2: Update expand_zero_clusters_in_l1() to support L2 slices qcow2: Update qcow2_truncate() to support L2 slices qcow2: Rename l2_table in qcow2_alloc_compressed_cluster_offset() qcow2: Rename l2_table in count_contiguous_clusters() qcow2: Rename l2_table in count_contiguous_clusters_unallocated() qcow2: Rename l2_table in count_cow_clusters() qcow2: Allow configuring the L2 slice size iotests: Test valid values of l2-cache-entry-size iotests: Test downgrading an image using a small L2 slice size iotests: Add l2-cache-entry-size to iotest 137 Daniel P. Berrangé (1): qemu-io: fix EOF Ctrl-D handling in qemu-io readline code Fam Zheng (4): iotests: Fix CID for VMDK afl image qemu-img.texi: Clean up parameter list qemu-img: Document --force-share / -U docs: Document share-rw property more thoroughly Kevin Wolf (1): Merge remote-tracking branch 'mreitz/tags/pull-block-2018-02-13' into queue-block Max Reitz (8): iotests: Use virtio-blk in 155 gluster: Move glfs_close() to create's clean-up gluster: Pull truncation from qemu_gluster_create gluster: Query current size in do_truncate() gluster: Add preallocated truncation sheepdog: Make sd_prealloc() take a BDS sheepdog: Pass old and new size to sd_prealloc() sheepdog: Allow fully preallocated truncation Paolo Bonzini (1): block: early check for blockers on drive-mirror Vladimir Sementsov-Ogievskiy (1): block: maintain persistent disabled bitmaps qapi/block-core.json | 12 +- block/qcow2.h | 33 +- include/block/dirty-bitmap.h | 1 - block/dirty-bitmap.c | 18 - block/gluster.c| 116 +++--- block/qcow2-bitmap.c | 12 +- block/qcow2-cache.c| 80 ++-- block/qcow2-cluster.c | 519 + block/qcow2-refcount.c | 206 +- block/qcow2.c | 63 ++- block/sheepdog.c | 56 ++- blockdev.c | 15 +- qemu-io.c | 27 +- docs/qemu-block-drivers.texi | 10 + qemu-doc.texi | 7 + qemu-img.texi | 74 ++--
[Qemu-block] [PULL 18/55] qcow2: Add table size field to Qcow2Cache
From: Alberto GarciaThe table size in the qcow2 cache is currently equal to the cluster size. This doesn't allow us to use the cache memory efficiently, particularly with large cluster sizes, so we need to be able to have smaller cache tables that are independent from the cluster size. This patch adds a new field to Qcow2Cache that we can use instead of the cluster size. The current table size is still being initialized to the cluster size, so there are no semantic changes yet, but this patch will allow us to prepare the rest of the code and simplify a few function calls. Signed-off-by: Alberto Garcia Reviewed-by: Eric Blake Reviewed-by: Max Reitz Message-id: 67a1bf9e55f417005c567bead95a018dc34bc687.1517840876.git.be...@igalia.com Signed-off-by: Max Reitz --- block/qcow2-cache.c | 29 ++--- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index c48ffebd8f..38c03770b4 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -39,6 +39,7 @@ struct Qcow2Cache { Qcow2CachedTable *entries; struct Qcow2Cache *depends; int size; +int table_size; booldepends_on_flush; void *table_array; uint64_tlru_counter; @@ -48,17 +49,15 @@ struct Qcow2Cache { static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs, Qcow2Cache *c, int table) { -BDRVQcow2State *s = bs->opaque; -return (uint8_t *) c->table_array + (size_t) table * s->cluster_size; +return (uint8_t *) c->table_array + (size_t) table * c->table_size; } static inline int qcow2_cache_get_table_idx(BlockDriverState *bs, Qcow2Cache *c, void *table) { -BDRVQcow2State *s = bs->opaque; ptrdiff_t table_offset = (uint8_t *) table - (uint8_t *) c->table_array; -int idx = table_offset / s->cluster_size; -assert(idx >= 0 && idx < c->size && table_offset % s->cluster_size == 0); +int idx = table_offset / c->table_size; +assert(idx >= 0 && idx < c->size && table_offset % c->table_size == 0); return idx; } @@ -79,10 +78,9 @@ static void qcow2_cache_table_release(BlockDriverState *bs, Qcow2Cache *c, { /* Using MADV_DONTNEED to discard memory is a Linux-specific feature */ #ifdef CONFIG_LINUX -BDRVQcow2State *s = bs->opaque; void *t = qcow2_cache_get_table_addr(bs, c, i); int align = getpagesize(); -size_t mem_size = (size_t) s->cluster_size * num_tables; +size_t mem_size = (size_t) c->table_size * num_tables; size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t; size_t length = QEMU_ALIGN_DOWN(mem_size - offset, align); if (mem_size > offset && length > 0) { @@ -132,9 +130,10 @@ Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables) c = g_new0(Qcow2Cache, 1); c->size = num_tables; +c->table_size = s->cluster_size; c->entries = g_try_new0(Qcow2CachedTable, num_tables); c->table_array = qemu_try_blockalign(bs->file->bs, - (size_t) num_tables * s->cluster_size); + (size_t) num_tables * c->table_size); if (!c->entries || !c->table_array) { qemu_vfree(c->table_array); @@ -203,13 +202,13 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) if (c == s->refcount_block_cache) { ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_REFCOUNT_BLOCK, -c->entries[i].offset, s->cluster_size); +c->entries[i].offset, c->table_size); } else if (c == s->l2_table_cache) { ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_ACTIVE_L2, -c->entries[i].offset, s->cluster_size); +c->entries[i].offset, c->table_size); } else { ret = qcow2_pre_write_overlap_check(bs, 0, -c->entries[i].offset, s->cluster_size); +c->entries[i].offset, c->table_size); } if (ret < 0) { @@ -223,7 +222,7 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) } ret = bdrv_pwrite(bs->file, c->entries[i].offset, - qcow2_cache_get_table_addr(bs, c, i), s->cluster_size); + qcow2_cache_get_table_addr(bs, c, i), c->table_size); if (ret < 0) { return ret; } @@ -331,7 +330,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, trace_qcow2_cache_get(qemu_coroutine_self(), c == s->l2_table_cache, offset, read_from_disk); -if (offset_into_cluster(s, offset)) { +if (!QEMU_IS_ALIGNED(offset, c->table_size)) { qcow2_signal_corruption(bs, true, -1, -1, "Cannot get entry from %s "
[Qemu-block] [PULL 02/55] qemu-img.texi: Clean up parameter list
From: Fam ZhengSplit options out of the "@table @var" section and create a "@table @option", then use whitespaces and blank lines consistently. Suggested-by: Kevin Wolf Signed-off-by: Fam Zheng Reviewed-by: Stefan Hajnoczi Reviewed-by: Kashyap Chamarthy Signed-off-by: Kevin Wolf --- qemu-img.texi | 66 +++ 1 file changed, 39 insertions(+), 27 deletions(-) diff --git a/qemu-img.texi b/qemu-img.texi index fdcf120f36..60a0e080c6 100644 --- a/qemu-img.texi +++ b/qemu-img.texi @@ -33,38 +33,14 @@ The following commands are supported: Command parameters: @table @var -@item filename - is a disk image filename - -@item --object @var{objectdef} - -is a QEMU user creatable object definition. See the @code{qemu(1)} manual -page for a description of the object properties. The most common object -type is a @code{secret}, which is used to supply passwords and/or encryption -keys. - -@item --image-opts - -Indicates that the source @var{filename} parameter is to be interpreted as a -full option string, not a plain filename. This parameter is mutually -exclusive with the @var{-f} parameter. - -@item --target-image-opts -Indicates that the @var{output_filename} parameter(s) are to be interpreted as -a full option string, not a plain filename. This parameter is mutually -exclusive with the @var{-O} parameters. It is currently required to also use -the @var{-n} parameter to skip image creation. This restriction may be relaxed -in a future release. +@item filename +is a disk image filename @item fmt is the disk image format. It is guessed automatically in most cases. See below for a description of the supported disk formats. -@item --backing-chain -will enumerate information about backing files in a disk image chain. Refer -below for further description. - @item size is the disk image size in bytes. Optional suffixes @code{k} or @code{K} (kilobyte, 1024) @code{M} (megabyte, 1024k) and @code{G} (gigabyte, 1024M) @@ -74,42 +50,78 @@ and T (terabyte, 1024G) are supported. @code{b} is ignored. is the destination disk image filename @item output_fmt - is the destination format +is the destination format + @item options is a comma separated list of format specific options in a name=value format. Use @code{-o ?} for an overview of the options supported by the used format or see the format descriptions below for details. + @item snapshot_param is param used for internal snapshot, format is 'snapshot.id=[ID],snapshot.name=[NAME]' or '[ID_OR_NAME]' + @item snapshot_id_or_name is deprecated, use snapshot_param instead +@end table + +@table @option + +@item --object @var{objectdef} +is a QEMU user creatable object definition. See the @code{qemu(1)} manual +page for a description of the object properties. The most common object +type is a @code{secret}, which is used to supply passwords and/or encryption +keys. + +@item --image-opts +Indicates that the source @var{filename} parameter is to be interpreted as a +full option string, not a plain filename. This parameter is mutually +exclusive with the @var{-f} parameter. + +@item --target-image-opts +Indicates that the @var{output_filename} parameter(s) are to be interpreted as +a full option string, not a plain filename. This parameter is mutually +exclusive with the @var{-O} parameters. It is currently required to also use +the @var{-n} parameter to skip image creation. This restriction may be relaxed +in a future release. + +@item --backing-chain +will enumerate information about backing files in a disk image chain. Refer +below for further description. + @item -c indicates that target image must be compressed (qcow format only) + @item -h with or without a command shows help and lists the supported formats + @item -p display progress bar (compare, convert and rebase commands only). If the @var{-p} option is not used for a command that supports it, the progress is reported when the process receives a @code{SIGUSR1} or @code{SIGINFO} signal. + @item -q Quiet mode - do not print any output (except errors). There's no progress bar in case both @var{-q} and @var{-p} options are used. + @item -S @var{size} indicates the consecutive number of bytes that must contain only zeros for qemu-img to create a sparse image during conversion. This value is rounded down to the nearest 512 bytes. You may use the common size suffixes like @code{k} for kilobytes. + @item -t @var{cache} specifies the cache mode that should be used with the (destination) file. See the documentation of the emulator's @code{-drive cache=...} option for allowed values. + @item -T @var{src_cache} specifies the cache mode that should be used with the source file(s). See the documentation of the emulator's @code{-drive cache=...} option for allowed values. + @end table Parameters to snapshot
[Qemu-block] [PULL 07/55] iotests: Use virtio-blk in 155
From: Max ReitzOnly a few select machine types support floppy drives and there is actually nothing preventing us from using virtio here, so let's do it. Reported-by: Christian Borntraeger Signed-off-by: Max Reitz Tested-by: Christian Borntraeger Signed-off-by: Kevin Wolf --- tests/qemu-iotests/155 | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155 index fc9fa975be..42dae04c83 100755 --- a/tests/qemu-iotests/155 +++ b/tests/qemu-iotests/155 @@ -64,7 +64,7 @@ class BaseClass(iotests.QMPTestCase): 'file': {'driver': 'file', 'filename': source_img}} self.vm.add_blockdev(self.qmp_to_opts(blockdev)) -self.vm.add_device('floppy,id=qdev0,drive=source') +self.vm.add_device('virtio-blk,id=qdev0,drive=source') self.vm.launch() self.assertIntactSourceBackingChain() @@ -173,21 +173,24 @@ class MirrorBaseClass(BaseClass): def testFull(self): self.runMirror('full') -node = self.findBlockNode('target', 'qdev0') +node = self.findBlockNode('target', + '/machine/peripheral/qdev0/virtio-backend') self.assertCorrectBackingImage(node, None) self.assertIntactSourceBackingChain() def testTop(self): self.runMirror('top') -node = self.findBlockNode('target', 'qdev0') +node = self.findBlockNode('target', + '/machine/peripheral/qdev0/virtio-backend') self.assertCorrectBackingImage(node, back2_img) self.assertIntactSourceBackingChain() def testNone(self): self.runMirror('none') -node = self.findBlockNode('target', 'qdev0') +node = self.findBlockNode('target', + '/machine/peripheral/qdev0/virtio-backend') self.assertCorrectBackingImage(node, source_img) self.assertIntactSourceBackingChain() @@ -239,7 +242,8 @@ class TestCommit(BaseClass): self.vm.event_wait('BLOCK_JOB_COMPLETED') -node = self.findBlockNode(None, 'qdev0') +node = self.findBlockNode(None, + '/machine/peripheral/qdev0/virtio-backend') self.assert_qmp(node, 'image' + '/backing-image' * 0 + '/filename', back1_img) self.assert_qmp(node, 'image' + '/backing-image' * 1 + '/filename', -- 2.13.6
[Qemu-block] [PULL 10/55] gluster: Pull truncation from qemu_gluster_create
From: Max ReitzPull out the truncation code from the qemu_cluster_create() function so we can later reuse it in qemu_gluster_truncate(). Signed-off-by: Max Reitz Reviewed-by: Eric Blake Signed-off-by: Kevin Wolf --- block/gluster.c | 74 +++-- 1 file changed, 40 insertions(+), 34 deletions(-) diff --git a/block/gluster.c b/block/gluster.c index 7fab2dfa12..8178541416 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -965,6 +965,45 @@ static coroutine_fn int qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs, } #endif +static int qemu_gluster_do_truncate(struct glfs_fd *fd, int64_t offset, +PreallocMode prealloc, Error **errp) +{ +switch (prealloc) { +#ifdef CONFIG_GLUSTERFS_FALLOCATE +case PREALLOC_MODE_FALLOC: +if (glfs_fallocate(fd, 0, 0, offset)) { +error_setg_errno(errp, errno, "Could not preallocate data"); +return -errno; +} +break; +#endif /* CONFIG_GLUSTERFS_FALLOCATE */ +#ifdef CONFIG_GLUSTERFS_ZEROFILL +case PREALLOC_MODE_FULL: +if (glfs_ftruncate(fd, offset)) { +error_setg_errno(errp, errno, "Could not resize file"); +return -errno; +} +if (glfs_zerofill(fd, 0, offset)) { +error_setg_errno(errp, errno, "Could not zerofill the new area"); +return -errno; +} +break; +#endif /* CONFIG_GLUSTERFS_ZEROFILL */ +case PREALLOC_MODE_OFF: +if (glfs_ftruncate(fd, offset)) { +error_setg_errno(errp, errno, "Could not resize file"); +return -errno; +} +break; +default: +error_setg(errp, "Unsupported preallocation mode: %s", + PreallocMode_str(prealloc)); +return -EINVAL; +} + +return 0; +} + static int qemu_gluster_create(const char *filename, QemuOpts *opts, Error **errp) { @@ -1019,40 +1058,7 @@ static int qemu_gluster_create(const char *filename, goto out; } -switch (prealloc) { -#ifdef CONFIG_GLUSTERFS_FALLOCATE -case PREALLOC_MODE_FALLOC: -if (glfs_fallocate(fd, 0, 0, total_size)) { -error_setg(errp, "Could not preallocate data for the new file"); -ret = -errno; -} -break; -#endif /* CONFIG_GLUSTERFS_FALLOCATE */ -#ifdef CONFIG_GLUSTERFS_ZEROFILL -case PREALLOC_MODE_FULL: -if (!glfs_ftruncate(fd, total_size)) { -if (glfs_zerofill(fd, 0, total_size)) { -error_setg(errp, "Could not zerofill the new file"); -ret = -errno; -} -} else { -error_setg(errp, "Could not resize file"); -ret = -errno; -} -break; -#endif /* CONFIG_GLUSTERFS_ZEROFILL */ -case PREALLOC_MODE_OFF: -if (glfs_ftruncate(fd, total_size) != 0) { -ret = -errno; -error_setg(errp, "Could not resize file"); -} -break; -default: -ret = -EINVAL; -error_setg(errp, "Unsupported preallocation mode: %s", - PreallocMode_str(prealloc)); -break; -} +ret = qemu_gluster_do_truncate(fd, total_size, prealloc, errp); out: if (fd) { -- 2.13.6
[Qemu-block] [PULL 06/55] block: early check for blockers on drive-mirror
From: Paolo BonziniEven if an op blocker is present for BLOCK_OP_TYPE_MIRROR_SOURCE, it is checked a bit late and the result is that the target is created even if drive-mirror subsequently fails. Add an early check to avoid this. Signed-off-by: Paolo Bonzini Reviewed-by: Fam Zheng Reviewed-by: Alberto Garcia Signed-off-by: Kevin Wolf --- blockdev.c | 5 + 1 file changed, 5 insertions(+) diff --git a/blockdev.c b/blockdev.c index bdbdeae7e4..7423c5317b 100644 --- a/blockdev.c +++ b/blockdev.c @@ -3569,6 +3569,11 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp) return; } +/* Early check to avoid creating target */ +if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_MIRROR_SOURCE, errp)) { +return; +} + aio_context = bdrv_get_aio_context(bs); aio_context_acquire(aio_context); -- 2.13.6
[Qemu-block] [PULL 01/55] iotests: Fix CID for VMDK afl image
From: Fam ZhengThis reverts commit 76bf133c4 which updated the reference output, and fixed the reference image, because the code path we want to exercise is actually the invalid image size. The descriptor block in the image, which includes the CID to verify, has been invalid since the reference image was added. Since commit 9877860e7bd we report this error earlier than the "file too large", so 059.out mismatches. The binary change is generated along the operations of: $ bunzip2 afl9.vmdk.bz2 $ qemu-img create -f vmdk fix.vmdk 1G $ dd if=afl9.vmdk of=fix.vmdk bs=512 count=1 conv=notrunc $ mv fix.vmdk afl9.vmdk $ bzip2 afl9.vmdk Signed-off-by: Fam Zheng Reviewed-by: Eric Blake Signed-off-by: Kevin Wolf --- tests/qemu-iotests/059.out | 2 +- tests/qemu-iotests/sample_images/afl9.vmdk.bz2 | Bin 178 -> 618 bytes 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/tests/qemu-iotests/059.out b/tests/qemu-iotests/059.out index 1ac5d56233..f6dce7947c 100644 --- a/tests/qemu-iotests/059.out +++ b/tests/qemu-iotests/059.out @@ -2358,5 +2358,5 @@ Offset Length Mapped to File 0x14000 0x1 0x5 TEST_DIR/t-s003.vmdk === Testing afl image with a very large capacity === -qemu-img: Could not open 'TEST_DIR/afl9.IMGFMT': Could not open 'TEST_DIR/afl9.IMGFMT': Invalid argument +qemu-img: Can't get image size 'TEST_DIR/afl9.IMGFMT': File too large *** done diff --git a/tests/qemu-iotests/sample_images/afl9.vmdk.bz2 b/tests/qemu-iotests/sample_images/afl9.vmdk.bz2 index 03615d36a12425cf4240bab86f4cfe648db14572..9fcd0af45a815431acf4689e0845ecf2d333cd58 100644 GIT binary patch literal 618 zcmV-w0+szjT4*^jL0KkKSvgW7ssIN3|NsBH-Q9UpfAhclU70`s-*NE~5QvC~h=_=Y zh>D2n*q*=vygR634445h35k;?00h9835kMW4$iPepVE{Bqk)uhJ^wfGLr=)3s zhM5CR88jLh7)B;cA*K)*6GmuECPU3o4NWG5O#pg>Ak#xY8Z^CrMt}oD38Ns$ z02n}M0LdjZ&}cLPqd+nPKmn$j0iXe(02%-d27nnJriN-uE+X@Bj4BBfd|yV!NB zwqkL}nW3AI5x^jp=t%^F1pxqp)v#n#)j$zcm1xqv(!$2d*5%vF{5RPWnOV8-^tE<( zU~%&}Y0uNu*9Wt=yS^8PkC%IG;aD{l#sG` m4Ho*fsHXdM
Re: [Qemu-block] qcow2 images corruption
Le 13/02/2018 à 16:26, Nicolas Ecarnot a écrit : >> It would be good if you could make the 'qemu-img check' output available >> somewhere. > I found this : https://github.com/ShijunDeng/qcow2-dump and the transcript (beautiful colors when viewed with "more") is attached : -- Nicolas ECARNOT Le script a débuté sur mar. 13 févr. 2018 17:31:05 CET ]0;root@serv-hv-adm13:/home[?1034h[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m# /root/qcow2-dump -m check serv-term-adm4-corr.qcow2.img [1;32m File:[1;36m serv-term-adm4-corr.qcow2.img [0m magic: 0x514649fb version: [1;36m2 [0mbacking_file_offset: 0x0 backing_file_size: 0 fs_type: [1;32mxfs [0mvirtual_size: 64424509440 / 61440M / 60G disk_size: 36507222016 / 34816M / 34G seek_end: 36507222016 [[1;32m0x88000[0m] / 34816M / 34G cluster_bits: [1;36m16 [0mcluster_size: [1;36m65536 [0mcrypt_method: 0 csize_shift: 54 csize_mask: 255 cluster_offset_mask: [1;36m0x3f [0ml1_table_offset: [1;32m0x76a46 [0ml1_size: [1;32m120 [0ml1_vm_state_index: [1;32m120 [0ml2_size: [1;36m8192 [0mrefcount_order: [1;36m4 [0mrefcount_bits: [1;36m16 [0mrefcount_block_bits: [1;36m15 [0mrefcount_block_size: [1;36m32768 [0mrefcount_table_offset: [1;32m0x1 [0mrefcount_table_clusters: [1;32m1 [0msnapshots_offset: [1;32m0x0 [0mnb_snapshots: [1;32m0 [0mincompatible_features: compatible_features: autoclear_features: [1;32mActive Snapshot: [0m L1 Table: [offset: 0x76a46, len: 120] [1;36mResult: [0mL1 Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 53, used: 67 L2 Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 20304, used: 528560 [1;32mRefcount Table: [0m Refcount Table: [offset: 0x1, len: 8192] [1;36mResult: [0mRefcount Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 8175, used: 17 Refcount: error: [1;33m4342, [0mleak: [1;33m0, [0munused: 28426, used: 524288 [1;32mCOPIED OFLAG: [0m [1;36mResult: [0mL1 Table ERROR OFLAG_COPIED: [1;33m1 [0mL2 Table ERROR OFLAG_COPIED: [1;33m4323 [0mActive L2 COPIED: [1;33m528560 [34639708160 / 33035M / 32G] [0m [1;32mActive Cluster: [0m [1;36m Result: [0mActive Cluster: reuse: [1;33m17 [0m [1;31mSummary: [0mpreallocation: [1;32moff [0mActive Cluster: [1;31mreuse: 17 [0mRefcount Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 8175, used: 17 Refcount: [1;33merror: [0m[1;31m4342, [0m[1;33mleak: 0, [0m[1;31mrebuild: 4325, [0munused: 28426, used: 524288 L1 Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 53, used: 67 [1;33moflag copied: [0m[1;31m1 [0mL2 Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 20304, used: 528560 [1;33moflag copied: [0m[1;31m4323 [0m ###[5;31m qcow2 image has refcount errors! (=_=#)[0m### ###[5;31mand qcow2 image has copied errors! (o_0)?[0m### ###[5;31m Sadly: refcount error cause active cluster reused! Orz[0m ### ###[1;33m Please backup this image and contact the author![0m ### ]0;root@serv-hv-adm13:/home[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m# exit Script terminé sur mar. 13 févr. 2018 17:31:13 CET
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 02/13/2018 08:48 AM, Daniel P. Berrangé wrote: No, that's policy decision that doesn't matter from QMP pov. If the mgmt app wants the snapshot to be wrt to the initial time, it can simply invoke the "stop" QMP command before doing the live migration and "cont" afterwards. That would be non-live. I think Roman means a live snapshot that saves the state at the beginning of the operation. Basically the difference between blockdev-backup (state at the beginning) and blockdev-mirror (state at the end), except for a whole VM. That doesn't seem practical unless you can instantaneously write out the entire guest RAM to disk without blocking, or can somehow snapshot the RAM so you can write out a consistent view of the original RAM, while the guest continues to dirty RAM pages. One idea for that is via fork()'s copy-on-write semantics; the parent continues processing the guest, while the child writes out RAM pages. Pages touched by the guest in the parent are now cleanly copied by the OS so that the child can take all the time it wants, but still writes out the state of the guest at the time of the fork(). It may be possible to use userfaultfd to achieve the same effects without a fork(). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] qcow2 images corruption
Le 13/02/2018 à 16:26, Nicolas Ecarnot a écrit : It would be good if you could make the 'qemu-img check' output available somewhere. I found this : https://github.com/ShijunDeng/qcow2-dump and the transcript (beautiful colors when viewed with "more") is attached : -- Nicolas ECARNOT Le script a débuté sur mar. 13 févr. 2018 17:31:05 CET ]0;root@serv-hv-adm13:/home[?1034h[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m# /root/qcow2-dump -m check serv-term-adm4-corr.qcow2.img [1;32m File:[1;36m serv-term-adm4-corr.qcow2.img [0m magic: 0x514649fb version: [1;36m2 [0mbacking_file_offset: 0x0 backing_file_size: 0 fs_type: [1;32mxfs [0mvirtual_size: 64424509440 / 61440M / 60G disk_size: 36507222016 / 34816M / 34G seek_end: 36507222016 [[1;32m0x88000[0m] / 34816M / 34G cluster_bits: [1;36m16 [0mcluster_size: [1;36m65536 [0mcrypt_method: 0 csize_shift: 54 csize_mask: 255 cluster_offset_mask: [1;36m0x3f [0ml1_table_offset: [1;32m0x76a46 [0ml1_size: [1;32m120 [0ml1_vm_state_index: [1;32m120 [0ml2_size: [1;36m8192 [0mrefcount_order: [1;36m4 [0mrefcount_bits: [1;36m16 [0mrefcount_block_bits: [1;36m15 [0mrefcount_block_size: [1;36m32768 [0mrefcount_table_offset: [1;32m0x1 [0mrefcount_table_clusters: [1;32m1 [0msnapshots_offset: [1;32m0x0 [0mnb_snapshots: [1;32m0 [0mincompatible_features: compatible_features: autoclear_features: [1;32mActive Snapshot: [0m L1 Table: [offset: 0x76a46, len: 120] [1;36mResult: [0mL1 Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 53, used: 67 L2 Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 20304, used: 528560 [1;32mRefcount Table: [0m Refcount Table: [offset: 0x1, len: 8192] [1;36mResult: [0mRefcount Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 8175, used: 17 Refcount: error: [1;33m4342, [0mleak: [1;33m0, [0munused: 28426, used: 524288 [1;32mCOPIED OFLAG: [0m [1;36mResult: [0mL1 Table ERROR OFLAG_COPIED: [1;33m1 [0mL2 Table ERROR OFLAG_COPIED: [1;33m4323 [0mActive L2 COPIED: [1;33m528560 [34639708160 / 33035M / 32G] [0m [1;32mActive Cluster: [0m [1;36m Result: [0mActive Cluster: reuse: [1;33m17 [0m [1;31mSummary: [0mpreallocation: [1;32moff [0mActive Cluster: [1;31mreuse: 17 [0mRefcount Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 8175, used: 17 Refcount: [1;33merror: [0m[1;31m4342, [0m[1;33mleak: 0, [0m[1;31mrebuild: 4325, [0munused: 28426, used: 524288 L1 Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 53, used: 67 [1;33moflag copied: [0m[1;31m1 [0mL2 Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 20304, used: 528560 [1;33moflag copied: [0m[1;31m4323 [0m ###[5;31m qcow2 image has refcount errors! (=_=#)[0m### ###[5;31mand qcow2 image has copied errors! (o_0)?[0m### ###[5;31m Sadly: refcount error cause active cluster reused! Orz[0m ### ###[1;33m Please backup this image and contact the author![0m ### ]0;root@serv-hv-adm13:/home[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m# exit Script terminé sur mar. 13 févr. 2018 17:31:13 CET
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
Am 13.02.2018 um 16:30 hat Daniel P. Berrangé geschrieben: > On Tue, Feb 13, 2018 at 04:23:21PM +0100, Kevin Wolf wrote: > > Am 13.02.2018 um 15:58 hat Daniel P. Berrangé geschrieben: > > > On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote: > > > > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben: > > > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > > > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: > > > > > > > Then you could just use the regular migrate QMP commands for > > > > > > > loading > > > > > > > and saving snapshots. > > > > > > > > > > > > Yes, you could. I think for a proper implementation you would want > > > > > > to do > > > > > > better, though. Live migration provides just a stream, but that's > > > > > > not > > > > > > really well suited for snapshots. When a RAM page is dirtied, you > > > > > > just > > > > > > want to overwrite the old version of it in a snapshot [...] > > > > > > > > > > This means the point in time where the guest state is snapshotted is > > > > > not > > > > > when the command is issued, but any unpredictable amount of time > > > > > later. > > > > > > > > > > I'm not sure this is what a user expects. > > > > > > > > I don't think it's necessarily a big problem as long as you set the > > > > expectations right, but good point anyway. > > > > > > > > > A better approach for the save part appears to be to stop the vcpus, > > > > > dump the device state, resume the vcpus, and save the memory contents > > > > > in the background, prioritizing the old copies of the pages that > > > > > change. > > > > > > > > So basically you would let the guest fault whenever it writes to a page > > > > that is not saved yet, and then save it first before you make the page > > > > writable again? Essentially blockdev-backup, except for RAM. > > > > > > The page fault servicing will be delayed by however long it takes to > > > write the page to underling storage, which could be considerable with > > > non-SSD. So guest performance could be significantly impacted on slow > > > storage with high dirtying rate. On the flip side it gurantees a live > > > snapshot would complete in finite time which is good. > > > > You can just use a bounce buffer for writing out the old page. Then the > > VM is only stopped for the duration of a malloc() + memcpy(). > > The would allow QEMU memory usage to balloon up to x2 RAM, if there was > slow storage and very fast dirtying rate. I don't think that's viable > unless there was a cap on how much bounce buffering we would allow before > just blocking the page faults Yes, you'd probably want to do this. But anyway, unless the guest is under really heavy load, allowing just a few MB to be dirtied without waiting for I/O will make the guest a lot more responsive immediately after taking a snapshot. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 02/13/2018 06:27 PM, Dr. David Alan Gilbert wrote: > * Roman Kagan (rka...@virtuozzo.com) wrote: >> On Tue, Feb 13, 2018 at 03:05:03PM +, Dr. David Alan Gilbert wrote: >>> * Denis V. Lunev (d...@virtuozzo.com) wrote: On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote: > * Daniel P. Berrangé (berra...@redhat.com) wrote: >> That doesn't seem practical unless you can instantaneously write out >> the entire guest RAM to disk without blocking, or can somehow snapshot >> the RAM so you can write out a consistent view of the original RAM, >> while the guest continues to dirty RAM pages. > People have suggested doing something like that with userfault write > mode; but the same would also be doable just by write protecting the > whole of RAM and then following the faults. nope, userfault fd does not help :( We have tried, the functionality is not enough. Better to have small extension to KVM to protect all memory and notify QEMU with accessed address. >>> Can you explain why? I thought the write-protect mode of userfaultfd was >>> supposed to be able to do that; cc'ing in Andrea >> IIRC it would if it worked; it just didn't when we tried. > Hmm that doesn't seem to be the ideal reason to create new KVM > functionality, especially since there were a bunch of people wanting the > userfaultfd-write mode for other uses. This does not seems to work in practice. We status with that is the same for more than a year and a half AFAIR. Thus there is a lot of sense to create something feasible as async snapshots are badly wanted. Den
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On Tue, Feb 13, 2018 at 04:23:21PM +0100, Kevin Wolf wrote: > Am 13.02.2018 um 15:58 hat Daniel P. Berrangé geschrieben: > > On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote: > > > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben: > > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: > > > > > > Then you could just use the regular migrate QMP commands for loading > > > > > > and saving snapshots. > > > > > > > > > > Yes, you could. I think for a proper implementation you would want to > > > > > do > > > > > better, though. Live migration provides just a stream, but that's not > > > > > really well suited for snapshots. When a RAM page is dirtied, you just > > > > > want to overwrite the old version of it in a snapshot [...] > > > > > > > > This means the point in time where the guest state is snapshotted is not > > > > when the command is issued, but any unpredictable amount of time later. > > > > > > > > I'm not sure this is what a user expects. > > > > > > I don't think it's necessarily a big problem as long as you set the > > > expectations right, but good point anyway. > > > > > > > A better approach for the save part appears to be to stop the vcpus, > > > > dump the device state, resume the vcpus, and save the memory contents > > > > in the background, prioritizing the old copies of the pages that > > > > change. > > > > > > So basically you would let the guest fault whenever it writes to a page > > > that is not saved yet, and then save it first before you make the page > > > writable again? Essentially blockdev-backup, except for RAM. > > > > The page fault servicing will be delayed by however long it takes to > > write the page to underling storage, which could be considerable with > > non-SSD. So guest performance could be significantly impacted on slow > > storage with high dirtying rate. On the flip side it gurantees a live > > snapshot would complete in finite time which is good. > > You can just use a bounce buffer for writing out the old page. Then the > VM is only stopped for the duration of a malloc() + memcpy(). The would allow QEMU memory usage to balloon up to x2 RAM, if there was slow storage and very fast dirtying rate. I don't think that's viable unless there was a cap on how much bounce buffering we would allow before just blocking the page faults Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [PATCH 0/7] block: Preallocated truncation for gluster and sheepdog
Am 13.02.2018 um 14:03 hat Max Reitz geschrieben: > As far as I can see, these are the only protocols beside file-posix that > support preallocated creation. In contrast to file-posix, however, they > have not supported preallocated truncation so far. This series brings > their truncation code to feature parity with their creation code in this > regard. > > Note that I do not have a test setup for either of the two drivers, so I > do not actually know whether this works. Anyone with a setup is more > than welcome to test this series. Thanks, applied to the block branch. We'll see if someone screams once this hits master. Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
* Roman Kagan (rka...@virtuozzo.com) wrote: > On Tue, Feb 13, 2018 at 03:05:03PM +, Dr. David Alan Gilbert wrote: > > * Denis V. Lunev (d...@virtuozzo.com) wrote: > > > On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote: > > > > * Daniel P. Berrangé (berra...@redhat.com) wrote: > > > >> That doesn't seem practical unless you can instantaneously write out > > > >> the entire guest RAM to disk without blocking, or can somehow snapshot > > > >> the RAM so you can write out a consistent view of the original RAM, > > > >> while the guest continues to dirty RAM pages. > > > > People have suggested doing something like that with userfault write > > > > mode; but the same would also be doable just by write protecting the > > > > whole of RAM and then following the faults. > > > > > > nope, userfault fd does not help :( We have tried, the functionality is > > > not > > > enough. Better to have small extension to KVM to protect all memory > > > and notify QEMU with accessed address. > > > > Can you explain why? I thought the write-protect mode of userfaultfd was > > supposed to be able to do that; cc'ing in Andrea > > IIRC it would if it worked; it just didn't when we tried. Hmm that doesn't seem to be the ideal reason to create new KVM functionality, especially since there were a bunch of people wanting the userfaultfd-write mode for other uses. Dave > Roman. -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] qcow2 images corruption
Hello Kevin, Le 13/02/2018 à 10:41, Kevin Wolf a écrit : Am 07.02.2018 um 18:06 hat Nicolas Ecarnot geschrieben: TL; DR : qcow2 images keep getting corrupted. Any workaround? Not without knowing the cause. Actually, my main concern is mostly about finding the cause rather than correcting my corrupted VMs. Another way to say it : I prefer to help oVirt than help myself. The first thing to make sure is that the image isn't touched by a second process while QEMU is running a VM. Indeed, I read some BZ about this issue : they were raised by a user who ran some qemu-img commands on a "mounted" image, thus leading to some corruption. In my case, I'm not playing with this, and the corrupted VMs were only touched by classical oVirt actions. The classic one is using 'qemu-img snapshot' on the image of a running VM, which is instant corruption (and newer QEMU versions have locking in place to prevent this), but we have seen more absurd cases of things outside QEMU tampering with the image when we were investigating previous corruption reports. This covers the majority of all reports, we haven't had a real corruption caused by a QEMU bug in ages. May I ask after what QEMU version this kind of locking has been added. As I wrote, our oVirt setup is 3.6 so not recent. After having found (https://access.redhat.com/solutions/1173623) the right logical volume hosting the qcow2 image, I can run qemu-img check on it. - On 80% of my VMs, I find no errors. - On 15% of them, I find Leaked cluster errors that I can correct using "qemu-img check -r all" - On 5% of them, I find Leaked clusters errors and further fatal errors, which can not be corrected with qemu-img. In rare cases, qemu-img can correct them, but destroys large parts of the image (becomes unusable), and on other cases it can not correct them at all. It would be good if you could make the 'qemu-img check' output available somewhere. See attachment. It would be even better if we could have a look at the respective image. I seem to remember that John (CCed) had a few scripts to analyse corrupted qcow2 images, maybe we would be able to see something there. I just exported it like this : qemu-img convert /dev/the_correct_path /home/blablah.qcow2.img The resulting file is 32G and I need an idea to transfer this img to you. What I read similar to my case is : - usage of qcow2 - heavy disk I/O - using the virtio-blk driver In the proxmox thread, they tend to say that using virtio-scsi is the solution. Having asked this question to oVirt experts (https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but it's not clear the driver is to blame. This seems very unlikely. The corruption you're seeing is in the qcow2 metadata, not only in the guest data. Are you saying: - the corruption is in the metadata and in the guest data OR - the corruption is only in the metadata ? If anything, virtio-scsi exercises more qcow2 code paths than virtio-blk, so any potential bug that affects virtio-blk should also affect virtio-scsi, but not the other way around. I get that. I agree with the answer Yaniv Kaul gave to me, saying I have to properly report the issue, so I'm longing to know which peculiar information I can give you now. To be honest, debugging corruption after the fact is pretty hard. We'd need the 'qemu-img check' output Done. and ideally the image to do anything, I remember some Redhat people once gave me a temporary access to put heavy file on some dedicated server. Is it still possible? but I can't promise that anything would come out of this. Best would be a reproducer, or at least some operation that you can link to the appearance of the corruption. Then we could take a more targeted look at the respective code. Sure. Alas I find no obvious pattern leading to corruption : From the guest side, it appeared with windows 2003, 2008, 2012, linux centOS 6 and 7. It appeared with virtio-blk; and I changed some VMs to used virtio-scsi but it's too soon to see appearance of corruption in that case. As I said, I'm using snapshots VERY rarely, and our versions are too old so we do them the cold way only (VM shutdown). So very safely. The "weirdest" thing we do is to migrate VMs : you see how conservative we are! As you can imagine, all this setup is in production, and for most of the VMs, I can not "play" with them. Moreover, we launched a campaign of nightly stopping every VM, qemu-img check them one by one, then boot. So it might take some time before I find another corrupted image. (which I'll preciously store for debug) Other informations : We very rarely do snapshots, but I'm close to imagine that automated migrations of VMs could trigger similar behaviors on qcow2 images. To my knowledge, oVirt only uses external snapshots and creates them with QMP. This should be perfectly safe because from the perspective of the qcow2 image being snapshotted, it just means that it gets no new write
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
Am 13.02.2018 um 15:58 hat Daniel P. Berrangé geschrieben: > On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote: > > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben: > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: > > > > > Then you could just use the regular migrate QMP commands for loading > > > > > and saving snapshots. > > > > > > > > Yes, you could. I think for a proper implementation you would want to do > > > > better, though. Live migration provides just a stream, but that's not > > > > really well suited for snapshots. When a RAM page is dirtied, you just > > > > want to overwrite the old version of it in a snapshot [...] > > > > > > This means the point in time where the guest state is snapshotted is not > > > when the command is issued, but any unpredictable amount of time later. > > > > > > I'm not sure this is what a user expects. > > > > I don't think it's necessarily a big problem as long as you set the > > expectations right, but good point anyway. > > > > > A better approach for the save part appears to be to stop the vcpus, > > > dump the device state, resume the vcpus, and save the memory contents > > > in the background, prioritizing the old copies of the pages that > > > change. > > > > So basically you would let the guest fault whenever it writes to a page > > that is not saved yet, and then save it first before you make the page > > writable again? Essentially blockdev-backup, except for RAM. > > The page fault servicing will be delayed by however long it takes to > write the page to underling storage, which could be considerable with > non-SSD. So guest performance could be significantly impacted on slow > storage with high dirtying rate. On the flip side it gurantees a live > snapshot would complete in finite time which is good. You can just use a bounce buffer for writing out the old page. Then the VM is only stopped for the duration of a malloc() + memcpy(). Kevin
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
* Denis V. Lunev (d...@virtuozzo.com) wrote: > On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote: > > * Daniel P. Berrangé (berra...@redhat.com) wrote: > >> On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote: > >>> Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben: > On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote: > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > >> Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: > >>> Then you could just use the regular migrate QMP commands for loading > >>> and saving snapshots. > >> Yes, you could. I think for a proper implementation you would want to > >> do > >> better, though. Live migration provides just a stream, but that's not > >> really well suited for snapshots. When a RAM page is dirtied, you just > >> want to overwrite the old version of it in a snapshot [...] > > This means the point in time where the guest state is snapshotted is not > > when the command is issued, but any unpredictable amount of time later. > > > > I'm not sure this is what a user expects. > > > > A better approach for the save part appears to be to stop the vcpus, > > dump the device state, resume the vcpus, and save the memory contents in > > the background, prioritizing the old copies of the pages that change. > > No multiple copies of the same page would have to be saved so the stream > > format would be fine. For the load part the usual inmigrate should > > work. > No, that's policy decision that doesn't matter from QMP pov. If the mgmt > app wants the snapshot to be wrt to the initial time, it can simply > invoke the "stop" QMP command before doing the live migration and > "cont" afterwards. > >>> That would be non-live. I think Roman means a live snapshot that saves > >>> the state at the beginning of the operation. Basically the difference > >>> between blockdev-backup (state at the beginning) and blockdev-mirror > >>> (state at the end), except for a whole VM. > >> That doesn't seem practical unless you can instantaneously write out > >> the entire guest RAM to disk without blocking, or can somehow snapshot > >> the RAM so you can write out a consistent view of the original RAM, > >> while the guest continues to dirty RAM pages. > > People have suggested doing something like that with userfault write > > mode; but the same would also be doable just by write protecting the > > whole of RAM and then following the faults. > > nope, userfault fd does not help :( We have tried, the functionality is not > enough. Better to have small extension to KVM to protect all memory > and notify QEMU with accessed address. Can you explain why? I thought the write-protect mode of userfaultfd was supposed to be able to do that; cc'ing in Andrea Dave > Den -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [Qemu-devel] [PATCH 5/7] sheepdog: Make sd_prealloc() take a BDS
On 02/13/2018 07:03 AM, Max Reitz wrote: We want to use this function in sd_truncate() later on, so taking a filename is not exactly ideal. Signed-off-by: Max Reitz--- block/sheepdog.c | 29 + 1 file changed, 21 insertions(+), 8 deletions(-) @@ -1837,10 +1837,11 @@ static int sd_prealloc(const char *filename, Error **errp) void *buf = NULL; int ret; -blk = blk_new_open(filename, NULL, NULL, - BDRV_O_RDWR | BDRV_O_RESIZE | BDRV_O_PROTOCOL, errp); -if (blk == NULL) { -ret = -EIO; +blk = blk_new(BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | BLK_PERM_RESIZE, + BLK_PERM_ALL); + +ret = blk_insert_bs(blk, bs, errp); +if (ret < 0) { if (prealloc) { -ret = sd_prealloc(filename, errp); +BlockDriverState *bs; +QDict *opts; + +opts = qdict_new(); +qdict_put_str(opts, "driver", "sheepdog"); +bs = bdrv_open(filename, NULL, opts, BDRV_O_PROTOCOL | BDRV_O_RDWR, + errp); I'm a bit weak on ensuring the same ending flags result after the new bdrv_open()/blk_new() as what you used to get on blk_new_open(), so someone that can actually test this is appreciated. Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote: > * Daniel P. Berrangé (berra...@redhat.com) wrote: >> On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote: >>> Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben: On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote: > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: >> Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: >>> Then you could just use the regular migrate QMP commands for loading >>> and saving snapshots. >> Yes, you could. I think for a proper implementation you would want to do >> better, though. Live migration provides just a stream, but that's not >> really well suited for snapshots. When a RAM page is dirtied, you just >> want to overwrite the old version of it in a snapshot [...] > This means the point in time where the guest state is snapshotted is not > when the command is issued, but any unpredictable amount of time later. > > I'm not sure this is what a user expects. > > A better approach for the save part appears to be to stop the vcpus, > dump the device state, resume the vcpus, and save the memory contents in > the background, prioritizing the old copies of the pages that change. > No multiple copies of the same page would have to be saved so the stream > format would be fine. For the load part the usual inmigrate should > work. No, that's policy decision that doesn't matter from QMP pov. If the mgmt app wants the snapshot to be wrt to the initial time, it can simply invoke the "stop" QMP command before doing the live migration and "cont" afterwards. >>> That would be non-live. I think Roman means a live snapshot that saves >>> the state at the beginning of the operation. Basically the difference >>> between blockdev-backup (state at the beginning) and blockdev-mirror >>> (state at the end), except for a whole VM. >> That doesn't seem practical unless you can instantaneously write out >> the entire guest RAM to disk without blocking, or can somehow snapshot >> the RAM so you can write out a consistent view of the original RAM, >> while the guest continues to dirty RAM pages. > People have suggested doing something like that with userfault write > mode; but the same would also be doable just by write protecting the > whole of RAM and then following the faults. nope, userfault fd does not help :( We have tried, the functionality is not enough. Better to have small extension to KVM to protect all memory and notify QEMU with accessed address. Den
Re: [Qemu-block] [Qemu-devel] [PATCH 6/7] sheepdog: Pass old and new size to sd_prealloc()
On 02/13/2018 07:03 AM, Max Reitz wrote: sd_prealloc() will now preallocate the area [old_size, new_size). As before, it rounds to buf_size and may thus overshoot and preallocate areas that were not requested to be preallocated. For image creation, this is no change in behavior. For truncation, this is in accordance with the documentation for preallocated truncation. Signed-off-by: Max Reitz--- block/sheepdog.c | 16 +--- 1 file changed, 5 insertions(+), 11 deletions(-) @@ -1847,19 +1847,13 @@ static int sd_prealloc(BlockDriverState *bs, Error **errp) blk_set_allow_write_beyond_eof(blk, true); -vdi_size = blk_getlength(blk); -if (vdi_size < 0) { -ret = vdi_size; -goto out; -} - @@ -2119,7 +2113,7 @@ static int sd_create(const char *filename, QemuOpts *opts, goto out; } -ret = sd_prealloc(bs, errp); +ret = sd_prealloc(bs, 0, s->inode.vdi_size, errp); Nice - you also got rid of a potential failure in blk_getlength(). Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
* Daniel P. Berrangé (berra...@redhat.com) wrote: > On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote: > > Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben: > > > On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote: > > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: > > > > > > Then you could just use the regular migrate QMP commands for loading > > > > > > and saving snapshots. > > > > > > > > > > Yes, you could. I think for a proper implementation you would want to > > > > > do > > > > > better, though. Live migration provides just a stream, but that's not > > > > > really well suited for snapshots. When a RAM page is dirtied, you just > > > > > want to overwrite the old version of it in a snapshot [...] > > > > > > > > This means the point in time where the guest state is snapshotted is not > > > > when the command is issued, but any unpredictable amount of time later. > > > > > > > > I'm not sure this is what a user expects. > > > > > > > > A better approach for the save part appears to be to stop the vcpus, > > > > dump the device state, resume the vcpus, and save the memory contents in > > > > the background, prioritizing the old copies of the pages that change. > > > > No multiple copies of the same page would have to be saved so the stream > > > > format would be fine. For the load part the usual inmigrate should > > > > work. > > > > > > No, that's policy decision that doesn't matter from QMP pov. If the mgmt > > > app wants the snapshot to be wrt to the initial time, it can simply > > > invoke the "stop" QMP command before doing the live migration and > > > "cont" afterwards. > > > > That would be non-live. I think Roman means a live snapshot that saves > > the state at the beginning of the operation. Basically the difference > > between blockdev-backup (state at the beginning) and blockdev-mirror > > (state at the end), except for a whole VM. > > That doesn't seem practical unless you can instantaneously write out > the entire guest RAM to disk without blocking, or can somehow snapshot > the RAM so you can write out a consistent view of the original RAM, > while the guest continues to dirty RAM pages. People have suggested doing something like that with userfault write mode; but the same would also be doable just by write protecting the whole of RAM and then following the faults. Dave > Regards, > Daniel > -- > |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| > |: https://libvirt.org -o-https://fstop138.berrange.com :| > |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :| -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote: > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben: > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: > > > > Then you could just use the regular migrate QMP commands for loading > > > > and saving snapshots. > > > > > > Yes, you could. I think for a proper implementation you would want to do > > > better, though. Live migration provides just a stream, but that's not > > > really well suited for snapshots. When a RAM page is dirtied, you just > > > want to overwrite the old version of it in a snapshot [...] > > > > This means the point in time where the guest state is snapshotted is not > > when the command is issued, but any unpredictable amount of time later. > > > > I'm not sure this is what a user expects. > > I don't think it's necessarily a big problem as long as you set the > expectations right, but good point anyway. > > > A better approach for the save part appears to be to stop the vcpus, > > dump the device state, resume the vcpus, and save the memory contents > > in the background, prioritizing the old copies of the pages that > > change. > > So basically you would let the guest fault whenever it writes to a page > that is not saved yet, and then save it first before you make the page > writable again? Essentially blockdev-backup, except for RAM. The page fault servicing will be delayed by however long it takes to write the page to underling storage, which could be considerable with non-SSD. So guest performance could be significantly impacted on slow storage with high dirtying rate. On the flip side it gurantees a live snapshot would complete in finite time which is good. Regards, Daniel -- |: https://berrange.com -o-https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o-https://fstop138.berrange.com :| |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
Re: [Qemu-block] [Qemu-devel] [PATCH 7/7] sheepdog: Allow fully preallocated truncation
On 02/13/2018 07:03 AM, Max Reitz wrote: Signed-off-by: Max Reitz--- block/sheepdog.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 1/7] gluster: Move glfs_close() to create's clean-up
On 02/13/2018 07:03 AM, Max Reitz wrote: glfs_close() is a classical clean-up operation, as can be seen by the fact that it is executed even if the truncation before it failed. Also, moving it to clean-up makes it more clear that if it fails, we do not want it to overwrite the current ret value if that signifies an error already. Signed-off-by: Max Reitz--- block/gluster.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 4/7] gluster: Add preallocated truncation
On 02/13/2018 07:03 AM, Max Reitz wrote: By using qemu_do_cluster_truncate() in qemu_cluster_truncate(), we now automatically have preallocated truncation. Signed-off-by: Max Reitz--- block/gluster.c | 17 + 1 file changed, 1 insertion(+), 16 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 3/7] gluster: Query current size in do_truncate()
On 02/13/2018 07:03 AM, Max Reitz wrote: Instead of expecting the current size to be 0, query it and allocate only the area [current_size, offset) if preallocation is requested. Signed-off-by: Max Reitz--- block/gluster.c | 21 +++-- 1 file changed, 19 insertions(+), 2 deletions(-) I don't have a gluster setup handy to test this, but the concept matches file-posix and the change looks reasonable. Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 2/7] gluster: Pull truncation from qemu_gluster_create
On 02/13/2018 07:03 AM, Max Reitz wrote: Pull out the truncation code from the qemu_cluster_create() function so we can later reuse it in qemu_gluster_truncate(). Signed-off-by: Max Reitz--- block/gluster.c | 74 +++-- 1 file changed, 40 insertions(+), 34 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH v4 31/39] qcow2: Update qcow2_truncate() to support L2 slices
On 02/05/2018 08:33 AM, Alberto Garcia wrote: The qcow2_truncate() code is mostly independent from whether we're using L2 slices or full L2 tables, but in full and falloc preallocation modes new L2 tables are allocated using qcow2_alloc_cluster_link_l2(). Therefore the code needs to be modified to ensure that all nb_clusters that are processed in each call can be allocated with just one L2 slice. Signed-off-by: Alberto Garcia--- block/qcow2.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI
On 02/13/2018 05:48 PM, Daniel P. Berrangé wrote: > On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote: >> Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben: >>> On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote: On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote: > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben: >> Then you could just use the regular migrate QMP commands for loading >> and saving snapshots. > Yes, you could. I think for a proper implementation you would want to do > better, though. Live migration provides just a stream, but that's not > really well suited for snapshots. When a RAM page is dirtied, you just > want to overwrite the old version of it in a snapshot [...] This means the point in time where the guest state is snapshotted is not when the command is issued, but any unpredictable amount of time later. I'm not sure this is what a user expects. A better approach for the save part appears to be to stop the vcpus, dump the device state, resume the vcpus, and save the memory contents in the background, prioritizing the old copies of the pages that change. No multiple copies of the same page would have to be saved so the stream format would be fine. For the load part the usual inmigrate should work. >>> No, that's policy decision that doesn't matter from QMP pov. If the mgmt >>> app wants the snapshot to be wrt to the initial time, it can simply >>> invoke the "stop" QMP command before doing the live migration and >>> "cont" afterwards. >> That would be non-live. I think Roman means a live snapshot that saves >> the state at the beginning of the operation. Basically the difference >> between blockdev-backup (state at the beginning) and blockdev-mirror >> (state at the end), except for a whole VM. > That doesn't seem practical unless you can instantaneously write out > the entire guest RAM to disk without blocking, or can somehow snapshot > the RAM so you can write out a consistent view of the original RAM, > while the guest continues to dirty RAM pages. > > Regards, > Daniel we have working prototype of that right now, not yet ready to be submitted even as RFC. In practice here we are spoken about snapshots with memory as "The equivalent HMP command is savevm." Thus all Roman's concerns are here. Den
Re: [Qemu-block] [PATCH v4 00/39] Allow configuring the qcow2 L2 cache entry size
On 2018-02-05 15:33, Alberto Garcia wrote: > this is the new revision of the patch series to allow configuring the > entry size of the qcow2 L2 cache. Follow this link for the full > description from the first version: > >https://lists.gnu.org/archive/html/qemu-block/2017-10/msg00458.html > > And here are some numbers showing the performance improvements: > >https://lists.gnu.org/archive/html/qemu-block/2017-12/msg00507.html > > Regards, > > Berto Thanks, applied to my block branch (with s/intead/instead/ and s/=300/=4242/): https://github.com/XanClic/qemu/commits/block Max signature.asc Description: OpenPGP digital signature