Re: [Qemu-block] [PATCH 1/4] block: Fix dirty bitmap in bdrv_co_discard
On 06/05/2015 03:45, Fam Zheng wrote: This is not enough, you also have to do the discard in block/mirror.c, otherwise the destination image could even become fully provisioned! I wasn't sure what if src and dest have different can_write_zeroes_with_unmap value, but your argument is stronger. I will add discard in mirror. Besides that, do we need to compare the can_write_zeroes_with_unmap? Hmm, if can_write_zeroes_with_unmap is set, it's probably better to write zeroes instead of discarding, in case the guest is relying on it. Paolo
Re: [Qemu-block] [Qemu-devel] [RFC] Differential Backups
On Tue, May 05, 2015 at 11:55:49AM -0400, John Snow wrote: On 05/05/2015 06:25 AM, Stefan Hajnoczi wrote: On Wed, Apr 29, 2015 at 06:51:08PM -0400, John Snow wrote: This is a feature that should be very easy to add on top of the existing incremental feature, since it's just a difference in how the bitmap is treated: Incremental - Links to the last incremental (managed by libvirt) - Clears the bitmap after creation Differential: - Links to the last full backup always (managed by libvirt) - Does not clear the bitmap after creation No biggie. Differential backups can be done using incremental backup functionality in QEMU: The client application points QEMU to the same target repeatedly instead of keeping separate incremental backups. Stefan Oh, so you're saying: [anchor]--[diff1] And then when making a new incremental, we re-use diff1 as a target and overwrite it so that it becomes: [anchor]--[diff2] In effect giving us a differential. OK, so it's possible, but we still lose out on some flexibility that a slightly different mode would provide us, like the ability to keep multiple differentials if desired. (Well, I suppose we *can* create those by manually copying differentials after we create them, but that seems hackier than necessary.) Still, it would be such a paltry few lines of code and introduce no real complexity to the subsystem, and it might make libvirt's time a little easier for managing such things. I have CCed Eric Blake and the libvirt mailing list. This is a good time to discuss libvirt requirements for backup APIs. Current work for QEMU 2.4: * QMP command to take incremental backup of guest disks in a single atomic operation * Dirty bitmap persistence across live migration and QEMU restart Eric: Do you have any opinion on a differential backup feature in addition to incremental backup. My thoughts are that QEMU should only copy out changed blocks and let the backup application worry about concepts like incremental vs differential. From a host performance perspective, copying out the same blocks that have already been sent in a previous backup is a waste of I/O bandwidth. Even backup applications that want differential backup may not actually use the QEMU feature for this reason. Regarding the size of the patch, there's a cost to adding the tests, documentation, and commiting to QMP APIs. These costs dwarf the small code change that preserves the dirty bitmap. If there's a concrete requirement for this feature then I'm happy with it, but let's not add bells and whistles just because we can. Stefan pgpOC_tjcvI0X.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH 1/4] block: Fix dirty bitmap in bdrv_co_discard
On 06/05/2015 11:50, Fam Zheng wrote: # src can_write_zeroes_with_unmap target can_write_zeroes_with_unmap 1 true true 2 true false 3 false true 4 false false 1 should replicate WRITE SAME, in case the unmap granularity of the target is different from that of the source. In that case, a discard on the target might leave some sectors untouched. Writing zeroes would ensure matching data between the source and the target. 2 should _not_ discard: it should write zeroes even at the cost of making the target fully provisioned. Perhaps you can optimize it by looking at bdrv_get_block_status for the target, and checking the answer for BDRV_ZERO. 3 and 4 can use discard on the target. So it looks like only the source setting matters. We need to check the cost of bdrv_co_get_block_status for raw, too. If it's too expensive, that can be a problem. Paolo For case 2 3 it's probably better to mirror the actual reading of source. I'm not sure about 4. Even in case 1, discard could be UNMAP and write zeroes could be WRITE SAME. If the unmap granularity of the target is For unaligned sectors, UNMAP might leave some sectors aside while WRITE SAME will write with zeroes.
Re: [Qemu-block] [Qemu-devel] qemu-img convert (vmdk)
Am 05.05.2015 um 19:01 hat Antoni Villalonga geschrieben: Hi, Is my first email to that list ;) I can reproduce this bug with v2.2 and v2.3. I'm not sure about the results after testing with v2.1 (doesn't show errors but seems to be still broken). % qemu-img convert -f raw -O vmdk -o subformat=streamOptimized 100GB_inputfile.img outputfile0.vmdk (no shown errors/warnings/info) % qemu-img convert -f vmdk -O qcow2 outputfile0.vmdk outputfile11.qcow2 qemu-img: error while reading sector 278528: Invalid argument Let me know if I can provide you more info. Fam, any idea? Kevin
Re: [Qemu-block] [PATCH v2 1/6] mirror: Discard target sectors if not allocated at source side
I just wander ifbdrv_is_allocated_above works as it is considered,migrate image in format of qcow2 run into such backtrace:#0 0x7f9e73822c6d in lseek64 () at ../sysdeps/unix/syscall-template.S:82#1 0x7f9e765f08e4 in find_allocation (bs=value optimized out, sector_num=value optimized out, nb_sectors=20480, pnum=0x7f9e7bab00fc) at block/raw-posix.c:1285#2 raw_co_get_block_status (bs=value optimized out, sector_num=value optimized out, nb_sectors=20480, pnum=0x7f9e7bab00fc) at block/raw-posix.c:1381#3 0x7f9e765cfe61 in bdrv_co_get_block_status (bs=0x7f9e790e4300, sector_num= 22470528, nb_sectors=20480, pnum=0x7f9e7bab00fc) at block.c:3593#4 0x7f9e765cffa8 in bdrv_co_get_block_status (bs=0x7f9e790e00a0, sector_num= 22466560, nb_sectors=value optimized out, pnum=0x7f9e7bab00fc) at block.c:3620#5 0x7f9e765cfff7 in bdrv_get_block_status_co_entry (opaque=0x7f9e7bab0080) at block.c:3639#6 0x7f9e765d0088 in bdrv_get_block_status (bs=value optimized out, sector_num=value optimized out, nb_sectors=value optimized out, pnum=value optimized out) at block.c:3663#7 0x7f9e765d00a9 in bdrv_is_allocated (bs=0x7f9e790e00a0, sector_num=value optimized out, nb_sectors=value optimized out, pnum=value optimized out) at block.c:3677#8 0x7f9e765d0166 in bdrv_is_allocated_above (top=0x7f9e790e00a0, base=0x0, sector_num=22466560, nb_sectors=20480, pnum=0x7f9e7bab020c) at block.c:3709#9 0x7f9e765dcf9e in mirror_iteration (opaque=0x7f9e79486110) at block/mirror.c:305#10 mirror_run (opaque=0x7f9e79486110) at block/mirror.c:430#11 0x7f9e766120eb in coroutine_trampoline (i0=value optimized out, i1=value optimized out) at coroutine-ucontext.c:118#12 0x7f9e734c4b70 in ?? () from /lib64/libc.so.6#13 0x7a897850 in ?? ()#14 0x in ?? ()it is too slow to tolerate…...Original MessageSender:Fam Zhengf...@redhat.comRecipient:qemu-develqemu-de...@nongnu.orgCc:Kevin Wolfkw...@redhat.com; Stefan Hajnoczistefa...@redhat.com; qemu-blockqemu-block@nongnu.org; pbonzinipbonz...@redhat.com; jsnowjs...@redhat.com; wangxiaolongwangxiaol...@ucloud.cnDate:Wednesday, May 6, 2015 12:52Subject:[PATCH v2 1/6] mirror: Discard target sectors if not allocated at source sideIf guest discards a source cluster during mirror, we would want to discard target side as well. Signed-off-by: Fam Zheng f...@redhat.com --- block/mirror.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/block/mirror.c b/block/mirror.c index 58f391a..37a5b61 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -163,6 +163,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector; uint64_t delay_ns = 0; MirrorOp *op; +int pnum; s-sector_num = hbitmap_iter_next(s-hbi); if (s-sector_num 0) { @@ -289,8 +290,15 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s) s-in_flight++; s-sectors_in_flight += nb_sectors; trace_mirror_one_iteration(s, sector_num, nb_sectors); -bdrv_aio_readv(source, sector_num, op-qiov, nb_sectors, - mirror_read_complete, op); + +if (!bdrv_is_allocated_above(source, NULL, sector_num, + nb_sectors, pnum)) { +bdrv_aio_discard(s-target, sector_num, nb_sectors, + mirror_write_complete, op); +} else { +bdrv_aio_readv(source, sector_num, op-qiov, nb_sectors, + mirror_read_complete, op); +} return delay_ns; } -- 1.9.3
Re: [Qemu-block] [Qemu-devel] qemu-img convert (vmdk)
On Wed, 05/06 12:01, Kevin Wolf wrote: Am 05.05.2015 um 19:01 hat Antoni Villalonga geschrieben: Hi, Is my first email to that list ;) I can reproduce this bug with v2.2 and v2.3. I'm not sure about the results after testing with v2.1 (doesn't show errors but seems to be still broken). % qemu-img convert -f raw -O vmdk -o subformat=streamOptimized 100GB_inputfile.img outputfile0.vmdk (no shown errors/warnings/info) % qemu-img convert -f vmdk -O qcow2 outputfile0.vmdk outputfile11.qcow2 qemu-img: error while reading sector 278528: Invalid argument Let me know if I can provide you more info. Fam, any idea? It's a bug. I can reproduce it with my 21G guest image. I'll take a look. Fam
[Qemu-block] [RFC PATCH 1/7] block: Add op blocker type device IO
Preventing device from submitting IO is useful around various nested poll. Op blocker is a good place to put this flag. Devices would submit IO requests through blk_* block backend interface, which calls blk_check_request to check the validity. Return -EBUSY if the operation is blocked, in which case device IO is temporarily disabled by another BDS user. Signed-off-by: Fam Zheng f...@redhat.com --- block/block-backend.c | 4 include/block/block.h | 1 + 2 files changed, 5 insertions(+) diff --git a/block/block-backend.c b/block/block-backend.c index 93e46f3..71fc695 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -478,6 +478,10 @@ static int blk_check_request(BlockBackend *blk, int64_t sector_num, return -EIO; } +if (bdrv_op_is_blocked(blk-bs, BLOCK_OP_TYPE_DEVICE_IO, NULL)) { +return -EBUSY; +} + return blk_check_byte_request(blk, sector_num * BDRV_SECTOR_SIZE, nb_sectors * BDRV_SECTOR_SIZE); } diff --git a/include/block/block.h b/include/block/block.h index 7d1a717..906fb31 100644 --- a/include/block/block.h +++ b/include/block/block.h @@ -159,6 +159,7 @@ typedef enum BlockOpType { BLOCK_OP_TYPE_RESIZE, BLOCK_OP_TYPE_STREAM, BLOCK_OP_TYPE_REPLACE, +BLOCK_OP_TYPE_DEVICE_IO, BLOCK_OP_TYPE_MAX, } BlockOpType; -- 1.9.3
[Qemu-block] [RFC PATCH 0/7] Fix transactional snapshot with virtio-blk dataplane
Reported by Paolo. Unlike the iohandler in main loop, iothreads currently process the event notifier used as virtio-blk ioeventfd in all nested aio_poll. This is dangerous without proper protection, because guest requests could sneak to block layer where they mustn't. For example, a QMP transaction may involve multiple bdrv_drain_all() in handling the list of AioContext it works on. If an aio_poll in one of the bdrv_drain_all() happens to process a guest VQ kick by dispatching the ioeventfd event, a new guest write is then submitted, and voila, the transaction semantics is violated. This series avoids this problem by disabling virtio-blk handlers during bdrv_drain_all() and transactions. Notes: If the general approach is right, other transaction types could get the blockers similarly, in next revision. And some related bdrv_drain_all() could also be changed to bdrv_drain(). virtio-scsi-dataplane will be a bit more complicated, but still doable. It would probably need one more interface abstraction between scsi-disk, scsi-bus and virtio-scsi. Although other devices don't have a pause/resume callback yet, the blk_check_request, which returns -EBUSY if device io op blocker is set, could hopefully cover most cases already. Timers and block jobs also generate IO, but it should be fine as long as they don't change guest visible data, which is true AFAICT. Fam Zheng (7): block: Add op blocker type device IO block: Block device IO during bdrv_drain and bdrv_drain_all block: Add op blocker notifier list block-backend: Add blk_op_blocker_add_notifier virtio-blk: Move complete_request to 'ops' structure virtio-blk: Don't handle output when there is device IO op blocker blockdev: Add device IO op blocker during snapshot transaction block.c | 20 block/block-backend.c | 10 ++ block/io.c | 12 +++ blockdev.c | 7 + hw/block/dataplane/virtio-blk.c | 36 ++--- hw/block/virtio-blk.c | 69 +++-- include/block/block.h | 9 ++ include/block/block_int.h | 3 ++ include/hw/virtio/virtio-blk.h | 17 -- include/sysemu/block-backend.h | 2 ++ 10 files changed, 174 insertions(+), 11 deletions(-) -- 1.9.3
[Qemu-block] [RFC PATCH 2/7] block: Block device IO during bdrv_drain and bdrv_drain_all
We don't want new requests from guest, so block the operation around the nested poll. Signed-off-by: Fam Zheng f...@redhat.com --- block/io.c | 12 1 file changed, 12 insertions(+) diff --git a/block/io.c b/block/io.c index 1ce62c4..d369de3 100644 --- a/block/io.c +++ b/block/io.c @@ -289,9 +289,15 @@ static bool bdrv_drain_one(BlockDriverState *bs) */ void bdrv_drain(BlockDriverState *bs) { +Error *blocker = NULL; + +error_setg(blocker, bdrv_drain in progress); +bdrv_op_block(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker); while (bdrv_drain_one(bs)) { /* Keep iterating */ } +bdrv_op_unblock(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker); +error_free(blocker); } /* @@ -311,6 +317,9 @@ void bdrv_drain_all(void) /* Always run first iteration so any pending completion BHs run */ bool busy = true; BlockDriverState *bs = NULL; +Error *blocker = NULL; + +error_setg(blocker, bdrv_drain_all in progress); while ((bs = bdrv_next(bs))) { AioContext *aio_context = bdrv_get_aio_context(bs); @@ -319,6 +328,7 @@ void bdrv_drain_all(void) if (bs-job) { block_job_pause(bs-job); } +bdrv_op_block(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker); aio_context_release(aio_context); } @@ -343,8 +353,10 @@ void bdrv_drain_all(void) if (bs-job) { block_job_resume(bs-job); } +bdrv_op_unblock(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker); aio_context_release(aio_context); } +error_free(blocker); } /** -- 1.9.3
[Qemu-block] [PATCH 1/7] qcow2: use one single memory block for the L2/refcount cache tables
The qcow2 L2/refcount cache contains one separate table for each cache entry. Doing one allocation per table adds unnecessary overhead and it also requires us to store the address of each table separately. Since the size of the cache is constant during its lifetime, it's better to have an array that contains all the tables using one single allocation. In my tests measuring freshly created caches with sizes 128MB (L2) and 32MB (refcount) this uses around 10MB of RAM less. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c| 55 -- block/qcow2-cluster.c | 12 +-- block/qcow2-refcount.c | 8 +--- block/qcow2.h | 3 ++- 4 files changed, 39 insertions(+), 39 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index b115549..f0dfb69 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -28,7 +28,6 @@ #include trace.h typedef struct Qcow2CachedTable { -void* table; int64_t offset; booldirty; int cache_hits; @@ -40,39 +39,35 @@ struct Qcow2Cache { struct Qcow2Cache* depends; int size; booldepends_on_flush; +void *table_array; }; +static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs, +Qcow2Cache *c, int table) +{ +BDRVQcowState *s = bs-opaque; +return (uint8_t *) c-table_array + (size_t) table * s-cluster_size; +} + Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables) { BDRVQcowState *s = bs-opaque; Qcow2Cache *c; -int i; c = g_new0(Qcow2Cache, 1); c-size = num_tables; c-entries = g_try_new0(Qcow2CachedTable, num_tables); -if (!c-entries) { -goto fail; -} - -for (i = 0; i c-size; i++) { -c-entries[i].table = qemu_try_blockalign(bs-file, s-cluster_size); -if (c-entries[i].table == NULL) { -goto fail; -} +c-table_array = qemu_try_blockalign(bs-file, + (size_t) num_tables * s-cluster_size); + +if (!c-entries || !c-table_array) { +qemu_vfree(c-table_array); +g_free(c-entries); +g_free(c); +c = NULL; } return c; - -fail: -if (c-entries) { -for (i = 0; i c-size; i++) { -qemu_vfree(c-entries[i].table); -} -} -g_free(c-entries); -g_free(c); -return NULL; } int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c) @@ -81,9 +76,9 @@ int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c) for (i = 0; i c-size; i++) { assert(c-entries[i].ref == 0); -qemu_vfree(c-entries[i].table); } +qemu_vfree(c-table_array); g_free(c-entries); g_free(c); @@ -151,8 +146,8 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, Qcow2Cache *c, int i) BLKDBG_EVENT(bs-file, BLKDBG_L2_UPDATE); } -ret = bdrv_pwrite(bs-file, c-entries[i].offset, c-entries[i].table, -s-cluster_size); +ret = bdrv_pwrite(bs-file, c-entries[i].offset, + qcow2_cache_get_table_addr(bs, c, i), s-cluster_size); if (ret 0) { return ret; } @@ -304,7 +299,8 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, BLKDBG_EVENT(bs-file, BLKDBG_L2_LOAD); } -ret = bdrv_pread(bs-file, offset, c-entries[i].table, s-cluster_size); +ret = bdrv_pread(bs-file, offset, qcow2_cache_get_table_addr(bs, c, i), + s-cluster_size); if (ret 0) { return ret; } @@ -319,7 +315,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, found: c-entries[i].cache_hits++; c-entries[i].ref++; -*table = c-entries[i].table; +*table = qcow2_cache_get_table_addr(bs, c, i); trace_qcow2_cache_get_done(qemu_coroutine_self(), c == s-l2_table_cache, i); @@ -344,7 +340,7 @@ int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table) int i; for (i = 0; i c-size; i++) { -if (c-entries[i].table == *table) { +if (qcow2_cache_get_table_addr(bs, c, i) == *table) { goto found; } } @@ -358,12 +354,13 @@ found: return 0; } -void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table) +void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c, + void *table) { int i; for (i = 0; i c-size; i++) { -if (c-entries[i].table == table) { +if (qcow2_cache_get_table_addr(bs, c, i) == table) { goto found; } } diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index ed2b44d..5cd418a 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -263,7 +263,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t
[Qemu-block] [PATCH 4/7] qcow2: remove qcow2_cache_find_entry_to_replace()
A cache miss means that the whole array was traversed and the entry we were looking for was not found, so there's no need to traverse it again in order to select an entry to replace. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 45 - 1 file changed, 16 insertions(+), 29 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 786c10a..dffb887 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -242,51 +242,38 @@ int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c) return 0; } -static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) -{ -int i; -uint64_t min_lru_counter = UINT64_MAX; -int min_index = -1; - - -for (i = 0; i c-size; i++) { -if (c-entries[i].ref) { -continue; -} - -if (c-entries[i].lru_counter min_lru_counter) { -min_index = i; -min_lru_counter = c-entries[i].lru_counter; -} -} - -if (min_index == -1) { -/* This can't happen in current synchronous code, but leave the check - * here as a reminder for whoever starts using AIO with the cache */ -abort(); -} -return min_index; -} - static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, void **table, bool read_from_disk) { BDRVQcowState *s = bs-opaque; int i; int ret; +uint64_t min_lru_counter = UINT64_MAX; +int min_lru_index = -1; trace_qcow2_cache_get(qemu_coroutine_self(), c == s-l2_table_cache, offset, read_from_disk); /* Check if the table is already cached */ for (i = 0; i c-size; i++) { -if (c-entries[i].offset == offset) { +const Qcow2CachedTable *t = c-entries[i]; +if (t-offset == offset) { goto found; } +if (t-ref == 0 t-lru_counter min_lru_counter) { +min_lru_counter = t-lru_counter; +min_lru_index = i; +} +} + +if (min_lru_index == -1) { +/* This can't happen in current synchronous code, but leave the check + * here as a reminder for whoever starts using AIO with the cache */ +abort(); } -/* If not, write a table back and replace it */ -i = qcow2_cache_find_entry_to_replace(c); +/* Cache miss: write a table back and replace it */ +i = min_lru_index; trace_qcow2_cache_get_replace_entry(qemu_coroutine_self(), c == s-l2_table_cache, i); if (i 0) { -- 2.1.4
Re: [Qemu-block] [PATCH v4 16/17] qapi: Expose new qcow2 overlap check options
On 04.05.2015 21:39, Eric Blake wrote: On 05/04/2015 01:15 PM, Max Reitz wrote: Expose the two new options for controlling the memory usage of the overlap check implementation via QAPI. Signed-off-by: Max Reitz mre...@redhat.com --- qapi/block-core.json | 37 + 1 file changed, 37 insertions(+) diff --git a/qapi/block-core.json b/qapi/block-core.json index 1c17224..99456e6 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -1509,6 +1509,39 @@ 'mode': 'Qcow2OverlapCheckMode' } } ## +# @Qcow2OverlapStructures +# +# Contains options for controlling the behavior of the metadata overlap +# prevention structures. +# +# The primary structure used for overlap check and prevention is a bitmap +# (actually a bytemap) which has one entry per cluster designating the type(s) +# of metadata it contains. In order to save memory, there is an RLE-like +# representation of this bitmap, too. +# +# The whole array of clusters is chunked. The RLE representation of one chunk +# is converted to the equivalent bitmap whenever a cluster in the chunk is +# accessed. The bitmaps are kept for a limited number of chunks (as a cache). On +# cache eviction, the bitmap is converted back to RLE again. Feels a bit verbose at describing the internal representation, especially if we change the internals at some point in the future. I'm more interested that there are two limits at play, and that tuning them larger may consume more memory but offer better protection on large images. Well, that works for total-size-limit, but it won't work as an explanation for bitmap-size. I could rename it to cache-size and just say that it's the size of some (opaque) internal cache, though. +# +# @bitmap-size: #optional size (in bytes) of the bitmap cache, defaults to +#64 kB +# Any requirement on it being a power of 2, or even a recommendation on how large or small to make it in relation to an image size? No requirements and I can't give recommendations. We do have a fixed default for the metadata caches (refcount blocks and L2 tables), too, and this is basically the same thing when it comes to choosing the right size. For a large image, you have multiple e.g. L2 tables and also multiple windows concerning the overlap checks; therefore, if you have a lot of totally random accesses, both caches will fail. Considering that we seem to have accepted this for the metadata caches, I think it should be fine here, too. +# @total-size-limit: #optional maximum total size (in bytes) of all the metadata What happens if bitmap-size is larger than total-size-limit? Interesting question. Well, what does happen is that at some point it tries to create a bitmap for the cache, and that will fail due to the size limit, so the event will get thrown. The problem is that accessing any area of the qcow2 file which does not yet have an entry in the bitmap cache will result in hitting the limit, so there is no protection for these areas. The problem is that it's difficult to give a limit for the ratio of bitmap-size to total-size-limit which always works. A fragments list (the RLE representation) will never exceed the bitmap in size, so using 0.5 as the maximum ratio seems to make sense. However, there are other structures as well, such as the array of windows which depends on the size of the qcow2 file, so depending on the file, the maximum ratio may get infinitely small. So I guess the best thing to do would be to detect the limit excess condition when creating a new bitmap, and then instead of trying to occupy a previously empty cache entry, evict a non-empty one (repeat until there is enough memory available, or the cache is empty, in which case it is justified to complain about hitting the memory limit). +#overlap prevention data structures combined; if this limit +#is exceeded, a QCOW2_OVERLAP_CHECK_MEMORY_LIMIT_REACHED +#event will be emitted and some parts of the image may no +#longer be protected against erroneous overwriting of +#metadata by (meta)data of a different kind. Defaults to +#SIZE_MAX. +# +# Since: 2.4 +## +{ 'type': 'Qcow2OverlapStructures', Rebase conflict with my series that does s/type/struct/ Will fix. Thanks for reviewing, Max + 'data': { '*bitmap-size': 'int', +'*total-size-limit': 'int' } } + +## # @BlockdevOptionsQcow2 # # Driver specific block device options for qcow2. @@ -1530,6 +1563,9 @@ # @overlap-check: #optional which overlap checks to perform for writes # to the image, defaults to 'cached' (since 2.2) # +# @overlap-structures:#optional options for controlling the behavior of the +# metadata overlap prevention structures (since 2.4) +# # @cache-size:#optional the
[Qemu-block] [PATCH 2/7] qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()
Since all tables are now stored together, it is possible to obtain the position of a particular table directly from its address, so the operation becomes O(1). Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 32 +++- 1 file changed, 15 insertions(+), 17 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index f0dfb69..6e78c8f 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -49,6 +49,16 @@ static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs, return (uint8_t *) c-table_array + (size_t) table * s-cluster_size; } +static inline int qcow2_cache_get_table_idx(BlockDriverState *bs, + Qcow2Cache *c, void *table) +{ +BDRVQcowState *s = bs-opaque; +ptrdiff_t table_offset = (uint8_t *) table - (uint8_t *) c-table_array; +int idx = table_offset / s-cluster_size; +assert(idx = 0 idx c-size table_offset % s-cluster_size == 0); +return idx; +} + Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables) { BDRVQcowState *s = bs-opaque; @@ -337,16 +347,12 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset, int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table) { -int i; +int i = qcow2_cache_get_table_idx(bs, c, *table); -for (i = 0; i c-size; i++) { -if (qcow2_cache_get_table_addr(bs, c, i) == *table) { -goto found; -} +if (c-entries[i].offset == 0) { +return -ENOENT; } -return -ENOENT; -found: c-entries[i].ref--; *table = NULL; @@ -357,15 +363,7 @@ found: void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c, void *table) { -int i; - -for (i = 0; i c-size; i++) { -if (qcow2_cache_get_table_addr(bs, c, i) == table) { -goto found; -} -} -abort(); - -found: +int i = qcow2_cache_get_table_idx(bs, c, table); +assert(c-entries[i].offset != 0); c-entries[i].dirty = true; } -- 2.1.4
[Qemu-block] [PATCH v2 0/7] qcow2 L2/refcount cache improvements
New version of the qcow2 cache patches: v2: - Don't do pointer arithmetic on void * - Rename table_addr() to qcow2_cache_get_table_addr() - Add qcow2_cache_get_table_idx() - Cast cache size to size_t to prevent overflows - Make qcow2_cache_put() a void function - Don't store the cluster size in the cache, get it from the BDS instead v1: https://lists.nongnu.org/archive/html/qemu-devel/2015-04/msg04355.html Regards, Berto Alberto Garcia (7): qcow2: use one single memory block for the L2/refcount cache tables qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty() qcow2: use an LRU algorithm to replace entries from the L2 cache qcow2: remove qcow2_cache_find_entry_to_replace() qcow2: use a hash to look for entries in the L2 cache qcow2: make qcow2_cache_put() a void function qcow2: style fixes in qcow2-cache.c block/qcow2-cache.c| 169 ++--- block/qcow2-cluster.c | 62 +- block/qcow2-refcount.c | 37 +++ block/qcow2.h | 5 +- 4 files changed, 104 insertions(+), 169 deletions(-) -- 2.1.4
[Qemu-block] [PATCH 7/7] qcow2: style fixes in qcow2-cache.c
Fix pointer declaration to make it consistent with the rest of the code. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 0f629c4..bea43c1 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -35,8 +35,8 @@ typedef struct Qcow2CachedTable { } Qcow2CachedTable; struct Qcow2Cache { -Qcow2CachedTable* entries; -struct Qcow2Cache* depends; +Qcow2CachedTable *entries; +struct Qcow2Cache *depends; int size; booldepends_on_flush; void *table_array; @@ -81,7 +81,7 @@ Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables) return c; } -int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c) +int qcow2_cache_destroy(BlockDriverState *bs, Qcow2Cache *c) { int i; -- 2.1.4
Re: [Qemu-block] [RFC PATCH 6/7] virtio-blk: Don't handle output when there is device IO op blocker
On 06/05/2015 14:20, Fam Zheng wrote: Does non-dataplane need to do anything, since it uses iohandlers rather than aio_set_event_notifier_handler? I guess it's not for this specific bug. See this as an attempt on a general purpose pause mechanism to the device in investment for the future, for example, bdrv_aio_drain. ;-) I can drop it in v2 if you think the idea is not very helpful. It's slightly more complicated but I think it's worth the extra coverage testing that this provides. Paolo
Re: [Qemu-block] [PATCH v4 13/17] qcow2/overlaps: Add memory limit reached event
On 04.05.2015 21:32, Eric Blake wrote: On 05/04/2015 01:15 PM, Max Reitz wrote: Later, a mechanism to set a limit on how much memory may be used for the overlap prevention structures will be introduced. If that limit is about to be exceeded, a QMP event should be emitted. This very event is specified by this patch. Signed-off-by: Max Reitz mre...@redhat.com --- docs/qmp/qmp-events.txt | 28 qapi/event.json | 27 +++ 2 files changed, 55 insertions(+) + +Data: +- reference: Device name if set; node name otherwise. (json-string) +- start: Offset of the range of clusters (possibly) no longer being + checked for writes overlapping with existing metadata. + (json-int, optional) +- length:Length of that range in bytes. (json-int, optional) + +Example: + +{ event: QCOW2_OVERLAP_CHECK_MEMORY_LIMIT_REACHED, +data: { reference: virtio0, start: 805306368, + length: 268435456 }, s/805306368/805306368/ and likewise for length (a json-int does not use quotes). Oops, right. Will fix. Otherwise seems okay. Good, thanks :-) Max
[Qemu-block] [PATCH 5/7] qcow2: use a hash to look for entries in the L2 cache
The current cache algorithm traverses the array starting always from the beginning, so the average number of comparisons needed to perform a lookup is proportional to the size of the array. By using a hash of the offset as the starting point, lookups are faster and independent from the array size. The hash is computed using the cluster number of the table, multiplied by 4 to make it perform better when there are collisions. In my tests, using a cache with 2048 entries, this reduces the average number of comparisons per lookup from 430 to 2.5. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 9 +++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index dffb887..3137ba8 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -248,6 +248,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, BDRVQcowState *s = bs-opaque; int i; int ret; +int lookup_index; uint64_t min_lru_counter = UINT64_MAX; int min_lru_index = -1; @@ -255,7 +256,8 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, offset, read_from_disk); /* Check if the table is already cached */ -for (i = 0; i c-size; i++) { +i = lookup_index = (offset / s-cluster_size * 4) % c-size; +do { const Qcow2CachedTable *t = c-entries[i]; if (t-offset == offset) { goto found; @@ -264,7 +266,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, min_lru_counter = t-lru_counter; min_lru_index = i; } -} +if (++i == c-size) { +i = 0; +} +} while (i != lookup_index); if (min_lru_index == -1) { /* This can't happen in current synchronous code, but leave the check -- 2.1.4
[Qemu-block] [PATCH] qcow2: Flush pending discards before allocating cluster
Before a freed cluster can be reused, pending discards for this cluster must be processed. The original assumption was that this was not a problem because discards are only cached during discard/write zeroes operations, which are synchronous so that no concurrent write requests can cause cluster allocations. However, the discard/write zeroes operation itself can allocate a new L2 table (and it has to in order to put zero flags there), so make sure we can cope with the situation. This fixes https://bugs.launchpad.net/bugs/1349972. Signed-off-by: Kevin Wolf kw...@redhat.com --- block/qcow2-refcount.c | 5 + 1 file changed, 5 insertions(+) diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index f47260b..83467c3 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -833,6 +833,11 @@ static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size) uint64_t i, nb_clusters, refcount; int ret; +/* We can't allocate clusters if they may still be queued for discard. */ +if (s-cache_discards) { +qcow2_process_discards(bs, 0); +} + nb_clusters = size_to_clusters(s, size); retry: for(i = 0; i nb_clusters; i++) { -- 1.8.3.1
[Qemu-block] [RFC PATCH 7/7] blockdev: Add device IO op blocker during snapshot transaction
Signed-off-by: Fam Zheng f...@redhat.com --- blockdev.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/blockdev.c b/blockdev.c index 5eaf77e..859fa2e 100644 --- a/blockdev.c +++ b/blockdev.c @@ -1398,6 +1398,7 @@ typedef struct ExternalSnapshotState { BlockDriverState *old_bs; BlockDriverState *new_bs; AioContext *aio_context; +Error *blocker; } ExternalSnapshotState; static void external_snapshot_prepare(BlkTransactionState *common, @@ -1519,6 +1520,9 @@ static void external_snapshot_prepare(BlkTransactionState *common, if (ret != 0) { error_propagate(errp, local_err); } + +error_setg(state-blocker, snapshot in progress); +bdrv_op_block(state-old_bs, BLOCK_OP_TYPE_DEVICE_IO, state-blocker); } static void external_snapshot_commit(BlkTransactionState *common) @@ -1536,6 +1540,9 @@ static void external_snapshot_commit(BlkTransactionState *common) bdrv_reopen(state-new_bs, state-new_bs-open_flags ~BDRV_O_RDWR, NULL); +bdrv_op_unblock(state-old_bs, BLOCK_OP_TYPE_DEVICE_IO, state-blocker); +error_free(state-blocker); + aio_context_release(state-aio_context); } -- 1.9.3
[Qemu-block] [RFC PATCH 4/7] block-backend: Add blk_op_blocker_add_notifier
Forward the call to bdrv_op_blocker_add_notifier. Signed-off-by: Fam Zheng f...@redhat.com --- block.c| 4 ++-- block/block-backend.c | 6 ++ include/sysemu/block-backend.h | 2 ++ 3 files changed, 10 insertions(+), 2 deletions(-) diff --git a/block.c b/block.c index 054ddb4..d98a304 100644 --- a/block.c +++ b/block.c @@ -3414,23 +3414,23 @@ void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason) BdrvOpBlocker *blocker; assert((int) op = 0 op BLOCK_OP_TYPE_MAX); -bdrv_op_blocker_notify(bs, op, reason, true); blocker = g_new0(BdrvOpBlocker, 1); blocker-reason = reason; QLIST_INSERT_HEAD(bs-op_blockers[op], blocker, list); +bdrv_op_blocker_notify(bs, op, reason, true); } void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason) { BdrvOpBlocker *blocker, *next; assert((int) op = 0 op BLOCK_OP_TYPE_MAX); -bdrv_op_blocker_notify(bs, op, reason, false); QLIST_FOREACH_SAFE(blocker, bs-op_blockers[op], list, next) { if (blocker-reason == reason) { QLIST_REMOVE(blocker, list); g_free(blocker); } } +bdrv_op_blocker_notify(bs, op, reason, false); } void bdrv_op_block_all(BlockDriverState *bs, Error *reason) diff --git a/block/block-backend.c b/block/block-backend.c index 71fc695..90d7476 100644 --- a/block/block-backend.c +++ b/block/block-backend.c @@ -806,6 +806,12 @@ void blk_op_unblock_all(BlockBackend *blk, Error *reason) bdrv_op_unblock_all(blk-bs, reason); } +void blk_op_blocker_add_notifier(BlockBackend *blk, + Notifier *notifier) +{ +bdrv_op_blocker_add_notifier(blk-bs, notifier); +} + AioContext *blk_get_aio_context(BlockBackend *blk) { return bdrv_get_aio_context(blk-bs); diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h index b4a4d5e..cde9651 100644 --- a/include/sysemu/block-backend.h +++ b/include/sysemu/block-backend.h @@ -136,6 +136,8 @@ int blk_get_flags(BlockBackend *blk); int blk_get_max_transfer_length(BlockBackend *blk); void blk_set_guest_block_size(BlockBackend *blk, int align); void *blk_blockalign(BlockBackend *blk, size_t size); +void blk_op_blocker_add_notifier(BlockBackend *blk, + Notifier *notifier); bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp); void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason); void blk_op_block_all(BlockBackend *blk, Error *reason); -- 1.9.3
[Qemu-block] [PATCH 3/7] qcow2: use an LRU algorithm to replace entries from the L2 cache
The current algorithm to evict entries from the cache gives always preference to those in the lowest positions. As the size of the cache increases, the chances of the later elements of being removed decrease exponentially. In a scenario with random I/O and lots of cache misses, entries in positions 8 and higher are rarely (if ever) evicted. This can be seen even with the default cache size, but with larger caches the problem becomes more obvious. Using an LRU algorithm makes the chances of being removed from the cache independent from the position. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 6e78c8f..786c10a 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -28,10 +28,10 @@ #include trace.h typedef struct Qcow2CachedTable { -int64_t offset; -booldirty; -int cache_hits; -int ref; +int64_t offset; +bool dirty; +uint64_t lru_counter; +int ref; } Qcow2CachedTable; struct Qcow2Cache { @@ -40,6 +40,7 @@ struct Qcow2Cache { int size; booldepends_on_flush; void *table_array; +uint64_tlru_counter; }; static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs, @@ -233,16 +234,18 @@ int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c) for (i = 0; i c-size; i++) { assert(c-entries[i].ref == 0); c-entries[i].offset = 0; -c-entries[i].cache_hits = 0; +c-entries[i].lru_counter = 0; } +c-lru_counter = 0; + return 0; } static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) { int i; -int min_count = INT_MAX; +uint64_t min_lru_counter = UINT64_MAX; int min_index = -1; @@ -251,15 +254,9 @@ static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) continue; } -if (c-entries[i].cache_hits min_count) { +if (c-entries[i].lru_counter min_lru_counter) { min_index = i; -min_count = c-entries[i].cache_hits; -} - -/* Give newer hits priority */ -/* TODO Check how to optimize the replacement strategy */ -if (c-entries[i].cache_hits 1) { -c-entries[i].cache_hits /= 2; +min_lru_counter = c-entries[i].lru_counter; } } @@ -318,12 +315,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c, /* Give the table some hits for the start so that it won't be replaced * immediately. The number 32 is completely arbitrary. */ -c-entries[i].cache_hits = 32; c-entries[i].offset = offset; /* And return the right table */ found: -c-entries[i].cache_hits++; c-entries[i].ref++; *table = qcow2_cache_get_table_addr(bs, c, i); @@ -356,6 +351,10 @@ int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table) c-entries[i].ref--; *table = NULL; +if (c-entries[i].ref == 0) { +c-entries[i].lru_counter = ++c-lru_counter; +} + assert(c-entries[i].ref = 0); return 0; } -- 2.1.4
Re: [Qemu-block] [Qemu-devel] [PATCH 6/6] qcow2: style fixes in qcow2-cache.c
On Thu, Apr 30, 2015 at 01:11:45PM +0300, Alberto Garcia wrote: Fix pointer declaration to make it consistent with the rest of the code. Signed-off-by: Alberto Garcia be...@igalia.com --- block/qcow2-cache.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) :) Reviewed-by: Stefan Hajnoczi stefa...@redhat.com pgpgiXMNuoEVo.pgp Description: PGP signature
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
On 06/05/2015 18:12, Max Reitz wrote: I very much think it would be worth fixing, if there wasn't the problem with legitimate use cases throwing unnecessary warnings. Right. I remember having a discussion with Kevin about this series (v1) regarding qcow2 on LVM; I think my point was that the warning is basically still correct, or only needs rewording (oops, I guess I did forget that in v2). If you are using qcow2 on LVM, you need to know exactly what you are doing, so a warning about this is indeed appropriate (in my opinion, that is). There's another thing to check. In the BZ you linked you got an EINVAL or EIO. Why didn't you get an ENOSPC? Can you check if virtio-scsi gives ENOSPC? If so, you could perhaps only warn for werror=report. But even then, there are legitimate cases where you want the guest to see the ENOSPC. In fact, that's the reason why virtio-scsi converts ENOSPC to a SCSI SPACE ALLOCATION FAILED sense code. :) So I think if we can word the warning in a way to make it clear that there are legitimate use cases, but you need to know what you are doing, I think it's worth having this warning. Users who know what they're doing won't be surprised or at least will know what it means, while users who don't know what it means most probably don't know what they're doing and thus the warning is appropriate for them. I don't know... But then, I'm not a maintainer of this code. :) Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
On 06.05.2015 18:20, Paolo Bonzini wrote: On 06/05/2015 18:12, Max Reitz wrote: I very much think it would be worth fixing, if there wasn't the problem with legitimate use cases throwing unnecessary warnings. Right. I remember having a discussion with Kevin about this series (v1) regarding qcow2 on LVM; I think my point was that the warning is basically still correct, or only needs rewording (oops, I guess I did forget that in v2). If you are using qcow2 on LVM, you need to know exactly what you are doing, so a warning about this is indeed appropriate (in my opinion, that is). There's another thing to check. In the BZ you linked you got an EINVAL or EIO. Why didn't you get an ENOSPC? Because qcow2 tries to write beyond the end of the file; the NBD client implementation passes that on to the server, and the server simply reports an error (which the NBD client turns into EIO). We could make the NBD client detect this condition and report ENOSPC immediately. But I don't think this would improve matters, for people would then complain Linux reports 'no space left on device' in qemu, while df reports that there is enough space available. It's the same thing, people don't know what they're doing and nobody warned them that what they are doing might be wrong. Can you check if virtio-scsi gives ENOSPC? In which configuration? Using virtio-scsi on top of qcow2 on top of some SCSI passthrough block driver? If so, you could perhaps only warn for werror=report. But even then, there are legitimate cases where you want the guest to see the ENOSPC. In fact, that's the reason why virtio-scsi converts ENOSPC to a SCSI SPACE ALLOCATION FAILED sense code. :) Sounds like we ought to make NBD return ENOSPC no matter the fate of this series. The problem with only warning for a certain non-default configuration is that people who don't know what they are doing are more likely to use the default configuration, so I'd like the warning to appear then. So I think if we can word the warning in a way to make it clear that there are legitimate use cases, but you need to know what you are doing, I think it's worth having this warning. Users who know what they're doing won't be surprised or at least will know what it means, while users who don't know what it means most probably don't know what they're doing and thus the warning is appropriate for them. I don't know... But then, I'm not a maintainer of this code. :) Well, this is not about this code in particular, but more about qemu's interface design in general, so I'm grateful about any opinion on it. :-) Max
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
On 06.05.2015 17:30, Paolo Bonzini wrote: On 06/05/2015 15:04, Max Reitz wrote: Introducing a warning for a normal QEMU invocation is a bit weird. What is the point of this series? Were users confused that they hit ENOSPC? Users were confused when exporting a qcow2 image using nbd-server instead of qemu-img, and then accessing that NBD export with qemu (subsequently getting I/O errors on guest writes, if the image is not yet fully allocated): http://bugzilla.redhat.com/show_bug.cgi?id=1090713 I think NBD exports of non-raw images falls within the case of user should know what they're doing. In particular, you don't even need metadata preallocation, you just need a truncate -s10G file.qcow2 before invoking nbd-server. Well, actually, you need to export it with qemu-nbd instead of nbd-server; which this series is trying to tell the users. So I think it's not worth fixing this, even though I see how it can be a minor UI/UX issue. I very much think it would be worth fixing, if there wasn't the problem with legitimate use cases throwing unnecessary warnings. I remember having a discussion with Kevin about this series (v1) regarding qcow2 on LVM; I think my point was that the warning is basically still correct, or only needs rewording (oops, I guess I did forget that in v2). If you are using qcow2 on LVM, you need to know exactly what you are doing, so a warning about this is indeed appropriate (in my opinion, that is). So I think if we can word the warning in a way to make it clear that there are legitimate use cases, but you need to know what you are doing, I think it's worth having this warning. Users who know what they're doing won't be surprised or at least will know what it means, while users who don't know what it means most probably don't know what they're doing and thus the warning is appropriate for them. And if you're using management software (which hopefully does know what it's doing), the warning shouldn't be prominently visible anyway (in case of using libvirt, stderr is written to a log file, right?). Max
Re: [Qemu-block] [Qemu-devel] [PATCH] use bdrv_flush to provide barrier semantic in block/vdi.c for metadata updates
CC-ing qemu-block and Stefan Weil (maintainer of vdi). On 06.05.2015 19:23, phoeagon wrote: Thanks for your input. So I changed it to: 1. Only call bdrv_flush when bdrv_pwrite was successful 2. Only if bdrv_flush was unsuccessful that the return value of vdi_co_write is updated. One of both is enough. Both are too much. :-) It is indeed correct, technically (because ret is 0 before the bdrv_write()), but it's too verbose. (See below) In this way we try to avoid messing up any potential return value checks possible while still propagating bdrv_flush errors. That return value was a catch and I admit I'm no pro with the return value convention in QEMU. bdrv_pwrite doesn't return the same value as bdrv_pwrite_sync I assume (they do return negative values when fail, but different values when successful) It doesn't really matter, I think. Returning any non-negative value from vdi_co_write() should be enough to signal success. --- Signed-off-by: Zhe Qiu address@hidden From 19b2fabbe00765b418362d8c1891f266091621f3 Mon Sep 17 00:00:00 2001 From: phoeagon address-hidden Date: Thu, 7 May 2015 01:09:38 +0800 Subject: [PATCH] block/vdi: Use bdrv_flush after metadata updates In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. --- block/vdi.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/block/vdi.c b/block/vdi.c index 5d09b36..54a5fa8 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,7 +713,11 @@ static int vdi_co_write(BlockDriverState *bs, logout(will write %u block map sectors starting from entry %u\n, n_sectors, bmap_first); ret = bdrv_write(bs-file, offset, base, n_sectors); +if (!(ret 0)) { +int flush_ret = bdrv_flush(bs-file); +if (flush_ret 0) +ret = flush_ret; +} I think bdrv_write() always returns 0 on success. In any case, it's fine for vdi_co_write() to return 0 on success (which is what bdrv_flush() returns), so shorting these four lines to ret = bdrv_flush(bs-file); is enough. The patch is correct, though, so if you want to leave it as it is, all you need to do is bring it into proper form (http://wiki.qemu.org/Contribute/SubmitAPatch). The previous version was nearly right, except for the things I mentioned: The subject needs to start with the part of qemu the patch is targeting (in this case block/vdi: or simply vdi: ), the Signed-off-by needs to contain your name (or any alias you desire) and your email address, and comments for the patch should be separated from the actual commit message by ---. Finally, for sending the next version, please change the [PATCH] in the subject to [PATCH v3] in order to indicate that it will be version 3 of this patch. Thanks! Max } return ret; -- 2.4.0
Re: [Qemu-block] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap
On 05/02/2015 12:47 AM, John Snow wrote: On 04/03/2015 07:05 AM, Paolo Bonzini wrote: On 03/04/2015 12:01, Wen Congyang wrote: Signed-off-by: Wen Congyang we...@cn.fujitsu.com Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com Signed-off-by: Gonglei arei.gong...@huawei.com --- include/qemu/hbitmap.h | 8 tests/test-hbitmap.c | 39 +++ util/hbitmap.c | 16 3 files changed, 63 insertions(+) diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h index 550d7ce..95a55e4 100644 --- a/include/qemu/hbitmap.h +++ b/include/qemu/hbitmap.h @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t count); void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count); /** + * hbitmap_reset_all: + * @hb: HBitmap to operate on. + * + * Reset all bits in an HBitmap. + */ +void hbitmap_reset_all(HBitmap *hb); + +/** * hbitmap_get: * @hb: HBitmap to operate on. * @item: Bit to query (0-based). diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c index 8c902f2..1f0078a 100644 --- a/tests/test-hbitmap.c +++ b/tests/test-hbitmap.c @@ -11,6 +11,7 @@ #include glib.h #include stdarg.h +#include string.h #include qemu/hbitmap.h #define LOG_BITS_PER_LONG (BITS_PER_LONG == 32 ? 5 : 6) @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data, } } +static void hbitmap_test_reset_all(TestHBitmapData *data) +{ +size_t n; + +hbitmap_reset_all(data-hb); + +n = (data-size + BITS_PER_LONG - 1) / BITS_PER_LONG; +if (n == 0) { +n = 1; +} +memset(data-bits, 0, n * sizeof(unsigned long)); + +if (data-granularity == 0) { +hbitmap_test_check(data, 0); +} +} + static void hbitmap_test_check_get(TestHBitmapData *data) { uint64_t count = 0; @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data, hbitmap_test_set(data, L3 / 2, L3); } +static void test_hbitmap_reset_all(TestHBitmapData *data, + const void *unused) +{ +hbitmap_test_init(data, L3 * 2, 0); +hbitmap_test_set(data, L1 - 1, L1 + 2); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, 0, L1 * 3); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L2, L1); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L2, L3 - L2 + 1); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L3 - 1, 3); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, 0, L3 * 2); +hbitmap_test_reset_all(data); +hbitmap_test_set(data, L3 / 2, L3); +hbitmap_test_reset_all(data); +} + static void test_hbitmap_granularity(TestHBitmapData *data, const void *unused) { @@ -394,6 +432,7 @@ int main(int argc, char **argv) hbitmap_test_add(/hbitmap/set/overlap, test_hbitmap_set_overlap); hbitmap_test_add(/hbitmap/reset/empty, test_hbitmap_reset_empty); hbitmap_test_add(/hbitmap/reset/general, test_hbitmap_reset); +hbitmap_test_add(/hbitmap/reset/all, test_hbitmap_reset_all); hbitmap_test_add(/hbitmap/granularity, test_hbitmap_granularity); g_test_run(); diff --git a/util/hbitmap.c b/util/hbitmap.c index ab13971..acce93c 100644 --- a/util/hbitmap.c +++ b/util/hbitmap.c @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count) hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last); } +void hbitmap_reset_all(HBitmap *hb) +{ +uint64_t size = hb-size; +unsigned int i; + +/* Same as hbitmap_alloc() except memset() */ +for (i = HBITMAP_LEVELS; --i = 1; ) { +size = MAX((size + BITS_PER_LONG - 1) BITS_PER_LEVEL, 1); +memset(hb-levels[i], 0, size * sizeof(unsigned long)); +} + For what it's worth, I recently added in a hb-sizes[i] cache to store the size of each array so you don't have to recompute this all the time. Yes, will fix it in the next version. Thanks Wen Congyang +assert(size == 1); +hb-levels[0][0] = 1UL (BITS_PER_LONG - 1); +hb-count = 0; +} + bool hbitmap_get(const HBitmap *hb, uint64_t item) { /* Compute position and bit in the last layer. */ Acked-by: Paolo Bonzini pbonz...@redhat.com .
[Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates
From: phoeagon phoea...@gmail.com In reference to b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding writes. Only when write is successful that bdrv_flush is called. Signed-off-by: Zhe Qiu phoea...@gmail.com --- block/vdi.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/block/vdi.c b/block/vdi.c index 7642ef3..dfe8ade 100644 --- a/block/vdi.c +++ b/block/vdi.c @@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs, logout(will write %u block map sectors starting from entry %u\n, n_sectors, bmap_first); ret = bdrv_write(bs-file, offset, base, n_sectors); +if (ret = 0) { +ret = bdrv_flush(bs-file); +} } return ret; -- 2.4.0
Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols
On 06/05/2015 15:04, Max Reitz wrote: Introducing a warning for a normal QEMU invocation is a bit weird. What is the point of this series? Were users confused that they hit ENOSPC? Users were confused when exporting a qcow2 image using nbd-server instead of qemu-img, and then accessing that NBD export with qemu (subsequently getting I/O errors on guest writes, if the image is not yet fully allocated): http://bugzilla.redhat.com/show_bug.cgi?id=1090713 I think NBD exports of non-raw images falls within the case of user should know what they're doing. In particular, you don't even need metadata preallocation, you just need a truncate -s10G file.qcow2 before invoking nbd-server. So I think it's not worth fixing this, even though I see how it can be a minor UI/UX issue. Paolo
Re: [Qemu-block] [Qemu-devel] [PATCH 2/6] qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()
On Tue, May 05, 2015 at 03:06:52PM +0200, Alberto Garcia wrote: On Fri 01 May 2015 04:31:52 PM CEST, Stefan Hajnoczi wrote: int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table) { -int i; +int i = (*table - c-table_array) / c-table_size; -for (i = 0; i c-size; i++) { -if (table_addr(c, i) == *table) { -goto found; -} +if (c-entries[i].offset == 0) { +return -ENOENT; } -return -ENOENT; -found: c-entries[i].ref--; *table = NULL; When is this function called with a bogus table pointer? I also could not image any such scenario, but I decided to be conservative and keep the error handling code. I'll double check all places where it's used and remove the relevant code. The reason why I'd think this is worth looking into is: The new qcow2_cache_put() code indexes outside the array bounds if table is a bogus value. The old code would return -ENOENT. So I am a little nervous about the change, although I suspect the function is never called with invalid inputs at all. Stefan pgplt5imbBoxo.pgp Description: PGP signature
Re: [Qemu-block] [RFC PATCH 3/7] block: Add op blocker notifier list
On Wed, 05/06 16:22, Paolo Bonzini wrote: On 06/05/2015 13:23, Fam Zheng wrote: void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason) { BdrvOpBlocker *blocker; assert((int) op = 0 op BLOCK_OP_TYPE_MAX); +bdrv_op_blocker_notify(bs, op, reason, true); blocker = g_new0(BdrvOpBlocker, 1); blocker-reason = reason; QLIST_INSERT_HEAD(bs-op_blockers[op], blocker, list); @@ -3405,6 +3424,7 @@ void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason) { BdrvOpBlocker *blocker, *next; assert((int) op = 0 op BLOCK_OP_TYPE_MAX); +bdrv_op_blocker_notify(bs, op, reason, false); QLIST_FOREACH_SAFE(blocker, bs-op_blockers[op], list, next) { if (blocker-reason == reason) { QLIST_REMOVE(blocker, list); Changed in the following patch. Bad git rebase -i, will fix. Also, should we only invoke the notifier if the list becomes empty/non-empty? Is there any reason why the notifier would like to know the particular Error that was added? Good point, this can be simplified. Fam
Re: [Qemu-block] [PATCH 1/6] qcow2: use one single memory block for the L2/refcount cache tables
On Tue, May 05, 2015 at 01:20:19PM +0200, Kevin Wolf wrote: Am 05.05.2015 um 12:28 hat Stefan Hajnoczi geschrieben: On Mon, May 04, 2015 at 12:58:13PM +0200, Kevin Wolf wrote: Am 01.05.2015 um 16:23 hat Stefan Hajnoczi geschrieben: On Thu, Apr 30, 2015 at 01:11:40PM +0300, Alberto Garcia wrote: Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables) { BDRVQcowState *s = bs-opaque; Qcow2Cache *c; -int i; c = g_new0(Qcow2Cache, 1); c-size = num_tables; +c-table_size = s-cluster_size; This assumes c-table_size meets bs' memory alignment requirements. The following would be safer: c-table_size = QEMU_ALIGN_UP(c-table_size, bdrv_opt_mem_align(bs-file)); You can't just access more than one cluster. You might be caching data and later write it back when it's stale. I don't mean I/O alignment, just memory alignment (i.e. the start address of the data buffer in memory). The start address is already taken care of by qemu_blockalign(). With rounding c-table_size, you'd align the length, and that would be wrong. It wasn't clear to me that c-table + n * c-table_size for all n is aligned, but I agree with you now: bdrv_qiov_is_aligned() checks both address and size against memory alignment. This means that if the I/O is memory aligned for table_size, then all multiples of table_size are also aligned. Stefan pgpteFAfL80Nj.pgp Description: PGP signature