Re: [Qemu-devel] QCOW2 cryptography and secure key handling
Do you (the block maintainers) have an idea on how the code could be improved to securely pass the crypto key to the QCOW2 code ? More generally, QCow2's current encryption support is woefully inadequate from a design POV. If we wanted better encryption built-in to QEMU it is best to just deprecate the current encryption support and define a new qcow2 extension based around something like the LUKS data format. Using the LUKS data format precisely would be good from a data portability POV, since then you can easily switch your images between LUKS encrypted block device qcow2-with-luks image file, without needing to re-encrypt the data. Thanks I will read the LUKS specification. Best regards Benoît
Re: [Qemu-devel] QCOW2 cryptography and secure key handling
More generally, QCow2's current encryption support is woefully inadequate from a design POV. If we wanted better encryption built-in to QEMU it is best to just deprecate the current encryption support and define a new qcow2 extension based around something like the LUKS data format. Using the LUKS data format precisely would be good from a data portability POV, since then you can easily switch your images between LUKS encrypted block device qcow2-with-luks image file, without needing to re-encrypt the data. I read the LUKS specification and undestood enough part of it to understand the potentials benefits (stronger encryption key, multiple user keys, possibility to change users keys). Kevin Stefan: What do you think about implementing LUKS in QCOW2 ? Best regards Benoît
Re: [Qemu-devel] [PATCH V2 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.
1. Can we activate the timer only when requests are actually pending? Imagine a host with 1000 guests, even a 1 second timer becomes wasteful. I will try to do this. 2. You don't vary the wait time, does this mean a throttled request must wait for max 1 second? If yes, then it introduces a big variance on request latency. If iops or bps request where done on a negligible time (very fast storage backend) yes. We could make the timer frequency higher though to mitigate this. Best regards Benoît
Re: [Qemu-devel] [PATCH V2 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.
I don't fully understand the patch yet but: The patch is a variant of the following algorithm. http://en.wikipedia.org/wiki/Leaky_bucket Best regards Benoît
[Qemu-devel] [PATCH V3 for-1.6 1/5] block: Repair the throttling code.
The throttling code was segfaulting since commit 02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller does not run in a coroutine. qemu_co_queue_do_restart assume that the caller is a coroutinne. As suggested by Stefan fix this by entering the coroutine directly. Also make sure like suggested that qemu_co_queue_next() and qemu_co_queue_restart_all() can be called only in coroutines. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 include/block/coroutine.h |9 +++-- qemu-coroutine-lock.c | 20 ++-- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/block.c b/block.c index b560241..dc72643 100644 --- a/block.c +++ b/block.c @@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs) { bs-io_limits_enabled = false; -while (qemu_co_queue_next(bs-throttled_reqs)); +while (qemu_co_enter_next(bs-throttled_reqs)) { +} if (bs-block_timer) { qemu_del_timer(bs-block_timer); @@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque) { BlockDriverState *bs = opaque; -qemu_co_queue_next(bs-throttled_reqs); +qemu_co_enter_next(bs-throttled_reqs); } void bdrv_io_limits_enable(BlockDriverState *bs) @@ -1445,8 +1446,7 @@ void bdrv_drain_all(void) * a busy wait. */ QTAILQ_FOREACH(bs, bdrv_states, list) { -if (!qemu_co_queue_empty(bs-throttled_reqs)) { -qemu_co_queue_restart_all(bs-throttled_reqs); +while (qemu_co_enter_next(bs-throttled_reqs)) { busy = true; } } diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 377805a..1f2db3e 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -130,12 +130,17 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue); * * Returns true if a coroutine was restarted, false if the queue is empty. */ -bool qemu_co_queue_next(CoQueue *queue); +bool coroutine_fn qemu_co_queue_next(CoQueue *queue); /** * Restarts all coroutines in the CoQueue and leaves the queue empty. */ -void qemu_co_queue_restart_all(CoQueue *queue); +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue); + +/** + * Enter the next coroutine in the queue + */ +bool qemu_co_enter_next(CoQueue *queue); /** * Checks if the CoQueue is empty. diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c index d9fea49..aeb33b9 100644 --- a/qemu-coroutine-lock.c +++ b/qemu-coroutine-lock.c @@ -88,16 +88,32 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single) return true; } -bool qemu_co_queue_next(CoQueue *queue) +bool coroutine_fn qemu_co_queue_next(CoQueue *queue) { +assert(qemu_in_coroutine()); return qemu_co_queue_do_restart(queue, true); } -void qemu_co_queue_restart_all(CoQueue *queue) +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue) { +assert(qemu_in_coroutine()); qemu_co_queue_do_restart(queue, false); } +bool qemu_co_enter_next(CoQueue *queue) +{ +Coroutine *next; + +next = QTAILQ_FIRST(queue-entries); +if (!next) { +return false; +} + +QTAILQ_REMOVE(queue-entries, next, co_queue_next); +qemu_coroutine_enter(next, NULL); +return true; +} + bool qemu_co_queue_empty(CoQueue *queue) { return (QTAILQ_FIRST(queue-entries) == NULL); -- 1.7.10.4
[Qemu-devel] [PATCH V3 for-1.6 3/5] block: Add support for throttling burst threshold in QMP and the command line.
The thresholds of the leaky bucket algorithm can be used to allow some burstiness. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qapi.c | 24 + blockdev.c | 105 +++--- hmp.c| 32 +++-- qapi-schema.json | 34 -- qemu-options.hx |2 +- qmp-commands.hx | 30 ++-- 6 files changed, 205 insertions(+), 22 deletions(-) diff --git a/block/qapi.c b/block/qapi.c index a4bc411..03f1604 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -235,6 +235,30 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops[BLOCK_IO_LIMIT_READ]; info-inserted-iops_wr = bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_bps_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-bps_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-has_bps_rd_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-bps_rd_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_bps_wr_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-bps_wr_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_iops_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-iops_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-iops_rd_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_iops_rd_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_iops_wr_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-iops_wr_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index 491e4d0..241ebe8 100644 --- a/blockdev.c +++ b/blockdev.c @@ -341,6 +341,17 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp) return false; } +if (io_limits-bps_threshold[BLOCK_IO_LIMIT_TOTAL] 0 || +io_limits-bps_threshold[BLOCK_IO_LIMIT_WRITE] 0 || +io_limits-bps_threshold[BLOCK_IO_LIMIT_READ] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_TOTAL] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_WRITE] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_READ] 0) { +error_setg(errp, + threshold values must be 0 or greater); +return false; +} + return true; } @@ -523,24 +534,34 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_get_number(opts, bps_rd, 0); io_limits.bps[BLOCK_IO_LIMIT_WRITE] = qemu_opt_get_number(opts, bps_wr, 0); +/* bps thresholds */ +io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] = +qemu_opt_get_number(opts, bps_threshold, +io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ); +io_limits.bps_threshold[BLOCK_IO_LIMIT_READ] = +qemu_opt_get_number(opts, bps_rd_threshold, +io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ); +io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] = +qemu_opt_get_number(opts, bps_wr_threshold, +io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ); + io_limits.iops[BLOCK_IO_LIMIT_TOTAL] = qemu_opt_get_number(opts, iops, 0); io_limits.iops[BLOCK_IO_LIMIT_READ] = qemu_opt_get_number(opts, iops_rd, 0); io_limits.iops[BLOCK_IO_LIMIT_WRITE] = qemu_opt_get_number(opts, iops_wr, 0); -io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] = - io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ; -io_limits.bps_threshold[BLOCK_IO_LIMIT_READ] = - io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ; -io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] = - io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ; -io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = - io_limits.iops[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ; + +/* iops thresholds */ +io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = +qemu_opt_get_number(opts, iops_threshold, +
[Qemu-devel] [PATCH V3 for-1.6 0/5] Leaky bucket throttling and features
The first patch fixes the throttling which was broken by a previous commit. The next patch replace the existing throttling algorithm by the well described leaky bucket algorithm. Third patch implement bursting by adding *_threshold parameters to qmp_block_set_io_throttle. The last one allow to define the max size of an io when throttling by iops via iops_sector_count to avoid vm users cheating on the iops limit. The last patch adds a metric reflecting how much the I/O are throttled. since v1: Add throttling percentage metric [Benoît] since v2: Enable timer only during I/O activity [Stefan] Mark function as coroutine_fn [Stefan] Guard these function checking they are in a coroutine [Stefan] Use defines to access buckets [Stefan] Fix typo [Stefan] reset throttling metric on iddle [Benoît] rename invalid to check_io_limit [Stefan] Benoît Canet (5): block: Repair the throttling code. block: Modify the throttling code to implement the leaky bucket algorithm. block: Add support for throttling burst threshold in QMP and the command line. block: Add iops_sector_count to do the iops accounting for a given io size. block: Add throttling percentage metrics. block.c | 493 - block/qapi.c | 32 +++ blockdev.c| 174 ++-- hmp.c | 38 +++- include/block/block_int.h | 18 +- include/block/coroutine.h |9 +- qapi-schema.json | 42 +++- qemu-coroutine-lock.c | 20 +- qemu-options.hx |2 +- qmp-commands.hx | 34 +++- 10 files changed, 642 insertions(+), 220 deletions(-) -- 1.7.10.4
[Qemu-devel] [PATCH V3 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.
This patch replace the previous algorithm by the well described leaky bucket algorithm: A bucket is filled by the incoming IOs and a periodic timer decrement the counter to make the bucket leak. When a given threshold is reached the bucket is full and the IOs are hold. In this patch the threshold is set to a default value to make the code behave like the previous implementation. In the next patch the threshold will be exposed in QMP to let the user control the burstiness of the throttling. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 454 +++-- blockdev.c| 71 +-- include/block/block_int.h | 15 +- 3 files changed, 339 insertions(+), 201 deletions(-) diff --git a/block.c b/block.c index dc72643..f1cd9c0 100644 --- a/block.c +++ b/block.c @@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque); static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors); -static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, -double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, int64_t *wait); - static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -101,6 +94,8 @@ static QLIST_HEAD(, BlockDriver) bdrv_drivers = /* If non-zero, use only whitelisted block drivers */ static int use_bdrv_whitelist; +/* boolean used to inform the throttling code that a bdrv_drain_all is issued */ +static bool draining; #ifdef _WIN32 static int is_windows_drive_prefix(const char *filename) @@ -129,28 +124,170 @@ void bdrv_io_limits_disable(BlockDriverState *bs) while (qemu_co_enter_next(bs-throttled_reqs)) { } +} -if (bs-block_timer) { -qemu_del_timer(bs-block_timer); -qemu_free_timer(bs-block_timer); -bs-block_timer = NULL; +static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta) +{ +int64_t *bytes = bs-leaky_buckets.bytes; +int64_t read_leak, write_leak; + +/* the limit apply to both reads and writes */ +if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) { +/* compute half the total leak */ +int64_t leak = ((bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL] * delta) / + NANOSECONDS_PER_SECOND); +int remain = leak % 2; +leak /= 2; + +/* the read bucket is smaller than half the quantity to leak so take + * care adding the leak difference to write leak + */ +if (bytes[BLOCK_IO_LIMIT_READ] = leak) { +read_leak = bytes[BLOCK_IO_LIMIT_READ]; +write_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_READ]; +/* symetric case */ +} else if (bytes[BLOCK_IO_LIMIT_WRITE] = leak) { +write_leak = bytes[BLOCK_IO_LIMIT_WRITE]; +read_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_WRITE]; +/* both bucket above leak count use half the total leak for both */ +} else { +write_leak = leak; +read_leak = leak + remain; +} +/* else we consider that limits are separated */ +} else { +read_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_READ] * delta) / +NANOSECONDS_PER_SECOND; +write_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_WRITE] * delta) / + NANOSECONDS_PER_SECOND; +} + +/* make the buckets leak */ +bytes[BLOCK_IO_LIMIT_READ] = MAX(bytes[BLOCK_IO_LIMIT_READ] - read_leak, + 0); +bytes[BLOCK_IO_LIMIT_WRITE] = MAX(bytes[BLOCK_IO_LIMIT_WRITE] - write_leak, + 0); +} + +static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta) +{ +double *ios = bs-leaky_buckets.ios; +int64_t read_leak, write_leak; + +/* the limit apply to both reads and writes */ +if (bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) { +/* compute half the total leak */ +int64_t leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) / + NANOSECONDS_PER_SECOND); +int remain = leak % 2; +leak /= 2; + +/* the read bucket is smaller than half the quantity to leak so take + * care adding the leak difference to write leak + */ +if (ios[BLOCK_IO_LIMIT_READ] = leak) { +read_leak = ios[BLOCK_IO_LIMIT_READ]; +write_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_READ]; +/* symetric case */ +} else if (ios[BLOCK_IO_LIMIT_WRITE] = leak) { +write_leak = ios[BLOCK_IO_LIMIT_WRITE]; +read_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_WRITE]; +/* both bucket above leak count use
[Qemu-devel] [PATCH V3 for-1.6 5/5] block: Add throttling percentage metrics.
This metrics show how many percent of the time I/Os are put on hold by the throttling algorithm. This metric could be used by system administrators or cloud stack developpers to decide when the throttling settings must be changed. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 27 ++- block/qapi.c |4 hmp.c |6 -- include/block/block_int.h |2 ++ qapi-schema.json |4 +++- 5 files changed, 39 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index bb4f8e4..6bb8570 100644 --- a/block.c +++ b/block.c @@ -118,12 +118,21 @@ int is_windows_drive(const char *filename) #endif /* throttling disk I/O limits */ +static void bdrv_reset_throttling_metrics(BlockDriverState *bs) +{ +/* iddle - reset values */ +bs-throttling_percentage = 0; +bs-full_since = 0; +} + void bdrv_io_limits_disable(BlockDriverState *bs) { bs-io_limits_enabled = false; while (qemu_co_enter_next(bs-throttled_reqs)) { } + +bdrv_reset_throttling_metrics(bs); } static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta) @@ -213,7 +222,8 @@ static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta) static void bdrv_leak_if_needed(BlockDriverState *bs) { int64_t now; -int64_t delta; +int64_t delta; /* the delta that would be ideally the timer period */ +int64_t delta_full; /* the delta where the bucket is full */ if (!bs-must_leak) { return; @@ -223,6 +233,14 @@ static void bdrv_leak_if_needed(BlockDriverState *bs) now = qemu_get_clock_ns(rt_clock); delta = now - bs-previous_leak; +/* compute throttle percentage reflecting how long IO are hold on average */ +if (bs-full_since) { +delta_full = now - bs-full_since; +bs-throttling_percentage = (delta_full * 100) / delta; +bs-full_since = 0; +} else { +bs-throttling_percentage = 0; +} bs-previous_leak = now; bdrv_make_bps_buckets_leak(bs, delta); @@ -260,6 +278,7 @@ static void bdrv_block_timer(void *opaque) /* disable throttling time on iddle for economy purpose */ if (bdrv_throttling_is_iddle(bs)) { bdrv_block_timer_disable(bs); +bdrv_reset_throttling_metrics(bs); return; } @@ -280,6 +299,7 @@ static void bdrv_block_timer_enable(BlockDriverState *bs) bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); bs-previous_leak = qemu_get_clock_ns(rt_clock); +bdrv_reset_throttling_metrics(bs); qemu_mod_timer(bs-block_timer, qemu_get_clock_ns(vm_clock) + BLOCK_IO_THROTTLE_PERIOD); @@ -432,6 +452,11 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs, * not full */ while (bdrv_is_any_threshold_exceeded(bs, nb_sectors, is_write)) { +/* remember since when the code decided to block the first I/O */ +if (qemu_co_queue_empty(bs-throttled_reqs)) { +bs-full_since = qemu_get_clock_ns(rt_clock); +} + bdrv_leak_if_needed(bs); qemu_co_queue_wait_insert_head(bs-throttled_reqs); bdrv_leak_if_needed(bs); diff --git a/block/qapi.c b/block/qapi.c index f81081c..bd1c6af 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -263,6 +263,10 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops_sector_count; info-inserted-iops_sector_count = bs-io_limits.iops_sector_count; +info-inserted-has_throttling_percentage = + bs-throttling_percentage; +info-inserted-throttling_percentage = + bs-throttling_percentage; } bs0 = bs; diff --git a/hmp.c b/hmp.c index 3912305..9dc4862 100644 --- a/hmp.c +++ b/hmp.c @@ -348,7 +348,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) iops_threshold=% PRId64 iops_rd_threshold=% PRId64 iops_wr_threshold=% PRId64 - iops_sector_count=% PRId64 \n, + iops_sector_count=% PRId64 + throttling_percentage=% PRId64 \n, info-value-inserted-bps, info-value-inserted-bps_rd, info-value-inserted-bps_wr, @@ -361,7 +362,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) info-value-inserted-iops_threshold, info-value-inserted-iops_rd_threshold, info-value-inserted-iops_wr_threshold, -info-value-inserted-iops_sector_count); +info-value-inserted-iops_sector_count, +
[Qemu-devel] [PATCH V3 for-1.6 4/5] block: Add iops_sector_count to do the iops accounting for a given io size.
This feature can be used in case where users are avoiding the iops limit by doing jumbo I/Os hammering the storage backend. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 +++- block/qapi.c |4 blockdev.c| 22 +- hmp.c |8 ++-- include/block/block_int.h |1 + qapi-schema.json | 10 -- qemu-options.hx |2 +- qmp-commands.hx |8 ++-- 8 files changed, 54 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index f1cd9c0..bb4f8e4 100644 --- a/block.c +++ b/block.c @@ -368,6 +368,12 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors, bool is_write) { bool bps_ret, iops_ret; +double ios = 1.0; + +if (bs-io_limits.iops_sector_count) { +ios = ((double) nb_sectors) / bs-io_limits.iops_sector_count; +ios = MAX(ios, 1.0); +} /* check if any bandwith or per IO threshold has been exceeded */ bps_ret = bdrv_is_bps_threshold_exceeded(bs, is_write); @@ -391,7 +397,7 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors, /* the IO is authorized so do the accounting and return false */ bs-leaky_buckets.bytes[is_write] += (int64_t)nb_sectors * BDRV_SECTOR_SIZE; -bs-leaky_buckets.ios[is_write]++; +bs-leaky_buckets.ios[is_write] += ios; return false; } diff --git a/block/qapi.c b/block/qapi.c index 03f1604..f81081c 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -259,6 +259,10 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; info-inserted-iops_wr_threshold = bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_iops_sector_count = + bs-io_limits.iops_sector_count; +info-inserted-iops_sector_count = + bs-io_limits.iops_sector_count; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index 241ebe8..d56e8a1 100644 --- a/blockdev.c +++ b/blockdev.c @@ -352,6 +352,11 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp) return false; } +if (io_limits-iops_sector_count 0) { +error_setg(errp, iops_sector_count must be 0 or greater); +return false; +} + return true; } @@ -563,6 +568,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_get_number(opts, iops_wr_threshold, io_limits.iops[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ); +io_limits.iops_sector_count = + qemu_opt_get_number(opts, iops_sector_count, 0); + + if (!do_check_io_limits(io_limits, error)) { error_report(%s, error_get_pretty(error)); error_free(error); @@ -1260,7 +1269,9 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, bool has_iops_rd_threshold, int64_t iops_rd_threshold, bool has_iops_wr_threshold, - int64_t iops_wr_threshold, Error **errp) + int64_t iops_wr_threshold, + bool has_iops_sector_count, + int64_t iops_sector_count, Error **errp) { BlockIOLimit io_limits; BlockDriverState *bs; @@ -1283,6 +1294,7 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = iops / THROTTLE_HZ; io_limits.iops_threshold[BLOCK_IO_LIMIT_READ] = iops_rd / THROTTLE_HZ; io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr / THROTTLE_HZ; +io_limits.iops_sector_count = 0; /* override them with givens values if present */ if (has_bps_threshold) { @@ -1304,6 +1316,10 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr_threshold; } +if (has_iops_sector_count) { +io_limits.iops_sector_count = iops_sector_count; +} + if (!do_check_io_limits(io_limits, errp)) { return; } @@ -2007,6 +2023,10 @@ QemuOptsList qemu_common_drive_opts = { .type = QEMU_OPT_NUMBER, .help = write bytes threshold, },{ +.name = iops_sector_count, +.type = QEMU_OPT_NUMBER, +.help = when limiting by iops max size of an I/O in sector, +},{ .name = copy-on-read, .type = QEMU_OPT_BOOL, .help = copy read data from backing file into image file, diff --git a/hmp.c b/hmp.c index
Re: [Qemu-devel] QCOW2 cryptography and secure key handling
There are two ways I could see it happening. Either integrate directly into the qcow2 file format, by mapping LUKS headers key material blocks into the qcow2 header region in some manner. Alternatively do it in a completely generic block driver, that qcow2 (or any other qemu bdrv) calls into instead of the file bdrv. That way the entire LUKS format becomes the image file data payload. A separate block driver, could also allow LUKS to be layered ontop, so that metadata is encrypted too. eg so you could end up with either layering QCow2 bdrv - LUKS bdrv - file bdrv LUKS bdrv - QCow2 bdrv - file bdrv I already tried the generic block driver approach on other project. (Quorum) The problem is that it result in complex issues to make the driver works with all QEMU features (think snapshots) and that no one has the funding to tackle the infrastructure work required to solve this: writing BlockBackend and block filters. Best regards Benoît
Re: [Qemu-devel] [PATCH V3 for-1.6 0/5] Leaky bucket throttling and features
Le Thursday 25 Jul 2013 à 20:08:22 (+0800), Fam Zheng a écrit : My ignorant question: Besides the fix, how is io throttling improved from previously, what are the advantages of leaky bucket algorithm? The advantage is simplicity and the ability to allow controlled bursts. The disavantage is it add stutters to the iops flow. Best Regards Benoît
Re: [Qemu-devel] [PATCH V3 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.
I think it is easier to understand written like this: int64_t total_leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) / NANOSECONDS_PER_SECOND); if (ios[BLOCK_IO_LIMIT_READ] = total_leak / 2) { read_leak = ios[BLOCK_IO_LIMIT_READ]; write_leak = total_leak - read_leak; /* symetric case */ } else if (ios[BLOCK_IO_LIMIT_WRITE] = total_leak / 2) { write_leak = ios[BLOCK_IO_LIMIT_WRITE]; read_leak = total_leak - write_leak; /* both bucket above leak count use half the total leak for both */ } else { write_leak = total_leak / 2; read_leak = (total_leak + 1) / 2; } Thanks, I will propagate these changes in the new infinite HZ algorithm I am currently writing. +/* else we consider that limits are separated */ +} else { +read_leak = (bs-io_limits.iops[BLOCK_IO_LIMIT_READ] * delta) / +NANOSECONDS_PER_SECOND; +write_leak = (bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE] * delta) / + NANOSECONDS_PER_SECOND; +} + +/* make the buckets leak */ +ios[BLOCK_IO_LIMIT_READ] = MAX(ios[BLOCK_IO_LIMIT_READ] - read_leak, 0); +ios[BLOCK_IO_LIMIT_WRITE] = MAX(ios[BLOCK_IO_LIMIT_WRITE] - write_leak, 0); +} + +static void bdrv_leak_if_needed(BlockDriverState *bs) +{ +int64_t now; +int64_t delta; + +if (!bs-must_leak) { +return; +} + +bs-must_leak = false; + +now = qemu_get_clock_ns(rt_clock); +delta = now - bs-previous_leak; +bs-previous_leak = now; + +bdrv_make_bps_buckets_leak(bs, delta); +bdrv_make_iops_buckets_leak(bs, delta); +} + +static void bdrv_block_timer_disable(BlockDriverState *bs) +{ +if (!bs-block_timer) { +return; } -bs-slice_start = 0; -bs-slice_end = 0; +qemu_del_timer(bs-block_timer); +qemu_free_timer(bs-block_timer); +bs-block_timer = NULL; +} + +static bool bdrv_throttling_is_iddle(BlockDriverState *bs) I don't quite understad the wording here, is iddle equivalent to idle? +{ +int64_t delta = qemu_get_clock_ns(rt_clock) - bs-previous_leak; + +if (delta BLOCK_IO_THROTTLE_PERIOD * 2) { +return false; +} + +/* iddle */ +return true; } +/* This callback is the timer in charge of making the leaky buckets leak */ static void bdrv_block_timer(void *opaque) Will be more readable for me if you could rename it to bdrv_clock_timer_cb. { BlockDriverState *bs = opaque; +/* disable throttling time on iddle for economy purpose */ +if (bdrv_throttling_is_iddle(bs)) { +bdrv_block_timer_disable(bs); +return; +} + +/* rearm the timer */ +qemu_mod_timer(bs-block_timer, + qemu_get_clock_ns(vm_clock) + + BLOCK_IO_THROTTLE_PERIOD); + +bs-must_leak = true; qemu_co_enter_next(bs-throttled_reqs); } +static void bdrv_block_timer_enable(BlockDriverState *bs) +{ +if (bs-block_timer) { +return; +} + +bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); +bs-previous_leak = qemu_get_clock_ns(rt_clock); +qemu_mod_timer(bs-block_timer, + qemu_get_clock_ns(vm_clock) + + BLOCK_IO_THROTTLE_PERIOD); +} + void bdrv_io_limits_enable(BlockDriverState *bs) { qemu_co_queue_init(bs-throttled_reqs); -bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); bs-io_limits_enabled = true; } @@ -165,15 +302,118 @@ bool bdrv_io_limits_enabled(BlockDriverState *bs) || io_limits-iops[BLOCK_IO_LIMIT_TOTAL]; } +/* This function check if the correct bandwith threshold has been exceeded What does the correct bandwidth threshold mean? And s/bandwith/bandwidth/, series wide. + * + * @is_write: true if the current IO is a write, false if it's a read + * @ret: true if threshold has been exceeded else false + */ +static bool bdrv_is_bps_threshold_exceeded(BlockDriverState *bs, bool is_write) +{ +/* limit is on total read + write bps : do the sum and compare with total + * threshold + */ +if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) { +int64_t bytes = bs-leaky_buckets.bytes[BLOCK_IO_LIMIT_READ] + +bs-leaky_buckets.bytes[BLOCK_IO_LIMIT_WRITE]; +return bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] bytes; +} + +/* check wether the threshold corresponding to the current io type (read, + * write) has been exceeded + */ +if (bs-io_limits.bps[is_write]) { It looks dangerous to use is_write as index of the
Re: [Qemu-devel] [PATCH V3 for-1.6 5/5] block: Add throttling percentage metrics.
If I understand it, the percentage is recalculated every leak check. So it only reflects the instant io flow, instead of historical statistics? But I think for system admin purpose, it's good to know a longer range io activity character. Or do you think management tool should sample it? Yes I wrote this with management polling in mind. +} else { +bs-throttling_percentage = 0; +} bs-previous_leak = now; bdrv_make_bps_buckets_leak(bs, delta); @@ -260,6 +278,7 @@ static void bdrv_block_timer(void *opaque) /* disable throttling time on iddle for economy purpose */ if (bdrv_throttling_is_iddle(bs)) { bdrv_block_timer_disable(bs); +bdrv_reset_throttling_metrics(bs); return; } @@ -280,6 +299,7 @@ static void bdrv_block_timer_enable(BlockDriverState *bs) bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); bs-previous_leak = qemu_get_clock_ns(rt_clock); +bdrv_reset_throttling_metrics(bs); qemu_mod_timer(bs-block_timer, qemu_get_clock_ns(vm_clock) + BLOCK_IO_THROTTLE_PERIOD); @@ -432,6 +452,11 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs, * not full */ while (bdrv_is_any_threshold_exceeded(bs, nb_sectors, is_write)) { +/* remember since when the code decided to block the first I/O */ +if (qemu_co_queue_empty(bs-throttled_reqs)) { +bs-full_since = qemu_get_clock_ns(rt_clock); +} + bdrv_leak_if_needed(bs); qemu_co_queue_wait_insert_head(bs-throttled_reqs); bdrv_leak_if_needed(bs); diff --git a/block/qapi.c b/block/qapi.c index f81081c..bd1c6af 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -263,6 +263,10 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops_sector_count; info-inserted-iops_sector_count = bs-io_limits.iops_sector_count; +info-inserted-has_throttling_percentage = + bs-throttling_percentage; +info-inserted-throttling_percentage = + bs-throttling_percentage; } bs0 = bs; diff --git a/hmp.c b/hmp.c index 3912305..9dc4862 100644 --- a/hmp.c +++ b/hmp.c @@ -348,7 +348,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) iops_threshold=% PRId64 iops_rd_threshold=% PRId64 iops_wr_threshold=% PRId64 - iops_sector_count=% PRId64 \n, + iops_sector_count=% PRId64 + throttling_percentage=% PRId64 \n, info-value-inserted-bps, info-value-inserted-bps_rd, info-value-inserted-bps_wr, @@ -361,7 +362,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) info-value-inserted-iops_threshold, info-value-inserted-iops_rd_threshold, info-value-inserted-iops_wr_threshold, -info-value-inserted-iops_sector_count); +info-value-inserted-iops_sector_count, +info-value-inserted-throttling_percentage); } else { monitor_printf(mon, [not inserted]); } diff --git a/include/block/block_int.h b/include/block/block_int.h index 74d7503..4487cd9 100644 --- a/include/block/block_int.h +++ b/include/block/block_int.h @@ -271,6 +271,8 @@ struct BlockDriverState { BlockIOLimit io_limits; BlockIOBaseValue leaky_buckets; int64_t previous_leak; +int64_t full_since; +int throttling_percentage; bool must_leak; CoQueue throttled_reqs; QEMUTimer*block_timer; diff --git a/qapi-schema.json b/qapi-schema.json index d579fda..14a02e7 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -783,6 +783,8 @@ # # @iops_sector_count: #optional an I/O size in sector (Since 1.6) # +# @throttling_percentage: #optional reflect throttling activity (Since 1.6) +# # Since: 0.14.0 # # Notes: This interface is only found in @BlockInfo. @@ -797,7 +799,7 @@ '*bps_threshold': 'int', '*bps_rd_threshold': 'int', '*bps_wr_threshold': 'int', '*iops_threshold': 'int', '*iops_rd_threshold': 'int', '*iops_wr_threshold': 'int', -'*iops_sector_count': 'int' } } +'*iops_sector_count': 'int', '*throttling_percentage': 'int' } } ## # @BlockDeviceIoStatus: -- 1.7.10.4 -- Fam
Re: [Qemu-devel] [PATCH 13/18] blockdev: Rename I/O throttling options for QMP
This patch will probably conflict with Benoît's work on leaky bucket throttling; can the two of you decide which one should go in first? Are we trying to target both this series and leaky bucket throttling for 1.6? I will to rebase my serie on top of this. However if anyone has suggestions for the names of the new options the leaky bucket serie add it would probably avoid an extra code review. Best regards Benoît
[Qemu-devel] [PATCH/FIX FOR 1.6] Repair the throttling code
The throttling code was segfaulting repair it for 1.6. Benoît Canet (1): block: Repair the throttling code. block.c |8 include/block/coroutine.h |9 +++-- qemu-coroutine-lock.c | 20 ++-- 3 files changed, 29 insertions(+), 8 deletions(-) -- 1.7.10.4
[Qemu-devel] [PATCH/FIX FOR 1.6] block: Repair the throttling code.
The throttling code was segfaulting since commit 02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller does not run in a coroutine. qemu_co_queue_do_restart assume that the caller is a coroutinne. As suggested by Stefan fix this by entering the coroutine directly. Also make sure like suggested that qemu_co_queue_next() and qemu_co_queue_restart_all() can be called only in coroutines. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 include/block/coroutine.h |9 +++-- qemu-coroutine-lock.c | 20 ++-- 3 files changed, 29 insertions(+), 8 deletions(-) diff --git a/block.c b/block.c index 6cd39fa..5a2240f 100644 --- a/block.c +++ b/block.c @@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs) { bs-io_limits_enabled = false; -while (qemu_co_queue_next(bs-throttled_reqs)); +while (qemu_co_enter_next(bs-throttled_reqs)) { +} if (bs-block_timer) { qemu_del_timer(bs-block_timer); @@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque) { BlockDriverState *bs = opaque; -qemu_co_queue_next(bs-throttled_reqs); +qemu_co_enter_next(bs-throttled_reqs); } void bdrv_io_limits_enable(BlockDriverState *bs) @@ -1445,8 +1446,7 @@ void bdrv_drain_all(void) * a busy wait. */ QTAILQ_FOREACH(bs, bdrv_states, list) { -if (!qemu_co_queue_empty(bs-throttled_reqs)) { -qemu_co_queue_restart_all(bs-throttled_reqs); +while (qemu_co_enter_next(bs-throttled_reqs)) { busy = true; } } diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 377805a..1f2db3e 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -130,12 +130,17 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue); * * Returns true if a coroutine was restarted, false if the queue is empty. */ -bool qemu_co_queue_next(CoQueue *queue); +bool coroutine_fn qemu_co_queue_next(CoQueue *queue); /** * Restarts all coroutines in the CoQueue and leaves the queue empty. */ -void qemu_co_queue_restart_all(CoQueue *queue); +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue); + +/** + * Enter the next coroutine in the queue + */ +bool qemu_co_enter_next(CoQueue *queue); /** * Checks if the CoQueue is empty. diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c index d9fea49..aeb33b9 100644 --- a/qemu-coroutine-lock.c +++ b/qemu-coroutine-lock.c @@ -88,16 +88,32 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single) return true; } -bool qemu_co_queue_next(CoQueue *queue) +bool coroutine_fn qemu_co_queue_next(CoQueue *queue) { +assert(qemu_in_coroutine()); return qemu_co_queue_do_restart(queue, true); } -void qemu_co_queue_restart_all(CoQueue *queue) +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue) { +assert(qemu_in_coroutine()); qemu_co_queue_do_restart(queue, false); } +bool qemu_co_enter_next(CoQueue *queue) +{ +Coroutine *next; + +next = QTAILQ_FIRST(queue-entries); +if (!next) { +return false; +} + +QTAILQ_REMOVE(queue-entries, next, co_queue_next); +qemu_coroutine_enter(next, NULL); +return true; +} + bool qemu_co_queue_empty(CoQueue *queue) { return (QTAILQ_FIRST(queue-entries) == NULL); -- 1.7.10.4
[Qemu-devel] [PATCH V4/FIX FOR 1.6] repair throttling code
since v3: silence bogus checkpatch warning [abligh] Benoît Canet (1): block: Repair the throttling code. block.c |7 +++ include/block/coroutine.h |9 +++-- qemu-coroutine-lock.c | 20 ++-- 3 files changed, 28 insertions(+), 8 deletions(-) -- 1.7.10.4
[Qemu-devel] [PATCH V4/FIX FOR 1.6] block: Repair the throttling code.
The throttling code was segfaulting since commit 02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller does not run in a coroutine. qemu_co_queue_do_restart assume that the caller is a coroutinne. As suggested by Stefan fix this by entering the coroutine directly. Also make sure like suggested that qemu_co_queue_next() and qemu_co_queue_restart_all() can be called only in coroutines. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |7 +++ include/block/coroutine.h |9 +++-- qemu-coroutine-lock.c | 20 ++-- 3 files changed, 28 insertions(+), 8 deletions(-) diff --git a/block.c b/block.c index 6cd39fa..35d71a4 100644 --- a/block.c +++ b/block.c @@ -127,7 +127,7 @@ void bdrv_io_limits_disable(BlockDriverState *bs) { bs-io_limits_enabled = false; -while (qemu_co_queue_next(bs-throttled_reqs)); +do {} while (qemu_co_enter_next(bs-throttled_reqs)); if (bs-block_timer) { qemu_del_timer(bs-block_timer); @@ -143,7 +143,7 @@ static void bdrv_block_timer(void *opaque) { BlockDriverState *bs = opaque; -qemu_co_queue_next(bs-throttled_reqs); +qemu_co_enter_next(bs-throttled_reqs); } void bdrv_io_limits_enable(BlockDriverState *bs) @@ -1445,8 +1445,7 @@ void bdrv_drain_all(void) * a busy wait. */ QTAILQ_FOREACH(bs, bdrv_states, list) { -if (!qemu_co_queue_empty(bs-throttled_reqs)) { -qemu_co_queue_restart_all(bs-throttled_reqs); +while (qemu_co_enter_next(bs-throttled_reqs)) { busy = true; } } diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 377805a..1f2db3e 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -130,12 +130,17 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue *queue); * * Returns true if a coroutine was restarted, false if the queue is empty. */ -bool qemu_co_queue_next(CoQueue *queue); +bool coroutine_fn qemu_co_queue_next(CoQueue *queue); /** * Restarts all coroutines in the CoQueue and leaves the queue empty. */ -void qemu_co_queue_restart_all(CoQueue *queue); +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue); + +/** + * Enter the next coroutine in the queue + */ +bool qemu_co_enter_next(CoQueue *queue); /** * Checks if the CoQueue is empty. diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c index d9fea49..aeb33b9 100644 --- a/qemu-coroutine-lock.c +++ b/qemu-coroutine-lock.c @@ -88,16 +88,32 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool single) return true; } -bool qemu_co_queue_next(CoQueue *queue) +bool coroutine_fn qemu_co_queue_next(CoQueue *queue) { +assert(qemu_in_coroutine()); return qemu_co_queue_do_restart(queue, true); } -void qemu_co_queue_restart_all(CoQueue *queue) +void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue) { +assert(qemu_in_coroutine()); qemu_co_queue_do_restart(queue, false); } +bool qemu_co_enter_next(CoQueue *queue) +{ +Coroutine *next; + +next = QTAILQ_FIRST(queue-entries); +if (!next) { +return false; +} + +QTAILQ_REMOVE(queue-entries, next, co_queue_next); +qemu_coroutine_enter(next, NULL); +return true; +} + bool qemu_co_queue_empty(CoQueue *queue) { return (QTAILQ_FIRST(queue-entries) == NULL); -- 1.7.10.4
Re: [Qemu-devel] [PATCH V3 for-1.6 3/5] block: Add support for throttling burst threshold in QMP and the command line.
Kevin's series renamed these to have a dash in the name, and also moved all the throttling parameters into a sub-struct. Does it make more sense to have just '*throttling' with that sub-struct containing 12 parameters, 6 for limits and 6 for thresholds, or would it be better to have '*throttling' with 6 members for limits, as well as '*throttling-threshold' with the other 6 members? Naming-wise, throttling.bps-total and throttling-threshold.bps-total convey as much information as throttling.bps-total and throttling.bps-total-threshold. In fact my series add up to 13 parameters. The last one is iops_sector_count so maybe I'll go the big sub-struct way. Best regards Benoît
Re: [Qemu-devel] VFIO and scheduled SR-IOV cards
Confused. You have one VF accessing BAR of another VF? Why? The VFs are scheduled and the board have only one microcontroller responsible to give the read results for some memory mapped registers. So it could respond the value of the active VF when a bar of an inactive VF is read. I handed over the contract as it's seemed too tricky. Best regards Benoît
Re: [Qemu-devel] QCOW2 cryptography and secure key handling
Le Monday 29 Jul 2013 à 12:32:35 (+0100), Daniel P. Berrange a écrit : On Mon, Jul 29, 2013 at 01:25:24PM +0200, Kevin Wolf wrote: Am 29.07.2013 um 13:21 hat Markus Armbruster geschrieben: Paolo Bonzini pbonz...@redhat.com writes: Il 23/07/2013 17:57, Daniel P. Berrange ha scritto: On Tue, Jul 23, 2013 at 05:38:00PM +0200, Kevin Wolf wrote: Am 23.07.2013 um 17:22 hat Stefan Hajnoczi geschrieben: On Tue, Jul 23, 2013 at 04:40:34PM +0200, Benoît Canet wrote: More generally, QCow2's current encryption support is woefully inadequate from a design POV. If we wanted better encryption built-in to QEMU it is best to just deprecate the current encryption support and define a new qcow2 extension based around something like the LUKS data format. Using the LUKS data format precisely would be good from a data portability POV, since then you can easily switch your images between LUKS encrypted block device qcow2-with-luks image file, without needing to re-encrypt the data. I read the LUKS specification and undestood enough part of it to understand the potentials benefits (stronger encryption key, multiple user keys, possibility to change users keys). Kevin Stefan: What do you think about implementing LUKS in QCOW2 ? Using standard or proven approachs in crypto is a good thing. I think the question is how much of a standard approach you take and what sense it makes in the context where it's used. The actual encryption algorithm is standard, as far as I can tell, but some people have repeatedly been arguing that it still results in bad crypto. Are they right? I don't know, I know too little of this stuff. One reason that QCow2 is bad, despite using a standard algorithm, is that the user passphrase is directly used encrypt/decrypt the data. Thus a weak passphrase leads to weak data encryption. With the LUKS format, the passphrase is only used to unlock the master key, which is cryptographically strong. LUKS applies multiple rounds of hashing to the user passphrase based on the speed of the machine CPUs, to make it less practical to brute force weak user passphrases and thus recover the master key. Another reason that QCow2 is bad is that disk encryption is Complicated. Even if you do not do any horrible mistakes such as using ECB encryption, a disk encrypted sector-by-sector has a lot of small separate cyphertexts in it and is susceptible to a special range of attacks. For example, current qcow2 encryption is vulnerable to a watermarking attack. http://en.wikipedia.org/wiki/Disk_encryption_theory#Cipher-block_chaining_.28CBC.29 Fine example of why the we use a standard, strong cypher (AES), therefore our crypto must be good argument is about as convincing as I built this sandcastle from the finest quartz sand, so it must be strong. Crypto should be done by trained professionals[*]. [...] [*] I studied crypto deeply enough to know I'm not. The point is, how do you know that you end up with good crypto when you add LUKS-like features? You still use them in a different context, and that may or may not break it. I can't really say. If we're not sufficiently confident in what we're doing, then we ought to find suitable people to advise us / review what we'd propose. I know Red Hat has people on its security team who we might be able to get to review any proposals in this area, if we wanted further crypto advise. If we went with an approach of incorporating LUKS, then we should also connect with the dm-crypt maintainers / LUKS designers to ask them to review what we're proposing to do. http://www.spinics.net/lists/dm-crypt/msg05277.html Best regards Benoît Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
Re: [Qemu-devel] QCOW2 cryptography and secure key handling
Crypto should be done by trained professionals[*]. I agree I feel uneasy to write cryptographic code. Best regards Benoît
Re: [Qemu-devel] QCOW2 cryptography and secure key handling
For example, current qcow2 encryption is vulnerable to a watermarking attack. http://en.wikipedia.org/wiki/Disk_encryption_theory#Cipher-block_chaining_.28CBC.29 void qcow2_encrypt_sectors(BDRVQcowState *s, int64_t sector_num, uint8_t *out_buf, const uint8_t *in_buf, int nb_sectors, int enc, const AES_KEY *key) { union { uint64_t ll[2]; uint8_t b[16]; } ivec; int i; for(i = 0; i nb_sectors; i++) { ivec.ll[0] = cpu_to_le64(sector_num); ivec.ll[1] = 0; AES_cbc_encrypt(in_buf, out_buf, 512, key, ivec.b, enc); sector_num++; in_buf += 512; out_buf += 512; } } CBC mode would imply that each sector would be crypted by combining the plaintext with the previous sector. It's does not look to be the case as the IV is reset to sector_num for each sector. It look like CTR mode. Best regards Benoît dm-crypt or other disk encryption programs use more complicated schemes, do we need to go there? Paolo
[Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket.
Implement the continuous leaky bucket algorithm devised on IRC as a separate module. Signed-off-by: Benoit Canet ben...@irqsave.net --- include/qemu/throttle.h | 111 util/Makefile.objs |1 + util/throttle.c | 436 +++ 3 files changed, 548 insertions(+) create mode 100644 include/qemu/throttle.h create mode 100644 util/throttle.c diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h new file mode 100644 index 000..328c782 --- /dev/null +++ b/include/qemu/throttle.h @@ -0,0 +1,111 @@ +/* + * QEMU throttling infrastructure + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 or + * (at your option) version 3 of the License. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see http://www.gnu.org/licenses/. + */ + +#ifndef THROTTLING_H +#define THROTTLING_H + +#include stdint.h +#include qemu-common.h +#include qemu/timer.h + +#define NANOSECONDS_PER_SECOND 10.0 + +#define BUCKETS_COUNT 6 + +typedef enum { +THROTTLE_BPS_TOTAL = 0, +THROTTLE_BPS_READ = 1, +THROTTLE_BPS_WRITE = 2, +THROTTLE_OPS_TOTAL = 3, +THROTTLE_OPS_READ = 4, +THROTTLE_OPS_WRITE = 5, +} BucketType; + +typedef struct LeakyBucket { +int64_t ups; /* units per second */ +int64_t max; /* leaky bucket max in units */ +double bucket;/* bucket in units */ +int64_t previous_leak; /* timestamp of the last leak done */ +} LeakyBucket; + +/* The following structure is used to configure a ThrottleState + * It contains a bit of state: the bucket field of the LeakyBucket structure. + * However it allows to keep the code clean and the bucket field is reset to + * zero at the right time. + */ +typedef struct ThrottleConfig { +LeakyBucket buckets[6];/* leaky buckets */ +int64_t unit_size; /* size of an unit in bytes */ +int64_t op_size; /* size of an operation in units */ +} ThrottleConfig; + +typedef struct ThrottleState { +ThrottleConfig cfg; +bool timer_is_throttling_write; /* is the timer throttling a write */ +QEMUTimer *timer;/* timer used to do the throttling */ +QEMUClock *clock;/* the clock used */ +} ThrottleState; + +/* following 3 function exposed for tests */ +bool throttle_do_start(ThrottleState *ts, + bool is_write, + int64_t size, + int64_t now, + int64_t *next_timer); + +bool throttle_do_end(ThrottleState *ts, + bool is_write, + int64_t now, + int64_t *next_timer); + +bool throttle_do_timer(ThrottleState *ts, + bool is_write, + int64_t now, + int64_t *next_timer); + +/* user API functions */ +void throttle_init(ThrottleState *ts, + QEMUClock *clock, + void (timer)(void *), + void *timer_opaque); + +void throttle_destroy(ThrottleState *ts); + +bool throttle_start(ThrottleState *ts, bool is_write, int64_t size); + +void throttle_end(ThrottleState *ts, bool is_write); + +void throttle_timer(ThrottleState *ts, int64_t now, bool *must_wait); + +void throttle_config(ThrottleState *ts, ThrottleConfig *cfg); + +void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg); + +bool throttle_enabled(ThrottleConfig *cfg); + +bool throttle_conflicting(ThrottleConfig *cfg); + +bool throttle_is_valid(ThrottleConfig *cfg); + +bool throttle_have_timer(ThrottleState *ts); + +#endif diff --git a/util/Makefile.objs b/util/Makefile.objs index dc72ab0..2bb13a2 100644 --- a/util/Makefile.objs +++ b/util/Makefile.objs @@ -11,3 +11,4 @@ util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o notify.o util-obj-y += qemu-option.o qemu-progress.o util-obj-y += hexdump.o util-obj-y += crc32c.o +util-obj-y += throttle.o diff --git a/util/throttle.c b/util/throttle.c new file mode 100644 index 000..4afc407 --- /dev/null +++ b/util/throttle.c @@ -0,0 +1,436 @@ +/* + * QEMU throttling infrastructure + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version
[Qemu-devel] [RFC V3 2/2] block: Enable the new throttling code in the block layer.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 316 +++-- block/qapi.c | 21 ++- blockdev.c| 115 + include/block/block.h |1 - include/block/block_int.h | 33 + 5 files changed, 150 insertions(+), 336 deletions(-) diff --git a/block.c b/block.c index 5a2240f..efa64fa 100644 --- a/block.c +++ b/block.c @@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque); static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors); -static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, -double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, int64_t *wait); - static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -123,55 +116,68 @@ int is_windows_drive(const char *filename) #endif /* throttling disk I/O limits */ -void bdrv_io_limits_disable(BlockDriverState *bs) +void bdrv_set_io_limits(BlockDriverState *bs, +ThrottleConfig *cfg) { -bs-io_limits_enabled = false; +throttle_config(bs-throttle_state, cfg); +} -while (qemu_co_enter_next(bs-throttled_reqs)) { -} +static bool bdrv_drain_throttled(BlockDriverState *bs) +{ +bool drained = false; +int i; -if (bs-block_timer) { -qemu_del_timer(bs-block_timer); -qemu_free_timer(bs-block_timer); -bs-block_timer = NULL; +for (i = 0; i 2; i++) { +while (qemu_co_enter_next(bs-throttled_reqs[i]) { +drained = true; +} } -bs-slice_start = 0; -bs-slice_end = 0; +return drained; } -static void bdrv_block_timer(void *opaque) +void bdrv_io_limits_disable(BlockDriverState *bs) { -BlockDriverState *bs = opaque; +bs-io_limits_enabled = false; + +bdrv_drain_throttled(bs); -qemu_co_enter_next(bs-throttled_reqs); +throttle_destroy(bs-throttle_state); } -void bdrv_io_limits_enable(BlockDriverState *bs) +static void bdrv_throttle_timer_cb(void *opaque) { -qemu_co_queue_init(bs-throttled_reqs); -bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); -bs-io_limits_enabled = true; +BlockDriverState *bs = opaque; +int now = qemu_get_clock_ns(vm_clock); +bool must_wait[2]; /* does throttled reads or writes must wait */ +int i; + +throttle_timer(bs-throttle_state, now, must_wait); + +/* execute not throttled requests */ +for (i = 0; i 2; i++) { +if (must_wait[i]) { +continue; +} +qemu_co_enter_next(bs-throttled_reqs[i]); +} } -bool bdrv_io_limits_enabled(BlockDriverState *bs) +/* should be called before bdrv_set_io_limits if a limit is set */ +void bdrv_io_limits_enable(BlockDriverState *bs) { -BlockIOLimit *io_limits = bs-io_limits; -return io_limits-bps[BLOCK_IO_LIMIT_READ] - || io_limits-bps[BLOCK_IO_LIMIT_WRITE] - || io_limits-bps[BLOCK_IO_LIMIT_TOTAL] - || io_limits-iops[BLOCK_IO_LIMIT_READ] - || io_limits-iops[BLOCK_IO_LIMIT_WRITE] - || io_limits-iops[BLOCK_IO_LIMIT_TOTAL]; +throttle_init(bs-throttle_state, vm_clock, bdrv_throttle_timer_cb, bs); +qemu_co_queue_init(bs-throttled_reqs[0]); +qemu_co_queue_init(bs-throttled_reqs[1]); +bs-io_limits_enabled = true; } static void bdrv_io_limits_intercept(BlockDriverState *bs, - bool is_write, int nb_sectors) + bool is_write, + int nb_sectors) { -int64_t wait_time = -1; - -if (!qemu_co_queue_empty(bs-throttled_reqs)) { -qemu_co_queue_wait(bs-throttled_reqs); +if (!qemu_co_queue_empty(bs-throttled_reqs[is_write])) { +qemu_co_queue_wait(bs-throttled_reqs[is_write]); } /* In fact, we hope to keep each request's timing, in FIFO mode. The next @@ -181,13 +187,11 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs, * be still in throttled_reqs queue. */ -while (bdrv_exceed_io_limits(bs, nb_sectors, is_write, wait_time)) { -qemu_mod_timer(bs-block_timer, - wait_time + qemu_get_clock_ns(vm_clock)); -qemu_co_queue_wait_insert_head(bs-throttled_reqs); +while (throttle_start(bs-throttle_state, is_write, nb_sectors)) { +qemu_co_queue_wait_insert_head(bs-throttled_reqs[is_write]); } -qemu_co_queue_next(bs-throttled_reqs); +qemu_co_queue_next(bs-throttled_reqs[is_write]); } /* check if the path starts with protocol: */ @@ -1106,11 +1110,6 @@ int bdrv_open(BlockDriverState *bs, const char *filename, QDict
[Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling
This patchset implement continous leaky bucket throttling. It works mostly on the general case. The exception is where the load is composed of both reads and writes and two limits iops_rd and iops_wr are set. The resulting iops are a little above half of the given limits. I tried various strategies to avoid this: two timer, two throttled request queues or even a different algorithm using a priority queue. The problem is still the same in every version of the code: reads and writes operation seems entangled. Benoît Canet (2): throttle: Add a new throttling API implementing continuus leaky bucket. block: Enable the new throttling code in the block layer. block.c | 316 block/qapi.c | 21 +-- blockdev.c| 115 ++-- include/block/block.h |1 - include/block/block_int.h | 33 +--- include/qemu/throttle.h | 111 util/Makefile.objs|1 + util/throttle.c | 436 + 8 files changed, 698 insertions(+), 336 deletions(-) create mode 100644 include/qemu/throttle.h create mode 100644 util/throttle.c -- 1.7.10.4
Re: [Qemu-devel] [PATCH] KVM: always use MADV_DONTFORK
Le Thursday 25 Jul 2013 à 12:11:15 (+0200), Andrea Arcangeli a écrit : MADV_DONTFORK prevents fork to fail with -ENOMEM if the default overcommit heuristics decides there's too much anonymous virtual memory allocated. If the KVM secondary MMU is synchronized with MMU notifiers or not, doesn't make a difference in that regard. Secondly it's always more efficient to avoid copying the guest physical address space in the fork child (so we avoid to mark all the guest memory readonly in the parent and so we skip the establishment and teardown of lots of pagetables in the child). In the common case we can ignore the error if MADV_DONTFORK is not available. Leave a second invocation that errors out in the KVM path if MMU notifiers are missing and KVM is enabled, to abort in such case. Signed-off-by: Andrea Arcangeli aarca...@redhat.com --- exec.c | 1 + 1 file changed, 1 insertion(+) diff --git a/exec.c b/exec.c index c99a883..d3bb58d 100644 --- a/exec.c +++ b/exec.c @@ -1162,6 +1162,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, void *host, qemu_ram_setup_dump(new_block-host, size); qemu_madvise(new_block-host, size, QEMU_MADV_HUGEPAGE); +qemu_madvise(new_block-host, size, QEMU_MADV_DONTFORK); if (kvm_enabled()) kvm_setup_guest_memory(new_block-host, size); This patch solve a bug where the network down script of a regular tap interface is not executed because fork fail when a pci-assigment is done on a large guest. Tested-By: Benoit Canet ben...@irqsave.net
Re: [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling
I saw more discussion on IRC. Does this mean you will send another revision to address outstanding issues? Just wanted to check if you are waiting for code review or if you are still developing the next patch revision. I am currently finishing to write unit tests for the next patch revision. Once unit tests are done I will rebase the previous QMP throttling patches on top of the new code then post the new series. I think that the result will be a net improvement. Thanks for asking Best regards Benoît
Re: [Qemu-devel] [PATCH V3 for-1.6 3/5] block: Add support for throttling burst threshold in QMP and the command line.
Kevin's series renamed these to have a dash in the name, and also moved all the throttling parameters into a sub-struct. Does it make more sense to have just '*throttling' with that sub-struct containing 12 parameters, 6 for limits and 6 for thresholds, or would it be better to have '*throttling' with 6 members for limits, as well as '*throttling-threshold' with the other 6 members? Naming-wise, throttling.bps-total and throttling-threshold.bps-total convey as much information as throttling.bps-total and throttling.bps-total-threshold. Eric Kevin: Should compatible old style values be added for the new throttling options in order to avoid a mixed style mess in qemu-options.hx ? Best regards Benoît
[Qemu-devel] [PATCH V4 0/5] Continuous Leaky Bucket Throttling
This patchset implement continous leaky bucket throttling. It use two requests queue to enable to do silly unbalanced throttling like block_set_io_throttle 0 0 0 0 6000 1 It use two timer to get the timer callbacks and the throttle.c code simple It add unit tests. It fail with the following error message at exit and I don't know why yet. qemu-system-x86_64: block.c:1489: bdrv_drain_all: Assertion `((bs-tracked_requests)-lh_first == ((void *)0))' failed. The throttling core is pretty solid and the surrouding of the patchset needs polish. (new options ...) since previous version: wrap qemu-option.hx declararation [Eric] continuus - continuous [Fam] unit test [Paolo] Benoît Canet (5): throttle: Add a new throttling API implementing continuous leaky bucket. throttle: Add units tests block: Enable the new throttling code in the block layer. block: Add support for throttling burst max in QMP and the command line. block: Add iops_sector_count to do the iops accounting for a given io size. block.c | 351 ++-- block/qapi.c | 50 +++-- blockdev.c| 207 ++- hmp.c | 36 +++- include/block/block.h |1 - include/block/block_int.h | 32 +-- include/qemu/throttle.h | 105 ++ qapi-schema.json | 40 +++- qemu-options.hx |4 +- qmp-commands.hx | 34 +++- tests/Makefile|2 + tests/test-throttle.c | 494 + util/Makefile.objs|1 + util/throttle.c | 391 +++ 14 files changed, 1405 insertions(+), 343 deletions(-) create mode 100644 include/qemu/throttle.h create mode 100644 tests/test-throttle.c create mode 100644 util/throttle.c -- 1.7.10.4
[Qemu-devel] [PATCH V4 1/5] throttle: Add a new throttling API implementing continuous leaky bucket.
Implement the continuous leaky bucket algorithm devised on IRC as a separate module. Signed-off-by: Benoit Canet ben...@irqsave.net --- include/qemu/throttle.h | 105 + util/Makefile.objs |1 + util/throttle.c | 391 +++ 3 files changed, 497 insertions(+) create mode 100644 include/qemu/throttle.h create mode 100644 util/throttle.c diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h new file mode 100644 index 000..e03bc3e --- /dev/null +++ b/include/qemu/throttle.h @@ -0,0 +1,105 @@ +/* + * QEMU throttling infrastructure + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 or + * (at your option) version 3 of the License. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, see http://www.gnu.org/licenses/. + */ + +#ifndef THROTTLING_H +#define THROTTLING_H + +#include stdint.h +#include qemu-common.h +#include qemu/timer.h + +#define NANOSECONDS_PER_SECOND 10.0 + +#define BUCKETS_COUNT 6 + +typedef enum { +THROTTLE_BPS_TOTAL = 0, +THROTTLE_BPS_READ = 1, +THROTTLE_BPS_WRITE = 2, +THROTTLE_OPS_TOTAL = 3, +THROTTLE_OPS_READ = 4, +THROTTLE_OPS_WRITE = 5, +} BucketType; + +typedef struct LeakyBucket { +double ups;/* units per second */ +double max;/* leaky bucket max in units */ +double bucket; /* bucket in units */ +} LeakyBucket; + +/* The following structure is used to configure a ThrottleState + * It contains a bit of state: the bucket field of the LeakyBucket structure. + * However it allows to keep the code clean and the bucket field is reset to + * zero at the right time. + */ +typedef struct ThrottleConfig { +LeakyBucket buckets[6]; /* leaky buckets */ +uint64_t unit_size; /* size of an unit in bytes */ +uint64_t op_size; /* size of an operation in units */ +} ThrottleConfig; + +typedef struct ThrottleState { +ThrottleConfig cfg; /* configuration */ +int64_t previous_leak; /* timestamp of the last leak done */ +QEMUTimer * timers[2]; /* timers used to do the throttling */ +QEMUClock *clock; /* the clock used */ +} ThrottleState; + +/* operations on single leaky buckets */ +void throttle_leak_bucket(LeakyBucket *bkt, int64_t delta); + +int64_t throttle_compute_wait(LeakyBucket *bkt); + +/* expose timer computation function for unit tests */ +bool throttle_compute_timer(ThrottleState *ts, +bool is_write, +int64_t now, +int64_t *next_timer); + +/* init/destroy cycle */ +void throttle_init(ThrottleState *ts, + QEMUClock *clock, + void (read_timer)(void *), + void (write_timer)(void *), + void *timer_opaque); + +void throttle_destroy(ThrottleState *ts); + +bool throttle_have_timer(ThrottleState *ts); + +/* configuration */ +bool throttle_enabled(ThrottleConfig *cfg); + +bool throttle_conflicting(ThrottleConfig *cfg); + +bool throttle_is_valid(ThrottleConfig *cfg); + +void throttle_config(ThrottleState *ts, ThrottleConfig *cfg); + +void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg); + +/* usage */ +bool throttle_allowed(ThrottleState *ts, bool is_write); + +void throttle_account(ThrottleState *ts, bool is_write, uint64_t size); + +#endif diff --git a/util/Makefile.objs b/util/Makefile.objs index dc72ab0..2bb13a2 100644 --- a/util/Makefile.objs +++ b/util/Makefile.objs @@ -11,3 +11,4 @@ util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o notify.o util-obj-y += qemu-option.o qemu-progress.o util-obj-y += hexdump.o util-obj-y += crc32c.o +util-obj-y += throttle.o diff --git a/util/throttle.c b/util/throttle.c new file mode 100644 index 000..2f25d44 --- /dev/null +++ b/util/throttle.c @@ -0,0 +1,391 @@ +/* + * QEMU throttling infrastructure + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License as + * published by the Free Software Foundation; either version 2 or + * (at your option) version 3 of the License. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY
[Qemu-devel] [PATCH V4 3/5] block: Enable the new throttling code in the block layer.
tip: Do not ever use the cfg scheduler in the guest with this code. It gives incorrect throttling. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 351 ++--- block/qapi.c | 21 ++- blockdev.c| 100 +++-- include/block/block.h |1 - include/block/block_int.h | 32 + 5 files changed, 173 insertions(+), 332 deletions(-) diff --git a/block.c b/block.c index 01b66d8..92f0aa1 100644 --- a/block.c +++ b/block.c @@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque); static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors); -static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, -double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, int64_t *wait); - static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -123,70 +116,110 @@ int is_windows_drive(const char *filename) #endif /* throttling disk I/O limits */ -void bdrv_io_limits_disable(BlockDriverState *bs) +void bdrv_set_io_limits(BlockDriverState *bs, +ThrottleConfig *cfg) { -bs-io_limits_enabled = false; +int i; + +throttle_config(bs-throttle_state, cfg); + +for (i = 0; i 2; i++) { +qemu_co_enter_next(bs-throttled_reqs[i]); +} +} + +/* this function drain all the throttled IOs */ +static bool bdrv_drain_throttled(BlockDriverState *bs) +{ +bool drained = false; +bool enabled = bs-io_limits_enabled; +int i; -do {} while (qemu_co_enter_next(bs-throttled_reqs)); +bs-io_limits_enabled = false; -if (bs-block_timer) { -qemu_del_timer(bs-block_timer); -qemu_free_timer(bs-block_timer); -bs-block_timer = NULL; +for (i = 0; i 2; i++) { +while (qemu_co_enter_next(bs-throttled_reqs[i])) { +drained = true; +} } -bs-slice_start = 0; -bs-slice_end = 0; +bs-io_limits_enabled = enabled; + +return drained; } -static void bdrv_block_timer(void *opaque) +void bdrv_io_limits_disable(BlockDriverState *bs) +{ +bs-io_limits_enabled = false; + +bdrv_drain_throttled(bs); + +throttle_destroy(bs-throttle_state); +} + +static void bdrv_throttle_read_timer_cb(void *opaque) { BlockDriverState *bs = opaque; +qemu_co_enter_next(bs-throttled_reqs[0]); +} -qemu_co_enter_next(bs-throttled_reqs); +static void bdrv_throttle_write_timer_cb(void *opaque) +{ +BlockDriverState *bs = opaque; +qemu_co_enter_next(bs-throttled_reqs[1]); } +/* should be called before bdrv_set_io_limits if a limit is set */ void bdrv_io_limits_enable(BlockDriverState *bs) { -qemu_co_queue_init(bs-throttled_reqs); -bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs); +throttle_init(bs-throttle_state, + vm_clock, + bdrv_throttle_read_timer_cb, + bdrv_throttle_write_timer_cb, + bs); +qemu_co_queue_init(bs-throttled_reqs[0]); +qemu_co_queue_init(bs-throttled_reqs[1]); bs-io_limits_enabled = true; } -bool bdrv_io_limits_enabled(BlockDriverState *bs) +/* This function make an IO wait if needed + * + * @is_write: is the IO a write + * @nb_sectors: the number of sectors of the IO + */ +static void bdrv_io_limits_intercept(BlockDriverState *bs, + bool is_write, + int nb_sectors) { -BlockIOLimit *io_limits = bs-io_limits; -return io_limits-bps[BLOCK_IO_LIMIT_READ] - || io_limits-bps[BLOCK_IO_LIMIT_WRITE] - || io_limits-bps[BLOCK_IO_LIMIT_TOTAL] - || io_limits-iops[BLOCK_IO_LIMIT_READ] - || io_limits-iops[BLOCK_IO_LIMIT_WRITE] - || io_limits-iops[BLOCK_IO_LIMIT_TOTAL]; +/* does this io must wait */ +bool must_wait = !throttle_allowed(bs-throttle_state, is_write); + +/* if must wait or any request of this type throttled queue the IO */ +if (must_wait || +!qemu_co_queue_empty(bs-throttled_reqs[is_write])) { +qemu_co_queue_wait(bs-throttled_reqs[is_write]); +} + +/* the IO will be executed do the accounting */ +throttle_account(bs-throttle_state, is_write, nb_sectors); } -static void bdrv_io_limits_intercept(BlockDriverState *bs, - bool is_write, int nb_sectors) +/* This function will schedule the next IO throttled IO of this type if needed + * + * @is_write: is the IO a write + */ +static void bdrv_io_limits_resched(BlockDriverState *bs, bool is_write) { -int64_t wait_time = -1; -if
[Qemu-devel] [PATCH V4 2/5] throttle: Add units tests
Signed-off-by: Benoit Canet ben...@irqsave.net --- tests/Makefile|2 + tests/test-throttle.c | 494 + 2 files changed, 496 insertions(+) create mode 100644 tests/test-throttle.c diff --git a/tests/Makefile b/tests/Makefile index d044908..fb1e06a 100644 --- a/tests/Makefile +++ b/tests/Makefile @@ -29,6 +29,7 @@ check-unit-y += tests/test-visitor-serialization$(EXESUF) check-unit-y += tests/test-iov$(EXESUF) gcov-files-test-iov-y = util/iov.c check-unit-y += tests/test-aio$(EXESUF) +check-unit-y += tests/test-throttle$(EXESUF) gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c check-unit-y += tests/test-thread-pool$(EXESUF) @@ -116,6 +117,7 @@ tests/check-qfloat$(EXESUF): tests/check-qfloat.o libqemuutil.a tests/check-qjson$(EXESUF): tests/check-qjson.o libqemuutil.a libqemustub.a tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(block-obj-y) libqemuutil.a libqemustub.a tests/test-aio$(EXESUF): tests/test-aio.o $(block-obj-y) libqemuutil.a libqemustub.a +tests/test-throttle$(EXESUF): tests/test-throttle.o $(block-obj-y) libqemuutil.a libqemustub.a tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(block-obj-y) libqemuutil.a libqemustub.a tests/test-iov$(EXESUF): tests/test-iov.o libqemuutil.a tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o libqemuutil.a libqemustub.a diff --git a/tests/test-throttle.c b/tests/test-throttle.c new file mode 100644 index 000..c687f5f --- /dev/null +++ b/tests/test-throttle.c @@ -0,0 +1,494 @@ +/* + * Throttle infrastructure tests + * + * Copyright Nodalink, SARL. 2013 + * + * Authors: + * Benoît Canet benoit.ca...@irqsave.net + * + * This work is licensed under the terms of the GNU LGPL, version 2 or later. + * See the COPYING.LIB file in the top-level directory. + */ + +#include glib.h +#include math.h +#include qemu/throttle.h + +LeakyBucketbkt; +ThrottleConfig cfg; +ThrottleState ts; + +/* usefull function */ +static bool double_cmp(double x, double y) +{ +return fabsl(x - y) 1e-6; +} + +/* tests for single bucket operations */ +static void test_leak_bucket(void) +{ +/* set initial value */ +bkt.ups = 150; +bkt.max = 15; +bkt.bucket = 1.5; + +/* leak an op work of time */ +throttle_leak_bucket(bkt, NANOSECONDS_PER_SECOND / 150); +g_assert(bkt.ups == 150); +g_assert(bkt.max == 15); +g_assert(double_cmp(bkt.bucket, 0.5)); + +/* leak again emptying the bucket */ +throttle_leak_bucket(bkt, NANOSECONDS_PER_SECOND / 150); +g_assert(bkt.ups == 150); +g_assert(bkt.max == 15); +g_assert(double_cmp(bkt.bucket, 0)); + +/* check that the bucket level won't go lower */ +throttle_leak_bucket(bkt, NANOSECONDS_PER_SECOND / 150); +g_assert(bkt.ups == 150); +g_assert(bkt.max == 15); +g_assert(double_cmp(bkt.bucket, 0)); +} + +static void test_compute_wait(void) +{ +int64_t wait; +int64_t result; + +/* no operation limit set */ +bkt.ups = 0; +bkt.max = 15; +bkt.bucket = 1.5; +wait = throttle_compute_wait(bkt); +g_assert(!wait); + +/* zero delta */ +bkt.ups = 150; +bkt.max = 15; +bkt.bucket = 15; +wait = throttle_compute_wait(bkt); +g_assert(!wait); + +/* below zero delta */ +bkt.ups = 150; +bkt.max = 15; +bkt.bucket = 9; +wait = throttle_compute_wait(bkt); +g_assert(!wait); + +/* half an operation above max */ +bkt.ups = 150; +bkt.max = 15; +bkt.bucket = 15.5; +wait = throttle_compute_wait(bkt); +/* time required to do half an operation */ +result = (int64_t) NANOSECONDS_PER_SECOND / 150 / 2; +g_assert(wait == result); +} + +/* functions to test ThrottleState initialization/destroy methods */ +static void read_timer_cb(void *opaque) +{ +} + +static void write_timer_cb(void *opaque) +{ +} + +static void test_init(void) +{ +int i; + +/* fill the structure with crap */ +memset(ts, 1, sizeof(ts)); + +/* init the structure */ +throttle_init(ts, vm_clock, read_timer_cb, write_timer_cb, ts); + +/* check initialized fields */ +g_assert(ts.clock == vm_clock); +g_assert(ts.timers[0]); +g_assert(ts.timers[1]); + +/* check other fields where cleared */ +g_assert(!ts.previous_leak); +g_assert(!ts.cfg.op_size); +g_assert(!ts.cfg.unit_size); +for (i = 0; i BUCKETS_COUNT; i++) { +g_assert(!ts.cfg.buckets[i].ups); +g_assert(!ts.cfg.buckets[i].max); +g_assert(!ts.cfg.buckets[i].bucket); +} + +throttle_destroy(ts); +} + +static void test_destroy(void) +{ +int i; +throttle_init(ts, vm_clock, read_timer_cb, write_timer_cb, ts); +throttle_destroy(ts); +for (i = 0; i 2; i++) { +g_assert(!ts.timers[i]); +} +} + +/* function to test throttle_config and throttle_get_config */ +static void test_config_functions(void) +{ +int i; +ThrottleConfig
[Qemu-devel] [PATCH V4 4/5] block: Add support for throttling burst max in QMP and the command line.
The max parameter of the leaky bycket throttling algoritm can be used to allow the guest to do bursts. The max value is a pool of I/O that the guest can use without being throttled at all. Throttling is triggered once this pool is empty. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qapi.c | 26 blockdev.c | 119 ++ hmp.c| 32 +-- qapi-schema.json | 34 +++- qemu-options.hx |4 +- qmp-commands.hx | 30 -- 6 files changed, 220 insertions(+), 25 deletions(-) diff --git a/block/qapi.c b/block/qapi.c index 45f806b..5ba10f4 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -232,6 +232,32 @@ void bdrv_query_info(BlockDriverState *bs, info-inserted-iops= cfg.buckets[THROTTLE_OPS_TOTAL].ups; info-inserted-iops_rd = cfg.buckets[THROTTLE_OPS_READ].ups; info-inserted-iops_wr = cfg.buckets[THROTTLE_OPS_WRITE].ups; + +info-inserted-has_bps_max = +cfg.buckets[THROTTLE_BPS_TOTAL].max; +info-inserted-bps_max = +cfg.buckets[THROTTLE_BPS_TOTAL].max; +info-inserted-has_bps_rd_max = +cfg.buckets[THROTTLE_BPS_READ].max; +info-inserted-bps_rd_max = +cfg.buckets[THROTTLE_BPS_READ].max; +info-inserted-has_bps_wr_max = +cfg.buckets[THROTTLE_BPS_WRITE].max; +info-inserted-bps_wr_max = +cfg.buckets[THROTTLE_BPS_WRITE].max; + +info-inserted-has_iops_max= +cfg.buckets[THROTTLE_OPS_TOTAL].max; +info-inserted-iops_max= +cfg.buckets[THROTTLE_OPS_TOTAL].max; +info-inserted-has_iops_rd_max = +cfg.buckets[THROTTLE_OPS_READ].max; +info-inserted-iops_rd_max = +cfg.buckets[THROTTLE_OPS_READ].max; +info-inserted-has_iops_wr_max = +cfg.buckets[THROTTLE_OPS_WRITE].max; +info-inserted-iops_wr_max = +cfg.buckets[THROTTLE_OPS_WRITE].max; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index 7559ea5..22016a2 100644 --- a/blockdev.c +++ b/blockdev.c @@ -486,13 +486,18 @@ static DriveInfo *blockdev_init(QemuOpts *all_opts, cfg.buckets[THROTTLE_OPS_WRITE].ups = qemu_opt_get_number(opts, throttling.iops-write, 0); -cfg.buckets[THROTTLE_BPS_TOTAL].max = 0; -cfg.buckets[THROTTLE_BPS_READ].max = 0; -cfg.buckets[THROTTLE_BPS_WRITE].max = 0; - -cfg.buckets[THROTTLE_OPS_TOTAL].max = 0; -cfg.buckets[THROTTLE_OPS_READ].max = 0; -cfg.buckets[THROTTLE_OPS_WRITE].max = 0; +cfg.buckets[THROTTLE_BPS_TOTAL].max = +qemu_opt_get_number(opts, throttling.bps-total-max, 0); +cfg.buckets[THROTTLE_BPS_READ].max = +qemu_opt_get_number(opts, throttling.bps-read-max, 0); +cfg.buckets[THROTTLE_BPS_WRITE].max = +qemu_opt_get_number(opts, throttling.bps-write-max, 0); +cfg.buckets[THROTTLE_OPS_TOTAL].max = +qemu_opt_get_number(opts, throttling.iops-total-max, 0); +cfg.buckets[THROTTLE_OPS_READ].max = +qemu_opt_get_number(opts, throttling.iops-read-max, 0); +cfg.buckets[THROTTLE_OPS_WRITE].max = +qemu_opt_get_number(opts, throttling.iops-write-max, 0); cfg.unit_size = BDRV_SECTOR_SIZE; cfg.op_size = 0; @@ -773,6 +778,14 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_rename(all_opts, bps_rd, throttling.bps-read); qemu_opt_rename(all_opts, bps_wr, throttling.bps-write); +qemu_opt_rename(all_opts, iops_max, throttling.iops-total-max); +qemu_opt_rename(all_opts, iops_rd_max, throttling.iops-read-max); +qemu_opt_rename(all_opts, iops_wr_max, throttling.iops-write-max); + +qemu_opt_rename(all_opts, bps_max, throttling.bps-total-max); +qemu_opt_rename(all_opts, bps_rd_max, throttling.bps-read-max); +qemu_opt_rename(all_opts, bps_wr_max, throttling.bps-write-max); + qemu_opt_rename(all_opts, readonly, read-only); value = qemu_opt_get(all_opts, cache); @@ -1257,8 +1270,22 @@ void qmp_change_blockdev(const char *device, const char *filename, /* throttling disk I/O limits */ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, - int64_t bps_wr, int64_t iops, int64_t iops_rd, - int64_t iops_wr, Error **errp) + int64_t bps_wr, + int64_t iops, + int64_t iops_rd, + int64_t iops_wr, + bool has_bps_max, + int64_t bps_max, + bool has_bps_rd_max, + int64_t bps_rd_max, +
[Qemu-devel] [PATCH V4 5/5] block: Add iops_sector_count to do the iops accounting for a given io size.
This feature can be used in case where users are avoiding the iops limit by doing jumbo I/Os hammering the storage backend. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qapi.c |3 +++ blockdev.c | 22 +++--- hmp.c|8 ++-- qapi-schema.json | 10 -- qemu-options.hx |2 +- qmp-commands.hx |8 ++-- 6 files changed, 43 insertions(+), 10 deletions(-) diff --git a/block/qapi.c b/block/qapi.c index 5ba10f4..f98ff64 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -258,6 +258,9 @@ void bdrv_query_info(BlockDriverState *bs, cfg.buckets[THROTTLE_OPS_WRITE].max; info-inserted-iops_wr_max = cfg.buckets[THROTTLE_OPS_WRITE].max; + +info-inserted-has_iops_sector_count = cfg.op_size; +info-inserted-iops_sector_count = cfg.op_size; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index 22016a2..e2b0ee0 100644 --- a/blockdev.c +++ b/blockdev.c @@ -500,7 +500,7 @@ static DriveInfo *blockdev_init(QemuOpts *all_opts, qemu_opt_get_number(opts, throttling.iops-write-max, 0); cfg.unit_size = BDRV_SECTOR_SIZE; -cfg.op_size = 0; +cfg.op_size = qemu_opt_get_number(opts, throttling.iops-sector-count, 0); if (!check_throttle_config(cfg, error)) { error_report(%s, error_get_pretty(error)); @@ -786,6 +786,9 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_rename(all_opts, bps_rd_max, throttling.bps-read-max); qemu_opt_rename(all_opts, bps_wr_max, throttling.bps-write-max); +qemu_opt_rename(all_opts, +iops_sector_count, throttling.iops-sector-count); + qemu_opt_rename(all_opts, readonly, read-only); value = qemu_opt_get(all_opts, cache); @@ -1285,7 +1288,9 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, bool has_iops_rd_max, int64_t iops_rd_max, bool has_iops_wr_max, - int64_t iops_wr_max, Error **errp) + int64_t iops_wr_max, + bool has_iops_sector_count, + int64_t iops_sector_count, Error **errp) { ThrottleConfig cfg; BlockDriverState *bs; @@ -1325,7 +1330,10 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, } cfg.unit_size = BDRV_SECTOR_SIZE; -cfg.op_size = 0; + +if (has_iops_sector_count) { +cfg.op_size = iops_sector_count; +} if (!check_throttle_config(cfg, errp)) { return; @@ -2051,6 +2059,10 @@ QemuOptsList qemu_common_drive_opts = { .type = QEMU_OPT_NUMBER, .help = total bytes write burst, },{ +.name = throttling.iops-sector-count, +.type = QEMU_OPT_NUMBER, +.help = when limiting by iops max size of an I/O in sector, +},{ .name = copy-on-read, .type = QEMU_OPT_BOOL, .help = copy read data from backing file into image file, @@ -2197,6 +2209,10 @@ QemuOptsList qemu_old_drive_opts = { .type = QEMU_OPT_NUMBER, .help = total bytes write burst, },{ +.name = iops_sector_count, +.type = QEMU_OPT_NUMBER, +.help = when limiting by iops max size of an I/O in sector, +},{ .name = copy-on-read, .type = QEMU_OPT_BOOL, .help = copy read data from backing file into image file, diff --git a/hmp.c b/hmp.c index e0c387c..01f7685 100644 --- a/hmp.c +++ b/hmp.c @@ -351,7 +351,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) iops_wr=% PRId64 iops_max=% PRId64 iops_rd_max=% PRId64 - iops_wr_max=% PRId64 \n, + iops_wr_max=% PRId64 + iops_sector_count=% PRId64 \n, info-value-inserted-bps, info-value-inserted-bps_rd, info-value-inserted-bps_wr, @@ -363,7 +364,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict) info-value-inserted-iops_wr, info-value-inserted-iops_max, info-value-inserted-iops_rd_max, -info-value-inserted-iops_wr_max); +info-value-inserted-iops_wr_max, +info-value-inserted-iops_sector_count); } else { monitor_printf(mon, [not inserted]); } @@ -1124,6 +1126,8 @@ void hmp_block_set_io_throttle(Monitor *mon, const QDict *qdict) false,
Re: [Qemu-devel] [PATCH V4 3/5] block: Enable the new throttling code in the block layer.o
This is not really accurate; the cfq scheduler reorders reads and writes to have longer bursts, and these sometimes exceed the rate you set. I understood this is mostly for separate rd/wr settings, or did you reproduce it with total rates too? I'll drop the comment. It's only with read/write settings and not with total. Also, it would be better to have a workaround for this. Perhaps we could simply make the default value of max nonzero? In the old throttling code the slice time is 0.1s, so perhaps we could see what happens with max=0.1*avg. I already set max to 0.1*avg if needed in the code.
Re: [Qemu-devel] [PATCH V4 3/5] block: Enable the new throttling code in the block layer.
Also, it would be better to have a workaround for this. Perhaps we could simply make the default value of max nonzero? In the old throttling code the slice time is 0.1s, so perhaps we could see what happens with max=0.1*avg. This gives correct results with cfg with some little auto compensated oscillations. I will definitelly drop the bogus comment. Best regards Benoît
Re: [Qemu-devel] [PATCH V4 0/5] Continuous Leaky Bucket Throttling*
It fail with the following error message at exit and I don't know why yet. qemu-system-x86_64: block.c:1489: bdrv_drain_all: Assertion `((bs-tracked_requests)-lh_first == ((void *)0))' failed. I solved this issue: bdrv_drain_all was bogus. Best regards Benoît block.c | 351 ++-- block/qapi.c | 50 +++-- blockdev.c| 207 ++- hmp.c | 36 +++- include/block/block.h |1 - include/block/block_int.h | 32 +-- include/qemu/throttle.h | 105 ++ qapi-schema.json | 40 +++- qemu-options.hx |4 +- qmp-commands.hx | 34 +++- tests/Makefile|2 + tests/test-throttle.c | 494 + util/Makefile.objs|1 + util/throttle.c | 391 +++ 14 files changed, 1405 insertions(+), 343 deletions(-) create mode 100644 include/qemu/throttle.h create mode 100644 tests/test-throttle.c create mode 100644 util/throttle.c -- 1.7.10.4
Re: [Qemu-devel] [PATCH] qemu-iotests: Filter out actual image size in 067
Le Saturday 02 Nov 2013 à 14:52:11 (+0100), Max Reitz a écrit : The actual size of the image file may differ depending on the Linux kernel currently running on the host. Filtering out this value makes this test pass in such cases. Signed-off-by: Max Reitz mre...@redhat.com --- tests/qemu-iotests/067 | 2 +- tests/qemu-iotests/067.out | 10 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067 index 79dc38b..d025192 100755 --- a/tests/qemu-iotests/067 +++ b/tests/qemu-iotests/067 @@ -45,7 +45,7 @@ function do_run_qemu() function run_qemu() { -do_run_qemu $@ 21 | _filter_testdir | _filter_qmp +do_run_qemu $@ 21 | _filter_testdir | _filter_qmp | sed -e 's/\(actual-size:\s*\)[0-9]\+/\1SIZE/g' } size=128M diff --git a/tests/qemu-iotests/067.out b/tests/qemu-iotests/067.out index 4bb9ff9..8d271cc 100644 --- a/tests/qemu-iotests/067.out +++ b/tests/qemu-iotests/067.out @@ -6,7 +6,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 Testing: -drive file=TEST_DIR/t.qcow2,format=qcow2,if=none,id=disk -device virtio-blk-pci,drive=disk,id=virtio0 QMP_VERSION {return: {}} -{return: [{io-status: ok, device: disk, locked: false, removable: false, inserted: {iops_rd: 0, image: {virtual-size: 134217728, filename: TEST_DIR/t.qcow2, cluster-size: 65536, format: qcow2, actual-size: 139264, format-specific: {type: qcow2, data: {compat: 1.1, lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 0, ro: false, backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 0, encrypted: false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, encryption_key_missing: false}, type: unknown}, {io-status: ok, device: ide1-cd0, locked: false, removable: true, tray_open: false, type: unknown}, {device: floppy0, locked: false, removable: true, tray_open: false, type: unknown}, {device: sd0, locked: false, removable: true, tray_open: false, type: unknown}]} +{return: [{io-status: ok, device: disk, locked: false, removable: false, inserted: {iops_rd: 0, image: {virtual-size: 134217728, filename: TEST_DIR/t.qcow2, cluster-size: 65536, format: qcow2, actual-size: SIZE, format-specific: {type: qcow2, data: {compat: 1.1, lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 0, ro: false, backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 0, encrypted: false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, encryption_key_missing: false}, type: unknown}, {io-status: ok, device: ide1-cd0, locked: false, removable: true, tray_open: false, type: unknown}, {device: floppy0, locked: false, removable: true, tray_open: false, type: unknown}, {device: sd0, locked: false, removable: true, tray_open: false, type: unknown}]} {return: {}} {return: {}} {timestamp: {seconds: TIMESTAMP, microseconds: TIMESTAMP}, event: DEVICE_DELETED, data: {path: /machine/peripheral/virtio0/virtio-backend}} @@ -24,7 +24,7 @@ QMP_VERSION Testing: -drive file=TEST_DIR/t.qcow2,format=qcow2,if=none,id=disk QMP_VERSION {return: {}} -{return: [{device: disk, locked: false, removable: true, inserted: {iops_rd: 0, image: {virtual-size: 134217728, filename: TEST_DIR/t.qcow2, cluster-size: 65536, format: qcow2, actual-size: 139264, format-specific: {type: qcow2, data: {compat: 1.1, lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 0, ro: false, backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 0, encrypted: false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, encryption_key_missing: false}, tray_open: false, type: unknown}, {io-status: ok, device: ide1-cd0, locked: false, removable: true, tray_open: false, type: unknown}, {device: floppy0, locked: false, removable: true, tray_open: false, type: unknown}, {device: sd0, locked: false, removable: true, tray_open: false, type: unknown}]} +{return: [{device: disk, locked: false, removable: true, inserted: {iops_rd: 0, image: {virtual-size: 134217728, filename: TEST_DIR/t.qcow2, cluster-size: 65536, format: qcow2, actual-size: SIZE, format-specific: {type: qcow2, data: {compat: 1.1, lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 0, ro: false, backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 0, encrypted: false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, encryption_key_missing: false}, tray_open: false, type: unknown}, {io-status: ok, device: ide1-cd0, locked: false, removable: true, tray_open: false, type: unknown}, {device: floppy0, locked: false, removable: true, tray_open: false, type: unknown}, {device: sd0, locked: false, removable: true, tray_open: false, type: unknown}]} {return: {}} {return: {}} {return: {}} @@ -44,7 +44,7 @@ Testing: QMP_VERSION {return: {}} {return: OK\r\n} -{return: [{io-status: ok, device: ide1-cd0, locked: false, removable: true, tray_open: false, type: unknown}, {device: floppy0, locked: false, removable: true, tray_open: false, type: unknown},
Re: [Qemu-devel] [PATCH v2] block: per caller dirty bitmap
Le Monday 04 Nov 2013 à 17:30:10 (+0800), Fam Zheng a écrit : Previously a BlockDriverState has only one dirty bitmap, so only one caller (e.g. a block job) can keep track of writing. This changes the dirty bitmap to a list and creates a BdrvDirtyBitmap for each caller, the lifecycle is managed with these new functions: bdrv_create_dirty_bitmap bdrv_release_dirty_bitmap Where BdrvDirtyBitmap is a linked list wrapper structure of HBitmap. In place of bdrv_set_dirty_tracking, a BdrvDirtyBitmap pointer argument is added to these functions, since each caller has its own dirty bitmap: bdrv_get_dirty bdrv_dirty_iter_init bdrv_get_dirty_count bdrv_set_dirty and bdrv_reset_dirty prototypes are unchanged but will internally walk the list of all dirty bitmaps and set them one by one. Signed-off-by: Fam Zheng f...@redhat.com --- v2: Added BdrvDirtyBitmap [Paolo] Signed-off-by: Fam Zheng f...@redhat.com --- block-migration.c | 22 + block.c | 81 --- block/mirror.c| 23 -- block/qapi.c | 8 - include/block/block.h | 11 --- include/block/block_int.h | 2 +- 6 files changed, 85 insertions(+), 62 deletions(-) diff --git a/block-migration.c b/block-migration.c index daf9ec1..8f4e826 100644 --- a/block-migration.c +++ b/block-migration.c @@ -58,6 +58,7 @@ typedef struct BlkMigDevState { /* Protected by block migration lock. */ unsigned long *aio_bitmap; int64_t completed_sectors; +BdrvDirtyBitmap *dirty_bitmap; } BlkMigDevState; typedef struct BlkMigBlock { @@ -309,12 +310,21 @@ static int mig_save_device_bulk(QEMUFile *f, BlkMigDevState *bmds) /* Called with iothread lock taken. */ -static void set_dirty_tracking(int enable) +static void set_dirty_tracking(void) { BlkMigDevState *bmds; QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) { -bdrv_set_dirty_tracking(bmds-bs, enable ? BLOCK_SIZE : 0); +bmds-dirty_bitmap = bdrv_create_dirty_bitmap(bmds-bs, BLOCK_SIZE); +} +} + +static void unset_dirty_tracking(void) +{ +BlkMigDevState *bmds; + +QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) { +bdrv_release_dirty_bitmap(bmds-bs, bmds-dirty_bitmap); } } @@ -432,7 +442,7 @@ static int mig_save_device_dirty(QEMUFile *f, BlkMigDevState *bmds, } else { blk_mig_unlock(); } -if (bdrv_get_dirty(bmds-bs, sector)) { +if (bdrv_get_dirty(bmds-bs, bmds-dirty_bitmap, sector)) { if (total_sectors - sector BDRV_SECTORS_PER_DIRTY_CHUNK) { nr_sectors = total_sectors - sector; @@ -554,7 +564,7 @@ static int64_t get_remaining_dirty(void) int64_t dirty = 0; QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) { -dirty += bdrv_get_dirty_count(bmds-bs); +dirty += bdrv_get_dirty_count(bmds-bs, bmds-dirty_bitmap); } return dirty BDRV_SECTOR_BITS; @@ -569,7 +579,7 @@ static void blk_mig_cleanup(void) bdrv_drain_all(); -set_dirty_tracking(0); +unset_dirty_tracking(); blk_mig_lock(); while ((bmds = QSIMPLEQ_FIRST(block_mig_state.bmds_list)) != NULL) { @@ -604,7 +614,7 @@ static int block_save_setup(QEMUFile *f, void *opaque) init_blk_migration(f); /* start track dirty blocks */ -set_dirty_tracking(1); +set_dirty_tracking(); qemu_mutex_unlock_iothread(); ret = flush_blks(f); diff --git a/block.c b/block.c index 58efb5b..ef7a55f 100644 --- a/block.c +++ b/block.c @@ -49,6 +49,11 @@ #include windows.h #endif +typedef struct BdrvDirtyBitmap { +HBitmap *bitmap; +QLIST_ENTRY(BdrvDirtyBitmap) list; +} BdrvDirtyBitmap; + #define NOT_DONE 0x7fff /* used while emulated sync operation in progress */ typedef enum { @@ -323,6 +328,7 @@ BlockDriverState *bdrv_new(const char *device_name) BlockDriverState *bs; bs = g_malloc0(sizeof(BlockDriverState)); +QLIST_INIT(bs-dirty_bitmaps); pstrcpy(bs-device_name, sizeof(bs-device_name), device_name); if (device_name[0] != '\0') { QTAILQ_INSERT_TAIL(bdrv_states, bs, list); @@ -1615,7 +1621,7 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest, bs_dest-iostatus = bs_src-iostatus; /* dirty bitmap */ -bs_dest-dirty_bitmap = bs_src-dirty_bitmap; +bs_dest-dirty_bitmaps = bs_src-dirty_bitmaps; /* reference count */ bs_dest-refcnt = bs_src-refcnt; @@ -1648,7 +1654,7 @@ void bdrv_swap(BlockDriverState *bs_new, BlockDriverState *bs_old) /* bs_new must be anonymous and shouldn't have anything fancy enabled */ assert(bs_new-device_name[0] == '\0'); -assert(bs_new-dirty_bitmap == NULL); +
Re: [Qemu-devel] [PATCH] block: Save errno before error_setg_errno
Le Tuesday 05 Nov 2013 à 20:03:33 (+0100), Max Reitz a écrit : error_setg_errno() may overwrite errno; therefore, its value should be read before calling that function and not afterwards. Signed-off-by: Max Reitz mre...@redhat.com --- block.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block.c b/block.c index 58efb5b..0e96a22 100644 --- a/block.c +++ b/block.c @@ -1084,8 +1084,8 @@ int bdrv_open(BlockDriverState *bs, const char *filename, QDict *options, snprintf(backing_filename, sizeof(backing_filename), %s, filename); } else if (!realpath(filename, backing_filename)) { -error_setg_errno(errp, errno, Could not resolve path '%s', filename); ret = -errno; +error_setg_errno(errp, errno, Could not resolve path '%s', filename); goto fail; } -- 1.8.4.2 Reviewed-by: Benoit Canet ben...@irqsave.net
Re: [Qemu-devel] [PATCH] block: Round up total_sectors
Le Wednesday 06 Nov 2013 à 19:48:06 (+0800), Fam Zheng a écrit : Since b94a2610, bdrv_getlength() is omitted when probing image. VMDK monolithicFlat is broken by that because a file 512 bytes can't be read with its total_sectors truncated to 0. This patch round up the size to BDRV_SECTOR_SIZE, when a image size is not sector aligned. Signed-off-by: Fam Zheng f...@redhat.com --- block.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block.c b/block.c index 58efb5b..f706634 100644 --- a/block.c +++ b/block.c @@ -640,7 +640,7 @@ static int refresh_total_sectors(BlockDriverState *bs, int64_t hint) if (length 0) { return length; } -hint = length BDRV_SECTOR_BITS; +hint = DIV_ROUND_UP(length, BDRV_SECTOR_SIZE); } bs-total_sectors = hint; -- 1.8.3.1 Reviewed-by: Benoit Canet ben...@irqsave.net
Re: [Qemu-devel] [PATCH] block: Print its file name if backing file opening failed
Le Thursday 07 Nov 2013 à 15:34:52 (+0800), Fam Zheng a écrit : If backing file doesn't exist, the error message is confusing and misleading: $ qemu /tmp/a.qcow2 qemu: could not open disk image /tmp/a.qcow2: Could not open file: No such file or directory But... $ ls /tmp/a.qcow2 /tmp/a.qcow2 $ qemu-img info /tmp/a.qcow2 image: /tmp/a.qcow2 file format: qcow2 virtual size: 8.0G (8589934592 bytes) disk size: 196K cluster_size: 65536 backing file: /tmp/b.qcow2 Because... $ ls /tmp/b.qcow2 ls: cannot access /tmp/b.qcow2: No such file or directory This is not intuitive. It's better to have the missing file's name in the error message. With this patch: $ qemu-io -c 'read 0 512' /tmp/a.qcow2 qemu-io: can't open device /tmp/a.qcow2: Could not open backing file: Could not open '/stor/vm/arch.raw': No such file or directory no file open, try 'help open' Which is a little bit better. Signed-off-by: Fam Zheng f...@redhat.com --- block.c| 3 ++- block/raw-posix.c | 1 - block/raw-win32.c | 1 - tests/qemu-iotests/051.out | 2 +- tests/qemu-iotests/069.out | 2 +- 5 files changed, 4 insertions(+), 5 deletions(-) diff --git a/block.c b/block.c index f706634..a8dbcfc 100644 --- a/block.c +++ b/block.c @@ -1009,7 +1009,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) bdrv_unref(bs-backing_hd); bs-backing_hd = NULL; bs-open_flags |= BDRV_O_NO_BACKING; -error_propagate(errp, local_err); +error_setg(errp, Could not open backing file: %s, + error_get_pretty(local_err)); return ret; } pstrcpy(bs-backing_file, sizeof(bs-backing_file), diff --git a/block/raw-posix.c b/block/raw-posix.c index f6d48bb..e1b1ba2 100644 --- a/block/raw-posix.c +++ b/block/raw-posix.c @@ -310,7 +310,6 @@ static int raw_open_common(BlockDriverState *bs, QDict *options, if (ret == -EROFS) { ret = -EACCES; } -error_setg_errno(errp, -ret, Could not open file); goto fail; } s-fd = fd; diff --git a/block/raw-win32.c b/block/raw-win32.c index 2741e4d..2bad5a3 100644 --- a/block/raw-win32.c +++ b/block/raw-win32.c @@ -280,7 +280,6 @@ static int raw_open(BlockDriverState *bs, QDict *options, int flags, } else { ret = -EINVAL; } -error_setg_errno(errp, -ret, Could not open file); goto fail; } diff --git a/tests/qemu-iotests/051.out b/tests/qemu-iotests/051.out index 15deef6..d351935 100644 --- a/tests/qemu-iotests/051.out +++ b/tests/qemu-iotests/051.out @@ -226,6 +226,6 @@ Testing: -drive file=foo:bar QEMU_PROG: -drive file=foo:bar: could not open disk image foo:bar: Unknown protocol Testing: -drive file.filename=foo:bar -QEMU_PROG: -drive file.filename=foo:bar: could not open disk image ide0-hd0: Could not open file: No such file or directory +QEMU_PROG: -drive file.filename=foo:bar: could not open disk image ide0-hd0: Could not open 'foo:bar': No such file or directory *** done diff --git a/tests/qemu-iotests/069.out b/tests/qemu-iotests/069.out index 3648814..b48306d 100644 --- a/tests/qemu-iotests/069.out +++ b/tests/qemu-iotests/069.out @@ -4,5 +4,5 @@ QA output created by 069 Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=131072 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072 backing_file='TEST_DIR/t.IMGFMT.base' -qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open file: No such file or directory +qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open backing file: Could not open 'TEST_DIR/t.IMGFMT.base': No such file or directory *** done -- 1.8.3.1 Reviewed-by: Benoit Canet ben...@irqsave.net
[Qemu-devel] [PATCH 0/2] Giving names to graph's BlockDriverState
The following series introduce a new file.node-name property in order to be able to give a name to each BlockDriverState of the graph. It also define undefined as a special value for node-name; a value that will be used to indicate to the management that it can not manipulate a node because it was not named. After this patchset is merged I would like to take care of presenting the graph to the management. (HMP /|| QMP) Eric: Do you have some ideas on this topic ? Best regards Benoît Benoît Canet (2): block: Add bs-node_name to hold the name of a bs node of the bs graph. block: Allow the user to define node-name option. block.c | 72 +++ block/blkverify.c | 2 +- block/iscsi.c | 2 +- block/vmdk.c | 2 +- block/vvfat.c | 4 +-- blockdev.c| 8 +++--- hw/block/xen_disk.c | 2 +- include/block/block.h | 3 +- include/block/block_int.h | 9 +- qemu-img.c| 6 ++-- qemu-io.c | 2 +- qemu-nbd.c| 2 +- 12 files changed, 79 insertions(+), 35 deletions(-) -- 1.8.3.2
[Qemu-devel] [PATCH 2/2] block: Allow the user to define node-name option.
As node-name is a separate name space as device-name we can enable it's definition right now: nobody will use it so no harm involved. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/block.c b/block.c index 230e71a..132981f 100644 --- a/block.c +++ b/block.c @@ -885,7 +885,8 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, options = qdict_new(); } -bs = bdrv_new(, NULL); +bs = bdrv_new(, qdict_get_try_str(options, node-name)); +qdict_del(options, node-name); bs-options = options; options = qdict_clone_shallow(options); @@ -1007,7 +1008,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) sizeof(backing_filename)); } -bs-backing_hd = bdrv_new(, NULL); +bs-backing_hd = bdrv_new(, qdict_get_try_str(options, node-name)); +qdict_del(options, node-name); if (bs-backing_format[0] != '\0') { back_drv = bdrv_find_format(bs-backing_format); -- 1.8.3.2
[Qemu-devel] [PATCH 1/2] block: Add bs-node_name to hold the name of a bs node of the bs graph.
Add the minimum of code to prepare the followings patches. If no node_name is provided to bdrv_new the bs-node_name is set to undefined. This will allow to have some default string to communicate in QMP and HMP. This also make undefined a reserved string for bs-node_name. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 70 +++ block/blkverify.c | 2 +- block/iscsi.c | 2 +- block/vmdk.c | 2 +- block/vvfat.c | 4 +-- blockdev.c| 8 +++--- hw/block/xen_disk.c | 2 +- include/block/block.h | 3 +- include/block/block_int.h | 9 +- qemu-img.c| 6 ++-- qemu-io.c | 2 +- qemu-nbd.c| 2 +- 12 files changed, 77 insertions(+), 35 deletions(-) diff --git a/block.c b/block.c index fd05a80..230e71a 100644 --- a/block.c +++ b/block.c @@ -89,6 +89,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); +static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states = +QTAILQ_HEAD_INITIALIZER(graph_bdrv_states); + static QLIST_HEAD(, BlockDriver) bdrv_drivers = QLIST_HEAD_INITIALIZER(bdrv_drivers); @@ -318,14 +321,26 @@ void bdrv_register(BlockDriver *bdrv) } /* create a new block device (by default it is empty) */ -BlockDriverState *bdrv_new(const char *device_name) +BlockDriverState *bdrv_new(const char *device_name, const char *node_name) { BlockDriverState *bs; bs = g_malloc0(sizeof(BlockDriverState)); pstrcpy(bs-device_name, sizeof(bs-device_name), device_name); if (device_name[0] != '\0') { -QTAILQ_INSERT_TAIL(bdrv_states, bs, list); +QTAILQ_INSERT_TAIL(bdrv_states, bs, device_list); +} +/* if node name is given store it in bs and insert bs in the graph bs list + * note: undefined is a reserved node name + */ +if (node_name +node_name[0] != '\0' +strcmp(node_name, undefined)) { +pstrcpy(bs-node_name, sizeof(bs-node_name), node_name); +QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list); +/* else set the bs node name to undefined for QMP and HMP */ +} else { +sprintf(bs-node_name, undefined); } bdrv_iostatus_disable(bs); notifier_list_init(bs-close_notifiers); @@ -870,7 +885,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, options = qdict_new(); } -bs = bdrv_new(); +bs = bdrv_new(, NULL); bs-options = options; options = qdict_clone_shallow(options); @@ -992,7 +1007,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) sizeof(backing_filename)); } -bs-backing_hd = bdrv_new(); +bs-backing_hd = bdrv_new(, NULL); if (bs-backing_format[0] != '\0') { back_drv = bdrv_find_format(bs-backing_format); @@ -1062,7 +1077,7 @@ int bdrv_open(BlockDriverState *bs, const char *filename, QDict *options, instead of opening 'filename' directly */ /* if there is a backing file, use it */ -bs1 = bdrv_new(); +bs1 = bdrv_new(, NULL); ret = bdrv_open(bs1, filename, NULL, 0, drv, local_err); if (ret 0) { bdrv_unref(bs1); @@ -1495,7 +1510,7 @@ void bdrv_close_all(void) { BlockDriverState *bs; -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { bdrv_close(bs); } } @@ -1524,7 +1539,7 @@ static bool bdrv_requests_pending(BlockDriverState *bs) static bool bdrv_requests_pending_all(void) { BlockDriverState *bs; -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { if (bdrv_requests_pending(bs)) { return true; } @@ -1554,7 +1569,7 @@ void bdrv_drain_all(void) /* FIXME: We do not have timer support here, so this is effectively * a busy wait. */ -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { if (bdrv_start_throttled_reqs(bs)) { busy = true; } @@ -1570,7 +1585,7 @@ void bdrv_drain_all(void) void bdrv_make_anon(BlockDriverState *bs) { if (bs-device_name[0] != '\0') { -QTAILQ_REMOVE(bdrv_states, bs, list); +QTAILQ_REMOVE(bdrv_states, bs, device_list); } bs-device_name[0] = '\0'; } @@ -1626,7 +1641,12 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest, /* keep the same entry in bdrv_states */ pstrcpy(bs_dest-device_name, sizeof(bs_dest-device_name), bs_src-device_name); -bs_dest-list = bs_src-list; +bs_dest-device_list = bs_src-device_list; + +/* keep the same entry in graph_bdrv_states */ +
Re: [Qemu-devel] How to introduce bs-node_name ?
Le Monday 04 Nov 2013 à 19:33:21 (+0800), Fam Zheng a écrit : On 11/04/2013 07:06 PM, Kevin Wolf wrote: Am 04.11.2013 um 10:48 hat Fam Zheng geschrieben: On 11/04/2013 05:31 PM, Stefan Hajnoczi wrote: On Wed, Oct 30, 2013 at 02:49:32PM +0100, Markus Armbruster wrote: The first proposal is to add another parameter, say id. Users can then refer either to an arbitrary BDS by id, or (for backward compatibility) to the root BDS by device. When the code sees device, it'll look up the BB, then fetch its root BDS. CON: Existing parameter device becomes compatibility cruft. PRO: Clean and obvious semantics (in my opinion). This proposal gets my vote. The second proposal is to press the existing parameter device into service for referring to BDS node_name. To keep backward compatibility, we obviously need to ensure that whenever the old code accepts a value of device, the new code accepts it as well, and both resolve it to the same BDS. Different legacy commands given the same device name might need to operate on different nodes. Could you give an example for this? Dynamic renaming does not solve this problem, so I'm not convinced we can always choose a device name matching a node name. Device name commands are higher-level than graph node commands. For example, block_set_io_throttle makes sense on a device but less sense on a graph node, unless we add the implicit assumption that the new throttling node is created on top of the given node or updated in place if the throttling node already exists (!!). Throttling a node could be useful too, for example if we want to throttle backing_hd which is on shared storage, but not to throttle on the local image. My ignorant question is: Why can't we just use one namespace, make sure no name collision between node_name and device_name, or even just drop device_name, so we treat the root node's node_name as device_name? For commands that only accept a device, this can be enforced in its implementation by checking against the whole graph to verify this. Markus described it somewhere in this thread: Live snapshots. Currently, the device_name moves to the new BDS on the top (and compatibility requires us to keep it that way), whereas a node name should, of course, stay at its node. When you consider this, the single namespace, as much as I would have loved it, is pretty much dead. Good everyone agree on the direction to take. I'll write some code. Best regards Benoît Thanks for explaining (again). I get the reason now. Fam
Re: [Qemu-devel] [PATCH] qapi-schema: Update description for NewImageMode
Le Thursday 07 Nov 2013 à 19:47:48 (+0100), Max Reitz a écrit : If the NewImageMode is absolute-paths but no backing file is available (e.g., when mirroring a device with an unbacked image), the target image will not be backed either. This patch updates the documentation in qapi-schema.json accordingly. Signed-off-by: Max Reitz mre...@redhat.com --- Follow-up to: - block/drive-mirror: Check for NULL backing_hd --- qapi-schema.json | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/qapi-schema.json b/qapi-schema.json index 81a375b..dde8e45 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -1736,7 +1736,8 @@ # @existing: QEMU should look for an existing image file. # # @absolute-paths: QEMU should create a new image with absolute paths -# for the backing file. +# for the backing file. If there is no backing file available, the new +# image will not be backed either. # # Since: 1.1 ## -- 1.8.4.2 Reviewed-by: Benoit Canet ben...@irqsave.net
Re: [Qemu-devel] [PATCH] util/error: Save errno from clobbering
Le Thursday 07 Nov 2013 à 20:10:29 (+0100), Max Reitz a écrit : There may be calls to error_setg() and especially error_setg_errno() which blindly (and until now wrongly) assume these functions not to clobber errno (e.g., they pass errno to error_setg_errno() and return -errno afterwards). Instead of trying to find and fix all of these constructs, just make sure error_setg() and error_setg_errno() indeed do not clobber errno. Suggested-by: Eric Blake ebl...@redhat.com Signed-off-by: Max Reitz mre...@redhat.com --- util/error.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/util/error.c b/util/error.c index ec0faa6..3ee362a 100644 --- a/util/error.c +++ b/util/error.c @@ -27,6 +27,7 @@ void error_set(Error **errp, ErrorClass err_class, const char *fmt, ...) { Error *err; va_list ap; +int saved_errno = errno; if (errp == NULL) { return; @@ -41,6 +42,8 @@ void error_set(Error **errp, ErrorClass err_class, const char *fmt, ...) err-err_class = err_class; *errp = err; + +errno = saved_errno; } void error_set_errno(Error **errp, int os_errno, ErrorClass err_class, @@ -49,6 +52,7 @@ void error_set_errno(Error **errp, int os_errno, ErrorClass err_class, Error *err; char *msg1; va_list ap; +int saved_errno = errno; if (errp == NULL) { return; @@ -69,6 +73,8 @@ void error_set_errno(Error **errp, int os_errno, ErrorClass err_class, err-err_class = err_class; *errp = err; + +errno = saved_errno; } void error_setg_file_open(Error **errp, int os_errno, const char *filename) -- 1.8.4.2 Yes this look better than trying to fix all callers. Reviewed-by: Benoit Canet ben...@irqsave.net
Re: [Qemu-devel] [PATCH 1/2] block: Add bs-node_name to hold the name of a bs node of the bs graph.
Le Thursday 07 Nov 2013 à 13:23:43 (-0700), Eric Blake a écrit : On 11/07/2013 08:01 AM, Benoît Canet wrote: Add the minimum of code to prepare the followings patches. If no node_name is provided to bdrv_new the bs-node_name is set to undefined. This will allow to have some default string to communicate in QMP and HMP. This also make undefined a reserved string for bs-node_name. Signed-off-by: Benoit Canet ben...@irqsave.net --- +/* if node name is given store it in bs and insert bs in the graph bs list + * note: undefined is a reserved node name + */ +if (node_name +node_name[0] != '\0' +strcmp(node_name, undefined)) { +pstrcpy(bs-node_name, sizeof(bs-node_name), node_name); +QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list); +/* else set the bs node name to undefined for QMP and HMP */ +} else { +sprintf(bs-node_name, undefined); strcpy() is more efficient than sprintf() with no % in the format string. +/* This function is to find a node in the bs graph */ +BlockDriverState *bdrv_find_node(const char *node_name) +{ +BlockDriverState *bs; + Should this function assert that node_name is not undefined? +++ b/include/block/block_int.h @@ -297,11 +297,18 @@ struct BlockDriverState { BlockdevOnError on_read_error, on_write_error; bool iostatus_enabled; BlockDeviceIoStatus iostatus; + +/* the following member give a name to every node on the BlockDriverState s/give/gives/ + * graph. + */ +char node_name[32]; +QTAILQ_ENTRY(BlockDriverState) node_list; +/* Device name is the name associated with the drive the guest see */ s/see/sees/ char device_name[32]; +QTAILQ_ENTRY(BlockDriverState) device_list; Maybe I missed it, but is there code that ensures there are no duplicate node names (other than the magic undefined)? Your are right I will add some checkings. Seems like it's moving in the right direction, although I'm not sure it's worth applying this until we know the qapi for working with node names makes sense. I am not seeking for fast commit of these patch, I am only trying to avoid posting a long series at once. Best regards Benoît -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org
Re: [Qemu-devel] [PATCH 1/2] block: Add bs-node_name to hold the name of a bs node of the bs graph.
Le Thursday 07 Nov 2013 à 15:54:09 (-0500), Jeff Cody a écrit : On Thu, Nov 07, 2013 at 04:01:42PM +0100, Benoît Canet wrote: Add the minimum of code to prepare the followings patches. If no node_name is provided to bdrv_new the bs-node_name is set to undefined. This will allow to have some default string to communicate in QMP and HMP. This also make undefined a reserved string for bs-node_name. Hi Benoît, Is it necessary to have a reserved string, or would an empty null-terminated string be able to implicitly denote the name as undefined? This would works too and just report the undefined logic to hmp printing code. I'll do this. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 70 +++ block/blkverify.c | 2 +- block/iscsi.c | 2 +- block/vmdk.c | 2 +- block/vvfat.c | 4 +-- blockdev.c| 8 +++--- hw/block/xen_disk.c | 2 +- include/block/block.h | 3 +- include/block/block_int.h | 9 +- qemu-img.c| 6 ++-- qemu-io.c | 2 +- qemu-nbd.c| 2 +- 12 files changed, 77 insertions(+), 35 deletions(-) diff --git a/block.c b/block.c index fd05a80..230e71a 100644 --- a/block.c +++ b/block.c @@ -89,6 +89,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); +static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states = +QTAILQ_HEAD_INITIALIZER(graph_bdrv_states); + static QLIST_HEAD(, BlockDriver) bdrv_drivers = QLIST_HEAD_INITIALIZER(bdrv_drivers); @@ -318,14 +321,26 @@ void bdrv_register(BlockDriver *bdrv) } /* create a new block device (by default it is empty) */ -BlockDriverState *bdrv_new(const char *device_name) +BlockDriverState *bdrv_new(const char *device_name, const char *node_name) { BlockDriverState *bs; bs = g_malloc0(sizeof(BlockDriverState)); pstrcpy(bs-device_name, sizeof(bs-device_name), device_name); if (device_name[0] != '\0') { -QTAILQ_INSERT_TAIL(bdrv_states, bs, list); +QTAILQ_INSERT_TAIL(bdrv_states, bs, device_list); +} +/* if node name is given store it in bs and insert bs in the graph bs list + * note: undefined is a reserved node name + */ +if (node_name +node_name[0] != '\0' +strcmp(node_name, undefined)) { +pstrcpy(bs-node_name, sizeof(bs-node_name), node_name); +QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list); +/* else set the bs node name to undefined for QMP and HMP */ +} else { +sprintf(bs-node_name, undefined); } bdrv_iostatus_disable(bs); notifier_list_init(bs-close_notifiers); @@ -870,7 +885,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, options = qdict_new(); } -bs = bdrv_new(); +bs = bdrv_new(, NULL); bs-options = options; options = qdict_clone_shallow(options); @@ -992,7 +1007,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) sizeof(backing_filename)); } -bs-backing_hd = bdrv_new(); +bs-backing_hd = bdrv_new(, NULL); if (bs-backing_format[0] != '\0') { back_drv = bdrv_find_format(bs-backing_format); @@ -1062,7 +1077,7 @@ int bdrv_open(BlockDriverState *bs, const char *filename, QDict *options, instead of opening 'filename' directly */ /* if there is a backing file, use it */ -bs1 = bdrv_new(); +bs1 = bdrv_new(, NULL); ret = bdrv_open(bs1, filename, NULL, 0, drv, local_err); if (ret 0) { bdrv_unref(bs1); @@ -1495,7 +1510,7 @@ void bdrv_close_all(void) { BlockDriverState *bs; -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { bdrv_close(bs); } } @@ -1524,7 +1539,7 @@ static bool bdrv_requests_pending(BlockDriverState *bs) static bool bdrv_requests_pending_all(void) { BlockDriverState *bs; -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { if (bdrv_requests_pending(bs)) { return true; } @@ -1554,7 +1569,7 @@ void bdrv_drain_all(void) /* FIXME: We do not have timer support here, so this is effectively * a busy wait. */ -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { if (bdrv_start_throttled_reqs(bs)) { busy = true; } @@ -1570,7 +1585,7
[Qemu-devel] [RFC V2 0/3] Giving names to BlockDriverState graph nodes
This series fix the issues of V1 and add the skeletton of a new QMP command useful to communicate a bs graph to libvirt. The minimum is exposed to the management to avoid binding it too much with QEMU. Tell me if you think it's worth implementing. Best regards Benoît v2: s/give/gives/ [Eric] s/see/sees/ [Eric] prevent duplicate node_name [Eric] drop undefined and use [Eric, Kevin, Jeff] remove from graph list on bdrv_make_anon [Jeff] comment the two chains [Fam] Add new command stub to retrieve the graph from libvirt [Benoît] Benoît Canet (3): block: Add bs-node_name to hold the name of a bs node of the bs graph. block: Allow the user to define node-name option. qapi: Add skeletton of command to query a drive bs graph. block.c | 88 +-- block/blkverify.c | 2 +- block/iscsi.c | 2 +- block/vmdk.c | 2 +- block/vvfat.c | 4 +-- blockdev.c| 16 ++--- hw/block/xen_disk.c | 2 +- include/block/block.h | 3 +- include/block/block_int.h | 9 - qapi-schema.json | 32 + qemu-img.c| 6 ++-- qemu-io.c | 2 +- qemu-nbd.c| 2 +- 13 files changed, 134 insertions(+), 36 deletions(-) -- 1.8.3.2
[Qemu-devel] [RFC V2 2/3] block: Allow the user to define node-name option.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 20 ++-- 1 file changed, 18 insertions(+), 2 deletions(-) diff --git a/block.c b/block.c index c397ee9..6690e3d 100644 --- a/block.c +++ b/block.c @@ -872,6 +872,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, const char *drvname; bool allow_protocol_prefix = false; Error *local_err = NULL; +const char *node_name = NULL; int ret; /* NULL means an empty set of options */ @@ -879,7 +880,14 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, options = qdict_new(); } -bs = bdrv_new(, ); +node_name = qdict_get_try_str(options, node-name); +if (node_name bdrv_find_node(node_name)) { +error_setg(errp, Duplicate node name); +return -EINVAL; +} +bs = bdrv_new(, node_name ? node_name : ); +qdict_del(options, node-name); + bs-options = options; options = qdict_clone_shallow(options); @@ -979,6 +987,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) int back_flags, ret; BlockDriver *back_drv = NULL; Error *local_err = NULL; +const char *node_name = NULL; if (bs-backing_hd != NULL) { QDECREF(options); @@ -1001,7 +1010,14 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) sizeof(backing_filename)); } -bs-backing_hd = bdrv_new(, ); +node_name = qdict_get_try_str(options, node-name); +if (node_name bdrv_find_node(node_name)) { +error_setg(errp, Duplicate node name); +QDECREF(options); +return -EINVAL; +} +bs-backing_hd = bdrv_new(, node_name ? node_name : ); +qdict_del(options, node-name); if (bs-backing_format[0] != '\0') { back_drv = bdrv_find_format(bs-backing_format); -- 1.8.3.2
[Qemu-devel] [RFC V2 1/3] block: Add bs-node_name to hold the name of a bs node of the bs graph.
Add the minimum of code to prepare the followings patches. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 72 ++- block/blkverify.c | 2 +- block/iscsi.c | 2 +- block/vmdk.c | 2 +- block/vvfat.c | 4 +-- blockdev.c| 8 +++--- hw/block/xen_disk.c | 2 +- include/block/block.h | 3 +- include/block/block_int.h | 9 +- qemu-img.c| 6 ++-- qemu-io.c | 2 +- qemu-nbd.c| 2 +- 12 files changed, 78 insertions(+), 36 deletions(-) diff --git a/block.c b/block.c index fd05a80..c397ee9 100644 --- a/block.c +++ b/block.c @@ -89,6 +89,9 @@ static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); +static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states = +QTAILQ_HEAD_INITIALIZER(graph_bdrv_states); + static QLIST_HEAD(, BlockDriver) bdrv_drivers = QLIST_HEAD_INITIALIZER(bdrv_drivers); @@ -318,14 +321,20 @@ void bdrv_register(BlockDriver *bdrv) } /* create a new block device (by default it is empty) */ -BlockDriverState *bdrv_new(const char *device_name) +BlockDriverState *bdrv_new(const char *device_name, const char *node_name) { BlockDriverState *bs; +assert(node_name); + bs = g_malloc0(sizeof(BlockDriverState)); pstrcpy(bs-device_name, sizeof(bs-device_name), device_name); if (device_name[0] != '\0') { -QTAILQ_INSERT_TAIL(bdrv_states, bs, list); +QTAILQ_INSERT_TAIL(bdrv_states, bs, device_list); +} +pstrcpy(bs-node_name, sizeof(bs-node_name), node_name); +if (node_name[0] != '\0') { +QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list); } bdrv_iostatus_disable(bs); notifier_list_init(bs-close_notifiers); @@ -870,7 +879,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char *filename, options = qdict_new(); } -bs = bdrv_new(); +bs = bdrv_new(, ); bs-options = options; options = qdict_clone_shallow(options); @@ -992,7 +1001,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict *options, Error **errp) sizeof(backing_filename)); } -bs-backing_hd = bdrv_new(); +bs-backing_hd = bdrv_new(, ); if (bs-backing_format[0] != '\0') { back_drv = bdrv_find_format(bs-backing_format); @@ -1062,7 +1071,7 @@ int bdrv_open(BlockDriverState *bs, const char *filename, QDict *options, instead of opening 'filename' directly */ /* if there is a backing file, use it */ -bs1 = bdrv_new(); +bs1 = bdrv_new(, ); ret = bdrv_open(bs1, filename, NULL, 0, drv, local_err); if (ret 0) { bdrv_unref(bs1); @@ -1495,7 +1504,7 @@ void bdrv_close_all(void) { BlockDriverState *bs; -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { bdrv_close(bs); } } @@ -1524,7 +1533,7 @@ static bool bdrv_requests_pending(BlockDriverState *bs) static bool bdrv_requests_pending_all(void) { BlockDriverState *bs; -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { if (bdrv_requests_pending(bs)) { return true; } @@ -1554,7 +1563,7 @@ void bdrv_drain_all(void) /* FIXME: We do not have timer support here, so this is effectively * a busy wait. */ -QTAILQ_FOREACH(bs, bdrv_states, list) { +QTAILQ_FOREACH(bs, bdrv_states, device_list) { if (bdrv_start_throttled_reqs(bs)) { busy = true; } @@ -1565,14 +1574,18 @@ void bdrv_drain_all(void) } } -/* make a BlockDriverState anonymous by removing from bdrv_state list. +/* make a BlockDriverState anonymous by removing from bdrv_state and + * graph_bdrv_state list. Also, NULL terminate the device_name to prevent double remove */ void bdrv_make_anon(BlockDriverState *bs) { if (bs-device_name[0] != '\0') { -QTAILQ_REMOVE(bdrv_states, bs, list); +QTAILQ_REMOVE(bdrv_states, bs, device_list); } bs-device_name[0] = '\0'; +if (bs-node_name[0] != '\0') { +QTAILQ_REMOVE(graph_bdrv_states, bs, node_list); +} } static void bdrv_rebind(BlockDriverState *bs) @@ -1626,7 +1639,12 @@ static void bdrv_move_feature_fields(BlockDriverState *bs_dest, /* keep the same entry in bdrv_states */ pstrcpy(bs_dest-device_name, sizeof(bs_dest-device_name), bs_src-device_name); -bs_dest-list = bs_src-list; +bs_dest-device_list = bs_src-device_list; + +/* keep the same entry in graph_bdrv_states */ +pstrcpy(bs_dest-node_name, sizeof(bs_dest-node_name), +bs_src-node_name); +bs_dest-node_list = bs_src-node_list; }
[Qemu-devel] [RFC V2 3/3] qapi: Add skeletton of command to query a drive bs graph.
--- blockdev.c | 8 qapi-schema.json | 32 2 files changed, 40 insertions(+) diff --git a/blockdev.c b/blockdev.c index 911ee7e..bfaeda0 100644 --- a/blockdev.c +++ b/blockdev.c @@ -1938,6 +1938,14 @@ void qmp_drive_backup(const char *device, const char *target, } } +BlockGraphNode *qmp_query_drive_graph(const char *device, Error **errp) +{ +/* the implementation of this function would recurse through the + * BlockDriverState graph to build it's result + */ +return NULL; +} + #define DEFAULT_MIRROR_BUF_SIZE (10 20) void qmp_drive_mirror(const char *device, const char *target, diff --git a/qapi-schema.json b/qapi-schema.json index 60f3fd1..bca3d5a 100644 --- a/qapi-schema.json +++ b/qapi-schema.json @@ -1983,6 +1983,38 @@ { 'command': 'drive-backup', 'data': 'DriveBackup' } ## +# @BlockGraphNode +# +# Information about a node of the block driver state graph +# +# @node-name: the name of the node in the graph +# +# @drv: the name of the block format used by this node as described in +# @BlockDeviceInfo. +# +# @children: a list of @BlockGraphNode being the children of this node +# +# Since 1.8 +## +{ 'type': 'BlockGraphNode', + 'data': { 'node-name': 'str', 'drv': 'str', 'children': ['BlockGraphNode'] } } + +## +# @query-drive-graph +# +# Get the block driver states graph for a given drive +# +# @device: the name of the device to get the graph from +# +# Returns: the root @BlockGraphNode +# +# Since 1.8 +## +{ 'command': 'query-drive-graph', + 'data': { 'device': 'str' }, + 'returns': 'BlockGraphNode' } + +## # @drive-mirror # # Start mirroring a block device's writes to a new destination. -- 1.8.3.2
Re: [Qemu-devel] [PATCH for 1.7] target-i386: do not override nr_cores for -cpu host
Le Tuesday 19 Nov 2013 à 17:49:46 (+0100), Paolo Bonzini a écrit : Commit 787aaf5 (target-i386: forward CPUID cache leaves when -cpu host is used, 2013-09-02) brings bits 31..26 of CPUID leaf 04h out of sync with the APIC IDs that QEMU reserves for each package. This number must come from -smp options rather than from the host CPUID. It also turns out that this unsyncing makes Windows Server 2012R2 fail to boot. Tested-by: Peter Lieven p...@kamp.de Signed-off-by: Paolo Bonzini pbonz...@redhat.com --- Resending because the mailing list ate the message. target-i386/cpu.c | 14 -- 1 file changed, 8 insertions(+), 6 deletions(-) diff --git a/target-i386/cpu.c b/target-i386/cpu.c index 864c80e..8df6747 100644 --- a/target-i386/cpu.c +++ b/target-i386/cpu.c @@ -2086,14 +2086,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, /* cache info: needed for Core compatibility */ if (cpu-cache_info_passthrough) { host_cpuid(index, count, eax, ebx, ecx, edx); -break; -} -if (cs-nr_cores 1) { -*eax = (cs-nr_cores - 1) 26; +*eax = ~0xFC00; } else { *eax = 0; -} -switch (count) { +switch (count) { case 0: /* L1 dcache info */ *eax |= CPUID_4_TYPE_DCACHE | \ CPUID_4_LEVEL(1) | \ @@ -2133,6 +2129,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, uint32_t count, *ecx = 0; *edx = 0; break; +} +} + +/* QEMU gives out its own APIC IDs, never pass down bits 31..26. */ Maybe a comment about the eax 31 would help the next sould who will read this code. +if ((*eax 31) cs-nr_cores 1) { +*eax |= (cs-nr_cores - 1) 26; } break; case 5: -- 1.8.4.2 Apart from this Reviewed-by: Benoit Canet ben...@irqsave.net
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
+QCOW2 can use one or more instance of a metadata journal. s/instance/instances/ Is there a reason to use multiple journals rather than a single journal for all entry types? The single journal area avoids seeks. Here are the main reason for this: For the deduplication some patterns like cycles of insertion/deletion could leave the hash table almost empty while filling the journal. If the journal is full and the hash table is empty a packing operation is started. Basically a new journal is created and only the entry presents in the hash table are reinserted. This is why I want to keep the deduplication journal appart from regular qcow2 journal: to avoid interferences between a pack operation and regular qcow2 journal entries. The other thing is that freezing the log store would need a replay of regular qcow2 entries as it trigger a reset of the journal. Also since deduplication will not work on spinning disk I discarded the seek time factor. Maybe commiting the dedupe journal by erase block sized chunk would be a good idea to reduce random writes to the SSD. The additional reason for having multiple journals is that the SILT paper propose a mode where prefix of the hash is used to dispatch insertions in multiples store and it easier to do with multiple journals. + +A journal is a sequential log of journal entries appended on a previously +allocated and reseted area. I think you say previously reset area instead of reseted. Another option is initialized area. +A journal is designed like a linked list with each entry pointing to the next +so it's easy to iterate over entries. + +A journal uses the following constants to denote the type of each entry + +TYPE_NONE = 0xFF default value of any bytes in a reseted journal +TYPE_END = 1 the entry ends a journal cluster and point to the next + cluster +TYPE_HASH = 2 the entry contains a deduplication hash + +QCOW2 journal entry: + +Byte 0 :Size of the entry: size = 2 + n with size = 254 This is not clear. I'm wondering if the +2 is included in the byte value or not. I'm also wondering what a byte value of zero means and what a byte value of 255 means. I am counting the journal entry header in the size. So yes the +2 is in the byte value. A byte value of zero, 1 or 255 is an error. Maybe this design is bogus and I should only count the payload size in the size field. It would make less tricky cases. Please include an example to illustrate how this field works. + + 1 :Type of the entry + + 2 - size :The optional n bytes structure carried by entry + +A journal is divided into clusters and no journal entry can be spilled on two +clusters. This avoid having to read more than one cluster to get a single entry. + +For this purpose an entry with the end type is added at the end of a journal +cluster before starting to write in the next cluster. +The size of such an entry is set so the entry points to the next cluster. + +As any journal cluster must be ended with an end entry the size of regular +journal entries is limited to 254 bytes in order to always left room for an end +entry which mimimal size is two bytes. + +The only cases where size 254 are none entries where size = 255. + +The replay of a journal stop when the first end none entry is reached. s/stop/stops/ +The journal cluster size is 4096 bytes. Questions about this layout: 1. Journal entries have no integrity mechanism, which is especially important if they span physical sectors where cheap disks may perform a partial write. This would leave a corrupt journal. If the last bytes are a checksum then you can get some confidence that the entry was fully written and is valid. I will add a checksum mecanism. Do you have any preferences regarding the checksum function ? Did I miss something? 2. Byte-granularity means that read-modify-write is necessary to append entries to the journal. Therefore a failure could destroy previously committed entries. It's designed to be committed by 4KB blocks. Any ideas how existing journals handle this?
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
2. Byte-granularity means that read-modify-write is necessary to append entries to the journal. Therefore a failure could destroy previously committed entries. Any ideas how existing journals handle this? You commit only whole blocks. So in this case we can consider a block only committed as soon as a TYPE_END entry has been written (and after that we won't touch it any more until the journalled changes have been flushed to disk). There's one interesting case: cache=writethrough. I'm not entirely sure yet what to do with it, but it's slow anyway, so using one block per entry and therefore flushing the journal very often might actually be not totally unreasonable. This sure would finish to kill the performance because this would be an io per metadata written to disk. Another thing I'm not sure about is whether a fixed 4k block is good or if we should leave it configurable. I don't think making it an option would hurt (not necessarily modifyable with qemu-img, but as a field in the file format). I agree. I also think about make the number of block to be flushed at once configurable. Benoît
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
Care to explain that in more detail? Why shouldn't it work on spinning disks? Hash are random they introduce random read access. With a QCOW2 cluster size of 4KB the deduplication code when writting duplicated data will do one random read per 4KB block to deduplicate. A server grade hardisk is rated for 250 iops. This traduce in 1MB/s of deduplicated data. Not very usable. On the contrary a samsung 840 pro SSD is rated for 80k iops of random read. That should traduce in 320MB/s of potentially deduplicated data. Havind dedup metadata on SSD and actual data on disk would solve the problem but it would need block backend. Benoît
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
Does this mean the journal forms the first-stage data structure for deduplication? Dedup records will accumulate in the journal until it becomes time to convert them in bulk into a more compact representation? The journal is mainly used to persist the last inserted dedup metadata across QEMU stop and restart. I replay it at startup to rebuild the hash table. So yes it's the first stage even it's never used for regular queries. When I read this specification I was thinking of a journal purely for logging operations. You could use a commit record to mark previous records applied. Upon startup, qcow2 would inspect uncommitted records and deal with them. Maybe that could help regular QCOW2 usage. I don't know.
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
By the way, I don't know much about journalling techniques. So I'm asking you these questions so that either you can answer them straight away or because they might warrant a look at existing journal implementations like: I tried to so something simple and performing for the deduplication usage. That explain that there is no concept of transaction and that the journal's block are flushed asynchronously in order to have an high insertion rate. I agree with your previous comment is more a log than a journal. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2 http://www.sqlite.org/cgi/src/dir?name=src http://blitiri.com.ar/p/libjio/ I will try to find a paper on journal design. Benoît
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
Simple is good. Even for deduplication alone, I think data integrity is critical - otherwise we risk stale dedup metadata pointing to clusters that are unallocated or do not contain the right data. So the journal will probably need to follow techniques for commits/checksums. I agree that checksums are missing for the dedup. Maybe we could even use some kind of error correcting code instead of a checksum. Concerning data integrity the events that the deduplication code cannot loose are hash deletions because they mark a previously inserted hash as obsolete. The problem with a commit/flush mechanism on hash deletion is that it will slow down the store insertion speed and also create some extra SSD wear out. To solve this I considered the fact that the dedup metadata as a whole is disposable. So I implemented a dedup dirty bit. When QEMU stop the journal is flushed and the dirty bit is cleared. When QEMU start and the dirty bit is set a crash is detected and _all_ the deduplication metadata is dropped. The QCOW2 data integrity won't suffer only the dedup ratio will be lower. As you said once on irc crashes don't happen often. Benoît
[Qemu-devel] Alternative mac address setting
Hello, I want to implement an alternative mac address setting to allow a guest using virtio-net to change it's mac address by itself. The main use case is high availability setups where a slave machine take the lead when the master is failing. (heartbeat) The thing that an alternate mac address setting would allow to do is to make the switch _without_ requiring a gratuitous arp. The slave guest would simply take the master guest mac address and resume operations. I initially though about implementing a list of alternate mac address but I fear that in this context lists are frowed upon. Do you have any advice on how to implement this feature ? Should I support old net device declaration ? What is the strict minimum changes that would need to be done to implement the feature ? Best regards Benoît
Re: [Qemu-devel] Alternative mac address setting
Sorry for the gibberish I misread the code. Please ignore this mail. Best regards Benoît Le Tuesday 16 Jul 2013 à 13:36:49 (+0200), Benoît Canet a écrit : Hello, I want to implement an alternative mac address setting to allow a guest using virtio-net to change it's mac address by itself. The main use case is high availability setups where a slave machine take the lead when the master is failing. (heartbeat) The thing that an alternate mac address setting would allow to do is to make the switch _without_ requiring a gratuitous arp. The slave guest would simply take the master guest mac address and resume operations. I initially though about implementing a list of alternate mac address but I fear that in this context lists are frowed upon. Do you have any advice on how to implement this feature ? Should I support old net device declaration ? What is the strict minimum changes that would need to be done to implement the feature ? Best regards Benoît
Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
Simple is good. Even for deduplication alone, I think data integrity is critical - otherwise we risk stale dedup metadata pointing to clusters that are unallocated or do not contain the right data. So the journal will probably need to follow techniques for commits/checksums. I'll add checksums to the journal and clean the journal entry size mess soon. For the transactional/commits aspect of the journal I think that we need Kevin's point of view on the subject. Best regards Benoît
[Qemu-devel] [PATCH 0/4] Leaky bucket throttling and features
The first patch fixes the throttling which was broken by a previous commit. The next patch replace the existing throttling algorithm by the well described leaky bucket algorithm. Third patch implement bursting by adding *_threshold parameters to qmp_block_set_io_throttle. The last one allow to define the max size of an io when throttling by iops via iops_sector_count. Benoît Canet (4): block: Repair the throttling code. block: Modify the throttling code to implement the leaky bucket algorithm. block: Add support for throttling burst threshold in QMP and the command line. block: Add iops_sector_count to do the iops accounting for a given io size. block.c | 424 + block/qapi.c | 28 +++ blockdev.c| 174 +-- hmp.c | 36 +++- include/block/block_int.h | 16 +- include/block/coroutine.h |5 + qapi-schema.json | 40 - qemu-coroutine-lock.c | 14 ++ qemu-options.hx |2 +- qmp-commands.hx | 34 +++- 10 files changed, 561 insertions(+), 212 deletions(-) -- 1.7.10.4
[Qemu-devel] [PATCH 1/4] block: Repair the throttling code.
The throttling code was segfaulting since commit 02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller does not run in a coroutine. qemu_co_queue_do_restart assume that the caller is a coroutinne. As sugested by Stefan fix this by entering the coroutine directly. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 include/block/coroutine.h |5 + qemu-coroutine-lock.c | 14 ++ 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index b560241..dc72643 100644 --- a/block.c +++ b/block.c @@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs) { bs-io_limits_enabled = false; -while (qemu_co_queue_next(bs-throttled_reqs)); +while (qemu_co_enter_next(bs-throttled_reqs)) { +} if (bs-block_timer) { qemu_del_timer(bs-block_timer); @@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque) { BlockDriverState *bs = opaque; -qemu_co_queue_next(bs-throttled_reqs); +qemu_co_enter_next(bs-throttled_reqs); } void bdrv_io_limits_enable(BlockDriverState *bs) @@ -1445,8 +1446,7 @@ void bdrv_drain_all(void) * a busy wait. */ QTAILQ_FOREACH(bs, bdrv_states, list) { -if (!qemu_co_queue_empty(bs-throttled_reqs)) { -qemu_co_queue_restart_all(bs-throttled_reqs); +while (qemu_co_enter_next(bs-throttled_reqs)) { busy = true; } } diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 377805a..66e331c 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -138,6 +138,11 @@ bool qemu_co_queue_next(CoQueue *queue); void qemu_co_queue_restart_all(CoQueue *queue); /** + * Enter the next coroutine in the queue + */ +bool qemu_co_enter_next(CoQueue *queue); + +/** * Checks if the CoQueue is empty. */ bool qemu_co_queue_empty(CoQueue *queue); diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c index d9fea49..c61d300 100644 --- a/qemu-coroutine-lock.c +++ b/qemu-coroutine-lock.c @@ -98,6 +98,20 @@ void qemu_co_queue_restart_all(CoQueue *queue) qemu_co_queue_do_restart(queue, false); } +bool qemu_co_enter_next(CoQueue *queue) +{ +Coroutine *next; + +next = QTAILQ_FIRST(queue-entries); +if (!next) { +return false; +} + +QTAILQ_REMOVE(queue-entries, next, co_queue_next); +qemu_coroutine_enter(next, NULL); +return true; +} + bool qemu_co_queue_empty(CoQueue *queue) { return (QTAILQ_FIRST(queue-entries) == NULL); -- 1.7.10.4
[Qemu-devel] [PATCH 2/4] block: Modify the throttling code to implement the leaky bucket algorithm.
This patch replace the previous algorithm by the well described leaky bucket algorithm: A bucket is filled by the incoming IOs and a periodic timer decrement the counter to make the bucket leak. When a given threshold is reached the bucket is full and the IOs are hold. In this patch the threshold is set to a default value to make the code behave like the previous implementation. In the next patch the threshold will be exposed in QMP to let the user control the burstiness of the throttling. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 410 + blockdev.c| 71 ++-- include/block/block_int.h | 15 +- 3 files changed, 299 insertions(+), 197 deletions(-) diff --git a/block.c b/block.c index dc72643..2d6e9b4 100644 --- a/block.c +++ b/block.c @@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque); static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors); -static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, -double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, int64_t *wait); - static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -101,6 +94,8 @@ static QLIST_HEAD(, BlockDriver) bdrv_drivers = /* If non-zero, use only whitelisted block drivers */ static int use_bdrv_whitelist; +/* boolean used to inform the throttling code that a bdrv_drain_all is issued */ +static bool draining; #ifdef _WIN32 static int is_windows_drive_prefix(const char *filename) @@ -135,15 +130,122 @@ void bdrv_io_limits_disable(BlockDriverState *bs) qemu_free_timer(bs-block_timer); bs-block_timer = NULL; } +} -bs-slice_start = 0; -bs-slice_end = 0; +static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta) +{ +int64_t *bytes = bs-leaky_buckets.bytes; +int64_t read_leak, write_leak; + +/* the limit apply to both reads and writes */ +if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) { +/* compute half the total leak */ +int64_t leak = ((bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL] * delta) / + NANOSECONDS_PER_SECOND); +int remain = leak % 2; +leak /= 2; + +/* the read bucket is smaller than half the quantity to leak so take + * care adding the leak difference to write leak + */ +if (bytes[BLOCK_IO_LIMIT_READ] = leak) { +read_leak = bytes[BLOCK_IO_LIMIT_READ]; +write_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_READ]; +/* symetric case */ +} else if (bytes[BLOCK_IO_LIMIT_WRITE] = leak) { +write_leak = bytes[BLOCK_IO_LIMIT_WRITE]; +read_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_WRITE]; +/* both bucket above leak count use half the total leak for both */ +} else { +write_leak = leak; +read_leak = leak + remain; +} +/* else we consider that limits are separated */ +} else { +read_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_READ] * delta) / +NANOSECONDS_PER_SECOND; +write_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_WRITE] * delta) / + NANOSECONDS_PER_SECOND; +} + +/* make the buckets leak */ +bytes[BLOCK_IO_LIMIT_READ] = MAX(bytes[BLOCK_IO_LIMIT_READ] - read_leak, + 0); +bytes[BLOCK_IO_LIMIT_WRITE] = MAX(bytes[BLOCK_IO_LIMIT_WRITE] - write_leak, + 0); } +static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta) +{ +double *ios = bs-leaky_buckets.ios; +int64_t read_leak, write_leak; + +/* the limit apply to both reads and writes */ +if (bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) { +/* compute half the total leak */ +int64_t leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) / + NANOSECONDS_PER_SECOND); +int remain = leak % 2; +leak /= 2; + +/* the read bucket is smaller than half the quantity to leak so take + * care adding the leak difference to write leak + */ +if (ios[BLOCK_IO_LIMIT_READ] = leak) { +read_leak = ios[BLOCK_IO_LIMIT_READ]; +write_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_READ]; +/* symetric case */ +} else if (ios[BLOCK_IO_LIMIT_WRITE] = leak) { +write_leak = ios[BLOCK_IO_LIMIT_WRITE]; +read_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_WRITE]; +/* both bucket above leak count use half the total leak for both */ +} else { +write_leak
[Qemu-devel] [PATCH 3/4] block: Add support for throttling burst threshold in QMP and the command line.
The thresholds of the leaky bucket algorithm can be used to allow some burstiness. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qapi.c | 24 + blockdev.c | 105 +++--- hmp.c| 32 +++-- qapi-schema.json | 34 -- qemu-options.hx |2 +- qmp-commands.hx | 30 ++-- 6 files changed, 205 insertions(+), 22 deletions(-) diff --git a/block/qapi.c b/block/qapi.c index a4bc411..03f1604 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -235,6 +235,30 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops[BLOCK_IO_LIMIT_READ]; info-inserted-iops_wr = bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_bps_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-bps_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-has_bps_rd_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-bps_rd_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_bps_wr_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-bps_wr_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_iops_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-iops_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-iops_rd_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_iops_rd_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_iops_wr_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-iops_wr_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index a78fba4..9bda359 100644 --- a/blockdev.c +++ b/blockdev.c @@ -341,6 +341,17 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp) return false; } +if (io_limits-bps_threshold[BLOCK_IO_LIMIT_TOTAL] 0 || +io_limits-bps_threshold[BLOCK_IO_LIMIT_WRITE] 0 || +io_limits-bps_threshold[BLOCK_IO_LIMIT_READ] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_TOTAL] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_WRITE] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_READ] 0) { +error_setg(errp, + threshold values must be 0 or greater); +return false; +} + return true; } @@ -523,24 +534,34 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_get_number(opts, bps_rd, 0); io_limits.bps[BLOCK_IO_LIMIT_WRITE] = qemu_opt_get_number(opts, bps_wr, 0); +/* bps thresholds */ +io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] = +qemu_opt_get_number(opts, bps_threshold, +io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ); +io_limits.bps_threshold[BLOCK_IO_LIMIT_READ] = +qemu_opt_get_number(opts, bps_rd_threshold, +io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ); +io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] = +qemu_opt_get_number(opts, bps_wr_threshold, +io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ); + io_limits.iops[BLOCK_IO_LIMIT_TOTAL] = qemu_opt_get_number(opts, iops, 0); io_limits.iops[BLOCK_IO_LIMIT_READ] = qemu_opt_get_number(opts, iops_rd, 0); io_limits.iops[BLOCK_IO_LIMIT_WRITE] = qemu_opt_get_number(opts, iops_wr, 0); -io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] = - io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ; -io_limits.bps_threshold[BLOCK_IO_LIMIT_READ] = - io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ; -io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] = - io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ; -io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = - io_limits.iops[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ; + +/* iops thresholds */ +io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = +qemu_opt_get_number(opts, iops_threshold, +
[Qemu-devel] [PATCH 4/4] block: Add iops_sector_count to do the iops accounting for a given io size.
This feature can be used in case where users are avoiding the iops limit by doing jumbo I/Os hammering the storage backend. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 +++- block/qapi.c |4 blockdev.c| 22 +- hmp.c |8 ++-- include/block/block_int.h |1 + qapi-schema.json | 10 -- qemu-options.hx |2 +- qmp-commands.hx |8 ++-- 8 files changed, 54 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index 2d6e9b4..cfa890d 100644 --- a/block.c +++ b/block.c @@ -337,6 +337,12 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors, bool is_write) { bool bps_ret, iops_ret; +double ios = 1.0; + +if (bs-io_limits.iops_sector_count) { +ios = ((double) nb_sectors) / bs-io_limits.iops_sector_count; +ios = MAX(ios, 1.0); +} /* check if any bandwith or per IO threshold has been exceeded */ bps_ret = bdrv_is_bps_threshold_exceeded(bs, is_write); @@ -360,7 +366,7 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors, /* the IO is authorized so do the accounting and return false */ bs-leaky_buckets.bytes[is_write] += (int64_t)nb_sectors * BDRV_SECTOR_SIZE; -bs-leaky_buckets.ios[is_write]++; +bs-leaky_buckets.ios[is_write] += ios; return false; } diff --git a/block/qapi.c b/block/qapi.c index 03f1604..f81081c 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -259,6 +259,10 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; info-inserted-iops_wr_threshold = bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_iops_sector_count = + bs-io_limits.iops_sector_count; +info-inserted-iops_sector_count = + bs-io_limits.iops_sector_count; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index 9bda359..8ddd710 100644 --- a/blockdev.c +++ b/blockdev.c @@ -352,6 +352,11 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp) return false; } +if (io_limits-iops_sector_count 0) { +error_setg(errp, iops_sector_count must be 0 or greater); +return false; +} + return true; } @@ -563,6 +568,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_get_number(opts, iops_wr_threshold, io_limits.iops[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ); +io_limits.iops_sector_count = + qemu_opt_get_number(opts, iops_sector_count, 0); + + if (!do_check_io_limits(io_limits, error)) { error_report(%s, error_get_pretty(error)); error_free(error); @@ -1260,7 +1269,9 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, bool has_iops_rd_threshold, int64_t iops_rd_threshold, bool has_iops_wr_threshold, - int64_t iops_wr_threshold, Error **errp) + int64_t iops_wr_threshold, + bool has_iops_sector_count, + int64_t iops_sector_count, Error **errp) { BlockIOLimit io_limits; BlockDriverState *bs; @@ -1283,6 +1294,7 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = iops / THROTTLE_HZ; io_limits.iops_threshold[BLOCK_IO_LIMIT_READ] = iops_rd / THROTTLE_HZ; io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr / THROTTLE_HZ; +io_limits.iops_sector_count = 0; /* override them with givens values if present */ if (has_bps_threshold) { @@ -1304,6 +1316,10 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr_threshold; } +if (has_iops_sector_count) { +io_limits.iops_sector_count = iops_sector_count; +} + if (!do_check_io_limits(io_limits, errp)) { return; } @@ -2007,6 +2023,10 @@ QemuOptsList qemu_common_drive_opts = { .type = QEMU_OPT_NUMBER, .help = write bytes threshold, },{ +.name = iops_sector_count, +.type = QEMU_OPT_NUMBER, +.help = when limiting by iops max size of an I/O in sector, +},{ .name = copy-on-read, .type = QEMU_OPT_BOOL, .help = copy read data from backing file into image file, diff --git a/hmp.c b/hmp.c index
[Qemu-devel] [PATCH for-1.6 4/4] block: Add iops_sector_count to do the iops accounting for a given io size.
This feature can be used in case where users are avoiding the iops limit by doing jumbo I/Os hammering the storage backend. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 +++- block/qapi.c |4 blockdev.c| 22 +- hmp.c |8 ++-- include/block/block_int.h |1 + qapi-schema.json | 10 -- qemu-options.hx |2 +- qmp-commands.hx |8 ++-- 8 files changed, 54 insertions(+), 9 deletions(-) diff --git a/block.c b/block.c index 2d6e9b4..cfa890d 100644 --- a/block.c +++ b/block.c @@ -337,6 +337,12 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors, bool is_write) { bool bps_ret, iops_ret; +double ios = 1.0; + +if (bs-io_limits.iops_sector_count) { +ios = ((double) nb_sectors) / bs-io_limits.iops_sector_count; +ios = MAX(ios, 1.0); +} /* check if any bandwith or per IO threshold has been exceeded */ bps_ret = bdrv_is_bps_threshold_exceeded(bs, is_write); @@ -360,7 +366,7 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors, /* the IO is authorized so do the accounting and return false */ bs-leaky_buckets.bytes[is_write] += (int64_t)nb_sectors * BDRV_SECTOR_SIZE; -bs-leaky_buckets.ios[is_write]++; +bs-leaky_buckets.ios[is_write] += ios; return false; } diff --git a/block/qapi.c b/block/qapi.c index 03f1604..f81081c 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -259,6 +259,10 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; info-inserted-iops_wr_threshold = bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_iops_sector_count = + bs-io_limits.iops_sector_count; +info-inserted-iops_sector_count = + bs-io_limits.iops_sector_count; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index 9bda359..8ddd710 100644 --- a/blockdev.c +++ b/blockdev.c @@ -352,6 +352,11 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp) return false; } +if (io_limits-iops_sector_count 0) { +error_setg(errp, iops_sector_count must be 0 or greater); +return false; +} + return true; } @@ -563,6 +568,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_get_number(opts, iops_wr_threshold, io_limits.iops[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ); +io_limits.iops_sector_count = + qemu_opt_get_number(opts, iops_sector_count, 0); + + if (!do_check_io_limits(io_limits, error)) { error_report(%s, error_get_pretty(error)); error_free(error); @@ -1260,7 +1269,9 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, bool has_iops_rd_threshold, int64_t iops_rd_threshold, bool has_iops_wr_threshold, - int64_t iops_wr_threshold, Error **errp) + int64_t iops_wr_threshold, + bool has_iops_sector_count, + int64_t iops_sector_count, Error **errp) { BlockIOLimit io_limits; BlockDriverState *bs; @@ -1283,6 +1294,7 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = iops / THROTTLE_HZ; io_limits.iops_threshold[BLOCK_IO_LIMIT_READ] = iops_rd / THROTTLE_HZ; io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr / THROTTLE_HZ; +io_limits.iops_sector_count = 0; /* override them with givens values if present */ if (has_bps_threshold) { @@ -1304,6 +1316,10 @@ void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd, io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr_threshold; } +if (has_iops_sector_count) { +io_limits.iops_sector_count = iops_sector_count; +} + if (!do_check_io_limits(io_limits, errp)) { return; } @@ -2007,6 +2023,10 @@ QemuOptsList qemu_common_drive_opts = { .type = QEMU_OPT_NUMBER, .help = write bytes threshold, },{ +.name = iops_sector_count, +.type = QEMU_OPT_NUMBER, +.help = when limiting by iops max size of an I/O in sector, +},{ .name = copy-on-read, .type = QEMU_OPT_BOOL, .help = copy read data from backing file into image file, diff --git a/hmp.c b/hmp.c index
[Qemu-devel] [PATCH for-1.6 1/4] block: Repair the throttling code.
The throttling code was segfaulting since commit 02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller does not run in a coroutine. qemu_co_queue_do_restart assume that the caller is a coroutinne. As sugested by Stefan fix this by entering the coroutine directly. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c |8 include/block/coroutine.h |5 + qemu-coroutine-lock.c | 14 ++ 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/block.c b/block.c index b560241..dc72643 100644 --- a/block.c +++ b/block.c @@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs) { bs-io_limits_enabled = false; -while (qemu_co_queue_next(bs-throttled_reqs)); +while (qemu_co_enter_next(bs-throttled_reqs)) { +} if (bs-block_timer) { qemu_del_timer(bs-block_timer); @@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque) { BlockDriverState *bs = opaque; -qemu_co_queue_next(bs-throttled_reqs); +qemu_co_enter_next(bs-throttled_reqs); } void bdrv_io_limits_enable(BlockDriverState *bs) @@ -1445,8 +1446,7 @@ void bdrv_drain_all(void) * a busy wait. */ QTAILQ_FOREACH(bs, bdrv_states, list) { -if (!qemu_co_queue_empty(bs-throttled_reqs)) { -qemu_co_queue_restart_all(bs-throttled_reqs); +while (qemu_co_enter_next(bs-throttled_reqs)) { busy = true; } } diff --git a/include/block/coroutine.h b/include/block/coroutine.h index 377805a..66e331c 100644 --- a/include/block/coroutine.h +++ b/include/block/coroutine.h @@ -138,6 +138,11 @@ bool qemu_co_queue_next(CoQueue *queue); void qemu_co_queue_restart_all(CoQueue *queue); /** + * Enter the next coroutine in the queue + */ +bool qemu_co_enter_next(CoQueue *queue); + +/** * Checks if the CoQueue is empty. */ bool qemu_co_queue_empty(CoQueue *queue); diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c index d9fea49..c61d300 100644 --- a/qemu-coroutine-lock.c +++ b/qemu-coroutine-lock.c @@ -98,6 +98,20 @@ void qemu_co_queue_restart_all(CoQueue *queue) qemu_co_queue_do_restart(queue, false); } +bool qemu_co_enter_next(CoQueue *queue) +{ +Coroutine *next; + +next = QTAILQ_FIRST(queue-entries); +if (!next) { +return false; +} + +QTAILQ_REMOVE(queue-entries, next, co_queue_next); +qemu_coroutine_enter(next, NULL); +return true; +} + bool qemu_co_queue_empty(CoQueue *queue) { return (QTAILQ_FIRST(queue-entries) == NULL); -- 1.7.10.4
[Qemu-devel] [PATCH for-1.6 3/4] block: Add support for throttling burst threshold in QMP and the command line.
The thresholds of the leaky bucket algorithm can be used to allow some burstiness. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qapi.c | 24 + blockdev.c | 105 +++--- hmp.c| 32 +++-- qapi-schema.json | 34 -- qemu-options.hx |2 +- qmp-commands.hx | 30 ++-- 6 files changed, 205 insertions(+), 22 deletions(-) diff --git a/block/qapi.c b/block/qapi.c index a4bc411..03f1604 100644 --- a/block/qapi.c +++ b/block/qapi.c @@ -235,6 +235,30 @@ void bdrv_query_info(BlockDriverState *bs, bs-io_limits.iops[BLOCK_IO_LIMIT_READ]; info-inserted-iops_wr = bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_bps_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-bps_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-has_bps_rd_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-bps_rd_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_bps_wr_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-bps_wr_threshold = + bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-has_iops_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-iops_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]; +info-inserted-iops_rd_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_iops_rd_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]; +info-inserted-has_iops_wr_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; +info-inserted-iops_wr_threshold = + bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE]; } bs0 = bs; diff --git a/blockdev.c b/blockdev.c index a78fba4..9bda359 100644 --- a/blockdev.c +++ b/blockdev.c @@ -341,6 +341,17 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, Error **errp) return false; } +if (io_limits-bps_threshold[BLOCK_IO_LIMIT_TOTAL] 0 || +io_limits-bps_threshold[BLOCK_IO_LIMIT_WRITE] 0 || +io_limits-bps_threshold[BLOCK_IO_LIMIT_READ] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_TOTAL] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_WRITE] 0 || +io_limits-iops_threshold[BLOCK_IO_LIMIT_READ] 0) { +error_setg(errp, + threshold values must be 0 or greater); +return false; +} + return true; } @@ -523,24 +534,34 @@ DriveInfo *drive_init(QemuOpts *all_opts, BlockInterfaceType block_default_type) qemu_opt_get_number(opts, bps_rd, 0); io_limits.bps[BLOCK_IO_LIMIT_WRITE] = qemu_opt_get_number(opts, bps_wr, 0); +/* bps thresholds */ +io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] = +qemu_opt_get_number(opts, bps_threshold, +io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ); +io_limits.bps_threshold[BLOCK_IO_LIMIT_READ] = +qemu_opt_get_number(opts, bps_rd_threshold, +io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ); +io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] = +qemu_opt_get_number(opts, bps_wr_threshold, +io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ); + io_limits.iops[BLOCK_IO_LIMIT_TOTAL] = qemu_opt_get_number(opts, iops, 0); io_limits.iops[BLOCK_IO_LIMIT_READ] = qemu_opt_get_number(opts, iops_rd, 0); io_limits.iops[BLOCK_IO_LIMIT_WRITE] = qemu_opt_get_number(opts, iops_wr, 0); -io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] = - io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ; -io_limits.bps_threshold[BLOCK_IO_LIMIT_READ] = - io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ; -io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] = - io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ; -io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = - io_limits.iops[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ; + +/* iops thresholds */ +io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = +qemu_opt_get_number(opts, iops_threshold, +
[Qemu-devel] [PATCH for-1.6 0/4] Leaky bucket throttling and features
The first patch fixes the throttling which was broken by a previous commit. The next patch replace the existing throttling algorithm by the well described leaky bucket algorithm. Third patch implement bursting by adding *_threshold parameters to qmp_block_set_io_throttle. The last one allow to define the max size of an io when throttling by iops via iops_sector_count to avoid vm users cheating on the iops limit. Benoît Canet (4): block: Repair the throttling code. block: Modify the throttling code to implement the leaky bucket algorithm. block: Add support for throttling burst threshold in QMP and the command line. block: Add iops_sector_count to do the iops accounting for a given io size. block.c | 424 + block/qapi.c | 28 +++ blockdev.c| 174 +-- hmp.c | 36 +++- include/block/block_int.h | 16 +- include/block/coroutine.h |5 + qapi-schema.json | 40 - qemu-coroutine-lock.c | 14 ++ qemu-options.hx |2 +- qmp-commands.hx | 34 +++- 10 files changed, 561 insertions(+), 212 deletions(-) -- 1.7.10.4
[Qemu-devel] [PATCH for-1.6 2/4] block: Modify the throttling code to implement the leaky bucket algorithm.
This patch replace the previous algorithm by the well described leaky bucket algorithm: A bucket is filled by the incoming IOs and a periodic timer decrement the counter to make the bucket leak. When a given threshold is reached the bucket is full and the IOs are hold. In this patch the threshold is set to a default value to make the code behave like the previous implementation. In the next patch the threshold will be exposed in QMP to let the user control the burstiness of the throttling. Signed-off-by: Benoit Canet ben...@irqsave.net --- block.c | 410 + blockdev.c| 71 ++-- include/block/block_int.h | 15 +- 3 files changed, 299 insertions(+), 197 deletions(-) diff --git a/block.c b/block.c index dc72643..2d6e9b4 100644 --- a/block.c +++ b/block.c @@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque); static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs, int64_t sector_num, int nb_sectors); -static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write, -double elapsed_time, uint64_t *wait); -static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors, -bool is_write, int64_t *wait); - static QTAILQ_HEAD(, BlockDriverState) bdrv_states = QTAILQ_HEAD_INITIALIZER(bdrv_states); @@ -101,6 +94,8 @@ static QLIST_HEAD(, BlockDriver) bdrv_drivers = /* If non-zero, use only whitelisted block drivers */ static int use_bdrv_whitelist; +/* boolean used to inform the throttling code that a bdrv_drain_all is issued */ +static bool draining; #ifdef _WIN32 static int is_windows_drive_prefix(const char *filename) @@ -135,15 +130,122 @@ void bdrv_io_limits_disable(BlockDriverState *bs) qemu_free_timer(bs-block_timer); bs-block_timer = NULL; } +} -bs-slice_start = 0; -bs-slice_end = 0; +static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta) +{ +int64_t *bytes = bs-leaky_buckets.bytes; +int64_t read_leak, write_leak; + +/* the limit apply to both reads and writes */ +if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) { +/* compute half the total leak */ +int64_t leak = ((bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL] * delta) / + NANOSECONDS_PER_SECOND); +int remain = leak % 2; +leak /= 2; + +/* the read bucket is smaller than half the quantity to leak so take + * care adding the leak difference to write leak + */ +if (bytes[BLOCK_IO_LIMIT_READ] = leak) { +read_leak = bytes[BLOCK_IO_LIMIT_READ]; +write_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_READ]; +/* symetric case */ +} else if (bytes[BLOCK_IO_LIMIT_WRITE] = leak) { +write_leak = bytes[BLOCK_IO_LIMIT_WRITE]; +read_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_WRITE]; +/* both bucket above leak count use half the total leak for both */ +} else { +write_leak = leak; +read_leak = leak + remain; +} +/* else we consider that limits are separated */ +} else { +read_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_READ] * delta) / +NANOSECONDS_PER_SECOND; +write_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_WRITE] * delta) / + NANOSECONDS_PER_SECOND; +} + +/* make the buckets leak */ +bytes[BLOCK_IO_LIMIT_READ] = MAX(bytes[BLOCK_IO_LIMIT_READ] - read_leak, + 0); +bytes[BLOCK_IO_LIMIT_WRITE] = MAX(bytes[BLOCK_IO_LIMIT_WRITE] - write_leak, + 0); } +static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta) +{ +double *ios = bs-leaky_buckets.ios; +int64_t read_leak, write_leak; + +/* the limit apply to both reads and writes */ +if (bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) { +/* compute half the total leak */ +int64_t leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) / + NANOSECONDS_PER_SECOND); +int remain = leak % 2; +leak /= 2; + +/* the read bucket is smaller than half the quantity to leak so take + * care adding the leak difference to write leak + */ +if (ios[BLOCK_IO_LIMIT_READ] = leak) { +read_leak = ios[BLOCK_IO_LIMIT_READ]; +write_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_READ]; +/* symetric case */ +} else if (ios[BLOCK_IO_LIMIT_WRITE] = leak) { +write_leak = ios[BLOCK_IO_LIMIT_WRITE]; +read_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_WRITE]; +/* both bucket above leak count use half the total leak for both */ +} else { +write_leak
[Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.
--- docs/specs/qcow2.txt | 42 ++ 1 file changed, 42 insertions(+) diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt index 36a559d..a4ffc85 100644 --- a/docs/specs/qcow2.txt +++ b/docs/specs/qcow2.txt @@ -350,3 +350,45 @@ Snapshot table entry: variable: Unique ID string for the snapshot (not null terminated) variable: Name of the snapshot (not null terminated) + +== Journal == + +QCOW2 can use one or more instance of a metadata journal. + +A journal is a sequential log of journal entries appended on a previously +allocated and reseted area. +A journal is designed like a linked list with each entry pointing to the next +so it's easy to iterate over entries. + +A journal uses the following constants to denote the type of each entry + +TYPE_NONE = 0xFF default value of any bytes in a reseted journal +TYPE_END = 1 the entry ends a journal cluster and point to the next + cluster +TYPE_HASH = 2 the entry contains a deduplication hash + +QCOW2 journal entry: + +Byte 0 :Size of the entry: size = 2 + n with size = 254 + + 1 :Type of the entry + + 2 - size :The optional n bytes structure carried by entry + +A journal is divided into clusters and no journal entry can be spilled on two +clusters. This avoid having to read more than one cluster to get a single entry. + +For this purpose an entry with the end type is added at the end of a journal +cluster before starting to write in the next cluster. +The size of such an entry is set so the entry points to the next cluster. + +As any journal cluster must be ended with an end entry the size of regular +journal entries is limited to 254 bytes in order to always left room for an end +entry which mimimal size is two bytes. + +The only cases where size 254 are none entries where size = 255. + +The replay of a journal stop when the first end none entry is reached. + +The journal cluster size is 4096 bytes. + -- 1.7.10.4
[Qemu-devel] [RFC V8 00/24] QCOW2 deduplication core functionality
Hello, This is the new patchset of the QCOW2 deduplication feature using an on disk key value store modelled after the SILT one. This is more a millestone reached post than a for review post. Performance is slow because the code is issuing a lot of read and write to the SSD in a sequential ways. The next revision will focus on parallelising the IOs. Kevin: the first patch add the description of the QCOW2 journal structure. The code make it easy to have multiple journal in QCOW2. Do you think it could fit others QCOW2 usages. Patches describing the others data structure are comming. Best regards Benoît Benoît Canet (24): qcow2: Add journal specification. qcow2: Add deduplication structures and fields. qcow2: Add journal. qcow2: Create the log store. qcow2: Add the hash store. qcow2: Add the deduplication store. qcow2: Add qcow2_dedup_read_missing_and_concatenate qcow2: Create a way to link to l2 tables when deduplicating. qcow2: Make qcow2_update_cluster_refcount public. qcow2: Add qcow2_dedup and related functions qcow2: Add qcow2_dedup_store_new_hashes. qcow2: Do allocate on rewrite on the dedup case. qcow2: Implement qcow2_compute_cluster_hash. qcow2: Load and save deduplication table header extension. qcow2: Extract qcow2_set_incompat_feature and qcow2_clear_incompat_feature. block: Add qcow2_dedup format and image creation code. qcow2: Drop hash for a given cluster when dedup makes refcount 2^16/2. qcow2: Remove hash when cluster is deleted. qcow2: Integrate deduplication in qcow2_co_writev loop. qcow2: Serialize write requests when deduplication is activated. qcow2: Integrate SKEIN hash algorithm in deduplication. qcow2: Add qcow2_dedup_init and qcow2_dedup_close. qcow2: Enable the deduplication feature. qcow2: Enable deduplication tests block/Makefile.objs |1 + block/qcow2-cluster.c| 13 +- block/qcow2-dedup.c | 778 + block/qcow2-hash-store.c | 802 ++ block/qcow2-journal.c| 587 +++ block/qcow2-log-store.c | 1281 ++ block/qcow2-refcount.c | 20 +- block/qcow2-store.c | 771 + block/qcow2.c| 431 +- block/qcow2.h| 357 +++- configure| 121 +++- docs/specs/qcow2.txt | 42 ++ include/block/block_int.h|1 + tests/qemu-iotests/common|6 + tests/qemu-iotests/common.rc |3 +- 15 files changed, 5144 insertions(+), 70 deletions(-) create mode 100644 block/qcow2-dedup.c create mode 100644 block/qcow2-hash-store.c create mode 100644 block/qcow2-journal.c create mode 100644 block/qcow2-log-store.c create mode 100644 block/qcow2-store.c -- 1.7.10.4
[Qemu-devel] [RFC V8 04/24] qcow2: Create the log store.
The log store is an in RAM hash table used to index QCowJournalEntries. Combining the log store with a journal allows to get the benefits of the journal (fewer random write and minimization of SSD wear out) with the advantages of a hash table (being good at random access). Once the journal is full and not packable or if the log store hash table is full the journal is frozen and converted into an on disk hash table used by the hash store. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/Makefile.objs |2 +- block/qcow2-log-store.c | 1039 +++ block/qcow2.h | 26 ++ 3 files changed, 1066 insertions(+), 1 deletion(-) create mode 100644 block/qcow2-log-store.c diff --git a/block/Makefile.objs b/block/Makefile.objs index ee894b5..8c0b646 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,6 +1,6 @@ block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o -block-obj-y += qcow2-journal.o +block-obj-y += qcow2-journal.o qcow2-log-store.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-log-store.c b/block/qcow2-log-store.c new file mode 100644 index 000..24ae310 --- /dev/null +++ b/block/qcow2-log-store.c @@ -0,0 +1,1039 @@ +/* + * QCOW2 log store + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include block/qcow2.h + +static uint64_t qcow2_dedup_get_h(QCowHashInfo *hash_info, int index, + uint32_t order) +{ +uint64_t *array = (uint64_t *) hash_info-hash.data; +uint64_t result = array[index]; +return result ((1 order) - 1); +} + +uint64_t qcow2_dedup_h1(QCowHashInfo *hash_info, uint32_t order) +{ +return qcow2_dedup_get_h(hash_info, 0, order); +} + +uint64_t qcow2_dedup_h2(QCowHashInfo *hash_info, uint32_t order) +{ +return qcow2_dedup_get_h(hash_info, 1, order); +} + +static bool qcow2_dedup_h_equals(QCowHashInfo *hash_info, uint32_t order) +{ +return qcow2_dedup_h1(hash_info, order) == + qcow2_dedup_h2(hash_info, order); +} + +static void qcow2_log_store_reset_kick_map(QCowLogStore *store) +{ +uint64_t i; +for (i = 0; i qcow2_log_store_nb_entries(store-order); i++) { +store-kick_map[i] = false; +} +} + +static bool qcow2_log_store_is_in_kick_path(QCowLogStore *store, +uint64_t in_table_idx) +{ +return store-kick_map[in_table_idx]; +} + +static void qcow2_log_store_set_in_kick_path(QCowLogStore *store, + uint64_t in_table_idx) +{ +store-kick_map[in_table_idx] = true; +} + +static void qcow2_log_store_unset_from_kick_path(QCowLogStore *store, + uint64_t in_table_idx) +{ +store-kick_map[in_table_idx] = false; +} + +static void qcow2_log_store_set_perm_element(int *perm, int value, int rnd) +{ +for (;perm[rnd % QCOW_LOG_STORE_BUCKET_SIZE] != -1; rnd++); +perm[rnd % QCOW_LOG_STORE_BUCKET_SIZE] = value; +} + +/* This function create a random permutation of the set + * [0, QCOW_LOG_STORE_BUCKET_SIZE] + */ +static void qcow2_log_store_make_perm(int *perm) +{ +int i, rnd; + +/* reset */ +for (i = 0; i QCOW_LOG_STORE_BUCKET_SIZE; i++) { +perm[i] = -1; +} + +/* create permutation */ +for (i = 0; i QCOW_LOG_STORE_BUCKET_SIZE; i++) { +rnd = rand(); +qcow2_log_store_set_perm_element(perm, i, rnd); +} +} + +/* This function is used to make sure we don't create path cycle while kicking + * + * The function uses a random permutation to compute
[Qemu-devel] [RFC V8 07/24] qcow2: Add qcow2_dedup_read_missing_and_concatenate
This function is used to read missing data when unaligned writes are done. This function also concatenate missing data with the given qiov data in order to prepare a buffer used to look for duplicated clusters. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/Makefile.objs |2 +- block/qcow2-dedup.c | 121 +++ block/qcow2.c | 35 +++ block/qcow2.h | 13 ++ 4 files changed, 170 insertions(+), 1 deletion(-) create mode 100644 block/qcow2-dedup.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 2d1a269..088407a 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,6 +1,6 @@ block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o -block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o qcow2-store.o +block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o qcow2-store.o qcow2-dedup.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c new file mode 100644 index 000..bc6e2c2 --- /dev/null +++ b/block/qcow2-dedup.c @@ -0,0 +1,121 @@ +/* + * Deduplication for the QCOW2 format + * + * Copyright (C) Nodalink, SARL. 2012-2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include block/block_int.h +#include qemu-common.h +#include qcow2.h + +/* + * Prepare a buffer containing everything required to compute cluster + * sized deduplication hashes. + * If sector_num or nb_sectors are not cluster-aligned, missing data + * before/after the qiov will be read. + * + * @qiov: the qiov for which missing data must be read + * @sector_num: the first sectors that must be read into the qiov + * @nb_sectors: the number of sectors to read into the qiov + * @data: the place where the data will be concatenated and stored + * the caller is responsible to use qemu_vfree() to + * data on success. + * @nb_data_sectors:the resulting size of the contatenated data (in sectors) + * @ret:negative on error + */ +int qcow2_dedup_read_missing_and_concatenate(BlockDriverState *bs, + QEMUIOVector *qiov, + uint64_t sector_num, + int nb_sectors, + uint8_t **data, + int *nb_data_sectors) +{ +BDRVQcowState *s = bs-opaque; +int ret = 0; +uint64_t cluster_beginning_sector; +uint64_t first_sector_after_qiov; +int cluster_beginning_nr; +int cluster_ending_nr; +int unaligned_ending_nr; +uint64_t max_cluster_ending_nr; + +/* compute how much and where to read at the beginning */ +cluster_beginning_nr = sector_num (s-cluster_sectors - 1); +cluster_beginning_sector = sector_num - cluster_beginning_nr; + +/* for the ending */ +first_sector_after_qiov = sector_num + nb_sectors; +unaligned_ending_nr = first_sector_after_qiov (s-cluster_sectors - 1); +cluster_ending_nr = unaligned_ending_nr ? +s-cluster_sectors - unaligned_ending_nr : 0; + +/* compute total size in sectors and allocate memory */ +*nb_data_sectors = cluster_beginning_nr + nb_sectors + cluster_ending_nr; +*data = qemu_blockalign(bs, *nb_data_sectors * BDRV_SECTOR_SIZE); + +/* read beginning */ +if (cluster_beginning_nr) { +ret = qcow2_read_cluster_data(bs, + *data, + cluster_beginning_sector
[Qemu-devel] [RFC V8 09/24] qcow2: Make qcow2_update_cluster_refcount public.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-refcount.c | 17 ++--- block/qcow2.h |3 +++ 2 files changed, 13 insertions(+), 7 deletions(-) diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index b32738f..3bd8f37 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -520,9 +520,9 @@ fail: * If the return value is non-negative, it is the new refcount of the cluster. * If it is negative, it is -errno and indicates an error. */ -static int update_cluster_refcount(BlockDriverState *bs, - int64_t cluster_index, - int addend) +int qcow2_update_cluster_refcount(BlockDriverState *bs, + int64_t cluster_index, + int addend) { BDRVQcowState *s = bs-opaque; int ret; @@ -649,7 +649,7 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size) if (free_in_cluster == 0) s-free_byte_offset = 0; if ((offset (s-cluster_size - 1)) != 0) -update_cluster_refcount(bs, offset s-cluster_bits, 1); +qcow2_update_cluster_refcount(bs, offset s-cluster_bits, 1); } else { offset = qcow2_alloc_clusters(bs, s-cluster_size); if (offset 0) { @@ -659,7 +659,7 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size) if ((cluster_offset + s-cluster_size) == offset) { /* we are lucky: contiguous data */ offset = s-free_byte_offset; -update_cluster_refcount(bs, offset s-cluster_bits, 1); +qcow2_update_cluster_refcount(bs, offset s-cluster_bits, 1); s-free_byte_offset += size; } else { s-free_byte_offset = offset; @@ -795,7 +795,9 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, } else { uint64_t cluster_index = (offset L2E_OFFSET_MASK) s-cluster_bits; if (addend != 0) { -refcount = update_cluster_refcount(bs, cluster_index, addend); +refcount = qcow2_update_cluster_refcount(bs, + cluster_index, + addend); } else { refcount = get_refcount(bs, cluster_index); } @@ -827,7 +829,8 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, if (addend != 0) { -refcount = update_cluster_refcount(bs, l2_offset s-cluster_bits, addend); +refcount = qcow2_update_cluster_refcount(bs, + l2_offset s-cluster_bits, addend); } else { refcount = get_refcount(bs, l2_offset s-cluster_bits); } diff --git a/block/qcow2.h b/block/qcow2.h index 7754065..3238083 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -568,6 +568,9 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs, int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res, BdrvCheckMode fix); +int qcow2_update_cluster_refcount(BlockDriverState *bs, + int64_t cluster_index, + int addend); /* qcow2-cluster.c functions */ int qcow2_grow_l1_table(BlockDriverState *bs, int min_size, bool exact_size); -- 1.7.10.4
[Qemu-devel] [RFC V8 03/24] qcow2: Add journal.
This commit add the code required to manage one or more journals in a qcow2 file. The primary user of this journal will be the qcow2-log-store.c. The journal is asynchronous and will require it's users to issue flushs in order to make sure that entries had reached stable storage. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/Makefile.objs |1 + block/qcow2-journal.c | 587 + block/qcow2.h | 29 +++ 3 files changed, 617 insertions(+) create mode 100644 block/qcow2-journal.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 5f0358a..ee894b5 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,5 +1,6 @@ block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o +block-obj-y += qcow2-journal.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c new file mode 100644 index 000..693de37 --- /dev/null +++ b/block/qcow2-journal.c @@ -0,0 +1,587 @@ +/* + * QCOW2 journal + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include block/qcow2.h + +/* This function reset the state of a journal + * + * @journal: the journal to initialize + * @sector: the on disk sector where the journal start + * @size:the size of the journal + */ +static void qcow2_journal_reset(QCowJournal *journal, +uint64_t sector, +uint64_t size) +{ +journal-sector = sector; +journal-size = size; +journal-index = 0; +/* clear write buffer */ +memset(journal-write_buf, QCOW_LOG_NONE, JOURNAL_CLUSTER_SIZE * 2); +journal-offset_in_buf = 0; +journal-flushed = true; +/* clear read cache */ +memset(journal-read_cache, QCOW_LOG_NONE, JOURNAL_CLUSTER_SIZE * 2); +/* so we will load the cache at first hit */ +journal-read_index = -1; +/* journal need to be resumed */ +journal-started = false; +} + +/* This function set the journal to the correct state after having replaying it + * + * @journal: the journal to resume + * @offset: the offset inside the journal + * @ret: 0 on success, -errno on error + */ +int qcow2_journal_resume(BlockDriverState *bs, + QCowJournal *journal, + uint64_t offset) +{ +uint64_t disk_offset; + +/* flush read cache */ +journal-read_index = -1; + +/* mark the journal as resumed */ +journal-started = true; + +/* reset state */ +journal-index = offset / JOURNAL_CLUSTER_SIZE; +journal-offset_in_buf = offset % JOURNAL_CLUSTER_SIZE; + +/* compute on disk offset of the current cluster */ +disk_offset = journal-sector * BDRV_SECTOR_SIZE + + journal-index * JOURNAL_CLUSTER_SIZE; + +/* read the current cluster in buffer */ +return bdrv_pread(bs-file, + disk_offset, + journal-write_buf, + JOURNAL_CLUSTER_SIZE); + +} + +uint64_t qcow2_journal_round_to_erase_blocks(uint64_t size) +{ +return ((size / SSD_ERASE_BLOCK_SIZE) + 1) * SSD_ERASE_BLOCK_SIZE; +} + +/* This function reset a journal content to none + * + * To be called at the end of the incarnate/freeze coroutine + * + * @journal: the QCowJournal to reset + * @in_coroutine: true if called from a coroutine + * @ret: 0 on success, -errno on error + */ +static int qcow2_journal_set_to_none(BlockDriverState *bs, + QCowJournal *journal, + bool in_coroutine) +{ +int ret = 0; +uint8_t
[Qemu-devel] [RFC V8 08/24] qcow2: Create a way to link to l2 tables when deduplicating.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-cluster.c |6 -- block/qcow2.h |6 ++ 2 files changed, 10 insertions(+), 2 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index c71470a..d6db0b9 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -696,7 +696,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) old_cluster[j++] = l2_table[l2_index + i]; l2_table[l2_index + i] = cpu_to_be64((cluster_offset + -(i s-cluster_bits)) | QCOW_OFLAG_COPIED); +(i s-cluster_bits)) | m-l2_entry_flags); } @@ -709,7 +709,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) * If this was a COW, we need to decrease the refcount of the old cluster. * Also flush bs-file to get the right order for L2 and refcount update. */ -if (j != 0) { +if (!m-overwrite j != 0) { for (i = 0; i j; i++) { qcow2_free_any_clusters(bs, be64_to_cpu(old_cluster[i]), 1); } @@ -1098,6 +1098,8 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, .offset = nb_sectors * BDRV_SECTOR_SIZE, .nb_sectors = avail_sectors - nb_sectors, }, +.l2_entry_flags = QCOW_OFLAG_COPIED, +.overwrite = false, }; qemu_co_queue_init((*m)-dependent_requests); QLIST_INSERT_HEAD(s-cluster_allocs, *m, next_in_flight); diff --git a/block/qcow2.h b/block/qcow2.h index 9a9abd3..7754065 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -438,6 +438,12 @@ typedef struct QCowL2Meta */ CoQueue dependent_requests; +/* contains the flags to apply to the l2 entry */ +uint64_t l2_entry_flags; + +/* set to true if we are overwriting an L2 table entry */ +bool overwrite; + /** * The COW Region between the start of the first allocated cluster and the * area the guest actually writes to. -- 1.7.10.4
[Qemu-devel] [RFC V8 05/24] qcow2: Add the hash store.
The hash store allow to do lookup in the set of previously frozen log stores. Each frozen log store is called an incarnation. When the number of incarnation reach a user defined setting oldest incarnations are dropped to avoid that the dedup metadata grow without limit. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/Makefile.objs |2 +- block/qcow2-hash-store.c | 802 ++ block/qcow2-log-store.c | 242 ++ block/qcow2.h| 38 +++ 4 files changed, 1083 insertions(+), 1 deletion(-) create mode 100644 block/qcow2-hash-store.c diff --git a/block/Makefile.objs b/block/Makefile.objs index 8c0b646..cafd4f8 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,6 +1,6 @@ block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o -block-obj-y += qcow2-journal.o qcow2-log-store.o +block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-hash-store.c b/block/qcow2-hash-store.c new file mode 100644 index 000..5284740 --- /dev/null +++ b/block/qcow2-hash-store.c @@ -0,0 +1,802 @@ +/* + * QCOW2 hash store + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include block/qcow2.h + +/* This function compute the number of bytes required to store a tag in hash + * store filters + * + * This function will only return values in the set [2, 4, 8] matching + * convenient compiler type sizes. + * + * @order: order of the hash store + * @ret: the number of byte required to store a tag + */ +int qcow2_hash_store_get_tag_size(uint32_t order) +{ +if (order 16) { +return 2; +} + +if (order 32) { +return 4; +} + +return 8; +} + +static uint64_t qcow2_hash_store_round_to_cluster(uint64_t size) +{ +return ((size / HASH_STORE_CLUSTER_SIZE) + 1) * HASH_STORE_CLUSTER_SIZE; +} + +/* This function compute the size of an in ram filter + * + * @order: the order of the store + * @ret: the size of the in ram filter in bytes + */ +uint64_t qcow2_hash_store_filter_size(uint32_t order) +{ +uint64_t size = qcow2_log_store_nb_entries(order) * +qcow2_hash_store_get_tag_size(order); +return qcow2_hash_store_round_to_cluster(size); +} + +/* this function make sure that no buckets are splitted between two clusters */ +static uint64_t qcow2_hash_store_round_to_bucket(uint64_t count) +{ +return (count / QCOW_LOG_STORE_BUCKET_SIZE) * QCOW_LOG_STORE_BUCKET_SIZE; +} + +uint64_t qcow2_hash_store_nb_hash_info_per_cluster(void) +{ +uint64_t raw_nb = HASH_STORE_CLUSTER_SIZE / sizeof(QCowHashInfo); +return qcow2_hash_store_round_to_bucket(raw_nb); +} + +/* This function compute the size of a frozen hash table that is dumped to disk + * + * @order: the order of the store + * @ret: the size in bytes + */ +static uint64_t qcow2_hash_store_table_size(uint32_t order) +{ +uint32_t nb_per_cluster = qcow2_hash_store_nb_hash_info_per_cluster(); +uint64_t count = (qcow2_log_store_nb_entries(order) / nb_per_cluster) + 1; +return count * HASH_STORE_CLUSTER_SIZE; +} + +/* This function compute the size of the result of a log store incarnation + * + * @order: the order of the store + * @ret: the size in bytes + */ +uint64_t qcow2_hash_store_incarnation_size(uint32_t order) +{ +return qcow2_hash_store_filter_size(order) + + qcow2_hash_store_table_size(order); +} + +/* This function reset a hash store + * + * @store: the hash store to initialize + * @order: the bit order of the hash store
[Qemu-devel] [RFC V8 11/24] qcow2: Add qcow2_dedup_store_new_hashes.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c | 97 ++- block/qcow2.h |6 +++- 2 files changed, 101 insertions(+), 2 deletions(-) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index 0daf77e..ffbf866 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -251,6 +251,7 @@ static int qcow2_dedup_link_l2(BlockDriverState *bs, static int qcow2_clear_l2_copied_flag_if_needed(BlockDriverState *bs, QCowHashInfo *hash_info) { +BDRVQcowState *s = bs-opaque; int ret = 0; uint64_t first_logical_sect = hash_info-first_logical_sect; @@ -273,7 +274,8 @@ static int qcow2_clear_l2_copied_flag_if_needed(BlockDriverState *bs, /* remember that we don't need to clear QCOW_OFLAG_COPIED again */ hash_info-first_logical_sect = first_logical_sect; -return 0; +/* clear the QCOW_OFLAG_COPIED flag from disk */ +return qcow2_store_insert(bs, s-key_value_store, hash_info); } /* This function deduplicate a cluster @@ -534,3 +536,96 @@ exit: return deduped_clusters_nr * s-cluster_sectors - start_index; } + +static inline bool is_hash_info_empty(QCowHashInfo *hash_info) +{ +return hash_info-physical_sect QCOW_DEDUP_FLAG_EMPTY; +} + +/* This function store a hash information to disk and RAM + * + * @hash_info: the QCowHashInfo to process + * @logical_sect: the logical sector of the cluster seen by the guest + * @physical_sect: the physical sector of the stored cluster + * @ret:0 on success, negative on error + */ +static int qcow2_store_hash(BlockDriverState *bs, +QCowHashInfo *hash_info, +uint64_t logical_sect, +uint64_t physical_sect) +{ +BDRVQcowState *s = bs-opaque; +int ret = 0; + +ret = qcow2_store_get(bs, s-key_value_store, hash_info); + +/* no hash info found in store or error */ +if (ret = 0) { +return ret; +} + +/* the hash info information are already completed */ +if (!is_hash_info_empty(hash_info)) { +return 0; +} + +/* Remember that this QCowHashInfo represents the first occurrence of the + * cluster so we will be able to clear QCOW_OFLAG_COPIED from the L2 table + * entry when refcount will go 1. + */ +logical_sect = logical_sect | QCOW_OFLAG_COPIED; + +/* fill the missing fields of the hash info */ +hash_info-physical_sect = physical_sect; +hash_info-first_logical_sect = logical_sect; + +/* write the hash into the key value store */ +return qcow2_store_insert(bs, s-key_value_store, hash_info); +} + +/* This function store the hashes of the clusters which are not duplicated + * + * @ds:The deduplication state + * @count: the number of dedup hash to process + * @logical_sect: logical offset of the first cluster (in sectors) + * @physical_sect: offset of the first cluster (in sectors) + * @ret: 0 on succes, errno on error + */ +int qcow2_dedup_store_new_hashes(BlockDriverState *bs, + QCowDedupState *ds, + int count, + uint64_t logical_sect, + uint64_t physical_sect) +{ +int ret = 0; +int i = 0; +BDRVQcowState *s = bs-opaque; +QCowHashElement *dedup_hash, *next_dedup_hash; + +/* round values on cluster boundaries for easier cluster deletion */ +logical_sect = logical_sect ~(s-cluster_sectors - 1); +physical_sect = physical_sect ~(s-cluster_sectors - 1); + +QTAILQ_FOREACH_SAFE(dedup_hash, ds-undedupables, next, next_dedup_hash) { + +ret = qcow2_store_hash(bs, + dedup_hash-hash_info, + logical_sect + i * s-cluster_sectors, + physical_sect + i * s-cluster_sectors); + +QTAILQ_REMOVE(ds-undedupables, dedup_hash, next); +g_free(dedup_hash); + +if (ret 0) { +break; +} + +i++; + +if (i == count) { +break; +} +} + +return ret; +} diff --git a/block/qcow2.h b/block/qcow2.h index 3346842..ff26e81 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -742,6 +742,10 @@ int qcow2_dedup(BlockDriverState *bs, uint64_t sector_num, uint8_t *data, int data_nr); - +int qcow2_dedup_store_new_hashes(BlockDriverState *bs, + QCowDedupState *ds, + int count, + uint64_t logical_sect, + uint64_t physical_sect); #endif -- 1.7.10.4
[Qemu-devel] [RFC V8 06/24] qcow2: Add the deduplication store.
The key value store is an SSD optimised store used to store deduplication's QCowHashInfo. It binds together the QCowJournal, the QCowLogStore and the QCowHashStore and must be called be the deduplication code. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/Makefile.objs |2 +- block/qcow2-store.c | 771 +++ block/qcow2.h | 20 +- 3 files changed, 791 insertions(+), 2 deletions(-) create mode 100644 block/qcow2-store.c diff --git a/block/Makefile.objs b/block/Makefile.objs index cafd4f8..2d1a269 100644 --- a/block/Makefile.objs +++ b/block/Makefile.objs @@ -1,6 +1,6 @@ block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o vvfat.o block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o qcow2-cache.o -block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o +block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o qcow2-store.o block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o block-obj-y += qed-check.o block-obj-y += vhdx.o diff --git a/block/qcow2-store.c b/block/qcow2-store.c new file mode 100644 index 000..8e3fad5 --- /dev/null +++ b/block/qcow2-store.c @@ -0,0 +1,771 @@ +/* + * QCOW2 key value store for SSD storage + * + * Copyright (C) Nodalink, SARL. 2013 + * + * Author: + * Benoît Canet benoit.ca...@irqsave.net + * + * Permission is hereby granted, free of charge, to any person obtaining a copy + * of this software and associated documentation files (the Software), to deal + * in the Software without restriction, including without limitation the rights + * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell + * copies of the Software, and to permit persons to whom the Software is + * furnished to do so, subject to the following conditions: + * + * The above copyright notice and this permission notice shall be included in + * all copies or substantial portions of the Software. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN + * THE SOFTWARE. + */ + +#include qemu-common.h +#include block/block_int.h +#include block/qcow2.h + +/* The qcow2_store combine the qcow2_log_store with the qcow2_hash_store- + * + * insertions: + * Entries are inserted in the qcow2_log_store which delegate writing to disk + * to the qcow2_journal. + * When full the qcow2_log_store is frozen into an incarnation and given to the + * qcow2_hash_store- A new qcow2_log store is then created + * + * deletions: + * Deletions are insertions with an additional flag set and a mandatory + * write flush of the journal: loosing an insertion is not a problem whereas + * loosing a deletion is a problem. + * + * queries: + * The log store is first queried and if the entry is not found in it the hash + * store is queried. + * When queried the hash store it will check for the presence of the + * QCowHashInfo in every incarnation starting from the youngest to the oldest. + * + * fifo: + * A user configurable setting allow the qcow2_store to act as a fifo. + * the oldest QCowHashInfos stored in it will be dropped an incarnation at once. + * Loosing the oldest QCowHashInfo won't corrupt the dedup metadata the only + * risk if to make to dedup ratio lower. + * + * algorithm: + * This code is an implementation of the first two stages of the SILT paper. + * The third stage was dropped to lower the complexity of the code + * The idea of behaving as a FIFO and the notion of incarnations comes from + * BufferHash and is handy to help reducing RAM consumption. + * + * SILT: www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf + * BufferHash: research.microsoft.com/en-us/people/sumann/bufferhash-nsdi10.pdf + */ + +/* This function initialize a key value store + * + * @store: the store to initialize + * @order: the bit order + */ +static void qcow2_store_init(BlockDriverState *bs, + QCowStore *store, + uint32_t order) +{ +store-order = order; +qemu_co_mutex_init(store-insert_lock); +qcow2_log_store_init(bs, store-log_store, order); +qcow2_log_store_init(bs, store-frozen_log_store, order); +qcow2_hash_store_init(store-hash_store, order); +} + +/* This function cleanup a key value store + * + * @store: the store to cleanup + */ +void qcow2_store_cleanup(QCowStore *store) +{ +qcow2_log_store_cleanup(store-log_store); +qcow2_log_store_cleanup(store-frozen_log_store); +qcow2_hash_store_cleanup(store-hash_store); +} + +/* This function look for a given hash info
[Qemu-devel] [RFC V8 16/24] block: Add qcow2_dedup format and image creation code.
Also modify qemu-io-test. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2.c| 176 -- include/block/block_int.h|1 + tests/qemu-iotests/common.rc |3 +- 3 files changed, 173 insertions(+), 7 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index 58e6236..e1265a2 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1313,7 +1313,8 @@ static int preallocate(BlockDriverState *bs) static int qcow2_create2(const char *filename, int64_t total_size, const char *backing_file, const char *backing_format, int flags, size_t cluster_size, int prealloc, - QEMUOptionParameter *options, int version) + QEMUOptionParameter *options, int version, + bool dedup, uint8_t hash_algo) { /* Calculate cluster_bits */ int cluster_bits; @@ -1417,11 +1418,39 @@ static int qcow2_create2(const char *filename, int64_t total_size, } /* Okay, now that we have a valid image, let's give it the right size */ +BDRVQcowState *s = bs-opaque; ret = bdrv_truncate(bs, total_size * BDRV_SECTOR_SIZE); if (ret 0) { goto out; } +if (dedup) { +s-has_dedup = true; +/* fill cluster size for qcow2_store_create */ +s-cluster_size = cluster_size; +/* unlimited incarnation number */ +s-dedup_max_incarnations = 0; +s-dedup_hash_algo = hash_algo; + +ret = qcow2_set_incompat_feature(bs, QCOW2_INCOMPAT_DEDUP); +if (ret 0) { +goto out; +} + +/* this will create the journal and save the conf to disk then + * save header extension + */ +ret = qcow2_store_create(bs, + s-key_value_store, + total_size * BDRV_SECTOR_SIZE); + +if (ret 0) { +goto out; +} + +qcow2_store_cleanup(bs, s-key_value_store); +} + /* Want a backing file? There you go.*/ if (backing_file) { ret = bdrv_change_backing_file(bs, backing_file, backing_format); @@ -1447,15 +1476,41 @@ out: return ret; } +static int qcow2_warn_if_version_3_is_needed(int version, + bool has_feature, + const char *feature) +{ +if (version 3 has_feature) { +fprintf(stderr, %s only supported with compatibility +level 1.1 and above (use compat=1.1 or greater)\n, +feature); +return -EINVAL; +} +return 0; +} + +static int8_t qcow2_get_dedup_hash_algo(char *value) +{ +if (!value || !strcmp(value, sha256)) { +return QCOW_HASH_SHA256; +} + +error_printf(Unsupported deduplication hash algorithm.\n); +return -EINVAL; +} + static int qcow2_create(const char *filename, QEMUOptionParameter *options) { const char *backing_file = NULL; const char *backing_fmt = NULL; uint64_t sectors = 0; int flags = 0; +int ret; size_t cluster_size = DEFAULT_CLUSTER_SIZE; int prealloc = 0; int version = 2; +bool dedup = false; +int8_t hash_algo = 0; /* Read out options */ while (options options-name) { @@ -1493,6 +1548,13 @@ static int qcow2_create(const char *filename, QEMUOptionParameter *options) } } else if (!strcmp(options-name, BLOCK_OPT_LAZY_REFCOUNTS)) { flags |= options-value.n ? BLOCK_FLAG_LAZY_REFCOUNTS : 0; +} else if (!strcmp(options-name, BLOCK_OPT_DEDUP)) { +hash_algo = qcow2_get_dedup_hash_algo(options-value.s); +if (hash_algo 0) { +return hash_algo; +} +dedup = true; +version = 3; } options++; } @@ -1503,14 +1565,22 @@ static int qcow2_create(const char *filename, QEMUOptionParameter *options) return -EINVAL; } -if (version 3 (flags BLOCK_FLAG_LAZY_REFCOUNTS)) { -fprintf(stderr, Lazy refcounts only supported with compatibility -level 1.1 and above (use compat=1.1 or greater)\n); -return -EINVAL; +ret = qcow2_warn_if_version_3_is_needed(version, +flags BLOCK_FLAG_LAZY_REFCOUNTS, +Lazy refcounts); +if (ret 0) { +return ret; +} +ret = qcow2_warn_if_version_3_is_needed(version, +dedup, +Deduplication); +if (ret 0) { +return ret; } return qcow2_create2(filename, sectors, backing_file, backing_fmt, flags, - cluster_size, prealloc, options, version); + cluster_size, prealloc, options, version, + dedup,
[Qemu-devel] [RFC V8 13/24] qcow2: Implement qcow2_compute_cluster_hash.
Also factorize detection of libgnutls with vnc tls. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c | 17 +- configure | 86 +-- 2 files changed, 79 insertions(+), 24 deletions(-) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index ffbf866..2d2c15c 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -28,6 +28,10 @@ #include block/block_int.h #include qemu-common.h #include qcow2.h +#ifdef CONFIG_SHA256_DEDUP +#include gnutls/gnutls.h +#include gnutls/crypto.h +#endif /* * Prepare a buffer containing everything required to compute cluster @@ -142,7 +146,18 @@ static int qcow2_compute_cluster_hash(BlockDriverState *bs, QCowHash *hash, uint8_t *data) { -return 0; +BDRVQcowState *s = bs-opaque; +switch (s-dedup_hash_algo) { +#ifdef CONFIG_SHA256_DEDUP +case QCOW_HASH_SHA256: +return gnutls_hash_fast(GNUTLS_DIG_SHA256, data, +s-cluster_size, hash-data); +#endif +default: +error_report(Invalid deduplication hash algorithm %i, + s-dedup_hash_algo); +abort(); +} } /* diff --git a/configure b/configure index e818e8b..34b26b2 100755 --- a/configure +++ b/configure @@ -240,6 +240,7 @@ gtk= gtkabi=2.0 tpm=no libssh2= +sha256_dedup=yes # parse CC options first for opt do @@ -930,6 +931,10 @@ for opt do ;; --enable-libssh2) libssh2=yes ;; + --disable-sha256-dedup) sha256_dedup=no + ;; + --enable-sha256-dedup) sha256_dedup=yes + ;; *) echo ERROR: unknown option $opt; show_help=yes ;; esac @@ -1190,6 +1195,8 @@ echo --with-coroutine=BACKEND coroutine backend. Supported options: echogthread, ucontext, sigaltstack, windows echo --enable-glusterfs enable GlusterFS backend echo --disable-glusterfs disable GlusterFS backend +echo --disable-sha256-dedup disable sha256 dedup +echo --enable-sha256-dedupenables sha256 dedup echo --enable-gcovenable test coverage analysis with gcov echo --gcov=GCOV use specified gcov [$gcov_tool] echo --enable-tpm enable TPM support @@ -1812,32 +1819,60 @@ EOF fi ## -# VNC TLS/WS detection -if test $vnc = yes -a \( $vnc_tls != no -o $vnc_ws != no \) ; then - cat $TMPC EOF +# gnutls detection (factorize the VNC TLS and SHA256 deduplication test) +cat $TMPC EOF #include gnutls/gnutls.h -int main(void) { gnutls_session_t s; gnutls_init(s, GNUTLS_SERVER); return 0; } +#include gnutls/crypto.h +int main(void) {char data[4096], digest[32]; +gnutls_hash_fast(GNUTLS_DIG_SHA256, data, 4096, digest); +return 0; +} EOF - vnc_tls_cflags=`$pkg_config --cflags gnutls 2 /dev/null` - vnc_tls_libs=`$pkg_config --libs gnutls 2 /dev/null` - if compile_prog $vnc_tls_cflags $vnc_tls_libs ; then -if test $vnc_tls != no ; then - vnc_tls=yes -fi -if test $vnc_ws != no ; then - vnc_ws=yes -fi -libs_softmmu=$vnc_tls_libs $libs_softmmu -QEMU_CFLAGS=$QEMU_CFLAGS $vnc_tls_cflags +gnu_tls_cflags=`$pkg_config --cflags gnutls 2 /dev/null` +gnu_tls_libs=`$pkg_config --libs gnutls 2 /dev/null` +if compile_prog $gnu_tls_cflags $gnu_tls_libs ; then + gnu_tls=yes +else + gnu_tls=no +fi + +## +# VNC TLS/WS +if test $vnc = yes -a $gnu_tls != no; then + if test $vnc_tls != no ; then +libs_softmmu=$gnu_tls_libs $libs_softmmu +libs_tools=$gnu_tls_libs $libs_softmmu +QEMU_CFLAGS=$QEMU_CFLAGS $gnu_tls_cflags +vnc_tls=yes + fi + if test $vnc_ws != no ; then +libs_softmmu=$gnu_tls_libs $libs_softmmu +libs_tools=$gnu_tls_libs $libs_softmmu +QEMU_CFLAGS=$QEMU_CFLAGS $gnu_tls_cflags +vnc_ws=yes + fi +else + if test $vnc_tls = yes ; then +feature_not_found vnc-tls + fi + if test $vnc_ws = yes ; then +feature_not_found vnc-ws + fi + vnc_tls=no + vnc_ws=no +fi + +## +# SHA256 deduplication +if test $sha256_dedup = yes; then + if test $gnu_tls = yes; then +libs_softmmu=$gnu_tls_libs $libs_softmmu +libs_tools=$gnu_tls_libs $libs_softmmu +QEMU_CFLAGS=$QEMU_CFLAGS $gnu_tls_cflags +sha256_dedup=yes else -if test $vnc_tls = yes ; then - feature_not_found vnc-tls -fi -if test $vnc_ws = yes ; then - feature_not_found vnc-ws -fi -vnc_tls=no -vnc_ws=no +echo gnutls 2.10.0 required to compile QEMU with sha256 deduplication +exit 1 fi fi @@ -3562,6 +3597,7 @@ echo seccomp support $seccomp echo coroutine backend $coroutine echo GlusterFS support $glusterfs echo virtio-blk-data-plane $virtio_blk_data_plane +echo sha256-dedup $sha256_dedup echo gcov $gcov_tool echo gcov enabled $gcov echo TPM support $tpm @@
[Qemu-devel] [RFC V8 15/24] qcow2: Extract qcow2_set_incompat_feature and qcow2_clear_incompat_feature.
Also change callers. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-cluster.c |2 +- block/qcow2.c | 43 ++- block/qcow2.h |7 --- 3 files changed, 27 insertions(+), 25 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 41c4bc2..7cd6e75 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -672,7 +672,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, QCowL2Meta *m) /* Update L2 table. */ if (s-use_lazy_refcounts) { -qcow2_mark_dirty(bs); +qcow2_set_incompat_feature(bs, QCOW2_INCOMPAT_DIRTY); } if (qcow2_need_accurate_refcounts(s)) { qcow2_cache_set_dependency(bs, s-l2_table_cache, diff --git a/block/qcow2.c b/block/qcow2.c index 3cd1051..58e6236 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -250,56 +250,57 @@ static void report_unsupported_feature(BlockDriverState *bs, } /* - * Sets the dirty bit and flushes afterwards if necessary. + * Sets the an incompatible feature bit and flushes afterwards if necessary. * * The incompatible_features bit is only set if the image file header was * updated successfully. Therefore it is not required to check the return * value of this function. */ -int qcow2_mark_dirty(BlockDriverState *bs) +int qcow2_set_incompat_feature(BlockDriverState *bs, + QCow2IncompatibleFeature feature) { BDRVQcowState *s = bs-opaque; uint64_t val; -int ret; +int ret = 0; assert(s-qcow_version = 3); -if (s-incompatible_features QCOW2_INCOMPAT_DIRTY) { -return 0; /* already dirty */ +if (s-incompatible_features feature) { +return 0; /* already added */ } -val = cpu_to_be64(s-incompatible_features | QCOW2_INCOMPAT_DIRTY); +val = cpu_to_be64(s-incompatible_features | feature); ret = bdrv_pwrite(bs-file, offsetof(QCowHeader, incompatible_features), val, sizeof(val)); if (ret 0) { return ret; } -ret = bdrv_flush(bs-file); -if (ret 0) { -return ret; -} -/* Only treat image as dirty if the header was updated successfully */ -s-incompatible_features |= QCOW2_INCOMPAT_DIRTY; +/* Only treat image as having the feature if the header was updated + * successfully + */ +s-incompatible_features |= feature; return 0; } /* - * Clears the dirty bit and flushes before if necessary. Only call this - * function when there are no pending requests, it does not guard against - * concurrent requests dirtying the image. + * Clears an incompatible feature bit and flushes before if necessary. + * Only call this function when there are no pending requests, it does not + * guard against concurrent requests adding a feature to the image. */ -static int qcow2_mark_clean(BlockDriverState *bs) +static int qcow2_clear_incompat_feature(BlockDriverState *bs, +QCow2IncompatibleFeature feature) { BDRVQcowState *s = bs-opaque; +int ret = 0; -if (s-incompatible_features QCOW2_INCOMPAT_DIRTY) { -int ret = bdrv_flush(bs); +if (s-incompatible_features feature) { +ret = bdrv_flush(bs); if (ret 0) { return ret; } -s-incompatible_features = ~QCOW2_INCOMPAT_DIRTY; +s-incompatible_features = ~feature; return qcow2_update_header(bs); } return 0; @@ -314,7 +315,7 @@ static int qcow2_check(BlockDriverState *bs, BdrvCheckResult *result, } if (fix result-check_errors == 0 result-corruptions == 0) { -return qcow2_mark_clean(bs); +return qcow2_clear_incompat_feature(bs, QCOW2_INCOMPAT_DIRTY); } return ret; } @@ -949,7 +950,7 @@ static void qcow2_close(BlockDriverState *bs) qcow2_cache_flush(bs, s-l2_table_cache); qcow2_cache_flush(bs, s-refcount_block_cache); -qcow2_mark_clean(bs); +qcow2_clear_incompat_feature(bs, QCOW2_INCOMPAT_DIRTY); qcow2_cache_destroy(bs, s-l2_table_cache); qcow2_cache_destroy(bs, s-refcount_block_cache); diff --git a/block/qcow2.h b/block/qcow2.h index ff26e81..720131d 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -290,14 +290,14 @@ enum { }; /* Incompatible feature bits */ -enum { +typedef enum { QCOW2_INCOMPAT_DIRTY_BITNR = 0, QCOW2_INCOMPAT_DIRTY = 1 QCOW2_INCOMPAT_DIRTY_BITNR, QCOW2_INCOMPAT_DEDUP_BITNR = 1, QCOW2_INCOMPAT_DEDUP = 1 QCOW2_INCOMPAT_DEDUP_BITNR, QCOW2_INCOMPAT_MASK = QCOW2_INCOMPAT_DIRTY | QCOW2_INCOMPAT_DEDUP, -}; +} QCow2IncompatibleFeature; /* Compatible feature bits */ enum { @@ -543,7 +543,8 @@ static inline uint64_t l2meta_cow_end(QCowL2Meta *m) int qcow2_backing_read1(BlockDriverState *bs, QEMUIOVector *qiov, int64_t sector_num, int nb_sectors); -int qcow2_mark_dirty(BlockDriverState *bs);
[Qemu-devel] [RFC V8 21/24] qcow2: Integrate SKEIN hash algorithm in deduplication.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c | 15 +++ block/qcow2.c |5 + configure | 35 +++ 3 files changed, 55 insertions(+) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index 599cb2e..606459f 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -33,6 +33,10 @@ #include gnutls/crypto.h #endif +#ifdef CONFIG_SKEIN_DEDUP +#include skeinApi.h +#endif + /* * Prepare a buffer containing everything required to compute cluster * sized deduplication hashes. @@ -153,6 +157,17 @@ static int qcow2_compute_cluster_hash(BlockDriverState *bs, return gnutls_hash_fast(GNUTLS_DIG_SHA256, data, s-cluster_size, hash-data); #endif +#if defined(CONFIG_SKEIN_DEDUP) +case QCOW_HASH_SKEIN: +{ +SkeinCtx_t ctx; +skeinCtxPrepare(ctx, Skein256); +skeinInit(ctx, Skein256); +skeinUpdate(ctx, data, s-cluster_size); +skeinFinal(ctx, hash-data); +} +return 0; +#endif default: error_report(Invalid deduplication hash algorithm %i, s-dedup_hash_algo); diff --git a/block/qcow2.c b/block/qcow2.c index 11c115f..ea2f0f2 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1592,6 +1592,11 @@ static int8_t qcow2_get_dedup_hash_algo(char *value) if (!value || !strcmp(value, sha256)) { return QCOW_HASH_SHA256; } +#if defined(CONFIG_SKEIN_DEDUP) +if (!strcmp(value, skein)) { +return QCOW_HASH_SKEIN; +} +#endif error_printf(Unsupported deduplication hash algorithm.\n); return -EINVAL; diff --git a/configure b/configure index 34b26b2..902c3c1 100755 --- a/configure +++ b/configure @@ -241,6 +241,7 @@ gtkabi=2.0 tpm=no libssh2= sha256_dedup=yes +skein_dedup=no # parse CC options first for opt do @@ -935,6 +936,10 @@ for opt do ;; --enable-sha256-dedup) sha256_dedup=yes ;; + --disable-skein-dedup) skein_dedup=no + ;; + --enable-skein-dedup) skein_dedup=yes + ;; *) echo ERROR: unknown option $opt; show_help=yes ;; esac @@ -1202,6 +1207,7 @@ echo --gcov=GCOV use specified gcov [$gcov_tool] echo --enable-tpm enable TPM support echo --disable-libssh2disable ssh block device support echo --enable-libssh2 enable ssh block device support +echo --enable-skein-dedup enable computing dedup hashes with SKEIN echo echo NOTE: The object files are built at the place where configure is launched exit 1 @@ -2613,6 +2619,30 @@ EOF fi fi +## +# SKEIN dedup hash function probe +if test $skein_dedup != no ; then + cat $TMPC EOF +#include skeinApi.h +int main(void) { +SkeinCtx_t ctx; +skeinCtxPrepare(ctx, 512); +return 0; +} +EOF + skein_libs=-lskein3fish + if compile_prog $skein_libs ; then +skein_dedup=yes +libs_tools=$skein_libs $libs_tools +libs_softmmu=$skein_libs $libs_softmmu + else +if test $skein_dedup = yes ; then + feature_not_found libskein3fish not found +fi +skein_dedup=no + fi +fi + # # Check for xxxat() functions when we are building linux-user # emulator. This is done because older glibc versions don't @@ -3603,6 +3633,7 @@ echo gcov enabled $gcov echo TPM support $tpm echo libssh2 support $libssh2 echo TPM passthrough $tpm_passthrough +echo SKEIN support $skein_dedup if test $sdl_too_old = yes; then echo - Your SDL version is too old - please upgrade to have SDL support @@ -3991,6 +4022,10 @@ if test $sha256_dedup = yes ; then echo CONFIG_SHA256_DEDUP=y $config_host_mak fi +if test $skein_dedup = yes ; then + echo CONFIG_SKEIN_DEDUP=y $config_host_mak +fi + # USB host support case $usb in linux) -- 1.7.10.4
[Qemu-devel] [RFC V8 23/24] qcow2: Enable the deduplication feature.
--- block/qcow2.c | 17 + 1 file changed, 17 insertions(+) diff --git a/block/qcow2.c b/block/qcow2.c index f7b94dd..bb7bf74 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -569,6 +569,13 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags) goto fail; } +if (s-incompatible_features QCOW2_INCOMPAT_DEDUP) { +ret = qcow2_dedup_init(bs); +if (ret 0) { +goto fail; +} +} + #ifdef DEBUG_ALLOC { BdrvCheckResult result = {0}; @@ -1043,8 +1050,13 @@ fail: static void qcow2_close(BlockDriverState *bs) { BDRVQcowState *s = bs-opaque; + g_free(s-l1_table); +if (s-has_dedup) { +qcow2_dedup_close(bs); +} + qcow2_cache_flush(bs, s-l2_table_cache); qcow2_cache_flush(bs, s-refcount_block_cache); @@ -1547,6 +1559,11 @@ static int qcow2_create2(const char *filename, int64_t total_size, } qcow2_store_cleanup(s-key_value_store); + +ret = qcow2_dedup_init(bs); +if (ret 0) { +goto out; +} } /* Want a backing file? There you go.*/ -- 1.7.10.4
[Qemu-devel] [RFC V8 12/24] qcow2: Do allocate on rewrite on the dedup case.
This patch does allocate on rewrite when deduplication is on. This get rid of the need of removing the old hash of the lookup structure when a cluster get rewritten. The old data is left in place and will be collected/deleted when it's cluster will reach 0. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-cluster.c |5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index d6db0b9..41c4bc2 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -878,7 +878,8 @@ static int handle_copied(BlockDriverState *bs, uint64_t guest_offset, cluster_offset = be64_to_cpu(l2_table[l2_index]); /* Check how many clusters are already allocated and don't need COW */ -if (qcow2_get_cluster_type(cluster_offset) == QCOW2_CLUSTER_NORMAL +if (!s-has_dedup +qcow2_get_cluster_type(cluster_offset) == QCOW2_CLUSTER_NORMAL (cluster_offset QCOW_OFLAG_COPIED)) { /* If a specific host_offset is required, check it */ @@ -1028,7 +1029,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t guest_offset, /* For the moment, overwrite compressed clusters one by one */ if (entry QCOW_OFLAG_COMPRESSED) { nb_clusters = 1; -} else { +} else if (!s-has_dedup) { nb_clusters = count_cow_clusters(s, nb_clusters, l2_table, l2_index); } -- 1.7.10.4
[Qemu-devel] [RFC V8 10/24] qcow2: Add qcow2_dedup and related functions
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c | 415 +++ block/qcow2.h |5 + 2 files changed, 420 insertions(+) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index bc6e2c2..0daf77e 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -119,3 +119,418 @@ fail: *data = NULL; return ret; } + +/* + * Set a QCowHashInfo into the new state + * + * @hash_info: the hash_info to set to new + */ +static void qcow2_set_hash_info_new(QCowHashInfo *hash_info) +{ +hash_info-physical_sect = QCOW_DEDUP_FLAG_EMPTY; +hash_info-first_logical_sect = QCOW_DEDUP_FLAG_EMPTY; +} + +/* + * Compute the hash of a given cluster + * + * @data: a buffer containing the cluster data + * @hash: a QCowHash where to store the computed hash + * @ret: 0 on success, negative on error + */ +static int qcow2_compute_cluster_hash(BlockDriverState *bs, + QCowHash *hash, + uint8_t *data) +{ +return 0; +} + +/* + * Get a QCowHashInfo corresponding to a cluster data + * + * @phash: if phash can be used no hash is computed + * @data:a buffer containing the cluster + * @ret: 1 if if hash found, 0 if not found, -errno on error + */ +static int qcow2_get_persistent_hash_for_cluster(BlockDriverState *bs, + QcowPersistentHash *phash, + uint8_t *data) +{ +BDRVQcowState *s = bs-opaque; +int ret = 0; + +/* no hash has been provided compute it and store it for later usage */ +if (!phash-reuse) { +ret = qcow2_compute_cluster_hash(bs, + phash-hash_info.hash, + data); +} + +/* do not reuse the hash anymore if it was precomputed */ +phash-reuse = false; + +/* error */ +if (ret 0) { +return ret; +} + +/* ask the store if it knows this QCowHashInfo */ +return qcow2_store_get(bs, s-key_value_store, phash-hash_info); +} + +/* + * Insert a QCowHashInfo into the store as new + * + * @hash_info: the QCowHashInfo to set to new and insert + * @ret: 0 on succes, -errno on error + */ +static int qcow2_add_new_hash_info_to_store(BlockDriverState *bs, +QCowHashInfo *hash_info) +{ +BDRVQcowState *s = bs-opaque; + +/* set the QCowHashInfo to the new state so we will remember to fill these + * field later with real values + */ +qcow2_set_hash_info_new(hash_info); +return qcow2_store_insert(bs, s-key_value_store, hash_info); +} + +/* + * Helper used to build a QCowHashElement + * + * @hash: the QCowHash to use + * @ret: a newly allocated QCowHashElement containing the given hash + */ +static QCowHashElement *qcow2_dedup_hash_new(QCowHash *hash) +{ +QCowHashElement *dedup_hash; +dedup_hash = g_new0(QCowHashElement, 1); +memcpy(dedup_hash-hash_info.hash.data, hash-data, HASH_LENGTH); +return dedup_hash; +} + +/* + * Helper used to link a deduplicated cluster in the l2 + * + * @logical_sect: the cluster sector seen by the guest + * @physical_sect: the cluster sector in the QCOW2 file + * @overwrite: true if we must overwrite the L2 table entry + * @ret: + */ +static int qcow2_dedup_link_l2(BlockDriverState *bs, + uint64_t logical_sect, + uint64_t physical_sect, + bool overwrite) +{ +QCowL2Meta m = { +.alloc_offset = physical_sect 9, +.offset = logical_sect 9, +.nb_clusters= 1, +.nb_available = 0, +.cow_start = { +.offset = 0, +.nb_sectors = 0, +}, +.cow_end = { +.offset = 0, +.nb_sectors = 0, +}, +.l2_entry_flags = 0, +.overwrite = overwrite, +}; +return qcow2_alloc_cluster_link_l2(bs, m); +} + +/* Clear the QCOW_OFLAG_COPIED from the first L2 entry written for a physical + * cluster. + * + * @hash_info: the duplicated hash info + * @ret: 0 on success, negative on error + */ +static int qcow2_clear_l2_copied_flag_if_needed(BlockDriverState *bs, +QCowHashInfo *hash_info) +{ +int ret = 0; +uint64_t first_logical_sect = hash_info-first_logical_sect; + +/* QCOW_OFLAG_COPIED already cleared - do nothing */ +if (!(first_logical_sect QCOW_OFLAG_COPIED)) { +return 0; +} + +first_logical_sect = ~QCOW_OFLAG_COPIED; + +/* overwrite first L2 entry to clear QCOW_FLAG_COPIED */ +ret = qcow2_dedup_link_l2(bs, first_logical_sect, + hash_info-physical_sect, + true); + +if (ret 0) { +return ret; +} + +/*
[Qemu-devel] [RFC V8 14/24] qcow2: Load and save deduplication table header extension.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2.c | 53 + 1 file changed, 53 insertions(+) diff --git a/block/qcow2.c b/block/qcow2.c index 34b2a87..3cd1051 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -54,9 +54,19 @@ typedef struct { uint32_t len; } QCowExtension; +typedef struct { +uint64_t offset; +uint32_t size; +uint32_t max; +uint32_t flags; +uint8_t hash_algo; +char reserved[56]; +} QCowDedupConfExtension; + #define QCOW2_EXT_MAGIC_END 0 #define QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA #define QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857 +#define QCOW2_EXT_MAGIC_DEDUP_TABLE 0xCD8E819B static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename) { @@ -85,6 +95,7 @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset, QCowExtension ext; uint64_t offset; int ret; +QCowDedupConfExtension dedup_conf_ext; #ifdef DEBUG_EXT printf(qcow2_read_extensions: start=%ld end=%ld\n, start_offset, end_offset); @@ -149,6 +160,28 @@ static int qcow2_read_extensions(BlockDriverState *bs, uint64_t start_offset, } break; +case QCOW2_EXT_MAGIC_DEDUP_TABLE: +if (ext.len sizeof(dedup_conf_ext)) { +fprintf(stderr, ERROR: dedup_conf_ext: len=%u too large + (=%zu)\n, +ext.len, sizeof(dedup_conf_ext)); +return 2; +} +ret = bdrv_pread(bs-file, offset, + dedup_conf_ext, ext.len); +if (ret 0) { +return ret; +} +s-dedup_conf_offset = +be64_to_cpu(dedup_conf_ext.offset); +s-dedup_conf_size = +be32_to_cpu(dedup_conf_ext.size); +s-dedup_max_incarnations = +be32_to_cpu(dedup_conf_ext.max); +s-dedup_hash_algo = dedup_conf_ext.hash_algo; +s-dedup_dirty = dedup_conf_ext.flags QCOW_DEDUP_DIRTY; +break; + default: /* unknown magic - save it in case we need to rewrite the header */ { @@ -1006,6 +1039,7 @@ int qcow2_update_header(BlockDriverState *bs) uint32_t refcount_table_clusters; size_t header_length; Qcow2UnknownHeaderExtension *uext; +QCowDedupConfExtension dedup_conf_ext; buf = qemu_blockalign(bs, buflen); @@ -1109,6 +1143,25 @@ int qcow2_update_header(BlockDriverState *bs) buf += ret; buflen -= ret; +if (s-has_dedup) { +memset(dedup_conf_ext, 0, sizeof(dedup_conf_ext)); +dedup_conf_ext.offset= cpu_to_be64(s-dedup_conf_offset); +dedup_conf_ext.size = cpu_to_be32(s-dedup_conf_size); +dedup_conf_ext.max = cpu_to_be32(s-dedup_max_incarnations); +dedup_conf_ext.hash_algo = s-dedup_hash_algo; +dedup_conf_ext.flags = s-dedup_dirty ? QCOW_DEDUP_DIRTY : 0; +ret = header_ext_add(buf, + QCOW2_EXT_MAGIC_DEDUP_TABLE, + dedup_conf_ext, + sizeof(dedup_conf_ext), + buflen); +if (ret 0) { +goto fail; +} +buf += ret; +buflen -= ret; +} + /* Keep unknown header extensions */ QLIST_FOREACH(uext, s-unknown_header_ext, next) { ret = header_ext_add(buf, uext-magic, uext-data, uext-len, buflen); -- 1.7.10.4
[Qemu-devel] [RFC V8 17/24] qcow2: Drop hash for a given cluster when dedup makes refcount 2^16/2.
A new physical cluster with the same hash value will be used for further occurrence of this hash. Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c | 20 1 file changed, 16 insertions(+), 4 deletions(-) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index 2d2c15c..da4ad5c 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -305,12 +305,24 @@ static int qcow2_deduplicate_cluster(BlockDriverState *bs, { BDRVQcowState *s = bs-opaque; uint64_t cluster_index = hash_info-physical_sect / s-cluster_sectors; -int ret = 0; +int refcount, ret = 0; /* Increment the refcount of the cluster */ -ret = qcow2_update_cluster_refcount(bs, -cluster_index, -1); +refcount = qcow2_update_cluster_refcount(bs, + cluster_index, + 1); + +if (refcount 0) { +return refcount; +} + +/* if we reached half the max refcount delete the QCowHashInfo from the + * store so the next time this cluster will be seen it will be handled as + * new + */ +if (refcount = 0x/2) { +ret = qcow2_store_delete(bs, s-key_value_store, hash_info); +} if (ret 0) { return ret; -- 1.7.10.4
[Qemu-devel] [RFC V8 19/24] qcow2: Integrate deduplication in qcow2_co_writev loop.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2.c | 88 +++-- 1 file changed, 86 insertions(+), 2 deletions(-) diff --git a/block/qcow2.c b/block/qcow2.c index e1265a2..8eb63f1 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -342,6 +342,7 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, int flags) Error *local_err = NULL; uint64_t ext_end; +s-has_dedup = false; ret = bdrv_pread(bs-file, 0, header, sizeof(header)); if (ret 0) { goto fail; @@ -821,13 +822,17 @@ static coroutine_fn int qcow2_co_writev(BlockDriverState *bs, BDRVQcowState *s = bs-opaque; int index_in_cluster; int n_end; -int ret; +int ret = 0; int cur_nr_sectors; /* number of sectors in current iteration */ uint64_t cluster_offset; QEMUIOVector hd_qiov; uint64_t bytes_done = 0; uint8_t *cluster_data = NULL; QCowL2Meta *l2meta = NULL; +uint8_t *dedup_cluster_data = NULL; +int dedup_cluster_data_nr; +int deduped_sectors_nr; +QCowDedupState ds; trace_qcow2_writev_start_req(qemu_coroutine_self(), sector_num, remaining_sectors); @@ -838,13 +843,69 @@ static coroutine_fn int qcow2_co_writev(BlockDriverState *bs, qemu_co_mutex_lock(s-lock); +if (s-has_dedup) { +QTAILQ_INIT(ds.undedupables); +ds.phash.reuse = false; +ds.nb_undedupable_sectors = 0; +ds.nb_clusters_processed = 0; + +/* if deduplication is on we make sure dedup_cluster_data + * contains a multiple of cluster size of data in order + * to compute the hashes + */ +ret = qcow2_dedup_read_missing_and_concatenate(bs, + qiov, + sector_num, + remaining_sectors, + dedup_cluster_data, + dedup_cluster_data_nr); + +if (ret 0) { +goto fail; +} +} + while (remaining_sectors != 0) { l2meta = NULL; trace_qcow2_writev_start_part(qemu_coroutine_self()); + +if (s-has_dedup ds.nb_undedupable_sectors == 0) { +/* Try to deduplicate as much clusters as possible */ +deduped_sectors_nr = qcow2_dedup(bs, + ds, + sector_num, + dedup_cluster_data, + dedup_cluster_data_nr); + +if (deduped_sectors_nr 0) { +goto fail; +} + +remaining_sectors -= deduped_sectors_nr; +sector_num += deduped_sectors_nr; +bytes_done += deduped_sectors_nr * 512; + +/* no more data to write - exit */ +if (remaining_sectors = 0) { +break; +} + +/* if we deduped something trace it */ +if (deduped_sectors_nr) { +trace_qcow2_writev_done_part(qemu_coroutine_self(), + deduped_sectors_nr); +trace_qcow2_writev_start_part(qemu_coroutine_self()); +} +} + index_in_cluster = sector_num (s-cluster_sectors - 1); -n_end = index_in_cluster + remaining_sectors; +n_end = s-has_dedup +ds.nb_undedupable_sectors remaining_sectors ? +index_in_cluster + ds.nb_undedupable_sectors : +index_in_cluster + remaining_sectors; + if (s-crypt_method n_end QCOW_MAX_CRYPT_CLUSTERS * s-cluster_sectors) { n_end = QCOW_MAX_CRYPT_CLUSTERS * s-cluster_sectors; @@ -912,6 +973,28 @@ static coroutine_fn int qcow2_co_writev(BlockDriverState *bs, l2meta = next; } +/* Write the non duplicated clusters hashes to disk */ +if (s-has_dedup) { +int count = cur_nr_sectors / s-cluster_sectors; +int has_ending = ((cluster_offset 9) + index_in_cluster + + cur_nr_sectors) (s-cluster_sectors - 1); +if (index_in_cluster) { +count++; +} +if (has_ending) { +count++; +} +ret = qcow2_dedup_store_new_hashes(bs, + ds, + count, + sector_num, + (cluster_offset 9)); +if (ret 0) { +goto fail; +} +} + +ds.nb_undedupable_sectors -= cur_nr_sectors; remaining_sectors -= cur_nr_sectors;
[Qemu-devel] [RFC V8 22/24] qcow2: Add qcow2_dedup_init and qcow2_dedup_close.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c | 60 +++ block/qcow2.c |2 +- block/qcow2.h |2 ++ 3 files changed, 63 insertions(+), 1 deletion(-) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index 606459f..5eeea38 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -716,3 +716,63 @@ void qcow2_dedup_destroy_hash(BlockDriverState *bs, free_exit: qemu_vfree(buf); } + +int qcow2_dedup_init(BlockDriverState *bs) +{ +BDRVQcowState *s = bs-opaque; +int ret = 0; + +s-has_dedup = true; + +/* if we are read-only we don't init anything */ +if (bs-read_only) { +return 0; +} + +/* no need to allocate the various store's buffers since qcow2_store_load + * will do it + */ + +/* load and parse the configuration from disk */ +ret = qcow2_store_load(bs, s-key_value_store); + +if (ret 0) { +return ret; +} + +/* if QEMU crashed forget everyting that was stored in the store */ +if(s-dedup_dirty) { +qcow2_store_forget(bs, s-key_value_store); +} + +/* set the dirty bit */ +s-dedup_dirty = true; +qcow2_update_header(bs); + +/* load the journal and the incarnations in a coroutine */ +return qcow2_store_start(bs, s-key_value_store); +} + +void qcow2_dedup_close(BlockDriverState *bs) +{ +BDRVQcowState *s = bs-opaque; +int ret = 0; + +/* if we are read-only we don't need to cleanup */ +if (bs-read_only) { +return; +} + +ret = qcow2_store_flush(bs, s-key_value_store); + +qcow2_store_cleanup(s-key_value_store); + +/* flush failed - leave the store dirty so it will be discard at restart */ +if (ret 0) { +return; +} + +/* clear the dirty bit */ +s-dedup_dirty = false; +qcow2_update_header(bs); +} diff --git a/block/qcow2.c b/block/qcow2.c index ea2f0f2..f7b94dd 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1546,7 +1546,7 @@ static int qcow2_create2(const char *filename, int64_t total_size, goto out; } -qcow2_store_cleanup(bs, s-key_value_store); +qcow2_store_cleanup(s-key_value_store); } /* Want a backing file? There you go.*/ diff --git a/block/qcow2.h b/block/qcow2.h index 3c6e685..b293be7 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -751,5 +751,7 @@ int qcow2_dedup_store_new_hashes(BlockDriverState *bs, uint64_t physical_sect); void qcow2_dedup_destroy_hash(BlockDriverState *bs, uint64_t cluster_index); +int qcow2_dedup_init(BlockDriverState *bs); +void qcow2_dedup_close(BlockDriverState *bs); #endif -- 1.7.10.4
[Qemu-devel] [RFC V8 24/24] qcow2: Enable deduplication tests
--- tests/qemu-iotests/common |6 ++ 1 file changed, 6 insertions(+) diff --git a/tests/qemu-iotests/common b/tests/qemu-iotests/common index 6826ea7..2483742 100644 --- a/tests/qemu-iotests/common +++ b/tests/qemu-iotests/common @@ -130,6 +130,7 @@ check options -cowtest cow -qcow test qcow -qcow2 test qcow2 +-qcow2_deduptest qcow2 deduplication -qedtest qed -vditest vdi -vpctest vpc @@ -175,6 +176,11 @@ testlist options xpand=false ;; + -qcow2_dedup) + IMGFMT=qcow2_dedup + xpand=false + ;; + -qed) IMGFMT=qed xpand=false -- 1.7.10.4
[Qemu-devel] [RFC V8 02/24] qcow2: Add deduplication structures and fields.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2.h | 203 - 1 file changed, 201 insertions(+), 2 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index 9421843..953edfe 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -57,7 +57,182 @@ #define REFCOUNT_CACHE_SIZE 4 #define DEFAULT_CLUSTER_SIZE 65536 - +#define DEFAULT_DEDUP_CLUSTER_SIZE 4096 + +#define HASH_LENGTH 32 + +/* indicate that this cluster hash has been deleted from the key value store */ +#define QCOW_DEDUP_DELETED (1LL 61) +/* indicate that the hash structure is empty and miss offset */ +#define QCOW_DEDUP_FLAG_EMPTY (1LL 62) + +#define SSD_ERASE_BLOCK_SIZE (512 * 1024) /* match SSD erase block size */ +#define JOURNAL_CLUSTER_SIZE 4096 /* used to read entries */ +#define HASH_STORE_CLUSTER_SIZE 4096 + +#define QCOW_LOG_END_SIZE 2/* size of a end block journal entry */ +#define QCOW_LOG_STORE_ENTRY_USED (1LL 60) /* mark used entry in table */ +#define QCOW_LOG_STORE_BUCKET_SIZE 4 /* size of a cuckoo hash bucket */ +#define QCOW_LOG_STORE_MAX_KICKS 128 /* max numbers of cuckoo hash kicks */ +#define QCOW_LOG_STORE_JOURNAL_RATIO 2 /* the ratio to compute the extra +* room the journal will take based +* on the log store size +*/ +#define QCOW2_NB_INCARNATION_GOAL 128 /* targeted number of incarnation */ + +#define QCOW_DEDUP_DIRTY 1 /* dirty flag in the qcow2 header extension */ + +typedef enum { +QCOW_LOG_NONE = 0xFF, /* on SSD erased clusters will mark none */ +QCOW_LOG_END = 1, /* end a block and point to the next */ +QCOW_LOG_HASH = 2,/* used to journalize a QCowHashInfo */ +} QCowLogEntryType; + +typedef enum { +QCOW_HASH_SHA256 = 0, +QCOW_HASH_SHA3 = 1, +QCOW_HASH_SKEIN = 2, +} QCowHashAlgo; + +typedef struct { +uint8_t data[HASH_LENGTH]; /* 32 bytes hash of a given cluster */ +} __attribute__((packed)) QCowHash; + +/* deduplication info */ +typedef struct { +QCowHash hash; +uint64_t physical_sect; /* where the cluster is stored on disk */ +uint64_t first_logical_sect; /* logical sector of the first occurrence of + * this cluster + */ +} __attribute__((packed)) QCowHashInfo; + +/* Used to keep a single precomputed hash between the calls of the dedup + * function + */ +typedef struct { +QCowHashInfo hash_info; +bool reuse; /* The main deduplication function can set this field to + * true before exiting to avoid computing the same hash + * twice. It's a speed optimization. + */ +} QcowPersistentHash; + +/* Undedupable hashes that must be written later to disk */ +typedef struct QCowHashElement { +QCowHashInfo hash_info; +QTAILQ_ENTRY(QCowHashElement) next; +} QCowHashElement; + +typedef struct { +QcowPersistentHash phash; /* contains a hash persisting between calls of +* qcow2_dedup() +*/ +QTAILQ_HEAD(, QCowHashElement) undedupables; +uint64_t nb_clusters_processed; +uint64_t nb_undedupable_sectors; +} QCowDedupState; + +/* The code must take care that the maximum size field of a QCowJournalEntry + * will be no more than 254 bytes. + * It's required to save the 2 bytes of room for QCOW_LOG_END entries + * in every cases + */ +typedef union { +QCowHashInfo hash_info; +uint8_t padding[254]; /* note the extra two bytes of padding to avoid +* read overflow. +*/ +} QCowJournalEntryUnion; + +typedef struct { +uint8_t size;/* maximum size of a journal entry is 254 bytes */ +uint8_t type;/* contains a QCowLogEntryType for future usage */ +QCowJournalEntryUnion u; +} __attribute__((packed)) QCowJournalEntry; + +typedef struct { +uint64_t sector; /* the journal physical on disk sector */ +uint64_t size;/* the size of the journal in bytes */ +uint64_t index; /* index of next buf cluster to write */ +uint8_t *write_buf; /* used to buffer written data */ +uint64_t offset_in_buf; /* the offset in the write buffer */ +bool flushed; /* true if the buffer reached disk*/ +uint8_t *read_cache; /* used to cache read data */ +int64_t read_index; /* index the cached read cluster */ +bool started; /* has the journal been resumed */ +} QCowJournal; + +typedef struct { +QCowJournal journal; /* the journal this log store will use */ +uint32_t order; /* the number of bits used for the sub hashes +
[Qemu-devel] [RFC V8 18/24] qcow2: Remove hash when cluster is deleted.
Signed-off-by: Benoit Canet ben...@irqsave.net --- block/qcow2-dedup.c| 45 + block/qcow2-refcount.c |3 +++ block/qcow2.h |2 ++ 3 files changed, 50 insertions(+) diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c index da4ad5c..599cb2e 100644 --- a/block/qcow2-dedup.c +++ b/block/qcow2-dedup.c @@ -656,3 +656,48 @@ int qcow2_dedup_store_new_hashes(BlockDriverState *bs, return ret; } + +/* Clean the last reference to a given cluster when its refcount is zero + * + * @cluster_index: the index of the physical cluster + */ +void qcow2_dedup_destroy_hash(BlockDriverState *bs, + uint64_t cluster_index) +{ +BDRVQcowState *s = bs-opaque; +uint64_t offset = cluster_index * s-cluster_size; +QCowHashInfo hash_info; +uint8_t *buf; +int ret = 0; + +/* allocate buffer */ +buf = qemu_blockalign(bs, s-cluster_size); + +/* read cluster from disk */ +ret = bdrv_pread(bs-file, offset, buf, s-cluster_size); + +/* error */ +if (ret 0) { +goto free_exit; +} + +/* clear hash info */ +memset(hash_info, 0, sizeof(QCowHashInfo)); + +/* compute hash for the cluster */ +ret = qcow2_compute_cluster_hash(bs, + hash_info.hash, + buf); + + +/* error */ +if (ret 0) { +goto free_exit; +} + +/* delete hash from key value store. It will not be deduplicated anymore */ +qcow2_store_delete(bs, s-key_value_store, hash_info); + +free_exit: + qemu_vfree(buf); +} diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 3bd8f37..2734cd9 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -482,6 +482,9 @@ static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, ret = -EINVAL; goto fail; } +if (s-has_dedup refcount == 0) { +qcow2_dedup_destroy_hash(bs, cluster_index); +} if (refcount == 0 cluster_index s-free_cluster_index) { s-free_cluster_index = cluster_index; } diff --git a/block/qcow2.h b/block/qcow2.h index 720131d..6f85e03 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -748,5 +748,7 @@ int qcow2_dedup_store_new_hashes(BlockDriverState *bs, int count, uint64_t logical_sect, uint64_t physical_sect); +void qcow2_dedup_destroy_hash(BlockDriverState *bs, + uint64_t cluster_index); #endif -- 1.7.10.4