from:"Benoît Canet"

Re: [Qemu-devel] QCOW2 cryptography and secure key handling

2013-07-23 Thread Benoît Canet

  Do you (the block maintainers) have an idea on how the code could be 
  improved
  to securely pass the crypto key to the QCOW2 code ?
 
 More generally, QCow2's current encryption support is woefully inadequate
 from a design POV. If we wanted better encryption built-in to QEMU it is
 best to just deprecate the current encryption support and define a new
 qcow2 extension based around something like the LUKS data format. Using
 the LUKS data format precisely would be good from a data portability
 POV, since then you can easily switch your images between LUKS encrypted
 block device  qcow2-with-luks image file, without needing to re-encrypt
 the data.

Thanks I will read the LUKS specification.

Best regards

Benoît

Re: [Qemu-devel] QCOW2 cryptography and secure key handling

2013-07-23 Thread Benoît Canet

 More generally, QCow2's current encryption support is woefully inadequate
 from a design POV. If we wanted better encryption built-in to QEMU it is
 best to just deprecate the current encryption support and define a new
 qcow2 extension based around something like the LUKS data format. Using
 the LUKS data format precisely would be good from a data portability
 POV, since then you can easily switch your images between LUKS encrypted
 block device  qcow2-with-luks image file, without needing to re-encrypt
 the data.

I read the LUKS specification and undestood enough part of it to understand the
potentials benefits (stronger encryption key, multiple user keys, possibility to
change users keys).

Kevin  Stefan: What do you think about implementing LUKS in QCOW2 ?

Best regards

Benoît

Re: [Qemu-devel] [PATCH V2 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.

2013-07-23 Thread Benoît Canet

 1. Can we activate the timer only when requests are actually pending?
Imagine a host with 1000 guests, even a 1 second timer becomes
wasteful.

I will try to do this.

 2. You don't vary the wait time, does this mean a throttled request must
wait for max 1 second?  If yes, then it introduces a big variance on
request latency.

If iops or bps request where done on a negligible time (very fast storage
backend) yes.
We could make the timer frequency higher though to mitigate this.

Best regards

Benoît

Re: [Qemu-devel] [PATCH V2 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.

2013-07-23 Thread Benoît Canet

 I don't fully understand the patch yet but:

The patch is a variant of the following algorithm.
http://en.wikipedia.org/wiki/Leaky_bucket

Best regards

Benoît

[Qemu-devel] [PATCH V3 for-1.6 1/5] block: Repair the throttling code.

2013-07-23 Thread Benoît Canet

The throttling code was segfaulting since commit
02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller
does not run in a coroutine.
qemu_co_queue_do_restart assume that the caller is a coroutinne.
As suggested by Stefan fix this by entering the coroutine directly.
Also make sure like suggested that qemu_co_queue_next() and
qemu_co_queue_restart_all() can be called only in coroutines.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 
 include/block/coroutine.h |9 +++--
 qemu-coroutine-lock.c |   20 ++--
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index b560241..dc72643 100644
--- a/block.c
+++ b/block.c
@@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 {
 bs-io_limits_enabled = false;
 
-while (qemu_co_queue_next(bs-throttled_reqs));
+while (qemu_co_enter_next(bs-throttled_reqs)) {
+}
 
 if (bs-block_timer) {
 qemu_del_timer(bs-block_timer);
@@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque)
 {
 BlockDriverState *bs = opaque;
 
-qemu_co_queue_next(bs-throttled_reqs);
+qemu_co_enter_next(bs-throttled_reqs);
 }
 
 void bdrv_io_limits_enable(BlockDriverState *bs)
@@ -1445,8 +1446,7 @@ void bdrv_drain_all(void)
  * a busy wait.
  */
 QTAILQ_FOREACH(bs, bdrv_states, list) {
-if (!qemu_co_queue_empty(bs-throttled_reqs)) {
-qemu_co_queue_restart_all(bs-throttled_reqs);
+while (qemu_co_enter_next(bs-throttled_reqs)) {
 busy = true;
 }
 }
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 377805a..1f2db3e 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -130,12 +130,17 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue 
*queue);
  *
  * Returns true if a coroutine was restarted, false if the queue is empty.
  */
-bool qemu_co_queue_next(CoQueue *queue);
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 
 /**
  * Restarts all coroutines in the CoQueue and leaves the queue empty.
  */
-void qemu_co_queue_restart_all(CoQueue *queue);
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
 
 /**
  * Checks if the CoQueue is empty.
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index d9fea49..aeb33b9 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -88,16 +88,32 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool 
single)
 return true;
 }
 
-bool qemu_co_queue_next(CoQueue *queue)
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
 {
+assert(qemu_in_coroutine());
 return qemu_co_queue_do_restart(queue, true);
 }
 
-void qemu_co_queue_restart_all(CoQueue *queue)
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
 {
+assert(qemu_in_coroutine());
 qemu_co_queue_do_restart(queue, false);
 }
 
+bool qemu_co_enter_next(CoQueue *queue)
+{
+Coroutine *next;
+
+next = QTAILQ_FIRST(queue-entries);
+if (!next) {
+return false;
+}
+
+QTAILQ_REMOVE(queue-entries, next, co_queue_next);
+qemu_coroutine_enter(next, NULL);
+return true;
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
 return (QTAILQ_FIRST(queue-entries) == NULL);
-- 
1.7.10.4

[Qemu-devel] [PATCH V3 for-1.6 3/5] block: Add support for throttling burst threshold in QMP and the command line.

2013-07-23 Thread Benoît Canet

The thresholds of the leaky bucket algorithm can be used to allow some
burstiness.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qapi.c |   24 +
 blockdev.c   |  105 +++---
 hmp.c|   32 +++--
 qapi-schema.json |   34 --
 qemu-options.hx  |2 +-
 qmp-commands.hx  |   30 ++--
 6 files changed, 205 insertions(+), 22 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index a4bc411..03f1604 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -235,6 +235,30 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops[BLOCK_IO_LIMIT_READ];
 info-inserted-iops_wr =
bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_bps_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-bps_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-has_bps_rd_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-bps_rd_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_bps_wr_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-bps_wr_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_iops_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-iops_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-iops_rd_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_iops_rd_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_iops_wr_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-iops_wr_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index 491e4d0..241ebe8 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -341,6 +341,17 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, 
Error **errp)
 return false;
 }
 
+if (io_limits-bps_threshold[BLOCK_IO_LIMIT_TOTAL]  0 ||
+io_limits-bps_threshold[BLOCK_IO_LIMIT_WRITE]  0 ||
+io_limits-bps_threshold[BLOCK_IO_LIMIT_READ]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_TOTAL]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_WRITE]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_READ]  0) {
+error_setg(errp,
+   threshold values must be 0 or greater);
+return false;
+}
+
 return true;
 }
 
@@ -523,24 +534,34 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
qemu_opt_get_number(opts, bps_rd, 0);
 io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
qemu_opt_get_number(opts, bps_wr, 0);
+/* bps thresholds */
+io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]  =
+qemu_opt_get_number(opts, bps_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ);
+io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]  =
+qemu_opt_get_number(opts, bps_rd_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ);
+io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]  =
+qemu_opt_get_number(opts, bps_wr_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ);
+
 io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
qemu_opt_get_number(opts, iops, 0);
 io_limits.iops[BLOCK_IO_LIMIT_READ]  =
qemu_opt_get_number(opts, iops_rd, 0);
 io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
qemu_opt_get_number(opts, iops_wr, 0);
-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] =
-   io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ;
-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]  =
-   io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ;
-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] =
-   io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ;
-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] =
-   io_limits.iops[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ;
+
+/* iops thresholds */
+io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]  =
+qemu_opt_get_number(opts, iops_threshold,
+

[Qemu-devel] [PATCH V3 for-1.6 0/5] Leaky bucket throttling and features

2013-07-23 Thread Benoît Canet

The first patch fixes the throttling which was broken by a previous commit.

The next patch replace the existing throttling algorithm by the well described
leaky bucket algorithm.

Third patch implement bursting by adding *_threshold parameters to
qmp_block_set_io_throttle.

The last one allow to define the max size of an io when throttling by iops via
iops_sector_count to avoid vm users cheating on the iops limit.

The last patch adds a metric reflecting how much the I/O are throttled.

since v1:
Add throttling percentage metric [Benoît]

since v2:
Enable timer only during I/O activity [Stefan]
Mark function as coroutine_fn [Stefan]
Guard these function checking they are in a coroutine [Stefan]
Use defines to access buckets [Stefan]
Fix typo [Stefan]
reset throttling metric on iddle [Benoît]
rename invalid to check_io_limit [Stefan]

Benoît Canet (5):
  block: Repair the throttling code.
  block: Modify the throttling code to implement the leaky bucket
algorithm.
  block: Add support for throttling burst threshold in QMP and the
command line.
  block: Add iops_sector_count to do the iops accounting for a given io
size.
  block: Add throttling percentage metrics.

 block.c   |  493 -
 block/qapi.c  |   32 +++
 blockdev.c|  174 ++--
 hmp.c |   38 +++-
 include/block/block_int.h |   18 +-
 include/block/coroutine.h |9 +-
 qapi-schema.json  |   42 +++-
 qemu-coroutine-lock.c |   20 +-
 qemu-options.hx   |2 +-
 qmp-commands.hx   |   34 +++-
 10 files changed, 642 insertions(+), 220 deletions(-)

-- 
1.7.10.4

[Qemu-devel] [PATCH V3 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.

2013-07-23 Thread Benoît Canet

This patch replace the previous algorithm by the well described leaky bucket
algorithm: A bucket is filled by the incoming IOs and a periodic timer decrement
the counter to make the bucket leak. When a given threshold is reached the
bucket is full and the IOs are hold.

In this patch the threshold is set to a default value to make the code behave
like the previous implementation.

In the next patch the threshold will be exposed in QMP to let the user control
the burstiness of the throttling.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |  454 +++--
 blockdev.c|   71 +--
 include/block/block_int.h |   15 +-
 3 files changed, 339 insertions(+), 201 deletions(-)

diff --git a/block.c b/block.c
index dc72643..f1cd9c0 100644
--- a/block.c
+++ b/block.c
@@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors);
 
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, int64_t *wait);
-
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -101,6 +94,8 @@ static QLIST_HEAD(, BlockDriver) bdrv_drivers =
 
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
+/* boolean used to inform the throttling code that a bdrv_drain_all is issued 
*/
+static bool draining;
 
 #ifdef _WIN32
 static int is_windows_drive_prefix(const char *filename)
@@ -129,28 +124,170 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 
 while (qemu_co_enter_next(bs-throttled_reqs)) {
 }
+}
 
-if (bs-block_timer) {
-qemu_del_timer(bs-block_timer);
-qemu_free_timer(bs-block_timer);
-bs-block_timer = NULL;
+static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta)
+{
+int64_t *bytes = bs-leaky_buckets.bytes;
+int64_t read_leak, write_leak;
+
+/* the limit apply to both reads and writes */
+if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
+/* compute half the total leak */
+int64_t leak = ((bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL] * delta) /
+   NANOSECONDS_PER_SECOND);
+int remain = leak % 2;
+leak /= 2;
+
+/* the read bucket is smaller than half the quantity to leak so take
+ * care adding the leak difference to write leak
+ */
+if (bytes[BLOCK_IO_LIMIT_READ] = leak) {
+read_leak = bytes[BLOCK_IO_LIMIT_READ];
+write_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_READ];
+/* symetric case */
+} else if (bytes[BLOCK_IO_LIMIT_WRITE] = leak) {
+write_leak = bytes[BLOCK_IO_LIMIT_WRITE];
+read_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_WRITE];
+/* both bucket above leak count use half the total leak for both */
+} else {
+write_leak = leak;
+read_leak = leak + remain;
+}
+/* else we consider that limits are separated */
+} else {
+read_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_READ] * delta) /
+NANOSECONDS_PER_SECOND;
+write_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_WRITE] * delta) /
+ NANOSECONDS_PER_SECOND;
+}
+
+/* make the buckets leak */
+bytes[BLOCK_IO_LIMIT_READ]  = MAX(bytes[BLOCK_IO_LIMIT_READ] - read_leak,
+  0);
+bytes[BLOCK_IO_LIMIT_WRITE] = MAX(bytes[BLOCK_IO_LIMIT_WRITE] - write_leak,
+  0);
+}
+
+static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta)
+{
+double *ios = bs-leaky_buckets.ios;
+int64_t read_leak, write_leak;
+
+/* the limit apply to both reads and writes */
+if (bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
+/* compute half the total leak */
+int64_t leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) /
+   NANOSECONDS_PER_SECOND);
+int remain = leak % 2;
+leak /= 2;
+
+/* the read bucket is smaller than half the quantity to leak so take
+ * care adding the leak difference to write leak
+ */
+if (ios[BLOCK_IO_LIMIT_READ] = leak) {
+read_leak = ios[BLOCK_IO_LIMIT_READ];
+write_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_READ];
+/* symetric case */
+} else if (ios[BLOCK_IO_LIMIT_WRITE] = leak) {
+write_leak = ios[BLOCK_IO_LIMIT_WRITE];
+read_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_WRITE];
+/* both bucket above leak count use

[Qemu-devel] [PATCH V3 for-1.6 5/5] block: Add throttling percentage metrics.

2013-07-23 Thread Benoît Canet

This metrics show how many percent of the time I/Os are put on hold by the
throttling algorithm.
This metric could be used by system administrators or cloud stack developpers
to decide when the throttling settings must be changed.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |   27 ++-
 block/qapi.c  |4 
 hmp.c |6 --
 include/block/block_int.h |2 ++
 qapi-schema.json  |4 +++-
 5 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index bb4f8e4..6bb8570 100644
--- a/block.c
+++ b/block.c
@@ -118,12 +118,21 @@ int is_windows_drive(const char *filename)
 #endif
 
 /* throttling disk I/O limits */
+static void bdrv_reset_throttling_metrics(BlockDriverState *bs)
+{
+/* iddle - reset values */
+bs-throttling_percentage = 0;
+bs-full_since = 0;
+}
+
 void bdrv_io_limits_disable(BlockDriverState *bs)
 {
 bs-io_limits_enabled = false;
 
 while (qemu_co_enter_next(bs-throttled_reqs)) {
 }
+
+bdrv_reset_throttling_metrics(bs);
 }
 
 static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta)
@@ -213,7 +222,8 @@ static void bdrv_make_iops_buckets_leak(BlockDriverState 
*bs, int64_t delta)
 static void bdrv_leak_if_needed(BlockDriverState *bs)
 {
 int64_t now;
-int64_t delta;
+int64_t delta; /* the delta that would be ideally the timer period */
+int64_t delta_full; /* the delta where the bucket is full */
 
 if (!bs-must_leak) {
 return;
@@ -223,6 +233,14 @@ static void bdrv_leak_if_needed(BlockDriverState *bs)
 
 now = qemu_get_clock_ns(rt_clock);
 delta = now - bs-previous_leak;
+/* compute throttle percentage reflecting how long IO are hold on average 
*/
+if (bs-full_since) {
+delta_full = now - bs-full_since;
+bs-throttling_percentage = (delta_full * 100) / delta;
+bs-full_since = 0;
+} else {
+bs-throttling_percentage = 0;
+}
 bs-previous_leak = now;
 
 bdrv_make_bps_buckets_leak(bs, delta);
@@ -260,6 +278,7 @@ static void bdrv_block_timer(void *opaque)
 /* disable throttling time on iddle for economy purpose */
 if (bdrv_throttling_is_iddle(bs)) {
 bdrv_block_timer_disable(bs);
+bdrv_reset_throttling_metrics(bs);
 return;
 }
 
@@ -280,6 +299,7 @@ static void bdrv_block_timer_enable(BlockDriverState *bs)
 
 bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
 bs-previous_leak = qemu_get_clock_ns(rt_clock);
+bdrv_reset_throttling_metrics(bs);
 qemu_mod_timer(bs-block_timer,
qemu_get_clock_ns(vm_clock) +
BLOCK_IO_THROTTLE_PERIOD);
@@ -432,6 +452,11 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
  * not full
  */
 while (bdrv_is_any_threshold_exceeded(bs, nb_sectors, is_write)) {
+/* remember since when the code decided to block the first I/O */
+if (qemu_co_queue_empty(bs-throttled_reqs)) {
+bs-full_since = qemu_get_clock_ns(rt_clock);
+}
+
 bdrv_leak_if_needed(bs);
 qemu_co_queue_wait_insert_head(bs-throttled_reqs);
 bdrv_leak_if_needed(bs);
diff --git a/block/qapi.c b/block/qapi.c
index f81081c..bd1c6af 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -263,6 +263,10 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops_sector_count;
 info-inserted-iops_sector_count =
bs-io_limits.iops_sector_count;
+info-inserted-has_throttling_percentage =
+   bs-throttling_percentage;
+info-inserted-throttling_percentage =
+   bs-throttling_percentage;
 }
 
 bs0 = bs;
diff --git a/hmp.c b/hmp.c
index 3912305..9dc4862 100644
--- a/hmp.c
+++ b/hmp.c
@@ -348,7 +348,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
  iops_threshold=% PRId64
  iops_rd_threshold=% PRId64
  iops_wr_threshold=% PRId64
- iops_sector_count=% PRId64 \n,
+ iops_sector_count=% PRId64
+ throttling_percentage=% PRId64 \n,
 info-value-inserted-bps,
 info-value-inserted-bps_rd,
 info-value-inserted-bps_wr,
@@ -361,7 +362,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
 info-value-inserted-iops_threshold,
 info-value-inserted-iops_rd_threshold,
 info-value-inserted-iops_wr_threshold,
-info-value-inserted-iops_sector_count);
+info-value-inserted-iops_sector_count,
+

[Qemu-devel] [PATCH V3 for-1.6 4/5] block: Add iops_sector_count to do the iops accounting for a given io size.

2013-07-23 Thread Benoît Canet

This feature can be used in case where users are avoiding the iops limit by
doing jumbo I/Os hammering the storage backend.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 +++-
 block/qapi.c  |4 
 blockdev.c|   22 +-
 hmp.c |8 ++--
 include/block/block_int.h |1 +
 qapi-schema.json  |   10 --
 qemu-options.hx   |2 +-
 qmp-commands.hx   |8 ++--
 8 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index f1cd9c0..bb4f8e4 100644
--- a/block.c
+++ b/block.c
@@ -368,6 +368,12 @@ static bool 
bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors,
bool is_write)
 {
 bool bps_ret, iops_ret;
+double   ios = 1.0;
+
+if (bs-io_limits.iops_sector_count) {
+ios = ((double) nb_sectors) / bs-io_limits.iops_sector_count;
+ios = MAX(ios, 1.0);
+}
 
 /* check if any bandwith or per IO threshold has been exceeded */
 bps_ret = bdrv_is_bps_threshold_exceeded(bs, is_write);
@@ -391,7 +397,7 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState 
*bs, int nb_sectors,
 /* the IO is authorized so do the accounting and return false */
 bs-leaky_buckets.bytes[is_write] += (int64_t)nb_sectors *
  BDRV_SECTOR_SIZE;
-bs-leaky_buckets.ios[is_write]++;
+bs-leaky_buckets.ios[is_write] += ios;
 
 return false;
 }
diff --git a/block/qapi.c b/block/qapi.c
index 03f1604..f81081c 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -259,6 +259,10 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
 info-inserted-iops_wr_threshold =
bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_iops_sector_count =
+   bs-io_limits.iops_sector_count;
+info-inserted-iops_sector_count =
+   bs-io_limits.iops_sector_count;
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index 241ebe8..d56e8a1 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -352,6 +352,11 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, 
Error **errp)
 return false;
 }
 
+if (io_limits-iops_sector_count  0) {
+error_setg(errp, iops_sector_count must be 0 or greater);
+return false;
+}
+
 return true;
 }
 
@@ -563,6 +568,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
 qemu_opt_get_number(opts, iops_wr_threshold,
 io_limits.iops[BLOCK_IO_LIMIT_WRITE] / 
THROTTLE_HZ);
 
+io_limits.iops_sector_count =
+   qemu_opt_get_number(opts, iops_sector_count, 0);
+ 
+
 if (!do_check_io_limits(io_limits, error)) {
 error_report(%s, error_get_pretty(error));
 error_free(error);
@@ -1260,7 +1269,9 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
bool has_iops_rd_threshold,
int64_t iops_rd_threshold,
bool has_iops_wr_threshold,
-   int64_t iops_wr_threshold, Error **errp)
+   int64_t iops_wr_threshold,
+   bool has_iops_sector_count,
+   int64_t iops_sector_count, Error **errp)
 {
 BlockIOLimit io_limits;
 BlockDriverState *bs;
@@ -1283,6 +1294,7 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = iops / THROTTLE_HZ;
 io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]  = iops_rd / THROTTLE_HZ;
 io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr / THROTTLE_HZ;
+io_limits.iops_sector_count = 0;
 
 /* override them with givens values if present */
 if (has_bps_threshold) {
@@ -1304,6 +1316,10 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr_threshold;
 }
 
+if (has_iops_sector_count) {
+io_limits.iops_sector_count = iops_sector_count;
+}
+
 if (!do_check_io_limits(io_limits, errp)) {
 return;
 }
@@ -2007,6 +2023,10 @@ QemuOptsList qemu_common_drive_opts = {
 .type = QEMU_OPT_NUMBER,
 .help = write bytes threshold,
 },{
+.name = iops_sector_count,
+.type = QEMU_OPT_NUMBER,
+.help = when limiting by iops max size of an I/O in sector,
+},{
 .name = copy-on-read,
 .type = QEMU_OPT_BOOL,
 .help = copy read data from backing file into image file,
diff --git a/hmp.c b/hmp.c
index

Re: [Qemu-devel] QCOW2 cryptography and secure key handling

2013-07-24 Thread Benoît Canet

 There are two ways I could see it happening. Either integrate directly
 into the qcow2 file format, by mapping LUKS headers  key material
 blocks into the qcow2 header region in some manner.
 
 Alternatively do it in a completely generic block driver, that qcow2
 (or any other qemu bdrv) calls into instead of the file bdrv. That
 way the entire LUKS format becomes the image file data payload. A
 separate block driver, could also allow LUKS to be layered ontop,
 so that metadata is encrypted too.  eg so you could end up with
 either layering
 
QCow2 bdrv - LUKS bdrv - file bdrv
LUKS bdrv - QCow2 bdrv - file bdrv

I already tried the generic block driver approach on other project. (Quorum)
The problem is that it result in complex issues to make the driver works with
all QEMU features (think snapshots) and that no one has the funding to tackle
the infrastructure work required to solve this: writing BlockBackend and block
filters.

Best regards

Benoît

Re: [Qemu-devel] [PATCH V3 for-1.6 0/5] Leaky bucket throttling and features

2013-07-25 Thread Benoît Canet

Le Thursday 25 Jul 2013 à 20:08:22 (+0800), Fam Zheng a écrit :
 My ignorant question: Besides the fix, how is io throttling improved
 from previously, what are the advantages of leaky bucket algorithm?

The advantage is simplicity and the ability to allow controlled bursts.

The disavantage is it add stutters to the iops flow.

Best Regards

Benoît

Re: [Qemu-devel] [PATCH V3 for-1.6 2/5] block: Modify the throttling code to implement the leaky bucket algorithm.

2013-07-26 Thread Benoît Canet

 I think it is easier to understand written like this:
 
int64_t total_leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * 
 delta) /
   NANOSECONDS_PER_SECOND);
if (ios[BLOCK_IO_LIMIT_READ] = total_leak / 2) {
   read_leak = ios[BLOCK_IO_LIMIT_READ];
   write_leak = total_leak - read_leak;
/* symetric case */
} else if (ios[BLOCK_IO_LIMIT_WRITE] = total_leak / 2) {
   write_leak = ios[BLOCK_IO_LIMIT_WRITE];
   read_leak = total_leak - write_leak;
/* both bucket above leak count use half the total leak for both */
} else {
   write_leak = total_leak / 2;
   read_leak = (total_leak + 1) / 2;
}

Thanks,
I will propagate these changes in the new infinite HZ algorithm I am currently
writing.

 
  +/* else we consider that limits are separated */
  +} else {
  +read_leak = (bs-io_limits.iops[BLOCK_IO_LIMIT_READ] * delta) /
  +NANOSECONDS_PER_SECOND;
  +write_leak = (bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE] * delta) /
  + NANOSECONDS_PER_SECOND;
  +}
  +
  +/* make the buckets leak */
  +ios[BLOCK_IO_LIMIT_READ]  = MAX(ios[BLOCK_IO_LIMIT_READ] - read_leak, 
  0);
  +ios[BLOCK_IO_LIMIT_WRITE] = MAX(ios[BLOCK_IO_LIMIT_WRITE] - 
  write_leak, 0);
  +}
  +
  +static void bdrv_leak_if_needed(BlockDriverState *bs)
  +{
  +int64_t now;
  +int64_t delta;
  +
  +if (!bs-must_leak) {
  +return;
  +}
  +
  +bs-must_leak = false;
  +
  +now = qemu_get_clock_ns(rt_clock);
  +delta = now - bs-previous_leak;
  +bs-previous_leak = now;
  +
  +bdrv_make_bps_buckets_leak(bs, delta);
  +bdrv_make_iops_buckets_leak(bs, delta);
  +}
  +
  +static void bdrv_block_timer_disable(BlockDriverState *bs)
  +{
  +if (!bs-block_timer) {
  +return;
   }
   
  -bs-slice_start = 0;
  -bs-slice_end   = 0;
  +qemu_del_timer(bs-block_timer);
  +qemu_free_timer(bs-block_timer);
  +bs-block_timer = NULL;
  +}
  +
  +static bool bdrv_throttling_is_iddle(BlockDriverState *bs)
 
 I don't quite understad the wording here, is iddle equivalent to idle?
 
  +{
  +int64_t delta = qemu_get_clock_ns(rt_clock) - bs-previous_leak;
  +
  +if (delta  BLOCK_IO_THROTTLE_PERIOD * 2) {
  +return false;
  +}
  +
  +/* iddle */
  +return true;
   }
   
  +/* This callback is the timer in charge of making the leaky buckets leak */
   static void bdrv_block_timer(void *opaque)
 
 Will be more readable for me if you could rename it to
 bdrv_clock_timer_cb.

 
   {
   BlockDriverState *bs = opaque;
   
  +/* disable throttling time on iddle for economy purpose */
  +if (bdrv_throttling_is_iddle(bs)) {
  +bdrv_block_timer_disable(bs);
  +return;
  +}
  +
  +/* rearm the timer */
  +qemu_mod_timer(bs-block_timer,
  +   qemu_get_clock_ns(vm_clock) +
  +   BLOCK_IO_THROTTLE_PERIOD);
  +
  +bs-must_leak = true;
   qemu_co_enter_next(bs-throttled_reqs);
   }
   
  +static void bdrv_block_timer_enable(BlockDriverState *bs)
  +{
  +if (bs-block_timer) {
  +return;
  +}
  +
  +bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
  +bs-previous_leak = qemu_get_clock_ns(rt_clock);
  +qemu_mod_timer(bs-block_timer,
  +   qemu_get_clock_ns(vm_clock) +
  +   BLOCK_IO_THROTTLE_PERIOD);
  +}
  +
   void bdrv_io_limits_enable(BlockDriverState *bs)
   {
   qemu_co_queue_init(bs-throttled_reqs);
  -bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
   bs-io_limits_enabled = true;
   }
   
  @@ -165,15 +302,118 @@ bool bdrv_io_limits_enabled(BlockDriverState *bs)
|| io_limits-iops[BLOCK_IO_LIMIT_TOTAL];
   }
   
  +/* This function check if the correct bandwith threshold has been exceeded
 
 What does the correct bandwidth threshold mean?
 
 And s/bandwith/bandwidth/, series wide.
 
  + *
  + * @is_write: true if the current IO is a write, false if it's a read
  + * @ret:  true if threshold has been exceeded else false
  + */
  +static bool bdrv_is_bps_threshold_exceeded(BlockDriverState *bs, bool 
  is_write)
  +{
  +/* limit is on total read + write bps : do the sum and compare with 
  total
  + * threshold
  + */
  +if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
  +int64_t bytes = bs-leaky_buckets.bytes[BLOCK_IO_LIMIT_READ] +
  +bs-leaky_buckets.bytes[BLOCK_IO_LIMIT_WRITE];
  +return bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]  bytes;
  +}
  +
  +/* check wether the threshold corresponding to the current io type 
  (read,
  + * write) has been exceeded
  + */
  +if (bs-io_limits.bps[is_write]) {
 
 It looks dangerous to use is_write as index of the

Re: [Qemu-devel] [PATCH V3 for-1.6 5/5] block: Add throttling percentage metrics.

2013-07-26 Thread Benoît Canet

 If I understand it, the percentage is recalculated every leak check. So
 it only reflects the instant io flow, instead of historical statistics?
 But I think for system admin purpose, it's good to know a longer range
 io activity character. Or do you think management tool should sample it?

Yes I wrote this with management polling in mind.

 
  +} else {
  +bs-throttling_percentage = 0;
  +}
   bs-previous_leak = now;
   
   bdrv_make_bps_buckets_leak(bs, delta);
  @@ -260,6 +278,7 @@ static void bdrv_block_timer(void *opaque)
   /* disable throttling time on iddle for economy purpose */
   if (bdrv_throttling_is_iddle(bs)) {
   bdrv_block_timer_disable(bs);
  +bdrv_reset_throttling_metrics(bs);
   return;
   }
   
  @@ -280,6 +299,7 @@ static void bdrv_block_timer_enable(BlockDriverState 
  *bs)
   
   bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
   bs-previous_leak = qemu_get_clock_ns(rt_clock);
  +bdrv_reset_throttling_metrics(bs);
   qemu_mod_timer(bs-block_timer,
  qemu_get_clock_ns(vm_clock) +
  BLOCK_IO_THROTTLE_PERIOD);
  @@ -432,6 +452,11 @@ static void bdrv_io_limits_intercept(BlockDriverState 
  *bs,
* not full
*/
   while (bdrv_is_any_threshold_exceeded(bs, nb_sectors, is_write)) {
  +/* remember since when the code decided to block the first I/O */
  +if (qemu_co_queue_empty(bs-throttled_reqs)) {
  +bs-full_since = qemu_get_clock_ns(rt_clock);
  +}
  +
   bdrv_leak_if_needed(bs);
   qemu_co_queue_wait_insert_head(bs-throttled_reqs);
   bdrv_leak_if_needed(bs);
  diff --git a/block/qapi.c b/block/qapi.c
  index f81081c..bd1c6af 100644
  --- a/block/qapi.c
  +++ b/block/qapi.c
  @@ -263,6 +263,10 @@ void bdrv_query_info(BlockDriverState *bs,
  bs-io_limits.iops_sector_count;
   info-inserted-iops_sector_count =
  bs-io_limits.iops_sector_count;
  +info-inserted-has_throttling_percentage =
  +   bs-throttling_percentage;
  +info-inserted-throttling_percentage =
  +   bs-throttling_percentage;
   }
   
   bs0 = bs;
  diff --git a/hmp.c b/hmp.c
  index 3912305..9dc4862 100644
  --- a/hmp.c
  +++ b/hmp.c
  @@ -348,7 +348,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
iops_threshold=% PRId64
iops_rd_threshold=% PRId64
iops_wr_threshold=% PRId64
  - iops_sector_count=% PRId64 \n,
  + iops_sector_count=% PRId64
  + throttling_percentage=% PRId64 \n,
   info-value-inserted-bps,
   info-value-inserted-bps_rd,
   info-value-inserted-bps_wr,
  @@ -361,7 +362,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
   info-value-inserted-iops_threshold,
   info-value-inserted-iops_rd_threshold,
   info-value-inserted-iops_wr_threshold,
  -info-value-inserted-iops_sector_count);
  +info-value-inserted-iops_sector_count,
  +info-value-inserted-throttling_percentage);
   } else {
   monitor_printf(mon,  [not inserted]);
   }
  diff --git a/include/block/block_int.h b/include/block/block_int.h
  index 74d7503..4487cd9 100644
  --- a/include/block/block_int.h
  +++ b/include/block/block_int.h
  @@ -271,6 +271,8 @@ struct BlockDriverState {
   BlockIOLimit io_limits;
   BlockIOBaseValue leaky_buckets;
   int64_t  previous_leak;
  +int64_t  full_since;
  +int  throttling_percentage;
   bool must_leak;
   CoQueue  throttled_reqs;
   QEMUTimer*block_timer;
  diff --git a/qapi-schema.json b/qapi-schema.json
  index d579fda..14a02e7 100644
  --- a/qapi-schema.json
  +++ b/qapi-schema.json
  @@ -783,6 +783,8 @@
   #
   # @iops_sector_count: #optional an I/O size in sector (Since 1.6)
   #
  +# @throttling_percentage: #optional reflect throttling activity (Since 1.6)
  +#
   # Since: 0.14.0
   #
   # Notes: This interface is only found in @BlockInfo.
  @@ -797,7 +799,7 @@
   '*bps_threshold': 'int', '*bps_rd_threshold': 'int',
   '*bps_wr_threshold': 'int', '*iops_threshold': 'int',
   '*iops_rd_threshold': 'int', '*iops_wr_threshold': 'int',
  -'*iops_sector_count': 'int' } }
  +'*iops_sector_count': 'int', '*throttling_percentage': 'int' } 
  }
   
   ##
   # @BlockDeviceIoStatus:
  -- 
  1.7.10.4
  
  
 
 -- 
 Fam

Re: [Qemu-devel] [PATCH 13/18] blockdev: Rename I/O throttling options for QMP

2013-07-26 Thread Benoît Canet

 This patch will probably conflict with Benoît's work on leaky bucket
 throttling; can the two of you decide which one should go in first?  Are
 we trying to target both this series and leaky bucket throttling for 1.6?

I will to rebase my serie on top of this.

However if anyone has suggestions for the names of the new options the leaky
bucket serie add it would probably avoid an extra code review.

Best regards

Benoît

[Qemu-devel] [PATCH/FIX FOR 1.6] Repair the throttling code

2013-07-26 Thread Benoît Canet

The throttling code was segfaulting repair it for 1.6.

Benoît Canet (1):
  block: Repair the throttling code.

 block.c   |8 
 include/block/coroutine.h |9 +++--
 qemu-coroutine-lock.c |   20 ++--
 3 files changed, 29 insertions(+), 8 deletions(-)

-- 
1.7.10.4

[Qemu-devel] [PATCH/FIX FOR 1.6] block: Repair the throttling code.

2013-07-26 Thread Benoît Canet

The throttling code was segfaulting since commit
02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller
does not run in a coroutine.
qemu_co_queue_do_restart assume that the caller is a coroutinne.
As suggested by Stefan fix this by entering the coroutine directly.
Also make sure like suggested that qemu_co_queue_next() and
qemu_co_queue_restart_all() can be called only in coroutines.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 
 include/block/coroutine.h |9 +++--
 qemu-coroutine-lock.c |   20 ++--
 3 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 6cd39fa..5a2240f 100644
--- a/block.c
+++ b/block.c
@@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 {
 bs-io_limits_enabled = false;
 
-while (qemu_co_queue_next(bs-throttled_reqs));
+while (qemu_co_enter_next(bs-throttled_reqs)) {
+}
 
 if (bs-block_timer) {
 qemu_del_timer(bs-block_timer);
@@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque)
 {
 BlockDriverState *bs = opaque;
 
-qemu_co_queue_next(bs-throttled_reqs);
+qemu_co_enter_next(bs-throttled_reqs);
 }
 
 void bdrv_io_limits_enable(BlockDriverState *bs)
@@ -1445,8 +1446,7 @@ void bdrv_drain_all(void)
  * a busy wait.
  */
 QTAILQ_FOREACH(bs, bdrv_states, list) {
-if (!qemu_co_queue_empty(bs-throttled_reqs)) {
-qemu_co_queue_restart_all(bs-throttled_reqs);
+while (qemu_co_enter_next(bs-throttled_reqs)) {
 busy = true;
 }
 }
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 377805a..1f2db3e 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -130,12 +130,17 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue 
*queue);
  *
  * Returns true if a coroutine was restarted, false if the queue is empty.
  */
-bool qemu_co_queue_next(CoQueue *queue);
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 
 /**
  * Restarts all coroutines in the CoQueue and leaves the queue empty.
  */
-void qemu_co_queue_restart_all(CoQueue *queue);
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
 
 /**
  * Checks if the CoQueue is empty.
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index d9fea49..aeb33b9 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -88,16 +88,32 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool 
single)
 return true;
 }
 
-bool qemu_co_queue_next(CoQueue *queue)
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
 {
+assert(qemu_in_coroutine());
 return qemu_co_queue_do_restart(queue, true);
 }
 
-void qemu_co_queue_restart_all(CoQueue *queue)
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
 {
+assert(qemu_in_coroutine());
 qemu_co_queue_do_restart(queue, false);
 }
 
+bool qemu_co_enter_next(CoQueue *queue)
+{
+Coroutine *next;
+
+next = QTAILQ_FIRST(queue-entries);
+if (!next) {
+return false;
+}
+
+QTAILQ_REMOVE(queue-entries, next, co_queue_next);
+qemu_coroutine_enter(next, NULL);
+return true;
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
 return (QTAILQ_FIRST(queue-entries) == NULL);
-- 
1.7.10.4

[Qemu-devel] [PATCH V4/FIX FOR 1.6] repair throttling code

2013-07-26 Thread Benoît Canet

since v3:

silence bogus checkpatch warning [abligh]

Benoît Canet (1):
  block: Repair the throttling code.

 block.c   |7 +++
 include/block/coroutine.h |9 +++--
 qemu-coroutine-lock.c |   20 ++--
 3 files changed, 28 insertions(+), 8 deletions(-)

-- 
1.7.10.4

[Qemu-devel] [PATCH V4/FIX FOR 1.6] block: Repair the throttling code.

2013-07-26 Thread Benoît Canet

The throttling code was segfaulting since commit
02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller
does not run in a coroutine.
qemu_co_queue_do_restart assume that the caller is a coroutinne.
As suggested by Stefan fix this by entering the coroutine directly.
Also make sure like suggested that qemu_co_queue_next() and
qemu_co_queue_restart_all() can be called only in coroutines.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |7 +++
 include/block/coroutine.h |9 +++--
 qemu-coroutine-lock.c |   20 ++--
 3 files changed, 28 insertions(+), 8 deletions(-)

diff --git a/block.c b/block.c
index 6cd39fa..35d71a4 100644
--- a/block.c
+++ b/block.c
@@ -127,7 +127,7 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 {
 bs-io_limits_enabled = false;
 
-while (qemu_co_queue_next(bs-throttled_reqs));
+do {} while (qemu_co_enter_next(bs-throttled_reqs));
 
 if (bs-block_timer) {
 qemu_del_timer(bs-block_timer);
@@ -143,7 +143,7 @@ static void bdrv_block_timer(void *opaque)
 {
 BlockDriverState *bs = opaque;
 
-qemu_co_queue_next(bs-throttled_reqs);
+qemu_co_enter_next(bs-throttled_reqs);
 }
 
 void bdrv_io_limits_enable(BlockDriverState *bs)
@@ -1445,8 +1445,7 @@ void bdrv_drain_all(void)
  * a busy wait.
  */
 QTAILQ_FOREACH(bs, bdrv_states, list) {
-if (!qemu_co_queue_empty(bs-throttled_reqs)) {
-qemu_co_queue_restart_all(bs-throttled_reqs);
+while (qemu_co_enter_next(bs-throttled_reqs)) {
 busy = true;
 }
 }
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 377805a..1f2db3e 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -130,12 +130,17 @@ void coroutine_fn qemu_co_queue_wait_insert_head(CoQueue 
*queue);
  *
  * Returns true if a coroutine was restarted, false if the queue is empty.
  */
-bool qemu_co_queue_next(CoQueue *queue);
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue);
 
 /**
  * Restarts all coroutines in the CoQueue and leaves the queue empty.
  */
-void qemu_co_queue_restart_all(CoQueue *queue);
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue);
+
+/**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
 
 /**
  * Checks if the CoQueue is empty.
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index d9fea49..aeb33b9 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -88,16 +88,32 @@ static bool qemu_co_queue_do_restart(CoQueue *queue, bool 
single)
 return true;
 }
 
-bool qemu_co_queue_next(CoQueue *queue)
+bool coroutine_fn qemu_co_queue_next(CoQueue *queue)
 {
+assert(qemu_in_coroutine());
 return qemu_co_queue_do_restart(queue, true);
 }
 
-void qemu_co_queue_restart_all(CoQueue *queue)
+void coroutine_fn qemu_co_queue_restart_all(CoQueue *queue)
 {
+assert(qemu_in_coroutine());
 qemu_co_queue_do_restart(queue, false);
 }
 
+bool qemu_co_enter_next(CoQueue *queue)
+{
+Coroutine *next;
+
+next = QTAILQ_FIRST(queue-entries);
+if (!next) {
+return false;
+}
+
+QTAILQ_REMOVE(queue-entries, next, co_queue_next);
+qemu_coroutine_enter(next, NULL);
+return true;
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
 return (QTAILQ_FIRST(queue-entries) == NULL);
-- 
1.7.10.4

Re: [Qemu-devel] [PATCH V3 for-1.6 3/5] block: Add support for throttling burst threshold in QMP and the command line.

2013-07-26 Thread Benoît Canet

 Kevin's series renamed these to have a dash in the name, and also moved
 all the throttling parameters into a sub-struct.  Does it make more
 sense to have just '*throttling' with that sub-struct containing 12
 parameters, 6 for limits and 6 for thresholds, or would it be better to
 have '*throttling' with 6 members for limits, as well as
 '*throttling-threshold' with the other 6 members?  Naming-wise,
 throttling.bps-total and throttling-threshold.bps-total convey as much
 information as throttling.bps-total and throttling.bps-total-threshold.

In fact my series add up to 13 parameters.
The last one is iops_sector_count so maybe I'll go the big sub-struct way.

Best regards

Benoît

Re: [Qemu-devel] VFIO and scheduled SR-IOV cards

2013-07-28 Thread Benoît Canet

 Confused.
 You have one VF accessing BAR of another VF?
 Why?

The VFs are scheduled and the board have only one microcontroller responsible
to give the read results for some memory mapped registers.
So it could respond the value of the active VF when a bar of an inactive VF is
read.

I handed over the contract as it's seemed too tricky.

Best regards

Benoît

Re: [Qemu-devel] QCOW2 cryptography and secure key handling

2013-07-29 Thread Benoît Canet

Le Monday 29 Jul 2013 à 12:32:35 (+0100), Daniel P. Berrange a écrit :
 On Mon, Jul 29, 2013 at 01:25:24PM +0200, Kevin Wolf wrote:
  Am 29.07.2013 um 13:21 hat Markus Armbruster geschrieben:
   Paolo Bonzini pbonz...@redhat.com writes:
   
Il 23/07/2013 17:57, Daniel P. Berrange ha scritto:
On Tue, Jul 23, 2013 at 05:38:00PM +0200, Kevin Wolf wrote:
Am 23.07.2013 um 17:22 hat Stefan Hajnoczi geschrieben:
On Tue, Jul 23, 2013 at 04:40:34PM +0200, Benoît Canet wrote:
More generally, QCow2's current encryption support is woefully 
inadequate
from a design POV. If we wanted better encryption built-in to QEMU 
it is
best to just deprecate the current encryption support and define a 
new
qcow2 extension based around something like the LUKS data format. 
Using
the LUKS data format precisely would be good from a data 
portability
POV, since then you can easily switch your images between LUKS 
encrypted
block device  qcow2-with-luks image file, without needing to 
re-encrypt
the data.
   
I read the LUKS specification and undestood enough part of it to
understand the
potentials benefits (stronger encryption key, multiple user keys,
possibility to
change users keys).
   
Kevin  Stefan: What do you think about implementing LUKS in QCOW2 ?
   
Using standard or proven approachs in crypto is a good thing.
   
I think the question is how much of a standard approach you take and
what sense it makes in the context where it's used. The actual
encryption algorithm is standard, as far as I can tell, but some 
people
have repeatedly been arguing that it still results in bad crypto. Are
they right? I don't know, I know too little of this stuff.

One reason that QCow2 is bad, despite using a standard algorithm, is
that the user passphrase is directly used encrypt/decrypt the data.
Thus a weak passphrase leads to weak data encryption. With the LUKS
format, the passphrase is only used to unlock the master key, which
is cryptographically strong. LUKS applies multiple rounds of hashing
to the user passphrase based on the speed of the machine CPUs, to
make it less practical to brute force weak user passphrases and thus
recover the master key.
   
Another reason that QCow2 is bad is that disk encryption is Complicated.
 Even if you do not do any horrible mistakes such as using ECB
encryption, a disk encrypted sector-by-sector has a lot of small
separate cyphertexts in it and is susceptible to a special range of 
attacks.
   
For example, current qcow2 encryption is vulnerable to a watermarking
attack.
http://en.wikipedia.org/wiki/Disk_encryption_theory#Cipher-block_chaining_.28CBC.29
   
   Fine example of why the we use a standard, strong cypher (AES),
   therefore our crypto must be good argument is about as convincing as I
   built this sandcastle from the finest quartz sand, so it must be
   strong.
   
   Crypto should be done by trained professionals[*].
   
   [...]
   
   
   [*] I studied crypto deeply enough to know I'm not.
  
  The point is, how do you know that you end up with good crypto when you
  add LUKS-like features? You still use them in a different context, and
  that may or may not break it. I can't really say.
 
 If we're not sufficiently confident in what we're doing, then we ought to
 find suitable people to advise us / review what we'd propose. I know Red
 Hat has people on its security team who we might be able to get to review
 any proposals in this area, if we wanted further crypto advise. If we went
 with an approach of incorporating LUKS, then we should also connect with
 the dm-crypt maintainers  / LUKS designers to ask them to review what we're
 proposing to do.

http://www.spinics.net/lists/dm-crypt/msg05277.html

Best regards

Benoît

 
 Regards,
 Daniel
 -- 
 |: http://berrange.com  -o-http://www.flickr.com/photos/dberrange/ :|
 |: http://libvirt.org  -o- http://virt-manager.org :|
 |: http://autobuild.org   -o- http://search.cpan.org/~danberr/ :|
 |: http://entangle-photo.org   -o-   http://live.gnome.org/gtk-vnc :|

Re: [Qemu-devel] QCOW2 cryptography and secure key handling

2013-07-31 Thread Benoît Canet

 Crypto should be done by trained professionals[*].

I agree I feel uneasy to write cryptographic code.

Best regards

Benoît

Re: [Qemu-devel] QCOW2 cryptography and secure key handling

2013-07-31 Thread Benoît Canet

 For example, current qcow2 encryption is vulnerable to a watermarking
 attack.
 http://en.wikipedia.org/wiki/Disk_encryption_theory#Cipher-block_chaining_.28CBC.29

void qcow2_encrypt_sectors(BDRVQcowState *s, int64_t sector_num,
   uint8_t *out_buf, const uint8_t *in_buf,
   int nb_sectors, int enc,
   const AES_KEY *key)
{
union {
uint64_t ll[2];
uint8_t b[16];
} ivec;
int i;

for(i = 0; i  nb_sectors; i++) {
ivec.ll[0] = cpu_to_le64(sector_num);
ivec.ll[1] = 0;
AES_cbc_encrypt(in_buf, out_buf, 512, key,
ivec.b, enc);
sector_num++;
in_buf += 512;
out_buf += 512;
}
}

CBC mode would imply that each sector would be crypted by combining the
plaintext with the previous sector.
It's does not look to be the case as the IV is reset to sector_num for each
sector.
It look like CTR mode.

Best regards

Benoît


 dm-crypt or other disk encryption programs use more complicated schemes,
 do we need to go there?

 Paolo

[Qemu-devel] [RFC V3 1/2] throttle: Add a new throttling API implementing continuus leaky bucket.

2013-08-02 Thread Benoît Canet

Implement the continuous leaky bucket algorithm devised on IRC as a separate
module.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 include/qemu/throttle.h |  111 
 util/Makefile.objs  |1 +
 util/throttle.c |  436 +++
 3 files changed, 548 insertions(+)
 create mode 100644 include/qemu/throttle.h
 create mode 100644 util/throttle.c

diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h
new file mode 100644
index 000..328c782
--- /dev/null
+++ b/include/qemu/throttle.h
@@ -0,0 +1,111 @@
+/*
+ * QEMU throttling infrastructure
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 or
+ * (at your option) version 3 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see http://www.gnu.org/licenses/.
+ */
+
+#ifndef THROTTLING_H
+#define THROTTLING_H
+
+#include stdint.h
+#include qemu-common.h
+#include qemu/timer.h
+
+#define NANOSECONDS_PER_SECOND  10.0
+
+#define BUCKETS_COUNT 6
+
+typedef enum {
+THROTTLE_BPS_TOTAL = 0,
+THROTTLE_BPS_READ  = 1,
+THROTTLE_BPS_WRITE = 2,
+THROTTLE_OPS_TOTAL = 3,
+THROTTLE_OPS_READ  = 4,
+THROTTLE_OPS_WRITE = 5,
+} BucketType;
+
+typedef struct LeakyBucket {
+int64_t ups;   /* units per second */
+int64_t max;   /* leaky bucket max in units */
+double  bucket;/* bucket in units */
+int64_t previous_leak; /* timestamp of the last leak done */
+} LeakyBucket;
+
+/* The following structure is used to configure a ThrottleState
+ * It contains a bit of state: the bucket field of the LeakyBucket structure.
+ * However it allows to keep the code clean and the bucket field is reset to
+ * zero at the right time.
+ */
+typedef struct ThrottleConfig {
+LeakyBucket buckets[6];/* leaky buckets */
+int64_t unit_size; /* size of an unit in bytes */
+int64_t op_size;   /* size of an operation in units */
+} ThrottleConfig;
+
+typedef struct ThrottleState {
+ThrottleConfig cfg;
+bool timer_is_throttling_write; /* is the timer throttling a write */
+QEMUTimer *timer;/* timer used to do the throttling */
+QEMUClock *clock;/* the clock used */
+} ThrottleState;
+
+/* following 3 function exposed for tests */
+bool throttle_do_start(ThrottleState *ts,
+   bool is_write,
+   int64_t size,
+   int64_t now,
+   int64_t *next_timer);
+
+bool throttle_do_end(ThrottleState *ts,
+ bool is_write,
+ int64_t now,
+ int64_t *next_timer);
+
+bool throttle_do_timer(ThrottleState *ts,
+   bool is_write,
+   int64_t now,
+   int64_t *next_timer);
+
+/* user API functions */
+void throttle_init(ThrottleState *ts,
+   QEMUClock *clock,
+   void (timer)(void *),
+   void *timer_opaque);
+
+void throttle_destroy(ThrottleState *ts);
+
+bool throttle_start(ThrottleState *ts, bool is_write, int64_t size);
+
+void throttle_end(ThrottleState *ts, bool is_write);
+
+void throttle_timer(ThrottleState *ts, int64_t now, bool *must_wait);
+
+void throttle_config(ThrottleState *ts, ThrottleConfig *cfg);
+
+void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg);
+
+bool throttle_enabled(ThrottleConfig *cfg);
+
+bool throttle_conflicting(ThrottleConfig *cfg);
+
+bool throttle_is_valid(ThrottleConfig *cfg);
+
+bool throttle_have_timer(ThrottleState *ts);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index dc72ab0..2bb13a2 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -11,3 +11,4 @@ util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o 
notify.o
 util-obj-y += qemu-option.o qemu-progress.o
 util-obj-y += hexdump.o
 util-obj-y += crc32c.o
+util-obj-y += throttle.o
diff --git a/util/throttle.c b/util/throttle.c
new file mode 100644
index 000..4afc407
--- /dev/null
+++ b/util/throttle.c
@@ -0,0 +1,436 @@
+/*
+ * QEMU throttling infrastructure
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version

[Qemu-devel] [RFC V3 2/2] block: Enable the new throttling code in the block layer.

2013-08-02 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |  316 +++--
 block/qapi.c  |   21 ++-
 blockdev.c|  115 +
 include/block/block.h |1 -
 include/block/block_int.h |   33 +
 5 files changed, 150 insertions(+), 336 deletions(-)

diff --git a/block.c b/block.c
index 5a2240f..efa64fa 100644
--- a/block.c
+++ b/block.c
@@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors);
 
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, int64_t *wait);
-
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -123,55 +116,68 @@ int is_windows_drive(const char *filename)
 #endif
 
 /* throttling disk I/O limits */
-void bdrv_io_limits_disable(BlockDriverState *bs)
+void bdrv_set_io_limits(BlockDriverState *bs,
+ThrottleConfig *cfg)
 {
-bs-io_limits_enabled = false;
+throttle_config(bs-throttle_state, cfg);
+}
 
-while (qemu_co_enter_next(bs-throttled_reqs)) {
-}
+static bool bdrv_drain_throttled(BlockDriverState *bs)
+{
+bool drained = false;
+int i;
 
-if (bs-block_timer) {
-qemu_del_timer(bs-block_timer);
-qemu_free_timer(bs-block_timer);
-bs-block_timer = NULL;
+for (i = 0; i  2; i++) {
+while (qemu_co_enter_next(bs-throttled_reqs[i]) {
+drained = true;
+}
 }
 
-bs-slice_start = 0;
-bs-slice_end   = 0;
+return drained;
 }
 
-static void bdrv_block_timer(void *opaque)
+void bdrv_io_limits_disable(BlockDriverState *bs)
 {
-BlockDriverState *bs = opaque;
+bs-io_limits_enabled = false;
+
+bdrv_drain_throttled(bs);
 
-qemu_co_enter_next(bs-throttled_reqs);
+throttle_destroy(bs-throttle_state);
 }
 
-void bdrv_io_limits_enable(BlockDriverState *bs)
+static void bdrv_throttle_timer_cb(void *opaque)
 {
-qemu_co_queue_init(bs-throttled_reqs);
-bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
-bs-io_limits_enabled = true;
+BlockDriverState *bs = opaque;
+int now = qemu_get_clock_ns(vm_clock);
+bool must_wait[2]; /* does throttled reads or writes must wait */
+int i;
+
+throttle_timer(bs-throttle_state, now, must_wait);
+
+/* execute not throttled requests */
+for (i = 0; i  2; i++)  {
+if (must_wait[i]) {
+continue;
+}
+qemu_co_enter_next(bs-throttled_reqs[i]);
+}
 }
 
-bool bdrv_io_limits_enabled(BlockDriverState *bs)
+/* should be called before bdrv_set_io_limits if a limit is set */
+void bdrv_io_limits_enable(BlockDriverState *bs)
 {
-BlockIOLimit *io_limits = bs-io_limits;
-return io_limits-bps[BLOCK_IO_LIMIT_READ]
- || io_limits-bps[BLOCK_IO_LIMIT_WRITE]
- || io_limits-bps[BLOCK_IO_LIMIT_TOTAL]
- || io_limits-iops[BLOCK_IO_LIMIT_READ]
- || io_limits-iops[BLOCK_IO_LIMIT_WRITE]
- || io_limits-iops[BLOCK_IO_LIMIT_TOTAL];
+throttle_init(bs-throttle_state, vm_clock, bdrv_throttle_timer_cb, bs);
+qemu_co_queue_init(bs-throttled_reqs[0]);
+qemu_co_queue_init(bs-throttled_reqs[1]);
+bs-io_limits_enabled = true;
 }
 
 static void bdrv_io_limits_intercept(BlockDriverState *bs,
- bool is_write, int nb_sectors)
+ bool is_write,
+ int nb_sectors)
 {
-int64_t wait_time = -1;
-
-if (!qemu_co_queue_empty(bs-throttled_reqs)) {
-qemu_co_queue_wait(bs-throttled_reqs);
+if (!qemu_co_queue_empty(bs-throttled_reqs[is_write])) {
+qemu_co_queue_wait(bs-throttled_reqs[is_write]);
 }
 
 /* In fact, we hope to keep each request's timing, in FIFO mode. The next
@@ -181,13 +187,11 @@ static void bdrv_io_limits_intercept(BlockDriverState *bs,
  * be still in throttled_reqs queue.
  */
 
-while (bdrv_exceed_io_limits(bs, nb_sectors, is_write, wait_time)) {
-qemu_mod_timer(bs-block_timer,
-   wait_time + qemu_get_clock_ns(vm_clock));
-qemu_co_queue_wait_insert_head(bs-throttled_reqs);
+while (throttle_start(bs-throttle_state, is_write, nb_sectors)) {
+qemu_co_queue_wait_insert_head(bs-throttled_reqs[is_write]);
 }
 
-qemu_co_queue_next(bs-throttled_reqs);
+qemu_co_queue_next(bs-throttled_reqs[is_write]);
 }
 
 /* check if the path starts with protocol: */
@@ -1106,11 +1110,6 @@ int bdrv_open(BlockDriverState *bs, const char 
*filename, QDict

[Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling

2013-08-02 Thread Benoît Canet

This patchset implement continous leaky bucket throttling.

It works mostly on the general case.
The exception is where the load is composed of both reads and writes and two
limits iops_rd and iops_wr are set.
The resulting iops are a little above half of the given limits.
I tried various strategies to avoid this: two timer, two throttled request
queues or even a different algorithm using a priority queue.
The problem is still the same in every version of the code: reads and writes
operation seems entangled.

Benoît Canet (2):
  throttle: Add a new throttling API implementing continuus leaky
bucket.
  block: Enable the new throttling code in the block layer.

 block.c   |  316 
 block/qapi.c  |   21 +--
 blockdev.c|  115 ++--
 include/block/block.h |1 -
 include/block/block_int.h |   33 +---
 include/qemu/throttle.h   |  111 
 util/Makefile.objs|1 +
 util/throttle.c   |  436 +
 8 files changed, 698 insertions(+), 336 deletions(-)
 create mode 100644 include/qemu/throttle.h
 create mode 100644 util/throttle.c

-- 
1.7.10.4

Re: [Qemu-devel] [PATCH] KVM: always use MADV_DONTFORK

2013-08-06 Thread Benoît Canet

Le Thursday 25 Jul 2013 à 12:11:15 (+0200), Andrea Arcangeli a écrit :
 MADV_DONTFORK prevents fork to fail with -ENOMEM if the default
 overcommit heuristics decides there's too much anonymous virtual
 memory allocated. If the KVM secondary MMU is synchronized with MMU
 notifiers or not, doesn't make a difference in that regard.
 
 Secondly it's always more efficient to avoid copying the guest
 physical address space in the fork child (so we avoid to mark all the
 guest memory readonly in the parent and so we skip the establishment
 and teardown of lots of pagetables in the child).
 
 In the common case we can ignore the error if MADV_DONTFORK is not
 available. Leave a second invocation that errors out in the KVM path
 if MMU notifiers are missing and KVM is enabled, to abort in such
 case.
 
 Signed-off-by: Andrea Arcangeli aarca...@redhat.com
 ---
  exec.c | 1 +
  1 file changed, 1 insertion(+)
 
 diff --git a/exec.c b/exec.c
 index c99a883..d3bb58d 100644
 --- a/exec.c
 +++ b/exec.c
 @@ -1162,6 +1162,7 @@ ram_addr_t qemu_ram_alloc_from_ptr(ram_addr_t size, 
 void *host,
  
  qemu_ram_setup_dump(new_block-host, size);
  qemu_madvise(new_block-host, size, QEMU_MADV_HUGEPAGE);
 +qemu_madvise(new_block-host, size, QEMU_MADV_DONTFORK);
  
  if (kvm_enabled())
  kvm_setup_guest_memory(new_block-host, size);
 

This patch solve a bug where the network down script of a regular tap interface
is not executed because fork fail when a pci-assigment is done on a large guest.

Tested-By: Benoit Canet ben...@irqsave.net

Re: [Qemu-devel] [RFC V3 0/2] continuous leaky bucket throttling

2013-08-07 Thread Benoît Canet

 I saw more discussion on IRC.  Does this mean you will send another
 revision to address outstanding issues?
 
 Just wanted to check if you are waiting for code review or if you are
 still developing the next patch revision.

I am currently finishing to write unit tests for the next patch revision.
Once unit tests are done I will rebase the previous QMP throttling patches on
top of the new code then post the new series.

I think that the result will be a net improvement.

Thanks for asking

Best regards

Benoît

Re: [Qemu-devel] [PATCH V3 for-1.6 3/5] block: Add support for throttling burst threshold in QMP and the command line.

2013-08-08 Thread Benoît Canet

 Kevin's series renamed these to have a dash in the name, and also moved
 all the throttling parameters into a sub-struct.  Does it make more
 sense to have just '*throttling' with that sub-struct containing 12
 parameters, 6 for limits and 6 for thresholds, or would it be better to
 have '*throttling' with 6 members for limits, as well as
 '*throttling-threshold' with the other 6 members?  Naming-wise,
 throttling.bps-total and throttling-threshold.bps-total convey as much
 information as throttling.bps-total and throttling.bps-total-threshold.

Eric  Kevin:

Should compatible old style values be added for the new throttling options in
order to avoid a mixed style mess in qemu-options.hx ?

Best regards

Benoît

[Qemu-devel] [PATCH V4 0/5] Continuous Leaky Bucket Throttling

2013-08-08 Thread Benoît Canet

This patchset implement continous leaky bucket throttling.

It use two requests queue to enable to do silly unbalanced throttling like
block_set_io_throttle 0 0 0 0 6000 1

It use two timer to get the timer callbacks and the throttle.c code simple

It add unit tests.

It fail with the following error message at exit and I don't know why yet.
qemu-system-x86_64: block.c:1489: bdrv_drain_all: Assertion
`((bs-tracked_requests)-lh_first == ((void *)0))' failed.

The throttling core is pretty solid and the surrouding of the patchset needs
polish. (new options ...)

since previous version:
wrap qemu-option.hx declararation [Eric]
continuus - continuous [Fam]
unit test [Paolo]

Benoît Canet (5):
  throttle: Add a new throttling API implementing continuous leaky
bucket.
  throttle: Add units tests
  block: Enable the new throttling code in the block layer.
  block: Add support for throttling burst max in QMP and the command
line.
  block: Add iops_sector_count to do the iops accounting for a given io
size.

 block.c   |  351 ++--
 block/qapi.c  |   50 +++--
 blockdev.c|  207 ++-
 hmp.c |   36 +++-
 include/block/block.h |1 -
 include/block/block_int.h |   32 +--
 include/qemu/throttle.h   |  105 ++
 qapi-schema.json  |   40 +++-
 qemu-options.hx   |4 +-
 qmp-commands.hx   |   34 +++-
 tests/Makefile|2 +
 tests/test-throttle.c |  494 +
 util/Makefile.objs|1 +
 util/throttle.c   |  391 +++
 14 files changed, 1405 insertions(+), 343 deletions(-)
 create mode 100644 include/qemu/throttle.h
 create mode 100644 tests/test-throttle.c
 create mode 100644 util/throttle.c

-- 
1.7.10.4

[Qemu-devel] [PATCH V4 1/5] throttle: Add a new throttling API implementing continuous leaky bucket.

2013-08-08 Thread Benoît Canet

Implement the continuous leaky bucket algorithm devised on IRC as a separate
module.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 include/qemu/throttle.h |  105 +
 util/Makefile.objs  |1 +
 util/throttle.c |  391 +++
 3 files changed, 497 insertions(+)
 create mode 100644 include/qemu/throttle.h
 create mode 100644 util/throttle.c

diff --git a/include/qemu/throttle.h b/include/qemu/throttle.h
new file mode 100644
index 000..e03bc3e
--- /dev/null
+++ b/include/qemu/throttle.h
@@ -0,0 +1,105 @@
+/*
+ * QEMU throttling infrastructure
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 or
+ * (at your option) version 3 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, see http://www.gnu.org/licenses/.
+ */
+
+#ifndef THROTTLING_H
+#define THROTTLING_H
+
+#include stdint.h
+#include qemu-common.h
+#include qemu/timer.h
+
+#define NANOSECONDS_PER_SECOND  10.0
+
+#define BUCKETS_COUNT 6
+
+typedef enum {
+THROTTLE_BPS_TOTAL = 0,
+THROTTLE_BPS_READ  = 1,
+THROTTLE_BPS_WRITE = 2,
+THROTTLE_OPS_TOTAL = 3,
+THROTTLE_OPS_READ  = 4,
+THROTTLE_OPS_WRITE = 5,
+} BucketType;
+
+typedef struct LeakyBucket {
+double  ups;/* units per second */
+double  max;/* leaky bucket max in units */
+double  bucket; /* bucket in units */
+} LeakyBucket;
+
+/* The following structure is used to configure a ThrottleState
+ * It contains a bit of state: the bucket field of the LeakyBucket structure.
+ * However it allows to keep the code clean and the bucket field is reset to
+ * zero at the right time.
+ */
+typedef struct ThrottleConfig {
+LeakyBucket buckets[6]; /* leaky buckets */
+uint64_t unit_size; /* size of an unit in bytes */
+uint64_t op_size;   /* size of an operation in units */
+} ThrottleConfig;
+
+typedef struct ThrottleState {
+ThrottleConfig cfg; /* configuration */
+int64_t previous_leak;  /* timestamp of the last leak done */
+QEMUTimer * timers[2];  /* timers used to do the throttling */
+QEMUClock *clock;   /* the clock used */
+} ThrottleState;
+
+/* operations on single leaky buckets */
+void throttle_leak_bucket(LeakyBucket *bkt, int64_t delta);
+
+int64_t throttle_compute_wait(LeakyBucket *bkt);
+
+/* expose timer computation function for unit tests */
+bool throttle_compute_timer(ThrottleState *ts,
+bool is_write,
+int64_t now,
+int64_t *next_timer);
+
+/* init/destroy cycle */
+void throttle_init(ThrottleState *ts,
+   QEMUClock *clock,
+   void (read_timer)(void *),
+   void (write_timer)(void *),
+   void *timer_opaque);
+
+void throttle_destroy(ThrottleState *ts);
+
+bool throttle_have_timer(ThrottleState *ts);
+
+/* configuration */
+bool throttle_enabled(ThrottleConfig *cfg);
+
+bool throttle_conflicting(ThrottleConfig *cfg);
+
+bool throttle_is_valid(ThrottleConfig *cfg);
+
+void throttle_config(ThrottleState *ts, ThrottleConfig *cfg);
+
+void throttle_get_config(ThrottleState *ts, ThrottleConfig *cfg);
+
+/* usage */
+bool throttle_allowed(ThrottleState *ts, bool is_write);
+
+void throttle_account(ThrottleState *ts, bool is_write, uint64_t size);
+
+#endif
diff --git a/util/Makefile.objs b/util/Makefile.objs
index dc72ab0..2bb13a2 100644
--- a/util/Makefile.objs
+++ b/util/Makefile.objs
@@ -11,3 +11,4 @@ util-obj-y += iov.o aes.o qemu-config.o qemu-sockets.o uri.o 
notify.o
 util-obj-y += qemu-option.o qemu-progress.o
 util-obj-y += hexdump.o
 util-obj-y += crc32c.o
+util-obj-y += throttle.o
diff --git a/util/throttle.c b/util/throttle.c
new file mode 100644
index 000..2f25d44
--- /dev/null
+++ b/util/throttle.c
@@ -0,0 +1,391 @@
+/*
+ * QEMU throttling infrastructure
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License as
+ * published by the Free Software Foundation; either version 2 or
+ * (at your option) version 3 of the License.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY

[Qemu-devel] [PATCH V4 3/5] block: Enable the new throttling code in the block layer.

2013-08-08 Thread Benoît Canet

tip: Do not ever use the cfg scheduler in the guest with this code.
 It gives incorrect throttling.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |  351 ++---
 block/qapi.c  |   21 ++-
 blockdev.c|  100 +++--
 include/block/block.h |1 -
 include/block/block_int.h |   32 +
 5 files changed, 173 insertions(+), 332 deletions(-)

diff --git a/block.c b/block.c
index 01b66d8..92f0aa1 100644
--- a/block.c
+++ b/block.c
@@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors);
 
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, int64_t *wait);
-
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -123,70 +116,110 @@ int is_windows_drive(const char *filename)
 #endif
 
 /* throttling disk I/O limits */
-void bdrv_io_limits_disable(BlockDriverState *bs)
+void bdrv_set_io_limits(BlockDriverState *bs,
+ThrottleConfig *cfg)
 {
-bs-io_limits_enabled = false;
+int i;
+
+throttle_config(bs-throttle_state, cfg);
+
+for (i = 0; i  2; i++) {
+qemu_co_enter_next(bs-throttled_reqs[i]);
+}
+}
+
+/* this function drain all the throttled IOs */
+static bool bdrv_drain_throttled(BlockDriverState *bs)
+{
+bool drained = false;
+bool enabled = bs-io_limits_enabled;
+int i;
 
-do {} while (qemu_co_enter_next(bs-throttled_reqs));
+bs-io_limits_enabled = false;
 
-if (bs-block_timer) {
-qemu_del_timer(bs-block_timer);
-qemu_free_timer(bs-block_timer);
-bs-block_timer = NULL;
+for (i = 0; i  2; i++) {
+while (qemu_co_enter_next(bs-throttled_reqs[i])) {
+drained = true;
+}
 }
 
-bs-slice_start = 0;
-bs-slice_end   = 0;
+bs-io_limits_enabled = enabled;
+
+return drained;
 }
 
-static void bdrv_block_timer(void *opaque)
+void bdrv_io_limits_disable(BlockDriverState *bs)
+{
+bs-io_limits_enabled = false;
+
+bdrv_drain_throttled(bs);
+
+throttle_destroy(bs-throttle_state);
+}
+
+static void bdrv_throttle_read_timer_cb(void *opaque)
 {
 BlockDriverState *bs = opaque;
+qemu_co_enter_next(bs-throttled_reqs[0]);
+}
 
-qemu_co_enter_next(bs-throttled_reqs);
+static void bdrv_throttle_write_timer_cb(void *opaque)
+{
+BlockDriverState *bs = opaque;
+qemu_co_enter_next(bs-throttled_reqs[1]);
 }
 
+/* should be called before bdrv_set_io_limits if a limit is set */
 void bdrv_io_limits_enable(BlockDriverState *bs)
 {
-qemu_co_queue_init(bs-throttled_reqs);
-bs-block_timer = qemu_new_timer_ns(vm_clock, bdrv_block_timer, bs);
+throttle_init(bs-throttle_state,
+  vm_clock,
+  bdrv_throttle_read_timer_cb,
+  bdrv_throttle_write_timer_cb,
+  bs);
+qemu_co_queue_init(bs-throttled_reqs[0]);
+qemu_co_queue_init(bs-throttled_reqs[1]);
 bs-io_limits_enabled = true;
 }
 
-bool bdrv_io_limits_enabled(BlockDriverState *bs)
+/* This function make an IO wait if needed
+ *
+ * @is_write:   is the IO a write
+ * @nb_sectors: the number of sectors of the IO
+ */
+static void bdrv_io_limits_intercept(BlockDriverState *bs,
+ bool is_write,
+ int nb_sectors)
 {
-BlockIOLimit *io_limits = bs-io_limits;
-return io_limits-bps[BLOCK_IO_LIMIT_READ]
- || io_limits-bps[BLOCK_IO_LIMIT_WRITE]
- || io_limits-bps[BLOCK_IO_LIMIT_TOTAL]
- || io_limits-iops[BLOCK_IO_LIMIT_READ]
- || io_limits-iops[BLOCK_IO_LIMIT_WRITE]
- || io_limits-iops[BLOCK_IO_LIMIT_TOTAL];
+/* does this io must wait */
+bool must_wait = !throttle_allowed(bs-throttle_state, is_write);
+
+/* if must wait or any request of this type throttled queue the IO */
+if (must_wait ||
+!qemu_co_queue_empty(bs-throttled_reqs[is_write])) {
+qemu_co_queue_wait(bs-throttled_reqs[is_write]);
+}
+
+/* the IO will be executed do the accounting */
+throttle_account(bs-throttle_state, is_write, nb_sectors);
 }
 
-static void bdrv_io_limits_intercept(BlockDriverState *bs,
- bool is_write, int nb_sectors)
+/* This function will schedule the next IO throttled IO of this type if needed
+ *
+ * @is_write: is the IO a write
+ */
+static void bdrv_io_limits_resched(BlockDriverState *bs, bool is_write)
 {
-int64_t wait_time = -1;
 
-if

[Qemu-devel] [PATCH V4 2/5] throttle: Add units tests

2013-08-08 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 tests/Makefile|2 +
 tests/test-throttle.c |  494 +
 2 files changed, 496 insertions(+)
 create mode 100644 tests/test-throttle.c

diff --git a/tests/Makefile b/tests/Makefile
index d044908..fb1e06a 100644
--- a/tests/Makefile
+++ b/tests/Makefile
@@ -29,6 +29,7 @@ check-unit-y += tests/test-visitor-serialization$(EXESUF)
 check-unit-y += tests/test-iov$(EXESUF)
 gcov-files-test-iov-y = util/iov.c
 check-unit-y += tests/test-aio$(EXESUF)
+check-unit-y += tests/test-throttle$(EXESUF)
 gcov-files-test-aio-$(CONFIG_WIN32) = aio-win32.c
 gcov-files-test-aio-$(CONFIG_POSIX) = aio-posix.c
 check-unit-y += tests/test-thread-pool$(EXESUF)
@@ -116,6 +117,7 @@ tests/check-qfloat$(EXESUF): tests/check-qfloat.o 
libqemuutil.a
 tests/check-qjson$(EXESUF): tests/check-qjson.o libqemuutil.a libqemustub.a
 tests/test-coroutine$(EXESUF): tests/test-coroutine.o $(block-obj-y) 
libqemuutil.a libqemustub.a
 tests/test-aio$(EXESUF): tests/test-aio.o $(block-obj-y) libqemuutil.a 
libqemustub.a
+tests/test-throttle$(EXESUF): tests/test-throttle.o $(block-obj-y) 
libqemuutil.a libqemustub.a
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(block-obj-y) 
libqemuutil.a libqemustub.a
 tests/test-iov$(EXESUF): tests/test-iov.o libqemuutil.a
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o libqemuutil.a libqemustub.a
diff --git a/tests/test-throttle.c b/tests/test-throttle.c
new file mode 100644
index 000..c687f5f
--- /dev/null
+++ b/tests/test-throttle.c
@@ -0,0 +1,494 @@
+/*
+ * Throttle infrastructure tests
+ *
+ * Copyright Nodalink, SARL. 2013
+ *
+ * Authors:
+ *  Benoît Canet benoit.ca...@irqsave.net
+ *
+ * This work is licensed under the terms of the GNU LGPL, version 2 or later.
+ * See the COPYING.LIB file in the top-level directory.
+ */
+
+#include glib.h
+#include math.h
+#include qemu/throttle.h
+
+LeakyBucketbkt;
+ThrottleConfig cfg;
+ThrottleState  ts;
+
+/* usefull function */
+static bool double_cmp(double x, double y)
+{
+return fabsl(x - y)  1e-6;
+}
+
+/* tests for single bucket operations */
+static void test_leak_bucket(void)
+{
+/* set initial value */
+bkt.ups = 150;
+bkt.max = 15;
+bkt.bucket = 1.5;
+
+/* leak an op work of time */
+throttle_leak_bucket(bkt, NANOSECONDS_PER_SECOND / 150);
+g_assert(bkt.ups == 150);
+g_assert(bkt.max == 15);
+g_assert(double_cmp(bkt.bucket, 0.5));
+
+/* leak again emptying the bucket */
+throttle_leak_bucket(bkt, NANOSECONDS_PER_SECOND / 150);
+g_assert(bkt.ups == 150);
+g_assert(bkt.max == 15);
+g_assert(double_cmp(bkt.bucket, 0));
+
+/* check that the bucket level won't go lower */
+throttle_leak_bucket(bkt, NANOSECONDS_PER_SECOND / 150);
+g_assert(bkt.ups == 150);
+g_assert(bkt.max == 15);
+g_assert(double_cmp(bkt.bucket, 0));
+}
+
+static void test_compute_wait(void)
+{
+int64_t wait;
+int64_t result;
+
+/* no operation limit set */
+bkt.ups = 0;
+bkt.max = 15;
+bkt.bucket = 1.5;
+wait = throttle_compute_wait(bkt);
+g_assert(!wait);
+
+/* zero delta */
+bkt.ups = 150;
+bkt.max = 15;
+bkt.bucket = 15;
+wait = throttle_compute_wait(bkt);
+g_assert(!wait);
+
+/* below zero delta */
+bkt.ups = 150;
+bkt.max = 15;
+bkt.bucket = 9;
+wait = throttle_compute_wait(bkt);
+g_assert(!wait);
+
+/* half an operation above max */
+bkt.ups = 150;
+bkt.max = 15;
+bkt.bucket = 15.5;
+wait = throttle_compute_wait(bkt);
+/* time required to do half an operation */
+result = (int64_t)  NANOSECONDS_PER_SECOND / 150 / 2;
+g_assert(wait == result);
+}
+
+/* functions to test ThrottleState initialization/destroy methods */
+static void read_timer_cb(void *opaque)
+{
+}
+
+static void write_timer_cb(void *opaque)
+{
+}
+
+static void test_init(void)
+{
+int i;
+
+/* fill the structure with crap */
+memset(ts, 1, sizeof(ts));
+
+/* init the structure */
+throttle_init(ts, vm_clock, read_timer_cb, write_timer_cb, ts);
+
+/* check initialized fields */
+g_assert(ts.clock == vm_clock);
+g_assert(ts.timers[0]);
+g_assert(ts.timers[1]);
+
+/* check other fields where cleared */
+g_assert(!ts.previous_leak);
+g_assert(!ts.cfg.op_size);
+g_assert(!ts.cfg.unit_size);
+for (i = 0; i  BUCKETS_COUNT; i++) {
+g_assert(!ts.cfg.buckets[i].ups);
+g_assert(!ts.cfg.buckets[i].max);
+g_assert(!ts.cfg.buckets[i].bucket);
+}
+
+throttle_destroy(ts);
+}
+
+static void test_destroy(void)
+{
+int i;
+throttle_init(ts, vm_clock, read_timer_cb, write_timer_cb, ts);
+throttle_destroy(ts);
+for (i = 0; i  2; i++) {
+g_assert(!ts.timers[i]);
+}
+}
+
+/* function to test throttle_config and throttle_get_config */
+static void test_config_functions(void)
+{
+int i;
+ThrottleConfig

[Qemu-devel] [PATCH V4 4/5] block: Add support for throttling burst max in QMP and the command line.

2013-08-08 Thread Benoît Canet

The max parameter of the leaky bycket throttling algoritm can be used to
allow the guest to do bursts.
The max value is a pool of I/O that the guest can use without being throttled
at all. Throttling is triggered once this pool is empty.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qapi.c |   26 
 blockdev.c   |  119 ++
 hmp.c|   32 +--
 qapi-schema.json |   34 +++-
 qemu-options.hx  |4 +-
 qmp-commands.hx  |   30 --
 6 files changed, 220 insertions(+), 25 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 45f806b..5ba10f4 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -232,6 +232,32 @@ void bdrv_query_info(BlockDriverState *bs,
 info-inserted-iops= cfg.buckets[THROTTLE_OPS_TOTAL].ups;
 info-inserted-iops_rd = cfg.buckets[THROTTLE_OPS_READ].ups;
 info-inserted-iops_wr = cfg.buckets[THROTTLE_OPS_WRITE].ups;
+
+info-inserted-has_bps_max =
+cfg.buckets[THROTTLE_BPS_TOTAL].max;
+info-inserted-bps_max =
+cfg.buckets[THROTTLE_BPS_TOTAL].max;
+info-inserted-has_bps_rd_max  =
+cfg.buckets[THROTTLE_BPS_READ].max;
+info-inserted-bps_rd_max  =
+cfg.buckets[THROTTLE_BPS_READ].max;
+info-inserted-has_bps_wr_max  =
+cfg.buckets[THROTTLE_BPS_WRITE].max;
+info-inserted-bps_wr_max  =
+cfg.buckets[THROTTLE_BPS_WRITE].max;
+
+info-inserted-has_iops_max=
+cfg.buckets[THROTTLE_OPS_TOTAL].max;
+info-inserted-iops_max=
+cfg.buckets[THROTTLE_OPS_TOTAL].max;
+info-inserted-has_iops_rd_max =
+cfg.buckets[THROTTLE_OPS_READ].max;
+info-inserted-iops_rd_max =
+cfg.buckets[THROTTLE_OPS_READ].max;
+info-inserted-has_iops_wr_max =
+cfg.buckets[THROTTLE_OPS_WRITE].max;
+info-inserted-iops_wr_max =
+cfg.buckets[THROTTLE_OPS_WRITE].max;
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index 7559ea5..22016a2 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -486,13 +486,18 @@ static DriveInfo *blockdev_init(QemuOpts *all_opts,
 cfg.buckets[THROTTLE_OPS_WRITE].ups =
 qemu_opt_get_number(opts, throttling.iops-write, 0);
 
-cfg.buckets[THROTTLE_BPS_TOTAL].max = 0;
-cfg.buckets[THROTTLE_BPS_READ].max  = 0;
-cfg.buckets[THROTTLE_BPS_WRITE].max = 0;
-
-cfg.buckets[THROTTLE_OPS_TOTAL].max = 0;
-cfg.buckets[THROTTLE_OPS_READ].max  = 0;
-cfg.buckets[THROTTLE_OPS_WRITE].max = 0;
+cfg.buckets[THROTTLE_BPS_TOTAL].max =
+qemu_opt_get_number(opts, throttling.bps-total-max, 0);
+cfg.buckets[THROTTLE_BPS_READ].max  =
+qemu_opt_get_number(opts, throttling.bps-read-max, 0);
+cfg.buckets[THROTTLE_BPS_WRITE].max =
+qemu_opt_get_number(opts, throttling.bps-write-max, 0);
+cfg.buckets[THROTTLE_OPS_TOTAL].max =
+qemu_opt_get_number(opts, throttling.iops-total-max, 0);
+cfg.buckets[THROTTLE_OPS_READ].max =
+qemu_opt_get_number(opts, throttling.iops-read-max, 0);
+cfg.buckets[THROTTLE_OPS_WRITE].max =
+qemu_opt_get_number(opts, throttling.iops-write-max, 0);
 
 cfg.unit_size = BDRV_SECTOR_SIZE;
 cfg.op_size = 0;
@@ -773,6 +778,14 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
 qemu_opt_rename(all_opts, bps_rd, throttling.bps-read);
 qemu_opt_rename(all_opts, bps_wr, throttling.bps-write);
 
+qemu_opt_rename(all_opts, iops_max, throttling.iops-total-max);
+qemu_opt_rename(all_opts, iops_rd_max, throttling.iops-read-max);
+qemu_opt_rename(all_opts, iops_wr_max, throttling.iops-write-max);
+
+qemu_opt_rename(all_opts, bps_max, throttling.bps-total-max);
+qemu_opt_rename(all_opts, bps_rd_max, throttling.bps-read-max);
+qemu_opt_rename(all_opts, bps_wr_max, throttling.bps-write-max);
+
 qemu_opt_rename(all_opts, readonly, read-only);
 
 value = qemu_opt_get(all_opts, cache);
@@ -1257,8 +1270,22 @@ void qmp_change_blockdev(const char *device, const char 
*filename,
 
 /* throttling disk I/O limits */
 void qmp_block_set_io_throttle(const char *device, int64_t bps, int64_t bps_rd,
-   int64_t bps_wr, int64_t iops, int64_t iops_rd,
-   int64_t iops_wr, Error **errp)
+   int64_t bps_wr,
+   int64_t iops,
+   int64_t iops_rd,
+   int64_t iops_wr,
+   bool has_bps_max,
+   int64_t bps_max,
+   bool has_bps_rd_max,
+   int64_t bps_rd_max,
+

[Qemu-devel] [PATCH V4 5/5] block: Add iops_sector_count to do the iops accounting for a given io size.

2013-08-08 Thread Benoît Canet

This feature can be used in case where users are avoiding the iops limit by
doing jumbo I/Os hammering the storage backend.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qapi.c |3 +++
 blockdev.c   |   22 +++---
 hmp.c|8 ++--
 qapi-schema.json |   10 --
 qemu-options.hx  |2 +-
 qmp-commands.hx  |8 ++--
 6 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index 5ba10f4..f98ff64 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -258,6 +258,9 @@ void bdrv_query_info(BlockDriverState *bs,
 cfg.buckets[THROTTLE_OPS_WRITE].max;
 info-inserted-iops_wr_max =
 cfg.buckets[THROTTLE_OPS_WRITE].max;
+
+info-inserted-has_iops_sector_count = cfg.op_size;
+info-inserted-iops_sector_count = cfg.op_size;
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index 22016a2..e2b0ee0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -500,7 +500,7 @@ static DriveInfo *blockdev_init(QemuOpts *all_opts,
 qemu_opt_get_number(opts, throttling.iops-write-max, 0);
 
 cfg.unit_size = BDRV_SECTOR_SIZE;
-cfg.op_size = 0;
+cfg.op_size = qemu_opt_get_number(opts, throttling.iops-sector-count, 0);
 
 if (!check_throttle_config(cfg, error)) {
 error_report(%s, error_get_pretty(error));
@@ -786,6 +786,9 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
 qemu_opt_rename(all_opts, bps_rd_max, throttling.bps-read-max);
 qemu_opt_rename(all_opts, bps_wr_max, throttling.bps-write-max);
 
+qemu_opt_rename(all_opts,
+iops_sector_count, throttling.iops-sector-count);
+
 qemu_opt_rename(all_opts, readonly, read-only);
 
 value = qemu_opt_get(all_opts, cache);
@@ -1285,7 +1288,9 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
bool has_iops_rd_max,
int64_t iops_rd_max,
bool has_iops_wr_max,
-   int64_t iops_wr_max, Error **errp)
+   int64_t iops_wr_max,
+   bool has_iops_sector_count,
+   int64_t iops_sector_count, Error **errp)
 {
 ThrottleConfig cfg;
 BlockDriverState *bs;
@@ -1325,7 +1330,10 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 }
 
 cfg.unit_size = BDRV_SECTOR_SIZE;
-cfg.op_size = 0;
+
+if (has_iops_sector_count) {
+cfg.op_size = iops_sector_count;
+}
 
 if (!check_throttle_config(cfg, errp)) {
 return;
@@ -2051,6 +2059,10 @@ QemuOptsList qemu_common_drive_opts = {
 .type = QEMU_OPT_NUMBER,
 .help = total bytes write burst,
 },{
+.name = throttling.iops-sector-count,
+.type = QEMU_OPT_NUMBER,
+.help = when limiting by iops max size of an I/O in sector,
+},{
 .name = copy-on-read,
 .type = QEMU_OPT_BOOL,
 .help = copy read data from backing file into image file,
@@ -2197,6 +2209,10 @@ QemuOptsList qemu_old_drive_opts = {
 .type = QEMU_OPT_NUMBER,
 .help = total bytes write burst,
 },{
+.name = iops_sector_count,
+.type = QEMU_OPT_NUMBER,
+.help = when limiting by iops max size of an I/O in sector,
+},{
 .name = copy-on-read,
 .type = QEMU_OPT_BOOL,
 .help = copy read data from backing file into image file,
diff --git a/hmp.c b/hmp.c
index e0c387c..01f7685 100644
--- a/hmp.c
+++ b/hmp.c
@@ -351,7 +351,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
  iops_wr=% PRId64
  iops_max=% PRId64
  iops_rd_max=% PRId64
- iops_wr_max=% PRId64 \n,
+ iops_wr_max=% PRId64
+ iops_sector_count=% PRId64 \n,
 info-value-inserted-bps,
 info-value-inserted-bps_rd,
 info-value-inserted-bps_wr,
@@ -363,7 +364,8 @@ void hmp_info_block(Monitor *mon, const QDict *qdict)
 info-value-inserted-iops_wr,
 info-value-inserted-iops_max,
 info-value-inserted-iops_rd_max,
-info-value-inserted-iops_wr_max);
+info-value-inserted-iops_wr_max,
+info-value-inserted-iops_sector_count);
 } else {
 monitor_printf(mon,  [not inserted]);
 }
@@ -1124,6 +1126,8 @@ void hmp_block_set_io_throttle(Monitor *mon, const QDict 
*qdict)
   false,

Re: [Qemu-devel] [PATCH V4 3/5] block: Enable the new throttling code in the block layer.o

2013-08-08 Thread Benoît Canet

 This is not really accurate; the cfq scheduler reorders reads and writes
 to have longer bursts, and these sometimes exceed the rate you set.  I
 understood this is mostly for separate rd/wr settings, or did you
 reproduce it with total rates too?

I'll drop the comment.

It's only with read/write settings and not with total.

 
 Also, it would be better to have a workaround for this.  Perhaps we
 could simply make the default value of max nonzero?  In the old
 throttling code the slice time is 0.1s, so perhaps we could see what
 happens with max=0.1*avg.

I already set max to 0.1*avg if needed in the code.

Re: [Qemu-devel] [PATCH V4 3/5] block: Enable the new throttling code in the block layer.

2013-08-08 Thread Benoît Canet

 Also, it would be better to have a workaround for this.  Perhaps we
 could simply make the default value of max nonzero?  In the old
 throttling code the slice time is 0.1s, so perhaps we could see what
 happens with max=0.1*avg.

This gives correct results with cfg with some little auto compensated 
oscillations.

I will definitelly drop the bogus comment.

Best regards

Benoît

Re: [Qemu-devel] [PATCH V4 0/5] Continuous Leaky Bucket Throttling*

2013-08-09 Thread Benoît Canet

 It fail with the following error message at exit and I don't know why yet.
 qemu-system-x86_64: block.c:1489: bdrv_drain_all: Assertion
 `((bs-tracked_requests)-lh_first == ((void *)0))' failed.

I solved this issue: bdrv_drain_all was bogus.

Best regards

Benoît

 
  block.c   |  351 ++--
  block/qapi.c  |   50 +++--
  blockdev.c|  207 ++-
  hmp.c |   36 +++-
  include/block/block.h |1 -
  include/block/block_int.h |   32 +--
  include/qemu/throttle.h   |  105 ++
  qapi-schema.json  |   40 +++-
  qemu-options.hx   |4 +-
  qmp-commands.hx   |   34 +++-
  tests/Makefile|2 +
  tests/test-throttle.c |  494 
 +
  util/Makefile.objs|1 +
  util/throttle.c   |  391 +++
  14 files changed, 1405 insertions(+), 343 deletions(-)
  create mode 100644 include/qemu/throttle.h
  create mode 100644 tests/test-throttle.c
  create mode 100644 util/throttle.c
 
 -- 
 1.7.10.4

Re: [Qemu-devel] [PATCH] qemu-iotests: Filter out actual image size in 067

2013-11-02 Thread Benoît Canet

Le Saturday 02 Nov 2013 à 14:52:11 (+0100), Max Reitz a écrit :
 The actual size of the image file may differ depending on the Linux
 kernel currently running on the host. Filtering out this value makes
 this test pass in such cases.
 
 Signed-off-by: Max Reitz mre...@redhat.com
 ---
  tests/qemu-iotests/067 |  2 +-
  tests/qemu-iotests/067.out | 10 +-
  2 files changed, 6 insertions(+), 6 deletions(-)
 
 diff --git a/tests/qemu-iotests/067 b/tests/qemu-iotests/067
 index 79dc38b..d025192 100755
 --- a/tests/qemu-iotests/067
 +++ b/tests/qemu-iotests/067
 @@ -45,7 +45,7 @@ function do_run_qemu()
  
  function run_qemu()
  {
 -do_run_qemu $@ 21 | _filter_testdir | _filter_qmp
 +do_run_qemu $@ 21 | _filter_testdir | _filter_qmp | sed -e 
 's/\(actual-size:\s*\)[0-9]\+/\1SIZE/g'
  }
  
  size=128M
 diff --git a/tests/qemu-iotests/067.out b/tests/qemu-iotests/067.out
 index 4bb9ff9..8d271cc 100644
 --- a/tests/qemu-iotests/067.out
 +++ b/tests/qemu-iotests/067.out
 @@ -6,7 +6,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728
  Testing: -drive file=TEST_DIR/t.qcow2,format=qcow2,if=none,id=disk -device 
 virtio-blk-pci,drive=disk,id=virtio0
  QMP_VERSION
  {return: {}}
 -{return: [{io-status: ok, device: disk, locked: false, 
 removable: false, inserted: {iops_rd: 0, image: {virtual-size: 
 134217728, filename: TEST_DIR/t.qcow2, cluster-size: 65536, format: 
 qcow2, actual-size: 139264, format-specific: {type: qcow2, data: 
 {compat: 1.1, lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 
 0, ro: false, backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 
 0, encrypted: false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, 
 encryption_key_missing: false}, type: unknown}, {io-status: ok, 
 device: ide1-cd0, locked: false, removable: true, tray_open: false, 
 type: unknown}, {device: floppy0, locked: false, removable: true, 
 tray_open: false, type: unknown}, {device: sd0, locked: false, 
 removable: true, tray_open: false, type: unknown}]}
 +{return: [{io-status: ok, device: disk, locked: false, 
 removable: false, inserted: {iops_rd: 0, image: {virtual-size: 
 134217728, filename: TEST_DIR/t.qcow2, cluster-size: 65536, format: 
 qcow2, actual-size: SIZE, format-specific: {type: qcow2, data: 
 {compat: 1.1, lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 
 0, ro: false, backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 
 0, encrypted: false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, 
 encryption_key_missing: false}, type: unknown}, {io-status: ok, 
 device: ide1-cd0, locked: false, removable: true, tray_open: false, 
 type: unknown}, {device: floppy0, locked: false, removable: true, 
 tray_open: false, type: unknown}, {device: sd0, locked: false, 
 removable: true, tray_open: false, type: unknown}]}
  {return: {}}
  {return: {}}
  {timestamp: {seconds:  TIMESTAMP, microseconds:  TIMESTAMP}, event: 
 DEVICE_DELETED, data: {path: 
 /machine/peripheral/virtio0/virtio-backend}}
 @@ -24,7 +24,7 @@ QMP_VERSION
  Testing: -drive file=TEST_DIR/t.qcow2,format=qcow2,if=none,id=disk
  QMP_VERSION
  {return: {}}
 -{return: [{device: disk, locked: false, removable: true, 
 inserted: {iops_rd: 0, image: {virtual-size: 134217728, filename: 
 TEST_DIR/t.qcow2, cluster-size: 65536, format: qcow2, actual-size: 
 139264, format-specific: {type: qcow2, data: {compat: 1.1, 
 lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 0, ro: false, 
 backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 0, encrypted: 
 false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, 
 encryption_key_missing: false}, tray_open: false, type: unknown}, 
 {io-status: ok, device: ide1-cd0, locked: false, removable: true, 
 tray_open: false, type: unknown}, {device: floppy0, locked: 
 false, removable: true, tray_open: false, type: unknown}, {device: 
 sd0, locked: false, removable: true, tray_open: false, type: 
 unknown}]}
 +{return: [{device: disk, locked: false, removable: true, 
 inserted: {iops_rd: 0, image: {virtual-size: 134217728, filename: 
 TEST_DIR/t.qcow2, cluster-size: 65536, format: qcow2, actual-size: 
 SIZE, format-specific: {type: qcow2, data: {compat: 1.1, 
 lazy-refcounts: false}}, dirty-flag: false}, iops_wr: 0, ro: false, 
 backing_file_depth: 0, drv: qcow2, iops: 0, bps_wr: 0, encrypted: 
 false, bps: 0, bps_rd: 0, file: TEST_DIR/t.qcow2, 
 encryption_key_missing: false}, tray_open: false, type: unknown}, 
 {io-status: ok, device: ide1-cd0, locked: false, removable: true, 
 tray_open: false, type: unknown}, {device: floppy0, locked: 
 false, removable: true, tray_open: false, type: unknown}, {device: 
 sd0, locked: false, removable: true, tray_open: false, type: 
 unknown}]}
  {return: {}}
  {return: {}}
  {return: {}}
 @@ -44,7 +44,7 @@ Testing:
  QMP_VERSION
  {return: {}}
  {return: OK\r\n}
 -{return: [{io-status: ok, device: ide1-cd0, locked: false, 
 removable: true, tray_open: false, type: unknown}, {device: 
 floppy0, locked: false, removable: true, tray_open: false, type: 
 unknown},

Re: [Qemu-devel] [PATCH v2] block: per caller dirty bitmap

2013-11-04 Thread Benoît Canet

Le Monday 04 Nov 2013 à 17:30:10 (+0800), Fam Zheng a écrit :
 Previously a BlockDriverState has only one dirty bitmap, so only one
 caller (e.g. a block job) can keep track of writing. This changes the
 dirty bitmap to a list and creates a BdrvDirtyBitmap for each caller, the
 lifecycle is managed with these new functions:
 
 bdrv_create_dirty_bitmap
 bdrv_release_dirty_bitmap
 
 Where BdrvDirtyBitmap is a linked list wrapper structure of HBitmap.
 
 In place of bdrv_set_dirty_tracking, a BdrvDirtyBitmap pointer argument
 is added to these functions, since each caller has its own dirty bitmap:
 
 bdrv_get_dirty
 bdrv_dirty_iter_init
 bdrv_get_dirty_count
 
 bdrv_set_dirty and bdrv_reset_dirty prototypes are unchanged but will
 internally walk the list of all dirty bitmaps and set them one by one.
 
 Signed-off-by: Fam Zheng f...@redhat.com
 
 ---
 v2: Added BdrvDirtyBitmap [Paolo]
 
 Signed-off-by: Fam Zheng f...@redhat.com
 ---
  block-migration.c | 22 +
  block.c   | 81 
 ---
  block/mirror.c| 23 --
  block/qapi.c  |  8 -
  include/block/block.h | 11 ---
  include/block/block_int.h |  2 +-
  6 files changed, 85 insertions(+), 62 deletions(-)
 
 diff --git a/block-migration.c b/block-migration.c
 index daf9ec1..8f4e826 100644
 --- a/block-migration.c
 +++ b/block-migration.c
 @@ -58,6 +58,7 @@ typedef struct BlkMigDevState {
  /* Protected by block migration lock.  */
  unsigned long *aio_bitmap;
  int64_t completed_sectors;
 +BdrvDirtyBitmap *dirty_bitmap;
  } BlkMigDevState;
  
  typedef struct BlkMigBlock {
 @@ -309,12 +310,21 @@ static int mig_save_device_bulk(QEMUFile *f, 
 BlkMigDevState *bmds)
  
  /* Called with iothread lock taken.  */
  
 -static void set_dirty_tracking(int enable)
 +static void set_dirty_tracking(void)
  {
  BlkMigDevState *bmds;
  
  QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) {
 -bdrv_set_dirty_tracking(bmds-bs, enable ? BLOCK_SIZE : 0);
 +bmds-dirty_bitmap = bdrv_create_dirty_bitmap(bmds-bs, BLOCK_SIZE);
 +}
 +}
 +
 +static void unset_dirty_tracking(void)
 +{
 +BlkMigDevState *bmds;
 +
 +QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) {
 +bdrv_release_dirty_bitmap(bmds-bs, bmds-dirty_bitmap);
  }
  }
  
 @@ -432,7 +442,7 @@ static int mig_save_device_dirty(QEMUFile *f, 
 BlkMigDevState *bmds,
  } else {
  blk_mig_unlock();
  }
 -if (bdrv_get_dirty(bmds-bs, sector)) {
 +if (bdrv_get_dirty(bmds-bs, bmds-dirty_bitmap, sector)) {
  
  if (total_sectors - sector  BDRV_SECTORS_PER_DIRTY_CHUNK) {
  nr_sectors = total_sectors - sector;
 @@ -554,7 +564,7 @@ static int64_t get_remaining_dirty(void)
  int64_t dirty = 0;
  
  QSIMPLEQ_FOREACH(bmds, block_mig_state.bmds_list, entry) {
 -dirty += bdrv_get_dirty_count(bmds-bs);
 +dirty += bdrv_get_dirty_count(bmds-bs, bmds-dirty_bitmap);
  }
  
  return dirty  BDRV_SECTOR_BITS;
 @@ -569,7 +579,7 @@ static void blk_mig_cleanup(void)
  
  bdrv_drain_all();
  
 -set_dirty_tracking(0);
 +unset_dirty_tracking();
  
  blk_mig_lock();
  while ((bmds = QSIMPLEQ_FIRST(block_mig_state.bmds_list)) != NULL) {
 @@ -604,7 +614,7 @@ static int block_save_setup(QEMUFile *f, void *opaque)
  init_blk_migration(f);
  
  /* start track dirty blocks */
 -set_dirty_tracking(1);
 +set_dirty_tracking();
  qemu_mutex_unlock_iothread();
  
  ret = flush_blks(f);
 diff --git a/block.c b/block.c
 index 58efb5b..ef7a55f 100644
 --- a/block.c
 +++ b/block.c
 @@ -49,6 +49,11 @@
  #include windows.h
  #endif
  
 +typedef struct BdrvDirtyBitmap {
 +HBitmap *bitmap;
 +QLIST_ENTRY(BdrvDirtyBitmap) list;
 +} BdrvDirtyBitmap;
 +
  #define NOT_DONE 0x7fff /* used while emulated sync operation in 
 progress */
  
  typedef enum {
 @@ -323,6 +328,7 @@ BlockDriverState *bdrv_new(const char *device_name)
  BlockDriverState *bs;
  
  bs = g_malloc0(sizeof(BlockDriverState));
 +QLIST_INIT(bs-dirty_bitmaps);
  pstrcpy(bs-device_name, sizeof(bs-device_name), device_name);
  if (device_name[0] != '\0') {
  QTAILQ_INSERT_TAIL(bdrv_states, bs, list);
 @@ -1615,7 +1621,7 @@ static void bdrv_move_feature_fields(BlockDriverState 
 *bs_dest,
  bs_dest-iostatus   = bs_src-iostatus;
  
  /* dirty bitmap */
 -bs_dest-dirty_bitmap   = bs_src-dirty_bitmap;
 +bs_dest-dirty_bitmaps  = bs_src-dirty_bitmaps;
  
  /* reference count */
  bs_dest-refcnt = bs_src-refcnt;
 @@ -1648,7 +1654,7 @@ void bdrv_swap(BlockDriverState *bs_new, 
 BlockDriverState *bs_old)
  
  /* bs_new must be anonymous and shouldn't have anything fancy enabled */
  assert(bs_new-device_name[0] == '\0');
 -assert(bs_new-dirty_bitmap == NULL);
 +

Re: [Qemu-devel] [PATCH] block: Save errno before error_setg_errno

2013-11-05 Thread Benoît Canet

Le Tuesday 05 Nov 2013 à 20:03:33 (+0100), Max Reitz a écrit :
 error_setg_errno() may overwrite errno; therefore, its value should be
 read before calling that function and not afterwards.
 
 Signed-off-by: Max Reitz mre...@redhat.com
 ---
  block.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/block.c b/block.c
 index 58efb5b..0e96a22 100644
 --- a/block.c
 +++ b/block.c
 @@ -1084,8 +1084,8 @@ int bdrv_open(BlockDriverState *bs, const char 
 *filename, QDict *options,
  snprintf(backing_filename, sizeof(backing_filename),
   %s, filename);
  } else if (!realpath(filename, backing_filename)) {
 -error_setg_errno(errp, errno, Could not resolve path '%s', 
 filename);
  ret = -errno;
 +error_setg_errno(errp, errno, Could not resolve path '%s', 
 filename);
  goto fail;
  }
  
 -- 
 1.8.4.2
 

Reviewed-by: Benoit Canet ben...@irqsave.net

Re: [Qemu-devel] [PATCH] block: Round up total_sectors

2013-11-06 Thread Benoît Canet

Le Wednesday 06 Nov 2013 à 19:48:06 (+0800), Fam Zheng a écrit :
 Since b94a2610, bdrv_getlength() is omitted when probing image. VMDK
 monolithicFlat is broken by that because a file  512 bytes can't be
 read with its total_sectors truncated to 0. This patch round up the size
 to BDRV_SECTOR_SIZE, when a image size is not sector aligned.
 
 Signed-off-by: Fam Zheng f...@redhat.com
 ---
  block.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/block.c b/block.c
 index 58efb5b..f706634 100644
 --- a/block.c
 +++ b/block.c
 @@ -640,7 +640,7 @@ static int refresh_total_sectors(BlockDriverState *bs, 
 int64_t hint)
  if (length  0) {
  return length;
  }
 -hint = length  BDRV_SECTOR_BITS;
 +hint = DIV_ROUND_UP(length, BDRV_SECTOR_SIZE);
  }
  
  bs-total_sectors = hint;
 -- 
 1.8.3.1
 
 

Reviewed-by: Benoit Canet ben...@irqsave.net

Re: [Qemu-devel] [PATCH] block: Print its file name if backing file opening failed

2013-11-07 Thread Benoît Canet

Le Thursday 07 Nov 2013 à 15:34:52 (+0800), Fam Zheng a écrit :
 If backing file doesn't exist, the error message is confusing and
 misleading:
 
 $ qemu /tmp/a.qcow2
 qemu: could not open disk image /tmp/a.qcow2: Could not open file: No
 such file or directory
 
 But...
 
 $ ls /tmp/a.qcow2
 /tmp/a.qcow2
 
 $ qemu-img info /tmp/a.qcow2
 image: /tmp/a.qcow2
 file format: qcow2
 virtual size: 8.0G (8589934592 bytes)
 disk size: 196K
 cluster_size: 65536
 backing file: /tmp/b.qcow2
 
 Because...
 
 $ ls /tmp/b.qcow2
 ls: cannot access /tmp/b.qcow2: No such file or directory
 
 This is not intuitive. It's better to have the missing file's name in
 the error message. With this patch:
 
 $ qemu-io -c 'read 0 512' /tmp/a.qcow2
 qemu-io: can't open device /tmp/a.qcow2: Could not open backing
 file: Could not open '/stor/vm/arch.raw': No such file or directory
 no file open, try 'help open'
 
 Which is a little bit better.
 
 Signed-off-by: Fam Zheng f...@redhat.com
 ---
  block.c| 3 ++-
  block/raw-posix.c  | 1 -
  block/raw-win32.c  | 1 -
  tests/qemu-iotests/051.out | 2 +-
  tests/qemu-iotests/069.out | 2 +-
  5 files changed, 4 insertions(+), 5 deletions(-)
 
 diff --git a/block.c b/block.c
 index f706634..a8dbcfc 100644
 --- a/block.c
 +++ b/block.c
 @@ -1009,7 +1009,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
 *options, Error **errp)
  bdrv_unref(bs-backing_hd);
  bs-backing_hd = NULL;
  bs-open_flags |= BDRV_O_NO_BACKING;
 -error_propagate(errp, local_err);
 +error_setg(errp, Could not open backing file: %s,
 +   error_get_pretty(local_err));
  return ret;
  }
  pstrcpy(bs-backing_file, sizeof(bs-backing_file),
 diff --git a/block/raw-posix.c b/block/raw-posix.c
 index f6d48bb..e1b1ba2 100644
 --- a/block/raw-posix.c
 +++ b/block/raw-posix.c
 @@ -310,7 +310,6 @@ static int raw_open_common(BlockDriverState *bs, QDict 
 *options,
  if (ret == -EROFS) {
  ret = -EACCES;
  }
 -error_setg_errno(errp, -ret, Could not open file);
  goto fail;
  }
  s-fd = fd;
 diff --git a/block/raw-win32.c b/block/raw-win32.c
 index 2741e4d..2bad5a3 100644
 --- a/block/raw-win32.c
 +++ b/block/raw-win32.c
 @@ -280,7 +280,6 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
 int flags,
  } else {
  ret = -EINVAL;
  }
 -error_setg_errno(errp, -ret, Could not open file);
  goto fail;
  }
  
 diff --git a/tests/qemu-iotests/051.out b/tests/qemu-iotests/051.out
 index 15deef6..d351935 100644
 --- a/tests/qemu-iotests/051.out
 +++ b/tests/qemu-iotests/051.out
 @@ -226,6 +226,6 @@ Testing: -drive file=foo:bar
  QEMU_PROG: -drive file=foo:bar: could not open disk image foo:bar: Unknown 
 protocol
  
  Testing: -drive file.filename=foo:bar
 -QEMU_PROG: -drive file.filename=foo:bar: could not open disk image ide0-hd0: 
 Could not open file: No such file or directory
 +QEMU_PROG: -drive file.filename=foo:bar: could not open disk image ide0-hd0: 
 Could not open 'foo:bar': No such file or directory
  
  *** done
 diff --git a/tests/qemu-iotests/069.out b/tests/qemu-iotests/069.out
 index 3648814..b48306d 100644
 --- a/tests/qemu-iotests/069.out
 +++ b/tests/qemu-iotests/069.out
 @@ -4,5 +4,5 @@ QA output created by 069
  
  Formatting 'TEST_DIR/t.IMGFMT.base', fmt=IMGFMT size=131072 
  Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=131072 
 backing_file='TEST_DIR/t.IMGFMT.base' 
 -qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open file: No such 
 file or directory
 +qemu-io: can't open device TEST_DIR/t.IMGFMT: Could not open backing file: 
 Could not open 'TEST_DIR/t.IMGFMT.base': No such file or directory
  *** done
 -- 
 1.8.3.1
 
 
Reviewed-by: Benoit Canet ben...@irqsave.net

[Qemu-devel] [PATCH 0/2] Giving names to graph's BlockDriverState

2013-11-07 Thread Benoît Canet

The following series introduce a new file.node-name property in order to be
able to give a name to each BlockDriverState of the graph.

It also define undefined as a special value for node-name; a value that will 
be
used to indicate to the management that it can not manipulate a node because it
was not named.

After this patchset is merged I would like to take care of presenting the graph
to the management. (HMP /|| QMP)

Eric: Do you have some ideas on this topic ?

Best regards

Benoît


Benoît Canet (2):
  block: Add bs-node_name to hold the name of a bs node of the bs
graph.
  block: Allow the user to define node-name option.

 block.c   | 72 +++
 block/blkverify.c |  2 +-
 block/iscsi.c |  2 +-
 block/vmdk.c  |  2 +-
 block/vvfat.c |  4 +--
 blockdev.c|  8 +++---
 hw/block/xen_disk.c   |  2 +-
 include/block/block.h |  3 +-
 include/block/block_int.h |  9 +-
 qemu-img.c|  6 ++--
 qemu-io.c |  2 +-
 qemu-nbd.c|  2 +-
 12 files changed, 79 insertions(+), 35 deletions(-)

-- 
1.8.3.2

[Qemu-devel] [PATCH 2/2] block: Allow the user to define node-name option.

2013-11-07 Thread Benoît Canet

As node-name is a separate name space as device-name we can enable it's
definition right now: nobody will use it so no harm involved.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 230e71a..132981f 100644
--- a/block.c
+++ b/block.c
@@ -885,7 +885,8 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename,
 options = qdict_new();
 }
 
-bs = bdrv_new(, NULL);
+bs = bdrv_new(, qdict_get_try_str(options, node-name));
+qdict_del(options, node-name);
 bs-options = options;
 options = qdict_clone_shallow(options);
 
@@ -1007,7 +1008,8 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
sizeof(backing_filename));
 }
 
-bs-backing_hd = bdrv_new(, NULL);
+bs-backing_hd = bdrv_new(, qdict_get_try_str(options, node-name));
+qdict_del(options, node-name);
 
 if (bs-backing_format[0] != '\0') {
 back_drv = bdrv_find_format(bs-backing_format);
-- 
1.8.3.2

[Qemu-devel] [PATCH 1/2] block: Add bs-node_name to hold the name of a bs node of the bs graph.

2013-11-07 Thread Benoît Canet

Add the minimum of code to prepare the followings patches.

If no node_name is provided to bdrv_new the bs-node_name is set to undefined.
This will allow to have some default string to communicate in QMP and HMP.
This also make undefined a reserved string for bs-node_name.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   | 70 +++
 block/blkverify.c |  2 +-
 block/iscsi.c |  2 +-
 block/vmdk.c  |  2 +-
 block/vvfat.c |  4 +--
 blockdev.c|  8 +++---
 hw/block/xen_disk.c   |  2 +-
 include/block/block.h |  3 +-
 include/block/block_int.h |  9 +-
 qemu-img.c|  6 ++--
 qemu-io.c |  2 +-
 qemu-nbd.c|  2 +-
 12 files changed, 77 insertions(+), 35 deletions(-)

diff --git a/block.c b/block.c
index fd05a80..230e71a 100644
--- a/block.c
+++ b/block.c
@@ -89,6 +89,9 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
+static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
+QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
+
 static QLIST_HEAD(, BlockDriver) bdrv_drivers =
 QLIST_HEAD_INITIALIZER(bdrv_drivers);
 
@@ -318,14 +321,26 @@ void bdrv_register(BlockDriver *bdrv)
 }
 
 /* create a new block device (by default it is empty) */
-BlockDriverState *bdrv_new(const char *device_name)
+BlockDriverState *bdrv_new(const char *device_name, const char *node_name)
 {
 BlockDriverState *bs;
 
 bs = g_malloc0(sizeof(BlockDriverState));
 pstrcpy(bs-device_name, sizeof(bs-device_name), device_name);
 if (device_name[0] != '\0') {
-QTAILQ_INSERT_TAIL(bdrv_states, bs, list);
+QTAILQ_INSERT_TAIL(bdrv_states, bs, device_list);
+}
+/* if node name is given store it in bs and insert bs in the graph bs list
+ * note: undefined is a reserved node name
+ */
+if (node_name 
+node_name[0] != '\0' 
+strcmp(node_name, undefined)) {
+pstrcpy(bs-node_name, sizeof(bs-node_name), node_name);
+QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list);
+/* else set the bs node name to undefined for QMP and HMP */
+} else {
+sprintf(bs-node_name, undefined);
 }
 bdrv_iostatus_disable(bs);
 notifier_list_init(bs-close_notifiers);
@@ -870,7 +885,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename,
 options = qdict_new();
 }
 
-bs = bdrv_new();
+bs = bdrv_new(, NULL);
 bs-options = options;
 options = qdict_clone_shallow(options);
 
@@ -992,7 +1007,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
sizeof(backing_filename));
 }
 
-bs-backing_hd = bdrv_new();
+bs-backing_hd = bdrv_new(, NULL);
 
 if (bs-backing_format[0] != '\0') {
 back_drv = bdrv_find_format(bs-backing_format);
@@ -1062,7 +1077,7 @@ int bdrv_open(BlockDriverState *bs, const char *filename, 
QDict *options,
instead of opening 'filename' directly */
 
 /* if there is a backing file, use it */
-bs1 = bdrv_new();
+bs1 = bdrv_new(, NULL);
 ret = bdrv_open(bs1, filename, NULL, 0, drv, local_err);
 if (ret  0) {
 bdrv_unref(bs1);
@@ -1495,7 +1510,7 @@ void bdrv_close_all(void)
 {
 BlockDriverState *bs;
 
-QTAILQ_FOREACH(bs, bdrv_states, list) {
+QTAILQ_FOREACH(bs, bdrv_states, device_list) {
 bdrv_close(bs);
 }
 }
@@ -1524,7 +1539,7 @@ static bool bdrv_requests_pending(BlockDriverState *bs)
 static bool bdrv_requests_pending_all(void)
 {
 BlockDriverState *bs;
-QTAILQ_FOREACH(bs, bdrv_states, list) {
+QTAILQ_FOREACH(bs, bdrv_states, device_list) {
 if (bdrv_requests_pending(bs)) {
 return true;
 }
@@ -1554,7 +1569,7 @@ void bdrv_drain_all(void)
 /* FIXME: We do not have timer support here, so this is effectively
  * a busy wait.
  */
-QTAILQ_FOREACH(bs, bdrv_states, list) {
+QTAILQ_FOREACH(bs, bdrv_states, device_list) {
 if (bdrv_start_throttled_reqs(bs)) {
 busy = true;
 }
@@ -1570,7 +1585,7 @@ void bdrv_drain_all(void)
 void bdrv_make_anon(BlockDriverState *bs)
 {
 if (bs-device_name[0] != '\0') {
-QTAILQ_REMOVE(bdrv_states, bs, list);
+QTAILQ_REMOVE(bdrv_states, bs, device_list);
 }
 bs-device_name[0] = '\0';
 }
@@ -1626,7 +1641,12 @@ static void bdrv_move_feature_fields(BlockDriverState 
*bs_dest,
 /* keep the same entry in bdrv_states */
 pstrcpy(bs_dest-device_name, sizeof(bs_dest-device_name),
 bs_src-device_name);
-bs_dest-list = bs_src-list;
+bs_dest-device_list = bs_src-device_list;
+
+/* keep the same entry in graph_bdrv_states */
+

Re: [Qemu-devel] How to introduce bs-node_name ?

2013-11-07 Thread Benoît Canet

Le Monday 04 Nov 2013 à 19:33:21 (+0800), Fam Zheng a écrit :
 
 On 11/04/2013 07:06 PM, Kevin Wolf wrote:
 Am 04.11.2013 um 10:48 hat Fam Zheng geschrieben:
 On 11/04/2013 05:31 PM, Stefan Hajnoczi wrote:
 On Wed, Oct 30, 2013 at 02:49:32PM +0100, Markus Armbruster wrote:
 The first proposal is to add another parameter, say id.  Users can
 then refer either to an arbitrary BDS by id, or (for backward
 compatibility) to the root BDS by device.  When the code sees
 device, it'll look up the BB, then fetch its root BDS.
 
 CON: Existing parameter device becomes compatibility cruft.
 
 PRO: Clean and obvious semantics (in my opinion).
 This proposal gets my vote.
 
 The second proposal is to press the existing parameter device into
 service for referring to BDS node_name.
 
 To keep backward compatibility, we obviously need to ensure that
 whenever the old code accepts a value of device, the new code accepts
 it as well, and both resolve it to the same BDS.
 Different legacy commands given the same device name might need to
 operate on different nodes.
 Could you give an example for this?
 
 
 Dynamic renaming does not solve this
 problem, so I'm not convinced we can always choose a device name
 matching a node name.
 
 Device name commands are higher-level than graph node commands.  For
 example, block_set_io_throttle makes sense on a device but less sense on
 a graph node, unless we add the implicit assumption that the new
 throttling node is created on top of the given node or updated in place
 if the throttling node already exists (!!).
 Throttling a node could be useful too, for example if we want to
 throttle backing_hd which is on shared storage, but not to throttle
 on the local image.
 
 My ignorant question is: Why can't we just use one namespace, make
 sure no name collision between node_name and device_name, or even
 just drop device_name, so we treat the root node's node_name as
 device_name? For commands that only accept a device, this can be
 enforced in its implementation by checking against the whole graph
 to verify this.
 Markus described it somewhere in this thread: Live snapshots.
 Currently, the device_name moves to the new BDS on the top (and
 compatibility requires us to keep it that way), whereas a node name
 should, of course, stay at its node.
 
 When you consider this, the single namespace, as much as I would have
 loved it, is pretty much dead.

Good everyone agree on the direction to take.
I'll write some code.

Best regards

Benoît

 
 Thanks for explaining (again). I get the reason now.
 
 Fam

Re: [Qemu-devel] [PATCH] qapi-schema: Update description for NewImageMode

2013-11-07 Thread Benoît Canet

Le Thursday 07 Nov 2013 à 19:47:48 (+0100), Max Reitz a écrit :
 If the NewImageMode is absolute-paths but no backing file is available
 (e.g., when mirroring a device with an unbacked image), the target image
 will not be backed either. This patch updates the documentation in
 qapi-schema.json accordingly.
 
 Signed-off-by: Max Reitz mre...@redhat.com
 ---
 Follow-up to:
  - block/drive-mirror: Check for NULL backing_hd
 ---
  qapi-schema.json | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 
 diff --git a/qapi-schema.json b/qapi-schema.json
 index 81a375b..dde8e45 100644
 --- a/qapi-schema.json
 +++ b/qapi-schema.json
 @@ -1736,7 +1736,8 @@
  # @existing: QEMU should look for an existing image file.
  #
  # @absolute-paths: QEMU should create a new image with absolute paths
 -# for the backing file.
 +# for the backing file. If there is no backing file available, the new
 +# image will not be backed either.
  #
  # Since: 1.1
  ##
 -- 
 1.8.4.2
 
 
Reviewed-by: Benoit Canet ben...@irqsave.net

Re: [Qemu-devel] [PATCH] util/error: Save errno from clobbering

2013-11-07 Thread Benoît Canet

Le Thursday 07 Nov 2013 à 20:10:29 (+0100), Max Reitz a écrit :
 There may be calls to error_setg() and especially error_setg_errno()
 which blindly (and until now wrongly) assume these functions not to
 clobber errno (e.g., they pass errno to error_setg_errno() and return
 -errno afterwards). Instead of trying to find and fix all of these
 constructs, just make sure error_setg() and error_setg_errno() indeed do
 not clobber errno.
 
 Suggested-by: Eric Blake ebl...@redhat.com
 Signed-off-by: Max Reitz mre...@redhat.com
 ---
  util/error.c | 6 ++
  1 file changed, 6 insertions(+)
 
 diff --git a/util/error.c b/util/error.c
 index ec0faa6..3ee362a 100644
 --- a/util/error.c
 +++ b/util/error.c
 @@ -27,6 +27,7 @@ void error_set(Error **errp, ErrorClass err_class, const 
 char *fmt, ...)
  {
  Error *err;
  va_list ap;
 +int saved_errno = errno;
  
  if (errp == NULL) {
  return;
 @@ -41,6 +42,8 @@ void error_set(Error **errp, ErrorClass err_class, const 
 char *fmt, ...)
  err-err_class = err_class;
  
  *errp = err;
 +
 +errno = saved_errno;
  }
  
  void error_set_errno(Error **errp, int os_errno, ErrorClass err_class,
 @@ -49,6 +52,7 @@ void error_set_errno(Error **errp, int os_errno, ErrorClass 
 err_class,
  Error *err;
  char *msg1;
  va_list ap;
 +int saved_errno = errno;
  
  if (errp == NULL) {
  return;
 @@ -69,6 +73,8 @@ void error_set_errno(Error **errp, int os_errno, ErrorClass 
 err_class,
  err-err_class = err_class;
  
  *errp = err;
 +
 +errno = saved_errno;
  }
  
  void error_setg_file_open(Error **errp, int os_errno, const char *filename)
 -- 
 1.8.4.2
 
 

Yes this look better than trying to fix all callers.

Reviewed-by: Benoit Canet ben...@irqsave.net

Re: [Qemu-devel] [PATCH 1/2] block: Add bs-node_name to hold the name of a bs node of the bs graph.

2013-11-07 Thread Benoît Canet

Le Thursday 07 Nov 2013 à 13:23:43 (-0700), Eric Blake a écrit :
 On 11/07/2013 08:01 AM, Benoît Canet wrote:
  Add the minimum of code to prepare the followings patches.
  
  If no node_name is provided to bdrv_new the bs-node_name is set to 
  undefined.
  This will allow to have some default string to communicate in QMP and HMP.
  This also make undefined a reserved string for bs-node_name.
  
  Signed-off-by: Benoit Canet ben...@irqsave.net
  ---
 
  +/* if node name is given store it in bs and insert bs in the graph bs 
  list
  + * note: undefined is a reserved node name
  + */
  +if (node_name 
  +node_name[0] != '\0' 
  +strcmp(node_name, undefined)) {
  +pstrcpy(bs-node_name, sizeof(bs-node_name), node_name);
  +QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list);
  +/* else set the bs node name to undefined for QMP and HMP */
  +} else {
  +sprintf(bs-node_name, undefined);
 
 strcpy() is more efficient than sprintf() with no % in the format string.
 
 
   
  +/* This function is to find a node in the bs graph */
  +BlockDriverState *bdrv_find_node(const char *node_name)
  +{
  +BlockDriverState *bs;
  +
 
 Should this function assert that node_name is not undefined?
 
 
  +++ b/include/block/block_int.h
  @@ -297,11 +297,18 @@ struct BlockDriverState {
   BlockdevOnError on_read_error, on_write_error;
   bool iostatus_enabled;
   BlockDeviceIoStatus iostatus;
  +
  +/* the following member give a name to every node on the 
  BlockDriverState
 
 s/give/gives/
 
  + * graph.
  + */
  +char node_name[32];
  +QTAILQ_ENTRY(BlockDriverState) node_list;
  +/* Device name is the name associated with the drive the guest see */
 
 s/see/sees/
 
   char device_name[32];
  +QTAILQ_ENTRY(BlockDriverState) device_list;
 
 Maybe I missed it, but is there code that ensures there are no duplicate
 node names (other than the magic undefined)?
 
Your are right I will add some checkings.

 Seems like it's moving in the right direction, although I'm not sure
 it's worth applying this until we know the qapi for working with node
 names makes sense.

I am not seeking for fast commit of these patch, I am only trying to avoid
posting a long series at once.

Best regards

Benoît

 
 -- 
 Eric Blake   eblake redhat com+1-919-301-3266
 Libvirt virtualization library http://libvirt.org

Re: [Qemu-devel] [PATCH 1/2] block: Add bs-node_name to hold the name of a bs node of the bs graph.

2013-11-07 Thread Benoît Canet

Le Thursday 07 Nov 2013 à 15:54:09 (-0500), Jeff Cody a écrit :
 On Thu, Nov 07, 2013 at 04:01:42PM +0100, Benoît Canet wrote:
  Add the minimum of code to prepare the followings patches.
  
  If no node_name is provided to bdrv_new the bs-node_name is set to 
  undefined.
  This will allow to have some default string to communicate in QMP and HMP.
  This also make undefined a reserved string for bs-node_name.
 
 Hi Benoît,
 
 Is it necessary to have a reserved string, or would an empty
 null-terminated string be able to implicitly denote the name as
 undefined?

This would works too and just report the undefined logic to hmp printing code.
I'll do this.

 
  
  Signed-off-by: Benoit Canet ben...@irqsave.net
  ---
   block.c   | 70 
  +++
   block/blkverify.c |  2 +-
   block/iscsi.c |  2 +-
   block/vmdk.c  |  2 +-
   block/vvfat.c |  4 +--
   blockdev.c|  8 +++---
   hw/block/xen_disk.c   |  2 +-
   include/block/block.h |  3 +-
   include/block/block_int.h |  9 +-
   qemu-img.c|  6 ++--
   qemu-io.c |  2 +-
   qemu-nbd.c|  2 +-
   12 files changed, 77 insertions(+), 35 deletions(-)
  
  diff --git a/block.c b/block.c
  index fd05a80..230e71a 100644
  --- a/block.c
  +++ b/block.c
  @@ -89,6 +89,9 @@ static int coroutine_fn 
  bdrv_co_do_write_zeroes(BlockDriverState *bs,
   static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
   QTAILQ_HEAD_INITIALIZER(bdrv_states);
   
  +static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
  +QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
  +
   static QLIST_HEAD(, BlockDriver) bdrv_drivers =
   QLIST_HEAD_INITIALIZER(bdrv_drivers);
   
  @@ -318,14 +321,26 @@ void bdrv_register(BlockDriver *bdrv)
   }
   
   /* create a new block device (by default it is empty) */
  -BlockDriverState *bdrv_new(const char *device_name)
  +BlockDriverState *bdrv_new(const char *device_name, const char *node_name)
   {
   BlockDriverState *bs;
   
   bs = g_malloc0(sizeof(BlockDriverState));
   pstrcpy(bs-device_name, sizeof(bs-device_name), device_name);
   if (device_name[0] != '\0') {
  -QTAILQ_INSERT_TAIL(bdrv_states, bs, list);
  +QTAILQ_INSERT_TAIL(bdrv_states, bs, device_list);
  +}
  +/* if node name is given store it in bs and insert bs in the graph bs 
  list
  + * note: undefined is a reserved node name
  + */
  +if (node_name 
  +node_name[0] != '\0' 
  +strcmp(node_name, undefined)) {
  +pstrcpy(bs-node_name, sizeof(bs-node_name), node_name);
  +QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list);
  +/* else set the bs node name to undefined for QMP and HMP */
  +} else {
  +sprintf(bs-node_name, undefined);
   }
   bdrv_iostatus_disable(bs);
   notifier_list_init(bs-close_notifiers);
  @@ -870,7 +885,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
  *filename,
   options = qdict_new();
   }
   
  -bs = bdrv_new();
  +bs = bdrv_new(, NULL);
   bs-options = options;
   options = qdict_clone_shallow(options);
   
  @@ -992,7 +1007,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
  *options, Error **errp)
  sizeof(backing_filename));
   }
   
  -bs-backing_hd = bdrv_new();
  +bs-backing_hd = bdrv_new(, NULL);
   
   if (bs-backing_format[0] != '\0') {
   back_drv = bdrv_find_format(bs-backing_format);
  @@ -1062,7 +1077,7 @@ int bdrv_open(BlockDriverState *bs, const char 
  *filename, QDict *options,
  instead of opening 'filename' directly */
   
   /* if there is a backing file, use it */
  -bs1 = bdrv_new();
  +bs1 = bdrv_new(, NULL);
   ret = bdrv_open(bs1, filename, NULL, 0, drv, local_err);
   if (ret  0) {
   bdrv_unref(bs1);
  @@ -1495,7 +1510,7 @@ void bdrv_close_all(void)
   {
   BlockDriverState *bs;
   
  -QTAILQ_FOREACH(bs, bdrv_states, list) {
  +QTAILQ_FOREACH(bs, bdrv_states, device_list) {
   bdrv_close(bs);
   }
   }
  @@ -1524,7 +1539,7 @@ static bool bdrv_requests_pending(BlockDriverState 
  *bs)
   static bool bdrv_requests_pending_all(void)
   {
   BlockDriverState *bs;
  -QTAILQ_FOREACH(bs, bdrv_states, list) {
  +QTAILQ_FOREACH(bs, bdrv_states, device_list) {
   if (bdrv_requests_pending(bs)) {
   return true;
   }
  @@ -1554,7 +1569,7 @@ void bdrv_drain_all(void)
   /* FIXME: We do not have timer support here, so this is effectively
* a busy wait.
*/
  -QTAILQ_FOREACH(bs, bdrv_states, list) {
  +QTAILQ_FOREACH(bs, bdrv_states, device_list) {
   if (bdrv_start_throttled_reqs(bs)) {
   busy = true;
   }
  @@ -1570,7 +1585,7

[Qemu-devel] [RFC V2 0/3] Giving names to BlockDriverState graph nodes

2013-11-12 Thread Benoît Canet

This series fix the issues of V1 and add the skeletton of a new QMP command
useful to communicate a bs graph to libvirt.
The minimum is exposed to the management to avoid binding it too much with QEMU.

Tell me if you think it's worth implementing.

Best regards

Benoît

v2:
   s/give/gives/ [Eric]
   s/see/sees/   [Eric]
   prevent duplicate node_name [Eric]
   drop undefined and use  [Eric, Kevin, Jeff]
   remove from graph list on bdrv_make_anon [Jeff]
   comment the two chains [Fam]
   Add new command stub to retrieve the graph from libvirt [Benoît]

Benoît Canet (3):
  block: Add bs-node_name to hold the name of a bs node of the bs
graph.
  block: Allow the user to define node-name option.
  qapi: Add skeletton of command to query a drive bs graph.

 block.c   | 88 +--
 block/blkverify.c |  2 +-
 block/iscsi.c |  2 +-
 block/vmdk.c  |  2 +-
 block/vvfat.c |  4 +--
 blockdev.c| 16 ++---
 hw/block/xen_disk.c   |  2 +-
 include/block/block.h |  3 +-
 include/block/block_int.h |  9 -
 qapi-schema.json  | 32 +
 qemu-img.c|  6 ++--
 qemu-io.c |  2 +-
 qemu-nbd.c|  2 +-
 13 files changed, 134 insertions(+), 36 deletions(-)

-- 
1.8.3.2

[Qemu-devel] [RFC V2 2/3] block: Allow the user to define node-name option.

2013-11-12 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c | 20 ++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index c397ee9..6690e3d 100644
--- a/block.c
+++ b/block.c
@@ -872,6 +872,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename,
 const char *drvname;
 bool allow_protocol_prefix = false;
 Error *local_err = NULL;
+const char *node_name = NULL;
 int ret;
 
 /* NULL means an empty set of options */
@@ -879,7 +880,14 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename,
 options = qdict_new();
 }
 
-bs = bdrv_new(, );
+node_name = qdict_get_try_str(options, node-name);
+if (node_name  bdrv_find_node(node_name)) {
+error_setg(errp, Duplicate node name);
+return -EINVAL;
+}
+bs = bdrv_new(, node_name ? node_name : );
+qdict_del(options, node-name);
+
 bs-options = options;
 options = qdict_clone_shallow(options);
 
@@ -979,6 +987,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
 int back_flags, ret;
 BlockDriver *back_drv = NULL;
 Error *local_err = NULL;
+const char *node_name = NULL;
 
 if (bs-backing_hd != NULL) {
 QDECREF(options);
@@ -1001,7 +1010,14 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
sizeof(backing_filename));
 }
 
-bs-backing_hd = bdrv_new(, );
+node_name = qdict_get_try_str(options, node-name);
+if (node_name  bdrv_find_node(node_name)) {
+error_setg(errp, Duplicate node name);
+QDECREF(options);
+return -EINVAL;
+}
+bs-backing_hd = bdrv_new(, node_name ? node_name : );
+qdict_del(options, node-name);
 
 if (bs-backing_format[0] != '\0') {
 back_drv = bdrv_find_format(bs-backing_format);
-- 
1.8.3.2

[Qemu-devel] [RFC V2 1/3] block: Add bs-node_name to hold the name of a bs node of the bs graph.

2013-11-12 Thread Benoît Canet

Add the minimum of code to prepare the followings patches.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   | 72 ++-
 block/blkverify.c |  2 +-
 block/iscsi.c |  2 +-
 block/vmdk.c  |  2 +-
 block/vvfat.c |  4 +--
 blockdev.c|  8 +++---
 hw/block/xen_disk.c   |  2 +-
 include/block/block.h |  3 +-
 include/block/block_int.h |  9 +-
 qemu-img.c|  6 ++--
 qemu-io.c |  2 +-
 qemu-nbd.c|  2 +-
 12 files changed, 78 insertions(+), 36 deletions(-)

diff --git a/block.c b/block.c
index fd05a80..c397ee9 100644
--- a/block.c
+++ b/block.c
@@ -89,6 +89,9 @@ static int coroutine_fn 
bdrv_co_do_write_zeroes(BlockDriverState *bs,
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
+static QTAILQ_HEAD(, BlockDriverState) graph_bdrv_states =
+QTAILQ_HEAD_INITIALIZER(graph_bdrv_states);
+
 static QLIST_HEAD(, BlockDriver) bdrv_drivers =
 QLIST_HEAD_INITIALIZER(bdrv_drivers);
 
@@ -318,14 +321,20 @@ void bdrv_register(BlockDriver *bdrv)
 }
 
 /* create a new block device (by default it is empty) */
-BlockDriverState *bdrv_new(const char *device_name)
+BlockDriverState *bdrv_new(const char *device_name, const char *node_name)
 {
 BlockDriverState *bs;
 
+assert(node_name);
+
 bs = g_malloc0(sizeof(BlockDriverState));
 pstrcpy(bs-device_name, sizeof(bs-device_name), device_name);
 if (device_name[0] != '\0') {
-QTAILQ_INSERT_TAIL(bdrv_states, bs, list);
+QTAILQ_INSERT_TAIL(bdrv_states, bs, device_list);
+}
+pstrcpy(bs-node_name, sizeof(bs-node_name), node_name);
+if (node_name[0] != '\0') {
+QTAILQ_INSERT_TAIL(graph_bdrv_states, bs, node_list);
 }
 bdrv_iostatus_disable(bs);
 notifier_list_init(bs-close_notifiers);
@@ -870,7 +879,7 @@ int bdrv_file_open(BlockDriverState **pbs, const char 
*filename,
 options = qdict_new();
 }
 
-bs = bdrv_new();
+bs = bdrv_new(, );
 bs-options = options;
 options = qdict_clone_shallow(options);
 
@@ -992,7 +1001,7 @@ int bdrv_open_backing_file(BlockDriverState *bs, QDict 
*options, Error **errp)
sizeof(backing_filename));
 }
 
-bs-backing_hd = bdrv_new();
+bs-backing_hd = bdrv_new(, );
 
 if (bs-backing_format[0] != '\0') {
 back_drv = bdrv_find_format(bs-backing_format);
@@ -1062,7 +1071,7 @@ int bdrv_open(BlockDriverState *bs, const char *filename, 
QDict *options,
instead of opening 'filename' directly */
 
 /* if there is a backing file, use it */
-bs1 = bdrv_new();
+bs1 = bdrv_new(, );
 ret = bdrv_open(bs1, filename, NULL, 0, drv, local_err);
 if (ret  0) {
 bdrv_unref(bs1);
@@ -1495,7 +1504,7 @@ void bdrv_close_all(void)
 {
 BlockDriverState *bs;
 
-QTAILQ_FOREACH(bs, bdrv_states, list) {
+QTAILQ_FOREACH(bs, bdrv_states, device_list) {
 bdrv_close(bs);
 }
 }
@@ -1524,7 +1533,7 @@ static bool bdrv_requests_pending(BlockDriverState *bs)
 static bool bdrv_requests_pending_all(void)
 {
 BlockDriverState *bs;
-QTAILQ_FOREACH(bs, bdrv_states, list) {
+QTAILQ_FOREACH(bs, bdrv_states, device_list) {
 if (bdrv_requests_pending(bs)) {
 return true;
 }
@@ -1554,7 +1563,7 @@ void bdrv_drain_all(void)
 /* FIXME: We do not have timer support here, so this is effectively
  * a busy wait.
  */
-QTAILQ_FOREACH(bs, bdrv_states, list) {
+QTAILQ_FOREACH(bs, bdrv_states, device_list) {
 if (bdrv_start_throttled_reqs(bs)) {
 busy = true;
 }
@@ -1565,14 +1574,18 @@ void bdrv_drain_all(void)
 }
 }
 
-/* make a BlockDriverState anonymous by removing from bdrv_state list.
+/* make a BlockDriverState anonymous by removing from bdrv_state and
+ * graph_bdrv_state list.
Also, NULL terminate the device_name to prevent double remove */
 void bdrv_make_anon(BlockDriverState *bs)
 {
 if (bs-device_name[0] != '\0') {
-QTAILQ_REMOVE(bdrv_states, bs, list);
+QTAILQ_REMOVE(bdrv_states, bs, device_list);
 }
 bs-device_name[0] = '\0';
+if (bs-node_name[0] != '\0') {
+QTAILQ_REMOVE(graph_bdrv_states, bs, node_list);
+}
 }
 
 static void bdrv_rebind(BlockDriverState *bs)
@@ -1626,7 +1639,12 @@ static void bdrv_move_feature_fields(BlockDriverState 
*bs_dest,
 /* keep the same entry in bdrv_states */
 pstrcpy(bs_dest-device_name, sizeof(bs_dest-device_name),
 bs_src-device_name);
-bs_dest-list = bs_src-list;
+bs_dest-device_list = bs_src-device_list;
+
+/* keep the same entry in graph_bdrv_states */
+pstrcpy(bs_dest-node_name, sizeof(bs_dest-node_name),
+bs_src-node_name);
+bs_dest-node_list   = bs_src-node_list;
 }

[Qemu-devel] [RFC V2 3/3] qapi: Add skeletton of command to query a drive bs graph.

2013-11-12 Thread Benoît Canet

---
 blockdev.c   |  8 
 qapi-schema.json | 32 
 2 files changed, 40 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 911ee7e..bfaeda0 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1938,6 +1938,14 @@ void qmp_drive_backup(const char *device, const char 
*target,
 }
 }
 
+BlockGraphNode *qmp_query_drive_graph(const char *device, Error **errp)
+{
+/* the implementation of this function would recurse through the
+ * BlockDriverState graph to build it's result
+ */
+return NULL;
+}
+
 #define DEFAULT_MIRROR_BUF_SIZE   (10  20)
 
 void qmp_drive_mirror(const char *device, const char *target,
diff --git a/qapi-schema.json b/qapi-schema.json
index 60f3fd1..bca3d5a 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -1983,6 +1983,38 @@
 { 'command': 'drive-backup', 'data': 'DriveBackup' }
 
 ##
+# @BlockGraphNode
+#
+# Information about a node of the block driver state graph
+#
+# @node-name: the name of the node in the graph
+#
+# @drv: the name of the block format used by this node as described in
+#   @BlockDeviceInfo.
+#
+# @children: a list of @BlockGraphNode being the children of this node
+#
+# Since 1.8
+##
+{ 'type': 'BlockGraphNode',
+  'data': { 'node-name': 'str', 'drv': 'str', 'children': ['BlockGraphNode'] } 
}
+
+##
+# @query-drive-graph
+#
+# Get the block driver states graph for a given drive
+#
+# @device: the name of the device to get the graph from
+#
+# Returns: the root @BlockGraphNode
+#
+# Since 1.8
+##
+{ 'command': 'query-drive-graph',
+  'data': { 'device': 'str' },
+  'returns': 'BlockGraphNode' }
+
+##
 # @drive-mirror
 #
 # Start mirroring a block device's writes to a new destination.
-- 
1.8.3.2

Re: [Qemu-devel] [PATCH for 1.7] target-i386: do not override nr_cores for -cpu host

2013-11-19 Thread Benoît Canet

Le Tuesday 19 Nov 2013 à 17:49:46 (+0100), Paolo Bonzini a écrit :
 Commit 787aaf5 (target-i386: forward CPUID cache leaves when -cpu host is
 used, 2013-09-02) brings bits 31..26 of CPUID leaf 04h out of sync with
 the APIC IDs that QEMU reserves for each package.  This number must come
 from -smp options rather than from the host CPUID.
 
 It also turns out that this unsyncing makes Windows Server 2012R2 fail
 to boot.
 
 Tested-by: Peter Lieven p...@kamp.de
 Signed-off-by: Paolo Bonzini pbonz...@redhat.com
 ---
 Resending because the mailing list ate the message.
 
  target-i386/cpu.c | 14 --
  1 file changed, 8 insertions(+), 6 deletions(-)
 
 diff --git a/target-i386/cpu.c b/target-i386/cpu.c
 index 864c80e..8df6747 100644
 --- a/target-i386/cpu.c
 +++ b/target-i386/cpu.c
 @@ -2086,14 +2086,10 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  /* cache info: needed for Core compatibility */
  if (cpu-cache_info_passthrough) {
  host_cpuid(index, count, eax, ebx, ecx, edx);
 -break;
 -}
 -if (cs-nr_cores  1) {
 -*eax = (cs-nr_cores - 1)  26;
 +*eax = ~0xFC00;
  } else {
  *eax = 0;
 -}
 -switch (count) {
 +switch (count) {
  case 0: /* L1 dcache info */
  *eax |= CPUID_4_TYPE_DCACHE | \
  CPUID_4_LEVEL(1) | \
 @@ -2133,6 +2129,12 @@ void cpu_x86_cpuid(CPUX86State *env, uint32_t index, 
 uint32_t count,
  *ecx = 0;
  *edx = 0;
  break;
 +}
 +}
 +
 +/* QEMU gives out its own APIC IDs, never pass down bits 31..26.  */
Maybe a comment about the eax  31 would help the next sould who will read this
code.

 +if ((*eax  31)  cs-nr_cores  1) {
 +*eax |= (cs-nr_cores - 1)  26;
  }
  break;
  case 5:
 -- 
 1.8.4.2
 

Apart from this
Reviewed-by: Benoit Canet ben...@irqsave.net

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-02 Thread Benoît Canet

  +QCOW2 can use one or more instance of a metadata journal.
 
 s/instance/instances/
 
 Is there a reason to use multiple journals rather than a single journal
 for all entry types?  The single journal area avoids seeks.

Here are the main reason for this:

For the deduplication some patterns like cycles of insertion/deletion could
leave the hash table almost empty while filling the journal.

If the journal is full and the hash table is empty a packing operation is
started.

Basically a new journal is created and only the entry presents in the hash table
are reinserted.

This is why I want to keep the deduplication journal appart from regular qcow2
journal: to avoid interferences between a pack operation and regular qcow2
journal entries.

The other thing is that freezing the log store would need a replay of regular
qcow2 entries as it trigger a reset of the journal.

Also since deduplication will not work on spinning disk I discarded the seek
time factor.

Maybe commiting the dedupe journal by erase block sized chunk would be a good
idea to reduce random writes to the SSD.

The additional reason for having multiple journals is that the SILT paper
propose a mode where prefix of the hash is used to dispatch insertions in
multiples store and it easier to do with multiple journals.

 
  +
  +A journal is a sequential log of journal entries appended on a previously
  +allocated and reseted area.
 
 I think you say previously reset area instead of reseted.  Another
 option is initialized area.
 
  +A journal is designed like a linked list with each entry pointing to the 
  next
  +so it's easy to iterate over entries.
  +
  +A journal uses the following constants to denote the type of each entry
  +
  +TYPE_NONE = 0xFF  default value of any bytes in a reseted journal
  +TYPE_END  = 1 the entry ends a journal cluster and point to the 
  next
  +  cluster
  +TYPE_HASH = 2 the entry contains a deduplication hash
  +
  +QCOW2 journal entry:
  +
  +Byte 0 :Size of the entry: size = 2 + n with size = 254
 
 This is not clear.  I'm wondering if the +2 is included in the byte
 value or not.  I'm also wondering what a byte value of zero means and
 what a byte value of 255 means.

I am counting the journal entry header in the size. So yes the +2 is in the byte
value.
A byte value of zero, 1 or 255  is an error.

Maybe this design is bogus and I should only count the payload size in the size
field. It would make less tricky cases.

 
 Please include an example to illustrate how this field works.
 
  +
  + 1 :Type of the entry
  +
  + 2 - size  :The optional n bytes structure carried by entry
  +
  +A journal is divided into clusters and no journal entry can be spilled on 
  two
  +clusters. This avoid having to read more than one cluster to get a single 
  entry.
  +
  +For this purpose an entry with the end type is added at the end of a 
  journal
  +cluster before starting to write in the next cluster.
  +The size of such an entry is set so the entry points to the next cluster.
  +
  +As any journal cluster must be ended with an end entry the size of regular
  +journal entries is limited to 254 bytes in order to always left room for 
  an end
  +entry which mimimal size is two bytes.
  +
  +The only cases where size  254 are none entries where size = 255.
  +
  +The replay of a journal stop when the first end none entry is reached.
 
 s/stop/stops/
 
  +The journal cluster size is 4096 bytes.
 
 Questions about this layout:
 
 1. Journal entries have no integrity mechanism, which is especially
important if they span physical sectors where cheap disks may perform
a partial write.  This would leave a corrupt journal.  If the last
bytes are a checksum then you can get some confidence that the entry
was fully written and is valid.

I will add a checksum mecanism.

Do you have any preferences regarding the checksum function ?

 
Did I miss something?
 
 2. Byte-granularity means that read-modify-write is necessary to append
entries to the journal.  Therefore a failure could destroy previously
committed entries.

It's designed to be committed by 4KB blocks.

 
Any ideas how existing journals handle this?

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-02 Thread Benoît Canet

  2. Byte-granularity means that read-modify-write is necessary to append
 entries to the journal.  Therefore a failure could destroy previously
 committed entries.
  
 Any ideas how existing journals handle this?
 
 You commit only whole blocks. So in this case we can consider a block
 only committed as soon as a TYPE_END entry has been written (and after
 that we won't touch it any more until the journalled changes have been
 flushed to disk).
 
 There's one interesting case: cache=writethrough. I'm not entirely
 sure yet what to do with it, but it's slow anyway, so using one block
 per entry and therefore flushing the journal very often might actually
 be not totally unreasonable.

This sure would finish to kill the performance because this would be an io
per metadata written to disk.

 
 Another thing I'm not sure about is whether a fixed 4k block is good or
 if we should leave it configurable. I don't think making it an option
 would hurt (not necessarily modifyable with qemu-img, but as a field
 in the file format).

I agree.
I also think about make the number of block to be flushed at once configurable.

Benoît

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-03 Thread Benoît Canet

 Care to explain that in more detail? Why shouldn't it work on spinning
 disks?

Hash are random they introduce random read access.

With a QCOW2 cluster size of 4KB the deduplication code when writting duplicated
data will do one random read per 4KB block to deduplicate.

A server grade hardisk is rated for 250 iops. This traduce in 1MB/s of 
deduplicated
data. Not very usable.

On the contrary a samsung 840 pro SSD is rated for 80k iops of random read.
That should traduce in 320MB/s of potentially deduplicated data.

Havind dedup metadata on SSD and actual data on disk would solve the problem but
it would need block backend.

Benoît

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-03 Thread Benoît Canet

 Does this mean the journal forms the first-stage data structure for
 deduplication?  Dedup records will accumulate in the journal until it
 becomes time to convert them in bulk into a more compact representation?

The journal is mainly used to persist the last inserted dedup metadata across
QEMU stop and restart. I replay it at startup to rebuild the hash table.
So yes it's the first stage even it's never used for regular queries.

 
 When I read this specification I was thinking of a journal purely for
 logging operations.  You could use a commit record to mark previous
 records applied.  Upon startup, qcow2 would inspect uncommitted records
 and deal with them.

Maybe that could help regular QCOW2 usage. I don't know.

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-03 Thread Benoît Canet

 By the way, I don't know much about journalling techniques.  So I'm
 asking you these questions so that either you can answer them straight
 away or because they might warrant a look at existing journal
 implementations like:

I tried to so something simple and performing for the deduplication usage.

That explain that there is no concept of transaction and that the journal's
block are flushed asynchronously in order to have an high insertion rate.

I agree with your previous comment is more a log than a journal.

 
 https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/fs/jbd2
 http://www.sqlite.org/cgi/src/dir?name=src
 http://blitiri.com.ar/p/libjio/

I will try to find a paper on journal design.

Benoît

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-04 Thread Benoît Canet

 Simple is good.  Even for deduplication alone, I think data integrity is
 critical - otherwise we risk stale dedup metadata pointing to clusters
 that are unallocated or do not contain the right data.  So the journal
 will probably need to follow techniques for commits/checksums.

I agree that checksums are missing for the dedup.
Maybe we could even use some kind of error correcting code instead of a 
checksum.

Concerning data integrity the events that the deduplication code cannot loose
are hash deletions because they mark a previously inserted hash as obsolete.

The problem with a commit/flush mechanism on hash deletion is that it will slow
down the store insertion speed and also create some extra SSD wear out.

To solve this I considered the fact that the dedup metadata as a whole is
disposable.

So I implemented a dedup dirty bit.

When QEMU stop the journal is flushed and the dirty bit is cleared.
When QEMU start and the dirty bit is set a crash is detected and _all_ the
deduplication metadata is dropped.
The QCOW2 data integrity won't suffer only the dedup ratio will be lower.

As you said once on irc crashes don't happen often.

Benoît

[Qemu-devel] Alternative mac address setting

2013-07-16 Thread Benoît Canet


Hello,

I want to implement an alternative mac address setting to allow a guest using
virtio-net to change it's mac address by itself.

The main use case is high availability setups where a slave machine take the
lead when the master is failing. (heartbeat)

The thing that an alternate mac address setting would allow to do is to make
the switch _without_ requiring a gratuitous arp.

The slave guest would simply take the master guest mac address and resume
operations.

I initially though about implementing a list of alternate mac address but I fear
that in this context lists are frowed upon.

Do you have any advice on how to implement this feature ?

Should I support old net device declaration ?

What is the strict minimum changes that would need to be done to implement the
feature ?

Best regards

Benoît

Re: [Qemu-devel] Alternative mac address setting

2013-07-16 Thread Benoît Canet


Sorry for the gibberish I misread the code.
Please ignore this mail.

Best regards

Benoît

Le Tuesday 16 Jul 2013 à 13:36:49 (+0200), Benoît Canet a écrit :
 
 Hello,
 
 I want to implement an alternative mac address setting to allow a guest using
 virtio-net to change it's mac address by itself.
 
 The main use case is high availability setups where a slave machine take the
 lead when the master is failing. (heartbeat)
 
 The thing that an alternate mac address setting would allow to do is to make
 the switch _without_ requiring a gratuitous arp.
 
 The slave guest would simply take the master guest mac address and resume
 operations.
 
 I initially though about implementing a list of alternate mac address but I 
 fear
 that in this context lists are frowed upon.
 
 Do you have any advice on how to implement this feature ?
 
 Should I support old net device declaration ?
 
 What is the strict minimum changes that would need to be done to implement the
 feature ?
 
 Best regards
 
 Benoît

Re: [Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-07-16 Thread Benoît Canet

  Simple is good.  Even for deduplication alone, I think data integrity is
  critical - otherwise we risk stale dedup metadata pointing to clusters
  that are unallocated or do not contain the right data.  So the journal
  will probably need to follow techniques for commits/checksums.


I'll add checksums to the journal and clean the journal entry size mess soon.

For the transactional/commits aspect of the journal I think that we need Kevin's
point of view on the subject.

Best regards

Benoît

[Qemu-devel] [PATCH 0/4] Leaky bucket throttling and features

2013-07-22 Thread Benoît Canet

The first patch fixes the throttling which was broken by a previous commit.

The next patch replace the existing throttling algorithm by the well described
leaky bucket algorithm.

Third patch implement bursting by adding *_threshold parameters to
qmp_block_set_io_throttle.

The last one allow to define the max size of an io when throttling by iops via
iops_sector_count.

Benoît Canet (4):
  block: Repair the throttling code.
  block: Modify the throttling code to implement the leaky bucket
algorithm.
  block: Add support for throttling burst threshold in QMP and the
command line.
  block: Add iops_sector_count to do the iops accounting for a given io
size.

 block.c   |  424 +
 block/qapi.c  |   28 +++
 blockdev.c|  174 +--
 hmp.c |   36 +++-
 include/block/block_int.h |   16 +-
 include/block/coroutine.h |5 +
 qapi-schema.json  |   40 -
 qemu-coroutine-lock.c |   14 ++
 qemu-options.hx   |2 +-
 qmp-commands.hx   |   34 +++-
 10 files changed, 561 insertions(+), 212 deletions(-)

-- 
1.7.10.4

[Qemu-devel] [PATCH 1/4] block: Repair the throttling code.

2013-07-22 Thread Benoît Canet

The throttling code was segfaulting since commit
02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller
does not run in a coroutine.
qemu_co_queue_do_restart assume that the caller is a coroutinne.
As sugested by Stefan fix this by entering the coroutine directly.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 
 include/block/coroutine.h |5 +
 qemu-coroutine-lock.c |   14 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index b560241..dc72643 100644
--- a/block.c
+++ b/block.c
@@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 {
 bs-io_limits_enabled = false;
 
-while (qemu_co_queue_next(bs-throttled_reqs));
+while (qemu_co_enter_next(bs-throttled_reqs)) {
+}
 
 if (bs-block_timer) {
 qemu_del_timer(bs-block_timer);
@@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque)
 {
 BlockDriverState *bs = opaque;
 
-qemu_co_queue_next(bs-throttled_reqs);
+qemu_co_enter_next(bs-throttled_reqs);
 }
 
 void bdrv_io_limits_enable(BlockDriverState *bs)
@@ -1445,8 +1446,7 @@ void bdrv_drain_all(void)
  * a busy wait.
  */
 QTAILQ_FOREACH(bs, bdrv_states, list) {
-if (!qemu_co_queue_empty(bs-throttled_reqs)) {
-qemu_co_queue_restart_all(bs-throttled_reqs);
+while (qemu_co_enter_next(bs-throttled_reqs)) {
 busy = true;
 }
 }
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 377805a..66e331c 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -138,6 +138,11 @@ bool qemu_co_queue_next(CoQueue *queue);
 void qemu_co_queue_restart_all(CoQueue *queue);
 
 /**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
+
+/**
  * Checks if the CoQueue is empty.
  */
 bool qemu_co_queue_empty(CoQueue *queue);
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index d9fea49..c61d300 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -98,6 +98,20 @@ void qemu_co_queue_restart_all(CoQueue *queue)
 qemu_co_queue_do_restart(queue, false);
 }
 
+bool qemu_co_enter_next(CoQueue *queue)
+{
+Coroutine *next;
+
+next = QTAILQ_FIRST(queue-entries);
+if (!next) {
+return false;
+}
+
+QTAILQ_REMOVE(queue-entries, next, co_queue_next);
+qemu_coroutine_enter(next, NULL);
+return true;
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
 return (QTAILQ_FIRST(queue-entries) == NULL);
-- 
1.7.10.4

[Qemu-devel] [PATCH 2/4] block: Modify the throttling code to implement the leaky bucket algorithm.

2013-07-22 Thread Benoît Canet

This patch replace the previous algorithm by the well described leaky bucket
algorithm: A bucket is filled by the incoming IOs and a periodic timer decrement
the counter to make the bucket leak. When a given threshold is reached the
bucket is full and the IOs are hold.

In this patch the threshold is set to a default value to make the code behave
like the previous implementation.

In the next patch the threshold will be exposed in QMP to let the user control
the burstiness of the throttling.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |  410 +
 blockdev.c|   71 ++--
 include/block/block_int.h |   15 +-
 3 files changed, 299 insertions(+), 197 deletions(-)

diff --git a/block.c b/block.c
index dc72643..2d6e9b4 100644
--- a/block.c
+++ b/block.c
@@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors);
 
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, int64_t *wait);
-
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -101,6 +94,8 @@ static QLIST_HEAD(, BlockDriver) bdrv_drivers =
 
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
+/* boolean used to inform the throttling code that a bdrv_drain_all is issued 
*/
+static bool draining;
 
 #ifdef _WIN32
 static int is_windows_drive_prefix(const char *filename)
@@ -135,15 +130,122 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 qemu_free_timer(bs-block_timer);
 bs-block_timer = NULL;
 }
+}
 
-bs-slice_start = 0;
-bs-slice_end   = 0;
+static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta)
+{
+int64_t *bytes = bs-leaky_buckets.bytes;
+int64_t read_leak, write_leak;
+
+/* the limit apply to both reads and writes */
+if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
+/* compute half the total leak */
+int64_t leak = ((bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL] * delta) /
+   NANOSECONDS_PER_SECOND);
+int remain = leak % 2;
+leak /= 2;
+
+/* the read bucket is smaller than half the quantity to leak so take
+ * care adding the leak difference to write leak
+ */
+if (bytes[BLOCK_IO_LIMIT_READ] = leak) {
+read_leak = bytes[BLOCK_IO_LIMIT_READ];
+write_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_READ];
+/* symetric case */
+} else if (bytes[BLOCK_IO_LIMIT_WRITE] = leak) {
+write_leak = bytes[BLOCK_IO_LIMIT_WRITE];
+read_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_WRITE];
+/* both bucket above leak count use half the total leak for both */
+} else {
+write_leak = leak;
+read_leak = leak + remain;
+}
+/* else we consider that limits are separated */
+} else {
+read_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_READ] * delta) /
+NANOSECONDS_PER_SECOND;
+write_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_WRITE] * delta) /
+ NANOSECONDS_PER_SECOND;
+}
+
+/* make the buckets leak */
+bytes[BLOCK_IO_LIMIT_READ]  = MAX(bytes[BLOCK_IO_LIMIT_READ] - read_leak,
+  0);
+bytes[BLOCK_IO_LIMIT_WRITE] = MAX(bytes[BLOCK_IO_LIMIT_WRITE] - write_leak,
+  0);
 }
 
+static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta)
+{
+double *ios = bs-leaky_buckets.ios;
+int64_t read_leak, write_leak;
+
+/* the limit apply to both reads and writes */
+if (bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
+/* compute half the total leak */
+int64_t leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) /
+   NANOSECONDS_PER_SECOND);
+int remain = leak % 2;
+leak /= 2;
+
+/* the read bucket is smaller than half the quantity to leak so take
+ * care adding the leak difference to write leak
+ */
+if (ios[BLOCK_IO_LIMIT_READ] = leak) {
+read_leak = ios[BLOCK_IO_LIMIT_READ];
+write_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_READ];
+/* symetric case */
+} else if (ios[BLOCK_IO_LIMIT_WRITE] = leak) {
+write_leak = ios[BLOCK_IO_LIMIT_WRITE];
+read_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_WRITE];
+/* both bucket above leak count use half the total leak for both */
+} else {
+write_leak

[Qemu-devel] [PATCH 3/4] block: Add support for throttling burst threshold in QMP and the command line.

2013-07-22 Thread Benoît Canet

The thresholds of the leaky bucket algorithm can be used to allow some
burstiness.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qapi.c |   24 +
 blockdev.c   |  105 +++---
 hmp.c|   32 +++--
 qapi-schema.json |   34 --
 qemu-options.hx  |2 +-
 qmp-commands.hx  |   30 ++--
 6 files changed, 205 insertions(+), 22 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index a4bc411..03f1604 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -235,6 +235,30 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops[BLOCK_IO_LIMIT_READ];
 info-inserted-iops_wr =
bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_bps_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-bps_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-has_bps_rd_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-bps_rd_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_bps_wr_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-bps_wr_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_iops_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-iops_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-iops_rd_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_iops_rd_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_iops_wr_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-iops_wr_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index a78fba4..9bda359 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -341,6 +341,17 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, 
Error **errp)
 return false;
 }
 
+if (io_limits-bps_threshold[BLOCK_IO_LIMIT_TOTAL]  0 ||
+io_limits-bps_threshold[BLOCK_IO_LIMIT_WRITE]  0 ||
+io_limits-bps_threshold[BLOCK_IO_LIMIT_READ]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_TOTAL]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_WRITE]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_READ]  0) {
+error_setg(errp,
+   threshold values must be 0 or greater);
+return false;
+}
+
 return true;
 }
 
@@ -523,24 +534,34 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
qemu_opt_get_number(opts, bps_rd, 0);
 io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
qemu_opt_get_number(opts, bps_wr, 0);
+/* bps thresholds */
+io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]  =
+qemu_opt_get_number(opts, bps_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ);
+io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]  =
+qemu_opt_get_number(opts, bps_rd_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ);
+io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]  =
+qemu_opt_get_number(opts, bps_wr_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ);
+
 io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
qemu_opt_get_number(opts, iops, 0);
 io_limits.iops[BLOCK_IO_LIMIT_READ]  =
qemu_opt_get_number(opts, iops_rd, 0);
 io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
qemu_opt_get_number(opts, iops_wr, 0);
-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] =
-   io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ;
-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]  =
-   io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ;
-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] =
-   io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ;
-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] =
-   io_limits.iops[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ;
+
+/* iops thresholds */
+io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]  =
+qemu_opt_get_number(opts, iops_threshold,
+

[Qemu-devel] [PATCH 4/4] block: Add iops_sector_count to do the iops accounting for a given io size.

2013-07-22 Thread Benoît Canet

This feature can be used in case where users are avoiding the iops limit by
doing jumbo I/Os hammering the storage backend.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 +++-
 block/qapi.c  |4 
 blockdev.c|   22 +-
 hmp.c |8 ++--
 include/block/block_int.h |1 +
 qapi-schema.json  |   10 --
 qemu-options.hx   |2 +-
 qmp-commands.hx   |8 ++--
 8 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index 2d6e9b4..cfa890d 100644
--- a/block.c
+++ b/block.c
@@ -337,6 +337,12 @@ static bool 
bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors,
bool is_write)
 {
 bool bps_ret, iops_ret;
+double   ios = 1.0;
+
+if (bs-io_limits.iops_sector_count) {
+ios = ((double) nb_sectors) / bs-io_limits.iops_sector_count;
+ios = MAX(ios, 1.0);
+}
 
 /* check if any bandwith or per IO threshold has been exceeded */
 bps_ret = bdrv_is_bps_threshold_exceeded(bs, is_write);
@@ -360,7 +366,7 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState 
*bs, int nb_sectors,
 /* the IO is authorized so do the accounting and return false */
 bs-leaky_buckets.bytes[is_write] += (int64_t)nb_sectors *
  BDRV_SECTOR_SIZE;
-bs-leaky_buckets.ios[is_write]++;
+bs-leaky_buckets.ios[is_write] += ios;
 
 return false;
 }
diff --git a/block/qapi.c b/block/qapi.c
index 03f1604..f81081c 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -259,6 +259,10 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
 info-inserted-iops_wr_threshold =
bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_iops_sector_count =
+   bs-io_limits.iops_sector_count;
+info-inserted-iops_sector_count =
+   bs-io_limits.iops_sector_count;
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index 9bda359..8ddd710 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -352,6 +352,11 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, 
Error **errp)
 return false;
 }
 
+if (io_limits-iops_sector_count  0) {
+error_setg(errp, iops_sector_count must be 0 or greater);
+return false;
+}
+
 return true;
 }
 
@@ -563,6 +568,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
 qemu_opt_get_number(opts, iops_wr_threshold,
 io_limits.iops[BLOCK_IO_LIMIT_WRITE] / 
THROTTLE_HZ);
 
+io_limits.iops_sector_count =
+   qemu_opt_get_number(opts, iops_sector_count, 0);
+ 
+
 if (!do_check_io_limits(io_limits, error)) {
 error_report(%s, error_get_pretty(error));
 error_free(error);
@@ -1260,7 +1269,9 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
bool has_iops_rd_threshold,
int64_t iops_rd_threshold,
bool has_iops_wr_threshold,
-   int64_t iops_wr_threshold, Error **errp)
+   int64_t iops_wr_threshold,
+   bool has_iops_sector_count,
+   int64_t iops_sector_count, Error **errp)
 {
 BlockIOLimit io_limits;
 BlockDriverState *bs;
@@ -1283,6 +1294,7 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = iops / THROTTLE_HZ;
 io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]  = iops_rd / THROTTLE_HZ;
 io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr / THROTTLE_HZ;
+io_limits.iops_sector_count = 0;
 
 /* override them with givens values if present */
 if (has_bps_threshold) {
@@ -1304,6 +1316,10 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr_threshold;
 }
 
+if (has_iops_sector_count) {
+io_limits.iops_sector_count = iops_sector_count;
+}
+
 if (!do_check_io_limits(io_limits, errp)) {
 return;
 }
@@ -2007,6 +2023,10 @@ QemuOptsList qemu_common_drive_opts = {
 .type = QEMU_OPT_NUMBER,
 .help = write bytes threshold,
 },{
+.name = iops_sector_count,
+.type = QEMU_OPT_NUMBER,
+.help = when limiting by iops max size of an I/O in sector,
+},{
 .name = copy-on-read,
 .type = QEMU_OPT_BOOL,
 .help = copy read data from backing file into image file,
diff --git a/hmp.c b/hmp.c
index

[Qemu-devel] [PATCH for-1.6 4/4] block: Add iops_sector_count to do the iops accounting for a given io size.

2013-07-22 Thread Benoît Canet

This feature can be used in case where users are avoiding the iops limit by
doing jumbo I/Os hammering the storage backend.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 +++-
 block/qapi.c  |4 
 blockdev.c|   22 +-
 hmp.c |8 ++--
 include/block/block_int.h |1 +
 qapi-schema.json  |   10 --
 qemu-options.hx   |2 +-
 qmp-commands.hx   |8 ++--
 8 files changed, 54 insertions(+), 9 deletions(-)

diff --git a/block.c b/block.c
index 2d6e9b4..cfa890d 100644
--- a/block.c
+++ b/block.c
@@ -337,6 +337,12 @@ static bool 
bdrv_is_any_threshold_exceeded(BlockDriverState *bs, int nb_sectors,
bool is_write)
 {
 bool bps_ret, iops_ret;
+double   ios = 1.0;
+
+if (bs-io_limits.iops_sector_count) {
+ios = ((double) nb_sectors) / bs-io_limits.iops_sector_count;
+ios = MAX(ios, 1.0);
+}
 
 /* check if any bandwith or per IO threshold has been exceeded */
 bps_ret = bdrv_is_bps_threshold_exceeded(bs, is_write);
@@ -360,7 +366,7 @@ static bool bdrv_is_any_threshold_exceeded(BlockDriverState 
*bs, int nb_sectors,
 /* the IO is authorized so do the accounting and return false */
 bs-leaky_buckets.bytes[is_write] += (int64_t)nb_sectors *
  BDRV_SECTOR_SIZE;
-bs-leaky_buckets.ios[is_write]++;
+bs-leaky_buckets.ios[is_write] += ios;
 
 return false;
 }
diff --git a/block/qapi.c b/block/qapi.c
index 03f1604..f81081c 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -259,6 +259,10 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
 info-inserted-iops_wr_threshold =
bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_iops_sector_count =
+   bs-io_limits.iops_sector_count;
+info-inserted-iops_sector_count =
+   bs-io_limits.iops_sector_count;
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index 9bda359..8ddd710 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -352,6 +352,11 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, 
Error **errp)
 return false;
 }
 
+if (io_limits-iops_sector_count  0) {
+error_setg(errp, iops_sector_count must be 0 or greater);
+return false;
+}
+
 return true;
 }
 
@@ -563,6 +568,10 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
 qemu_opt_get_number(opts, iops_wr_threshold,
 io_limits.iops[BLOCK_IO_LIMIT_WRITE] / 
THROTTLE_HZ);
 
+io_limits.iops_sector_count =
+   qemu_opt_get_number(opts, iops_sector_count, 0);
+ 
+
 if (!do_check_io_limits(io_limits, error)) {
 error_report(%s, error_get_pretty(error));
 error_free(error);
@@ -1260,7 +1269,9 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
bool has_iops_rd_threshold,
int64_t iops_rd_threshold,
bool has_iops_wr_threshold,
-   int64_t iops_wr_threshold, Error **errp)
+   int64_t iops_wr_threshold,
+   bool has_iops_sector_count,
+   int64_t iops_sector_count, Error **errp)
 {
 BlockIOLimit io_limits;
 BlockDriverState *bs;
@@ -1283,6 +1294,7 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] = iops / THROTTLE_HZ;
 io_limits.iops_threshold[BLOCK_IO_LIMIT_READ]  = iops_rd / THROTTLE_HZ;
 io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr / THROTTLE_HZ;
+io_limits.iops_sector_count = 0;
 
 /* override them with givens values if present */
 if (has_bps_threshold) {
@@ -1304,6 +1316,10 @@ void qmp_block_set_io_throttle(const char *device, 
int64_t bps, int64_t bps_rd,
 io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE] = iops_wr_threshold;
 }
 
+if (has_iops_sector_count) {
+io_limits.iops_sector_count = iops_sector_count;
+}
+
 if (!do_check_io_limits(io_limits, errp)) {
 return;
 }
@@ -2007,6 +2023,10 @@ QemuOptsList qemu_common_drive_opts = {
 .type = QEMU_OPT_NUMBER,
 .help = write bytes threshold,
 },{
+.name = iops_sector_count,
+.type = QEMU_OPT_NUMBER,
+.help = when limiting by iops max size of an I/O in sector,
+},{
 .name = copy-on-read,
 .type = QEMU_OPT_BOOL,
 .help = copy read data from backing file into image file,
diff --git a/hmp.c b/hmp.c
index

[Qemu-devel] [PATCH for-1.6 1/4] block: Repair the throttling code.

2013-07-22 Thread Benoît Canet

The throttling code was segfaulting since commit
02ffb504485f0920cfc75a0982a602f824a9a4f4 because some qemu_co_queue_next caller
does not run in a coroutine.
qemu_co_queue_do_restart assume that the caller is a coroutinne.
As sugested by Stefan fix this by entering the coroutine directly.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |8 
 include/block/coroutine.h |5 +
 qemu-coroutine-lock.c |   14 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/block.c b/block.c
index b560241..dc72643 100644
--- a/block.c
+++ b/block.c
@@ -127,7 +127,8 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 {
 bs-io_limits_enabled = false;
 
-while (qemu_co_queue_next(bs-throttled_reqs));
+while (qemu_co_enter_next(bs-throttled_reqs)) {
+}
 
 if (bs-block_timer) {
 qemu_del_timer(bs-block_timer);
@@ -143,7 +144,7 @@ static void bdrv_block_timer(void *opaque)
 {
 BlockDriverState *bs = opaque;
 
-qemu_co_queue_next(bs-throttled_reqs);
+qemu_co_enter_next(bs-throttled_reqs);
 }
 
 void bdrv_io_limits_enable(BlockDriverState *bs)
@@ -1445,8 +1446,7 @@ void bdrv_drain_all(void)
  * a busy wait.
  */
 QTAILQ_FOREACH(bs, bdrv_states, list) {
-if (!qemu_co_queue_empty(bs-throttled_reqs)) {
-qemu_co_queue_restart_all(bs-throttled_reqs);
+while (qemu_co_enter_next(bs-throttled_reqs)) {
 busy = true;
 }
 }
diff --git a/include/block/coroutine.h b/include/block/coroutine.h
index 377805a..66e331c 100644
--- a/include/block/coroutine.h
+++ b/include/block/coroutine.h
@@ -138,6 +138,11 @@ bool qemu_co_queue_next(CoQueue *queue);
 void qemu_co_queue_restart_all(CoQueue *queue);
 
 /**
+ * Enter the next coroutine in the queue
+ */
+bool qemu_co_enter_next(CoQueue *queue);
+
+/**
  * Checks if the CoQueue is empty.
  */
 bool qemu_co_queue_empty(CoQueue *queue);
diff --git a/qemu-coroutine-lock.c b/qemu-coroutine-lock.c
index d9fea49..c61d300 100644
--- a/qemu-coroutine-lock.c
+++ b/qemu-coroutine-lock.c
@@ -98,6 +98,20 @@ void qemu_co_queue_restart_all(CoQueue *queue)
 qemu_co_queue_do_restart(queue, false);
 }
 
+bool qemu_co_enter_next(CoQueue *queue)
+{
+Coroutine *next;
+
+next = QTAILQ_FIRST(queue-entries);
+if (!next) {
+return false;
+}
+
+QTAILQ_REMOVE(queue-entries, next, co_queue_next);
+qemu_coroutine_enter(next, NULL);
+return true;
+}
+
 bool qemu_co_queue_empty(CoQueue *queue)
 {
 return (QTAILQ_FIRST(queue-entries) == NULL);
-- 
1.7.10.4

[Qemu-devel] [PATCH for-1.6 3/4] block: Add support for throttling burst threshold in QMP and the command line.

2013-07-22 Thread Benoît Canet

The thresholds of the leaky bucket algorithm can be used to allow some
burstiness.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qapi.c |   24 +
 blockdev.c   |  105 +++---
 hmp.c|   32 +++--
 qapi-schema.json |   34 --
 qemu-options.hx  |2 +-
 qmp-commands.hx  |   30 ++--
 6 files changed, 205 insertions(+), 22 deletions(-)

diff --git a/block/qapi.c b/block/qapi.c
index a4bc411..03f1604 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -235,6 +235,30 @@ void bdrv_query_info(BlockDriverState *bs,
bs-io_limits.iops[BLOCK_IO_LIMIT_READ];
 info-inserted-iops_wr =
bs-io_limits.iops[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_bps_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-bps_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-has_bps_rd_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-bps_rd_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_bps_wr_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-bps_wr_threshold =
+   bs-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-has_iops_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-iops_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL];
+info-inserted-iops_rd_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_iops_rd_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_READ];
+info-inserted-has_iops_wr_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
+info-inserted-iops_wr_threshold =
+   bs-io_limits.iops_threshold[BLOCK_IO_LIMIT_WRITE];
 }
 
 bs0 = bs;
diff --git a/blockdev.c b/blockdev.c
index a78fba4..9bda359 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -341,6 +341,17 @@ static bool do_check_io_limits(BlockIOLimit *io_limits, 
Error **errp)
 return false;
 }
 
+if (io_limits-bps_threshold[BLOCK_IO_LIMIT_TOTAL]  0 ||
+io_limits-bps_threshold[BLOCK_IO_LIMIT_WRITE]  0 ||
+io_limits-bps_threshold[BLOCK_IO_LIMIT_READ]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_TOTAL]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_WRITE]  0 ||
+io_limits-iops_threshold[BLOCK_IO_LIMIT_READ]  0) {
+error_setg(errp,
+   threshold values must be 0 or greater);
+return false;
+}
+
 return true;
 }
 
@@ -523,24 +534,34 @@ DriveInfo *drive_init(QemuOpts *all_opts, 
BlockInterfaceType block_default_type)
qemu_opt_get_number(opts, bps_rd, 0);
 io_limits.bps[BLOCK_IO_LIMIT_WRITE]  =
qemu_opt_get_number(opts, bps_wr, 0);
+/* bps thresholds */
+io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL]  =
+qemu_opt_get_number(opts, bps_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ);
+io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]  =
+qemu_opt_get_number(opts, bps_rd_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ);
+io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE]  =
+qemu_opt_get_number(opts, bps_wr_threshold,
+io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ);
+
 io_limits.iops[BLOCK_IO_LIMIT_TOTAL] =
qemu_opt_get_number(opts, iops, 0);
 io_limits.iops[BLOCK_IO_LIMIT_READ]  =
qemu_opt_get_number(opts, iops_rd, 0);
 io_limits.iops[BLOCK_IO_LIMIT_WRITE] =
qemu_opt_get_number(opts, iops_wr, 0);
-io_limits.bps_threshold[BLOCK_IO_LIMIT_TOTAL] =
-   io_limits.bps[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ;
-io_limits.bps_threshold[BLOCK_IO_LIMIT_READ]  =
-   io_limits.bps[BLOCK_IO_LIMIT_READ] / THROTTLE_HZ;
-io_limits.bps_threshold[BLOCK_IO_LIMIT_WRITE] =
-   io_limits.bps[BLOCK_IO_LIMIT_WRITE] / THROTTLE_HZ;
-io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL] =
-   io_limits.iops[BLOCK_IO_LIMIT_TOTAL] / THROTTLE_HZ;
+
+/* iops thresholds */
+io_limits.iops_threshold[BLOCK_IO_LIMIT_TOTAL]  =
+qemu_opt_get_number(opts, iops_threshold,
+

[Qemu-devel] [PATCH for-1.6 0/4] Leaky bucket throttling and features

2013-07-22 Thread Benoît Canet

The first patch fixes the throttling which was broken by a previous commit.

The next patch replace the existing throttling algorithm by the well described
leaky bucket algorithm.

Third patch implement bursting by adding *_threshold parameters to
qmp_block_set_io_throttle.

The last one allow to define the max size of an io when throttling by iops via
iops_sector_count to avoid vm users cheating on the iops limit.

Benoît Canet (4):
  block: Repair the throttling code.
  block: Modify the throttling code to implement the leaky bucket
algorithm.
  block: Add support for throttling burst threshold in QMP and the
command line.
  block: Add iops_sector_count to do the iops accounting for a given io
size.

 block.c   |  424 +
 block/qapi.c  |   28 +++
 blockdev.c|  174 +--
 hmp.c |   36 +++-
 include/block/block_int.h |   16 +-
 include/block/coroutine.h |5 +
 qapi-schema.json  |   40 -
 qemu-coroutine-lock.c |   14 ++
 qemu-options.hx   |2 +-
 qmp-commands.hx   |   34 +++-
 10 files changed, 561 insertions(+), 212 deletions(-)

-- 
1.7.10.4

[Qemu-devel] [PATCH for-1.6 2/4] block: Modify the throttling code to implement the leaky bucket algorithm.

2013-07-22 Thread Benoît Canet

This patch replace the previous algorithm by the well described leaky bucket
algorithm: A bucket is filled by the incoming IOs and a periodic timer decrement
the counter to make the bucket leak. When a given threshold is reached the
bucket is full and the IOs are hold.

In this patch the threshold is set to a default value to make the code behave
like the previous implementation.

In the next patch the threshold will be exposed in QMP to let the user control
the burstiness of the throttling.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block.c   |  410 +
 blockdev.c|   71 ++--
 include/block/block_int.h |   15 +-
 3 files changed, 299 insertions(+), 197 deletions(-)

diff --git a/block.c b/block.c
index dc72643..2d6e9b4 100644
--- a/block.c
+++ b/block.c
@@ -86,13 +86,6 @@ static void coroutine_fn bdrv_co_do_rw(void *opaque);
 static int coroutine_fn bdrv_co_do_write_zeroes(BlockDriverState *bs,
 int64_t sector_num, int nb_sectors);
 
-static bool bdrv_exceed_bps_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_iops_limits(BlockDriverState *bs, bool is_write,
-double elapsed_time, uint64_t *wait);
-static bool bdrv_exceed_io_limits(BlockDriverState *bs, int nb_sectors,
-bool is_write, int64_t *wait);
-
 static QTAILQ_HEAD(, BlockDriverState) bdrv_states =
 QTAILQ_HEAD_INITIALIZER(bdrv_states);
 
@@ -101,6 +94,8 @@ static QLIST_HEAD(, BlockDriver) bdrv_drivers =
 
 /* If non-zero, use only whitelisted block drivers */
 static int use_bdrv_whitelist;
+/* boolean used to inform the throttling code that a bdrv_drain_all is issued 
*/
+static bool draining;
 
 #ifdef _WIN32
 static int is_windows_drive_prefix(const char *filename)
@@ -135,15 +130,122 @@ void bdrv_io_limits_disable(BlockDriverState *bs)
 qemu_free_timer(bs-block_timer);
 bs-block_timer = NULL;
 }
+}
 
-bs-slice_start = 0;
-bs-slice_end   = 0;
+static void bdrv_make_bps_buckets_leak(BlockDriverState *bs, int64_t delta)
+{
+int64_t *bytes = bs-leaky_buckets.bytes;
+int64_t read_leak, write_leak;
+
+/* the limit apply to both reads and writes */
+if (bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL]) {
+/* compute half the total leak */
+int64_t leak = ((bs-io_limits.bps[BLOCK_IO_LIMIT_TOTAL] * delta) /
+   NANOSECONDS_PER_SECOND);
+int remain = leak % 2;
+leak /= 2;
+
+/* the read bucket is smaller than half the quantity to leak so take
+ * care adding the leak difference to write leak
+ */
+if (bytes[BLOCK_IO_LIMIT_READ] = leak) {
+read_leak = bytes[BLOCK_IO_LIMIT_READ];
+write_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_READ];
+/* symetric case */
+} else if (bytes[BLOCK_IO_LIMIT_WRITE] = leak) {
+write_leak = bytes[BLOCK_IO_LIMIT_WRITE];
+read_leak = 2 * leak + remain - bytes[BLOCK_IO_LIMIT_WRITE];
+/* both bucket above leak count use half the total leak for both */
+} else {
+write_leak = leak;
+read_leak = leak + remain;
+}
+/* else we consider that limits are separated */
+} else {
+read_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_READ] * delta) /
+NANOSECONDS_PER_SECOND;
+write_leak = (bs-io_limits.bps[BLOCK_IO_LIMIT_WRITE] * delta) /
+ NANOSECONDS_PER_SECOND;
+}
+
+/* make the buckets leak */
+bytes[BLOCK_IO_LIMIT_READ]  = MAX(bytes[BLOCK_IO_LIMIT_READ] - read_leak,
+  0);
+bytes[BLOCK_IO_LIMIT_WRITE] = MAX(bytes[BLOCK_IO_LIMIT_WRITE] - write_leak,
+  0);
 }
 
+static void bdrv_make_iops_buckets_leak(BlockDriverState *bs, int64_t delta)
+{
+double *ios = bs-leaky_buckets.ios;
+int64_t read_leak, write_leak;
+
+/* the limit apply to both reads and writes */
+if (bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL]) {
+/* compute half the total leak */
+int64_t leak = ((bs-io_limits.iops[BLOCK_IO_LIMIT_TOTAL] * delta) /
+   NANOSECONDS_PER_SECOND);
+int remain = leak % 2;
+leak /= 2;
+
+/* the read bucket is smaller than half the quantity to leak so take
+ * care adding the leak difference to write leak
+ */
+if (ios[BLOCK_IO_LIMIT_READ] = leak) {
+read_leak = ios[BLOCK_IO_LIMIT_READ];
+write_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_READ];
+/* symetric case */
+} else if (ios[BLOCK_IO_LIMIT_WRITE] = leak) {
+write_leak = ios[BLOCK_IO_LIMIT_WRITE];
+read_leak = 2 * leak + remain - ios[BLOCK_IO_LIMIT_WRITE];
+/* both bucket above leak count use half the total leak for both */
+} else {
+write_leak

[Qemu-devel] [RFC V8 01/24] qcow2: Add journal specification.

2013-06-20 Thread Benoît Canet

---
 docs/specs/qcow2.txt |   42 ++
 1 file changed, 42 insertions(+)

diff --git a/docs/specs/qcow2.txt b/docs/specs/qcow2.txt
index 36a559d..a4ffc85 100644
--- a/docs/specs/qcow2.txt
+++ b/docs/specs/qcow2.txt
@@ -350,3 +350,45 @@ Snapshot table entry:
 variable:   Unique ID string for the snapshot (not null terminated)
 
 variable:   Name of the snapshot (not null terminated)
+
+== Journal ==
+
+QCOW2 can use one or more instance of a metadata journal.
+
+A journal is a sequential log of journal entries appended on a previously
+allocated and reseted area.
+A journal is designed like a linked list with each entry pointing to the next
+so it's easy to iterate over entries.
+
+A journal uses the following constants to denote the type of each entry
+
+TYPE_NONE = 0xFF  default value of any bytes in a reseted journal
+TYPE_END  = 1 the entry ends a journal cluster and point to the next
+  cluster
+TYPE_HASH = 2 the entry contains a deduplication hash
+
+QCOW2 journal entry:
+
+Byte 0 :Size of the entry: size = 2 + n with size = 254
+
+ 1 :Type of the entry
+
+ 2 - size  :The optional n bytes structure carried by entry
+
+A journal is divided into clusters and no journal entry can be spilled on two
+clusters. This avoid having to read more than one cluster to get a single 
entry.
+
+For this purpose an entry with the end type is added at the end of a journal
+cluster before starting to write in the next cluster.
+The size of such an entry is set so the entry points to the next cluster.
+
+As any journal cluster must be ended with an end entry the size of regular
+journal entries is limited to 254 bytes in order to always left room for an end
+entry which mimimal size is two bytes.
+
+The only cases where size  254 are none entries where size = 255.
+
+The replay of a journal stop when the first end none entry is reached.
+
+The journal cluster size is 4096 bytes.
+
-- 
1.7.10.4

[Qemu-devel] [RFC V8 00/24] QCOW2 deduplication core functionality

2013-06-20 Thread Benoît Canet

Hello,

This is the new patchset of the QCOW2 deduplication feature using an on disk
key value store modelled after the SILT one.

This is more a millestone reached post than a for review post.

Performance is slow because the code is issuing a lot of read and write to the
SSD in a sequential ways.
The next revision will focus on parallelising the IOs.

Kevin: the first patch add the description of the QCOW2 journal structure.
   The code make it easy to have multiple journal in QCOW2.
   Do you think it could fit others QCOW2 usages.
   Patches describing the others data structure are comming.

Best regards

Benoît

Benoît Canet (24):
  qcow2: Add journal specification.
  qcow2: Add deduplication structures and fields.
  qcow2: Add journal.
  qcow2: Create the log store.
  qcow2: Add the hash store.
  qcow2: Add the deduplication store.
  qcow2: Add qcow2_dedup_read_missing_and_concatenate
  qcow2: Create a way to link to l2 tables when deduplicating.
  qcow2: Make qcow2_update_cluster_refcount public.
  qcow2: Add qcow2_dedup and related functions
  qcow2: Add qcow2_dedup_store_new_hashes.
  qcow2: Do allocate on rewrite on the dedup case.
  qcow2: Implement qcow2_compute_cluster_hash.
  qcow2: Load and save deduplication table header extension.
  qcow2: Extract qcow2_set_incompat_feature and
qcow2_clear_incompat_feature.
  block: Add qcow2_dedup format and image creation code.
  qcow2: Drop hash for a given cluster when dedup makes refcount 
2^16/2.
  qcow2: Remove hash when cluster is deleted.
  qcow2: Integrate deduplication in qcow2_co_writev loop.
  qcow2: Serialize write requests when deduplication is activated.
  qcow2: Integrate SKEIN hash algorithm in deduplication.
  qcow2: Add qcow2_dedup_init and qcow2_dedup_close.
  qcow2: Enable the deduplication feature.
  qcow2: Enable deduplication tests

 block/Makefile.objs  |1 +
 block/qcow2-cluster.c|   13 +-
 block/qcow2-dedup.c  |  778 +
 block/qcow2-hash-store.c |  802 ++
 block/qcow2-journal.c|  587 +++
 block/qcow2-log-store.c  | 1281 ++
 block/qcow2-refcount.c   |   20 +-
 block/qcow2-store.c  |  771 +
 block/qcow2.c|  431 +-
 block/qcow2.h|  357 +++-
 configure|  121 +++-
 docs/specs/qcow2.txt |   42 ++
 include/block/block_int.h|1 +
 tests/qemu-iotests/common|6 +
 tests/qemu-iotests/common.rc |3 +-
 15 files changed, 5144 insertions(+), 70 deletions(-)
 create mode 100644 block/qcow2-dedup.c
 create mode 100644 block/qcow2-hash-store.c
 create mode 100644 block/qcow2-journal.c
 create mode 100644 block/qcow2-log-store.c
 create mode 100644 block/qcow2-store.c

-- 
1.7.10.4

[Qemu-devel] [RFC V8 04/24] qcow2: Create the log store.

2013-06-20 Thread Benoît Canet

The log store is an in RAM hash table used to index QCowJournalEntries.
Combining the log store with a journal allows to get the benefits of the journal
(fewer random write and minimization of SSD wear out) with the advantages of a
hash table (being good at random access).

Once the journal is full and not packable or if the log store hash table is full
the journal is frozen and converted into an on disk hash table used by the hash
store.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/Makefile.objs |2 +-
 block/qcow2-log-store.c | 1039 +++
 block/qcow2.h   |   26 ++
 3 files changed, 1066 insertions(+), 1 deletion(-)
 create mode 100644 block/qcow2-log-store.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index ee894b5..8c0b646 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,6 +1,6 @@
 block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
-block-obj-y += qcow2-journal.o
+block-obj-y += qcow2-journal.o qcow2-log-store.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
diff --git a/block/qcow2-log-store.c b/block/qcow2-log-store.c
new file mode 100644
index 000..24ae310
--- /dev/null
+++ b/block/qcow2-log-store.c
@@ -0,0 +1,1039 @@
+/*
+ * QCOW2 log store
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include qemu-common.h
+#include block/block_int.h
+#include block/qcow2.h
+
+static uint64_t qcow2_dedup_get_h(QCowHashInfo *hash_info, int index,
+  uint32_t order)
+{
+uint64_t *array = (uint64_t *) hash_info-hash.data;
+uint64_t result = array[index];
+return result  ((1  order) - 1);
+}
+
+uint64_t qcow2_dedup_h1(QCowHashInfo *hash_info, uint32_t order)
+{
+return qcow2_dedup_get_h(hash_info, 0, order);
+}
+
+uint64_t qcow2_dedup_h2(QCowHashInfo *hash_info, uint32_t order)
+{
+return qcow2_dedup_get_h(hash_info, 1, order);
+}
+
+static bool qcow2_dedup_h_equals(QCowHashInfo *hash_info, uint32_t order)
+{
+return qcow2_dedup_h1(hash_info, order) ==
+   qcow2_dedup_h2(hash_info, order);
+}
+
+static void qcow2_log_store_reset_kick_map(QCowLogStore *store)
+{
+uint64_t i;
+for (i = 0; i  qcow2_log_store_nb_entries(store-order); i++) {
+store-kick_map[i] = false;
+}
+}
+
+static bool qcow2_log_store_is_in_kick_path(QCowLogStore *store,
+uint64_t in_table_idx)
+{
+return store-kick_map[in_table_idx];
+}
+
+static void qcow2_log_store_set_in_kick_path(QCowLogStore *store,
+ uint64_t in_table_idx)
+{
+store-kick_map[in_table_idx] = true;
+}
+
+static void qcow2_log_store_unset_from_kick_path(QCowLogStore *store,
+ uint64_t in_table_idx)
+{
+store-kick_map[in_table_idx] = false;
+}
+
+static void qcow2_log_store_set_perm_element(int *perm, int value, int rnd)
+{
+for (;perm[rnd % QCOW_LOG_STORE_BUCKET_SIZE] != -1; rnd++);
+perm[rnd % QCOW_LOG_STORE_BUCKET_SIZE] = value;
+}
+
+/* This function create a random permutation of the set
+ * [0, QCOW_LOG_STORE_BUCKET_SIZE]
+ */
+static void qcow2_log_store_make_perm(int *perm)
+{
+int i, rnd;
+
+/* reset */
+for (i = 0; i  QCOW_LOG_STORE_BUCKET_SIZE; i++) {
+perm[i] = -1;
+}
+
+/* create permutation */
+for (i = 0; i  QCOW_LOG_STORE_BUCKET_SIZE; i++) {
+rnd = rand();
+qcow2_log_store_set_perm_element(perm, i, rnd);
+}
+}
+
+/* This function is used to make sure we don't create path cycle while kicking
+ *
+ * The function uses a random permutation to compute

[Qemu-devel] [RFC V8 07/24] qcow2: Add qcow2_dedup_read_missing_and_concatenate

2013-06-20 Thread Benoît Canet

This function is used to read missing data when unaligned writes are
done. This function also concatenate missing data with the given
qiov data in order to prepare a buffer used to look for duplicated
clusters.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/Makefile.objs |2 +-
 block/qcow2-dedup.c |  121 +++
 block/qcow2.c   |   35 +++
 block/qcow2.h   |   13 ++
 4 files changed, 170 insertions(+), 1 deletion(-)
 create mode 100644 block/qcow2-dedup.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 2d1a269..088407a 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,6 +1,6 @@
 block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
-block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o 
qcow2-store.o
+block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o 
qcow2-store.o qcow2-dedup.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
new file mode 100644
index 000..bc6e2c2
--- /dev/null
+++ b/block/qcow2-dedup.c
@@ -0,0 +1,121 @@
+/*
+ * Deduplication for the QCOW2 format
+ *
+ * Copyright (C) Nodalink, SARL. 2012-2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include block/block_int.h
+#include qemu-common.h
+#include qcow2.h
+
+/*
+ * Prepare a buffer containing everything required to compute cluster
+ * sized deduplication hashes.
+ * If sector_num or nb_sectors are not cluster-aligned, missing data
+ * before/after the qiov will be read.
+ *
+ * @qiov:   the qiov for which missing data must be read
+ * @sector_num: the first sectors that must be read into the qiov
+ * @nb_sectors: the number of sectors to read into the qiov
+ * @data:   the place where the data will be concatenated and 
stored
+ *   the caller is responsible to use qemu_vfree() to
+ *   data on success.
+ * @nb_data_sectors:the resulting size of the contatenated data (in 
sectors)
+ * @ret:negative on error
+ */
+int qcow2_dedup_read_missing_and_concatenate(BlockDriverState *bs,
+ QEMUIOVector *qiov,
+ uint64_t sector_num,
+ int nb_sectors,
+ uint8_t **data,
+ int *nb_data_sectors)
+{
+BDRVQcowState *s = bs-opaque;
+int ret = 0;
+uint64_t cluster_beginning_sector;
+uint64_t first_sector_after_qiov;
+int cluster_beginning_nr;
+int cluster_ending_nr;
+int unaligned_ending_nr;
+uint64_t max_cluster_ending_nr;
+
+/* compute how much and where to read at the beginning */
+cluster_beginning_nr = sector_num  (s-cluster_sectors - 1);
+cluster_beginning_sector = sector_num - cluster_beginning_nr;
+
+/* for the ending */
+first_sector_after_qiov = sector_num + nb_sectors;
+unaligned_ending_nr = first_sector_after_qiov  (s-cluster_sectors - 1);
+cluster_ending_nr = unaligned_ending_nr ?
+s-cluster_sectors - unaligned_ending_nr : 0;
+
+/* compute total size in sectors and allocate memory */
+*nb_data_sectors = cluster_beginning_nr + nb_sectors + cluster_ending_nr;
+*data = qemu_blockalign(bs, *nb_data_sectors * BDRV_SECTOR_SIZE);
+
+/* read beginning */
+if (cluster_beginning_nr) {
+ret = qcow2_read_cluster_data(bs,
+  *data,
+  cluster_beginning_sector

[Qemu-devel] [RFC V8 09/24] qcow2: Make qcow2_update_cluster_refcount public.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-refcount.c |   17 ++---
 block/qcow2.h  |3 +++
 2 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index b32738f..3bd8f37 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -520,9 +520,9 @@ fail:
  * If the return value is non-negative, it is the new refcount of the cluster.
  * If it is negative, it is -errno and indicates an error.
  */
-static int update_cluster_refcount(BlockDriverState *bs,
-   int64_t cluster_index,
-   int addend)
+int qcow2_update_cluster_refcount(BlockDriverState *bs,
+  int64_t cluster_index,
+  int addend)
 {
 BDRVQcowState *s = bs-opaque;
 int ret;
@@ -649,7 +649,7 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
 if (free_in_cluster == 0)
 s-free_byte_offset = 0;
 if ((offset  (s-cluster_size - 1)) != 0)
-update_cluster_refcount(bs, offset  s-cluster_bits, 1);
+qcow2_update_cluster_refcount(bs, offset  s-cluster_bits, 1);
 } else {
 offset = qcow2_alloc_clusters(bs, s-cluster_size);
 if (offset  0) {
@@ -659,7 +659,7 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size)
 if ((cluster_offset + s-cluster_size) == offset) {
 /* we are lucky: contiguous data */
 offset = s-free_byte_offset;
-update_cluster_refcount(bs, offset  s-cluster_bits, 1);
+qcow2_update_cluster_refcount(bs, offset  s-cluster_bits, 1);
 s-free_byte_offset += size;
 } else {
 s-free_byte_offset = offset;
@@ -795,7 +795,9 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 } else {
 uint64_t cluster_index = (offset  L2E_OFFSET_MASK)  
s-cluster_bits;
 if (addend != 0) {
-refcount = update_cluster_refcount(bs, 
cluster_index, addend);
+refcount = qcow2_update_cluster_refcount(bs,
+   cluster_index,
+   addend);
 } else {
 refcount = get_refcount(bs, cluster_index);
 }
@@ -827,7 +829,8 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 
 
 if (addend != 0) {
-refcount = update_cluster_refcount(bs, l2_offset  
s-cluster_bits, addend);
+refcount = qcow2_update_cluster_refcount(bs,
+   l2_offset  s-cluster_bits, addend);
 } else {
 refcount = get_refcount(bs, l2_offset  s-cluster_bits);
 }
diff --git a/block/qcow2.h b/block/qcow2.h
index 7754065..3238083 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -568,6 +568,9 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 
 int qcow2_check_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
   BdrvCheckMode fix);
+int qcow2_update_cluster_refcount(BlockDriverState *bs,
+  int64_t cluster_index,
+  int addend);
 
 /* qcow2-cluster.c functions */
 int qcow2_grow_l1_table(BlockDriverState *bs, int min_size, bool exact_size);
-- 
1.7.10.4

[Qemu-devel] [RFC V8 03/24] qcow2: Add journal.

2013-06-20 Thread Benoît Canet

This commit add the code required to manage one or more journals in a qcow2 
file.
The primary user of this journal will be the qcow2-log-store.c.
The journal is asynchronous and will require it's users to issue flushs in order
to make sure that entries had reached stable storage.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/Makefile.objs   |1 +
 block/qcow2-journal.c |  587 +
 block/qcow2.h |   29 +++
 3 files changed, 617 insertions(+)
 create mode 100644 block/qcow2-journal.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 5f0358a..ee894b5 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,5 +1,6 @@
 block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
+block-obj-y += qcow2-journal.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
diff --git a/block/qcow2-journal.c b/block/qcow2-journal.c
new file mode 100644
index 000..693de37
--- /dev/null
+++ b/block/qcow2-journal.c
@@ -0,0 +1,587 @@
+/*
+ * QCOW2 journal
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include qemu-common.h
+#include block/block_int.h
+#include block/qcow2.h
+
+/* This function reset the state of a journal
+ *
+ * @journal: the journal to initialize
+ * @sector:  the on disk sector where the journal start
+ * @size:the size of the journal
+ */
+static void qcow2_journal_reset(QCowJournal *journal,
+uint64_t sector,
+uint64_t size)
+{
+journal-sector = sector;
+journal-size = size;
+journal-index = 0;
+/* clear write buffer */
+memset(journal-write_buf, QCOW_LOG_NONE, JOURNAL_CLUSTER_SIZE * 2);
+journal-offset_in_buf = 0;
+journal-flushed = true;
+/* clear read cache */
+memset(journal-read_cache, QCOW_LOG_NONE, JOURNAL_CLUSTER_SIZE * 2);
+/* so we will load the cache at first hit */
+journal-read_index = -1;
+/* journal need to be resumed */
+journal-started = false;
+}
+
+/* This function set the journal to the correct state after having replaying it
+ *
+ * @journal: the journal to resume
+ * @offset:  the offset inside the journal
+ * @ret: 0 on success, -errno on error
+ */
+int qcow2_journal_resume(BlockDriverState *bs,
+ QCowJournal *journal,
+ uint64_t offset)
+{
+uint64_t disk_offset;
+
+/* flush read cache */
+journal-read_index = -1;
+
+/* mark the journal as resumed */
+journal-started = true;
+
+/* reset state */
+journal-index = offset / JOURNAL_CLUSTER_SIZE;
+journal-offset_in_buf = offset % JOURNAL_CLUSTER_SIZE;
+
+/* compute on disk offset of the current cluster */
+disk_offset = journal-sector * BDRV_SECTOR_SIZE +
+  journal-index * JOURNAL_CLUSTER_SIZE;
+
+/* read the current cluster in buffer */
+return bdrv_pread(bs-file,
+  disk_offset,
+  journal-write_buf,
+  JOURNAL_CLUSTER_SIZE);
+
+}
+
+uint64_t qcow2_journal_round_to_erase_blocks(uint64_t size)
+{
+return ((size / SSD_ERASE_BLOCK_SIZE) + 1) * SSD_ERASE_BLOCK_SIZE;
+}
+
+/* This function reset a journal content to none
+ *
+ * To be called at the end of the incarnate/freeze coroutine 
+ *
+ * @journal: the QCowJournal to reset
+ * @in_coroutine: true if called from a coroutine
+ * @ret: 0 on success, -errno on error
+ */
+static int qcow2_journal_set_to_none(BlockDriverState *bs,
+ QCowJournal *journal,
+ bool in_coroutine)
+{
+int ret = 0;
+uint8_t

[Qemu-devel] [RFC V8 08/24] qcow2: Create a way to link to l2 tables when deduplicating.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-cluster.c |6 --
 block/qcow2.h |6 ++
 2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index c71470a..d6db0b9 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -696,7 +696,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
 old_cluster[j++] = l2_table[l2_index + i];
 
 l2_table[l2_index + i] = cpu_to_be64((cluster_offset +
-(i  s-cluster_bits)) | QCOW_OFLAG_COPIED);
+(i  s-cluster_bits)) | m-l2_entry_flags);
  }
 
 
@@ -709,7 +709,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
  * If this was a COW, we need to decrease the refcount of the old cluster.
  * Also flush bs-file to get the right order for L2 and refcount update.
  */
-if (j != 0) {
+if (!m-overwrite  j != 0) {
 for (i = 0; i  j; i++) {
 qcow2_free_any_clusters(bs, be64_to_cpu(old_cluster[i]), 1);
 }
@@ -1098,6 +1098,8 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 .offset = nb_sectors * BDRV_SECTOR_SIZE,
 .nb_sectors = avail_sectors - nb_sectors,
 },
+.l2_entry_flags = QCOW_OFLAG_COPIED,
+.overwrite  = false,
 };
 qemu_co_queue_init((*m)-dependent_requests);
 QLIST_INSERT_HEAD(s-cluster_allocs, *m, next_in_flight);
diff --git a/block/qcow2.h b/block/qcow2.h
index 9a9abd3..7754065 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -438,6 +438,12 @@ typedef struct QCowL2Meta
  */
 CoQueue dependent_requests;
 
+/* contains the flags to apply to the l2 entry */
+uint64_t l2_entry_flags;
+
+/* set to true if we are overwriting an L2 table entry */
+bool overwrite;
+
 /**
  * The COW Region between the start of the first allocated cluster and the
  * area the guest actually writes to.
-- 
1.7.10.4

[Qemu-devel] [RFC V8 05/24] qcow2: Add the hash store.

2013-06-20 Thread Benoît Canet

The hash store allow to do lookup in the set of previously frozen log stores.
Each frozen log store is called an incarnation. When the number of incarnation
reach a user defined setting oldest incarnations are dropped to avoid that the
dedup metadata grow without limit.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/Makefile.objs  |2 +-
 block/qcow2-hash-store.c |  802 ++
 block/qcow2-log-store.c  |  242 ++
 block/qcow2.h|   38 +++
 4 files changed, 1083 insertions(+), 1 deletion(-)
 create mode 100644 block/qcow2-hash-store.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 8c0b646..cafd4f8 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,6 +1,6 @@
 block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
-block-obj-y += qcow2-journal.o qcow2-log-store.o
+block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
diff --git a/block/qcow2-hash-store.c b/block/qcow2-hash-store.c
new file mode 100644
index 000..5284740
--- /dev/null
+++ b/block/qcow2-hash-store.c
@@ -0,0 +1,802 @@
+/*
+ * QCOW2 hash store
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include qemu-common.h
+#include block/block_int.h
+#include block/qcow2.h
+
+/* This function compute the number of bytes required to store a tag in hash
+ * store filters
+ *
+ * This function will only return values in the set [2, 4, 8] matching
+ * convenient compiler type sizes.
+ *
+ * @order: order of the hash store
+ * @ret:   the number of byte required to store a tag
+ */
+int qcow2_hash_store_get_tag_size(uint32_t order)
+{
+if (order  16) {
+return 2;
+}
+
+if (order  32) {
+return 4;
+}
+
+return 8;
+}
+
+static uint64_t qcow2_hash_store_round_to_cluster(uint64_t size)
+{
+return ((size / HASH_STORE_CLUSTER_SIZE) + 1) * HASH_STORE_CLUSTER_SIZE;
+}
+
+/* This function compute the size of an in ram filter
+ *
+ * @order: the order of the store
+ * @ret:   the size of the in ram filter in bytes
+ */
+uint64_t qcow2_hash_store_filter_size(uint32_t order)
+{
+uint64_t size = qcow2_log_store_nb_entries(order) *
+qcow2_hash_store_get_tag_size(order);
+return qcow2_hash_store_round_to_cluster(size);
+}
+
+/* this function make sure that no buckets are splitted between two clusters */
+static uint64_t qcow2_hash_store_round_to_bucket(uint64_t count)
+{
+return (count / QCOW_LOG_STORE_BUCKET_SIZE) * QCOW_LOG_STORE_BUCKET_SIZE;
+}
+
+uint64_t qcow2_hash_store_nb_hash_info_per_cluster(void)
+{
+uint64_t raw_nb = HASH_STORE_CLUSTER_SIZE / sizeof(QCowHashInfo);
+return qcow2_hash_store_round_to_bucket(raw_nb);
+}
+
+/* This function compute the size of a frozen hash table that is dumped to disk
+ *
+ * @order: the order of the store
+ * @ret:   the size in bytes
+ */
+static uint64_t qcow2_hash_store_table_size(uint32_t order)
+{
+uint32_t nb_per_cluster = qcow2_hash_store_nb_hash_info_per_cluster();
+uint64_t count = (qcow2_log_store_nb_entries(order) / nb_per_cluster) + 1;
+return count * HASH_STORE_CLUSTER_SIZE;
+}
+
+/* This function compute the size of the result of a log store incarnation
+ *
+ * @order: the order of the store
+ * @ret:   the size in bytes
+ */
+uint64_t qcow2_hash_store_incarnation_size(uint32_t order)
+{
+return qcow2_hash_store_filter_size(order) +
+   qcow2_hash_store_table_size(order);
+}
+
+/* This function reset a hash store
+ *
+ * @store:   the hash store to initialize
+ * @order:   the bit order of the hash store

[Qemu-devel] [RFC V8 11/24] qcow2: Add qcow2_dedup_store_new_hashes.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c |   97 ++-
 block/qcow2.h   |6 +++-
 2 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index 0daf77e..ffbf866 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -251,6 +251,7 @@ static int qcow2_dedup_link_l2(BlockDriverState *bs,
 static int qcow2_clear_l2_copied_flag_if_needed(BlockDriverState *bs,
 QCowHashInfo *hash_info)
 {
+BDRVQcowState *s = bs-opaque;
 int ret = 0;
 uint64_t first_logical_sect = hash_info-first_logical_sect;
 
@@ -273,7 +274,8 @@ static int 
qcow2_clear_l2_copied_flag_if_needed(BlockDriverState *bs,
 /* remember that we don't need to clear QCOW_OFLAG_COPIED again */
 hash_info-first_logical_sect = first_logical_sect;
 
-return 0;
+/* clear the QCOW_OFLAG_COPIED flag from disk */
+return qcow2_store_insert(bs, s-key_value_store, hash_info);
 }
 
 /* This function deduplicate a cluster
@@ -534,3 +536,96 @@ exit:
 
 return deduped_clusters_nr * s-cluster_sectors - start_index;
 }
+
+static inline bool is_hash_info_empty(QCowHashInfo *hash_info)
+{
+return hash_info-physical_sect  QCOW_DEDUP_FLAG_EMPTY;
+}
+
+/* This function store a hash information to disk and RAM
+ *
+ * @hash_info:  the QCowHashInfo to process
+ * @logical_sect:   the logical sector of the cluster seen by the guest
+ * @physical_sect:  the physical sector of the stored cluster
+ * @ret:0 on success, negative on error
+ */
+static int qcow2_store_hash(BlockDriverState *bs,
+QCowHashInfo *hash_info,
+uint64_t logical_sect,
+uint64_t physical_sect)
+{
+BDRVQcowState *s = bs-opaque;
+int ret = 0;
+
+ret = qcow2_store_get(bs, s-key_value_store, hash_info);
+
+/* no hash info found in store or error */
+if (ret = 0) {
+return ret;
+}
+
+/* the hash info information are already completed */
+if (!is_hash_info_empty(hash_info)) {
+return 0;
+}
+
+/* Remember that this QCowHashInfo represents the first occurrence of the
+ * cluster so we will be able to clear QCOW_OFLAG_COPIED from the L2 table
+ * entry when refcount will go  1.
+ */
+logical_sect = logical_sect | QCOW_OFLAG_COPIED;
+
+/* fill the missing fields of the hash info */
+hash_info-physical_sect = physical_sect;
+hash_info-first_logical_sect = logical_sect;
+
+/* write the hash into the key value store */
+return qcow2_store_insert(bs, s-key_value_store, hash_info);
+}
+
+/* This function store the hashes of the clusters which are not duplicated
+ *
+ * @ds:The deduplication state
+ * @count: the number of dedup hash to process
+ * @logical_sect:  logical offset of the first cluster (in sectors)
+ * @physical_sect: offset of the first cluster (in sectors)
+ * @ret:   0 on succes, errno on error
+ */
+int qcow2_dedup_store_new_hashes(BlockDriverState *bs,
+ QCowDedupState *ds,
+ int count,
+ uint64_t logical_sect,
+ uint64_t physical_sect)
+{
+int ret = 0;
+int i = 0;
+BDRVQcowState *s = bs-opaque;
+QCowHashElement *dedup_hash, *next_dedup_hash;
+
+/* round values on cluster boundaries for easier cluster deletion */
+logical_sect = logical_sect  ~(s-cluster_sectors - 1);
+physical_sect = physical_sect  ~(s-cluster_sectors - 1);
+
+QTAILQ_FOREACH_SAFE(dedup_hash, ds-undedupables, next, next_dedup_hash) {
+
+ret = qcow2_store_hash(bs,
+   dedup_hash-hash_info,
+   logical_sect + i * s-cluster_sectors,
+   physical_sect + i * s-cluster_sectors);
+
+QTAILQ_REMOVE(ds-undedupables, dedup_hash, next);
+g_free(dedup_hash);
+
+if (ret  0) {
+break;
+}
+
+i++;
+
+if (i == count) {
+break;
+}
+}
+
+return ret;
+}
diff --git a/block/qcow2.h b/block/qcow2.h
index 3346842..ff26e81 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -742,6 +742,10 @@ int qcow2_dedup(BlockDriverState *bs,
 uint64_t sector_num,
 uint8_t *data,
 int data_nr);
-
+int qcow2_dedup_store_new_hashes(BlockDriverState *bs,
+ QCowDedupState *ds,
+ int count,
+ uint64_t logical_sect,
+ uint64_t physical_sect);
 
 #endif
-- 
1.7.10.4

[Qemu-devel] [RFC V8 06/24] qcow2: Add the deduplication store.

2013-06-20 Thread Benoît Canet

The key value store is an SSD optimised store used to store deduplication's
QCowHashInfo.
It binds together the QCowJournal, the QCowLogStore and the QCowHashStore and
must be called be the deduplication code.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/Makefile.objs |2 +-
 block/qcow2-store.c |  771 +++
 block/qcow2.h   |   20 +-
 3 files changed, 791 insertions(+), 2 deletions(-)
 create mode 100644 block/qcow2-store.c

diff --git a/block/Makefile.objs b/block/Makefile.objs
index cafd4f8..2d1a269 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -1,6 +1,6 @@
 block-obj-y += raw.o cow.o qcow.o vdi.o vmdk.o cloop.o dmg.o bochs.o vpc.o 
vvfat.o
 block-obj-y += qcow2.o qcow2-refcount.o qcow2-cluster.o qcow2-snapshot.o 
qcow2-cache.o
-block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o
+block-obj-y += qcow2-journal.o qcow2-log-store.o qcow2-hash-store.o 
qcow2-store.o
 block-obj-y += qed.o qed-gencb.o qed-l2-cache.o qed-table.o qed-cluster.o
 block-obj-y += qed-check.o
 block-obj-y += vhdx.o
diff --git a/block/qcow2-store.c b/block/qcow2-store.c
new file mode 100644
index 000..8e3fad5
--- /dev/null
+++ b/block/qcow2-store.c
@@ -0,0 +1,771 @@
+/*
+ * QCOW2 key value store for SSD storage
+ *
+ * Copyright (C) Nodalink, SARL. 2013
+ *
+ * Author:
+ *   Benoît Canet benoit.ca...@irqsave.net
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the Software), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include qemu-common.h
+#include block/block_int.h
+#include block/qcow2.h
+
+/* The qcow2_store combine the qcow2_log_store with the qcow2_hash_store-
+ *
+ * insertions:
+ * Entries are inserted in the qcow2_log_store which delegate writing to disk
+ * to the qcow2_journal.
+ * When full the qcow2_log_store is frozen into an incarnation and given to the
+ * qcow2_hash_store- A new qcow2_log store is then created
+ *
+ * deletions:
+ * Deletions are insertions with an additional flag set and a mandatory
+ * write flush of the journal: loosing an insertion is not a problem whereas
+ * loosing a deletion is a problem.
+ *
+ * queries:
+ * The log store is first queried and if the entry is not found in it the hash
+ * store is queried.
+ * When queried the hash store it will check for the presence of the
+ * QCowHashInfo in every incarnation starting from the youngest to the oldest.
+ *
+ * fifo:
+ * A user configurable setting allow the qcow2_store to act as a fifo.
+ * the oldest QCowHashInfos stored in it will be dropped an incarnation at 
once.
+ * Loosing the oldest QCowHashInfo won't corrupt the dedup metadata the only
+ * risk if to make to dedup ratio lower.
+ *
+ * algorithm:
+ * This code is an implementation of the first two stages of the SILT paper.
+ * The third stage was dropped to lower the complexity of the code
+ * The idea of behaving as a FIFO and the notion of incarnations comes from
+ * BufferHash and is handy to help reducing RAM consumption.
+ *
+ * SILT: www.cs.cmu.edu/~dga/papers/silt-sosp2011.pdf
+ * BufferHash: research.microsoft.com/en-us/people/sumann/bufferhash-nsdi10.pdf
+ */
+
+/* This function initialize a key value store
+ *
+ * @store: the store to initialize
+ * @order: the bit order
+ */
+static void qcow2_store_init(BlockDriverState *bs,
+ QCowStore *store,
+ uint32_t order)
+{
+store-order = order;
+qemu_co_mutex_init(store-insert_lock);
+qcow2_log_store_init(bs, store-log_store, order);
+qcow2_log_store_init(bs, store-frozen_log_store, order);
+qcow2_hash_store_init(store-hash_store, order);
+}
+
+/* This function cleanup a key value store
+ *
+ * @store: the store to cleanup
+ */
+void qcow2_store_cleanup(QCowStore *store)
+{
+qcow2_log_store_cleanup(store-log_store);
+qcow2_log_store_cleanup(store-frozen_log_store);
+qcow2_hash_store_cleanup(store-hash_store);
+}
+
+/* This function look for a given hash info

[Qemu-devel] [RFC V8 16/24] block: Add qcow2_dedup format and image creation code.

2013-06-20 Thread Benoît Canet

Also modify qemu-io-test.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2.c|  176 --
 include/block/block_int.h|1 +
 tests/qemu-iotests/common.rc |3 +-
 3 files changed, 173 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 58e6236..e1265a2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1313,7 +1313,8 @@ static int preallocate(BlockDriverState *bs)
 static int qcow2_create2(const char *filename, int64_t total_size,
  const char *backing_file, const char *backing_format,
  int flags, size_t cluster_size, int prealloc,
- QEMUOptionParameter *options, int version)
+ QEMUOptionParameter *options, int version,
+ bool dedup, uint8_t hash_algo)
 {
 /* Calculate cluster_bits */
 int cluster_bits;
@@ -1417,11 +1418,39 @@ static int qcow2_create2(const char *filename, int64_t 
total_size,
 }
 
 /* Okay, now that we have a valid image, let's give it the right size */
+BDRVQcowState *s = bs-opaque;
 ret = bdrv_truncate(bs, total_size * BDRV_SECTOR_SIZE);
 if (ret  0) {
 goto out;
 }
 
+if (dedup) {
+s-has_dedup = true;
+/* fill cluster size for qcow2_store_create */
+s-cluster_size = cluster_size;
+/* unlimited incarnation number */
+s-dedup_max_incarnations = 0;
+s-dedup_hash_algo = hash_algo;
+
+ret = qcow2_set_incompat_feature(bs, QCOW2_INCOMPAT_DEDUP);
+if (ret  0) {
+goto out;
+}
+
+/* this will create the journal and save the conf to disk then
+ * save header extension
+ */
+ret = qcow2_store_create(bs,
+ s-key_value_store,
+ total_size * BDRV_SECTOR_SIZE);
+
+if (ret  0) {
+goto out;
+}
+
+qcow2_store_cleanup(bs, s-key_value_store);
+}
+
 /* Want a backing file? There you go.*/
 if (backing_file) {
 ret = bdrv_change_backing_file(bs, backing_file, backing_format);
@@ -1447,15 +1476,41 @@ out:
 return ret;
 }
 
+static int qcow2_warn_if_version_3_is_needed(int version,
+ bool has_feature,
+ const char *feature)
+{
+if (version  3  has_feature) {
+fprintf(stderr, %s only supported with compatibility 
+level 1.1 and above (use compat=1.1 or greater)\n,
+feature);
+return -EINVAL;
+}
+return 0;
+}
+
+static int8_t qcow2_get_dedup_hash_algo(char *value)
+{
+if (!value || !strcmp(value, sha256)) {
+return QCOW_HASH_SHA256;
+}
+
+error_printf(Unsupported deduplication hash algorithm.\n);
+return -EINVAL;
+}
+
 static int qcow2_create(const char *filename, QEMUOptionParameter *options)
 {
 const char *backing_file = NULL;
 const char *backing_fmt = NULL;
 uint64_t sectors = 0;
 int flags = 0;
+int ret;
 size_t cluster_size = DEFAULT_CLUSTER_SIZE;
 int prealloc = 0;
 int version = 2;
+bool dedup = false;
+int8_t hash_algo = 0;
 
 /* Read out options */
 while (options  options-name) {
@@ -1493,6 +1548,13 @@ static int qcow2_create(const char *filename, 
QEMUOptionParameter *options)
 }
 } else if (!strcmp(options-name, BLOCK_OPT_LAZY_REFCOUNTS)) {
 flags |= options-value.n ? BLOCK_FLAG_LAZY_REFCOUNTS : 0;
+} else if (!strcmp(options-name, BLOCK_OPT_DEDUP)) {
+hash_algo = qcow2_get_dedup_hash_algo(options-value.s);
+if (hash_algo  0) {
+return hash_algo;
+}
+dedup = true;
+version = 3;
 }
 options++;
 }
@@ -1503,14 +1565,22 @@ static int qcow2_create(const char *filename, 
QEMUOptionParameter *options)
 return -EINVAL;
 }
 
-if (version  3  (flags  BLOCK_FLAG_LAZY_REFCOUNTS)) {
-fprintf(stderr, Lazy refcounts only supported with compatibility 
-level 1.1 and above (use compat=1.1 or greater)\n);
-return -EINVAL;
+ret = qcow2_warn_if_version_3_is_needed(version,
+flags  BLOCK_FLAG_LAZY_REFCOUNTS,
+Lazy refcounts);
+if (ret  0) {
+return ret;
+}
+ret = qcow2_warn_if_version_3_is_needed(version,
+dedup,
+Deduplication);
+if (ret  0) {
+return ret;
 }
 
 return qcow2_create2(filename, sectors, backing_file, backing_fmt, flags,
- cluster_size, prealloc, options, version);
+ cluster_size, prealloc, options, version,
+ dedup,

[Qemu-devel] [RFC V8 13/24] qcow2: Implement qcow2_compute_cluster_hash.

2013-06-20 Thread Benoît Canet

Also factorize detection of libgnutls with vnc tls.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c |   17 +-
 configure   |   86 +--
 2 files changed, 79 insertions(+), 24 deletions(-)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index ffbf866..2d2c15c 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -28,6 +28,10 @@
 #include block/block_int.h
 #include qemu-common.h
 #include qcow2.h
+#ifdef CONFIG_SHA256_DEDUP
+#include gnutls/gnutls.h
+#include gnutls/crypto.h
+#endif
 
 /*
  * Prepare a buffer containing everything required to compute cluster
@@ -142,7 +146,18 @@ static int qcow2_compute_cluster_hash(BlockDriverState *bs,
QCowHash *hash,
uint8_t *data)
 {
-return 0;
+BDRVQcowState *s = bs-opaque;
+switch (s-dedup_hash_algo) {
+#ifdef CONFIG_SHA256_DEDUP
+case QCOW_HASH_SHA256:
+return gnutls_hash_fast(GNUTLS_DIG_SHA256, data,
+s-cluster_size, hash-data);
+#endif
+default:
+error_report(Invalid deduplication hash algorithm %i,
+ s-dedup_hash_algo);
+abort();
+}
 }
 
 /*
diff --git a/configure b/configure
index e818e8b..34b26b2 100755
--- a/configure
+++ b/configure
@@ -240,6 +240,7 @@ gtk=
 gtkabi=2.0
 tpm=no
 libssh2=
+sha256_dedup=yes
 
 # parse CC options first
 for opt do
@@ -930,6 +931,10 @@ for opt do
   ;;
   --enable-libssh2) libssh2=yes
   ;;
+  --disable-sha256-dedup) sha256_dedup=no
+  ;;
+  --enable-sha256-dedup) sha256_dedup=yes
+  ;;
   *) echo ERROR: unknown option $opt; show_help=yes
   ;;
   esac
@@ -1190,6 +1195,8 @@ echo   --with-coroutine=BACKEND coroutine backend. 
Supported options:
 echogthread, ucontext, sigaltstack, windows
 echo   --enable-glusterfs   enable GlusterFS backend
 echo   --disable-glusterfs  disable GlusterFS backend
+echo   --disable-sha256-dedup   disable sha256 dedup
+echo   --enable-sha256-dedupenables sha256 dedup
 echo   --enable-gcovenable test coverage analysis with gcov
 echo   --gcov=GCOV  use specified gcov [$gcov_tool]
 echo   --enable-tpm enable TPM support
@@ -1812,32 +1819,60 @@ EOF
 fi
 
 ##
-# VNC TLS/WS detection
-if test $vnc = yes -a \( $vnc_tls != no -o $vnc_ws != no \) ; then
-  cat  $TMPC EOF
+# gnutls detection (factorize the VNC TLS and SHA256 deduplication test)
+cat  $TMPC EOF
 #include gnutls/gnutls.h
-int main(void) { gnutls_session_t s; gnutls_init(s, GNUTLS_SERVER); return 0; 
}
+#include gnutls/crypto.h
+int main(void) {char data[4096], digest[32];
+gnutls_hash_fast(GNUTLS_DIG_SHA256, data, 4096, digest);
+return 0;
+}
 EOF
-  vnc_tls_cflags=`$pkg_config --cflags gnutls 2 /dev/null`
-  vnc_tls_libs=`$pkg_config --libs gnutls 2 /dev/null`
-  if compile_prog $vnc_tls_cflags $vnc_tls_libs ; then
-if test $vnc_tls != no ; then
-  vnc_tls=yes
-fi
-if test $vnc_ws != no ; then
-  vnc_ws=yes
-fi
-libs_softmmu=$vnc_tls_libs $libs_softmmu
-QEMU_CFLAGS=$QEMU_CFLAGS $vnc_tls_cflags
+gnu_tls_cflags=`$pkg_config --cflags gnutls 2 /dev/null`
+gnu_tls_libs=`$pkg_config --libs gnutls 2 /dev/null`
+if compile_prog $gnu_tls_cflags $gnu_tls_libs ; then
+  gnu_tls=yes
+else
+  gnu_tls=no
+fi
+
+##
+# VNC TLS/WS
+if test $vnc = yes -a $gnu_tls != no; then
+  if test $vnc_tls != no ; then
+libs_softmmu=$gnu_tls_libs $libs_softmmu
+libs_tools=$gnu_tls_libs $libs_softmmu
+QEMU_CFLAGS=$QEMU_CFLAGS $gnu_tls_cflags
+vnc_tls=yes
+  fi
+  if test $vnc_ws != no ; then
+libs_softmmu=$gnu_tls_libs $libs_softmmu
+libs_tools=$gnu_tls_libs $libs_softmmu
+QEMU_CFLAGS=$QEMU_CFLAGS $gnu_tls_cflags
+vnc_ws=yes
+  fi
+else
+  if test $vnc_tls = yes ; then
+feature_not_found vnc-tls
+  fi
+  if test $vnc_ws = yes ; then
+feature_not_found vnc-ws
+  fi
+  vnc_tls=no
+  vnc_ws=no
+fi
+
+##
+# SHA256 deduplication
+if test $sha256_dedup = yes; then
+  if test $gnu_tls = yes; then
+libs_softmmu=$gnu_tls_libs $libs_softmmu
+libs_tools=$gnu_tls_libs $libs_softmmu
+QEMU_CFLAGS=$QEMU_CFLAGS $gnu_tls_cflags
+sha256_dedup=yes
   else
-if test $vnc_tls = yes ; then
-  feature_not_found vnc-tls
-fi
-if test $vnc_ws = yes ; then
-  feature_not_found vnc-ws
-fi
-vnc_tls=no
-vnc_ws=no
+echo gnutls  2.10.0 required to compile QEMU with sha256 deduplication
+exit 1
   fi
 fi
 
@@ -3562,6 +3597,7 @@ echo seccomp support   $seccomp
 echo coroutine backend $coroutine
 echo GlusterFS support $glusterfs
 echo virtio-blk-data-plane $virtio_blk_data_plane
+echo sha256-dedup  $sha256_dedup
 echo gcov  $gcov_tool
 echo gcov enabled  $gcov
 echo TPM support   $tpm
@@

[Qemu-devel] [RFC V8 15/24] qcow2: Extract qcow2_set_incompat_feature and qcow2_clear_incompat_feature.

2013-06-20 Thread Benoît Canet

Also change callers.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-cluster.c |2 +-
 block/qcow2.c |   43 ++-
 block/qcow2.h |7 ---
 3 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 41c4bc2..7cd6e75 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -672,7 +672,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
 
 /* Update L2 table. */
 if (s-use_lazy_refcounts) {
-qcow2_mark_dirty(bs);
+qcow2_set_incompat_feature(bs, QCOW2_INCOMPAT_DIRTY);
 }
 if (qcow2_need_accurate_refcounts(s)) {
 qcow2_cache_set_dependency(bs, s-l2_table_cache,
diff --git a/block/qcow2.c b/block/qcow2.c
index 3cd1051..58e6236 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -250,56 +250,57 @@ static void report_unsupported_feature(BlockDriverState 
*bs,
 }
 
 /*
- * Sets the dirty bit and flushes afterwards if necessary.
+ * Sets the an incompatible feature bit and flushes afterwards if necessary.
  *
  * The incompatible_features bit is only set if the image file header was
  * updated successfully.  Therefore it is not required to check the return
  * value of this function.
  */
-int qcow2_mark_dirty(BlockDriverState *bs)
+int qcow2_set_incompat_feature(BlockDriverState *bs,
+   QCow2IncompatibleFeature feature)
 {
 BDRVQcowState *s = bs-opaque;
 uint64_t val;
-int ret;
+int ret = 0;
 
 assert(s-qcow_version = 3);
 
-if (s-incompatible_features  QCOW2_INCOMPAT_DIRTY) {
-return 0; /* already dirty */
+if (s-incompatible_features  feature) {
+return 0; /* already added */
 }
 
-val = cpu_to_be64(s-incompatible_features | QCOW2_INCOMPAT_DIRTY);
+val = cpu_to_be64(s-incompatible_features | feature);
 ret = bdrv_pwrite(bs-file, offsetof(QCowHeader, incompatible_features),
   val, sizeof(val));
 if (ret  0) {
 return ret;
 }
-ret = bdrv_flush(bs-file);
-if (ret  0) {
-return ret;
-}
 
-/* Only treat image as dirty if the header was updated successfully */
-s-incompatible_features |= QCOW2_INCOMPAT_DIRTY;
+/* Only treat image as having the feature if the header was updated
+ * successfully
+ */
+s-incompatible_features |= feature;
 return 0;
 }
 
 /*
- * Clears the dirty bit and flushes before if necessary.  Only call this
- * function when there are no pending requests, it does not guard against
- * concurrent requests dirtying the image.
+ * Clears an incompatible feature bit and flushes before if necessary.
+ * Only call this function when there are no pending requests, it does not
+ * guard against concurrent requests adding a feature to the image.
  */
-static int qcow2_mark_clean(BlockDriverState *bs)
+static int qcow2_clear_incompat_feature(BlockDriverState *bs,
+QCow2IncompatibleFeature feature)
 {
 BDRVQcowState *s = bs-opaque;
+int ret = 0;
 
-if (s-incompatible_features  QCOW2_INCOMPAT_DIRTY) {
-int ret = bdrv_flush(bs);
+if (s-incompatible_features  feature) {
+ret = bdrv_flush(bs);
 if (ret  0) {
 return ret;
 }
 
-s-incompatible_features = ~QCOW2_INCOMPAT_DIRTY;
+s-incompatible_features = ~feature;
 return qcow2_update_header(bs);
 }
 return 0;
@@ -314,7 +315,7 @@ static int qcow2_check(BlockDriverState *bs, 
BdrvCheckResult *result,
 }
 
 if (fix  result-check_errors == 0  result-corruptions == 0) {
-return qcow2_mark_clean(bs);
+return qcow2_clear_incompat_feature(bs, QCOW2_INCOMPAT_DIRTY);
 }
 return ret;
 }
@@ -949,7 +950,7 @@ static void qcow2_close(BlockDriverState *bs)
 qcow2_cache_flush(bs, s-l2_table_cache);
 qcow2_cache_flush(bs, s-refcount_block_cache);
 
-qcow2_mark_clean(bs);
+qcow2_clear_incompat_feature(bs, QCOW2_INCOMPAT_DIRTY);
 
 qcow2_cache_destroy(bs, s-l2_table_cache);
 qcow2_cache_destroy(bs, s-refcount_block_cache);
diff --git a/block/qcow2.h b/block/qcow2.h
index ff26e81..720131d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -290,14 +290,14 @@ enum {
 };
 
 /* Incompatible feature bits */
-enum {
+typedef enum {
 QCOW2_INCOMPAT_DIRTY_BITNR   = 0,
 QCOW2_INCOMPAT_DIRTY = 1  QCOW2_INCOMPAT_DIRTY_BITNR,
 QCOW2_INCOMPAT_DEDUP_BITNR   = 1,
 QCOW2_INCOMPAT_DEDUP = 1  QCOW2_INCOMPAT_DEDUP_BITNR,
 
 QCOW2_INCOMPAT_MASK  = QCOW2_INCOMPAT_DIRTY | QCOW2_INCOMPAT_DEDUP,
-};
+} QCow2IncompatibleFeature;
 
 /* Compatible feature bits */
 enum {
@@ -543,7 +543,8 @@ static inline uint64_t l2meta_cow_end(QCowL2Meta *m)
 int qcow2_backing_read1(BlockDriverState *bs, QEMUIOVector *qiov,
   int64_t sector_num, int nb_sectors);
 
-int qcow2_mark_dirty(BlockDriverState *bs);

[Qemu-devel] [RFC V8 21/24] qcow2: Integrate SKEIN hash algorithm in deduplication.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c |   15 +++
 block/qcow2.c   |5 +
 configure   |   35 +++
 3 files changed, 55 insertions(+)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index 599cb2e..606459f 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -33,6 +33,10 @@
 #include gnutls/crypto.h
 #endif
 
+#ifdef CONFIG_SKEIN_DEDUP
+#include skeinApi.h
+#endif
+
 /*
  * Prepare a buffer containing everything required to compute cluster
  * sized deduplication hashes.
@@ -153,6 +157,17 @@ static int qcow2_compute_cluster_hash(BlockDriverState *bs,
 return gnutls_hash_fast(GNUTLS_DIG_SHA256, data,
 s-cluster_size, hash-data);
 #endif
+#if defined(CONFIG_SKEIN_DEDUP)
+case QCOW_HASH_SKEIN:
+{
+SkeinCtx_t ctx;
+skeinCtxPrepare(ctx, Skein256);
+skeinInit(ctx, Skein256);
+skeinUpdate(ctx, data, s-cluster_size);
+skeinFinal(ctx, hash-data);
+}
+return 0;
+#endif
 default:
 error_report(Invalid deduplication hash algorithm %i,
  s-dedup_hash_algo);
diff --git a/block/qcow2.c b/block/qcow2.c
index 11c115f..ea2f0f2 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1592,6 +1592,11 @@ static int8_t qcow2_get_dedup_hash_algo(char *value)
 if (!value || !strcmp(value, sha256)) {
 return QCOW_HASH_SHA256;
 }
+#if defined(CONFIG_SKEIN_DEDUP)
+if (!strcmp(value, skein)) {
+return QCOW_HASH_SKEIN;
+}
+#endif
 
 error_printf(Unsupported deduplication hash algorithm.\n);
 return -EINVAL;
diff --git a/configure b/configure
index 34b26b2..902c3c1 100755
--- a/configure
+++ b/configure
@@ -241,6 +241,7 @@ gtkabi=2.0
 tpm=no
 libssh2=
 sha256_dedup=yes
+skein_dedup=no
 
 # parse CC options first
 for opt do
@@ -935,6 +936,10 @@ for opt do
   ;;
   --enable-sha256-dedup) sha256_dedup=yes
   ;;
+  --disable-skein-dedup) skein_dedup=no
+  ;;
+  --enable-skein-dedup) skein_dedup=yes
+  ;;
   *) echo ERROR: unknown option $opt; show_help=yes
   ;;
   esac
@@ -1202,6 +1207,7 @@ echo   --gcov=GCOV  use specified gcov 
[$gcov_tool]
 echo   --enable-tpm enable TPM support
 echo   --disable-libssh2disable ssh block device support
 echo   --enable-libssh2 enable ssh block device support
+echo   --enable-skein-dedup enable computing dedup hashes with SKEIN
 echo 
 echo NOTE: The object files are built at the place where configure is 
launched
 exit 1
@@ -2613,6 +2619,30 @@ EOF
   fi
 fi
 
+##
+# SKEIN dedup hash function probe
+if test $skein_dedup != no ; then
+  cat  $TMPC EOF
+#include skeinApi.h
+int main(void) {
+SkeinCtx_t ctx;
+skeinCtxPrepare(ctx, 512);
+return 0;
+}
+EOF
+  skein_libs=-lskein3fish
+  if compile_prog  $skein_libs ; then
+skein_dedup=yes
+libs_tools=$skein_libs $libs_tools
+libs_softmmu=$skein_libs $libs_softmmu
+  else
+if test $skein_dedup = yes ; then
+  feature_not_found libskein3fish not found
+fi
+skein_dedup=no
+  fi
+fi
+
 #
 # Check for xxxat() functions when we are building linux-user
 # emulator.  This is done because older glibc versions don't
@@ -3603,6 +3633,7 @@ echo gcov enabled  $gcov
 echo TPM support   $tpm
 echo libssh2 support   $libssh2
 echo TPM passthrough   $tpm_passthrough
+echo SKEIN support $skein_dedup
 
 if test $sdl_too_old = yes; then
 echo - Your SDL version is too old - please upgrade to have SDL support
@@ -3991,6 +4022,10 @@ if test $sha256_dedup = yes ; then
   echo CONFIG_SHA256_DEDUP=y  $config_host_mak
 fi
 
+if test $skein_dedup = yes ; then
+  echo CONFIG_SKEIN_DEDUP=y  $config_host_mak
+fi
+
 # USB host support
 case $usb in
 linux)
-- 
1.7.10.4

[Qemu-devel] [RFC V8 23/24] qcow2: Enable the deduplication feature.

2013-06-20 Thread Benoît Canet

---
 block/qcow2.c |   17 +
 1 file changed, 17 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index f7b94dd..bb7bf74 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -569,6 +569,13 @@ static int qcow2_open(BlockDriverState *bs, QDict 
*options, int flags)
 goto fail;
 }
 
+if (s-incompatible_features  QCOW2_INCOMPAT_DEDUP) {
+ret = qcow2_dedup_init(bs);
+if (ret  0) {
+goto fail;
+}
+}
+
 #ifdef DEBUG_ALLOC
 {
 BdrvCheckResult result = {0};
@@ -1043,8 +1050,13 @@ fail:
 static void qcow2_close(BlockDriverState *bs)
 {
 BDRVQcowState *s = bs-opaque;
+
 g_free(s-l1_table);
 
+if (s-has_dedup) {
+qcow2_dedup_close(bs);
+}
+
 qcow2_cache_flush(bs, s-l2_table_cache);
 qcow2_cache_flush(bs, s-refcount_block_cache);
 
@@ -1547,6 +1559,11 @@ static int qcow2_create2(const char *filename, int64_t 
total_size,
 }
 
 qcow2_store_cleanup(s-key_value_store);
+
+ret = qcow2_dedup_init(bs);
+if (ret  0) {
+goto out;
+}
 }
 
 /* Want a backing file? There you go.*/
-- 
1.7.10.4

[Qemu-devel] [RFC V8 12/24] qcow2: Do allocate on rewrite on the dedup case.

2013-06-20 Thread Benoît Canet

This patch does allocate on rewrite when deduplication is on.
This get rid of the need of removing the old hash of the lookup structure
when a cluster get rewritten.
The old data is left in place and will be collected/deleted when it's cluster
will reach 0.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-cluster.c |5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index d6db0b9..41c4bc2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -878,7 +878,8 @@ static int handle_copied(BlockDriverState *bs, uint64_t 
guest_offset,
 cluster_offset = be64_to_cpu(l2_table[l2_index]);
 
 /* Check how many clusters are already allocated and don't need COW */
-if (qcow2_get_cluster_type(cluster_offset) == QCOW2_CLUSTER_NORMAL
+if (!s-has_dedup 
+qcow2_get_cluster_type(cluster_offset) == QCOW2_CLUSTER_NORMAL
  (cluster_offset  QCOW_OFLAG_COPIED))
 {
 /* If a specific host_offset is required, check it */
@@ -1028,7 +1029,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 /* For the moment, overwrite compressed clusters one by one */
 if (entry  QCOW_OFLAG_COMPRESSED) {
 nb_clusters = 1;
-} else {
+} else if (!s-has_dedup) {
 nb_clusters = count_cow_clusters(s, nb_clusters, l2_table, l2_index);
 }
 
-- 
1.7.10.4

[Qemu-devel] [RFC V8 10/24] qcow2: Add qcow2_dedup and related functions

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c |  415 +++
 block/qcow2.h   |5 +
 2 files changed, 420 insertions(+)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index bc6e2c2..0daf77e 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -119,3 +119,418 @@ fail:
 *data = NULL;
 return ret;
 }
+
+/*
+ * Set a QCowHashInfo into the new state
+ *
+ * @hash_info: the hash_info to set to new
+ */
+static void qcow2_set_hash_info_new(QCowHashInfo *hash_info)
+{
+hash_info-physical_sect = QCOW_DEDUP_FLAG_EMPTY;
+hash_info-first_logical_sect = QCOW_DEDUP_FLAG_EMPTY;
+}
+
+/*
+ * Compute the hash of a given cluster
+ *
+ * @data: a buffer containing the cluster data
+ * @hash: a QCowHash where to store the computed hash
+ * @ret:  0 on success, negative on error
+ */
+static int qcow2_compute_cluster_hash(BlockDriverState *bs,
+   QCowHash *hash,
+   uint8_t *data)
+{
+return 0;
+}
+
+/*
+ * Get a QCowHashInfo corresponding to a cluster data
+ *
+ * @phash:   if phash can be used no hash is computed
+ * @data:a buffer containing the cluster
+ * @ret: 1 if if hash found, 0 if not found, -errno on error
+ */
+static int qcow2_get_persistent_hash_for_cluster(BlockDriverState *bs,
+ QcowPersistentHash *phash,
+ uint8_t *data)
+{
+BDRVQcowState *s = bs-opaque;
+int ret = 0;
+
+/* no hash has been provided compute it and store it for later usage */
+if (!phash-reuse) {
+ret = qcow2_compute_cluster_hash(bs,
+ phash-hash_info.hash,
+ data);
+}
+
+/* do not reuse the hash anymore if it was precomputed */
+phash-reuse = false;
+
+/* error */
+if (ret  0) {
+return ret;
+}
+
+/* ask the store if it knows this QCowHashInfo */
+return qcow2_store_get(bs, s-key_value_store, phash-hash_info);
+}
+
+/*
+ * Insert a QCowHashInfo into the store as new
+ *
+ * @hash_info: the QCowHashInfo to set to new and insert
+ * @ret:   0 on succes, -errno on error
+ */
+static int qcow2_add_new_hash_info_to_store(BlockDriverState *bs,
+QCowHashInfo *hash_info)
+{
+BDRVQcowState *s = bs-opaque;
+
+/* set the QCowHashInfo to the new state so we will remember to fill these
+ * field later with real values
+ */
+qcow2_set_hash_info_new(hash_info);
+return qcow2_store_insert(bs, s-key_value_store, hash_info);
+}
+
+/*
+ * Helper used to build a QCowHashElement
+ *
+ * @hash: the QCowHash to use
+ * @ret:  a newly allocated QCowHashElement containing the given hash
+ */
+static QCowHashElement *qcow2_dedup_hash_new(QCowHash *hash)
+{
+QCowHashElement *dedup_hash;
+dedup_hash = g_new0(QCowHashElement, 1);
+memcpy(dedup_hash-hash_info.hash.data, hash-data, HASH_LENGTH);
+return dedup_hash;
+}
+
+/*
+ * Helper used to link a deduplicated cluster in the l2
+ *
+ * @logical_sect:  the cluster sector seen by the guest
+ * @physical_sect: the cluster sector in the QCOW2 file
+ * @overwrite: true if we must overwrite the L2 table entry
+ * @ret:
+ */
+static int qcow2_dedup_link_l2(BlockDriverState *bs,
+   uint64_t logical_sect,
+   uint64_t physical_sect,
+   bool overwrite)
+{
+QCowL2Meta m = {
+.alloc_offset   = physical_sect  9,
+.offset = logical_sect  9,
+.nb_clusters= 1,
+.nb_available   = 0,
+.cow_start = {
+.offset = 0,
+.nb_sectors = 0,
+},
+.cow_end = {
+.offset = 0,
+.nb_sectors = 0,
+},
+.l2_entry_flags = 0,
+.overwrite  = overwrite,
+};
+return qcow2_alloc_cluster_link_l2(bs, m);
+}
+
+/* Clear the QCOW_OFLAG_COPIED from the first L2 entry written for a physical
+ * cluster.
+ *
+ * @hash_info: the duplicated hash info
+ * @ret:   0 on success, negative on error
+ */
+static int qcow2_clear_l2_copied_flag_if_needed(BlockDriverState *bs,
+QCowHashInfo *hash_info)
+{
+int ret = 0;
+uint64_t first_logical_sect = hash_info-first_logical_sect;
+
+/* QCOW_OFLAG_COPIED already cleared - do nothing */
+if (!(first_logical_sect  QCOW_OFLAG_COPIED)) {
+return 0;
+}
+
+first_logical_sect = ~QCOW_OFLAG_COPIED;
+
+/* overwrite first L2 entry to clear QCOW_FLAG_COPIED */
+ret = qcow2_dedup_link_l2(bs, first_logical_sect,
+  hash_info-physical_sect,
+  true);
+
+if (ret  0) {
+return ret;
+}
+
+/*

[Qemu-devel] [RFC V8 14/24] qcow2: Load and save deduplication table header extension.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2.c |   53 +
 1 file changed, 53 insertions(+)

diff --git a/block/qcow2.c b/block/qcow2.c
index 34b2a87..3cd1051 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -54,9 +54,19 @@ typedef struct {
 uint32_t len;
 } QCowExtension;
 
+typedef struct {
+uint64_t offset;
+uint32_t size;
+uint32_t max;
+uint32_t flags;
+uint8_t  hash_algo;
+char reserved[56];
+} QCowDedupConfExtension;
+
 #define  QCOW2_EXT_MAGIC_END 0
 #define  QCOW2_EXT_MAGIC_BACKING_FORMAT 0xE2792ACA
 #define  QCOW2_EXT_MAGIC_FEATURE_TABLE 0x6803f857
+#define  QCOW2_EXT_MAGIC_DEDUP_TABLE 0xCD8E819B
 
 static int qcow2_probe(const uint8_t *buf, int buf_size, const char *filename)
 {
@@ -85,6 +95,7 @@ static int qcow2_read_extensions(BlockDriverState *bs, 
uint64_t start_offset,
 QCowExtension ext;
 uint64_t offset;
 int ret;
+QCowDedupConfExtension dedup_conf_ext;
 
 #ifdef DEBUG_EXT
 printf(qcow2_read_extensions: start=%ld end=%ld\n, start_offset, 
end_offset);
@@ -149,6 +160,28 @@ static int qcow2_read_extensions(BlockDriverState *bs, 
uint64_t start_offset,
 }
 break;
 
+case QCOW2_EXT_MAGIC_DEDUP_TABLE:
+if (ext.len  sizeof(dedup_conf_ext)) {
+fprintf(stderr, ERROR: dedup_conf_ext: len=%u too large
+ (=%zu)\n,
+ext.len, sizeof(dedup_conf_ext));
+return 2;
+}
+ret = bdrv_pread(bs-file, offset,
+ dedup_conf_ext, ext.len);
+if (ret  0) {
+return ret;
+}
+s-dedup_conf_offset =
+be64_to_cpu(dedup_conf_ext.offset);
+s-dedup_conf_size =
+be32_to_cpu(dedup_conf_ext.size);
+s-dedup_max_incarnations =
+be32_to_cpu(dedup_conf_ext.max);
+s-dedup_hash_algo = dedup_conf_ext.hash_algo;
+s-dedup_dirty = dedup_conf_ext.flags  QCOW_DEDUP_DIRTY;
+break;
+
 default:
 /* unknown magic - save it in case we need to rewrite the header */
 {
@@ -1006,6 +1039,7 @@ int qcow2_update_header(BlockDriverState *bs)
 uint32_t refcount_table_clusters;
 size_t header_length;
 Qcow2UnknownHeaderExtension *uext;
+QCowDedupConfExtension dedup_conf_ext;
 
 buf = qemu_blockalign(bs, buflen);
 
@@ -1109,6 +1143,25 @@ int qcow2_update_header(BlockDriverState *bs)
 buf += ret;
 buflen -= ret;
 
+if (s-has_dedup) {
+memset(dedup_conf_ext, 0, sizeof(dedup_conf_ext));
+dedup_conf_ext.offset= cpu_to_be64(s-dedup_conf_offset);
+dedup_conf_ext.size  = cpu_to_be32(s-dedup_conf_size);
+dedup_conf_ext.max   = cpu_to_be32(s-dedup_max_incarnations);
+dedup_conf_ext.hash_algo = s-dedup_hash_algo;
+dedup_conf_ext.flags = s-dedup_dirty ? QCOW_DEDUP_DIRTY : 0;
+ret = header_ext_add(buf,
+ QCOW2_EXT_MAGIC_DEDUP_TABLE,
+ dedup_conf_ext,
+ sizeof(dedup_conf_ext),
+ buflen);
+if (ret  0) {
+goto fail;
+}
+buf += ret;
+buflen -= ret;
+}
+
 /* Keep unknown header extensions */
 QLIST_FOREACH(uext, s-unknown_header_ext, next) {
 ret = header_ext_add(buf, uext-magic, uext-data, uext-len, buflen);
-- 
1.7.10.4

[Qemu-devel] [RFC V8 17/24] qcow2: Drop hash for a given cluster when dedup makes refcount 2^16/2.

2013-06-20 Thread Benoît Canet

A new physical cluster with the same hash value will be used for further
occurrence of this hash.

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c |   20 
 1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index 2d2c15c..da4ad5c 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -305,12 +305,24 @@ static int qcow2_deduplicate_cluster(BlockDriverState *bs,
 {
 BDRVQcowState *s = bs-opaque;
 uint64_t cluster_index = hash_info-physical_sect / s-cluster_sectors;
-int ret = 0;
+int refcount, ret = 0;
 
 /* Increment the refcount of the cluster */
-ret = qcow2_update_cluster_refcount(bs,
-cluster_index,
-1);
+refcount = qcow2_update_cluster_refcount(bs,
+ cluster_index,
+ 1);
+
+if (refcount  0) {
+return refcount;
+}
+
+/* if we reached half the max refcount delete the QCowHashInfo from the
+ * store so the next time this cluster will be seen it will be handled as
+ * new
+ */
+if (refcount = 0x/2) {
+ret = qcow2_store_delete(bs, s-key_value_store, hash_info);
+}
 
 if (ret  0) {
 return ret;
-- 
1.7.10.4

[Qemu-devel] [RFC V8 19/24] qcow2: Integrate deduplication in qcow2_co_writev loop.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2.c |   88 +++--
 1 file changed, 86 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index e1265a2..8eb63f1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -342,6 +342,7 @@ static int qcow2_open(BlockDriverState *bs, QDict *options, 
int flags)
 Error *local_err = NULL;
 uint64_t ext_end;
 
+s-has_dedup = false;
 ret = bdrv_pread(bs-file, 0, header, sizeof(header));
 if (ret  0) {
 goto fail;
@@ -821,13 +822,17 @@ static coroutine_fn int qcow2_co_writev(BlockDriverState 
*bs,
 BDRVQcowState *s = bs-opaque;
 int index_in_cluster;
 int n_end;
-int ret;
+int ret = 0;
 int cur_nr_sectors; /* number of sectors in current iteration */
 uint64_t cluster_offset;
 QEMUIOVector hd_qiov;
 uint64_t bytes_done = 0;
 uint8_t *cluster_data = NULL;
 QCowL2Meta *l2meta = NULL;
+uint8_t *dedup_cluster_data = NULL;
+int dedup_cluster_data_nr;
+int deduped_sectors_nr;
+QCowDedupState ds;
 
 trace_qcow2_writev_start_req(qemu_coroutine_self(), sector_num,
  remaining_sectors);
@@ -838,13 +843,69 @@ static coroutine_fn int qcow2_co_writev(BlockDriverState 
*bs,
 
 qemu_co_mutex_lock(s-lock);
 
+if (s-has_dedup) {
+QTAILQ_INIT(ds.undedupables);
+ds.phash.reuse = false;
+ds.nb_undedupable_sectors = 0;
+ds.nb_clusters_processed = 0;
+
+/* if deduplication is on we make sure dedup_cluster_data
+ * contains a multiple of cluster size of data in order
+ * to compute the hashes
+ */
+ret = qcow2_dedup_read_missing_and_concatenate(bs,
+   qiov,
+   sector_num,
+   remaining_sectors,
+   dedup_cluster_data,
+   dedup_cluster_data_nr);
+
+if (ret  0) {
+goto fail;
+}
+}
+
 while (remaining_sectors != 0) {
 
 l2meta = NULL;
 
 trace_qcow2_writev_start_part(qemu_coroutine_self());
+
+if (s-has_dedup  ds.nb_undedupable_sectors == 0) {
+/* Try to deduplicate as much clusters as possible */
+deduped_sectors_nr = qcow2_dedup(bs,
+ ds,
+ sector_num,
+ dedup_cluster_data,
+ dedup_cluster_data_nr);
+
+if (deduped_sectors_nr  0) {
+goto fail;
+}
+
+remaining_sectors -= deduped_sectors_nr;
+sector_num += deduped_sectors_nr;
+bytes_done += deduped_sectors_nr * 512;
+
+/* no more data to write - exit */
+if (remaining_sectors = 0) {
+break;
+}
+
+/* if we deduped something trace it */
+if (deduped_sectors_nr) {
+trace_qcow2_writev_done_part(qemu_coroutine_self(),
+ deduped_sectors_nr);
+trace_qcow2_writev_start_part(qemu_coroutine_self());
+}
+}
+
 index_in_cluster = sector_num  (s-cluster_sectors - 1);
-n_end = index_in_cluster + remaining_sectors;
+n_end = s-has_dedup 
+ds.nb_undedupable_sectors  remaining_sectors ?
+index_in_cluster + ds.nb_undedupable_sectors :
+index_in_cluster + remaining_sectors;
+
 if (s-crypt_method 
 n_end  QCOW_MAX_CRYPT_CLUSTERS * s-cluster_sectors) {
 n_end = QCOW_MAX_CRYPT_CLUSTERS * s-cluster_sectors;
@@ -912,6 +973,28 @@ static coroutine_fn int qcow2_co_writev(BlockDriverState 
*bs,
 l2meta = next;
 }
 
+/* Write the non duplicated clusters hashes to disk */
+if (s-has_dedup) {
+int count = cur_nr_sectors / s-cluster_sectors;
+int has_ending = ((cluster_offset  9) + index_in_cluster +
+ cur_nr_sectors)  (s-cluster_sectors - 1);
+if (index_in_cluster) {
+count++;
+}
+if (has_ending) {
+count++;
+}
+ret = qcow2_dedup_store_new_hashes(bs,
+   ds,
+   count,
+   sector_num,
+   (cluster_offset  9));
+if (ret  0) {
+goto fail;
+}
+}
+
+ds.nb_undedupable_sectors -= cur_nr_sectors;
 remaining_sectors -= cur_nr_sectors;

[Qemu-devel] [RFC V8 22/24] qcow2: Add qcow2_dedup_init and qcow2_dedup_close.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c |   60 +++
 block/qcow2.c   |2 +-
 block/qcow2.h   |2 ++
 3 files changed, 63 insertions(+), 1 deletion(-)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index 606459f..5eeea38 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -716,3 +716,63 @@ void qcow2_dedup_destroy_hash(BlockDriverState *bs,
 free_exit:
qemu_vfree(buf);
 }
+
+int qcow2_dedup_init(BlockDriverState *bs)
+{
+BDRVQcowState *s = bs-opaque;
+int ret = 0;
+
+s-has_dedup = true;
+
+/* if we are read-only we don't init anything */
+if (bs-read_only) {
+return 0;
+}
+
+/* no need to allocate the various store's buffers since qcow2_store_load
+ * will do it
+ */
+
+/* load and parse the configuration from disk */
+ret = qcow2_store_load(bs, s-key_value_store);
+
+if (ret  0) {
+return ret;
+}
+
+/* if QEMU crashed forget everyting that was stored in the store */
+if(s-dedup_dirty) {
+qcow2_store_forget(bs, s-key_value_store);
+}
+
+/* set the dirty bit */
+s-dedup_dirty = true;
+qcow2_update_header(bs);
+
+/* load the journal and the incarnations in a coroutine */
+return qcow2_store_start(bs, s-key_value_store);
+}
+
+void qcow2_dedup_close(BlockDriverState *bs)
+{
+BDRVQcowState *s = bs-opaque;
+int ret = 0;
+
+/* if we are read-only we don't need to cleanup */
+if (bs-read_only) {
+return;
+}
+
+ret = qcow2_store_flush(bs, s-key_value_store);
+
+qcow2_store_cleanup(s-key_value_store);
+
+/* flush failed - leave the store dirty so it will be discard at restart 
*/
+if (ret  0) {
+return;
+}
+
+/* clear the dirty bit */
+s-dedup_dirty = false;
+qcow2_update_header(bs);
+}
diff --git a/block/qcow2.c b/block/qcow2.c
index ea2f0f2..f7b94dd 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1546,7 +1546,7 @@ static int qcow2_create2(const char *filename, int64_t 
total_size,
 goto out;
 }
 
-qcow2_store_cleanup(bs, s-key_value_store);
+qcow2_store_cleanup(s-key_value_store);
 }
 
 /* Want a backing file? There you go.*/
diff --git a/block/qcow2.h b/block/qcow2.h
index 3c6e685..b293be7 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -751,5 +751,7 @@ int qcow2_dedup_store_new_hashes(BlockDriverState *bs,
  uint64_t physical_sect);
 void qcow2_dedup_destroy_hash(BlockDriverState *bs,
   uint64_t cluster_index);
+int qcow2_dedup_init(BlockDriverState *bs);
+void qcow2_dedup_close(BlockDriverState *bs);
 
 #endif
-- 
1.7.10.4

[Qemu-devel] [RFC V8 24/24] qcow2: Enable deduplication tests

2013-06-20 Thread Benoît Canet

---
 tests/qemu-iotests/common |6 ++
 1 file changed, 6 insertions(+)

diff --git a/tests/qemu-iotests/common b/tests/qemu-iotests/common
index 6826ea7..2483742 100644
--- a/tests/qemu-iotests/common
+++ b/tests/qemu-iotests/common
@@ -130,6 +130,7 @@ check options
 -cowtest cow
 -qcow   test qcow
 -qcow2  test qcow2
+-qcow2_deduptest qcow2 deduplication
 -qedtest qed
 -vditest vdi
 -vpctest vpc
@@ -175,6 +176,11 @@ testlist options
xpand=false
;;
 
+   -qcow2_dedup)
+   IMGFMT=qcow2_dedup
+   xpand=false
+   ;;
+
-qed)
IMGFMT=qed
xpand=false
-- 
1.7.10.4

[Qemu-devel] [RFC V8 02/24] qcow2: Add deduplication structures and fields.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2.h |  203 -
 1 file changed, 201 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 9421843..953edfe 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -57,7 +57,182 @@
 #define REFCOUNT_CACHE_SIZE 4
 
 #define DEFAULT_CLUSTER_SIZE 65536
-
+#define DEFAULT_DEDUP_CLUSTER_SIZE 4096
+
+#define HASH_LENGTH 32
+
+/* indicate that this cluster hash has been deleted from the key value store */
+#define QCOW_DEDUP_DELETED (1LL  61)
+/* indicate that the hash structure is empty and miss offset */
+#define QCOW_DEDUP_FLAG_EMPTY   (1LL  62)
+
+#define SSD_ERASE_BLOCK_SIZE (512 * 1024) /* match SSD erase block size */
+#define JOURNAL_CLUSTER_SIZE 4096   /* used to read entries */
+#define HASH_STORE_CLUSTER_SIZE 4096
+
+#define QCOW_LOG_END_SIZE 2/* size of a end block journal entry */
+#define QCOW_LOG_STORE_ENTRY_USED (1LL  60) /* mark used entry in table */
+#define QCOW_LOG_STORE_BUCKET_SIZE 4   /* size of a cuckoo hash bucket */
+#define QCOW_LOG_STORE_MAX_KICKS 128   /* max numbers of cuckoo hash kicks */
+#define QCOW_LOG_STORE_JOURNAL_RATIO 2 /* the ratio to compute the extra
+* room the journal will take based
+* on the log store size
+*/
+#define QCOW2_NB_INCARNATION_GOAL  128 /* targeted number of incarnation */
+
+#define QCOW_DEDUP_DIRTY 1 /* dirty flag in the qcow2 header extension */
+
+typedef enum {
+QCOW_LOG_NONE = 0xFF, /* on SSD erased clusters will mark none */
+QCOW_LOG_END = 1, /* end a block and point to the next */
+QCOW_LOG_HASH = 2,/* used to journalize a QCowHashInfo */
+} QCowLogEntryType;
+
+typedef enum {
+QCOW_HASH_SHA256 = 0,
+QCOW_HASH_SHA3   = 1,
+QCOW_HASH_SKEIN  = 2,
+} QCowHashAlgo;
+
+typedef struct {
+uint8_t data[HASH_LENGTH]; /* 32 bytes hash of a given cluster */
+} __attribute__((packed)) QCowHash;
+
+/* deduplication info */
+typedef struct {
+QCowHash hash;
+uint64_t physical_sect;   /* where the cluster is stored on disk */
+uint64_t first_logical_sect;  /* logical sector of the first occurrence of
+   * this cluster
+   */
+} __attribute__((packed)) QCowHashInfo;
+
+/* Used to keep a single precomputed hash between the calls of the dedup
+ * function
+ */
+typedef struct {
+QCowHashInfo hash_info;
+bool reuse; /* The main deduplication function can set this field to
+ * true before exiting to avoid computing the same hash
+ * twice. It's a speed optimization.
+ */
+} QcowPersistentHash;
+
+/* Undedupable hashes that must be written later to disk */
+typedef struct QCowHashElement {
+QCowHashInfo hash_info;
+QTAILQ_ENTRY(QCowHashElement) next;
+} QCowHashElement;
+
+typedef struct {
+QcowPersistentHash phash;  /* contains a hash persisting between calls of
+* qcow2_dedup()
+*/
+QTAILQ_HEAD(, QCowHashElement) undedupables;
+uint64_t nb_clusters_processed;
+uint64_t nb_undedupable_sectors;
+} QCowDedupState;
+
+/* The code must take care that the maximum size field of a QCowJournalEntry
+ * will be no more than 254 bytes.
+ * It's required to save the 2 bytes of room for QCOW_LOG_END entries
+ * in every cases
+ */
+typedef union {
+QCowHashInfo hash_info;
+uint8_t  padding[254]; /* note the extra two bytes of padding to avoid
+* read overflow.
+*/
+} QCowJournalEntryUnion;
+
+typedef struct {
+uint8_t size;/* maximum size of a journal entry is 254 bytes */
+uint8_t type;/* contains a QCowLogEntryType for future usage */
+QCowJournalEntryUnion u;
+} __attribute__((packed)) QCowJournalEntry;
+
+typedef struct {
+uint64_t sector;  /* the journal physical on disk sector */
+uint64_t size;/* the size of the journal in bytes */
+uint64_t index;   /* index of next buf cluster to write */
+uint8_t  *write_buf;  /* used to buffer written data */
+uint64_t offset_in_buf;   /* the offset in the write buffer */
+bool flushed; /* true if the buffer reached disk*/
+uint8_t  *read_cache; /* used to cache read data */
+int64_t read_index;   /* index the cached read cluster */
+bool started; /* has the journal been resumed */
+} QCowJournal;
+
+typedef struct {
+QCowJournal journal;  /* the journal this log store will use */
+uint32_t order;   /* the number of bits used for the sub hashes
+

[Qemu-devel] [RFC V8 18/24] qcow2: Remove hash when cluster is deleted.

2013-06-20 Thread Benoît Canet

Signed-off-by: Benoit Canet ben...@irqsave.net
---
 block/qcow2-dedup.c|   45 +
 block/qcow2-refcount.c |3 +++
 block/qcow2.h  |2 ++
 3 files changed, 50 insertions(+)

diff --git a/block/qcow2-dedup.c b/block/qcow2-dedup.c
index da4ad5c..599cb2e 100644
--- a/block/qcow2-dedup.c
+++ b/block/qcow2-dedup.c
@@ -656,3 +656,48 @@ int qcow2_dedup_store_new_hashes(BlockDriverState *bs,
 
 return ret;
 }
+
+/* Clean the last reference to a given cluster when its refcount is zero
+ *
+ * @cluster_index: the index of the physical cluster
+ */
+void qcow2_dedup_destroy_hash(BlockDriverState *bs,
+  uint64_t cluster_index)
+{
+BDRVQcowState *s = bs-opaque;
+uint64_t offset = cluster_index * s-cluster_size;
+QCowHashInfo hash_info;
+uint8_t *buf;
+int ret = 0;
+
+/* allocate buffer */
+buf = qemu_blockalign(bs, s-cluster_size);
+
+/* read cluster from disk */
+ret = bdrv_pread(bs-file, offset, buf, s-cluster_size);
+
+/* error */
+if (ret  0) {
+goto free_exit;
+}
+
+/* clear hash info */
+memset(hash_info, 0, sizeof(QCowHashInfo));
+
+/* compute hash for the cluster */
+ret = qcow2_compute_cluster_hash(bs,
+ hash_info.hash,
+ buf);
+
+
+/* error */
+if (ret  0) {
+goto free_exit;
+}
+
+/* delete hash from key value store. It will not be deduplicated anymore */
+qcow2_store_delete(bs, s-key_value_store, hash_info);
+
+free_exit:
+   qemu_vfree(buf);
+}
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 3bd8f37..2734cd9 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -482,6 +482,9 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
 ret = -EINVAL;
 goto fail;
 }
+if (s-has_dedup  refcount == 0) {
+qcow2_dedup_destroy_hash(bs, cluster_index);
+}
 if (refcount == 0  cluster_index  s-free_cluster_index) {
 s-free_cluster_index = cluster_index;
 }
diff --git a/block/qcow2.h b/block/qcow2.h
index 720131d..6f85e03 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -748,5 +748,7 @@ int qcow2_dedup_store_new_hashes(BlockDriverState *bs,
  int count,
  uint64_t logical_sect,
  uint64_t physical_sect);
+void qcow2_dedup_destroy_hash(BlockDriverState *bs,
+  uint64_t cluster_index);
 
 #endif
-- 
1.7.10.4

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 2243 matches

Mail list logo