Re: [Qemu-block] [PATCH 1/4] block: Fix dirty bitmap in bdrv_co_discard

2015-05-06 Thread Paolo Bonzini


On 06/05/2015 03:45, Fam Zheng wrote:
  This is not enough, you also have to do the discard in block/mirror.c,
  otherwise the destination image could even become fully provisioned!
 I wasn't sure what if src and dest have different can_write_zeroes_with_unmap
 value, but your argument is stronger.
 
 I will add discard in mirror.  Besides that, do we need to compare the
 can_write_zeroes_with_unmap?

Hmm, if can_write_zeroes_with_unmap is set, it's probably better to
write zeroes instead of discarding, in case the guest is relying on it.

Paolo



Re: [Qemu-block] [Qemu-devel] [RFC] Differential Backups

2015-05-06 Thread Stefan Hajnoczi
On Tue, May 05, 2015 at 11:55:49AM -0400, John Snow wrote:
 On 05/05/2015 06:25 AM, Stefan Hajnoczi wrote:
 On Wed, Apr 29, 2015 at 06:51:08PM -0400, John Snow wrote:
 This is a feature that should be very easy to add on top of the existing
 incremental feature, since it's just a difference in how the bitmap is
 treated:
 
 Incremental
 - Links to the last incremental (managed by libvirt)
 - Clears the bitmap after creation
 
 Differential:
 - Links to the last full backup always (managed by libvirt)
 - Does not clear the bitmap after creation
 
 No biggie.
 
 Differential backups can be done using incremental backup functionality
 in QEMU:
 
 The client application points QEMU to the same target repeatedly instead
 of keeping separate incremental backups.
 
 Stefan
 
 
 Oh, so you're saying:
 
 [anchor]--[diff1]
 
 And then when making a new incremental, we re-use diff1 as a target and
 overwrite it so that it becomes:
 
 [anchor]--[diff2]
 
 In effect giving us a differential.
 
 OK, so it's possible, but we still lose out on some flexibility that a
 slightly different mode would provide us, like the ability to keep multiple
 differentials if desired. (Well, I suppose we *can* create those by manually
 copying differentials after we create them, but that seems hackier than
 necessary.)
 
 Still, it would be such a paltry few lines of code and introduce no real
 complexity to the subsystem, and it might make libvirt's time a little
 easier for managing such things.

I have CCed Eric Blake and the libvirt mailing list.

This is a good time to discuss libvirt requirements for backup APIs.
Current work for QEMU 2.4:
 * QMP command to take incremental backup of guest disks in a single
   atomic operation
 * Dirty bitmap persistence across live migration and QEMU restart

Eric: Do you have any opinion on a differential backup feature in
addition to incremental backup.

My thoughts are that QEMU should only copy out changed blocks and let
the backup application worry about concepts like incremental vs
differential.

From a host performance perspective, copying out the same blocks that
have already been sent in a previous backup is a waste of I/O bandwidth.
Even backup applications that want differential backup may not actually
use the QEMU feature for this reason.

Regarding the size of the patch, there's a cost to adding the tests,
documentation, and commiting to QMP APIs.  These costs dwarf the small
code change that preserves the dirty bitmap.

If there's a concrete requirement for this feature then I'm happy with
it, but let's not add bells and whistles just because we can.

Stefan


pgpOC_tjcvI0X.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH 1/4] block: Fix dirty bitmap in bdrv_co_discard

2015-05-06 Thread Paolo Bonzini


On 06/05/2015 11:50, Fam Zheng wrote:
  #   src can_write_zeroes_with_unmap   target can_write_zeroes_with_unmap
 
  1  true   true
  2  true   false
  3  false  true
  4  false  false

1 should replicate WRITE SAME, in case the unmap granularity of the
target is different from that of the source.  In that case, a discard on
the target might leave some sectors untouched.  Writing zeroes would
ensure matching data between the source and the target.

2 should _not_ discard: it should write zeroes even at the cost of
making the target fully provisioned.  Perhaps you can optimize it by
looking at bdrv_get_block_status for the target, and checking the answer
for BDRV_ZERO.

3 and 4 can use discard on the target.

So it looks like only the source setting matters.

We need to check the cost of bdrv_co_get_block_status for raw, too.
If it's too expensive, that can be a problem.

Paolo

 For case 2  3 it's probably better to mirror the actual reading of source.
 
 I'm not sure about 4.

Even in case 1, discard could be UNMAP and write zeroes could be
WRITE SAME.  If the unmap granularity of the target is For unaligned
sectors, UNMAP might leave some sectors aside while WRITE SAME will
write with zeroes.



Re: [Qemu-block] [Qemu-devel] qemu-img convert (vmdk)

2015-05-06 Thread Kevin Wolf
Am 05.05.2015 um 19:01 hat Antoni Villalonga geschrieben:
 Hi,
 
 Is my first email to that list ;)
 
 
 I can reproduce this bug with v2.2 and v2.3. I'm not sure about the results
 after testing with v2.1 (doesn't show errors but seems to be still broken).
 
 % qemu-img convert -f raw -O vmdk -o subformat=streamOptimized 
 100GB_inputfile.img outputfile0.vmdk
 (no shown errors/warnings/info)
 % qemu-img convert -f vmdk -O qcow2 outputfile0.vmdk outputfile11.qcow2
 qemu-img: error while reading sector 278528: Invalid argument
 
 Let me know if I can provide you more info.

Fam, any idea?

Kevin



Re: [Qemu-block] [PATCH v2 1/6] mirror: Discard target sectors if not allocated at source side

2015-05-06 Thread wangxiaolong
I just wander ifbdrv_is_allocated_above works as it is considered,migrate image in format of qcow2 run into such backtrace:#0 0x7f9e73822c6d in lseek64 () at ../sysdeps/unix/syscall-template.S:82#1 0x7f9e765f08e4 in find_allocation (bs=value optimized out,  sector_num=value optimized out, nb_sectors=20480, pnum=0x7f9e7bab00fc)  at block/raw-posix.c:1285#2 raw_co_get_block_status (bs=value optimized out,  sector_num=value optimized out, nb_sectors=20480, pnum=0x7f9e7bab00fc)  at block/raw-posix.c:1381#3 0x7f9e765cfe61 in bdrv_co_get_block_status (bs=0x7f9e790e4300, sector_num=  22470528, nb_sectors=20480, pnum=0x7f9e7bab00fc) at block.c:3593#4 0x7f9e765cffa8 in bdrv_co_get_block_status (bs=0x7f9e790e00a0, sector_num=  22466560, nb_sectors=value optimized out, pnum=0x7f9e7bab00fc)  at block.c:3620#5 0x7f9e765cfff7 in bdrv_get_block_status_co_entry (opaque=0x7f9e7bab0080)  at block.c:3639#6 0x7f9e765d0088 in bdrv_get_block_status (bs=value optimized out,  sector_num=value optimized out, nb_sectors=value optimized out,  pnum=value optimized out) at block.c:3663#7 0x7f9e765d00a9 in bdrv_is_allocated (bs=0x7f9e790e00a0,  sector_num=value optimized out, nb_sectors=value optimized out,  pnum=value optimized out) at block.c:3677#8 0x7f9e765d0166 in bdrv_is_allocated_above (top=0x7f9e790e00a0, base=0x0,  sector_num=22466560, nb_sectors=20480, pnum=0x7f9e7bab020c) at block.c:3709#9 0x7f9e765dcf9e in mirror_iteration (opaque=0x7f9e79486110)  at block/mirror.c:305#10 mirror_run (opaque=0x7f9e79486110) at block/mirror.c:430#11 0x7f9e766120eb in coroutine_trampoline (i0=value optimized out,  i1=value optimized out) at coroutine-ucontext.c:118#12 0x7f9e734c4b70 in ?? () from /lib64/libc.so.6#13 0x7a897850 in ?? ()#14 0x in ?? ()it is too slow to tolerate…...Original MessageSender:Fam Zhengf...@redhat.comRecipient:qemu-develqemu-de...@nongnu.orgCc:Kevin Wolfkw...@redhat.com; Stefan Hajnoczistefa...@redhat.com; qemu-blockqemu-block@nongnu.org; pbonzinipbonz...@redhat.com; jsnowjs...@redhat.com; wangxiaolongwangxiaol...@ucloud.cnDate:Wednesday, May 6, 2015 12:52Subject:[PATCH v2 1/6] mirror: Discard target sectors if not allocated at source sideIf guest discards a source cluster during mirror, we would want to
discard target side as well.

Signed-off-by: Fam Zheng f...@redhat.com
---
 block/mirror.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index 58f391a..37a5b61 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -163,6 +163,7 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 int64_t end, sector_num, next_chunk, next_sector, hbitmap_next_sector;
 uint64_t delay_ns = 0;
 MirrorOp *op;
+int pnum;
 
 s-sector_num = hbitmap_iter_next(s-hbi);
 if (s-sector_num  0) {
@@ -289,8 +290,15 @@ static uint64_t coroutine_fn mirror_iteration(MirrorBlockJob *s)
 s-in_flight++;
 s-sectors_in_flight += nb_sectors;
 trace_mirror_one_iteration(s, sector_num, nb_sectors);
-bdrv_aio_readv(source, sector_num, op-qiov, nb_sectors,
-   mirror_read_complete, op);
+
+if (!bdrv_is_allocated_above(source, NULL, sector_num,
+ nb_sectors, pnum)) {
+bdrv_aio_discard(s-target, sector_num, nb_sectors,
+ mirror_write_complete, op);
+} else {
+bdrv_aio_readv(source, sector_num, op-qiov, nb_sectors,
+   mirror_read_complete, op);
+}
 return delay_ns;
 }
 
-- 
1.9.3



Re: [Qemu-block] [Qemu-devel] qemu-img convert (vmdk)

2015-05-06 Thread Fam Zheng
On Wed, 05/06 12:01, Kevin Wolf wrote:
 Am 05.05.2015 um 19:01 hat Antoni Villalonga geschrieben:
  Hi,
  
  Is my first email to that list ;)
  
  
  I can reproduce this bug with v2.2 and v2.3. I'm not sure about the results
  after testing with v2.1 (doesn't show errors but seems to be still broken).
  
  % qemu-img convert -f raw -O vmdk -o subformat=streamOptimized 
  100GB_inputfile.img outputfile0.vmdk
  (no shown errors/warnings/info)
  % qemu-img convert -f vmdk -O qcow2 outputfile0.vmdk outputfile11.qcow2
  qemu-img: error while reading sector 278528: Invalid argument
  
  Let me know if I can provide you more info.
 
 Fam, any idea?

It's a bug. I can reproduce it with my 21G guest image. I'll take a look.

Fam



[Qemu-block] [RFC PATCH 1/7] block: Add op blocker type device IO

2015-05-06 Thread Fam Zheng
Preventing device from submitting IO is useful around various nested
poll. Op blocker is a good place to put this flag.

Devices would submit IO requests through blk_* block backend interface,
which calls blk_check_request to check the validity. Return -EBUSY if
the operation is blocked, in which case device IO is temporarily
disabled by another BDS user.

Signed-off-by: Fam Zheng f...@redhat.com
---
 block/block-backend.c | 4 
 include/block/block.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/block/block-backend.c b/block/block-backend.c
index 93e46f3..71fc695 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -478,6 +478,10 @@ static int blk_check_request(BlockBackend *blk, int64_t 
sector_num,
 return -EIO;
 }
 
+if (bdrv_op_is_blocked(blk-bs, BLOCK_OP_TYPE_DEVICE_IO, NULL)) {
+return -EBUSY;
+}
+
 return blk_check_byte_request(blk, sector_num * BDRV_SECTOR_SIZE,
   nb_sectors * BDRV_SECTOR_SIZE);
 }
diff --git a/include/block/block.h b/include/block/block.h
index 7d1a717..906fb31 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -159,6 +159,7 @@ typedef enum BlockOpType {
 BLOCK_OP_TYPE_RESIZE,
 BLOCK_OP_TYPE_STREAM,
 BLOCK_OP_TYPE_REPLACE,
+BLOCK_OP_TYPE_DEVICE_IO,
 BLOCK_OP_TYPE_MAX,
 } BlockOpType;
 
-- 
1.9.3




[Qemu-block] [RFC PATCH 0/7] Fix transactional snapshot with virtio-blk dataplane

2015-05-06 Thread Fam Zheng
Reported by Paolo.

Unlike the iohandler in main loop, iothreads currently process the event
notifier used as virtio-blk ioeventfd in all nested aio_poll. This is dangerous
without proper protection, because guest requests could sneak to block layer
where they mustn't.

For example, a QMP transaction may involve multiple bdrv_drain_all() in
handling the list of AioContext it works on. If an aio_poll in one of the
bdrv_drain_all() happens to process a guest VQ kick by dispatching the
ioeventfd event, a new guest write is then submitted, and voila, the
transaction semantics is violated.

This series avoids this problem by disabling virtio-blk handlers during
bdrv_drain_all() and transactions.

Notes:

If the general approach is right, other transaction types could get the
blockers similarly, in next revision. And some related bdrv_drain_all() could
also be changed to bdrv_drain().

virtio-scsi-dataplane will be a bit more complicated, but still doable.  It
would probably need one more interface abstraction between scsi-disk, scsi-bus
and virtio-scsi.

Although other devices don't have a pause/resume callback yet, the
blk_check_request, which returns -EBUSY if device io op blocker is set, could
hopefully cover most cases already.

Timers and block jobs also generate IO, but it should be fine as long as they
don't change guest visible data, which is true AFAICT.


Fam Zheng (7):
  block: Add op blocker type device IO
  block: Block device IO during bdrv_drain and bdrv_drain_all
  block: Add op blocker notifier list
  block-backend: Add blk_op_blocker_add_notifier
  virtio-blk: Move complete_request to 'ops' structure
  virtio-blk: Don't handle output when there is device IO op blocker
  blockdev: Add device IO op blocker during snapshot transaction

 block.c | 20 
 block/block-backend.c   | 10 ++
 block/io.c  | 12 +++
 blockdev.c  |  7 +
 hw/block/dataplane/virtio-blk.c | 36 ++---
 hw/block/virtio-blk.c   | 69 +++--
 include/block/block.h   |  9 ++
 include/block/block_int.h   |  3 ++
 include/hw/virtio/virtio-blk.h  | 17 --
 include/sysemu/block-backend.h  |  2 ++
 10 files changed, 174 insertions(+), 11 deletions(-)

-- 
1.9.3




[Qemu-block] [RFC PATCH 2/7] block: Block device IO during bdrv_drain and bdrv_drain_all

2015-05-06 Thread Fam Zheng
We don't want new requests from guest, so block the operation around the
nested poll.

Signed-off-by: Fam Zheng f...@redhat.com
---
 block/io.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/io.c b/block/io.c
index 1ce62c4..d369de3 100644
--- a/block/io.c
+++ b/block/io.c
@@ -289,9 +289,15 @@ static bool bdrv_drain_one(BlockDriverState *bs)
  */
 void bdrv_drain(BlockDriverState *bs)
 {
+Error *blocker = NULL;
+
+error_setg(blocker, bdrv_drain in progress);
+bdrv_op_block(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker);
 while (bdrv_drain_one(bs)) {
 /* Keep iterating */
 }
+bdrv_op_unblock(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker);
+error_free(blocker);
 }
 
 /*
@@ -311,6 +317,9 @@ void bdrv_drain_all(void)
 /* Always run first iteration so any pending completion BHs run */
 bool busy = true;
 BlockDriverState *bs = NULL;
+Error *blocker = NULL;
+
+error_setg(blocker, bdrv_drain_all in progress);
 
 while ((bs = bdrv_next(bs))) {
 AioContext *aio_context = bdrv_get_aio_context(bs);
@@ -319,6 +328,7 @@ void bdrv_drain_all(void)
 if (bs-job) {
 block_job_pause(bs-job);
 }
+bdrv_op_block(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker);
 aio_context_release(aio_context);
 }
 
@@ -343,8 +353,10 @@ void bdrv_drain_all(void)
 if (bs-job) {
 block_job_resume(bs-job);
 }
+bdrv_op_unblock(bs, BLOCK_OP_TYPE_DEVICE_IO, blocker);
 aio_context_release(aio_context);
 }
+error_free(blocker);
 }
 
 /**
-- 
1.9.3




[Qemu-block] [PATCH 1/7] qcow2: use one single memory block for the L2/refcount cache tables

2015-05-06 Thread Alberto Garcia
The qcow2 L2/refcount cache contains one separate table for each cache
entry. Doing one allocation per table adds unnecessary overhead and it
also requires us to store the address of each table separately.

Since the size of the cache is constant during its lifetime, it's
better to have an array that contains all the tables using one single
allocation.

In my tests measuring freshly created caches with sizes 128MB (L2) and
32MB (refcount) this uses around 10MB of RAM less.

Signed-off-by: Alberto Garcia be...@igalia.com
---
 block/qcow2-cache.c| 55 --
 block/qcow2-cluster.c  | 12 +--
 block/qcow2-refcount.c |  8 +---
 block/qcow2.h  |  3 ++-
 4 files changed, 39 insertions(+), 39 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index b115549..f0dfb69 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -28,7 +28,6 @@
 #include trace.h
 
 typedef struct Qcow2CachedTable {
-void*   table;
 int64_t offset;
 booldirty;
 int cache_hits;
@@ -40,39 +39,35 @@ struct Qcow2Cache {
 struct Qcow2Cache*  depends;
 int size;
 booldepends_on_flush;
+void   *table_array;
 };
 
+static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs,
+Qcow2Cache *c, int table)
+{
+BDRVQcowState *s = bs-opaque;
+return (uint8_t *) c-table_array + (size_t) table * s-cluster_size;
+}
+
 Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables)
 {
 BDRVQcowState *s = bs-opaque;
 Qcow2Cache *c;
-int i;
 
 c = g_new0(Qcow2Cache, 1);
 c-size = num_tables;
 c-entries = g_try_new0(Qcow2CachedTable, num_tables);
-if (!c-entries) {
-goto fail;
-}
-
-for (i = 0; i  c-size; i++) {
-c-entries[i].table = qemu_try_blockalign(bs-file, s-cluster_size);
-if (c-entries[i].table == NULL) {
-goto fail;
-}
+c-table_array = qemu_try_blockalign(bs-file,
+ (size_t) num_tables * 
s-cluster_size);
+
+if (!c-entries || !c-table_array) {
+qemu_vfree(c-table_array);
+g_free(c-entries);
+g_free(c);
+c = NULL;
 }
 
 return c;
-
-fail:
-if (c-entries) {
-for (i = 0; i  c-size; i++) {
-qemu_vfree(c-entries[i].table);
-}
-}
-g_free(c-entries);
-g_free(c);
-return NULL;
 }
 
 int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c)
@@ -81,9 +76,9 @@ int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c)
 
 for (i = 0; i  c-size; i++) {
 assert(c-entries[i].ref == 0);
-qemu_vfree(c-entries[i].table);
 }
 
+qemu_vfree(c-table_array);
 g_free(c-entries);
 g_free(c);
 
@@ -151,8 +146,8 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 BLKDBG_EVENT(bs-file, BLKDBG_L2_UPDATE);
 }
 
-ret = bdrv_pwrite(bs-file, c-entries[i].offset, c-entries[i].table,
-s-cluster_size);
+ret = bdrv_pwrite(bs-file, c-entries[i].offset,
+  qcow2_cache_get_table_addr(bs, c, i), s-cluster_size);
 if (ret  0) {
 return ret;
 }
@@ -304,7 +299,8 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 BLKDBG_EVENT(bs-file, BLKDBG_L2_LOAD);
 }
 
-ret = bdrv_pread(bs-file, offset, c-entries[i].table, 
s-cluster_size);
+ret = bdrv_pread(bs-file, offset, qcow2_cache_get_table_addr(bs, c, 
i),
+ s-cluster_size);
 if (ret  0) {
 return ret;
 }
@@ -319,7 +315,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 found:
 c-entries[i].cache_hits++;
 c-entries[i].ref++;
-*table = c-entries[i].table;
+*table = qcow2_cache_get_table_addr(bs, c, i);
 
 trace_qcow2_cache_get_done(qemu_coroutine_self(),
c == s-l2_table_cache, i);
@@ -344,7 +340,7 @@ int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, 
void **table)
 int i;
 
 for (i = 0; i  c-size; i++) {
-if (c-entries[i].table == *table) {
+if (qcow2_cache_get_table_addr(bs, c, i) == *table) {
 goto found;
 }
 }
@@ -358,12 +354,13 @@ found:
 return 0;
 }
 
-void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table)
+void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
+ void *table)
 {
 int i;
 
 for (i = 0; i  c-size; i++) {
-if (c-entries[i].table == table) {
+if (qcow2_cache_get_table_addr(bs, c, i) == table) {
 goto found;
 }
 }
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ed2b44d..5cd418a 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -263,7 +263,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, 
uint64_t 

[Qemu-block] [PATCH 4/7] qcow2: remove qcow2_cache_find_entry_to_replace()

2015-05-06 Thread Alberto Garcia
A cache miss means that the whole array was traversed and the entry
we were looking for was not found, so there's no need to traverse it
again in order to select an entry to replace.

Signed-off-by: Alberto Garcia be...@igalia.com
---
 block/qcow2-cache.c | 45 -
 1 file changed, 16 insertions(+), 29 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 786c10a..dffb887 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -242,51 +242,38 @@ int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c)
 return 0;
 }
 
-static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c)
-{
-int i;
-uint64_t min_lru_counter = UINT64_MAX;
-int min_index = -1;
-
-
-for (i = 0; i  c-size; i++) {
-if (c-entries[i].ref) {
-continue;
-}
-
-if (c-entries[i].lru_counter  min_lru_counter) {
-min_index = i;
-min_lru_counter = c-entries[i].lru_counter;
-}
-}
-
-if (min_index == -1) {
-/* This can't happen in current synchronous code, but leave the check
- * here as a reminder for whoever starts using AIO with the cache */
-abort();
-}
-return min_index;
-}
-
 static int qcow2_cache_do_get(BlockDriverState *bs, Qcow2Cache *c,
 uint64_t offset, void **table, bool read_from_disk)
 {
 BDRVQcowState *s = bs-opaque;
 int i;
 int ret;
+uint64_t min_lru_counter = UINT64_MAX;
+int min_lru_index = -1;
 
 trace_qcow2_cache_get(qemu_coroutine_self(), c == s-l2_table_cache,
   offset, read_from_disk);
 
 /* Check if the table is already cached */
 for (i = 0; i  c-size; i++) {
-if (c-entries[i].offset == offset) {
+const Qcow2CachedTable *t = c-entries[i];
+if (t-offset == offset) {
 goto found;
 }
+if (t-ref == 0  t-lru_counter  min_lru_counter) {
+min_lru_counter = t-lru_counter;
+min_lru_index = i;
+}
+}
+
+if (min_lru_index == -1) {
+/* This can't happen in current synchronous code, but leave the check
+ * here as a reminder for whoever starts using AIO with the cache */
+abort();
 }
 
-/* If not, write a table back and replace it */
-i = qcow2_cache_find_entry_to_replace(c);
+/* Cache miss: write a table back and replace it */
+i = min_lru_index;
 trace_qcow2_cache_get_replace_entry(qemu_coroutine_self(),
 c == s-l2_table_cache, i);
 if (i  0) {
-- 
2.1.4




Re: [Qemu-block] [PATCH v4 16/17] qapi: Expose new qcow2 overlap check options

2015-05-06 Thread Max Reitz

On 04.05.2015 21:39, Eric Blake wrote:

On 05/04/2015 01:15 PM, Max Reitz wrote:

Expose the two new options for controlling the memory usage of the
overlap check implementation via QAPI.

Signed-off-by: Max Reitz mre...@redhat.com
---
  qapi/block-core.json | 37 +
  1 file changed, 37 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 1c17224..99456e6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -1509,6 +1509,39 @@
  'mode':  'Qcow2OverlapCheckMode' } }
  
  ##

+# @Qcow2OverlapStructures
+#
+# Contains options for controlling the behavior of the metadata overlap
+# prevention structures.
+#
+# The primary structure used for overlap check and prevention is a bitmap
+# (actually a bytemap) which has one entry per cluster designating the type(s)
+# of metadata it contains. In order to save memory, there is an RLE-like
+# representation of this bitmap, too.
+#
+# The whole array of clusters is chunked. The RLE representation of one chunk
+# is converted to the equivalent bitmap whenever a cluster in the chunk is
+# accessed. The bitmaps are kept for a limited number of chunks (as a cache). 
On
+# cache eviction, the bitmap is converted back to RLE again.

Feels a bit verbose at describing the internal representation,
especially if we change the internals at some point in the future.  I'm
more interested that there are two limits at play, and that tuning them
larger may consume more memory but offer better protection on large images.


Well, that works for total-size-limit, but it won't work as an 
explanation for bitmap-size. I could rename it to cache-size and just 
say that it's the size of some (opaque) internal cache, though.



+#
+# @bitmap-size:  #optional size (in bytes) of the bitmap cache, defaults to
+#64 kB
+#

Any requirement on it being a power of 2, or even a recommendation on
how large or small to make it in relation to an image size?


No requirements and I can't give recommendations. We do have a fixed 
default for the metadata caches (refcount blocks and L2 tables), too, 
and this is basically the same thing when it comes to choosing the right 
size. For a large image, you have multiple e.g. L2 tables and also 
multiple windows concerning the overlap checks; therefore, if you have a 
lot of totally random accesses, both caches will fail. Considering that 
we seem to have accepted this for the metadata caches, I think it should 
be fine here, too.



+# @total-size-limit: #optional maximum total size (in bytes) of all the 
metadata

What happens if bitmap-size is larger than total-size-limit?


Interesting question. Well, what does happen is that at some point it 
tries to create a bitmap for the cache, and that will fail due to the 
size limit, so the event will get thrown. The problem is that accessing 
any area of the qcow2 file which does not yet have an entry in the 
bitmap cache will result in hitting the limit, so there is no protection 
for these areas.


The problem is that it's difficult to give a limit for the ratio of 
bitmap-size to total-size-limit which always works. A fragments list 
(the RLE representation) will never exceed the bitmap in size, so using 
0.5 as the maximum ratio seems to make sense. However, there are other 
structures as well, such as the array of windows which depends on the 
size of the qcow2 file, so depending on the file, the maximum ratio may 
get infinitely small.


So I guess the best thing to do would be to detect the limit excess 
condition when creating a new bitmap, and then instead of trying to 
occupy a previously empty cache entry, evict a non-empty one (repeat 
until there is enough memory available, or the cache is empty, in which 
case it is justified to complain about hitting the memory limit).





+#overlap prevention data structures combined; if this limit
+#is exceeded, a QCOW2_OVERLAP_CHECK_MEMORY_LIMIT_REACHED
+#event will be emitted and some parts of the image may no
+#longer be protected against erroneous overwriting of
+#metadata by (meta)data of a different kind. Defaults to
+#SIZE_MAX.
+#
+# Since: 2.4
+##
+{ 'type': 'Qcow2OverlapStructures',

Rebase conflict with my series that does s/type/struct/


Will fix.

Thanks for reviewing,

Max


+  'data': { '*bitmap-size':  'int',
+'*total-size-limit': 'int' } }
+
+##
  # @BlockdevOptionsQcow2
  #
  # Driver specific block device options for qcow2.
@@ -1530,6 +1563,9 @@
  # @overlap-check: #optional which overlap checks to perform for writes
  # to the image, defaults to 'cached' (since 2.2)
  #
+# @overlap-structures:#optional options for controlling the behavior of the
+# metadata overlap prevention structures (since 2.4)
+#
  # @cache-size:#optional the 

[Qemu-block] [PATCH 2/7] qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()

2015-05-06 Thread Alberto Garcia
Since all tables are now stored together, it is possible to obtain
the position of a particular table directly from its address, so the
operation becomes O(1).

Signed-off-by: Alberto Garcia be...@igalia.com
---
 block/qcow2-cache.c | 32 +++-
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index f0dfb69..6e78c8f 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -49,6 +49,16 @@ static inline void 
*qcow2_cache_get_table_addr(BlockDriverState *bs,
 return (uint8_t *) c-table_array + (size_t) table * s-cluster_size;
 }
 
+static inline int qcow2_cache_get_table_idx(BlockDriverState *bs,
+  Qcow2Cache *c, void *table)
+{
+BDRVQcowState *s = bs-opaque;
+ptrdiff_t table_offset = (uint8_t *) table - (uint8_t *) c-table_array;
+int idx = table_offset / s-cluster_size;
+assert(idx = 0  idx  c-size  table_offset % s-cluster_size == 0);
+return idx;
+}
+
 Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables)
 {
 BDRVQcowState *s = bs-opaque;
@@ -337,16 +347,12 @@ int qcow2_cache_get_empty(BlockDriverState *bs, 
Qcow2Cache *c, uint64_t offset,
 
 int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table)
 {
-int i;
+int i = qcow2_cache_get_table_idx(bs, c, *table);
 
-for (i = 0; i  c-size; i++) {
-if (qcow2_cache_get_table_addr(bs, c, i) == *table) {
-goto found;
-}
+if (c-entries[i].offset == 0) {
+return -ENOENT;
 }
-return -ENOENT;
 
-found:
 c-entries[i].ref--;
 *table = NULL;
 
@@ -357,15 +363,7 @@ found:
 void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
  void *table)
 {
-int i;
-
-for (i = 0; i  c-size; i++) {
-if (qcow2_cache_get_table_addr(bs, c, i) == table) {
-goto found;
-}
-}
-abort();
-
-found:
+int i = qcow2_cache_get_table_idx(bs, c, table);
+assert(c-entries[i].offset != 0);
 c-entries[i].dirty = true;
 }
-- 
2.1.4




[Qemu-block] [PATCH v2 0/7] qcow2 L2/refcount cache improvements

2015-05-06 Thread Alberto Garcia
New version of the qcow2 cache patches:

v2:
- Don't do pointer arithmetic on void *
- Rename table_addr() to qcow2_cache_get_table_addr()
- Add qcow2_cache_get_table_idx()
- Cast cache size to size_t to prevent overflows
- Make qcow2_cache_put() a void function
- Don't store the cluster size in the cache, get it from the BDS instead

v1: https://lists.nongnu.org/archive/html/qemu-devel/2015-04/msg04355.html

Regards,

Berto

Alberto Garcia (7):
  qcow2: use one single memory block for the L2/refcount cache tables
  qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()
  qcow2: use an LRU algorithm to replace entries from the L2 cache
  qcow2: remove qcow2_cache_find_entry_to_replace()
  qcow2: use a hash to look for entries in the L2 cache
  qcow2: make qcow2_cache_put() a void function
  qcow2: style fixes in qcow2-cache.c

 block/qcow2-cache.c| 169 ++---
 block/qcow2-cluster.c  |  62 +-
 block/qcow2-refcount.c |  37 +++
 block/qcow2.h  |   5 +-
 4 files changed, 104 insertions(+), 169 deletions(-)

-- 
2.1.4




[Qemu-block] [PATCH 7/7] qcow2: style fixes in qcow2-cache.c

2015-05-06 Thread Alberto Garcia
Fix pointer declaration to make it consistent with the rest of the
code.

Signed-off-by: Alberto Garcia be...@igalia.com
---
 block/qcow2-cache.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 0f629c4..bea43c1 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -35,8 +35,8 @@ typedef struct Qcow2CachedTable {
 } Qcow2CachedTable;
 
 struct Qcow2Cache {
-Qcow2CachedTable*   entries;
-struct Qcow2Cache*  depends;
+Qcow2CachedTable   *entries;
+struct Qcow2Cache  *depends;
 int size;
 booldepends_on_flush;
 void   *table_array;
@@ -81,7 +81,7 @@ Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int 
num_tables)
 return c;
 }
 
-int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c)
+int qcow2_cache_destroy(BlockDriverState *bs, Qcow2Cache *c)
 {
 int i;
 
-- 
2.1.4




Re: [Qemu-block] [RFC PATCH 6/7] virtio-blk: Don't handle output when there is device IO op blocker

2015-05-06 Thread Paolo Bonzini


On 06/05/2015 14:20, Fam Zheng wrote:
  
  Does non-dataplane need to do anything, since it uses iohandlers rather
  than aio_set_event_notifier_handler?
 I guess it's not for this specific bug. See this as an attempt on a general
 purpose pause mechanism to the device in investment for the future, for
 example, bdrv_aio_drain. ;-)
 
 I can drop it in v2 if you think the idea is not very helpful.

It's slightly more complicated but I think it's worth the extra coverage
testing that this provides.

Paolo



Re: [Qemu-block] [PATCH v4 13/17] qcow2/overlaps: Add memory limit reached event

2015-05-06 Thread Max Reitz

On 04.05.2015 21:32, Eric Blake wrote:

On 05/04/2015 01:15 PM, Max Reitz wrote:

Later, a mechanism to set a limit on how much memory may be used for the
overlap prevention structures will be introduced. If that limit is about
to be exceeded, a QMP event should be emitted. This very event is
specified by this patch.

Signed-off-by: Max Reitz mre...@redhat.com
---
  docs/qmp/qmp-events.txt | 28 
  qapi/event.json | 27 +++
  2 files changed, 55 insertions(+)
+
+Data:
+- reference: Device name if set; node name otherwise. (json-string)
+- start: Offset of the range of clusters (possibly) no longer being
+   checked for writes overlapping with existing metadata.
+   (json-int, optional)
+- length:Length of that range in bytes. (json-int, optional)
+
+Example:
+
+{ event: QCOW2_OVERLAP_CHECK_MEMORY_LIMIT_REACHED,
+data: { reference: virtio0, start: 805306368,
+  length: 268435456 },

s/805306368/805306368/ and likewise for length (a json-int does not
use quotes).


Oops, right. Will fix.


Otherwise seems okay.


Good, thanks :-)

Max



[Qemu-block] [PATCH 5/7] qcow2: use a hash to look for entries in the L2 cache

2015-05-06 Thread Alberto Garcia
The current cache algorithm traverses the array starting always from
the beginning, so the average number of comparisons needed to perform
a lookup is proportional to the size of the array.

By using a hash of the offset as the starting point, lookups are
faster and independent from the array size.

The hash is computed using the cluster number of the table, multiplied
by 4 to make it perform better when there are collisions.

In my tests, using a cache with 2048 entries, this reduces the average
number of comparisons per lookup from 430 to 2.5.

Signed-off-by: Alberto Garcia be...@igalia.com
---
 block/qcow2-cache.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index dffb887..3137ba8 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -248,6 +248,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 BDRVQcowState *s = bs-opaque;
 int i;
 int ret;
+int lookup_index;
 uint64_t min_lru_counter = UINT64_MAX;
 int min_lru_index = -1;
 
@@ -255,7 +256,8 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
   offset, read_from_disk);
 
 /* Check if the table is already cached */
-for (i = 0; i  c-size; i++) {
+i = lookup_index = (offset / s-cluster_size * 4) % c-size;
+do {
 const Qcow2CachedTable *t = c-entries[i];
 if (t-offset == offset) {
 goto found;
@@ -264,7 +266,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 min_lru_counter = t-lru_counter;
 min_lru_index = i;
 }
-}
+if (++i == c-size) {
+i = 0;
+}
+} while (i != lookup_index);
 
 if (min_lru_index == -1) {
 /* This can't happen in current synchronous code, but leave the check
-- 
2.1.4




[Qemu-block] [PATCH] qcow2: Flush pending discards before allocating cluster

2015-05-06 Thread Kevin Wolf
Before a freed cluster can be reused, pending discards for this cluster
must be processed.

The original assumption was that this was not a problem because discards
are only cached during discard/write zeroes operations, which are
synchronous so that no concurrent write requests can cause cluster
allocations.

However, the discard/write zeroes operation itself can allocate a new L2
table (and it has to in order to put zero flags there), so make sure we
can cope with the situation.

This fixes https://bugs.launchpad.net/bugs/1349972.

Signed-off-by: Kevin Wolf kw...@redhat.com
---
 block/qcow2-refcount.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index f47260b..83467c3 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -833,6 +833,11 @@ static int64_t alloc_clusters_noref(BlockDriverState *bs, 
uint64_t size)
 uint64_t i, nb_clusters, refcount;
 int ret;
 
+/* We can't allocate clusters if they may still be queued for discard. */
+if (s-cache_discards) {
+qcow2_process_discards(bs, 0);
+}
+
 nb_clusters = size_to_clusters(s, size);
 retry:
 for(i = 0; i  nb_clusters; i++) {
-- 
1.8.3.1




[Qemu-block] [RFC PATCH 7/7] blockdev: Add device IO op blocker during snapshot transaction

2015-05-06 Thread Fam Zheng
Signed-off-by: Fam Zheng f...@redhat.com
---
 blockdev.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index 5eaf77e..859fa2e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -1398,6 +1398,7 @@ typedef struct ExternalSnapshotState {
 BlockDriverState *old_bs;
 BlockDriverState *new_bs;
 AioContext *aio_context;
+Error *blocker;
 } ExternalSnapshotState;
 
 static void external_snapshot_prepare(BlkTransactionState *common,
@@ -1519,6 +1520,9 @@ static void external_snapshot_prepare(BlkTransactionState 
*common,
 if (ret != 0) {
 error_propagate(errp, local_err);
 }
+
+error_setg(state-blocker, snapshot in progress);
+bdrv_op_block(state-old_bs, BLOCK_OP_TYPE_DEVICE_IO, state-blocker);
 }
 
 static void external_snapshot_commit(BlkTransactionState *common)
@@ -1536,6 +1540,9 @@ static void external_snapshot_commit(BlkTransactionState 
*common)
 bdrv_reopen(state-new_bs, state-new_bs-open_flags  ~BDRV_O_RDWR,
 NULL);
 
+bdrv_op_unblock(state-old_bs, BLOCK_OP_TYPE_DEVICE_IO, state-blocker);
+error_free(state-blocker);
+
 aio_context_release(state-aio_context);
 }
 
-- 
1.9.3




[Qemu-block] [RFC PATCH 4/7] block-backend: Add blk_op_blocker_add_notifier

2015-05-06 Thread Fam Zheng
Forward the call to bdrv_op_blocker_add_notifier.

Signed-off-by: Fam Zheng f...@redhat.com
---
 block.c| 4 ++--
 block/block-backend.c  | 6 ++
 include/sysemu/block-backend.h | 2 ++
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/block.c b/block.c
index 054ddb4..d98a304 100644
--- a/block.c
+++ b/block.c
@@ -3414,23 +3414,23 @@ void bdrv_op_block(BlockDriverState *bs, BlockOpType 
op, Error *reason)
 BdrvOpBlocker *blocker;
 assert((int) op = 0  op  BLOCK_OP_TYPE_MAX);
 
-bdrv_op_blocker_notify(bs, op, reason, true);
 blocker = g_new0(BdrvOpBlocker, 1);
 blocker-reason = reason;
 QLIST_INSERT_HEAD(bs-op_blockers[op], blocker, list);
+bdrv_op_blocker_notify(bs, op, reason, true);
 }
 
 void bdrv_op_unblock(BlockDriverState *bs, BlockOpType op, Error *reason)
 {
 BdrvOpBlocker *blocker, *next;
 assert((int) op = 0  op  BLOCK_OP_TYPE_MAX);
-bdrv_op_blocker_notify(bs, op, reason, false);
 QLIST_FOREACH_SAFE(blocker, bs-op_blockers[op], list, next) {
 if (blocker-reason == reason) {
 QLIST_REMOVE(blocker, list);
 g_free(blocker);
 }
 }
+bdrv_op_blocker_notify(bs, op, reason, false);
 }
 
 void bdrv_op_block_all(BlockDriverState *bs, Error *reason)
diff --git a/block/block-backend.c b/block/block-backend.c
index 71fc695..90d7476 100644
--- a/block/block-backend.c
+++ b/block/block-backend.c
@@ -806,6 +806,12 @@ void blk_op_unblock_all(BlockBackend *blk, Error *reason)
 bdrv_op_unblock_all(blk-bs, reason);
 }
 
+void blk_op_blocker_add_notifier(BlockBackend *blk,
+ Notifier *notifier)
+{
+bdrv_op_blocker_add_notifier(blk-bs, notifier);
+}
+
 AioContext *blk_get_aio_context(BlockBackend *blk)
 {
 return bdrv_get_aio_context(blk-bs);
diff --git a/include/sysemu/block-backend.h b/include/sysemu/block-backend.h
index b4a4d5e..cde9651 100644
--- a/include/sysemu/block-backend.h
+++ b/include/sysemu/block-backend.h
@@ -136,6 +136,8 @@ int blk_get_flags(BlockBackend *blk);
 int blk_get_max_transfer_length(BlockBackend *blk);
 void blk_set_guest_block_size(BlockBackend *blk, int align);
 void *blk_blockalign(BlockBackend *blk, size_t size);
+void blk_op_blocker_add_notifier(BlockBackend *blk,
+ Notifier *notifier);
 bool blk_op_is_blocked(BlockBackend *blk, BlockOpType op, Error **errp);
 void blk_op_unblock(BlockBackend *blk, BlockOpType op, Error *reason);
 void blk_op_block_all(BlockBackend *blk, Error *reason);
-- 
1.9.3




[Qemu-block] [PATCH 3/7] qcow2: use an LRU algorithm to replace entries from the L2 cache

2015-05-06 Thread Alberto Garcia
The current algorithm to evict entries from the cache gives always
preference to those in the lowest positions. As the size of the cache
increases, the chances of the later elements of being removed decrease
exponentially.

In a scenario with random I/O and lots of cache misses, entries in
positions 8 and higher are rarely (if ever) evicted. This can be seen
even with the default cache size, but with larger caches the problem
becomes more obvious.

Using an LRU algorithm makes the chances of being removed from the
cache independent from the position.

Signed-off-by: Alberto Garcia be...@igalia.com
---
 block/qcow2-cache.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 6e78c8f..786c10a 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -28,10 +28,10 @@
 #include trace.h
 
 typedef struct Qcow2CachedTable {
-int64_t offset;
-booldirty;
-int cache_hits;
-int ref;
+int64_t  offset;
+bool dirty;
+uint64_t lru_counter;
+int  ref;
 } Qcow2CachedTable;
 
 struct Qcow2Cache {
@@ -40,6 +40,7 @@ struct Qcow2Cache {
 int size;
 booldepends_on_flush;
 void   *table_array;
+uint64_tlru_counter;
 };
 
 static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs,
@@ -233,16 +234,18 @@ int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c)
 for (i = 0; i  c-size; i++) {
 assert(c-entries[i].ref == 0);
 c-entries[i].offset = 0;
-c-entries[i].cache_hits = 0;
+c-entries[i].lru_counter = 0;
 }
 
+c-lru_counter = 0;
+
 return 0;
 }
 
 static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c)
 {
 int i;
-int min_count = INT_MAX;
+uint64_t min_lru_counter = UINT64_MAX;
 int min_index = -1;
 
 
@@ -251,15 +254,9 @@ static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c)
 continue;
 }
 
-if (c-entries[i].cache_hits  min_count) {
+if (c-entries[i].lru_counter  min_lru_counter) {
 min_index = i;
-min_count = c-entries[i].cache_hits;
-}
-
-/* Give newer hits priority */
-/* TODO Check how to optimize the replacement strategy */
-if (c-entries[i].cache_hits  1) {
-c-entries[i].cache_hits /= 2;
+min_lru_counter = c-entries[i].lru_counter;
 }
 }
 
@@ -318,12 +315,10 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 
 /* Give the table some hits for the start so that it won't be replaced
  * immediately. The number 32 is completely arbitrary. */
-c-entries[i].cache_hits = 32;
 c-entries[i].offset = offset;
 
 /* And return the right table */
 found:
-c-entries[i].cache_hits++;
 c-entries[i].ref++;
 *table = qcow2_cache_get_table_addr(bs, c, i);
 
@@ -356,6 +351,10 @@ int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, 
void **table)
 c-entries[i].ref--;
 *table = NULL;
 
+if (c-entries[i].ref == 0) {
+c-entries[i].lru_counter = ++c-lru_counter;
+}
+
 assert(c-entries[i].ref = 0);
 return 0;
 }
-- 
2.1.4




Re: [Qemu-block] [Qemu-devel] [PATCH 6/6] qcow2: style fixes in qcow2-cache.c

2015-05-06 Thread Stefan Hajnoczi
On Thu, Apr 30, 2015 at 01:11:45PM +0300, Alberto Garcia wrote:
 Fix pointer declaration to make it consistent with the rest of the
 code.
 
 Signed-off-by: Alberto Garcia be...@igalia.com
 ---
  block/qcow2-cache.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)

:)

Reviewed-by: Stefan Hajnoczi stefa...@redhat.com


pgpgiXMNuoEVo.pgp
Description: PGP signature


Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols

2015-05-06 Thread Paolo Bonzini


On 06/05/2015 18:12, Max Reitz wrote:
 I very much think it would be worth fixing, if there wasn't the problem
 with legitimate use cases throwing unnecessary warnings.

Right.

 I remember having a discussion with Kevin about this series (v1)
 regarding qcow2 on LVM; I think my point was that the warning is
 basically still correct, or only needs rewording (oops, I guess I did
 forget that in v2). If you are using qcow2 on LVM, you need to know
 exactly what you are doing, so a warning about this is indeed
 appropriate (in my opinion, that is).

There's another thing to check.  In the BZ you linked you got an EINVAL
or EIO.  Why didn't you get an ENOSPC?  Can you check if virtio-scsi
gives ENOSPC?

If so, you could perhaps only warn for werror=report.  But even then,
there are legitimate cases where you want the guest to see the ENOSPC.
In fact, that's the reason why virtio-scsi converts ENOSPC to a SCSI
SPACE ALLOCATION FAILED sense code. :)

 So I think if we can word the warning in a way to make it clear that
 there are legitimate use cases, but you need to know what you are doing,
 I think it's worth having this warning. Users who know what they're
 doing won't be surprised or at least will know what it means, while
 users who don't know what it means most probably don't know what they're
 doing and thus the warning is appropriate for them.

I don't know...  But then, I'm not a maintainer of this code. :)

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols

2015-05-06 Thread Max Reitz

On 06.05.2015 18:20, Paolo Bonzini wrote:


On 06/05/2015 18:12, Max Reitz wrote:

I very much think it would be worth fixing, if there wasn't the problem
with legitimate use cases throwing unnecessary warnings.

Right.


I remember having a discussion with Kevin about this series (v1)
regarding qcow2 on LVM; I think my point was that the warning is
basically still correct, or only needs rewording (oops, I guess I did
forget that in v2). If you are using qcow2 on LVM, you need to know
exactly what you are doing, so a warning about this is indeed
appropriate (in my opinion, that is).

There's another thing to check.  In the BZ you linked you got an EINVAL
or EIO.  Why didn't you get an ENOSPC?


Because qcow2 tries to write beyond the end of the file; the NBD client 
implementation passes that on to the server, and the server simply 
reports an error (which the NBD client turns into EIO).


We could make the NBD client detect this condition and report ENOSPC 
immediately. But I don't think this would improve matters, for people 
would then complain Linux reports 'no space left on device' in qemu, 
while df reports that there is enough space available. It's the same 
thing, people don't know what they're doing and nobody warned them that 
what they are doing might be wrong.



Can you check if virtio-scsi
gives ENOSPC?


In which configuration? Using virtio-scsi on top of qcow2 on top of some 
SCSI passthrough block driver?



If so, you could perhaps only warn for werror=report.  But even then,
there are legitimate cases where you want the guest to see the ENOSPC.
In fact, that's the reason why virtio-scsi converts ENOSPC to a SCSI
SPACE ALLOCATION FAILED sense code. :)


Sounds like we ought to make NBD return ENOSPC no matter the fate of 
this series.


The problem with only warning for a certain non-default configuration is 
that people who don't know what they are doing are more likely to use 
the default configuration, so I'd like the warning to appear then.



So I think if we can word the warning in a way to make it clear that
there are legitimate use cases, but you need to know what you are doing,
I think it's worth having this warning. Users who know what they're
doing won't be surprised or at least will know what it means, while
users who don't know what it means most probably don't know what they're
doing and thus the warning is appropriate for them.

I don't know...  But then, I'm not a maintainer of this code. :)


Well, this is not about this code in particular, but more about qemu's 
interface design in general, so I'm grateful about any opinion on it. :-)


Max



Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols

2015-05-06 Thread Max Reitz

On 06.05.2015 17:30, Paolo Bonzini wrote:


On 06/05/2015 15:04, Max Reitz wrote:

Introducing a warning for a normal QEMU invocation is a bit weird.

What is the point of this series?  Were users confused that they hit
ENOSPC?

Users were confused when exporting a qcow2 image using nbd-server
instead of qemu-img, and then accessing that NBD export with qemu
(subsequently getting I/O errors on guest writes, if the image is not
yet fully allocated): http://bugzilla.redhat.com/show_bug.cgi?id=1090713

I think NBD exports of non-raw images falls within the case of user
should know what they're doing.  In particular, you don't even need
metadata preallocation, you just need a truncate -s10G file.qcow2
before invoking nbd-server.


Well, actually, you need to export it with qemu-nbd instead of 
nbd-server; which this series is trying to tell the users.



So I think it's not worth fixing this, even though I see how it can be a
minor UI/UX issue.


I very much think it would be worth fixing, if there wasn't the problem 
with legitimate use cases throwing unnecessary warnings.


I remember having a discussion with Kevin about this series (v1) 
regarding qcow2 on LVM; I think my point was that the warning is 
basically still correct, or only needs rewording (oops, I guess I did 
forget that in v2). If you are using qcow2 on LVM, you need to know 
exactly what you are doing, so a warning about this is indeed 
appropriate (in my opinion, that is).


So I think if we can word the warning in a way to make it clear that 
there are legitimate use cases, but you need to know what you are doing, 
I think it's worth having this warning. Users who know what they're 
doing won't be surprised or at least will know what it means, while 
users who don't know what it means most probably don't know what they're 
doing and thus the warning is appropriate for them.


And if you're using management software (which hopefully does know what 
it's doing), the warning shouldn't be prominently visible anyway (in 
case of using libvirt, stderr is written to a log file, right?).


Max



Re: [Qemu-block] [Qemu-devel] [PATCH] use bdrv_flush to provide barrier semantic in block/vdi.c for metadata updates

2015-05-06 Thread Max Reitz

CC-ing qemu-block and Stefan Weil (maintainer of vdi).

On 06.05.2015 19:23, phoeagon wrote:

Thanks for your input.

So I changed it to:
1. Only call bdrv_flush when bdrv_pwrite was successful
2. Only if bdrv_flush was unsuccessful that the return value of 
vdi_co_write is updated.


One of both is enough. Both are too much. :-)

It is indeed correct, technically (because ret is 0 before the 
bdrv_write()), but it's too verbose. (See below)


In this way we try to avoid messing up any potential return value 
checks possible while still propagating bdrv_flush errors.
That return value was a catch and I admit I'm no pro with the return 
value convention in QEMU. bdrv_pwrite doesn't return the same value as 
bdrv_pwrite_sync I assume (they do return negative values when fail, 
but different values when successful)


It doesn't really matter, I think. Returning any non-negative value from 
vdi_co_write() should be enough to signal success.



---


Signed-off-by: Zhe Qiu address@hidden

From 19b2fabbe00765b418362d8c1891f266091621f3 Mon Sep 17 00:00:00 2001
From: phoeagon address-hidden
Date: Thu, 7 May 2015 01:09:38 +0800
Subject: [PATCH] block/vdi: Use bdrv_flush after metadata updates

In reference to 
b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2, 
metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to 
succeeding writes.


---
 block/vdi.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/vdi.c b/block/vdi.c
index 5d09b36..54a5fa8 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -713,7 +713,11 @@ static int vdi_co_write(BlockDriverState *bs,
 logout(will write %u block map sectors starting from entry 
%u\n,

n_sectors, bmap_first);
 ret = bdrv_write(bs-file, offset, base, n_sectors);
+if (!(ret  0)) {
+int flush_ret = bdrv_flush(bs-file);
+if (flush_ret  0)
+ret = flush_ret;
+}


I think bdrv_write() always returns 0 on success. In any case, it's fine 
for vdi_co_write() to return 0 on success (which is what bdrv_flush() 
returns), so shorting these four lines to ret = bdrv_flush(bs-file); 
is enough.


The patch is correct, though, so if you want to leave it as it is, all 
you need to do is bring it into proper form 
(http://wiki.qemu.org/Contribute/SubmitAPatch).


The previous version was nearly right, except for the things I 
mentioned: The subject needs to start with the part of qemu the patch is 
targeting (in this case block/vdi:  or simply vdi: ), the 
Signed-off-by needs to contain your name (or any alias you desire) and 
your email address, and comments for the patch should be separated from 
the actual commit message by ---.


Finally, for sending the next version, please change the [PATCH] in 
the subject to [PATCH v3] in order to indicate that it will be version 
3 of this patch.


Thanks!

Max


 }
 return ret;
--
2.4.0




Re: [Qemu-block] [PATCH COLO v3 10/14] util/hbitmap: Add an API to reset all set bits in hbitmap

2015-05-06 Thread Wen Congyang
On 05/02/2015 12:47 AM, John Snow wrote:
 
 
 On 04/03/2015 07:05 AM, Paolo Bonzini wrote:


 On 03/04/2015 12:01, Wen Congyang wrote:
 Signed-off-by: Wen Congyang we...@cn.fujitsu.com
 Signed-off-by: zhanghailiang zhang.zhanghaili...@huawei.com
 Signed-off-by: Gonglei arei.gong...@huawei.com
 ---
   include/qemu/hbitmap.h |  8 
   tests/test-hbitmap.c   | 39 +++
   util/hbitmap.c | 16 
   3 files changed, 63 insertions(+)

 diff --git a/include/qemu/hbitmap.h b/include/qemu/hbitmap.h
 index 550d7ce..95a55e4 100644
 --- a/include/qemu/hbitmap.h
 +++ b/include/qemu/hbitmap.h
 @@ -109,6 +109,14 @@ void hbitmap_set(HBitmap *hb, uint64_t start, uint64_t 
 count);
   void hbitmap_reset(HBitmap *hb, uint64_t start, uint64_t count);

   /**
 + * hbitmap_reset_all:
 + * @hb: HBitmap to operate on.
 + *
 + * Reset all bits in an HBitmap.
 + */
 +void hbitmap_reset_all(HBitmap *hb);
 +
 +/**
* hbitmap_get:
* @hb: HBitmap to operate on.
* @item: Bit to query (0-based).
 diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c
 index 8c902f2..1f0078a 100644
 --- a/tests/test-hbitmap.c
 +++ b/tests/test-hbitmap.c
 @@ -11,6 +11,7 @@

   #include glib.h
   #include stdarg.h
 +#include string.h
   #include qemu/hbitmap.h

   #define LOG_BITS_PER_LONG  (BITS_PER_LONG == 32 ? 5 : 6)
 @@ -143,6 +144,23 @@ static void hbitmap_test_reset(TestHBitmapData *data,
   }
   }

 +static void hbitmap_test_reset_all(TestHBitmapData *data)
 +{
 +size_t n;
 +
 +hbitmap_reset_all(data-hb);
 +
 +n = (data-size + BITS_PER_LONG - 1) / BITS_PER_LONG;
 +if (n == 0) {
 +n = 1;
 +}
 +memset(data-bits, 0, n * sizeof(unsigned long));
 +
 +if (data-granularity == 0) {
 +hbitmap_test_check(data, 0);
 +}
 +}
 +
   static void hbitmap_test_check_get(TestHBitmapData *data)
   {
   uint64_t count = 0;
 @@ -323,6 +341,26 @@ static void test_hbitmap_reset(TestHBitmapData *data,
   hbitmap_test_set(data, L3 / 2, L3);
   }

 +static void test_hbitmap_reset_all(TestHBitmapData *data,
 +   const void *unused)
 +{
 +hbitmap_test_init(data, L3 * 2, 0);
 +hbitmap_test_set(data, L1 - 1, L1 + 2);
 +hbitmap_test_reset_all(data);
 +hbitmap_test_set(data, 0, L1 * 3);
 +hbitmap_test_reset_all(data);
 +hbitmap_test_set(data, L2, L1);
 +hbitmap_test_reset_all(data);
 +hbitmap_test_set(data, L2, L3 - L2 + 1);
 +hbitmap_test_reset_all(data);
 +hbitmap_test_set(data, L3 - 1, 3);
 +hbitmap_test_reset_all(data);
 +hbitmap_test_set(data, 0, L3 * 2);
 +hbitmap_test_reset_all(data);
 +hbitmap_test_set(data, L3 / 2, L3);
 +hbitmap_test_reset_all(data);
 +}
 +
   static void test_hbitmap_granularity(TestHBitmapData *data,
const void *unused)
   {
 @@ -394,6 +432,7 @@ int main(int argc, char **argv)
   hbitmap_test_add(/hbitmap/set/overlap, test_hbitmap_set_overlap);
   hbitmap_test_add(/hbitmap/reset/empty, test_hbitmap_reset_empty);
   hbitmap_test_add(/hbitmap/reset/general, test_hbitmap_reset);
 +hbitmap_test_add(/hbitmap/reset/all, test_hbitmap_reset_all);
   hbitmap_test_add(/hbitmap/granularity, test_hbitmap_granularity);
   g_test_run();

 diff --git a/util/hbitmap.c b/util/hbitmap.c
 index ab13971..acce93c 100644
 --- a/util/hbitmap.c
 +++ b/util/hbitmap.c
 @@ -353,6 +353,22 @@ void hbitmap_reset(HBitmap *hb, uint64_t start, 
 uint64_t count)
   hb_reset_between(hb, HBITMAP_LEVELS - 1, start, last);
   }

 +void hbitmap_reset_all(HBitmap *hb)
 +{
 +uint64_t size = hb-size;
 +unsigned int i;
 +
 +/* Same as hbitmap_alloc() except memset() */
 +for (i = HBITMAP_LEVELS; --i = 1; ) {
 +size = MAX((size + BITS_PER_LONG - 1)  BITS_PER_LEVEL, 1);
 +memset(hb-levels[i], 0, size * sizeof(unsigned long));
 +}
 +
 
 For what it's worth, I recently added in a hb-sizes[i] cache to store the 
 size of each array so you don't have to recompute this all the time.

Yes, will fix it in the next version.

Thanks
Wen Congyang

 
 +assert(size == 1);
 +hb-levels[0][0] = 1UL  (BITS_PER_LONG - 1);
 +hb-count = 0;
 +}
 +
   bool hbitmap_get(const HBitmap *hb, uint64_t item)
   {
   /* Compute position and bit in the last layer.  */


 Acked-by: Paolo Bonzini pbonz...@redhat.com

 .
 




[Qemu-block] [PATCH v3] block/vdi: Use bdrv_flush after metadata updates

2015-05-06 Thread Zhe Qiu
From: phoeagon phoea...@gmail.com

In reference to 
b0ad5a455d7e5352d4c86ba945112011dbeadfb8~078a458e077d6b0db262c4b05fee51d01de2d1d2,
 metadata writes to qcow2/cow/qcow/vpc/vmdk are all synced prior to succeeding 
writes.

Only when write is successful that bdrv_flush is called.

Signed-off-by: Zhe Qiu phoea...@gmail.com
---
 block/vdi.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/block/vdi.c b/block/vdi.c
index 7642ef3..dfe8ade 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -713,6 +713,9 @@ static int vdi_co_write(BlockDriverState *bs,
 logout(will write %u block map sectors starting from entry %u\n,
n_sectors, bmap_first);
 ret = bdrv_write(bs-file, offset, base, n_sectors);
+if (ret = 0) {
+ret = bdrv_flush(bs-file);
+}
 }
 
 return ret;
-- 
2.4.0




Re: [Qemu-block] [Qemu-devel] [PATCH v2 0/3] block: Warn about usage of growing formats over non-growable protocols

2015-05-06 Thread Paolo Bonzini


On 06/05/2015 15:04, Max Reitz wrote:

 Introducing a warning for a normal QEMU invocation is a bit weird.

 What is the point of this series?  Were users confused that they hit
 ENOSPC?
 
 Users were confused when exporting a qcow2 image using nbd-server
 instead of qemu-img, and then accessing that NBD export with qemu
 (subsequently getting I/O errors on guest writes, if the image is not
 yet fully allocated): http://bugzilla.redhat.com/show_bug.cgi?id=1090713

I think NBD exports of non-raw images falls within the case of user
should know what they're doing.  In particular, you don't even need
metadata preallocation, you just need a truncate -s10G file.qcow2
before invoking nbd-server.

So I think it's not worth fixing this, even though I see how it can be a
minor UI/UX issue.

Paolo



Re: [Qemu-block] [Qemu-devel] [PATCH 2/6] qcow2: simplify qcow2_cache_put() and qcow2_cache_entry_mark_dirty()

2015-05-06 Thread Stefan Hajnoczi
On Tue, May 05, 2015 at 03:06:52PM +0200, Alberto Garcia wrote:
 On Fri 01 May 2015 04:31:52 PM CEST, Stefan Hajnoczi wrote:
 
   int qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table)
   {
  -int i;
  +int i = (*table - c-table_array) / c-table_size;
   
  -for (i = 0; i  c-size; i++) {
  -if (table_addr(c, i) == *table) {
  -goto found;
  -}
  +if (c-entries[i].offset == 0) {
  +return -ENOENT;
   }
  -return -ENOENT;
   
  -found:
   c-entries[i].ref--;
   *table = NULL;
   
 
  When is this function called with a bogus table pointer?
 
 I also could not image any such scenario, but I decided to be
 conservative and keep the error handling code. I'll double check all
 places where it's used and remove the relevant code.

The reason why I'd think this is worth looking into is:

The new qcow2_cache_put() code indexes outside the array bounds if table
is a bogus value.  The old code would return -ENOENT.  So I am a little
nervous about the change, although I suspect the function is never
called with invalid inputs at all.

Stefan


pgplt5imbBoxo.pgp
Description: PGP signature


Re: [Qemu-block] [RFC PATCH 3/7] block: Add op blocker notifier list

2015-05-06 Thread Fam Zheng
On Wed, 05/06 16:22, Paolo Bonzini wrote:
 
 
 On 06/05/2015 13:23, Fam Zheng wrote:
   void bdrv_op_block(BlockDriverState *bs, BlockOpType op, Error *reason)
   {
   BdrvOpBlocker *blocker;
   assert((int) op = 0  op  BLOCK_OP_TYPE_MAX);
   
  +bdrv_op_blocker_notify(bs, op, reason, true);
   blocker = g_new0(BdrvOpBlocker, 1);
   blocker-reason = reason;
   QLIST_INSERT_HEAD(bs-op_blockers[op], blocker, list);
  @@ -3405,6 +3424,7 @@ void bdrv_op_unblock(BlockDriverState *bs, 
  BlockOpType op, Error *reason)
   {
   BdrvOpBlocker *blocker, *next;
   assert((int) op = 0  op  BLOCK_OP_TYPE_MAX);
  +bdrv_op_blocker_notify(bs, op, reason, false);
   QLIST_FOREACH_SAFE(blocker, bs-op_blockers[op], list, next) {
   if (blocker-reason == reason) {
   QLIST_REMOVE(blocker, list);
 
 Changed in the following patch.

Bad git rebase -i, will fix.

 
 Also, should we only invoke the notifier if the list becomes
 empty/non-empty?  Is there any reason why the notifier would like to
 know the particular Error that was added?
 

Good point, this can be simplified.

Fam



Re: [Qemu-block] [PATCH 1/6] qcow2: use one single memory block for the L2/refcount cache tables

2015-05-06 Thread Stefan Hajnoczi
On Tue, May 05, 2015 at 01:20:19PM +0200, Kevin Wolf wrote:
 Am 05.05.2015 um 12:28 hat Stefan Hajnoczi geschrieben:
  On Mon, May 04, 2015 at 12:58:13PM +0200, Kevin Wolf wrote:
   Am 01.05.2015 um 16:23 hat Stefan Hajnoczi geschrieben:
On Thu, Apr 30, 2015 at 01:11:40PM +0300, Alberto Garcia wrote:
  Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables)
  {
  BDRVQcowState *s = bs-opaque;
  Qcow2Cache *c;
 -int i;
  
  c = g_new0(Qcow2Cache, 1);
  c-size = num_tables;
 +c-table_size = s-cluster_size;

This assumes c-table_size meets bs' memory alignment requirements.  The
following would be safer:

c-table_size = QEMU_ALIGN_UP(c-table_size, 
bdrv_opt_mem_align(bs-file));
   
   You can't just access more than one cluster. You might be caching data
   and later write it back when it's stale.
  
  I don't mean I/O alignment, just memory alignment (i.e. the start
  address of the data buffer in memory).
 
 The start address is already taken care of by qemu_blockalign(). With
 rounding c-table_size, you'd align the length, and that would be wrong.

It wasn't clear to me that c-table + n * c-table_size for all n is
aligned, but I agree with you now:

bdrv_qiov_is_aligned() checks both address and size against memory
alignment.  This means that if the I/O is memory aligned for table_size,
then all multiples of table_size are also aligned.

Stefan


pgpteFAfL80Nj.pgp
Description: PGP signature