date:20180213

[Qemu-block] [PATCH 1/2] qcow2: Prefer 'entries' over 'size' for non-byte values in spec

2018-02-13 Thread Eric Blake

We want to limit the use of the term 'size' for only values that
count by bytes.  Renaming fields in the spec does not invalidate
any existing implementation, but may make future implementations
easier to write.

A reasonable followup would be to rename internal qemu code that
operates on qcow2 images to also use the distinction between
size and entries in variable names.

Signed-off-by: Eric Blake 
---
 docs/interop/qcow2.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt
index d7fdb1fee31..597d3f261d5 100644
--- a/docs/interop/qcow2.txt
+++ b/docs/interop/qcow2.txt
@@ -47,7 +47,7 @@ The first cluster of a qcow2 image contains the file header:
 1 for AES encryption
 2 for LUKS encryption

- 36 - 39:   l1_size
+ 36 - 39:   l1_entries
 Number of entries in the active L1 table

  40 - 47:   l1_table_offset
@@ -538,7 +538,7 @@ Structure of a bitmap directory entry:
 (described below) for the bitmap starts. Must be aligned to
 a cluster boundary.

- 8 - 11:bitmap_table_size
+ 8 - 11:bitmap_table_entries
 Number of entries in the bitmap table of the bitmap.

 12 - 15:flags
-- 
2.14.3

[Qemu-block] [RFC PATCH 0/2] s/size/entries/ when dealing with non-byte units

2018-02-13 Thread Eric Blake

I mentioned this while reviewing Berto's series on L2 slice handling;
this is a first cut at patches that I think are worth doing throughout
the qcow2 code base if we like the idea.

Eric Blake (2):
  qcow2: Prefer 'entries' over 'size' for non-byte values in spec
  qcow2: Prefer 'entries' over 'size' during cache creation

 docs/interop/qcow2.txt |  4 ++--
 block/qcow2.h  |  4 ++--
 block/qcow2.c  | 21 +++--
 3 files changed, 15 insertions(+), 14 deletions(-)

-- 
2.14.3

[Qemu-block] [PATCH 2/2] qcow2: Prefer 'entries' over 'size' during cache creation

2018-02-13 Thread Eric Blake

Using 'size' for anything other than bytes is difficult to
reason about; let's rename entries related to the number of
entries in a cache accordingly.

Signed-off-by: Eric Blake 
---
 block/qcow2.h |  4 ++--
 block/qcow2.c | 21 +++--
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 883802241fb..0daf8e6d6f8 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -68,10 +68,10 @@
 #define MAX_CLUSTER_BITS 21

 /* Must be at least 2 to cover COW */
-#define MIN_L2_CACHE_SIZE 2 /* cache entries */
+#define MIN_L2_CACHE_ENTRIES 2

 /* Must be at least 4 to cover all cases of refcount table growth */
-#define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */
+#define MIN_REFCOUNT_CACHE_ENTRIES 4

 /* Whichever is more */
 #define DEFAULT_L2_CACHE_CLUSTERS 8 /* clusters */
diff --git a/block/qcow2.c b/block/qcow2.c
index 288b5299d80..f25c33df1d1 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -843,6 +843,7 @@ static int qcow2_update_options_prepare(BlockDriverState 
*bs,
 const char *opt_overlap_check, *opt_overlap_check_template;
 int overlap_check_template = 0;
 uint64_t l2_cache_size, l2_cache_entry_size, refcount_cache_size;
+uint64_t l2_cache_entries, refcount_cache_entries;
 int i;
 const char *encryptfmt;
 QDict *encryptopts = NULL;
@@ -869,21 +870,21 @@ static int qcow2_update_options_prepare(BlockDriverState 
*bs,
 goto fail;
 }

-l2_cache_size /= l2_cache_entry_size;
-if (l2_cache_size < MIN_L2_CACHE_SIZE) {
-l2_cache_size = MIN_L2_CACHE_SIZE;
+l2_cache_entries = l2_cache_size / l2_cache_entry_size;
+if (l2_cache_entries < MIN_L2_CACHE_ENTRIES) {
+l2_cache_entries = MIN_L2_CACHE_ENTRIES;
 }
-if (l2_cache_size > INT_MAX) {
+if (l2_cache_entries > INT_MAX) {
 error_setg(errp, "L2 cache size too big");
 ret = -EINVAL;
 goto fail;
 }

-refcount_cache_size /= s->cluster_size;
-if (refcount_cache_size < MIN_REFCOUNT_CACHE_SIZE) {
-refcount_cache_size = MIN_REFCOUNT_CACHE_SIZE;
+refcount_cache_entries = refcount_cache_size / s->cluster_size;
+if (refcount_cache_entries < MIN_REFCOUNT_CACHE_ENTRIES) {
+refcount_cache_entries = MIN_REFCOUNT_CACHE_ENTRIES;
 }
-if (refcount_cache_size > INT_MAX) {
+if (refcount_cache_entries > INT_MAX) {
 error_setg(errp, "Refcount cache size too big");
 ret = -EINVAL;
 goto fail;
@@ -908,9 +909,9 @@ static int qcow2_update_options_prepare(BlockDriverState 
*bs,
 }

 r->l2_slice_size = l2_cache_entry_size / sizeof(uint64_t);
-r->l2_table_cache = qcow2_cache_create(bs, l2_cache_size,
+r->l2_table_cache = qcow2_cache_create(bs, l2_cache_entries,
l2_cache_entry_size);
-r->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_size,
+r->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_entries,
  s->cluster_size);
 if (r->l2_table_cache == NULL || r->refcount_block_cache == NULL) {
 error_setg(errp, "Could not allocate metadata caches");
-- 
2.14.3

Re: [Qemu-block] qcow2 images corruption

2018-02-13 Thread John Snow



On 02/13/2018 04:41 AM, Kevin Wolf wrote:
> Am 07.02.2018 um 18:06 hat Nicolas Ecarnot geschrieben:
>> TL; DR : qcow2 images keep getting corrupted. Any workaround?
> 
> Not without knowing the cause.
> 
> The first thing to make sure is that the image isn't touched by a second
> process while QEMU is running a VM. The classic one is using 'qemu-img
> snapshot' on the image of a running VM, which is instant corruption (and
> newer QEMU versions have locking in place to prevent this), but we have
> seen more absurd cases of things outside QEMU tampering with the image
> when we were investigating previous corruption reports.
> 
> This covers the majority of all reports, we haven't had a real
> corruption caused by a QEMU bug in ages.
> 
>> After having found (https://access.redhat.com/solutions/1173623) the right
>> logical volume hosting the qcow2 image, I can run qemu-img check on it.
>> - On 80% of my VMs, I find no errors.
>> - On 15% of them, I find Leaked cluster errors that I can correct using
>> "qemu-img check -r all"
>> - On 5% of them, I find Leaked clusters errors and further fatal errors,
>> which can not be corrected with qemu-img.
>> In rare cases, qemu-img can correct them, but destroys large parts of the
>> image (becomes unusable), and on other cases it can not correct them at all.
> 
> It would be good if you could make the 'qemu-img check' output available
> somewhere.
> 
> It would be even better if we could have a look at the respective image.
> I seem to remember that John (CCed) had a few scripts to analyse
> corrupted qcow2 images, maybe we would be able to see something there.
> 

Hi! I did write a pretty simplistic tool for trying to tell the shape of
a corruption at a glance. It seems to work pretty similarly to the other
tool you already found, but it won't hurt anything to run it:

https://github.com/jnsnow/qcheck

(Actually, that other tool looks like it has an awful lot of options.
I'll have to check it out.)

It can print a really upsetting amount of data (especially for very
corrupt images), but in the default case, the simple setting should do
the trick just fine.

You could always put the output from this tool in a pastebin too; it
might help me visualize the problem a bit more -- I find seeing the
exact offsets and locations of where all the various tables and things
to be pretty helpful.

You can also always use the "deluge" option and compress it if you want,
just don't let it print to your terminal:

jsnow@probe (dev) ~/s/qcheck> ./qcheck -xd
/home/bos/jsnow/src/qemu/bin/git/install_test_f26.qcow2 > deluge.log;
and ls -sh deluge.log
4.3M deluge.log

but it compresses down very well:

jsnow@probe (dev) ~/s/qcheck> 7z a -t7z -m0=ppmd deluge.ppmd.7z deluge.log
jsnow@probe (dev) ~/s/qcheck> ls -s deluge.ppmd.7z
316 deluge.ppmd.7z

So I suppose if you want to send along:
(1) The basic output without any flags, in a pastebin
(2) The zipped deluge output, just in case

and I will try my hand at guessing what went wrong.


(Also, maybe my tool will totally choke for your image, who knows. It
hasn't received an overwhelming amount of testing apart from when I go
to use it personally and inevitably wind up displeased with how it
handles certain situations, so ...)

>> What I read similar to my case is :
>> - usage of qcow2
>> - heavy disk I/O
>> - using the virtio-blk driver
>>
>> In the proxmox thread, they tend to say that using virtio-scsi is the
>> solution. Having asked this question to oVirt experts
>> (https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but it's
>> not clear the driver is to blame.
> 
> This seems very unlikely. The corruption you're seeing is in the qcow2
> metadata, not only in the guest data. If anything, virtio-scsi exercises
> more qcow2 code paths than virtio-blk, so any potential bug that affects
> virtio-blk should also affect virtio-scsi, but not the other way around.
> 
>> I agree with the answer Yaniv Kaul gave to me, saying I have to properly
>> report the issue, so I'm longing to know which peculiar information I can
>> give you now.
> 
> To be honest, debugging corruption after the fact is pretty hard. We'd
> need the 'qemu-img check' output and ideally the image to do anything,
> but I can't promise that anything would come out of this.
> 
> Best would be a reproducer, or at least some operation that you can link
> to the appearance of the corruption. Then we could take a more targeted
> look at the respective code.
> 
>> As you can imagine, all this setup is in production, and for most of the
>> VMs, I can not "play" with them. Moreover, we launched a campaign of nightly
>> stopping every VM, qemu-img check them one by one, then boot.
>> So it might take some time before I find another corrupted image.
>> (which I'll preciously store for debug)
>>
>> Other informations : We very rarely do snapshots, but I'm close to imagine
>> that automated migrations of VMs could trigger similar behaviors on qcow2
>> images.
> 
> To my

[Qemu-block] [PATCH v8 21/21] block: Drop unused .bdrv_co_get_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Now that all drivers have been updated to provide the
byte-based .bdrv_co_block_status(), we can delete the sector-based
interface.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v7: no change
v6: rebase to changes in patch 1, drop R-b
v5: rebase to master
v4: rebase to interface tweak
v3: no change
v2: rebase to earlier changes
---
 include/block/block_int.h |  3 ---
 block/io.c| 50 ++-
 2 files changed, 10 insertions(+), 43 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index bf2598856cf..5ae7738cf8d 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -215,9 +215,6 @@ struct BlockDriver {
  * as well as non-NULL pnum, map, and file; in turn, the driver
  * must return an error or set pnum to an aligned non-zero value.
  */
-int64_t coroutine_fn (*bdrv_co_get_block_status)(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file);
 int coroutine_fn (*bdrv_co_block_status)(BlockDriverState *bs,
 bool want_zero, int64_t offset, int64_t bytes, int64_t *pnum,
 int64_t *map, BlockDriverState **file);
diff --git a/block/io.c b/block/io.c
index 5bae79f282e..4c3dba09730 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1963,7 +1963,7 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,

 /* Must be non-NULL or bdrv_getlength() would have failed */
 assert(bs->drv);
-if (!bs->drv->bdrv_co_get_block_status && !bs->drv->bdrv_co_block_status) {
+if (!bs->drv->bdrv_co_block_status) {
 *pnum = bytes;
 ret = BDRV_BLOCK_DATA | BDRV_BLOCK_ALLOCATED;
 if (offset + bytes == total_size) {
@@ -1981,53 +1981,23 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,

 /* Round out to request_alignment boundaries */
 align = bs->bl.request_alignment;
-if (bs->drv->bdrv_co_get_block_status && align < BDRV_SECTOR_SIZE) {
-align = BDRV_SECTOR_SIZE;
-}
 aligned_offset = QEMU_ALIGN_DOWN(offset, align);
 aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;

-if (bs->drv->bdrv_co_get_block_status) {
-int count; /* sectors */
-int64_t longret;
-
-assert(QEMU_IS_ALIGNED(aligned_offset | aligned_bytes,
-   BDRV_SECTOR_SIZE));
-/*
- * The contract allows us to return pnum smaller than bytes, even
- * if the next query would see the same status; we truncate the
- * request to avoid overflowing the driver's 32-bit interface.
- */
-longret = bs->drv->bdrv_co_get_block_status(
-bs, aligned_offset >> BDRV_SECTOR_BITS,
-MIN(INT_MAX, aligned_bytes) >> BDRV_SECTOR_BITS, ,
-_file);
-if (longret < 0) {
-assert(INT_MIN <= longret);
-ret = longret;
-goto out;
-}
-if (longret & BDRV_BLOCK_OFFSET_VALID) {
-local_map = longret & BDRV_BLOCK_OFFSET_MASK;
-}
-ret = longret & ~BDRV_BLOCK_OFFSET_MASK;
-*pnum = count * BDRV_SECTOR_SIZE;
-} else {
-ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-aligned_bytes, pnum, _map,
-_file);
-if (ret < 0) {
-*pnum = 0;
-goto out;
-}
-assert(*pnum); /* The block driver must make progress */
+ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+aligned_bytes, pnum, _map,
+_file);
+if (ret < 0) {
+*pnum = 0;
+goto out;
 }

 /*
- * The driver's result must be a multiple of request_alignment.
+ * The driver's result must be a non-zero multiple of request_alignment.
  * Clamp pnum and adjust map to original request.
  */
-assert(QEMU_IS_ALIGNED(*pnum, align) && align > offset - aligned_offset);
+assert(*pnum && QEMU_IS_ALIGNED(*pnum, align) &&
+   align > offset - aligned_offset);
 *pnum -= offset - aligned_offset;
 if (*pnum > bytes) {
 *pnum = bytes;
-- 
2.14.3

[Qemu-block] [PATCH v8 19/21] vpc: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vpc driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v7: tweak commit message and type of 'n' [Fam]
v6: no change
v5: fix incorrect rounding in 'map' and bad loop condition [Vladimir]
v4: rebase to interface tweak
v3: rebase to master
v2: drop get_sector_offset() [Kevin], rebase to mapping flag
---
 block/vpc.c | 45 +++--
 1 file changed, 23 insertions(+), 22 deletions(-)

diff --git a/block/vpc.c b/block/vpc.c
index cfa5144e867..fba4492fd7b 100644
--- a/block/vpc.c
+++ b/block/vpc.c
@@ -706,53 +706,54 @@ fail:
 return ret;
 }

-static int64_t coroutine_fn vpc_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn vpc_co_block_status(BlockDriverState *bs,
+bool want_zero,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file)
 {
 BDRVVPCState *s = bs->opaque;
 VHDFooter *footer = (VHDFooter*) s->footer_buf;
-int64_t start, offset;
+int64_t image_offset;
 bool allocated;
-int64_t ret;
-int n;
+int ret;
+int64_t n;

 if (be32_to_cpu(footer->type) == VHD_FIXED) {
-*pnum = nb_sectors;
+*pnum = bytes;
+*map = offset;
 *file = bs->file->bs;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

 qemu_co_mutex_lock(>lock);

-offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false, NULL);
-start = offset;
-allocated = (offset != -1);
+image_offset = get_image_offset(bs, offset, false, NULL);
+allocated = (image_offset != -1);
 *pnum = 0;
 ret = 0;

 do {
 /* All sectors in a block are contiguous (without using the bitmap) */
-n = ROUND_UP(sector_num + 1, s->block_size / BDRV_SECTOR_SIZE)
-  - sector_num;
-n = MIN(n, nb_sectors);
+n = ROUND_UP(offset + 1, s->block_size) - offset;
+n = MIN(n, bytes);

 *pnum += n;
-sector_num += n;
-nb_sectors -= n;
+offset += n;
+bytes -= n;
 /* *pnum can't be greater than one block for allocated
  * sectors since there is always a bitmap in between. */
 if (allocated) {
 *file = bs->file->bs;
-ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | start;
+*map = image_offset;
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 break;
 }
-if (nb_sectors == 0) {
+if (bytes == 0) {
 break;
 }
-offset = get_image_offset(bs, sector_num << BDRV_SECTOR_BITS, false,
-  NULL);
-} while (offset == -1);
+image_offset = get_image_offset(bs, offset, false, NULL);
+} while (image_offset == -1);

 qemu_co_mutex_unlock(>lock);
 return ret;
@@ -1098,7 +1099,7 @@ static BlockDriver bdrv_vpc = {

 .bdrv_co_preadv = vpc_co_preadv,
 .bdrv_co_pwritev= vpc_co_pwritev,
-.bdrv_co_get_block_status   = vpc_co_get_block_status,
+.bdrv_co_block_status   = vpc_co_block_status,

 .bdrv_get_info  = vpc_get_info,

-- 
2.14.3

[Qemu-block] [PATCH v8 20/21] vvfat: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vvfat driver accordingly.  Note that we
can rely on the block driver having already clamped limits to our
block size, and simplify accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v5-v7: no change
v4: rebase to interface tweak
v3: no change
v2: rebase to earlier changes, simplify
---
 block/vvfat.c | 16 +++-
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/block/vvfat.c b/block/vvfat.c
index 7e06ebacf61..4a17a49e128 100644
--- a/block/vvfat.c
+++ b/block/vvfat.c
@@ -3088,15 +3088,13 @@ vvfat_co_pwritev(BlockDriverState *bs, uint64_t offset, 
uint64_t bytes,
 return ret;
 }

-static int64_t coroutine_fn vvfat_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *n, BlockDriverState **file)
+static int coroutine_fn vvfat_co_block_status(BlockDriverState *bs,
+  bool want_zero, int64_t offset,
+  int64_t bytes, int64_t *n,
+  int64_t *map,
+  BlockDriverState **file)
 {
-*n = bs->total_sectors - sector_num;
-if (*n > nb_sectors) {
-*n = nb_sectors;
-} else if (*n < 0) {
-return 0;
-}
+*n = bytes;
 return BDRV_BLOCK_DATA;
 }

@@ -3257,7 +3255,7 @@ static BlockDriver bdrv_vvfat = {

 .bdrv_co_preadv = vvfat_co_preadv,
 .bdrv_co_pwritev= vvfat_co_pwritev,
-.bdrv_co_get_block_status = vvfat_co_get_block_status,
+.bdrv_co_block_status   = vvfat_co_block_status,
 };

 static void bdrv_vvfat_init(void)
-- 
2.14.3

[Qemu-block] [PATCH v8 18/21] vmdk: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vmdk driver accordingly.  Drop the
now-unused vmdk_find_index_in_cluster().

Also, fix a pre-existing bug: if find_extent() fails (unlikely,
since the block layer did a bounds check), then we must return a
failure, rather than 0.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: drop dead code [Vladimir], return error on find_extent() failure
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/vmdk.c | 38 ++
 1 file changed, 14 insertions(+), 24 deletions(-)

diff --git a/block/vmdk.c b/block/vmdk.c
index ef15ddbfd3d..75f84213e6f 100644
--- a/block/vmdk.c
+++ b/block/vmdk.c
@@ -1304,33 +1304,27 @@ static inline uint64_t 
vmdk_find_offset_in_cluster(VmdkExtent *extent,
 return extent_relative_offset % cluster_size;
 }

-static inline uint64_t vmdk_find_index_in_cluster(VmdkExtent *extent,
-  int64_t sector_num)
-{
-uint64_t offset;
-offset = vmdk_find_offset_in_cluster(extent, sector_num * 
BDRV_SECTOR_SIZE);
-return offset / BDRV_SECTOR_SIZE;
-}
-
-static int64_t coroutine_fn vmdk_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn vmdk_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset, int64_t bytes,
+ int64_t *pnum, int64_t *map,
+ BlockDriverState **file)
 {
 BDRVVmdkState *s = bs->opaque;
 int64_t index_in_cluster, n, ret;
-uint64_t offset;
+uint64_t cluster_offset;
 VmdkExtent *extent;

-extent = find_extent(s, sector_num, NULL);
+extent = find_extent(s, offset >> BDRV_SECTOR_BITS, NULL);
 if (!extent) {
-return 0;
+return -EIO;
 }
 qemu_co_mutex_lock(>lock);
-ret = get_cluster_offset(bs, extent, NULL,
- sector_num * 512, false, ,
+ret = get_cluster_offset(bs, extent, NULL, offset, false, _offset,
  0, 0);
 qemu_co_mutex_unlock(>lock);

-index_in_cluster = vmdk_find_index_in_cluster(extent, sector_num);
+index_in_cluster = vmdk_find_offset_in_cluster(extent, offset);
 switch (ret) {
 case VMDK_ERROR:
 ret = -EIO;
@@ -1345,18 +1339,14 @@ static int64_t coroutine_fn 
vmdk_co_get_block_status(BlockDriverState *bs,
 ret = BDRV_BLOCK_DATA;
 if (!extent->compressed) {
 ret |= BDRV_BLOCK_OFFSET_VALID;
-ret |= (offset + (index_in_cluster << BDRV_SECTOR_BITS))
-& BDRV_BLOCK_OFFSET_MASK;
+*map = cluster_offset + index_in_cluster;
 }
 *file = extent->file->bs;
 break;
 }

-n = extent->cluster_sectors - index_in_cluster;
-if (n > nb_sectors) {
-n = nb_sectors;
-}
-*pnum = n;
+n = extent->cluster_sectors * BDRV_SECTOR_SIZE - index_in_cluster;
+*pnum = MIN(n, bytes);
 return ret;
 }

@@ -2410,7 +2400,7 @@ static BlockDriver bdrv_vmdk = {
 .bdrv_close   = vmdk_close,
 .bdrv_create  = vmdk_create,
 .bdrv_co_flush_to_disk= vmdk_co_flush,
-.bdrv_co_get_block_status = vmdk_co_get_block_status,
+.bdrv_co_block_status = vmdk_co_block_status,
 .bdrv_get_allocated_file_size = vmdk_get_allocated_file_size,
 .bdrv_has_zero_init   = vmdk_has_zero_init,
 .bdrv_get_specific_info   = vmdk_get_specific_info,
-- 
2.14.3

[Qemu-block] [PATCH v8 17/21] vdi: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the vdi driver accordingly.  Note that the
TODO is already covered (the block layer guarantees bounds of its
requests), and that we can remove the now-unused s->block_sectors.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: fix pnum when offset rounded down to block_size [Vladimir]
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/vdi.c | 33 +
 1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index 32b1763cde0..0780c82d829 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -172,8 +172,6 @@ typedef struct {
 uint32_t *bmap;
 /* Size of block (bytes). */
 uint32_t block_size;
-/* Size of block (sectors). */
-uint32_t block_sectors;
 /* First sector of block map. */
 uint32_t bmap_sector;
 /* VDI header (converted to host endianness). */
@@ -463,7 +461,6 @@ static int vdi_open(BlockDriverState *bs, QDict *options, 
int flags,
 bs->total_sectors = header.disk_size / SECTOR_SIZE;

 s->block_size = header.block_size;
-s->block_sectors = header.block_size / SECTOR_SIZE;
 s->bmap_sector = header.offset_bmap / SECTOR_SIZE;
 s->header = header;

@@ -509,33 +506,29 @@ static int vdi_reopen_prepare(BDRVReopenState *state,
 return 0;
 }

-static int64_t coroutine_fn vdi_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn vdi_co_block_status(BlockDriverState *bs,
+bool want_zero,
+int64_t offset, int64_t bytes,
+int64_t *pnum, int64_t *map,
+BlockDriverState **file)
 {
-/* TODO: Check for too large sector_num (in bdrv_is_allocated or here). */
 BDRVVdiState *s = (BDRVVdiState *)bs->opaque;
-size_t bmap_index = sector_num / s->block_sectors;
-size_t sector_in_block = sector_num % s->block_sectors;
-int n_sectors = s->block_sectors - sector_in_block;
+size_t bmap_index = offset / s->block_size;
+size_t index_in_block = offset % s->block_size;
 uint32_t bmap_entry = le32_to_cpu(s->bmap[bmap_index]);
-uint64_t offset;
 int result;

-logout("%p, %" PRId64 ", %d, %p\n", bs, sector_num, nb_sectors, pnum);
-if (n_sectors > nb_sectors) {
-n_sectors = nb_sectors;
-}
-*pnum = n_sectors;
+logout("%p, %" PRId64 ", %" PRId64 ", %p\n", bs, offset, bytes, pnum);
+*pnum = MIN(s->block_size - index_in_block, bytes);
 result = VDI_IS_ALLOCATED(bmap_entry);
 if (!result) {
 return 0;
 }

-offset = s->header.offset_data +
-  (uint64_t)bmap_entry * s->block_size +
-  sector_in_block * SECTOR_SIZE;
+*map = s->header.offset_data + (uint64_t)bmap_entry * s->block_size +
+index_in_block;
 *file = bs->file->bs;
-return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

 static int coroutine_fn
@@ -903,7 +896,7 @@ static BlockDriver bdrv_vdi = {
 .bdrv_child_perm  = bdrv_format_default_perms,
 .bdrv_create = vdi_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-.bdrv_co_get_block_status = vdi_co_get_block_status,
+.bdrv_co_block_status = vdi_co_block_status,
 .bdrv_make_empty = vdi_make_empty,

 .bdrv_co_preadv = vdi_co_preadv,
-- 
2.14.3

[Qemu-block] [PATCH v8 13/21] qed: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the qed driver accordingly, taking the opportunity
to inline qed_is_allocated_cb() into its lone caller (the callback
used to be important, until we switched qed to coroutines).  There is
no intent to optimize based on the want_zero flag for this format.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: initialize len before qed_find_cluster() [Vladimir]
v4: rebase to interface change, inline pointless callback
v3: no change
v2: rebase to mapping flag, fix mask in qed_is_allocated_cb
---
 block/qed.c | 84 +
 1 file changed, 28 insertions(+), 56 deletions(-)

diff --git a/block/qed.c b/block/qed.c
index c6ff3ab015d..a5952209261 100644
--- a/block/qed.c
+++ b/block/qed.c
@@ -688,74 +688,46 @@ finish:
 return ret;
 }

-typedef struct {
-BlockDriverState *bs;
-Coroutine *co;
-uint64_t pos;
-int64_t status;
-int *pnum;
-BlockDriverState **file;
-} QEDIsAllocatedCB;
-
-/* Called with table_lock held.  */
-static void qed_is_allocated_cb(void *opaque, int ret, uint64_t offset, size_t 
len)
-{
-QEDIsAllocatedCB *cb = opaque;
-BDRVQEDState *s = cb->bs->opaque;
-*cb->pnum = len / BDRV_SECTOR_SIZE;
-switch (ret) {
-case QED_CLUSTER_FOUND:
-offset |= qed_offset_into_cluster(s, cb->pos);
-cb->status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset;
-*cb->file = cb->bs->file->bs;
-break;
-case QED_CLUSTER_ZERO:
-cb->status = BDRV_BLOCK_ZERO;
-break;
-case QED_CLUSTER_L2:
-case QED_CLUSTER_L1:
-cb->status = 0;
-break;
-default:
-assert(ret < 0);
-cb->status = ret;
-break;
-}
-
-if (cb->co) {
-aio_co_wake(cb->co);
-}
-}
-
-static int64_t coroutine_fn bdrv_qed_co_get_block_status(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors, int *pnum,
+static int coroutine_fn bdrv_qed_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t pos, int64_t bytes,
+ int64_t *pnum, int64_t *map,
  BlockDriverState **file)
 {
 BDRVQEDState *s = bs->opaque;
-size_t len = (size_t)nb_sectors * BDRV_SECTOR_SIZE;
-QEDIsAllocatedCB cb = {
-.bs = bs,
-.pos = (uint64_t)sector_num * BDRV_SECTOR_SIZE,
-.status = BDRV_BLOCK_OFFSET_MASK,
-.pnum = pnum,
-.file = file,
-};
+size_t len = MIN(bytes, SIZE_MAX);
+int status;
 QEDRequest request = { .l2_table = NULL };
 uint64_t offset;
 int ret;

 qemu_co_mutex_lock(>table_lock);
-ret = qed_find_cluster(s, , cb.pos, , );
-qed_is_allocated_cb(, ret, offset, len);
+ret = qed_find_cluster(s, , pos, , );

-/* The callback was invoked immediately */
-assert(cb.status != BDRV_BLOCK_OFFSET_MASK);
+*pnum = len;
+switch (ret) {
+case QED_CLUSTER_FOUND:
+*map = offset | qed_offset_into_cluster(s, pos);
+status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+*file = bs->file->bs;
+break;
+case QED_CLUSTER_ZERO:
+status = BDRV_BLOCK_ZERO;
+break;
+case QED_CLUSTER_L2:
+case QED_CLUSTER_L1:
+status = 0;
+break;
+default:
+assert(ret < 0);
+status = ret;
+break;
+}

 qed_unref_l2_cache_entry(request.l2_table);
 qemu_co_mutex_unlock(>table_lock);

-return cb.status;
+return status;
 }

 static BDRVQEDState *acb_to_s(QEDAIOCB *acb)
@@ -1594,7 +1566,7 @@ static BlockDriver bdrv_qed = {
 .bdrv_child_perm  = bdrv_format_default_perms,
 .bdrv_create  = bdrv_qed_create,
 .bdrv_has_zero_init   = bdrv_has_zero_init_1,
-.bdrv_co_get_block_status = bdrv_qed_co_get_block_status,
+.bdrv_co_block_status = bdrv_qed_co_block_status,
 .bdrv_co_readv= bdrv_qed_co_readv,
 .bdrv_co_writev   = bdrv_qed_co_writev,
 .bdrv_co_pwrite_zeroes= bdrv_qed_co_pwrite_zeroes,
-- 
2.14.3

[Qemu-block] [PATCH v8 15/21] sheepdog: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the sheepdog driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 
Reviewed-by: Jeff Cody 

---
v7: rebase to minor spacing changes in master
v5-v6: no change
v4: update to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/sheepdog.c | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index ac02b10fe03..3c3becf94df 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -3004,19 +3004,19 @@ static coroutine_fn int sd_co_pdiscard(BlockDriverState 
*bs, int64_t offset,
 return acb.ret;
 }

-static coroutine_fn int64_t
-sd_co_get_block_status(BlockDriverState *bs, int64_t sector_num, int 
nb_sectors,
-   int *pnum, BlockDriverState **file)
+static coroutine_fn int
+sd_co_block_status(BlockDriverState *bs, bool want_zero, int64_t offset,
+   int64_t bytes, int64_t *pnum, int64_t *map,
+   BlockDriverState **file)
 {
 BDRVSheepdogState *s = bs->opaque;
 SheepdogInode *inode = >inode;
 uint32_t object_size = (UINT32_C(1) << inode->block_size_shift);
-uint64_t offset = sector_num * BDRV_SECTOR_SIZE;
 unsigned long start = offset / object_size,
-  end = DIV_ROUND_UP((sector_num + nb_sectors) *
- BDRV_SECTOR_SIZE, object_size);
+  end = DIV_ROUND_UP(offset + bytes, object_size);
 unsigned long idx;
-int64_t ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | offset;
+*map = offset;
+int ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;

 for (idx = start; idx < end; idx++) {
 if (inode->data_vdi_id[idx] == 0) {
@@ -3033,9 +3033,9 @@ sd_co_get_block_status(BlockDriverState *bs, int64_t 
sector_num, int nb_sectors,
 }
 }

-*pnum = (idx - start) * object_size / BDRV_SECTOR_SIZE;
-if (*pnum > nb_sectors) {
-*pnum = nb_sectors;
+*pnum = (idx - start) * object_size;
+if (*pnum > bytes) {
+*pnum = bytes;
 }
 if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID) {
 *file = bs;
@@ -3113,7 +3113,7 @@ static BlockDriver bdrv_sheepdog = {
 .bdrv_co_writev   = sd_co_writev,
 .bdrv_co_flush_to_disk= sd_co_flush_to_disk,
 .bdrv_co_pdiscard = sd_co_pdiscard,
-.bdrv_co_get_block_status = sd_co_get_block_status,
+.bdrv_co_block_status = sd_co_block_status,

 .bdrv_snapshot_create = sd_snapshot_create,
 .bdrv_snapshot_goto   = sd_snapshot_goto,
@@ -3149,7 +3149,7 @@ static BlockDriver bdrv_sheepdog_tcp = {
 .bdrv_co_writev   = sd_co_writev,
 .bdrv_co_flush_to_disk= sd_co_flush_to_disk,
 .bdrv_co_pdiscard = sd_co_pdiscard,
-.bdrv_co_get_block_status = sd_co_get_block_status,
+.bdrv_co_block_status = sd_co_block_status,

 .bdrv_snapshot_create = sd_snapshot_create,
 .bdrv_snapshot_goto   = sd_snapshot_goto,
@@ -3185,7 +3185,7 @@ static BlockDriver bdrv_sheepdog_unix = {
 .bdrv_co_writev   = sd_co_writev,
 .bdrv_co_flush_to_disk= sd_co_flush_to_disk,
 .bdrv_co_pdiscard = sd_co_pdiscard,
-.bdrv_co_get_block_status = sd_co_get_block_status,
+.bdrv_co_block_status = sd_co_block_status,

 .bdrv_snapshot_create = sd_snapshot_create,
 .bdrv_snapshot_goto   = sd_snapshot_goto,
-- 
2.14.3

[Qemu-block] [PATCH v8 05/21] gluster: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the gluster driver accordingly.

In want_zero mode, we continue to report fine-grained hole
information (the caller wants as much mapping detail as possible);
but when not in that mode, the caller prefers larger *pnum and
merely cares about what offsets are allocated at this layer, rather
than where the holes live.  Since holes still read as zeroes at
this layer (rather than deferring to a backing layer), we can take
the shortcut of skipping find_allocation(), and merely state that
all bytes are allocated.

We can also drop redundant bounds checks that are already
guaranteed by the block layer.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: drop redundant code
v4: tweak commit message [Fam], rebase to interface tweak
v3: no change
v2: tweak comments [Prasanna], add mapping, drop R-b
---
 block/gluster.c | 70 -
 1 file changed, 34 insertions(+), 36 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 3f17b7819d2..1a07d221d17 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1362,68 +1362,66 @@ exit:
 }

 /*
- * Returns the allocation status of the specified sectors.
+ * Returns the allocation status of the specified offset.
  *
- * If 'sector_num' is beyond the end of the disk image the return value is 0
- * and 'pnum' is set to 0.
+ * The block layer guarantees 'offset' and 'bytes' are within bounds.
  *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
+ * 'pnum' is set to the number of bytes (including and immediately following
+ * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
- * beyond the end of the disk image it will be clamped.
+ * 'bytes' is the max value 'pnum' should be set to.
  *
- * (Based on raw_co_get_block_status() from file-posix.c.)
+ * (Based on raw_co_block_status() from file-posix.c.)
  */
-static int64_t coroutine_fn qemu_gluster_co_get_block_status(
-BlockDriverState *bs, int64_t sector_num, int nb_sectors, int *pnum,
-BlockDriverState **file)
+static int coroutine_fn qemu_gluster_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset,
+ int64_t bytes,
+ int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
 {
 BDRVGlusterState *s = bs->opaque;
-off_t start, data = 0, hole = 0;
-int64_t total_size;
+off_t data = 0, hole = 0;
 int ret = -EINVAL;

 if (!s->fd) {
 return ret;
 }

-start = sector_num * BDRV_SECTOR_SIZE;
-total_size = bdrv_getlength(bs);
-if (total_size < 0) {
-return total_size;
-} else if (start >= total_size) {
-*pnum = 0;
-return 0;
-} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) {
-nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
+if (!want_zero) {
+*pnum = bytes;
+*map = offset;
+*file = bs;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

-ret = find_allocation(bs, start, , );
+ret = find_allocation(bs, offset, , );
 if (ret == -ENXIO) {
 /* Trailing hole */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_ZERO;
 } else if (ret < 0) {
 /* No info available, so pretend there are no holes */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_DATA;
-} else if (data == start) {
-/* On a data extent, compute sectors to the end of the extent,
+} else if (data == offset) {
+/* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, BDRV_SECTOR_SIZE));
+*pnum = MIN(bytes, hole - offset);
 ret = BDRV_BLOCK_DATA;
 } else {
-/* On a hole, compute sectors to the beginning of the next extent.  */
-assert(hole == start);
-*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE);
+/* On a hole, compute bytes to the beginning of the next extent.  */
+assert(hole == offset);
+*pnum = MIN(bytes, data - offset);
 ret = BDRV_BLOCK_ZERO;
 }

+*map = offset;
 *file = bs;

-return ret | BDRV_BLOCK_OFFSET_VALID | start;
+return ret |

[Qemu-block] [PATCH v8 16/21] vdi: Avoid bitrot of debugging code

2018-02-13 Thread Eric Blake

Rework the debug define so that we always get -Wformat checking,
even when debugging is disabled.

Signed-off-by: Eric Blake 
Reviewed-by: Stefan Weil 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v2-v7: no change
---
 block/vdi.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/block/vdi.c b/block/vdi.c
index fc1c614cb12..32b1763cde0 100644
--- a/block/vdi.c
+++ b/block/vdi.c
@@ -87,12 +87,18 @@
 #define DEFAULT_CLUSTER_SIZE (1 * MiB)

 #if defined(CONFIG_VDI_DEBUG)
-#define logout(fmt, ...) \
-fprintf(stderr, "vdi\t%-24s" fmt, __func__, ##__VA_ARGS__)
+#define VDI_DEBUG 1
 #else
-#define logout(fmt, ...) ((void)0)
+#define VDI_DEBUG 0
 #endif

+#define logout(fmt, ...) \
+do {\
+if (VDI_DEBUG) {\
+fprintf(stderr, "vdi\t%-24s" fmt, __func__, ##__VA_ARGS__); \
+}   \
+} while (0)
+
 /* Image signature. */
 #define VDI_SIGNATURE 0xbeda107f

-- 
2.14.3

[Qemu-block] [PATCH v8 09/21] null: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the null driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: minor fix to type of 'ret'
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping parameter
---
 block/null.c | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/block/null.c b/block/null.c
index 214d394fff4..806a8631e4d 100644
--- a/block/null.c
+++ b/block/null.c
@@ -223,22 +223,23 @@ static int null_reopen_prepare(BDRVReopenState 
*reopen_state,
 return 0;
 }

-static int64_t coroutine_fn null_co_get_block_status(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors, int *pnum,
- BlockDriverState **file)
+static int coroutine_fn null_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
 {
 BDRVNullState *s = bs->opaque;
-off_t start = sector_num * BDRV_SECTOR_SIZE;
+int ret = BDRV_BLOCK_OFFSET_VALID;

-*pnum = nb_sectors;
+*pnum = bytes;
+*map = offset;
 *file = bs;

 if (s->read_zeroes) {
-return BDRV_BLOCK_OFFSET_VALID | start | BDRV_BLOCK_ZERO;
-} else {
-return BDRV_BLOCK_OFFSET_VALID | start;
+ret |= BDRV_BLOCK_ZERO;
 }
+return ret;
 }

 static void null_refresh_filename(BlockDriverState *bs, QDict *opts)
@@ -270,7 +271,7 @@ static BlockDriver bdrv_null_co = {
 .bdrv_co_flush_to_disk  = null_co_flush,
 .bdrv_reopen_prepare= null_reopen_prepare,

-.bdrv_co_get_block_status   = null_co_get_block_status,
+.bdrv_co_block_status   = null_co_block_status,

 .bdrv_refresh_filename  = null_refresh_filename,
 };
@@ -290,7 +291,7 @@ static BlockDriver bdrv_null_aio = {
 .bdrv_aio_flush = null_aio_flush,
 .bdrv_reopen_prepare= null_reopen_prepare,

-.bdrv_co_get_block_status   = null_co_get_block_status,
+.bdrv_co_block_status   = null_co_block_status,

 .bdrv_refresh_filename  = null_refresh_filename,
 };
-- 
2.14.3

[Qemu-block] [PATCH v8 14/21] raw: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the raw driver accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v5-v7: no change
v4: rebase to interface tweak
v3: no change
v2: rebase to mapping
---
 block/raw-format.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/raw-format.c b/block/raw-format.c
index ab552c09541..830243a8e48 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -250,17 +250,17 @@ fail:
 return ret;
 }

-static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors, int *pnum,
+static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
+bool want_zero, int64_t offset,
+int64_t bytes, int64_t *pnum,
+int64_t *map,
 BlockDriverState **file)
 {
 BDRVRawState *s = bs->opaque;
-*pnum = nb_sectors;
+*pnum = bytes;
 *file = bs->file->bs;
-sector_num += s->offset / BDRV_SECTOR_SIZE;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
+*map = offset + s->offset;
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

 static int coroutine_fn raw_co_pwrite_zeroes(BlockDriverState *bs,
@@ -496,7 +496,7 @@ BlockDriver bdrv_raw = {
 .bdrv_co_pwritev  = _co_pwritev,
 .bdrv_co_pwrite_zeroes = _co_pwrite_zeroes,
 .bdrv_co_pdiscard = _co_pdiscard,
-.bdrv_co_get_block_status = _co_get_block_status,
+.bdrv_co_block_status = _co_block_status,
 .bdrv_truncate= _truncate,
 .bdrv_getlength   = _getlength,
 .has_variable_length  = true,
-- 
2.14.3

[Qemu-block] [PATCH v8 08/21] iscsi: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the iscsi driver accordingly.  In this case,
it is handy to teach iscsi_co_block_status() to handle a NULL map
and file parameter, even though the block layer passes non-NULL
values, because we also call the function directly.  For now, there
are no optimizations done based on the want_zero flag.

We can also make the simplification of asserting that the block
layer passed in aligned values.

Signed-off-by: Eric Blake 
Reviewed-by: Fam Zheng 

---
v8: rebase to master
v7: rebase to master
v6: no change
v5: assert rather than check for alignment
v4: rebase to interface tweaks
v3: no change
v2: rebase to mapping parameter
---
 block/iscsi.c | 67 ---
 1 file changed, 32 insertions(+), 35 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index d2b0466775c..4842519fdad 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -653,36 +653,36 @@ out_unlock:



-static int64_t coroutine_fn iscsi_co_get_block_status(BlockDriverState *bs,
-  int64_t sector_num,
-  int nb_sectors, int *pnum,
-  BlockDriverState **file)
+static int coroutine_fn iscsi_co_block_status(BlockDriverState *bs,
+  bool want_zero, int64_t offset,
+  int64_t bytes, int64_t *pnum,
+  int64_t *map,
+  BlockDriverState **file)
 {
 IscsiLun *iscsilun = bs->opaque;
 struct scsi_get_lba_status *lbas = NULL;
 struct scsi_lba_status_descriptor *lbasd = NULL;
 struct IscsiTask iTask;
 uint64_t lba;
-int64_t ret;
+int ret;

 iscsi_co_init_iscsitask(iscsilun, );

-if (!is_sector_request_lun_aligned(sector_num, nb_sectors, iscsilun)) {
-ret = -EINVAL;
-goto out;
-}
+assert(QEMU_IS_ALIGNED(offset | bytes, iscsilun->block_size));

 /* default to all sectors allocated */
-ret = BDRV_BLOCK_DATA;
-ret |= (sector_num << BDRV_SECTOR_BITS) | BDRV_BLOCK_OFFSET_VALID;
-*pnum = nb_sectors;
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+if (map) {
+*map = offset;
+}
+*pnum = bytes;

 /* LUN does not support logical block provisioning */
 if (!iscsilun->lbpme) {
 goto out;
 }

-lba = sector_qemu2lun(sector_num, iscsilun);
+lba = offset / iscsilun->block_size;

 qemu_mutex_lock(>mutex);
 retry:
@@ -727,12 +727,12 @@ retry:

 lbasd = >descriptors[0];

-if (sector_qemu2lun(sector_num, iscsilun) != lbasd->lba) {
+if (lba != lbasd->lba) {
 ret = -EIO;
 goto out_unlock;
 }

-*pnum = sector_lun2qemu(lbasd->num_blocks, iscsilun);
+*pnum = lbasd->num_blocks * iscsilun->block_size;

 if (lbasd->provisioning == SCSI_PROVISIONING_TYPE_DEALLOCATED ||
 lbasd->provisioning == SCSI_PROVISIONING_TYPE_ANCHORED) {
@@ -743,15 +743,13 @@ retry:
 }

 if (ret & BDRV_BLOCK_ZERO) {
-iscsi_allocmap_set_unallocated(iscsilun, sector_num * BDRV_SECTOR_SIZE,
-   *pnum * BDRV_SECTOR_SIZE);
+iscsi_allocmap_set_unallocated(iscsilun, offset, *pnum);
 } else {
-iscsi_allocmap_set_allocated(iscsilun, sector_num * BDRV_SECTOR_SIZE,
- *pnum * BDRV_SECTOR_SIZE);
+iscsi_allocmap_set_allocated(iscsilun, offset, *pnum);
 }

-if (*pnum > nb_sectors) {
-*pnum = nb_sectors;
+if (*pnum > bytes) {
+*pnum = bytes;
 }
 out_unlock:
 qemu_mutex_unlock(>mutex);
@@ -760,7 +758,7 @@ out:
 if (iTask.task != NULL) {
 scsi_free_scsi_task(iTask.task);
 }
-if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID) {
+if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID && file) {
 *file = bs;
 }
 return ret;
@@ -800,25 +798,24 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState 
*bs,
  nb_sectors * BDRV_SECTOR_SIZE) &&
 !iscsi_allocmap_is_allocated(iscsilun, sector_num * BDRV_SECTOR_SIZE,
  nb_sectors * BDRV_SECTOR_SIZE)) {
-int pnum;
-BlockDriverState *file;
+int64_t pnum;
 /* check the block status from the beginning of the cluster
  * containing the start sector */
-int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;
-int head;
-int64_t ret;
+int64_t head;
+int ret;

-assert(cluster_sectors);
-head = sector_num % cluster_sectors;
-ret = iscsi_co_get_block_status(bs, sector_num - head,
-BDRV_REQUEST_MAX_SECTORS, ,
-

[Qemu-block] [PATCH v8 11/21] qcow: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the qcow driver accordingly.  There is no
intent to optimize based on the want_zero flag for this format.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v5-v7: no change
v4: rebase to interface tweak
v3: rebase to master
v2: rebase to mapping flag
---
 block/qcow.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index 8631155ac81..dead5029c67 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -524,23 +524,28 @@ static int get_cluster_offset(BlockDriverState *bs,
 return 1;
 }

-static int64_t coroutine_fn qcow_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn qcow_co_block_status(BlockDriverState *bs,
+ bool want_zero,
+ int64_t offset, int64_t bytes,
+ int64_t *pnum, int64_t *map,
+ BlockDriverState **file)
 {
 BDRVQcowState *s = bs->opaque;
-int index_in_cluster, n, ret;
+int index_in_cluster, ret;
+int64_t n;
 uint64_t cluster_offset;

 qemu_co_mutex_lock(>lock);
-ret = get_cluster_offset(bs, sector_num << 9, 0, 0, 0, 0, _offset);
+ret = get_cluster_offset(bs, offset, 0, 0, 0, 0, _offset);
 qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
 return ret;
 }
-index_in_cluster = sector_num & (s->cluster_sectors - 1);
-n = s->cluster_sectors - index_in_cluster;
-if (n > nb_sectors)
-n = nb_sectors;
+index_in_cluster = offset & (s->cluster_size - 1);
+n = s->cluster_size - index_in_cluster;
+if (n > bytes) {
+n = bytes;
+}
 *pnum = n;
 if (!cluster_offset) {
 return 0;
@@ -548,9 +553,9 @@ static int64_t coroutine_fn 
qcow_co_get_block_status(BlockDriverState *bs,
 if ((cluster_offset & QCOW_OFLAG_COMPRESSED) || s->crypto) {
 return BDRV_BLOCK_DATA;
 }
-cluster_offset |= (index_in_cluster << BDRV_SECTOR_BITS);
+*map = cluster_offset | index_in_cluster;
 *file = bs->file->bs;
-return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID | cluster_offset;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

 static int decompress_buffer(uint8_t *out_buf, int out_buf_size,
@@ -1128,7 +1133,7 @@ static BlockDriver bdrv_qcow = {

 .bdrv_co_readv  = qcow_co_readv,
 .bdrv_co_writev = qcow_co_writev,
-.bdrv_co_get_block_status   = qcow_co_get_block_status,
+.bdrv_co_block_status   = qcow_co_block_status,

 .bdrv_make_empty= qcow_make_empty,
 .bdrv_co_pwritev_compressed = qcow_co_pwritev_compressed,
-- 
2.14.3

[Qemu-block] [PATCH v8 12/21] qcow2: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the qcow2 driver accordingly.

For now, we are ignoring the 'want_zero' hint.  However, it should
be relatively straightforward to honor the hint as a way to return
larger *pnum values when we have consecutive clusters with the same
data/zero status but which differ only in having non-consecutive
mappings.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v5-v7: no change
v4: update to interface tweak
v3: no change
v2: rebase to mapping flag
---
 block/qcow2.c | 24 +---
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index 57a517e2bdd..288b5299d80 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1670,32 +1670,34 @@ static void qcow2_join_options(QDict *options, QDict 
*old_options)
 }
 }

-static int64_t coroutine_fn qcow2_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn qcow2_co_block_status(BlockDriverState *bs,
+  bool want_zero,
+  int64_t offset, int64_t count,
+  int64_t *pnum, int64_t *map,
+  BlockDriverState **file)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t cluster_offset;
 int index_in_cluster, ret;
 unsigned int bytes;
-int64_t status = 0;
+int status = 0;

-bytes = MIN(INT_MAX, nb_sectors * BDRV_SECTOR_SIZE);
+bytes = MIN(INT_MAX, count);
 qemu_co_mutex_lock(>lock);
-ret = qcow2_get_cluster_offset(bs, sector_num << BDRV_SECTOR_BITS, ,
-   _offset);
+ret = qcow2_get_cluster_offset(bs, offset, , _offset);
 qemu_co_mutex_unlock(>lock);
 if (ret < 0) {
 return ret;
 }

-*pnum = bytes >> BDRV_SECTOR_BITS;
+*pnum = bytes;

 if (cluster_offset != 0 && ret != QCOW2_CLUSTER_COMPRESSED &&
 !s->crypto) {
-index_in_cluster = sector_num & (s->cluster_sectors - 1);
-cluster_offset |= (index_in_cluster << BDRV_SECTOR_BITS);
+index_in_cluster = offset & (s->cluster_size - 1);
+*map = cluster_offset | index_in_cluster;
 *file = bs->file->bs;
-status |= BDRV_BLOCK_OFFSET_VALID | cluster_offset;
+status |= BDRV_BLOCK_OFFSET_VALID;
 }
 if (ret == QCOW2_CLUSTER_ZERO_PLAIN || ret == QCOW2_CLUSTER_ZERO_ALLOC) {
 status |= BDRV_BLOCK_ZERO;
@@ -4352,7 +4354,7 @@ BlockDriver bdrv_qcow2 = {
 .bdrv_child_perm  = bdrv_format_default_perms,
 .bdrv_create= qcow2_create,
 .bdrv_has_zero_init = bdrv_has_zero_init_1,
-.bdrv_co_get_block_status = qcow2_co_get_block_status,
+.bdrv_co_block_status = qcow2_co_block_status,

 .bdrv_co_preadv = qcow2_co_preadv,
 .bdrv_co_pwritev= qcow2_co_pwritev,
-- 
2.14.3

[Qemu-block] [PATCH v8 10/21] parallels: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the parallels driver accordingly.  Note that
the internal function block_status() is still sector-based, because
it is still in use by other sector-based functions; but that's okay
because request_alignment is 512 as a result of those functions.
For now, no optimizations are added based on the mapping hint.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v7: fix bug in *map [Vladimir]
v6: no change
v5: fix pnum when return is 0
v4: rebase to interface tweak, R-b dropped
v3: no change
v2: rebase to mapping parameter; it is ignored, so R-b kept
---
 block/parallels.c | 22 +++---
 1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/block/parallels.c b/block/parallels.c
index e1e3d80c887..3e952a9c147 100644
--- a/block/parallels.c
+++ b/block/parallels.c
@@ -261,23 +261,31 @@ static coroutine_fn int 
parallels_co_flush_to_os(BlockDriverState *bs)
 }


-static int64_t coroutine_fn parallels_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num, int nb_sectors, int *pnum, BlockDriverState **file)
+static int coroutine_fn parallels_co_block_status(BlockDriverState *bs,
+  bool want_zero,
+  int64_t offset,
+  int64_t bytes,
+  int64_t *pnum,
+  int64_t *map,
+  BlockDriverState **file)
 {
 BDRVParallelsState *s = bs->opaque;
-int64_t offset;
+int count;

+assert(QEMU_IS_ALIGNED(offset | bytes, BDRV_SECTOR_SIZE));
 qemu_co_mutex_lock(>lock);
-offset = block_status(s, sector_num, nb_sectors, pnum);
+offset = block_status(s, offset >> BDRV_SECTOR_BITS,
+  bytes >> BDRV_SECTOR_BITS, );
 qemu_co_mutex_unlock(>lock);

+*pnum = count * BDRV_SECTOR_SIZE;
 if (offset < 0) {
 return 0;
 }

+*map = offset * BDRV_SECTOR_SIZE;
 *file = bs->file->bs;
-return (offset << BDRV_SECTOR_BITS) |
-BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

 static coroutine_fn int parallels_co_writev(BlockDriverState *bs,
@@ -782,7 +790,7 @@ static BlockDriver bdrv_parallels = {
 .bdrv_open = parallels_open,
 .bdrv_close= parallels_close,
 .bdrv_child_perm  = bdrv_format_default_perms,
-.bdrv_co_get_block_status = parallels_co_get_block_status,
+.bdrv_co_block_status = parallels_co_block_status,
 .bdrv_has_zero_init   = bdrv_has_zero_init_1,
 .bdrv_co_flush_to_os  = parallels_co_flush_to_os,
 .bdrv_co_readv  = parallels_co_readv,
-- 
2.14.3

[Qemu-block] [PATCH v8 03/21] block: Switch passthrough drivers to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the generic helpers, and all passthrough clients
(blkdebug, commit, mirror, throttle) accordingly.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: rebase to master
v4: rebase to interface tweak
v3: rebase to addition of throttle driver
v2: rebase to master, retitle while merging blkdebug, commit, and mirror
---
 include/block/block_int.h | 28 
 block/io.c| 36 
 block/blkdebug.c  | 20 +++-
 block/commit.c|  2 +-
 block/mirror.c|  2 +-
 block/throttle.c  |  2 +-
 6 files changed, 50 insertions(+), 40 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index c93722b43a4..bf2598856cf 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -1041,23 +1041,27 @@ void bdrv_format_default_perms(BlockDriverState *bs, 
BdrvChild *c,
uint64_t *nperm, uint64_t *nshared);

 /*
- * Default implementation for drivers to pass bdrv_co_get_block_status() to
+ * Default implementation for drivers to pass bdrv_co_block_status() to
  * their file.
  */
-int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors,
-int *pnum,
-BlockDriverState 
**file);
+int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
+bool want_zero,
+int64_t offset,
+int64_t bytes,
+int64_t *pnum,
+int64_t *map,
+BlockDriverState **file);
 /*
- * Default implementation for drivers to pass bdrv_co_get_block_status() to
+ * Default implementation for drivers to pass bdrv_co_block_status() to
  * their backing file.
  */
-int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState 
*bs,
-   int64_t sector_num,
-   int nb_sectors,
-   int *pnum,
-   BlockDriverState 
**file);
+int coroutine_fn bdrv_co_block_status_from_backing(BlockDriverState *bs,
+   bool want_zero,
+   int64_t offset,
+   int64_t bytes,
+   int64_t *pnum,
+   int64_t *map,
+   BlockDriverState **file);
 const char *bdrv_get_parent_name(const BlockDriverState *bs);
 void blk_dev_change_media_cb(BlockBackend *blk, bool load, Error **errp);
 bool blk_dev_has_removable_media(BlockBackend *blk);
diff --git a/block/io.c b/block/io.c
index b00c7e2e2c0..5bae79f282e 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1868,30 +1868,34 @@ typedef struct BdrvCoBlockStatusData {
 bool done;
 } BdrvCoBlockStatusData;

-int64_t coroutine_fn bdrv_co_get_block_status_from_file(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors,
-int *pnum,
-BlockDriverState 
**file)
+int coroutine_fn bdrv_co_block_status_from_file(BlockDriverState *bs,
+bool want_zero,
+int64_t offset,
+int64_t bytes,
+int64_t *pnum,
+int64_t *map,
+BlockDriverState **file)
 {
 assert(bs->file && bs->file->bs);
-*pnum = nb_sectors;
+*pnum = bytes;
+*map = offset;
 *file = bs->file->bs;
-return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
+return BDRV_BLOCK_RAW | BDRV_BLOCK_OFFSET_VALID;
 }

-int64_t coroutine_fn bdrv_co_get_block_status_from_backing(BlockDriverState 
*bs,
-   int64_t

[Qemu-block] [PATCH v8 00/21] add byte-based block_status driver callbacks

2018-02-13 Thread Eric Blake

There are patches floating around to add NBD_CMD_BLOCK_STATUS,
but NBD wants to report status on byte granularity (even if the
reporting will probably be naturally aligned to sectors or even
much higher levels).  I've therefore started the task of
converting our block status code to report at a byte granularity
rather than sectors.

These patches have been around for a while, but it's time to
finish it now that 2.12 is open for patches.

Based-on: <20180213170529.10858-1-kw...@redhat.com>
(Kevin's [PULL 00/55] Block layer patches)

The overall conversion currently looks like:
part 1: bdrv_is_allocated (merged, commit 51b0a488, 2.10)
part 2: dirty-bitmap (merged, commit ca759622, 2.11)
part 3: bdrv_get_block_status (merged, commit f0a9c18f, 2.11)
part 4: .bdrv_co_block_status (this series, v7 was here [1])

[1] https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg00954.html

Available as a tag at:
git fetch git://repo.or.cz/qemu/ericb.git nbd-byte-callback-v8

Since v7:
- rebase to master (more iscsi context changes, nvme driver is new)
- add a few more R-bys

Eric Blake (21):
  block: Add .bdrv_co_block_status() callback
  nvme: Drop pointless .bdrv_co_get_block_status()
  block: Switch passthrough drivers to .bdrv_co_block_status()
  file-posix: Switch to .bdrv_co_block_status()
  gluster: Switch to .bdrv_co_block_status()
  iscsi: Switch cluster_sectors to byte-based
  iscsi: Switch iscsi_allocmap_update() to byte-based
  iscsi: Switch to .bdrv_co_block_status()
  null: Switch to .bdrv_co_block_status()
  parallels: Switch to .bdrv_co_block_status()
  qcow: Switch to .bdrv_co_block_status()
  qcow2: Switch to .bdrv_co_block_status()
  qed: Switch to .bdrv_co_block_status()
  raw: Switch to .bdrv_co_block_status()
  sheepdog: Switch to .bdrv_co_block_status()
  vdi: Avoid bitrot of debugging code
  vdi: Switch to .bdrv_co_block_status()
  vmdk: Switch to .bdrv_co_block_status()
  vpc: Switch to .bdrv_co_block_status()
  vvfat: Switch to .bdrv_co_block_status()
  block: Drop unused .bdrv_co_get_block_status()

 include/block/block.h |  14 ++---
 include/block/block_int.h |  51 +--
 block/io.c|  86 +++--
 block/blkdebug.c  |  20 +++---
 block/commit.c|   2 +-
 block/file-posix.c|  62 +-
 block/gluster.c   |  70 ++---
 block/iscsi.c | 157 --
 block/mirror.c|   2 +-
 block/null.c  |  23 +++
 block/nvme.c  |  14 -
 block/parallels.c |  22 ---
 block/qcow.c  |  27 
 block/qcow2.c |  24 +++
 block/qed.c   |  84 +
 block/raw-format.c|  16 ++---
 block/sheepdog.c  |  26 
 block/throttle.c  |   2 +-
 block/vdi.c   |  45 +++--
 block/vmdk.c  |  38 +--
 block/vpc.c   |  45 ++---
 block/vvfat.c |  16 +++--
 22 files changed, 404 insertions(+), 442 deletions(-)

-- 
2.14.3

[Qemu-block] [PATCH v8 04/21] file-posix: Switch to .bdrv_co_block_status()

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based.  Update the file protocol driver accordingly.

In want_zero mode, we continue to report fine-grained hole
information (the caller wants as much mapping detail as possible);
but when not in that mode, the caller prefers larger *pnum and
merely cares about what offsets are allocated at this layer, rather
than where the holes live.  Since holes still read as zeroes at
this layer (rather than deferring to a backing layer), we can take
the shortcut of skipping lseek(), and merely state that all bytes
are allocated.

We can also drop redundant bounds checks that are already
guaranteed by the block layer.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v6-v7: no change
v5: drop redundant code
v4: tweak commit message [Fam], rebase to interface tweak
v3: no change
v2: tweak comment, add mapping support
---
 block/file-posix.c | 62 +-
 1 file changed, 29 insertions(+), 33 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index ca49c1a98ae..f1591c38490 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2131,25 +2131,24 @@ static int find_allocation(BlockDriverState *bs, off_t 
start,
 }

 /*
- * Returns the allocation status of the specified sectors.
+ * Returns the allocation status of the specified offset.
  *
- * If 'sector_num' is beyond the end of the disk image the return value is 0
- * and 'pnum' is set to 0.
+ * The block layer guarantees 'offset' and 'bytes' are within bounds.
  *
- * 'pnum' is set to the number of sectors (including and immediately following
- * the specified sector) that are known to be in the same
+ * 'pnum' is set to the number of bytes (including and immediately following
+ * the specified offset) that are known to be in the same
  * allocated/unallocated state.
  *
- * 'nb_sectors' is the max value 'pnum' should be set to.  If nb_sectors goes
- * beyond the end of the disk image it will be clamped.
+ * 'bytes' is the max value 'pnum' should be set to.
  */
-static int64_t coroutine_fn raw_co_get_block_status(BlockDriverState *bs,
-int64_t sector_num,
-int nb_sectors, int *pnum,
-BlockDriverState **file)
+static int coroutine_fn raw_co_block_status(BlockDriverState *bs,
+bool want_zero,
+int64_t offset,
+int64_t bytes, int64_t *pnum,
+int64_t *map,
+BlockDriverState **file)
 {
-off_t start, data = 0, hole = 0;
-int64_t total_size;
+off_t data = 0, hole = 0;
 int ret;

 ret = fd_open(bs);
@@ -2157,39 +2156,36 @@ static int64_t coroutine_fn 
raw_co_get_block_status(BlockDriverState *bs,
 return ret;
 }

-start = sector_num * BDRV_SECTOR_SIZE;
-total_size = bdrv_getlength(bs);
-if (total_size < 0) {
-return total_size;
-} else if (start >= total_size) {
-*pnum = 0;
-return 0;
-} else if (start + nb_sectors * BDRV_SECTOR_SIZE > total_size) {
-nb_sectors = DIV_ROUND_UP(total_size - start, BDRV_SECTOR_SIZE);
+if (!want_zero) {
+*pnum = bytes;
+*map = offset;
+*file = bs;
+return BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
 }

-ret = find_allocation(bs, start, , );
+ret = find_allocation(bs, offset, , );
 if (ret == -ENXIO) {
 /* Trailing hole */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_ZERO;
 } else if (ret < 0) {
 /* No info available, so pretend there are no holes */
-*pnum = nb_sectors;
+*pnum = bytes;
 ret = BDRV_BLOCK_DATA;
-} else if (data == start) {
-/* On a data extent, compute sectors to the end of the extent,
+} else if (data == offset) {
+/* On a data extent, compute bytes to the end of the extent,
  * possibly including a partial sector at EOF. */
-*pnum = MIN(nb_sectors, DIV_ROUND_UP(hole - start, BDRV_SECTOR_SIZE));
+*pnum = MIN(bytes, hole - offset);
 ret = BDRV_BLOCK_DATA;
 } else {
-/* On a hole, compute sectors to the beginning of the next extent.  */
-assert(hole == start);
-*pnum = MIN(nb_sectors, (data - start) / BDRV_SECTOR_SIZE);
+/* On a hole, compute bytes to the beginning of the next extent.  */
+assert(hole == offset);
+*pnum = MIN(bytes, data - offset);
 ret = BDRV_BLOCK_ZERO;
 }
+*map = offset;
 *file = bs;
-return ret | BDRV_BLOCK_OFFSET_VALID | start;
+return ret |

[Qemu-block] [PATCH v8 01/21] block: Add .bdrv_co_block_status() callback

2018-02-13 Thread Eric Blake

We are gradually moving away from sector-based interfaces, towards
byte-based. Now that the block layer exposes byte-based allocation,
it's time to tackle the drivers.  Add a new callback that operates
on as small as byte boundaries. Subsequent patches will then update
individual drivers, then finally remove .bdrv_co_get_block_status().

The new code also passes through the 'want_zero' hint, which will
allow subsequent patches to further optimize callers that only care
about how much of the image is allocated (want_zero is false),
rather than full details about runs of zeroes and which offsets the
allocation actually maps to (want_zero is true).  As part of this
effort, fix another part of the documentation: the claim in commit
4c41cb4 that BDRV_BLOCK_ALLOCATED is short for 'DATA || ZERO' is a
lie at the block layer (see commit e88ae2264), even though it is
how the bit is computed from the driver layer.  After all, there
are intentionally cases where we return ZERO but not ALLOCATED at
the block layer, when we know that a read sees zero because the
backing file is too short.  Note that the driver interface is thus
slightly different than the public interface with regards to which
bits will be set, and what guarantees are provided on input.

We also add an assertion that any driver using the new callback will
make progress (the only time pnum will be 0 is if the block layer
already handled an out-of-bounds request, or if there is an error);
the old driver interface did not provide this guarantee, which
could lead to some inf-loops in drastic corner-case failures.

Signed-off-by: Eric Blake 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Fam Zheng 

---
v7: add R-b
v6: drop now-useless rounding of mid-sector end-of-file hole [Kevin],
better documentation of 'want_zero' [Kevin]
v5: rebase to master, typo fix, document more block layer guarantees
v4: rebase to master
v3: no change
v2: improve alignment handling, ensure all iotests still pass
---
 include/block/block.h | 14 +++---
 include/block/block_int.h | 20 +++-
 block/io.c| 28 +++-
 3 files changed, 41 insertions(+), 21 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 19b3ab9cb5e..947e8876cdd 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -115,19 +115,19 @@ typedef struct HDGeometry {
  * BDRV_BLOCK_ZERO: offset reads as zero
  * BDRV_BLOCK_OFFSET_VALID: an associated offset exists for accessing raw data
  * BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
- *   layer (short for DATA || ZERO), set by block layer
- * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this layer
+ *   layer rather than any backing, set by block layer
+ * BDRV_BLOCK_EOF: the returned pnum covers through end of file for this
+ * layer, set by block layer
  *
  * Internal flag:
  * BDRV_BLOCK_RAW: for use by passthrough drivers, such as raw, to request
  * that the block layer recompute the answer from the returned
  * BDS; must be accompanied by just BDRV_BLOCK_OFFSET_VALID.
  *
- * If BDRV_BLOCK_OFFSET_VALID is set, bits 9-62 (BDRV_BLOCK_OFFSET_MASK) of
- * the return value (old interface) or the entire map parameter (new
- * interface) represent the offset in the returned BDS that is allocated for
- * the corresponding raw data.  However, whether that offset actually
- * contains data also depends on BDRV_BLOCK_DATA, as follows:
+ * If BDRV_BLOCK_OFFSET_VALID is set, the map parameter represents the
+ * host offset within the returned BDS that is allocated for the
+ * corresponding raw guest data.  However, whether that offset
+ * actually contains data also depends on BDRV_BLOCK_DATA, as follows:
  *
  * DATA ZERO OFFSET_VALID
  *  ttt   sectors read as zero, returned file is zero at offset
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 5ea63f8fa8a..c93722b43a4 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -202,15 +202,25 @@ struct BlockDriver {
 /*
  * Building block for bdrv_block_status[_above] and
  * bdrv_is_allocated[_above].  The driver should answer only
- * according to the current layer, and should not set
- * BDRV_BLOCK_ALLOCATED, but may set BDRV_BLOCK_RAW.  See block.h
- * for the meaning of _DATA, _ZERO, and _OFFSET_VALID.  The block
- * layer guarantees input aligned to request_alignment, as well as
- * non-NULL pnum and file.
+ * according to the current layer, and should only need to set
+ * BDRV_BLOCK_DATA, BDRV_BLOCK_ZERO, BDRV_BLOCK_OFFSET_VALID,
+ * and/or BDRV_BLOCK_RAW; if the current layer defers to a backing
+ * layer, the result should be 0 (and not BDRV_BLOCK_ZERO).  See
+ * block.h for the overall meaning of the bits.

[Qemu-block] [PATCH v8 02/21] nvme: Drop pointless .bdrv_co_get_block_status()

2018-02-13 Thread Eric Blake

Commit bdd6a90 has a bug: drivers should never directly set
BDRV_BLOCK_ALLOCATED, but only io.c should do that (as needed).
Instead, drivers should report BDRV_BLOCK_DATA if it knows that
data comes from this BDS.

But let's look at the bigger picture: semantically, the nvme
driver is similar to the nbd, null, and raw drivers (no backing
file, all data comes from this BDS).  But while two of those
other drivers have to supply the callback (null because it can
special-case BDRV_BLOCK_ZERO, raw because it can special-case
a different offset), in this case the block layer defaults are
good enough without the callback at all (similar to nbd).

So, fix the bug by deletion ;)

Signed-off-by: Eric Blake 

---
v8: new patch
---
 block/nvme.c | 14 --
 1 file changed, 14 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index 10bffbbf2f4..4e561b08df3 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -1068,18 +1068,6 @@ static int nvme_reopen_prepare(BDRVReopenState 
*reopen_state,
 return 0;
 }

-static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
- int64_t sector_num,
- int nb_sectors, int *pnum,
- BlockDriverState **file)
-{
-*pnum = nb_sectors;
-*file = bs;
-
-return BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID |
-   (sector_num << BDRV_SECTOR_BITS);
-}
-
 static void nvme_refresh_filename(BlockDriverState *bs, QDict *opts)
 {
 QINCREF(opts);
@@ -1179,8 +1167,6 @@ static BlockDriver bdrv_nvme = {
 .bdrv_co_flush_to_disk= nvme_co_flush,
 .bdrv_reopen_prepare  = nvme_reopen_prepare,

-.bdrv_co_get_block_status = nvme_co_get_block_status,
-
 .bdrv_refresh_filename= nvme_refresh_filename,
 .bdrv_refresh_limits  = nvme_refresh_limits,

-- 
2.14.3

[Qemu-block] [PATCH v8 06/21] iscsi: Switch cluster_sectors to byte-based

2018-02-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert all uses of
the cluster size in sectors, along with adding assertions that we
are not dividing by zero.

Improve some comment grammar while in the area.

Signed-off-by: Eric Blake 
Acked-by: Paolo Bonzini 
Reviewed-by: Fam Zheng 

---
v8: rebase to master
v2-v7: no change
---
 block/iscsi.c | 56 +++-
 1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 421983dd6ff..3414c21c7f5 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -86,7 +86,7 @@ typedef struct IscsiLun {
 unsigned long *allocmap;
 unsigned long *allocmap_valid;
 long allocmap_size;
-int cluster_sectors;
+int cluster_size;
 bool use_16_for_rw;
 bool write_protected;
 bool lbpme;
@@ -430,9 +430,10 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int 
open_flags)
 {
 iscsi_allocmap_free(iscsilun);

+assert(iscsilun->cluster_size);
 iscsilun->allocmap_size =
-DIV_ROUND_UP(sector_lun2qemu(iscsilun->num_blocks, iscsilun),
- iscsilun->cluster_sectors);
+DIV_ROUND_UP(iscsilun->num_blocks * iscsilun->block_size,
+ iscsilun->cluster_size);

 iscsilun->allocmap = bitmap_try_new(iscsilun->allocmap_size);
 if (!iscsilun->allocmap) {
@@ -440,7 +441,7 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int 
open_flags)
 }

 if (open_flags & BDRV_O_NOCACHE) {
-/* in case that cache.direct = on all allocmap entries are
+/* when cache.direct = on all allocmap entries are
  * treated as invalid to force a relookup of the block
  * status on every read request */
 return 0;
@@ -461,17 +462,19 @@ iscsi_allocmap_update(IscsiLun *iscsilun, int64_t 
sector_num,
   int nb_sectors, bool allocated, bool valid)
 {
 int64_t cl_num_expanded, nb_cls_expanded, cl_num_shrunk, nb_cls_shrunk;
+int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;

 if (iscsilun->allocmap == NULL) {
 return;
 }
 /* expand to entirely contain all affected clusters */
-cl_num_expanded = sector_num / iscsilun->cluster_sectors;
+assert(cluster_sectors);
+cl_num_expanded = sector_num / cluster_sectors;
 nb_cls_expanded = DIV_ROUND_UP(sector_num + nb_sectors,
-   iscsilun->cluster_sectors) - 
cl_num_expanded;
+   cluster_sectors) - cl_num_expanded;
 /* shrink to touch only completely contained clusters */
-cl_num_shrunk = DIV_ROUND_UP(sector_num, iscsilun->cluster_sectors);
-nb_cls_shrunk = (sector_num + nb_sectors) / iscsilun->cluster_sectors
+cl_num_shrunk = DIV_ROUND_UP(sector_num, cluster_sectors);
+nb_cls_shrunk = (sector_num + nb_sectors) / cluster_sectors
   - cl_num_shrunk;
 if (allocated) {
 bitmap_set(iscsilun->allocmap, cl_num_expanded, nb_cls_expanded);
@@ -535,9 +538,12 @@ iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t 
sector_num,
 if (iscsilun->allocmap == NULL) {
 return true;
 }
-size = DIV_ROUND_UP(sector_num + nb_sectors, iscsilun->cluster_sectors);
+assert(iscsilun->cluster_size);
+size = DIV_ROUND_UP(sector_num + nb_sectors,
+iscsilun->cluster_size >> BDRV_SECTOR_BITS);
 return !(find_next_bit(iscsilun->allocmap, size,
-   sector_num / iscsilun->cluster_sectors) == size);
+   sector_num * BDRV_SECTOR_SIZE /
+   iscsilun->cluster_size) == size);
 }

 static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun,
@@ -547,9 +553,12 @@ static inline bool iscsi_allocmap_is_valid(IscsiLun 
*iscsilun,
 if (iscsilun->allocmap_valid == NULL) {
 return false;
 }
-size = DIV_ROUND_UP(sector_num + nb_sectors, iscsilun->cluster_sectors);
+assert(iscsilun->cluster_size);
+size = DIV_ROUND_UP(sector_num + nb_sectors,
+iscsilun->cluster_size >> BDRV_SECTOR_BITS);
 return (find_next_zero_bit(iscsilun->allocmap_valid, size,
-   sector_num / iscsilun->cluster_sectors) == 
size);
+   sector_num * BDRV_SECTOR_SIZE /
+   iscsilun->cluster_size) == size);
 }

 static int coroutine_fn
@@ -793,16 +802,21 @@ static int coroutine_fn iscsi_co_readv(BlockDriverState 
*bs,
 BlockDriverState *file;
 /* check the block status from the beginning of the cluster
  * containing the start sector */
-int64_t ret = iscsi_co_get_block_status(bs,
-  sector_num - sector_num % iscsilun->cluster_sectors,
-  BDRV_REQUEST_MAX_SECTORS, , );
+int

[Qemu-block] [PATCH v8 07/21] iscsi: Switch iscsi_allocmap_update() to byte-based

2018-02-13 Thread Eric Blake

We are gradually converting to byte-based interfaces, as they are
easier to reason about than sector-based.  Convert all uses of
the allocmap (no semantic change).  Callers that already had bytes
available are simpler, and callers that now scale to bytes will be
easier to switch to byte-based in the future.

Signed-off-by: Eric Blake 
Acked-by: Paolo Bonzini 
Reviewed-by: Fam Zheng 

---
v7: rebase to master, simple enough to keep ack
v3-v6: no change
v2: rebase to count/bytes rename
---
 block/iscsi.c | 90 +--
 1 file changed, 44 insertions(+), 46 deletions(-)

diff --git a/block/iscsi.c b/block/iscsi.c
index 3414c21c7f5..d2b0466775c 100644
--- a/block/iscsi.c
+++ b/block/iscsi.c
@@ -458,24 +458,22 @@ static int iscsi_allocmap_init(IscsiLun *iscsilun, int 
open_flags)
 }

 static void
-iscsi_allocmap_update(IscsiLun *iscsilun, int64_t sector_num,
-  int nb_sectors, bool allocated, bool valid)
+iscsi_allocmap_update(IscsiLun *iscsilun, int64_t offset,
+  int64_t bytes, bool allocated, bool valid)
 {
 int64_t cl_num_expanded, nb_cls_expanded, cl_num_shrunk, nb_cls_shrunk;
-int cluster_sectors = iscsilun->cluster_size >> BDRV_SECTOR_BITS;

 if (iscsilun->allocmap == NULL) {
 return;
 }
 /* expand to entirely contain all affected clusters */
-assert(cluster_sectors);
-cl_num_expanded = sector_num / cluster_sectors;
-nb_cls_expanded = DIV_ROUND_UP(sector_num + nb_sectors,
-   cluster_sectors) - cl_num_expanded;
+assert(iscsilun->cluster_size);
+cl_num_expanded = offset / iscsilun->cluster_size;
+nb_cls_expanded = DIV_ROUND_UP(offset + bytes,
+   iscsilun->cluster_size) - cl_num_expanded;
 /* shrink to touch only completely contained clusters */
-cl_num_shrunk = DIV_ROUND_UP(sector_num, cluster_sectors);
-nb_cls_shrunk = (sector_num + nb_sectors) / cluster_sectors
-  - cl_num_shrunk;
+cl_num_shrunk = DIV_ROUND_UP(offset, iscsilun->cluster_size);
+nb_cls_shrunk = (offset + bytes) / iscsilun->cluster_size - cl_num_shrunk;
 if (allocated) {
 bitmap_set(iscsilun->allocmap, cl_num_expanded, nb_cls_expanded);
 } else {
@@ -498,26 +496,26 @@ iscsi_allocmap_update(IscsiLun *iscsilun, int64_t 
sector_num,
 }

 static void
-iscsi_allocmap_set_allocated(IscsiLun *iscsilun, int64_t sector_num,
- int nb_sectors)
+iscsi_allocmap_set_allocated(IscsiLun *iscsilun, int64_t offset,
+ int64_t bytes)
 {
-iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, true, true);
+iscsi_allocmap_update(iscsilun, offset, bytes, true, true);
 }

 static void
-iscsi_allocmap_set_unallocated(IscsiLun *iscsilun, int64_t sector_num,
-   int nb_sectors)
+iscsi_allocmap_set_unallocated(IscsiLun *iscsilun, int64_t offset,
+   int64_t bytes)
 {
 /* Note: if cache.direct=on the fifth argument to iscsi_allocmap_update
  * is ignored, so this will in effect be an iscsi_allocmap_set_invalid.
  */
-iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, false, true);
+iscsi_allocmap_update(iscsilun, offset, bytes, false, true);
 }

-static void iscsi_allocmap_set_invalid(IscsiLun *iscsilun, int64_t sector_num,
-   int nb_sectors)
+static void iscsi_allocmap_set_invalid(IscsiLun *iscsilun, int64_t offset,
+   int64_t bytes)
 {
-iscsi_allocmap_update(iscsilun, sector_num, nb_sectors, false, false);
+iscsi_allocmap_update(iscsilun, offset, bytes, false, false);
 }

 static void iscsi_allocmap_invalidate(IscsiLun *iscsilun)
@@ -531,34 +529,30 @@ static void iscsi_allocmap_invalidate(IscsiLun *iscsilun)
 }

 static inline bool
-iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t sector_num,
-int nb_sectors)
+iscsi_allocmap_is_allocated(IscsiLun *iscsilun, int64_t offset,
+int64_t bytes)
 {
 unsigned long size;
 if (iscsilun->allocmap == NULL) {
 return true;
 }
 assert(iscsilun->cluster_size);
-size = DIV_ROUND_UP(sector_num + nb_sectors,
-iscsilun->cluster_size >> BDRV_SECTOR_BITS);
+size = DIV_ROUND_UP(offset + bytes, iscsilun->cluster_size);
 return !(find_next_bit(iscsilun->allocmap, size,
-   sector_num * BDRV_SECTOR_SIZE /
-   iscsilun->cluster_size) == size);
+   offset / iscsilun->cluster_size) == size);
 }

 static inline bool iscsi_allocmap_is_valid(IscsiLun *iscsilun,
-   int64_t sector_num, int nb_sectors)
+   int64_t offset, int64_t bytes)
 {

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Denis V. Lunev

On 02/13/2018 07:46 PM, Eric Blake wrote:
> On 02/13/2018 08:48 AM, Daniel P. Berrangé wrote:
 No, that's policy decision that doesn't matter from QMP pov. If the
 mgmt
 app wants the snapshot to be wrt to the initial time, it can simply
 invoke the "stop" QMP command before doing the live migration and
 "cont" afterwards.
>>>
>>> That would be non-live. I think Roman means a live snapshot that saves
>>> the state at the beginning of the operation. Basically the difference
>>> between blockdev-backup (state at the beginning) and blockdev-mirror
>>> (state at the end), except for a whole VM.
>>
>> That doesn't seem practical unless you can instantaneously write out
>> the entire guest RAM to disk without blocking, or can somehow snapshot
>> the RAM so you can write out a consistent view of the original RAM,
>> while the guest continues to dirty RAM pages.
>
> One idea for that is via fork()'s copy-on-write semantics; the parent
> continues processing the guest, while the child writes out RAM pages.
> Pages touched by the guest in the parent are now cleanly copied by the
> OS so that the child can take all the time it wants, but still writes
> out the state of the guest at the time of the fork().  It may be
> possible to use userfaultfd to achieve the same effects without a fork().
>
this would be problematic as we for sure will face memory problems.
Guest stalls are IMHO much less problematic (as they are expected
by end-user) then memory shortage.

Anyway, this is discussable.

Den

Re: [Qemu-block] [PATCH v6 3/9] block: Add VFIO based NVMe driver

2018-02-13 Thread Eric Blake


On 01/16/2018 12:08 AM, Fam Zheng wrote:

This is a new protocol driver that exclusively opens a host NVMe
controller through VFIO. It achieves better latency than linux-aio by
completely bypassing host kernel vfs/block layer.

 $rw-$bs-$iodepth  linux-aio nvme://
 
 randread-4k-1 10.5k 21.6k
 randread-512k-1   745   1591
 randwrite-4k-130.7k 37.0k
 randwrite-512k-1  1945  1980

 (unit: IOPS)

The driver also integrates with the polling mechanism of iothread.

This patch is co-authored by Paolo and me.

Signed-off-by: Paolo Bonzini 
Signed-off-by: Fam Zheng 
Message-Id: <20180110091846.10699-4-f...@redhat.com>
---


Sorry for not noticing sooner, but


+static int64_t coroutine_fn nvme_co_get_block_status(BlockDriverState *bs,
+ int64_t sector_num,
+ int nb_sectors, int *pnum,
+ BlockDriverState **file)
+{
+*pnum = nb_sectors;
+*file = bs;
+
+return BDRV_BLOCK_ALLOCATED | BDRV_BLOCK_OFFSET_VALID |
+   (sector_num << BDRV_SECTOR_BITS);


This is wrong.  Drivers should only ever return BDRV_BLOCK_DATA (which 
io.c then _adds_ BDRV_BLOCK_ALLOCATED to, as needed).  I'll fix it up as 
part of my byte-based block status series (v8 coming up soon).


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [PATCH v4] ssh: switch from libssh2 to libssh

2018-02-13 Thread Eric Blake


On 02/13/2018 12:49 PM, Max Reitz wrote:

On 2018-01-18 17:44, Pino Toscano wrote:

Rewrite the implementation of the ssh block driver to use libssh instead
of libssh2.  The libssh library has various advantages over libssh2:
- easier API for authentication (for example for using ssh-agent)
- easier API for known_hosts handling
- supports newer types of keys in known_hosts




@@ -628,6 +570,8 @@ static int connect_to_ssh(BDRVSSHState *s, QDict *options,
  Error *local_err = NULL;
  const char *user, *path, *host_key_check;
  long port = 0;
+unsigned long portU = 0;


I was about to say: How about making port an unsigned long and swapping
the qemu_strtol() for a qemu_strtoul()?

But I think you'd rather want an unsigned int instead (and that won't
work with qemu_strtoul()).


Dan has a pending patch that adds qemu_strtoi() and qemu_strtoui(), when 
we want to deal with parsing to ints.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [PATCH v4] ssh: switch from libssh2 to libssh

2018-02-13 Thread Max Reitz

On 2018-01-18 17:44, Pino Toscano wrote:
> Rewrite the implementation of the ssh block driver to use libssh instead
> of libssh2.  The libssh library has various advantages over libssh2:
> - easier API for authentication (for example for using ssh-agent)
> - easier API for known_hosts handling
> - supports newer types of keys in known_hosts
> 
> Kerberos authentication can be enabled once the libssh bug for it [1] is
> fixed.
> 
> The development version of libssh (i.e. the future 0.8.x) supports
> fsync, so reuse the build time check for this.
> 
> [1] https://red.libssh.org/issues/242
> 
> Signed-off-by: Pino Toscano 
> ---
> 
> Changes from v3:
> - fix socket cleanup in connect_to_ssh()
> - add comments about the socket cleanup
> - improve the error reporting (closer to what was with libssh2)
> - improve EOF detection on sftp_read()
> 
> Changes from v2:
> - used again an own fd
> - fixed co_yield() implementation
> 
> Changes from v1:
> - fixed jumbo packets writing
> - fixed missing 'err' assignment
> - fixed commit message

One thing: The performance seems to have dropped hugely, from what I can
tell.

Before this patch, running the iotests 1-10 over ssh (raw/ssh) took
12.6 s.  With this patch, they take 59.3 s.  Perhaps the starkest
contrast can be seen in test 1, which took 4 s before and 27 s after --
this test simply reads and writes 128 MB of continuous data.

I like having elliptic curves, but I think this patch needs optimization
work before we can replace libssh2.

>  block/Makefile.objs |   6 +-
>  block/ssh.c | 522 
> 
>  configure   |  65 ---
>  3 files changed, 278 insertions(+), 315 deletions(-)

[...]

> diff --git a/block/ssh.c b/block/ssh.c
> index b049a16eb9..2975fc27d8 100644
> --- a/block/ssh.c
> +++ b/block/ssh.c

[...]

> @@ -87,27 +81,25 @@ static void ssh_state_init(BDRVSSHState *s)
>  {
>  memset(s, 0, sizeof *s);
>  s->sock = -1;
> -s->offset = -1;
>  qemu_co_mutex_init(>lock);
>  }
>  
>  static void ssh_state_free(BDRVSSHState *s)
>  {
> +if (s->attrs) {
> +sftp_attributes_free(s->attrs);
> +}
>  if (s->sftp_handle) {
> -libssh2_sftp_close(s->sftp_handle);
> +sftp_close(s->sftp_handle);
>  }
>  if (s->sftp) {
> -libssh2_sftp_shutdown(s->sftp);
> +sftp_free(s->sftp);
>  }
>  if (s->session) {
> -libssh2_session_disconnect(s->session,
> -   "from qemu ssh client: "
> -   "user closed the connection");
> -libssh2_session_free(s->session);
> -}
> -if (s->sock >= 0) {
> -close(s->sock);
> +ssh_disconnect(s->session);
> +ssh_free(s->session);
>  }
> +/* s->sock is owned by the ssh_session, which free's it. */

s/free's/frees/

>  }
>  
>  static void GCC_FMT_ATTR(3, 4)
> @@ -121,13 +113,13 @@ session_error_setg(Error **errp, BDRVSSHState *s, const 
> char *fs, ...)
>  va_end(args);
>  
>  if (s->session) {
> -char *ssh_err;
> +const char *ssh_err;
>  int ssh_err_code;
>  
> -/* This is not an errno.  See . */
> -ssh_err_code = libssh2_session_last_error(s->session,
> -  _err, NULL, 0);
> -error_setg(errp, "%s: %s (libssh2 error code: %d)",
> +/* This is not an errno.  See . */
> +ssh_err = ssh_get_error(s->session);
> +ssh_err_code = ssh_get_error_code(s->session);
> +error_setg(errp, "%s: %s (libssh error code: %d)",
> msg, ssh_err, ssh_err_code);

Maybe we should not append the error info if there is no error.

(Example:

$ ./qemu-img info ssh://localhost/tmp/foo
qemu-img: Could not open 'ssh://localhost/tmp/foo': no host key was
found in known_hosts:  (libssh error code: 0)

)

>  } else {
>  error_setg(errp, "%s", msg);

[...]

> @@ -291,68 +283,41 @@ static void ssh_parse_filename(const char *filename, 
> QDict *options,
>  static int check_host_key_knownhosts(BDRVSSHState *s,
>   const char *host, int port, Error 
> **errp)
>  {
> -const char *home;
> -char *knh_file = NULL;
> -LIBSSH2_KNOWNHOSTS *knh = NULL;
> -struct libssh2_knownhost *found;
> -int ret, r;
> -const char *hostkey;
> -size_t len;
> -int type;
> +int ret;
> +int state;
>  
> -hostkey = libssh2_session_hostkey(s->session, , );
> -if (!hostkey) {
> -ret = -EINVAL;
> -session_error_setg(errp, s, "failed to read remote host key");
> -goto out;
> -}
> +state = ssh_is_server_known(s->session);
>  
> -knh = libssh2_knownhost_init(s->session);
> -if (!knh) {
> -ret = -EINVAL;
> -session_error_setg(errp, s,
> -   "failed to initialize known hosts support");
> -goto out;
> -}
> -
> -

Re: [Qemu-block] block_status automatically added flags

2018-02-13 Thread Eric Blake


On 02/13/2018 11:36 AM, Vladimir Sementsov-Ogievskiy wrote:

Hi Eric!

I'm now testing my nbd block status realization (block_status part, not 
about dirty bitmaps), and faced into the following effect.


I created empty qcow2 image and wrote to the first sector, so

qemu-io -c map x

reports:

64 KiB (0x1) bytes allocated at offset 0 bytes (0x0)
9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1)

But I can't get same results, when connecting to nbd server, exporting 
the same qcow2 file, I get


10 MiB (0xa0) bytes allocated at offset 0 bytes (0x0)


Is this with or without your NBD_CMD_BLOCK_STATUS patches applied?  And 
are you exposing the data over NBD as raw ('qemu-nbd -f qcow2'/'qemu-io 
-f raw') or as qcow2 ('qemu-nbd -f raw'/'qemu-io -f qcow2')?


/me does a quick reproduction

Yes, I definitely see that behavior without any NBD_CMD_BLOCK_STATUS 
patches and when the image is exposed over NBD as raw, but not when 
exposed as qcow2, when testing the 2.11 release:


$ qemu-img create -f qcow2 file3 10M
Formatting 'file3', fmt=qcow2 size=10485760 cluster_size=65536 
lazy_refcounts=off refcount_bits=16

$ qemu-io -c 'w 0 64k' -c map -f qcow2 file3
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0579 sec (1.079 MiB/sec and 17.2601 ops/sec)
64 KiB (0x1) bytes allocated at offset 0 bytes (0x0)
9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1)
$ qemu-nbd -f qcow2 -x foo file3
$ qemu-io -f raw -c map nbd://localhost:10809/foo
10 MiB (0xa0) bytes allocated at offset 0 bytes (0x0)
$ qemu-nbd -f raw -x foo file3
$ qemu-io -f qcow2 -c map nbd://localhost:10809/foo
64 KiB (0x1) bytes allocated at offset 0 bytes (0x0)
9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1)

Right now, without NBD block status, the NBD driver reports the entire 
file as allocated, as it can't do any better (NBD has no backing file, 
and all data .  Presumably, once NBD_CMD_BLOCK_STATUS is implemented, we 
can then use that for more accurate information.





Finally, I understand the reason:

for local file, qemu-io calls bdrv_is_allocated, which calls 
bdrv_common_block_status_above with want_zero=false. So, it doesn't set 
BDRV_BLOCK_ZERO, and doesn't set BDRV_BLOCK_ALLOCATED.


'qemu-io map' is a bit unusual; it is the only UI that easily exposes 
bdrv_is_allocated() to the outside world ('qemu-img map' does not). 
(The fact that both operations are named 'map' but do something 
different is annoying; for back-compat reasons, we can't change 
qemu-img, and I don't know if changing qemu-io is worth it.)



And, even if we 
change want_zero to true,


Well, you'd do that by invoking bdrv_block_status() (via 'qemu-img map', 
for example).


here, it will set BDRV_BLOCK_ZERO, but will 
not set BDRV_BLOCK_ALLOCATED, which contradicts with it's definition:


  BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
    layer (short for DATA || ZERO), set by block layer


This text is wrong; it gets fixed in my still-pending concluding series 
for byte-based block status:


https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg00955.html

Conceptually, BDRV_BLOCK_ALLOCATED means "is THIS layer of the backing 
chain responsible for the contents at this guest offset"; and there are 
cases where we know that we read zeroes but where the current layer is 
not responsible for the contents (such as a qcow2 that has a backing 
file with shorter length, where we return BDRV_BLOCK_ZERO but not 
BDRV_BLOCK_ALLOCATED).  But since NBD has no backing chain, the entire 
image is considered allocated.  Meanwhile, asking whether something is 
allocated ('qemu-io -c map') is not the usual question you want to ask 
when determining what portions of a file are zero.





for nbd, we go through the similar way on server (but with want_zero = 
true), and we finally have BDRV_BLOCK_ZERO without BDRV_BLOCK_ALLOCATED, 
which maps to NBD_STATE_HOLE+NBD_STATE_ZERO. But then, in the client we 
have BDRV_BLOCK_ZERO not automatically added by block layer but directly 
from nbd driver, therefor BDRV_BLOCK_ALLOCATED is set and I get 
different result.


Drivers should never set BDRV_BLOCK_ALLOCATED; only the code in io.c 
should set it; and output based on BDRV_BLOCK_ALLOCATED is only useful 
in backing chain scenarios (which NBD does not have).





this all looks weird for me.

BDRV_BLOCK_ALLOCATED definition should be fixed, to show that this flag 
show only reported by driver flags, not automatically added.


If we need to report yet more flags, where the driver can report 
additional information, then that's different.  But changing the 
BDRV_BLOCK_ALLOCATED semantics would probably have knock-on effects that 
I'm not prepared to audit for (that is, we'd rather fix the 
documentation to match reality, which my pending patch does, and NOT 
change the code to  match the current incorrect documentation).




And then the situation

[Qemu-block] block_status automatically added flags

2018-02-13 Thread Vladimir Sementsov-Ogievskiy


Hi Eric!

I'm now testing my nbd block status realization (block_status part, not 
about dirty bitmaps), and faced into the following effect.


I created empty qcow2 image and wrote to the first sector, so

qemu-io -c map x

reports:

64 KiB (0x1) bytes allocated at offset 0 bytes (0x0)
9.938 MiB (0x9f) bytes not allocated at offset 64 KiB (0x1)

But I can't get same results, when connecting to nbd server, exporting 
the same qcow2 file, I get


10 MiB (0xa0) bytes allocated at offset 0 bytes (0x0)


Finally, I understand the reason:

for local file, qemu-io calls bdrv_is_allocated, which calls 
bdrv_common_block_status_above with want_zero=false. So, it doesn't set 
BDRV_BLOCK_ZERO, and doesn't set BDRV_BLOCK_ALLOCATED. And, even if we 
change want_zero to true, here, it will set BDRV_BLOCK_ZERO, but will 
not set BDRV_BLOCK_ALLOCATED, which contradicts with it's definition:


 BDRV_BLOCK_ALLOCATED: the content of the block is determined by this
   layer (short for DATA || ZERO), set by block layer


for nbd, we go through the similar way on server (but with want_zero = 
true), and we finally have BDRV_BLOCK_ZERO without BDRV_BLOCK_ALLOCATED, 
which maps to NBD_STATE_HOLE+NBD_STATE_ZERO. But then, in the client we 
have BDRV_BLOCK_ZERO not automatically added by block layer but directly 
from nbd driver, therefor BDRV_BLOCK_ALLOCATED is set and I get 
different result.



this all looks weird for me.

BDRV_BLOCK_ALLOCATED definition should be fixed, to show that this flag 
show only reported by driver flags, not automatically added.


And then the situation with nbd looks strange, because automatically 
added flag becomes "flag, reported by block driver".


On the other hand, it is not wrong.


I think, to test the feature, I'll improve qemu-io map, to show zeros 
(or is it better to add separate command?) and leave the other things as 
is. What do you think?


--
Best regards,
Vladimir

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Denis Plotnikov




On 13.02.2018 17:43, Kevin Wolf wrote:

Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben:

On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:

Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:

Then you could just use the regular migrate QMP commands for loading
and saving snapshots.


Yes, you could. I think for a proper implementation you would want to do
better, though. Live migration provides just a stream, but that's not
really well suited for snapshots. When a RAM page is dirtied, you just
want to overwrite the old version of it in a snapshot [...]


This means the point in time where the guest state is snapshotted is not
when the command is issued, but any unpredictable amount of time later.

I'm not sure this is what a user expects.


I don't think it's necessarily a big problem as long as you set the
expectations right, but good point anyway.


A better approach for the save part appears to be to stop the vcpus,
dump the device state, resume the vcpus, and save the memory contents
in the background, prioritizing the old copies of the pages that
change.


So basically you would let the guest fault whenever it writes to a page
that is not saved yet, and then save it first before you make the page
writable again? Essentially blockdev-backup, except for RAM.


No multiple copies of the same page would have to be saved so the
stream format would be fine.  For the load part the usual inmigrate
should work.


This is true.

I think it will require changes to the qcow2 driver, though. Currently,
you write the VM state into the active layer and then take the disk
snapshot so that the VM state is automatically snapshotted as well.
Afterwards, the VM state is discarded again in the active layer.

If you want to take the snapshot at the point in time when the command
was issued, you first need to create a qcow2 snapshot to save the disk
state, but then continue to write the VM state into that snapshot, even
though it's not the active layer.

In fact I think this would be more elegant than writing the VM state
into the active layer and then discarding it again, but inactive
snapshots haven't been writable so far, so that will require some
changes.

I'm sure that Denis has already some thoughts on this, though.
Yes, I stick to the scheme when you write a snapshot as it is now but 
instead of ram writing I just preallocate the space for ram and save the 
actual file offsets of the clusters forming the space, then I resume 
vcpus and write the ram using those cluster offsets with respect to vcpu 
pagefaults.




Kevin



--
Best,
Denis

[Qemu-block] [PULL 52/55] qcow2: Allow configuring the L2 slice size

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

Now that the code is ready to handle L2 slices we can finally add an
option to allow configuring their size.

An L2 slice is the portion of an L2 table that is read by the qcow2
cache. Until now the cache was always reading full L2 tables, and
since the L2 table size is equal to the cluster size this was not very
efficient with large clusters. Here's a more detailed explanation of
why it makes sense to have smaller cache entries in order to load L2
data:

   https://lists.gnu.org/archive/html/qemu-block/2017-09/msg00635.html

This patch introduces a new command-line option to the qcow2 driver
named l2-cache-entry-size (cf. l2-cache-size). The cache entry size
has the same restrictions as the cluster size: it must be a power of
two and it has the same range of allowed values, with the additional
requirement that it must not be larger than the cluster size.

The L2 cache entry size (L2 slice size) remains equal to the cluster
size for now by default, so this feature must be explicitly enabled.
Although my tests show that 4KB slices consistently improve
performance and give the best results, let's wait and make more tests
with different cluster sizes before deciding on an optimal default.

Now that the cache entry size is not necessarily equal to the cluster
size we need to reflect that in the MIN_L2_CACHE_SIZE documentation.
That minimum value is a requirement of the COW algorithm: we need to
read two L2 slices (and not two L2 tables) in order to do COW, see
l2_allocate() for the actual code.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
c73e5611ff4a9ec5d20de68a6c289553a13d2354.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 qapi/block-core.json |  6 ++
 block/qcow2.h|  6 --
 block/qcow2-cache.c  | 10 --
 block/qcow2.c| 34 +++---
 4 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 2c107823fe..5c5921bfb7 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -2521,6 +2521,11 @@
 # @l2-cache-size: the maximum size of the L2 table cache in
 # bytes (since 2.2)
 #
+# @l2-cache-entry-size:   the size of each entry in the L2 cache in
+# bytes. It must be a power of two between 512
+# and the cluster size. The default value is
+# the cluster size (since 2.12)
+#
 # @refcount-cache-size:   the maximum size of the refcount block cache
 # in bytes (since 2.2)
 #
@@ -2542,6 +2547,7 @@
 '*overlap-check': 'Qcow2OverlapChecks',
 '*cache-size': 'int',
 '*l2-cache-size': 'int',
+'*l2-cache-entry-size': 'int',
 '*refcount-cache-size': 'int',
 '*cache-clean-interval': 'int',
 '*encrypt': 'BlockdevQcow2Encryption' } }
diff --git a/block/qcow2.h b/block/qcow2.h
index d956a6cdd2..883802241f 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -68,7 +68,7 @@
 #define MAX_CLUSTER_BITS 21
 
 /* Must be at least 2 to cover COW */
-#define MIN_L2_CACHE_SIZE 2 /* clusters */
+#define MIN_L2_CACHE_SIZE 2 /* cache entries */
 
 /* Must be at least 4 to cover all cases of refcount table growth */
 #define MIN_REFCOUNT_CACHE_SIZE 4 /* clusters */
@@ -100,6 +100,7 @@
 #define QCOW2_OPT_OVERLAP_INACTIVE_L2 "overlap-check.inactive-l2"
 #define QCOW2_OPT_CACHE_SIZE "cache-size"
 #define QCOW2_OPT_L2_CACHE_SIZE "l2-cache-size"
+#define QCOW2_OPT_L2_CACHE_ENTRY_SIZE "l2-cache-entry-size"
 #define QCOW2_OPT_REFCOUNT_CACHE_SIZE "refcount-cache-size"
 #define QCOW2_OPT_CACHE_CLEAN_INTERVAL "cache-clean-interval"
 
@@ -647,7 +648,8 @@ void qcow2_free_snapshots(BlockDriverState *bs);
 int qcow2_read_snapshots(BlockDriverState *bs);
 
 /* qcow2-cache.c functions */
-Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables);
+Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables,
+   unsigned table_size);
 int qcow2_cache_destroy(Qcow2Cache *c);
 
 void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table);
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index b1aa42477e..d9dafa31e5 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -120,14 +120,20 @@ void qcow2_cache_clean_unused(Qcow2Cache *c)
 c->cache_clean_lru_counter = c->lru_counter;
 }
 
-Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables)
+Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables,
+   unsigned table_size)
 {
 BDRVQcow2State *s = bs->opaque;
 Qcow2Cache *c;
 
+assert(num_tables > 0);
+assert(is_power_of_2(table_size));
+assert(table_size >= (1 << MIN_CLUSTER_BITS));
+assert(table_size <= s->cluster_size);
+

[Qemu-block] [PULL 55/55] iotests: Add l2-cache-entry-size to iotest 137

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This test tries reopening a qcow2 image with valid and invalid
options. This patch adds l2-cache-entry-size to the set.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
3d3b7d2dbfc020deaef60fb58739b0801eb9517c.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/137 | 5 +
 tests/qemu-iotests/137.out | 2 ++
 2 files changed, 7 insertions(+)

diff --git a/tests/qemu-iotests/137 b/tests/qemu-iotests/137
index 5a01250005..87965625d8 100755
--- a/tests/qemu-iotests/137
+++ b/tests/qemu-iotests/137
@@ -83,6 +83,9 @@ $QEMU_IO \
 -c "reopen -o overlap-check.inactive-l2=off" \
 -c "reopen -o cache-size=1M" \
 -c "reopen -o l2-cache-size=512k" \
+-c "reopen -o l2-cache-entry-size=512" \
+-c "reopen -o l2-cache-entry-size=4k" \
+-c "reopen -o l2-cache-entry-size=64k" \
 -c "reopen -o refcount-cache-size=128k" \
 -c "reopen -o cache-clean-interval=5" \
 -c "reopen -o cache-clean-interval=0" \
@@ -107,6 +110,8 @@ $QEMU_IO \
 -c "reopen -o cache-size=1M,l2-cache-size=2M" \
 -c "reopen -o cache-size=1M,refcount-cache-size=2M" \
 -c "reopen -o l2-cache-size=256T" \
+-c "reopen -o l2-cache-entry-size=33k" \
+-c "reopen -o l2-cache-entry-size=128k" \
 -c "reopen -o refcount-cache-size=256T" \
 -c "reopen -o overlap-check=constant,overlap-check.template=all" \
 -c "reopen -o overlap-check=blubb" \
diff --git a/tests/qemu-iotests/137.out b/tests/qemu-iotests/137.out
index 05efd74d17..e28e1eadba 100644
--- a/tests/qemu-iotests/137.out
+++ b/tests/qemu-iotests/137.out
@@ -20,6 +20,8 @@ cache-size, l2-cache-size and refcount-cache-size may not be 
set the same time
 l2-cache-size may not exceed cache-size
 refcount-cache-size may not exceed cache-size
 L2 cache size too big
+L2 cache entry size must be a power of two between 512 and the cluster size 
(65536)
+L2 cache entry size must be a power of two between 512 and the cluster size 
(65536)
 L2 cache size too big
 Conflicting values for qcow2 options 'overlap-check' ('constant') and 
'overlap-check.template' ('all')
 Unsupported value 'blubb' for qcow2 option 'overlap-check'. Allowed are any of 
the following: none, constant, cached, all
-- 
2.13.6

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Denis Plotnikov




On 13.02.2018 18:05, Dr. David Alan Gilbert wrote:

* Denis V. Lunev (d...@virtuozzo.com) wrote:

On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote:

* Daniel P. Berrangé (berra...@redhat.com) wrote:

On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote:

Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben:

On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote:

On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:

Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:

Then you could just use the regular migrate QMP commands for loading
and saving snapshots.

Yes, you could. I think for a proper implementation you would want to do
better, though. Live migration provides just a stream, but that's not
really well suited for snapshots. When a RAM page is dirtied, you just
want to overwrite the old version of it in a snapshot [...]

This means the point in time where the guest state is snapshotted is not
when the command is issued, but any unpredictable amount of time later.

I'm not sure this is what a user expects.

A better approach for the save part appears to be to stop the vcpus,
dump the device state, resume the vcpus, and save the memory contents in
the background, prioritizing the old copies of the pages that change.
No multiple copies of the same page would have to be saved so the stream
format would be fine.  For the load part the usual inmigrate should
work.

No, that's policy decision that doesn't matter from QMP pov. If the mgmt
app wants the snapshot to be wrt to the initial time, it can simply
invoke the "stop" QMP command before doing the live migration and
"cont" afterwards.

That would be non-live. I think Roman means a live snapshot that saves
the state at the beginning of the operation. Basically the difference
between blockdev-backup (state at the beginning) and blockdev-mirror
(state at the end), except for a whole VM.

That doesn't seem practical unless you can instantaneously write out
the entire guest RAM to disk without blocking, or can somehow snapshot
the RAM so you can write out a consistent view of the original RAM,
while the guest continues to dirty RAM pages.

People have suggested doing something like that with userfault write
mode; but the same would also be doable just by write protecting the
whole of RAM and then following the faults.


nope, userfault fd does not help :( We have tried, the functionality is not
enough. Better to have small extension to KVM to protect all memory
and notify QEMU with accessed address.


Can you explain why? I thought the write-protect mode of userfaultfd was
supposed to be able to do that; cc'ing in Andrea


Hi everybody

Yes, that's true but it isn't implemented yet in the kernel

...
userfaultfd_register

if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) {
vm_flags |= VM_UFFD_WP;
/*
 * FIXME: remove the below error constraint by
 * implementing the wprotect tracking mode.
 */
ret = -EINVAL;
goto out;
}

and I don't feel like doing that taking into account that somebody has 
it done already

but when it's done I'll use it with pleasure.

KVM need to tiny wini modification to be made in its mmu part which 
would tell to the userspace the fault address. This is a simple solution 
which can be used while we're living without userfaultfd.




Dave


Den

--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK



--
Best,
Denis

[Qemu-block] [PULL 45/55] qcow2: Prepare expand_zero_clusters_in_l1() for adding L2 slice support

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

Adding support for L2 slices to expand_zero_clusters_in_l1() needs
(among other things) an extra loop that iterates over all slices of
each L2 table.

Putting all changes in one patch would make it hard to read because
all semantic changes would be mixed with pure indentation changes.

To make things easier this patch simply creates a new block and
changes the indentation of all lines of code inside it. Thus, all
modifications in this patch are cosmetic. There are no semantic
changes and no variables are renamed yet. The next patch will take
care of that.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
c2ae9f31ed5b6e591477ad4654448badd1c89d73.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 187 ++
 1 file changed, 96 insertions(+), 91 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 1041dcb6dd..ccb2944eda 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1904,118 +1904,123 @@ static int 
expand_zero_clusters_in_l1(BlockDriverState *bs, uint64_t *l1_table,
 goto fail;
 }
 
-if (is_active_l1) {
-/* get active L2 tables from cache */
-ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
-(void **)_table);
-} else {
-/* load inactive L2 tables from disk */
-ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE,
-(void *)l2_table, s->cluster_sectors);
-}
-if (ret < 0) {
-goto fail;
-}
-
-for (j = 0; j < s->l2_size; j++) {
-uint64_t l2_entry = be64_to_cpu(l2_table[j]);
-int64_t offset = l2_entry & L2E_OFFSET_MASK;
-QCow2ClusterType cluster_type = qcow2_get_cluster_type(l2_entry);
-
-if (cluster_type != QCOW2_CLUSTER_ZERO_PLAIN &&
-cluster_type != QCOW2_CLUSTER_ZERO_ALLOC) {
-continue;
+{
+if (is_active_l1) {
+/* get active L2 tables from cache */
+ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
+  (void **)_table);
+} else {
+/* load inactive L2 tables from disk */
+ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE,
+(void *)l2_table, s->cluster_sectors);
+}
+if (ret < 0) {
+goto fail;
 }
 
-if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) {
-if (!bs->backing) {
-/* not backed; therefore we can simply deallocate the
- * cluster */
-l2_table[j] = 0;
-l2_dirty = true;
+for (j = 0; j < s->l2_size; j++) {
+uint64_t l2_entry = be64_to_cpu(l2_table[j]);
+int64_t offset = l2_entry & L2E_OFFSET_MASK;
+QCow2ClusterType cluster_type =
+qcow2_get_cluster_type(l2_entry);
+
+if (cluster_type != QCOW2_CLUSTER_ZERO_PLAIN &&
+cluster_type != QCOW2_CLUSTER_ZERO_ALLOC) {
 continue;
 }
 
-offset = qcow2_alloc_clusters(bs, s->cluster_size);
-if (offset < 0) {
-ret = offset;
-goto fail;
-}
+if (cluster_type == QCOW2_CLUSTER_ZERO_PLAIN) {
+if (!bs->backing) {
+/* not backed; therefore we can simply deallocate the
+ * cluster */
+l2_table[j] = 0;
+l2_dirty = true;
+continue;
+}
+
+offset = qcow2_alloc_clusters(bs, s->cluster_size);
+if (offset < 0) {
+ret = offset;
+goto fail;
+}
 
-if (l2_refcount > 1) {
-/* For shared L2 tables, set the refcount accordingly (it 
is
- * already 1 and needs to be l2_refcount) */
-ret = qcow2_update_cluster_refcount(bs,
-offset >> s->cluster_bits,
+if (l2_refcount > 1) {
+/* For shared L2 tables, set the refcount accordingly
+ * (it is already 1 and needs to be l2_refcount) */
+ret = qcow2_update_cluster_refcount(
+bs, offset >> s->cluster_bits,
 refcount_diff(1, l2_refcount), false,
 QCOW2_DISCARD_OTHER);
-if (ret < 0) {

[Qemu-block] [PULL 54/55] iotests: Test downgrading an image using a small L2 slice size

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

expand_zero_clusters_in_l1() is used when downgrading qcow2 images
from v3 to v2 (compat=0.10). This is one of the functions that needed
more changes to support L2 slices, so this patch extends iotest 061 to
test downgrading a qcow2 image using a smaller slice size.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
3e5662dce5e4926c8fabbad4c0b9142b2a506dd4.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/061 | 16 
 tests/qemu-iotests/061.out | 61 ++
 2 files changed, 77 insertions(+)

diff --git a/tests/qemu-iotests/061 b/tests/qemu-iotests/061
index f5678b10c9..911b6f2894 100755
--- a/tests/qemu-iotests/061
+++ b/tests/qemu-iotests/061
@@ -54,6 +54,22 @@ $QEMU_IO -c "read -P 0 0 128k" "$TEST_IMG" | _filter_qemu_io
 _check_test_img
 
 echo
+echo "=== Testing version downgrade with zero expansion and 4K cache entries 
==="
+echo
+IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
+$QEMU_IO -c "write -z 0 128k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "write -z 32M 128k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c map "$TEST_IMG" | _filter_qemu_io
+$PYTHON qcow2.py "$TEST_IMG" dump-header
+$QEMU_IMG amend -o "compat=0.10" --image-opts \
+  driver=qcow2,file.filename=$TEST_IMG,l2-cache-entry-size=4096
+$PYTHON qcow2.py "$TEST_IMG" dump-header
+$QEMU_IO -c "read -P 0 0 128k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c "read -P 0 32M 128k" "$TEST_IMG" | _filter_qemu_io
+$QEMU_IO -c map "$TEST_IMG" | _filter_qemu_io
+_check_test_img
+
+echo
 echo "=== Testing dirty version downgrade ==="
 echo
 IMGOPTS="compat=1.1,lazy_refcounts=on" _make_test_img 64M
diff --git a/tests/qemu-iotests/061.out b/tests/qemu-iotests/061.out
index 942485de99..e857ef9a7d 100644
--- a/tests/qemu-iotests/061.out
+++ b/tests/qemu-iotests/061.out
@@ -52,6 +52,67 @@ read 131072/131072 bytes at offset 0
 128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 No errors were found on the image.
 
+=== Testing version downgrade with zero expansion and 4K cache entries ===
+
+Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
+wrote 131072/131072 bytes at offset 0
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 131072/131072 bytes at offset 33554432
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+128 KiB (0x2) bytes allocated at offset 0 bytes (0x0)
+31.875 MiB (0x1fe) bytes not allocated at offset 128 KiB (0x2)
+128 KiB (0x2) bytes allocated at offset 32 MiB (0x200)
+31.875 MiB (0x1fe) bytes not allocated at offset 32.125 MiB (0x202)
+magic 0x514649fb
+version   3
+backing_file_offset   0x0
+backing_file_size 0x0
+cluster_bits  16
+size  67108864
+crypt_method  0
+l1_size   1
+l1_table_offset   0x3
+refcount_table_offset 0x1
+refcount_table_clusters   1
+nb_snapshots  0
+snapshot_offset   0x0
+incompatible_features 0x0
+compatible_features   0x1
+autoclear_features0x0
+refcount_order4
+header_length 104
+
+Header extension:
+magic 0x6803f857
+length144
+data  
+
+magic 0x514649fb
+version   2
+backing_file_offset   0x0
+backing_file_size 0x0
+cluster_bits  16
+size  67108864
+crypt_method  0
+l1_size   1
+l1_table_offset   0x3
+refcount_table_offset 0x1
+refcount_table_clusters   1
+nb_snapshots  0
+snapshot_offset   0x0
+incompatible_features 0x0
+compatible_features   0x0
+autoclear_features0x0
+refcount_order4
+header_length 72
+
+read 131072/131072 bytes at offset 0
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 131072/131072 bytes at offset 33554432
+128 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+64 MiB (0x400) bytes not allocated at offset 0 bytes (0x0)
+No errors were found on the image.
+
 === Testing dirty version downgrade ===
 
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=67108864
-- 
2.13.6

[Qemu-block] [PULL 53/55] iotests: Test valid values of l2-cache-entry-size

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

The l2-cache-entry-size setting can only contain values that are
powers of two between 512 and the cluster size.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
bd3547b670b8d0af11480c760991a22bcae5b48c.1517840877.git.be...@igalia.com
[mreitz: Changed non-power-of-two test value from 300 to 4242]
Signed-off-by: Max Reitz 
---
 tests/qemu-iotests/103 | 17 +
 tests/qemu-iotests/103.out |  3 +++
 2 files changed, 20 insertions(+)

diff --git a/tests/qemu-iotests/103 b/tests/qemu-iotests/103
index d0cfab8844..2841318492 100755
--- a/tests/qemu-iotests/103
+++ b/tests/qemu-iotests/103
@@ -66,6 +66,14 @@ $QEMU_IO -c "open -o cache-size=1M,refcount-cache-size=2M 
$TEST_IMG" 2>&1 \
 $QEMU_IO -c "open -o cache-size=0,l2-cache-size=0,refcount-cache-size=0 
$TEST_IMG" \
 2>&1 | _filter_testdir | _filter_imgfmt
 
+# Invalid cache entry sizes
+$QEMU_IO -c "open -o l2-cache-entry-size=256 $TEST_IMG" \
+2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o l2-cache-entry-size=4242 $TEST_IMG" \
+2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o l2-cache-entry-size=128k $TEST_IMG" \
+2>&1 | _filter_testdir | _filter_imgfmt
+
 echo
 echo '=== Testing valid option combinations ==='
 echo
@@ -94,6 +102,15 @@ $QEMU_IO -c "open -o 
l2-cache-size=1M,refcount-cache-size=0.25M $TEST_IMG" \
  -c 'read -P 42 0 64k' \
 | _filter_qemu_io
 
+# Valid cache entry sizes
+$QEMU_IO -c "open -o l2-cache-entry-size=512 $TEST_IMG" \
+2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o l2-cache-entry-size=16k $TEST_IMG" \
+2>&1 | _filter_testdir | _filter_imgfmt
+$QEMU_IO -c "open -o l2-cache-entry-size=64k $TEST_IMG" \
+2>&1 | _filter_testdir | _filter_imgfmt
+
+
 echo
 echo '=== Testing minimal L2 cache and COW ==='
 echo
diff --git a/tests/qemu-iotests/103.out b/tests/qemu-iotests/103.out
index b7aaadf89a..bd45d3875a 100644
--- a/tests/qemu-iotests/103.out
+++ b/tests/qemu-iotests/103.out
@@ -9,6 +9,9 @@ can't open device TEST_DIR/t.IMGFMT: cache-size, l2-cache-size 
and refcount-cach
 can't open device TEST_DIR/t.IMGFMT: l2-cache-size may not exceed cache-size
 can't open device TEST_DIR/t.IMGFMT: refcount-cache-size may not exceed 
cache-size
 can't open device TEST_DIR/t.IMGFMT: cache-size, l2-cache-size and 
refcount-cache-size may not be set the same time
+can't open device TEST_DIR/t.IMGFMT: L2 cache entry size must be a power of 
two between 512 and the cluster size (65536)
+can't open device TEST_DIR/t.IMGFMT: L2 cache entry size must be a power of 
two between 512 and the cluster size (65536)
+can't open device TEST_DIR/t.IMGFMT: L2 cache entry size must be a power of 
two between 512 and the cluster size (65536)
 
 === Testing valid option combinations ===
 
-- 
2.13.6

[Qemu-block] [PULL 48/55] qcow2: Rename l2_table in qcow2_alloc_compressed_cluster_offset()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function doesn't need any changes to support L2 slices, but since
it's now dealing with slices instead of full tables, the l2_table
variable is renamed for clarity.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
0c5d4b9bf163aa3b49ec19cc512a50d83563f2ad.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 8fbaba008b..24055e19a1 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -747,26 +747,26 @@ uint64_t 
qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 {
 BDRVQcow2State *s = bs->opaque;
 int l2_index, ret;
-uint64_t *l2_table;
+uint64_t *l2_slice;
 int64_t cluster_offset;
 int nb_csectors;
 
-ret = get_cluster_table(bs, offset, _table, _index);
+ret = get_cluster_table(bs, offset, _slice, _index);
 if (ret < 0) {
 return 0;
 }
 
 /* Compression can't overwrite anything. Fail if the cluster was already
  * allocated. */
-cluster_offset = be64_to_cpu(l2_table[l2_index]);
+cluster_offset = be64_to_cpu(l2_slice[l2_index]);
 if (cluster_offset & L2E_OFFSET_MASK) {
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 return 0;
 }
 
 cluster_offset = qcow2_alloc_bytes(bs, compressed_size);
 if (cluster_offset < 0) {
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 return 0;
 }
 
@@ -781,9 +781,9 @@ uint64_t 
qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 /* compressed clusters never have the copied flag */
 
 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
-qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
-l2_table[l2_index] = cpu_to_be64(cluster_offset);
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
+l2_slice[l2_index] = cpu_to_be64(cluster_offset);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 return cluster_offset;
 }
-- 
2.13.6

[Qemu-block] [PULL 51/55] qcow2: Rename l2_table in count_cow_clusters()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function doesn't need any changes to support L2 slices, but since
it's now dealing with slices intead of full tables, the l2_table
variable is renamed for clarity.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
6107001fc79e6739242f1de7d191375e4f130aac.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 074a4aaf1e..e406b0f3b9 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -999,12 +999,12 @@ err:
  * which must copy from the backing file)
  */
 static int count_cow_clusters(BDRVQcow2State *s, int nb_clusters,
-uint64_t *l2_table, int l2_index)
+uint64_t *l2_slice, int l2_index)
 {
 int i;
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t l2_entry = be64_to_cpu(l2_table[l2_index + i]);
+uint64_t l2_entry = be64_to_cpu(l2_slice[l2_index + i]);
 QCow2ClusterType cluster_type = qcow2_get_cluster_type(l2_entry);
 
 switch(cluster_type) {
-- 
2.13.6

[Qemu-block] [PULL 50/55] qcow2: Rename l2_table in count_contiguous_clusters_unallocated()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function doesn't need any changes to support L2 slices, but since
it's now dealing with slices instead of full tables, the l2_table
variable is renamed for clarity.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
78bcc54bc632574dd0b900a77a00a1b6ffc359e6.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 54ae210b43..074a4aaf1e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -406,10 +406,10 @@ static int count_contiguous_clusters(int nb_clusters, int 
cluster_size,
 
 /*
  * Checks how many consecutive unallocated clusters in a given L2
- * table have the same cluster type.
+ * slice have the same cluster type.
  */
 static int count_contiguous_clusters_unallocated(int nb_clusters,
- uint64_t *l2_table,
+ uint64_t *l2_slice,
  QCow2ClusterType wanted_type)
 {
 int i;
@@ -417,7 +417,7 @@ static int count_contiguous_clusters_unallocated(int 
nb_clusters,
 assert(wanted_type == QCOW2_CLUSTER_ZERO_PLAIN ||
wanted_type == QCOW2_CLUSTER_UNALLOCATED);
 for (i = 0; i < nb_clusters; i++) {
-uint64_t entry = be64_to_cpu(l2_table[i]);
+uint64_t entry = be64_to_cpu(l2_slice[i]);
 QCow2ClusterType type = qcow2_get_cluster_type(entry);
 
 if (type != wanted_type) {
-- 
2.13.6

[Qemu-block] [PULL 49/55] qcow2: Rename l2_table in count_contiguous_clusters()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function doesn't need any changes to support L2 slices, but since
it's now dealing with slices intead of full tables, the l2_table
variable is renamed for clarity.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
812b0c3505bb1687e51285dccf1a94f0cecb1f74.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 24055e19a1..54ae210b43 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -370,19 +370,19 @@ fail:
 }
 
 /*
- * Checks how many clusters in a given L2 table are contiguous in the image
+ * Checks how many clusters in a given L2 slice are contiguous in the image
  * file. As soon as one of the flags in the bitmask stop_flags changes compared
  * to the first cluster, the search is stopped and the cluster is not counted
  * as contiguous. (This allows it, for example, to stop at the first compressed
  * cluster which may require a different handling)
  */
 static int count_contiguous_clusters(int nb_clusters, int cluster_size,
-uint64_t *l2_table, uint64_t stop_flags)
+uint64_t *l2_slice, uint64_t stop_flags)
 {
 int i;
 QCow2ClusterType first_cluster_type;
 uint64_t mask = stop_flags | L2E_OFFSET_MASK | QCOW_OFLAG_COMPRESSED;
-uint64_t first_entry = be64_to_cpu(l2_table[0]);
+uint64_t first_entry = be64_to_cpu(l2_slice[0]);
 uint64_t offset = first_entry & mask;
 
 if (!offset) {
@@ -395,7 +395,7 @@ static int count_contiguous_clusters(int nb_clusters, int 
cluster_size,
first_cluster_type == QCOW2_CLUSTER_ZERO_ALLOC);
 
 for (i = 0; i < nb_clusters; i++) {
-uint64_t l2_entry = be64_to_cpu(l2_table[i]) & mask;
+uint64_t l2_entry = be64_to_cpu(l2_slice[i]) & mask;
 if (offset + (uint64_t) i * cluster_size != l2_entry) {
 break;
 }
-- 
2.13.6

[Qemu-block] [PULL 46/55] qcow2: Update expand_zero_clusters_in_l1() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

expand_zero_clusters_in_l1() expands zero clusters as a necessary step
to downgrade qcow2 images to a version that doesn't support metadata
zero clusters. This function takes an L1 table (which may or may not
be active) and iterates over all its L2 tables looking for zero
clusters.

Since we'll be loading L2 slices instead of full tables we need to add
an extra loop that iterates over all slices of each L2 table, and we
should also use the slice size when allocating the buffer used when
the L1 table is not active.

This function doesn't need any additional changes so apart from that
this patch simply updates the variable name from l2_table to l2_slice.

Finally, and since we have to touch the bdrv_read() / bdrv_write()
calls anyway, this patch takes the opportunity to replace them with
the byte-based bdrv_pread() / bdrv_pwrite().

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
43590976f730501688096cff103f2923b72b0f32.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 51 ---
 1 file changed, 28 insertions(+), 23 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ccb2944eda..8fbaba008b 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1863,22 +1863,25 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 {
 BDRVQcow2State *s = bs->opaque;
 bool is_active_l1 = (l1_table == s->l1_table);
-uint64_t *l2_table = NULL;
+uint64_t *l2_slice = NULL;
+unsigned slice, slice_size2, n_slices;
 int ret;
 int i, j;
 
+slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+n_slices = s->cluster_size / slice_size2;
+
 if (!is_active_l1) {
 /* inactive L2 tables require a buffer to be stored in when loading
  * them from disk */
-l2_table = qemu_try_blockalign(bs->file->bs, s->cluster_size);
-if (l2_table == NULL) {
+l2_slice = qemu_try_blockalign(bs->file->bs, slice_size2);
+if (l2_slice == NULL) {
 return -ENOMEM;
 }
 }
 
 for (i = 0; i < l1_size; i++) {
 uint64_t l2_offset = l1_table[i] & L1E_OFFSET_MASK;
-bool l2_dirty = false;
 uint64_t l2_refcount;
 
 if (!l2_offset) {
@@ -1904,22 +1907,23 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 goto fail;
 }
 
-{
+for (slice = 0; slice < n_slices; slice++) {
+uint64_t slice_offset = l2_offset + slice * slice_size2;
+bool l2_dirty = false;
 if (is_active_l1) {
 /* get active L2 tables from cache */
-ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
-  (void **)_table);
+ret = qcow2_cache_get(bs, s->l2_table_cache, slice_offset,
+  (void **)_slice);
 } else {
 /* load inactive L2 tables from disk */
-ret = bdrv_read(bs->file, l2_offset / BDRV_SECTOR_SIZE,
-(void *)l2_table, s->cluster_sectors);
+ret = bdrv_pread(bs->file, slice_offset, l2_slice, 
slice_size2);
 }
 if (ret < 0) {
 goto fail;
 }
 
-for (j = 0; j < s->l2_size; j++) {
-uint64_t l2_entry = be64_to_cpu(l2_table[j]);
+for (j = 0; j < s->l2_slice_size; j++) {
+uint64_t l2_entry = be64_to_cpu(l2_slice[j]);
 int64_t offset = l2_entry & L2E_OFFSET_MASK;
 QCow2ClusterType cluster_type =
 qcow2_get_cluster_type(l2_entry);
@@ -1933,7 +1937,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 if (!bs->backing) {
 /* not backed; therefore we can simply deallocate the
  * cluster */
-l2_table[j] = 0;
+l2_slice[j] = 0;
 l2_dirty = true;
 continue;
 }
@@ -1960,12 +1964,13 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 }
 
 if (offset_into_cluster(s, offset)) {
+int l2_index = slice * s->l2_slice_size + j;
 qcow2_signal_corruption(
 bs, true, -1, -1,
 "Cluster allocation offset "
 "%#" PRIx64 " unaligned (L2 offset: %#"
 PRIx64 ", L2 index: %#x)", offset,
-l2_offset, j);
+l2_offset, l2_index);
 if (cluster_type ==

[Qemu-block] [PULL 47/55] qcow2: Update qcow2_truncate() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

The qcow2_truncate() code is mostly independent from whether
we're using L2 slices or full L2 tables, but in full and
falloc preallocation modes new L2 tables are allocated using
qcow2_alloc_cluster_link_l2().  Therefore the code needs to be
modified to ensure that all nb_clusters that are processed in each
call can be allocated with just one L2 slice.

Signed-off-by: Alberto Garcia 
Message-id: 
1fd7d272b5e7b66254a090b74cf2bed1cc334c0e.1517840877.git.be...@igalia.com
Reviewed-by: Max Reitz 
Reviewed-by: Eric Blake 
Signed-off-by: Max Reitz 
---
 block/qcow2.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/block/qcow2.c b/block/qcow2.c
index ba8d71c72d..953254a004 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -3262,9 +3262,9 @@ static int qcow2_truncate(BlockDriverState *bs, int64_t 
offset,
 host_offset = allocation_start;
 guest_offset = old_length;
 while (nb_new_data_clusters) {
-int64_t guest_cluster = guest_offset >> s->cluster_bits;
-int64_t nb_clusters = MIN(nb_new_data_clusters,
-  s->l2_size - guest_cluster % s->l2_size);
+int64_t nb_clusters = MIN(
+nb_new_data_clusters,
+s->l2_slice_size - offset_to_l2_slice_index(s, guest_offset));
 QCowL2Meta allocation = {
 .offset   = guest_offset,
 .alloc_offset = host_offset,
-- 
2.13.6

[Qemu-block] [PULL 43/55] qcow2: Update qcow2_update_snapshot_refcount() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

qcow2_update_snapshot_refcount() increases the refcount of all
clusters of a given snapshot. In order to do that it needs to load all
its L2 tables and iterate over their entries. Since we'll be loading
L2 slices instead of full tables we need to add an extra loop that
iterates over all slices of each L2 table.

This function doesn't need any additional changes so apart from that
this patch simply updates the variable name from l2_table to l2_slice.

Signed-off-by: Alberto Garcia 
Message-id: 
5f4db199b9637f4833b58487135124d70add8cf0.1517840877.git.be...@igalia.com
Reviewed-by: Max Reitz 
Reviewed-by: Eric Blake 
Signed-off-by: Max Reitz 
---
 block/qcow2-refcount.c | 32 ++--
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index dfa28301c4..d46b69d7f3 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1183,17 +1183,20 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 int64_t l1_table_offset, int l1_size, int addend)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t *l1_table, *l2_table, l2_offset, entry, l1_size2, refcount;
+uint64_t *l1_table, *l2_slice, l2_offset, entry, l1_size2, refcount;
 bool l1_allocated = false;
 int64_t old_entry, old_l2_offset;
+unsigned slice, slice_size2, n_slices;
 int i, j, l1_modified = 0, nb_csectors;
 int ret;
 
 assert(addend >= -1 && addend <= 1);
 
-l2_table = NULL;
+l2_slice = NULL;
 l1_table = NULL;
 l1_size2 = l1_size * sizeof(uint64_t);
+slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+n_slices = s->cluster_size / slice_size2;
 
 s->cache_discards = true;
 
@@ -1236,19 +1239,19 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 goto fail;
 }
 
-{
+for (slice = 0; slice < n_slices; slice++) {
 ret = qcow2_cache_get(bs, s->l2_table_cache,
-  l2_offset,
-  (void **) _table);
+  l2_offset + slice * slice_size2,
+  (void **) _slice);
 if (ret < 0) {
 goto fail;
 }
 
-for (j = 0; j < s->l2_size; j++) {
+for (j = 0; j < s->l2_slice_size; j++) {
 uint64_t cluster_index;
 uint64_t offset;
 
-entry = be64_to_cpu(l2_table[j]);
+entry = be64_to_cpu(l2_slice[j]);
 old_entry = entry;
 entry &= ~QCOW_OFLAG_COPIED;
 offset = entry & L2E_OFFSET_MASK;
@@ -1273,12 +1276,14 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 case QCOW2_CLUSTER_NORMAL:
 case QCOW2_CLUSTER_ZERO_ALLOC:
 if (offset_into_cluster(s, offset)) {
+/* Here l2_index means table (not slice) index */
+int l2_index = slice * s->l2_slice_size + j;
 qcow2_signal_corruption(
 bs, true, -1, -1, "Cluster "
 "allocation offset %#" PRIx64
 " unaligned (L2 offset: %#"
 PRIx64 ", L2 index: %#x)",
-offset, l2_offset, j);
+offset, l2_offset, l2_index);
 ret = -EIO;
 goto fail;
 }
@@ -1317,14 +1322,13 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 qcow2_cache_set_dependency(bs, s->l2_table_cache,

s->refcount_block_cache);
 }
-l2_table[j] = cpu_to_be64(entry);
+l2_slice[j] = cpu_to_be64(entry);
 qcow2_cache_entry_mark_dirty(s->l2_table_cache,
- l2_table);
+ l2_slice);
 }
 }
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
-
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 }
 
 if (addend != 0) {
@@ -1352,8 +1356,8 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 
 ret = bdrv_flush(bs);
 fail:
-if (l2_table) {
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+if (l2_slice) {
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 }
 
 s->cache_discards = false;
-- 
2.13.6

[Qemu-block] [PULL 36/55] qcow2: Update qcow2_get_cluster_offset() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

qcow2_get_cluster_offset() checks how many contiguous bytes are
available at a given offset. The returned number of bytes is limited
by the amount that can be addressed without having to load more than
one L2 table.

Since we'll be loading L2 slices instead of full tables this patch
changes the limit accordingly using the size of the L2 slice for the
calculations instead of the full table size.

One consequence of this is that with small L2 slices operations such
as 'qemu-img map' will need to iterate in more steps because each
qcow2_get_cluster_offset() call will potentially return a smaller
number. However the code is already prepared for that so this doesn't
break semantics.

The l2_table variable is also renamed to l2_slice to reflect this, and
offset_to_l2_index() is replaced with offset_to_l2_slice_index().

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
6b602260acb33da56ed6af9611731cb7acd110eb.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 30 +++---
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index cd1ba40f74..e0eea73297 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -529,8 +529,8 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 {
 BDRVQcow2State *s = bs->opaque;
 unsigned int l2_index;
-uint64_t l1_index, l2_offset, *l2_table;
-int l1_bits, c;
+uint64_t l1_index, l2_offset, *l2_slice;
+int c;
 unsigned int offset_in_cluster;
 uint64_t bytes_available, bytes_needed, nb_clusters;
 QCow2ClusterType type;
@@ -539,12 +539,12 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, 
uint64_t offset,
 offset_in_cluster = offset_into_cluster(s, offset);
 bytes_needed = (uint64_t) *bytes + offset_in_cluster;
 
-l1_bits = s->l2_bits + s->cluster_bits;
-
 /* compute how many bytes there are between the start of the cluster
- * containing offset and the end of the l1 entry */
-bytes_available = (1ULL << l1_bits) - (offset & ((1ULL << l1_bits) - 1))
-+ offset_in_cluster;
+ * containing offset and the end of the l2 slice that contains
+ * the entry pointing to it */
+bytes_available =
+((uint64_t) (s->l2_slice_size - offset_to_l2_slice_index(s, offset)))
+<< s->cluster_bits;
 
 if (bytes_needed > bytes_available) {
 bytes_needed = bytes_available;
@@ -573,17 +573,17 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, 
uint64_t offset,
 return -EIO;
 }
 
-/* load the l2 table in memory */
+/* load the l2 slice in memory */
 
-ret = l2_load(bs, offset, l2_offset, _table);
+ret = l2_load(bs, offset, l2_offset, _slice);
 if (ret < 0) {
 return ret;
 }
 
 /* find the cluster offset for the given disk offset */
 
-l2_index = offset_to_l2_index(s, offset);
-*cluster_offset = be64_to_cpu(l2_table[l2_index]);
+l2_index = offset_to_l2_slice_index(s, offset);
+*cluster_offset = be64_to_cpu(l2_slice[l2_index]);
 
 nb_clusters = size_to_clusters(s, bytes_needed);
 /* bytes_needed <= *bytes + offset_in_cluster, both of which are unsigned
@@ -610,14 +610,14 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, 
uint64_t offset,
 case QCOW2_CLUSTER_UNALLOCATED:
 /* how many empty clusters ? */
 c = count_contiguous_clusters_unallocated(nb_clusters,
-  _table[l2_index], type);
+  _slice[l2_index], type);
 *cluster_offset = 0;
 break;
 case QCOW2_CLUSTER_ZERO_ALLOC:
 case QCOW2_CLUSTER_NORMAL:
 /* how many allocated clusters ? */
 c = count_contiguous_clusters(nb_clusters, s->cluster_size,
-  _table[l2_index], QCOW_OFLAG_ZERO);
+  _slice[l2_index], QCOW_OFLAG_ZERO);
 *cluster_offset &= L2E_OFFSET_MASK;
 if (offset_into_cluster(s, *cluster_offset)) {
 qcow2_signal_corruption(bs, true, -1, -1,
@@ -633,7 +633,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 abort();
 }
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 bytes_available = (int64_t)c * s->cluster_size;
 
@@ -651,7 +651,7 @@ out:
 return type;
 
 fail:
-qcow2_cache_put(s->l2_table_cache, (void **)_table);
+qcow2_cache_put(s->l2_table_cache, (void **)_slice);
 return ret;
 }
 
-- 
2.13.6

[Qemu-block] [PULL 33/55] qcow2: Update l2_allocate() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This patch updates l2_allocate() to support the qcow2 cache returning
L2 slices instead of full L2 tables.

The old code simply gets an L2 table from the cache and initializes it
with zeroes or with the contents of an existing table. With a cache
that returns slices instead of tables the idea remains the same, but
the code must now iterate over all the slices that are contained in an
L2 table.

Since now we're operating with slices the function can no longer
return the newly-allocated table, so it's up to the caller to retrieve
the appropriate L2 slice after calling l2_allocate() (note that with
this patch the caller is still loading full L2 tables, but we'll deal
with that in a separate patch).

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Message-id: 
20fc0415bf0e011e29f6487ec86eb06a11f37445.1517840877.git.be...@igalia.com
Reviewed-by: Max Reitz 
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 56 +++
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 3f7272441d..4aa9ea7b29 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -263,11 +263,12 @@ int qcow2_write_l1_entry(BlockDriverState *bs, int 
l1_index)
  *
  */
 
-static int l2_allocate(BlockDriverState *bs, int l1_index, uint64_t **table)
+static int l2_allocate(BlockDriverState *bs, int l1_index)
 {
 BDRVQcow2State *s = bs->opaque;
 uint64_t old_l2_offset;
-uint64_t *l2_table = NULL;
+uint64_t *l2_slice = NULL;
+unsigned slice, slice_size2, n_slices;
 int64_t l2_offset;
 int ret;
 
@@ -298,42 +299,45 @@ static int l2_allocate(BlockDriverState *bs, int 
l1_index, uint64_t **table)
 
 /* allocate a new entry in the l2 cache */
 
+slice_size2 = s->l2_slice_size * sizeof(uint64_t);
+n_slices = s->cluster_size / slice_size2;
+
 trace_qcow2_l2_allocate_get_empty(bs, l1_index);
-{
+for (slice = 0; slice < n_slices; slice++) {
 ret = qcow2_cache_get_empty(bs, s->l2_table_cache,
-l2_offset,
-(void **) table);
+l2_offset + slice * slice_size2,
+(void **) _slice);
 if (ret < 0) {
 goto fail;
 }
 
-l2_table = *table;
-
 if ((old_l2_offset & L1E_OFFSET_MASK) == 0) {
-/* if there was no old l2 table, clear the new table */
-memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
+/* if there was no old l2 table, clear the new slice */
+memset(l2_slice, 0, slice_size2);
 } else {
-uint64_t *old_table;
+uint64_t *old_slice;
+uint64_t old_l2_slice_offset =
+(old_l2_offset & L1E_OFFSET_MASK) + slice * slice_size2;
 
-/* if there was an old l2 table, read it from the disk */
+/* if there was an old l2 table, read a slice from the disk */
 BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_COW_READ);
-ret = qcow2_cache_get(bs, s->l2_table_cache,
-  old_l2_offset & L1E_OFFSET_MASK,
-  (void **) _table);
+ret = qcow2_cache_get(bs, s->l2_table_cache, old_l2_slice_offset,
+  (void **) _slice);
 if (ret < 0) {
 goto fail;
 }
 
-memcpy(l2_table, old_table, s->cluster_size);
+memcpy(l2_slice, old_slice, slice_size2);
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 }
 
-/* write the l2 table to the file */
+/* write the l2 slice to the file */
 BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
 
 trace_qcow2_l2_allocate_write_l2(bs, l1_index);
-qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 }
 
 ret = qcow2_cache_flush(bs, s->l2_table_cache);
@@ -349,14 +353,13 @@ static int l2_allocate(BlockDriverState *bs, int 
l1_index, uint64_t **table)
 goto fail;
 }
 
-*table = l2_table;
 trace_qcow2_l2_allocate_done(bs, l1_index, 0);
 return 0;
 
 fail:
 trace_qcow2_l2_allocate_done(bs, l1_index, ret);
-if (l2_table != NULL) {
-qcow2_cache_put(s->l2_table_cache, (void **) table);
+if (l2_slice != NULL) {
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 }
 s->l1_table[l1_index] = old_l2_offset;
 if (l2_offset > 0) {
@@ -701,7 +704,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t 
offset,
 }
 } else {

[Qemu-block] [PULL 38/55] qcow2: Update handle_copied() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

handle_copied() loads an L2 table and limits the number of checked
clusters to the amount that fits inside that table. Since we'll be
loading L2 slices instead of full tables we need to update that limit.

Apart from that, this function doesn't need any additional changes, so
this patch simply updates the variable name from l2_table to l2_slice.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
541ac001a7d6b86bab2392554bee53c2b312148c.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index a01abb3b18..7699544358 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1119,7 +1119,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t 
guest_offset,
 BDRVQcow2State *s = bs->opaque;
 int l2_index;
 uint64_t cluster_offset;
-uint64_t *l2_table;
+uint64_t *l2_slice;
 uint64_t nb_clusters;
 unsigned int keep_clusters;
 int ret;
@@ -1131,23 +1131,23 @@ static int handle_copied(BlockDriverState *bs, uint64_t 
guest_offset,
 == offset_into_cluster(s, *host_offset));
 
 /*
- * Calculate the number of clusters to look for. We stop at L2 table
+ * Calculate the number of clusters to look for. We stop at L2 slice
  * boundaries to keep things simple.
  */
 nb_clusters =
 size_to_clusters(s, offset_into_cluster(s, guest_offset) + *bytes);
 
-l2_index = offset_to_l2_index(s, guest_offset);
-nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
+l2_index = offset_to_l2_slice_index(s, guest_offset);
+nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index);
 assert(nb_clusters <= INT_MAX);
 
 /* Find L2 entry for the first involved cluster */
-ret = get_cluster_table(bs, guest_offset, _table, _index);
+ret = get_cluster_table(bs, guest_offset, _slice, _index);
 if (ret < 0) {
 return ret;
 }
 
-cluster_offset = be64_to_cpu(l2_table[l2_index]);
+cluster_offset = be64_to_cpu(l2_slice[l2_index]);
 
 /* Check how many clusters are already allocated and don't need COW */
 if (qcow2_get_cluster_type(cluster_offset) == QCOW2_CLUSTER_NORMAL
@@ -1175,7 +1175,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t 
guest_offset,
 /* We keep all QCOW_OFLAG_COPIED clusters */
 keep_clusters =
 count_contiguous_clusters(nb_clusters, s->cluster_size,
-  _table[l2_index],
+  _slice[l2_index],
   QCOW_OFLAG_COPIED | QCOW_OFLAG_ZERO);
 assert(keep_clusters <= nb_clusters);
 
@@ -1190,7 +1190,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t 
guest_offset,
 
 /* Cleanup */
 out:
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 /* Only return a host offset if we actually made progress. Otherwise we
  * would make requirements for handle_alloc() that it can't fulfill */
-- 
2.13.6

[Qemu-block] [PULL 35/55] qcow2: Update get_cluster_table() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This patch updates get_cluster_table() to return L2 slices instead of
full L2 tables.

The code itself needs almost no changes, it only needs to call
offset_to_l2_slice_index() instead of offset_to_l2_index(). This patch
also renames all the relevant variables and the documentation.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
64cf064c0021ba315d3f3032da0f95db1b615f33.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 044eccda72..cd1ba40f74 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -659,20 +659,20 @@ fail:
  * get_cluster_table
  *
  * for a given disk offset, load (and allocate if needed)
- * the l2 table.
+ * the appropriate slice of its l2 table.
  *
- * the cluster index in the l2 table is given to the caller.
+ * the cluster index in the l2 slice is given to the caller.
  *
  * Returns 0 on success, -errno in failure case
  */
 static int get_cluster_table(BlockDriverState *bs, uint64_t offset,
- uint64_t **new_l2_table,
+ uint64_t **new_l2_slice,
  int *new_l2_index)
 {
 BDRVQcow2State *s = bs->opaque;
 unsigned int l2_index;
 uint64_t l1_index, l2_offset;
-uint64_t *l2_table = NULL;
+uint64_t *l2_slice = NULL;
 int ret;
 
 /* seek to the l2 offset in the l1 table */
@@ -712,17 +712,17 @@ static int get_cluster_table(BlockDriverState *bs, 
uint64_t offset,
 assert(offset_into_cluster(s, l2_offset) == 0);
 }
 
-/* load the l2 table in memory */
-ret = l2_load(bs, offset, l2_offset, _table);
+/* load the l2 slice in memory */
+ret = l2_load(bs, offset, l2_offset, _slice);
 if (ret < 0) {
 return ret;
 }
 
 /* find the cluster offset for the given disk offset */
 
-l2_index = offset_to_l2_index(s, offset);
+l2_index = offset_to_l2_slice_index(s, offset);
 
-*new_l2_table = l2_table;
+*new_l2_slice = l2_slice;
 *new_l2_index = l2_index;
 
 return 0;
-- 
2.13.6

[Qemu-block] [PULL 44/55] qcow2: Read refcount before L2 table in expand_zero_clusters_in_l1()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

At the moment it doesn't really make a difference whether we call
qcow2_get_refcount() before of after reading the L2 table, but if we
want to support L2 slices we'll need to read the refcount first.

This patch simply changes the order of those two operations to prepare
for that. The patch with the actual semantic changes will be easier to
read because of this.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
947a91d934053a2dbfef979aeb9568f57ef57c5d.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index f440105f6c..1041dcb6dd 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1898,6 +1898,12 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 goto fail;
 }
 
+ret = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits,
+ _refcount);
+if (ret < 0) {
+goto fail;
+}
+
 if (is_active_l1) {
 /* get active L2 tables from cache */
 ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
@@ -1911,12 +1917,6 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 goto fail;
 }
 
-ret = qcow2_get_refcount(bs, l2_offset >> s->cluster_bits,
- _refcount);
-if (ret < 0) {
-goto fail;
-}
-
 for (j = 0; j < s->l2_size; j++) {
 uint64_t l2_entry = be64_to_cpu(l2_table[j]);
 int64_t offset = l2_entry & L2E_OFFSET_MASK;
-- 
2.13.6

[Qemu-block] [PULL 39/55] qcow2: Update handle_alloc() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

handle_alloc() loads an L2 table and limits the number of checked
clusters to the amount that fits inside that table. Since we'll be
loading L2 slices instead of full tables we need to update that limit.

Apart from that, this function doesn't need any additional changes, so
this patch simply updates the variable name from l2_table to l2_slice.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
b243299c7136f7014c5af51665431ddbf5e99afd.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 7699544358..dbd57cd35f 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1274,7 +1274,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 {
 BDRVQcow2State *s = bs->opaque;
 int l2_index;
-uint64_t *l2_table;
+uint64_t *l2_slice;
 uint64_t entry;
 uint64_t nb_clusters;
 int ret;
@@ -1287,29 +1287,29 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 assert(*bytes > 0);
 
 /*
- * Calculate the number of clusters to look for. We stop at L2 table
+ * Calculate the number of clusters to look for. We stop at L2 slice
  * boundaries to keep things simple.
  */
 nb_clusters =
 size_to_clusters(s, offset_into_cluster(s, guest_offset) + *bytes);
 
-l2_index = offset_to_l2_index(s, guest_offset);
-nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
+l2_index = offset_to_l2_slice_index(s, guest_offset);
+nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index);
 assert(nb_clusters <= INT_MAX);
 
 /* Find L2 entry for the first involved cluster */
-ret = get_cluster_table(bs, guest_offset, _table, _index);
+ret = get_cluster_table(bs, guest_offset, _slice, _index);
 if (ret < 0) {
 return ret;
 }
 
-entry = be64_to_cpu(l2_table[l2_index]);
+entry = be64_to_cpu(l2_slice[l2_index]);
 
 /* For the moment, overwrite compressed clusters one by one */
 if (entry & QCOW_OFLAG_COMPRESSED) {
 nb_clusters = 1;
 } else {
-nb_clusters = count_cow_clusters(s, nb_clusters, l2_table, l2_index);
+nb_clusters = count_cow_clusters(s, nb_clusters, l2_slice, l2_index);
 }
 
 /* This function is only called when there were no non-COW clusters, so if
@@ -1338,7 +1338,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
  * nb_clusters already to a range of COW clusters */
 preallocated_nb_clusters =
 count_contiguous_clusters(nb_clusters, s->cluster_size,
-  _table[l2_index], QCOW_OFLAG_COPIED);
+  _slice[l2_index], QCOW_OFLAG_COPIED);
 assert(preallocated_nb_clusters > 0);
 
 nb_clusters = preallocated_nb_clusters;
@@ -1349,7 +1349,7 @@ static int handle_alloc(BlockDriverState *bs, uint64_t 
guest_offset,
 keep_old_clusters = true;
 }
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 if (!alloc_cluster_offset) {
 /* Allocate, if necessary at a given offset in the image file */
-- 
2.13.6

[Qemu-block] [PULL 34/55] qcow2: Refactor get_cluster_table()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

After the previous patch we're now always using l2_load() in
get_cluster_table() regardless of whether a new L2 table has to be
allocated or not.

This patch refactors that part of the code to use one single l2_load()
call.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
ce31758c4a1fadccea7a6ccb93951eb01d95fd4c.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 4aa9ea7b29..044eccda72 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -694,15 +694,7 @@ static int get_cluster_table(BlockDriverState *bs, 
uint64_t offset,
 return -EIO;
 }
 
-/* seek the l2 table of the given l2 offset */
-
-if (s->l1_table[l1_index] & QCOW_OFLAG_COPIED) {
-/* load the l2 table in memory */
-ret = l2_load(bs, offset, l2_offset, _table);
-if (ret < 0) {
-return ret;
-}
-} else {
+if (!(s->l1_table[l1_index] & QCOW_OFLAG_COPIED)) {
 /* First allocate a new L2 table (and do COW if needed) */
 ret = l2_allocate(bs, l1_index);
 if (ret < 0) {
@@ -718,11 +710,12 @@ static int get_cluster_table(BlockDriverState *bs, 
uint64_t offset,
 /* Get the offset of the newly-allocated l2 table */
 l2_offset = s->l1_table[l1_index] & L1E_OFFSET_MASK;
 assert(offset_into_cluster(s, l2_offset) == 0);
-/* Load the l2 table in memory */
-ret = l2_load(bs, offset, l2_offset, _table);
-if (ret < 0) {
-return ret;
-}
+}
+
+/* load the l2 table in memory */
+ret = l2_load(bs, offset, l2_offset, _table);
+if (ret < 0) {
+return ret;
 }
 
 /* find the cluster offset for the given disk offset */
-- 
2.13.6

[Qemu-block] [PULL 32/55] qcow2: Prepare l2_allocate() for adding L2 slice support

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

Adding support for L2 slices to l2_allocate() needs (among other
things) an extra loop that iterates over all slices of a new L2 table.

Putting all changes in one patch would make it hard to read because
all semantic changes would be mixed with pure indentation changes.

To make things easier this patch simply creates a new block and
changes the indentation of all lines of code inside it. Thus, all
modifications in this patch are cosmetic. There are no semantic
changes and no variables are renamed yet. The next patch will take
care of that.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
d0d7dca8520db304524f52f49d8157595a707a35.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 53 ---
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 542080aed0..3f7272441d 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -299,38 +299,43 @@ static int l2_allocate(BlockDriverState *bs, int 
l1_index, uint64_t **table)
 /* allocate a new entry in the l2 cache */
 
 trace_qcow2_l2_allocate_get_empty(bs, l1_index);
-ret = qcow2_cache_get_empty(bs, s->l2_table_cache, l2_offset, (void**) 
table);
-if (ret < 0) {
-goto fail;
-}
+{
+ret = qcow2_cache_get_empty(bs, s->l2_table_cache,
+l2_offset,
+(void **) table);
+if (ret < 0) {
+goto fail;
+}
 
-l2_table = *table;
+l2_table = *table;
 
-if ((old_l2_offset & L1E_OFFSET_MASK) == 0) {
-/* if there was no old l2 table, clear the new table */
-memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
-} else {
-uint64_t* old_table;
+if ((old_l2_offset & L1E_OFFSET_MASK) == 0) {
+/* if there was no old l2 table, clear the new table */
+memset(l2_table, 0, s->l2_size * sizeof(uint64_t));
+} else {
+uint64_t *old_table;
 
-/* if there was an old l2 table, read it from the disk */
-BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_COW_READ);
-ret = qcow2_cache_get(bs, s->l2_table_cache,
-old_l2_offset & L1E_OFFSET_MASK,
-(void**) _table);
-if (ret < 0) {
-goto fail;
+/* if there was an old l2 table, read it from the disk */
+BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_COW_READ);
+ret = qcow2_cache_get(bs, s->l2_table_cache,
+  old_l2_offset & L1E_OFFSET_MASK,
+  (void **) _table);
+if (ret < 0) {
+goto fail;
+}
+
+memcpy(l2_table, old_table, s->cluster_size);
+
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 }
 
-memcpy(l2_table, old_table, s->cluster_size);
+/* write the l2 table to the file */
+BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+trace_qcow2_l2_allocate_write_l2(bs, l1_index);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 }
 
-/* write the l2 table to the file */
-BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
-
-trace_qcow2_l2_allocate_write_l2(bs, l1_index);
-qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 ret = qcow2_cache_flush(bs, s->l2_table_cache);
 if (ret < 0) {
 goto fail;
-- 
2.13.6

[Qemu-block] [PULL 40/55] qcow2: Update discard_single_l2() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

discard_single_l2() limits the number of clusters to be discarded
to the amount that fits inside an L2 table. Since we'll be loading
L2 slices instead of full tables we need to update that limit. The
function is renamed to discard_in_l2_slice() for clarity.

Apart from that, this function doesn't need any additional changes, so
this patch simply updates the variable name from l2_table to l2_slice.

Signed-off-by: Alberto Garcia 
Message-id: 
1cb44a5b68be5334cb01b97a3db3a3c5a43396e5.1517840877.git.be...@igalia.com
Reviewed-by: Eric Blake 
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 32 
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index dbd57cd35f..62336960a5 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1631,32 +1631,32 @@ int qcow2_decompress_cluster(BlockDriverState *bs, 
uint64_t cluster_offset)
 
 /*
  * This discards as many clusters of nb_clusters as possible at once (i.e.
- * all clusters in the same L2 table) and returns the number of discarded
+ * all clusters in the same L2 slice) and returns the number of discarded
  * clusters.
  */
-static int discard_single_l2(BlockDriverState *bs, uint64_t offset,
- uint64_t nb_clusters, enum qcow2_discard_type 
type,
- bool full_discard)
+static int discard_in_l2_slice(BlockDriverState *bs, uint64_t offset,
+   uint64_t nb_clusters,
+   enum qcow2_discard_type type, bool full_discard)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t *l2_table;
+uint64_t *l2_slice;
 int l2_index;
 int ret;
 int i;
 
-ret = get_cluster_table(bs, offset, _table, _index);
+ret = get_cluster_table(bs, offset, _slice, _index);
 if (ret < 0) {
 return ret;
 }
 
-/* Limit nb_clusters to one L2 table */
-nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
+/* Limit nb_clusters to one L2 slice */
+nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index);
 assert(nb_clusters <= INT_MAX);
 
 for (i = 0; i < nb_clusters; i++) {
 uint64_t old_l2_entry;
 
-old_l2_entry = be64_to_cpu(l2_table[l2_index + i]);
+old_l2_entry = be64_to_cpu(l2_slice[l2_index + i]);
 
 /*
  * If full_discard is false, make sure that a discarded area reads back
@@ -1694,18 +1694,18 @@ static int discard_single_l2(BlockDriverState *bs, 
uint64_t offset,
 }
 
 /* First remove L2 entries */
-qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
 if (!full_discard && s->qcow_version >= 3) {
-l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
+l2_slice[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
 } else {
-l2_table[l2_index + i] = cpu_to_be64(0);
+l2_slice[l2_index + i] = cpu_to_be64(0);
 }
 
 /* Then decrease the refcount */
 qcow2_free_any_clusters(bs, old_l2_entry, 1, type);
 }
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 return nb_clusters;
 }
@@ -1729,10 +1729,10 @@ int qcow2_cluster_discard(BlockDriverState *bs, 
uint64_t offset,
 
 s->cache_discards = true;
 
-/* Each L2 table is handled by its own loop iteration */
+/* Each L2 slice is handled by its own loop iteration */
 while (nb_clusters > 0) {
-cleared = discard_single_l2(bs, offset, nb_clusters, type,
-full_discard);
+cleared = discard_in_l2_slice(bs, offset, nb_clusters, type,
+  full_discard);
 if (cleared < 0) {
 ret = cleared;
 goto fail;
-- 
2.13.6

[Qemu-block] [PULL 42/55] qcow2: Prepare qcow2_update_snapshot_refcount() for adding L2 slice support

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

Adding support for L2 slices to qcow2_update_snapshot_refcount() needs
(among other things) an extra loop that iterates over all slices of
each L2 table.

Putting all changes in one patch would make it hard to read because
all semantic changes would be mixed with pure indentation changes.

To make things easier this patch simply creates a new block and
changes the indentation of all lines of code inside it. Thus, all
modifications in this patch are cosmetic. There are no semantic
changes and no variables are renamed yet. The next patch will take
care of that.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
8ffaa5e55bd51121f80e498f4045b64902a94293.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-refcount.c | 144 +
 1 file changed, 75 insertions(+), 69 deletions(-)

diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 9df380d52c..dfa28301c4 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -1236,90 +1236,96 @@ int qcow2_update_snapshot_refcount(BlockDriverState *bs,
 goto fail;
 }
 
-ret = qcow2_cache_get(bs, s->l2_table_cache, l2_offset,
-(void**) _table);
-if (ret < 0) {
-goto fail;
-}
+{
+ret = qcow2_cache_get(bs, s->l2_table_cache,
+  l2_offset,
+  (void **) _table);
+if (ret < 0) {
+goto fail;
+}
 
-for (j = 0; j < s->l2_size; j++) {
-uint64_t cluster_index;
-uint64_t offset;
-
-entry = be64_to_cpu(l2_table[j]);
-old_entry = entry;
-entry &= ~QCOW_OFLAG_COPIED;
-offset = entry & L2E_OFFSET_MASK;
-
-switch (qcow2_get_cluster_type(entry)) {
-case QCOW2_CLUSTER_COMPRESSED:
-nb_csectors = ((entry >> s->csize_shift) &
-   s->csize_mask) + 1;
-if (addend != 0) {
-ret = update_refcount(bs,
-(entry & s->cluster_offset_mask) & ~511,
+for (j = 0; j < s->l2_size; j++) {
+uint64_t cluster_index;
+uint64_t offset;
+
+entry = be64_to_cpu(l2_table[j]);
+old_entry = entry;
+entry &= ~QCOW_OFLAG_COPIED;
+offset = entry & L2E_OFFSET_MASK;
+
+switch (qcow2_get_cluster_type(entry)) {
+case QCOW2_CLUSTER_COMPRESSED:
+nb_csectors = ((entry >> s->csize_shift) &
+   s->csize_mask) + 1;
+if (addend != 0) {
+ret = update_refcount(
+bs, (entry & s->cluster_offset_mask) & ~511,
 nb_csectors * 512, abs(addend), addend < 0,
 QCOW2_DISCARD_SNAPSHOT);
-if (ret < 0) {
+if (ret < 0) {
+goto fail;
+}
+}
+/* compressed clusters are never modified */
+refcount = 2;
+break;
+
+case QCOW2_CLUSTER_NORMAL:
+case QCOW2_CLUSTER_ZERO_ALLOC:
+if (offset_into_cluster(s, offset)) {
+qcow2_signal_corruption(
+bs, true, -1, -1, "Cluster "
+"allocation offset %#" PRIx64
+" unaligned (L2 offset: %#"
+PRIx64 ", L2 index: %#x)",
+offset, l2_offset, j);
+ret = -EIO;
 goto fail;
 }
-}
-/* compressed clusters are never modified */
-refcount = 2;
-break;
-
-case QCOW2_CLUSTER_NORMAL:
-case QCOW2_CLUSTER_ZERO_ALLOC:
-if (offset_into_cluster(s, offset)) {
-qcow2_signal_corruption(bs, true, -1, -1, "Cluster "
-"allocation offset %#" PRIx64
-" unaligned (L2 offset: %#"
-PRIx64 ", L2 index: %#x)",
-offset, l2_offset, j);
-

[Qemu-block] [PULL 41/55] qcow2: Update zero_single_l2() to support L2 slices

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

zero_single_l2() limits the number of clusters to be zeroed to the
amount that fits inside an L2 table. Since we'll be loading L2 slices
instead of full tables we need to update that limit. The function is
renamed to zero_in_l2_slice() for clarity.

Apart from that, this function doesn't need any additional changes, so
this patch simply updates the variable name from l2_table to l2_slice.

Signed-off-by: Alberto Garcia 
Message-id: 
ebc16e7e79fa6969d8975ef487d679794de4fbcc.1517840877.git.be...@igalia.com
Reviewed-by: Eric Blake 
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 28 ++--
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 62336960a5..f440105f6c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1752,33 +1752,33 @@ fail:
 
 /*
  * This zeroes as many clusters of nb_clusters as possible at once (i.e.
- * all clusters in the same L2 table) and returns the number of zeroed
+ * all clusters in the same L2 slice) and returns the number of zeroed
  * clusters.
  */
-static int zero_single_l2(BlockDriverState *bs, uint64_t offset,
-  uint64_t nb_clusters, int flags)
+static int zero_in_l2_slice(BlockDriverState *bs, uint64_t offset,
+uint64_t nb_clusters, int flags)
 {
 BDRVQcow2State *s = bs->opaque;
-uint64_t *l2_table;
+uint64_t *l2_slice;
 int l2_index;
 int ret;
 int i;
 bool unmap = !!(flags & BDRV_REQ_MAY_UNMAP);
 
-ret = get_cluster_table(bs, offset, _table, _index);
+ret = get_cluster_table(bs, offset, _slice, _index);
 if (ret < 0) {
 return ret;
 }
 
-/* Limit nb_clusters to one L2 table */
-nb_clusters = MIN(nb_clusters, s->l2_size - l2_index);
+/* Limit nb_clusters to one L2 slice */
+nb_clusters = MIN(nb_clusters, s->l2_slice_size - l2_index);
 assert(nb_clusters <= INT_MAX);
 
 for (i = 0; i < nb_clusters; i++) {
 uint64_t old_offset;
 QCow2ClusterType cluster_type;
 
-old_offset = be64_to_cpu(l2_table[l2_index + i]);
+old_offset = be64_to_cpu(l2_slice[l2_index + i]);
 
 /*
  * Minimize L2 changes if the cluster already reads back as
@@ -1790,16 +1790,16 @@ static int zero_single_l2(BlockDriverState *bs, 
uint64_t offset,
 continue;
 }
 
-qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_slice);
 if (cluster_type == QCOW2_CLUSTER_COMPRESSED || unmap) {
-l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
+l2_slice[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
 qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
 } else {
-l2_table[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO);
+l2_slice[l2_index + i] |= cpu_to_be64(QCOW_OFLAG_ZERO);
 }
 }
 
-qcow2_cache_put(s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _slice);
 
 return nb_clusters;
 }
@@ -1823,13 +1823,13 @@ int qcow2_cluster_zeroize(BlockDriverState *bs, 
uint64_t offset,
 return -ENOTSUP;
 }
 
-/* Each L2 table is handled by its own loop iteration */
+/* Each L2 slice is handled by its own loop iteration */
 nb_clusters = size_to_clusters(s, bytes);
 
 s->cache_discards = true;
 
 while (nb_clusters > 0) {
-cleared = zero_single_l2(bs, offset, nb_clusters, flags);
+cleared = zero_in_l2_slice(bs, offset, nb_clusters, flags);
 if (cleared < 0) {
 ret = cleared;
 goto fail;
-- 
2.13.6

[Qemu-block] [PULL 27/55] qcow2: Remove BDS parameter from qcow2_cache_is_table_offset()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function was only using the BlockDriverState parameter to pass it
to qcow2_cache_get_table_addr(). This is no longer necessary so this
parameter can be removed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
eb0ed90affcf302e5a954bafb5931b5215483d3a.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h  | 3 +--
 block/qcow2-cache.c| 3 +--
 block/qcow2-refcount.c | 6 +++---
 3 files changed, 5 insertions(+), 7 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 7296784353..edc5d8d57d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -654,8 +654,7 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, 
uint64_t offset,
 int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 void **table);
 void qcow2_cache_put(Qcow2Cache *c, void **table);
-void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
-  uint64_t offset);
+void *qcow2_cache_is_table_offset(Qcow2Cache *c, uint64_t offset);
 void qcow2_cache_discard(Qcow2Cache *c, void *table);
 
 /* qcow2-bitmap.c functions */
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 8d0b65e671..b1aa42477e 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -428,8 +428,7 @@ void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void 
*table)
 c->entries[i].dirty = true;
 }
 
-void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
-  uint64_t offset)
+void *qcow2_cache_is_table_offset(Qcow2Cache *c, uint64_t offset)
 {
 int i;
 
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 361b39d5cc..9df380d52c 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -871,14 +871,14 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
 if (refcount == 0) {
 void *table;
 
-table = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+table = qcow2_cache_is_table_offset(s->refcount_block_cache,
 offset);
 if (table != NULL) {
 qcow2_cache_put(s->refcount_block_cache, _block);
 qcow2_cache_discard(s->refcount_block_cache, table);
 }
 
-table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
+table = qcow2_cache_is_table_offset(s->l2_table_cache, offset);
 if (table != NULL) {
 qcow2_cache_discard(s->l2_table_cache, table);
 }
@@ -3186,7 +3186,7 @@ static int qcow2_discard_refcount_block(BlockDriverState 
*bs,
 s->free_cluster_index = cluster_index;
 }
 
-refblock = qcow2_cache_is_table_offset(bs, s->refcount_block_cache,
+refblock = qcow2_cache_is_table_offset(s->refcount_block_cache,
discard_block_offs);
 if (refblock) {
 /* discard refblock from the cache if refblock is cached */
-- 
2.13.6

[Qemu-block] [PULL 28/55] qcow2: Add offset_to_l1_index()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

Similar to offset_to_l2_index(), this function returns the index in
the L1 table for a given guest offset. This is only used in a couple
of places and it's not a particularly complex calculation, but it
makes the code a bit more readable.

Although in the qcow2_get_cluster_offset() case the old code was
taking advantage of the l1_bits variable, we're going to get rid of
the other uses of l1_bits in a later patch anyway, so it doesn't make
sense to keep it just for this.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
a5f626fed526b7459a0425fad06d823d18df8522.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h | 5 +
 block/qcow2-cluster.c | 4 ++--
 2 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index edc5d8d57d..d9ba57c030 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -463,6 +463,11 @@ static inline int64_t size_to_l1(BDRVQcow2State *s, 
int64_t size)
 return (size + (1ULL << shift) - 1) >> shift;
 }
 
+static inline int offset_to_l1_index(BDRVQcow2State *s, uint64_t offset)
+{
+return offset >> (s->l2_bits + s->cluster_bits);
+}
+
 static inline int offset_to_l2_index(BDRVQcow2State *s, int64_t offset)
 {
 return (offset >> s->cluster_bits) & (s->l2_size - 1);
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index ab5715c97e..ce7591dc3d 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -540,7 +540,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 
 /* seek to the l2 offset in the l1 table */
 
-l1_index = offset >> l1_bits;
+l1_index = offset_to_l1_index(s, offset);
 if (l1_index >= s->l1_size) {
 type = QCOW2_CLUSTER_UNALLOCATED;
 goto out;
@@ -663,7 +663,7 @@ static int get_cluster_table(BlockDriverState *bs, uint64_t 
offset,
 
 /* seek to the l2 offset in the l1 table */
 
-l1_index = offset >> (s->l2_bits + s->cluster_bits);
+l1_index = offset_to_l1_index(s, offset);
 if (l1_index >= s->l1_size) {
 ret = qcow2_grow_l1_table(bs, l1_index + 1, false);
 if (ret < 0) {
-- 
2.13.6

[Qemu-block] [PULL 29/55] qcow2: Add l2_slice_size field to BDRVQcow2State

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

The BDRVQcow2State structure contains an l2_size field, which stores
the number of 64-bit entries in an L2 table.

For efficiency reasons we want to be able to load slices instead of
full L2 tables, so we need to know how many entries an L2 slice can
hold.

An L2 slice is the portion of an L2 table that is loaded by the qcow2
cache. At the moment that cache can only load complete tables,
therefore an L2 slice has the same size as an L2 table (one cluster)
and l2_size == l2_slice_size.

Later we'll allow smaller slices, but until then we have to use this
new l2_slice_size field to make the rest of the code ready for that.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
adb048595f9fb5dfb110c802a8b3c3be3b937f37.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h | 1 +
 block/qcow2.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/block/qcow2.h b/block/qcow2.h
index d9ba57c030..45fb7f4299 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -251,6 +251,7 @@ typedef struct BDRVQcow2State {
 int cluster_bits;
 int cluster_size;
 int cluster_sectors;
+int l2_slice_size;
 int l2_bits;
 int l2_size;
 int l1_size;
diff --git a/block/qcow2.c b/block/qcow2.c
index 461cc7ab1c..ba8d71c72d 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -807,6 +807,7 @@ static void read_cache_sizes(BlockDriverState *bs, QemuOpts 
*opts,
 typedef struct Qcow2ReopenState {
 Qcow2Cache *l2_table_cache;
 Qcow2Cache *refcount_block_cache;
+int l2_slice_size; /* Number of entries in a slice of the L2 table */
 bool use_lazy_refcounts;
 int overlap_check;
 bool discard_passthrough[QCOW2_DISCARD_MAX];
@@ -888,6 +889,7 @@ static int qcow2_update_options_prepare(BlockDriverState 
*bs,
 }
 }
 
+r->l2_slice_size = s->cluster_size / sizeof(uint64_t);
 r->l2_table_cache = qcow2_cache_create(bs, l2_cache_size);
 r->refcount_block_cache = qcow2_cache_create(bs, refcount_cache_size);
 if (r->l2_table_cache == NULL || r->refcount_block_cache == NULL) {
@@ -1051,6 +1053,7 @@ static void qcow2_update_options_commit(BlockDriverState 
*bs,
 }
 s->l2_table_cache = r->l2_table_cache;
 s->refcount_block_cache = r->refcount_block_cache;
+s->l2_slice_size = r->l2_slice_size;
 
 s->overlap_check = r->overlap_check;
 s->use_lazy_refcounts = r->use_lazy_refcounts;
-- 
2.13.6

[Qemu-block] [PULL 23/55] qcow2: Remove BDS parameter from qcow2_cache_put()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function was only using the BlockDriverState parameter to pass it
to qcow2_cache_get_table_idx(). This is no longer necessary so this
parameter can be removed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
6f98155489054a457563da77cdad1a66ebb3e896.1517840876.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h  |  2 +-
 block/qcow2-cache.c|  2 +-
 block/qcow2-cluster.c  | 28 ++--
 block/qcow2-refcount.c | 30 +++---
 4 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 73e0e0133d..606dcdd47f 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -653,7 +653,7 @@ int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, 
uint64_t offset,
 void **table);
 int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
 void **table);
-void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table);
+void qcow2_cache_put(Qcow2Cache *c, void **table);
 void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
   uint64_t offset);
 void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table);
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 07603e6b15..3b55f39afb 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -407,7 +407,7 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache 
*c, uint64_t offset,
 return qcow2_cache_do_get(bs, c, offset, table, false);
 }
 
-void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, void **table)
+void qcow2_cache_put(Qcow2Cache *c, void **table)
 {
 int i = qcow2_cache_get_table_idx(c, *table);
 
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 08af5c5995..ab5715c97e 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -317,7 +317,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, 
uint64_t **table)
 
 memcpy(l2_table, old_table, s->cluster_size);
 
-qcow2_cache_put(bs, s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 }
 
 /* write the l2 table to the file */
@@ -345,7 +345,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, 
uint64_t **table)
 fail:
 trace_qcow2_l2_allocate_done(bs, l1_index, ret);
 if (l2_table != NULL) {
-qcow2_cache_put(bs, s->l2_table_cache, (void**) table);
+qcow2_cache_put(s->l2_table_cache, (void **) table);
 }
 s->l1_table[l1_index] = old_l2_offset;
 if (l2_offset > 0) {
@@ -619,7 +619,7 @@ int qcow2_get_cluster_offset(BlockDriverState *bs, uint64_t 
offset,
 abort();
 }
 
-qcow2_cache_put(bs, s->l2_table_cache, (void**) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 
 bytes_available = (int64_t)c * s->cluster_size;
 
@@ -637,7 +637,7 @@ out:
 return type;
 
 fail:
-qcow2_cache_put(bs, s->l2_table_cache, (void **)_table);
+qcow2_cache_put(s->l2_table_cache, (void **)_table);
 return ret;
 }
 
@@ -744,13 +744,13 @@ uint64_t 
qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
  * allocated. */
 cluster_offset = be64_to_cpu(l2_table[l2_index]);
 if (cluster_offset & L2E_OFFSET_MASK) {
-qcow2_cache_put(bs, s->l2_table_cache, (void**) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 return 0;
 }
 
 cluster_offset = qcow2_alloc_bytes(bs, compressed_size);
 if (cluster_offset < 0) {
-qcow2_cache_put(bs, s->l2_table_cache, (void**) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 return 0;
 }
 
@@ -767,7 +767,7 @@ uint64_t 
qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
 qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 l2_table[l2_index] = cpu_to_be64(cluster_offset);
-qcow2_cache_put(bs, s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 
 return cluster_offset;
 }
@@ -956,7 +956,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
  }
 
 
-qcow2_cache_put(bs, s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 
 /*
  * If this was a COW, we need to decrease the refcount of the old cluster.
@@ -1174,7 +1174,7 @@ static int handle_copied(BlockDriverState *bs, uint64_t 
guest_offset,
 
 /* Cleanup */
 out:
-qcow2_cache_put(bs, s->l2_table_cache, (void **) _table);
+qcow2_cache_put(s->l2_table_cache, (void **) _table);
 
 /* Only return a host offset if we actually made progress. Otherwise we
  * would make requirements for handle_alloc() that it can't fulfill */
@@

[Qemu-block] [PULL 26/55] qcow2: Remove BDS parameter from qcow2_cache_discard()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function was only using the BlockDriverState parameter to pass it
to qcow2_cache_get_table_idx() and qcow2_cache_table_release(). This
is no longer necessary so this parameter can be removed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
9724f7e38e763ad3be32627c6b7fe8df9edb1476.1517840877.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h  | 2 +-
 block/qcow2-cache.c| 2 +-
 block/qcow2-refcount.c | 6 +++---
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 867f2e8828..7296784353 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -656,7 +656,7 @@ int qcow2_cache_get_empty(BlockDriverState *bs, Qcow2Cache 
*c, uint64_t offset,
 void qcow2_cache_put(Qcow2Cache *c, void **table);
 void *qcow2_cache_is_table_offset(BlockDriverState *bs, Qcow2Cache *c,
   uint64_t offset);
-void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table);
+void qcow2_cache_discard(Qcow2Cache *c, void *table);
 
 /* qcow2-bitmap.c functions */
 int qcow2_check_bitmaps_refcounts(BlockDriverState *bs, BdrvCheckResult *res,
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 03b3e03c6c..8d0b65e671 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -441,7 +441,7 @@ void *qcow2_cache_is_table_offset(BlockDriverState *bs, 
Qcow2Cache *c,
 return NULL;
 }
 
-void qcow2_cache_discard(BlockDriverState *bs, Qcow2Cache *c, void *table)
+void qcow2_cache_discard(Qcow2Cache *c, void *table)
 {
 int i = qcow2_cache_get_table_idx(c, table);
 
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 65af05dd23..361b39d5cc 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -875,12 +875,12 @@ static int QEMU_WARN_UNUSED_RESULT 
update_refcount(BlockDriverState *bs,
 offset);
 if (table != NULL) {
 qcow2_cache_put(s->refcount_block_cache, _block);
-qcow2_cache_discard(bs, s->refcount_block_cache, table);
+qcow2_cache_discard(s->refcount_block_cache, table);
 }
 
 table = qcow2_cache_is_table_offset(bs, s->l2_table_cache, offset);
 if (table != NULL) {
-qcow2_cache_discard(bs, s->l2_table_cache, table);
+qcow2_cache_discard(s->l2_table_cache, table);
 }
 
 if (s->discard_passthrough[type]) {
@@ -3190,7 +3190,7 @@ static int qcow2_discard_refcount_block(BlockDriverState 
*bs,
discard_block_offs);
 if (refblock) {
 /* discard refblock from the cache if refblock is cached */
-qcow2_cache_discard(bs, s->refcount_block_cache, refblock);
+qcow2_cache_discard(s->refcount_block_cache, refblock);
 }
 update_refcount_discard(bs, discard_block_offs, s->cluster_size);
 
-- 
2.13.6

[Qemu-block] [PULL 19/55] qcow2: Remove BDS parameter from qcow2_cache_get_table_addr()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function was only using the BlockDriverState parameter to get the
cache table size (since it was equal to the cluster size). This is no
longer necessary so this parameter can be removed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
e1f943a9e89e1deb876f45de1bb22419ccdb6ad3.1517840876.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cache.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 38c03770b4..a481ef499a 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -46,8 +46,7 @@ struct Qcow2Cache {
 uint64_tcache_clean_lru_counter;
 };
 
-static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs,
-Qcow2Cache *c, int table)
+static inline void *qcow2_cache_get_table_addr(Qcow2Cache *c, int table)
 {
 return (uint8_t *) c->table_array + (size_t) table * c->table_size;
 }
@@ -78,7 +77,7 @@ static void qcow2_cache_table_release(BlockDriverState *bs, 
Qcow2Cache *c,
 {
 /* Using MADV_DONTNEED to discard memory is a Linux-specific feature */
 #ifdef CONFIG_LINUX
-void *t = qcow2_cache_get_table_addr(bs, c, i);
+void *t = qcow2_cache_get_table_addr(c, i);
 int align = getpagesize();
 size_t mem_size = (size_t) c->table_size * num_tables;
 size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t;
@@ -222,7 +221,7 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 }
 
 ret = bdrv_pwrite(bs->file, c->entries[i].offset,
-  qcow2_cache_get_table_addr(bs, c, i), c->table_size);
+  qcow2_cache_get_table_addr(c, i), c->table_size);
 if (ret < 0) {
 return ret;
 }
@@ -378,7 +377,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 }
 
 ret = bdrv_pread(bs->file, offset,
- qcow2_cache_get_table_addr(bs, c, i),
+ qcow2_cache_get_table_addr(c, i),
  c->table_size);
 if (ret < 0) {
 return ret;
@@ -390,7 +389,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 /* And return the right table */
 found:
 c->entries[i].ref++;
-*table = qcow2_cache_get_table_addr(bs, c, i);
+*table = qcow2_cache_get_table_addr(c, i);
 
 trace_qcow2_cache_get_done(qemu_coroutine_self(),
c == s->l2_table_cache, i);
@@ -439,7 +438,7 @@ void *qcow2_cache_is_table_offset(BlockDriverState *bs, 
Qcow2Cache *c,
 
 for (i = 0; i < c->size; i++) {
 if (c->entries[i].offset == offset) {
-return qcow2_cache_get_table_addr(bs, c, i);
+return qcow2_cache_get_table_addr(c, i);
 }
 }
 return NULL;
-- 
2.13.6

[Qemu-block] [PULL 25/55] qcow2: Remove BDS parameter from qcow2_cache_clean_unused()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function was only using the BlockDriverState parameter to pass it
to qcow2_cache_table_release(). This is no longer necessary so this
parameter can be removed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
b74f17591af52f201de0ea3a3b2dd0a81932334d.1517840876.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h   | 2 +-
 block/qcow2-cache.c | 2 +-
 block/qcow2.c   | 4 ++--
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 9936cd0700..867f2e8828 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -646,7 +646,7 @@ int qcow2_cache_set_dependency(BlockDriverState *bs, 
Qcow2Cache *c,
 Qcow2Cache *dependency);
 void qcow2_cache_depends_on_flush(Qcow2Cache *c);
 
-void qcow2_cache_clean_unused(BlockDriverState *bs, Qcow2Cache *c);
+void qcow2_cache_clean_unused(Qcow2Cache *c);
 int qcow2_cache_empty(BlockDriverState *bs, Qcow2Cache *c);
 
 int qcow2_cache_get(BlockDriverState *bs, Qcow2Cache *c, uint64_t offset,
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 6f17a28635..03b3e03c6c 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -93,7 +93,7 @@ static inline bool can_clean_entry(Qcow2Cache *c, int i)
 t->lru_counter <= c->cache_clean_lru_counter;
 }
 
-void qcow2_cache_clean_unused(BlockDriverState *bs, Qcow2Cache *c)
+void qcow2_cache_clean_unused(Qcow2Cache *c)
 {
 int i = 0;
 while (i < c->size) {
diff --git a/block/qcow2.c b/block/qcow2.c
index 38e7fdc4b0..461cc7ab1c 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -706,8 +706,8 @@ static void cache_clean_timer_cb(void *opaque)
 {
 BlockDriverState *bs = opaque;
 BDRVQcow2State *s = bs->opaque;
-qcow2_cache_clean_unused(bs, s->l2_table_cache);
-qcow2_cache_clean_unused(bs, s->refcount_block_cache);
+qcow2_cache_clean_unused(s->l2_table_cache);
+qcow2_cache_clean_unused(s->refcount_block_cache);
 timer_mod(s->cache_clean_timer, qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) +
   (int64_t) s->cache_clean_interval * 1000);
 }
-- 
2.13.6

[Qemu-block] [PULL 22/55] qcow2: Remove BDS parameter from qcow2_cache_entry_mark_dirty()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function was only using the BlockDriverState parameter to pass it
to qcow2_cache_get_table_idx(). This is no longer necessary so this
parameter can be removed.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
5c40516a91782b083c1428b7b6a41bb9e2679bfb.1517840876.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2.h  |  3 +--
 block/qcow2-cache.c|  3 +--
 block/qcow2-cluster.c  | 12 ++--
 block/qcow2-refcount.c | 14 ++
 4 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/block/qcow2.h b/block/qcow2.h
index 016e87c81a..73e0e0133d 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -639,8 +639,7 @@ int qcow2_read_snapshots(BlockDriverState *bs);
 Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int num_tables);
 int qcow2_cache_destroy(BlockDriverState* bs, Qcow2Cache *c);
 
-void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
- void *table);
+void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table);
 int qcow2_cache_flush(BlockDriverState *bs, Qcow2Cache *c);
 int qcow2_cache_write(BlockDriverState *bs, Qcow2Cache *c);
 int qcow2_cache_set_dependency(BlockDriverState *bs, Qcow2Cache *c,
diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index 5ff2cbf5c5..07603e6b15 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -421,8 +421,7 @@ void qcow2_cache_put(BlockDriverState *bs, Qcow2Cache *c, 
void **table)
 assert(c->entries[i].ref >= 0);
 }
 
-void qcow2_cache_entry_mark_dirty(BlockDriverState *bs, Qcow2Cache *c,
- void *table)
+void qcow2_cache_entry_mark_dirty(Qcow2Cache *c, void *table)
 {
 int i = qcow2_cache_get_table_idx(c, table);
 assert(c->entries[i].offset != 0);
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index e5ab102c29..08af5c5995 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -324,7 +324,7 @@ static int l2_allocate(BlockDriverState *bs, int l1_index, 
uint64_t **table)
 BLKDBG_EVENT(bs->file, BLKDBG_L2_ALLOC_WRITE);
 
 trace_qcow2_l2_allocate_write_l2(bs, l1_index);
-qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 ret = qcow2_cache_flush(bs, s->l2_table_cache);
 if (ret < 0) {
 goto fail;
@@ -765,7 +765,7 @@ uint64_t 
qcow2_alloc_compressed_cluster_offset(BlockDriverState *bs,
 /* compressed clusters never have the copied flag */
 
 BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
-qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 l2_table[l2_index] = cpu_to_be64(cluster_offset);
 qcow2_cache_put(bs, s->l2_table_cache, (void **) _table);
 
@@ -937,7 +937,7 @@ int qcow2_alloc_cluster_link_l2(BlockDriverState *bs, 
QCowL2Meta *m)
 if (ret < 0) {
 goto err;
 }
-qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 
 assert(l2_index + m->nb_clusters <= s->l2_size);
 for (i = 0; i < m->nb_clusters; i++) {
@@ -1678,7 +1678,7 @@ static int discard_single_l2(BlockDriverState *bs, 
uint64_t offset,
 }
 
 /* First remove L2 entries */
-qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 if (!full_discard && s->qcow_version >= 3) {
 l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
 } else {
@@ -1774,7 +1774,7 @@ static int zero_single_l2(BlockDriverState *bs, uint64_t 
offset,
 continue;
 }
 
-qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 if (cluster_type == QCOW2_CLUSTER_COMPRESSED || unmap) {
 l2_table[l2_index + i] = cpu_to_be64(QCOW_OFLAG_ZERO);
 qcow2_free_any_clusters(bs, old_offset, 1, QCOW2_DISCARD_REQUEST);
@@ -1983,7 +1983,7 @@ static int expand_zero_clusters_in_l1(BlockDriverState 
*bs, uint64_t *l1_table,
 
 if (is_active_l1) {
 if (l2_dirty) {
-qcow2_cache_entry_mark_dirty(bs, s->l2_table_cache, l2_table);
+qcow2_cache_entry_mark_dirty(s->l2_table_cache, l2_table);
 qcow2_cache_depends_on_flush(s->l2_table_cache);
 }
 qcow2_cache_put(bs, s->l2_table_cache, (void **) _table);
diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c
index 92701ab7af..5434e7d4c8 100644
--- a/block/qcow2-refcount.c
+++ b/block/qcow2-refcount.c
@@ -421,7 +421,7 @@ static int alloc_refcount_block(BlockDriverState *bs,
 
 /* Now the new refcount block needs to be written to disk */
 BLKDBG_EVENT(bs->file,

[Qemu-block] [PULL 12/55] gluster: Add preallocated truncation

2018-02-13 Thread Kevin Wolf

From: Max Reitz 

By using qemu_do_cluster_truncate() in qemu_cluster_truncate(), we now
automatically have preallocated truncation.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/gluster.c | 17 +
 1 file changed, 1 insertion(+), 16 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 806b894bc8..3f17b7819d 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -1122,23 +1122,8 @@ static coroutine_fn int 
qemu_gluster_co_rw(BlockDriverState *bs,
 static int qemu_gluster_truncate(BlockDriverState *bs, int64_t offset,
  PreallocMode prealloc, Error **errp)
 {
-int ret;
 BDRVGlusterState *s = bs->opaque;
-
-if (prealloc != PREALLOC_MODE_OFF) {
-error_setg(errp, "Unsupported preallocation mode '%s'",
-   PreallocMode_str(prealloc));
-return -ENOTSUP;
-}
-
-ret = glfs_ftruncate(s->fd, offset);
-if (ret < 0) {
-ret = -errno;
-error_setg_errno(errp, -ret, "Failed to truncate file");
-return ret;
-}
-
-return 0;
+return qemu_gluster_do_truncate(s->fd, offset, prealloc, errp);
 }
 
 static coroutine_fn int qemu_gluster_co_readv(BlockDriverState *bs,
-- 
2.13.6

[Qemu-block] [PULL 17/55] qcow2: Fix documentation of get_cluster_table()

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

This function has not been returning the offset of the L2 table since
commit 3948d1d4876065160583e79533bf604481063833

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
b498733b6706a859a03678d74ecbd26aeba129aa.1517840876.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cluster.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index f077cd3ac5..e5ab102c29 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -647,8 +647,7 @@ fail:
  * for a given disk offset, load (and allocate if needed)
  * the l2 table.
  *
- * the l2 table offset in the qcow2 file and the cluster index
- * in the l2 table are given to the caller.
+ * the cluster index in the l2 table is given to the caller.
  *
  * Returns 0 on success, -errno in failure case
  */
-- 
2.13.6

[Qemu-block] [PULL 13/55] sheepdog: Make sd_prealloc() take a BDS

2018-02-13 Thread Kevin Wolf

From: Max Reitz 

We want to use this function in sd_truncate() later on, so taking a
filename is not exactly ideal.

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/sheepdog.c | 29 +
 1 file changed, 21 insertions(+), 8 deletions(-)

diff --git a/block/sheepdog.c b/block/sheepdog.c
index af125a2c8d..cc1d37b3da 100644
--- a/block/sheepdog.c
+++ b/block/sheepdog.c
@@ -1826,10 +1826,10 @@ static int do_sd_create(BDRVSheepdogState *s, uint32_t 
*vdi_id, int snapshot,
 return 0;
 }
 
-static int sd_prealloc(const char *filename, Error **errp)
+static int sd_prealloc(BlockDriverState *bs, Error **errp)
 {
 BlockBackend *blk = NULL;
-BDRVSheepdogState *base = NULL;
+BDRVSheepdogState *base = bs->opaque;
 unsigned long buf_size;
 uint32_t idx, max_idx;
 uint32_t object_size;
@@ -1837,10 +1837,11 @@ static int sd_prealloc(const char *filename, Error 
**errp)
 void *buf = NULL;
 int ret;
 
-blk = blk_new_open(filename, NULL, NULL,
-   BDRV_O_RDWR | BDRV_O_RESIZE | BDRV_O_PROTOCOL, errp);
-if (blk == NULL) {
-ret = -EIO;
+blk = blk_new(BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | BLK_PERM_RESIZE,
+  BLK_PERM_ALL);
+
+ret = blk_insert_bs(blk, bs, errp);
+if (ret < 0) {
 goto out_with_err_set;
 }
 
@@ -1852,7 +1853,6 @@ static int sd_prealloc(const char *filename, Error **errp)
 goto out;
 }
 
-base = blk_bs(blk)->opaque;
 object_size = (UINT32_C(1) << base->inode.block_size_shift);
 buf_size = MIN(object_size, SD_DATA_OBJ_SIZE);
 buf = g_malloc0(buf_size);
@@ -2108,7 +2108,20 @@ static int sd_create(const char *filename, QemuOpts 
*opts,
 }
 
 if (prealloc) {
-ret = sd_prealloc(filename, errp);
+BlockDriverState *bs;
+QDict *opts;
+
+opts = qdict_new();
+qdict_put_str(opts, "driver", "sheepdog");
+bs = bdrv_open(filename, NULL, opts, BDRV_O_PROTOCOL | BDRV_O_RDWR,
+   errp);
+if (!bs) {
+goto out;
+}
+
+ret = sd_prealloc(bs, errp);
+
+bdrv_unref(bs);
 }
 out:
 g_free(backing_file);
-- 
2.13.6

[Qemu-block] [PULL 04/55] docs: Document share-rw property more thoroughly

2018-02-13 Thread Kevin Wolf

From: Fam Zheng 

Suggested-by: Stefan Hajnoczi 
Signed-off-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kashyap Chamarthy 
Signed-off-by: Kevin Wolf 
---
 docs/qemu-block-drivers.texi | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/docs/qemu-block-drivers.texi b/docs/qemu-block-drivers.texi
index cd74767ed3..f1793692bb 100644
--- a/docs/qemu-block-drivers.texi
+++ b/docs/qemu-block-drivers.texi
@@ -845,6 +845,16 @@ QEMU transparently handles lock handover during shared 
storage migration.  For
 shared virtual disk images between multiple VMs, the "share-rw" device option
 should be used.
 
+By default, the guest has exclusive write access to its disk image. If the
+guest can safely share the disk image with other writers the @code{-device
+...,share-rw=on} parameter can be used.  This is only safe if the guest is
+running software, such as a cluster file system, that coordinates disk accesses
+to avoid corruption.
+
+Note that share-rw=on only declares the guest's ability to share the disk.
+Some QEMU features, such as image file formats, require exclusive write access
+to the disk image and this is unaffected by the share-rw=on option.
+
 Alternatively, locking can be fully disabled by "locking=off" block device
 option. In the command line, the option is usually in the form of
 "file.locking=off" as the protocol driver is normally placed as a "file" child
-- 
2.13.6

[Qemu-block] [PULL 00/55] Block layer patches

2018-02-13 Thread Kevin Wolf

The following changes since commit fb68096da3d35e64c88cd610c1fa42766c58e92a:

  Revert "tests: use memfd in vhost-user-test" (2018-02-13 09:51:52 +)

are available in the git repository at:

  git://repo.or.cz/qemu/kevin.git tags/for-upstream

for you to fetch changes up to 0a4dc980e6c935e9be745ce3ee1a4c71629ecd00:

  Merge remote-tracking branch 'mreitz/tags/pull-block-2018-02-13' into 
queue-block (2018-02-13 17:01:13 +0100)


Block layer patches


Alberto Garcia (40):
  qcow2: Use g_try_realloc() in qcow2_expand_zero_clusters()
  qcow2: Fix documentation of get_cluster_table()
  qcow2: Add table size field to Qcow2Cache
  qcow2: Remove BDS parameter from qcow2_cache_get_table_addr()
  qcow2: Remove BDS parameter from qcow2_cache_get_table_idx()
  qcow2: Remove BDS parameter from qcow2_cache_table_release()
  qcow2: Remove BDS parameter from qcow2_cache_entry_mark_dirty()
  qcow2: Remove BDS parameter from qcow2_cache_put()
  qcow2: Remove BDS parameter from qcow2_cache_destroy()
  qcow2: Remove BDS parameter from qcow2_cache_clean_unused()
  qcow2: Remove BDS parameter from qcow2_cache_discard()
  qcow2: Remove BDS parameter from qcow2_cache_is_table_offset()
  qcow2: Add offset_to_l1_index()
  qcow2: Add l2_slice_size field to BDRVQcow2State
  qcow2: Add offset_to_l2_slice_index()
  qcow2: Update l2_load() to support L2 slices
  qcow2: Prepare l2_allocate() for adding L2 slice support
  qcow2: Update l2_allocate() to support L2 slices
  qcow2: Refactor get_cluster_table()
  qcow2: Update get_cluster_table() to support L2 slices
  qcow2: Update qcow2_get_cluster_offset() to support L2 slices
  qcow2: Update qcow2_alloc_cluster_link_l2() to support L2 slices
  qcow2: Update handle_copied() to support L2 slices
  qcow2: Update handle_alloc() to support L2 slices
  qcow2: Update discard_single_l2() to support L2 slices
  qcow2: Update zero_single_l2() to support L2 slices
  qcow2: Prepare qcow2_update_snapshot_refcount() for adding L2 slice 
support
  qcow2: Update qcow2_update_snapshot_refcount() to support L2 slices
  qcow2: Read refcount before L2 table in expand_zero_clusters_in_l1()
  qcow2: Prepare expand_zero_clusters_in_l1() for adding L2 slice support
  qcow2: Update expand_zero_clusters_in_l1() to support L2 slices
  qcow2: Update qcow2_truncate() to support L2 slices
  qcow2: Rename l2_table in qcow2_alloc_compressed_cluster_offset()
  qcow2: Rename l2_table in count_contiguous_clusters()
  qcow2: Rename l2_table in count_contiguous_clusters_unallocated()
  qcow2: Rename l2_table in count_cow_clusters()
  qcow2: Allow configuring the L2 slice size
  iotests: Test valid values of l2-cache-entry-size
  iotests: Test downgrading an image using a small L2 slice size
  iotests: Add l2-cache-entry-size to iotest 137

Daniel P. Berrangé (1):
  qemu-io: fix EOF Ctrl-D handling in qemu-io readline code

Fam Zheng (4):
  iotests: Fix CID for VMDK afl image
  qemu-img.texi: Clean up parameter list
  qemu-img: Document --force-share / -U
  docs: Document share-rw property more thoroughly

Kevin Wolf (1):
  Merge remote-tracking branch 'mreitz/tags/pull-block-2018-02-13' into 
queue-block

Max Reitz (8):
  iotests: Use virtio-blk in 155
  gluster: Move glfs_close() to create's clean-up
  gluster: Pull truncation from qemu_gluster_create
  gluster: Query current size in do_truncate()
  gluster: Add preallocated truncation
  sheepdog: Make sd_prealloc() take a BDS
  sheepdog: Pass old and new size to sd_prealloc()
  sheepdog: Allow fully preallocated truncation

Paolo Bonzini (1):
  block: early check for blockers on drive-mirror

Vladimir Sementsov-Ogievskiy (1):
  block: maintain persistent disabled bitmaps

 qapi/block-core.json   |  12 +-
 block/qcow2.h  |  33 +-
 include/block/dirty-bitmap.h   |   1 -
 block/dirty-bitmap.c   |  18 -
 block/gluster.c| 116 +++---
 block/qcow2-bitmap.c   |  12 +-
 block/qcow2-cache.c|  80 ++--
 block/qcow2-cluster.c  | 519 +
 block/qcow2-refcount.c | 206 +-
 block/qcow2.c  |  63 ++-
 block/sheepdog.c   |  56 ++-
 blockdev.c |  15 +-
 qemu-io.c  |  27 +-
 docs/qemu-block-drivers.texi   |  10 +
 qemu-doc.texi  |   7 +
 qemu-img.texi  |  74 ++--

[Qemu-block] [PULL 18/55] qcow2: Add table size field to Qcow2Cache

2018-02-13 Thread Kevin Wolf

From: Alberto Garcia 

The table size in the qcow2 cache is currently equal to the cluster
size. This doesn't allow us to use the cache memory efficiently,
particularly with large cluster sizes, so we need to be able to have
smaller cache tables that are independent from the cluster size. This
patch adds a new field to Qcow2Cache that we can use instead of the
cluster size.

The current table size is still being initialized to the cluster size,
so there are no semantic changes yet, but this patch will allow us to
prepare the rest of the code and simplify a few function calls.

Signed-off-by: Alberto Garcia 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Message-id: 
67a1bf9e55f417005c567bead95a018dc34bc687.1517840876.git.be...@igalia.com
Signed-off-by: Max Reitz 
---
 block/qcow2-cache.c | 29 ++---
 1 file changed, 14 insertions(+), 15 deletions(-)

diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c
index c48ffebd8f..38c03770b4 100644
--- a/block/qcow2-cache.c
+++ b/block/qcow2-cache.c
@@ -39,6 +39,7 @@ struct Qcow2Cache {
 Qcow2CachedTable   *entries;
 struct Qcow2Cache  *depends;
 int size;
+int table_size;
 booldepends_on_flush;
 void   *table_array;
 uint64_tlru_counter;
@@ -48,17 +49,15 @@ struct Qcow2Cache {
 static inline void *qcow2_cache_get_table_addr(BlockDriverState *bs,
 Qcow2Cache *c, int table)
 {
-BDRVQcow2State *s = bs->opaque;
-return (uint8_t *) c->table_array + (size_t) table * s->cluster_size;
+return (uint8_t *) c->table_array + (size_t) table * c->table_size;
 }
 
 static inline int qcow2_cache_get_table_idx(BlockDriverState *bs,
   Qcow2Cache *c, void *table)
 {
-BDRVQcow2State *s = bs->opaque;
 ptrdiff_t table_offset = (uint8_t *) table - (uint8_t *) c->table_array;
-int idx = table_offset / s->cluster_size;
-assert(idx >= 0 && idx < c->size && table_offset % s->cluster_size == 0);
+int idx = table_offset / c->table_size;
+assert(idx >= 0 && idx < c->size && table_offset % c->table_size == 0);
 return idx;
 }
 
@@ -79,10 +78,9 @@ static void qcow2_cache_table_release(BlockDriverState *bs, 
Qcow2Cache *c,
 {
 /* Using MADV_DONTNEED to discard memory is a Linux-specific feature */
 #ifdef CONFIG_LINUX
-BDRVQcow2State *s = bs->opaque;
 void *t = qcow2_cache_get_table_addr(bs, c, i);
 int align = getpagesize();
-size_t mem_size = (size_t) s->cluster_size * num_tables;
+size_t mem_size = (size_t) c->table_size * num_tables;
 size_t offset = QEMU_ALIGN_UP((uintptr_t) t, align) - (uintptr_t) t;
 size_t length = QEMU_ALIGN_DOWN(mem_size - offset, align);
 if (mem_size > offset && length > 0) {
@@ -132,9 +130,10 @@ Qcow2Cache *qcow2_cache_create(BlockDriverState *bs, int 
num_tables)
 
 c = g_new0(Qcow2Cache, 1);
 c->size = num_tables;
+c->table_size = s->cluster_size;
 c->entries = g_try_new0(Qcow2CachedTable, num_tables);
 c->table_array = qemu_try_blockalign(bs->file->bs,
- (size_t) num_tables * 
s->cluster_size);
+ (size_t) num_tables * c->table_size);
 
 if (!c->entries || !c->table_array) {
 qemu_vfree(c->table_array);
@@ -203,13 +202,13 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 
 if (c == s->refcount_block_cache) {
 ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_REFCOUNT_BLOCK,
-c->entries[i].offset, s->cluster_size);
+c->entries[i].offset, c->table_size);
 } else if (c == s->l2_table_cache) {
 ret = qcow2_pre_write_overlap_check(bs, QCOW2_OL_ACTIVE_L2,
-c->entries[i].offset, s->cluster_size);
+c->entries[i].offset, c->table_size);
 } else {
 ret = qcow2_pre_write_overlap_check(bs, 0,
-c->entries[i].offset, s->cluster_size);
+c->entries[i].offset, c->table_size);
 }
 
 if (ret < 0) {
@@ -223,7 +222,7 @@ static int qcow2_cache_entry_flush(BlockDriverState *bs, 
Qcow2Cache *c, int i)
 }
 
 ret = bdrv_pwrite(bs->file, c->entries[i].offset,
-  qcow2_cache_get_table_addr(bs, c, i), s->cluster_size);
+  qcow2_cache_get_table_addr(bs, c, i), c->table_size);
 if (ret < 0) {
 return ret;
 }
@@ -331,7 +330,7 @@ static int qcow2_cache_do_get(BlockDriverState *bs, 
Qcow2Cache *c,
 trace_qcow2_cache_get(qemu_coroutine_self(), c == s->l2_table_cache,
   offset, read_from_disk);
 
-if (offset_into_cluster(s, offset)) {
+if (!QEMU_IS_ALIGNED(offset, c->table_size)) {
 qcow2_signal_corruption(bs, true, -1, -1, "Cannot get entry from %s "

[Qemu-block] [PULL 02/55] qemu-img.texi: Clean up parameter list

2018-02-13 Thread Kevin Wolf

From: Fam Zheng 

Split options out of the "@table @var" section and create a "@table
@option", then use whitespaces and blank lines consistently.

Suggested-by: Kevin Wolf 
Signed-off-by: Fam Zheng 
Reviewed-by: Stefan Hajnoczi 
Reviewed-by: Kashyap Chamarthy 
Signed-off-by: Kevin Wolf 
---
 qemu-img.texi | 66 +++
 1 file changed, 39 insertions(+), 27 deletions(-)

diff --git a/qemu-img.texi b/qemu-img.texi
index fdcf120f36..60a0e080c6 100644
--- a/qemu-img.texi
+++ b/qemu-img.texi
@@ -33,38 +33,14 @@ The following commands are supported:
 
 Command parameters:
 @table @var
-@item filename
- is a disk image filename
-
-@item --object @var{objectdef}
-
-is a QEMU user creatable object definition. See the @code{qemu(1)} manual
-page for a description of the object properties. The most common object
-type is a @code{secret}, which is used to supply passwords and/or encryption
-keys.
-
-@item --image-opts
-
-Indicates that the source @var{filename} parameter is to be interpreted as a
-full option string, not a plain filename. This parameter is mutually
-exclusive with the @var{-f} parameter.
-
-@item --target-image-opts
 
-Indicates that the @var{output_filename} parameter(s) are to be interpreted as
-a full option string, not a plain filename. This parameter is mutually
-exclusive with the @var{-O} parameters. It is currently required to also use
-the @var{-n} parameter to skip image creation. This restriction may be relaxed
-in a future release.
+@item filename
+is a disk image filename
 
 @item fmt
 is the disk image format. It is guessed automatically in most cases. See below
 for a description of the supported disk formats.
 
-@item --backing-chain
-will enumerate information about backing files in a disk image chain. Refer
-below for further description.
-
 @item size
 is the disk image size in bytes. Optional suffixes @code{k} or @code{K}
 (kilobyte, 1024) @code{M} (megabyte, 1024k) and @code{G} (gigabyte, 1024M)
@@ -74,42 +50,78 @@ and T (terabyte, 1024G) are supported.  @code{b} is ignored.
 is the destination disk image filename
 
 @item output_fmt
- is the destination format
+is the destination format
+
 @item options
 is a comma separated list of format specific options in a
 name=value format. Use @code{-o ?} for an overview of the options supported
 by the used format or see the format descriptions below for details.
+
 @item snapshot_param
 is param used for internal snapshot, format is
 'snapshot.id=[ID],snapshot.name=[NAME]' or '[ID_OR_NAME]'
+
 @item snapshot_id_or_name
 is deprecated, use snapshot_param instead
 
+@end table
+
+@table @option
+
+@item --object @var{objectdef}
+is a QEMU user creatable object definition. See the @code{qemu(1)} manual
+page for a description of the object properties. The most common object
+type is a @code{secret}, which is used to supply passwords and/or encryption
+keys.
+
+@item --image-opts
+Indicates that the source @var{filename} parameter is to be interpreted as a
+full option string, not a plain filename. This parameter is mutually
+exclusive with the @var{-f} parameter.
+
+@item --target-image-opts
+Indicates that the @var{output_filename} parameter(s) are to be interpreted as
+a full option string, not a plain filename. This parameter is mutually
+exclusive with the @var{-O} parameters. It is currently required to also use
+the @var{-n} parameter to skip image creation. This restriction may be relaxed
+in a future release.
+
+@item --backing-chain
+will enumerate information about backing files in a disk image chain. Refer
+below for further description.
+
 @item -c
 indicates that target image must be compressed (qcow format only)
+
 @item -h
 with or without a command shows help and lists the supported formats
+
 @item -p
 display progress bar (compare, convert and rebase commands only).
 If the @var{-p} option is not used for a command that supports it, the
 progress is reported when the process receives a @code{SIGUSR1} or
 @code{SIGINFO} signal.
+
 @item -q
 Quiet mode - do not print any output (except errors). There's no progress bar
 in case both @var{-q} and @var{-p} options are used.
+
 @item -S @var{size}
 indicates the consecutive number of bytes that must contain only zeros
 for qemu-img to create a sparse image during conversion. This value is rounded
 down to the nearest 512 bytes. You may use the common size suffixes like
 @code{k} for kilobytes.
+
 @item -t @var{cache}
 specifies the cache mode that should be used with the (destination) file. See
 the documentation of the emulator's @code{-drive cache=...} option for allowed
 values.
+
 @item -T @var{src_cache}
 specifies the cache mode that should be used with the source file(s). See
 the documentation of the emulator's @code{-drive cache=...} option for allowed
 values.
+
 @end table
 
 Parameters to snapshot

[Qemu-block] [PULL 07/55] iotests: Use virtio-blk in 155

2018-02-13 Thread Kevin Wolf

From: Max Reitz 

Only a few select machine types support floppy drives and there is
actually nothing preventing us from using virtio here, so let's do it.

Reported-by: Christian Borntraeger 
Signed-off-by: Max Reitz 
Tested-by: Christian Borntraeger 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/155 | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/tests/qemu-iotests/155 b/tests/qemu-iotests/155
index fc9fa975be..42dae04c83 100755
--- a/tests/qemu-iotests/155
+++ b/tests/qemu-iotests/155
@@ -64,7 +64,7 @@ class BaseClass(iotests.QMPTestCase):
 'file': {'driver': 'file',
  'filename': source_img}}
 self.vm.add_blockdev(self.qmp_to_opts(blockdev))
-self.vm.add_device('floppy,id=qdev0,drive=source')
+self.vm.add_device('virtio-blk,id=qdev0,drive=source')
 self.vm.launch()
 
 self.assertIntactSourceBackingChain()
@@ -173,21 +173,24 @@ class MirrorBaseClass(BaseClass):
 def testFull(self):
 self.runMirror('full')
 
-node = self.findBlockNode('target', 'qdev0')
+node = self.findBlockNode('target',
+  '/machine/peripheral/qdev0/virtio-backend')
 self.assertCorrectBackingImage(node, None)
 self.assertIntactSourceBackingChain()
 
 def testTop(self):
 self.runMirror('top')
 
-node = self.findBlockNode('target', 'qdev0')
+node = self.findBlockNode('target',
+  '/machine/peripheral/qdev0/virtio-backend')
 self.assertCorrectBackingImage(node, back2_img)
 self.assertIntactSourceBackingChain()
 
 def testNone(self):
 self.runMirror('none')
 
-node = self.findBlockNode('target', 'qdev0')
+node = self.findBlockNode('target',
+  '/machine/peripheral/qdev0/virtio-backend')
 self.assertCorrectBackingImage(node, source_img)
 self.assertIntactSourceBackingChain()
 
@@ -239,7 +242,8 @@ class TestCommit(BaseClass):
 
 self.vm.event_wait('BLOCK_JOB_COMPLETED')
 
-node = self.findBlockNode(None, 'qdev0')
+node = self.findBlockNode(None,
+  '/machine/peripheral/qdev0/virtio-backend')
 self.assert_qmp(node, 'image' + '/backing-image' * 0 + '/filename',
 back1_img)
 self.assert_qmp(node, 'image' + '/backing-image' * 1 + '/filename',
-- 
2.13.6

[Qemu-block] [PULL 10/55] gluster: Pull truncation from qemu_gluster_create

2018-02-13 Thread Kevin Wolf

From: Max Reitz 

Pull out the truncation code from the qemu_cluster_create() function so
we can later reuse it in qemu_gluster_truncate().

Signed-off-by: Max Reitz 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 block/gluster.c | 74 +++--
 1 file changed, 40 insertions(+), 34 deletions(-)

diff --git a/block/gluster.c b/block/gluster.c
index 7fab2dfa12..8178541416 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -965,6 +965,45 @@ static coroutine_fn int 
qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs,
 }
 #endif
 
+static int qemu_gluster_do_truncate(struct glfs_fd *fd, int64_t offset,
+PreallocMode prealloc, Error **errp)
+{
+switch (prealloc) {
+#ifdef CONFIG_GLUSTERFS_FALLOCATE
+case PREALLOC_MODE_FALLOC:
+if (glfs_fallocate(fd, 0, 0, offset)) {
+error_setg_errno(errp, errno, "Could not preallocate data");
+return -errno;
+}
+break;
+#endif /* CONFIG_GLUSTERFS_FALLOCATE */
+#ifdef CONFIG_GLUSTERFS_ZEROFILL
+case PREALLOC_MODE_FULL:
+if (glfs_ftruncate(fd, offset)) {
+error_setg_errno(errp, errno, "Could not resize file");
+return -errno;
+}
+if (glfs_zerofill(fd, 0, offset)) {
+error_setg_errno(errp, errno, "Could not zerofill the new area");
+return -errno;
+}
+break;
+#endif /* CONFIG_GLUSTERFS_ZEROFILL */
+case PREALLOC_MODE_OFF:
+if (glfs_ftruncate(fd, offset)) {
+error_setg_errno(errp, errno, "Could not resize file");
+return -errno;
+}
+break;
+default:
+error_setg(errp, "Unsupported preallocation mode: %s",
+   PreallocMode_str(prealloc));
+return -EINVAL;
+}
+
+return 0;
+}
+
 static int qemu_gluster_create(const char *filename,
QemuOpts *opts, Error **errp)
 {
@@ -1019,40 +1058,7 @@ static int qemu_gluster_create(const char *filename,
 goto out;
 }
 
-switch (prealloc) {
-#ifdef CONFIG_GLUSTERFS_FALLOCATE
-case PREALLOC_MODE_FALLOC:
-if (glfs_fallocate(fd, 0, 0, total_size)) {
-error_setg(errp, "Could not preallocate data for the new file");
-ret = -errno;
-}
-break;
-#endif /* CONFIG_GLUSTERFS_FALLOCATE */
-#ifdef CONFIG_GLUSTERFS_ZEROFILL
-case PREALLOC_MODE_FULL:
-if (!glfs_ftruncate(fd, total_size)) {
-if (glfs_zerofill(fd, 0, total_size)) {
-error_setg(errp, "Could not zerofill the new file");
-ret = -errno;
-}
-} else {
-error_setg(errp, "Could not resize file");
-ret = -errno;
-}
-break;
-#endif /* CONFIG_GLUSTERFS_ZEROFILL */
-case PREALLOC_MODE_OFF:
-if (glfs_ftruncate(fd, total_size) != 0) {
-ret = -errno;
-error_setg(errp, "Could not resize file");
-}
-break;
-default:
-ret = -EINVAL;
-error_setg(errp, "Unsupported preallocation mode: %s",
-   PreallocMode_str(prealloc));
-break;
-}
+ret = qemu_gluster_do_truncate(fd, total_size, prealloc, errp);
 
 out:
 if (fd) {
-- 
2.13.6

[Qemu-block] [PULL 06/55] block: early check for blockers on drive-mirror

2018-02-13 Thread Kevin Wolf

From: Paolo Bonzini 

Even if an op blocker is present for BLOCK_OP_TYPE_MIRROR_SOURCE,
it is checked a bit late and the result is that the target is
created even if drive-mirror subsequently fails.  Add an early
check to avoid this.

Signed-off-by: Paolo Bonzini 
Reviewed-by: Fam Zheng 
Reviewed-by: Alberto Garcia 
Signed-off-by: Kevin Wolf 
---
 blockdev.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/blockdev.c b/blockdev.c
index bdbdeae7e4..7423c5317b 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -3569,6 +3569,11 @@ void qmp_drive_mirror(DriveMirror *arg, Error **errp)
 return;
 }
 
+/* Early check to avoid creating target */
+if (bdrv_op_is_blocked(bs, BLOCK_OP_TYPE_MIRROR_SOURCE, errp)) {
+return;
+}
+
 aio_context = bdrv_get_aio_context(bs);
 aio_context_acquire(aio_context);
 
-- 
2.13.6

[Qemu-block] [PULL 01/55] iotests: Fix CID for VMDK afl image

2018-02-13 Thread Kevin Wolf

From: Fam Zheng 

This reverts commit 76bf133c4 which updated the reference output, and
fixed the reference image, because the code path we want to exercise is
actually the invalid image size.

The descriptor block in the image, which includes the CID to verify, has been
invalid since the reference image was added. Since commit 9877860e7bd we report
this error earlier than the "file too large", so 059.out mismatches.

The binary change is generated along the operations of:

  $ bunzip2 afl9.vmdk.bz2
  $ qemu-img create -f vmdk fix.vmdk 1G
  $ dd if=afl9.vmdk of=fix.vmdk bs=512 count=1 conv=notrunc
  $ mv fix.vmdk afl9.vmdk
  $ bzip2 afl9.vmdk

Signed-off-by: Fam Zheng 
Reviewed-by: Eric Blake 
Signed-off-by: Kevin Wolf 
---
 tests/qemu-iotests/059.out |   2 +-
 tests/qemu-iotests/sample_images/afl9.vmdk.bz2 | Bin 178 -> 618 bytes
 2 files changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/qemu-iotests/059.out b/tests/qemu-iotests/059.out
index 1ac5d56233..f6dce7947c 100644
--- a/tests/qemu-iotests/059.out
+++ b/tests/qemu-iotests/059.out
@@ -2358,5 +2358,5 @@ Offset  Length  Mapped to   File
 0x14000 0x1 0x5 TEST_DIR/t-s003.vmdk
 
 === Testing afl image with a very large capacity ===
-qemu-img: Could not open 'TEST_DIR/afl9.IMGFMT': Could not open 
'TEST_DIR/afl9.IMGFMT': Invalid argument
+qemu-img: Can't get image size 'TEST_DIR/afl9.IMGFMT': File too large
 *** done
diff --git a/tests/qemu-iotests/sample_images/afl9.vmdk.bz2 
b/tests/qemu-iotests/sample_images/afl9.vmdk.bz2
index 
03615d36a12425cf4240bab86f4cfe648db14572..9fcd0af45a815431acf4689e0845ecf2d333cd58
 100644
GIT binary patch
literal 618
zcmV-w0+szjT4*^jL0KkKSvgW7ssIN3|NsBH-Q9UpfAhclU70`s-*NE~5QvC~h=_=Y
zh>D2n*q*=vygR634445h35k;?00h9835kMW4$iPepVE{Bqk)uhJ^wfGLr=)3s
zhM5CR88jLh7)B;cA*K)*6GmuECPU3o4NWG5O#pg>Ak#xY8Z^CrMt}oD38Ns$
z02n}M0LdjZ&}cLPqd+nPKmn$j0iXe(02%-d27nnJriN-uE+X@Bj4BBfd|yV!NB
zwqkL}nW3AI5x^jp=t%^F1pxqp)v#n#)j$zcm1xqv(!$2d*5%vF{5RPWnOV8-^tE<(
zU~%&}Y0uNu*9Wt=yS^8PkC%IG;aD{l#sG`m4Ho*fsHXdM

Re: [Qemu-block] qcow2 images corruption

2018-02-13 Thread Nicolas Ecarnot


Le 13/02/2018 à 16:26, Nicolas Ecarnot a écrit :
>> It would be good if you could make the 'qemu-img check' output available
>> somewhere.
>

I found this :
https://github.com/ShijunDeng/qcow2-dump

and the transcript (beautiful colors when viewed with "more") is attached :


--
Nicolas ECARNOT
Le script a débuté sur mar. 13 févr. 2018 17:31:05 CET
]0;root@serv-hv-adm13:/home[?1034h[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m#
 /root/qcow2-dump -m check serv-term-adm4-corr.qcow2.img
[1;32m
File:[1;36m serv-term-adm4-corr.qcow2.img
[0m

magic: 0x514649fb
version: [1;36m2
[0mbacking_file_offset: 0x0
backing_file_size: 0
fs_type: [1;32mxfs
[0mvirtual_size: 64424509440 / 61440M / 60G
disk_size: 36507222016 / 34816M / 34G
seek_end: 36507222016 [[1;32m0x88000[0m] / 34816M / 34G
cluster_bits: [1;36m16
[0mcluster_size: [1;36m65536
[0mcrypt_method: 0
csize_shift: 54
csize_mask: 255
cluster_offset_mask: [1;36m0x3f
[0ml1_table_offset: [1;32m0x76a46
[0ml1_size: [1;32m120
[0ml1_vm_state_index: [1;32m120
[0ml2_size: [1;36m8192
[0mrefcount_order: [1;36m4
[0mrefcount_bits: [1;36m16
[0mrefcount_block_bits: [1;36m15
[0mrefcount_block_size: [1;36m32768
[0mrefcount_table_offset: [1;32m0x1
[0mrefcount_table_clusters: [1;32m1
[0msnapshots_offset: [1;32m0x0
[0mnb_snapshots: [1;32m0
[0mincompatible_features: 
compatible_features: 
autoclear_features: 



[1;32mActive Snapshot:
[0m
L1 Table:   [offset: 0x76a46, len: 120]

[1;36mResult:
[0mL1 Table:   unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 53, 
used: 67
L2 Table:   unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 20304, 
used: 528560



[1;32mRefcount Table:
[0m
Refcount Table: [offset: 0x1, len: 8192]

[1;36mResult:
[0mRefcount Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 
8175, used: 17
Refcount:   error: [1;33m4342, [0mleak: [1;33m0, [0munused: 28426, 
used: 524288



[1;32mCOPIED OFLAG:
[0m

[1;36mResult:
[0mL1 Table ERROR OFLAG_COPIED: [1;33m1
[0mL2 Table ERROR OFLAG_COPIED: [1;33m4323
[0mActive L2 COPIED: [1;33m528560 [34639708160 / 33035M / 32G]

[0m

[1;32mActive Cluster:
[0m
[1;36m
Result:
[0mActive Cluster: reuse: [1;33m17

[0m

[1;31mSummary:
[0mpreallocation:  [1;32moff
[0mActive Cluster: [1;31mreuse: 17
[0mRefcount Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 
8175, used: 17
Refcount:   [1;33merror: [0m[1;31m4342, [0m[1;33mleak: 0, 
[0m[1;31mrebuild: 4325, [0munused: 28426, used: 524288
L1 Table:   [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 53, 
used: 67
[1;33moflag copied: [0m[1;31m1
[0mL2 Table:   [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 
20304, used: 528560
[1;33moflag copied: [0m[1;31m4323
[0m

###[5;31m qcow2 image has refcount errors!   (=_=#)[0m###
###[5;31mand qcow2 image has copied errors!  (o_0)?[0m###
###[5;31m  Sadly: refcount error cause active cluster reused! Orz[0m  ###
###[1;33m Please backup this image and contact the author![0m ###



]0;root@serv-hv-adm13:/home[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m#
 exit

Script terminé sur mar. 13 févr. 2018 17:31:13 CET

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Eric Blake


On 02/13/2018 08:48 AM, Daniel P. Berrangé wrote:

No, that's policy decision that doesn't matter from QMP pov. If the mgmt
app wants the snapshot to be wrt to the initial time, it can simply
invoke the "stop" QMP command before doing the live migration and
"cont" afterwards.


That would be non-live. I think Roman means a live snapshot that saves
the state at the beginning of the operation. Basically the difference
between blockdev-backup (state at the beginning) and blockdev-mirror
(state at the end), except for a whole VM.


That doesn't seem practical unless you can instantaneously write out
the entire guest RAM to disk without blocking, or can somehow snapshot
the RAM so you can write out a consistent view of the original RAM,
while the guest continues to dirty RAM pages.


One idea for that is via fork()'s copy-on-write semantics; the parent 
continues processing the guest, while the child writes out RAM pages. 
Pages touched by the guest in the parent are now cleanly copied by the 
OS so that the child can take all the time it wants, but still writes 
out the state of the guest at the time of the fork().  It may be 
possible to use userfaultfd to achieve the same effects without a fork().


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] qcow2 images corruption

2018-02-13 Thread Nicolas Ecarnot


Le 13/02/2018 à 16:26, Nicolas Ecarnot a écrit :

It would be good if you could make the 'qemu-img check' output available
somewhere.




I found this :
https://github.com/ShijunDeng/qcow2-dump

and the transcript (beautiful colors when viewed with "more") is attached :




--
Nicolas ECARNOT
Le script a débuté sur mar. 13 févr. 2018 17:31:05 CET
]0;root@serv-hv-adm13:/home[?1034h[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m#
 /root/qcow2-dump -m check serv-term-adm4-corr.qcow2.img
[1;32m
File:[1;36m serv-term-adm4-corr.qcow2.img
[0m

magic: 0x514649fb
version: [1;36m2
[0mbacking_file_offset: 0x0
backing_file_size: 0
fs_type: [1;32mxfs
[0mvirtual_size: 64424509440 / 61440M / 60G
disk_size: 36507222016 / 34816M / 34G
seek_end: 36507222016 [[1;32m0x88000[0m] / 34816M / 34G
cluster_bits: [1;36m16
[0mcluster_size: [1;36m65536
[0mcrypt_method: 0
csize_shift: 54
csize_mask: 255
cluster_offset_mask: [1;36m0x3f
[0ml1_table_offset: [1;32m0x76a46
[0ml1_size: [1;32m120
[0ml1_vm_state_index: [1;32m120
[0ml2_size: [1;36m8192
[0mrefcount_order: [1;36m4
[0mrefcount_bits: [1;36m16
[0mrefcount_block_bits: [1;36m15
[0mrefcount_block_size: [1;36m32768
[0mrefcount_table_offset: [1;32m0x1
[0mrefcount_table_clusters: [1;32m1
[0msnapshots_offset: [1;32m0x0
[0mnb_snapshots: [1;32m0
[0mincompatible_features: 
compatible_features: 
autoclear_features: 



[1;32mActive Snapshot:
[0m
L1 Table:   [offset: 0x76a46, len: 120]

[1;36mResult:
[0mL1 Table:   unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 53, 
used: 67
L2 Table:   unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 20304, 
used: 528560



[1;32mRefcount Table:
[0m
Refcount Table: [offset: 0x1, len: 8192]

[1;36mResult:
[0mRefcount Table: unaligned: [1;33m0, [0minvalid: [1;33m0, [0munused: 
8175, used: 17
Refcount:   error: [1;33m4342, [0mleak: [1;33m0, [0munused: 28426, 
used: 524288



[1;32mCOPIED OFLAG:
[0m

[1;36mResult:
[0mL1 Table ERROR OFLAG_COPIED: [1;33m1
[0mL2 Table ERROR OFLAG_COPIED: [1;33m4323
[0mActive L2 COPIED: [1;33m528560 [34639708160 / 33035M / 32G]

[0m

[1;32mActive Cluster:
[0m
[1;36m
Result:
[0mActive Cluster: reuse: [1;33m17

[0m

[1;31mSummary:
[0mpreallocation:  [1;32moff
[0mActive Cluster: [1;31mreuse: 17
[0mRefcount Table: [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 
8175, used: 17
Refcount:   [1;33merror: [0m[1;31m4342, [0m[1;33mleak: 0, 
[0m[1;31mrebuild: 4325, [0munused: 28426, used: 524288
L1 Table:   [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 53, 
used: 67
[1;33moflag copied: [0m[1;31m1
[0mL2 Table:   [1;33munaligned: 0, [0m[1;33minvalid: 0, [0munused: 
20304, used: 528560
[1;33moflag copied: [0m[1;31m4323
[0m

###[5;31m qcow2 image has refcount errors!   (=_=#)[0m###
###[5;31mand qcow2 image has copied errors!  (o_0)?[0m###
###[5;31m  Sadly: refcount error cause active cluster reused! Orz[0m  ###
###[1;33m Please backup this image and contact the author![0m ###



]0;root@serv-hv-adm13:/home[01;32mroot@serv-hv-adm13[00m:[01;34m/home[00m#
 exit

Script terminé sur mar. 13 févr. 2018 17:31:13 CET

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Kevin Wolf

Am 13.02.2018 um 16:30 hat Daniel P. Berrangé geschrieben:
> On Tue, Feb 13, 2018 at 04:23:21PM +0100, Kevin Wolf wrote:
> > Am 13.02.2018 um 15:58 hat Daniel P. Berrangé geschrieben:
> > > On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote:
> > > > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben:
> > > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> > > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
> > > > > > > Then you could just use the regular migrate QMP commands for 
> > > > > > > loading
> > > > > > > and saving snapshots.
> > > > > > 
> > > > > > Yes, you could. I think for a proper implementation you would want 
> > > > > > to do
> > > > > > better, though. Live migration provides just a stream, but that's 
> > > > > > not
> > > > > > really well suited for snapshots. When a RAM page is dirtied, you 
> > > > > > just
> > > > > > want to overwrite the old version of it in a snapshot [...]
> > > > > 
> > > > > This means the point in time where the guest state is snapshotted is 
> > > > > not
> > > > > when the command is issued, but any unpredictable amount of time 
> > > > > later.
> > > > > 
> > > > > I'm not sure this is what a user expects.
> > > > 
> > > > I don't think it's necessarily a big problem as long as you set the
> > > > expectations right, but good point anyway.
> > > > 
> > > > > A better approach for the save part appears to be to stop the vcpus,
> > > > > dump the device state, resume the vcpus, and save the memory contents
> > > > > in the background, prioritizing the old copies of the pages that
> > > > > change.
> > > > 
> > > > So basically you would let the guest fault whenever it writes to a page
> > > > that is not saved yet, and then save it first before you make the page
> > > > writable again? Essentially blockdev-backup, except for RAM.
> > > 
> > > The page fault servicing will be delayed by however long it takes to
> > > write the page to underling storage, which could be considerable with
> > > non-SSD. So guest performance could be significantly impacted on slow
> > > storage with high dirtying rate. On the flip side it gurantees a live
> > > snapshot would complete in finite time which is good.
> > 
> > You can just use a bounce buffer for writing out the old page. Then the
> > VM is only stopped for the duration of a malloc() + memcpy().
> 
> The would allow QEMU memory usage to balloon up to x2  RAM, if there was
> slow storage and very fast dirtying rate. I don't think that's viable
> unless there was a cap on how much bounce buffering we would allow before
> just blocking the page faults

Yes, you'd probably want to do this. But anyway, unless the guest is
under really heavy load, allowing just a few MB to be dirtied without
waiting for I/O will make the guest a lot more responsive immediately
after taking a snapshot.

Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Denis V. Lunev

On 02/13/2018 06:27 PM, Dr. David Alan Gilbert wrote:
> * Roman Kagan (rka...@virtuozzo.com) wrote:
>> On Tue, Feb 13, 2018 at 03:05:03PM +, Dr. David Alan Gilbert wrote:
>>> * Denis V. Lunev (d...@virtuozzo.com) wrote:
 On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
>> That doesn't seem practical unless you can instantaneously write out
>> the entire guest RAM to disk without blocking, or can somehow snapshot
>> the RAM so you can write out a consistent view of the original RAM,
>> while the guest continues to dirty RAM pages.
> People have suggested doing something like that with userfault write
> mode; but the same would also be doable just by write protecting the
> whole of RAM and then following the faults.
 nope, userfault fd does not help :( We have tried, the functionality is not
 enough. Better to have small extension to KVM to protect all memory
 and notify QEMU with accessed address.
>>> Can you explain why? I thought the write-protect mode of userfaultfd was
>>> supposed to be able to do that; cc'ing in Andrea
>> IIRC it would if it worked; it just didn't when we tried.
> Hmm that doesn't seem to be the ideal reason to create new KVM
> functionality, especially since there were a bunch of people wanting the
> userfaultfd-write mode for other uses.
This does not seems to work in practice. We status with that is the
same for more than a year and a half AFAIR. Thus there is a lot
of sense to create something feasible as async snapshots are
badly wanted.

Den

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Daniel P . Berrangé

On Tue, Feb 13, 2018 at 04:23:21PM +0100, Kevin Wolf wrote:
> Am 13.02.2018 um 15:58 hat Daniel P. Berrangé geschrieben:
> > On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote:
> > > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben:
> > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
> > > > > > Then you could just use the regular migrate QMP commands for loading
> > > > > > and saving snapshots.
> > > > > 
> > > > > Yes, you could. I think for a proper implementation you would want to 
> > > > > do
> > > > > better, though. Live migration provides just a stream, but that's not
> > > > > really well suited for snapshots. When a RAM page is dirtied, you just
> > > > > want to overwrite the old version of it in a snapshot [...]
> > > > 
> > > > This means the point in time where the guest state is snapshotted is not
> > > > when the command is issued, but any unpredictable amount of time later.
> > > > 
> > > > I'm not sure this is what a user expects.
> > > 
> > > I don't think it's necessarily a big problem as long as you set the
> > > expectations right, but good point anyway.
> > > 
> > > > A better approach for the save part appears to be to stop the vcpus,
> > > > dump the device state, resume the vcpus, and save the memory contents
> > > > in the background, prioritizing the old copies of the pages that
> > > > change.
> > > 
> > > So basically you would let the guest fault whenever it writes to a page
> > > that is not saved yet, and then save it first before you make the page
> > > writable again? Essentially blockdev-backup, except for RAM.
> > 
> > The page fault servicing will be delayed by however long it takes to
> > write the page to underling storage, which could be considerable with
> > non-SSD. So guest performance could be significantly impacted on slow
> > storage with high dirtying rate. On the flip side it gurantees a live
> > snapshot would complete in finite time which is good.
> 
> You can just use a bounce buffer for writing out the old page. Then the
> VM is only stopped for the duration of a malloc() + memcpy().

The would allow QEMU memory usage to balloon up to x2  RAM, if there was
slow storage and very fast dirtying rate. I don't think that's viable
unless there was a cap on how much bounce buffering we would allow before
just blocking the page faults

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [PATCH 0/7] block: Preallocated truncation for gluster and sheepdog

2018-02-13 Thread Kevin Wolf

Am 13.02.2018 um 14:03 hat Max Reitz geschrieben:
> As far as I can see, these are the only protocols beside file-posix that
> support preallocated creation.  In contrast to file-posix, however, they
> have not supported preallocated truncation so far.  This series brings
> their truncation code to feature parity with their creation code in this
> regard.
> 
> Note that I do not have a test setup for either of the two drivers, so I
> do not actually know whether this works.  Anyone with a setup is more
> than welcome to test this series.

Thanks, applied to the block branch.

We'll see if someone screams once this hits master.

Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Dr. David Alan Gilbert

* Roman Kagan (rka...@virtuozzo.com) wrote:
> On Tue, Feb 13, 2018 at 03:05:03PM +, Dr. David Alan Gilbert wrote:
> > * Denis V. Lunev (d...@virtuozzo.com) wrote:
> > > On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote:
> > > > * Daniel P. Berrangé (berra...@redhat.com) wrote:
> > > >> That doesn't seem practical unless you can instantaneously write out
> > > >> the entire guest RAM to disk without blocking, or can somehow snapshot
> > > >> the RAM so you can write out a consistent view of the original RAM,
> > > >> while the guest continues to dirty RAM pages.
> > > > People have suggested doing something like that with userfault write
> > > > mode; but the same would also be doable just by write protecting the
> > > > whole of RAM and then following the faults.
> > > 
> > > nope, userfault fd does not help :( We have tried, the functionality is 
> > > not
> > > enough. Better to have small extension to KVM to protect all memory
> > > and notify QEMU with accessed address.
> > 
> > Can you explain why? I thought the write-protect mode of userfaultfd was
> > supposed to be able to do that; cc'ing in Andrea
> 
> IIRC it would if it worked; it just didn't when we tried.

Hmm that doesn't seem to be the ideal reason to create new KVM
functionality, especially since there were a bunch of people wanting the
userfaultfd-write mode for other uses.

Dave

> Roman.
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] qcow2 images corruption

2018-02-13 Thread Nicolas Ecarnot


Hello Kevin,

Le 13/02/2018 à 10:41, Kevin Wolf a écrit :

Am 07.02.2018 um 18:06 hat Nicolas Ecarnot geschrieben:

TL; DR : qcow2 images keep getting corrupted. Any workaround?


Not without knowing the cause.


Actually, my main concern is mostly about finding the cause rather than 
correcting my corrupted VMs.


Another way to say it : I prefer to help oVirt than help myself.


The first thing to make sure is that the image isn't touched by a second
process while QEMU is running a VM.


Indeed, I read some BZ about this issue : they were raised by a user who 
ran some qemu-img commands on a "mounted" image, thus leading to some 
corruption.
In my case, I'm not playing with this, and the corrupted VMs were only 
touched by classical oVirt actions.



The classic one is using 'qemu-img
snapshot' on the image of a running VM, which is instant corruption (and
newer QEMU versions have locking in place to prevent this), but we have
seen more absurd cases of things outside QEMU tampering with the image
when we were investigating previous corruption reports.

This covers the majority of all reports, we haven't had a real
corruption caused by a QEMU bug in ages.


May I ask after what QEMU version this kind of locking has been added.
As I wrote, our oVirt setup is 3.6 so not recent.




After having found (https://access.redhat.com/solutions/1173623) the right
logical volume hosting the qcow2 image, I can run qemu-img check on it.
- On 80% of my VMs, I find no errors.
- On 15% of them, I find Leaked cluster errors that I can correct using
"qemu-img check -r all"
- On 5% of them, I find Leaked clusters errors and further fatal errors,
which can not be corrected with qemu-img.
In rare cases, qemu-img can correct them, but destroys large parts of the
image (becomes unusable), and on other cases it can not correct them at all.


It would be good if you could make the 'qemu-img check' output available
somewhere.


See attachment.



It would be even better if we could have a look at the respective image.
I seem to remember that John (CCed) had a few scripts to analyse
corrupted qcow2 images, maybe we would be able to see something there.


I just exported it like this :
qemu-img convert /dev/the_correct_path /home/blablah.qcow2.img

The resulting file is 32G and I need an idea to transfer this img to you.




What I read similar to my case is :
- usage of qcow2
- heavy disk I/O
- using the virtio-blk driver

In the proxmox thread, they tend to say that using virtio-scsi is the
solution. Having asked this question to oVirt experts
(https://lists.ovirt.org/pipermail/users/2018-February/086753.html) but it's
not clear the driver is to blame.


This seems very unlikely. The corruption you're seeing is in the qcow2
metadata, not only in the guest data.


Are you saying:
- the corruption is in the metadata and in the guest data
OR
- the corruption is only in the metadata
?


If anything, virtio-scsi exercises
more qcow2 code paths than virtio-blk, so any potential bug that affects
virtio-blk should also affect virtio-scsi, but not the other way around.


I get that.




I agree with the answer Yaniv Kaul gave to me, saying I have to properly
report the issue, so I'm longing to know which peculiar information I can
give you now.


To be honest, debugging corruption after the fact is pretty hard. We'd
need the 'qemu-img check' output


Done.


and ideally the image to do anything,


I remember some Redhat people once gave me a temporary access to put 
heavy file on some dedicated server. Is it still possible?



but I can't promise that anything would come out of this.

Best would be a reproducer, or at least some operation that you can link
to the appearance of the corruption. Then we could take a more targeted
look at the respective code.


Sure.
Alas I find no obvious pattern leading to corruption :
From the guest side, it appeared with windows 2003, 2008, 2012, linux 
centOS 6 and 7. It appeared with virtio-blk; and I changed some VMs to 
used virtio-scsi but it's too soon to see appearance of corruption in 
that case.
As I said, I'm using snapshots VERY rarely, and our versions are too old 
so we do them the cold way only (VM shutdown). So very safely.
The "weirdest" thing we do is to migrate VMs : you see how conservative 
we are!



As you can imagine, all this setup is in production, and for most of the
VMs, I can not "play" with them. Moreover, we launched a campaign of nightly
stopping every VM, qemu-img check them one by one, then boot.
So it might take some time before I find another corrupted image.
(which I'll preciously store for debug)

Other informations : We very rarely do snapshots, but I'm close to imagine
that automated migrations of VMs could trigger similar behaviors on qcow2
images.


To my knowledge, oVirt only uses external snapshots and creates them
with QMP. This should be perfectly safe because from the perspective of
the qcow2 image being snapshotted, it just means that it gets no new
write

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Kevin Wolf

Am 13.02.2018 um 15:58 hat Daniel P. Berrangé geschrieben:
> On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote:
> > Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben:
> > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
> > > > > Then you could just use the regular migrate QMP commands for loading
> > > > > and saving snapshots.
> > > > 
> > > > Yes, you could. I think for a proper implementation you would want to do
> > > > better, though. Live migration provides just a stream, but that's not
> > > > really well suited for snapshots. When a RAM page is dirtied, you just
> > > > want to overwrite the old version of it in a snapshot [...]
> > > 
> > > This means the point in time where the guest state is snapshotted is not
> > > when the command is issued, but any unpredictable amount of time later.
> > > 
> > > I'm not sure this is what a user expects.
> > 
> > I don't think it's necessarily a big problem as long as you set the
> > expectations right, but good point anyway.
> > 
> > > A better approach for the save part appears to be to stop the vcpus,
> > > dump the device state, resume the vcpus, and save the memory contents
> > > in the background, prioritizing the old copies of the pages that
> > > change.
> > 
> > So basically you would let the guest fault whenever it writes to a page
> > that is not saved yet, and then save it first before you make the page
> > writable again? Essentially blockdev-backup, except for RAM.
> 
> The page fault servicing will be delayed by however long it takes to
> write the page to underling storage, which could be considerable with
> non-SSD. So guest performance could be significantly impacted on slow
> storage with high dirtying rate. On the flip side it gurantees a live
> snapshot would complete in finite time which is good.

You can just use a bounce buffer for writing out the old page. Then the
VM is only stopped for the duration of a malloc() + memcpy().

Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Dr. David Alan Gilbert

* Denis V. Lunev (d...@virtuozzo.com) wrote:
> On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote:
> > * Daniel P. Berrangé (berra...@redhat.com) wrote:
> >> On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote:
> >>> Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben:
>  On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote:
> > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> >> Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
> >>> Then you could just use the regular migrate QMP commands for loading
> >>> and saving snapshots.
> >> Yes, you could. I think for a proper implementation you would want to 
> >> do
> >> better, though. Live migration provides just a stream, but that's not
> >> really well suited for snapshots. When a RAM page is dirtied, you just
> >> want to overwrite the old version of it in a snapshot [...]
> > This means the point in time where the guest state is snapshotted is not
> > when the command is issued, but any unpredictable amount of time later.
> >
> > I'm not sure this is what a user expects.
> >
> > A better approach for the save part appears to be to stop the vcpus,
> > dump the device state, resume the vcpus, and save the memory contents in
> > the background, prioritizing the old copies of the pages that change.
> > No multiple copies of the same page would have to be saved so the stream
> > format would be fine.  For the load part the usual inmigrate should
> > work.
>  No, that's policy decision that doesn't matter from QMP pov. If the mgmt
>  app wants the snapshot to be wrt to the initial time, it can simply
>  invoke the "stop" QMP command before doing the live migration and
>  "cont" afterwards.
> >>> That would be non-live. I think Roman means a live snapshot that saves
> >>> the state at the beginning of the operation. Basically the difference
> >>> between blockdev-backup (state at the beginning) and blockdev-mirror
> >>> (state at the end), except for a whole VM.
> >> That doesn't seem practical unless you can instantaneously write out
> >> the entire guest RAM to disk without blocking, or can somehow snapshot
> >> the RAM so you can write out a consistent view of the original RAM,
> >> while the guest continues to dirty RAM pages.
> > People have suggested doing something like that with userfault write
> > mode; but the same would also be doable just by write protecting the
> > whole of RAM and then following the faults.
> 
> nope, userfault fd does not help :( We have tried, the functionality is not
> enough. Better to have small extension to KVM to protect all memory
> and notify QEMU with accessed address.

Can you explain why? I thought the write-protect mode of userfaultfd was
supposed to be able to do that; cc'ing in Andrea

Dave

> Den
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [Qemu-devel] [PATCH 5/7] sheepdog: Make sd_prealloc() take a BDS

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

We want to use this function in sd_truncate() later on, so taking a
filename is not exactly ideal.

Signed-off-by: Max Reitz 
---
  block/sheepdog.c | 29 +
  1 file changed, 21 insertions(+), 8 deletions(-)




@@ -1837,10 +1837,11 @@ static int sd_prealloc(const char *filename, Error 
**errp)
  void *buf = NULL;
  int ret;
  
-blk = blk_new_open(filename, NULL, NULL,

-   BDRV_O_RDWR | BDRV_O_RESIZE | BDRV_O_PROTOCOL, errp);
-if (blk == NULL) {
-ret = -EIO;
+blk = blk_new(BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE | BLK_PERM_RESIZE,
+  BLK_PERM_ALL);
+
+ret = blk_insert_bs(blk, bs, errp);
+if (ret < 0) {




  if (prealloc) {
-ret = sd_prealloc(filename, errp);
+BlockDriverState *bs;
+QDict *opts;
+
+opts = qdict_new();
+qdict_put_str(opts, "driver", "sheepdog");
+bs = bdrv_open(filename, NULL, opts, BDRV_O_PROTOCOL | BDRV_O_RDWR,
+   errp);


I'm a bit weak on ensuring the same ending flags result after the new 
bdrv_open()/blk_new() as what you used to get on blk_new_open(), so 
someone that can actually test this is appreciated.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Denis V. Lunev

On 02/13/2018 05:59 PM, Dr. David Alan Gilbert wrote:
> * Daniel P. Berrangé (berra...@redhat.com) wrote:
>> On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote:
>>> Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben:
 On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote:
> On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
>> Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
>>> Then you could just use the regular migrate QMP commands for loading
>>> and saving snapshots.
>> Yes, you could. I think for a proper implementation you would want to do
>> better, though. Live migration provides just a stream, but that's not
>> really well suited for snapshots. When a RAM page is dirtied, you just
>> want to overwrite the old version of it in a snapshot [...]
> This means the point in time where the guest state is snapshotted is not
> when the command is issued, but any unpredictable amount of time later.
>
> I'm not sure this is what a user expects.
>
> A better approach for the save part appears to be to stop the vcpus,
> dump the device state, resume the vcpus, and save the memory contents in
> the background, prioritizing the old copies of the pages that change.
> No multiple copies of the same page would have to be saved so the stream
> format would be fine.  For the load part the usual inmigrate should
> work.
 No, that's policy decision that doesn't matter from QMP pov. If the mgmt
 app wants the snapshot to be wrt to the initial time, it can simply
 invoke the "stop" QMP command before doing the live migration and
 "cont" afterwards.
>>> That would be non-live. I think Roman means a live snapshot that saves
>>> the state at the beginning of the operation. Basically the difference
>>> between blockdev-backup (state at the beginning) and blockdev-mirror
>>> (state at the end), except for a whole VM.
>> That doesn't seem practical unless you can instantaneously write out
>> the entire guest RAM to disk without blocking, or can somehow snapshot
>> the RAM so you can write out a consistent view of the original RAM,
>> while the guest continues to dirty RAM pages.
> People have suggested doing something like that with userfault write
> mode; but the same would also be doable just by write protecting the
> whole of RAM and then following the faults.

nope, userfault fd does not help :( We have tried, the functionality is not
enough. Better to have small extension to KVM to protect all memory
and notify QEMU with accessed address.

Den

Re: [Qemu-block] [Qemu-devel] [PATCH 6/7] sheepdog: Pass old and new size to sd_prealloc()

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

sd_prealloc() will now preallocate the area [old_size, new_size).  As
before, it rounds to buf_size and may thus overshoot and preallocate
areas that were not requested to be preallocated.  For image creation,
this is no change in behavior.  For truncation, this is in accordance
with the documentation for preallocated truncation.

Signed-off-by: Max Reitz 
---
  block/sheepdog.c | 16 +---
  1 file changed, 5 insertions(+), 11 deletions(-)




@@ -1847,19 +1847,13 @@ static int sd_prealloc(BlockDriverState *bs, Error 
**errp)
  
  blk_set_allow_write_beyond_eof(blk, true);
  
-vdi_size = blk_getlength(blk);

-if (vdi_size < 0) {
-ret = vdi_size;
-goto out;
-}
-



@@ -2119,7 +2113,7 @@ static int sd_create(const char *filename, QemuOpts *opts,
  goto out;
  }
  
-ret = sd_prealloc(bs, errp);

+ret = sd_prealloc(bs, 0, s->inode.vdi_size, errp);


Nice - you also got rid of a potential failure in blk_getlength().

Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Dr. David Alan Gilbert

* Daniel P. Berrangé (berra...@redhat.com) wrote:
> On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote:
> > Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben:
> > > On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote:
> > > > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> > > > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
> > > > > > Then you could just use the regular migrate QMP commands for loading
> > > > > > and saving snapshots.
> > > > > 
> > > > > Yes, you could. I think for a proper implementation you would want to 
> > > > > do
> > > > > better, though. Live migration provides just a stream, but that's not
> > > > > really well suited for snapshots. When a RAM page is dirtied, you just
> > > > > want to overwrite the old version of it in a snapshot [...]
> > > > 
> > > > This means the point in time where the guest state is snapshotted is not
> > > > when the command is issued, but any unpredictable amount of time later.
> > > > 
> > > > I'm not sure this is what a user expects.
> > > >
> > > > A better approach for the save part appears to be to stop the vcpus,
> > > > dump the device state, resume the vcpus, and save the memory contents in
> > > > the background, prioritizing the old copies of the pages that change.
> > > > No multiple copies of the same page would have to be saved so the stream
> > > > format would be fine.  For the load part the usual inmigrate should
> > > > work.
> > > 
> > > No, that's policy decision that doesn't matter from QMP pov. If the mgmt
> > > app wants the snapshot to be wrt to the initial time, it can simply
> > > invoke the "stop" QMP command before doing the live migration and
> > > "cont" afterwards.
> > 
> > That would be non-live. I think Roman means a live snapshot that saves
> > the state at the beginning of the operation. Basically the difference
> > between blockdev-backup (state at the beginning) and blockdev-mirror
> > (state at the end), except for a whole VM.
> 
> That doesn't seem practical unless you can instantaneously write out
> the entire guest RAM to disk without blocking, or can somehow snapshot
> the RAM so you can write out a consistent view of the original RAM,
> while the guest continues to dirty RAM pages.

People have suggested doing something like that with userfault write
mode; but the same would also be doable just by write protecting the
whole of RAM and then following the faults.

Dave

> Regards,
> Daniel
> -- 
> |: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org -o-https://fstop138.berrange.com :|
> |: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|
--
Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Daniel P . Berrangé

On Tue, Feb 13, 2018 at 03:43:10PM +0100, Kevin Wolf wrote:
> Am 13.02.2018 um 15:30 hat Roman Kagan geschrieben:
> > On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> > > Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
> > > > Then you could just use the regular migrate QMP commands for loading
> > > > and saving snapshots.
> > > 
> > > Yes, you could. I think for a proper implementation you would want to do
> > > better, though. Live migration provides just a stream, but that's not
> > > really well suited for snapshots. When a RAM page is dirtied, you just
> > > want to overwrite the old version of it in a snapshot [...]
> > 
> > This means the point in time where the guest state is snapshotted is not
> > when the command is issued, but any unpredictable amount of time later.
> > 
> > I'm not sure this is what a user expects.
> 
> I don't think it's necessarily a big problem as long as you set the
> expectations right, but good point anyway.
> 
> > A better approach for the save part appears to be to stop the vcpus,
> > dump the device state, resume the vcpus, and save the memory contents
> > in the background, prioritizing the old copies of the pages that
> > change.
> 
> So basically you would let the guest fault whenever it writes to a page
> that is not saved yet, and then save it first before you make the page
> writable again? Essentially blockdev-backup, except for RAM.

The page fault servicing will be delayed by however long it takes to
write the page to underling storage, which could be considerable with
non-SSD. So guest performance could be significantly impacted on slow
storage with high dirtying rate. On the flip side it gurantees a live
snapshot would complete in finite time which is good.

Regards,
Daniel
-- 
|: https://berrange.com  -o-https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o-https://fstop138.berrange.com :|
|: https://entangle-photo.org-o-https://www.instagram.com/dberrange :|

Re: [Qemu-block] [Qemu-devel] [PATCH 7/7] sheepdog: Allow fully preallocated truncation

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

Signed-off-by: Max Reitz 
---
  block/sheepdog.c | 15 ---
  1 file changed, 12 insertions(+), 3 deletions(-)


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 1/7] gluster: Move glfs_close() to create's clean-up

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

glfs_close() is a classical clean-up operation, as can be seen by the
fact that it is executed even if the truncation before it failed.
Also, moving it to clean-up makes it more clear that if it fails, we do
not want it to overwrite the current ret value if that signifies an
error already.

Signed-off-by: Max Reitz 
---
  block/gluster.c | 10 ++
  1 file changed, 6 insertions(+), 4 deletions(-)



Reviewed-by: Eric Blake 


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 4/7] gluster: Add preallocated truncation

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

By using qemu_do_cluster_truncate() in qemu_cluster_truncate(), we now
automatically have preallocated truncation.

Signed-off-by: Max Reitz 
---
  block/gluster.c | 17 +
  1 file changed, 1 insertion(+), 16 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 3/7] gluster: Query current size in do_truncate()

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

Instead of expecting the current size to be 0, query it and allocate
only the area [current_size, offset) if preallocation is requested.

Signed-off-by: Max Reitz 
---
  block/gluster.c | 21 +++--
  1 file changed, 19 insertions(+), 2 deletions(-)



I don't have a gluster setup handy to test this, but the concept matches 
file-posix and the change looks reasonable.


Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 2/7] gluster: Pull truncation from qemu_gluster_create

2018-02-13 Thread Eric Blake


On 02/13/2018 07:03 AM, Max Reitz wrote:

Pull out the truncation code from the qemu_cluster_create() function so
we can later reuse it in qemu_gluster_truncate().

Signed-off-by: Max Reitz 
---
  block/gluster.c | 74 +++--
  1 file changed, 40 insertions(+), 34 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [PATCH v4 31/39] qcow2: Update qcow2_truncate() to support L2 slices

2018-02-13 Thread Eric Blake


On 02/05/2018 08:33 AM, Alberto Garcia wrote:

The qcow2_truncate() code is mostly independent from whether
we're using L2 slices or full L2 tables, but in full and
falloc preallocation modes new L2 tables are allocated using
qcow2_alloc_cluster_link_l2().  Therefore the code needs to be
modified to ensure that all nb_clusters that are processed in each
call can be allocated with just one L2 slice.

Signed-off-by: Alberto Garcia 
---
  block/qcow2.c | 6 +++---
  1 file changed, 3 insertions(+), 3 deletions(-)



Reviewed-by: Eric Blake 

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Re: [Qemu-block] [Qemu-devel] [PATCH 1/2] Add save-snapshot, load-snapshot and delete-snapshot to QAPI

2018-02-13 Thread Denis V. Lunev

On 02/13/2018 05:48 PM, Daniel P. Berrangé wrote:
> On Tue, Feb 13, 2018 at 03:45:21PM +0100, Kevin Wolf wrote:
>> Am 13.02.2018 um 15:36 hat Daniel P. Berrangé geschrieben:
>>> On Tue, Feb 13, 2018 at 05:30:02PM +0300, Roman Kagan wrote:
 On Tue, Feb 13, 2018 at 11:50:24AM +0100, Kevin Wolf wrote:
> Am 11.01.2018 um 14:04 hat Daniel P. Berrange geschrieben:
>> Then you could just use the regular migrate QMP commands for loading
>> and saving snapshots.
> Yes, you could. I think for a proper implementation you would want to do
> better, though. Live migration provides just a stream, but that's not
> really well suited for snapshots. When a RAM page is dirtied, you just
> want to overwrite the old version of it in a snapshot [...]
 This means the point in time where the guest state is snapshotted is not
 when the command is issued, but any unpredictable amount of time later.

 I'm not sure this is what a user expects.

 A better approach for the save part appears to be to stop the vcpus,
 dump the device state, resume the vcpus, and save the memory contents in
 the background, prioritizing the old copies of the pages that change.
 No multiple copies of the same page would have to be saved so the stream
 format would be fine.  For the load part the usual inmigrate should
 work.
>>> No, that's policy decision that doesn't matter from QMP pov. If the mgmt
>>> app wants the snapshot to be wrt to the initial time, it can simply
>>> invoke the "stop" QMP command before doing the live migration and
>>> "cont" afterwards.
>> That would be non-live. I think Roman means a live snapshot that saves
>> the state at the beginning of the operation. Basically the difference
>> between blockdev-backup (state at the beginning) and blockdev-mirror
>> (state at the end), except for a whole VM.
> That doesn't seem practical unless you can instantaneously write out
> the entire guest RAM to disk without blocking, or can somehow snapshot
> the RAM so you can write out a consistent view of the original RAM,
> while the guest continues to dirty RAM pages.
>
> Regards,
> Daniel
we have working prototype of that right now, not yet ready to be
submitted even as RFC. In practice here we are spoken about
snapshots with memory as "The equivalent HMP command is savevm."
Thus all Roman's concerns are here.

Den

Re: [Qemu-block] [PATCH v4 00/39] Allow configuring the qcow2 L2 cache entry size

2018-02-13 Thread Max Reitz

On 2018-02-05 15:33, Alberto Garcia wrote:
> this is the new revision of the patch series to allow configuring the
> entry size of the qcow2 L2 cache. Follow this link for the full
> description from the first version:
> 
>https://lists.gnu.org/archive/html/qemu-block/2017-10/msg00458.html
> 
> And here are some numbers showing the performance improvements:
> 
>https://lists.gnu.org/archive/html/qemu-block/2017-12/msg00507.html
> 
> Regards,
> 
> Berto

Thanks, applied to my block branch
(with s/intead/instead/ and s/=300/=4242/):

https://github.com/XanClic/qemu/commits/block

Max



signature.asc
Description: OpenPGP digital signature

1 2 >

1 - 100 of 124 matches

Mail list logo