from:"Anton Nefedov"

Re: [RFC PATCH v1 2/7] char-socket: return -1 in case of disconnect during tcp_chr_write

2020-04-24 Thread Anton Nefedov

> On Thu, Apr 23, 2020 at 8:43 PM Dima Stepanov  wrote:
>> The problem is that vhost_user_write doesn't get an error after
>> disconnect and try to call vhost_user_read(). The tcp_chr_write()
>> routine should return -1 in case of disconnect. Indicate the EIO error
>> if this routine is called in the disconnected state.
>>
>> Signed-off-by: Dima Stepanov 
>> ---
>>   chardev/char-socket.c | 8 +---
>>   1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/chardev/char-socket.c b/chardev/char-socket.c
>> index 185fe38..c128cca 100644
>> --- a/chardev/char-socket.c
>> +++ b/chardev/char-socket.c
>> @@ -175,14 +175,16 @@ static int tcp_chr_write(Chardev *chr, const uint8_t 
>> *buf, int len)
>>   if (ret < 0 && errno != EAGAIN) {
>>   if (tcp_chr_read_poll(chr) <= 0) {
>>   tcp_chr_disconnect_locked(chr);
>> -return len;
>> +/* Return an error since we made a disconnect. */
>> +return ret;
> 
> Looks ok, but this return was introduced in commit
> b0a335e351103bf92f3f9d0bd5759311be8156ac ("qemu-char: socket backend:
> disconnect on write error"). It doesn't say why it didn't return -1
> though. Anton, could you review? thanks
> 

hej,

I think I had no special intent but to repeat the behaviour as in the
snippet below, that is to return @len when the socket is disconnected.

Seems that tcp_chr_write() worked that way since the very beginning
(commit 0bab00f).

It looks ok to me to return an error though. If some guest device doesnt
expect that I guess it should ignore the error on its side.

>>   } /* else let the read handler finish it properly */
>>   }
>>
>>   return ret;
>>   } else {
>> -/* XXX: indicate an error ? */
>> -return len;
>> +/* Indicate an error. */
>> +errno = EIO;
>> +return -1;
>>   }
>>   }

Re: Problems with c8bb23cbdbe3 on ppc64le

2019-10-10 Thread Anton Nefedov

On 10/10/2019 6:17 PM, Max Reitz wrote:
> Hi everyone,
> 
> (CCs just based on tags in the commit in question)
> 
> I have two bug reports which claim problems of qcow2 on XFS on ppc64le
> machines since qemu 4.1.0.  One of those is about bad performance
> (sorry, is isn’t public :-/), the other about data corruption
> (https://bugzilla.redhat.com/show_bug.cgi?id=1751934).
> 
> It looks like in both cases reverting c8bb23cbdbe3 solves the problem
> (which optimized COW of unallocated areas).
> 
> I think I’ve looked at every angle but can‘t find what could be wrong
> with it.  Do any of you have any idea? :-/
> 

hi,

oh, that patch strikes again..

I don't quite follow, was this bug confirmed to happen on x86? Comment 8
(https://bugzilla.redhat.com/show_bug.cgi?id=1751934#c8) mentioned that
(or was that mixed up with the old xfsctl bug?)

Regardless of the platform, does it reproduce? That's comforting
already; worst case we can trace each and every request then (unless it
will stop to reproduce this way).

Also, perhaps it's worth to try to replace fallocate with write(0)?
Either in qcow2 (in the patch, bdrv_co_pwrite_zeroes -> bdrv_co_pwritev)
or in the file driver. It might hint whether it's misbehaving fallocate
(in qemu or in kernel) or something else.

/Anton

Re: [PATCH v9 0/9] discard blockstats

2019-09-23 Thread Anton Nefedov

On 23/9/2019 1:59 PM, Max Reitz wrote:
> On 06.09.19 18:01, Anton Nefedov wrote:
>> v9:
>>   - fixed patch 5 so the fields are actually numbered in sectors not blocks
>>   - fixed patch 7 accordingly
>>   - patch 8: make stat fields unsigned
>>   - qapi patches: "since 4.1" -> "since 4.2"
>>
>> v8: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03709.html
> 
> For the record: Looks OK to me, but I suppose you still wanted to change
> patch 3’s commit message.
> 
> (I still don’t like patch 9 very much, but that won’t stop me from
> taking it.)
> 

hi,
I've now resent the series with the updated patch 3 commit message

https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg05050.html

thanks,

[PATCH v10 2/9] qapi: add unmap to BlockDeviceStats

2019-09-23 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json   | 29 +++--
 include/block/accounting.h |  1 +
 block/qapi.c   |  6 ++
 tests/qemu-iotests/227.out | 18 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 5ab554b54a..7d3e05891c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -860,6 +860,8 @@
 #
 # @wr_bytes:  The number of bytes written by the device.
 #
+# @unmap_bytes: The number of bytes unmapped by the device (Since 4.2)
+#
 # @rd_operations: The number of read operations performed by the device.
 #
 # @wr_operations: The number of write operations performed by the device.
@@ -867,6 +869,9 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
+# @unmap_operations: The number of unmap operations performed by the device
+#(Since 4.2)
+#
 # @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
 # @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
@@ -874,6 +879,9 @@
 # @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
 #   (since 0.15.0).
 #
+# @unmap_total_time_ns: Total time spent on unmap operations in nanoseconds
+#   (Since 4.2)
+#
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
 # growable sparse files (like qcow2) that are used on top
@@ -885,6 +893,9 @@
 # @wr_merged: Number of write requests that have been merged into another
 # request (Since 2.3).
 #
+# @unmap_merged: Number of unmap requests that have been merged into another
+#request (Since 4.2)
+#
 # @idle_time_ns: Time since the last I/O operation, in
 #nanoseconds. If the field is absent it means that
 #there haven't been any operations yet (Since 2.5).
@@ -898,6 +909,9 @@
 # @failed_flush_operations: The number of failed flush operations
 #   performed by the device (Since 2.5)
 #
+# @failed_unmap_operations: The number of failed unmap operations performed
+#   by the device (Since 4.2)
+#
 # @invalid_rd_operations: The number of invalid read operations
 #  performed by the device (Since 2.5)
 #
@@ -907,6 +921,9 @@
 # @invalid_flush_operations: The number of invalid flush operations
 #performed by the device (Since 2.5)
 #
+# @invalid_unmap_operations: The number of invalid unmap operations performed
+#by the device (Since 4.2)
+#
 # @account_invalid: Whether invalid operations are included in the
 #   last access statistics (Since 2.5)
 #
@@ -925,18 +942,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'unmap_bytes' : 'int',
'rd_operations': 'int', 'wr_operations': 'int',
-   'flush_operations': 'int',
+   'flush_operations': 'int', 'unmap_operations': 'int',
'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'flush_total_time_ns': 'int',
+   'flush_total_time_ns': 'int', 'unmap_total_time_ns': 'int',
'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int', 'unmap_merged': 'int',
'*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int',
+   'failed_flush_operations': 'int', 'failed_unmap_operations': 'int',
'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
-   'invalid_flush_operations': 'int',
+   'invalid_flush_operations': 'int', 'invalid_unmap_operations': 
'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
diff --git a/include/block/accounting.h b/include/block/accounting.h
index d1f67b10dd..ba8b04d572 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -36,6 +36,7 @@ enum BlockAcctType {
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
+BLOCK_ACCT_UNMAP,
 BLOCK_MAX_IOTYPE,
 };
 
diff --git a/block/qapi.c b/block/qapi.c
index 7ee2ee065d..69c35c4196 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -440,24 +440,30 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_byte

[PATCH v10 7/9] scsi: account unmap operations

2019-09-23 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/scsi/scsi-disk.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index a002fdabe8..68b1675fd9 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1617,10 +1617,16 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 r->sector_count = (ldl_be_p(>inbuf[8]) & 0xULL)
 * (s->qdev.blocksize / BDRV_SECTOR_SIZE);
 if (!check_lba_range(s, r->sector, r->sector_count)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk),
+   BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->qdev.conf.blk), >acct,
+ r->sector_count * BDRV_SECTOR_SIZE,
+ BLOCK_ACCT_UNMAP);
+
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
 r->sector * BDRV_SECTOR_SIZE,
 r->sector_count * BDRV_SECTOR_SIZE,
@@ -1647,10 +1653,11 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, true)) {
 scsi_req_unref(>req);
 g_free(data);
 } else {
+block_acct_done(blk_get_stats(s->qdev.conf.blk), >acct);
 scsi_unmap_complete_noio(data, ret);
 }
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
@@ -1682,6 +1689,7 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 }
 
 if (blk_is_read_only(s->qdev.conf.blk)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
 return;
 }
@@ -1697,10 +1705,12 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 return;
 
 invalid_param_len:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_PARAM_LEN));
 return;
 
 invalid_field:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
-- 
2.17.1

[PATCH v10 0/9] discard blockstats

2019-09-23 Thread Anton Nefedov

v10:
  - patch 3 commit message updated

v9: https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg01190.html



qmp query-blockstats provides stats info for write/read/flush ops.

Patches 1-7 implement the similar for discard (unmap) command for scsi
and ide disks.
Discard stat "unmap_ops / unmap_bytes" is supposed to account the ops that
have completed without an error.

However, discard operation is advisory. Specifically,
 - common block layer ignores ENOTSUP error code.
   That might be returned if the block driver does not support discard,
   or discard has been configured to be ignored.
 - format drivers such as qcow2 may ignore discard if they were configured
   to ignore that, or if the corresponding area is already marked unused
   (unallocated / zero clusters).

And what is actually useful is the number of bytes actually discarded
down on the host filesystem.
To achieve that, driver-specific statistics has been added to blockstats
(patch 9).
With patch 8, file-posix driver accounts discard operations on its level too.

query-blockstat result:

(note the difference between blockdevice unmap and file discard stats. qcow2
sends fewer ops down to the file as the clusters are actually unallocated
on qcow2 level)

{
  "device": "drive-scsi0-0-0-0",
  "node-name": "#block159",
  "stats": {
>   "invalid_unmap_operations": 0,
>   "failed_unmap_operations": 0,
"wr_highest_offset": 13411688448,
"rd_total_time_ns": 2859566315,
"rd_bytes": 103182336,
"rd_merged": 0,
"flush_operations": 19,
"invalid_wr_operations": 0,
"flush_total_time_ns": 23111608,
"failed_rd_operations": 0,
"failed_flush_operations": 0,
"invalid_flush_operations": 0,
"timed_stats": [
  
],
"wr_merged": 0,
"wr_bytes": 1702912,
>   "unmap_bytes": 11954954240,
>   "unmap_operations": 865,
"idle_time_ns": 2669508623,
"account_invalid": true,
>   "unmap_total_time_ns": 19698002,
"wr_operations": 143,
"failed_wr_operations": 0,
"rd_operations": 4816,
"account_failed": true,
>   "unmap_merged": 0,
"wr_total_time_ns": 1262686124,
"invalid_rd_operations": 0
  },
  "parent": {
>   "driver-specific": {
> "discard-nb-failed": 0,
> "discard-bytes-ok": 720896,
> "driver": "file",
> "discard-nb-ok": 8
>   },
"node-name": "#block009",
"stats": {
[..]
}
  }
},
{
  "device": "floppy0",

Anton Nefedov (9):
  qapi: group BlockDeviceStats fields
  qapi: add unmap to BlockDeviceStats
  block: add empty account cookie type
  ide: account UNMAP (TRIM) operations
  scsi: store unmap offset and nb_sectors in request struct
  scsi: move unmap error checking to the complete callback
  scsi: account unmap operations
  file-posix: account discard operations
  qapi: query-blockstat: add driver specific file-posix stats

 qapi/block-core.json   | 81 --
 include/block/accounting.h |  2 +
 include/block/block.h  |  1 +
 include/block/block_int.h  |  1 +
 block.c|  9 +
 block/accounting.c |  6 +++
 block/file-posix.c | 54 -
 block/qapi.c   | 11 ++
 hw/ide/core.c  | 12 ++
 hw/scsi/scsi-disk.c| 34 ++--
 tests/qemu-iotests/227.out | 18 +
 11 files changed, 206 insertions(+), 23 deletions(-)

-- 
2.17.1

[PATCH v10 1/9] qapi: group BlockDeviceStats fields

2019-09-23 Thread Anton Nefedov

Make the stat fields definition slightly more readable.
Also reorder total_time_ns stats read-write-flush as done elsewhere.
Cosmetic change only.

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e6edd641f1..5ab554b54a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -867,12 +867,12 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
-# @flush_total_time_ns: Total time spend on cache flushes in nano-seconds
-#   (since 0.15.0).
+# @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
-# @wr_total_time_ns: Total time spend on writes in nano-seconds (since 0.15.0).
+# @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
 #
-# @rd_total_time_ns: Total_time_spend on reads in nano-seconds (since 0.15.0).
+# @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
+#   (since 0.15.0).
 #
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
@@ -925,14 +925,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'rd_operations': 'int',
-   'wr_operations': 'int', 'flush_operations': 'int',
-   'flush_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'rd_total_time_ns': 'int', 'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int', '*idle_time_ns': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+   'rd_operations': 'int', 'wr_operations': 'int',
+   'flush_operations': 'int',
+   'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
+   'flush_total_time_ns': 'int',
+   'wr_highest_offset': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int',
+   '*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int', 'invalid_rd_operations': 'int',
-   'invalid_wr_operations': 'int', 'invalid_flush_operations': 'int',
+   'failed_flush_operations': 'int',
+   'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
+   'invalid_flush_operations': 'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
-- 
2.17.1

[PATCH v10 9/9] qapi: query-blockstat: add driver specific file-posix stats

2019-09-23 Thread Anton Nefedov

A block driver can provide a callback to report driver-specific
statistics.

file-posix driver now reports discard statistics

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Markus Armbruster 
---
 qapi/block-core.json  | 38 ++
 include/block/block.h |  1 +
 include/block/block_int.h |  1 +
 block.c   |  9 +
 block/file-posix.c| 32 
 block/qapi.c  |  5 +
 6 files changed, 86 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7d3e05891c..859acea014 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -960,6 +960,41 @@
'*wr_latency_histogram': 'BlockLatencyHistogramInfo',
'*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
 
+##
+# @BlockStatsSpecificFile:
+#
+# File driver statistics
+#
+# @discard-nb-ok: The number of successful discard operations performed by
+# the driver.
+#
+# @discard-nb-failed: The number of failed discard operations performed by
+# the driver.
+#
+# @discard-bytes-ok: The number of bytes discarded by the driver.
+#
+# Since: 4.2
+##
+{ 'struct': 'BlockStatsSpecificFile',
+  'data': {
+  'discard-nb-ok': 'uint64',
+  'discard-nb-failed': 'uint64',
+  'discard-bytes-ok': 'uint64' } }
+
+##
+# @BlockStatsSpecific:
+#
+# Block driver specific statistics
+#
+# Since: 4.2
+##
+{ 'union': 'BlockStatsSpecific',
+  'base': { 'driver': 'BlockdevDriver' },
+  'discriminator': 'driver',
+  'data': {
+  'file': 'BlockStatsSpecificFile',
+  'host_device': 'BlockStatsSpecificFile' } }
+
 ##
 # @BlockStats:
 #
@@ -975,6 +1010,8 @@
 #
 # @stats:  A @BlockDeviceStats for the device.
 #
+# @driver-specific: Optional driver-specific stats. (Since 4.2)
+#
 # @parent: This describes the file block device if it has one.
 #  Contains recursively the statistics of the underlying
 #  protocol (e.g. the host file for a qcow2 image). If there is
@@ -988,6 +1025,7 @@
 { 'struct': 'BlockStats',
   'data': {'*device': 'str', '*qdev': 'str', '*node-name': 'str',
'stats': 'BlockDeviceStats',
+   '*driver-specific': 'BlockStatsSpecific',
'*parent': 'BlockStats',
'*backing': 'BlockStats'} }
 
diff --git a/include/block/block.h b/include/block/block.h
index 37c9de7446..792bb826db 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -501,6 +501,7 @@ int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
   Error **errp);
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
 void bdrv_round_to_clusters(BlockDriverState *bs,
 int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0422acdf1c..2b113eb3c7 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -366,6 +366,7 @@ struct BlockDriver {
 int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
  Error **errp);
+BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
 int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
   QEMUIOVector *qiov,
diff --git a/block.c b/block.c
index 5944124845..5d3a5a6b95 100644
--- a/block.c
+++ b/block.c
@@ -5155,6 +5155,15 @@ ImageInfoSpecific 
*bdrv_get_specific_info(BlockDriverState *bs,
 return NULL;
 }
 
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs)
+{
+BlockDriver *drv = bs->drv;
+if (!drv || !drv->bdrv_get_specific_stats) {
+return NULL;
+}
+return drv->bdrv_get_specific_stats(bs);
+}
+
 void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
 {
 if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
diff --git a/block/file-posix.c b/block/file-posix.c
index f3934c4e10..695fcf740d 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2753,6 +2753,36 @@ static int raw_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 return 0;
 }
 
+static BlockStatsSpecificFile get_blockstats_specific_file(BlockDriverState 
*bs)
+{
+BDRVRawState *s = bs->opaque;
+return (BlockStatsSpecificFile) {
+.discard_nb_ok = s->stats.discard_nb_ok,
+.discard_nb_failed = s->stats.discard_nb_failed,
+.discard_bytes_ok = s->stats.discard_bytes_ok,
+};
+}
+
+static BlockStatsSpecific *raw_get_specific_stats(BlockDriverState *bs)
+{
+BlockStatsSpecific *stats = g_new(BlockStatsSpecific, 1);
+
+stats->driver = BLOCKDE

[PATCH v10 5/9] scsi: store unmap offset and nb_sectors in request struct

2019-09-23 Thread Anton Nefedov

it allows to report it in the error handler

Signed-off-by: Anton Nefedov 
---
 hw/scsi/scsi-disk.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 915641a0f1..b3dd21800d 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1608,8 +1608,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 {
 SCSIDiskReq *r = data->r;
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
-uint64_t sector_num;
-uint32_t nb_sectors;
 
 assert(r->req.aiocb == NULL);
 if (scsi_disk_req_check_error(r, ret, false)) {
@@ -1617,16 +1615,18 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 }
 
 if (data->count > 0) {
-sector_num = ldq_be_p(>inbuf[0]);
-nb_sectors = ldl_be_p(>inbuf[8]) & 0xULL;
-if (!check_lba_range(s, sector_num, nb_sectors)) {
+r->sector = ldq_be_p(>inbuf[0])
+* (s->qdev.blocksize / BDRV_SECTOR_SIZE);
+r->sector_count = (ldl_be_p(>inbuf[8]) & 0xULL)
+* (s->qdev.blocksize / BDRV_SECTOR_SIZE);
+if (!check_lba_range(s, r->sector, r->sector_count)) {
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
-sector_num * s->qdev.blocksize,
-nb_sectors * s->qdev.blocksize,
+r->sector * BDRV_SECTOR_SIZE,
+r->sector_count * BDRV_SECTOR_SIZE,
 scsi_unmap_complete, data);
 data->count--;
 data->inbuf += 16;
-- 
2.17.1

[PATCH v10 3/9] block: add empty account cookie type

2019-09-23 Thread Anton Nefedov

Each block_acct_done/failed call is designed to correspond to a
previous block_acct_start call, which initializes the stats cookie.
However sometimes it is not the case, e.g. some error paths might
report the same cookie twice because it is hard to accurately track if
the cookie was reported yet or not.

This patch cleans the cookie after report.
(Note: block_acct_failed/done without a previous block_acct_start at
all should be avoided. Uninitialized cookie might hold a garbage value
and there is still "< BLOCK_MAX_IOTYPE" assertion for that)

It will be particularly useful in ide code where it's hard to
keep track whether the request done its accounting or not: in the
following patch of the series, trim requests will do the accounting
separately.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/accounting.h | 1 +
 block/accounting.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/block/accounting.h b/include/block/accounting.h
index ba8b04d572..878b4c3581 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -33,6 +33,7 @@ typedef struct BlockAcctTimedStats BlockAcctTimedStats;
 typedef struct BlockAcctStats BlockAcctStats;
 
 enum BlockAcctType {
+BLOCK_ACCT_NONE = 0,
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
diff --git a/block/accounting.c b/block/accounting.c
index 70a3d9a426..8d41c8a83a 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -195,6 +195,10 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 
 assert(cookie->type < BLOCK_MAX_IOTYPE);
 
+if (cookie->type == BLOCK_ACCT_NONE) {
+return;
+}
+
 qemu_mutex_lock(>lock);
 
 if (failed) {
@@ -217,6 +221,8 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 }
 
 qemu_mutex_unlock(>lock);
+
+cookie->type = BLOCK_ACCT_NONE;
 }
 
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
-- 
2.17.1

[PATCH v10 8/9] file-posix: account discard operations

2019-09-23 Thread Anton Nefedov

This will help to identify how many of the user-issued discard operations
(accounted on a device level) have actually suceeded down on the host file
(even though the numbers will not be exactly the same if non-raw format
driver is used (e.g. qcow2 sending metadata discards)).

Note that these numbers will not include discards triggered by
write-zeroes + MAY_UNMAP calls.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/file-posix.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index f12c06de2d..f3934c4e10 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -161,6 +161,11 @@ typedef struct BDRVRawState {
 bool needs_alignment;
 bool drop_cache;
 bool check_cache_dropped;
+struct {
+uint64_t discard_nb_ok;
+uint64_t discard_nb_failed;
+uint64_t discard_bytes_ok;
+} stats;
 
 PRManager *pr_mgr;
 } BDRVRawState;
@@ -2660,11 +2665,22 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,
 #endif /* !__linux__ */
 }
 
+static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
+{
+if (ret) {
+s->stats.discard_nb_failed++;
+} else {
+s->stats.discard_nb_ok++;
+s->stats.discard_bytes_ok += nbytes;
+}
+}
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, bool blkdev)
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+int ret;
 
 acb = (RawPosixAIOData) {
 .bs = bs,
@@ -2678,7 +2694,9 @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int 
bytes, bool blkdev)
 acb.aio_type |= QEMU_AIO_BLKDEV;
 }
 
-return raw_thread_pool_submit(bs, handle_aiocb_discard, );
+ret = raw_thread_pool_submit(bs, handle_aiocb_discard, );
+raw_account_discard(s, bytes, ret);
+return ret;
 }
 
 static coroutine_fn int
@@ -3301,10 +3319,12 @@ static int fd_open(BlockDriverState *bs)
 static coroutine_fn int
 hdev_co_pdiscard(BlockDriverState *bs, int64_t offset, int bytes)
 {
+BDRVRawState *s = bs->opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret < 0) {
+raw_account_discard(s, bytes, ret);
 return ret;
 }
 return raw_do_pdiscard(bs, offset, bytes, true);
-- 
2.17.1

[PATCH v10 6/9] scsi: move unmap error checking to the complete callback

2019-09-23 Thread Anton Nefedov

This will help to account the operation in the following commit.

The difference is that we don't call scsi_disk_req_check_error() before
the 1st discard iteration anymore. That function also checks if
the request is cancelled, however it shouldn't get canceled until it
yields in blk_aio() functions anyway.
Same approach is already used for emulate_write_same.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 hw/scsi/scsi-disk.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index b3dd21800d..a002fdabe8 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1610,9 +1610,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
-goto done;
-}
 
 if (data->count > 0) {
 r->sector = ldq_be_p(>inbuf[0])
@@ -1650,7 +1647,12 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-scsi_unmap_complete_noio(data, ret);
+if (scsi_disk_req_check_error(r, ret, false)) {
+scsi_req_unref(>req);
+g_free(data);
+} else {
+scsi_unmap_complete_noio(data, ret);
+}
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
-- 
2.17.1

[PATCH v10 4/9] ide: account UNMAP (TRIM) operations

2019-09-23 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/ide/core.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index e6e54c6c9a..754ff4dc34 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -442,6 +442,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 TrimAIOCB *iocb = opaque;
 IDEState *s = iocb->s;
 
+if (iocb->i >= 0) {
+if (ret >= 0) {
+block_acct_done(blk_get_stats(s->blk), >acct);
+} else {
+block_acct_failed(blk_get_stats(s->blk), >acct);
+}
+}
+
 if (ret >= 0) {
 while (iocb->j < iocb->qiov->niov) {
 int j = iocb->j;
@@ -459,10 +467,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 }
 
 if (!ide_sect_range_ok(s, sector, count)) {
+block_acct_invalid(blk_get_stats(s->blk), 
BLOCK_ACCT_UNMAP);
 iocb->ret = -EINVAL;
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->blk), >acct,
+ count << BDRV_SECTOR_BITS, BLOCK_ACCT_UNMAP);
+
 /* Got an entry! Submit and exit.  */
 iocb->aiocb = blk_aio_pdiscard(s->blk,
sector << BDRV_SECTOR_BITS,
-- 
2.17.1

Re: [Qemu-devel] [PATCH v9 3/9] block: add empty account cookie type

2019-09-10 Thread Anton Nefedov

On 9/9/2019 5:54 PM, Alberto Garcia wrote:
> On Fri 06 Sep 2019 06:01:14 PM CEST, Anton Nefedov wrote:
>> This adds some protection from accounting uninitialized cookie.
>> That is, block_acct_failed/done without previous block_acct_start;
>> in that case, cookie probably holds values from previous operation.
>>
>> (Note: it might also be uninitialized holding garbage value and there
>> is still "< BLOCK_MAX_IOTYPE" assertion for that.  So
>> block_acct_failed/done without previous block_acct_start should be
>> used with caution.)
>>
>> Currently this is particularly useful in ide code where it's hard to
>> keep track whether the request started accounting or not. For example,
>> trim requests do the accounting separately.
> 
> Sorry if I'm understanding it wrong, but it sounds like you know that
> there's a bug in the ide code (where you call block_acct_done() without
> having it initialized it first), and the purpose of the this patch is to
> hide the bug ?
> 

hi,

not really; in the existing code, I can't see block_acct_done() without
block_acct_start(), but there might be double-accounting though;
e.g. ide_atapi_cmd_read_dma_cb(): it can account the same operation
twice like
   ide_handle_rw_error();
   goto eot;
   block_acct_failed();

The patch should solve it.

The commit message is misleading, sorry. I'll change to:

 > Each block_acct_done/failed call is designed to correspond to a
 > previous block_acct_start call, which initializes the stats cookie.
 > However sometimes it is not the case, e.g. some error paths might
 > report the same cookie twice because it is hard to accurately track if
 > the cookie was reported yet or not.

 > This patch cleans the cookie after report.
 > (Note: block_acct_failed/done without a previous block_acct_start at
 > all should be avoided. Uninitialized cookie might hold a garbage value
 > and there is still "< BLOCK_MAX_IOTYPE" assertion for that)

 > It will be particularly useful in ide code where it's hard to
 > keep track whether the request done its accounting or not: in the
 > following patch of the series, trim requests will do the accounting
 > separately.

/Anton

[Qemu-devel] [PATCH v9 7/9] scsi: account unmap operations

2019-09-06 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
---
 hw/scsi/scsi-disk.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index a002fdabe8..68b1675fd9 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1617,10 +1617,16 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 r->sector_count = (ldl_be_p(>inbuf[8]) & 0xULL)
 * (s->qdev.blocksize / BDRV_SECTOR_SIZE);
 if (!check_lba_range(s, r->sector, r->sector_count)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk),
+   BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->qdev.conf.blk), >acct,
+ r->sector_count * BDRV_SECTOR_SIZE,
+ BLOCK_ACCT_UNMAP);
+
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
 r->sector * BDRV_SECTOR_SIZE,
 r->sector_count * BDRV_SECTOR_SIZE,
@@ -1647,10 +1653,11 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, true)) {
 scsi_req_unref(>req);
 g_free(data);
 } else {
+block_acct_done(blk_get_stats(s->qdev.conf.blk), >acct);
 scsi_unmap_complete_noio(data, ret);
 }
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
@@ -1682,6 +1689,7 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 }
 
 if (blk_is_read_only(s->qdev.conf.blk)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
 return;
 }
@@ -1697,10 +1705,12 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 return;
 
 invalid_param_len:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_PARAM_LEN));
 return;
 
 invalid_field:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v9 3/9] block: add empty account cookie type

2019-09-06 Thread Anton Nefedov

This adds some protection from accounting uninitialized cookie.
That is, block_acct_failed/done without previous block_acct_start;
in that case, cookie probably holds values from previous operation.

(Note: it might also be uninitialized holding garbage value and there is
 still "< BLOCK_MAX_IOTYPE" assertion for that.
 So block_acct_failed/done without previous block_acct_start should be used
 with caution.)

Currently this is particularly useful in ide code where it's hard to
keep track whether the request started accounting or not. For example,
trim requests do the accounting separately.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 include/block/accounting.h | 1 +
 block/accounting.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/block/accounting.h b/include/block/accounting.h
index ba8b04d572..878b4c3581 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -33,6 +33,7 @@ typedef struct BlockAcctTimedStats BlockAcctTimedStats;
 typedef struct BlockAcctStats BlockAcctStats;
 
 enum BlockAcctType {
+BLOCK_ACCT_NONE = 0,
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
diff --git a/block/accounting.c b/block/accounting.c
index 70a3d9a426..8d41c8a83a 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -195,6 +195,10 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 
 assert(cookie->type < BLOCK_MAX_IOTYPE);
 
+if (cookie->type == BLOCK_ACCT_NONE) {
+return;
+}
+
 qemu_mutex_lock(>lock);
 
 if (failed) {
@@ -217,6 +221,8 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 }
 
 qemu_mutex_unlock(>lock);
+
+cookie->type = BLOCK_ACCT_NONE;
 }
 
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
-- 
2.17.1

[Qemu-devel] [PATCH v9 2/9] qapi: add unmap to BlockDeviceStats

2019-09-06 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json   | 29 +++--
 include/block/accounting.h |  1 +
 block/qapi.c   |  6 ++
 tests/qemu-iotests/227.out | 18 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 5ab554b54a..7d3e05891c 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -860,6 +860,8 @@
 #
 # @wr_bytes:  The number of bytes written by the device.
 #
+# @unmap_bytes: The number of bytes unmapped by the device (Since 4.2)
+#
 # @rd_operations: The number of read operations performed by the device.
 #
 # @wr_operations: The number of write operations performed by the device.
@@ -867,6 +869,9 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
+# @unmap_operations: The number of unmap operations performed by the device
+#(Since 4.2)
+#
 # @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
 # @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
@@ -874,6 +879,9 @@
 # @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
 #   (since 0.15.0).
 #
+# @unmap_total_time_ns: Total time spent on unmap operations in nanoseconds
+#   (Since 4.2)
+#
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
 # growable sparse files (like qcow2) that are used on top
@@ -885,6 +893,9 @@
 # @wr_merged: Number of write requests that have been merged into another
 # request (Since 2.3).
 #
+# @unmap_merged: Number of unmap requests that have been merged into another
+#request (Since 4.2)
+#
 # @idle_time_ns: Time since the last I/O operation, in
 #nanoseconds. If the field is absent it means that
 #there haven't been any operations yet (Since 2.5).
@@ -898,6 +909,9 @@
 # @failed_flush_operations: The number of failed flush operations
 #   performed by the device (Since 2.5)
 #
+# @failed_unmap_operations: The number of failed unmap operations performed
+#   by the device (Since 4.2)
+#
 # @invalid_rd_operations: The number of invalid read operations
 #  performed by the device (Since 2.5)
 #
@@ -907,6 +921,9 @@
 # @invalid_flush_operations: The number of invalid flush operations
 #performed by the device (Since 2.5)
 #
+# @invalid_unmap_operations: The number of invalid unmap operations performed
+#by the device (Since 4.2)
+#
 # @account_invalid: Whether invalid operations are included in the
 #   last access statistics (Since 2.5)
 #
@@ -925,18 +942,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'unmap_bytes' : 'int',
'rd_operations': 'int', 'wr_operations': 'int',
-   'flush_operations': 'int',
+   'flush_operations': 'int', 'unmap_operations': 'int',
'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'flush_total_time_ns': 'int',
+   'flush_total_time_ns': 'int', 'unmap_total_time_ns': 'int',
'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int', 'unmap_merged': 'int',
'*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int',
+   'failed_flush_operations': 'int', 'failed_unmap_operations': 'int',
'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
-   'invalid_flush_operations': 'int',
+   'invalid_flush_operations': 'int', 'invalid_unmap_operations': 
'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
diff --git a/include/block/accounting.h b/include/block/accounting.h
index d1f67b10dd..ba8b04d572 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -36,6 +36,7 @@ enum BlockAcctType {
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
+BLOCK_ACCT_UNMAP,
 BLOCK_MAX_IOTYPE,
 };
 
diff --git a/block/qapi.c b/block/qapi.c
index 15f1030264..3356a1dc59 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -440,24 +440,30 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_byte

[Qemu-devel] [PATCH v9 6/9] scsi: move unmap error checking to the complete callback

2019-09-06 Thread Anton Nefedov

This will help to account the operation in the following commit.

The difference is that we don't call scsi_disk_req_check_error() before
the 1st discard iteration anymore. That function also checks if
the request is cancelled, however it shouldn't get canceled until it
yields in blk_aio() functions anyway.
Same approach is already used for emulate_write_same.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 hw/scsi/scsi-disk.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index b3dd21800d..a002fdabe8 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1610,9 +1610,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
-goto done;
-}
 
 if (data->count > 0) {
 r->sector = ldq_be_p(>inbuf[0])
@@ -1650,7 +1647,12 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-scsi_unmap_complete_noio(data, ret);
+if (scsi_disk_req_check_error(r, ret, false)) {
+scsi_req_unref(>req);
+g_free(data);
+} else {
+scsi_unmap_complete_noio(data, ret);
+}
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v9 4/9] ide: account UNMAP (TRIM) operations

2019-09-06 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/ide/core.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index e6e54c6c9a..754ff4dc34 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -442,6 +442,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 TrimAIOCB *iocb = opaque;
 IDEState *s = iocb->s;
 
+if (iocb->i >= 0) {
+if (ret >= 0) {
+block_acct_done(blk_get_stats(s->blk), >acct);
+} else {
+block_acct_failed(blk_get_stats(s->blk), >acct);
+}
+}
+
 if (ret >= 0) {
 while (iocb->j < iocb->qiov->niov) {
 int j = iocb->j;
@@ -459,10 +467,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 }
 
 if (!ide_sect_range_ok(s, sector, count)) {
+block_acct_invalid(blk_get_stats(s->blk), 
BLOCK_ACCT_UNMAP);
 iocb->ret = -EINVAL;
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->blk), >acct,
+ count << BDRV_SECTOR_BITS, BLOCK_ACCT_UNMAP);
+
 /* Got an entry! Submit and exit.  */
 iocb->aiocb = blk_aio_pdiscard(s->blk,
sector << BDRV_SECTOR_BITS,
-- 
2.17.1

[Qemu-devel] [PATCH v9 0/9] discard blockstats

2019-09-06 Thread Anton Nefedov

v9:
 - fixed patch 5 so the fields are actually numbered in sectors not blocks
 - fixed patch 7 accordingly
 - patch 8: make stat fields unsigned
 - qapi patches: "since 4.1" -> "since 4.2"

v8: https://lists.gnu.org/archive/html/qemu-devel/2019-05/msg03709.html



qmp query-blockstats provides stats info for write/read/flush ops.

Patches 1-7 implement the similar for discard (unmap) command for scsi
and ide disks.
Discard stat "unmap_ops / unmap_bytes" is supposed to account the ops that
have completed without an error.

However, discard operation is advisory. Specifically,
 - common block layer ignores ENOTSUP error code.
   That might be returned if the block driver does not support discard,
   or discard has been configured to be ignored.
 - format drivers such as qcow2 may ignore discard if they were configured
   to ignore that, or if the corresponding area is already marked unused
   (unallocated / zero clusters).

And what is actually useful is the number of bytes actually discarded
down on the host filesystem.
To achieve that, driver-specific statistics has been added to blockstats
(patch 9).
With patch 8, file-posix driver accounts discard operations on its level too.

query-blockstat result:

(note the difference between blockdevice unmap and file discard stats. qcow2
sends fewer ops down to the file as the clusters are actually unallocated
on qcow2 level)

{
  "device": "drive-scsi0-0-0-0",
  "node-name": "#block159",
  "stats": {
>   "invalid_unmap_operations": 0,
>   "failed_unmap_operations": 0,
"wr_highest_offset": 13411688448,
"rd_total_time_ns": 2859566315,
"rd_bytes": 103182336,
"rd_merged": 0,
"flush_operations": 19,
"invalid_wr_operations": 0,
"flush_total_time_ns": 23111608,
"failed_rd_operations": 0,
"failed_flush_operations": 0,
"invalid_flush_operations": 0,
"timed_stats": [
  
],
"wr_merged": 0,
"wr_bytes": 1702912,
>   "unmap_bytes": 11954954240,
>   "unmap_operations": 865,
"idle_time_ns": 2669508623,
"account_invalid": true,
>   "unmap_total_time_ns": 19698002,
"wr_operations": 143,
"failed_wr_operations": 0,
"rd_operations": 4816,
"account_failed": true,
>   "unmap_merged": 0,
"wr_total_time_ns": 1262686124,
    "invalid_rd_operations": 0
  },
  "parent": {
>   "driver-specific": {
> "discard-nb-failed": 0,
> "discard-bytes-ok": 720896,
> "driver": "file",
> "discard-nb-ok": 8
>   },
"node-name": "#block009",
"stats": {
[..]
}
  }
},
{
  "device": "floppy0",

Anton Nefedov (9):
  qapi: group BlockDeviceStats fields
  qapi: add unmap to BlockDeviceStats
  block: add empty account cookie type
  ide: account UNMAP (TRIM) operations
  scsi: store unmap offset and nb_sectors in request struct
  scsi: move unmap error checking to the complete callback
  scsi: account unmap operations
  file-posix: account discard operations
  qapi: query-blockstat: add driver specific file-posix stats

 qapi/block-core.json   | 81 --
 include/block/accounting.h |  2 +
 include/block/block.h  |  1 +
 include/block/block_int.h  |  1 +
 block.c|  9 +
 block/accounting.c |  6 +++
 block/file-posix.c | 54 -
 block/qapi.c   | 11 ++
 hw/ide/core.c  | 12 ++
 hw/scsi/scsi-disk.c| 34 ++--
 tests/qemu-iotests/227.out | 18 +
 11 files changed, 206 insertions(+), 23 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v9 9/9] qapi: query-blockstat: add driver specific file-posix stats

2019-09-06 Thread Anton Nefedov

A block driver can provide a callback to report driver-specific
statistics.

file-posix driver now reports discard statistics

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Markus Armbruster 
---
 qapi/block-core.json  | 38 ++
 include/block/block.h |  1 +
 include/block/block_int.h |  1 +
 block.c   |  9 +
 block/file-posix.c| 32 
 block/qapi.c  |  5 +
 6 files changed, 86 insertions(+)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7d3e05891c..859acea014 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -960,6 +960,41 @@
'*wr_latency_histogram': 'BlockLatencyHistogramInfo',
'*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
 
+##
+# @BlockStatsSpecificFile:
+#
+# File driver statistics
+#
+# @discard-nb-ok: The number of successful discard operations performed by
+# the driver.
+#
+# @discard-nb-failed: The number of failed discard operations performed by
+# the driver.
+#
+# @discard-bytes-ok: The number of bytes discarded by the driver.
+#
+# Since: 4.2
+##
+{ 'struct': 'BlockStatsSpecificFile',
+  'data': {
+  'discard-nb-ok': 'uint64',
+  'discard-nb-failed': 'uint64',
+  'discard-bytes-ok': 'uint64' } }
+
+##
+# @BlockStatsSpecific:
+#
+# Block driver specific statistics
+#
+# Since: 4.2
+##
+{ 'union': 'BlockStatsSpecific',
+  'base': { 'driver': 'BlockdevDriver' },
+  'discriminator': 'driver',
+  'data': {
+  'file': 'BlockStatsSpecificFile',
+  'host_device': 'BlockStatsSpecificFile' } }
+
 ##
 # @BlockStats:
 #
@@ -975,6 +1010,8 @@
 #
 # @stats:  A @BlockDeviceStats for the device.
 #
+# @driver-specific: Optional driver-specific stats. (Since 4.2)
+#
 # @parent: This describes the file block device if it has one.
 #  Contains recursively the statistics of the underlying
 #  protocol (e.g. the host file for a qcow2 image). If there is
@@ -988,6 +1025,7 @@
 { 'struct': 'BlockStats',
   'data': {'*device': 'str', '*qdev': 'str', '*node-name': 'str',
'stats': 'BlockDeviceStats',
+   '*driver-specific': 'BlockStatsSpecific',
'*parent': 'BlockStats',
'*backing': 'BlockStats'} }
 
diff --git a/include/block/block.h b/include/block/block.h
index 124ad40809..6c5279da99 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -503,6 +503,7 @@ int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
   Error **errp);
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
 void bdrv_round_to_clusters(BlockDriverState *bs,
 int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 0422acdf1c..2b113eb3c7 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -366,6 +366,7 @@ struct BlockDriver {
 int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
  Error **errp);
+BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
 int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
   QEMUIOVector *qiov,
diff --git a/block.c b/block.c
index 5944124845..5d3a5a6b95 100644
--- a/block.c
+++ b/block.c
@@ -5155,6 +5155,15 @@ ImageInfoSpecific 
*bdrv_get_specific_info(BlockDriverState *bs,
 return NULL;
 }
 
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs)
+{
+BlockDriver *drv = bs->drv;
+if (!drv || !drv->bdrv_get_specific_stats) {
+return NULL;
+}
+return drv->bdrv_get_specific_stats(bs);
+}
+
 void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
 {
 if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
diff --git a/block/file-posix.c b/block/file-posix.c
index 6548dda309..767feadc21 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2821,6 +2821,36 @@ static int raw_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 return 0;
 }
 
+static BlockStatsSpecificFile get_blockstats_specific_file(BlockDriverState 
*bs)
+{
+BDRVRawState *s = bs->opaque;
+return (BlockStatsSpecificFile) {
+.discard_nb_ok = s->stats.discard_nb_ok,
+.discard_nb_failed = s->stats.discard_nb_failed,
+.discard_bytes_ok = s->stats.discard_bytes_ok,
+};
+}
+
+static BlockStatsSpecific *raw_get_specific_stats(BlockDriverState *bs)
+{
+BlockStatsSpecific *stats = g_new(BlockStatsSpecific, 1);
+
+stats->driver = BLOCKDE

[Qemu-devel] [PATCH v9 8/9] file-posix: account discard operations

2019-09-06 Thread Anton Nefedov

This will help to identify how many of the user-issued discard operations
(accounted on a device level) have actually suceeded down on the host file
(even though the numbers will not be exactly the same if non-raw format
driver is used (e.g. qcow2 sending metadata discards)).

Note that these numbers will not include discards triggered by
write-zeroes + MAY_UNMAP calls.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/file-posix.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 71f168ee2f..6548dda309 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -161,6 +161,11 @@ typedef struct BDRVRawState {
 bool needs_alignment;
 bool drop_cache;
 bool check_cache_dropped;
+struct {
+uint64_t discard_nb_ok;
+uint64_t discard_nb_failed;
+uint64_t discard_bytes_ok;
+} stats;
 
 PRManager *pr_mgr;
 } BDRVRawState;
@@ -2728,11 +2733,22 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,
 #endif /* !__linux__ */
 }
 
+static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
+{
+if (ret) {
+s->stats.discard_nb_failed++;
+} else {
+s->stats.discard_nb_ok++;
+s->stats.discard_bytes_ok += nbytes;
+}
+}
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, bool blkdev)
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+int ret;
 
 acb = (RawPosixAIOData) {
 .bs = bs,
@@ -2746,7 +2762,9 @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int 
bytes, bool blkdev)
 acb.aio_type |= QEMU_AIO_BLKDEV;
 }
 
-return raw_thread_pool_submit(bs, handle_aiocb_discard, );
+ret = raw_thread_pool_submit(bs, handle_aiocb_discard, );
+raw_account_discard(s, bytes, ret);
+return ret;
 }
 
 static coroutine_fn int
@@ -3369,10 +3387,12 @@ static int fd_open(BlockDriverState *bs)
 static coroutine_fn int
 hdev_co_pdiscard(BlockDriverState *bs, int64_t offset, int bytes)
 {
+BDRVRawState *s = bs->opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret < 0) {
+raw_account_discard(s, bytes, ret);
 return ret;
 }
 return raw_do_pdiscard(bs, offset, bytes, true);
-- 
2.17.1

[Qemu-devel] [PATCH v9 1/9] qapi: group BlockDeviceStats fields

2019-09-06 Thread Anton Nefedov

Make the stat fields definition slightly more readable.
Also reorder total_time_ns stats read-write-flush as done elsewhere.
Cosmetic change only.

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index e6edd641f1..5ab554b54a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -867,12 +867,12 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
-# @flush_total_time_ns: Total time spend on cache flushes in nano-seconds
-#   (since 0.15.0).
+# @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
-# @wr_total_time_ns: Total time spend on writes in nano-seconds (since 0.15.0).
+# @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
 #
-# @rd_total_time_ns: Total_time_spend on reads in nano-seconds (since 0.15.0).
+# @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
+#   (since 0.15.0).
 #
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
@@ -925,14 +925,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'rd_operations': 'int',
-   'wr_operations': 'int', 'flush_operations': 'int',
-   'flush_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'rd_total_time_ns': 'int', 'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int', '*idle_time_ns': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+   'rd_operations': 'int', 'wr_operations': 'int',
+   'flush_operations': 'int',
+   'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
+   'flush_total_time_ns': 'int',
+   'wr_highest_offset': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int',
+   '*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int', 'invalid_rd_operations': 'int',
-   'invalid_wr_operations': 'int', 'invalid_flush_operations': 'int',
+   'failed_flush_operations': 'int',
+   'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
+   'invalid_flush_operations': 'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
-- 
2.17.1

[Qemu-devel] [PATCH v9 5/9] scsi: store unmap offset and nb_sectors in request struct

2019-09-06 Thread Anton Nefedov

it allows to report it in the error handler

Signed-off-by: Anton Nefedov 
---
 hw/scsi/scsi-disk.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 915641a0f1..b3dd21800d 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1608,8 +1608,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 {
 SCSIDiskReq *r = data->r;
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
-uint64_t sector_num;
-uint32_t nb_sectors;
 
 assert(r->req.aiocb == NULL);
 if (scsi_disk_req_check_error(r, ret, false)) {
@@ -1617,16 +1615,18 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 }
 
 if (data->count > 0) {
-sector_num = ldq_be_p(>inbuf[0]);
-nb_sectors = ldl_be_p(>inbuf[8]) & 0xULL;
-if (!check_lba_range(s, sector_num, nb_sectors)) {
+r->sector = ldq_be_p(>inbuf[0])
+* (s->qdev.blocksize / BDRV_SECTOR_SIZE);
+r->sector_count = (ldl_be_p(>inbuf[8]) & 0xULL)
+* (s->qdev.blocksize / BDRV_SECTOR_SIZE);
+if (!check_lba_range(s, r->sector, r->sector_count)) {
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
-sector_num * s->qdev.blocksize,
-nb_sectors * s->qdev.blocksize,
+r->sector * BDRV_SECTOR_SIZE,
+r->sector_count * BDRV_SECTOR_SIZE,
 scsi_unmap_complete, data);
 data->count--;
 data->inbuf += 16;
-- 
2.17.1

Re: [Qemu-devel] [PATCH 0/2] block/file-posix: Fix xfs_write_zeroes()

2019-08-23 Thread Anton Nefedov

On 22/8/2019 7:26 PM, Max Reitz wrote:
> Lukàš ran over a nasty regression in our xfs_write_zeroes() function
> (sorry, my fault) made apparent by a recent patch from Anton that makes
> qcow2 images heavily exercise the offending code path.
> 
> This series fixes the bug and adds a test to prevent it from
> reoccurring.
> 
> 
> Max Reitz (2):
>block/file-posix: Fix xfs_write_zeroes()
>iotests: Test reverse sub-cluster qcow2 writes
> 
>   block/file-posix.c | 16 ++---
>   tests/qemu-iotests/265 | 67 ++
>   tests/qemu-iotests/265.out |  6 
>   tests/qemu-iotests/group   |  1 +
>   4 files changed, 85 insertions(+), 5 deletions(-)
>   create mode 100755 tests/qemu-iotests/265
>   create mode 100644 tests/qemu-iotests/265.out
> 

Nice! thanks Max

Reviewed-by: Anton Nefedov

Re: [Qemu-devel] Broken aarch64 by qcow2: skip writing zero buffers to empty COW areas [v2]

2019-08-21 Thread Anton Nefedov

On 21/8/2019 5:14 PM, Lukáš Doktor wrote:
> Hello guys,
> 
> First attempt was rejected due to zip attachment, let's try it again with 
> just Avocado-vt debug.log and serial console log files attached.
> 
> I bisected a regression on aarch64 all the way to this commit: "qcow2: skip 
> writing zero buffers to empty COW areas" 
> c8bb23cbdbe32f5c326365e0a82e1b0e68cdcd8a. Would you please have a look at it?
> 
> My reproducer is running kickstart installation of RHEL-8 from DVD on aarch64 
> gicv3 machine, which never finishes since this commit, where anaconda 
> complains about package installation, occasionally there are also XFS 
> metadata corruption messages on serial console:
> 

hi,

this looks scary :( I doubt that it can have anything to do with aarch64
but rather a really tricky timing (or, possibly, a broken environment
like broken fallocate() on a host? who knows..)

Is it always the same machine you observe this issue on? Did you try
others?

I just wonder if it's worth to try to reproduce it on my machine
(and I don't have aarch64 on hand now). I can probably come up with
some torture test that will continuously write to qcow2 with random
offsets/sizes and verify the result.

If you could kindly reproduce it again then we can probably start with
enabling qemu traces by appending
  " -trace bdrv* -trace qcow2* -trace file=/some_huge_partition/qemu.log"
to the command line.

Beware that it's going to produce a huge amount of logs.

Also, the corrupted image and the serial log will be required for
investigation.

thanks,

/Anton

Re: [Qemu-devel] [PATCH v8 9/9] qapi: query-blockstat: add driver specific file-posix stats

2019-08-21 Thread Anton Nefedov

On 21/8/2019 2:21 PM, Max Reitz wrote:
> On 21.08.19 13:00, Anton Nefedov wrote:
>> On 12/8/2019 10:04 PM, Max Reitz wrote:
>>> On 16.05.19 16:33, Anton Nefedov wrote:
>>>> A block driver can provide a callback to report driver-specific
>>>> statistics.
>>>>
>>>> file-posix driver now reports discard statistics
>>>>
>>>> Signed-off-by: Anton Nefedov 
>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy 
>>>> Acked-by: Markus Armbruster 
>>>> ---
>>>>qapi/block-core.json  | 38 ++
>>>>include/block/block.h |  1 +
>>>>include/block/block_int.h |  1 +
>>>>block.c   |  9 +
>>>>block/file-posix.c| 38 +++---
>>>>block/qapi.c  |  5 +
>>>>6 files changed, 89 insertions(+), 3 deletions(-)
>>>
>>>
>>>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>>>> index 55194f84ce..368e09ae37 100644
>>>> --- a/qapi/block-core.json
>>>> +++ b/qapi/block-core.json
>>>> @@ -956,6 +956,41 @@
>>>>   '*wr_latency_histogram': 'BlockLatencyHistogramInfo',
>>>>   '*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
>>>>
>>>> +##
>>>> +# @BlockStatsSpecificFile:
>>>> +#
>>>> +# File driver statistics
>>>> +#
>>>> +# @discard-nb-ok: The number of successful discard operations performed by
>>>> +# the driver.
>>>> +#
>>>> +# @discard-nb-failed: The number of failed discard operations performed by
>>>> +# the driver.
>>>> +#
>>>> +# @discard-bytes-ok: The number of bytes discarded by the driver.
>>>> +#
>>>> +# Since: 4.1
>>>> +##
>>>> +{ 'struct': 'BlockStatsSpecificFile',
>>>> +  'data': {
>>>> +  'discard-nb-ok': 'uint64',
>>>> +  'discard-nb-failed': 'uint64',
>>>> +  'discard-bytes-ok': 'uint64' } }
>>>> +
>>>> +##
>>>> +# @BlockStatsSpecific:
>>>> +#
>>>> +# Block driver specific statistics
>>>> +#
>>>> +# Since: 4.1
>>>> +##
>>>> +{ 'union': 'BlockStatsSpecific',
>>>> +  'base': { 'driver': 'BlockdevDriver' },
>>>> +  'discriminator': 'driver',
>>>> +  'data': {
>>>> +  'file': 'BlockStatsSpecificFile',
>>>> +  'host_device': 'BlockStatsSpecificFile' } }
>>>
>>> I would like to use these chance to complain that I find this awkward.
>>> My problem is that I don’t know how any management application is
>>> supposed to reasonably consume this.  It feels weird to potentially have
>>> to recognize the result for every block driver.
>>>
>>> I would now like to note that I’m clearly not in a position to block
>>> this at this point, because I’ve had a year to do so, I didn’t, so it
>>> would be unfair to do it now.
>>>
>>> (Still, I feel like if I have a concern, I should raise it, even if it’s
>>> too late.)
>>>
>>> I know Markus has proposed this, but I don’t understand why.  He set
>>> ImageInfoSpecific as a precedence, but that has a different reasoning
>>> behind it.  The point for that is that it simply doesn’t work any other
>>> way, because it is clearly format-specific information that cannot be
>>> shared between drivers.  Anything that can be shared is put into
>>> ImageInfo (like the cluster size).
>>>
>>> We have the same constellation here, BlockStats contains common stuff,
>>> and BlockStatsSpecific would contain driver-specific stuff.  But to me,
>>> BlockStatsSpecificFile doesn’t look very special.  It looks like it just
>>> duplicates fields that already exist in BlockDeviceStats.
>>>
>>>
>>> (Furthermore, most of ImageInfoSpecific is actually not useful to
>>> management software, but only as an information for humans (and having
>>> such a structure for that is perfectly fine).  But these stats don’t
>>> really look like something for immediate human consumption.)
>>>
>>>
>>> So I wonder why you don’t just put this information into
>>> BlockDeviceStats.  From what I can tell looking at
>>> bdrv_query_bds_stats() and qmp_query_blockstats(), the @stat

Re: [Qemu-devel] [PATCH v8 4/9] ide: account UNMAP (TRIM) operations

2019-08-21 Thread Anton Nefedov

On 12/8/2019 9:16 PM, Max Reitz wrote:
> On 16.05.19 16:33, Anton Nefedov wrote:
>> Signed-off-by: Anton Nefedov 
>> Reviewed-by: Vladimir Sementsov-Ogievskiy 
>> ---
>>   hw/ide/core.c | 12 
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/hw/ide/core.c b/hw/ide/core.c
>> index 6afadf894f..3a7ac93777 100644
>> --- a/hw/ide/core.c
>> +++ b/hw/ide/core.c
>> @@ -441,6 +441,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
>>   TrimAIOCB *iocb = opaque;
>>   IDEState *s = iocb->s;
>>   
>> +if (iocb->i >= 0) {
>> +if (ret >= 0) {
>> +block_acct_done(blk_get_stats(s->blk), >acct);
>> +} else {
>> +block_acct_failed(blk_get_stats(s->blk), >acct);
> 
> Hmm, in other places (ide_handle_rw_error() here or
> scsi_handle_rw_error() in scsi-disk) only report this with
> BLOCK_ERROR_ACTION_REPORT.
> 
> So I’m wondering whether the same should be done here.
> 

Many other places do not check the action:
scsi_{dma|read|write}_complete(), hw/ide/atapi.c calls

That feels somewhat weird to me, to account the operation complete when
it's not.
(But I don't really know the use cases of BLOCK_ERROR_ACTION_IGNORE).

Can it be that the error action is only checked so that the operation is
not accounted twice (as there might be block_acct_done() later in the
common path with ACTION_IGNORE)

/Anton

Re: [Qemu-devel] [PATCH v8 5/9] scsi: store unmap offset and nb_sectors in request struct

2019-08-21 Thread Anton Nefedov

On 12/8/2019 8:58 PM, Max Reitz wrote:
> On 16.05.19 16:33, Anton Nefedov wrote:
>> it allows to report it in the error handler
>>
>> Signed-off-by: Anton Nefedov 
>> Reviewed-by: Vladimir Sementsov-Ogievskiy 
>> Reviewed-by: Alberto Garcia 
>> ---
>>   hw/scsi/scsi-disk.c | 12 +---
>>   1 file changed, 5 insertions(+), 7 deletions(-)
> 
> (Sorry for the late reply :-/)
> 
>> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
>> index e7e865ab3b..b43254103c 100644
>> --- a/hw/scsi/scsi-disk.c
>> +++ b/hw/scsi/scsi-disk.c
>> @@ -1602,8 +1602,6 @@ static void scsi_unmap_complete_noio(UnmapCBData 
>> *data, int ret)
>>   {
>>   SCSIDiskReq *r = data->r;
>>   SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
>> -uint64_t sector_num;
>> -uint32_t nb_sectors;
>>   
>>   assert(r->req.aiocb == NULL);
>>   if (scsi_disk_req_check_error(r, ret, false)) {
>> @@ -1611,16 +1609,16 @@ static void scsi_unmap_complete_noio(UnmapCBData 
>> *data, int ret)
>>   }
>>   
>>   if (data->count > 0) {
>> -sector_num = ldq_be_p(>inbuf[0]);
>> -nb_sectors = ldl_be_p(>inbuf[8]) & 0xULL;
>> -if (!check_lba_range(s, sector_num, nb_sectors)) {
>> +r->sector = ldq_be_p(>inbuf[0]);
>> +r->sector_count = ldl_be_p(>inbuf[8]) & 0xULL;
>> +if (!check_lba_range(s, r->sector, r->sector_count)) {
>>   scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
>>   goto done;
>>   }
>>   
>>   r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
>> -sector_num * s->qdev.blocksize,
>> -nb_sectors * s->qdev.blocksize,
>> +r->sector * s->qdev.blocksize,
>> +r->sector_count * s->qdev.blocksize,
> 
> This looks to me like these are not necessarily in terms of 512-byte
> sectors.  It doesn’t seem to make anything technically wrong, because
> patch 7 takes that into account.
> 
> But it’s still weird if everything else in this file treats these fields
> as being in terms of 512 byte sectors (and they are actually defined
> this way in SCSIDiskReq).
> 

Nice that you caught this, thanks! I guess variable names misled me

Re: [Qemu-devel] [PATCH v8 9/9] qapi: query-blockstat: add driver specific file-posix stats

2019-08-21 Thread Anton Nefedov

On 12/8/2019 10:04 PM, Max Reitz wrote:
> On 16.05.19 16:33, Anton Nefedov wrote:
>> A block driver can provide a callback to report driver-specific
>> statistics.
>>
>> file-posix driver now reports discard statistics
>>
>> Signed-off-by: Anton Nefedov 
>> Reviewed-by: Vladimir Sementsov-Ogievskiy 
>> Acked-by: Markus Armbruster 
>> ---
>>   qapi/block-core.json  | 38 ++
>>   include/block/block.h |  1 +
>>   include/block/block_int.h |  1 +
>>   block.c   |  9 +
>>   block/file-posix.c| 38 +++---
>>   block/qapi.c  |  5 +
>>   6 files changed, 89 insertions(+), 3 deletions(-)
> 
> 
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 55194f84ce..368e09ae37 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -956,6 +956,41 @@
>>  '*wr_latency_histogram': 'BlockLatencyHistogramInfo',
>>  '*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
>>   
>> +##
>> +# @BlockStatsSpecificFile:
>> +#
>> +# File driver statistics
>> +#
>> +# @discard-nb-ok: The number of successful discard operations performed by
>> +# the driver.
>> +#
>> +# @discard-nb-failed: The number of failed discard operations performed by
>> +# the driver.
>> +#
>> +# @discard-bytes-ok: The number of bytes discarded by the driver.
>> +#
>> +# Since: 4.1
>> +##
>> +{ 'struct': 'BlockStatsSpecificFile',
>> +  'data': {
>> +  'discard-nb-ok': 'uint64',
>> +  'discard-nb-failed': 'uint64',
>> +  'discard-bytes-ok': 'uint64' } }
>> +
>> +##
>> +# @BlockStatsSpecific:
>> +#
>> +# Block driver specific statistics
>> +#
>> +# Since: 4.1
>> +##
>> +{ 'union': 'BlockStatsSpecific',
>> +  'base': { 'driver': 'BlockdevDriver' },
>> +  'discriminator': 'driver',
>> +  'data': {
>> +  'file': 'BlockStatsSpecificFile',
>> +  'host_device': 'BlockStatsSpecificFile' } }
> 
> I would like to use these chance to complain that I find this awkward.
> My problem is that I don’t know how any management application is
> supposed to reasonably consume this.  It feels weird to potentially have
> to recognize the result for every block driver.
> 
> I would now like to note that I’m clearly not in a position to block
> this at this point, because I’ve had a year to do so, I didn’t, so it
> would be unfair to do it now.
> 
> (Still, I feel like if I have a concern, I should raise it, even if it’s
> too late.)
> 
> I know Markus has proposed this, but I don’t understand why.  He set
> ImageInfoSpecific as a precedence, but that has a different reasoning
> behind it.  The point for that is that it simply doesn’t work any other
> way, because it is clearly format-specific information that cannot be
> shared between drivers.  Anything that can be shared is put into
> ImageInfo (like the cluster size).
> 
> We have the same constellation here, BlockStats contains common stuff,
> and BlockStatsSpecific would contain driver-specific stuff.  But to me,
> BlockStatsSpecificFile doesn’t look very special.  It looks like it just
> duplicates fields that already exist in BlockDeviceStats.
> 
> 
> (Furthermore, most of ImageInfoSpecific is actually not useful to
> management software, but only as an information for humans (and having
> such a structure for that is perfectly fine).  But these stats don’t
> really look like something for immediate human consumption.)
> 
> 
> So I wonder why you don’t just put this information into
> BlockDeviceStats.  From what I can tell looking at
> bdrv_query_bds_stats() and qmp_query_blockstats(), the @stats field is
> currently completely 0 if @query-nodes is true.
> 
> (Furthermore, I wonder whether it would make sense to re-add
> BlockAcctStats to each BDS and then let the generic block code do the
> accounting on it.  I moved it to the BB in 7f0e9da6f13 because we didn’t
> care about node-level information at the time, but maybe it’s time to
> reconsider.)
> 
> 
> Anyway, as I’ve said, I fully understand that complaining about a design
> decision is just unfair at this point, so this is not a veto.
> 

hi!

Having both "unmap" and "discard" stats in the same list feels weird.
The intention is that unmap belongs to the device level, and discard is
from the driver level. Now we have a separate structure named "driver-
specific". Could also be called "driver-stats".

W

Re: [Qemu-devel] [Qemu-block] [PATCH v7 0/9] discard blockstats

2019-06-04 Thread Anton Nefedov

On 3/6/2019 1:09 PM, Stefano Garzarella wrote:
> On Tue, May 14, 2019 at 12:10:40PM +0000, Anton Nefedov wrote:
>> hi,
>>
>> yet another take for this patch series; please kindly consider these for 4.1
>>
>> Just a few cosmetic comments were received for v6 so this is mostly
>> a rebase+ping.
>>
>> new in v7:
>>  - general rebase
>>  - since clauses -> 4.1
>>  - patch 8: not completely trivial rebase: raw_account_discard moved to
>>common raw_do_pdiscard()
>>  - patch 9: comment wording fixed
>>
>> v6: http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg06633.html
>> v5: http://lists.nongnu.org/archive/html/qemu-devel/2018-10/msg06828.html
>> v4: http://lists.nongnu.org/archive/html/qemu-devel/2018-08/msg04308.html
>> v3: http://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg03688.html
>>
> 
> Hi Anton,
> since qemu 4.0 added the discard support on virtio-blk, should we consider it
> in this series?
> 
> If you prefer I can work on it and send you the patch.
> 

hi Stefano,

if this series is finally getting in, it is really nice if you can add
virtio-blk support too.

thanks!

Re: [Qemu-devel] [RFC 1/3] block: Add ImageRotationalInfo

2019-05-27 Thread Anton Nefedov

On 27/5/2019 3:57 PM, Max Reitz wrote:
> On 27.05.19 14:37, Alberto Garcia wrote:
>> On Mon 27 May 2019 02:16:53 PM CEST, Max Reitz wrote:
>>> On 26.05.19 17:08, Alberto Garcia wrote:
 On Fri 24 May 2019 07:28:10 PM CEST, Max Reitz  wrote:
> +##
> +# @ImageRotationalInfo:
> +#
> +# Indicates whether an image is stored on a rotating disk or not.
> +#
> +# @solid-state: Image is stored on a solid-state drive
> +#
> +# @rotating:Image is stored on a rotating disk

 What happens when you cannot tell? You assume it's solid-state?
>>>
>>> When *I* cannot tell?  This field is generally optional, so in that case
>>> it just will not be present.
>>>
>>> (When Linux cannot tell?  I don’t know :-))
>>>

Linux defaults to rotational == 1 unless the driver sets
QUEUE_FLAG_NONROT.

By the way as far as I can tell, qemu does not report this flag unless
explicitly set in a device property.

 DEFINE_PROP_UINT16("rotation_rate", IDEDrive, dev.rotation_rate, 0),
and
 DEFINE_PROP_UINT16("rotation_rate", SCSIDiskState, rotation_rate, 0),

>>> Do you think there should be an explicit value for that?
>>
>> I'll try to rephrase:
>>
>> we have a new optimization that improves performance on SSDs but reduces
>> performance on HDDs, so this series would detect where an image is
>> stored in order to enable the faster code path for each case.
>>
>> What happens if QEMU cannot detect if we have a solid drive or a
>> rotational drive? (e.g. a remote storage backend). Will it default to
>> enabling preallocation using write_zeroes()?
> 
> In this series, yes.  That is the default I chose.
> 
> We have to make a separate decision for each case.  In the case of
> filling newly allocated areas with zeroes, I think the performance gain
> for SSDs is more important than the performance loss for HDDs.  That is
> what I wrote in my response to Anton’s series.  So I took the series
> even without it being able to distinguish both cases at all.
> Consequentially, I believe it is reasonable for that to be the default
> behavior if we cannot tell.
> 
> I think in general optimizing for SSDs should probably be the default.
> HDDs are slow anyway, so whoever uses them probably doesn’t care about
> performance too much anyway...?  Whereas people using SSDs probably do.
>   But as I said, we can and should always make a separate decision for
> each case.
> 

Overall it looks good to me but I wonder how do we ensure both variants
are test covered? Need a blockdev option to enforce the mode?

/Anton

Re: [Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas

2019-05-23 Thread Anton Nefedov

On 22/5/2019 11:40 PM, Max Reitz wrote:
> On 16.05.19 16:27, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing
>> image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
>> used on the whole cluster instead of writing explicit zero buffers later
>> in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
>> Signed-off-by: Anton Nefedov 
>> ---
>>   qapi/block-core.json   |  4 +-
>>   block/qcow2.h  |  6 +++
>>   block/qcow2-cluster.c  |  2 +-
>>   block/qcow2.c  | 93 +-
>>   block/trace-events |  1 +
>>   tests/qemu-iotests/060 |  7 ++-
>>   tests/qemu-iotests/060.out |  5 +-
>>   7 files changed, 112 insertions(+), 6 deletions(-)
> 
> [...]
> 
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 8e024007db..e6b1293ddf 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
> 
> [...]
> 
>> @@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>   return false;
>>   }
>>   
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t 
>> bytes)
>> +{
>> +int64_t nr;
>> +return !bytes ||
>> +(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
>> bytes);
> 
> It's a pity that this bdrv_is_allocated() throws away BDRV_BLOCK_ZERO
> information.  If something in the backing chain has explicit zero
> clusters there, we could get the information for free, but this will
> consider it allocated.
> 
> Wouldn't it make sense to make bdrv_co_block_status_above() public and
> then use that (with want_zero == false)?
> 
> (But that can be done later, too, of course.)
> 

or something like bdrv_has_non_zero_data(), with the argument
want_zero (maybe call it check_file in this case).

Yes, I'd rather implement it separately.

>> +}
>> +
>> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
>> +{
>> +/*
>> + * This check is designed for optimization shortcut so it must be
>> + * efficient.
>> + * Instead of is_zero(), use is_unallocated() as it is faster (but not
>> + * as accurate and can result in false negatives).
>> + */
>> +return is_unallocated(bs, m->offset + m->cow_start.offset,
>> +  m->cow_start.nb_bytes) &&
>> +   is_unallocated(bs, m->offset + m->cow_end.offset,
>> +  m->cow_end.nb_bytes);
>> +}
>> +
>> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +BDRVQcow2State *s = bs->opaque;
>> +QCowL2Meta *m;
>> +
>> +if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
>> +return 0;
>> +}
>> +
>> +if (bs->encrypted) {
>> +return 0;
>> +}
>> +
>> +for (m = l2meta; m != NULL; m = m->next) {
>> +int ret;
>> +
>> +if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
>> +continue;
>> +}
>> +
>> +if (!is_zero_cow(bs, m)) {
>> +continue;
>> +}
>> +
>> +/*
>> + * instead of writing zero COW buffers,
>> + * efficiently zero out the whole clusters
>> + */
>> +
>> +ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
>> +m->nb_clusters * 
>> s->cluster_size,
>> +true);
>> +if (ret < 0) {
>> +return ret;
>> +}
>> +
>> +BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
>> +ret = bdrv_co_pwrite_zeroes(s->data_file, m->alloc_offset,
>> +m->nb_clusters * s->cluster_size,
>> +BDRV_REQ_NO_FALLBACK);
> 
> Mostly I'm writing this mail because of this.  Zeroing the whole range
> seems inefficient to me for very large requests, and the commit message
> doesn't say anything about the reasoning behind unconditionally zeroing
> everything.
> 
> Benchmarking looks like in most cases it is about equal to zeroing head
> and tail separately.  But if I pre-filll an image with random data, it
> is much slower:
> 
> $ qemu-img create -f qcow2

Re: [Qemu-devel] [PATCH v8 3/9] block: add empty account cookie type

2019-05-16 Thread Anton Nefedov

On 16/5/2019 6:34 PM, Vladimir Sementsov-Ogievskiy wrote:
> 16.05.2019 17:33, Anton Nefedov wrote:
>> This adds some protection from accounting uninitialized cookie.
>> That is, block_acct_failed/done without previous block_acct_start;
>> in that case, cookie probably holds values from previous operation.
>>
>> (Note: it might also be uninitialized holding garbage value and there is
>>still "< BLOCK_MAX_IOTYPE" assertion for that.
>>So block_acct_failed/done without previous block_acct_start should be used
>>with caution.)
>>
>> Currently this is particularly useful in ide code where it's hard to
>> keep track whether the request started accounting or not. For example,
>> trim requests do the accounting separately.
>>
>> Signed-off-by: Anton Nefedov 
>> ---
>>include/block/accounting.h | 1 +
>>block/accounting.c | 6 ++
>>2 files changed, 7 insertions(+)
>>
>> diff --git a/include/block/accounting.h b/include/block/accounting.h
>> index ba8b04d572..878b4c3581 100644
>> --- a/include/block/accounting.h
>> +++ b/include/block/accounting.h
>> @@ -33,6 +33,7 @@ typedef struct BlockAcctTimedStats BlockAcctTimedStats;
>>typedef struct BlockAcctStats BlockAcctStats;
>>
>>enum BlockAcctType {
>> +BLOCK_ACCT_NONE = 0,
>>BLOCK_ACCT_READ,
>>BLOCK_ACCT_WRITE,
>>BLOCK_ACCT_FLUSH,
>> diff --git a/block/accounting.c b/block/accounting.c
>> index 70a3d9a426..8d41c8a83a 100644
>> --- a/block/accounting.c
>> +++ b/block/accounting.c
>> @@ -195,6 +195,10 @@ static void block_account_one_io(BlockAcctStats *stats, 
>> BlockAcctCookie *cookie,
>>
>>assert(cookie->type < BLOCK_MAX_IOTYPE);
>>
>> +if (cookie->type == BLOCK_ACCT_NONE) {
> 
> worth error_report() ?
> 

I don't think we should necessarily consider it an error;
as mentioned in the commit message this might be useful in some places
like IDE trim handling.

[Qemu-devel] [PATCH v8 0/9] discard blockstats

2019-05-16 Thread Anton Nefedov

..apparently v7 ended up in a weird base64 that would not easily git-am.
Resending.



hi,

yet another take for this patch series; please kindly consider these for 4.1

Just a few cosmetic comments were received for v6 so this is mostly
a rebase+ping.

new in v7:
- general rebase
- since clauses -> 4.1
- patch 8: not completely trivial rebase: raw_account_discard moved to
  common raw_do_pdiscard()
- patch 9: comment wording fixed

v6: http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg06633.html
v5: http://lists.nongnu.org/archive/html/qemu-devel/2018-10/msg06828.html
v4: http://lists.nongnu.org/archive/html/qemu-devel/2018-08/msg04308.html
v3: http://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg03688.html



qmp query-blockstats provides stats info for write/read/flush ops.

Patches 1-7 implement the similar for discard (unmap) command for scsi
and ide disks.
Discard stat "unmap_ops / unmap_bytes" is supposed to account the ops that
have completed without an error.

However, discard operation is advisory. Specifically,
 - common block layer ignores ENOTSUP error code.
   That might be returned if the block driver does not support discard,
   or discard has been configured to be ignored.
 - format drivers such as qcow2 may ignore discard if they were configured
   to ignore that, or if the corresponding area is already marked unused
   (unallocated / zero clusters).

And what is actually useful is the number of bytes actually discarded
down on the host filesystem.
To achieve that, driver-specific statistics has been added to blockstats
(patch 9).
With patch 8, file-posix driver accounts discard operations on its level too.

query-blockstat result:

(note the difference between blockdevice unmap and file discard stats. qcow2
sends fewer ops down to the file as the clusters are actually unallocated
on qcow2 level)

{
  "device": "drive-scsi0-0-0-0",
  "node-name": "#block159",
  "stats": {
>   "invalid_unmap_operations": 0,
>   "failed_unmap_operations": 0,
"wr_highest_offset": 13411688448,
"rd_total_time_ns": 2859566315,
"rd_bytes": 103182336,
"rd_merged": 0,
"flush_operations": 19,
"invalid_wr_operations": 0,
"flush_total_time_ns": 23111608,
"failed_rd_operations": 0,
"failed_flush_operations": 0,
"invalid_flush_operations": 0,
"timed_stats": [
  
],
"wr_merged": 0,
"wr_bytes": 1702912,
>   "unmap_bytes": 11954954240,
>   "unmap_operations": 865,
"idle_time_ns": 2669508623,
"account_invalid": true,
>   "unmap_total_time_ns": 19698002,
"wr_operations": 143,
"failed_wr_operations": 0,
"rd_operations": 4816,
"account_failed": true,
>   "unmap_merged": 0,
"wr_total_time_ns": 1262686124,
"invalid_rd_operations": 0
  },
  "parent": {
>   "driver-specific": {
> "discard-nb-failed": 0,
> "discard-bytes-ok": 720896,
> "driver": "file",
> "discard-nb-ok": 8
>   },
"node-name": "#block009",
"stats": {
[..]
}
  }
},
{
  "device": "floppy0",

Anton Nefedov (9):
  qapi: group BlockDeviceStats fields
  qapi: add unmap to BlockDeviceStats
  block: add empty account cookie type
  ide: account UNMAP (TRIM) operations
  scsi: store unmap offset and nb_sectors in request struct
  scsi: move unmap error checking to the complete callback
  scsi: account unmap operations
  file-posix: account discard operations
  qapi: query-blockstat: add driver specific file-posix stats

 qapi/block-core.json   | 81 --
 include/block/accounting.h |  2 +
 include/block/block.h  |  1 +
 include/block/block_int.h  |  1 +
 block.c|  9 +
 block/accounting.c |  6 +++
 block/file-posix.c | 54 -
 block/qapi.c   | 11 ++
 hw/ide/core.c  | 12 ++
 hw/scsi/scsi-disk.c| 32 +--
 tests/qemu-iotests/227.out | 18 +
 11 files changed, 204 insertions(+), 23 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v8 2/9] qapi: add unmap to BlockDeviceStats

2019-05-16 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json   | 29 +++--
 include/block/accounting.h |  1 +
 block/qapi.c   |  6 ++
 tests/qemu-iotests/227.out | 18 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 754d07f1fb..55194f84ce 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -856,6 +856,8 @@
 #
 # @wr_bytes:  The number of bytes written by the device.
 #
+# @unmap_bytes: The number of bytes unmapped by the device (Since 4.1)
+#
 # @rd_operations: The number of read operations performed by the device.
 #
 # @wr_operations: The number of write operations performed by the device.
@@ -863,6 +865,9 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
+# @unmap_operations: The number of unmap operations performed by the device
+#(Since 4.1)
+#
 # @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
 # @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
@@ -870,6 +875,9 @@
 # @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
 #   (since 0.15.0).
 #
+# @unmap_total_time_ns: Total time spent on unmap operations in nanoseconds
+#   (Since 4.1)
+#
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
 # growable sparse files (like qcow2) that are used on top
@@ -881,6 +889,9 @@
 # @wr_merged: Number of write requests that have been merged into another
 # request (Since 2.3).
 #
+# @unmap_merged: Number of unmap requests that have been merged into another
+#request (Since 4.1)
+#
 # @idle_time_ns: Time since the last I/O operation, in
 #nanoseconds. If the field is absent it means that
 #there haven't been any operations yet (Since 2.5).
@@ -894,6 +905,9 @@
 # @failed_flush_operations: The number of failed flush operations
 #   performed by the device (Since 2.5)
 #
+# @failed_unmap_operations: The number of failed unmap operations performed
+#   by the device (Since 4.1)
+#
 # @invalid_rd_operations: The number of invalid read operations
 #  performed by the device (Since 2.5)
 #
@@ -903,6 +917,9 @@
 # @invalid_flush_operations: The number of invalid flush operations
 #performed by the device (Since 2.5)
 #
+# @invalid_unmap_operations: The number of invalid unmap operations performed
+#by the device (Since 4.1)
+#
 # @account_invalid: Whether invalid operations are included in the
 #   last access statistics (Since 2.5)
 #
@@ -921,18 +938,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'unmap_bytes' : 'int',
'rd_operations': 'int', 'wr_operations': 'int',
-   'flush_operations': 'int',
+   'flush_operations': 'int', 'unmap_operations': 'int',
'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'flush_total_time_ns': 'int',
+   'flush_total_time_ns': 'int', 'unmap_total_time_ns': 'int',
'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int', 'unmap_merged': 'int',
'*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int',
+   'failed_flush_operations': 'int', 'failed_unmap_operations': 'int',
'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
-   'invalid_flush_operations': 'int',
+   'invalid_flush_operations': 'int', 'invalid_unmap_operations': 
'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
diff --git a/include/block/accounting.h b/include/block/accounting.h
index d1f67b10dd..ba8b04d572 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -36,6 +36,7 @@ enum BlockAcctType {
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
+BLOCK_ACCT_UNMAP,
 BLOCK_MAX_IOTYPE,
 };
 
diff --git a/block/qapi.c b/block/qapi.c
index 0c13c86f4e..f9447a3297 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -434,24 +434,30 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_byte

[Qemu-devel] [PATCH v8 5/9] scsi: store unmap offset and nb_sectors in request struct

2019-05-16 Thread Anton Nefedov

it allows to report it in the error handler

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 hw/scsi/scsi-disk.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index e7e865ab3b..b43254103c 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1602,8 +1602,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 {
 SCSIDiskReq *r = data->r;
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
-uint64_t sector_num;
-uint32_t nb_sectors;
 
 assert(r->req.aiocb == NULL);
 if (scsi_disk_req_check_error(r, ret, false)) {
@@ -1611,16 +1609,16 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 }
 
 if (data->count > 0) {
-sector_num = ldq_be_p(>inbuf[0]);
-nb_sectors = ldl_be_p(>inbuf[8]) & 0xULL;
-if (!check_lba_range(s, sector_num, nb_sectors)) {
+r->sector = ldq_be_p(>inbuf[0]);
+r->sector_count = ldl_be_p(>inbuf[8]) & 0xULL;
+if (!check_lba_range(s, r->sector, r->sector_count)) {
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
-sector_num * s->qdev.blocksize,
-nb_sectors * s->qdev.blocksize,
+r->sector * s->qdev.blocksize,
+r->sector_count * s->qdev.blocksize,
 scsi_unmap_complete, data);
 data->count--;
 data->inbuf += 16;
-- 
2.17.1

[Qemu-devel] [PATCH v8 9/9] qapi: query-blockstat: add driver specific file-posix stats

2019-05-16 Thread Anton Nefedov

A block driver can provide a callback to report driver-specific
statistics.

file-posix driver now reports discard statistics

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Markus Armbruster 
---
 qapi/block-core.json  | 38 ++
 include/block/block.h |  1 +
 include/block/block_int.h |  1 +
 block.c   |  9 +
 block/file-posix.c| 38 +++---
 block/qapi.c  |  5 +
 6 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 55194f84ce..368e09ae37 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -956,6 +956,41 @@
'*wr_latency_histogram': 'BlockLatencyHistogramInfo',
'*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
 
+##
+# @BlockStatsSpecificFile:
+#
+# File driver statistics
+#
+# @discard-nb-ok: The number of successful discard operations performed by
+# the driver.
+#
+# @discard-nb-failed: The number of failed discard operations performed by
+# the driver.
+#
+# @discard-bytes-ok: The number of bytes discarded by the driver.
+#
+# Since: 4.1
+##
+{ 'struct': 'BlockStatsSpecificFile',
+  'data': {
+  'discard-nb-ok': 'uint64',
+  'discard-nb-failed': 'uint64',
+  'discard-bytes-ok': 'uint64' } }
+
+##
+# @BlockStatsSpecific:
+#
+# Block driver specific statistics
+#
+# Since: 4.1
+##
+{ 'union': 'BlockStatsSpecific',
+  'base': { 'driver': 'BlockdevDriver' },
+  'discriminator': 'driver',
+  'data': {
+  'file': 'BlockStatsSpecificFile',
+  'host_device': 'BlockStatsSpecificFile' } }
+
 ##
 # @BlockStats:
 #
@@ -971,6 +1006,8 @@
 #
 # @stats:  A @BlockDeviceStats for the device.
 #
+# @driver-specific: Optional driver-specific stats. (Since 4.1)
+#
 # @parent: This describes the file block device if it has one.
 #  Contains recursively the statistics of the underlying
 #  protocol (e.g. the host file for a qcow2 image). If there is
@@ -984,6 +1021,7 @@
 { 'struct': 'BlockStats',
   'data': {'*device': 'str', '*qdev': 'str', '*node-name': 'str',
'stats': 'BlockDeviceStats',
+   '*driver-specific': 'BlockStatsSpecific',
'*parent': 'BlockStats',
'*backing': 'BlockStats'} }
 
diff --git a/include/block/block.h b/include/block/block.h
index 5e2b98b0ee..b182f0c7ae 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -490,6 +490,7 @@ int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
   Error **errp);
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
 void bdrv_round_to_clusters(BlockDriverState *bs,
 int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 94d45c9708..dc3bc97ea3 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -358,6 +358,7 @@ struct BlockDriver {
 int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
  Error **errp);
+BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
 int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
   QEMUIOVector *qiov,
diff --git a/block.c b/block.c
index 6999aad446..f68fb5aaec 100644
--- a/block.c
+++ b/block.c
@@ -4942,6 +4942,15 @@ ImageInfoSpecific 
*bdrv_get_specific_info(BlockDriverState *bs,
 return NULL;
 }
 
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs)
+{
+BlockDriver *drv = bs->drv;
+if (!drv || !drv->bdrv_get_specific_stats) {
+return NULL;
+}
+return drv->bdrv_get_specific_stats(bs);
+}
+
 void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
 {
 if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
diff --git a/block/file-posix.c b/block/file-posix.c
index 76d54b3a85..a2f01cfe10 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,9 +160,9 @@ typedef struct BDRVRawState {
 bool drop_cache;
 bool check_cache_dropped;
 struct {
-int64_t discard_nb_ok;
-int64_t discard_nb_failed;
-int64_t discard_bytes_ok;
+uint64_t discard_nb_ok;
+uint64_t discard_nb_failed;
+uint64_t discard_bytes_ok;
 } stats;
 
 PRManager *pr_mgr;
@@ -2723,6 +2723,36 @@ static int raw_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 return 0;
 }
 
+static BlockStatsSpecificFile get_blockstats_specific_file(BlockDriverState 
*bs)
+{
+BDRVRawState *s = bs->opaque;
+return (B

[Qemu-devel] [PATCH v8 3/9] block: add empty account cookie type

2019-05-16 Thread Anton Nefedov

This adds some protection from accounting uninitialized cookie.
That is, block_acct_failed/done without previous block_acct_start;
in that case, cookie probably holds values from previous operation.

(Note: it might also be uninitialized holding garbage value and there is
 still "< BLOCK_MAX_IOTYPE" assertion for that.
 So block_acct_failed/done without previous block_acct_start should be used
 with caution.)

Currently this is particularly useful in ide code where it's hard to
keep track whether the request started accounting or not. For example,
trim requests do the accounting separately.

Signed-off-by: Anton Nefedov 
---
 include/block/accounting.h | 1 +
 block/accounting.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/block/accounting.h b/include/block/accounting.h
index ba8b04d572..878b4c3581 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -33,6 +33,7 @@ typedef struct BlockAcctTimedStats BlockAcctTimedStats;
 typedef struct BlockAcctStats BlockAcctStats;
 
 enum BlockAcctType {
+BLOCK_ACCT_NONE = 0,
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
diff --git a/block/accounting.c b/block/accounting.c
index 70a3d9a426..8d41c8a83a 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -195,6 +195,10 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 
 assert(cookie->type < BLOCK_MAX_IOTYPE);
 
+if (cookie->type == BLOCK_ACCT_NONE) {
+return;
+}
+
 qemu_mutex_lock(>lock);
 
 if (failed) {
@@ -217,6 +221,8 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 }
 
 qemu_mutex_unlock(>lock);
+
+cookie->type = BLOCK_ACCT_NONE;
 }
 
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
-- 
2.17.1

[Qemu-devel] [PATCH v8 1/9] qapi: group BlockDeviceStats fields

2019-05-16 Thread Anton Nefedov

Make the stat fields definition slightly more readable.
Also reorder total_time_ns stats read-write-flush as done elsewhere.
Cosmetic change only.

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7ccbfff9d0..754d07f1fb 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -863,12 +863,12 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
-# @flush_total_time_ns: Total time spend on cache flushes in nano-seconds
-#   (since 0.15.0).
+# @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
-# @wr_total_time_ns: Total time spend on writes in nano-seconds (since 0.15.0).
+# @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
 #
-# @rd_total_time_ns: Total_time_spend on reads in nano-seconds (since 0.15.0).
+# @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
+#   (since 0.15.0).
 #
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
@@ -921,14 +921,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'rd_operations': 'int',
-   'wr_operations': 'int', 'flush_operations': 'int',
-   'flush_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'rd_total_time_ns': 'int', 'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int', '*idle_time_ns': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+   'rd_operations': 'int', 'wr_operations': 'int',
+   'flush_operations': 'int',
+   'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
+   'flush_total_time_ns': 'int',
+   'wr_highest_offset': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int',
+   '*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int', 'invalid_rd_operations': 'int',
-   'invalid_wr_operations': 'int', 'invalid_flush_operations': 'int',
+   'failed_flush_operations': 'int',
+   'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
+   'invalid_flush_operations': 'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
-- 
2.17.1

[Qemu-devel] [PATCH v2] iotest 134: test cluster-misaligned encrypted write

2019-05-16 Thread Anton Nefedov

COW (even empty/zero) areas require encryption too

Signed-off-by: Anton Nefedov 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Alberto Garcia 
---

..apparently v1 ended up in a weird base64 that would not easily git-am.
Resending.

used to be a part of 'qcow2: cluster space preallocation' series
http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02769.html

---
 tests/qemu-iotests/134 |  9 +
 tests/qemu-iotests/134.out | 10 ++
 2 files changed, 19 insertions(+)

diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index e9e3e84c2a..f5fd3b7f34 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -57,6 +57,15 @@ echo
 echo "== reading whole image =="
 $QEMU_IO --object $SECRET -c "read 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
 
+echo
+echo "== rewriting cluster part =="
+$QEMU_IO --object $SECRET -c "write -P 0xb 512 512" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
+echo
+echo "== verify pattern =="
+$QEMU_IO --object $SECRET -c "read -P 0 0 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+$QEMU_IO --object $SECRET -c "read -P 0xb 512 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
 echo
 echo "== rewriting whole image =="
 $QEMU_IO --object $SECRET -c "write -P 0xa 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
diff --git a/tests/qemu-iotests/134.out b/tests/qemu-iotests/134.out
index 972be49d91..09d46f6b17 100644
--- a/tests/qemu-iotests/134.out
+++ b/tests/qemu-iotests/134.out
@@ -5,6 +5,16 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
encryption=on encrypt.
 read 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== rewriting cluster part ==
+wrote 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verify pattern ==
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == rewriting whole image ==
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.17.1

[Qemu-devel] [PATCH v14 0/1] qcow2: cluster space preallocation

2019-05-16 Thread Anton Nefedov

..apparently v13 ended up in a weird base64 that would not easily git-am.
Resending.



hi,

this used to be a large 10-patches series, but now thanks to Kevin's
BDRV_REQ_NO_FALLBACK series
(http://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg06389.html)
the core block layer functionality is already in place: the existing flag
can be reused.

A few accompanying patches were also dropped from the series as those can
be processed separately.

So,

new in v13:
   - patch 1 (was patch 9) rebased to use s->data_file pointer
   - comments fixed/added
   - other patches dropped ;)

v12: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02761.html
v11: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg04342.html
v10: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg00121.html



This pull request is to start to improve a few performance points of
qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
 pad data with zeroes if there is no backing data.
 Resulting increase in ops number and potential cluster fragmentation
 (on the host file) is already solved by:
   ee22a9d qcow2: Merge the writing of the COW regions with the guest data
 However, in case of zero COW regions, that can be avoided at all
 but the whole clusters are preallocated and zeroed in a single
 efficient write_zeroes() operation

  2. moreover, efficient write_zeroes() operation can be used to preallocate
 space megabytes (*configurable number) ahead which gives noticeable
 improvement on some storage types (e.g. distributed storage)
 where the space allocation operation might be expensive)
 (Not included in this patchset since v6).

  3. this will also allow to enable simultaneous writes to the same unallocated
 cluster after the space has been allocated & zeroed but before
 the first data is written and the cluster is linked to L2.
 (Not included in this patchset).

Efficient write_zeroes usually implies that the blocks are not actually
written to but only reserved and marked as zeroed by the storage.
Existing bdrv_write_zeroes() falls back to writing zero buffers if
write_zeroes is not supported by the driver, so BDRV_REQ_NO_FALLBACK is used.

simple perf test:

  qemu-img create -f qcow2 test.img 4G && \
  qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img

test results (seconds):

+---+---+--+---+--+--+
|   file|before| after| gain |
+---+---+--+---+--+--+
|ssd|  61.153  |  36.313  |  41% |
|hdd| 112.676  | 122.056  |  -8% |
+---+--+--+--+

Anton Nefedov (1):
  qcow2: skip writing zero buffers to empty COW areas

 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 93 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 +-
 7 files changed, 112 insertions(+), 6 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v8 8/9] file-posix: account discard operations

2019-05-16 Thread Anton Nefedov

This will help to identify how many of the user-issued discard operations
(accounted on a device level) have actually suceeded down on the host file
(even though the numbers will not be exactly the same if non-raw format
driver is used (e.g. qcow2 sending metadata discards)).

Note that these numbers will not include discards triggered by
write-zeroes + MAY_UNMAP calls.

Signed-off-by: Anton Nefedov 
---
 block/file-posix.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 1cf4ee49eb..76d54b3a85 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -159,6 +159,11 @@ typedef struct BDRVRawState {
 bool needs_alignment;
 bool drop_cache;
 bool check_cache_dropped;
+struct {
+int64_t discard_nb_ok;
+int64_t discard_nb_failed;
+int64_t discard_bytes_ok;
+} stats;
 
 PRManager *pr_mgr;
 } BDRVRawState;
@@ -2630,11 +2635,22 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,
 #endif /* !__linux__ */
 }
 
+static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
+{
+if (ret) {
+s->stats.discard_nb_failed++;
+} else {
+s->stats.discard_nb_ok++;
+s->stats.discard_bytes_ok += nbytes;
+}
+}
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, bool blkdev)
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+int ret;
 
 acb = (RawPosixAIOData) {
 .bs = bs,
@@ -2648,7 +2664,9 @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int 
bytes, bool blkdev)
 acb.aio_type |= QEMU_AIO_BLKDEV;
 }
 
-return raw_thread_pool_submit(bs, handle_aiocb_discard, );
+ret = raw_thread_pool_submit(bs, handle_aiocb_discard, );
+raw_account_discard(s, bytes, ret);
+return ret;
 }
 
 static coroutine_fn int
@@ -3263,10 +3281,12 @@ static int fd_open(BlockDriverState *bs)
 static coroutine_fn int
 hdev_co_pdiscard(BlockDriverState *bs, int64_t offset, int bytes)
 {
+BDRVRawState *s = bs->opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret < 0) {
+raw_account_discard(s, bytes, ret);
 return ret;
 }
 return raw_do_pdiscard(bs, offset, bytes, true);
-- 
2.17.1

[Qemu-devel] [PATCH v8 7/9] scsi: account unmap operations

2019-05-16 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/scsi/scsi-disk.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6eff496b54..5c77981d60 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1609,10 +1609,16 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 r->sector = ldq_be_p(>inbuf[0]);
 r->sector_count = ldl_be_p(>inbuf[8]) & 0xULL;
 if (!check_lba_range(s, r->sector, r->sector_count)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk),
+   BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->qdev.conf.blk), >acct,
+ r->sector_count * s->qdev.blocksize,
+ BLOCK_ACCT_UNMAP);
+
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
 r->sector * s->qdev.blocksize,
 r->sector_count * s->qdev.blocksize,
@@ -1639,10 +1645,11 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, true)) {
 scsi_req_unref(>req);
 g_free(data);
 } else {
+block_acct_done(blk_get_stats(s->qdev.conf.blk), >acct);
 scsi_unmap_complete_noio(data, ret);
 }
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
@@ -1674,6 +1681,7 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 }
 
 if (blk_is_read_only(s->qdev.conf.blk)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
 return;
 }
@@ -1689,10 +1697,12 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 return;
 
 invalid_param_len:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_PARAM_LEN));
 return;
 
 invalid_field:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v8 4/9] ide: account UNMAP (TRIM) operations

2019-05-16 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/ide/core.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 6afadf894f..3a7ac93777 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -441,6 +441,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 TrimAIOCB *iocb = opaque;
 IDEState *s = iocb->s;
 
+if (iocb->i >= 0) {
+if (ret >= 0) {
+block_acct_done(blk_get_stats(s->blk), >acct);
+} else {
+block_acct_failed(blk_get_stats(s->blk), >acct);
+}
+}
+
 if (ret >= 0) {
 while (iocb->j < iocb->qiov->niov) {
 int j = iocb->j;
@@ -458,10 +466,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 }
 
 if (!ide_sect_range_ok(s, sector, count)) {
+block_acct_invalid(blk_get_stats(s->blk), 
BLOCK_ACCT_UNMAP);
 iocb->ret = -EINVAL;
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->blk), >acct,
+ count << BDRV_SECTOR_BITS, BLOCK_ACCT_UNMAP);
+
 /* Got an entry! Submit and exit.  */
 iocb->aiocb = blk_aio_pdiscard(s->blk,
sector << BDRV_SECTOR_BITS,
-- 
2.17.1

[Qemu-devel] [PATCH v8 6/9] scsi: move unmap error checking to the complete callback

2019-05-16 Thread Anton Nefedov

This will help to account the operation in the following commit.

The difference is that we don't call scsi_disk_req_check_error() before
the 1st discard iteration anymore. That function also checks if
the request is cancelled, however it shouldn't get canceled until it
yields in blk_aio() functions anyway.
Same approach is already used for emulate_write_same.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 hw/scsi/scsi-disk.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index b43254103c..6eff496b54 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1604,9 +1604,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
-goto done;
-}
 
 if (data->count > 0) {
 r->sector = ldq_be_p(>inbuf[0]);
@@ -1642,7 +1639,12 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-scsi_unmap_complete_noio(data, ret);
+if (scsi_disk_req_check_error(r, ret, false)) {
+scsi_req_unref(>req);
+g_free(data);
+} else {
+scsi_unmap_complete_noio(data, ret);
+}
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v14 1/1] qcow2: skip writing zero buffers to empty COW areas

2019-05-16 Thread Anton Nefedov

If COW areas of the newly allocated clusters are zeroes on the backing
image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
used on the whole cluster instead of writing explicit zero buffers later
in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov 
---
 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 93 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 +-
 7 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7ccbfff9d0..3e4042be7f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3215,6 +3215,8 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
+# @cluster_alloc_space: an allocation of file space for a cluster (since 4.1)
+#
 # Since: 2.9
 ##
 { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
@@ -3233,7 +3235,7 @@
 'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
 'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
 'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-'cor_write'] }
+'cor_write', 'cluster_alloc_space'] }
 
 ##
 # @BlkdebugInjectErrorOptions:
diff --git a/block/qcow2.h b/block/qcow2.h
index e62508d1ce..07b18a733b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -398,6 +398,12 @@ typedef struct QCowL2Meta
  */
 Qcow2COWRegion cow_end;
 
+/*
+ * Indicates that COW regions are already handled and do not require
+ * any more processing.
+ */
+bool skip_cow;
+
 /**
  * The I/O vector with the data from the actual guest write request.
  * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 974a4e8656..79d4651603 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -832,7 +832,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 assert(start->offset + start->nb_bytes <= end->offset);
 assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
-if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
 return 0;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 8e024007db..e6b1293ddf 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2120,6 +2120,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 continue;
 }
 
+/* If COW regions are handled already, skip this too */
+if (m->skip_cow) {
+continue;
+}
+
 /* The data (middle) region must be immediately after the
  * start region */
 if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 return false;
 }
 
+static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+int64_t nr;
+return !bytes ||
+(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
bytes);
+}
+
+static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+{
+/*
+ * This check is designed for optimization shortcut so it must be
+ * efficient.
+ * Instead of is_zero(), use is_unallocated() as it is faster (but not
+ * as accurate and can result in false negatives).
+ */
+return is_unallocated(bs, m->offset + m->cow_start.offset,
+  m->cow_start.nb_bytes) &&
+   is_unallocated(bs, m->offset + m->cow_end.offset,
+  m->cow_end.nb_bytes);
+}
+
+static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+BDRVQcow2State *s = bs->opaque;
+QCowL2Meta *m;
+
+if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
+return 0;
+}
+
+if (bs->encrypted) {
+return 0;
+}
+
+for (m = l2meta; m != NULL; m = m->next) {
+int ret;
+
+if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
+continue;
+}
+
+if (!is_zero_cow(bs, m)) {
+continue;
+}
+
+/*
+ * instead of writing zero COW buffers,
+ * efficiently zero out the whole clusters
+ */
+
+ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
+m->nb_clusters * s->cluster_size,
+true);
+if (ret < 0) {
+return ret;
+}
+
+BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SP

[Qemu-devel] [PATCH] iotest 134: test cluster-misaligned encrypted write

2019-05-14 Thread Anton Nefedov

COW (even empty/zero) areas require encryption too

Signed-off-by: Anton Nefedov 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Alberto Garcia 
---
used to be a part of 'qcow2: cluster space preallocation' series
http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02769.html

 tests/qemu-iotests/134 |  9 +
 tests/qemu-iotests/134.out | 10 ++
 2 files changed, 19 insertions(+)

diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index e9e3e84c2a..f5fd3b7f34 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -57,6 +57,15 @@ echo
 echo "== reading whole image =="
 $QEMU_IO --object $SECRET -c "read 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
 
+echo
+echo "== rewriting cluster part =="
+$QEMU_IO --object $SECRET -c "write -P 0xb 512 512" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
+echo
+echo "== verify pattern =="
+$QEMU_IO --object $SECRET -c "read -P 0 0 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+$QEMU_IO --object $SECRET -c "read -P 0xb 512 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
 echo
 echo "== rewriting whole image =="
 $QEMU_IO --object $SECRET -c "write -P 0xa 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
diff --git a/tests/qemu-iotests/134.out b/tests/qemu-iotests/134.out
index 972be49d91..09d46f6b17 100644
--- a/tests/qemu-iotests/134.out
+++ b/tests/qemu-iotests/134.out
@@ -5,6 +5,16 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
encryption=on encrypt.
 read 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== rewriting cluster part ==
+wrote 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verify pattern ==
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == rewriting whole image ==
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.17.1

[Qemu-devel] [PATCH v13 1/1] qcow2: skip writing zero buffers to empty COW areas

2019-05-14 Thread Anton Nefedov

If COW areas of the newly allocated clusters are zeroes on the backing
image, efficient bdrv_write_zeroes(flags=BDRV_REQ_NO_FALLBACK) can be
used on the whole cluster instead of writing explicit zero buffers later
in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov 
---
 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 93 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 +-
 7 files changed, 112 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7ccbfff9d0..3e4042be7f 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3215,6 +3215,8 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
+# @cluster_alloc_space: an allocation of file space for a cluster (since 4.1)
+#
 # Since: 2.9
 ##
 { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
@@ -3233,7 +3235,7 @@
 'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
 'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
 'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-'cor_write'] }
+'cor_write', 'cluster_alloc_space'] }
 
 ##
 # @BlkdebugInjectErrorOptions:
diff --git a/block/qcow2.h b/block/qcow2.h
index e62508d1ce..07b18a733b 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -398,6 +398,12 @@ typedef struct QCowL2Meta
  */
 Qcow2COWRegion cow_end;
 
+/*
+ * Indicates that COW regions are already handled and do not require
+ * any more processing.
+ */
+bool skip_cow;
+
 /**
  * The I/O vector with the data from the actual guest write request.
  * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 974a4e8656..79d4651603 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -832,7 +832,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 assert(start->offset + start->nb_bytes <= end->offset);
 assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
-if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
 return 0;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 8e024007db..e6b1293ddf 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2120,6 +2120,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 continue;
 }
 
+/* If COW regions are handled already, skip this too */
+if (m->skip_cow) {
+continue;
+}
+
 /* The data (middle) region must be immediately after the
  * start region */
 if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2145,6 +2150,80 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 return false;
 }
 
+static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+int64_t nr;
+return !bytes ||
+(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
bytes);
+}
+
+static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+{
+/*
+ * This check is designed for optimization shortcut so it must be
+ * efficient.
+ * Instead of is_zero(), use is_unallocated() as it is faster (but not
+ * as accurate and can result in false negatives).
+ */
+return is_unallocated(bs, m->offset + m->cow_start.offset,
+  m->cow_start.nb_bytes) &&
+   is_unallocated(bs, m->offset + m->cow_end.offset,
+  m->cow_end.nb_bytes);
+}
+
+static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+BDRVQcow2State *s = bs->opaque;
+QCowL2Meta *m;
+
+if (!(s->data_file->bs->supported_zero_flags & BDRV_REQ_NO_FALLBACK)) {
+return 0;
+}
+
+if (bs->encrypted) {
+return 0;
+}
+
+for (m = l2meta; m != NULL; m = m->next) {
+int ret;
+
+if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
+continue;
+}
+
+if (!is_zero_cow(bs, m)) {
+continue;
+}
+
+/*
+ * instead of writing zero COW buffers,
+ * efficiently zero out the whole clusters
+ */
+
+ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
+m->nb_clusters * s->cluster_size,
+true);
+if (ret < 0) {
+return ret;
+}
+
+BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SP

[Qemu-devel] [PATCH v13 0/1] qcow2: cluster space preallocation

2019-05-14 Thread Anton Nefedov

hi,

this used to be a large 10-patches series, but now thanks to Kevin's
BDRV_REQ_NO_FALLBACK series
(http://lists.nongnu.org/archive/html/qemu-devel/2019-03/msg06389.html)
the core block layer functionality is already in place: the existing flag
can be reused.

A few accompanying patches were also dropped from the series as those can
be processed separately.

So,

new in v13:
   - patch 1 (was patch 9) rebased to use s->data_file pointer
   - comments fixed/added
   - other patches dropped ;)

v12: http://lists.nongnu.org/archive/html/qemu-devel/2019-01/msg02761.html
v11: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg04342.html
v10: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg00121.html



This pull request is to start to improve a few performance points of
qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
 pad data with zeroes if there is no backing data.
 Resulting increase in ops number and potential cluster fragmentation
 (on the host file) is already solved by:
   ee22a9d qcow2: Merge the writing of the COW regions with the guest data
 However, in case of zero COW regions, that can be avoided at all
 but the whole clusters are preallocated and zeroed in a single
 efficient write_zeroes() operation

  2. moreover, efficient write_zeroes() operation can be used to preallocate
 space megabytes (*configurable number) ahead which gives noticeable
 improvement on some storage types (e.g. distributed storage)
 where the space allocation operation might be expensive)
 (Not included in this patchset since v6).

  3. this will also allow to enable simultaneous writes to the same unallocated
 cluster after the space has been allocated & zeroed but before
 the first data is written and the cluster is linked to L2.
 (Not included in this patchset).

Efficient write_zeroes usually implies that the blocks are not actually
written to but only reserved and marked as zeroed by the storage.
Existing bdrv_write_zeroes() falls back to writing zero buffers if
write_zeroes is not supported by the driver, so BDRV_REQ_NO_FALLBACK is used.

simple perf test:

  qemu-img create -f qcow2 test.img 4G && \
  qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img

test results (seconds):

+---+---+--+---+--+--+
|   file|before| after| gain |
+---+---+--+---+--+--+
|ssd|  61.153  |  36.313  |  41% |
|hdd| 112.676  | 122.056  |  -8% |
+---+--+--+--+

Anton Nefedov (1):
  qcow2: skip writing zero buffers to empty COW areas

 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 93 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 +-
 7 files changed, 112 insertions(+), 6 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v7 8/9] file-posix: account discard operations

2019-05-14 Thread Anton Nefedov

This will help to identify how many of the user-issued discard operations
(accounted on a device level) have actually suceeded down on the host file
(even though the numbers will not be exactly the same if non-raw format
driver is used (e.g. qcow2 sending metadata discards)).

Note that these numbers will not include discards triggered by
write-zeroes + MAY_UNMAP calls.

Signed-off-by: Anton Nefedov 
---
 block/file-posix.c | 22 +-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 1cf4ee49eb..76d54b3a85 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -159,6 +159,11 @@ typedef struct BDRVRawState {
 bool needs_alignment;
 bool drop_cache;
 bool check_cache_dropped;
+struct {
+int64_t discard_nb_ok;
+int64_t discard_nb_failed;
+int64_t discard_bytes_ok;
+} stats;
 
 PRManager *pr_mgr;
 } BDRVRawState;
@@ -2630,11 +2635,22 @@ static void coroutine_fn 
raw_co_invalidate_cache(BlockDriverState *bs,
 #endif /* !__linux__ */
 }
 
+static void raw_account_discard(BDRVRawState *s, uint64_t nbytes, int ret)
+{
+if (ret) {
+s->stats.discard_nb_failed++;
+} else {
+s->stats.discard_nb_ok++;
+s->stats.discard_bytes_ok += nbytes;
+}
+}
+
 static coroutine_fn int
 raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int bytes, bool blkdev)
 {
 BDRVRawState *s = bs->opaque;
 RawPosixAIOData acb;
+int ret;
 
 acb = (RawPosixAIOData) {
 .bs = bs,
@@ -2648,7 +2664,9 @@ raw_do_pdiscard(BlockDriverState *bs, int64_t offset, int 
bytes, bool blkdev)
 acb.aio_type |= QEMU_AIO_BLKDEV;
 }
 
-return raw_thread_pool_submit(bs, handle_aiocb_discard, );
+ret = raw_thread_pool_submit(bs, handle_aiocb_discard, );
+raw_account_discard(s, bytes, ret);
+return ret;
 }
 
 static coroutine_fn int
@@ -3263,10 +3281,12 @@ static int fd_open(BlockDriverState *bs)
 static coroutine_fn int
 hdev_co_pdiscard(BlockDriverState *bs, int64_t offset, int bytes)
 {
+BDRVRawState *s = bs->opaque;
 int ret;
 
 ret = fd_open(bs);
 if (ret < 0) {
+raw_account_discard(s, bytes, ret);
 return ret;
 }
 return raw_do_pdiscard(bs, offset, bytes, true);
-- 
2.17.1

[Qemu-devel] [PATCH v7 7/9] scsi: account unmap operations

2019-05-14 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/scsi/scsi-disk.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index 6eff496b54..5c77981d60 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1609,10 +1609,16 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 r->sector = ldq_be_p(>inbuf[0]);
 r->sector_count = ldl_be_p(>inbuf[8]) & 0xULL;
 if (!check_lba_range(s, r->sector, r->sector_count)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk),
+   BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->qdev.conf.blk), >acct,
+ r->sector_count * s->qdev.blocksize,
+ BLOCK_ACCT_UNMAP);
+
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
 r->sector * s->qdev.blocksize,
 r->sector_count * s->qdev.blocksize,
@@ -1639,10 +1645,11 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-if (scsi_disk_req_check_error(r, ret, false)) {
+if (scsi_disk_req_check_error(r, ret, true)) {
 scsi_req_unref(>req);
 g_free(data);
 } else {
+block_acct_done(blk_get_stats(s->qdev.conf.blk), >acct);
 scsi_unmap_complete_noio(data, ret);
 }
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
@@ -1674,6 +1681,7 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 }
 
 if (blk_is_read_only(s->qdev.conf.blk)) {
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(WRITE_PROTECTED));
 return;
 }
@@ -1689,10 +1697,12 @@ static void scsi_disk_emulate_unmap(SCSIDiskReq *r, 
uint8_t *inbuf)
 return;
 
 invalid_param_len:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_PARAM_LEN));
 return;
 
 invalid_field:
+block_acct_invalid(blk_get_stats(s->qdev.conf.blk), BLOCK_ACCT_UNMAP);
 scsi_check_condition(r, SENSE_CODE(INVALID_FIELD));
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v7 6/9] scsi: move unmap error checking to the complete callback

2019-05-14 Thread Anton Nefedov

This will help to account the operation in the following commit.

The difference is that we don't call scsi_disk_req_check_error() before
the 1st discard iteration anymore. That function also checks if
the request is cancelled, however it shouldn't get canceled until it
yields in blk_aio() functions anyway.
Same approach is already used for emulate_write_same.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 hw/scsi/scsi-disk.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index b43254103c..6eff496b54 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1604,9 +1604,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
 
 assert(r->req.aiocb == NULL);
-if (scsi_disk_req_check_error(r, ret, false)) {
-goto done;
-}
 
 if (data->count > 0) {
 r->sector = ldq_be_p(>inbuf[0]);
@@ -1642,7 +1639,12 @@ static void scsi_unmap_complete(void *opaque, int ret)
 r->req.aiocb = NULL;
 
 aio_context_acquire(blk_get_aio_context(s->qdev.conf.blk));
-scsi_unmap_complete_noio(data, ret);
+if (scsi_disk_req_check_error(r, ret, false)) {
+scsi_req_unref(>req);
+g_free(data);
+} else {
+scsi_unmap_complete_noio(data, ret);
+}
 aio_context_release(blk_get_aio_context(s->qdev.conf.blk));
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v7 3/9] block: add empty account cookie type

2019-05-14 Thread Anton Nefedov

This adds some protection from accounting uninitialized cookie.
That is, block_acct_failed/done without previous block_acct_start;
in that case, cookie probably holds values from previous operation.

(Note: it might also be uninitialized holding garbage value and there is
 still "< BLOCK_MAX_IOTYPE" assertion for that.
 So block_acct_failed/done without previous block_acct_start should be used
 with caution.)

Currently this is particularly useful in ide code where it's hard to
keep track whether the request started accounting or not. For example,
trim requests do the accounting separately.

Signed-off-by: Anton Nefedov 
---
 include/block/accounting.h | 1 +
 block/accounting.c | 6 ++
 2 files changed, 7 insertions(+)

diff --git a/include/block/accounting.h b/include/block/accounting.h
index ba8b04d572..878b4c3581 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -33,6 +33,7 @@ typedef struct BlockAcctTimedStats BlockAcctTimedStats;
 typedef struct BlockAcctStats BlockAcctStats;
 
 enum BlockAcctType {
+BLOCK_ACCT_NONE = 0,
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
diff --git a/block/accounting.c b/block/accounting.c
index 70a3d9a426..8d41c8a83a 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -195,6 +195,10 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 
 assert(cookie->type < BLOCK_MAX_IOTYPE);
 
+if (cookie->type == BLOCK_ACCT_NONE) {
+return;
+}
+
 qemu_mutex_lock(>lock);
 
 if (failed) {
@@ -217,6 +221,8 @@ static void block_account_one_io(BlockAcctStats *stats, 
BlockAcctCookie *cookie,
 }
 
 qemu_mutex_unlock(>lock);
+
+cookie->type = BLOCK_ACCT_NONE;
 }
 
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
-- 
2.17.1

[Qemu-devel] [PATCH v7 9/9] qapi: query-blockstat: add driver specific file-posix stats

2019-05-14 Thread Anton Nefedov

A block driver can provide a callback to report driver-specific
statistics.

file-posix driver now reports discard statistics

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Acked-by: Markus Armbruster 
---
 qapi/block-core.json  | 38 ++
 include/block/block.h |  1 +
 include/block/block_int.h |  1 +
 block.c   |  9 +
 block/file-posix.c| 38 +++---
 block/qapi.c  |  5 +
 6 files changed, 89 insertions(+), 3 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 55194f84ce..368e09ae37 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -956,6 +956,41 @@
'*wr_latency_histogram': 'BlockLatencyHistogramInfo',
'*flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
 
+##
+# @BlockStatsSpecificFile:
+#
+# File driver statistics
+#
+# @discard-nb-ok: The number of successful discard operations performed by
+# the driver.
+#
+# @discard-nb-failed: The number of failed discard operations performed by
+# the driver.
+#
+# @discard-bytes-ok: The number of bytes discarded by the driver.
+#
+# Since: 4.1
+##
+{ 'struct': 'BlockStatsSpecificFile',
+  'data': {
+  'discard-nb-ok': 'uint64',
+  'discard-nb-failed': 'uint64',
+  'discard-bytes-ok': 'uint64' } }
+
+##
+# @BlockStatsSpecific:
+#
+# Block driver specific statistics
+#
+# Since: 4.1
+##
+{ 'union': 'BlockStatsSpecific',
+  'base': { 'driver': 'BlockdevDriver' },
+  'discriminator': 'driver',
+  'data': {
+  'file': 'BlockStatsSpecificFile',
+  'host_device': 'BlockStatsSpecificFile' } }
+
 ##
 # @BlockStats:
 #
@@ -971,6 +1006,8 @@
 #
 # @stats:  A @BlockDeviceStats for the device.
 #
+# @driver-specific: Optional driver-specific stats. (Since 4.1)
+#
 # @parent: This describes the file block device if it has one.
 #  Contains recursively the statistics of the underlying
 #  protocol (e.g. the host file for a qcow2 image). If there is
@@ -984,6 +1021,7 @@
 { 'struct': 'BlockStats',
   'data': {'*device': 'str', '*qdev': 'str', '*node-name': 'str',
'stats': 'BlockDeviceStats',
+   '*driver-specific': 'BlockStatsSpecific',
'*parent': 'BlockStats',
'*backing': 'BlockStats'} }
 
diff --git a/include/block/block.h b/include/block/block.h
index 5e2b98b0ee..b182f0c7ae 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -490,6 +490,7 @@ int bdrv_get_flags(BlockDriverState *bs);
 int bdrv_get_info(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *bdrv_get_specific_info(BlockDriverState *bs,
   Error **errp);
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs);
 void bdrv_round_to_clusters(BlockDriverState *bs,
 int64_t offset, int64_t bytes,
 int64_t *cluster_offset,
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 94d45c9708..dc3bc97ea3 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -358,6 +358,7 @@ struct BlockDriver {
 int (*bdrv_get_info)(BlockDriverState *bs, BlockDriverInfo *bdi);
 ImageInfoSpecific *(*bdrv_get_specific_info)(BlockDriverState *bs,
  Error **errp);
+BlockStatsSpecific *(*bdrv_get_specific_stats)(BlockDriverState *bs);
 
 int coroutine_fn (*bdrv_save_vmstate)(BlockDriverState *bs,
   QEMUIOVector *qiov,
diff --git a/block.c b/block.c
index 6999aad446..f68fb5aaec 100644
--- a/block.c
+++ b/block.c
@@ -4942,6 +4942,15 @@ ImageInfoSpecific 
*bdrv_get_specific_info(BlockDriverState *bs,
 return NULL;
 }
 
+BlockStatsSpecific *bdrv_get_specific_stats(BlockDriverState *bs)
+{
+BlockDriver *drv = bs->drv;
+if (!drv || !drv->bdrv_get_specific_stats) {
+return NULL;
+}
+return drv->bdrv_get_specific_stats(bs);
+}
+
 void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
 {
 if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
diff --git a/block/file-posix.c b/block/file-posix.c
index 76d54b3a85..a2f01cfe10 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -160,9 +160,9 @@ typedef struct BDRVRawState {
 bool drop_cache;
 bool check_cache_dropped;
 struct {
-int64_t discard_nb_ok;
-int64_t discard_nb_failed;
-int64_t discard_bytes_ok;
+uint64_t discard_nb_ok;
+uint64_t discard_nb_failed;
+uint64_t discard_bytes_ok;
 } stats;
 
 PRManager *pr_mgr;
@@ -2723,6 +2723,36 @@ static int raw_get_info(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 return 0;
 }
 
+static BlockStatsSpecificFile get_blockstats_specific_file(BlockDriverState 
*bs)
+{
+BDRVRawState *s = bs->opaque;
+return (B

[Qemu-devel] [PATCH v7 2/9] qapi: add unmap to BlockDeviceStats

2019-05-14 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
Reviewed-by: Eric Blake 
---
 qapi/block-core.json   | 29 +++--
 include/block/accounting.h |  1 +
 block/qapi.c   |  6 ++
 tests/qemu-iotests/227.out | 18 ++
 4 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 754d07f1fb..55194f84ce 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -856,6 +856,8 @@
 #
 # @wr_bytes:  The number of bytes written by the device.
 #
+# @unmap_bytes: The number of bytes unmapped by the device (Since 4.1)
+#
 # @rd_operations: The number of read operations performed by the device.
 #
 # @wr_operations: The number of write operations performed by the device.
@@ -863,6 +865,9 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
+# @unmap_operations: The number of unmap operations performed by the device
+#(Since 4.1)
+#
 # @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
 # @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
@@ -870,6 +875,9 @@
 # @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
 #   (since 0.15.0).
 #
+# @unmap_total_time_ns: Total time spent on unmap operations in nanoseconds
+#   (Since 4.1)
+#
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
 # growable sparse files (like qcow2) that are used on top
@@ -881,6 +889,9 @@
 # @wr_merged: Number of write requests that have been merged into another
 # request (Since 2.3).
 #
+# @unmap_merged: Number of unmap requests that have been merged into another
+#request (Since 4.1)
+#
 # @idle_time_ns: Time since the last I/O operation, in
 #nanoseconds. If the field is absent it means that
 #there haven't been any operations yet (Since 2.5).
@@ -894,6 +905,9 @@
 # @failed_flush_operations: The number of failed flush operations
 #   performed by the device (Since 2.5)
 #
+# @failed_unmap_operations: The number of failed unmap operations performed
+#   by the device (Since 4.1)
+#
 # @invalid_rd_operations: The number of invalid read operations
 #  performed by the device (Since 2.5)
 #
@@ -903,6 +917,9 @@
 # @invalid_flush_operations: The number of invalid flush operations
 #performed by the device (Since 2.5)
 #
+# @invalid_unmap_operations: The number of invalid unmap operations performed
+#by the device (Since 4.1)
+#
 # @account_invalid: Whether invalid operations are included in the
 #   last access statistics (Since 2.5)
 #
@@ -921,18 +938,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'unmap_bytes' : 'int',
'rd_operations': 'int', 'wr_operations': 'int',
-   'flush_operations': 'int',
+   'flush_operations': 'int', 'unmap_operations': 'int',
'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'flush_total_time_ns': 'int',
+   'flush_total_time_ns': 'int', 'unmap_total_time_ns': 'int',
'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int', 'unmap_merged': 'int',
'*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int',
+   'failed_flush_operations': 'int', 'failed_unmap_operations': 'int',
'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
-   'invalid_flush_operations': 'int',
+   'invalid_flush_operations': 'int', 'invalid_unmap_operations': 
'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
diff --git a/include/block/accounting.h b/include/block/accounting.h
index d1f67b10dd..ba8b04d572 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -36,6 +36,7 @@ enum BlockAcctType {
 BLOCK_ACCT_READ,
 BLOCK_ACCT_WRITE,
 BLOCK_ACCT_FLUSH,
+BLOCK_ACCT_UNMAP,
 BLOCK_MAX_IOTYPE,
 };
 
diff --git a/block/qapi.c b/block/qapi.c
index 0c13c86f4e..f9447a3297 100644
--- a/block/qapi.c
+++ b/block/qapi.c
@@ -434,24 +434,30 @@ static void bdrv_query_blk_stats(BlockDeviceStats *ds, 
BlockBackend *blk)
 
 ds->rd_bytes = stats->nr_bytes[BLOCK_ACCT_READ];
 ds->wr_byte

[Qemu-devel] [PATCH v7 0/9] discard blockstats

2019-05-14 Thread Anton Nefedov

hi,

yet another take for this patch series; please kindly consider these for 4.1

Just a few cosmetic comments were received for v6 so this is mostly
a rebase+ping.

new in v7:
- general rebase
- since clauses -> 4.1
- patch 8: not completely trivial rebase: raw_account_discard moved to
  common raw_do_pdiscard()
- patch 9: comment wording fixed

v6: http://lists.nongnu.org/archive/html/qemu-devel/2018-11/msg06633.html
v5: http://lists.nongnu.org/archive/html/qemu-devel/2018-10/msg06828.html
v4: http://lists.nongnu.org/archive/html/qemu-devel/2018-08/msg04308.html
v3: http://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg03688.html



qmp query-blockstats provides stats info for write/read/flush ops.

Patches 1-7 implement the similar for discard (unmap) command for scsi
and ide disks.
Discard stat "unmap_ops / unmap_bytes" is supposed to account the ops that
have completed without an error.

However, discard operation is advisory. Specifically,
 - common block layer ignores ENOTSUP error code.
   That might be returned if the block driver does not support discard,
   or discard has been configured to be ignored.
 - format drivers such as qcow2 may ignore discard if they were configured
   to ignore that, or if the corresponding area is already marked unused
   (unallocated / zero clusters).

And what is actually useful is the number of bytes actually discarded
down on the host filesystem.
To achieve that, driver-specific statistics has been added to blockstats
(patch 9).
With patch 8, file-posix driver accounts discard operations on its level too.

query-blockstat result:

(note the difference between blockdevice unmap and file discard stats. qcow2
sends fewer ops down to the file as the clusters are actually unallocated
on qcow2 level)

{
  "device": "drive-scsi0-0-0-0",
  "node-name": "#block159",
  "stats": {
>   "invalid_unmap_operations": 0,
>   "failed_unmap_operations": 0,
"wr_highest_offset": 13411688448,
"rd_total_time_ns": 2859566315,
"rd_bytes": 103182336,
"rd_merged": 0,
"flush_operations": 19,
"invalid_wr_operations": 0,
"flush_total_time_ns": 23111608,
"failed_rd_operations": 0,
"failed_flush_operations": 0,
"invalid_flush_operations": 0,
"timed_stats": [
  
],
"wr_merged": 0,
"wr_bytes": 1702912,
>   "unmap_bytes": 11954954240,
>   "unmap_operations": 865,
"idle_time_ns": 2669508623,
"account_invalid": true,
>   "unmap_total_time_ns": 19698002,
"wr_operations": 143,
"failed_wr_operations": 0,
"rd_operations": 4816,
"account_failed": true,
>   "unmap_merged": 0,
"wr_total_time_ns": 1262686124,
"invalid_rd_operations": 0
  },
  "parent": {
>   "driver-specific": {
> "discard-nb-failed": 0,
> "discard-bytes-ok": 720896,
> "driver": "file",
> "discard-nb-ok": 8
>   },
"node-name": "#block009",
"stats": {
[..]
}
  }
},
{
  "device": "floppy0",

Anton Nefedov (9):
  qapi: group BlockDeviceStats fields
  qapi: add unmap to BlockDeviceStats
  block: add empty account cookie type
  ide: account UNMAP (TRIM) operations
  scsi: store unmap offset and nb_sectors in request struct
  scsi: move unmap error checking to the complete callback
  scsi: account unmap operations
  file-posix: account discard operations
  qapi: query-blockstat: add driver specific file-posix stats

 qapi/block-core.json   | 81 --
 include/block/accounting.h |  2 +
 include/block/block.h  |  1 +
 include/block/block_int.h  |  1 +
 block.c|  9 +
 block/accounting.c |  6 +++
 block/file-posix.c | 54 -
 block/qapi.c   | 11 ++
 hw/ide/core.c  | 12 ++
 hw/scsi/scsi-disk.c| 32 +--
 tests/qemu-iotests/227.out | 18 +
 11 files changed, 204 insertions(+), 23 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v7 5/9] scsi: store unmap offset and nb_sectors in request struct

2019-05-14 Thread Anton Nefedov

it allows to report it in the error handler

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 hw/scsi/scsi-disk.c | 12 +---
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
index e7e865ab3b..b43254103c 100644
--- a/hw/scsi/scsi-disk.c
+++ b/hw/scsi/scsi-disk.c
@@ -1602,8 +1602,6 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 {
 SCSIDiskReq *r = data->r;
 SCSIDiskState *s = DO_UPCAST(SCSIDiskState, qdev, r->req.dev);
-uint64_t sector_num;
-uint32_t nb_sectors;
 
 assert(r->req.aiocb == NULL);
 if (scsi_disk_req_check_error(r, ret, false)) {
@@ -1611,16 +1609,16 @@ static void scsi_unmap_complete_noio(UnmapCBData *data, 
int ret)
 }
 
 if (data->count > 0) {
-sector_num = ldq_be_p(>inbuf[0]);
-nb_sectors = ldl_be_p(>inbuf[8]) & 0xULL;
-if (!check_lba_range(s, sector_num, nb_sectors)) {
+r->sector = ldq_be_p(>inbuf[0]);
+r->sector_count = ldl_be_p(>inbuf[8]) & 0xULL;
+if (!check_lba_range(s, r->sector, r->sector_count)) {
 scsi_check_condition(r, SENSE_CODE(LBA_OUT_OF_RANGE));
 goto done;
 }
 
 r->req.aiocb = blk_aio_pdiscard(s->qdev.conf.blk,
-sector_num * s->qdev.blocksize,
-nb_sectors * s->qdev.blocksize,
+r->sector * s->qdev.blocksize,
+r->sector_count * s->qdev.blocksize,
 scsi_unmap_complete, data);
 data->count--;
 data->inbuf += 16;
-- 
2.17.1

[Qemu-devel] [PATCH v7 4/9] ide: account UNMAP (TRIM) operations

2019-05-14 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 hw/ide/core.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 6afadf894f..3a7ac93777 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -441,6 +441,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 TrimAIOCB *iocb = opaque;
 IDEState *s = iocb->s;
 
+if (iocb->i >= 0) {
+if (ret >= 0) {
+block_acct_done(blk_get_stats(s->blk), >acct);
+} else {
+block_acct_failed(blk_get_stats(s->blk), >acct);
+}
+}
+
 if (ret >= 0) {
 while (iocb->j < iocb->qiov->niov) {
 int j = iocb->j;
@@ -458,10 +466,14 @@ static void ide_issue_trim_cb(void *opaque, int ret)
 }
 
 if (!ide_sect_range_ok(s, sector, count)) {
+block_acct_invalid(blk_get_stats(s->blk), 
BLOCK_ACCT_UNMAP);
 iocb->ret = -EINVAL;
 goto done;
 }
 
+block_acct_start(blk_get_stats(s->blk), >acct,
+ count << BDRV_SECTOR_BITS, BLOCK_ACCT_UNMAP);
+
 /* Got an entry! Submit and exit.  */
 iocb->aiocb = blk_aio_pdiscard(s->blk,
sector << BDRV_SECTOR_BITS,
-- 
2.17.1

[Qemu-devel] [PATCH v7 1/9] qapi: group BlockDeviceStats fields

2019-05-14 Thread Anton Nefedov

Make the stat fields definition slightly more readable.
Also reorder total_time_ns stats read-write-flush as done elsewhere.
Cosmetic change only.

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json | 26 +++---
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 7ccbfff9d0..754d07f1fb 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -863,12 +863,12 @@
 # @flush_operations: The number of cache flush operations performed by the
 #device (since 0.15.0)
 #
-# @flush_total_time_ns: Total time spend on cache flushes in nano-seconds
-#   (since 0.15.0).
+# @rd_total_time_ns: Total time spent on reads in nanoseconds (since 0.15.0).
 #
-# @wr_total_time_ns: Total time spend on writes in nano-seconds (since 0.15.0).
+# @wr_total_time_ns: Total time spent on writes in nanoseconds (since 0.15.0).
 #
-# @rd_total_time_ns: Total_time_spend on reads in nano-seconds (since 0.15.0).
+# @flush_total_time_ns: Total time spent on cache flushes in nanoseconds
+#   (since 0.15.0).
 #
 # @wr_highest_offset: The offset after the greatest byte written to the
 # device.  The intended use of this information is for
@@ -921,14 +921,18 @@
 # Since: 0.14.0
 ##
 { 'struct': 'BlockDeviceStats',
-  'data': {'rd_bytes': 'int', 'wr_bytes': 'int', 'rd_operations': 'int',
-   'wr_operations': 'int', 'flush_operations': 'int',
-   'flush_total_time_ns': 'int', 'wr_total_time_ns': 'int',
-   'rd_total_time_ns': 'int', 'wr_highest_offset': 'int',
-   'rd_merged': 'int', 'wr_merged': 'int', '*idle_time_ns': 'int',
+  'data': {'rd_bytes': 'int', 'wr_bytes': 'int',
+   'rd_operations': 'int', 'wr_operations': 'int',
+   'flush_operations': 'int',
+   'rd_total_time_ns': 'int', 'wr_total_time_ns': 'int',
+   'flush_total_time_ns': 'int',
+   'wr_highest_offset': 'int',
+   'rd_merged': 'int', 'wr_merged': 'int',
+   '*idle_time_ns': 'int',
'failed_rd_operations': 'int', 'failed_wr_operations': 'int',
-   'failed_flush_operations': 'int', 'invalid_rd_operations': 'int',
-   'invalid_wr_operations': 'int', 'invalid_flush_operations': 'int',
+   'failed_flush_operations': 'int',
+   'invalid_rd_operations': 'int', 'invalid_wr_operations': 'int',
+   'invalid_flush_operations': 'int',
'account_invalid': 'bool', 'account_failed': 'bool',
'timed_stats': ['BlockDeviceTimedStats'],
'*rd_latency_histogram': 'BlockLatencyHistogramInfo',
-- 
2.17.1

Re: [Qemu-devel] [PATCH v12 00/10] qcow2: cluster space preallocation

2019-02-21 Thread Anton Nefedov

On 14/1/2019 2:18 PM, Anton Nefedov wrote:
> This pull request is to start to improve a few performance points of
> qcow2 format:
> 
>1. non cluster-aligned write requests (to unallocated clusters) explicitly
>   pad data with zeroes if there is no backing data.
>   Resulting increase in ops number and potential cluster fragmentation
>   (on the host file) is already solved by:
> ee22a9d qcow2: Merge the writing of the COW regions with the guest 
> data
>   However, in case of zero COW regions, that can be avoided at all
>   but the whole clusters are preallocated and zeroed in a single
>   efficient write_zeroes() operation
> 
>2. moreover, efficient write_zeroes() operation can be used to preallocate
>   space megabytes (*configurable number) ahead which gives noticeable
>   improvement on some storage types (e.g. distributed storage)
>   where the space allocation operation might be expensive)
>   (Not included in this patchset since v6).
> 
>3. this will also allow to enable simultaneous writes to the same 
> unallocated
>   cluster after the space has been allocated & zeroed but before
>   the first data is written and the cluster is linked to L2.
>   (Not included in this patchset).
> 
> Efficient write_zeroes usually implies that the blocks are not actually
> written to but only reserved and marked as zeroed by the storage.
> In this patchset, file-posix driver is marked as supporting this operation
> if it supports (/configured to support) fallocate() operation.
> 
> Existing bdrv_write_zeroes() falls back to writing zero buffers if
> write_zeroes is not supported by the driver.
> A new flag (BDRV_REQ_ALLOCATE) is introduced to avoid that but return ENOTSUP.
> Such allocate requests are also implemented to possibly overlap with the
> other requests. No wait is performed but an error returned in such case as 
> well.
> So the operation should be considered advisory and a fallback scenario still
> handled by the caller (in this case, qcow2 driver).
> 
> simple perf test:
> 
>qemu-img create -f qcow2 test.img 4G && \
>qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img
> 
> test results (seconds):
> 
>  +---+---+--+---+--+--+
>  |   file|before| after| gain |
>  +---+---+--+---+--+--+
>  |ssd|  61.153  |  36.313  |  41% |
>  |hdd| 112.676  | 122.056  |  -8% |
>  +---+--+--+--+
> 

ping

Re: [Qemu-devel] [PATCH v6 0/9] discard blockstats

2019-02-21 Thread Anton Nefedov

On 14/1/2019 4:16 PM, Anton Nefedov wrote:
> On 30/11/2018 5:47 PM, Anton Nefedov wrote:
>> qmp query-blockstats provides stats info for write/read/flush ops.
>>
>> Patches 1-7 implement the similar for discard (unmap) command for scsi
>> and ide disks.
>> Discard stat "unmap_ops / unmap_bytes" is supposed to account the ops 
>> that
>> have completed without an error.
>>
>> However, discard operation is advisory. Specifically,
>>   - common block layer ignores ENOTSUP error code.
>>     That might be returned if the block driver does not support discard,
>>     or discard has been configured to be ignored.
>>   - format drivers such as qcow2 may ignore discard if they were 
>> configured
>>     to ignore that, or if the corresponding area is already marked unused
>>     (unallocated / zero clusters).
>>
>> And what is actually useful is the number of bytes actually discarded
>> down on the host filesystem.
>> To achieve that, driver-specific statistics has been added to blockstats
>> (patch 9).
>> With patch 8, file-posix driver accounts discard operations on its 
>> level too.
>>
> 
> ping

ping

Re: [Qemu-devel] [PATCH v12 09/10] qcow2: skip writing zero buffers to empty COW areas

2019-01-16 Thread Anton Nefedov

On 15/1/2019 6:27 PM, Alberto Garcia wrote:
> On Mon 14 Jan 2019 12:18:30 PM CET, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
>> Signed-off-by: Anton Nefedov 
>> Reviewed-by: Vladimir Sementsov-Ogievskiy 
> 
> Reviewed-by: Alberto Garcia 
> 
>> +ret = handle_alloc_space(bs, l2meta);
> 
> I insist that it would be nice to have a short comment explaining what
> this does.
> 

Right sorry forgot your comment.
I'd go with:

+/* Try to efficiently initialize the physical space with zeroes */
  ret = handle_alloc_space(bs, l2meta);
  if (ret < 0) {
  qemu_co_mutex_lock(>lock);

Re: [Qemu-devel] [PATCH v6 0/9] discard blockstats

2019-01-14 Thread Anton Nefedov

On 30/11/2018 5:47 PM, Anton Nefedov wrote:
> qmp query-blockstats provides stats info for write/read/flush ops.
> 
> Patches 1-7 implement the similar for discard (unmap) command for scsi
> and ide disks.
> Discard stat "unmap_ops / unmap_bytes" is supposed to account the ops that
> have completed without an error.
> 
> However, discard operation is advisory. Specifically,
>   - common block layer ignores ENOTSUP error code.
> That might be returned if the block driver does not support discard,
> or discard has been configured to be ignored.
>   - format drivers such as qcow2 may ignore discard if they were configured
> to ignore that, or if the corresponding area is already marked unused
> (unallocated / zero clusters).
> 
> And what is actually useful is the number of bytes actually discarded
> down on the host filesystem.
> To achieve that, driver-specific statistics has been added to blockstats
> (patch 9).
> With patch 8, file-posix driver accounts discard operations on its level too.
> 

ping

[Qemu-devel] [PATCH v12 05/10] block: treat BDRV_REQ_ALLOCATE as serialising

2019-01-14 Thread Anton Nefedov

The idea is that ALLOCATE requests may overlap with other requests.
Reuse the existing block layer infrastructure for serialising requests.
Use the following approach:
  - mark ALLOCATE also SERIALISING, so subsequent requests to the area wait
  - ALLOCATE request itself must never wait if another request is in flight
already. Return EAGAIN, let the caller reconsider.

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
---
 include/block/block.h |  3 +++
 block/io.c| 31 ---
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 643d32f4b8..dfc0fc1b8f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -88,6 +88,9 @@ typedef enum {
  * efficiently allocate the space so it reads as zeroes, or return an 
error.
  * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
  * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
+ * This flag implicitly sets BDRV_REQ_SERIALISING meaning it is protected
+ * from conflicts with overlapping requests. If such conflict is detected,
+ * -EAGAIN is returned.
  */
 BDRV_REQ_ALLOCATE   = 0x100,
 
diff --git a/block/io.c b/block/io.c
index 66006a089d..4451714a60 100644
--- a/block/io.c
+++ b/block/io.c
@@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
 bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+static bool coroutine_fn find_or_wait_serialising_requests(
+BdrvTrackedRequest *self, bool wait)
 {
 BlockDriverState *bs = self->bs;
 BdrvTrackedRequest *req;
 bool retry;
-bool waited = false;
+bool found = false;
 
 if (!atomic_read(>serialising_in_flight)) {
 return false;
@@ -751,11 +752,14 @@ static bool coroutine_fn 
wait_serialising_requests(BdrvTrackedRequest *self)
  * will wait for us as soon as it wakes up, then just go on
  * (instead of producing a deadlock in the former case). */
 if (!req->waiting_for) {
+found = true;
+if (!wait) {
+break;
+}
 self->waiting_for = req;
 qemu_co_queue_wait(>wait_queue, >reqs_lock);
 self->waiting_for = NULL;
 retry = true;
-waited = true;
 break;
 }
 }
@@ -763,7 +767,12 @@ static bool coroutine_fn 
wait_serialising_requests(BdrvTrackedRequest *self)
 qemu_co_mutex_unlock(>reqs_lock);
 } while (retry);
 
-return waited;
+return found;
+}
+
+static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+{
+return find_or_wait_serialising_requests(self, true);
 }
 
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
@@ -1585,7 +1594,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
   BdrvTrackedRequest *req, int flags)
 {
 BlockDriverState *bs = child->bs;
-bool waited;
+bool found;
 int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 
 if (bs->read_only) {
@@ -1602,9 +1611,13 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
 mark_request_serialising(req, bdrv_get_cluster_size(bs));
 }
 
-waited = wait_serialising_requests(req);
+found = find_or_wait_serialising_requests(req,
+  !(flags & BDRV_REQ_ALLOCATE));
+if (found && (flags & BDRV_REQ_ALLOCATE)) {
+return -EAGAIN;
+}
 
-assert(!waited || !req->serialising ||
+assert(!found || !req->serialising ||
is_request_serialising_and_aligned(req));
 assert(req->overlap_offset <= offset);
 assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
@@ -1864,6 +1877,10 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
 assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
 assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
 
+if (flags & BDRV_REQ_ALLOCATE) {
+flags |= BDRV_REQ_SERIALISING;
+}
+
 trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
 
 if (!bs->drv) {
-- 
2.17.1

[Qemu-devel] [PATCH v12 03/10] quorum: set supported write flags

2019-01-14 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/quorum.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 16b3c8067c..d21a6a3b8e 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -857,6 +857,19 @@ static QemuOptsList quorum_runtime_opts = {
 },
 };
 
+static void quorum_set_supported_flags(BlockDriverState *bs)
+{
+BDRVQuorumState *s = bs->opaque;
+int i;
+
+bs->supported_write_flags = BDRV_REQ_FUA;
+for (i = 0; i < s->num_children; i++) {
+bs->supported_write_flags &= s->children[i]->bs->supported_write_flags;
+}
+
+bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
+}
+
 static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
Error **errp)
 {
@@ -950,7 +963,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->next_child_index = s->num_children;
 
-bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
+quorum_set_supported_flags(bs);
 
 g_free(opened);
 goto exit;
@@ -1025,6 +1038,8 @@ static void quorum_add_child(BlockDriverState *bs, 
BlockDriverState *child_bs,
 s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
 s->children[s->num_children++] = child;
 
+quorum_set_supported_flags(bs);
+
 out:
 bdrv_drained_end(bs);
 }
@@ -1063,6 +1078,8 @@ static void quorum_del_child(BlockDriverState *bs, 
BdrvChild *child,
 bdrv_unref_child(bs, child);
 
 bdrv_drained_end(bs);
+
+quorum_set_supported_flags(bs);
 }
 
 static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
-- 
2.17.1

[Qemu-devel] [PATCH v12 00/10] qcow2: cluster space preallocation

2019-01-14 Thread Anton Nefedov

new in v12:
   - patch 9: pre-write overlap check added

v11: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg04342.html
- patch 4, 9: fixed commentary format
- patch 4: removed one hunk with a dead check
- patch 5: added commentary to BDRV_REQ_ALLOCATE definition
- new auxiliary patch 6 for the following patch-7 change:
- patch 7: reset BDRV_REQ_ALLOCATE from supported flag if CONFIG_FALLOCATE
   is false
- patch 9: add commentary about missing qcow2_pre_write_overlap_check().
   Omit redundant changes in the test 060.

v10: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg00121.html
- patches 1-3,6,7: rebase after REQ_WRITE_UNCHANGED
- patch 3: drop supported_zero_flags. My bad, no write_zeroes in quorum.
- patch 4: almost trivial rebase. RB-tags not stripped.
   Choose another constant for BDRV_REQ_ALLOCATE
- patch 5: rebase. Instead of marking REQ_ALLOCATE serialising, accompany
   it with REQ_SERIALISING.
- patch 7: add symmetric copy-on-read change
- patch 8: trivial rebase. RB-tags not stripped.



This pull request is to start to improve a few performance points of
qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
 pad data with zeroes if there is no backing data.
 Resulting increase in ops number and potential cluster fragmentation
 (on the host file) is already solved by:
   ee22a9d qcow2: Merge the writing of the COW regions with the guest data
 However, in case of zero COW regions, that can be avoided at all
 but the whole clusters are preallocated and zeroed in a single
 efficient write_zeroes() operation

  2. moreover, efficient write_zeroes() operation can be used to preallocate
 space megabytes (*configurable number) ahead which gives noticeable
 improvement on some storage types (e.g. distributed storage)
 where the space allocation operation might be expensive)
 (Not included in this patchset since v6).

  3. this will also allow to enable simultaneous writes to the same unallocated
 cluster after the space has been allocated & zeroed but before
 the first data is written and the cluster is linked to L2.
 (Not included in this patchset).

Efficient write_zeroes usually implies that the blocks are not actually
written to but only reserved and marked as zeroed by the storage.
In this patchset, file-posix driver is marked as supporting this operation
if it supports (/configured to support) fallocate() operation.

Existing bdrv_write_zeroes() falls back to writing zero buffers if
write_zeroes is not supported by the driver.
A new flag (BDRV_REQ_ALLOCATE) is introduced to avoid that but return ENOTSUP.
Such allocate requests are also implemented to possibly overlap with the
other requests. No wait is performed but an error returned in such case as well.
So the operation should be considered advisory and a fallback scenario still
handled by the caller (in this case, qcow2 driver).

simple perf test:

  qemu-img create -f qcow2 test.img 4G && \
  qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img

test results (seconds):

+---+---+--+---+--+--+
|   file|before| after| gain |
+---+---+--+---+--+--+
|ssd|  61.153  |  36.313  |  41% |
|hdd| 112.676  | 122.056  |  -8% |
+---+--+--+------+

Anton Nefedov (10):
  mirror: inherit supported write/zero flags
  blkverify: set supported write/zero flags
  quorum: set supported write flags
  block: introduce BDRV_REQ_ALLOCATE flag
  block: treat BDRV_REQ_ALLOCATE as serialising
  file-posix: reset fallocate-related flags without CONFIG_FALLOCATE*
  file-posix: support BDRV_REQ_ALLOCATE
  block: support BDRV_REQ_ALLOCATE in passthrough drivers
  qcow2: skip writing zero buffers to empty COW areas
  iotest 134: test cluster-misaligned encrypted write

 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 include/block/block.h  | 13 +-
 include/block/block_int.h  |  3 +-
 block/blkdebug.c   |  2 +-
 block/blkverify.c  | 10 -
 block/copy-on-read.c   |  4 +-
 block/file-posix.c | 16 +--
 block/io.c | 45 +++
 block/mirror.c |  8 +++-
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 91 +-
 block/quorum.c | 19 +++-
 block/raw-format.c |  2 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 ++-
 tests/qemu-iotests/134 |  9 
 tests/qemu-iotests/134.out | 10 +
 19 files changed, 229 insertions(+), 28 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v12 07/10] file-posix: support BDRV_REQ_ALLOCATE

2019-01-14 Thread Anton Nefedov

Current write_zeroes implementation is good enough to satisfy this flag too

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 block/file-posix.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 8d3ec96627..dac218ec7f 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -607,6 +607,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 } else {
 s->discard_zeroes = true;
 s->has_fallocate = true;
+bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
 }
 } else {
 if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
@@ -650,10 +651,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 #ifdef CONFIG_XFS
 if (platform_test_xfs_fd(s->fd)) {
 s->is_xfs = true;
+bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
 }
 #endif
 
-bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
 ret = 0;
 fail:
 if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1552,6 +1554,10 @@ static int handle_aiocb_write_zeroes(void *opaque)
 s->has_fallocate = false;
 #endif
 
+if (!s->has_fallocate) {
+aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
+}
+
 return -ENOTSUP;
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v12 06/10] file-posix: reset fallocate-related flags without CONFIG_FALLOCATE*

2019-01-14 Thread Anton Nefedov

these flags currently affect nothing without CONFIG_FALLOCATE*, so it's
not a bug, but fixing it makes possible to adjust supported zero flag
BDRV_REQ_ALLOCATE regardless of configuration.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 block/file-posix.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 8aee7a3fb8..8d3ec96627 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1488,9 +1488,7 @@ static ssize_t 
handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
 static int handle_aiocb_write_zeroes(void *opaque)
 {
 RawPosixAIOData *aiocb = opaque;
-#if defined(CONFIG_FALLOCATE) || defined(CONFIG_XFS)
 BDRVRawState *s = aiocb->bs->opaque;
-#endif
 #ifdef CONFIG_FALLOCATE
 int64_t len;
 #endif
@@ -1514,6 +1512,8 @@ static int handle_aiocb_write_zeroes(void *opaque)
 }
 s->has_write_zeroes = false;
 }
+#else
+s->has_write_zeroes = false;
 #endif
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1533,6 +1533,8 @@ static int handle_aiocb_write_zeroes(void *opaque)
 s->has_discard = false;
 }
 }
+#else
+s->has_discard = false;
 #endif
 
 #ifdef CONFIG_FALLOCATE
@@ -1546,6 +1548,8 @@ static int handle_aiocb_write_zeroes(void *opaque)
 }
 s->has_fallocate = false;
 }
+#else
+s->has_fallocate = false;
 #endif
 
 return -ENOTSUP;
-- 
2.17.1

[Qemu-devel] [PATCH v12 10/10] iotest 134: test cluster-misaligned encrypted write

2019-01-14 Thread Anton Nefedov

COW (even empty/zero) areas require encryption too

Signed-off-by: Anton Nefedov 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Alberto Garcia 
---
 tests/qemu-iotests/134 |  9 +
 tests/qemu-iotests/134.out | 10 ++
 2 files changed, 19 insertions(+)

diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index cacabcd28b..792c8ca12f 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -57,6 +57,15 @@ echo
 echo "== reading whole image =="
 $QEMU_IO --object $SECRET -c "read 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
 
+echo
+echo "== rewriting cluster part =="
+$QEMU_IO --object $SECRET -c "write -P 0xb 512 512" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
+echo
+echo "== verify pattern =="
+$QEMU_IO --object $SECRET -c "read -P 0 0 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+$QEMU_IO --object $SECRET -c "read -P 0xb 512 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
 echo
 echo "== rewriting whole image =="
 $QEMU_IO --object $SECRET -c "write -P 0xa 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
diff --git a/tests/qemu-iotests/134.out b/tests/qemu-iotests/134.out
index 972be49d91..09d46f6b17 100644
--- a/tests/qemu-iotests/134.out
+++ b/tests/qemu-iotests/134.out
@@ -5,6 +5,16 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
encryption=on encrypt.
 read 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== rewriting cluster part ==
+wrote 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verify pattern ==
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == rewriting whole image ==
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.17.1

[Qemu-devel] [PATCH v12 04/10] block: introduce BDRV_REQ_ALLOCATE flag

2019-01-14 Thread Anton Nefedov

The flag is supposed to indicate that the region of the disk image has
to be sufficiently allocated so it reads as zeroes.

The call with the flag set must return -ENOTSUP if allocation cannot
be done efficiently.
This has to be made sure of by both
  - the drivers that support the flag
  - and the common block layer (so it will not fall back to any slowpath
(like writing zero buffers) in case the driver does not support
the flag).

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 include/block/block.h | 10 +-
 include/block/block_int.h |  3 ++-
 block/io.c| 14 +-
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index f70a843b72..643d32f4b8 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -83,8 +83,16 @@ typedef enum {
  */
 BDRV_REQ_SERIALISING= 0x80,
 
+/*
+ * The BDRV_REQ_ALLOCATE flag is used to indicate that the driver has to
+ * efficiently allocate the space so it reads as zeroes, or return an 
error.
+ * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
+ * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
+ */
+BDRV_REQ_ALLOCATE   = 0x100,
+
 /* Mask of valid flags */
-BDRV_REQ_MASK   = 0xff,
+BDRV_REQ_MASK   = 0x1ff,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f605622216..833129d912 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -724,7 +724,8 @@ struct BlockDriverState {
  * their children. */
 unsigned int supported_write_flags;
 /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
- * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
+ * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED, BDRV_REQ_ALLOCATE)
+ */
 unsigned int supported_zero_flags;
 
 /* the following member gives a name to every node on the bs graph. */
diff --git a/block/io.c b/block/io.c
index bd9d688f8b..66006a089d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1534,7 +1534,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 assert(!bs->supported_zero_flags);
 }
 
-if (ret == -ENOTSUP) {
+if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
 /* Fall back to bounce buffer if write zeroes is unsupported */
 BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
 
@@ -1773,6 +1773,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild 
*child,
 
 assert(flags & BDRV_REQ_ZERO_WRITE);
 if (head_padding_bytes || tail_padding_bytes) {
+if (flags & BDRV_REQ_ALLOCATE) {
+return -ENOTSUP;
+}
 buf = qemu_blockalign(bs, align);
 iov = (struct iovec) {
 .iov_base   = buf,
@@ -1858,6 +1861,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
 bool use_local_qiov = false;
 int ret;
 
+assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
+assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
+
 trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
 
 if (!bs->drv) {
@@ -1980,6 +1986,12 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, 
int64_t offset,
 {
 trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
 
+if ((flags & BDRV_REQ_ALLOCATE) &&
+!(child->bs->supported_zero_flags & BDRV_REQ_ALLOCATE))
+{
+return -ENOTSUP;
+}
+
 if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
 flags &= ~BDRV_REQ_MAY_UNMAP;
 }
-- 
2.17.1

[Qemu-devel] [PATCH v12 08/10] block: support BDRV_REQ_ALLOCATE in passthrough drivers

2019-01-14 Thread Anton Nefedov

Support the flag if the underlying BDS supports it

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 block/blkdebug.c | 2 +-
 block/blkverify.c| 2 +-
 block/copy-on-read.c | 4 ++--
 block/mirror.c   | 2 +-
 block/raw-format.c   | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 0759452925..f0fc2ec276 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -401,7 +401,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
 bs->file->bs->supported_zero_flags);
 ret = -EINVAL;
 
diff --git a/block/blkverify.c b/block/blkverify.c
index bb52596cbb..9cb4f94b68 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -146,7 +146,7 @@ static int blkverify_open(BlockDriverState *bs, QDict 
*options, int flags,
  bs->file->bs->supported_write_flags &
  s->test_file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
  bs->file->bs->supported_zero_flags &
  s->test_file->bs->supported_zero_flags);
 
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 64dcc424b5..1eb993699a 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -38,8 +38,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int 
flags,
 bs->file->bs->supported_write_flags);
 
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-   ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
-bs->file->bs->supported_zero_flags);
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
+ bs->file->bs->supported_zero_flags);
 
 return 0;
 }
diff --git a/block/mirror.c b/block/mirror.c
index 7b5a5f13a2..057516acb9 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1532,7 +1532,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->supported_write_flags);
 mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE)
  & bs->supported_zero_flags);
 
 bs_opaque = g_new0(MirrorBDSOpaque, 1);
diff --git a/block/raw-format.c b/block/raw-format.c
index 6f6dc99b2c..ad7453dc83 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -432,7 +432,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
int flags,
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
 bs->file->bs->supported_zero_flags);
 
 if (bs->probed && !bdrv_is_read_only(bs)) {
-- 
2.17.1

[Qemu-devel] [PATCH v12 09/10] qcow2: skip writing zero buffers to empty COW areas

2019-01-14 Thread Anton Nefedov

If COW areas of the newly allocated clusters are zeroes on the backing image,
efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
cluster instead of writing explicit zero buffers later in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 91 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 ++-
 7 files changed, 110 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 762000f31f..204528b3f6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3009,6 +3009,8 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
+# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
+#
 # Since: 2.9
 ##
 { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
@@ -3027,7 +3029,7 @@
 'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
 'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
 'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-'cor_write'] }
+'cor_write', 'cluster_alloc_space'] }
 
 ##
 # @BlkdebugInjectErrorOptions:
diff --git a/block/qcow2.h b/block/qcow2.h
index 438a1dee9e..dad4b1c7ca 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -389,6 +389,12 @@ typedef struct QCowL2Meta
  */
 Qcow2COWRegion cow_end;
 
+/*
+ * Indicates that COW regions are already handled and do not require
+ * any more processing.
+ */
+bool skip_cow;
+
 /**
  * The I/O vector with the data from the actual guest write request.
  * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 30eca26c47..e5f936a82c 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 assert(start->offset + start->nb_bytes <= end->offset);
 assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
-if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
 return 0;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 4897abae5e..05a7cbebbd 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2021,6 +2021,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 continue;
 }
 
+/* If COW regions are handled already, skip this too */
+if (m->skip_cow) {
+continue;
+}
+
 /* The data (middle) region must be immediately after the
  * start region */
 if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2046,6 +2051,79 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 return false;
 }
 
+static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+int64_t nr;
+return !bytes ||
+(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
bytes);
+}
+
+static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+{
+/*
+ * This check is designed for optimization shortcut so it must be
+ * efficient.
+ * Instead of is_zero(), use is_unallocated() as it is faster (but not
+ * as accurate and can result in false negatives).
+ */
+return is_unallocated(bs, m->offset + m->cow_start.offset,
+  m->cow_start.nb_bytes) &&
+   is_unallocated(bs, m->offset + m->cow_end.offset,
+  m->cow_end.nb_bytes);
+}
+
+static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+BDRVQcow2State *s = bs->opaque;
+QCowL2Meta *m;
+
+if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
+return 0;
+}
+
+if (bs->encrypted) {
+return 0;
+}
+
+for (m = l2meta; m != NULL; m = m->next) {
+int ret;
+
+if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
+continue;
+}
+
+if (!is_zero_cow(bs, m)) {
+continue;
+}
+
+/*
+ * instead of writing zero COW buffers,
+ * efficiently zero out the whole clusters
+ */
+
+ret = qcow2_pre_write_overlap_check(bs, 0, m->alloc_offset,
+m->nb_clusters * s->cluster_size);
+if (ret < 0) {
+return ret;
+}
+
+BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
+ret = bdrv_co_pwrite_zeroes(bs->fi

[Qemu-devel] [PATCH v12 02/10] blkverify: set supported write/zero flags

2019-01-14 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/blkverify.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index 89bf4386e3..bb52596cbb 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -141,8 +141,14 @@ static int blkverify_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto fail;
 }
 
-bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
-bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
+bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+(BDRV_REQ_FUA &
+ bs->file->bs->supported_write_flags &
+ s->test_file->bs->supported_write_flags);
+bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+ bs->file->bs->supported_zero_flags &
+ s->test_file->bs->supported_zero_flags);
 
 ret = 0;
 fail:
-- 
2.17.1

[Qemu-devel] [PATCH v12 01/10] mirror: inherit supported write/zero flags

2019-01-14 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index f0b211a9c8..7b5a5f13a2 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1529,8 +1529,12 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 mirror_top_bs->implicit = true;
 }
 mirror_top_bs->total_sectors = bs->total_sectors;
-mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
-mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
+mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+(BDRV_REQ_FUA & bs->supported_write_flags);
+mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+ & bs->supported_zero_flags);
+
 bs_opaque = g_new0(MirrorBDSOpaque, 1);
 mirror_top_bs->opaque = bs_opaque;
 bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
-- 
2.17.1

Re: [Qemu-devel] [PATCH v11 09/10] qcow2: skip writing zero buffers to empty COW areas

2018-12-24 Thread Anton Nefedov

On 21/12/2018 7:16 PM, Vladimir Sementsov-Ogievskiy wrote:
> 18.12.2018 10:57, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
>> Signed-off-by: Anton Nefedov 
>> ---
>>qapi/block-core.json   |  4 +-
>>block/qcow2.h  |  6 +++
>>block/qcow2-cluster.c  |  2 +-
>>block/qcow2.c  | 89 +-
>>block/trace-events |  1 +
>>tests/qemu-iotests/060 |  7 ++-
>>tests/qemu-iotests/060.out |  5 ++-
>>7 files changed, 108 insertions(+), 6 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 762000f31f..204528b3f6 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3009,6 +3009,8 @@
>>#
>># @cor_write: a write due to copy-on-read (since 2.11)
>>#
>> +# @cluster_alloc_space: an allocation of file space for a cluster (since 
>> 4.0)
>> +#
>># Since: 2.9
>>##
>>{ 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
>> @@ -3027,7 +3029,7 @@
>>'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>>'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>>'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
>> -'cor_write'] }
>> +'cor_write', 'cluster_alloc_space'] }
>>
>>##
>># @BlkdebugInjectErrorOptions:
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index a98d24500b..32d2c04bfa 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -386,6 +386,12 @@ typedef struct QCowL2Meta
>> */
>>Qcow2COWRegion cow_end;
>>
>> +/*
>> + * Indicates that COW regions are already handled and do not require
>> + * any more processing.
>> + */
>> +bool skip_cow;
>> +
>>/**
> 
> hmm, around it, all comments starts from '/**', so, I think, yours should too.
> (this note doesn't touch your other comments)
> 

that triggers a warning in checkpatch.pl

   WARNING: Block comments use a leading /* on a separate line

>> * The I/O vector with the data from the actual guest write request.
>> * If non-NULL, this is meant to be merged together with the data
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index e2737429f5..23e0702027 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta 
>> *m)
>>assert(start->offset + start->nb_bytes <= end->offset);
>>assert(!m->data_qiov || m->data_qiov->size == data_bytes);
>>
>> -if (start->nb_bytes == 0 && end->nb_bytes == 0) {
>> +if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
>>return 0;
>>}
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 4897abae5e..161b935962 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2021,6 +2021,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>continue;
>>}
>>
>> +/* If COW regions are handled already, skip this too */
>> +if (m->skip_cow) {
>> +continue;
>> +}
>> +
>>/* The data (middle) region must be immediately after the
>> * start region */
>>if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>> @@ -2046,6 +2051,77 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>return false;
>>}
>>
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t 
>> bytes)
>> +{
>> +int64_t nr;
>> +return !bytes ||
>> +(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
>> bytes);
>> +}
>> +
>> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
>> +{
>> +/*
>> + * This check is designed for optimization shortcut so it must be
>> + * efficient.
>> + * Instead of is_zero(), use is_unallocated() as it is faster (b

[Qemu-devel] [PATCH v11 09/10] qcow2: skip writing zero buffers to empty COW areas

2018-12-18 Thread Anton Nefedov

If COW areas of the newly allocated clusters are zeroes on the backing image,
efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
cluster instead of writing explicit zero buffers later in perform_cow().

iotest 060:
write to the discarded cluster does not trigger COW anymore.
Use a backing image instead.

Signed-off-by: Anton Nefedov 
---
 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 89 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 ++-
 7 files changed, 108 insertions(+), 6 deletions(-)

diff --git a/qapi/block-core.json b/qapi/block-core.json
index 762000f31f..204528b3f6 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -3009,6 +3009,8 @@
 #
 # @cor_write: a write due to copy-on-read (since 2.11)
 #
+# @cluster_alloc_space: an allocation of file space for a cluster (since 4.0)
+#
 # Since: 2.9
 ##
 { 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
@@ -3027,7 +3029,7 @@
 'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
 'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
 'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
-'cor_write'] }
+'cor_write', 'cluster_alloc_space'] }
 
 ##
 # @BlkdebugInjectErrorOptions:
diff --git a/block/qcow2.h b/block/qcow2.h
index a98d24500b..32d2c04bfa 100644
--- a/block/qcow2.h
+++ b/block/qcow2.h
@@ -386,6 +386,12 @@ typedef struct QCowL2Meta
  */
 Qcow2COWRegion cow_end;
 
+/*
+ * Indicates that COW regions are already handled and do not require
+ * any more processing.
+ */
+bool skip_cow;
+
 /**
  * The I/O vector with the data from the actual guest write request.
  * If non-NULL, this is meant to be merged together with the data
diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index e2737429f5..23e0702027 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta *m)
 assert(start->offset + start->nb_bytes <= end->offset);
 assert(!m->data_qiov || m->data_qiov->size == data_bytes);
 
-if (start->nb_bytes == 0 && end->nb_bytes == 0) {
+if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
 return 0;
 }
 
diff --git a/block/qcow2.c b/block/qcow2.c
index 4897abae5e..161b935962 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2021,6 +2021,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 continue;
 }
 
+/* If COW regions are handled already, skip this too */
+if (m->skip_cow) {
+continue;
+}
+
 /* The data (middle) region must be immediately after the
  * start region */
 if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
@@ -2046,6 +2051,77 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
 return false;
 }
 
+static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t bytes)
+{
+int64_t nr;
+return !bytes ||
+(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
bytes);
+}
+
+static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
+{
+/*
+ * This check is designed for optimization shortcut so it must be
+ * efficient.
+ * Instead of is_zero(), use is_unallocated() as it is faster (but not
+ * as accurate and can result in false negatives).
+ */
+return is_unallocated(bs, m->offset + m->cow_start.offset,
+  m->cow_start.nb_bytes) &&
+   is_unallocated(bs, m->offset + m->cow_end.offset,
+  m->cow_end.nb_bytes);
+}
+
+static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
+{
+BDRVQcow2State *s = bs->opaque;
+QCowL2Meta *m;
+
+if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
+return 0;
+}
+
+if (bs->encrypted) {
+return 0;
+}
+
+for (m = l2meta; m != NULL; m = m->next) {
+int ret;
+
+if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
+continue;
+}
+
+if (!is_zero_cow(bs, m)) {
+continue;
+}
+
+/*
+ * Conventional place for qcow2_pre_write_overlap_check() but in this
+ * case it is already done for these clusters
+ */
+
+BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
+/*
+ * instead of writing zero COW buffers,
+ * efficiently zero out the whole clusters
+ */
+ret = bdrv_co_pwrite_zeroes(bs->file, m->alloc_offset,
+m->nb_clusters * s->cluster_size,
+

[Qemu-devel] [PATCH v11 05/10] block: treat BDRV_REQ_ALLOCATE as serialising

2018-12-18 Thread Anton Nefedov

The idea is that ALLOCATE requests may overlap with other requests.
Reuse the existing block layer infrastructure for serialising requests.
Use the following approach:
  - mark ALLOCATE also SERIALISING, so subsequent requests to the area wait
  - ALLOCATE request itself must never wait if another request is in flight
already. Return EAGAIN, let the caller reconsider.

Signed-off-by: Anton Nefedov 
---
 include/block/block.h |  3 +++
 block/io.c| 31 ---
 2 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index 643d32f4b8..dfc0fc1b8f 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -88,6 +88,9 @@ typedef enum {
  * efficiently allocate the space so it reads as zeroes, or return an 
error.
  * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
  * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
+ * This flag implicitly sets BDRV_REQ_SERIALISING meaning it is protected
+ * from conflicts with overlapping requests. If such conflict is detected,
+ * -EAGAIN is returned.
  */
 BDRV_REQ_ALLOCATE   = 0x100,
 
diff --git a/block/io.c b/block/io.c
index 66006a089d..4451714a60 100644
--- a/block/io.c
+++ b/block/io.c
@@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
 bdrv_wakeup(bs);
 }
 
-static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+static bool coroutine_fn find_or_wait_serialising_requests(
+BdrvTrackedRequest *self, bool wait)
 {
 BlockDriverState *bs = self->bs;
 BdrvTrackedRequest *req;
 bool retry;
-bool waited = false;
+bool found = false;
 
 if (!atomic_read(>serialising_in_flight)) {
 return false;
@@ -751,11 +752,14 @@ static bool coroutine_fn 
wait_serialising_requests(BdrvTrackedRequest *self)
  * will wait for us as soon as it wakes up, then just go on
  * (instead of producing a deadlock in the former case). */
 if (!req->waiting_for) {
+found = true;
+if (!wait) {
+break;
+}
 self->waiting_for = req;
 qemu_co_queue_wait(>wait_queue, >reqs_lock);
 self->waiting_for = NULL;
 retry = true;
-waited = true;
 break;
 }
 }
@@ -763,7 +767,12 @@ static bool coroutine_fn 
wait_serialising_requests(BdrvTrackedRequest *self)
 qemu_co_mutex_unlock(>reqs_lock);
 } while (retry);
 
-return waited;
+return found;
+}
+
+static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
+{
+return find_or_wait_serialising_requests(self, true);
 }
 
 static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
@@ -1585,7 +1594,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
   BdrvTrackedRequest *req, int flags)
 {
 BlockDriverState *bs = child->bs;
-bool waited;
+bool found;
 int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
 
 if (bs->read_only) {
@@ -1602,9 +1611,13 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
offset, uint64_t bytes,
 mark_request_serialising(req, bdrv_get_cluster_size(bs));
 }
 
-waited = wait_serialising_requests(req);
+found = find_or_wait_serialising_requests(req,
+  !(flags & BDRV_REQ_ALLOCATE));
+if (found && (flags & BDRV_REQ_ALLOCATE)) {
+return -EAGAIN;
+}
 
-assert(!waited || !req->serialising ||
+assert(!found || !req->serialising ||
is_request_serialising_and_aligned(req));
 assert(req->overlap_offset <= offset);
 assert(offset + bytes <= req->overlap_offset + req->overlap_bytes);
@@ -1864,6 +1877,10 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
 assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
 assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
 
+if (flags & BDRV_REQ_ALLOCATE) {
+flags |= BDRV_REQ_SERIALISING;
+}
+
 trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
 
 if (!bs->drv) {
-- 
2.17.1

[Qemu-devel] [PATCH v11 08/10] block: support BDRV_REQ_ALLOCATE in passthrough drivers

2018-12-18 Thread Anton Nefedov

Support the flag if the underlying BDS supports it

Signed-off-by: Anton Nefedov 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
Reviewed-by: Alberto Garcia 
---
 block/blkdebug.c | 2 +-
 block/blkverify.c| 2 +-
 block/copy-on-read.c | 4 ++--
 block/mirror.c   | 2 +-
 block/raw-format.c   | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 0759452925..f0fc2ec276 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -401,7 +401,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
 bs->file->bs->supported_zero_flags);
 ret = -EINVAL;
 
diff --git a/block/blkverify.c b/block/blkverify.c
index bb52596cbb..9cb4f94b68 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -146,7 +146,7 @@ static int blkverify_open(BlockDriverState *bs, QDict 
*options, int flags,
  bs->file->bs->supported_write_flags &
  s->test_file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
  bs->file->bs->supported_zero_flags &
  s->test_file->bs->supported_zero_flags);
 
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 64dcc424b5..1eb993699a 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -38,8 +38,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int 
flags,
 bs->file->bs->supported_write_flags);
 
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-   ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
-bs->file->bs->supported_zero_flags);
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
+ bs->file->bs->supported_zero_flags);
 
 return 0;
 }
diff --git a/block/mirror.c b/block/mirror.c
index be52c9be9c..eebcacac98 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1532,7 +1532,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->supported_write_flags);
 mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE)
  & bs->supported_zero_flags);
 
 bs_opaque = g_new0(MirrorBDSOpaque, 1);
diff --git a/block/raw-format.c b/block/raw-format.c
index 6f6dc99b2c..ad7453dc83 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -432,7 +432,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
int flags,
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
 bs->file->bs->supported_zero_flags);
 
 if (bs->probed && !bdrv_is_read_only(bs)) {
-- 
2.17.1

[Qemu-devel] [PATCH v11 10/10] iotest 134: test cluster-misaligned encrypted write

2018-12-18 Thread Anton Nefedov

COW (even empty/zero) areas require encryption too

Signed-off-by: Anton Nefedov 
Reviewed-by: Eric Blake 
Reviewed-by: Max Reitz 
Reviewed-by: Alberto Garcia 
---
 tests/qemu-iotests/134 |  9 +
 tests/qemu-iotests/134.out | 10 ++
 2 files changed, 19 insertions(+)

diff --git a/tests/qemu-iotests/134 b/tests/qemu-iotests/134
index cacabcd28b..792c8ca12f 100755
--- a/tests/qemu-iotests/134
+++ b/tests/qemu-iotests/134
@@ -57,6 +57,15 @@ echo
 echo "== reading whole image =="
 $QEMU_IO --object $SECRET -c "read 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
 
+echo
+echo "== rewriting cluster part =="
+$QEMU_IO --object $SECRET -c "write -P 0xb 512 512" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
+echo
+echo "== verify pattern =="
+$QEMU_IO --object $SECRET -c "read -P 0 0 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+$QEMU_IO --object $SECRET -c "read -P 0xb 512 512"  --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
+
 echo
 echo "== rewriting whole image =="
 $QEMU_IO --object $SECRET -c "write -P 0xa 0 $size" --image-opts $IMGSPEC | 
_filter_qemu_io | _filter_testdir
diff --git a/tests/qemu-iotests/134.out b/tests/qemu-iotests/134.out
index 972be49d91..09d46f6b17 100644
--- a/tests/qemu-iotests/134.out
+++ b/tests/qemu-iotests/134.out
@@ -5,6 +5,16 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=134217728 
encryption=on encrypt.
 read 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
 
+== rewriting cluster part ==
+wrote 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
+== verify pattern ==
+read 512/512 bytes at offset 0
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+read 512/512 bytes at offset 512
+512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+
 == rewriting whole image ==
 wrote 134217728/134217728 bytes at offset 0
 128 MiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-- 
2.17.1

[Qemu-devel] [PATCH v11 06/10] file-posix: reset fallocate-related flags without CONFIG_FALLOCATE*

2018-12-18 Thread Anton Nefedov

these flags currently affect nothing without CONFIG_FALLOCATE*, so it's
not a bug. Fixing it makes possible to adjust supported zero flag
BDRV_REQ_ALLOCATE regardless of configuration (in the following patch).

Signed-off-by: Anton Nefedov 
---
 block/file-posix.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index d8f0b93752..a65e464cbc 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -1488,9 +1488,7 @@ static ssize_t 
handle_aiocb_write_zeroes_block(RawPosixAIOData *aiocb)
 static int handle_aiocb_write_zeroes(void *opaque)
 {
 RawPosixAIOData *aiocb = opaque;
-#if defined(CONFIG_FALLOCATE) || defined(CONFIG_XFS)
 BDRVRawState *s = aiocb->bs->opaque;
-#endif
 #ifdef CONFIG_FALLOCATE
 int64_t len;
 #endif
@@ -1514,6 +1512,8 @@ static int handle_aiocb_write_zeroes(void *opaque)
 }
 s->has_write_zeroes = false;
 }
+#else
+s->has_write_zeroes = false;
 #endif
 
 #ifdef CONFIG_FALLOCATE_PUNCH_HOLE
@@ -1533,6 +1533,8 @@ static int handle_aiocb_write_zeroes(void *opaque)
 s->has_discard = false;
 }
 }
+#else
+s->has_discard = false;
 #endif
 
 #ifdef CONFIG_FALLOCATE
@@ -1546,6 +1548,8 @@ static int handle_aiocb_write_zeroes(void *opaque)
 }
 s->has_fallocate = false;
 }
+#else
+s->has_fallocate = false;
 #endif
 
 return -ENOTSUP;
-- 
2.17.1

[Qemu-devel] [PATCH v11 02/10] blkverify: set supported write/zero flags

2018-12-18 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/blkverify.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/block/blkverify.c b/block/blkverify.c
index 89bf4386e3..bb52596cbb 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -141,8 +141,14 @@ static int blkverify_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto fail;
 }
 
-bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
-bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
+bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+(BDRV_REQ_FUA &
+ bs->file->bs->supported_write_flags &
+ s->test_file->bs->supported_write_flags);
+bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+ bs->file->bs->supported_zero_flags &
+ s->test_file->bs->supported_zero_flags);
 
 ret = 0;
 fail:
-- 
2.17.1

[Qemu-devel] [PATCH v11 04/10] block: introduce BDRV_REQ_ALLOCATE flag

2018-12-18 Thread Anton Nefedov

The flag is supposed to indicate that the region of the disk image has
to be sufficiently allocated so it reads as zeroes.

The call with the flag set must return -ENOTSUP if allocation cannot
be done efficiently.
This has to be made sure of by both
  - the drivers that support the flag
  - and the common block layer (so it will not fall back to any slowpath
(like writing zero buffers) in case the driver does not support
the flag).

Signed-off-by: Anton Nefedov 
---
 include/block/block.h | 10 +-
 include/block/block_int.h |  3 ++-
 block/io.c| 14 +-
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/include/block/block.h b/include/block/block.h
index f70a843b72..643d32f4b8 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -83,8 +83,16 @@ typedef enum {
  */
 BDRV_REQ_SERIALISING= 0x80,
 
+/*
+ * The BDRV_REQ_ALLOCATE flag is used to indicate that the driver has to
+ * efficiently allocate the space so it reads as zeroes, or return an 
error.
+ * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
+ * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
+ */
+BDRV_REQ_ALLOCATE   = 0x100,
+
 /* Mask of valid flags */
-BDRV_REQ_MASK   = 0xff,
+BDRV_REQ_MASK   = 0x1ff,
 } BdrvRequestFlags;
 
 typedef struct BlockSizes {
diff --git a/include/block/block_int.h b/include/block/block_int.h
index f605622216..833129d912 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -724,7 +724,8 @@ struct BlockDriverState {
  * their children. */
 unsigned int supported_write_flags;
 /* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
- * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
+ * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED, BDRV_REQ_ALLOCATE)
+ */
 unsigned int supported_zero_flags;
 
 /* the following member gives a name to every node on the bs graph. */
diff --git a/block/io.c b/block/io.c
index bd9d688f8b..66006a089d 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1534,7 +1534,7 @@ static int coroutine_fn 
bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
 assert(!bs->supported_zero_flags);
 }
 
-if (ret == -ENOTSUP) {
+if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
 /* Fall back to bounce buffer if write zeroes is unsupported */
 BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
 
@@ -1773,6 +1773,9 @@ static int coroutine_fn bdrv_co_do_zero_pwritev(BdrvChild 
*child,
 
 assert(flags & BDRV_REQ_ZERO_WRITE);
 if (head_padding_bytes || tail_padding_bytes) {
+if (flags & BDRV_REQ_ALLOCATE) {
+return -ENOTSUP;
+}
 buf = qemu_blockalign(bs, align);
 iov = (struct iovec) {
 .iov_base   = buf,
@@ -1858,6 +1861,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
 bool use_local_qiov = false;
 int ret;
 
+assert(!((flags & BDRV_REQ_ALLOCATE) && (flags & BDRV_REQ_MAY_UNMAP)));
+assert(!((flags & BDRV_REQ_ALLOCATE) && !(flags & BDRV_REQ_ZERO_WRITE)));
+
 trace_bdrv_co_pwritev(child->bs, offset, bytes, flags);
 
 if (!bs->drv) {
@@ -1980,6 +1986,12 @@ int coroutine_fn bdrv_co_pwrite_zeroes(BdrvChild *child, 
int64_t offset,
 {
 trace_bdrv_co_pwrite_zeroes(child->bs, offset, bytes, flags);
 
+if ((flags & BDRV_REQ_ALLOCATE) &&
+!(child->bs->supported_zero_flags & BDRV_REQ_ALLOCATE))
+{
+return -ENOTSUP;
+}
+
 if (!(child->bs->open_flags & BDRV_O_UNMAP)) {
 flags &= ~BDRV_REQ_MAY_UNMAP;
 }
-- 
2.17.1

[Qemu-devel] [PATCH v11 07/10] file-posix: support BDRV_REQ_ALLOCATE

2018-12-17 Thread Anton Nefedov

Current write_zeroes implementation is good enough to satisfy this flag too

Signed-off-by: Anton Nefedov 
---
 block/file-posix.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index a65e464cbc..c3fbf53853 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -607,6 +607,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 } else {
 s->discard_zeroes = true;
 s->has_fallocate = true;
+bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
 }
 } else {
 if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
@@ -650,10 +651,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 #ifdef CONFIG_XFS
 if (platform_test_xfs_fd(s->fd)) {
 s->is_xfs = true;
+bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
 }
 #endif
 
-bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
 ret = 0;
 fail:
 if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
@@ -1552,6 +1554,10 @@ static int handle_aiocb_write_zeroes(void *opaque)
 s->has_fallocate = false;
 #endif
 
+if (!s->has_fallocate) {
+aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
+}
+
 return -ENOTSUP;
 }
 
-- 
2.17.1

[Qemu-devel] [PATCH v11 00/10] qcow2: cluster space preallocation

2018-12-17 Thread Anton Nefedov

new in v11:
- patch 4, 9: fixed commentary format
- patch 4: removed one hunk with a dead check
- patch 5: added commentary to BDRV_REQ_ALLOCATE definition
- new auxiliary patch 6 for the following patch-7 change:
- patch 7: reset BDRV_REQ_ALLOCATE from supported flag if CONFIG_FALLOCATE
   is false
- patch 9: add commentary about missing qcow2_pre_write_overlap_check().
   Omit redundant changes in the test 060.

v10: http://lists.nongnu.org/archive/html/qemu-devel/2018-12/msg00121.html
- patches 1-3,6,7: rebase after REQ_WRITE_UNCHANGED
- patch 3: drop supported_zero_flags. My bad, no write_zeroes in quorum.
- patch 4: almost trivial rebase. RB-tags not stripped.
   Choose another constant for BDRV_REQ_ALLOCATE
- patch 5: rebase. Instead of marking REQ_ALLOCATE serialising, accompany
   it with REQ_SERIALISING.
- patch 7: add symmetric copy-on-read change
- patch 8: trivial rebase. RB-tags not stripped.



This pull request is to start to improve a few performance points of
qcow2 format:

  1. non cluster-aligned write requests (to unallocated clusters) explicitly
 pad data with zeroes if there is no backing data.
 Resulting increase in ops number and potential cluster fragmentation
 (on the host file) is already solved by:
   ee22a9d qcow2: Merge the writing of the COW regions with the guest data
 However, in case of zero COW regions, that can be avoided at all
 but the whole clusters are preallocated and zeroed in a single
 efficient write_zeroes() operation

  2. moreover, efficient write_zeroes() operation can be used to preallocate
 space megabytes (*configurable number) ahead which gives noticeable
 improvement on some storage types (e.g. distributed storage)
 where the space allocation operation might be expensive)
 (Not included in this patchset since v6).

  3. this will also allow to enable simultaneous writes to the same unallocated
 cluster after the space has been allocated & zeroed but before
 the first data is written and the cluster is linked to L2.
 (Not included in this patchset).

Efficient write_zeroes usually implies that the blocks are not actually
written to but only reserved and marked as zeroed by the storage.
In this patchset, file-posix driver is marked as supporting this operation
if it supports (/configured to support) fallocate() operation.

Existing bdrv_write_zeroes() falls back to writing zero buffers if
write_zeroes is not supported by the driver.
A new flag (BDRV_REQ_ALLOCATE) is introduced to avoid that but return ENOTSUP.
Such allocate requests are also implemented to possibly overlap with the
other requests. No wait is performed but an error returned in such case as well.
So the operation should be considered advisory and a fallback scenario still
handled by the caller (in this case, qcow2 driver).

simple perf test:

  qemu-img create -f qcow2 test.img 4G && \
  qemu-img bench -c $((1024*1024)) -f qcow2 -n -s 4k -t none -w test.img

test results (seconds):

+---+---+--+---+--+--+
|   file|before| after| gain |
+---+---+--+---+--+--+
|ssd|  61.153  |  36.313  |  41% |
|hdd| 112.676  | 122.056  |  -8% |
+---+--+--+------+

Anton Nefedov (10):
  mirror: inherit supported write/zero flags
  blkverify: set supported write/zero flags
  quorum: set supported write flags
  block: introduce BDRV_REQ_ALLOCATE flag
  block: treat BDRV_REQ_ALLOCATE as serialising
  file-posix: reset fallocate-related flags without CONFIG_FALLOCATE*
  file-posix: support BDRV_REQ_ALLOCATE
  block: support BDRV_REQ_ALLOCATE in passthrough drivers
  qcow2: skip writing zero buffers to empty COW areas
  iotest 134: test cluster-misaligned encrypted write

 qapi/block-core.json   |  4 +-
 block/qcow2.h  |  6 +++
 include/block/block.h  | 13 +-
 include/block/block_int.h  |  3 +-
 block/blkdebug.c   |  2 +-
 block/blkverify.c  | 10 -
 block/copy-on-read.c   |  4 +-
 block/file-posix.c | 16 +--
 block/io.c | 45 +++
 block/mirror.c |  8 +++-
 block/qcow2-cluster.c  |  2 +-
 block/qcow2.c  | 89 +-
 block/quorum.c | 19 +++-
 block/raw-format.c |  2 +-
 block/trace-events |  1 +
 tests/qemu-iotests/060 |  7 ++-
 tests/qemu-iotests/060.out |  5 ++-
 tests/qemu-iotests/134 |  9 
 tests/qemu-iotests/134.out | 10 +
 19 files changed, 227 insertions(+), 28 deletions(-)

-- 
2.17.1

[Qemu-devel] [PATCH v11 03/10] quorum: set supported write flags

2018-12-17 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/quorum.c | 19 ++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/block/quorum.c b/block/quorum.c
index 16b3c8067c..d21a6a3b8e 100644
--- a/block/quorum.c
+++ b/block/quorum.c
@@ -857,6 +857,19 @@ static QemuOptsList quorum_runtime_opts = {
 },
 };
 
+static void quorum_set_supported_flags(BlockDriverState *bs)
+{
+BDRVQuorumState *s = bs->opaque;
+int i;
+
+bs->supported_write_flags = BDRV_REQ_FUA;
+for (i = 0; i < s->num_children; i++) {
+bs->supported_write_flags &= s->children[i]->bs->supported_write_flags;
+}
+
+bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
+}
+
 static int quorum_open(BlockDriverState *bs, QDict *options, int flags,
Error **errp)
 {
@@ -950,7 +963,7 @@ static int quorum_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 s->next_child_index = s->num_children;
 
-bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
+quorum_set_supported_flags(bs);
 
 g_free(opened);
 goto exit;
@@ -1025,6 +1038,8 @@ static void quorum_add_child(BlockDriverState *bs, 
BlockDriverState *child_bs,
 s->children = g_renew(BdrvChild *, s->children, s->num_children + 1);
 s->children[s->num_children++] = child;
 
+quorum_set_supported_flags(bs);
+
 out:
 bdrv_drained_end(bs);
 }
@@ -1063,6 +1078,8 @@ static void quorum_del_child(BlockDriverState *bs, 
BdrvChild *child,
 bdrv_unref_child(bs, child);
 
 bdrv_drained_end(bs);
+
+quorum_set_supported_flags(bs);
 }
 
 static void quorum_refresh_filename(BlockDriverState *bs, QDict *options)
-- 
2.17.1

[Qemu-devel] [PATCH v11 01/10] mirror: inherit supported write/zero flags

2018-12-17 Thread Anton Nefedov

Signed-off-by: Anton Nefedov 
Reviewed-by: Alberto Garcia 
Reviewed-by: Vladimir Sementsov-Ogievskiy 
---
 block/mirror.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/block/mirror.c b/block/mirror.c
index ab59ad77e8..be52c9be9c 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1529,8 +1529,12 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 mirror_top_bs->implicit = true;
 }
 mirror_top_bs->total_sectors = bs->total_sectors;
-mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
-mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
+mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
+(BDRV_REQ_FUA & bs->supported_write_flags);
+mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+ & bs->supported_zero_flags);
+
 bs_opaque = g_new0(MirrorBDSOpaque, 1);
 mirror_top_bs->opaque = bs_opaque;
 bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
-- 
2.17.1

Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas

2018-12-17 Thread Anton Nefedov



On 14/12/2018 7:20 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
> 
> [..]
> 
>> --- a/tests/qemu-iotests/060
>> +++ b/tests/qemu-iotests/060
>> @@ -150,27 +150,33 @@ $QEMU_IO -c "$OPEN_RO" -c "read -P 1 0 512" | 
>> _filter_qemu_io
>>echo
>>echo "=== Testing overlap while COW is in flight ==="
>>echo
>> +BACKING_IMG=$TEST_IMG.base
>> +TEST_IMG=$BACKING_IMG _make_test_img 1G
>> +
>> +$QEMU_IO -c 'write 64k 64k' "$BACKING_IMG" | _filter_qemu_io
>> +
>># compat=0.10 is required in order to make the following discard actually
>> -# unallocate the sector rather than make it a zero sector - we want COW, 
>> after
>> -# all.
>> -IMGOPTS='compat=0.10' _make_test_img 1G
>> +# unallocate the sector rather than make it a zero sector as we would like
>> +# to reuse it for another guest offset
>> +IMGOPTS='compat=0.10' _make_test_img -b "$BACKING_IMG" 1G
>># Write two clusters, the second one enforces creation of an L2 table 
>> after
>># the first data cluster.
>>$QEMU_IO -c 'write 0k 64k' -c 'write 512M 64k' "$TEST_IMG" | 
>> _filter_qemu_io
>> -# Discard the first cluster. This cluster will soon enough be reallocated 
>> and
>> -# used for COW.
>> +# Discard the first cluster. This cluster will soon enough be reallocated
>>$QEMU_IO -c 'discard 0k 64k' "$TEST_IMG" | _filter_qemu_io
>># Now, corrupt the image by marking the second L2 table cluster as free.
>>poke_file "$TEST_IMG" '131084' "\x00\x00" # 0x2000c
>> -# Start a write operation requiring COW on the image stopping it right 
>> before
>> -# doing the read; then, trigger the corruption prevention by writing 
>> anything to
>> -# any unallocated cluster, leading to an attempt to overwrite the second L2
>> +# Start a write operation requiring COW on the image;
>> +# this write will reuse the host offset released by a previous discard.
>> +# Stop it right before doing the read.
>> +# Then, trigger the corruption prevention by writing anything to
>> +# another unallocated cluster, leading to an attempt to overwrite the 
>> second L2
>># table. Finally, resume the COW write and see it fail (but not crash).
>>echo "open -o file.driver=blkdebug $TEST_IMG
>>break cow_read 0
>> -aio_write 0k 1k
>> +aio_write 64k 1k
>>wait_break 0
>> -write 64k 64k
>> +write 128k 64k
> 
> don't understand why you need these changes.
> 
> works for me, without them, if write to backing at 0 offset, of course.
> 
> As I understand, discard create unallocated holes in top qcow2 for old qcow2 
> version.
> 

Ok, so COW happens regardless if this guest offset has been discarded
before. These offset changes are indeed not needed. Just the backing
file.

Re: [Qemu-devel] [PATCH v6 9/9] qapi: query-blockstat: add driver specific file-posix stats

2018-12-13 Thread Anton Nefedov

On 13/12/2018 3:20 PM, Markus Armbruster wrote:
> I'm reviewing just the QAPI schema today.
> 
> Anton Nefedov  writes:
> 
>> A block driver can provide a callback to report driver-specific
>> statistics.
>>
>> file-posix driver now reports discard statistics
>>
>> Signed-off-by: Anton Nefedov 
>> ---
>>   qapi/block-core.json  | 38 ++
>>   include/block/block.h |  1 +
>>   include/block/block_int.h |  1 +
>>   block.c   |  9 +
>>   block/file-posix.c| 32 
>>   block/qapi.c  |  5 +
>>   6 files changed, 86 insertions(+)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index 959358ccc4..b100e852c7 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -877,6 +877,41 @@
>>  '*x_wr_latency_histogram': 'BlockLatencyHistogramInfo',
>>  '*x_flush_latency_histogram': 'BlockLatencyHistogramInfo' } }
>>   
>> +##
>> +# @BlockStatsSpecificFile:
>> +#
>> +# File driver statistics
>> +#
>> +# @discard-nb-ok: The number of succeeded discard operations performed by
> 
> successful discard operations
> 

Fixed.

>> +# the driver.
>> +#
>> +# @discard-nb-failed: The number of failed discard operations performed by
>> +# the driver.
>> +#
>> +# @discard-bytes-ok: The number of bytes discarded by the driver.
>> +#
>> +# Since: 4.0
>> +##
>> +{ 'struct': 'BlockStatsSpecificFile',
>> +  'data': {
>> +  'discard-nb-ok': 'int',
>> +  'discard-nb-failed': 'int',
>> +  'discard-bytes-ok': 'int' } }
> 
> Should these be unsigned?
> 
> For what it's worth, similar counters nearby are also 'int'.
> 

And I just added these symmetrically.
Probably shouldn't have - let these be uint64.

>> +
>> +##
>> +# @BlockStatsSpecific:
>> +#
>> +# Block driver specific statistics
>> +#
>> +# Since: 4.0
>> +##
>> +{ 'union': 'BlockStatsSpecific',
>> +  'base': { 'driver': 'BlockdevDriver' },
>> +  'discriminator': 'driver',
>> +  'data': {
>> +  'file': 'BlockStatsSpecificFile',
>> +  'host_device': 'BlockStatsSpecificFile' } }
>> +
>>   ##
>>   # @BlockStats:
>>   #
>> @@ -892,6 +927,8 @@
>>   #
>>   # @stats:  A @BlockDeviceStats for the device.
>>   #
>> +# @driver-specific: Optional driver-specific stats. (Since 4.0)
>> +#
>>   # @parent: This describes the file block device if it has one.
>>   #  Contains recursively the statistics of the underlying
>>   #  protocol (e.g. the host file for a qcow2 image). If there is
>> @@ -905,6 +942,7 @@
>>   { 'struct': 'BlockStats',
>> 'data': {'*device': 'str', '*qdev': 'str', '*node-name': 'str',
>>  'stats': 'BlockDeviceStats',
>> +   '*driver-specific': 'BlockStatsSpecific',
>>  '*parent': 'BlockStats',
>>  '*backing': 'BlockStats'} }
>>   
> 
> Feels awkward.
> 
> When is @driver-specific present?  Exactly when the driver is 'file' or
> 'host_device'?  If that's correct, then turning BlockStats into a union
> would be clearer and reduce parenthesises on the wire:
> 
> { 'union': 'BlockStats',
>'base': {
>'driver': 'BlockdevDriver',
>... all the other existing members of BlockStats ... }
>'discriminator': 'driver',
>'data': {
>'file': 'BlockStatsSpecificFile',
>'host_device': 'BlockStatsSpecificFile' } }
> 
> [...]
> 

this series drags for quite a while - we already discussed this :)
In short: Blockdev does not always have driver, so it's either this
or adding weird BlockdevDriver values like "none".

http://lists.nongnu.org/archive/html/qemu-devel/2018-06/msg01845.html

Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas

2018-12-13 Thread Anton Nefedov

On 13/12/2018 3:02 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
> 
> [...]
> 
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>continue;
>>}
>>
>> +/* If COW regions are handled already, skip this too */
>> +if (m->skip_cow) {
>> +continue;
>> +}
>> +
>>/* The data (middle) region must be immediately after the
>> * start region */
>>if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>return false;
>>}
>>
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t 
>> bytes)
>> +{
>> +int64_t nr;
>> +return !bytes ||
>> +(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
>> bytes);
> 
> hm, nr may be < bytes if it is up to file length. And we lose this case, 
> when, it
> may be considered as unallocated too.
> 
> Doesn't harm, however.
> 

Thanks guys for your review.

This case I think is too rare to care about.

>> +}
>> +
>> +static bool is_zero_cow(BlockDriverState *bs, QCowL2Meta *m)
>> +{
>> +/* This check is designed for optimization shortcut so it must be
>> + * efficient.
>> + * Instead of is_zero(), use is_unallocated() as it is faster (but not
>> + * as accurate and can result in false negatives). */
> 
> But, in case of allocated zeros, we'll read them anyway (as part of COW 
> process),
> so, it may be handled in the same way too. May be not here, but after read we 
> can
> check for zeroes, and again effectively write zeros to the whole cluster.
> 
> Again it may be done separately, I don't sure it worth doing.
> 

Detecting zeroes after we read them (and not here) might make sense;
performance gain should be about the same (minus some CPU to check the
read buffer for zeroes).

The question is how frequent it might hit:
  - raw backing image with holes
  - qcow2 backing image with sub-cluster zero areas
(e.g. smaller cluster size)
  - backing image contains lots of space with explicit zeroes
(e.g. guest FS with 'shred' extensively used to delete files)

None of these probably occur that frequent.
But might be a next step and a separate series.

>> +return is_unallocated(bs, m->offset + m->cow_start.offset,
>> +  m->cow_start.nb_bytes) &&
>> +   is_unallocated(bs, m->offset + m->cow_end.offset,
>> +  m->cow_end.nb_bytes);
>> +}
>> +
>> +static int handle_alloc_space(BlockDriverState *bs, QCowL2Meta *l2meta)
>> +{
>> +BDRVQcow2State *s = bs->opaque;
>> +QCowL2Meta *m;
>> +
>> +if (!(bs->file->bs->supported_zero_flags & BDRV_REQ_ALLOCATE)) {
>> +return 0;
>> +}
>> +
>> +if (bs->encrypted) {
>> +return 0;
>> +}
>> +
>> +for (m = l2meta; m != NULL; m = m->next) {
>> +int ret;
>> +
>> +if (!m->cow_start.nb_bytes && !m->cow_end.nb_bytes) {
>> +continue;
>> +}
>> +
>> +if (!is_zero_cow(bs, m)) {
>> +continue;
>> +}
> 
> pre_write_overlap_check should be here
> 

Existing pre_write_overlap_check in qcow2_co_pwritev() should cover it,
since the check aligns the range to cluster boundaries.

However I think I missed a comment about it here. For suspicious
readers :) and just in case someone starts moving this code around.

I propose:

diff --git a/block/qcow2.c b/block/qcow2.c
index 027188a1a3..b3b3124083 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2088,6 +2088,9 @@ static int handle_alloc_space(BlockDriverState 
*bs, QCowL2Meta *l2meta)
  continue;
  }

+/* Conventional place for qcow2_pre_write_overlap_check() but 
in this
+   case it is already done for these clusters */
+
  BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);
  /* instead of writing zero COW buffers,
 efficiently zero out the whole clusters */


>> +
>> +BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_SPACE);

Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE

2018-12-13 Thread Anton Nefedov

On 12/12/2018 8:19 PM, Vladimir Sementsov-Ogievskiy wrote:
> 05.12.2018 17:11, Anton Nefedov wrote:
>> On 5/12/2018 4:25 PM, Vladimir Sementsov-Ogievskiy wrote:
>>> 03.12.2018 13:14, Anton Nefedov wrote:
>>>>  }
>>>>  #endif
>>>>  
>>>> -bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>>>> +bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
>>>>  ret = 0;
>>>>  fail:
>>>>  if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
>>>> @@ -1520,6 +1522,10 @@ static ssize_t 
>>>> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>>>  }
>>>>  s->has_fallocate = false;
>>>>  }
>>>> +
>>>> +if (!s->has_fallocate) {
>>>> +aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
>>>> +}
 >>>>> #endif
> 
> hm, if CONFIG_FALLOCATE is disabled, flag will remain in supported_zero_flags
> 

right..
I think there should be a separate patch to reset s->has_* in
non-CONFIG_FALLOCATE* cases. Then I'll move this hunk just one line down
under the following #endif

Re: [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising

2018-12-13 Thread Anton Nefedov



On 12/12/2018 3:48 PM, Vladimir Sementsov-Ogievskiy wrote:
> 05.12.2018 17:01, Anton Nefedov wrote:
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -87,6 +87,9 @@ typedef enum {
>>  * efficiently allocate the space so it reads as zeroes, or return
>> an error.
>>  * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
>>  * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
>> + * This flag implicitly behaves as BDRV_REQ_SERIALISING i.e. it is
>> + * protected from conflicts with overlapping requests. If such
>> conflict is
>> + * detected, -EAGAIN is returned.
> 
> "behaves as" sounds like "do the same" for me, so better is "implicitly sets" 
> or something like this.
> 

"implicitly sets" is good enough for me

Re: [Qemu-devel] [PATCH v10 3/9] quorum: set supported write flags

2018-12-07 Thread Anton Nefedov

On 7/12/2018 5:33 PM, Alberto Garcia wrote:
> On Mon 03 Dec 2018 11:14:55 AM CET, Anton Nefedov wrote:
>> Signed-off-by: Anton Nefedov 
>> ---
>>   block/quorum.c | 19 ++-
>>   1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/quorum.c b/block/quorum.c
>> index 16b3c8067c..d21a6a3b8e 100644
>> --- a/block/quorum.c
>> +++ b/block/quorum.c
>> @@ -857,6 +857,19 @@ static QemuOptsList quorum_runtime_opts = {
>>   },
>>   };
>>   
>> +static void quorum_set_supported_flags(BlockDriverState *bs)
>> +{
>> +BDRVQuorumState *s = bs->opaque;
>> +int i;
>> +
>> +bs->supported_write_flags = BDRV_REQ_FUA;
>> +for (i = 0; i < s->num_children; i++) {
>> +bs->supported_write_flags &= 
>> s->children[i]->bs->supported_write_flags;
>> +}
>> +
>> +bs->supported_write_flags |= BDRV_REQ_WRITE_UNCHANGED;
>> +}
> 
> You don't set supported_zero_flags here anymore ?
> 
> Berto
> 

Yes, I noticed (that late) that quorum doesn't actually implement
write_zeroes(). bdrv_co_do_pwrite_zeroes() specifically checks that
there must be no supported flags set in such case.

Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas

2018-12-05 Thread Anton Nefedov

On 5/12/2018 5:01 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> If COW areas of the newly allocated clusters are zeroes on the backing image,
>> efficient bdrv_write_zeroes(flags=BDRV_REQ_ALLOCATE) can be used on the whole
>> cluster instead of writing explicit zero buffers later in perform_cow().
>>
>> iotest 060:
>> write to the discarded cluster does not trigger COW anymore.
>> Use a backing image instead.
>>
>> Signed-off-by: Anton Nefedov 
>> Reviewed-by: Alberto Garcia 
>> ---
>>qapi/block-core.json   |  4 +-
>>block/qcow2.h  |  6 +++
>>block/qcow2-cluster.c  |  2 +-
>>block/qcow2.c  | 80 +-
>>block/trace-events |  1 +
>>tests/qemu-iotests/060 | 26 -
>>tests/qemu-iotests/060.out |  5 ++-
>>7 files changed, 109 insertions(+), 15 deletions(-)
>>
>> diff --git a/qapi/block-core.json b/qapi/block-core.json
>> index d4fe710836..50598aa8fe 100644
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3004,6 +3004,8 @@
>>#
>># @cor_write: a write due to copy-on-read (since 2.11)
>>#
>> +# @cluster_alloc_space: an allocation of file space for a cluster (since 
>> 4.0)
>> +#
>># Since: 2.9
>>##
>>{ 'enum': 'BlkdebugEvent', 'prefix': 'BLKDBG',
>> @@ -3022,7 +3024,7 @@
>>'pwritev_rmw_tail', 'pwritev_rmw_after_tail', 'pwritev',
>>'pwritev_zero', 'pwritev_done', 'empty_image_prepare',
>>'l1_shrink_write_table', 'l1_shrink_free_l2_clusters',
>> -'cor_write'] }
>> +'cor_write', 'cluster_alloc_space'] }
>>
>>##
>># @BlkdebugInjectErrorOptions:
>> diff --git a/block/qcow2.h b/block/qcow2.h
>> index 8662b68575..8a64077897 100644
>> --- a/block/qcow2.h
>> +++ b/block/qcow2.h
>> @@ -389,6 +389,12 @@ typedef struct QCowL2Meta
>> */
>>Qcow2COWRegion cow_end;
>>
>> +/**
>> + * Indicates that COW regions are already handled and do not require
>> + * any more processing.
>> + */
>> +bool skip_cow;
>> +
>>/**
>> * The I/O vector with the data from the actual guest write request.
>> * If non-NULL, this is meant to be merged together with the data
>> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
>> index d37fe08b3d..3685c5f67e 100644
>> --- a/block/qcow2-cluster.c
>> +++ b/block/qcow2-cluster.c
>> @@ -806,7 +806,7 @@ static int perform_cow(BlockDriverState *bs, QCowL2Meta 
>> *m)
>>assert(start->offset + start->nb_bytes <= end->offset);
>>assert(!m->data_qiov || m->data_qiov->size == data_bytes);
>>
>> -if (start->nb_bytes == 0 && end->nb_bytes == 0) {
>> +if ((start->nb_bytes == 0 && end->nb_bytes == 0) || m->skip_cow) {
>>return 0;
>>}
>>
>> diff --git a/block/qcow2.c b/block/qcow2.c
>> index 991d6ac91b..027188a1a3 100644
>> --- a/block/qcow2.c
>> +++ b/block/qcow2.c
>> @@ -2015,6 +2015,11 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>continue;
>>}
>>
>> +/* If COW regions are handled already, skip this too */
>> +if (m->skip_cow) {
>> +continue;
>> +}
>> +
>>/* The data (middle) region must be immediately after the
>> * start region */
>>if (l2meta_cow_start(m) + m->cow_start.nb_bytes != offset) {
>> @@ -2040,6 +2045,68 @@ static bool merge_cow(uint64_t offset, unsigned bytes,
>>return false;
>>}
>>
>> +static bool is_unallocated(BlockDriverState *bs, int64_t offset, int64_t 
>> bytes)
>> +{
>> +int64_t nr;
>> +return !bytes ||
>> +(!bdrv_is_allocated_above(bs, NULL, offset, bytes, ) && nr == 
>> bytes);
> 
> bdrv_is_allocated_above may return error < 0
> 

Probably I just took is_zero() as an example.
But somewhere there's even a rationale (bdrv_co_do_copy_on_readv):

 ret = bdrv_is_allocated(bs, cluster_offset,
 MIN(cluster_bytes, max_transfer), );
 if (ret < 0) {
 /* Safe to treat errors in querying allocation as if
  * unallocated; we'll probably fail again soon on the
  * read, but at least that wi

Re: [Qemu-devel] [PATCH v10 6/9] file-posix: support BDRV_REQ_ALLOCATE

2018-12-05 Thread Anton Nefedov

On 5/12/2018 4:25 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> Current write_zeroes implementation is good enough to satisfy this flag too
>>
>> Signed-off-by: Anton Nefedov 
>> ---
>>block/file-posix.c | 8 +++-
>>1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/file-posix.c b/block/file-posix.c
>> index 07bbdab953..b0b7ab0159 100644
>> --- a/block/file-posix.c
>> +++ b/block/file-posix.c
>> @@ -603,6 +603,7 @@ static int raw_open_common(BlockDriverState *bs, QDict 
>> *options,
>>} else {
>>s->discard_zeroes = true;
>>s->has_fallocate = true;
>> +bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
>>}
>>} else {
>>if (!(S_ISCHR(st.st_mode) || S_ISBLK(st.st_mode))) {
>> @@ -646,10 +647,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
>> *options,
>>#ifdef CONFIG_XFS
>>if (platform_test_xfs_fd(s->fd)) {
>>s->is_xfs = true;
>> +bs->supported_zero_flags = BDRV_REQ_ALLOCATE;
> 
> why we should handle xfs separately? is there a case with xfs, not handled by 
> previous generic if ()?
> 

The driver supports ALLOCATE either when it's XFS, or when fallocate is
available. Currently there is no test for fallocate, it's just implied
it's supported until ENOTSUP returned.
I think it's safer (for possible future changes) to set it twice even
though you're right and first condition currently covers the XFS 
condition too.

>>}
>>#endif
>>
>> -bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>> +bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
>>ret = 0;
>>fail:
>>if (filename && (bdrv_flags & BDRV_O_TEMPORARY)) {
>> @@ -1520,6 +1522,10 @@ static ssize_t 
>> handle_aiocb_write_zeroes(RawPosixAIOData *aiocb)
>>}
>>s->has_fallocate = false;
>>}
>> +
>> +if (!s->has_fallocate) {
>> +aiocb->bs->supported_zero_flags &= ~BDRV_REQ_ALLOCATE;
>> +}
>>#endif
>>
>>return -ENOTSUP;
>>
> 
>

Re: [Qemu-devel] [PATCH v10 5/9] block: treat BDRV_REQ_ALLOCATE as serialising

2018-12-05 Thread Anton Nefedov



On 5/12/2018 4:14 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> The idea is that ALLOCATE requests may overlap with other requests.
> 
> please, describe why
> 

It is not used in this series from some point, but the idea is that the
caller might use ALLOCATE requests on a larger extent.

Described in the series header:

   2. moreover, efficient write_zeroes() operation can be used to
preallocate
  space megabytes (*configurable number) ahead which gives noticeable
  improvement on some storage types (e.g. distributed storage)
  where the space allocation operation might be expensive)
  (Not included in this patchset since v6).

So, it's possible to drop from this series and add later but I'd like
to receive general remarks on whether this is an acceptable way.

>> Reuse the existing block layer infrastructure for serialising requests.
>> Use the following approach:
>> - mark ALLOCATE also SERIALISING, so subsequent requests to the area wait
>> - ALLOCATE request itself must never wait if another request is in flight
>>   already. Return EAGAIN, let the caller reconsider.
>>
>> Signed-off-by: Anton Nefedov 
>> ---
>>block/io.c | 31 ---
>>1 file changed, 24 insertions(+), 7 deletions(-)
>>
>> diff --git a/block/io.c b/block/io.c
>> index d9d7644858..6ff946f63d 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -720,12 +720,13 @@ void bdrv_dec_in_flight(BlockDriverState *bs)
>>bdrv_wakeup(bs);
>>}
>>
>> -static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>> +static bool coroutine_fn find_or_wait_serialising_requests(
>> +BdrvTrackedRequest *self, bool wait)
>>{
>>BlockDriverState *bs = self->bs;
>>BdrvTrackedRequest *req;
>>bool retry;
>> -bool waited = false;
>> +bool found = false;
>>
>>if (!atomic_read(>serialising_in_flight)) {
>>return false;
>> @@ -751,11 +752,14 @@ static bool coroutine_fn 
>> wait_serialising_requests(BdrvTrackedRequest *self)
>> * will wait for us as soon as it wakes up, then just go 
>> on
>> * (instead of producing a deadlock in the former case). 
>> */
>>if (!req->waiting_for) {
>> +found = true;
>> +if (!wait) {
>> +break;
>> +}
>>self->waiting_for = req;
>>qemu_co_queue_wait(>wait_queue, >reqs_lock);
>>self->waiting_for = NULL;
>>retry = true;
>> -waited = true;
>>break;
>>}
>>}
>> @@ -763,7 +767,12 @@ static bool coroutine_fn 
>> wait_serialising_requests(BdrvTrackedRequest *self)
>>qemu_co_mutex_unlock(>reqs_lock);
>>} while (retry);
>>
>> -return waited;
>> +return found;
>> +}
>> +
>> +static bool coroutine_fn wait_serialising_requests(BdrvTrackedRequest *self)
>> +{
>> +return find_or_wait_serialising_requests(self, true);
>>}
>>
>>static int bdrv_check_byte_request(BlockDriverState *bs, int64_t offset,
>> @@ -1585,7 +1594,7 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
>> offset, uint64_t bytes,
>>  BdrvTrackedRequest *req, int flags)
>>{
>>BlockDriverState *bs = child->bs;
>> -bool waited;
>> +bool found;
>>int64_t end_sector = DIV_ROUND_UP(offset + bytes, BDRV_SECTOR_SIZE);
>>
>>if (bs->read_only) {
>> @@ -1602,9 +1611,13 @@ bdrv_co_write_req_prepare(BdrvChild *child, int64_t 
>> offset, uint64_t bytes,
>>mark_request_serialising(req, bdrv_get_cluster_size(bs));
>>}
>>
>> -waited = wait_serialising_requests(req);
>> +found = find_or_wait_serialising_requests(req,
>> +  !(flags & BDRV_REQ_ALLOCATE));
>> +if (found && (flags & BDRV_REQ_ALLOCATE)) {
>> +return -EAGAIN;
>> +}
>>
>> -assert(!waited || !req->serialising ||
>> +assert(!found || !req->serialising ||
>>   is_request_serialising_and_aligned(req));
>>assert(req->overlap_offset <= offset);
>>assert(offset + bytes &

Re: [Qemu-devel] [PATCH v10 4/9] block: introduce BDRV_REQ_ALLOCATE flag

2018-12-05 Thread Anton Nefedov



On 5/12/2018 3:59 PM, Vladimir Sementsov-Ogievskiy wrote:
> 03.12.2018 13:14, Anton Nefedov wrote:
>> The flag is supposed to indicate that the region of the disk image has
>> to be sufficiently allocated so it reads as zeroes.
>>
>> The call with the flag set must return -ENOTSUP if allocation cannot
>> be done efficiently.
>> This has to be made sure of by both
>> - the drivers that support the flag
>> - and the common block layer (so it will not fall back to any slowpath
>>   (like writing zero buffers) in case the driver does not support
>>   the flag).
>>
>> Signed-off-by: Anton Nefedov 
>> Reviewed-by: Alberto Garcia 
>> ---
>>include/block/block.h |  9 -
>>include/block/block_int.h |  2 +-
>>block/io.c| 18 --
>>3 files changed, 25 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/block/block.h b/include/block/block.h
>> index 7f5453b45b..f571082415 100644
>> --- a/include/block/block.h
>> +++ b/include/block/block.h
>> @@ -83,8 +83,15 @@ typedef enum {
>> */
>>BDRV_REQ_SERIALISING= 0x80,
>>
>> +/* The BDRV_REQ_ALLOCATE flag is used to indicate that the driver has to
>> + * efficiently allocate the space so it reads as zeroes, or return an 
>> error.
>> + * If this flag is set then BDRV_REQ_ZERO_WRITE must also be set.
>> + * This flag cannot be set together with BDRV_REQ_MAY_UNMAP.
> 
> and, may be, it can't be set with FUA too?
> 

I don't quite see why it cannot. Even the efficient allocate call
usually implies some Unit Access, it's up to the protocol driver to
decide which exactly.

>> + */
>> +BDRV_REQ_ALLOCATE   = 0x100,
>> +
>>/* Mask of valid flags */
>> -BDRV_REQ_MASK   = 0xff,
>> +BDRV_REQ_MASK   = 0x1ff,
>>} BdrvRequestFlags;
>>
>>typedef struct BlockSizes {
>> diff --git a/include/block/block_int.h b/include/block/block_int.h
>> index f605622216..ff84c5d8aa 100644
>> --- a/include/block/block_int.h
>> +++ b/include/block/block_int.h
>> @@ -724,7 +724,7 @@ struct BlockDriverState {
>> * their children. */
>>unsigned int supported_write_flags;
>>/* Flags honored during pwrite_zeroes (so far: BDRV_REQ_FUA,
>> - * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED) */
>> + * BDRV_REQ_MAY_UNMAP, BDRV_REQ_WRITE_UNCHANGED, BDRV_REQ_ALLOCATE) */
>>unsigned int supported_zero_flags;
>>
>>/* the following member gives a name to every node on the bs graph. */
>> diff --git a/block/io.c b/block/io.c
>> index bd9d688f8b..d9d7644858 100644
>> --- a/block/io.c
>> +++ b/block/io.c
>> @@ -1534,7 +1534,7 @@ static int coroutine_fn 
>> bdrv_co_do_pwrite_zeroes(BlockDriverState *bs,
>>assert(!bs->supported_zero_flags);
>>}
>>
>> -if (ret == -ENOTSUP) {
>> +if (ret == -ENOTSUP && !(flags & BDRV_REQ_ALLOCATE)) {
>>/* Fall back to bounce buffer if write zeroes is unsupported 
>> */
>>BdrvRequestFlags write_flags = flags & ~BDRV_REQ_ZERO_WRITE;
>>
>> @@ -1702,7 +1702,9 @@ static int coroutine_fn bdrv_aligned_pwritev(BdrvChild 
>> *child,
>>!(flags & BDRV_REQ_ZERO_WRITE) && drv->bdrv_co_pwrite_zeroes &&
>>qemu_iovec_is_zero(qiov)) {
>>flags |= BDRV_REQ_ZERO_WRITE;
>> -if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP) {
>> +if (bs->detect_zeroes == BLOCKDEV_DETECT_ZEROES_OPTIONS_UNMAP &&
>> +!(flags & BDRV_REQ_ALLOCATE))
> 
> 
> dead check. we are in if (!(flags & BDRV_REQ_ZERO_WRITE)), so (flags & 
> BDRV_REQ_ALLOCATE) must be zero as well.
> 

Agree.

>> +{
>>flags |= BDRV_REQ_MAY_UNMAP;
>>}
>>}
>> @@ -1773,6 +1775,9 @@ static int coroutine_fn 
>> bdrv_co_do_zero_pwritev(BdrvChild *child,
>>
>>assert(flags & BDRV_REQ_ZERO_WRITE);
>>if (head_padding_bytes || tail_padding_bytes) {
>> +if (flags & BDRV_REQ_ALLOCATE) {
>> +return -ENOTSUP;
>> +}
>>buf = qemu_blockalign(bs, align);
>>iov = (struct iovec) {
>>.iov_base   = buf,
>> @@ -1858,6 +1863,9 @@ int coroutine_fn bdrv_co_pwritev(BdrvChild *child,
>>

Re: [Qemu-devel] [PATCH v10 1/9] mirror: inherit supported write/zero flags

2018-12-05 Thread Anton Nefedov

On 5/12/2018 3:43 PM, Vladimir Sementsov-Ogievskiy wrote:
> Could you please write, what is the behavior change and why here?
> 

The idea is that passthrough drivers should report the flags if there
are no obstacles to support it.
Technically, these changes are not connected to this series, but since
BDRV_REQ_ALLOCATE flag is added, so we might want to expose it where
possible.

> Is it a bug, that FUA was not inherited before?
> 

I don't think it's a bug really since there is a fallback path in
block/io.c.

> 03.12.2018 13:14, Anton Nefedov wrote:
>> Signed-off-by: Anton Nefedov 
>> ---
>>block/mirror.c | 8 ++--
>>1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/block/mirror.c b/block/mirror.c
>> index 56d9ef7474..56908c9b19 100644
>> --- a/block/mirror.c
>> +++ b/block/mirror.c
>> @@ -1528,8 +1528,12 @@ static void mirror_start_job(const char *job_id, 
>> BlockDriverState *bs,
>>mirror_top_bs->implicit = true;
>>}
>>mirror_top_bs->total_sectors = bs->total_sectors;
>> -mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED;
>> -mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED;
>> +mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +(BDRV_REQ_FUA & bs->supported_write_flags);
>> +mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
>> +((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
>> + & bs->supported_zero_flags);
>> +
>>bs_opaque = g_new0(MirrorBDSOpaque, 1);
>>mirror_top_bs->opaque = bs_opaque;
>>bdrv_set_aio_context(mirror_top_bs, bdrv_get_aio_context(bs));
>>
> 
>

Re: [Qemu-devel] [PATCH v10 8/9] qcow2: skip writing zero buffers to empty COW areas

2018-12-03 Thread Anton Nefedov



On 3/12/2018 4:59 PM, Alberto Garcia wrote:
> On Mon 03 Dec 2018 11:14:59 AM CET, Anton Nefedov wrote:
>> --- a/qapi/block-core.json
>> +++ b/qapi/block-core.json
>> @@ -3004,6 +3004,8 @@
>>   #
>>   # @cor_write: a write due to copy-on-read (since 2.11)
>>   #
>> +# @cluster_alloc_space: an allocation of file space for a cluster (since 
>> 4.0)
> 
> Since 4.0 ??
> 
> Berto
> 

I heard that 3.1 will be followed by 4.0

/Anton

[Qemu-devel] [PATCH v10 7/9] block: support BDRV_REQ_ALLOCATE in passthrough drivers

2018-12-03 Thread Anton Nefedov

Support the flag if the underlying BDS supports it

Signed-off-by: Anton Nefedov 
---
 block/blkdebug.c | 2 +-
 block/blkverify.c| 2 +-
 block/copy-on-read.c | 4 ++--
 block/mirror.c   | 2 +-
 block/raw-format.c   | 2 +-
 5 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/block/blkdebug.c b/block/blkdebug.c
index 0759452925..f0fc2ec276 100644
--- a/block/blkdebug.c
+++ b/block/blkdebug.c
@@ -401,7 +401,7 @@ static int blkdebug_open(BlockDriverState *bs, QDict 
*options, int flags,
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
 bs->file->bs->supported_zero_flags);
 ret = -EINVAL;
 
diff --git a/block/blkverify.c b/block/blkverify.c
index bb52596cbb..9cb4f94b68 100644
--- a/block/blkverify.c
+++ b/block/blkverify.c
@@ -146,7 +146,7 @@ static int blkverify_open(BlockDriverState *bs, QDict 
*options, int flags,
  bs->file->bs->supported_write_flags &
  s->test_file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
  bs->file->bs->supported_zero_flags &
  s->test_file->bs->supported_zero_flags);
 
diff --git a/block/copy-on-read.c b/block/copy-on-read.c
index 64dcc424b5..1eb993699a 100644
--- a/block/copy-on-read.c
+++ b/block/copy-on-read.c
@@ -38,8 +38,8 @@ static int cor_open(BlockDriverState *bs, QDict *options, int 
flags,
 bs->file->bs->supported_write_flags);
 
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-   ((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
-bs->file->bs->supported_zero_flags);
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
+ bs->file->bs->supported_zero_flags);
 
 return 0;
 }
diff --git a/block/mirror.c b/block/mirror.c
index 56908c9b19..9d836a6bd3 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -1531,7 +1531,7 @@ static void mirror_start_job(const char *job_id, 
BlockDriverState *bs,
 mirror_top_bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->supported_write_flags);
 mirror_top_bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP)
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE)
  & bs->supported_zero_flags);
 
 bs_opaque = g_new0(MirrorBDSOpaque, 1);
diff --git a/block/raw-format.c b/block/raw-format.c
index 6f6dc99b2c..ad7453dc83 100644
--- a/block/raw-format.c
+++ b/block/raw-format.c
@@ -432,7 +432,7 @@ static int raw_open(BlockDriverState *bs, QDict *options, 
int flags,
 bs->supported_write_flags = BDRV_REQ_WRITE_UNCHANGED |
 (BDRV_REQ_FUA & bs->file->bs->supported_write_flags);
 bs->supported_zero_flags = BDRV_REQ_WRITE_UNCHANGED |
-((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP) &
+((BDRV_REQ_FUA | BDRV_REQ_MAY_UNMAP | BDRV_REQ_ALLOCATE) &
 bs->file->bs->supported_zero_flags);
 
 if (bs->probed && !bdrv_is_read_only(bs)) {
-- 
2.17.1

1 2 3 4 5 >

1 - 100 of 495 matches

Mail list logo