Re: [PATCH v3 2/6] block: block-status cache for data regions

2021-08-25 Thread Vladimir Sementsov-Ogievskiy

12.08.2021 11:41, Hanna Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz 
---
  include/block/block_int.h | 50 


[..]


+/*
+ * Note that checking QLIST_EMPTY(>children) is also done when
+ * the cache is queried above.  Technically, we do not need to 
check
+ * it here; the worst that can happen is that we fill the cache for
+ * non-protocol nodes, and then it is never used.  However, filling
+ * the cache requires an RCU update, so double check here to avoid
+ * such an update if possible.
+ */
+if (ret == (BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID) &&
+QLIST_EMPTY(>children))
+{
+/*
+ * When a protocol driver reports BLOCK_OFFSET_VALID, the
+ * returned local_map value must be the same as the offset we
+ * have passed (aligned_offset).
+ * Assert this, because we follow this rule when reading from
+ * the cache (see the `local_map = aligned_offset` assignment
+ * above), and the result the cache delivers must be the same
+ * as the driver would deliver.
+ */
+assert(local_map == aligned_offset);


maybe, also assert(local_file == bs);

as well, we may check only BDRV_BLOCK_DATA flag above and
assert BDRV_BLOCK_OFFSET_VALID here

anyway:
Reviewed-by: Vladimir Sementsov-Ogievskiy 


--
Best regards,
Vladimir



Re: [PATCH v3 2/6] block: block-status cache for data regions

2021-08-17 Thread Hanna Reitz

On 16.08.21 23:38, Eric Blake wrote:

On Thu, Aug 12, 2021 at 10:41:44AM +0200, Hanna Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz 
---
+++ b/block.c
+/**
+ * Check whether [offset, offset + bytes) overlaps with the cached
+ * block-status data region.
+ *
+ * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
+ * which is what bdrv_bsc_is_data()'s interface needs.
+ * Otherwise, *pnum is not touched.

Why duplicate this comment,...


I don’t think it can be duplicated, because this is a static function.  
It is very similar to bdrv_bsc_is_data()’s interface, I admit, but it 
isn’t exactly the same (besides the _locked suffix, the only difference 
is that bdrv_bsc_is_data() is for a single byte, while this function 
checks a range).



+ */
+static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
+   int64_t offset, int64_t bytes,
+   int64_t *pnum)
+{
+BdrvBlockStatusCache *bsc = qatomic_rcu_read(>block_status_cache);
+bool overlaps;
+
+overlaps =
+qatomic_read(>valid) &&
+ranges_overlap(offset, bytes, bsc->data_start,
+   bsc->data_end - bsc->data_start);
+
+if (overlaps && pnum) {
+*pnum = bsc->data_end - offset;
+}
+
+return overlaps;
+}
+
+/**
+ * See block_int.h for this function's documentation.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
+{
+RCU_READ_LOCK_GUARD();
+
+return bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
+}
+
+/**
+ * See block_int.h for this function's documentation.

...but not these?


These are not static functions, and so I can point to the header where 
they’re declared.


(We have a wild mix of where functions are described in qemu, and it’s 
often in their C files.  I prefer having descriptions in the header, but 
because we have the precedent of explaining interfaces in C files, I 
thought I can’t get around adding at least pointers in the C file.)



But that's minor.

Reviewed-by: Eric Blake 


Thanks!

Hanna




Re: [PATCH v3 2/6] block: block-status cache for data regions

2021-08-16 Thread Eric Blake
On Thu, Aug 12, 2021 at 10:41:44AM +0200, Hanna Reitz wrote:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 

> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Hanna Reitz 
> ---

> +++ b/block.c

> +/**
> + * Check whether [offset, offset + bytes) overlaps with the cached
> + * block-status data region.
> + *
> + * If so, and @pnum is not NULL, set *pnum to `bsc.data_end - offset`,
> + * which is what bdrv_bsc_is_data()'s interface needs.
> + * Otherwise, *pnum is not touched.

Why duplicate this comment,...

> + */
> +static bool bdrv_bsc_range_overlaps_locked(BlockDriverState *bs,
> +   int64_t offset, int64_t bytes,
> +   int64_t *pnum)
> +{
> +BdrvBlockStatusCache *bsc = qatomic_rcu_read(>block_status_cache);
> +bool overlaps;
> +
> +overlaps =
> +qatomic_read(>valid) &&
> +ranges_overlap(offset, bytes, bsc->data_start,
> +   bsc->data_end - bsc->data_start);
> +
> +if (overlaps && pnum) {
> +*pnum = bsc->data_end - offset;
> +}
> +
> +return overlaps;
> +}
> +
> +/**
> + * See block_int.h for this function's documentation.
> + */
> +bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum)
> +{
> +RCU_READ_LOCK_GUARD();
> +
> +return bdrv_bsc_range_overlaps_locked(bs, offset, 1, pnum);
> +}
> +
> +/**
> + * See block_int.h for this function's documentation.

...but not these?

But that's minor.

Reviewed-by: Eric Blake 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3266
Virtualization:  qemu.org | libvirt.org




[PATCH v3 2/6] block: block-status cache for data regions

2021-08-12 Thread Hanna Reitz
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Hanna Reitz 
---
 include/block/block_int.h | 50 
 block.c   | 80 +++
 block/io.c| 65 +--
 3 files changed, 192 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index 12e5750fe8..437d746733 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -34,6 +34,7 @@
 #include "qemu/hbitmap.h"
 #include "block/snapshot.h"
 #include "qemu/throttle.h"
+#include "qemu/rcu.h"
 
 #define BLOCK_FLAG_LAZY_REFCOUNTS   8
 
@@ -839,6 +840,24 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ * functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+struct rcu_head rcu;
+
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
 /* Protected by big QEMU lock or read-only after opening.  No special
  * locking needed during I/O...
@@ -1004,6 +1023,11 @@ struct BlockDriverState {
 
 /* BdrvChild links to this node may never be frozen */
 bool never_freeze;
+
+/* Lock for block-status cache RCU writers */
+CoMutex bsc_modify_lock;
+/* Always non-NULL, but must only be dereferenced under an RCU read guard 
*/
+BdrvBlockStatusCache *block_status_cache;
 };
 
 struct BlockBackendRootState {
@@ -1429,4 +1453,30 @@ static inline BlockDriverState 
*bdrv_primary_bs(BlockDriverState *bs)
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
+/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+   int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
 #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index e97ce0b1c8..28b00d7239 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
 #include