Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-07-12 Thread Max Reitz

On 06.07.21 19:04, Kevin Wolf wrote:

Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 

Since you indicated that you'll respin the patch, I'll add my minor
comments:


@@ -2442,9 +2445,58 @@ static int coroutine_fn 
bdrv_co_block_status(BlockDriverState *bs,
  aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
  
  if (bs->drv->bdrv_co_block_status) {

-ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
-aligned_bytes, pnum, _map,
-_file);
+bool from_cache = false;
+
+/*
+ * Use the block-status cache only for protocol nodes: Format
+ * drivers are generally quick to inquire the status, but protocol
+ * drivers often need to get information from outside of qemu, so
+ * we do not have control over the actual implementation.  There
+ * have been cases where inquiring the status took an unreasonably
+ * long time, and we can do nothing in qemu to fix it.
+ * This is especially problematic for images with large data areas,
+ * because finding the few holes in them and giving them special
+ * treatment does not gain much performance.  Therefore, we try to
+ * cache the last-identified data region.
+ *
+ * Second, limiting ourselves to protocol nodes allows us to assume
+ * the block status for data regions to be DATA | OFFSET_VALID, and
+ * that the host offset is the same as the guest offset.
+ *
+ * Note that it is possible that external writers zero parts of
+ * the cached regions without the cache being invalidated, and so
+ * we may report zeroes as data.  This is not catastrophic,
+ * however, because reporting zeroes as data is fine.
+ */
+if (QLIST_EMPTY(>children)) {
+if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+local_file = bs;
+local_map = aligned_offset;
+
+from_cache = true;
+}
+}
+
+if (!from_cache) {

Is having a separate variable from_cache really useful? This looks like
it could just be:

 if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
 // The code above
 } else {
 // The code below
 }


Oh, yes.

(I guess this was mainly an artifact from v1 where there was a mutex 
around the bdrv_bsc_is_data() block.  Now it’s better to just roll both 
conditions into one, yes.)



+ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
+aligned_bytes, pnum, 
_map,
+_file);
+
+/*
+ * Note that checking QLIST_EMPTY(>children) is also done when

Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-07-06 Thread Kevin Wolf
Am 23.06.2021 um 17:01 hat Max Reitz geschrieben:
> As we have attempted before
> (https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
> "file-posix: Cache lseek result for data regions";
> https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
> "file-posix: Cache next hole"), this patch seeks to reduce the number of
> SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
> main difference is that this time it is implemented as part of the
> general block layer code.
> 
> The problem we face is that on some filesystems or in some
> circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
> implementation is outside of qemu, there is little we can do about its
> performance.
> 
> We have already introduced the want_zero parameter to
> bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
> unless we really want zero information; but sometimes we do want that
> information, because for files that consist largely of zero areas,
> special-casing those areas can give large performance boosts.  So the
> real problem is with files that consist largely of data, so that
> inquiring the block status does not gain us much performance, but where
> such an inquiry itself takes a lot of time.
> 
> To address this, we want to cache data regions.  Most of the time, when
> bad performance is reported, it is in places where the image is iterated
> over from start to end (qemu-img convert or the mirror job), so a simple
> yet effective solution is to cache only the current data region.
> 
> (Note that only caching data regions but not zero regions means that
> returning false information from the cache is not catastrophic: Treating
> zeroes as data is fine.  While we try to invalidate the cache on zero
> writes and discards, such incongruences may still occur when there are
> other processes writing to the image.)
> 
> We only use the cache for nodes without children (i.e. protocol nodes),
> because that is where the problem is: Drivers that rely on block-status
> implementations outside of qemu (e.g. SEEK_DATA/HOLE).
> 
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
> Signed-off-by: Max Reitz 

Since you indicated that you'll respin the patch, I'll add my minor
comments:

> @@ -2442,9 +2445,58 @@ static int coroutine_fn 
> bdrv_co_block_status(BlockDriverState *bs,
>  aligned_bytes = ROUND_UP(offset + bytes, align) - aligned_offset;
>  
>  if (bs->drv->bdrv_co_block_status) {
> -ret = bs->drv->bdrv_co_block_status(bs, want_zero, aligned_offset,
> -aligned_bytes, pnum, _map,
> -_file);
> +bool from_cache = false;
> +
> +/*
> + * Use the block-status cache only for protocol nodes: Format
> + * drivers are generally quick to inquire the status, but protocol
> + * drivers often need to get information from outside of qemu, so
> + * we do not have control over the actual implementation.  There
> + * have been cases where inquiring the status took an unreasonably
> + * long time, and we can do nothing in qemu to fix it.
> + * This is especially problematic for images with large data areas,
> + * because finding the few holes in them and giving them special
> + * treatment does not gain much performance.  Therefore, we try to
> + * cache the last-identified data region.
> + *
> + * Second, limiting ourselves to protocol nodes allows us to assume
> + * the block status for data regions to be DATA | OFFSET_VALID, and
> + * that the host offset is the same as the guest offset.
> + *
> + * Note that it is possible that external writers zero parts of
> + * the cached regions without the cache being invalidated, and so
> + * we may report zeroes as data.  This is not catastrophic,
> + * however, because reporting zeroes as data is fine.
> + */
> +if (QLIST_EMPTY(>children)) {
> +if (bdrv_bsc_is_data(bs, aligned_offset, pnum)) {
> +ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
> +local_file = bs;
> +local_map = aligned_offset;
> +
> +from_cache = true;
> +}
> +}
> +
> +if (!from_cache) {

Is having a separate variable from_cache really useful? This looks like
it could just be:

if (QLIST_EMPTY() && bdrv_bsc_is_data()) {
// The code above
} else {
// The code below
}

> +ret = bs->drv->bdrv_co_block_status(bs, want_zero, 
> aligned_offset,
> +aligned_bytes, pnum, 
> _map,
> +_file);
> +
> +/*
> + * Note that checking QLIST_EMPTY(>children) is also done 
> when
> + * the cache is queried above.  

Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-06-24 Thread Max Reitz

On 24.06.21 12:05, Vladimir Sementsov-Ogievskiy wrote:

23.06.2021 18:01, Max Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform. The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts. So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 


I'm new to RCU, so my review can't be reliable..


Yeah, well, same here. :)


---
  include/block/block_int.h | 47 ++
  block.c   | 84 +++
  block/io.c    | 61 ++--
  3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..fcb599dd1c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,22 @@ struct BdrvChild {
  QLIST_ENTRY(BdrvChild) next_parent;
  };
  +/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ * functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not 
necessarily

+ *    the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+    bool valid;
+    int64_t data_start;
+    int64_t data_end;
+} BdrvBlockStatusCache;
+
  struct BlockDriverState {
  /* Protected by big QEMU lock or read-only after opening. No 
special

   * locking needed during I/O...
@@ -997,6 +1013,11 @@ struct BlockDriverState {
    /* BdrvChild links to this node may never be frozen */
  bool never_freeze;
+
+    /* Lock for block-status cache RCU writers */
+    CoMutex bsc_modify_lock;
+    /* Always non-NULL, but must only be dereferenced under an RCU 
read guard */

+    BdrvBlockStatusCache *block_status_cache;>   };
    struct BlockBackendRootState {
@@ -1422,4 +1443,30 @@ static inline BlockDriverState 
*bdrv_primary_bs(BlockDriverState *bs)

   */
  void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
  +/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t 
*pnum);

+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+   int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t 
bytes);

+
  #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index 3f456892d0..9ab9459f7a 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
  #include 

Re: [PATCH v2 2/6] block: block-status cache for data regions

2021-06-24 Thread Vladimir Sementsov-Ogievskiy

23.06.2021 18:01, Max Reitz wrote:

As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 


I'm new to RCU, so my review can't be reliable..


---
  include/block/block_int.h | 47 ++
  block.c   | 84 +++
  block/io.c| 61 ++--
  3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..fcb599dd1c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,22 @@ struct BdrvChild {
  QLIST_ENTRY(BdrvChild) next_parent;
  };
  
+/*

+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ * functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;
+
  struct BlockDriverState {
  /* Protected by big QEMU lock or read-only after opening.  No special
   * locking needed during I/O...
@@ -997,6 +1013,11 @@ struct BlockDriverState {
  
  /* BdrvChild links to this node may never be frozen */

  bool never_freeze;
+
+/* Lock for block-status cache RCU writers */
+CoMutex bsc_modify_lock;
+/* Always non-NULL, but must only be dereferenced under an RCU read guard 
*/
+BdrvBlockStatusCache *block_status_cache;>   };
  
  struct BlockBackendRootState {

@@ -1422,4 +1443,30 @@ static inline BlockDriverState 
*bdrv_primary_bs(BlockDriverState *bs)
   */
  void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
  
+/**

+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+   int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
  #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index 3f456892d0..9ab9459f7a 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
  #include "qemu/timer.h"
  #include "qemu/cutils.h"
  #include "qemu/id.h"
+#include 

[PATCH v2 2/6] block: block-status cache for data regions

2021-06-23 Thread Max Reitz
As we have attempted before
(https://lists.gnu.org/archive/html/qemu-devel/2019-01/msg06451.html,
"file-posix: Cache lseek result for data regions";
https://lists.nongnu.org/archive/html/qemu-block/2021-02/msg00934.html,
"file-posix: Cache next hole"), this patch seeks to reduce the number of
SEEK_DATA/HOLE operations the file-posix driver has to perform.  The
main difference is that this time it is implemented as part of the
general block layer code.

The problem we face is that on some filesystems or in some
circumstances, SEEK_DATA/HOLE is unreasonably slow.  Given the
implementation is outside of qemu, there is little we can do about its
performance.

We have already introduced the want_zero parameter to
bdrv_co_block_status() to reduce the number of SEEK_DATA/HOLE calls
unless we really want zero information; but sometimes we do want that
information, because for files that consist largely of zero areas,
special-casing those areas can give large performance boosts.  So the
real problem is with files that consist largely of data, so that
inquiring the block status does not gain us much performance, but where
such an inquiry itself takes a lot of time.

To address this, we want to cache data regions.  Most of the time, when
bad performance is reported, it is in places where the image is iterated
over from start to end (qemu-img convert or the mirror job), so a simple
yet effective solution is to cache only the current data region.

(Note that only caching data regions but not zero regions means that
returning false information from the cache is not catastrophic: Treating
zeroes as data is fine.  While we try to invalidate the cache on zero
writes and discards, such incongruences may still occur when there are
other processes writing to the image.)

We only use the cache for nodes without children (i.e. protocol nodes),
because that is where the problem is: Drivers that rely on block-status
implementations outside of qemu (e.g. SEEK_DATA/HOLE).

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/307
Signed-off-by: Max Reitz 
---
 include/block/block_int.h | 47 ++
 block.c   | 84 +++
 block/io.c| 61 ++--
 3 files changed, 189 insertions(+), 3 deletions(-)

diff --git a/include/block/block_int.h b/include/block/block_int.h
index a8f9598102..fcb599dd1c 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -832,6 +832,22 @@ struct BdrvChild {
 QLIST_ENTRY(BdrvChild) next_parent;
 };
 
+/*
+ * Allows bdrv_co_block_status() to cache one data region for a
+ * protocol node.
+ *
+ * @valid: Whether the cache is valid (should be accessed with atomic
+ * functions so this can be reset by RCU readers)
+ * @data_start: Offset where we know (or strongly assume) is data
+ * @data_end: Offset where the data region ends (which is not necessarily
+ *the start of a zeroed region)
+ */
+typedef struct BdrvBlockStatusCache {
+bool valid;
+int64_t data_start;
+int64_t data_end;
+} BdrvBlockStatusCache;
+
 struct BlockDriverState {
 /* Protected by big QEMU lock or read-only after opening.  No special
  * locking needed during I/O...
@@ -997,6 +1013,11 @@ struct BlockDriverState {
 
 /* BdrvChild links to this node may never be frozen */
 bool never_freeze;
+
+/* Lock for block-status cache RCU writers */
+CoMutex bsc_modify_lock;
+/* Always non-NULL, but must only be dereferenced under an RCU read guard 
*/
+BdrvBlockStatusCache *block_status_cache;
 };
 
 struct BlockBackendRootState {
@@ -1422,4 +1443,30 @@ static inline BlockDriverState 
*bdrv_primary_bs(BlockDriverState *bs)
  */
 void bdrv_drain_all_end_quiesce(BlockDriverState *bs);
 
+/**
+ * Check whether the given offset is in the cached block-status data
+ * region.
+ *
+ * If it is, and @pnum is not NULL, *pnum is set to
+ * `bsc.data_end - offset`, i.e. how many bytes, starting from
+ * @offset, are data (according to the cache).
+ * Otherwise, *pnum is not touched.
+ */
+bool bdrv_bsc_is_data(BlockDriverState *bs, int64_t offset, int64_t *pnum);
+
+/**
+ * If [offset, offset + bytes) overlaps with the currently cached
+ * block-status region, invalidate the cache.
+ *
+ * (To be used by I/O paths that cause data regions to be zero or
+ * holes.)
+ */
+void bdrv_bsc_invalidate_range(BlockDriverState *bs,
+   int64_t offset, int64_t bytes);
+
+/**
+ * Mark the range [offset, offset + bytes) as a data region.
+ */
+void bdrv_bsc_fill(BlockDriverState *bs, int64_t offset, int64_t bytes);
+
 #endif /* BLOCK_INT_H */
diff --git a/block.c b/block.c
index 3f456892d0..9ab9459f7a 100644
--- a/block.c
+++ b/block.c
@@ -49,6 +49,8 @@
 #include "qemu/timer.h"
 #include "qemu/cutils.h"
 #include "qemu/id.h"
+#include "qemu/range.h"
+#include "qemu/rcu.h"
 #include "block/coroutines.h"
 
 #ifdef CONFIG_BSD
@@ -398,6 +400,9 @@ BlockDriverState