[PATCH v2 0/3] md superblock write alignment on 512e devices

2020-10-29 Thread Christopher Unkel
Hello,

Thanks for the feedback on the previous patch series.

A updated patch series with the same function as the first patch
(https://lkml.org/lkml/2020/10/22/1058 "md: align superblock writes to
physical blocks") follows.

As suggested, it introduces a helper function, which can be used to
reduce some code duplication.  It handles the case in super_1_sync()
where the superblock is extended by the addition of new component
devices.

I think it also fixes a bug where the existing code in super_1_load()
ought to be rejecting the array with EINVAL: if the superblock padded
out to the *logical* block length runs into the bitmap.  For example, if
the bitmap offset is 2 (bitmap 1K after superblock) and the logical
block size is 4K, the superblock padded out to 4K runs into the bitmap.
This case may be unusual (perhaps only happens if the array is created
on a 512n device and then raw contents are copied onto a 4kn device) but
I think it is possible.

With respect to the option of simply replacing
queue_logical_block_size() with queue_physical_block_size(), I think
this can result in the code rejecting devices that can be loaded, but
for which the physical block alignment can't be respected--the longer
padded size would trigger the EINVAL cases testing against
data_offset/new_data_offset.  I think it's better to proceed in such
cases, just with unaligned superblock writes as would presently happen.
Also if I'm right about the above bug, then I think this subsitution
would be more likely to trigger it.

Thanks,

  --Chris


Christopher Unkel (3):
  md: factor out repeated sb alignment logic
  md: align superblock writes to physical blocks
  md: reuse sb length-checking logic

 drivers/md/md.c | 69 +
 1 file changed, 52 insertions(+), 17 deletions(-)

-- 
2.17.1



[PATCH 2/3] md: align superblock writes to physical blocks

2020-10-29 Thread Christopher Unkel
Writes of the md superblock are aligned to the logical blocks of the
containing device, but no attempt is made to align them to physical
block boundaries.  This means that on a "512e" device (4k physical, 512
logical) every superblock update hits the 512-byte emulation and the
possible associated performance penalty.

Respect the physical block alignment when possible, that is, when the
write padded out to the physical block doesn't run into the data or
bitmap.

Signed-off-by: Christopher Unkel 
---
This series replaces the first patch of the previous series
(https://lkml.org/lkml/2020/10/22/1058), with the following changes:

1. Creates a helper function super_1_sb_length_ok().
2. Fixes operator placement style violation.
3. Covers case in super_1_sync().
4. Refactors duplicate logic.
5. Covers a case in existing code where aligned superblock could
   run into bitmap.

 drivers/md/md.c | 45 +
 1 file changed, 41 insertions(+), 4 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index d6a55ca1d52e..802a9a256fe5 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1646,15 +1646,52 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 
*sb)
return cpu_to_le32(csum);
 }
 
+static int
+super_1_sb_length_ok(struct md_rdev *rdev, int minor_version, int sb_len)
+{
+   int sectors = sb_len / 512;
+   struct mdp_superblock_1 *sb;
+
+   /* superblock is stored in memory as a single page */
+   if (sb_len > PAGE_SIZE)
+   return 0;
+
+   /* check if sb runs into data */
+   if (minor_version) {
+   if (rdev->sb_start + sectors > rdev->data_offset
+   || rdev->sb_start + sectors > rdev->new_data_offset)
+   return 0;
+   } else if (sb_len > 4096)
+   return 0;
+
+   /* check if sb runs into bitmap */
+   sb = page_address(rdev->sb_page);
+   if (le32_to_cpu(sb->feature_map) & MD_FEATURE_BITMAP_OFFSET) {
+   __s32 bitmap_offset = (__s32)le32_to_cpu(sb->bitmap_offset);
+   if (bitmap_offset > 0 && sectors > bitmap_offset)
+   return 0;
+   }
+
+   return 1;
+}
+
 /*
  * set rdev->sb_size to that required for number of devices in array
  * with appropriate padding to underlying sectors
  */
 static void
-super_1_set_rdev_sb_size(struct md_rdev *rdev, int max_dev)
+super_1_set_rdev_sb_size(struct md_rdev *rdev, int max_dev, int minor_version)
 {
int sb_size = max_dev * 2 + 256;
-   rdev->sb_size = round_up(sb_size, bdev_logical_block_size(rdev->bdev));
+   int pb_aligned_size = round_up(sb_size,
+  bdev_physical_block_size(rdev->bdev));
+
+   /* generate physical-block aligned writes if legal */
+   if (super_1_sb_length_ok(rdev, minor_version, pb_aligned_size))
+   rdev->sb_size = pb_aligned_size;
+   else
+   rdev->sb_size = round_up(sb_size,
+bdev_logical_block_size(rdev->bdev));
 }
 
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
@@ -1730,7 +1767,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
rdev->new_data_offset += (s32)le32_to_cpu(sb->new_offset);
atomic_set(>corrected_errors, 
le32_to_cpu(sb->cnt_corrected_read));
 
-   super_1_set_rdev_sb_size(rdev, le32_to_cpu(sb->max_dev));
+   super_1_set_rdev_sb_size(rdev, le32_to_cpu(sb->max_dev), minor_version);
 
if (minor_version
&& rdev->data_offset < sb_start + (rdev->sb_size/512))
@@ -2140,7 +2177,7 @@ static void super_1_sync(struct mddev *mddev, struct 
md_rdev *rdev)
 
if (max_dev > le32_to_cpu(sb->max_dev)) {
sb->max_dev = cpu_to_le32(max_dev);
-   super_1_set_rdev_sb_size(rdev, max_dev);
+   super_1_set_rdev_sb_size(rdev, max_dev, mddev->minor_version);
} else
max_dev = le32_to_cpu(sb->max_dev);
 
-- 
2.17.1



[PATCH 3/3] md: reuse sb length-checking logic

2020-10-29 Thread Christopher Unkel
The logic in super_1_load() to check the length of the superblock
against (new_)data_offset has the same purpose as the newly-created
super_1_sb_length_ok().  The latter is also more complete in that it
check for overlap between the superblock write and the bitmap.

Signed-off-by: Christopher Unkel 
---
This series replaces the first patch of the previous series
(https://lkml.org/lkml/2020/10/22/1058), with the following changes:

1. Creates a helper function super_1_sb_length_ok().
2. Fixes operator placement style violation.
3. Covers case in super_1_sync().
4. Refactors duplicate logic.
5. Covers a case in existing code where aligned superblock could
   run into bitmap.

 drivers/md/md.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 802a9a256fe5..3b7bf14922ac 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1768,13 +1768,8 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
atomic_set(>corrected_errors, 
le32_to_cpu(sb->cnt_corrected_read));
 
super_1_set_rdev_sb_size(rdev, le32_to_cpu(sb->max_dev), minor_version);
-
-   if (minor_version
-   && rdev->data_offset < sb_start + (rdev->sb_size/512))
-   return -EINVAL;
-   if (minor_version
-   && rdev->new_data_offset < sb_start + (rdev->sb_size/512))
-   return -EINVAL;
+   if (!super_1_sb_length_ok(rdev, minor_version, rdev->sb_size))
+   return -EINVAL;
 
if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
rdev->desc_nr = -1;
-- 
2.17.1



[PATCH v2 1/3] md: factor out repeated sb alignment logic

2020-10-29 Thread Christopher Unkel
super_1_load() and super_1_sync() both contain a copy of logic to pad
out the superblock size so that it is aligned on a logical block
boundary.  Factor into new function, and use round_up() rather than
explict bitmask-based calculation.

Signed-off-by: Christopher Unkel 
---
This series replaces the first patch of the previous series
(https://lkml.org/lkml/2020/10/22/1058), with the following changes:

1. Creates a helper function super_1_sb_length_ok().
2. Fixes operator placement style violation.
3. Covers case in super_1_sync().
4. Refactors duplicate logic.
5. Covers a case in existing code where aligned superblock could
   run into bitmap.

drivers/md/md.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..d6a55ca1d52e 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1646,6 +1646,17 @@ static __le32 calc_sb_1_csum(struct mdp_superblock_1 *sb)
return cpu_to_le32(csum);
 }
 
+/*
+ * set rdev->sb_size to that required for number of devices in array
+ * with appropriate padding to underlying sectors
+ */
+static void
+super_1_set_rdev_sb_size(struct md_rdev *rdev, int max_dev)
+{
+   int sb_size = max_dev * 2 + 256;
+   rdev->sb_size = round_up(sb_size, bdev_logical_block_size(rdev->bdev));
+}
+
 static int super_1_load(struct md_rdev *rdev, struct md_rdev *refdev, int 
minor_version)
 {
struct mdp_superblock_1 *sb;
@@ -1653,7 +1664,6 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
sector_t sb_start;
sector_t sectors;
char b[BDEVNAME_SIZE], b2[BDEVNAME_SIZE];
-   int bmask;
bool spare_disk = true;
 
/*
@@ -1720,10 +1730,7 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
rdev->new_data_offset += (s32)le32_to_cpu(sb->new_offset);
atomic_set(>corrected_errors, 
le32_to_cpu(sb->cnt_corrected_read));
 
-   rdev->sb_size = le32_to_cpu(sb->max_dev) * 2 + 256;
-   bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;
-   if (rdev->sb_size & bmask)
-   rdev->sb_size = (rdev->sb_size | bmask) + 1;
+   super_1_set_rdev_sb_size(rdev, le32_to_cpu(sb->max_dev));
 
if (minor_version
&& rdev->data_offset < sb_start + (rdev->sb_size/512))
@@ -2132,12 +2139,8 @@ static void super_1_sync(struct mddev *mddev, struct 
md_rdev *rdev)
max_dev = rdev2->desc_nr+1;
 
if (max_dev > le32_to_cpu(sb->max_dev)) {
-   int bmask;
sb->max_dev = cpu_to_le32(max_dev);
-   rdev->sb_size = max_dev * 2 + 256;
-   bmask = queue_logical_block_size(rdev->bdev->bd_disk->queue)-1;
-   if (rdev->sb_size & bmask)
-   rdev->sb_size = (rdev->sb_size | bmask) + 1;
+   super_1_set_rdev_sb_size(rdev, max_dev);
} else
max_dev = le32_to_cpu(sb->max_dev);
 
-- 
2.17.1



[PATCH 3/3] md: pad writes to end of bitmap to physical blocks

2020-10-22 Thread Christopher Unkel
Writes of the last page of the bitmap are padded out to the next logical
block boundary.  However, they are not padded out to the next physical
block boundary, so the writes may be less than a physical block.  On a
"512e" disk (logical block 512 bytes, physical block 4k) and if the last
page of the bitmap is less than 3584 bytes, this means that writes of
the last bitmap page hit the 512-byte emulation.

Respect the physical block boundary as long as the resulting write
doesn't run into other data, and is no longer than a page.  (If the
physical block size is larger than a page no bitmap write will respect
the physical block boundaries.)

Signed-off-by: Christopher Unkel 
---
 drivers/md/md-bitmap.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 600b89d5a3ad..21af5f94d495 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -264,10 +264,18 @@ static int write_sb_page(struct bitmap *bitmap, struct 
page *page, int wait)
 
if (page->index == store->file_pages-1) {
int last_page_size = store->bytes & (PAGE_SIZE-1);
+   int pb_aligned_size;
if (last_page_size == 0)
last_page_size = PAGE_SIZE;
size = roundup(last_page_size,
   bdev_logical_block_size(bdev));
+   pb_aligned_size = roundup(last_page_size,
+ 
bdev_physical_block_size(bdev));
+   if (pb_aligned_size > size
+   && pb_aligned_size <= PAGE_SIZE
+   && sb_write_alignment_ok(mddev, rdev, page, offset,
+pb_aligned_size))
+   size = pb_aligned_size;
}
/* Just make sure we aren't corrupting data or
 * metadata
-- 
2.17.1



[PATCH 2/3] md: factor sb write alignment check into function

2020-10-22 Thread Christopher Unkel
Refactor in preparation for a second use of the logic.

Signed-off-by: Christopher Unkel 
---
 drivers/md/md-bitmap.c | 72 +++---
 1 file changed, 40 insertions(+), 32 deletions(-)

diff --git a/drivers/md/md-bitmap.c b/drivers/md/md-bitmap.c
index 200c5d0f08bf..600b89d5a3ad 100644
--- a/drivers/md/md-bitmap.c
+++ b/drivers/md/md-bitmap.c
@@ -209,6 +209,44 @@ static struct md_rdev *next_active_rdev(struct md_rdev 
*rdev, struct mddev *mdde
return NULL;
 }
 
+static int sb_write_alignment_ok(struct mddev *mddev, struct md_rdev *rdev,
+struct page *page, int offset, int size)
+{
+   if (mddev->external) {
+   /* Bitmap could be anywhere. */
+   if (rdev->sb_start + offset + (page->index
+  * (PAGE_SIZE/512))
+   > rdev->data_offset
+   &&
+   rdev->sb_start + offset
+   < (rdev->data_offset + mddev->dev_sectors
++ (PAGE_SIZE/512)))
+   return 0;
+   } else if (offset < 0) {
+   /* DATA  BITMAP METADATA  */
+   if (offset
+   + (long)(page->index * (PAGE_SIZE/512))
+   + size/512 > 0)
+   /* bitmap runs in to metadata */
+   return 0;
+   if (rdev->data_offset + mddev->dev_sectors
+   > rdev->sb_start + offset)
+   /* data runs in to bitmap */
+   return 0;
+   } else if (rdev->sb_start < rdev->data_offset) {
+   /* METADATA BITMAP DATA */
+   if (rdev->sb_start
+   + offset
+   + page->index*(PAGE_SIZE/512) + size/512
+   > rdev->data_offset)
+   /* bitmap runs in to data */
+   return 0;
+   } else {
+   /* DATA METADATA BITMAP - no problems */
+   }
+   return 1;
+}
+
 static int write_sb_page(struct bitmap *bitmap, struct page *page, int wait)
 {
struct md_rdev *rdev;
@@ -234,38 +272,8 @@ static int write_sb_page(struct bitmap *bitmap, struct 
page *page, int wait)
/* Just make sure we aren't corrupting data or
 * metadata
 */
-   if (mddev->external) {
-   /* Bitmap could be anywhere. */
-   if (rdev->sb_start + offset + (page->index
-  * (PAGE_SIZE/512))
-   > rdev->data_offset
-   &&
-   rdev->sb_start + offset
-   < (rdev->data_offset + mddev->dev_sectors
-+ (PAGE_SIZE/512)))
-   goto bad_alignment;
-   } else if (offset < 0) {
-   /* DATA  BITMAP METADATA  */
-   if (offset
-   + (long)(page->index * (PAGE_SIZE/512))
-   + size/512 > 0)
-   /* bitmap runs in to metadata */
-   goto bad_alignment;
-   if (rdev->data_offset + mddev->dev_sectors
-   > rdev->sb_start + offset)
-   /* data runs in to bitmap */
-   goto bad_alignment;
-   } else if (rdev->sb_start < rdev->data_offset) {
-   /* METADATA BITMAP DATA */
-   if (rdev->sb_start
-   + offset
-   + page->index*(PAGE_SIZE/512) + size/512
-   > rdev->data_offset)
-   /* bitmap runs in to data */
-   goto bad_alignment;
-   } else {
-   /* DATA METADATA BITMAP - no problems */
-   }
+   if (!sb_write_alignment_ok(mddev, rdev, page, offset, size))
+   goto bad_alignment;
md_super_write(mddev, rdev,
   rdev->sb_start + offset
   + page->index * (PAGE_SIZE/512),
-- 
2.17.1



[PATCH 0/3] mdraid sb and bitmap write alignment on 512e drives

2020-10-22 Thread Christopher Unkel
Hello all,

While investigating some performance issues on mdraid 10 volumes
formed with "512e" disks (4k native/physical sector size but with 512
byte sector emulation), I've found two cases where mdraid will
needlessly issue writes that start on 4k byte boundary, but are are
shorter than 4k:

1. writes of the raid superblock; and
2. writes of the last page of the write-intent bitmap.

The following is an excerpt of a blocktrace of one of the component
members of a mdraid 10 volume during a 4k write near the end of the
array:

  8,32  112 0.01687   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  115 0.001454119   711  D  WS 2056 + 1 [kworker/11:1H]
* 8,32  118 0.002847204   711  D  WS 2080 + 7 [kworker/11:1H]
  8,32  11   11 0.003700545  3094  D  WS 11721043920 + 8 [md127_raid1]
  8,32  11   14 0.308785692   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11   17 0.310201697   711  D  WS 2056 + 1 [kworker/11:1H]
  8,32  11   20 5.500799245   711  D  WS 2064 + 8 [kworker/11:1H]
* 8,32  11   2315.740923558   711  D  WS 2080 + 7 [kworker/11:1H]

Note the starred transactions, which each start on a 4k boundary, but
are less than 4k in length, and so will use the 512-byte emulation.
Sector 2056 holds the superblock, and is written as a single 512-byte
write.  Sector 2086 holds the bitmap bit relevant to the written
sector.  When it is written the active bits of the last page of the
bitmap are written, starting at sector 2080, padded out to the end of
the 512-byte logical sector as required.  This results in a 3.5kb
write, again using the 512-byte emulation.

Note that in some arrays the last page of the bitmap may be
sufficiently full that they are not affected by the issue with the
bitmap write.

As there can be a substantial penalty to using the 512-byte sector
emulation (turning writes into read-modify writes if the relevant
sector is not in the drive's cache) I believe it makes sense to pad
these writes out to a 4k boundary.  The writes are already padded out
for "4k native" drives, where the short access is illegal.

The following patch set changes the superblock and bitmap writes to
respect the physical block size (e.g. 4k for today's 512e drives) when
possible.  In each case there is already logic for padding out to the
underlying logical sector size.  I reuse or repeat the logic for
padding out to the physical sector size, but treat the padding out as
optional rather than mandatory.

The corresponding block trace with these patches is:

   8,32   12 0.03410   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   15 0.001368788   694  D  WS 2056 + 8 [kworker/1:1H]
   8,32   18 0.002727981   694  D  WS 2080 + 8 [kworker/1:1H]
   8,32   1   11 0.003533831  3063  D  WS 11721043920 + 8 [md127_raid1]
   8,32   1   14 0.253952321   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1   17 0.255354215   694  D  WS 2056 + 8 [kworker/1:1H]
   8,32   1   20 5.337938486   694  D  WS 2064 + 8 [kworker/1:1H]
   8,32   1   2315.577963062   694  D  WS 2080 + 8 [kworker/1:1H]

I do notice that the code for bitmap writes has a more sophisticated
and thorough check for overlap than the code for superblock writes.
(Compare write_sb_page in md-bitmap.c vs. super_1_load in md.c.) From
what I know since the various structures starts have always been 4k
aligned anyway, it is always safe to pad the superblock write out to
4k (as occurs on 4k native drives) but not necessarily futher.

Feedback appreciated.

  --Chris


Christopher Unkel (3):
  md: align superblock writes to physical blocks
  md: factor sb write alignment check into function
  md: pad writes to end of bitmap to physical blocks

 drivers/md/md-bitmap.c | 80 +-
 drivers/md/md.c| 15 
 2 files changed, 63 insertions(+), 32 deletions(-)

-- 
2.17.1



[PATCH 1/3] md: align superblock writes to physical blocks

2020-10-22 Thread Christopher Unkel
Writes of the md superblock are aligned to the logical blocks of the
containing device, but no attempt is made to align them to physical
block boundaries.  This means that on a "512e" device (4k physical, 512
logical) every superblock update hits the 512-byte emulation and the
possible associated performance penalty.

Respect the physical block alignment when possible.

Signed-off-by: Christopher Unkel 
---
 drivers/md/md.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 98bac4f304ae..2b42850acfb3 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -1732,6 +1732,21 @@ static int super_1_load(struct md_rdev *rdev, struct 
md_rdev *refdev, int minor_
&& rdev->new_data_offset < sb_start + (rdev->sb_size/512))
return -EINVAL;
 
+   /* Respect physical block size if feasible. */
+   bmask = queue_physical_block_size(rdev->bdev->bd_disk->queue)-1;
+   if (!((rdev->sb_start * 512) & bmask) && (rdev->sb_size & bmask)) {
+   int candidate_size = (rdev->sb_size | bmask) + 1;
+
+   if (minor_version) {
+   int sectors = candidate_size / 512;
+
+   if (rdev->data_offset >= sb_start + sectors
+   && rdev->new_data_offset >= sb_start + sectors)
+   rdev->sb_size = candidate_size;
+   } else if (bmask <= 4095)
+   rdev->sb_size = candidate_size;
+   }
+
if (sb->level == cpu_to_le32(LEVEL_MULTIPATH))
rdev->desc_nr = -1;
else
-- 
2.17.1