Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-30 Thread Martin K. Petersen

Mariusz,

>>> One potential solution for that is to change type of some queue limits
>>> (at least discard_granularity and discard_alignment, maybe more, like
>>> max_discard_sectors?) from 32 bits to 64 bits.

You're still restricted by the max bio size which is 32 bits on 32-bit
platforms.

> Let me give you an example with io_min (it works also for other params):
> We've got RAID 0 with 2 disks and 32KiB chunk size. Each member disk
> reports io_min=64KiB. blk_stack_limits takes max value from set of chunk
> size and members io_min
>   t->io_min = max(t->io_min, b->io_min);
> so new array's io_min will be equal to 64KiB. Such request will be split
> for both array members, each device will get 32KiB request and it is not
> enough for them to meet minimum io size requirement.

When this was written, there was a clear benefit giving priority to the
MD/DM topology over any values reported by the hardware.

I.e. the performance was much better when honoring the MD stripe chunk
and stripe size over any values reported by the underlying devices. This
was true for both disk drives and RAID devices that reported the
relevant issues.

That's why the stacking works the way it does.

For your particular example, I'd say that if your device reports an
io_min of 64KB, then it's a user error to create an MD device with a
stripe chunk of 32KB. mdadm should discourage creating such a
configuration.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-30 Thread Mariusz Dabrowski

Hi,

do you have any comments to this?

Regards,
Mariusz

On 11/22/2017 09:12 AM, Mariusz Dabrowski wrote:

On 11/17/2017 07:40 PM, Shaohua Li wrote:

On Wed, Nov 15, 2017 at 11:25:12AM +0100, Mariusz Dabrowski wrote:

Hi all,

In order to be compliant with a pass-throug drive behavior, RAID queue
limits should be calculated in a way that minimal io, optimal io and
discard granularity size will be met from a single drive perspective.
Currently MD driver is ignoring queue limits reported by members and all
calculations are based on chunk (stripe) size.


We did use blk_stack_limits, which will care about these.


I am working on patch which
changes this. Since kernel 4.13 drives which support streams (write hints)
might report discard_alignment or discard_granularity bigger than array
stripe (for more details view NVMe 1.3 specification, chapter 9.3 Streams)
so result of current calculation is not optimal for those drives. For
example for 4 disk RAID 0 with chunk size smaller than discard_granularity
of member drives, discard_granularity of RAID array should be equal to
4*member_discard_granularity.

The problem is that for some drives there is a risk that result of this
calculation exceeds unsigned int range. Current implementation of function
nvme_config_discard in NVMe driver can also suffer the same problem:

if (ctrl->nr_streams && ns->sws && ns->sgs) {
unsigned int sz = logical_block_size * ns->sws * ns->sgs;

ns->queue->limits.discard_alignment = sz;
ns->queue->limits.discard_granularity = sz;
}

One potential solution for that is to change type of some queue limits
(at least discard_granularity and discard_alignment, maybe more, like
max_discard_sectors?) from 32 bits to 64 bits.

What are your thoughts about this? Do you expect any troubles with
changing this in block layer? Would you be willing to do such change?

Signed-off-by: Mariusz Dabrowski 
---

 drivers/md/md.c | 24 
 drivers/md/md.h |  1 +
 drivers/md/raid0.c  | 23 ---
 drivers/md/raid10.c | 30 +-
 drivers/md/raid5.c  | 25 +++--
 5 files changed, 89 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ff1bbf6c90e..e0dc114cab19 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include "md.h"
@@ -679,6 +680,29 @@ struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev,
int nr)
 }
 EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);

+void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits *limits)
+{
+struct md_rdev *rdev;
+
+memset(limits, 0, sizeof(struct queue_limits));
+
+rdev_for_each(rdev, mddev) {
+struct queue_limits *rdev_limits;
+
+rdev_limits = >bdev->bd_queue->limits;
+limits->io_min = max(limits->io_min, rdev_limits->io_min);
+limits->io_opt = lcm_not_zero(limits->io_opt,
+  rdev_limits->io_opt);
+limits->discard_granularity =
+max(limits->discard_granularity,
+rdev_limits->discard_granularity);
+limits->discard_alignment =
+max(limits->discard_alignment,
+rdev_limits->discard_alignment);
+}
+}


isn't this exactly what blk_stack_limits does?


Yes, but limits of member devices have to be known before calling
blk_stack_limits to calculate values for array.


+EXPORT_SYMBOL_GPL(md_rdev_get_queue_limits);
+
 static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
 {
 struct md_rdev *rdev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d8287d3cd1bf..5b514b9bb535 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -702,6 +702,7 @@ extern void md_reload_sb(struct mddev *mddev, int
raid_disk);
 extern void md_update_sb(struct mddev *mddev, int force);
 extern void md_kick_rdev_from_array(struct md_rdev * rdev);
 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
+void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits
*limits);

 static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 5a00fc118470..f1dcf7fd3eb1 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -382,15 +382,31 @@ static int raid0_run(struct mddev *mddev)
 if (mddev->queue) {
 struct md_rdev *rdev;
 bool discard_supported = false;
+struct queue_limits limits;
+unsigned int io_min, io_opt;
+unsigned int granularity, alignment;
+unsigned int chunk_size = mddev->chunk_sectors << 9;

+md_rdev_get_queue_limits(mddev, );
 blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
 blk_queue_max_write_same_sectors(mddev->queue, mddev->chunk_sectors);
 blk_queue_max_write_zeroes_sectors(mddev->queue, mddev->chunk_sectors);
 

Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-22 Thread Mariusz Dabrowski

On 11/17/2017 07:40 PM, Shaohua Li wrote:

On Wed, Nov 15, 2017 at 11:25:12AM +0100, Mariusz Dabrowski wrote:

Hi all,

In order to be compliant with a pass-throug drive behavior, RAID queue
limits should be calculated in a way that minimal io, optimal io and
discard granularity size will be met from a single drive perspective.
Currently MD driver is ignoring queue limits reported by members and all
calculations are based on chunk (stripe) size.


We did use blk_stack_limits, which will care about these.


I am working on patch which
changes this. Since kernel 4.13 drives which support streams (write hints)
might report discard_alignment or discard_granularity bigger than array
stripe (for more details view NVMe 1.3 specification, chapter 9.3 Streams)
so result of current calculation is not optimal for those drives. For
example for 4 disk RAID 0 with chunk size smaller than discard_granularity
of member drives, discard_granularity of RAID array should be equal to
4*member_discard_granularity.

The problem is that for some drives there is a risk that result of this
calculation exceeds unsigned int range. Current implementation of function
nvme_config_discard in NVMe driver can also suffer the same problem:

if (ctrl->nr_streams && ns->sws && ns->sgs) {
unsigned int sz = logical_block_size * ns->sws * ns->sgs;

ns->queue->limits.discard_alignment = sz;
ns->queue->limits.discard_granularity = sz;
}

One potential solution for that is to change type of some queue limits
(at least discard_granularity and discard_alignment, maybe more, like
max_discard_sectors?) from 32 bits to 64 bits.

What are your thoughts about this? Do you expect any troubles with
changing this in block layer? Would you be willing to do such change?

Signed-off-by: Mariusz Dabrowski 
---

 drivers/md/md.c | 24 
 drivers/md/md.h |  1 +
 drivers/md/raid0.c  | 23 ---
 drivers/md/raid10.c | 30 +-
 drivers/md/raid5.c  | 25 +++--
 5 files changed, 89 insertions(+), 14 deletions(-)

diff --git a/drivers/md/md.c b/drivers/md/md.c
index 0ff1bbf6c90e..e0dc114cab19 100644
--- a/drivers/md/md.c
+++ b/drivers/md/md.c
@@ -66,6 +66,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include "md.h"
@@ -679,6 +680,29 @@ struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, 
int nr)
 }
 EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);

+void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits *limits)
+{
+   struct md_rdev *rdev;
+
+   memset(limits, 0, sizeof(struct queue_limits));
+
+   rdev_for_each(rdev, mddev) {
+   struct queue_limits *rdev_limits;
+
+   rdev_limits = >bdev->bd_queue->limits;
+   limits->io_min = max(limits->io_min, rdev_limits->io_min);
+   limits->io_opt = lcm_not_zero(limits->io_opt,
+ rdev_limits->io_opt);
+   limits->discard_granularity =
+   max(limits->discard_granularity,
+   rdev_limits->discard_granularity);
+   limits->discard_alignment =
+   max(limits->discard_alignment,
+   rdev_limits->discard_alignment);
+   }
+}


isn't this exactly what blk_stack_limits does?


Yes, but limits of member devices have to be known before calling 
blk_stack_limits to calculate values for array.



+EXPORT_SYMBOL_GPL(md_rdev_get_queue_limits);
+
 static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
 {
struct md_rdev *rdev;
diff --git a/drivers/md/md.h b/drivers/md/md.h
index d8287d3cd1bf..5b514b9bb535 100644
--- a/drivers/md/md.h
+++ b/drivers/md/md.h
@@ -702,6 +702,7 @@ extern void md_reload_sb(struct mddev *mddev, int 
raid_disk);
 extern void md_update_sb(struct mddev *mddev, int force);
 extern void md_kick_rdev_from_array(struct md_rdev * rdev);
 struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
+void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits 
*limits);

 static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev *mddev)
 {
diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
index 5a00fc118470..f1dcf7fd3eb1 100644
--- a/drivers/md/raid0.c
+++ b/drivers/md/raid0.c
@@ -382,15 +382,31 @@ static int raid0_run(struct mddev *mddev)
if (mddev->queue) {
struct md_rdev *rdev;
bool discard_supported = false;
+   struct queue_limits limits;
+   unsigned int io_min, io_opt;
+   unsigned int granularity, alignment;
+   unsigned int chunk_size = mddev->chunk_sectors << 9;

+   md_rdev_get_queue_limits(mddev, );
blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
blk_queue_max_write_same_sectors(mddev->queue, 

Re: [RFC] md: make queue limits depending on limits of RAID members

2017-11-17 Thread Shaohua Li
On Wed, Nov 15, 2017 at 11:25:12AM +0100, Mariusz Dabrowski wrote:
> Hi all,
> 
> In order to be compliant with a pass-throug drive behavior, RAID queue
> limits should be calculated in a way that minimal io, optimal io and
> discard granularity size will be met from a single drive perspective.
> Currently MD driver is ignoring queue limits reported by members and all
> calculations are based on chunk (stripe) size.

We did use blk_stack_limits, which will care about these.

> I am working on patch which
> changes this. Since kernel 4.13 drives which support streams (write hints)
> might report discard_alignment or discard_granularity bigger than array
> stripe (for more details view NVMe 1.3 specification, chapter 9.3 Streams)
> so result of current calculation is not optimal for those drives. For 
> example for 4 disk RAID 0 with chunk size smaller than discard_granularity
> of member drives, discard_granularity of RAID array should be equal to
> 4*member_discard_granularity.
> 
> The problem is that for some drives there is a risk that result of this
> calculation exceeds unsigned int range. Current implementation of function
> nvme_config_discard in NVMe driver can also suffer the same problem:
> 
>   if (ctrl->nr_streams && ns->sws && ns->sgs) {
>   unsigned int sz = logical_block_size * ns->sws * ns->sgs;
> 
>   ns->queue->limits.discard_alignment = sz;
>   ns->queue->limits.discard_granularity = sz;
>   }
> 
> One potential solution for that is to change type of some queue limits
> (at least discard_granularity and discard_alignment, maybe more, like
> max_discard_sectors?) from 32 bits to 64 bits.
> 
> What are your thoughts about this? Do you expect any troubles with 
> changing this in block layer? Would you be willing to do such change?
> 
> Signed-off-by: Mariusz Dabrowski 
> ---
> 
>  drivers/md/md.c | 24 
>  drivers/md/md.h |  1 +
>  drivers/md/raid0.c  | 23 ---
>  drivers/md/raid10.c | 30 +-
>  drivers/md/raid5.c  | 25 +++--
>  5 files changed, 89 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 0ff1bbf6c90e..e0dc114cab19 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -66,6 +66,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include "md.h"
> @@ -679,6 +680,29 @@ struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, 
> int nr)
>  }
>  EXPORT_SYMBOL_GPL(md_find_rdev_nr_rcu);
>  
> +void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits 
> *limits)
> +{
> + struct md_rdev *rdev;
> +
> + memset(limits, 0, sizeof(struct queue_limits));
> +
> + rdev_for_each(rdev, mddev) {
> + struct queue_limits *rdev_limits;
> +
> + rdev_limits = >bdev->bd_queue->limits;
> + limits->io_min = max(limits->io_min, rdev_limits->io_min);
> + limits->io_opt = lcm_not_zero(limits->io_opt,
> +   rdev_limits->io_opt);
> + limits->discard_granularity =
> + max(limits->discard_granularity,
> + rdev_limits->discard_granularity);
> + limits->discard_alignment =
> + max(limits->discard_alignment,
> + rdev_limits->discard_alignment);
> + }
> +}

isn't this exactly what blk_stack_limits does?

> +EXPORT_SYMBOL_GPL(md_rdev_get_queue_limits);
> +
>  static struct md_rdev *find_rdev(struct mddev *mddev, dev_t dev)
>  {
>   struct md_rdev *rdev;
> diff --git a/drivers/md/md.h b/drivers/md/md.h
> index d8287d3cd1bf..5b514b9bb535 100644
> --- a/drivers/md/md.h
> +++ b/drivers/md/md.h
> @@ -702,6 +702,7 @@ extern void md_reload_sb(struct mddev *mddev, int 
> raid_disk);
>  extern void md_update_sb(struct mddev *mddev, int force);
>  extern void md_kick_rdev_from_array(struct md_rdev * rdev);
>  struct md_rdev *md_find_rdev_nr_rcu(struct mddev *mddev, int nr);
> +void md_rdev_get_queue_limits(struct mddev *mddev, struct queue_limits 
> *limits);
>  
>  static inline void rdev_dec_pending(struct md_rdev *rdev, struct mddev 
> *mddev)
>  {
> diff --git a/drivers/md/raid0.c b/drivers/md/raid0.c
> index 5a00fc118470..f1dcf7fd3eb1 100644
> --- a/drivers/md/raid0.c
> +++ b/drivers/md/raid0.c
> @@ -382,15 +382,31 @@ static int raid0_run(struct mddev *mddev)
>   if (mddev->queue) {
>   struct md_rdev *rdev;
>   bool discard_supported = false;
> + struct queue_limits limits;
> + unsigned int io_min, io_opt;
> + unsigned int granularity, alignment;
> + unsigned int chunk_size = mddev->chunk_sectors << 9;
>  
> + md_rdev_get_queue_limits(mddev, );
>   blk_queue_max_hw_sectors(mddev->queue, mddev->chunk_sectors);
>