Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-10 Thread Andreas Herrmann
On Wed, Feb 10, 2016 at 08:47:15PM +0100, Markus Trippelsdorf wrote:
> On 2016.02.10 at 20:34 +0100, Andreas Herrmann wrote:
> > On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> > > > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > > > http://marc.info/?l=linux-scsi=145347009631192=2
> > > > 
> > > > Probably that is a good solution (at least in the short term) to allow
> > > > users to switch to blk-mq for some host adapters (with fast storage
> > > > attached) but to stick to legacy stuff on other host adapters with
> > > > rotary devices.
> > > 
> > > I don't think that Johannes' patch is a good solution.
> > 
> > Why? Because it's not per device?
> 
> Yes. Like Christoph said in his reply to the patch: »The host is simply
> the wrong place to decide these things.«
> 
> > > The best solution for the user would be if blk-mq could be toggled
> > > per drive (or even automatically enabled if queue/rotational == 0).
> > 
> > Yes, I aggree, but ...
> > 
> > > Is there a fundamental reason why this is not feasible?
> > 
> > ... it's not possible (*) with the current implementation.
> > 
> > Tag handling/command allocation differs. Respective functions are set
> > per host.
> > 
> > (*) Or maybe it's possible but just hard to achieve and I didn't look
> > long enough into relevant code to get an idea how to do it.
> > 
> > > Your solution is better than nothing, but it requires that the user
> > > finds out the drive <=> host mapping by hand and then runs something
> > > like: 
> > > echo "250" > 
> > > /sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> > > during boot for spinning rust drives...
> > 
> > Or it could automatically be set in case of rotational device.
> > (Once we know for sure that it doesn't cause performance degradation.)
> 
> Yes, this sound like a good idea.
> 
> But, if I understand things correctly, your patch is only an interim
> solution until proper I/O scheduler support gets implemented for blk-mq, no?

That's to be discussed. (Hence the RFC)

My (potentially wrong) claims are

- I don't think that fast storage (e.g. SSDs) requires I/O scheduler
  support with blk-mq. blk-mq is very good at pushing a large number
  of requests from per CPU sw queues to hw queue(s). Why then
  introduce any overhead for I/O scheduler support?

- Slow storage (e.g. spinning drives) is fine with the old code which
  provides scheduler support and I doubt that there is any benefit for
  those devices when switching to blk-mq.

- The big hammer (scsi_mod.use_blk_mq) for the entire scsi stack to
  decide what to use is suboptimal. You can't have optimal performance
  when you have both slow and fast storage devices in your system.

I doubt that it is possible to add I/O scheduling support to blk-mq
which can be on par with what CFQ is able to achieve for slow devices
at the moment.

Requests are scattered among per-CPU software queues (and almost
instantly passed to hardware queue(s)). Due to CPU scheduling,
requests initiated from one process might come down via different
software queues. What is an efficient way to sort/merge requests from
all the software queues in such a way that the result is comparable to
what CFQ does (assuming that CFQ provides optimal performance)? So far
I didn't find a solution to this problem. (I just have this patch
which adds not too much overhead and improves the situation a little
bit.)

Maybe the solution is to avoid per-CPU queues for slow storage and
fall back to a set of queues comparable to what CFQ uses.

One way to do this is by falling back to non-blk-mq code and direct
use of CFQ.

Code that allows to select blk-mq per host would help to some
extent. But when you have both device types connected to the same host
adapter it doesn't help either.


Andreas


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-10 Thread Markus Trippelsdorf
On 2016.02.10 at 20:34 +0100, Andreas Herrmann wrote:
> On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> > > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > > http://marc.info/?l=linux-scsi=145347009631192=2
> > > 
> > > Probably that is a good solution (at least in the short term) to allow
> > > users to switch to blk-mq for some host adapters (with fast storage
> > > attached) but to stick to legacy stuff on other host adapters with
> > > rotary devices.
> > 
> > I don't think that Johannes' patch is a good solution.
> 
> Why? Because it's not per device?

Yes. Like Christoph said in his reply to the patch: »The host is simply
the wrong place to decide these things.«

> > The best solution for the user would be if blk-mq could be toggled
> > per drive (or even automatically enabled if queue/rotational == 0).
> 
> Yes, I aggree, but ...
> 
> > Is there a fundamental reason why this is not feasible?
> 
> ... it's not possible (*) with the current implementation.
> 
> Tag handling/command allocation differs. Respective functions are set
> per host.
> 
> (*) Or maybe it's possible but just hard to achieve and I didn't look
> long enough into relevant code to get an idea how to do it.
> 
> > Your solution is better than nothing, but it requires that the user
> > finds out the drive <=> host mapping by hand and then runs something
> > like: 
> > echo "250" > 
> > /sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> > during boot for spinning rust drives...
> 
> Or it could automatically be set in case of rotational device.
> (Once we know for sure that it doesn't cause performance degradation.)

Yes, this sound like a good idea.

But, if I understand things correctly, your patch is only an interim
solution until proper I/O scheduler support gets implemented for blk-mq, no?

-- 
Markus


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-10 Thread Andreas Herrmann
On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> On 2016.02.09 at 18:12 +0100, Andreas Herrmann wrote:
> > [CC-ing linux-block and linux-scsi and adding some comments]
> > 
> > On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> > > This introduces a new blk_mq hw attribute time_slice_us which allows
> > > to specify a time slice in usecs.
> > > 
> > > Fio test results are sent in a separate mail to this.
> > 
> > See http://marc.info/?l=linux-kernel=145436682607949=2
> > 
> > In short it shows significant performance gains in some tests,
> > e.g. sequential read iops up by >40% with 8 jobs. But it's never on
> > par with CFQ when more than 1 job was used during the test.
> > 
> > > Results for fio improved to some extent with this patch. But in
> > > reality the picture is quite mixed. Performance is highly dependend on
> > > task scheduling. There is no guarantee that the requests originated
> > > from one CPU belong to the same process.
> > > 
> > > I think for rotary devices CFQ is by far the best choice. A simple
> > > illustration is:
> > > 
> > >   Copying two files (750MB in this case) in parallel on a rotary
> > >   device. The elapsed wall clock time (seconds) for this is
> > >meanstdev
> > >cfq, slice_idle=8   16.18   4.95
> > >cfq, slice_idle=0   23.74   2.82
> > >blk-mq, time_slice_usec=0   24.37   2.05
> > >blk-mq, time_slice_usec=250 25.58   3.16
> > 
> > This illustrates that although their was performance gain with fio
> > tests, the patch can cause higher variance and lower performance in
> > comparison to unmodified blk-mq with other tests. And it underscores
> > superiority of CFQ for rotary disks.
> > 
> > Meanwhile my opinion is that it's not really worth to look further
> > into introduction of I/O scheduling support in blk-mq. I don't see the
> > need for scheduling support (deadline or something else) for fast
> > storage devices. And rotary devices should really avoid usage of blk-mq
> > and stick to CFQ.
> > 
> > Thus I think that introducing some coexistence of blk-mq and the
> > legacy block with CFQ is the best option.
> > 
> > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > http://marc.info/?l=linux-scsi=145347009631192=2
> > 
> > Probably that is a good solution (at least in the short term) to allow
> > users to switch to blk-mq for some host adapters (with fast storage
> > attached) but to stick to legacy stuff on other host adapters with
> > rotary devices.
> 
> I don't think that Johannes' patch is a good solution.

Why? Because it's not per device?

> The best solution for the user would be if blk-mq could be toggled
> per drive (or even automatically enabled if queue/rotational == 0).

Yes, I aggree, but ...

> Is there a fundamental reason why this is not feasible?

... it's not possible (*) with the current implementation.

Tag handling/command allocation differs. Respective functions are set
per host.

(*) Or maybe it's possible but just hard to achieve and I didn't look
long enough into relevant code to get an idea how to do it.

> Your solution is better than nothing, but it requires that the user
> finds out the drive <=> host mapping by hand and then runs something
> like: 
> echo "250" > 
> /sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> during boot for spinning rust drives...

Or it could automatically be set in case of rotational device.
(Once we know for sure that it doesn't cause performance degradation.)

> -- 
> Markus

Andreas


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-10 Thread Markus Trippelsdorf
On 2016.02.10 at 20:34 +0100, Andreas Herrmann wrote:
> On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> > > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > > http://marc.info/?l=linux-scsi=145347009631192=2
> > > 
> > > Probably that is a good solution (at least in the short term) to allow
> > > users to switch to blk-mq for some host adapters (with fast storage
> > > attached) but to stick to legacy stuff on other host adapters with
> > > rotary devices.
> > 
> > I don't think that Johannes' patch is a good solution.
> 
> Why? Because it's not per device?

Yes. Like Christoph said in his reply to the patch: »The host is simply
the wrong place to decide these things.«

> > The best solution for the user would be if blk-mq could be toggled
> > per drive (or even automatically enabled if queue/rotational == 0).
> 
> Yes, I aggree, but ...
> 
> > Is there a fundamental reason why this is not feasible?
> 
> ... it's not possible (*) with the current implementation.
> 
> Tag handling/command allocation differs. Respective functions are set
> per host.
> 
> (*) Or maybe it's possible but just hard to achieve and I didn't look
> long enough into relevant code to get an idea how to do it.
> 
> > Your solution is better than nothing, but it requires that the user
> > finds out the drive <=> host mapping by hand and then runs something
> > like: 
> > echo "250" > 
> > /sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> > during boot for spinning rust drives...
> 
> Or it could automatically be set in case of rotational device.
> (Once we know for sure that it doesn't cause performance degradation.)

Yes, this sound like a good idea.

But, if I understand things correctly, your patch is only an interim
solution until proper I/O scheduler support gets implemented for blk-mq, no?

-- 
Markus


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-10 Thread Andreas Herrmann
On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> On 2016.02.09 at 18:12 +0100, Andreas Herrmann wrote:
> > [CC-ing linux-block and linux-scsi and adding some comments]
> > 
> > On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> > > This introduces a new blk_mq hw attribute time_slice_us which allows
> > > to specify a time slice in usecs.
> > > 
> > > Fio test results are sent in a separate mail to this.
> > 
> > See http://marc.info/?l=linux-kernel=145436682607949=2
> > 
> > In short it shows significant performance gains in some tests,
> > e.g. sequential read iops up by >40% with 8 jobs. But it's never on
> > par with CFQ when more than 1 job was used during the test.
> > 
> > > Results for fio improved to some extent with this patch. But in
> > > reality the picture is quite mixed. Performance is highly dependend on
> > > task scheduling. There is no guarantee that the requests originated
> > > from one CPU belong to the same process.
> > > 
> > > I think for rotary devices CFQ is by far the best choice. A simple
> > > illustration is:
> > > 
> > >   Copying two files (750MB in this case) in parallel on a rotary
> > >   device. The elapsed wall clock time (seconds) for this is
> > >meanstdev
> > >cfq, slice_idle=8   16.18   4.95
> > >cfq, slice_idle=0   23.74   2.82
> > >blk-mq, time_slice_usec=0   24.37   2.05
> > >blk-mq, time_slice_usec=250 25.58   3.16
> > 
> > This illustrates that although their was performance gain with fio
> > tests, the patch can cause higher variance and lower performance in
> > comparison to unmodified blk-mq with other tests. And it underscores
> > superiority of CFQ for rotary disks.
> > 
> > Meanwhile my opinion is that it's not really worth to look further
> > into introduction of I/O scheduling support in blk-mq. I don't see the
> > need for scheduling support (deadline or something else) for fast
> > storage devices. And rotary devices should really avoid usage of blk-mq
> > and stick to CFQ.
> > 
> > Thus I think that introducing some coexistence of blk-mq and the
> > legacy block with CFQ is the best option.
> > 
> > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > http://marc.info/?l=linux-scsi=145347009631192=2
> > 
> > Probably that is a good solution (at least in the short term) to allow
> > users to switch to blk-mq for some host adapters (with fast storage
> > attached) but to stick to legacy stuff on other host adapters with
> > rotary devices.
> 
> I don't think that Johannes' patch is a good solution.

Why? Because it's not per device?

> The best solution for the user would be if blk-mq could be toggled
> per drive (or even automatically enabled if queue/rotational == 0).

Yes, I aggree, but ...

> Is there a fundamental reason why this is not feasible?

... it's not possible (*) with the current implementation.

Tag handling/command allocation differs. Respective functions are set
per host.

(*) Or maybe it's possible but just hard to achieve and I didn't look
long enough into relevant code to get an idea how to do it.

> Your solution is better than nothing, but it requires that the user
> finds out the drive <=> host mapping by hand and then runs something
> like: 
> echo "250" > 
> /sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> during boot for spinning rust drives...

Or it could automatically be set in case of rotational device.
(Once we know for sure that it doesn't cause performance degradation.)

> -- 
> Markus

Andreas


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-10 Thread Andreas Herrmann
On Wed, Feb 10, 2016 at 08:47:15PM +0100, Markus Trippelsdorf wrote:
> On 2016.02.10 at 20:34 +0100, Andreas Herrmann wrote:
> > On Tue, Feb 09, 2016 at 06:41:56PM +0100, Markus Trippelsdorf wrote:
> > > > Recently Johannes sent a patch to enable scsi-mq per driver, see
> > > > http://marc.info/?l=linux-scsi=145347009631192=2
> > > > 
> > > > Probably that is a good solution (at least in the short term) to allow
> > > > users to switch to blk-mq for some host adapters (with fast storage
> > > > attached) but to stick to legacy stuff on other host adapters with
> > > > rotary devices.
> > > 
> > > I don't think that Johannes' patch is a good solution.
> > 
> > Why? Because it's not per device?
> 
> Yes. Like Christoph said in his reply to the patch: »The host is simply
> the wrong place to decide these things.«
> 
> > > The best solution for the user would be if blk-mq could be toggled
> > > per drive (or even automatically enabled if queue/rotational == 0).
> > 
> > Yes, I aggree, but ...
> > 
> > > Is there a fundamental reason why this is not feasible?
> > 
> > ... it's not possible (*) with the current implementation.
> > 
> > Tag handling/command allocation differs. Respective functions are set
> > per host.
> > 
> > (*) Or maybe it's possible but just hard to achieve and I didn't look
> > long enough into relevant code to get an idea how to do it.
> > 
> > > Your solution is better than nothing, but it requires that the user
> > > finds out the drive <=> host mapping by hand and then runs something
> > > like: 
> > > echo "250" > 
> > > /sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
> > > during boot for spinning rust drives...
> > 
> > Or it could automatically be set in case of rotational device.
> > (Once we know for sure that it doesn't cause performance degradation.)
> 
> Yes, this sound like a good idea.
> 
> But, if I understand things correctly, your patch is only an interim
> solution until proper I/O scheduler support gets implemented for blk-mq, no?

That's to be discussed. (Hence the RFC)

My (potentially wrong) claims are

- I don't think that fast storage (e.g. SSDs) requires I/O scheduler
  support with blk-mq. blk-mq is very good at pushing a large number
  of requests from per CPU sw queues to hw queue(s). Why then
  introduce any overhead for I/O scheduler support?

- Slow storage (e.g. spinning drives) is fine with the old code which
  provides scheduler support and I doubt that there is any benefit for
  those devices when switching to blk-mq.

- The big hammer (scsi_mod.use_blk_mq) for the entire scsi stack to
  decide what to use is suboptimal. You can't have optimal performance
  when you have both slow and fast storage devices in your system.

I doubt that it is possible to add I/O scheduling support to blk-mq
which can be on par with what CFQ is able to achieve for slow devices
at the moment.

Requests are scattered among per-CPU software queues (and almost
instantly passed to hardware queue(s)). Due to CPU scheduling,
requests initiated from one process might come down via different
software queues. What is an efficient way to sort/merge requests from
all the software queues in such a way that the result is comparable to
what CFQ does (assuming that CFQ provides optimal performance)? So far
I didn't find a solution to this problem. (I just have this patch
which adds not too much overhead and improves the situation a little
bit.)

Maybe the solution is to avoid per-CPU queues for slow storage and
fall back to a set of queues comparable to what CFQ uses.

One way to do this is by falling back to non-blk-mq code and direct
use of CFQ.

Code that allows to select blk-mq per host would help to some
extent. But when you have both device types connected to the same host
adapter it doesn't help either.


Andreas


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-09 Thread Markus Trippelsdorf
On 2016.02.09 at 18:12 +0100, Andreas Herrmann wrote:
> [CC-ing linux-block and linux-scsi and adding some comments]
> 
> On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> > This introduces a new blk_mq hw attribute time_slice_us which allows
> > to specify a time slice in usecs.
> > 
> > Fio test results are sent in a separate mail to this.
> 
> See http://marc.info/?l=linux-kernel=145436682607949=2
> 
> In short it shows significant performance gains in some tests,
> e.g. sequential read iops up by >40% with 8 jobs. But it's never on
> par with CFQ when more than 1 job was used during the test.
> 
> > Results for fio improved to some extent with this patch. But in
> > reality the picture is quite mixed. Performance is highly dependend on
> > task scheduling. There is no guarantee that the requests originated
> > from one CPU belong to the same process.
> > 
> > I think for rotary devices CFQ is by far the best choice. A simple
> > illustration is:
> > 
> >   Copying two files (750MB in this case) in parallel on a rotary
> >   device. The elapsed wall clock time (seconds) for this is
> >meanstdev
> >cfq, slice_idle=8   16.18   4.95
> >cfq, slice_idle=0   23.74   2.82
> >blk-mq, time_slice_usec=0   24.37   2.05
> >blk-mq, time_slice_usec=250 25.58   3.16
> 
> This illustrates that although their was performance gain with fio
> tests, the patch can cause higher variance and lower performance in
> comparison to unmodified blk-mq with other tests. And it underscores
> superiority of CFQ for rotary disks.
> 
> Meanwhile my opinion is that it's not really worth to look further
> into introduction of I/O scheduling support in blk-mq. I don't see the
> need for scheduling support (deadline or something else) for fast
> storage devices. And rotary devices should really avoid usage of blk-mq
> and stick to CFQ.
> 
> Thus I think that introducing some coexistence of blk-mq and the
> legacy block with CFQ is the best option.
> 
> Recently Johannes sent a patch to enable scsi-mq per driver, see
> http://marc.info/?l=linux-scsi=145347009631192=2
> 
> Probably that is a good solution (at least in the short term) to allow
> users to switch to blk-mq for some host adapters (with fast storage
> attached) but to stick to legacy stuff on other host adapters with
> rotary devices.

I don't think that Johannes' patch is a good solution.

The best solution for the user would be if blk-mq could be toggled per
drive (or even automatically enabled if queue/rotational == 0). Is there
a fundamental reason why this is not feasible?

Your solution is better than nothing, but it requires that the user
finds out the drive <=> host mapping by hand and then runs something
like: 
echo "250" > 
/sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
during boot for spinning rust drives...

-- 
Markus


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-09 Thread Andreas Herrmann
[CC-ing linux-block and linux-scsi and adding some comments]

On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> This introduces a new blk_mq hw attribute time_slice_us which allows
> to specify a time slice in usecs.
> 
> Default value is 0 and implies no modification to blk-mq behaviour.
> 
> A positive value changes blk-mq to service only one software queue
> within this time slice until it expires or the software queue is
> empty. Then the next software queue with pending requests is selected.
> 
> Signed-off-by: Andreas Herrmann 
> ---
>  block/blk-mq-sysfs.c   |  27 +++
>  block/blk-mq.c | 208 
> +
>  include/linux/blk-mq.h |   9 +++
>  3 files changed, 211 insertions(+), 33 deletions(-)
> 
> Hi,
> 
> This update is long overdue (sorry for the delay).
> 
> Change to v1:
> - time slice is now specified in usecs instead of msecs.
> - time slice is extended (up to 3 times the initial value) when there
>   was actually a request to be serviced for the software queue
> 
> Fio test results are sent in a separate mail to this.

See http://marc.info/?l=linux-kernel=145436682607949=2

In short it shows significant performance gains in some tests,
e.g. sequential read iops up by >40% with 8 jobs. But it's never on
par with CFQ when more than 1 job was used during the test.

> Results for fio improved to some extent with this patch. But in
> reality the picture is quite mixed. Performance is highly dependend on
> task scheduling. There is no guarantee that the requests originated
> from one CPU belong to the same process.
> 
> I think for rotary devices CFQ is by far the best choice. A simple
> illustration is:
> 
>   Copying two files (750MB in this case) in parallel on a rotary
>   device. The elapsed wall clock time (seconds) for this is
>meanstdev
>cfq, slice_idle=8   16.18   4.95
>cfq, slice_idle=0   23.74   2.82
>blk-mq, time_slice_usec=0   24.37   2.05
>blk-mq, time_slice_usec=250 25.58   3.16

This illustrates that although their was performance gain with fio
tests, the patch can cause higher variance and lower performance in
comparison to unmodified blk-mq with other tests. And it underscores
superiority of CFQ for rotary disks.

Meanwhile my opinion is that it's not really worth to look further
into introduction of I/O scheduling support in blk-mq. I don't see the
need for scheduling support (deadline or something else) for fast
storage devices. And rotary devices should really avoid usage of blk-mq
and stick to CFQ.

Thus I think that introducing some coexistence of blk-mq and the
legacy block with CFQ is the best option.

Recently Johannes sent a patch to enable scsi-mq per driver, see
http://marc.info/?l=linux-scsi=145347009631192=2

Probably that is a good solution (at least in the short term) to allow
users to switch to blk-mq for some host adapters (with fast storage
attached) but to stick to legacy stuff on other host adapters with
rotary devices.

What do others think?


Thanks,

Andreas


> Regards,
> 
> Andreas
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 1cf1878..77c875c 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
> blk_mq_hw_ctx *hctx, char *page)
>   return ret;
>  }
>  
> +static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
> +   char *page)
> +{
> + return sprintf(page, "%u\n", hctx->tslice_us);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
> + const char *page, size_t length)
> +{
> + unsigned long long store;
> + int err;
> +
> + err = kstrtoull(page, 10, );
> + if (err)
> + return -EINVAL;
> +
> + hctx->tslice_us = (unsigned)store;
> + return length;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
>   .attr = {.name = "dispatched", .mode = S_IRUGO },
>   .show = blk_mq_sysfs_dispatched_show,
> @@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry 
> blk_mq_hw_sysfs_poll = {
>   .show = blk_mq_hw_sysfs_poll_show,
>  };
>  
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
> + .attr = {.name = "time_slice_us", .mode = S_IRUGO | S_IWUSR },
> + .show = blk_mq_hw_sysfs_tslice_show,
> + .store = blk_mq_hw_sysfs_tslice_store,
> +};
> +
>  static struct attribute *default_hw_ctx_attrs[] = {
>   _mq_hw_sysfs_queued.attr,
>   _mq_hw_sysfs_run.attr,
> @@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
>   _mq_hw_sysfs_cpus.attr,
>   _mq_hw_sysfs_active.attr,
>   _mq_hw_sysfs_poll.attr,
> + _mq_hw_sysfs_tslice.attr,
>   NULL,
>  };
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c0622f..97d32f2 100644
> --- a/block/blk-mq.c
> 

Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-09 Thread Andreas Herrmann
[CC-ing linux-block and linux-scsi and adding some comments]

On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> This introduces a new blk_mq hw attribute time_slice_us which allows
> to specify a time slice in usecs.
> 
> Default value is 0 and implies no modification to blk-mq behaviour.
> 
> A positive value changes blk-mq to service only one software queue
> within this time slice until it expires or the software queue is
> empty. Then the next software queue with pending requests is selected.
> 
> Signed-off-by: Andreas Herrmann 
> ---
>  block/blk-mq-sysfs.c   |  27 +++
>  block/blk-mq.c | 208 
> +
>  include/linux/blk-mq.h |   9 +++
>  3 files changed, 211 insertions(+), 33 deletions(-)
> 
> Hi,
> 
> This update is long overdue (sorry for the delay).
> 
> Change to v1:
> - time slice is now specified in usecs instead of msecs.
> - time slice is extended (up to 3 times the initial value) when there
>   was actually a request to be serviced for the software queue
> 
> Fio test results are sent in a separate mail to this.

See http://marc.info/?l=linux-kernel=145436682607949=2

In short it shows significant performance gains in some tests,
e.g. sequential read iops up by >40% with 8 jobs. But it's never on
par with CFQ when more than 1 job was used during the test.

> Results for fio improved to some extent with this patch. But in
> reality the picture is quite mixed. Performance is highly dependend on
> task scheduling. There is no guarantee that the requests originated
> from one CPU belong to the same process.
> 
> I think for rotary devices CFQ is by far the best choice. A simple
> illustration is:
> 
>   Copying two files (750MB in this case) in parallel on a rotary
>   device. The elapsed wall clock time (seconds) for this is
>meanstdev
>cfq, slice_idle=8   16.18   4.95
>cfq, slice_idle=0   23.74   2.82
>blk-mq, time_slice_usec=0   24.37   2.05
>blk-mq, time_slice_usec=250 25.58   3.16

This illustrates that although their was performance gain with fio
tests, the patch can cause higher variance and lower performance in
comparison to unmodified blk-mq with other tests. And it underscores
superiority of CFQ for rotary disks.

Meanwhile my opinion is that it's not really worth to look further
into introduction of I/O scheduling support in blk-mq. I don't see the
need for scheduling support (deadline or something else) for fast
storage devices. And rotary devices should really avoid usage of blk-mq
and stick to CFQ.

Thus I think that introducing some coexistence of blk-mq and the
legacy block with CFQ is the best option.

Recently Johannes sent a patch to enable scsi-mq per driver, see
http://marc.info/?l=linux-scsi=145347009631192=2

Probably that is a good solution (at least in the short term) to allow
users to switch to blk-mq for some host adapters (with fast storage
attached) but to stick to legacy stuff on other host adapters with
rotary devices.

What do others think?


Thanks,

Andreas


> Regards,
> 
> Andreas
> 
> diff --git a/block/blk-mq-sysfs.c b/block/blk-mq-sysfs.c
> index 1cf1878..77c875c 100644
> --- a/block/blk-mq-sysfs.c
> +++ b/block/blk-mq-sysfs.c
> @@ -247,6 +247,26 @@ static ssize_t blk_mq_hw_sysfs_cpus_show(struct 
> blk_mq_hw_ctx *hctx, char *page)
>   return ret;
>  }
>  
> +static ssize_t blk_mq_hw_sysfs_tslice_show(struct blk_mq_hw_ctx *hctx,
> +   char *page)
> +{
> + return sprintf(page, "%u\n", hctx->tslice_us);
> +}
> +
> +static ssize_t blk_mq_hw_sysfs_tslice_store(struct blk_mq_hw_ctx *hctx,
> + const char *page, size_t length)
> +{
> + unsigned long long store;
> + int err;
> +
> + err = kstrtoull(page, 10, );
> + if (err)
> + return -EINVAL;
> +
> + hctx->tslice_us = (unsigned)store;
> + return length;
> +}
> +
>  static struct blk_mq_ctx_sysfs_entry blk_mq_sysfs_dispatched = {
>   .attr = {.name = "dispatched", .mode = S_IRUGO },
>   .show = blk_mq_sysfs_dispatched_show,
> @@ -305,6 +325,12 @@ static struct blk_mq_hw_ctx_sysfs_entry 
> blk_mq_hw_sysfs_poll = {
>   .show = blk_mq_hw_sysfs_poll_show,
>  };
>  
> +static struct blk_mq_hw_ctx_sysfs_entry blk_mq_hw_sysfs_tslice = {
> + .attr = {.name = "time_slice_us", .mode = S_IRUGO | S_IWUSR },
> + .show = blk_mq_hw_sysfs_tslice_show,
> + .store = blk_mq_hw_sysfs_tslice_store,
> +};
> +
>  static struct attribute *default_hw_ctx_attrs[] = {
>   _mq_hw_sysfs_queued.attr,
>   _mq_hw_sysfs_run.attr,
> @@ -314,6 +340,7 @@ static struct attribute *default_hw_ctx_attrs[] = {
>   _mq_hw_sysfs_cpus.attr,
>   _mq_hw_sysfs_active.attr,
>   _mq_hw_sysfs_poll.attr,
> + _mq_hw_sysfs_tslice.attr,
>   NULL,
>  };
>  
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 4c0622f..97d32f2 100644
> --- 

Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-09 Thread Markus Trippelsdorf
On 2016.02.09 at 18:12 +0100, Andreas Herrmann wrote:
> [CC-ing linux-block and linux-scsi and adding some comments]
> 
> On Mon, Feb 01, 2016 at 11:43:40PM +0100, Andreas Herrmann wrote:
> > This introduces a new blk_mq hw attribute time_slice_us which allows
> > to specify a time slice in usecs.
> > 
> > Fio test results are sent in a separate mail to this.
> 
> See http://marc.info/?l=linux-kernel=145436682607949=2
> 
> In short it shows significant performance gains in some tests,
> e.g. sequential read iops up by >40% with 8 jobs. But it's never on
> par with CFQ when more than 1 job was used during the test.
> 
> > Results for fio improved to some extent with this patch. But in
> > reality the picture is quite mixed. Performance is highly dependend on
> > task scheduling. There is no guarantee that the requests originated
> > from one CPU belong to the same process.
> > 
> > I think for rotary devices CFQ is by far the best choice. A simple
> > illustration is:
> > 
> >   Copying two files (750MB in this case) in parallel on a rotary
> >   device. The elapsed wall clock time (seconds) for this is
> >meanstdev
> >cfq, slice_idle=8   16.18   4.95
> >cfq, slice_idle=0   23.74   2.82
> >blk-mq, time_slice_usec=0   24.37   2.05
> >blk-mq, time_slice_usec=250 25.58   3.16
> 
> This illustrates that although their was performance gain with fio
> tests, the patch can cause higher variance and lower performance in
> comparison to unmodified blk-mq with other tests. And it underscores
> superiority of CFQ for rotary disks.
> 
> Meanwhile my opinion is that it's not really worth to look further
> into introduction of I/O scheduling support in blk-mq. I don't see the
> need for scheduling support (deadline or something else) for fast
> storage devices. And rotary devices should really avoid usage of blk-mq
> and stick to CFQ.
> 
> Thus I think that introducing some coexistence of blk-mq and the
> legacy block with CFQ is the best option.
> 
> Recently Johannes sent a patch to enable scsi-mq per driver, see
> http://marc.info/?l=linux-scsi=145347009631192=2
> 
> Probably that is a good solution (at least in the short term) to allow
> users to switch to blk-mq for some host adapters (with fast storage
> attached) but to stick to legacy stuff on other host adapters with
> rotary devices.

I don't think that Johannes' patch is a good solution.

The best solution for the user would be if blk-mq could be toggled per
drive (or even automatically enabled if queue/rotational == 0). Is there
a fundamental reason why this is not feasible?

Your solution is better than nothing, but it requires that the user
finds out the drive <=> host mapping by hand and then runs something
like: 
echo "250" > 
/sys/devices/pci:00/:00:11.0/ata2/host1/target1:0:0/1:0:0:0/block/sdb/mq/0/time_slice_us
during boot for spinning rust drives...

-- 
Markus


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-01 Thread Andreas Herrmann
Following benchmark results for the patch using fio.

 - kernel version 4.5.0-rc2-0001 (ie. with time-slice patch) 
 - Intel Core i7-3770S CPU @ 3.10GHz system, 4 cores, 8 threads/CPUs
 - fio version as of 2.2.9-37-g0e1c4
 - results were gathered iterating using rw and numjobs parameter, e.g.:

fio --directory=$1 --rw=$j --name=fio-scaling --size=5G --group_reporting \
--ioengine=libaio --direct=1 --iodepth=1 --runtime=$2 --numjobs=$i

 - rotational device was a 1500GB Samsung HD155UI

ata3.00: ATA-8: SAMSUNG HD155UI, 1AQ10001, max UDMA/133
scsi 2:0:0:0: Direct-Access ATA  SAMSUNG HD155UI  0001 PQ: 0 ANSI: 5

 - no regression between blk-mq using vanilla v4.5-rc2 kernel and new
   blk-mq w/ time_slice_us=0
 - I omitted random read/write/rw tests as there was no significant
   difference
 - time_slice_us=250 seems to be optimal choice
   (I compared 1000 µs, 500 µs, 250 µs and 50 µs)

 (1) cfq, slice_idle=8 (default)
 (2) cfq, slice_idle=0
 (3) blk-mq, time_slice_us=0 (behaviour matches unmodified mainline) 
 (4) blk-mq, time_slice_us=250  (250 µs) (time slice incremented)

 -
 n   cfq  cfqblk-mq   blk-mq
 u   (1)  (2)  (3)  (4) 
 m
 j   0  250 
 o  µs   µs 
 b
 s
 -
read iops [mean (stdev)]
 -
 1 17394.6 (470.48) 17336.4 (382.88) 20324.5 ( 72.75) 18914.5 (324.23)
 2 15862.1 (416.38)  8427.8 (104.45)  8373.2 ( 71.97) 13216.6 (964.83)
 3 16597.1 (246.77)  7078.0 (107.29)  8671.3 (238.12) 12370.2 (276.76)
 4 15078.6 (132.34)  8589.8 (169.00)  9107.5 (328.00) 11582.0 (225.09)
 5 14392.2 ( 75.55)  5908.5 (120.76)  6052.0 (102.01) 10906.3 (435.44)
 6 13540.2 (142.83)  5902.9 (122.41)  5944.5 ( 90.23)  9758.9 (374.65)
 7 13112.4 (175.02)  5938.7 ( 99.70)  5982.2 (104.44)  9233.0 (616.16)
 8 12651.0 (187.11)  5955.9 (181.99)  6010.2 (105.62)  8908.5 (399.25)
 -
write iops [mean (stdev)]
 -
 1 15233.6 (236.28) 15062.4 (176.56) 17141.0 (107.48  16236.8 (128.94)
 2 15333.5 (474.39)  9860.1 ( 58.42) 10020.0 ( 38.67  15027.4 (288.07)
 3 15753.6 (459.48)  9552.5 (153.48)  9523.0 (111.82  14021.5 (551.12)
 4 15932.2 (328.94)  9949.8 ( 97.03)  9946.0 (110.38  13166.0 (289.70)
 5 15840.5 (407.32)  9886.9 ( 43.84)  9814.0 ( 60.92  12401.2 (706.76)
 6 16071.2 (349.90)  9822.1 ( 88.81)  9796.0 ( 83.29  11916.1 (904.20)
 7 15864.1 (353.14)  9684.6 ( 63.48)  9628.0 ( 35.54  12194.0 (292.25)
 8 15901.2 (308.38)  9474.3 ( 86.85)  9447.0 ( 40.46  11933.3 (633.21)  
  
 -
rw (read) iops [mean (stdev)]
 -
 1  4034.5 ( 76.29)  4238.7 ( 65.71)  4305.9 (107.76)  4069.8 (115.45)
 2  3727.7 (108.46)  3860.5 (274.02)  4194.4 (132.19)  3909.3 (202.47)
 3  3837.6 ( 57.84)  3533.6 (206.67)  3560.5 (145.49)  3625.9 (170.29)
 4  3650.8 ( 53.90)  3201.1 ( 72.26)  3123.0 (154.18)  3506.5 (142.67)
 5  3638.8 (102.67)  3043.1 (122.89)  3210.6 ( 89.05)  3301.1 (194.63)
 6  3592.9 ( 93.41)  3002.3 (114.94)  3180.3 ( 42.36)  3297.5 (200.37)
 7  3666.1 ( 66.77)  3081.3 ( 92.76)  3120.7 (127.99)  3210.4 (139.00)  
   
 8  3787.5 ( 46.90)  2859.6 (220.61)  2978.1 (119.55)  3215.4 (166.90)  
   
 -
rw (write) iops [mean (stdev)]
 -
 1  4037.4 ( 74.32)  4241.1 ( 66.48)  4306.4 (105.00)  4073.3 (115.22)
 2  3740.1 (102.04)  3868.2 (265.28)  4193.9 (128.23)  3914.6 (195.13)
 3  3832.9 ( 58.52)  3526.3 (209.27)  3550.3 (148.23)  3622.4 (169.52)
 4  3647.1 ( 53.76)  3211.4 ( 68.84)  3133.7 (152.57)  3504.2 (138.19)
 5  3642.5 (102.65)  3051.0 (114.87)  3206.6 ( 84.58)  3303.5 (193.98)
 6  3610.2 ( 93.48)  3013.7 (124.22)  3208.8 ( 45.25)  3291.2 (197.91)
 7  3681.6 ( 66.38)  3077.4 ( 91.72)  3118.5 (130.27)  3184.9 (140.03)
 8  3792.6 ( 49.89)  2816.7 (235.14)  2943.7 (133.80)  3176.4 (164.24)
 -


Re: [RFC PATCH v2] blk-mq: Introduce per sw queue time-slice

2016-02-01 Thread Andreas Herrmann
Following benchmark results for the patch using fio.

 - kernel version 4.5.0-rc2-0001 (ie. with time-slice patch) 
 - Intel Core i7-3770S CPU @ 3.10GHz system, 4 cores, 8 threads/CPUs
 - fio version as of 2.2.9-37-g0e1c4
 - results were gathered iterating using rw and numjobs parameter, e.g.:

fio --directory=$1 --rw=$j --name=fio-scaling --size=5G --group_reporting \
--ioengine=libaio --direct=1 --iodepth=1 --runtime=$2 --numjobs=$i

 - rotational device was a 1500GB Samsung HD155UI

ata3.00: ATA-8: SAMSUNG HD155UI, 1AQ10001, max UDMA/133
scsi 2:0:0:0: Direct-Access ATA  SAMSUNG HD155UI  0001 PQ: 0 ANSI: 5

 - no regression between blk-mq using vanilla v4.5-rc2 kernel and new
   blk-mq w/ time_slice_us=0
 - I omitted random read/write/rw tests as there was no significant
   difference
 - time_slice_us=250 seems to be optimal choice
   (I compared 1000 µs, 500 µs, 250 µs and 50 µs)

 (1) cfq, slice_idle=8 (default)
 (2) cfq, slice_idle=0
 (3) blk-mq, time_slice_us=0 (behaviour matches unmodified mainline) 
 (4) blk-mq, time_slice_us=250  (250 µs) (time slice incremented)

 -
 n   cfq  cfqblk-mq   blk-mq
 u   (1)  (2)  (3)  (4) 
 m
 j   0  250 
 o  µs   µs 
 b
 s
 -
read iops [mean (stdev)]
 -
 1 17394.6 (470.48) 17336.4 (382.88) 20324.5 ( 72.75) 18914.5 (324.23)
 2 15862.1 (416.38)  8427.8 (104.45)  8373.2 ( 71.97) 13216.6 (964.83)
 3 16597.1 (246.77)  7078.0 (107.29)  8671.3 (238.12) 12370.2 (276.76)
 4 15078.6 (132.34)  8589.8 (169.00)  9107.5 (328.00) 11582.0 (225.09)
 5 14392.2 ( 75.55)  5908.5 (120.76)  6052.0 (102.01) 10906.3 (435.44)
 6 13540.2 (142.83)  5902.9 (122.41)  5944.5 ( 90.23)  9758.9 (374.65)
 7 13112.4 (175.02)  5938.7 ( 99.70)  5982.2 (104.44)  9233.0 (616.16)
 8 12651.0 (187.11)  5955.9 (181.99)  6010.2 (105.62)  8908.5 (399.25)
 -
write iops [mean (stdev)]
 -
 1 15233.6 (236.28) 15062.4 (176.56) 17141.0 (107.48  16236.8 (128.94)
 2 15333.5 (474.39)  9860.1 ( 58.42) 10020.0 ( 38.67  15027.4 (288.07)
 3 15753.6 (459.48)  9552.5 (153.48)  9523.0 (111.82  14021.5 (551.12)
 4 15932.2 (328.94)  9949.8 ( 97.03)  9946.0 (110.38  13166.0 (289.70)
 5 15840.5 (407.32)  9886.9 ( 43.84)  9814.0 ( 60.92  12401.2 (706.76)
 6 16071.2 (349.90)  9822.1 ( 88.81)  9796.0 ( 83.29  11916.1 (904.20)
 7 15864.1 (353.14)  9684.6 ( 63.48)  9628.0 ( 35.54  12194.0 (292.25)
 8 15901.2 (308.38)  9474.3 ( 86.85)  9447.0 ( 40.46  11933.3 (633.21)  
  
 -
rw (read) iops [mean (stdev)]
 -
 1  4034.5 ( 76.29)  4238.7 ( 65.71)  4305.9 (107.76)  4069.8 (115.45)
 2  3727.7 (108.46)  3860.5 (274.02)  4194.4 (132.19)  3909.3 (202.47)
 3  3837.6 ( 57.84)  3533.6 (206.67)  3560.5 (145.49)  3625.9 (170.29)
 4  3650.8 ( 53.90)  3201.1 ( 72.26)  3123.0 (154.18)  3506.5 (142.67)
 5  3638.8 (102.67)  3043.1 (122.89)  3210.6 ( 89.05)  3301.1 (194.63)
 6  3592.9 ( 93.41)  3002.3 (114.94)  3180.3 ( 42.36)  3297.5 (200.37)
 7  3666.1 ( 66.77)  3081.3 ( 92.76)  3120.7 (127.99)  3210.4 (139.00)  
   
 8  3787.5 ( 46.90)  2859.6 (220.61)  2978.1 (119.55)  3215.4 (166.90)  
   
 -
rw (write) iops [mean (stdev)]
 -
 1  4037.4 ( 74.32)  4241.1 ( 66.48)  4306.4 (105.00)  4073.3 (115.22)
 2  3740.1 (102.04)  3868.2 (265.28)  4193.9 (128.23)  3914.6 (195.13)
 3  3832.9 ( 58.52)  3526.3 (209.27)  3550.3 (148.23)  3622.4 (169.52)
 4  3647.1 ( 53.76)  3211.4 ( 68.84)  3133.7 (152.57)  3504.2 (138.19)
 5  3642.5 (102.65)  3051.0 (114.87)  3206.6 ( 84.58)  3303.5 (193.98)
 6  3610.2 ( 93.48)  3013.7 (124.22)  3208.8 ( 45.25)  3291.2 (197.91)
 7  3681.6 ( 66.38)  3077.4 ( 91.72)  3118.5 (130.27)  3184.9 (140.03)
 8  3792.6 ( 49.89)  2816.7 (235.14)  2943.7 (133.80)  3176.4 (164.24)
 -