Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-13 Thread Mike Snitzer
On Fri, Jan 12 2018 at  8:37pm -0500,
Mike Snitzer  wrote:

> On Fri, Jan 12 2018 at  8:00pm -0500,
> Bart Van Assche  wrote:
> 
> > On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> > > It was 50 ms before it was 100 ms.  No real explaination for these
> > > values other than they seem to make Bart's IB SRP testbed happy?
> > 
> > But that constant was not introduced by me in the dm code.
> 
> No actually it was (not that there's anything wrong with that):
> 
> commit 06eb061f48594aa369f6e852b352410298b317a8
> Author: Bart Van Assche 
> Date:   Fri Apr 7 16:50:44 2017 -0700
> 
> dm mpath: requeue after a small delay if blk_get_request() fails
> 
> If blk_get_request() returns ENODEV then multipath_clone_and_map()
> causes a request to be requeued immediately. This can cause a kworker
> thread to spend 100% of the CPU time of a single core in
> __blk_mq_run_hw_queue() and also can cause device removal to never
> finish.
> 
> Avoid this by only requeuing after a delay if blk_get_request() fails.
> Additionally, reduce the requeue delay.
> 
> Cc: sta...@vger.kernel.org # 4.9+
> Signed-off-by: Bart Van Assche 
> Signed-off-by: Mike Snitzer 
> 
> Note that this commit actually details a different case where a
> blk_get_request() (in existing code) return of -ENODEV is a very
> compelling case to use DM_MAPIO_DELAY_REQUEUE.
> 
> SO I'll revisit what is appropriate in multipath_clone_and_map() on
> Monday.

Sleep helped.  I had another look and it is only the old .request_fn
blk_get_request() code that even sets -ENODEV (if blk_queue_dying).

But thankfully the blk_get_request() error handling in
multipath_clone_and_map() checks for blk_queue_dying() and will return
DM_MAPIO_DELAY_REQUEUE.

So we're all set for this case.

Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-13 Thread Mike Snitzer
On Sat, Jan 13 2018 at 10:04am -0500,
Ming Lei  wrote:

> On Fri, Jan 12, 2018 at 05:31:17PM -0500, Mike Snitzer wrote:
> > 
> > Ming or Jens: might you be able to shed some light on how dm-mpath
> > would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can
> 
> When BLK_STS_RESOURCE is returned from .queue_rq(), blk_mq_dispatch_rq_list()
> will check if BLK_MQ_S_SCHED_RESTART is set.
> 
> If it has been set, the queue won't be rerun for this request, and the queue
> will be rerun until one in-flight request is completed, see 
> blk_mq_sched_restart()
> which is called from blk_mq_free_request().
> 
> If BLK_MQ_S_SCHED_RESTART isn't set, queue is rerun in 
> blk_mq_dispatch_rq_list(),
> and BLK_MQ_S_SCHED_RESTART is set before calling .queue_rq(), see
> blk_mq_sched_mark_restart_hctx() which is called in 
> blk_mq_sched_dispatch_requests().
> 
> This mechanism can avoid continuous running queue in case of STS_RESOURCE, 
> that
> means drivers wouldn't worry about that by adding random delay.

Great, thanks for the overview.  Really appreciate it.

Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-13 Thread Ming Lei
On Fri, Jan 12, 2018 at 05:31:17PM -0500, Mike Snitzer wrote:
> On Fri, Jan 12 2018 at  1:54pm -0500,
> Bart Van Assche  wrote:
> 
> > On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > > OK, you have the stage: please give me a pointer to your best
> > > explaination of the several.
> > 
> > Since the previous discussion about this topic occurred more than a month
> > ago it could take more time to look up an explanation than to explain it
> > again. Anyway, here we go. As you know a block layer request queue needs to
> > be rerun if one or more requests are waiting and a previous condition that
> > prevented the request to be executed has been cleared. For the dm-mpath
> > driver, examples of such conditions are no tags available, a path that is
> > busy (see also pgpath_busy()), path initialization that is in progress
> > (pg_init_in_progress) or a request completes with status, e.g. if the
> > SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> > of these conditions, e.g. path initialization completes, a callback
> > function in the dm-mpath driver is called and it is possible to explicitly
> > rerun the queue. I agree that for such scenario's a delayed queue run should
> > not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> > SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> > SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> > maximum number of retries has been reached (see also scsi_io_completion()).
> > In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> > to the initiator, there is no other approach for the SCSI initiator to
> > figure out whether it can queue another request than to resubmit the
> > request. The worst possible strategy is to resubmit a request immediately
> > because that will cause a significant fraction of the fabric bandwidth to
> > be used just for replying "BUSY" to requests that can't be processed
> > immediately.
> 
> The thing is, the 2 scenarios you are most concerned about have
> _nothing_ to do with dm_mq_queue_rq() at all.  They occur as an error in
> the IO path _after_ the request is successfully retrieved with
> blk_get_request() and then submitted.
>  
> > The intention of commit 6077c2d706097c0 was to address the last mentioned
> > case.
> 
> So it really makes me question why you think commit 6077c2d706097c0
> addresses the issue you think it does.  And gives me more conviction to
> remove 6077c2d706097c0.
> 
> It may help just by virtue of blindly kicking the queue if
> blk_get_request() fails (regardless of the target is responding with
> BUSY or not).  Very unsatisfying to say the least.
> 
> I think it'd be much more beneficial for dm-mpath.c:multipath_end_io()
> to be trained to be respond more intelligently to BLK_STS_RESOURCE.
> 
> E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding
> on the path.  This is one case where Ming said the queue would be
> re-run, as detailed in this header:
> https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89
> 
> And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to
> kicking the queue more efficiently.  _BUT_ I'm not seeing any external
> blk-mq interface that exposes this capability to a blk-mq driver.  As is
> BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq
> (blk_mq_sched_mark_restart_hctx).
> 
> SO I have to do more homework here...
> 
> Ming or Jens: might you be able to shed some light on how dm-mpath
> would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can

When BLK_STS_RESOURCE is returned from .queue_rq(), blk_mq_dispatch_rq_list()
will check if BLK_MQ_S_SCHED_RESTART is set.

If it has been set, the queue won't be rerun for this request, and the queue
will be rerun until one in-flight request is completed, see 
blk_mq_sched_restart()
which is called from blk_mq_free_request().

If BLK_MQ_S_SCHED_RESTART isn't set, queue is rerun in 
blk_mq_dispatch_rq_list(),
and BLK_MQ_S_SCHED_RESTART is set before calling .queue_rq(), see
blk_mq_sched_mark_restart_hctx() which is called in 
blk_mq_sched_dispatch_requests().

This mechanism can avoid continuous running queue in case of STS_RESOURCE, that
means drivers wouldn't worry about that by adding random delay.


-- 
Ming


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-13 Thread Ming Lei
On Fri, Jan 12, 2018 at 06:54:49PM +, Bart Van Assche wrote:
> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.

That isn't true, when BLK_STS_RESOURCE is returned to blk-mq, blk-mq
will apply BLK_MQ_S_SCHED_RESTART to hold the queue until one in-flight
request is completed, please see blk_mq_sched_restart() which is called
from blk_mq_free_request().

Also now we have IO schedulers, when blk_get_request() in dm-mpath returns
NULL, it doesn't provide underlying queue's BUSY accurately or in time, since
at default size of scheduler tags is double size of driver tags. So it isn't
good to depend blk_get_request() only to evaluate queue's busy status, this
patchset provides underlying's dispatch result directly to blk-mq, and can deal
with this case much better.

> 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

Again, queue won't be rerun immediately after STS_RESOURCE is returned to
blk-mq. And BLK_MQ_S_SCHED_RESTART should address your concern on continuous
resubmission in case of running out of requests, right?

Thanks,
Ming


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  8:00pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> > It was 50 ms before it was 100 ms.  No real explaination for these
> > values other than they seem to make Bart's IB SRP testbed happy?
> 
> But that constant was not introduced by me in the dm code.

No actually it was (not that there's anything wrong with that):

commit 06eb061f48594aa369f6e852b352410298b317a8
Author: Bart Van Assche 
Date:   Fri Apr 7 16:50:44 2017 -0700

dm mpath: requeue after a small delay if blk_get_request() fails

If blk_get_request() returns ENODEV then multipath_clone_and_map()
causes a request to be requeued immediately. This can cause a kworker
thread to spend 100% of the CPU time of a single core in
__blk_mq_run_hw_queue() and also can cause device removal to never
finish.

Avoid this by only requeuing after a delay if blk_get_request() fails.
Additionally, reduce the requeue delay.

Cc: sta...@vger.kernel.org # 4.9+
Signed-off-by: Bart Van Assche 
Signed-off-by: Mike Snitzer 

Note that this commit actually details a different case where a
blk_get_request() (in existing code) return of -ENODEV is a very
compelling case to use DM_MAPIO_DELAY_REQUEUE.

SO I'll revisit what is appropriate in multipath_clone_and_map() on
Monday.

I need a break... have a good weekend Bart.


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 19:52 -0500, Mike Snitzer wrote:
> It was 50 ms before it was 100 ms.  No real explaination for these
> values other than they seem to make Bart's IB SRP testbed happy?

But that constant was not introduced by me in the dm code. See e.g. the
following commits:

commit d548b34b062b60b4f4df295a0b4823dfda1f1fc4
Author: Mike Snitzer 
Date:   Thu Mar 5 22:21:10 2015 -0500

dm: reduce the queue delay used in dm_request_fn from 100ms to 10ms
 
Commit 7eaceaccab ("block: remove per-queue plugging") didn't justify
DM's use of a 100ms delay; such an extended delay is a liability when
there is reason to re-kick the queue.
 
Signed-off-by: Mike Snitzer 

commit 7eaceaccab5f40bbfda044629a6298616aeaed50
Author: Jens Axboe 
Date:   Thu Mar 10 08:52:07 2011 +0100

block: remove per-queue plugging
 
Code has been converted over to the new explicit on-stack plugging,
and delay users have been converted to use the new API for that.
So lets kill off the old plugging along with aops->sync_page().
 
Signed-off-by: Jens Axboe 

Bart. 

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  2:53pm -0500,
Elliott, Robert (Persistent Memory)  wrote:

> 
> 
> > -Original Message-
> > From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> > ow...@vger.kernel.org] On Behalf Of Bart Van Assche
> ...
> > The intention of commit 6077c2d706097c0 was to address the last mentioned
> > case. It may be possible to move the delayed queue rerun from the
> > dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> > wrong to rerun the queue immediately in case a SCSI target system returns
> > "BUSY".
> 
> Seeing SCSI BUSY mentioned here...
> 
> On its own, patch 6077c2d706 seems to be adding an arbitrarily selected
> magic value of 100 ms without explanation in the patch description or
> in the added code.

It was 50 ms before it was 100 ms.  No real explaination for these
values other than they seem to make Bart's IB SRP testbed happy?
 
> But, dm_request_original_request() already seems to have chosen that value
> for similar purposes:
> unsigned long delay_ms = delay_requeue ? 100 : 0;
> 
> So, making them share a #define would indicate there's a reason for that
> particular value.  Any change to the value would be picked up everywhere.
> 
> All the other callers of blk_mq_delay_run_hw_queue() use macros:
> drivers/md/dm-rq.c: blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
> drivers/nvme/host/fc.c: blk_mq_delay_run_hw_queue(queue->hctx, 
> NVMEFC_QUEUE_DELAY);
> drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, 
> SCSI_QUEUE_DELAY);
> drivers/scsi/scsi_lib.c:
> blk_mq_delay_run_hw_queue(hctx, SCSI_QUEUE_DELAY);

Sure, I'll add a #define.

> Those both happen to be set to 3 (ms).

As for the value of 100, we're dealing with path faults so retrying them
extremely fast could be wasted effort.  But there is obviously no once
size fits all.  As storage gets faster 100 ms could prove to be very
bad.

But without tests to verify a change, there won't be one.

Thanks,
Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  6:42pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote:
> > @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, 
> > struct request *clone,
> > if (error && blk_path_error(error)) {
> > struct multipath *m = ti->private;
> >  
> > -   r = DM_ENDIO_REQUEUE;
> > +   if (r == BLK_STS_RESOURCE)
> > +   r = DM_ENDIO_DELAY_REQUEUE;
> > +   else
> > +   r = DM_ENDIO_REQUEUE;
> 
> Did you perhaps intend "error == BLK_STS_RESOURCE"?

Yes, it was a quick patch to get your thoughts.

> 
> > diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> > index 9ba8453..da83f64 100644
> > --- a/include/linux/device-mapper.h
> > +++ b/include/linux/device-mapper.h
> > @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
> >  #define DM_ENDIO_DONE  0
> >  #define DM_ENDIO_INCOMPLETE1
> >  #define DM_ENDIO_REQUEUE   2
> > +#define DM_ENDIO_DELAY_REQUEUE 3
> >  
> >  /*
> >   * Definitions of return values from target map function.
> > @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
> >  #define DM_MAPIO_SUBMITTED 0
> >  #define DM_MAPIO_REMAPPED  1
> >  #define DM_MAPIO_REQUEUE   DM_ENDIO_REQUEUE
> > -#define DM_MAPIO_DELAY_REQUEUE 3
> > +#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
> >  #define DM_MAPIO_KILL  4
> >  
> >  #define dm_sector_div64(x, y)( \
> 
> Please consider to introduce enumeration types for the DM_ENDIO_* and the
> DM_MAPIO_* constants such that the compiler can catch what I reported above.

OK, point taken.

> Otherwise this patch looks fine to me.

Cool, thanks.

Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 18:17 -0500, Mike Snitzer wrote:
> @@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, 
> struct request *clone,
>   if (error && blk_path_error(error)) {
>   struct multipath *m = ti->private;
>  
> - r = DM_ENDIO_REQUEUE;
> + if (r == BLK_STS_RESOURCE)
> + r = DM_ENDIO_DELAY_REQUEUE;
> + else
> + r = DM_ENDIO_REQUEUE;

Did you perhaps intend "error == BLK_STS_RESOURCE"?

> diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
> index 9ba8453..da83f64 100644
> --- a/include/linux/device-mapper.h
> +++ b/include/linux/device-mapper.h
> @@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
>  #define DM_ENDIO_DONE0
>  #define DM_ENDIO_INCOMPLETE  1
>  #define DM_ENDIO_REQUEUE 2
> +#define DM_ENDIO_DELAY_REQUEUE   3
>  
>  /*
>   * Definitions of return values from target map function.
> @@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
>  #define DM_MAPIO_SUBMITTED   0
>  #define DM_MAPIO_REMAPPED1
>  #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE
> -#define DM_MAPIO_DELAY_REQUEUE   3
> +#define DM_MAPIO_DELAY_REQUEUE   DM_ENDIO_DELAY_REQUEUE
>  #define DM_MAPIO_KILL4
>  
>  #define dm_sector_div64(x, y)( \

Please consider to introduce enumeration types for the DM_ENDIO_* and the
DM_MAPIO_* constants such that the compiler can catch what I reported above.
Otherwise this patch looks fine to me.

Thanks,

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche  wrote:
 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

This is my 3rd reply to this email.. 3rd time's the charm? ;)

Here is a patch that will kick the hw queues via blk_mq_requeue_work(),
indirectly from dm-rq.c:__dm_mq_kick_requeue_list(), after a delay if
BLK_STS_RESOURCE is returned from the target.

Your thoughts on this patch as an alternative to commit 6077c2d7060 ?

Thanks,
Mike

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index d8c7259..ab2cfdc 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -1570,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct 
request *clone,
if (error && blk_path_error(error)) {
struct multipath *m = ti->private;
 
-   r = DM_ENDIO_REQUEUE;
+   if (r == BLK_STS_RESOURCE)
+   r = DM_ENDIO_DELAY_REQUEUE;
+   else
+   r = DM_ENDIO_REQUEUE;
 
if (pgpath)
fail_path(pgpath);
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index 6b01298..ab0ed2d 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t 
error, bool mapped)
/* The target wants to requeue the I/O */
dm_requeue_original_request(tio, false);
break;
+   case DM_ENDIO_DELAY_REQUEUE:
+   /* The target wants to requeue the I/O after a delay */
+   dm_requeue_original_request(tio, true);
+   break;
default:
DMWARN("unimplemented target endio return value: %d", r);
BUG();
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 9ba8453..da83f64 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -550,6 +550,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
 #define DM_ENDIO_DONE  0
 #define DM_ENDIO_INCOMPLETE1
 #define DM_ENDIO_REQUEUE   2
+#define DM_ENDIO_DELAY_REQUEUE 3
 
 /*
  * Definitions of return values from target map function.
@@ -557,7 +558,7 @@ struct dm_table *dm_swap_table(struct mapped_device *md,
 #define DM_MAPIO_SUBMITTED 0
 #define DM_MAPIO_REMAPPED  1
 #define DM_MAPIO_REQUEUE   DM_ENDIO_REQUEUE
-#define DM_MAPIO_DELAY_REQUEUE 3
+#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
 #define DM_MAPIO_KILL  4
 
 #define dm_sector_div64(x, y)( \




Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.

The thing is, the 2 scenarios you are most concerned about have
_nothing_ to do with dm_mq_queue_rq() at all.  They occur as an error in
the IO path _after_ the request is successfully retrieved with
blk_get_request() and then submitted.
 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case.

So it really makes me question why you think commit 6077c2d706097c0
addresses the issue you think it does.  And gives me more conviction to
remove 6077c2d706097c0.

It may help just by virtue of blindly kicking the queue if
blk_get_request() fails (regardless of the target is responding with
BUSY or not).  Very unsatisfying to say the least.

I think it'd be much more beneficial for dm-mpath.c:multipath_end_io()
to be trained to be respond more intelligently to BLK_STS_RESOURCE.

E.g. set BLK_MQ_S_SCHED_RESTART if requests are known to be outstanding
on the path.  This is one case where Ming said the queue would be
re-run, as detailed in this header:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89

And Jens has reinforced to me that BLK_MQ_S_SCHED_RESTART is a means to
kicking the queue more efficiently.  _BUT_ I'm not seeing any external
blk-mq interface that exposes this capability to a blk-mq driver.  As is
BLK_MQ_S_SCHED_RESTART gets set very much in the bowels of blk-mq
(blk_mq_sched_mark_restart_hctx).

SO I have to do more homework here...

Ming or Jens: might you be able to shed some light on how dm-mpath
would/could set BLK_MQ_S_SCHED_RESTART?  A new function added that can
be called from q->softirq_done_fn in response to BLK_STS_RESOURCE?  Or
is this level of control (and blk-mq detail) something that a blk-mq
driver should need to care about?

Anyway, we may need to get to that level of blk-mq driver controlled
retry optimization.

But Bart: again this is a vastly different feedback loop than the
stabbing in the dark your commit 6077c2d706097c0 is doing.  

Mike


RE: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Elliott, Robert (Persistent Memory)


> -Original Message-
> From: linux-block-ow...@vger.kernel.org [mailto:linux-block-
> ow...@vger.kernel.org] On Behalf Of Bart Van Assche
...
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

Seeing SCSI BUSY mentioned here...

On its own, patch 6077c2d706 seems to be adding an arbitrarily selected
magic value of 100 ms without explanation in the patch description or
in the added code.

But, dm_request_original_request() already seems to have chosen that value
for similar purposes:
unsigned long delay_ms = delay_requeue ? 100 : 0;

So, making them share a #define would indicate there's a reason for that
particular value.  Any change to the value would be picked up everywhere.

All the other callers of blk_mq_delay_run_hw_queue() use macros:
drivers/md/dm-rq.c: blk_mq_delay_run_hw_queue(hctx, 100/*ms*/);
drivers/nvme/host/fc.c: blk_mq_delay_run_hw_queue(queue->hctx, 
NVMEFC_QUEUE_DELAY);
drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, 
SCSI_QUEUE_DELAY);
drivers/scsi/scsi_lib.c:blk_mq_delay_run_hw_queue(hctx, 
SCSI_QUEUE_DELAY);

Those both happen to be set to 3 (ms).


---
Robert Elliott, HPE Persistent Memory





Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at  1:54pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> > OK, you have the stage: please give me a pointer to your best
> > explaination of the several.
> 
> Since the previous discussion about this topic occurred more than a month
> ago it could take more time to look up an explanation than to explain it
> again. Anyway, here we go. As you know a block layer request queue needs to
> be rerun if one or more requests are waiting and a previous condition that
> prevented the request to be executed has been cleared. For the dm-mpath
> driver, examples of such conditions are no tags available, a path that is
> busy (see also pgpath_busy()), path initialization that is in progress
> (pg_init_in_progress) or a request completes with status, e.g. if the
> SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
> of these conditions, e.g. path initialization completes, a callback
> function in the dm-mpath driver is called and it is possible to explicitly
> rerun the queue. I agree that for such scenario's a delayed queue run should
> not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
> SCSI request over a fabric and the SCSI target replies with "BUSY" then the
> SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
> maximum number of retries has been reached (see also scsi_io_completion()).
> In that last case, if a SCSI target sends a "BUSY" reply over the wire back
> to the initiator, there is no other approach for the SCSI initiator to
> figure out whether it can queue another request than to resubmit the
> request. The worst possible strategy is to resubmit a request immediately
> because that will cause a significant fraction of the fabric bandwidth to
> be used just for replying "BUSY" to requests that can't be processed
> immediately.
> 
> The intention of commit 6077c2d706097c0 was to address the last mentioned
> case. It may be possible to move the delayed queue rerun from the
> dm_queue_rq() into dm_requeue_original_request(). But I think it would be
> wrong to rerun the queue immediately in case a SCSI target system returns
> "BUSY".

OK, thank you very much for this.  Really helps.

For starters multipath_clone_and_map() could do a fair amount more with
the insight that a SCSI "BUSY" was transmitted back.  If both blk-mq
being out of tags and SCSI "BUSY" simply return BLK_STS_RESOURCE then
dm-mpath doesn't have the ability to behave more intelligently.

Anyway, armed with this info I'll have a think about what we might do to
tackle this problem head on.

Thanks,
Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 13:06 -0500, Mike Snitzer wrote:
> OK, you have the stage: please give me a pointer to your best
> explaination of the several.

Since the previous discussion about this topic occurred more than a month
ago it could take more time to look up an explanation than to explain it
again. Anyway, here we go. As you know a block layer request queue needs to
be rerun if one or more requests are waiting and a previous condition that
prevented the request to be executed has been cleared. For the dm-mpath
driver, examples of such conditions are no tags available, a path that is
busy (see also pgpath_busy()), path initialization that is in progress
(pg_init_in_progress) or a request completes with status, e.g. if the
SCSI core calls __blk_mq_end_request(req, error) with error != 0. For some
of these conditions, e.g. path initialization completes, a callback
function in the dm-mpath driver is called and it is possible to explicitly
rerun the queue. I agree that for such scenario's a delayed queue run should
not be triggered. For other scenario's, e.g. if a SCSI initiator submits a
SCSI request over a fabric and the SCSI target replies with "BUSY" then the
SCSI core will end the I/O request with status BLK_STS_RESOURCE after the
maximum number of retries has been reached (see also scsi_io_completion()).
In that last case, if a SCSI target sends a "BUSY" reply over the wire back
to the initiator, there is no other approach for the SCSI initiator to
figure out whether it can queue another request than to resubmit the
request. The worst possible strategy is to resubmit a request immediately
because that will cause a significant fraction of the fabric bandwidth to
be used just for replying "BUSY" to requests that can't be processed
immediately.

The intention of commit 6077c2d706097c0 was to address the last mentioned
case. It may be possible to move the delayed queue rerun from the
dm_queue_rq() into dm_requeue_original_request(). But I think it would be
wrong to rerun the queue immediately in case a SCSI target system returns
"BUSY".

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at 12:46pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote:
> > You've not explained it many times.
> 
> That's not correct. I have already several times posted a detailed and easy
> to understand explanation

OK, you have the stage: please give me a pointer to your best
explaination of the several.

> > We cannot get a straight answer from you.
>
> That is a gross and incorrect statement. Please calm down.

I'm perfectly calm.  I'm just tired of how you sow controversy.


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 12:40 -0500, Mike Snitzer wrote:
> You've not explained it many times.

That's not correct. I have already several times posted a detailed and easy
to understand explanation

> We cannot get a straight answer from you.

That is a gross and incorrect statement. Please calm down.

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Fri, Jan 12 2018 at 12:26pm -0500,
Bart Van Assche  wrote:

> On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote:
> > This is going upstream for 4.16:
> > https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89
> 
> That is really gross. I have explained many times in detail on the dm-devel
> list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is
> wrong if you remove it. Isn't the responsibility of you as the device mapper
> kernel maintainer to avoid regressions instead of introducing regressions?

Please stop, seriously.

You've not explained it many times.  We cannot get a straight answer
from you.  No analysis that establishes that if an underlying dm-mq
multipath path is out of tags (shared or otherwise) that dm-rq core
_must_ run the hw queue after a delay.  

Commit 6077c2d7060 just papers over a real blk-mq problem (that may now
be fixed).  Your assertions that blk-mq would need to otherwise poll
(and waste resources so it can immediately retry) ignores that blk-mq
_should_ make progress as requests complete or if/when a path is
recovered, etc.  So I'm not going to accept your dysfuctional reasoning
on this, sorry.  _THAT_ is my job as a maintainer. 


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Bart Van Assche
On Fri, 2018-01-12 at 12:18 -0500, Mike Snitzer wrote:
> This is going upstream for 4.16:
> https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/commit/?h=dm-4.16=5b18cff4baedde77e0d69bd62a13ae78f9488d89

That is really gross. I have explained many times in detail on the dm-devel
list why that blk_mq_delay_run_hw_queue() call is necessary. So I think it is
wrong if you remove it. Isn't the responsibility of you as the device mapper
kernel maintainer to avoid regressions instead of introducing regressions?

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-12 Thread Mike Snitzer
On Thu, Jan 11 2018 at 10:33pm -0500,
Ming Lei  wrote:

> On Thu, Jan 11, 2018 at 08:57:21PM -0500, Mike Snitzer wrote:
> > On Thu, Jan 11 2018 at  8:42pm -0500,
> > Ming Lei  wrote:
> > 
> > > On Thu, Jan 11, 2018 at 10:37:37PM +, Bart Van Assche wrote:
> > > > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > > > Bart, if for some reason we regress for some workload you're able to
> > > > > more readily test we can deal with it.  But I'm too encouraged by 
> > > > > Ming's
> > > > > performance improvements to hold these changes back any further.
> > > > 
> > > > Sorry Mike but I think Ming's measurement results are not sufficient to
> > > > motivate upstreaming of these patches. What I remember from previous 
> > > > versions
> > > > of this patch series is that Ming measured the performance impact of 
> > > > this
> > > > patch series only for the Emulex FC driver (lpfc). That driver limits 
> > > > queue
> > > > depth to 3. Instead of modifying the dm code, that driver needs to be 
> > > > fixed
> > > > such that it allows larger queue depths.
> > > > 
> > > > Additionally, some time ago I had explained in detail why I think that 
> > > > patch
> > > > 2/5 in this series is wrong and why it will introduce fairness 
> > > > regressions
> > > > in multi-LUN workloads. I think we need performance measurements for 
> > > > this
> > > > patch series for at least the following two configurations before this 
> > > > patch
> > > > series can be considered for upstream inclusion:
> > > > * The performance impact of this patch series for SCSI devices with a
> > > >   realistic queue depth (e.g. 64 or 128).
> > > 
> > > Please look at the cover letter or patch 5's commit log, it mentions that
> > > dm-mpath over virtio-scsi is tested, and the default queue depth of 
> > > virito-scsi
> > > is 128.
> > > 
> > > > * The performance impact for this patch series for a SCSI host with 
> > > > which
> > > >   multiple LUNs are associated and for which the target system often 
> > > > replies
> > > >   with the SCSI "BUSY" status.
> > > 
> > > I don't think this patch is related with this case, this patch just 
> > > provides the
> > > BUSY feedback from underlying queue to blk-mq via dm-rq.
> > 
> > Hi Ming,
> > 
> > Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html
> > 
> > Specifically:
> > "That patch [commit 6077c2d706] can be considered as a first step that
> > can be refined further, namely by modifying the dm-rq code further such
> > that dm-rq queues are only rerun after the busy condition has been
> > cleared. The patch at the start of this thread is easier to review and
> > easier to test than any patch that would only rerun dm-rq queues after
> > the busy condition has been cleared."
> > 
> > Do you have time to reason through Bart's previous proposal for how to
> > better address the specific issue that was documented to be fixed in the
> > header for commit 6077c2d706 ?
> 
> Hi Mike,
> 
> Recently we fixed one such issue in blk-mq (eb619fdb2d4cb8b3d34), and I
> highly guess that may fix Bart's case.

Wow, kind of staggering that there was no mention of this fix prior to
now.  Seems _very_ relevant (like the real fix).

> Let's see this commit 6077c2d706 again, I don't know the idea behind the
> fix, which just adds random of 100ms, and this delay degrades performance a
> lot, since no hctx can be scheduled any more during the delay.

I'd love to try to reproduce the issue documented in commit
6077c2d706 but sadly I cannot get that srp-test to work on my system.
Just fails with:

# while srp-test/run_tests -d -r 30 -t 02-mq; do :; done
Unloaded the ib_srpt kernel module
Unloaded the rdma_rxe kernel module
Zero-initializing /dev/ram0 ... done
Zero-initializing /dev/ram1 ... done
Configured SRP target driver
Running test srp-test/tests/02-mq ...
Test file I/O on top of multipath concurrently with logout and login (0 min; mq)
SRP login failed
Test srp-test/tests/02-mq failed

Think the login is failing due to target not being setup properly?
Yeap, now from reading this thread, it is clear that srp-test only works
if you have an IB HCA:
https://patchwork.kernel.org/patch/10041381/

> So again it is definitely a workaround, no any reasoning, no any theory, just
> say it is a fix. Are there anyone who can explain the story?
> 
> IMO, the blk_get_request() in dm-mpath is just like a kmalloc(ATOMIC) which
> is used in other blk-mq drivers' IO path too, so don't know why dm-mq is so
> strange to require such ugly 100ms delay.

The header for commit 6077c2d706 notes that same action that Jens took
to unwedge the request stalls (in the patchwork thread I referenced
above), with:

echo run > /sys/kernel/debug/block/nullb1/state
vs what Bart noted in commit 6077c2d706:
echo run >/sys/kernel/debug/block/dm-0/state

Really does feel like the issue Jens' shared tags fix addressed _is_ the
root cause that commit 6077c2d706 worked around.


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Ming Lei
On Thu, Jan 11, 2018 at 08:57:21PM -0500, Mike Snitzer wrote:
> On Thu, Jan 11 2018 at  8:42pm -0500,
> Ming Lei  wrote:
> 
> > On Thu, Jan 11, 2018 at 10:37:37PM +, Bart Van Assche wrote:
> > > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > > Bart, if for some reason we regress for some workload you're able to
> > > > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > > > performance improvements to hold these changes back any further.
> > > 
> > > Sorry Mike but I think Ming's measurement results are not sufficient to
> > > motivate upstreaming of these patches. What I remember from previous 
> > > versions
> > > of this patch series is that Ming measured the performance impact of this
> > > patch series only for the Emulex FC driver (lpfc). That driver limits 
> > > queue
> > > depth to 3. Instead of modifying the dm code, that driver needs to be 
> > > fixed
> > > such that it allows larger queue depths.
> > > 
> > > Additionally, some time ago I had explained in detail why I think that 
> > > patch
> > > 2/5 in this series is wrong and why it will introduce fairness regressions
> > > in multi-LUN workloads. I think we need performance measurements for this
> > > patch series for at least the following two configurations before this 
> > > patch
> > > series can be considered for upstream inclusion:
> > > * The performance impact of this patch series for SCSI devices with a
> > >   realistic queue depth (e.g. 64 or 128).
> > 
> > Please look at the cover letter or patch 5's commit log, it mentions that
> > dm-mpath over virtio-scsi is tested, and the default queue depth of 
> > virito-scsi
> > is 128.
> > 
> > > * The performance impact for this patch series for a SCSI host with which
> > >   multiple LUNs are associated and for which the target system often 
> > > replies
> > >   with the SCSI "BUSY" status.
> > 
> > I don't think this patch is related with this case, this patch just 
> > provides the
> > BUSY feedback from underlying queue to blk-mq via dm-rq.
> 
> Hi Ming,
> 
> Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html
> 
> Specifically:
> "That patch [commit 6077c2d706] can be considered as a first step that
> can be refined further, namely by modifying the dm-rq code further such
> that dm-rq queues are only rerun after the busy condition has been
> cleared. The patch at the start of this thread is easier to review and
> easier to test than any patch that would only rerun dm-rq queues after
> the busy condition has been cleared."
> 
> Do you have time to reason through Bart's previous proposal for how to
> better address the specific issue that was documented to be fixed in the
> header for commit 6077c2d706 ?

Hi Mike,

Recently we fixed one such issue in blk-mq (eb619fdb2d4cb8b3d34), and I
highly guess that may fix Bart's case.

Let's see this commit 6077c2d706 again, I don't know the idea behind the
fix, which just adds random of 100ms, and this delay degrades performance a
lot, since no hctx can be scheduled any more during the delay.

So again it is definitely a workaround, no any reasoning, no any theory, just
say it is a fix. Are there anyone who can explain the story?

IMO, the blk_get_request() in dm-mpath is just like a kmalloc(ATOMIC) which
is used in other blk-mq drivers' IO path too, so don't know why dm-mq is so
strange to require such ugly 100ms delay.

So I suggest to remove it, let's see if there are reports, and if there
are, I am quite confident we can root cause that and fix that.

> 
> Otherwise I fear we'll just keep hitting a wall with Bart...

I do test dm-mq over scsi-debug which is setup as two LUNs, and set
can_queue as 1, and this patchset just works well, not any IO hang.
And anyone can try and play with it.

Thanks,
Ming


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Mike Snitzer
On Thu, Jan 11 2018 at  8:42pm -0500,
Ming Lei  wrote:

> On Thu, Jan 11, 2018 at 10:37:37PM +, Bart Van Assche wrote:
> > On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > > Bart, if for some reason we regress for some workload you're able to
> > > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > > performance improvements to hold these changes back any further.
> > 
> > Sorry Mike but I think Ming's measurement results are not sufficient to
> > motivate upstreaming of these patches. What I remember from previous 
> > versions
> > of this patch series is that Ming measured the performance impact of this
> > patch series only for the Emulex FC driver (lpfc). That driver limits queue
> > depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> > such that it allows larger queue depths.
> > 
> > Additionally, some time ago I had explained in detail why I think that patch
> > 2/5 in this series is wrong and why it will introduce fairness regressions
> > in multi-LUN workloads. I think we need performance measurements for this
> > patch series for at least the following two configurations before this patch
> > series can be considered for upstream inclusion:
> > * The performance impact of this patch series for SCSI devices with a
> >   realistic queue depth (e.g. 64 or 128).
> 
> Please look at the cover letter or patch 5's commit log, it mentions that
> dm-mpath over virtio-scsi is tested, and the default queue depth of 
> virito-scsi
> is 128.
> 
> > * The performance impact for this patch series for a SCSI host with which
> >   multiple LUNs are associated and for which the target system often replies
> >   with the SCSI "BUSY" status.
> 
> I don't think this patch is related with this case, this patch just provides 
> the
> BUSY feedback from underlying queue to blk-mq via dm-rq.

Hi Ming,

Please see https://www.redhat.com/archives/dm-devel/2017-April/msg00190.html

Specifically:
"That patch [commit 6077c2d706] can be considered as a first step that
can be refined further, namely by modifying the dm-rq code further such
that dm-rq queues are only rerun after the busy condition has been
cleared. The patch at the start of this thread is easier to review and
easier to test than any patch that would only rerun dm-rq queues after
the busy condition has been cleared."

Do you have time to reason through Bart's previous proposal for how to
better address the specific issue that was documented to be fixed in the
header for commit 6077c2d706 ?

Otherwise I fear we'll just keep hitting a wall with Bart...

Thanks,
Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Mike Snitzer
On Thu, Jan 11 2018 at  6:27pm -0500,
Bart Van Assche  wrote:

> On Thu, 2018-01-11 at 17:58 -0500, Mike Snitzer wrote:
> > The changes are pretty easy to review.  This notion that these changes
> > are problematic rings very hollow given your lack of actual numbers (or
> > some other concerning observation rooted in testing fact) to back up
> > your position.
> 
> It's not my job to run the multi-LUN test. That's the job of the people who
> want these patches upstream. Since I asked for these test results for the 
> first
> time several months ago I'm surprised that nobody has run these tests yet.

I've reasoned through a few different ways to respond to this.  Fact is
you're not giving me much to work with.  AFAIK you _are_ charted with
supporting the types of storage configs that you've requested
performance results from.

Your dm-rq.c commit 6077c2d706097c0 ("dm rq: Avoid that request
processing stalls sporadically") silently went in through Jens:
https://www.redhat.com/archives/dm-devel/2017-April/msg00157.html
Not sure why that happened to begin with honestly.

But at the end of that post I meant to say:
"If this dm-mq specific commit is justified the case certainly is
_not_ spelled out in the commit header."

Anyway, I've split this contentious removal of dm_mq_queue_rq's
blk_mq_delay_run_hw_queue(hctx, 100/*ms*/) into a separate patch; but
at this point I'm still inclined to accept it for 4.16.

I'll hopefully look closer at understanding the need for commit
6077c2d706097c0 tomorrow.

In the meantime, I'd _really_ appreciate it if you'd give the rest of
the changes Ming has proposed in this patchset a much more open mind!

Thanks,
Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Ming Lei
On Thu, Jan 11, 2018 at 10:37:37PM +, Bart Van Assche wrote:
> On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > Bart, if for some reason we regress for some workload you're able to
> > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > performance improvements to hold these changes back any further.
> 
> Sorry Mike but I think Ming's measurement results are not sufficient to
> motivate upstreaming of these patches. What I remember from previous versions
> of this patch series is that Ming measured the performance impact of this
> patch series only for the Emulex FC driver (lpfc). That driver limits queue
> depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> such that it allows larger queue depths.
> 
> Additionally, some time ago I had explained in detail why I think that patch
> 2/5 in this series is wrong and why it will introduce fairness regressions
> in multi-LUN workloads. I think we need performance measurements for this
> patch series for at least the following two configurations before this patch
> series can be considered for upstream inclusion:
> * The performance impact of this patch series for SCSI devices with a
>   realistic queue depth (e.g. 64 or 128).

Please look at the cover letter or patch 5's commit log, it mentions that
dm-mpath over virtio-scsi is tested, and the default queue depth of virito-scsi
is 128.

> * The performance impact for this patch series for a SCSI host with which
>   multiple LUNs are associated and for which the target system often replies
>   with the SCSI "BUSY" status.

I don't think this patch is related with this case, this patch just provides the
BUSY feedback from underlying queue to blk-mq via dm-rq.

Thanks,
Ming


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Bart Van Assche
On Thu, 2018-01-11 at 17:58 -0500, Mike Snitzer wrote:
> The changes are pretty easy to review.  This notion that these changes
> are problematic rings very hollow given your lack of actual numbers (or
> some other concerning observation rooted in testing fact) to back up
> your position.

It's not my job to run the multi-LUN test. That's the job of the people who
want these patches upstream. Since I asked for these test results for the first
time several months ago I'm surprised that nobody has run these tests yet.

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Mike Snitzer
On Thu, Jan 11 2018 at  5:37pm -0500,
Bart Van Assche  wrote:

> On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> > Bart, if for some reason we regress for some workload you're able to
> > more readily test we can deal with it.  But I'm too encouraged by Ming's
> > performance improvements to hold these changes back any further.
> 
> Sorry Mike but I think Ming's measurement results are not sufficient to
> motivate upstreaming of these patches. What I remember from previous versions
> of this patch series is that Ming measured the performance impact of this
> patch series only for the Emulex FC driver (lpfc). That driver limits queue
> depth to 3. Instead of modifying the dm code, that driver needs to be fixed
> such that it allows larger queue depths.

Sorry Bart but even in my test (using null_blk with queue depth of 8, 12
hw queues on 12 cpu system, with 12 fio jobs) it is yielding performance
improvement.  Going from 1350K iops to 1420K iops.  And 5275MB/s to
5532MB/s.  For sequential fio workload.

This test has been a staple of mine for a very long time.  The
improvement is very real (and hard to come by).  The iops improvement is
a bellweather for improved efficiency in the fast path.

So all your concern about Ming's testing is fine.  But you'd really help
me out a lot if you could give these changes a test.
 
> Additionally, some time ago I had explained in detail why I think that patch
> 2/5 in this series is wrong and why it will introduce fairness regressions
> in multi-LUN workloads.

You may think you explained it.. but IIRC it was very hand-wavvy.  If
you can provide a pointer to what you think was a compelling argument
then please do.

But I'm left very much unconvinced with your argument given what I see
in patch 2.  If blk_get_request() fails it follows that BLK_STS_RESOURCE
should be returned.  Why is the 100ms delay meaningful in that case for
SCSI?

> I think we need performance measurements for this
> patch series for at least the following two configurations before this patch
> series can be considered for upstream inclusion:
> * The performance impact of this patch series for SCSI devices with a
>   realistic queue depth (e.g. 64 or 128).
> * The performance impact for this patch series for a SCSI host with which
>   multiple LUNs are associated and for which the target system often replies
>   with the SCSI "BUSY" status.

OK, so please test it.

The changes are pretty easy to review.  This notion that these changes
are problematic rings very hollow given your lack of actual numbers (or
some other concerning observation rooted in testing fact) to back up
your position.

Mike


Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Bart Van Assche
On Thu, 2018-01-11 at 17:07 -0500, Mike Snitzer wrote:
> Bart, if for some reason we regress for some workload you're able to
> more readily test we can deal with it.  But I'm too encouraged by Ming's
> performance improvements to hold these changes back any further.

Sorry Mike but I think Ming's measurement results are not sufficient to
motivate upstreaming of these patches. What I remember from previous versions
of this patch series is that Ming measured the performance impact of this
patch series only for the Emulex FC driver (lpfc). That driver limits queue
depth to 3. Instead of modifying the dm code, that driver needs to be fixed
such that it allows larger queue depths.

Additionally, some time ago I had explained in detail why I think that patch
2/5 in this series is wrong and why it will introduce fairness regressions
in multi-LUN workloads. I think we need performance measurements for this
patch series for at least the following two configurations before this patch
series can be considered for upstream inclusion:
* The performance impact of this patch series for SCSI devices with a
  realistic queue depth (e.g. 64 or 128).
* The performance impact for this patch series for a SCSI host with which
  multiple LUNs are associated and for which the target system often replies
  with the SCSI "BUSY" status.

Bart.

Re: [PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-11 Thread Mike Snitzer
On Thu, Jan 11 2018 at  1:01am -0500,
Ming Lei  wrote:

> Hi Guys,
> 
> The 1st patch removes the workaround of blk_mq_delay_run_hw_queue() in
> case of requeue, this way isn't necessary, and more worse, it makes
> BLK_MQ_S_SCHED_RESTART not working, and degarde I/O performance.
> 
> The 2nd patch return DM_MAPIO_REQUEUE to dm-rq if underlying request
> allocation fails, then we can return BLK_STS_RESOURCE from dm-rq to
> blk-mq, so that blk-mq can hold the requests to be dequeued.

In general the delayed requeue was meant to cope (slightly better) with
queue_if_no_path and no paths being available.  I think patch 1 respect
that intent.  And patch 2 looks right to me, BLK_STS_RESOURCE definitely
should be returned if blk_get_request() fails.

So I picked these up for dm-4.16 (and staged them in linux-next).  I
reworded both headers and the 2nd patch's block comment, see:
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=64e4e12bf474cf3aaaf59245bbeba3057d2fedf9
https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/commit/?h=dm-4.16=c50de17c1cc7d5910df58aba9b950e99579e239d

Bart, if for some reason we regress for some workload you're able to
more readily test we can deal with it.  But I'm too encouraged by Ming's
performance improvements to hold these changes back any further.

> The other 3 paches changes the blk-mq part of blk_insert_cloned_request(),
> in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
> and blk-mq can get the dispatch result of underlying queue, and with
> this information, blk-mq can handle IO merge much better, then
> sequential I/O performance is improved much.
> 
> In my dm-mpath over virtio-scsi test, this whole patchset improves
> sequential IO by 3X ~ 5X.

I've merged dm-4.16 and for-4.16/block and patched 3 - 5 apply cleanly
(so we won't have any merge conflicts during 4.16), resulting git branch
is here:
https://git.kernel.org/pub/scm/linux/kernel/git/snitzer/linux.git/log/?h=block-4.16_dm-4.16

I'll reply to patches 3 - 5 individually with my Reviewed-by.
Jens, please feel free to pick them up for 4.16.

Thanks,
Mike


[PATCH V3 0/5] dm-rq: improve sequential I/O performance

2018-01-10 Thread Ming Lei
Hi Guys,

The 1st patch removes the workaround of blk_mq_delay_run_hw_queue() in
case of requeue, this way isn't necessary, and more worse, it makes
BLK_MQ_S_SCHED_RESTART not working, and degarde I/O performance.

The 2nd patch return DM_MAPIO_REQUEUE to dm-rq if underlying request
allocation fails, then we can return BLK_STS_RESOURCE from dm-rq to
blk-mq, so that blk-mq can hold the requests to be dequeued.

The other 3 paches changes the blk-mq part of blk_insert_cloned_request(),
in which we switch to blk_mq_try_issue_directly(), so that both dm-rq
and blk-mq can get the dispatch result of underlying queue, and with
this information, blk-mq can handle IO merge much better, then
sequential I/O performance is improved much.

In my dm-mpath over virtio-scsi test, this whole patchset improves
sequential IO by 3X ~ 5X.

V3:
- rebase on the latest for-4.16/block of block tree
- add missed pg_init_all_paths() in patch 1, according to Bart's review

V2:
- drop 'dm-mpath: cache ti->clone during requeue', which is a bit
too complicated, and not see obvious performance improvement.
- make change on blk-mq part cleaner

Ming Lei (5):
  dm-mpath: don't call blk_mq_delay_run_hw_queue() in case of
BLK_STS_RESOURCE
  dm-mpath: return DM_MAPIO_REQUEUE in case of rq allocation failure
  blk-mq: move actual issue into one helper
  blk-mq: return dispatch result to caller in blk_mq_try_issue_directly
  blk-mq: issue request directly for blk_insert_cloned_request

 block/blk-core.c  |  3 +-
 block/blk-mq.c| 86 +++
 block/blk-mq.h|  3 ++
 drivers/md/dm-mpath.c | 19 +---
 drivers/md/dm-rq.c| 20 +---
 5 files changed, 101 insertions(+), 30 deletions(-)

-- 
2.9.5