from:"Laurence Oberman"

Re: [PATCH 2/2] blk-mq: provide a helper to check if a queue is busy

2018-11-08 Thread Laurence Oberman

On Thu, 2018-11-08 at 08:42 -0700, Jens Axboe wrote:
> +static bool blk_mq_check_busy(struct blk_mq_hw_ctx *hctx, struct
> request *rq,
> + void *priv, bool reserved)
> +{
> +   bool *busy = (bool *) priv;
> +
> +   /*
> +    * If we find a request, we know the queue is busy. Return
> false
> +    * to stop the iteration.
> +    */
> +   *busy = true;
> +   return false;
> +}
> +

Hi Jens,

Trying to understand the logic, likely just my lack of knowledge
You return false after setting true because you say we know the queue
is busy.
How do we know we found a request here
I dont see you check the result of 
 bool *busy = (bool *) priv;
Is the assumption here because this was called we already knew we had a
request

Apologies it its just my lack of understanding

+static bool blk_mq_check_busy(struct blk_mq_hw_ctx *hctx, struct
request *rq,
+ void *priv, bool reserved)
+{
+   bool *busy = (bool *) priv;
+
+   /*
+* If we find a request, we know the queue is busy. Return
false
+* to stop the iteration.
+*/
+   *busy = true;
+   return false;
+}
+

Re: [PATCH] block: Clear kernel memory before copying to user

2018-11-07 Thread Laurence Oberman

On Wed, 2018-11-07 at 07:37 -0700, Keith Busch wrote:
> If the kernel allocates a bounce buffer for user read data, this
> memory
> needs to be cleared before copying it to the user, otherwise it may
> leak
> kernel memory to user space.
> 
> Signed-off-by: Keith Busch 
> ---
>  block/bio.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index d5368a445561..a50d59236b19 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1260,6 +1260,7 @@ struct bio *bio_copy_user_iov(struct
> request_queue *q,
>   if (ret)
>   goto cleanup;
>   } else {
> + zero_fill_bio(bio);
>   iov_iter_advance(iter, bio->bi_iter.bi_size);
>   }
>  

Straightforward, looks good from my view.

Reviewed-by:
Laurence Oberman

Re: [PATCH V3 0/3] blk-mq: improve IO perf in case of none io sched

2018-07-02 Thread Laurence Oberman

On Mon, 2018-07-02 at 17:35 +0800, Ming Lei wrote:
> Hi,
> 
> The 1st 2 patch improves ctx->lock uses, and it is observed that IOPS
> may be improved by ~5% in rand IO test on MegaRaid SAS run by
> Kashyap.
> 
> The 3rd patch fixes rand IO performance regression on MegaRaid SAS
> test, still reported by Kashyap.
> 
> V3:
>   - export dispatch busy from debugfs as suggested by Jens
>   - add comment on blk_mq_update_hctx_busy() as suggested by
> Christoph
> 
> V2:
>   - fix list corruption in patch 1/3
> 
> 
> Ming Lei (3):
>   blk-mq: use list_splice_tail_init() to insert requests
>   blk-mq: only attempt to merge bio if there is rq in sw queue
>   blk-mq: dequeue request one by one from sw queue iff hctx is busy


> 
>  block/blk-mq-debugfs.c |  9 +
>  block/blk-mq-sched.c   | 14 --
>  block/blk-mq.c | 44 --
> --
>  include/linux/blk-mq.h |  3 ++-
>  4 files changed, 51 insertions(+), 19 deletions(-)
> 
> Cc: Kashyap Desai 
> Cc: Laurence Oberman 
> Cc: Omar Sandoval 
> Cc: Christoph Hellwig 
> Cc: Bart Van Assche 
> Cc: Hannes Reinecke 
> 

Thanks Ming, I repaired my MSA50 shelf so will functional test.
I have 6 SSD drives working in there now. 
I know its already been tested by Kashyap.

Regards
Laurence

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

2018-05-10 Thread Laurence Oberman

On Thu, 2018-05-10 at 18:28 +0800, Ming Lei wrote:
> On Sat, May 05, 2018 at 07:11:33PM -0400, Laurence Oberman wrote:
> > On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > The 1st patch introduces blk_quiesce_timeout() and
> > > blk_unquiesce_timeout()
> > > for NVMe, meantime fixes blk_sync_queue().
> > > 
> > > The 2nd patch covers timeout for admin commands for recovering
> > > controller
> > > for avoiding possible deadlock.
> > > 
> > > The 3rd and 4th patches avoid to wait_freeze on queues which
> > > aren't
> > > frozen.
> > > 
> > > The last 4 patches fixes several races wrt. NVMe timeout handler,
> > > and
> > > finally can make blktests block/011 passed. Meantime the NVMe PCI
> > > timeout
> > > mecanism become much more rebost than before.
> > > 
> > > gitweb:
> > >   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
> > > 
> > > V4:
> > >   - fixe nvme_init_set_host_mem_cmd()
> > >   - use nested EH model, and run both nvme_dev_disable() and
> > >   resetting in one same context
> > > 
> > > V3:
> > >   - fix one new race related freezing in patch 4,
> > > nvme_reset_work()
> > >   may hang forever without this patch
> > >   - rewrite the last 3 patches, and avoid to break
> > > nvme_reset_ctrl*()
> > > 
> > > V2:
> > >   - fix draining timeout work, so no need to change return value
> > > from
> > >   .timeout()
> > >   - fix race between nvme_start_freeze() and nvme_unfreeze()
> > >   - cover timeout for admin commands running in EH
> > > 
> > > Ming Lei (7):
> > >   block: introduce blk_quiesce_timeout() and
> > > blk_unquiesce_timeout()
> > >   nvme: pci: cover timeout for admin commands running in EH
> > >   nvme: pci: only wait freezing if queue is frozen
> > >   nvme: pci: freeze queue in nvme_dev_disable() in case of error
> > > recovery
> > >   nvme: core: introduce 'reset_lock' for sync reset state and
> > > reset
> > > activities
> > >   nvme: pci: prepare for supporting error recovery from resetting
> > > context
> > >   nvme: pci: support nested EH
> > > 
> > >  block/blk-core.c |  21 +++-
> > >  block/blk-mq.c   |   9 ++
> > >  block/blk-timeout.c  |   5 +-
> > >  drivers/nvme/host/core.c |  46 ++-
> > >  drivers/nvme/host/nvme.h |   5 +
> > >  drivers/nvme/host/pci.c  | 304
> > > ---
> > >  include/linux/blkdev.h   |  13 ++
> > >  7 files changed, 356 insertions(+), 47 deletions(-)
> > > 
> > > Cc: Jianchao Wang <jianchao.w.w...@oracle.com>
> > > Cc: Christoph Hellwig <h...@lst.de>
> > > Cc: Sagi Grimberg <s...@grimberg.me>
> > > Cc: linux-n...@lists.infradead.org
> > > Cc: Laurence Oberman <lober...@redhat.com>
> > 
> > Hello Ming
> > 
> > I have a two node NUMA system here running your kernel tree
> > 4.17.0-rc3.ming.nvme+
> > 
> > [root@segstorage1 ~]# numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 3 5 6 8 11 13 14
> > node 0 size: 63922 MB
> > node 0 free: 61310 MB
> > node 1 cpus: 1 2 4 7 9 10 12 15
> > node 1 size: 64422 MB
> > node 1 free: 62372 MB
> > node distances:
> > node   0   1 
> >   0:  10  20 
> >   1:  20  10 
> > 
> > I ran block/011
> > 
> > [root@segstorage1 blktests]# ./check block/011
> > block/011 => nvme0n1 (disable PCI device while doing
> > I/O)[failed]
> > runtime...  106.936s
> > --- tests/block/011.out 2018-05-05 18:01:14.268414752
> > -0400
> > +++ results/nvme0n1/block/011.out.bad   2018-05-05
> > 19:07:21.028634858 -0400
> > @@ -1,2 +1,36 @@
> >  Running block/011
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> &g

Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-10 Thread Laurence Oberman

On Thu, 2018-05-10 at 15:16 +, Bart Van Assche wrote:
> On Fri, 2018-05-04 at 16:42 -0400, Laurence Oberman wrote:
> > I was never able to reproduce Barts original issue using his tree
> > and
> > actual mlx5/cx4 hardware and ibsrp
> > I enabled BFQ with no other special tuning for the moath and
> > subpaths.
> > I was waiting for him to come back from vacation to check with him.
> 
> (back in the office)
> 
> Hello Laurence,
> 
> What I understood from off-list communication is that you tried to
> find
> a way to reproduce what I reported without using the srp-test
> software.
> My understanding is that both Paolo and I can reproduce the reported
> issue
> with the srp-test software.
> 
> Bart.
> 
> 
> 

Hello Bart

using your kernel
4.17.0-rc2.bart+

CONFIG_IOSCHED_BFQ=y
CONFIG_BFQ_GROUP_IOSCHED=y

These are all SRP LUNS

36001405b2b5c6c24c084b6fa4d55da2f dm-27 LIO-ORG ,block-10
size=3.9G features='2 queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 2:0:0:9  sdap 66:144 active ready running
  `- 1:0:0:9  sdaz 67:48  active ready running

36001405b26ebe76dcb94a489f6f245f8 dm-18 LIO-ORG ,block-21
size=3.9G features='2 queue_mode mq' hwhandler='0' wp=rw
`-+- policy='service-time 0' prio=1 status=active
  |- 2:0:0:20 sdx  65:112 active ready running
  `- 1:0:0:20 sdaa 65:160 active ready running

[root@ibclient ~]# cd /sys/block
[root@ibclient block]# cat /sys/block/dm-18/queue/scheduler
mq-deadline kyber [bfq] none
[root@ibclient block]# cat /sys/block/sdaa/queue/scheduler
mq-deadline kyber [bfq] none
[root@ibclient block]# cat /sys/block/sdx/queue/scheduler
mq-deadline kyber [bfq] none

Not using the test software just exercising the LUNS via my own tests I
am unable to get the OOPS

I guess something in the srp-test software triggers it then.

Doing plenty of IO to 5 mpath devices (1.3Gbytes/sec)

#Time cpu sys inter  ctxsw Free Buff Cach Inac Slab  Map
KBRead  Reads KBWrit Writes   KBIn  PktIn  KBOut  PktOut 
12:08:320   0  1437   1107  88G   5M   1G 902M 300M
178M  1380K345  0  0  6 74  0   4 

Thanks
Laurence

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

2018-05-05 Thread Laurence Oberman

On Sat, 2018-05-05 at 19:31 -0400, Laurence Oberman wrote:
> On Sat, 2018-05-05 at 19:11 -0400, Laurence Oberman wrote:
> > On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> > > Hi,
> > > 
> > > The 1st patch introduces blk_quiesce_timeout() and
> > > blk_unquiesce_timeout()
> > > for NVMe, meantime fixes blk_sync_queue().
> > > 
> > > The 2nd patch covers timeout for admin commands for recovering
> > > controller
> > > for avoiding possible deadlock.
> > > 
> > > The 3rd and 4th patches avoid to wait_freeze on queues which
> > > aren't
> > > frozen.
> > > 
> > > The last 4 patches fixes several races wrt. NVMe timeout handler,
> > > and
> > > finally can make blktests block/011 passed. Meantime the NVMe PCI
> > > timeout
> > > mecanism become much more rebost than before.
> > > 
> > > gitweb:
> > >   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
> > > 
> > > V4:
> > >   - fixe nvme_init_set_host_mem_cmd()
> > >   - use nested EH model, and run both nvme_dev_disable() and
> > >   resetting in one same context
> > > 
> > > V3:
> > >   - fix one new race related freezing in patch 4,
> > > nvme_reset_work()
> > >   may hang forever without this patch
> > >   - rewrite the last 3 patches, and avoid to break
> > > nvme_reset_ctrl*()
> > > 
> > > V2:
> > >   - fix draining timeout work, so no need to change return value
> > > from
> > >   .timeout()
> > >   - fix race between nvme_start_freeze() and nvme_unfreeze()
> > >   - cover timeout for admin commands running in EH
> > > 
> > > Ming Lei (7):
> > >   block: introduce blk_quiesce_timeout() and
> > > blk_unquiesce_timeout()
> > >   nvme: pci: cover timeout for admin commands running in EH
> > >   nvme: pci: only wait freezing if queue is frozen
> > >   nvme: pci: freeze queue in nvme_dev_disable() in case of error
> > > recovery
> > >   nvme: core: introduce 'reset_lock' for sync reset state and
> > > reset
> > > activities
> > >   nvme: pci: prepare for supporting error recovery from resetting
> > > context
> > >   nvme: pci: support nested EH
> > > 
> > >  block/blk-core.c |  21 +++-
> > >  block/blk-mq.c   |   9 ++
> > >  block/blk-timeout.c  |   5 +-
> > >  drivers/nvme/host/core.c |  46 ++-
> > >  drivers/nvme/host/nvme.h |   5 +
> > >  drivers/nvme/host/pci.c  | 304
> > > ---
> > >  include/linux/blkdev.h   |  13 ++
> > >  7 files changed, 356 insertions(+), 47 deletions(-)
> > > 
> > > Cc: Jianchao Wang <jianchao.w.w...@oracle.com>
> > > Cc: Christoph Hellwig <h...@lst.de>
> > > Cc: Sagi Grimberg <s...@grimberg.me>
> > > Cc: linux-n...@lists.infradead.org
> > > Cc: Laurence Oberman <lober...@redhat.com>
> > 
> > Hello Ming
> > 
> > I have a two node NUMA system here running your kernel tree
> > 4.17.0-rc3.ming.nvme+
> > 
> > [root@segstorage1 ~]# numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 3 5 6 8 11 13 14
> > node 0 size: 63922 MB
> > node 0 free: 61310 MB
> > node 1 cpus: 1 2 4 7 9 10 12 15
> > node 1 size: 64422 MB
> > node 1 free: 62372 MB
> > node distances:
> > node   0   1 
> >   0:  10  20 
> >   1:  20  10 
> > 
> > I ran block/011
> > 
> > [root@segstorage1 blktests]# ./check block/011
> > block/011 => nvme0n1 (disable PCI device while doing
> > I/O)[failed]
> > runtime...  106.936s
> > --- tests/block/011.out 2018-05-05 18:01:14.268414752
> > -0400
> > +++ results/nvme0n1/block/011.out.bad   2018-05-05
> > 19:07:21.028634858 -0400
> > @@ -1,2 +1,36 @@
> >  Running block/011
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> > IO_U_F_FLIGHT) == 0' failed.
> > +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> &g

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

2018-05-05 Thread Laurence Oberman

On Sat, 2018-05-05 at 19:11 -0400, Laurence Oberman wrote:
> On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> > Hi,
> > 
> > The 1st patch introduces blk_quiesce_timeout() and
> > blk_unquiesce_timeout()
> > for NVMe, meantime fixes blk_sync_queue().
> > 
> > The 2nd patch covers timeout for admin commands for recovering
> > controller
> > for avoiding possible deadlock.
> > 
> > The 3rd and 4th patches avoid to wait_freeze on queues which aren't
> > frozen.
> > 
> > The last 4 patches fixes several races wrt. NVMe timeout handler,
> > and
> > finally can make blktests block/011 passed. Meantime the NVMe PCI
> > timeout
> > mecanism become much more rebost than before.
> > 
> > gitweb:
> > https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
> > 
> > V4:
> > - fixe nvme_init_set_host_mem_cmd()
> > - use nested EH model, and run both nvme_dev_disable() and
> > resetting in one same context
> > 
> > V3:
> > - fix one new race related freezing in patch 4,
> > nvme_reset_work()
> > may hang forever without this patch
> > - rewrite the last 3 patches, and avoid to break
> > nvme_reset_ctrl*()
> > 
> > V2:
> > - fix draining timeout work, so no need to change return value
> > from
> > .timeout()
> > - fix race between nvme_start_freeze() and nvme_unfreeze()
> > - cover timeout for admin commands running in EH
> > 
> > Ming Lei (7):
> >   block: introduce blk_quiesce_timeout() and
> > blk_unquiesce_timeout()
> >   nvme: pci: cover timeout for admin commands running in EH
> >   nvme: pci: only wait freezing if queue is frozen
> >   nvme: pci: freeze queue in nvme_dev_disable() in case of error
> > recovery
> >   nvme: core: introduce 'reset_lock' for sync reset state and reset
> > activities
> >   nvme: pci: prepare for supporting error recovery from resetting
> > context
> >   nvme: pci: support nested EH
> > 
> >  block/blk-core.c |  21 +++-
> >  block/blk-mq.c   |   9 ++
> >  block/blk-timeout.c  |   5 +-
> >  drivers/nvme/host/core.c |  46 ++-
> >  drivers/nvme/host/nvme.h |   5 +
> >  drivers/nvme/host/pci.c  | 304
> > ---
> >  include/linux/blkdev.h   |  13 ++
> >  7 files changed, 356 insertions(+), 47 deletions(-)
> > 
> > Cc: Jianchao Wang <jianchao.w.w...@oracle.com>
> > Cc: Christoph Hellwig <h...@lst.de>
> > Cc: Sagi Grimberg <s...@grimberg.me>
> > Cc: linux-n...@lists.infradead.org
> > Cc: Laurence Oberman <lober...@redhat.com>
> 
> Hello Ming
> 
> I have a two node NUMA system here running your kernel tree
> 4.17.0-rc3.ming.nvme+
> 
> [root@segstorage1 ~]# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 3 5 6 8 11 13 14
> node 0 size: 63922 MB
> node 0 free: 61310 MB
> node 1 cpus: 1 2 4 7 9 10 12 15
> node 1 size: 64422 MB
> node 1 free: 62372 MB
> node distances:
> node   0   1 
>   0:  10  20 
>   1:  20  10 
> 
> I ran block/011
> 
> [root@segstorage1 blktests]# ./check block/011
> block/011 => nvme0n1 (disable PCI device while doing I/O)[failed]
> runtime...  106.936s
> --- tests/block/011.out   2018-05-05 18:01:14.268414752
> -0400
> +++ results/nvme0n1/block/011.out.bad 2018-05-05
> 19:07:21.028634858 -0400
> @@ -1,2 +1,36 @@
>  Running block/011
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> +fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
> IO_U_F_FLIGHT) == 0' failed.
> ...
> (Run 'diff -u tests/block/011.out
> results/nvme0n1/block/011.out.bad' to see the entire diff)
> 
> [ 1421.738551] run blktests block/011 at 2018-05-05 19:05:34
> [ 1452.676351] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.718221] nvme nvme0: controller is down; will reset: CSTS=0x3,
> PCI_STATUS=0x10
> [ 1452.718239] nvme nvme0: EH 0: before shutdown
> [ 1452.760890] nvme nvme0: controller is down

Re: [PATCH V4 0/7] nvme: pci: fix & improve timeout handling

2018-05-05 Thread Laurence Oberman

On Sat, 2018-05-05 at 21:58 +0800, Ming Lei wrote:
> Hi,
> 
> The 1st patch introduces blk_quiesce_timeout() and
> blk_unquiesce_timeout()
> for NVMe, meantime fixes blk_sync_queue().
> 
> The 2nd patch covers timeout for admin commands for recovering
> controller
> for avoiding possible deadlock.
> 
> The 3rd and 4th patches avoid to wait_freeze on queues which aren't
> frozen.
> 
> The last 4 patches fixes several races wrt. NVMe timeout handler, and
> finally can make blktests block/011 passed. Meantime the NVMe PCI
> timeout
> mecanism become much more rebost than before.
> 
> gitweb:
>   https://github.com/ming1/linux/commits/v4.17-rc-nvme-timeout.V4
> 
> V4:
>   - fixe nvme_init_set_host_mem_cmd()
>   - use nested EH model, and run both nvme_dev_disable() and
>   resetting in one same context
> 
> V3:
>   - fix one new race related freezing in patch 4,
> nvme_reset_work()
>   may hang forever without this patch
>   - rewrite the last 3 patches, and avoid to break
> nvme_reset_ctrl*()
> 
> V2:
>   - fix draining timeout work, so no need to change return value
> from
>   .timeout()
>   - fix race between nvme_start_freeze() and nvme_unfreeze()
>   - cover timeout for admin commands running in EH
> 
> Ming Lei (7):
>   block: introduce blk_quiesce_timeout() and blk_unquiesce_timeout()
>   nvme: pci: cover timeout for admin commands running in EH
>   nvme: pci: only wait freezing if queue is frozen
>   nvme: pci: freeze queue in nvme_dev_disable() in case of error
> recovery
>   nvme: core: introduce 'reset_lock' for sync reset state and reset
> activities
>   nvme: pci: prepare for supporting error recovery from resetting
> context
>   nvme: pci: support nested EH
> 
>  block/blk-core.c |  21 +++-
>  block/blk-mq.c   |   9 ++
>  block/blk-timeout.c  |   5 +-
>  drivers/nvme/host/core.c |  46 ++-
>  drivers/nvme/host/nvme.h |   5 +
>  drivers/nvme/host/pci.c  | 304
> ---
>  include/linux/blkdev.h   |  13 ++
>  7 files changed, 356 insertions(+), 47 deletions(-)
> 
> Cc: Jianchao Wang <jianchao.w.w...@oracle.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Sagi Grimberg <s...@grimberg.me>
> Cc: linux-n...@lists.infradead.org
> Cc: Laurence Oberman <lober...@redhat.com>

Hello Ming

I have a two node NUMA system here running your kernel tree
4.17.0-rc3.ming.nvme+

[root@segstorage1 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 3 5 6 8 11 13 14
node 0 size: 63922 MB
node 0 free: 61310 MB
node 1 cpus: 1 2 4 7 9 10 12 15
node 1 size: 64422 MB
node 1 free: 62372 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

I ran block/011

[root@segstorage1 blktests]# ./check block/011
block/011 => nvme0n1 (disable PCI device while doing I/O)[failed]
runtime...  106.936s
--- tests/block/011.out 2018-05-05 18:01:14.268414752 -0400
+++ results/nvme0n1/block/011.out.bad   2018-05-05
19:07:21.028634858 -0400
@@ -1,2 +1,36 @@
 Running block/011
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
IO_U_F_FLIGHT) == 0' failed.
+fio: ioengines.c:289: td_io_queue: Assertion `(io_u->flags &
IO_U_F_FLIGHT) == 0' failed.
...
(Run 'diff -u tests/block/011.out
results/nvme0n1/block/011.out.bad' to see the entire diff)

[ 1421.738551] run blktests block/011 at 2018-05-05 19:05:34
[ 1452.676351] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.718221] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.718239] nvme nvme0: EH 0: before shutdown
[ 1452.760890] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760894] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760897] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760900] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760903] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760906] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760909] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760912] nvme nvme0: controller is down; will reset: CSTS=0x3,
PCI_STATUS=0x10
[ 1452.760915] nvme nv

Re: v4.16-rc1 + dm-mpath + BFQ

2018-05-04 Thread Laurence Oberman

On Fri, 2018-05-04 at 22:11 +0200, Paolo Valente wrote:
> > Il giorno 30 mar 2018, alle ore 18:57, Bart Van Assche  > c...@wdc.com> ha scritto:
> > 
> > On Fri, 2018-03-30 at 10:23 +0200, Paolo Valente wrote:
> > > Still 4.16-rc1, being that the version for which you reported
> > > this
> > > issue in the first place.
> > 
> > A vanilla v4.16-rc1 kernel is not sufficient to run the srp-test
> > software
> > since RDMA/CM support for the SRP target driver is missing from
> > that kernel.
> > That's why I asked you to use the for-next branch from my github
> > repository
> > in a previous e-mail. Anyway, since the necessary patches are now
> > in
> > linux-next, the srp-test software can also be run against linux-
> > next. Here
> > are the results that I obtained with label next-20180329 and the
> > kernel
> > config attached to your previous e-mail:
> > 
> > # while ./srp-test/run_tests -c -d -r 10 -e bfq; do :; done
> > 
> > BUG: unable to handle kernel NULL pointer dereference at
> > 0200
> > PGD 0 P4D 0 
> > Oops: 0002 [#1] SMP PTI
> > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-
> > prebuilt.qemu-project.org 04/01/2014
> > RIP: 0010:rb_erase+0x284/0x380
> > Call Trace:
> > 
> > elv_rb_del+0x24/0x30
> > bfq_remove_request+0x9a/0x2e0 [bfq]
> > ? rcu_read_lock_sched_held+0x64/0x70
> > ? update_load_avg+0x72b/0x760
> > bfq_finish_requeue_request+0x2e1/0x3b0 [bfq]
> > ? __lock_is_held+0x5a/0xa0
> > blk_mq_free_request+0x5f/0x1a0
> > blk_put_request+0x23/0x60
> > multipath_release_clone+0xe/0x10
> > dm_softirq_done+0xe3/0x270
> > __blk_mq_complete_request_remote+0x18/0x20
> > flush_smp_call_function_queue+0xa1/0x150
> > generic_smp_call_function_single_interrupt+0x13/0x30
> > smp_call_function_single_interrupt+0x4d/0x220
> > call_function_single_interrupt+0xf/0x20
> > 
> > 
> 
> Hi Bart,
> I suspect my recent fix [1] might fix your failure too.
> 
> Thanks,
> Paolo
> 
> [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1682
> 264.html
> 
> > Bart.
> > 
> > 
> > 
> 
> 
I was never able to reproduce Barts original issue using his tree and
actual mlx5/cx4 hardware and ibsrp
I enabled BFQ with no other special tuning for the moath and subpaths.
I was waiting for him to come back from vacation to check with him.

Thanks
Laurence

Re: [PATCH 0/3] scsi: scsi-mq: don't hold host_busy in IO path

2018-04-27 Thread Laurence Oberman

On Fri, 2018-04-27 at 15:48 +, Bart Van Assche wrote:
> On Fri, 2018-04-27 at 09:39 -0600, Jens Axboe wrote:
> > blk_mq_tagset_busy_iter(>tag_set, scsi_host_check_in_flight,
> > _flight);
> > return in_flight.cnt + atomic_read(>host_busy);
> > 
> > The atomic read is basically free, once we get rid of the dirty of
> > that
> > variable on each IO.
> 
> Hello Jens,
> 
> What makes you think that " + atomic_read(>host_busy)" is
> necessary?
> I am not aware of any code outside the SCSI core that modifies the
> host_busy
> member.
> 
> Thanks,
> 
> Bart.
> 
> 
> 

As part of testing latest upstream in MQ and non-MQ I intend to test
this patch series fully on actual hardware
F/C 8G to memory backed array LUNS and of course SRP/RDMA
I have started working on this and will report back.
Thanks
Laurence

Re: [PATCH] Revert "blk-mq: remove code for dealing with remapping queue"

2018-04-24 Thread Laurence Oberman

On Wed, 2018-04-25 at 04:01 +0800, Ming Lei wrote:
> This reverts commit 37c7c6c76d431dd7ef9c29d95f6052bd425f004c.
> 
> Turns out some drivers(most are FC drivers) may not use managed
> IRQ affinity, and has their customized .map_queues meantime, so
> still keep this code for avoiding regression.
> 
> Reported-by: Laurence Oberman <lober...@redhat.com>
> Cc: Ewan Milne <emi...@redhat.com>
> Cc: Stefan Haberland <s...@linux.vnet.ibm.com>
> Cc: Christian Borntraeger <borntrae...@de.ibm.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Sagi Grimberg <s...@grimberg.me>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  block/blk-mq.c | 34 +++---
>  1 file changed, 31 insertions(+), 3 deletions(-)
> 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e39276852638..f4891b792d0d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -2407,7 +2407,7 @@ static void blk_mq_free_map_and_requests(struct
> blk_mq_tag_set *set,
>  
>  static void blk_mq_map_swqueue(struct request_queue *q)
>  {
> - unsigned int i;
> + unsigned int i, hctx_idx;
>   struct blk_mq_hw_ctx *hctx;
>   struct blk_mq_ctx *ctx;
>   struct blk_mq_tag_set *set = q->tag_set;
> @@ -2424,8 +2424,23 @@ static void blk_mq_map_swqueue(struct
> request_queue *q)
>  
>   /*
>    * Map software to hardware queues.
> +  *
> +  * If the cpu isn't present, the cpu is mapped to first
> hctx.
>    */
>   for_each_possible_cpu(i) {
> + hctx_idx = q->mq_map[i];
> + /* unmapped hw queue can be remapped after CPU topo
> changed */
> + if (!set->tags[hctx_idx] &&
> + !__blk_mq_alloc_rq_map(set, hctx_idx)) {
> + /*
> +  * If tags initialization fail for some
> hctx,
> +  * that hctx won't be brought online.  In
> this
> +  * case, remap the current ctx to hctx[0]
> which
> +  * is guaranteed to always have tags
> allocated
> +  */
> + q->mq_map[i] = 0;
> + }
> +
>   ctx = per_cpu_ptr(q->queue_ctx, i);
>   hctx = blk_mq_map_queue(q, i);
>  
> @@ -2437,8 +2452,21 @@ static void blk_mq_map_swqueue(struct
> request_queue *q)
>   mutex_unlock(>sysfs_lock);
>  
>   queue_for_each_hw_ctx(q, hctx, i) {
> - /* every hctx should get mapped by at least one CPU
> */
> - WARN_ON(!hctx->nr_ctx);
> + /*
> +  * If no software queues are mapped to this hardware
> queue,
> +  * disable it and free the request entries.
> +  */
> + if (!hctx->nr_ctx) {
> + /* Never unmap queue 0.  We need it as a
> +  * fallback in case of a new remap fails
> +  * allocation
> +  */
> + if (i && set->tags[i])
> +     blk_mq_free_map_and_requests(set,
> i);
> +
> + hctx->tags = NULL;
> + continue;
> + }
>  
>   hctx->tags = set->tags[i];
>   WARN_ON(!hctx->tags);

Hello Ming

Thanks for getting this fixed so quickly.
The dumpstack is gone now of course.

Tested-by: Laurence Oberman <lober...@redhat.com>

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-08 Thread Laurence Oberman

On Thu, 2018-03-08 at 21:42 +0800, Ming Lei wrote:
> On Wed, Mar 07, 2018 at 09:11:37AM -0500, Laurence Oberman wrote:
> > On Tue, 2018-03-06 at 14:24 -0500, Martin K. Petersen wrote:
> > > Ming,
> > > 
> > > > Given both Don and Laurence have verified that patch 1 and
> > > > patch 2
> > > > does fix IO hang, could you consider to merge the two first?
> > > 
> > > Oh, and I would still need a formal Acked-by: from Don and
> > > Tested-by:
> > > from Laurence.
> > > 
> > > Also, for 4.16/scsi-fixes I would prefer verification to be done
> > > with
> > > just patch 1/8 and none of the subsequent changes in place. Just
> > > to
> > > make
> > > sure we're testing the right thing.
> > > 
> > > Thanks!
> > > 
> > 
> > Hello Martin
> > 
> > I tested just Patch 1/8 from the V3 series.
> > No issues running workload and no issues booting on the DL380G7.
> > Don can you ack this so we can at least get this one in.
> > 
> > Against: 4.16.0-rc4.v31of8+ on an x86_64
> > 
> > Tested-by: Laurence Oberman <lober...@redhat.com>
> 
> Hi Laurence,
> 
> Thanks for your test!
> 
> Could you test patch 2 too since you have megaraid_sas controller?
> 
> Looks it is better to split the fix patches from the current
> patchset,
> since these fixes should be for V4.16.
> 
> Thanks
> Ming

Ho Ming I see a V4 now so I am going to wait until you split these and
then I will test both HPSA and megaraid_sas once Kashyap agrees.

When I see a V4 show up with the split will pull and act on it.

Thanks
Laurence

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-07 Thread Laurence Oberman

On Tue, 2018-03-06 at 14:24 -0500, Martin K. Petersen wrote:
> Ming,
> 
> > Given both Don and Laurence have verified that patch 1 and patch 2
> > does fix IO hang, could you consider to merge the two first?
> 
> Oh, and I would still need a formal Acked-by: from Don and Tested-by:
> from Laurence.
> 
> Also, for 4.16/scsi-fixes I would prefer verification to be done with
> just patch 1/8 and none of the subsequent changes in place. Just to
> make
> sure we're testing the right thing.
> 
> Thanks!
> 

Hello Martin

I tested just Patch 1/8 from the V3 series.
No issues running workload and no issues booting on the DL380G7.
Don can you ack this so we can at least get this one in.

Against: 4.16.0-rc4.v31of8+ on an x86_64

Tested-by: Laurence Oberman <lober...@redhat.com>

Thanks
Laurence

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-02 Thread Laurence Oberman

On Fri, 2018-03-02 at 15:03 +, Don Brace wrote:
> > -Original Message-
> > From: Laurence Oberman [mailto:lober...@redhat.com]
> > Sent: Friday, March 02, 2018 8:09 AM
> > To: Ming Lei <ming@redhat.com>
> > Cc: Don Brace <don.br...@microsemi.com>; Jens Axboe <axboe@kernel.d
> > k>;
> > linux-block@vger.kernel.org; Christoph Hellwig <h...@infradead.org>;
> > Mike
> > Snitzer <snit...@redhat.com>; linux-s...@vger.kernel.org; Hannes
> > Reinecke
> > <h...@suse.de>; Arun Easi <arun.e...@cavium.com>; Omar Sandoval
> > <osan...@fb.com>; Martin K . Petersen <martin.peter...@oracle.com>;
> > James
> > Bottomley <james.bottom...@hansenpartnership.com>; Christoph
> > Hellwig
> > <h...@lst.de>; Kashyap Desai <kashyap.de...@broadcom.com>; Peter
> > Rivera
> > <peter.riv...@broadcom.com>; Meelis Roos <mr...@linux.ee>
> > Subject: Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply
> > queue
> > 
> > EXTERNAL EMAIL
> > 
> > 
> > On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote:
> > > On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote:
> > > > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote:
> > > > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > > > > > > -Original Message-
> > > > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > > > Sent: Tuesday, February 27, 2018 4:08 AM
> > > > > > > To: Jens Axboe <ax...@kernel.dk>; linux-block@vger.kernel
> > > > > > > .org
> > > > > > > ;
> > > > > > > Christoph
> > > > > > > Hellwig <h...@infradead.org>; Mike Snitzer <snitzer@redhat
> > > > > > > .com
> > > > > > > > 
> > > > > > > 
> > > > > > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke <hare@sus
> > > > > > > e.de
> > > > > > > > ;
> > > > > > > 
> > > > > > > Arun Easi
> > > > > > > <arun.e...@cavium.com>; Omar Sandoval <osan...@fb.com>;
> > > > > > > Martin K
> > > > > > > .
> > > > > > > Petersen <martin.peter...@oracle.com>; James Bottomley
> > > > > > > <james.bottom...@hansenpartnership.com>; Christoph
> > > > > > > Hellwig  > > > > > > ch@l
> > > > > > > st
> > > > > > > .de>;
> > > > > > > Don Brace <don.br...@microsemi.com>; Kashyap Desai
> > > > > > > <kashyap.de...@broadcom.com>; Peter Rivera <peter.rivera@
> > > > > > > broa
> > > > > > > dcom
> > > > > > > .c
> > > > > > > om>;
> > > > > > > Laurence Oberman <lober...@redhat.com>; Ming Lei
> > > > > > > <ming@redhat.com>; Meelis Roos <mr...@linux.ee>
> > > > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of
> > > > > > > reply
> > > > > > > queue
> > > > > > > 
> > > 
> > > Seems Don run into IO failure without blk-mq, could you run your
> > > tests again
> > > in legacy mode?
> > > 
> > > Thanks,
> > > Ming
> > 
> > Hello Ming
> > I ran multiple passes on Legacy and still see no issues in my test
> > bed
> > 
> > BOOT_IMAGE=/vmlinuz-4.16.0-rc2.ming+ root=UUID=43f86d71-b1bf-4789-
> > a28e-
> > 21c6ddc90195 ro crashkernel=256M@64M log_buf_len=64M
> > console=ttyS1,115200n8
> > 
> > HEAD of the git kernel I am using
> > 
> > 694e16f scsi: megaraid: improve scsi_mq performance via
> > .host_tagset
> > 793686c scsi: hpsa: improve scsi_mq performance via .host_tagset
> > 60d5b36 block: null_blk: introduce module parameter of
> > 'g_host_tags'
> > 8847067 scsi: Add template flag 'host_tagset'
> > a8fbdd6 blk-mq: introduce BLK_MQ_F_HOST_TAGS
> > 4710fab blk-mq: introduce 'start_tag' field to 'struct blk_mq_tags'
> > 09bb153 scsi: megaraid_sas: fix selection of reply queue
> > 52700d8 scsi: hpsa: fix selection of reply queue
> 
> I checkout out Linus's tree (4.16.0-rc3+) and re-applied the above
> patches.
> I  and have been running 24 hours with no issues.
> Evidently my forked copy was corrupted. 
> 
> So, my I/O testing has gone well. 
> 
> I'll run some performance numbers next.
> 
> Thanks,
> Don

Unless Kashyap is not happy we need to consider getting this in to
Linus now because we are seeing HPE servers that keep hanging now with
the original commit now upstream.

Kashyap, are you good with the v3 patchset or still concerned with
performance. I was getting pretty good IOPS/sec to individual SSD
drives set up as jbod devices on the megaraid_sas.

With larger I/O sizes like 1MB I was getting good MB/sec and not seeing
a measurable performance impact.

I dont have the hardware you have to mimic your configuration.

Thanks
Laurence

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-02 Thread Laurence Oberman

On Fri, 2018-03-02 at 10:16 +0800, Ming Lei wrote:
> On Thu, Mar 01, 2018 at 04:19:34PM -0500, Laurence Oberman wrote:
> > On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote:
> > > On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > > > > -Original Message-
> > > > > From: Ming Lei [mailto:ming@redhat.com]
> > > > > Sent: Tuesday, February 27, 2018 4:08 AM
> > > > > To: Jens Axboe <ax...@kernel.dk>; linux-block@vger.kernel.org
> > > > > ;
> > > > > Christoph
> > > > > Hellwig <h...@infradead.org>; Mike Snitzer <snit...@redhat.com
> > > > > >
> > > > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke <h...@suse.de
> > > > > >;
> > > > > Arun Easi
> > > > > <arun.e...@cavium.com>; Omar Sandoval <osan...@fb.com>;
> > > > > Martin K
> > > > > .
> > > > > Petersen <martin.peter...@oracle.com>; James Bottomley
> > > > > <james.bottom...@hansenpartnership.com>; Christoph Hellwig  > > > > ch@l
> > > > > st
> > > > > .de>;
> > > > > Don Brace <don.br...@microsemi.com>; Kashyap Desai
> > > > > <kashyap.de...@broadcom.com>; Peter Rivera <peter.rivera@broa
> > > > > dcom
> > > > > .c
> > > > > om>;
> > > > > Laurence Oberman <lober...@redhat.com>; Ming Lei
> > > > > <ming@redhat.com>; Meelis Roos <mr...@linux.ee>
> > > > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of reply
> > > > > queue
> > > > > 
> > > > > EXTERNAL EMAIL
> > > > > 
> > > > > 
> > > > > From 84676c1f21 (genirq/affinity: assign vectors to all
> > > > > possible
> > > > > CPUs),
> > > > > one msix vector can be created without any online CPU mapped,
> > > > > then
> > > > > one
> > > > > command's completion may not be notified.
> > > > > 
> > > > > This patch setups mapping between cpu and reply queue
> > > > > according
> > > > > to
> > > > > irq
> > > > > affinity info retrived by pci_irq_get_affinity(), and uses
> > > > > this
> > > > > mapping
> > > > > table to choose reply queue for queuing one command.
> > > > > 
> > > > > Then the chosen reply queue has to be active, and fixes IO
> > > > > hang
> > > > > caused
> > > > > by using inactive reply queue which doesn't have any online
> > > > > CPU
> > > > > mapped.
> > > > > 
> > > > > Cc: Hannes Reinecke <h...@suse.de>
> > > > > Cc: Arun Easi <arun.e...@cavium.com>
> > > > > Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> > > > > Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> > > > > Cc: Christoph Hellwig <h...@lst.de>,
> > > > > Cc: Don Brace <don.br...@microsemi.com>
> > > > > Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> > > > > Cc: Peter Rivera <peter.riv...@broadcom.com>
> > > > > Cc: Laurence Oberman <lober...@redhat.com>
> > > > > Cc: Meelis Roos <mr...@linux.ee>
> > > > > Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all
> > > > > possible CPUs")
> > > > > Signed-off-by: Ming Lei <ming@redhat.com>
> > > > 
> > > > I am getting some issues that need to be tracked down:
> > > > 
> > > > [ 1636.032984] hpsa :87:00.0: Acknowledging event:
> > > > 0xc032
> > > > (HP
> > > > SSD Smart Path configuration change)
> > > > [ 1638.510656] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> > > > Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap-
> > > > En-
> > > > Exp=0
> > > > [ 1653.967695] hpsa :87:00.0: Acknowledging event:
> > > > 0x8020
> > > > (HP
> > > > SSD Smart Path configuration change)
> > > > [ 1656.770377] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> > > > Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap-
> > > > En-
> > > > Exp=0
>

Re: [PATCH V3 0/8] blk-mq & scsi: fix reply queue selection and improve host wide tagset

2018-03-01 Thread Laurence Oberman

On Tue, 2018-02-27 at 18:07 +0800, Ming Lei wrote:
> Hi All,
> 
> The 1st two patches fixes reply queue selection, and this issue has
> been
> reported and can cause IO hang during booting, please consider the
> two
> for V4.16.
> 
> The following 6 patches try to improve hostwide tagset on hpsa and
> megaraid_sas by making hw queue per NUMA node.
> 
> I don't have high-performance hpsa and megaraid_sas device at hand.
> 
> Don Brace, could you test this patchset on concurrent IOs over you
> hpsa
> SSD and see if this approach is well?
> 
> Kashyap, could you test this patchset on your megaraid_sas SSDs?
> 
>   gitweb: https://github.com/ming1/linux/tree/v4.16-rc-host-tags-
> v3.2
> 
> thanks,
> Ming
> 
> Hannes Reinecke (1):
>   scsi: Add template flag 'host_tagset'
> 
> Ming Lei (7):
>   scsi: hpsa: fix selection of reply queue
>   scsi: megaraid_sas: fix selection of reply queue
>   blk-mq: introduce 'start_tag' field to 'struct blk_mq_tags'
>   blk-mq: introduce BLK_MQ_F_HOST_TAGS
>   block: null_blk: introduce module parameter of 'g_host_tags'
>   scsi: hpsa: improve scsi_mq performance via .host_tagset
>   scsi: megaraid: improve scsi_mq performance via .host_tagset
> 
>  block/blk-mq-debugfs.c  |  2 +
>  block/blk-mq-sched.c|  2 +-
>  block/blk-mq-tag.c  | 13 +++--
>  block/blk-mq-tag.h  | 11 ++--
>  block/blk-mq.c  | 50 +++---
>  block/blk-mq.h  |  3 +-
>  drivers/block/null_blk.c|  6 +++
>  drivers/scsi/hpsa.c | 79
> ++---
>  drivers/scsi/hpsa.h |  1 +
>  drivers/scsi/megaraid/megaraid_sas.h|  2 +-
>  drivers/scsi/megaraid/megaraid_sas_base.c   | 40 ++-
>  drivers/scsi/megaraid/megaraid_sas_fusion.c | 12 ++---
>  drivers/scsi/scsi_lib.c |  2 +
>  include/linux/blk-mq.h  |  2 +
>  include/scsi/scsi_host.h|  3 ++
>  15 files changed, 182 insertions(+), 46 deletions(-)
> 

For the patchset above
All functional I/O tests and boot tests passed with multiple concurrent
fio runs.

Original HPSA booting issue is also resolved and its important or we
will have to revert original genirq commit
commit 84676c1f21e8ff54befe985f4f14dc1edc10046b

Tested-by: Laurence Oberman <lober...@redhat.com>

Thanks
Laurence

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-01 Thread Laurence Oberman

On Thu, 2018-03-01 at 14:01 -0500, Laurence Oberman wrote:
> On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > > -Original Message-
> > > From: Ming Lei [mailto:ming@redhat.com]
> > > Sent: Tuesday, February 27, 2018 4:08 AM
> > > To: Jens Axboe <ax...@kernel.dk>; linux-block@vger.kernel.org;
> > > Christoph
> > > Hellwig <h...@infradead.org>; Mike Snitzer <snit...@redhat.com>
> > > Cc: linux-s...@vger.kernel.org; Hannes Reinecke <h...@suse.de>;
> > > Arun Easi
> > > <arun.e...@cavium.com>; Omar Sandoval <osan...@fb.com>; Martin K
> > > .
> > > Petersen <martin.peter...@oracle.com>; James Bottomley
> > > <james.bottom...@hansenpartnership.com>; Christoph Hellwig <hch@l
> > > st
> > > .de>;
> > > Don Brace <don.br...@microsemi.com>; Kashyap Desai
> > > <kashyap.de...@broadcom.com>; Peter Rivera <peter.rivera@broadcom
> > > .c
> > > om>;
> > > Laurence Oberman <lober...@redhat.com>; Ming Lei
> > > <ming@redhat.com>; Meelis Roos <mr...@linux.ee>
> > > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
> > > 
> > > EXTERNAL EMAIL
> > > 
> > > 
> > > From 84676c1f21 (genirq/affinity: assign vectors to all possible
> > > CPUs),
> > > one msix vector can be created without any online CPU mapped,
> > > then
> > > one
> > > command's completion may not be notified.
> > > 
> > > This patch setups mapping between cpu and reply queue according
> > > to
> > > irq
> > > affinity info retrived by pci_irq_get_affinity(), and uses this
> > > mapping
> > > table to choose reply queue for queuing one command.
> > > 
> > > Then the chosen reply queue has to be active, and fixes IO hang
> > > caused
> > > by using inactive reply queue which doesn't have any online CPU
> > > mapped.
> > > 
> > > Cc: Hannes Reinecke <h...@suse.de>
> > > Cc: Arun Easi <arun.e...@cavium.com>
> > > Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> > > Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> > > Cc: Christoph Hellwig <h...@lst.de>,
> > > Cc: Don Brace <don.br...@microsemi.com>
> > > Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> > > Cc: Peter Rivera <peter.riv...@broadcom.com>
> > > Cc: Laurence Oberman <lober...@redhat.com>
> > > Cc: Meelis Roos <mr...@linux.ee>
> > > Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all
> > > possible CPUs")
> > > Signed-off-by: Ming Lei <ming@redhat.com>
> > 
> > I am getting some issues that need to be tracked down:
> > 
> > [ 1636.032984] hpsa :87:00.0: Acknowledging event: 0xc032
> > (HP
> > SSD Smart Path configuration change)
> > [ 1638.510656] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> > Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap- En-
> > Exp=0
> > [ 1653.967695] hpsa :87:00.0: Acknowledging event: 0x8020
> > (HP
> > SSD Smart Path configuration change)
> > [ 1656.770377] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> > Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap- En-
> > Exp=0
> > [ 2839.762267] hpsa :87:00.0: Acknowledging event: 0x8020
> > (HP
> > SSD Smart Path configuration change)
> > [ 2840.841290] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> > Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap- En-
> > Exp=0
> > [ 2917.582653] hpsa :87:00.0: Acknowledging event: 0xc020
> > (HP
> > SSD Smart Path configuration change)
> > [ 2919.087191] hpsa :87:00.0: scsi 3:1:0:1: updated Direct-
> > Access HP   LOGICAL VOLUME   RAID-5 SSDSmartPathCap+ En+
> > Exp=1
> > [ 2919.142527] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> > [3:1:0:2] A phys disk component of LV is missing, turning off
> > offload_enabled for LV.
> > [ 2919.203915] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> > [3:1:0:2] A phys disk component of LV is missing, turning off
> > offload_enabled for LV.
> > [ 2919.266921] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> > [3:1:0:2] A phys disk component of LV is missing, turning off
> > offload_enabled for LV.
> > [ 2934.999629] hpsa :87:00.0: Acknowledging event: 0x4000
> > (HP
> > SSD

Re: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue

2018-03-01 Thread Laurence Oberman

On Thu, 2018-03-01 at 16:18 +, Don Brace wrote:
> > -Original Message-
> > From: Ming Lei [mailto:ming@redhat.com]
> > Sent: Tuesday, February 27, 2018 4:08 AM
> > To: Jens Axboe <ax...@kernel.dk>; linux-block@vger.kernel.org;
> > Christoph
> > Hellwig <h...@infradead.org>; Mike Snitzer <snit...@redhat.com>
> > Cc: linux-s...@vger.kernel.org; Hannes Reinecke <h...@suse.de>;
> > Arun Easi
> > <arun.e...@cavium.com>; Omar Sandoval <osan...@fb.com>; Martin K .
> > Petersen <martin.peter...@oracle.com>; James Bottomley
> > <james.bottom...@hansenpartnership.com>; Christoph Hellwig <hch@lst
> > .de>;
> > Don Brace <don.br...@microsemi.com>; Kashyap Desai
> > <kashyap.de...@broadcom.com>; Peter Rivera <peter.rivera@broadcom.c
> > om>;
> > Laurence Oberman <lober...@redhat.com>; Ming Lei
> > <ming@redhat.com>; Meelis Roos <mr...@linux.ee>
> > Subject: [PATCH V3 1/8] scsi: hpsa: fix selection of reply queue
> > 
> > EXTERNAL EMAIL
> > 
> > 
> > From 84676c1f21 (genirq/affinity: assign vectors to all possible
> > CPUs),
> > one msix vector can be created without any online CPU mapped, then
> > one
> > command's completion may not be notified.
> > 
> > This patch setups mapping between cpu and reply queue according to
> > irq
> > affinity info retrived by pci_irq_get_affinity(), and uses this
> > mapping
> > table to choose reply queue for queuing one command.
> > 
> > Then the chosen reply queue has to be active, and fixes IO hang
> > caused
> > by using inactive reply queue which doesn't have any online CPU
> > mapped.
> > 
> > Cc: Hannes Reinecke <h...@suse.de>
> > Cc: Arun Easi <arun.e...@cavium.com>
> > Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> > Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> > Cc: Christoph Hellwig <h...@lst.de>,
> > Cc: Don Brace <don.br...@microsemi.com>
> > Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> > Cc: Peter Rivera <peter.riv...@broadcom.com>
> > Cc: Laurence Oberman <lober...@redhat.com>
> > Cc: Meelis Roos <mr...@linux.ee>
> > Fixes: 84676c1f21e8 ("genirq/affinity: assign vectors to all
> > possible CPUs")
> > Signed-off-by: Ming Lei <ming@redhat.com>
> 
> I am getting some issues that need to be tracked down:
> 
> [ 1636.032984] hpsa :87:00.0: Acknowledging event: 0xc032 (HP
> SSD Smart Path configuration change)
> [ 1638.510656] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap- En-
> Exp=0
> [ 1653.967695] hpsa :87:00.0: Acknowledging event: 0x8020 (HP
> SSD Smart Path configuration change)
> [ 1656.770377] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap- En-
> Exp=0
> [ 2839.762267] hpsa :87:00.0: Acknowledging event: 0x8020 (HP
> SSD Smart Path configuration change)
> [ 2840.841290] hpsa :87:00.0: scsi 3:0:8:0: updated Direct-
> Access HP   MO0400JDVEU  PHYS DRV SSDSmartPathCap- En-
> Exp=0
> [ 2917.582653] hpsa :87:00.0: Acknowledging event: 0xc020 (HP
> SSD Smart Path configuration change)
> [ 2919.087191] hpsa :87:00.0: scsi 3:1:0:1: updated Direct-
> Access HP   LOGICAL VOLUME   RAID-5 SSDSmartPathCap+ En+
> Exp=1
> [ 2919.142527] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> [3:1:0:2] A phys disk component of LV is missing, turning off
> offload_enabled for LV.
> [ 2919.203915] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> [3:1:0:2] A phys disk component of LV is missing, turning off
> offload_enabled for LV.
> [ 2919.266921] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> [3:1:0:2] A phys disk component of LV is missing, turning off
> offload_enabled for LV.
> [ 2934.999629] hpsa :87:00.0: Acknowledging event: 0x4000 (HP
> SSD Smart Path state change)
> [ 2936.937333] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> [3:1:0:2] A phys disk component of LV is missing, turning off
> offload_enabled for LV.
> [ 2936.998707] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> [3:1:0:2] A phys disk component of LV is missing, turning off
> offload_enabled for LV.
> [ 2937.060101] hpsa :87:00.0: hpsa_figure_phys_disk_ptrs:
> [3:1:0:2] A phys disk component of LV is missing, turning off
> offload_enabled for LV.
> [ 3619.711122] sd 3:1:0:3: [sde] tag#436 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE

Re: [PATCH V3 8/8] scsi: megaraid: improve scsi_mq performance via .host_tagset

2018-02-28 Thread Laurence Oberman

On Wed, 2018-02-28 at 23:21 +0800, Ming Lei wrote:
> On Wed, Feb 28, 2018 at 08:28:48PM +0530, Kashyap Desai wrote:
> > Ming -
> > 
> > Quick testing on my setup -  Performance slightly degraded (4-5%
> > drop)for
> > megaraid_sas driver with this patch. (From 1610K IOPS it goes to
> > 1544K)
> > I confirm that after applying this patch, we have #queue = #numa
> > node.
> > 
> > ls -l
> > /sys/devices/pci:80/:80:02.0/:83:00.0/host10/target10:2
> > :23/10:
> > 2:23:0/block/sdy/mq
> > total 0
> > drwxr-xr-x. 18 root root 0 Feb 28 09:53 0
> > drwxr-xr-x. 18 root root 0 Feb 28 09:53 1
> 
> OK, thanks for your test.
> 
> As I mentioned to you, this patch should have improved performance on
> megaraid_sas, but the current slight degrade might be caused by
> scsi_host_queue_ready() in scsi_queue_rq(), I guess.
> 
> With .host_tagset enabled and use per-numa-node hw queue, request can
> be
> queued to lld more frequently/quick than single queue, then the cost
> of
> atomic_inc_return(>host_busy) may be increased much meantime,
> think about millions of such operations, and finally slight IOPS drop
> is observed when the hw queue depth becomes half of .can_queue.
> 
> > 
> > 
> > I would suggest to skip megaraid_sas driver changes using
> > shared_tagset
> > until and unless there is obvious gain. If overall interface of
> > using
> > shared_tagset is commit in kernel tree, we will investigate
> > (megaraid_sas
> > driver) in future about real benefit of using it.
> 
> I'd suggest to not merge it until it is proved that performance can
> be
> improved in real device.
> 
> I will try to work to remove the expensive atomic_inc_return(
> >host_busy)
> from scsi_queue_rq(), since it isn't needed for SCSI_MQ, once it is
> done, will
> ask you to test again.
> 
> 
> Thanks,
> Ming

I will test this here as well
I just put the Megaraid card in to my system here

Kashyap, do you have ssd's on the back-end and are you you using jbods
or virtual devices. Let me have your config.
I only have 6G sas shelves though.

Regards
Laurence

Re: [PATCH V2 8/8] scsi: hpsa: use blk_mq to solve irq affinity issue

2018-02-05 Thread Laurence Oberman

On Mon, 2018-02-05 at 23:20 +0800, Ming Lei wrote:
> This patch uses .force_blk_mq to drive HPSA via SCSI_MQ, meantime
> maps
> each reply queue to blk_mq's hw queue, then .queuecommand can always
> choose the hw queue as the reply queue. And if no any online CPU is
> mapped to one hw queue, request can't be submitted to this hw queue
> at all, finally the irq affinity issue is solved.
> 
> Cc: Hannes Reinecke <h...@suse.de>
> Cc: Arun Easi <arun.e...@cavium.com>
> Cc: Omar Sandoval <osan...@fb.com>,
> Cc: "Martin K. Petersen" <martin.peter...@oracle.com>,
> Cc: James Bottomley <james.bottom...@hansenpartnership.com>,
> Cc: Christoph Hellwig <h...@lst.de>,
> Cc: Don Brace <don.br...@microsemi.com>
> Cc: Kashyap Desai <kashyap.de...@broadcom.com>
> Cc: Peter Rivera <peter.riv...@broadcom.com>
> Cc: Paolo Bonzini <pbonz...@redhat.com>
> Cc: Mike Snitzer <snit...@redhat.com>
> Tested-by: Laurence Oberman <lober...@redhat.com>
> Signed-off-by: Ming Lei <ming@redhat.com>
> ---
>  drivers/scsi/hpsa.c | 51 ++-
> 
>  1 file changed, 34 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
> index 443eabf63a9f..e517a4c74a28 100644
> --- a/drivers/scsi/hpsa.c
> +++ b/drivers/scsi/hpsa.c
> @@ -51,6 +51,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include "hpsa_cmd.h"
> @@ -956,6 +957,13 @@ static struct device_attribute
> *hpsa_shost_attrs[] = {
>  #define HPSA_NRESERVED_CMDS  (HPSA_CMDS_RESERVED_FOR_DRIVER +\
>    HPSA_MAX_CONCURRENT_PASSTHRUS)
>  
> +static int hpsa_map_queues(struct Scsi_Host *shost)
> +{
> +struct ctlr_info *h = shost_to_hba(shost);
> +
> +return blk_mq_pci_map_queues(>tag_set, h->pdev);
> +}
> +
>  static struct scsi_host_template hpsa_driver_template = {
>   .module = THIS_MODULE,
>   .name   = HPSA,
> @@ -974,10 +982,13 @@ static struct scsi_host_template
> hpsa_driver_template = {
>  #ifdef CONFIG_COMPAT
>   .compat_ioctl   = hpsa_compat_ioctl,
>  #endif
> + .map_queues = hpsa_map_queues,
>   .sdev_attrs = hpsa_sdev_attrs,
>   .shost_attrs = hpsa_shost_attrs,
>   .max_sectors = 1024,
>   .no_write_same = 1,
> + .force_blk_mq = 1,
> + .host_tagset = 1,
>  };
>  
>  static inline u32 next_command(struct ctlr_info *h, u8 q)
> @@ -1045,11 +1056,7 @@ static void set_performant_mode(struct
> ctlr_info *h, struct CommandList *c,
>   c->busaddr |= 1 | (h->blockFetchTable[c-
> >Header.SGList] << 1);
>   if (unlikely(!h->msix_vectors))
>   return;
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - c->Header.ReplyQueue =
> - raw_smp_processor_id() % h-
> >nreply_queues;
> - else
> - c->Header.ReplyQueue = reply_queue % h-
> >nreply_queues;
> + c->Header.ReplyQueue = reply_queue;
>   }
>  }
>  
> @@ -1063,10 +1070,7 @@ static void
> set_ioaccel1_performant_mode(struct ctlr_info *h,
>    * Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->ReplyQueue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->ReplyQueue = reply_queue % h->nreply_queues;
> + cp->ReplyQueue = reply_queue;
>   /*
>    * Set the bits in the address sent down to include:
>    *  - performant mode bit (bit 0)
> @@ -1087,10 +1091,7 @@ static void
> set_ioaccel2_tmf_performant_mode(struct ctlr_info *h,
>   /* Tell the controller to post the reply to the queue for
> this
>    * processor.  This seems to give the best I/O throughput.
>    */
> - if (likely(reply_queue == DEFAULT_REPLY_QUEUE))
> - cp->reply_queue = smp_processor_id() % h-
> >nreply_queues;
> - else
> - cp->reply_queue = reply_queue % h->nreply_queues;
> + cp->reply_queue = reply_queue;
>   /* Set the bits in the address sent down to include:
>    *  - performant mode bit not used in ioaccel mode 2
>    *  - pull count (bits 0-3)
> @@ -1109,10 +1110,7 @@ static void
> set_ioaccel2_performant_mode(struct ctlr_info *h,
>    * Tell the controller to post the reply to th

Re: [dm-devel] [PATCH v5] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-30 Thread Laurence Oberman

On Tue, 2018-01-30 at 09:52 -0800, Bart Van Assche wrote:
> On 01/30/18 06:24, Mike Snitzer wrote:
> > +    *
> > +    * If driver returns BLK_STS_RESOURCE and
> > SCHED_RESTART
> > +    * bit is set, run queue after a delay to avoid IO
> > stalls
> > +    * that could otherwise occur if the queue is
> > idle.
> >      */
> > -   if (!blk_mq_sched_needs_restart(hctx) ||
> > +   needs_restart = blk_mq_sched_needs_restart(hctx);
> > +   if (!needs_restart ||
> >     (no_tag && list_empty_careful(
> > >dispatch_wait.entry)))
> >     blk_mq_run_hw_queue(hctx, true);
> > +   else if (needs_restart && (ret ==
> > BLK_STS_RESOURCE))
> > +   blk_mq_delay_run_hw_queue(hctx,
> > BLK_MQ_QUEUE_DELAY);
> >     }
> 
> If a request completes concurrently with execution of the above code 
> then the request completion will trigger a call of 
> blk_mq_sched_restart_hctx() and that call will clear the 
> BLK_MQ_S_SCHED_RESTART bit. If that bit is cleared before the above
> code 
> tests it then the above code will schedule an asynchronous call of 
> __blk_mq_run_hw_queue(). If the .queue_rq() call triggered by the
> new 
> queue run returns again BLK_STS_RESOURCE then the above code will be 
> executed again. In other words, a loop occurs. That loop will repeat
> as 
> long as the described race occurs. The current (kernel v4.15) block 
> layer behavior is simpler: only block drivers call 
> blk_mq_delay_run_hw_queue() and the block layer core never calls
> that 
> function. Hence that loop cannot occur with the v4.15 block layer
> core 
> and block drivers. A motivation of why that loop is preferred
> compared 
> to the current behavior (no loop) is missing. Does this mean that
> that 
> loop is a needless complication of the block layer core?
> 
> Sorry but I still prefer the v4.15 block layer approach because this 
> patch has in my view the following disadvantages:
> - It involves a blk-mq API change. API changes are always risky and
> need
>    some time to stabilize.
> - The delay after which to rerun the queue is moved from block layer
>    drivers into the block layer core. I think that's wrong because
> only
>    the block driver authors can make a good choice for this constant.
> - This patch makes block drivers harder to understand. Anyone who
> sees
>    return BLK_STS_RESOURCE / return BLK_STS_DEV_RESOURCE for the
> first
>    time will have to look up the meaning of these constants. An
> explicit
>    blk_mq_delay_run_hw_queue() call is easier to understand.
> - This patch makes the blk-mq core harder to understand because of
> the
>    loop mentioned above.
> - This patch does not fix any bugs nor makes block drivers easier to
>    read or to implement. So why is this patch considered useful?
> 
> Thanks,
> 
> Bart.

Hello Bart

What is the performance implication of your method versus this patch
above.
Is there more of a delay in your approach or in Ming's approach from
your own testing.
I guess it seems we will never get consensus on this so is it time to
take a vote. 

I respect and trust your inputs, you know that, but are you perhaps
prepared to accept the approach above and agree to it and if it turns
out to expose more issues it can be addressed later.

Is that not a better way to progress this because to me it looks like
you and Ming will continue to disagree on which is the better approach.

With much respect
Laurence

Re: [PATCH V3] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-28 Thread Laurence Oberman

On Sun, 2018-01-28 at 16:57 +, Bart Van Assche wrote:
> On Sat, 2018-01-27 at 23:58 -0500, Mike Snitzer wrote:
> > On Sat, Jan 27 2018 at 10:00pm -0500,
> > Bart Van Assche  wrote:
> > 
> > > On Sat, 2018-01-27 at 21:03 -0500, Mike Snitzer wrote:
> > > > You cannot even be forthcoming about the technical merit of a
> > > > change you
> > > > authored (commit 6077c2d70) that I'm left to clean up in the
> > > > face of
> > > > performance bottlenecks it unwittingly introduced?  If you were
> > > > being
> > > > honest: you'd grant that the random delay of 100ms is utterly
> > > > baseless
> > > > (not to mention that kicking the queue like you did is a
> > > > complete
> > > > hack).  So that 100ms delay is what my dm-4.16 commit is
> > > > talking about.
> > > 
> > > There are multiple errors in the above:
> > > 1. I have already explained in detail why commit 6077c2d70 is (a)
> > > correct
> > >    and (b) essential. See e.g. https://www.redhat.com/archives/dm
> > > -devel/2018-January/msg00168.html.
> > 
> > And you'd be wrong.  Again.  We've already established that commit
> > 6077c2d70 is _not_ essential.  Ming's V3, in conjunction with all
> > the
> > changes that already went into block and DM for 4.16, prove that.
> 
> What I wrote was right when commit 6077c2d70 got introduced. My
> explanation
> shows that at that time  was both correct and essential. Ming's v3 is
> in my
> opinion not relevant yet to this discussion because it introduces new
> bugs
> and so far no attempt has been made to address these bugs.
> 
> > > 2. With patch "blk-mq: Avoid that blk_mq_delay_run_hw_queue()
> > > introduces
> > >    unintended delays" applied, there is nothing to clean up
> > > anymore since
> > >    that patch eliminates the queue delays that were triggered by
> > >    blk_mq_delay_run_hw_queue().
> > 
> > The issue Ming fixed is that your random queue kicks break RESTART
> > on
> > dm-mq mpath.
> 
> That comment is too short to be comprehensible for anyone. Can you
> elaborate
> this further?
> 
> > > 3. You know that I'm honest. Suggesting that I'm not is wrong.
> > 
> > No, I trully do _not_ know you're always honest.  You've conflated
> > so
> > many details on this topic that it makes me have serious
> > doubts.  You're
> > lashing out so much, in defense of your dm_mq_queue_rq delayed
> > queue
> > kick, that you're missing there is a genuine generic blk-mq problem
> > that
> > is getting fixed in Ming's V3.
> > 
> > [ ... ]
> > 
> > Bart, the number of emails exchanged on this topic has exhausted
> > _everyone_.  Emotions get tense when things don't evolve despite
> > every
> > oppotunity for progress to be realized.  When people cling to
> > solutions
> > that are no longer relevant.  There is a very real need for
> > progress
> > here.  So either get out of the way or suggest a specific series of
> > change(s) that you feel better than Ming's V3.
> 
> If you and Ming really care about the approach in the patch at the
> start
> of this e-mail thread, what are you waiting for to address the
> technical
> shortcomings that I pointed out?
> 
> > With that, I'll also stop responding to your baiting now and
> > forevermore
> > (e.g. "maintainers should this.. and maintainers should not that"
> > or
> > "your employer is not investing adequately".  Next time you find
> > yourself typing drivel like that: spare us all and hit delete).
> 
> My opinion about what you wrote in this and the other e-mails you
> sent to
> me during the past months is as follows:
> 1. That I have always done my best to provide all the relevant
> technical
>    information.
> 2. That my focus was on the technical aspects of the discussion.
> 3. That you did not try to reach a consensus but rather to dominate
> the
>    discussion.
> 4. That the tone of your e-mails was very aggressive.
> 5. That you insulted me several times personally.
> 
> I believe that your behavior is a violation of the kernel code of
> conflict
> and sufficient to file a complaint with the TAB. From
> Documentation/process/code-of-conflict.rst:
> 
> "If however, anyone feels personally abused, threatened, or otherwise
> uncomfortable due to this process, that is not acceptable.  If so,
> please contact the Linux Foundation's Technical Advisory Board [ ...
> ]"
> 
> Additionally, since you are a U.S. Citizen, you should know that what
> you
> wrote in previous e-mails amounts to libel. A quote from a definition
> of
> libel: "to publish in print (including pictures), writing or
> broadcast
> through radio, television or film, an untruth about another which
> will do
> harm to that person or his/her reputation, by tending to bring the
> target
> into ridicule, hatred, scorn or contempt of others." You should be
> aware
> since I live in the U.S. that this makes it possible for me to sue
> you for
> defamation and to ask for a compensation.
> 
> Bart.

Folks

I think we need to all take some time, take a breath and lets see how
we can

Re: [PATCH V2] blk-mq: introduce BLK_STS_DEV_RESOURCE

2018-01-23 Thread Laurence Oberman

On Tue, 2018-01-23 at 09:44 -0500, Mike Snitzer wrote:
> On Tue, Jan 23 2018 at  9:27am -0500,
> Ming Lei  wrote:
> 
> > Hello Martin,
> > 
> > On Tue, Jan 23, 2018 at 08:30:41AM -0500, Martin K. Petersen wrote:
> > > 
> > > Ming,
> > > 
> > > > + * Block layer and block driver specific status, which is
> > > > ususally returnd
> > > 
> > >  
> > >  ^^^
> > > > + * from driver to block layer in IO path.
> > > 
> > > Given that the comment blurb is long and the flag not defined
> > > until
> > > later, it is not entirely obvious that you are documenting
> > > BLK_STS_DEV_RESOURCE. So please make that clear at the beginning
> > > of the
> > > comment.
> > 
> > OK, how about the following document?
> > 
> > /*
> >  * BLK_STS_DEV_RESOURC: Block layer and block driver specific
> > status,
> 
>  ^^^ typo
> 
> >  * which is usually returned from driver to block layer in IO path.
> >  *
> >  * BLK_STS_DEV_RESOURCE is returned from driver to block layer if
> > device
> >  * related resource is run out of, but driver can guarantee that
> > queue
> >  * will be rerun in future for dispatching the current request when
> > this
> >  * resource is available.
> >  *
> >  * Difference with BLK_STS_RESOURCE:
> >  * If driver isn't sure if the queue can be run again for dealing
> > with the
> >  * current request after this kind of resource is available, please
> > return
> >  * BLK_STS_SOURCE, for example, when memory allocation, DMA Mapping
> > or other
> 
>  ^^ typo
> 
> >  * system resource allocation fails and IO can't be submitted to
> > device,
> >  * BLK_STS_RESOURCE should be used for avoiding IO hang.
> >  */
> 
> In general the 2nd paragraph is one big run-on sentence.  Needs some
> isolation.

I built a new kernel from Mikes latest tree which has all Ming's latest
patches so I could sanity test this on SRP.

I do not have Bart's special parameter the (the can_queue param) but I
am running Bart's 02-mq test over and over looking for a stall.

I am running 5 parallel 02-mq runs via fio against 5 mpaths on which I
have 5 xfs file systems.

So far I have 20 loops of 5 parallel runs and I do not see any stalls
or lock-ups.


Will be good to know if Bart can test this too in his test bed.


Top of Mikes tree is this

afb07d0 Merge branch 'block-4.16_dm-4.16' of https://git.kernel.org/pub
/scm/linux/kernel/git/snitzer/linux int
o block-4.16_dm-4.16
16ce390 Merge branch 'dm-4.16' into block-4.16_dm-4.16
0b943d7 Merge branch 'block-4.16' into block-4.16_dm-4.16
f046ecc blk-mq: introduce BLK_STS_DEV_RESOURCE
4c8cf1c dm mpath selector: more evenly distribute ties
475a055 blk-throttle: use queue_is_rq_based
7923111 dm thin: fix trailing semicolon in
__remap_and_issue_shared_cell
1b3c530 dm table: fix NVMe bio-based dm_table_determine_type()
validation
2ba27c8 dm: various cleanups to md->queue initialization code
e603071 dm mpath: delay the retry of a request if the target responded
as busy
316a795 dm mpath: don't call blk_mq_delay_run_hw_queue() in case of
BLK_STS_RESOURCE
f5ced52 block: Remove kblockd_schedule_delayed_work{,_on}()
ae943d2 blk-mq: Avoid that blk_mq_delay_run_hw_queue() introduces
unintended delays
c77ff7f blk-mq: Rename blk_mq_request_direct_issue() into
blk_mq_request_issue_directly()
8c7a8d1 lib/scatterlist: Fix chaining support in sgl_alloc_order()
9b9c63f7 Merge branch 'nvme-4.16' of git://git.infradead.org/nvme into
for-4.16/block
b889bf6 blk-throttle: track read and write request individually
486e45f IB/srp: Add target_can_queue login parameter
a13553c block: add bdev_read_only() checks to common helpers
721c7fc block: fail op_is_write() requests to read-only partitions
17534c6 blk-throttle: export io_serviced_recursive,
io_service_bytes_recursive
2c2086a block: Protect less code with sysfs_lock in
blk_{un,}register_queue()
14a2349 block: Document scheduler modification locking requirements
83d016a block: Unexport elv_register_queue() and elv_unregister_queue()
8a8747d block, bfq: limit sectors served with interactive weight
raising
a52a69e block, bfq: limit tags for writes and async I/O
2f53fbe Merge branch 'block-4.16_dm-4.16' of https://git.kernel.org/pub
/scm/linux/kernel/git/snitzer/linux int
o block-4.16_dm-4.16
7d089f98 Merge branch 'dm-4.16' into block-4.16_dm-4.16

Typical collectl capture looks like this, solid I/O until completion
ZERO gaps at 1s interval captures.

# <--Disks--->
#Time KBRead  Reads KBWrit
Writes 
10:11:50   0  4 32  20777 
10:11:51   0  1 961009  56736 
10:11:52   0  0  1400K  61694 
10:11:53   0  0  1595K  53154 
10:11:54   0  0  1838K  40027 
10:11:55   0  0  1736K  39865 
10:11:56   0  0  2010K  19867 
10:11:57   0  0  1746K  36227 
10:11:58   0  0  1256K 104195 
10:11:59   0  0 928996 126062 
10:12:00   29152   7288 773564  79886

Re: [PATCH] lib/scatterlist: Fix chaining support in sgl_alloc_order()

2018-01-19 Thread Laurence Oberman

On Fri, 2018-01-19 at 11:00 -0800, Bart Van Assche wrote:
> This patch avoids that workloads with large block sizes (megabytes)
> can trigger the following call stack with the ib_srpt driver (that
> driver is the only driver that chains scatterlists allocated by
> sgl_alloc_order()):
> 
> BUG: Bad page state in process kworker/0:1H  pfn:2423a78
> page:fb03d08e9e00 count:-3 mapcount:0 mapping:  (null)
> index:0x0
> flags: 0x57c000()
> raw: 0057c000  
> fffd
> raw: dead0100 dead0200 
> 
> page dumped because: nonzero _count
> CPU: 0 PID: 733 Comm: kworker/0:1H Tainted: G  I  4.15.0-
> rc7.bart+ #1
> Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
> Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
> Call Trace:
>  dump_stack+0x5c/0x83
>  bad_page+0xf5/0x10f
>  get_page_from_freelist+0xa46/0x11b0
>  __alloc_pages_nodemask+0x103/0x290
>  sgl_alloc_order+0x101/0x180
>  target_alloc_sgl+0x2c/0x40 [target_core_mod]
>  srpt_alloc_rw_ctxs+0x173/0x2d0 [ib_srpt]
>  srpt_handle_new_iu+0x61e/0x7f0 [ib_srpt]
>  __ib_process_cq+0x55/0xa0 [ib_core]
>  ib_cq_poll_work+0x1b/0x60 [ib_core]
>  process_one_work+0x141/0x340
>  worker_thread+0x47/0x3e0
>  kthread+0xf5/0x130
>  ret_from_fork+0x1f/0x30
> 
> Fixes: e80a0af4759a ("lib/scatterlist: Introduce sgl_alloc() and
> sgl_free()")
> Reported-by: Laurence Oberman <lober...@redhat.com>
> Signed-off-by: Bart Van Assche <bart.vanass...@wdc.com>
> Cc: Nicholas A. Bellinger <n...@linux-iscsi.org>
> Cc: Laurence Oberman <lober...@redhat.com>
> ---
>  drivers/target/target_core_transport.c |  2 +-
>  include/linux/scatterlist.h|  1 +
>  lib/scatterlist.c  | 32
> +++-
>  3 files changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/target/target_core_transport.c
> b/drivers/target/target_core_transport.c
> index a001ba711cca..c03a78ee26cd 100644
> --- a/drivers/target/target_core_transport.c
> +++ b/drivers/target/target_core_transport.c
> @@ -2300,7 +2300,7 @@ static void target_complete_ok_work(struct
> work_struct *work)
>  
>  void target_free_sgl(struct scatterlist *sgl, int nents)
>  {
> - sgl_free(sgl);
> + sgl_free_n_order(sgl, nents, 0);
>  }
>  EXPORT_SYMBOL(target_free_sgl);
>  
> diff --git a/include/linux/scatterlist.h
> b/include/linux/scatterlist.h
> index b8a7c1d1dbe3..22b2131bcdcd 100644
> --- a/include/linux/scatterlist.h
> +++ b/include/linux/scatterlist.h
> @@ -282,6 +282,7 @@ struct scatterlist *sgl_alloc_order(unsigned long
> long length,
>   gfp_t gfp, unsigned int
> *nent_p);
>  struct scatterlist *sgl_alloc(unsigned long long length, gfp_t gfp,
>     unsigned int *nent_p);
> +void sgl_free_n_order(struct scatterlist *sgl, int nents, int
> order);
>  void sgl_free_order(struct scatterlist *sgl, int order);
>  void sgl_free(struct scatterlist *sgl);
>  #endif /* CONFIG_SGL_ALLOC */
> diff --git a/lib/scatterlist.c b/lib/scatterlist.c
> index 9afc9b432083..53728d391d3a 100644
> --- a/lib/scatterlist.c
> +++ b/lib/scatterlist.c
> @@ -512,7 +512,7 @@ struct scatterlist *sgl_alloc_order(unsigned long
> long length,
>   if (!sgl)
>   return NULL;
>  
> - sg_init_table(sgl, nent);
> + sg_init_table(sgl, nalloc);
>   sg = sgl;
>   while (length) {
>   elem_len = min_t(u64, length, PAGE_SIZE << order);
> @@ -526,7 +526,7 @@ struct scatterlist *sgl_alloc_order(unsigned long
> long length,
>   length -= elem_len;
>   sg = sg_next(sg);
>   }
> - WARN_ON_ONCE(sg);
> + WARN_ONCE(length, "length = %lld\n", length);
>   if (nent_p)
>   *nent_p = nent;
>   return sgl;
> @@ -549,22 +549,44 @@ struct scatterlist *sgl_alloc(unsigned long
> long length, gfp_t gfp,
>  EXPORT_SYMBOL(sgl_alloc);
>  
>  /**
> - * sgl_free_order - free a scatterlist and its pages
> + * sgl_free_n_order - free a scatterlist and its pages
>   * @sgl: Scatterlist with one or more elements
> + * @nents: Maximum number of elements to free
>   * @order: Second argument for __free_pages()
> + *
> + * Notes:
> + * - If several scatterlists have been chained and each chain
> element is
> + *   freed separately then it's essential to set nents correctly to
> avoid that a
> + *   page would get freed twice.
> + * - All pages in a chained scatterlist can be freed at once by
> setting @nents
> + *   to a high number.
>   */

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Laurence Oberman

On Thu, 2018-01-18 at 22:24 +, Bart Van Assche wrote:
> On Thu, 2018-01-18 at 17:18 -0500, Laurence Oberman wrote:
> > OK, I ran 5 at once of 5 separate mount points.
> > I am using 4k block sizes
> > Its solid consistent for me. No stalls no gaps.
> 
> Hi Laurence,
> 
> That's great news and thank you for having shared this information
> but I think
> it should be mentioned that you have been running my tree in which
> some recent
> block and dm patches were reverted
> (https://github.com/bvanassche/linux/tree/block-scsi-for-next)
> 
> > 
> > [global]
> > name=02-mq
> > filename=fio-output-02-mq.txt
> > rw=randwrite
> > verify=md5
> > ;rwmixread=60
> > ;rwmixwrite=40
> > bs=4K
> > ;direct=1
> > ;numjobs=4
> > ;time_based=1
> > runtime=120
> > 
> > [file1]
> > size=3G
> > ioengine=libaio
> > iodepth=16
> > 
> > I watch I/O and I see it ramp up but fio still runs and it kind of
> > shuts down.
> 
> In test "file1" I see an I/O size of 3G. Does that mean that the
> patch that
> should fix the sgl_alloc() issue is working?
> 
> Thanks,
> 
> Bart.

Hello Bart Thank you.

OK, so booting into Mike tree now and then hopefully I get the lockups.
Can you give me some idea of what to look for.
I assume I/O just stops.

I want to get this happening in-house so we have an avenue to fix this.
Following getting the stall I will attend to the slg patch test for the
SRPT side.

Please note I am not running latest SRPT right now as you know.

Regards
Back later with results

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Laurence Oberman

On Thu, 2018-01-18 at 17:01 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  4:39pm -0500,
> Bart Van Assche  wrote:
> 
> > On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> > > On Thu, Jan 18 2018 at  3:58P -0500,
> > > Bart Van Assche  wrote:
> > > 
> > > > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > > > For Bart's test the underlying scsi-mq driver is what is
> > > > > regularly
> > > > > hitting this case in __blk_mq_try_issue_directly():
> > > > > 
> > > > > if (blk_mq_hctx_stopped(hctx) ||
> > > > > blk_queue_quiesced(q))
> > > > 
> > > > These lockups were all triggered by incorrect handling of
> > > > .queue_rq() returning BLK_STS_RESOURCE.
> > > 
> > > Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> > > "Incorrect" because it no longer runs
> > > blk_mq_delay_run_hw_queue()?
> > 
> > In what I wrote I was referring to both dm_mq_queue_rq() and
> > scsi_queue_rq().
> > With "incorrect" I meant that queue lockups are introduced that
> > make user
> > space processes unkillable. That's a severe bug.
> 
> And yet Laurence cannot reproduce any such lockups with your test...
> 
> Are you absolutely certain this patch doesn't help you?
> https://patchwork.kernel.org/patch/10174037/
> 
> If it doesn't then that is actually very useful to know.
> 
> > > We have time to get this right, please stop hyperventilating
> > > about
> > > "regressions".
> > 
> > Sorry Mike but that's something I consider as an unfair comment. If
> > Ming and
> > you work on patches together, it's your job to make sure that no
> > regressions
> > are introduced. Instead of blaming me because I report these
> > regressions you
> > should be grateful that I take the time and effort to report these
> > regressions
> > early. And since you are employed by a large organization that
> > sells Linux
> > support services, your employer should invest in developing test
> > cases that
> > reach a higher coverage of the dm, SCSI and block layer code. I
> > don't think
> > that it's normal that my tests discovered several issues that were
> > not
> > discovered by Red Hat's internal test suite. That's something Red
> > Hat has to
> > address.
> 
> You have no self-awareness of just how mypoic you're being about
> this.
> 
> I'm not ignoring or blaming you for your test no longer passing.  Far
> from it.  I very much want to fix this.  But I want it fixed in a way
> that doesn't paper over the real bug(s) while also introducing blind
> queue runs that compromise the blk-mq RESTART code's ability to work
> as
> intended.
> 
> I'd have thought you could appreciate this.  We need a root cause on
> this, not hand-waving justifications on why problematic delayed queue
> runs are correct.
> 
> Please just focus on helping Laurence get his very capable testbed to
> reproduce this issue.  Once we can reproduce these "unkillable"
> "stalls"
> in-house it'll be _much_ easier to analyze and fix.
> 
> Thanks,
> Mike

Hello Bart

OK, I ran 5 at once of 5 separate mount points.
I am using 4k block sizes
Its solid consistent for me. No stalls no gaps.

[global]
name=02-mq
filename=fio-output-02-mq.txt
rw=randwrite
verify=md5
;rwmixread=60
;rwmixwrite=40
bs=4K
;direct=1
;numjobs=4
;time_based=1
runtime=120

[file1]
size=3G
ioengine=libaio
iodepth=16

I watch I/O and I see it ramp up but fio still runs and it kind of
shuts down.

#!/bin/bash
for i in 1 2 3 4 5
do
cd /data-$i 
fio /root/srp-test.lobe/fio_tests/02-mq 1>/data-$i.out 2>&1 &
done


#Time cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 

17:16:09   34  12 34431  10393  0  0 249872  36094 
17:16:10   31  10 13001   2679  0  0  57652   7980 
17:16:11   32  11 19473   4416  0  0 143248  17362 
17:16:12   31   9  8068   1606  0  0  20088   2026 
17:16:13   31   9  7718   1518  0  0  15908   1354 
17:16:14   36  14 41132   9710  0  0 686216  46661 
17:16:15   39  18 63622  21033  0  0  1246K  75108 
17:16:16   38  16 76114   9013  0  0  1146K  82338 
17:16:17   33  11 31941   2735  0  0 237340  25034 
17:16:18   36  14 23900   4982  0  1  1567K  43244 
17:16:19   37  15 55884   4974  0  4  1003K  67253 
17:16:20   28  12  7542   4975  0  0  1630K   3266 
17:16:218   6 14650  34721  0  0  2535K  21206 
17:16:222   2  6655  15839  0  0  2319K   9897 
17:16:23   11  11 37305 134030  0  0  1275K  87010 
17:16:24   23  22 59266 119350   6560   1640  1102K  78034 
17:16:25   21  17 65699 144462 148052  37013 159900  22120 
17:16:26   14   9 80700 164034 216588  54147  4  1 
# <--Disks--->
#Time cpu sys inter  ctxsw KBRead  Reads KBWrit Writes 
17:16:27   14   9 78699 162095 214412  53603  0  0 
17:16:28   14   9 75895 155352 204932  51233  0  0 
17:16:29   14   9 75377 161871 214136  53534

Re: [RFC PATCH] blk-mq: fixup RESTART when queue becomes idle

2018-01-18 Thread Laurence Oberman

On Thu, 2018-01-18 at 16:23 -0500, Mike Snitzer wrote:
> On Thu, Jan 18 2018 at  3:58P -0500,
> Bart Van Assche  wrote:
> 
> > On Thu, 2018-01-18 at 15:48 -0500, Mike Snitzer wrote:
> > > For Bart's test the underlying scsi-mq driver is what is
> > > regularly
> > > hitting this case in __blk_mq_try_issue_directly():
> > > 
> > > if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q))
> > 
> > Hello Mike,
> > 
> > That code path is not the code path that triggered the lockups that
> > I reported
> > during the past days.
> 
> If you're hitting blk_mq_sched_insert_request() then you most
> certainly
> are hitting that code path.
> 
> If you aren't then what was your earlier email going on about?
> https://www.redhat.com/archives/dm-devel/2018-January/msg00372.html
> 
> If you were just focusing on that as one possible reason, that isn't
> very helpful.  By this point you really should _know_ what is
> triggering
> the stall based on the code paths taken.  Please use ftrace's
> function_graph tracer if need be.
> 
> > These lockups were all triggered by incorrect handling of
> > .queue_rq() returning BLK_STS_RESOURCE.
> 
> Please be precise, dm_mq_queue_rq()'s return of BLK_STS_RESOURCE?
> "Incorrect" because it no longer runs blk_mq_delay_run_hw_queue()?
> 
> Please try to do more work analyzing the test case that only you can
> easily run (due to srp_test being a PITA).  And less time lobbying
> for
> a change that you don't understand to _really_ be correct.
> 
> We have time to get this right, please stop hyperventilating about
> "regressions".
> 
> Thanks,
> Mike

Hello Bart
I have run a good few loops of 02-mq and its stable for me on your
tree.
I am not running the entire disconnect re-connect loops and un-mounts
etc. for good reason.
I have 35 LUNS so its very impact-full to lose them and have them come
back all the time.

Anyway
I am very happy to try reproduce this in-house so Mike and Ming can
focus on it but I need to know if all I need to do is loop over 02-mq
over and over.

Also please let me know whats debugfs and sysfs to capture and I am
happy to try help move this along.

Regards
Laurence

Re: blk-mq: don't dispatch request in blk_mq_request_direct_issue if queue is busy

2018-01-17 Thread Laurence Oberman

On Wed, 2018-01-17 at 23:36 -0500, Mike Snitzer wrote:
> On Wed, Jan 17 2018 at 11:06pm -0500,
> Ming Lei <ming@redhat.com> wrote:
> 
> > If we run into blk_mq_request_direct_issue(), when queue is busy,
> > we
> > don't want to dispatch this request into hctx->dispatch_list, and
> > what we need to do is to return the queue busy info to caller, so
> > that caller can deal with it well.
> > 
> > Fixes: 396eaf21ee ("blk-mq: improve DM's blk-mq IO merging via
> > blk_insert_cloned_request feedback")
> > Reported-by: Laurence Oberman <lober...@redhat.com>
> > Reviewed-by: Mike Snitzer <snit...@redhat.com>
> > Signed-off-by: Ming Lei <ming@redhat.com>
> > ---
> >  block/blk-mq.c | 22 ++
> >  1 file changed, 10 insertions(+), 12 deletions(-)
> > 
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 4d4af8d712da..1af7fa70993b 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -1856,15 +1856,6 @@ static blk_status_t
> > __blk_mq_issue_directly(struct blk_mq_hw_ctx *hctx,
> >     return ret;
> >  }
> >  
> > -static void __blk_mq_fallback_to_insert(struct request *rq,
> > -   bool run_queue, bool
> > bypass_insert)
> > -{
> > -   if (!bypass_insert)
> > -   blk_mq_sched_insert_request(rq, false, run_queue,
> > false);
> > -   else
> > -   blk_mq_request_bypass_insert(rq, run_queue);
> > -}
> > -
> >  static blk_status_t __blk_mq_try_issue_directly(struct
> > blk_mq_hw_ctx *hctx,
> >     struct request
> > *rq,
> >     blk_qc_t *cookie,
> > @@ -1873,9 +1864,16 @@ static blk_status_t
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >     struct request_queue *q = rq->q;
> >     bool run_queue = true;
> >  
> > -   /* RCU or SRCU read lock is needed before checking
> > quiesced flag */
> > +   /*
> > +    * RCU or SRCU read lock is needed before checking
> > quiesced flag.
> > +    *
> > +    * When queue is stopped or quiesced, ignore
> > 'bypass_insert' from
> > +    * blk_mq_request_direct_issue(), and return BLK_STS_OK to
> > caller,
> > +    * and avoid driver to try to dispatch again.
> > +    */
> >     if (blk_mq_hctx_stopped(hctx) || blk_queue_quiesced(q)) {
> >     run_queue = false;
> > +   bypass_insert = false;
> >     goto insert;
> >     }
> >  
> > @@ -1892,10 +1890,10 @@ static blk_status_t
> > __blk_mq_try_issue_directly(struct blk_mq_hw_ctx *hctx,
> >  
> >     return __blk_mq_issue_directly(hctx, rq, cookie);
> >  insert:
> > -   __blk_mq_fallback_to_insert(rq, run_queue, bypass_insert);
> >     if (bypass_insert)
> >     return BLK_STS_RESOURCE;
> >  
> > +   blk_mq_sched_insert_request(rq, false, run_queue, false);
> >     return BLK_STS_OK;
> >  }
> 
> OK so you're just leveraging blk_mq_sched_insert_request()'s
> ability to resort to__blk_mq_insert_request() if !q->elevator.

I tested this against Mike's latest combined tree and its stable.
This fixes the list corruption issue.
Many Thanks Ming and Mike.

I will apply it to Bart's latest SRP/SRPT tree tomorrow as its very
late here but it will clearly fix the issue in Bart's tree too.

Tested-by: Laurence Oberman <lober...@redhat.com>

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Laurence Oberman

On Wed, 2018-01-17 at 23:53 +, Bart Van Assche wrote:
> On Wed, 2018-01-17 at 18:43 -0500, Laurence Oberman wrote:
> > On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> > > On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > > > On Wed, Jan 17 2018 at 11:50am -0500,
> > > > Jens Axboe <ax...@kernel.dk> wrote:
> > > > 
> > > > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > > > Hi Jens,
> > > > > > 
> > > > > > Think this finally takes care of it! ;)
> > > > > > 
> > > > > > Thanks,
> > > > > > Mike
> > > > > > 
> > > > > > Mike Snitzer (2):
> > > > > >   blk-mq: factor out a few helpers from
> > > > > > __blk_mq_try_issue_directly
> > > > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > > > blk_mq_sched_insert_request
> > > > > > 
> > > > > > Ming Lei (1):
> > > > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > > > blk_insert_cloned_request feedback
> > > > > 
> > > > > Applied - added actual commit message to patch 3.
> > > > 
> > > > Great, thanks.
> > > 
> > > Hello Mike,
> > > 
> > > Laurence hit the following while retesting the SRP initiator
> > > code:
> > > 
> > > [ 2223.797129] list_add corruption. prev->next should be next
> > > (e0ddd5dd), but was 3defe5cd.
> > > (prev=3defe5cd).
> > > [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> > > __list_add_valid+0x6a/0x70
> > > [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> > > G  I  4.15.0-rc8.bart3+ #1
> > > [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> > > 08/16/2015
> > > [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> > > [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> > > [ 2224.967002] Call Trace:
> > > [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> > > [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> > > [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> > > [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> > > [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> > > [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> > > [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> > > [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> > > [ 2225.264852]  process_one_work+0x141/0x340
> > > [ 2225.287872]  worker_thread+0x47/0x3e0
> > > [ 2225.308354]  kthread+0xf5/0x130
> > > [ 2225.396405]  ret_from_fork+0x32/0x40
> > > 
> > > That call trace did not show up before this patch series was
> > > added to
> > > Jens'
> > > tree. This is a regression. Could this have been introduced by
> > > this
> > > patch
> > > series?
> > > 
> > > Thanks,
> > > 
> > > Bart.
> > 
> > Hi Bart
> > One thing to note.
> > 
> > I tested Mike's combined tree on the weekend fully dm4.16-block4.16 
> > and
> > did not see this.
> > This was with Mike combined tree and SRPT running 4.13-rc2.
> > 
> > I also tested your tree Monday with the revert of the
> > scatter/gather
> > patches with both SRP and SRPT running your tree and it was fine.
> > 
> > So its a combination of what you provided me before and that has
> > been
> > added to your tree.
> > 
> > Mike combined tree seemed to be fine, I can revisit that if needed.
> > I
> > still have that kernel in place.
> > 
> > I was not running latest SRPT when I tested Mike's tree
> 
> Hello Laurence,
> 
> The tree I sent you this morning did not only include Mike's latest
> dm code
> but also Jens' latest for-next branch. So what you wrote above does
> not
> contradict what I wrote in my e-mail, namely that I suspect that a
> regression
> was introduced by the patches in the series "blk-mq: improve DM's
> blk-mq IO
> merging via blk_insert_cloned_request feedback". These changes namely
> went in
> through the block tree and not through the dm tree. Additionally,
> these
> changes were not present in the block-scsi-for-next branch I sent you
> on
> Monday.
> 
> Bart.

Hi Bart
Thank you

I probably don't have latest stuff Mike added Monday in his tree.
I have not gone back to test that, been busy with yours all week.

I will get with Mike and pull his latest and test it with SRPT on 4.13-
rc2 like before.

That should be the best way to isolate this.

Regards
Laurence

Re: [for-4.16 PATCH v6 0/3] blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback

2018-01-17 Thread Laurence Oberman

On Wed, 2018-01-17 at 23:31 +, Bart Van Assche wrote:
> On Wed, 2018-01-17 at 11:58 -0500, Mike Snitzer wrote:
> > On Wed, Jan 17 2018 at 11:50am -0500,
> > Jens Axboe  wrote:
> > 
> > > On 1/17/18 9:25 AM, Mike Snitzer wrote:
> > > > Hi Jens,
> > > > 
> > > > Think this finally takes care of it! ;)
> > > > 
> > > > Thanks,
> > > > Mike
> > > > 
> > > > Mike Snitzer (2):
> > > >   blk-mq: factor out a few helpers from
> > > > __blk_mq_try_issue_directly
> > > >   blk-mq-sched: remove unused 'can_block' arg from
> > > > blk_mq_sched_insert_request
> > > > 
> > > > Ming Lei (1):
> > > >   blk-mq: improve DM's blk-mq IO merging via
> > > > blk_insert_cloned_request feedback
> > > 
> > > Applied - added actual commit message to patch 3.
> > 
> > Great, thanks.
> 
> Hello Mike,
> 
> Laurence hit the following while retesting the SRP initiator code:
> 
> [ 2223.797129] list_add corruption. prev->next should be next
> (e0ddd5dd), but was 3defe5cd.
> (prev=3defe5cd).
> [ 2223.862168] WARNING: CPU: 14 PID: 577 at lib/list_debug.c:28
> __list_add_valid+0x6a/0x70
> [ 2224.481151] CPU: 14 PID: 577 Comm: kworker/14:1H Tainted:
> G  I  4.15.0-rc8.bart3+ #1
> [ 2224.531193] Hardware name: HP ProLiant DL380 G7, BIOS P67
> 08/16/2015
> [ 2224.567150] Workqueue: kblockd blk_mq_run_work_fn
> [ 2224.593182] RIP: 0010:__list_add_valid+0x6a/0x70
> [ 2224.967002] Call Trace:
> [ 2224.980941]  blk_mq_request_bypass_insert+0x57/0xa0
> [ 2225.009044]  __blk_mq_try_issue_directly+0x56/0x1e0
> [ 2225.037007]  blk_mq_request_direct_issue+0x5d/0xc0
> [ 2225.090608]  map_request+0x142/0x260 [dm_mod]
> [ 2225.114756]  dm_mq_queue_rq+0xa4/0x120 [dm_mod]
> [ 2225.140812]  blk_mq_dispatch_rq_list+0x90/0x5b0
> [ 2225.211769]  blk_mq_sched_dispatch_requests+0x107/0x1a0
> [ 2225.240825]  __blk_mq_run_hw_queue+0x5f/0xf0
> [ 2225.264852]  process_one_work+0x141/0x340
> [ 2225.287872]  worker_thread+0x47/0x3e0
> [ 2225.308354]  kthread+0xf5/0x130
> [ 2225.396405]  ret_from_fork+0x32/0x40
> 
> That call trace did not show up before this patch series was added to
> Jens'
> tree. This is a regression. Could this have been introduced by this
> patch
> series?
> 
> Thanks,
> 
> Bart.

Hi Bart
One thing to note.

I tested Mike's combined tree on the weekend fully dm4.16-block4.16 and
did not see this.
This was with Mike combined tree and SRPT running 4.13-rc2.

I also tested your tree Monday with the revert of the scatter/gather
patches with both SRP and SRPT running your tree and it was fine.

So its a combination of what you provided me before and that has been
added to your tree.

Mike combined tree seemed to be fine, I can revisit that if needed. I
still have that kernel in place.

I was not running latest SRPT when I tested Mike's tree


Regards
Laurence

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Laurence Oberman

On Tue, 2018-01-16 at 15:22 +, Don Brace wrote:
> > -Original Message-
> > From: Laurence Oberman [mailto:lober...@redhat.com]
> > Sent: Tuesday, January 16, 2018 7:29 AM
> > To: Thomas Gleixner <t...@linutronix.de>; Ming Lei <ming.lei@redhat
> > .com>
> > Cc: Christoph Hellwig <h...@infradead.org>; Jens Axboe <ax...@fb.com
> > >;
> > linux-block@vger.kernel.org; linux-ker...@vger.kernel.org; Mike
> > Snitzer
> > <snit...@redhat.com>; Don Brace <don.br...@microsemi.com>
> > Subject: Re: [PATCH 0/2] genirq/affinity: try to make sure online
> > CPU is assgined
> > to irq vector
> > 
> > > > It is because of irq_create_affinity_masks().
> > > 
> > > That still does not answer the question. If the interrupt for a
> > > queue
> > > is
> > > assigned to an offline CPU, then the queue should not be used and
> > > never
> > > raise an interrupt. That's how managed interrupts have been
> > > designed.
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > > 
> > > 
> > > 
> > > 
> > 
> > I captured a full boot log for this issue for Microsemi, I will
> > send it
> > to Don Brace.
> > I enabled all the HPSA debug and here is snippet
> > 
> > 
> > ..
> > ..
> > ..
> >   246.751135] INFO: task systemd-udevd:413 blocked for more than
> > 120
> > seconds.
> > [  246.788008]   Tainted: G  I  4.15.0-rc4.noming+
> > #1
> > [  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> > disables this message.
> > [  246.865594] systemd-udevd   D0   413411 0x8004
> > [  246.895519] Call Trace:
> > [  246.909713]  ? __schedule+0x340/0xc20
> > [  246.930236]  schedule+0x32/0x80
> > [  246.947905]  schedule_timeout+0x23d/0x450
> > [  246.970047]  ? find_held_lock+0x2d/0x90
> > [  246.991774]  ? wait_for_completion_io+0x108/0x170
> > [  247.018172]  io_schedule_timeout+0x19/0x40
> > [  247.041208]  wait_for_completion_io+0x110/0x170
> > [  247.067326]  ? wake_up_q+0x70/0x70
> > [  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
> > [  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0
> > [hpsa]
> > [  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
> > [  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]
> 
> This trace comes from internally generated discovery commands. No
> SCSI devices have
> been presented to the SML yet.
> 
> At this point we should be running on only one CPU. These commands
> are meant to use
> reply queue 0 which are tied to CPU 0. It's interesting that the
> patch helps.
> 
> However, I was wondering if you could inspect the iLo IML logs and
> send the
> AHS logs for inspection.
> 
> Thanks,
> Don Brace
> ESC - Smart Storage
> Microsemi Corporation


Hello Don

I took two other dl380 g7's and ran the same kernel and it hangs in the
identical place. Its absolutely consistent here.

I doubt all three have hardware issues.

Nothing is logged of interest in the IML.

Ming will have more to share on specifically why it helps.
I think he sent that along to you already.

Regards
Laurence

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-16 Thread Laurence Oberman

On Tue, 2018-01-16 at 12:25 +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> 
> > On Mon, Jan 15, 2018 at 09:40:36AM -0800, Christoph Hellwig wrote:
> > > On Tue, Jan 16, 2018 at 12:03:43AM +0800, Ming Lei wrote:
> > > > Hi,
> > > > 
> > > > These two patches fixes IO hang issue reported by Laurence.
> > > > 
> > > > 84676c1f21 ("genirq/affinity: assign vectors to all possible
> > > > CPUs")
> > > > may cause one irq vector assigned to all offline CPUs, then
> > > > this vector
> > > > can't handle irq any more.
> > > 
> > > Well, that very much was the intention of managed
> > > interrupts.  Why
> > > does the device raise an interrupt for a queue that has no online
> > > cpu assigned to it?
> > 
> > It is because of irq_create_affinity_masks().
> 
> That still does not answer the question. If the interrupt for a queue
> is
> assigned to an offline CPU, then the queue should not be used and
> never
> raise an interrupt. That's how managed interrupts have been designed.
> 
> Thanks,
> 
>   tglx
> 
> 
> 
> 

I captured a full boot log for this issue for Microsemi, I will send it
to Don Brace.
I enabled all the HPSA debug and here is snippet

[0.00] Kernel command line: BOOT_IMAGE=/vmlinuz-4.15.0-
rc4.noming+ root=/dev/mapper/rhel_ibclient-root ro crashkernel=512M@64M
 rd.lvm.lv=rhel_ibclient/root rd.lvm.lv=rhel_ibclient/swap
log_buf_len=54M console=ttyS1,115200n8 scsi_mod.use_blk_mq=y
dm_mod.use_blk_mq=y
[0.00] Memory: 7834908K/1002852K available (8397K kernel code,
3012K rwdata, 3660K rodata, 2184K init, 15344K bss, 2356808K reserved,
0K cma-reserved)
[0.00] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32,
Nodes=2
[0.00] ftrace: allocating 33084 entries in 130 pages
[0.00] Running RCU self tests
[0.00] Hierarchical RCU implementation.
[0.00]  RCU lockdep checking is enabled.
[0.00]  RCU restricting CPUs from NR_CPUS=8192 to
nr_cpu_ids=32.
[0.00] RCU: Adjusting geometry for rcu_fanout_leaf=16,
nr_cpu_ids=32
[0.00] NR_IRQS: 524544, nr_irqs: 1088, preallocated irqs: 16
..
..
0.190147] smp: Brought up 2 nodes, 16 CPUs
[0.192006] smpboot: Max logical packages: 4
[0.193007] smpboot: Total of 16 processors activated (76776.33
BogoMIPS)
[0.940640] node 0 initialised, 10803218 pages in 743ms
[1.005449] node 1 initialised, 11812066 pages in 807ms
..
..
[7.440896] hpsa :05:00.0: can't disable ASPM; OS doesn't have
ASPM control
[7.442071] hpsa :05:00.0: Logical aborts not supported
[7.442075] hpsa :05:00.0: HP SSD Smart Path aborts not
supported
[7.442164] hpsa :05:00.0: Controller Configuration information
[7.442167] hpsa :05:00.0: 
[7.442173] hpsa :05:00.0:Signature = CISS
[7.442177] hpsa :05:00.0:Spec Number = 3
[7.442182] hpsa :05:00.0:Transport methods supported =
0x7a07
[7.442186] hpsa :05:00.0:Transport methods active = 0x3
[7.442190] hpsa :05:00.0:Requested transport Method = 0x2
[7.442194] hpsa :05:00.0:Coalesce Interrupt Delay = 0x0
[7.442198] hpsa :05:00.0:Coalesce Interrupt Count = 0x1
[7.442202] hpsa :05:00.0:Max outstanding commands = 1024
[7.442206] hpsa :05:00.0:Bus Types = 0x20
[7.442220] hpsa :05:00.0:Server Name = 2M21220149
[7.442224] hpsa :05:00.0:Heartbeat Counter = 0xd23
[7.442224] 
[7.442224] 
..
..
  246.751135] INFO: task systemd-udevd:413 blocked for more than 120
seconds.
[  246.788008]   Tainted: G  I  4.15.0-rc4.noming+ #1
[  246.822380] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  246.865594] systemd-udevd   D0   413411 0x8004
[  246.895519] Call Trace:
[  246.909713]  ? __schedule+0x340/0xc20
[  246.930236]  schedule+0x32/0x80
[  246.947905]  schedule_timeout+0x23d/0x450
[  246.970047]  ? find_held_lock+0x2d/0x90
[  246.991774]  ? wait_for_completion_io+0x108/0x170
[  247.018172]  io_schedule_timeout+0x19/0x40
[  247.041208]  wait_for_completion_io+0x110/0x170
[  247.067326]  ? wake_up_q+0x70/0x70
[  247.086801]  hpsa_scsi_do_simple_cmd+0xc6/0x100 [hpsa]
[  247.114315]  hpsa_scsi_do_simple_cmd_with_retry+0xb7/0x1c0 [hpsa]
[  247.146629]  hpsa_scsi_do_inquiry+0x73/0xd0 [hpsa]
[  247.174118]  hpsa_init_one+0x12cb/0x1a59 [hpsa]
[  247.199851]  ? __pm_runtime_resume+0x55/0x70
[  247.224527]  local_pci_probe+0x3f/0xa0
[  247.246034]  pci_device_probe+0x146/0x1b0
[  247.268413]  driver_probe_device+0x2b3/0x4a0
[  247.291868]  __driver_attach+0xda/0xe0
[  247.313370]  ? driver_probe_device+0x4a0/0x4a0
[  247.338399]  bus_for_each_dev+0x6a/0xb0
[  247.359912]  bus_add_driver+0x41/0x260
[  247.380244]  driver_register+0x5b/0xd0
[  247.400811]  ? 0xc016b000
[  247.418819]  hpsa_init+0x38/0x1000 [hpsa]
[  247.440763]  ? 0xc016b000
[  247.459451]  do_one_initcall+0x4d/0x19c
[

Re: [PATCH 0/2] genirq/affinity: try to make sure online CPU is assgined to irq vector

2018-01-15 Thread Laurence Oberman

On Mon, 2018-01-15 at 18:43 +0100, Thomas Gleixner wrote:
> On Tue, 16 Jan 2018, Ming Lei wrote:
> > These two patches fixes IO hang issue reported by Laurence.
> > 
> > 84676c1f21 ("genirq/affinity: assign vectors to all possible CPUs")
> > may cause one irq vector assigned to all offline CPUs, then this
> > vector
> > can't handle irq any more.
> > 
> > The 1st patch moves irq vectors spread into one function, and
> > prepares
> > for the fix done in 2nd patch.
> > 
> > The 2nd patch fixes the issue by trying to make sure online CPUs
> > assigned
> > to irq vector.
> 
> Which means it's completely undoing the intent and mechanism of
> managed
> interrupts. Not going to happen.
> 
> Which driver is that which abuses managed interrupts and does not
> keep its
> queues properly sorted on cpu hotplug?
> 
> Thanks,
> 
>   tglx

Hello Thomas

The servers I am using are all booting off hpsa (SmartArray)
The system would hang on boot with this stack below.

So seen when booting off hpsa driver, not seen by Mike when booting off
a server not using hpsa.

Also not seen when reverting the patch I called out and reverted.

Putting that patch back into Mike/Jens combined tree and adding Ming's
patch seems to fix this issue now. I can boot.

I just did a quick sanity boot and check, not any in-depth testing
right now.

Its not code I am at all familiar with that Ming has changed to make it
work so I defer to Ming to explain in-depth


[  246.751050] INFO: task systemd-udevd:411 blocked for more than 120
seconds.
[  246.791852]   Tainted: G  I  4.15.0-
rc4.block.dm.4.16+ #1
[  246.830650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[  246.874637] systemd-udevd   D0   411408 0x8004
[  246.904934] Call Trace:
[  246.918191]  ? __schedule+0x28d/0x870
[  246.937643]  ? _cond_resched+0x15/0x30
[  246.958222]  schedule+0x32/0x80
[  246.975424]  async_synchronize_cookie_domain+0x8b/0x140
[  247.004452]  ? remove_wait_queue+0x60/0x60
[  247.027335]  do_init_module+0xbe/0x219
[  247.048022]  load_module+0x21d6/0x2910
[  247.069436]  ? m_show+0x1c0/0x1c0
[  247.087999]  SYSC_finit_module+0x94/0xe0
[  247.110392]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[  247.136669] RIP: 0033:0x7f84049287f9
[  247.156112] RSP: 002b:7ffd13199ab8 EFLAGS: 0246 ORIG_RAX:
0139
[  247.196883] RAX: ffda RBX: 55b712b59e80 RCX:
7f84049287f9
[  247.237989] RDX:  RSI: 7f8405245099 RDI:
0008
[  247.279105] RBP: 7f8404bf2760 R08:  R09:
55b712b45760
[  247.320005] R10: 0008 R11: 0246 R12:
0020
[  247.360625] R13: 7f8404bf2818 R14: 0050 R15:
7f8404bf27b8
[  247.401062] INFO: task scsi_eh_0:471 blocked for more than 120
seconds.
[  247.438161]   Tainted: G  I  4.15.0-
rc4.block.dm.4.16+ #1
[  247.476640] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[  247.520700] scsi_eh_0   D0   471  2 0x8000
[  247.551339] Call Trace:
[  247.564360]  ? __schedule+0x28d/0x870
[  247.584720]  schedule+0x32/0x80
[  247.601294]  hpsa_eh_device_reset_handler+0x68c/0x700 [hpsa]
[  247.633358]  ? remove_wait_queue+0x60/0x60
[  247.656345]  scsi_try_bus_device_reset+0x27/0x40
[  247.682424]  scsi_eh_ready_devs+0x53f/0xe20
[  247.706467]  ? __pm_runtime_resume+0x55/0x70
[  247.730327]  scsi_error_handler+0x434/0x5e0
[  247.754387]  ? __schedule+0x295/0x870
[  247.775420]  kthread+0xf5/0x130
[  247.793461]  ? scsi_eh_get_sense+0x240/0x240
[  247.818008]  ? kthread_associate_blkcg+0x90/0x90
[  247.844759]  ret_from_fork+0x1f/0x30
[  247.865440] INFO: task scsi_id:488 blocked for more than 120
seconds.
[  247.901112]   Tainted: G  I  4.15.0-
rc4.block.dm.4.16+ #1
[  247.938743] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables
this message.
[  247.981092] scsi_id D0   488  1 0x0004
[  248.010535] Call Trace:
[  248.023567]  ? __schedule+0x28d/0x870
[  248.044236]  ? __switch_to+0x1f5/0x460
[  248.065776]  schedule+0x32/0x80
[  248.084238]  schedule_timeout+0x1d4/0x2f0
[  248.106184]  wait_for_completion+0x123/0x190
[  248.130759]  ? wake_up_q+0x70/0x70
[  248.150295]  flush_work+0x119/0x1a0
[  248.169238]  ? wake_up_worker+0x30/0x30
[  248.189670]  __cancel_work_timer+0x103/0x190
[  248.213751]  ? kobj_lookup+0x10b/0x160
[  248.235441]  disk_block_events+0x6f/0x90
[  248.257820]  __blkdev_get+0x6a/0x480
[  248.278770]  ? bd_acquire+0xd0/0xd0
[  248.298438]  blkdev_get+0x1a5/0x300
[  248.316587]  ? bd_acquire+0xd0/0xd0
[  248.334814]  do_dentry_open+0x202/0x320
[  248.354372]  ? security_inode_permission+0x3c/0x50
[  248.378818]  path_openat+0x537/0x12c0
[  248.397386]  ? vm_insert_page+0x1e0/0x1f0
[  248.417664]  ? vvar_fault+0x75/0x140
[  248.435811]  do_filp_open+0x91/0x100
[  248.454061]  do_sys_open+0x126/0x210
[  248.472462]  entry_SYSCALL_64_fastpath+0x1a/0x7d
[

Re: [PATCH] SCSI: don't get target/host busy_count in scsi_mq_get_budget()

2017-11-08 Thread Laurence Oberman

On Wed, 2017-11-08 at 10:57 -0700, Jens Axboe wrote:
> On 11/08/2017 09:41 AM, Bart Van Assche wrote:
> > On Tue, 2017-11-07 at 20:06 -0700, Jens Axboe wrote:
> > > At this point, I have no idea what Bart's setup looks like. Bart,
> > > it
> > > would be REALLY helpful if you could tell us how you are
> > > reproducing
> > > your hang. I don't know why this has to be dragged out.
> > 
> > Hello Jens,
> > 
> > It is a disappointment to me that you have allowed Ming to evaluate
> > other
> > approaches than reverting "blk-mq: don't handle TAG_SHARED in
> > restart". That
> > patch namely replaces an algorithm that is trusted by the community
> > with an
> > algorithm of which even Ming acknowledged that it is racy. A quote
> > from [1]:
> > "IO hang may be caused if all requests are completed just before
> > the current
> > SCSI device is added to shost->starved_list". I don't know of any
> > way to fix
> > that race other than serializing request submission and completion
> > by adding
> > locking around these actions, which is something we don't want.
> > Hence my
> > request to revert that patch.
> 
> I was reluctant to revert it, in case we could work out a better way
> of
> doing it. As I mentioned in the other replies, it's not exactly the
> prettiest or most efficient. However, since we currently don't have
> a good solution for the issue, I'm fine with reverting that patch.
> 
> > Regarding the test I run, here is a summary of what I mentioned in
> > previous
> > e-mails:
> > * I modified the SRP initiator such that the SCSI target queue
> > depth is
> >   reduced to one by setting starget->can_queue to 1 from inside
> >   scsi_host_template.target_alloc.
> > * With that modified SRP initiator I run the srp-test software as
> > follows
> >   until something breaks:
> >   while ./run_tests -f xfs -d -e deadline -r 60; do :; done
> 
> What kernel options are needed? Where do I download everything I
> need?
> 
> In other words, would it be possible to do a fuller guide for getting
> this setup and running?
> 
> I'll run my simple test case as well, since it's currently breaking
> basically everywhere.
> 
> > Today a system with at least one InfiniBand HCA is required to run
> > that test.
> > When I have the time I will post the SRP initiator and target
> > patches on the
> > linux-rdma mailing list that make it possible to run that test
> > against the
> > SoftRoCE driver (drivers/infiniband/sw/rxe). The only hardware
> > required to
> > use that driver is an Ethernet adapter.
> 
> OK, I guess I can't run it then... I'll have to rely on your testing.

Hello 

I agree with Bart in this case, we should revert this.
My test-bed is tied up and I have not been able to give it back to Ming
so he could follow up on Bart's last update.

Right now its safer to revert.

Thanks
Laurence
>

Re: [v4.13-rc BUG] system lockup when running big buffered write(4M) to IB SRP via mpath

2017-08-09 Thread Laurence Oberman




On 08/08/2017 10:28 PM, Laurence Oberman wrote:

On Tue, 2017-08-08 at 20:11 -0400, Laurence Oberman wrote:

On Tue, 2017-08-08 at 22:17 +0800, Ming Lei wrote:

Hi Guys,
  
Laurence and I see a system lockup issue when running

concurrent

big buffered write(4M bytes) to IB SRP on v4.13-rc3.
  
1 how to reproduce
  
1) setup IB_SRR & multi path
  
  	#./start_opensm.sh

#./start_srp.sh
  
  	# cat start_opensm.sh

#!/bin/bash
rmmod ib_srpt
opensm -F opensm.1.conf &
opensm -F opensm.2.conf &
  
  	# cat start_srp.sh

run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10

-t 7000

-ance -i mlx5_0 -p 1 1>/root/srp1.log 2>&1 &
run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10

-t 7000

-ance -i mlx5_1 -p 1 1>/root/srp2.log 2>&1 &

  
2) run the following test script
  
  	#./test-mp.sh
  
  	#cat test-mp.sh

#!/bin/sh
DEVS="360014051186d24cc27b4d04994e3
36001405b2b5c6c24c084b6fa4d55da2f
36001405b26ebe76dcb94a489f6f245f8"
  
  	for i in $DEVS; do

for j in `seq 0 15`; do
./hammer_write.sh $i 1>/dev/null 2>&1 &
done
done
  
  	#cat hammer_write.sh

#!/bin/bash
while true; do
dd if=/dev/zero of=/dev/mapper/$1 bs=4096k

count=800

done
  
  
2 lockup log

[root@ibclient ~]# ./test-mp.sh
[root@ibclient ~]# [  169.095494] perf: interrupt took too long
(2575

2500), lowering kernel.perf_event_max_sample_rate to 77000


  
[  178.569124] perf: interrupt took too long (3249 > 3218),

lowering
kernel.perf_event_max_sample_rate to 61000
[  190.586624] perf: interrupt took too long (4066 > 4061),
lowering
kernel.perf_event_max_sample_rate to 49000
[  210.381836] perf: interrupt took too long (5083 > 5082),
lowering
kernel.perf_event_max_sample_rate to 39000
[  240.498659] perf: interrupt took too long (6373 > 6353),
lowering
kernel.perf_event_max_sample_rate to 31000
  
[root@ibclient ~]#

[root@ibclient ~]# [  292.252795] perf: interrupt took too long
(8006

7966), lowering kernel.perf_event_max_sample_rate to 24000


  
[  410.354443] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.432725] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.499530] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.564952] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.644493] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.698091] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.755175] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.796784] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.837690] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  410.878743] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.476823] multipath_clone_and_map: 32 callbacks suppressed
[  463.506773] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.586752] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.656880] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.698919] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.723572] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.751820] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.781110] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.828267] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.856997] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  463.885998] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.572060] multipath_clone_and_map: 5201 callbacks

suppressed

[  468.602191] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.655761] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.697540] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.738610] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.779597] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.820705] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.862712] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.904662] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.945132] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  468.985128] device-mapper: multipath: blk_get_request()

returned

-11 - requeuing
[  473.643013] multipath_clone_and_map: 6569 callbacks

suppresse

Re: [v4.13-rc BUG] system lockup when running big buffered write(4M) to IB SRP via mpath

2017-08-08 Thread Laurence Oberman

On Tue, 2017-08-08 at 20:11 -0400, Laurence Oberman wrote:
> > On Tue, 2017-08-08 at 22:17 +0800, Ming Lei wrote:
> > > > Hi Guys,
> > > > 
> > > > Laurence and I see a system lockup issue when running
> > concurrent
> > > > big buffered write(4M bytes) to IB SRP on v4.13-rc3.
> > > > 
> > > > 1 how to reproduce
> > > > 
> > > > 1) setup IB_SRR & multi path
> > > > 
> > > > #./start_opensm.sh
> > > > #./start_srp.sh 
> > > > 
> > > > # cat start_opensm.sh 
> > > > #!/bin/bash
> > > > rmmod ib_srpt
> > > > opensm -F opensm.1.conf &
> > > > opensm -F opensm.2.conf &
> > > > 
> > > > # cat start_srp.sh
> > > > run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10
> > -t 7000
> > > > -ance -i mlx5_0 -p 1 1>/root/srp1.log 2>&1 &
> > > > run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10
> > -t 7000
> > > > -ance -i mlx5_1 -p 1 1>/root/srp2.log 2>&1 &
> > > > 
> > > > 
> > > > 2) run the following test script
> > > > 
> > > > #./test-mp.sh
> > > >  
> > > > #cat test-mp.sh 
> > > > #!/bin/sh
> > > > DEVS="360014051186d24cc27b4d04994e3
> > > > 36001405b2b5c6c24c084b6fa4d55da2f
> > > > 36001405b26ebe76dcb94a489f6f245f8"
> > > > 
> > > > for i in $DEVS; do
> > > > for j in `seq 0 15`; do
> > > > ./hammer_write.sh $i 1>/dev/null 2>&1 &
> > > > done
> > > > done
> > > > 
> > > > #cat hammer_write.sh
> > > > #!/bin/bash
> > > > while true; do
> > > > dd if=/dev/zero of=/dev/mapper/$1 bs=4096k
> > count=800 
> > > > done
> > > > 
> > > > 
> > > > 2 lockup log
> > > > [root@ibclient ~]# ./test-mp.sh 
> > > > [root@ibclient ~]# [  169.095494] perf: interrupt took too long
> > > > (2575
> > > > > > 2500), lowering kernel.perf_event_max_sample_rate to 77000
> > 
> > > > 
> > > > [  178.569124] perf: interrupt took too long (3249 > 3218),
> > > > lowering
> > > > kernel.perf_event_max_sample_rate to 61000
> > > > [  190.586624] perf: interrupt took too long (4066 > 4061),
> > > > lowering
> > > > kernel.perf_event_max_sample_rate to 49000
> > > > [  210.381836] perf: interrupt took too long (5083 > 5082),
> > > > lowering
> > > > kernel.perf_event_max_sample_rate to 39000
> > > > [  240.498659] perf: interrupt took too long (6373 > 6353),
> > > > lowering
> > > > kernel.perf_event_max_sample_rate to 31000
> > > > 
> > > > [root@ibclient ~]# 
> > > > [root@ibclient ~]# [  292.252795] perf: interrupt took too long
> > > > (8006
> > > > > > 7966), lowering kernel.perf_event_max_sample_rate to 24000
> > 
> > > > 
> > > > [  410.354443] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.432725] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.499530] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.564952] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.644493] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.698091] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.755175] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.796784] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.837690] device-mapper: multipath: blk_get_request()
> > returned
> > > > -11 - requeuing
> > > > [  410.878743] device-mapper: multipath: blk_get_request()
> > returned
> > &

Re: [v4.13-rc BUG] system lockup when running big buffered write(4M) to IB SRP via mpath

2017-08-08 Thread Laurence Oberman

On Tue, 2017-08-08 at 22:17 +0800, Ming Lei wrote:
> Hi Guys,
> 
> Laurence and I see a system lockup issue when running concurrent
> big buffered write(4M bytes) to IB SRP on v4.13-rc3.
> 
> 1 how to reproduce
> 
> 1) setup IB_SRR & multi path
> 
>   #./start_opensm.sh
>   #./start_srp.sh 
> 
>   # cat start_opensm.sh 
>   #!/bin/bash
>   rmmod ib_srpt
>   opensm -F opensm.1.conf &
>   opensm -F opensm.2.conf &
> 
>   # cat start_srp.sh
>   run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10 -t 7000
> -ance -i mlx5_0 -p 1 1>/root/srp1.log 2>&1 &
>   run_srp_daemon  -f /etc/ddn/srp_daemon.conf -R 30 -T 10 -t 7000
> -ance -i mlx5_1 -p 1 1>/root/srp2.log 2>&1 &
>   
> 
> 2) run the following test script
> 
>   #./test-mp.sh
>  
>   #cat test-mp.sh 
>   #!/bin/sh
>   DEVS="360014051186d24cc27b4d04994e3
> 36001405b2b5c6c24c084b6fa4d55da2f 36001405b26ebe76dcb94a489f6f245f8"
> 
>   for i in $DEVS; do
>   for j in `seq 0 15`; do
>   ./hammer_write.sh $i 1>/dev/null 2>&1 &
>   done
>   done
> 
>   #cat hammer_write.sh
>   #!/bin/bash
>   while true; do
>   dd if=/dev/zero of=/dev/mapper/$1 bs=4096k count=800 
>   done
> 
> 
> 2 lockup log
> [root@ibclient ~]# ./test-mp.sh 
> [root@ibclient ~]# [  169.095494] perf: interrupt took too long (2575
> > 2500), lowering kernel.perf_event_max_sample_rate to 77000
> [  178.569124] perf: interrupt took too long (3249 > 3218), lowering
> kernel.perf_event_max_sample_rate to 61000
> [  190.586624] perf: interrupt took too long (4066 > 4061), lowering
> kernel.perf_event_max_sample_rate to 49000
> [  210.381836] perf: interrupt took too long (5083 > 5082), lowering
> kernel.perf_event_max_sample_rate to 39000
> [  240.498659] perf: interrupt took too long (6373 > 6353), lowering
> kernel.perf_event_max_sample_rate to 31000
> 
> [root@ibclient ~]# 
> [root@ibclient ~]# [  292.252795] perf: interrupt took too long (8006
> > 7966), lowering kernel.perf_event_max_sample_rate to 24000
> [  410.354443] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.432725] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.499530] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.564952] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.644493] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.698091] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.755175] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.796784] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.837690] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  410.878743] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.476823] multipath_clone_and_map: 32 callbacks suppressed
> [  463.506773] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.586752] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.656880] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.698919] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.723572] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.751820] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.781110] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.828267] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.856997] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  463.885998] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.572060] multipath_clone_and_map: 5201 callbacks suppressed
> [  468.602191] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.655761] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.697540] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.738610] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.779597] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.820705] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.862712] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.904662] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.945132] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  468.985128] device-mapper: multipath: blk_get_request() returned
> -11 - requeuing
> [  473.643013] multipath_clone_and_map: 6569 callbacks suppressed
> [  473.673466] device-mapper: multipath: blk_get_request()

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-08 Thread Laurence Oberman




On 08/08/2017 09:41 AM, Ming Lei wrote:

Hi Laurence and Guys,

On Mon, Aug 07, 2017 at 06:06:11PM -0400, Laurence Oberman wrote:

On Mon, Aug 7, 2017 at 8:48 AM, Laurence Oberman <lober...@redhat.com>
wrote:
Hello

I need to retract my Tested-by:

While its valid that the patches do not introduce performance regressions,
they seem to cause a hard lockup when the [mq-deadline] scheduler is
enabled so I am not confident with a passing result here.

This is specific to large buffered I/O writes (4MB) At least that is my
current test.

I did not wait long enough for the issue to show when I first sent the pass
(Tested-by) message because I know my test platform so well I thought I had
given it enough time to validate the patches for performance regressions.

I dont know if the failing clone in blk_get_request() is a direct a
catalyst for the hard lockup but what I do know is with the stock upstream
4.13-RC3 I only see them when I am set to [none] and stock upstream never
seems to see the hard lockup.

With [mq-deadline] enabled on stock I dont see them at all and it behaves.

Now with Ming's patches if we enable [mq-deadline] we DO see the clone
failures and the hard lockup so we have opposit behaviour with the
scheduler choice and we have the hard lockup.

On Ming's kernel with [none] we are well behaved and that was my original
focus, testing on [none] and hence my Tested-by: pass.

So more investigation is needed here.


Laurence, as we talked in IRC, the hard lock issue you saw isn't
related with this patchset, because the issue can be reproduced on
both v4.13-rc3 and RHEL7. The only trick is to run your hammer
write script concurrently in 16 jobs, then it just takes several
minutes to trigger, no matter with using mq none or mq-deadline
scheduler.

Given it is easy to reproduce, I believe it shouldn't be very
difficult to investigate and root cause.

I will report the issue on another thread, and attach the
script for reproduction.

So let's focus on this patchset([PATCH V2 00/20] blk-mq-sched: improve
SCSI-MQ performance) in this thread.

Thanks again for your test!

Thanks,
Ming



Hello Ming,

Yes I agree, this means my original Tested-by: for your patch set is 
then still valid for large size I/O tests.

Thank you for all this hard work and improving block-MQ

Regards
Laurence

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman




On 08/07/2017 01:29 PM, Laurence Oberman wrote:



On 08/07/2017 11:27 AM, Bart Van Assche wrote:

On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:

I tested this series using Ming's tests as well as my own set of tests
typically run against changes to upstream code in my SRP test-bed.
My tests also include very large sequential buffered and un-buffered 
I/O.


This series seems to be fine for me. I did uncover another issue that is
unrelated to these patches and also exists in 4.13-RC3 generic that I am
still debugging.


Hello Laurence,

What kind of tests did you run? Only functional tests or also performance
tests?

Has the issue you encountered with kernel 4.13-rc3 already been 
reported on

the linux-rdma mailing list?

Thanks,

Bart.



Hi Bart

Actually I was focusing on just performance to see if we had any 
regressions with Mings new patches for the large sequential I/O cases.


Ming had already tested the small I/O performance cases so I was making 
sure the large I/O sequential tests did not suffer.


The 4MB un-buffered direct read tests to DM devices seems to perform 
much the same in my test bed.
The 4MB buffered and un-buffered 4MB writes also seem to be well behaved 
with not much improvement.


These were not exhaustive tests and did not include my usual port 
disconnect and recovery tests either.

I was just making sure we did not regress with Ming's changes.

I was only just starting to baseline test the mq-deadline scheduler as 
prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.

I had always only tested with [none]

The tests were with [none] and [mq-deadline]

The new issue I started seeing was not yet reported yet as I was still 
investigating it.


In summary:
With buffered writes we see the clone fail in blk_get_request in both 
Mings kernel and in the upstream 4.13-RC3 stock kernel


[  885.271451] io scheduler mq-deadline registered
[  898.455442] device-mapper: multipath: blk_get_request() returned -11 
- requeuing


This is due to

multipath_clone_and_map()

/*
  * Map cloned requests (request-based multipath)
  */
static int multipath_clone_and_map(struct dm_target *ti, struct request 
*rq,

union map_info *map_context,
struct request **__clone)
{
..
..
..
 clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);

 if (IS_ERR(clone)) {
 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
 bool queue_dying = blk_queue_dying(q);
 DMERR_LIMIT("blk_get_request() returned %ld%s - 
requeuing",
 PTR_ERR(clone), queue_dying ? " (path 
offline)" : "");

 if (queue_dying) {
 atomic_inc(>pg_init_in_progress);
 activate_or_offline_path(pgpath);
 return DM_MAPIO_REQUEUE;
 }
 return DM_MAPIO_DELAY_REQUEUE;
 }

Still investigating but it leads to a hard lockup


So I still need to see if the hard-lockup happens in the stock kernel 
with mq-deadline and some other work before coming up with a full 
summary of the issue.


I also intend to re-run all tests including disconnect and reconnect 
tests on both mq-deadline and none.


Trace below


[ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
[ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G  I 
4.13.0-rc3lobeming.ming_V4+ #

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman




On 08/07/2017 11:27 AM, Bart Van Assche wrote:

On Mon, 2017-08-07 at 08:48 -0400, Laurence Oberman wrote:

I tested this series using Ming's tests as well as my own set of tests
typically run against changes to upstream code in my SRP test-bed.
My tests also include very large sequential buffered and un-buffered I/O.

This series seems to be fine for me. I did uncover another issue that is
unrelated to these patches and also exists in 4.13-RC3 generic that I am
still debugging.


Hello Laurence,

What kind of tests did you run? Only functional tests or also performance
tests?

Has the issue you encountered with kernel 4.13-rc3 already been reported on
the linux-rdma mailing list?

Thanks,

Bart.



Hi Bart

Actually I was focusing on just performance to see if we had any 
regressions with Mings new patches for the large sequential I/O cases.


Ming had already tested the small I/O performance cases so I was making 
sure the large I/O sequential tests did not suffer.


The 4MB un-buffered direct read tests to DM devices seems to perform 
much the same in my test bed.
The 4MB buffered and un-buffered 4MB writes also seem to be well behaved 
with not much improvement.


These were not exhaustive tests and did not include my usual port 
disconnect and recovery tests either.

I was just making sure we did not regress with Ming's changes.

I was only just starting to baseline test the mq-deadline scheduler as 
prior to 4.13-RC3 I had not been testing any of the new MQ schedulers.

I had always only tested with [none]

The tests were with [none] and [mq-deadline]

The new issue I started seeing was not yet reported yet as I was still 
investigating it.


In summary:
With buffered writes we see the clone fail in blk_get_request in both 
Mings kernel and in the upstream 4.13-RC3 stock kernel


[  885.271451] io scheduler mq-deadline registered
[  898.455442] device-mapper: multipath: blk_get_request() returned -11 
- requeuing


This is due to

multipath_clone_and_map()

/*
 * Map cloned requests (request-based multipath)
 */
static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
   union map_info *map_context,
   struct request **__clone)
{
..
..
..
clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, 
GFP_ATOMIC);

if (IS_ERR(clone)) {
/* EBUSY, ENODEV or EWOULDBLOCK: requeue */
bool queue_dying = blk_queue_dying(q);
DMERR_LIMIT("blk_get_request() returned %ld%s - requeuing",
PTR_ERR(clone), queue_dying ? " (path 
offline)" : "");

if (queue_dying) {
atomic_inc(>pg_init_in_progress);
activate_or_offline_path(pgpath);
return DM_MAPIO_REQUEUE;
}
return DM_MAPIO_DELAY_REQUEUE;
}

Still investigating but it leads to a hard lockup


So I still need to see if the hard-lockup happens in the stock kernel 
with mq-deadline and some other work before coming up with a full 
summary of the issue.


I also intend to re-run all tests including disconnect and reconnect 
tests on both mq-deadline and none.


Trace below


[ 1553.167357] NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
[ 1553.167359] Modules linked in: mq_deadline binfmt_misc dm_round_robin 
xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter 
ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set 
nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat 
nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 rpcrdma ip6table_mangle 
ip6table_security ip6table_raw iptable_nat ib_isert nf_conntrack_ipv4 
iscsi_target_mod nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ib_iser 
libiscsi scsi_transport_iscsi iptable_mangle iptable_security 
iptable_raw ebtable_filter ebtables target_core_mod ip6table_filter 
ip6_tables iptable_filter ib_srp scsi_transport_srp ib_ipoib rdma_ucm 
ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm mlx5_ib ib_core 
intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul
[ 1553.167385]  crc32_pclmul ghash_clmulni_intel pcbc aesni_intel sg 
joydev ipmi_si hpilo crypto_simd hpwdt iTCO_wdt cryptd ipmi_devintf 
glue_helper gpio_ich iTCO_vendor_support shpchp ipmi_msghandler pcspkr 
acpi_power_meter i7core_edac lpc_ich pcc_cpufreq nfsd auth_rpcgss 
nfs_acl lockd grace sunrpc dm_multipath ip_tables xfs libcrc32c sd_mod 
amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper syscopyarea 
sysfillrect sysimgblt fb_sys_fops ttm mlx5_core drm mlxfw i2c_core ptp 
serio_raw hpsa crc32c_intel bnx2 pps_core devlink scsi_transport_sas 
dm_mirror dm_region_hash dm_log dm_mod [last unloaded: ib_srpt]
[ 1553.167410] CPU: 4 PID: 11532 Comm: dd Tainted: G  I 
4.13.0-rc3lobeming.ming_V4+ #20

[ 1553.167411] Hardware name: HP ProLiant DL380 G7, BIOS P67 08/16/2015
[

Re: [PATCH V2 00/20] blk-mq-sched: improve SCSI-MQ performance

2017-08-07 Thread Laurence Oberman

++
  include/linux/sbitmap.h |  54 ++
  11 files changed, 504 insertions(+), 138 deletions(-)



Hello

I tested this series using Ming's tests as well as my own set of tests 
typically run against changes to upstream code in my SRP test-bed.

My tests also include very large sequential buffered and un-buffered I/O.

This series seems to be fine for me. I did uncover another issue that is 
unrelated to these patches and also exists in 4.13-RC3 generic that I am 
still debugging.


For what its worth:
Tested-by: Laurence Oberman <lober...@redhat.com>

Re: device support in hpsa, was: Re: OOPS from cciss_ioctl in 4.12+git

2017-07-07 Thread Laurence Oberman




On 07/07/2017 02:08 PM, Christoph Hellwig wrote:

On Fri, Jul 07, 2017 at 11:42:38AM -0400, Laurence Oberman wrote:

What happens when  hpsa_allow_any=1 with the Smart Array 64xx
It should probe.


But only if it has a HP vendor ID as far as I can tell.  We'd
still need to add the compaq ids so that these controllers get
probed.  But maybe it's time to add them and flip the hpsa_allow_any
default (maybe conditionally on a config option?) and mark cciss
deprecated.


Agreed, I vote yes.

Re: device support in hpasa, was: Re: OOPS from cciss_ioctl in 4.12+git

2017-07-07 Thread Laurence Oberman




On 07/07/2017 11:03 AM, Jens Axboe wrote:

On 07/07/2017 09:00 AM, Christoph Hellwig wrote:

On Thu, Jul 06, 2017 at 12:55:04PM +0300, Meelis Roos wrote:

Also we're trying to move people away from the cciss driver, can you
check if the hpsa SCSI driver works for you as well?


I have older adapter:

Compaq Computer Corporation Smart Array 64xx [0e11:0046] (rev 01)

That does not seem to be supported by hpsa AFAICS.


Looks like.  Although hpsa has support for various SA5 controllers
it seems like it decided to skip all Compaq branded controllers.

As far as I can tell we could simply add support for those to
hpsa.  Ccing hpsa folks to figure out if that's the case.


Pretty sure Hannes had a patch he tested for that, he talked about
that back at LSFMM earlier this year. Hannes?



What happens when  hpsa_allow_any=1 with the Smart Array 64xx
It should probe.

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman



- Original Message -
> From: "Laurence Oberman" <lober...@redhat.com>
> To: "Jan Kara" <j...@suse.cz>
> Cc: "Johannes Weiner" <han...@cmpxchg.org>, "Hugh Dickins" 
> <hu...@google.com>, "Linus Torvalds"
> <torva...@linux-foundation.org>, "Dave Chinner" <da...@fromorbit.com>, "Chris 
> Leech" <cle...@redhat.com>, "Linux
> Kernel Mailing List" <linux-ker...@vger.kernel.org>, "Lee Duncan" 
> <ldun...@suse.com>, open-is...@googlegroups.com,
> "Linux SCSI List" <linux-s...@vger.kernel.org>, linux-block@vger.kernel.org, 
> "Christoph Hellwig" <h...@lst.de>,
> "Andrea Arcangeli" <aarca...@redhat.com>
> Sent: Wednesday, January 4, 2017 10:26:09 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> 
> 
> - Original Message -
> > From: "Jan Kara" <j...@suse.cz>
> > To: "Johannes Weiner" <han...@cmpxchg.org>
> > Cc: "Hugh Dickins" <hu...@google.com>, "Linus Torvalds"
> > <torva...@linux-foundation.org>, "Dave Chinner"
> > <da...@fromorbit.com>, "Chris Leech" <cle...@redhat.com>, "Linux Kernel
> > Mailing List"
> > <linux-ker...@vger.kernel.org>, "Lee Duncan" <ldun...@suse.com>,
> > open-is...@googlegroups.com, "Linux SCSI List"
> > <linux-s...@vger.kernel.org>, linux-block@vger.kernel.org, "Christoph
> > Hellwig" <h...@lst.de>, "Jan Kara"
> > <j...@suse.cz>, "Andrea Arcangeli" <aarca...@redhat.com>
> > Sent: Tuesday, January 3, 2017 7:28:25 AM
> > Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at
> > iscsi_tcp_segment_done+0x20d/0x2e0
> > 
> > On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner
> > > > > > > <da...@fromorbit.com>
> > > > > > > wrote:
> > > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > > workload again and about a minute in this fired:
> > > > > > > >
> > > > > > > > [628867.607417] [ cut here ]
> > > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at
> > > > > > > > mm/workingset.c:461
> > > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > > 
> > > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > > Johannes Weiner to the participants.
> > > > > > > 
> > > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > > affecting the working set code.
> > > > > > > 
> > > > > > > Well, I suspect that anything that creates memory pressure will
> > > > > > > end
> > > > > > > up
> > > > > > > triggering the working set code, so ..
> > > > > > > 
> > > > > > > That said, obviously memory corruption could be involved and
> > > > > > > result
> > > > > > > in
> > > > > > > random issues too, but I wouldn't really expect that in this
> > > > > > > code.
> > > > > > > 
> > > > > > > It would probably be really useful to get more data points - is
> > > > > > > the
> > > > > > > problem reliably in this area, or is it going to be random and
> > > > > > > all
> > > > > > > over the place.
> > > > > > 
> > > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > > shadow_lru_isolate,
> > > > > > soon fo

Re: [4.10, panic, regression] iscsi: null pointer deref at iscsi_tcp_segment_done+0x20d/0x2e0

2017-01-04 Thread Laurence Oberman



- Original Message -
> From: "Jan Kara" 
> To: "Johannes Weiner" 
> Cc: "Hugh Dickins" , "Linus Torvalds" 
> , "Dave Chinner"
> , "Chris Leech" , "Linux Kernel 
> Mailing List"
> , "Lee Duncan" , 
> open-is...@googlegroups.com, "Linux SCSI List"
> , linux-block@vger.kernel.org, "Christoph 
> Hellwig" , "Jan Kara"
> , "Andrea Arcangeli" 
> Sent: Tuesday, January 3, 2017 7:28:25 AM
> Subject: Re: [4.10, panic, regression] iscsi: null pointer deref at 
> iscsi_tcp_segment_done+0x20d/0x2e0
> 
> On Mon 02-01-17 16:11:36, Johannes Weiner wrote:
> > On Fri, Dec 23, 2016 at 03:33:29AM -0500, Johannes Weiner wrote:
> > > On Fri, Dec 23, 2016 at 02:32:41AM -0500, Johannes Weiner wrote:
> > > > On Thu, Dec 22, 2016 at 12:22:27PM -0800, Hugh Dickins wrote:
> > > > > On Wed, 21 Dec 2016, Linus Torvalds wrote:
> > > > > > On Wed, Dec 21, 2016 at 9:13 PM, Dave Chinner 
> > > > > > wrote:
> > > > > > > I unmounted the fs, mkfs'd it again, ran the
> > > > > > > workload again and about a minute in this fired:
> > > > > > >
> > > > > > > [628867.607417] [ cut here ]
> > > > > > > [628867.608603] WARNING: CPU: 2 PID: 16925 at mm/workingset.c:461
> > > > > > > shadow_lru_isolate+0x171/0x220
> > > > > > 
> > > > > > Well, part of the changes during the merge window were the shadow
> > > > > > entry tracking changes that came in through Andrew's tree. Adding
> > > > > > Johannes Weiner to the participants.
> > > > > > 
> > > > > > > Now, this workload does not touch the page cache at all - it's
> > > > > > > entirely an XFS metadata workload, so it should not really be
> > > > > > > affecting the working set code.
> > > > > > 
> > > > > > Well, I suspect that anything that creates memory pressure will end
> > > > > > up
> > > > > > triggering the working set code, so ..
> > > > > > 
> > > > > > That said, obviously memory corruption could be involved and result
> > > > > > in
> > > > > > random issues too, but I wouldn't really expect that in this code.
> > > > > > 
> > > > > > It would probably be really useful to get more data points - is the
> > > > > > problem reliably in this area, or is it going to be random and all
> > > > > > over the place.
> > > > > 
> > > > > Data point: kswapd got WARNING on mm/workingset.c:457 in
> > > > > shadow_lru_isolate,
> > > > > soon followed by NULL pointer deref in list_lru_isolate, one time
> > > > > when
> > > > > I tried out Sunday's git tree.  Not seen since, I haven't had time to
> > > > > investigate, just set it aside as something to worry about if it
> > > > > happens
> > > > > again.  But it looks like shadow_lru_isolate() has issues beyond
> > > > > Dave's
> > > > > case (I've no XFS and no iscsi), suspect unrelated to his other
> > > > > problems.
> > > > 
> > > > This seems consistent with what Dave observed: we encounter regular
> > > > pages in radix tree nodes on the shadow LRU that should only contain
> > > > nodes full of exceptional shadow entries. It could be an issue in the
> > > > new slot replacement code and the node tracking callback.
> > > 
> > > Both encounters seem to indicate use-after-free. Dave's node didn't
> > > warn about an unexpected node->count / node->exceptional state, but
> > > had entries that were inconsistent with that. Hugh got the counter
> > > warning but crashed on a list_head that's not NULLed in a live node.
> > > 
> > > workingset_update_node() should be called on page cache radix tree
> > > leaf nodes that go empty. I must be missing an update_node callback
> > > where a leaf node gets freed somewhere.
> > 
> > Sorry for dropping silent on this. I'm traveling over the holidays
> > with sporadic access to my emails and no access to real equipment.
> > 
> > The times I managed to sneak away to look at the code didn't turn up
> > anything useful yet.
> > 
> > Andrea encountered the warning as well and I gave him a debugging
> > patch (attached below), but he hasn't been able to reproduce this
> > condition. I've personally never seen the warning trigger, even though
> > the patches have been running on my main development machine for quite
> > a while now. Albeit against an older base; I've updated to Linus's
> > master branch now in case it's an interaction with other new code.
> > 
> > If anybody manages to reproduce this, that would be helpful. Any extra
> > eyes on this would be much appreciated too until I'm back at my desk.
> 
> I was looking into this but I didn't find a way how we could possibly leave
> radix tree node on LRU. So your debug patch looks like a good way forward.
> 
>   Honza
> --
> Jan Kara 
> SUSE Labs, CR
> --
> To unsubscribe from this list: send

Re: [PATCH v5 14/14] nvme: Use BLK_MQ_S_STOPPED instead of QUEUE_FLAG_STOPPED in blk-mq code

2016-10-31 Thread Laurence Oberman



- Original Message -
> From: "Bart Van Assche" <bart.vanass...@sandisk.com>
> To: "Jens Axboe" <ax...@fb.com>
> Cc: "Christoph Hellwig" <h...@lst.de>, "James Bottomley" 
> <j...@linux.vnet.ibm.com>, "Martin K. Petersen"
> <martin.peter...@oracle.com>, "Mike Snitzer" <snit...@redhat.com>, "Doug 
> Ledford" <dledf...@redhat.com>, "Keith
> Busch" <keith.bu...@intel.com>, "Ming Lei" <tom.leim...@gmail.com>, "Konrad 
> Rzeszutek Wilk"
> <konrad.w...@oracle.com>, "Roger Pau Monné" <roger@citrix.com>, "Laurence 
> Oberman" <lober...@redhat.com>,
> linux-block@vger.kernel.org, linux-s...@vger.kernel.org, 
> linux-r...@vger.kernel.org, linux-n...@lists.infradead.org
> Sent: Friday, October 28, 2016 8:23:40 PM
> Subject: [PATCH v5 14/14] nvme: Use BLK_MQ_S_STOPPED instead of 
> QUEUE_FLAG_STOPPED in blk-mq code
> 
> Make nvme_requeue_req() check BLK_MQ_S_STOPPED instead of
> QUEUE_FLAG_STOPPED. Remove the QUEUE_FLAG_STOPPED manipulations
> that became superfluous because of this change. Change
> blk_queue_stopped() tests into blk_mq_queue_stopped().
> 
> This patch fixes a race condition: using queue_flag_clear_unlocked()
> is not safe if any other function that manipulates the queue flags
> can be called concurrently, e.g. blk_cleanup_queue().
> 
> Signed-off-by: Bart Van Assche <bart.vanass...@sandisk.com>
> Cc: Keith Busch <keith.bu...@intel.com>
> Cc: Christoph Hellwig <h...@lst.de>
> Cc: Sagi Grimberg <s...@grimberg.me>
> ---
>  drivers/nvme/host/core.c | 16 ++--
>  1 file changed, 2 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index fe15d94..45dd237 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -201,13 +201,7 @@ static struct nvme_ns *nvme_get_ns_from_disk(struct
> gendisk *disk)
>  
>  void nvme_requeue_req(struct request *req)
>  {
> - unsigned long flags;
> -
> - blk_mq_requeue_request(req, false);
> - spin_lock_irqsave(req->q->queue_lock, flags);
> - if (!blk_queue_stopped(req->q))
> - blk_mq_kick_requeue_list(req->q);
> - spin_unlock_irqrestore(req->q->queue_lock, flags);
> + blk_mq_requeue_request(req, !blk_mq_queue_stopped(req->q));
>  }
>  EXPORT_SYMBOL_GPL(nvme_requeue_req);
>  
> @@ -2078,13 +2072,8 @@ void nvme_stop_queues(struct nvme_ctrl *ctrl)
>   struct nvme_ns *ns;
>  
>   mutex_lock(>namespaces_mutex);
> - list_for_each_entry(ns, >namespaces, list) {
> - spin_lock_irq(ns->queue->queue_lock);
> - queue_flag_set(QUEUE_FLAG_STOPPED, ns->queue);
> - spin_unlock_irq(ns->queue->queue_lock);
> -
> + list_for_each_entry(ns, >namespaces, list)
>   blk_mq_quiesce_queue(ns->queue);
> - }
>   mutex_unlock(>namespaces_mutex);
>  }
>  EXPORT_SYMBOL_GPL(nvme_stop_queues);
> @@ -2095,7 +2084,6 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>  
>   mutex_lock(>namespaces_mutex);
>   list_for_each_entry(ns, >namespaces, list) {
> - queue_flag_clear_unlocked(QUEUE_FLAG_STOPPED, ns->queue);
>   blk_mq_start_stopped_hw_queues(ns->queue, true);
>   blk_mq_kick_requeue_list(ns->queue);
>   }
> --
> 2.10.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

Hello Bart

Thanks for all this work.

Applied all 14 patches, also corrected the part of the xen-blkfront.c 
blkif_recover patch in patchv5-5/14.

diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
index 9908597..60fff99 100644
--- a/drivers/block/xen-blkfront.c
+++ b/drivers/block/xen-blkfront.c
@@ -2045,6 +2045,7 @@ static int blkif_recover(struct blkfront_info *info)
 BUG_ON(req->nr_phys_segments > segs);
 blk_mq_requeue_request(req);
 }
+blk_mq_start_stopped_hw_queues(infrq, true);*** 
Corrected
 blk_mq_kick_requeue_list(infrq);
 
     while ((bio = bio_list_pop(_list)) != NULL) {

Ran multiple read/write buffered and directio tests via RDMA/SRP and mlx5 
(100Gbit) with max_sectors_kb set to 1024, 2048, 4096 and 8196
Ran multiple read/write buffered and directio tests via RDMA/SRP and mlx4 
(56Gbit)  with max_sectors_kb set to 1024, 2048, 4096 and 8196
Reset the SRP hosts multiple times with multipath set to no_path_retry queue
Ran basic NVME read/write testing with no hot plug disconnects on multiple 
block sizes

All tests passed.

For the series:
Tested-by: Laurence Oberman <lober...@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/9] Introduce blk_quiesce_queue() and blk_resume_queue()

2016-10-11 Thread Laurence Oberman



- Original Message -
> From: "Bart Van Assche" <bart.vanass...@sandisk.com>
> To: "Jens Axboe" <ax...@fb.com>
> Cc: "Christoph Hellwig" <h...@lst.de>, "James Bottomley" 
> <j...@linux.vnet.ibm.com>, "Martin K. Petersen"
> <martin.peter...@oracle.com>, "Mike Snitzer" <snit...@redhat.com>, "Doug 
> Ledford" <dledf...@redhat.com>, "Keith
> Busch" <keith.bu...@intel.com>, linux-block@vger.kernel.org, 
> linux-s...@vger.kernel.org, linux-r...@vger.kernel.org,
> linux-n...@lists.infradead.org
> Sent: Monday, September 26, 2016 2:25:54 PM
> Subject: [PATCH 0/9] Introduce blk_quiesce_queue() and blk_resume_queue()
> 
> Hello Jens,
> 
> Multiple block drivers need the functionality to stop a request queue
> and to wait until all ongoing request_fn() / queue_rq() calls have
> finished without waiting until all outstanding requests have finished.
> Hence this patch series that introduces the blk_quiesce_queue() and
> blk_resume_queue() functions. The dm-mq, SRP and nvme patches in this
> patch series are three examples of where these functions are useful.
> These patches apply on top of the September 21 version of your
> for-4.9/block branch. The individual patches in this series are:
> 
> 0001-blk-mq-Introduce-blk_mq_queue_stopped.patch
> 0002-dm-Fix-a-race-condition-related-to-stopping-and-star.patch
> 0003-RFC-nvme-Use-BLK_MQ_S_STOPPED-instead-of-QUEUE_FLAG_.patch
> 0004-block-Move-blk_freeze_queue-and-blk_unfreeze_queue-c.patch
> 0005-block-Extend-blk_freeze_queue_start-to-the-non-blk-m.patch
> 0006-block-Rename-mq_freeze_wq-and-mq_freeze_depth.patch
> 0007-blk-mq-Introduce-blk_quiesce_queue-and-blk_resume_qu.patch
> 0008-SRP-transport-Port-srp_wait_for_queuecommand-to-scsi.patch
> 0009-RFC-nvme-Fix-a-race-condition.patch
> 
> Thanks,
> 
> Bart.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
Hello

I took Bart's latest patches from his tree and tested all the SRP/RDMA and as 
many of the nvme tests.

Everything is passing my tests, including SRP port resets etc.
The nvme tests were all on a small intel nvme card.

Tested-by: Laurence Oberman <lober...@redhat.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

47 matches

Mail list logo