[PATCH] ide: ide-atapi: fix compile error with defining macro DEBUG

2017-11-09 Thread Hongxu Jia
Compile ide-atapi failed with defining macro "DEBUG"
...
|drivers/ide/ide-atapi.c:285:52: error: 'struct request' has
no member named 'cmd'; did you mean 'csd'?
|  debug_log("%s: rq->cmd[0]: 0x%x\n", __func__, rq->cmd[0]);
...

Since we split the scsi_request out of struct request, it missed
do the same thing on debug_log

Fixes: 82ed4db499b8 ("block: split scsi_request out of struct request")

Signed-off-by: Hongxu Jia 
---
 drivers/ide/ide-atapi.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/ide/ide-atapi.c b/drivers/ide/ide-atapi.c
index 14d1e7d..0e6bc63 100644
--- a/drivers/ide/ide-atapi.c
+++ b/drivers/ide/ide-atapi.c
@@ -282,7 +282,7 @@ int ide_cd_expiry(ide_drive_t *drive)
struct request *rq = drive->hwif->rq;
unsigned long wait = 0;
 
-   debug_log("%s: rq->cmd[0]: 0x%x\n", __func__, rq->cmd[0]);
+   debug_log("%s: scsi_req(rq)->cmd[0]: 0x%x\n", __func__, 
scsi_req(rq)->cmd[0]);
 
/*
 * Some commands are *slow* and normally take a long time to complete.
@@ -463,7 +463,7 @@ static ide_startstop_t ide_pc_intr(ide_drive_t *drive)
return ide_do_reset(drive);
}
 
-   debug_log("[cmd %x]: check condition\n", rq->cmd[0]);
+   debug_log("[cmd %x]: check condition\n", 
scsi_req(rq)->cmd[0]);
 
/* Retry operation */
ide_retry_pc(drive);
@@ -531,7 +531,7 @@ static ide_startstop_t ide_pc_intr(ide_drive_t *drive)
ide_pad_transfer(drive, write, bcount);
 
debug_log("[cmd %x] transferred %d bytes, padded %d bytes, resid: %u\n",
- rq->cmd[0], done, bcount, scsi_req(rq)->resid_len);
+ scsi_req(rq)->cmd[0], done, bcount, scsi_req(rq)->resid_len);
 
/* And set the interrupt handler again */
ide_set_handler(drive, ide_pc_intr, timeout);
-- 
2.7.4



Re: [PATCH v2] locking/lockdep: Revise Documentation/locking/crossrelease.txt

2017-11-09 Thread Byungchul Park

On 11/10/2017 4:30 PM, Ingo Molnar wrote:


* Byungchul Park  wrote:


 Event C depends on event A.
 Event A depends on event B.
 Event B depends on event C.
  
-   NOTE: Precisely speaking, a dependency is one between whether a

-   waiter for an event can be woken up and whether another waiter for
-   another event can be woken up. However from now on, we will describe
-   a dependency as if it's one between an event and another event for
-   simplicity.


Why was this explanation removed?


-Lockdep tries to detect a deadlock by checking dependencies created by
-lock operations, acquire and release. Waiting for a lock corresponds to
-waiting for an event, and releasing a lock corresponds to triggering an
-event in the previous section.
+Lockdep tries to detect a deadlock by checking circular dependencies
+created by lock operations, acquire and release, which are wait and
+event respectively.


What? You changed a readable paragraph into an unreadable one.

Sorry, this text needs to be acked by someone with good English skills, and I
don't have the time right now to fix it all up. Please send minimal, obvious
typo/grammar fixes only.


I will send one including minimal fixes at the next spin.

--
Thanks,
Byungchul


Re: [PATCH v2] locking/lockdep: Revise Documentation/locking/crossrelease.txt

2017-11-09 Thread Ingo Molnar

* Byungchul Park  wrote:

> Event C depends on event A.
> Event A depends on event B.
> Event B depends on event C.
>  
> -   NOTE: Precisely speaking, a dependency is one between whether a
> -   waiter for an event can be woken up and whether another waiter for
> -   another event can be woken up. However from now on, we will describe
> -   a dependency as if it's one between an event and another event for
> -   simplicity.

Why was this explanation removed?

> -Lockdep tries to detect a deadlock by checking dependencies created by
> -lock operations, acquire and release. Waiting for a lock corresponds to
> -waiting for an event, and releasing a lock corresponds to triggering an
> -event in the previous section.
> +Lockdep tries to detect a deadlock by checking circular dependencies
> +created by lock operations, acquire and release, which are wait and
> +event respectively.

What? You changed a readable paragraph into an unreadable one.

Sorry, this text needs to be acked by someone with good English skills, and I 
don't have the time right now to fix it all up. Please send minimal, obvious
typo/grammar fixes only.

Thanks,

Ingo


Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Hannes Reinecke
On 11/09/2017 06:44 PM, Christoph Hellwig wrote:
> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace.  The character device
> nodes are still available on a per-controller basis.  A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/Kconfig |   9 ++
>  drivers/nvme/host/Makefile|   1 +
>  drivers/nvme/host/core.c  | 133 +++---
>  drivers/nvme/host/multipath.c | 255 
> ++
>  drivers/nvme/host/nvme.h  |  57 ++
>  5 files changed, 440 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/nvme/host/multipath.c
> 
After much discussion I'm finally fine with this.

Reviewed-by: Hannes Reinecke 

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Ming Lei
On Fri, Nov 10, 2017 at 01:53:18PM +0800, Ming Lei wrote:
> On Thu, Nov 09, 2017 at 09:32:58AM -0700, Jens Axboe wrote:
> > On 11/09/2017 08:30 AM, Jens Axboe wrote:
> > > On 11/09/2017 03:00 AM, Ming Lei wrote:
> > >> On Thu, Nov 09, 2017 at 11:41:40AM +0800, Ming Lei wrote:
> > >>> On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote:
> >  This patch attempts to make the case of hctx re-running on driver tag
> >  failure more robust. Without this patch, it's pretty easy to trigger a
> >  stall condition with shared tags. An example is using null_blk like
> >  this:
> > 
> >  modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
> >  submit_queues=1 hw_queue_depth=1
> > 
> >  which sets up 4 devices, sharing the same tag set with a depth of 1.
> >  Running a fio job ala:
> > 
> >  [global]
> >  bs=4k
> >  rw=randread
> >  norandommap
> >  direct=1
> >  ioengine=libaio
> >  iodepth=4
> > 
> >  [nullb0]
> >  filename=/dev/nullb0
> >  [nullb1]
> >  filename=/dev/nullb1
> >  [nullb2]
> >  filename=/dev/nullb2
> >  [nullb3]
> >  filename=/dev/nullb3
> > 
> >  will inevitably end with one or more threads being stuck waiting for a
> >  scheduler tag. That IO is then stuck forever, until someone else
> >  triggers a run of the queue.
> > 
> >  Ensure that we always re-run the hardware queue, if the driver tag we
> >  were waiting for got freed before we added our leftover request entries
> >  back on the dispatch list.
> > 
> >  Signed-off-by: Jens Axboe 
> > 
> >  ---
> > 
> >  diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> >  index 7f4a1ba532af..bb7f08415203 100644
> >  --- a/block/blk-mq-debugfs.c
> >  +++ b/block/blk-mq-debugfs.c
> >  @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = {
> > HCTX_STATE_NAME(STOPPED),
> > HCTX_STATE_NAME(TAG_ACTIVE),
> > HCTX_STATE_NAME(SCHED_RESTART),
> >  -  HCTX_STATE_NAME(TAG_WAITING),
> > HCTX_STATE_NAME(START_ON_RUN),
> >   };
> >   #undef HCTX_STATE_NAME
> >  diff --git a/block/blk-mq.c b/block/blk-mq.c
> >  index 3d759bb8a5bb..8dc5db40df9d 100644
> >  --- a/block/blk-mq.c
> >  +++ b/block/blk-mq.c
> >  @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, 
> >  struct blk_mq_hw_ctx **hctx,
> > return rq->tag != -1;
> >   }
> >   
> >  -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned 
> >  mode, int flags,
> >  -  void *key)
> >  +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned 
> >  mode,
> >  +  int flags, void *key)
> >   {
> > struct blk_mq_hw_ctx *hctx;
> >   
> > hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
> >   
> >  -  list_del(>entry);
> >  -  clear_bit_unlock(BLK_MQ_S_TAG_WAITING, >state);
> >  +  list_del_init(>entry);
> > blk_mq_run_hw_queue(hctx, true);
> > return 1;
> >   }
> >   
> >  -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
> >  +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx,
> >  +   struct request *rq)
> >   {
> >  +  struct blk_mq_hw_ctx *this_hctx = *hctx;
> >  +  wait_queue_entry_t *wait = _hctx->dispatch_wait;
> > struct sbq_wait_state *ws;
> >   
> >  +  if (!list_empty_careful(>entry))
> >  +  return false;
> >  +
> >  +  spin_lock(_hctx->lock);
> >  +  if (!list_empty(>entry)) {
> >  +  spin_unlock(_hctx->lock);
> >  +  return false;
> >  +  }
> >  +
> >  +  ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
> >  +  add_wait_queue(>wait, wait);
> >  +
> > /*
> >  -   * The TAG_WAITING bit serves as a lock protecting 
> >  hctx->dispatch_wait.
> >  -   * The thread which wins the race to grab this bit adds the 
> >  hardware
> >  -   * queue to the wait queue.
> >  +   * It's possible that a tag was freed in the window between the
> >  +   * allocation failure and adding the hardware queue to the wait
> >  +   * queue.
> >  */
> >  -  if (test_bit(BLK_MQ_S_TAG_WAITING, >state) ||
> >  -  test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, >state))
> >  +  if (!blk_mq_get_driver_tag(rq, hctx, false)) {
> >  +  spin_unlock(_hctx->lock);
> > return false;
> >  -
> >  -  init_waitqueue_func_entry(>dispatch_wait, 
> >  blk_mq_dispatch_wake);
> >  -  ws = 

Re: [PATCH] blk-mq: respect queue dead via blk_mq_quiesce_queue

2017-11-09 Thread Ming Lei
On Sun, Nov 05, 2017 at 08:10:08PM +0800, Ming Lei wrote:
> blk-mq never respects queue dead, and this may cause use-after-free on
> any kind of queue resources. This patch respects the rule by calling
> blk_mq_quiesce_queue() when queue is marked as DEAD.
> 
> This patch fixes the following kernel crash:
> 
> [   42.170824] BUG: unable to handle kernel NULL pointer dereference at   
> (null)
> [   42.172011] IP: blk_mq_flush_busy_ctxs+0x5a/0xe0
> [   42.172011] PGD 25d8d1067 P4D 25d8d1067 PUD 25d458067 PMD 0
> [   42.172011] Oops:  [#1] PREEMPT SMP
> [   42.172011] Dumping ftrace buffer:
> [   42.172011](ftrace buffer empty)
> [   42.172011] Modules linked in: scsi_debug ebtable_filter ebtables 
> ip6table_filter ip6_tables xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
> nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 
> nf_nat_ipv4 nf_nat nf_conntrack libcrc32c bridge stp llc fuse iptable_filter 
> ip_tables sd_mod sg mptsas mptscsih mptbase crc32c_intel scsi_transport_sas 
> nvme lpc_ich serio_raw ahci virtio_scsi libahci libata nvme_core binfmt_misc 
> dm_mod iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi null_blk configs
> [   42.172011] CPU: 0 PID: 2750 Comm: fio Not tainted 
> 4.14.0-rc7.blk_mq_io_hang+ #507
> [   42.172011] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> 1.9.3-1.fc25 04/01/2014
> [   42.172011] task: 88025dc18000 task.stack: c900028c4000
> [   42.172011] RIP: 0010:blk_mq_flush_busy_ctxs+0x5a/0xe0
> [   42.172011] RSP: 0018:c900028c7be0 EFLAGS: 00010246
> [   42.172011] RAX:  RBX:  RCX: 
> 8802775907c8
> [   42.172011] RDX:  RSI: c900028c7c50 RDI: 
> 88025de25c00
> [   42.172011] RBP: c900028c7c28 R08:  R09: 
> 8802595692c0
> [   42.172011] R10: c900028c7e50 R11: 0008 R12: 
> c900028c7c50
> [   42.172011] R13: 88025de25cd8 R14: c900028c7d78 R15: 
> 88025de25c00
> [   42.172011] FS:  7faa8653f7c0() GS:88027fc0() 
> knlGS:
> [   42.172011] CS:  0010 DS:  ES:  CR0: 80050033
> [   42.172011] CR2:  CR3: 0002593df006 CR4: 
> 003606f0
> [   42.172011] DR0:  DR1:  DR2: 
> 
> [   42.172011] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [   42.172011] Call Trace:
> [   42.172011]  blk_mq_sched_dispatch_requests+0x1b4/0x1f0
> [   42.172011]  ? preempt_schedule+0x27/0x30
> [   42.172011]  __blk_mq_run_hw_queue+0x8b/0xa0
> [   42.172011]  __blk_mq_delay_run_hw_queue+0xb7/0x100
> [   42.172011]  blk_mq_run_hw_queue+0x14/0x20
> [   42.172011]  blk_mq_sched_insert_requests+0xda/0x120
> [   42.172011]  blk_mq_flush_plug_list+0x179/0x280
> [   42.172011]  blk_flush_plug_list+0x102/0x290
> [   42.172011]  blk_finish_plug+0x2c/0x40
> [   42.172011]  do_io_submit+0x3fa/0x780
> [   42.172011]  SyS_io_submit+0x10/0x20
> [   42.172011]  ? SyS_io_submit+0x10/0x20
> [   42.172011]  entry_SYSCALL_64_fastpath+0x1a/0xa5
> [   42.172011] RIP: 0033:0x7faa80ff8697
> [   42.172011] RSP: 002b:7ffe08236de8 EFLAGS: 0206 ORIG_RAX: 
> 00d1
> [   42.172011] RAX: ffda RBX: 01f1e600 RCX: 
> 7faa80ff8697
> [   42.172011] RDX: 0208e638 RSI: 0001 RDI: 
> 7faa6ecae000
> [   42.172011] RBP: 0001 R08: 0001 R09: 
> 01f1e0e0
> [   42.172011] R10: 0001 R11: 0206 R12: 
> 7faa614f19e0
> [   42.172011] R13: 7faa614ff0b0 R14: 7faa614fef50 R15: 
> 0001
> [   42.172011] Code: e0 00 00 00 c7 45 bc 00 00 00 00 85 c0 74 76 4c 8d af d8 
> 00 00 00 49 89 ff 44 8b 45 bc 4c 89 c0 49 c1 e0 06 4d 03 87 e8 00 00 00 <49> 
> 83 38 00 4d 89 c6 74 41 41 8b 8f dc 00 00 00 41 89 c4 31 db
> [   42.172011] RIP: blk_mq_flush_busy_ctxs+0x5a/0xe0 RSP: c900028c7be0
> [   42.172011] CR2: 
> [   42.172011] ---[ end trace 2432dfddf9b84061 ]---
> [   42.172011] Kernel panic - not syncing: Fatal exception
> [   42.172011] Dumping ftrace buffer:
> [   42.172011](ftrace buffer empty)
> [   42.172011] Kernel Offset: disabled
> [   42.172011] ---[ end Kernel panic - not syncing: Fatal exception
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Ming Lei 
> ---
>  block/blk-core.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 048be4aa6024..0b121f29e3b1 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -658,6 +658,10 @@ void blk_cleanup_queue(struct request_queue *q)
>   queue_flag_set(QUEUE_FLAG_DEAD, q);
>   spin_unlock_irq(lock);
>  
> + /* respect queue DEAD via quiesce for blk-mq */
> + if (q->mq_ops)
> + blk_mq_quiesce_queue(q);
> +
>   /* for synchronous bio-based driver finish in-flight integrity i/o */
>   blk_flush_integrity();

Hi Jens,


Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Ming Lei
On Thu, Nov 09, 2017 at 09:32:58AM -0700, Jens Axboe wrote:
> On 11/09/2017 08:30 AM, Jens Axboe wrote:
> > On 11/09/2017 03:00 AM, Ming Lei wrote:
> >> On Thu, Nov 09, 2017 at 11:41:40AM +0800, Ming Lei wrote:
> >>> On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote:
>  This patch attempts to make the case of hctx re-running on driver tag
>  failure more robust. Without this patch, it's pretty easy to trigger a
>  stall condition with shared tags. An example is using null_blk like
>  this:
> 
>  modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 
>  submit_queues=1 hw_queue_depth=1
> 
>  which sets up 4 devices, sharing the same tag set with a depth of 1.
>  Running a fio job ala:
> 
>  [global]
>  bs=4k
>  rw=randread
>  norandommap
>  direct=1
>  ioengine=libaio
>  iodepth=4
> 
>  [nullb0]
>  filename=/dev/nullb0
>  [nullb1]
>  filename=/dev/nullb1
>  [nullb2]
>  filename=/dev/nullb2
>  [nullb3]
>  filename=/dev/nullb3
> 
>  will inevitably end with one or more threads being stuck waiting for a
>  scheduler tag. That IO is then stuck forever, until someone else
>  triggers a run of the queue.
> 
>  Ensure that we always re-run the hardware queue, if the driver tag we
>  were waiting for got freed before we added our leftover request entries
>  back on the dispatch list.
> 
>  Signed-off-by: Jens Axboe 
> 
>  ---
> 
>  diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>  index 7f4a1ba532af..bb7f08415203 100644
>  --- a/block/blk-mq-debugfs.c
>  +++ b/block/blk-mq-debugfs.c
>  @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = {
>   HCTX_STATE_NAME(STOPPED),
>   HCTX_STATE_NAME(TAG_ACTIVE),
>   HCTX_STATE_NAME(SCHED_RESTART),
>  -HCTX_STATE_NAME(TAG_WAITING),
>   HCTX_STATE_NAME(START_ON_RUN),
>   };
>   #undef HCTX_STATE_NAME
>  diff --git a/block/blk-mq.c b/block/blk-mq.c
>  index 3d759bb8a5bb..8dc5db40df9d 100644
>  --- a/block/blk-mq.c
>  +++ b/block/blk-mq.c
>  @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, 
>  struct blk_mq_hw_ctx **hctx,
>   return rq->tag != -1;
>   }
>   
>  -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned 
>  mode, int flags,
>  -void *key)
>  +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
>  +int flags, void *key)
>   {
>   struct blk_mq_hw_ctx *hctx;
>   
>   hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
>   
>  -list_del(>entry);
>  -clear_bit_unlock(BLK_MQ_S_TAG_WAITING, >state);
>  +list_del_init(>entry);
>   blk_mq_run_hw_queue(hctx, true);
>   return 1;
>   }
>   
>  -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
>  +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx,
>  + struct request *rq)
>   {
>  +struct blk_mq_hw_ctx *this_hctx = *hctx;
>  +wait_queue_entry_t *wait = _hctx->dispatch_wait;
>   struct sbq_wait_state *ws;
>   
>  +if (!list_empty_careful(>entry))
>  +return false;
>  +
>  +spin_lock(_hctx->lock);
>  +if (!list_empty(>entry)) {
>  +spin_unlock(_hctx->lock);
>  +return false;
>  +}
>  +
>  +ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
>  +add_wait_queue(>wait, wait);
>  +
>   /*
>  - * The TAG_WAITING bit serves as a lock protecting 
>  hctx->dispatch_wait.
>  - * The thread which wins the race to grab this bit adds the 
>  hardware
>  - * queue to the wait queue.
>  + * It's possible that a tag was freed in the window between the
>  + * allocation failure and adding the hardware queue to the wait
>  + * queue.
>    */
>  -if (test_bit(BLK_MQ_S_TAG_WAITING, >state) ||
>  -test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, >state))
>  +if (!blk_mq_get_driver_tag(rq, hctx, false)) {
>  +spin_unlock(_hctx->lock);
>   return false;
>  -
>  -init_waitqueue_func_entry(>dispatch_wait, 
>  blk_mq_dispatch_wake);
>  -ws = bt_wait_ptr(>tags->bitmap_tags, hctx);
>  +}
>   
>   /*
>  - * As soon as this returns, it's no longer safe to fiddle with
>  - * hctx->dispatch_wait, since a completion can wake up the wait 
>  

Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Christoph Hellwig
On Fri, Nov 10, 2017 at 05:52:36AM +0100, Christoph Hellwig wrote:
> > If we've CMIC capabilities, we'll use the subsys->instance; if we don't
> > have CMIC, we use the ctrl->instance. 
> > 
> > Since the two instances are independent of each other, they can create
> > duplicate names.
> > 
> > To fix, I think we'll need to always use the subsys instance for
> > consistency if CONFIG_NVME_MULTIPATH=y.
> 
> Yes, we should.  But once we do that we should just always use
> subsys->instance, as it will always be the same as ctrl->instance
> for CONFIG_NVME_MULTIPATH=n.

Scrap that.  Of course they aren't always the same in the
CONFIG_NVME_MULTIPATH=n.  We'll need your patch as-is.


Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Christoph Hellwig
On Thu, Nov 09, 2017 at 04:22:17PM -0500, Mike Snitzer wrote:
> Your 0th header speaks to the NVMe multipath IO path leveraging NVMe's
> lack of partial completion but I think it'd be useful to have this
> header (that actually gets committed) speak to it.

There is a comment above blk_steal_bios to this respect, which is in
the tree.

> Also, the block core patch to introduce blk_steal_bios() already went in
> but should there be a QUEUE_FLAG that gets set by drivers like NVMe that
> don't support partial completion?
> 
> This would make it easier for other future drivers to know whether they
> can use a more optimized IO path.

As-is there is not real need for this.  But if you want to use a similar
path in dm-multipath feel free to add the flag when needed.


Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Christoph Hellwig
> If we've CMIC capabilities, we'll use the subsys->instance; if we don't
> have CMIC, we use the ctrl->instance. 
> 
> Since the two instances are independent of each other, they can create
> duplicate names.
> 
> To fix, I think we'll need to always use the subsys instance for
> consistency if CONFIG_NVME_MULTIPATH=y.

Yes, we should.  But once we do that we should just always use
subsys->instance, as it will always be the same as ctrl->instance
for CONFIG_NVME_MULTIPATH=n.


Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Bart Van Assche
On Thu, 2017-11-09 at 09:32 -0700, Jens Axboe wrote:
> It's been running happily for > 1 hour now, no issues observed.

The same null_blk test runs fine on my setup. But what's weird is that if
I run the srp-test software that I again see a lockup in sd_probe_async().
That happens not only with today's for-next branch but also with the same
kernel tree with which my tests passed yesterday, which is weird. I will
analyze this further.

Bart.

[PATCH] blk-mq: improve tag waiting setup for non-shared tags

2017-11-09 Thread Jens Axboe
If we run out of driver tags, we currently treat shared and non-shared
tags the same - both cases hook into the tag waitqueue. This is a bit
more costly than it needs to be on unshared tags, since we have to both
grab the hctx lock, and the waitqueue lock (and disable interrupts).
For the non-shared case, we can simply mark the queue as needing a
restart.

Split blk_mq_dispatch_wait_add() to account for both cases, and
rename it to blk_mq_mark_tag_wait() to better reflect what it
does now.

Without this patch, shared and non-shared performance is about the same
with 4 fio thread hammering on a single null_blk device (~410K, at 75%
sys). With the patch, the shared case is the same, but the non-shared
tags case runs at 431K at 71% sys.

Signed-off-by: Jens Axboe 

---

diff --git a/block/blk-mq.c b/block/blk-mq.c
index bfe24a5b62a3..965317ea45f9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -1006,44 +1006,68 @@ static int blk_mq_dispatch_wake(wait_queue_entry_t 
*wait, unsigned mode,
return 1;
 }
 
-static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx,
-struct request *rq)
+/*
+ * Mark us waiting for a tag. For shared tags, this involves hooking us into
+ * the tag wakeups. For non-shared tags, we can simply mark us nedeing a
+ * restart. For both caes, take care to check the condition again after
+ * marking us as waiting.
+ */
+static bool blk_mq_mark_tag_wait(struct blk_mq_hw_ctx **hctx,
+struct request *rq)
 {
struct blk_mq_hw_ctx *this_hctx = *hctx;
-   wait_queue_entry_t *wait = _hctx->dispatch_wait;
+   bool shared_tags = (this_hctx->flags & BLK_MQ_F_TAG_SHARED) != 0;
struct sbq_wait_state *ws;
+   wait_queue_entry_t *wait;
+   bool ret;
 
-   if (!list_empty_careful(>entry))
-   return false;
+   if (!shared_tags) {
+   if (!test_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state))
+   set_bit(BLK_MQ_S_SCHED_RESTART, _hctx->state);
+   } else {
+   wait = _hctx->dispatch_wait;
+   if (!list_empty_careful(>entry))
+   return false;
 
-   spin_lock(_hctx->lock);
-   if (!list_empty(>entry)) {
-   spin_unlock(_hctx->lock);
-   return false;
-   }
+   spin_lock(_hctx->lock);
+   if (!list_empty(>entry)) {
+   spin_unlock(_hctx->lock);
+   return false;
+   }
 
-   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
-   add_wait_queue(>wait, wait);
+   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
+   add_wait_queue(>wait, wait);
+   }
 
/*
 * It's possible that a tag was freed in the window between the
 * allocation failure and adding the hardware queue to the wait
 * queue.
 */
-   if (!blk_mq_get_driver_tag(rq, hctx, false)) {
+   ret = blk_mq_get_driver_tag(rq, hctx, false);
+
+   if (!shared_tags) {
+   /*
+* Don't clear RESTART here, someone else could have set it.
+* At most this will cost an extra queue run.
+*/
+   return ret;
+   } else {
+   if (!ret) {
+   spin_unlock(_hctx->lock);
+   return false;
+   }
+
+   /*
+* We got a tag, remove ourselves from the wait queue to ensure
+* someone else gets the wakeup.
+*/
+   spin_lock_irq(>wait.lock);
+   list_del_init(>entry);
+   spin_unlock_irq(>wait.lock);
spin_unlock(_hctx->lock);
-   return false;
+   return true;
}
-
-   /*
-* We got a tag, remove ourselves from the wait queue to ensure
-* someone else gets the wakeup.
-*/
-   spin_lock_irq(>wait.lock);
-   list_del_init(>entry);
-   spin_unlock_irq(>wait.lock);
-   spin_unlock(_hctx->lock);
-   return true;
 }
 
 bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
@@ -1076,10 +1100,15 @@ bool blk_mq_dispatch_rq_list(struct request_queue *q, 
struct list_head *list,
 * before we add this entry back on the dispatch list,
 * we'll re-run it below.
 */
-   if (!blk_mq_dispatch_wait_add(, rq)) {
+   if (!blk_mq_mark_tag_wait(, rq)) {
if (got_budget)
blk_mq_put_dispatch_budget(hctx);
-   no_tag = true;
+   /*
+* For non-shared tags, the RESTART check
+* will suffice.
+*/
+  

Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Mike Snitzer
On Thu, Nov 09 2017 at 12:44pm -0500,
Christoph Hellwig  wrote:

> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace.  The character device
> nodes are still available on a per-controller basis.  A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> Signed-off-by: Christoph Hellwig 

Your 0th header speaks to the NVMe multipath IO path leveraging NVMe's
lack of partial completion but I think it'd be useful to have this
header (that actually gets committed) speak to it.

> diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
> new file mode 100644
> index ..062754ebebfd
> --- /dev/null
> +++ b/drivers/nvme/host/multipath.c
...
> +void nvme_failover_req(struct request *req)
> +{
> + struct nvme_ns *ns = req->q->queuedata;
> + unsigned long flags;
> +
> + spin_lock_irqsave(>head->requeue_lock, flags);
> + blk_steal_bios(>head->requeue_list, req);
> + spin_unlock_irqrestore(>head->requeue_lock, flags);
> + blk_mq_end_request(req, 0);
> +
> + nvme_reset_ctrl(ns->ctrl);
> + kblockd_schedule_work(>head->requeue_work);
> +}

Also, the block core patch to introduce blk_steal_bios() already went in
but should there be a QUEUE_FLAG that gets set by drivers like NVMe that
don't support partial completion?

This would make it easier for other future drivers to know whether they
can use a more optimized IO path.

Mike


Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Keith Busch
Ahh, I incorporated non-multipath disks into the mix and observing some
trouble. Details below:

On Thu, Nov 09, 2017 at 06:44:47PM +0100, Christoph Hellwig wrote:
> +#ifdef CONFIG_NVME_MULTIPATH
> + if (ns->head->disk) {
> + sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
> + ctrl->cntlid, ns->head->instance);
> + flags = GENHD_FL_HIDDEN;
> + } else
> +#endif
> + sprintf(disk_name, "nvme%dn%d", ctrl->instance, 
> ns->head->instance);

...
  
> +int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl, struct nvme_ns_head *head)
> +{
> + struct request_queue *q;
> + bool vwc = false;
> +
> + bio_list_init(>requeue_list);
> + spin_lock_init(>requeue_lock);
> + INIT_WORK(>requeue_work, nvme_requeue_work);
> +
> + /*
> +  * Add a multipath node if the subsystems supports multiple controllers.
> +  * We also do this for private namespaces as the namespace sharing data 
> could
> +  * change after a rescan.
> +  */
> + if (!(ctrl->subsys->cmic & (1 << 1)) || !multipath)
> + return 0;

...

> + sprintf(head->disk->disk_name, "nvme%dn%d",
> + ctrl->subsys->instance, head->instance);

If we've CMIC capabilities, we'll use the subsys->instance; if we don't
have CMIC, we use the ctrl->instance. 

Since the two instances are independent of each other, they can create
duplicate names.

To fix, I think we'll need to always use the subsys instance for
consistency if CONFIG_NVME_MULTIPATH=y.

---
@@ -2837,8 +2826,10 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
ctrl->cntlid, ns->head->instance);
flags = GENHD_FL_HIDDEN;
} else
+   sprintf(disk_name, "nvme%dn%d", ctrl->subsys->instance, 
ns->head->instance);
+#else
+   sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
 #endif
-   sprintf(disk_name, "nvme%dn%d", ctrl->instance, 
ns->head->instance);
 
if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
if (nvme_nvm_register(ns, disk_name, node)) {
--

This may confuse some people since the block device's name will not always
match up to the controller's chardev name, but I don't see another way
to do it.


Re: [PATCH v12 0/7] block, scsi: Improve suspend and resume

2017-11-09 Thread Jens Axboe
On 11/09/2017 11:49 AM, Bart Van Assche wrote:
> Hello Jens,
> 
> It is known that during the resume following a hibernate, especially when
> using an md RAID1 array created on top of SCSI devices, sometimes the system
> hangs instead of resuming up properly. This patch series fixes that
> problem. These patches have been tested on top of the block layer for-next
> branch. Please consider these changes for kernel v4.15.

Applied, thanks Bart.

-- 
Jens Axboe



Re: [PATCH 6/7] block: create 'slaves' and 'holders' entries for hidden gendisks

2017-11-09 Thread Martin K. Petersen

Christoph,

> From: Hannes Reinecke 
>
> When creating nvme multipath devices we should populate the 'slaves'
> and 'holders' directorys properly to aid userspace topology detection.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 7/7] nvme: create 'slaves' and 'holders' entries for hidden controllers

2017-11-09 Thread Martin K. Petersen

Christoph,

> From: Hannes Reinecke 
>
> When creating nvme multipath devices we should populate the 'slaves'
> and 'holders' directorys properly to aid userspace topology detection.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 5/7] nvme: also expose the namespace identification sysfs files for mpath nodes

2017-11-09 Thread Martin K. Petersen

Christoph,

> We do this by adding a helper that returns the ns_head for a device
> that can belong to either the per-controller or per-subsystem block
> device nodes, and otherwise reuse all the existing code.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Martin K. Petersen

Christoph,

> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to
> it.  The gendisk for each controllers path to the name space still
> exists inside the kernel, but is hidden from userspace.  The character
> device nodes are still available on a per-controller basis.  A new
> link from the sysfs directory for the subsystem allows to find all
> controllers for a given subsystem.
>
> Currently we will always send I/O to the first available path, this
> will be changed once the NVMe Asynchronous Namespace Access (ANA) TP
> is ratified and implemented, at which point we will look at the ANA
> state for each namespace.  Another possibility that was prototyped is
> to use the path that is closes to the submitting NUMA code, which will
> be mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with the
> performance rates provided by NVMe.
>
> The multipath device will go away once all paths to it disappear, any
> delay to keep it alive needs to be implemented at the controller
> level.

Beautiful!

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 3/7] nvme: track shared namespaces

2017-11-09 Thread Martin K. Petersen

Christoph,

> Introduce a new struct nvme_ns_head that holds information about an
> actual namespace, unlike struct nvme_ns, which only holds the
> per-controller namespace information.  For private namespaces there is
> a 1:1 relation of the two, but for shared namespaces this lets us
> discover all the paths to it.  For now only the identifiers are moved
> to the new structure, but most of the information in struct nvme_ns
> should eventually move over.
>
> To allow lockless path lookup the list of nvme_ns structures per
> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
> structure through call_srcu.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 2/7] nvme: introduce a nvme_ns_ids structure

2017-11-09 Thread Martin K. Petersen

Christoph,

> This allows us to manage the various uniqueue namespace identifiers
> together instead needing various variables and arguments.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 1/7] nvme: track subsystems

2017-11-09 Thread Martin K. Petersen

Christoph,

> This adds a new nvme_subsystem structure so that we can track multiple
> controllers that belong to a single subsystem.  For now we only use it
> to store the NQN, and to check that we don't have duplicate NQNs
> unless the involved subsystems support multiple controllers.
>
> Includes code originally from Hannes Reinecke to expose the subsystems
> in sysfs.

Reviewed-by: Martin K. Petersen 

-- 
Martin K. Petersen  Oracle Linux Engineering


[PATCH v12 6/7] block, scsi: Make SCSI quiesce and resume work reliably

2017-11-09 Thread Bart Van Assche
The contexts from which a SCSI device can be quiesced or resumed are:
* Writing into /sys/class/scsi_device/*/device/state.
* SCSI parallel (SPI) domain validation.
* The SCSI device power management methods. See also scsi_bus_pm_ops.

It is essential during suspend and resume that neither the filesystem
state nor the filesystem metadata in RAM changes. This is why while
the hibernation image is being written or restored that SCSI devices
are quiesced. The SCSI core quiesces devices through scsi_device_quiesce()
and scsi_device_resume(). In the SDEV_QUIESCE state execution of
non-preempt requests is deferred. This is realized by returning
BLKPREP_DEFER from inside scsi_prep_state_check() for quiesced SCSI
devices. Avoid that a full queue prevents power management requests
to be submitted by deferring allocation of non-preempt requests for
devices in the quiesced state. This patch has been tested by running
the following commands and by verifying that after each resume the
fio job was still running:

for ((i=0; i<10; i++)); do
  (
cd /sys/block/md0/md &&
while true; do
  [ "$( sync_action
  sleep 1
done
  ) &
  pids=($!)
  for d in /sys/class/block/sd*[a-z]; do
bdev=${d#/sys/class/block/}
hcil=$(readlink "$d/device")
hcil=${hcil#../../../}
echo 4 > "$d/queue/nr_requests"
echo 1 > "/sys/class/scsi_device/$hcil/device/queue_depth"
fio --name="$bdev" --filename="/dev/$bdev" --buffered=0 --bs=512 \
  --rw=randread --ioengine=libaio --numjobs=4 --iodepth=16   \
  --iodepth_batch=1 --thread --loops=$((2**31)) &
pids+=($!)
  done
  sleep 1
  echo "$(date) Hibernating ..." >>hibernate-test-log.txt
  systemctl hibernate
  sleep 10
  kill "${pids[@]}"
  echo idle > /sys/block/md0/md/sync_action
  wait
  echo "$(date) Done." >>hibernate-test-log.txt
done

Reported-by: Oleksandr Natalenko 
References: "I/O hangs after resuming from suspend-to-ram" 
(https://marc.info/?l=linux-block=150340235201348).
Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: Martin Steigerwald 
Tested-by: Oleksandr Natalenko 
Cc: Martin K. Petersen 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 42 ++
 block/blk-mq.c |  4 ++--
 drivers/scsi/scsi_lib.c| 42 ++
 fs/block_dev.c |  4 ++--
 include/linux/blkdev.h |  2 +-
 include/scsi/scsi_device.h |  1 +
 6 files changed, 70 insertions(+), 25 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index edc276899116..29b08428ae45 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -374,6 +374,7 @@ void blk_clear_preempt_only(struct request_queue *q)
 
spin_lock_irqsave(q->queue_lock, flags);
queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   wake_up_all(>mq_freeze_wq);
spin_unlock_irqrestore(q->queue_lock, flags);
 }
 EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
@@ -795,15 +796,38 @@ struct request_queue *blk_alloc_queue(gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
-int blk_queue_enter(struct request_queue *q, bool nowait)
+/**
+ * blk_queue_enter() - try to increase q->q_usage_counter
+ * @q: request queue pointer
+ * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
+ */
+int blk_queue_enter(struct request_queue *q, unsigned int flags)
 {
+   const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
+
while (true) {
+   bool success = false;
int ret;
 
-   if (percpu_ref_tryget_live(>q_usage_counter))
+   rcu_read_lock_sched();
+   if (percpu_ref_tryget_live(>q_usage_counter)) {
+   /*
+* The code that sets the PREEMPT_ONLY flag is
+* responsible for ensuring that that flag is globally
+* visible before the queue is unfrozen.
+*/
+   if (preempt || !blk_queue_preempt_only(q)) {
+   success = true;
+   } else {
+   percpu_ref_put(>q_usage_counter);
+   }
+   }
+   rcu_read_unlock_sched();
+
+   if (success)
return 0;
 
-   if (nowait)
+   if (flags & BLK_MQ_REQ_NOWAIT)
return -EBUSY;
 
/*
@@ -816,7 +840,8 @@ int blk_queue_enter(struct request_queue *q, bool nowait)
smp_rmb();
 
ret = wait_event_interruptible(q->mq_freeze_wq,
-   !atomic_read(>mq_freeze_depth) ||
+   

[PATCH v12 5/7] block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag

2017-11-09 Thread Bart Van Assche
This flag will be used in the next patch to let the block layer
core know whether or not a SCSI request queue has been quiesced.
A quiesced SCSI queue namely only processes RQF_PREEMPT requests.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: Martin Steigerwald 
Tested-by: Oleksandr Natalenko 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 30 ++
 block/blk-mq-debugfs.c |  1 +
 include/linux/blkdev.h |  6 ++
 3 files changed, 37 insertions(+)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17eed16a6e04..edc276899116 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -348,6 +348,36 @@ void blk_sync_queue(struct request_queue *q)
 }
 EXPORT_SYMBOL(blk_sync_queue);
 
+/**
+ * blk_set_preempt_only - set QUEUE_FLAG_PREEMPT_ONLY
+ * @q: request queue pointer
+ *
+ * Returns the previous value of the PREEMPT_ONLY flag - 0 if the flag was not
+ * set and 1 if the flag was already set.
+ */
+int blk_set_preempt_only(struct request_queue *q)
+{
+   unsigned long flags;
+   int res;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   res = queue_flag_test_and_set(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   return res;
+}
+EXPORT_SYMBOL_GPL(blk_set_preempt_only);
+
+void blk_clear_preempt_only(struct request_queue *q)
+{
+   unsigned long flags;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   queue_flag_clear(QUEUE_FLAG_PREEMPT_ONLY, q);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+}
+EXPORT_SYMBOL_GPL(blk_clear_preempt_only);
+
 /**
  * __blk_run_queue_uncond - run a queue whether or not it has been stopped
  * @q: The queue to run
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7e3ae8105d5f..b56a4f35720d 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -74,6 +74,7 @@ static const char *const blk_queue_flag_name[] = {
QUEUE_FLAG_NAME(REGISTERED),
QUEUE_FLAG_NAME(SCSI_PASSTHROUGH),
QUEUE_FLAG_NAME(QUIESCED),
+   QUEUE_FLAG_NAME(PREEMPT_ONLY),
 };
 #undef QUEUE_FLAG_NAME
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1af5ddd0f631..2147e2381a22 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -632,6 +632,7 @@ struct request_queue {
 #define QUEUE_FLAG_REGISTERED  26  /* queue has been registered to a disk 
*/
 #define QUEUE_FLAG_SCSI_PASSTHROUGH 27 /* queue supports SCSI commands */
 #define QUEUE_FLAG_QUIESCED28  /* queue has been quiesced */
+#define QUEUE_FLAG_PREEMPT_ONLY29  /* only process REQ_PREEMPT 
requests */
 
 #define QUEUE_FLAG_DEFAULT ((1 << QUEUE_FLAG_IO_STAT) |\
 (1 << QUEUE_FLAG_SAME_COMP)|   \
@@ -732,6 +733,11 @@ static inline void queue_flag_clear(unsigned int flag, 
struct request_queue *q)
((rq)->cmd_flags & (REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT| \
 REQ_FAILFAST_DRIVER))
 #define blk_queue_quiesced(q)  test_bit(QUEUE_FLAG_QUIESCED, &(q)->queue_flags)
+#define blk_queue_preempt_only(q)  \
+   test_bit(QUEUE_FLAG_PREEMPT_ONLY, &(q)->queue_flags)
+
+extern int blk_set_preempt_only(struct request_queue *q);
+extern void blk_clear_preempt_only(struct request_queue *q);
 
 static inline bool blk_account_rq(struct request *rq)
 {
-- 
2.14.3



[PATCH v12 4/7] ide, scsi: Tell the block layer at request allocation time about preempt requests

2017-11-09 Thread Bart Van Assche
Convert blk_get_request(q, op, __GFP_RECLAIM) into
blk_get_request_flags(q, op, BLK_MQ_PREEMPT). This patch does not
change any functionality.

Signed-off-by: Bart Van Assche 
Tested-by: Martin Steigerwald 
Acked-by: David S. Miller  [ for IDE ]
Acked-by: Martin K. Petersen 
Reviewed-by: Hannes Reinecke 
Tested-by: Oleksandr Natalenko 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
---
 drivers/ide/ide-pm.c| 4 ++--
 drivers/scsi/scsi_lib.c | 6 +++---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 544f02d673ca..f56d742908df 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -89,9 +89,9 @@ int generic_ide_resume(struct device *dev)
}
 
memset(, 0, sizeof(rqpm));
-   rq = blk_get_request(drive->queue, REQ_OP_DRV_IN, __GFP_RECLAIM);
+   rq = blk_get_request_flags(drive->queue, REQ_OP_DRV_IN,
+  BLK_MQ_REQ_PREEMPT);
ide_req(rq)->type = ATA_PRIV_PM_RESUME;
-   rq->rq_flags |= RQF_PREEMPT;
rq->special = 
rqpm.pm_step = IDE_PM_START_RESUME;
rqpm.pm_state = PM_EVENT_ON;
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index c537c4d768c1..1e94081cc791 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -252,9 +252,9 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
struct scsi_request *rq;
int ret = DRIVER_ERROR << 24;
 
-   req = blk_get_request(sdev->request_queue,
+   req = blk_get_request_flags(sdev->request_queue,
data_direction == DMA_TO_DEVICE ?
-   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, __GFP_RECLAIM);
+   REQ_OP_SCSI_OUT : REQ_OP_SCSI_IN, BLK_MQ_REQ_PREEMPT);
if (IS_ERR(req))
return ret;
rq = scsi_req(req);
@@ -268,7 +268,7 @@ int scsi_execute(struct scsi_device *sdev, const unsigned 
char *cmd,
rq->retries = retries;
req->timeout = timeout;
req->cmd_flags |= flags;
-   req->rq_flags |= rq_flags | RQF_QUIET | RQF_PREEMPT;
+   req->rq_flags |= rq_flags | RQF_QUIET;
 
/*
 * head injection *required* here otherwise quiesce won't work
-- 
2.14.3



[PATCH v12 2/7] block: Introduce blk_get_request_flags()

2017-11-09 Thread Bart Van Assche
A side effect of this patch is that the GFP mask that is passed to
several allocation functions in the legacy block layer is changed
from GFP_KERNEL into __GFP_DIRECT_RECLAIM.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: Martin Steigerwald 
Tested-by: Oleksandr Natalenko 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 50 +++---
 include/linux/blkdev.h |  3 +++
 2 files changed, 38 insertions(+), 15 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index a4362849059a..0f7093dfc010 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1160,7 +1160,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * @rl: request list to allocate from
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLQ_MQ_REQ_* flags
  *
  * Get a free request from @q.  This function may fail under memory
  * pressure or if @q is dead.
@@ -1170,7 +1170,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *__get_request(struct request_list *rl, unsigned int op,
-   struct bio *bio, gfp_t gfp_mask)
+struct bio *bio, unsigned int flags)
 {
struct request_queue *q = rl->q;
struct request *rq;
@@ -1179,6 +1179,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
struct io_cq *icq = NULL;
const bool is_sync = op_is_sync(op);
int may_queue;
+   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+__GFP_DIRECT_RECLAIM;
req_flags_t rq_flags = RQF_ALLOCED;
 
lockdep_assert_held(q->queue_lock);
@@ -1339,7 +1341,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * @q: request_queue to allocate request from
  * @op: operation and flags
  * @bio: bio to allocate request for (can be %NULL)
- * @gfp_mask: allocation mask
+ * @flags: BLK_MQ_REQ_* flags.
  *
  * Get a free request from @q.  If %__GFP_DIRECT_RECLAIM is set in @gfp_mask,
  * this function keeps retrying under memory pressure and fails iff @q is dead.
@@ -1349,7 +1351,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, unsigned int op,
-   struct bio *bio, gfp_t gfp_mask)
+  struct bio *bio, unsigned int flags)
 {
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1361,7 +1363,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
rl = blk_get_rl(q, bio);/* transferred to @rq on success */
 retry:
-   rq = __get_request(rl, op, bio, gfp_mask);
+   rq = __get_request(rl, op, bio, flags);
if (!IS_ERR(rq))
return rq;
 
@@ -1370,7 +1372,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
return ERR_PTR(-EAGAIN);
}
 
-   if (!gfpflags_allow_blocking(gfp_mask) || unlikely(blk_queue_dying(q))) 
{
+   if ((flags & BLK_MQ_REQ_NOWAIT) || unlikely(blk_queue_dying(q))) {
blk_put_rl(rl);
return rq;
}
@@ -1397,10 +1399,13 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
goto retry;
 }
 
+/* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, gfp_t gfp_mask)
+  unsigned int op, unsigned int flags)
 {
struct request *rq;
+   gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
+__GFP_DIRECT_RECLAIM;
int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
@@ -1413,7 +1418,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
if (ret)
return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
-   rq = get_request(q, op, NULL, gfp_mask);
+   rq = get_request(q, op, NULL, flags);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
blk_queue_exit(q);
@@ -1427,25 +1432,40 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
return rq;
 }
 
-struct request *blk_get_request(struct request_queue *q, unsigned int op,
-   gfp_t gfp_mask)
+/**
+ * blk_get_request_flags - allocate a request
+ * @q: request queue to allocate a 

[PATCH v12 7/7] block, nvme: Introduce blk_mq_req_flags_t

2017-11-09 Thread Bart Van Assche
Several block layer and NVMe core functions accept a combination
of BLK_MQ_REQ_* flags through the 'flags' argument but there is
no verification at compile time whether the right type of block
layer flags is passed. Make it possible for sparse to verify this.
This patch does not change any functionality.

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: Oleksandr Natalenko 
Cc: linux-n...@lists.infradead.org
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
Cc: Ming Lei 
---
 block/blk-core.c  | 12 ++--
 block/blk-mq.c|  4 ++--
 block/blk-mq.h|  2 +-
 drivers/nvme/host/core.c  |  5 +++--
 drivers/nvme/host/nvme.h  |  5 +++--
 include/linux/blk-mq.h| 17 +++--
 include/linux/blk_types.h |  2 ++
 include/linux/blkdev.h|  4 ++--
 8 files changed, 30 insertions(+), 21 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 29b08428ae45..7c54c195e79e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -801,7 +801,7 @@ EXPORT_SYMBOL(blk_alloc_queue);
  * @q: request queue pointer
  * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PREEMPT
  */
-int blk_queue_enter(struct request_queue *q, unsigned int flags)
+int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags)
 {
const bool preempt = flags & BLK_MQ_REQ_PREEMPT;
 
@@ -1225,7 +1225,7 @@ int blk_update_nr_requests(struct request_queue *q, 
unsigned int nr)
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *__get_request(struct request_list *rl, unsigned int op,
-struct bio *bio, unsigned int flags)
+struct bio *bio, blk_mq_req_flags_t flags)
 {
struct request_queue *q = rl->q;
struct request *rq;
@@ -1408,7 +1408,7 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
  * Returns request pointer on success, with @q->queue_lock *not held*.
  */
 static struct request *get_request(struct request_queue *q, unsigned int op,
-  struct bio *bio, unsigned int flags)
+  struct bio *bio, blk_mq_req_flags_t flags)
 {
const bool is_sync = op_is_sync(op);
DEFINE_WAIT(wait);
@@ -1458,7 +1458,7 @@ static struct request *get_request(struct request_queue 
*q, unsigned int op,
 
 /* flags: BLK_MQ_REQ_PREEMPT and/or BLK_MQ_REQ_NOWAIT. */
 static struct request *blk_old_get_request(struct request_queue *q,
-  unsigned int op, unsigned int flags)
+   unsigned int op, blk_mq_req_flags_t flags)
 {
struct request *rq;
gfp_t gfp_mask = flags & BLK_MQ_REQ_NOWAIT ? GFP_ATOMIC :
@@ -1495,7 +1495,7 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
  * @flags: BLK_MQ_REQ_* flags, e.g. BLK_MQ_REQ_NOWAIT.
  */
 struct request *blk_get_request_flags(struct request_queue *q, unsigned int op,
- unsigned int flags)
+ blk_mq_req_flags_t flags)
 {
struct request *req;
 
@@ -2291,7 +2291,7 @@ blk_qc_t generic_make_request(struct bio *bio)
current->bio_list = bio_list_on_stack;
do {
struct request_queue *q = bio->bi_disk->queue;
-   unsigned int flags = bio->bi_opf & REQ_NOWAIT ?
+   blk_mq_req_flags_t flags = bio->bi_opf & REQ_NOWAIT ?
BLK_MQ_REQ_NOWAIT : 0;
 
if (likely(blk_queue_enter(q, flags) == 0)) {
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4fb187570e96..462c188d9679 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -383,7 +383,7 @@ static struct request *blk_mq_get_request(struct 
request_queue *q,
 }
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-   unsigned int flags)
+   blk_mq_req_flags_t flags)
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
@@ -409,7 +409,7 @@ struct request *blk_mq_alloc_request(struct request_queue 
*q, unsigned int op,
 EXPORT_SYMBOL(blk_mq_alloc_request);
 
 struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
-   unsigned int op, unsigned int flags, unsigned int hctx_idx)
+   unsigned int op, blk_mq_req_flags_t flags, unsigned int hctx_idx)
 {
struct blk_mq_alloc_data alloc_data = { .flags = flags };
struct request *rq;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2502f40ccdc0..99a19c5523e2 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -111,7 +111,7 @@ static inline void blk_mq_put_ctx(struct blk_mq_ctx *ctx)
 struct blk_mq_alloc_data {
/* input parameter */
struct request_queue *q;
-   unsigned int flags;
+   

[PATCH v12 0/7] block, scsi: Improve suspend and resume

2017-11-09 Thread Bart Van Assche
Hello Jens,

It is known that during the resume following a hibernate, especially when
using an md RAID1 array created on top of SCSI devices, sometimes the system
hangs instead of resuming up properly. This patch series fixes that
problem. These patches have been tested on top of the block layer for-next
branch. Please consider these changes for kernel v4.15.

Thanks,

Bart.

Changes between v11 and v12:
- Left out the WARN_ONCE() statement from blk_queue_enter().
- In patch 6, also update the direct_make_request() blk_queue_enter() call.

Changes between v10 and v11:
- Left out the three md patches because a deadlock was reported when using XFS
  on top of md RAID 1. This deadlock occurred because the md kernel thread got
  frozen before the kernel thread running xfsaild().
- Left out the blk_queue_enter() / blk_queue_exit() changes from
  block/blk-timeout.c because a recent patch removed these calls from
  blk_timeout_work().
- Retested the whole series.

Changes between v9 and v10:
- Made sure that scsi_device_resume() handles SCSI devices that are in an
  error state correctly. Unlike other SCSI transport protocols, during an
  ATA device resume the ATA controller is reset. This should fix the
  blk_queue_enter WARNING reported by the kernel test robot.

Changes between v8 and v9:
- Modified the third patch such that the MD_RECOVERY_FROZEN flag is restored
  properly after a resume.
- Modified the ninth patch such that a kernel warning is reported if it is
  attempted to call scsi_device_quiesce() from multiple contexts concurrently.
- Added Reviewed-by / Tested-by tags as appropriate.
  
Changes between v7 and v8:
- Fixed a (theoretical?) race identified by Ming Lei.
- Added a tenth patch that checks whether the proper type of flags has been
  passed to a range of block layer functions.

Changes between v6 and v7:
- Added support for the PREEMPT_ONLY flag in blk-mq-debugfs.c.
- Fixed kerneldoc header of blk_queue_enter().
- Added a rcu_read_lock_sched() / rcu_read_unlock_sched() pair in
  blk_set_preempt_only().
- Removed a synchronize_rcu() call from scsi_device_quiesce().
- Modified the description of patch 9/9 in this series.
- Removed scsi_run_queue() call from scsi_device_resume().

Changes between v5 and v6:
- Split an md patch into two patches to make it easier to review the changes.
- For the md patch that suspends I/O while the system is frozen, switched back
  to the freezer mechanism because this makes the code shorter and easier to
  review.
- Changed blk_set/clear_preempt_only() from EXPORT_SYMBOL() into
  EXPORT_SYMBOL_GPL().
- Made blk_set_preempt_only() behave as a test-and-set operation.
- Introduced blk_get_request_flags() and BLK_MQ_REQ_PREEMPT as requested by
  Christoph and reduced the number of arguments of blk_queue_enter() back from
  three to two.
- In scsi_device_quiesce(), moved the blk_mq_freeze_queue() call out of a
  critical section. Made the explanation of why the synchronize_rcu() call
  is necessary more detailed.

Changes between v4 and v5:
- Split blk_set_preempt_only() into two functions as requested by Christoph.
- Made blk_get_request() trigger WARN_ONCE() if it is attempted to allocate
  a request while the system is frozen. This is a big help to identify drivers
  that submit I/O while the system is frozen.
- Since kernel thread freezing is on its way out, reworked the approach for
  avoiding that the md driver submits I/O while the system is frozen such that
  the approach no longer depends on the kernel thread freeze mechanism.
- Fixed the (theoretical) deadlock in scsi_device_quiesce() that was identified
  by Ming.

Changes between v3 and v4:
- Made sure that this patch series not only works for scsi-mq but also for
  the legacy SCSI stack.
- Removed an smp_rmb()/smp_wmb() pair from the hot path and added a comment
  that explains why that is safe.
- Reordered the patches in this patch series.

Changes between v2 and v3:
- Made md kernel threads freezable.
- Changed the approach for quiescing SCSI devices again.
- Addressed Ming's review comments.

Changes compared to v1 of this patch series:
- Changed the approach and rewrote the patch series.

Bart Van Assche (6):
  block: Introduce blk_get_request_flags()
  block: Introduce BLK_MQ_REQ_PREEMPT
  ide, scsi: Tell the block layer at request allocation time about
preempt requests
  block: Add the QUEUE_FLAG_PREEMPT_ONLY request queue flag
  block, scsi: Make SCSI quiesce and resume work reliably
  block, nvme: Introduce blk_mq_req_flags_t

Ming Lei (1):
  block: Make q_usage_counter also track legacy requests

 block/blk-core.c   | 132 +
 block/blk-mq-debugfs.c |   1 +
 block/blk-mq.c |  20 +++
 block/blk-mq.h |   2 +-
 drivers/ide/ide-pm.c   |   4 +-
 drivers/nvme/host/core.c   |   5 +-
 drivers/nvme/host/nvme.h   |   5 +-
 drivers/scsi/scsi_lib.c|  48 +++--
 fs/block_dev.c |   4 +-
 

[PATCH v12 3/7] block: Introduce BLK_MQ_REQ_PREEMPT

2017-11-09 Thread Bart Van Assche
Set RQF_PREEMPT if BLK_MQ_REQ_PREEMPT is passed to
blk_get_request_flags().

Signed-off-by: Bart Van Assche 
Reviewed-by: Hannes Reinecke 
Tested-by: Martin Steigerwald 
Tested-by: Oleksandr Natalenko 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Johannes Thumshirn 
---
 block/blk-core.c   | 4 +++-
 block/blk-mq.c | 2 ++
 include/linux/blk-mq.h | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 0f7093dfc010..17eed16a6e04 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1263,6 +1263,8 @@ static struct request *__get_request(struct request_list 
*rl, unsigned int op,
blk_rq_set_rl(rq, rl);
rq->cmd_flags = op;
rq->rq_flags = rq_flags;
+   if (flags & BLK_MQ_REQ_PREEMPT)
+   rq->rq_flags |= RQF_PREEMPT;
 
/* init elvpriv */
if (rq_flags & RQF_ELVPRIV) {
@@ -1444,7 +1446,7 @@ struct request *blk_get_request_flags(struct 
request_queue *q, unsigned int op,
struct request *req;
 
WARN_ON_ONCE(op & REQ_NOWAIT);
-   WARN_ON_ONCE(flags & ~BLK_MQ_REQ_NOWAIT);
+   WARN_ON_ONCE(flags & ~(BLK_MQ_REQ_NOWAIT | BLK_MQ_REQ_PREEMPT));
 
if (q->mq_ops) {
req = blk_mq_alloc_request(q, op, flags);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2b7d95b2cdfb..2c7ceb34a016 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -291,6 +291,8 @@ static struct request *blk_mq_rq_ctx_init(struct 
blk_mq_alloc_data *data,
rq->q = data->q;
rq->mq_ctx = data->ctx;
rq->cmd_flags = op;
+   if (data->flags & BLK_MQ_REQ_PREEMPT)
+   rq->rq_flags |= RQF_PREEMPT;
if (blk_queue_io_stat(data->q))
rq->rq_flags |= RQF_IO_STAT;
/* do not touch atomic flags, it needs atomic ops against the timer */
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 4ae987c2352c..82b56609736a 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -212,6 +212,7 @@ enum {
BLK_MQ_REQ_NOWAIT   = (1 << 0), /* return when out of requests */
BLK_MQ_REQ_RESERVED = (1 << 1), /* allocate from reserved pool */
BLK_MQ_REQ_INTERNAL = (1 << 2), /* allocate internal/sched tag */
+   BLK_MQ_REQ_PREEMPT  = (1 << 3), /* set RQF_PREEMPT */
 };
 
 struct request *blk_mq_alloc_request(struct request_queue *q, unsigned int op,
-- 
2.14.3



[PATCH v12 1/7] block: Make q_usage_counter also track legacy requests

2017-11-09 Thread Bart Van Assche
From: Ming Lei 

This patch makes it possible to pause request allocation for
the legacy block layer by calling blk_mq_freeze_queue() and
blk_mq_unfreeze_queue().

Signed-off-by: Ming Lei 
[ bvanassche: Combined two patches into one, edited a comment and made sure
  REQ_NOWAIT is handled properly in blk_old_get_request() ]
Signed-off-by: Bart Van Assche 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Hannes Reinecke 
Tested-by: Martin Steigerwald 
Tested-by: Oleksandr Natalenko 
Cc: Ming Lei 
---
 block/blk-core.c | 12 
 block/blk-mq.c   | 10 ++
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 5e81dcf4690a..a4362849059a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -612,6 +612,9 @@ void blk_set_queue_dying(struct request_queue *q)
}
spin_unlock_irq(q->queue_lock);
}
+
+   /* Make blk_queue_enter() reexamine the DYING flag. */
+   wake_up_all(>mq_freeze_wq);
 }
 EXPORT_SYMBOL_GPL(blk_set_queue_dying);
 
@@ -1398,16 +1401,22 @@ static struct request *blk_old_get_request(struct 
request_queue *q,
   unsigned int op, gfp_t gfp_mask)
 {
struct request *rq;
+   int ret = 0;
 
WARN_ON_ONCE(q->mq_ops);
 
/* create ioc upfront */
create_io_context(gfp_mask, q->node);
 
+   ret = blk_queue_enter(q, !(gfp_mask & __GFP_DIRECT_RECLAIM) ||
+ (op & REQ_NOWAIT));
+   if (ret)
+   return ERR_PTR(ret);
spin_lock_irq(q->queue_lock);
rq = get_request(q, op, NULL, gfp_mask);
if (IS_ERR(rq)) {
spin_unlock_irq(q->queue_lock);
+   blk_queue_exit(q);
return rq;
}
 
@@ -1579,6 +1588,7 @@ void __blk_put_request(struct request_queue *q, struct 
request *req)
blk_free_request(rl, req);
freed_request(rl, sync, rq_flags);
blk_put_rl(rl);
+   blk_queue_exit(q);
}
 }
 EXPORT_SYMBOL_GPL(__blk_put_request);
@@ -1860,8 +1870,10 @@ static blk_qc_t blk_queue_bio(struct request_queue *q, 
struct bio *bio)
 * Grab a free request. This is might sleep but can not fail.
 * Returns with the queue unlocked.
 */
+   blk_queue_enter_live(q);
req = get_request(q, bio->bi_opf, bio, GFP_NOIO);
if (IS_ERR(req)) {
+   blk_queue_exit(q);
__wbt_done(q->rq_wb, wb_acct);
if (PTR_ERR(req) == -ENOMEM)
bio->bi_status = BLK_STS_RESOURCE;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 2396f036c388..2b7d95b2cdfb 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -126,7 +126,8 @@ void blk_freeze_queue_start(struct request_queue *q)
freeze_depth = atomic_inc_return(>mq_freeze_depth);
if (freeze_depth == 1) {
percpu_ref_kill(>q_usage_counter);
-   blk_mq_run_hw_queues(q, false);
+   if (q->mq_ops)
+   blk_mq_run_hw_queues(q, false);
}
 }
 EXPORT_SYMBOL_GPL(blk_freeze_queue_start);
@@ -256,13 +257,6 @@ void blk_mq_wake_waiters(struct request_queue *q)
queue_for_each_hw_ctx(q, hctx, i)
if (blk_mq_hw_queue_mapped(hctx))
blk_mq_tag_wakeup_all(hctx->tags, true);
-
-   /*
-* If we are called because the queue has now been marked as
-* dying, we need to ensure that processes currently waiting on
-* the queue are notified as well.
-*/
-   wake_up_all(>mq_freeze_wq);
 }
 
 bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
-- 
2.14.3



Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Omar Sandoval
On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote:
> This patch attempts to make the case of hctx re-running on driver tag
> failure more robust. Without this patch, it's pretty easy to trigger a
> stall condition with shared tags. An example is using null_blk like
> this:
> 
> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
> hw_queue_depth=1
> 
> which sets up 4 devices, sharing the same tag set with a depth of 1.
> Running a fio job ala:
> 
> [global]
> bs=4k
> rw=randread
> norandommap
> direct=1
> ioengine=libaio
> iodepth=4
> 
> [nullb0]
> filename=/dev/nullb0
> [nullb1]
> filename=/dev/nullb1
> [nullb2]
> filename=/dev/nullb2
> [nullb3]
> filename=/dev/nullb3
> 
> will inevitably end with one or more threads being stuck waiting for a
> scheduler tag. That IO is then stuck forever, until someone else
> triggers a run of the queue.
> 
> Ensure that we always re-run the hardware queue, if the driver tag we
> were waiting for got freed before we added our leftover request entries
> back on the dispatch list.
> 
> Signed-off-by: Jens Axboe 

Reviewed-by: Omar Sandoval 


[PATCH 0/2] block: enforce ioctl(BLKROSET) and set_disk_ro()

2017-11-09 Thread Ilya Dryomov
Hello,

I was doing some cleanup work on rbd BLKROSET handler and discovered
that we ignore partition rw/ro setting (hd_struct->policy) for pretty
much everything but straight writes.

David (CCed) has blktests patches standing by.

(Another aspect of this is that we don't enforce open(2) mode.  Tejun
took a stab at this a few years ago, but his patch had to be reverted:

  75f1dc0d076d ("block: check bdev_read_only() from blkdev_get()")
  e51900f7d38c ("block: revert block_dev read-only check")

It is a separate issue and refusing writes to read-only devices is
obviously more important, but perhaps it's time to revisit that as
well?)

Thanks,

Ilya


Ilya Dryomov (2):
  block: fail op_is_write() requests to read-only partitions
  block: add bdev_read_only() checks to common helpers

 block/blk-core.c | 23 ++-
 block/blk-lib.c  | 12 
 2 files changed, 34 insertions(+), 1 deletion(-)

-- 
2.4.3



[PATCH 2/2] block: add bdev_read_only() checks to common helpers

2017-11-09 Thread Ilya Dryomov
Similar to blkdev_write_iter(), return -EPERM if the partition is
read-only.  This covers ioctl(), fallocate() and most in-kernel users
but isn't meant to be exhaustive -- everything else will be caught in
generic_make_request_checks(), fail with -EIO and can be fixed later.

Signed-off-by: Ilya Dryomov 
---
 block/blk-lib.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/block/blk-lib.c b/block/blk-lib.c
index f625fda5f095..64fe863ae43a 100644
--- a/block/blk-lib.c
+++ b/block/blk-lib.c
@@ -36,6 +36,9 @@ int __blkdev_issue_discard(struct block_device *bdev, 
sector_t sector,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
if (flags & BLKDEV_DISCARD_SECURE) {
if (!blk_queue_secure_erase(q))
return -EOPNOTSUPP;
@@ -155,6 +158,9 @@ static int __blkdev_issue_write_same(struct block_device 
*bdev, sector_t sector,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
if ((sector | nr_sects) & bs_mask)
return -EINVAL;
@@ -232,6 +238,9 @@ static int __blkdev_issue_write_zeroes(struct block_device 
*bdev,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
/* Ensure that max_write_zeroes_sectors doesn't overflow bi_size */
max_write_zeroes_sectors = bdev_write_zeroes_sectors(bdev);
 
@@ -286,6 +295,9 @@ static int __blkdev_issue_zero_pages(struct block_device 
*bdev,
if (!q)
return -ENXIO;
 
+   if (bdev_read_only(bdev))
+   return -EPERM;
+
while (nr_sects != 0) {
bio = next_bio(bio, __blkdev_sectors_to_bio_pages(nr_sects),
   gfp_mask);
-- 
2.4.3



[PATCH 1/2] block: fail op_is_write() requests to read-only partitions

2017-11-09 Thread Ilya Dryomov
Regular block device writes go through blkdev_write_iter(), which does
bdev_read_only(), while zeroout/discard/etc requests are never checked,
both userspace- and kernel-triggered.  Add a generic catch-all check to
generic_make_request_checks() to actually enforce ioctl(BLKROSET) and
set_disk_ro(), which is used by quite a few drivers for things like
snapshots, read-only backing files/images, etc.

Signed-off-by: Ilya Dryomov 
---
 block/blk-core.c | 23 ++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index b8d1aa2d1008..139ff47caf4a 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -2022,6 +2022,20 @@ static inline int bio_check_eod(struct bio *bio, 
unsigned int nr_sectors)
return 0;
 }
 
+static inline bool bio_check_ro(struct bio *bio)
+{
+   struct hd_struct *p;
+   int ret = false;
+
+   rcu_read_lock();
+   p = __disk_get_part(bio->bi_disk, bio->bi_partno);
+   if (!p || (p->policy && op_is_write(bio_op(bio
+   ret = true;
+   rcu_read_unlock();
+
+   return ret;
+}
+
 static noinline_for_stack bool
 generic_make_request_checks(struct bio *bio)
 {
@@ -2044,11 +2058,18 @@ generic_make_request_checks(struct bio *bio)
goto end_io;
}
 
+   if (bio_check_ro(bio)) {
+   printk(KERN_ERR
+  "generic_make_request: Trying to write "
+   "to read-only block-device %s (partno %d)\n",
+   bio_devname(bio, b), bio->bi_partno);
+   goto end_io;
+   }
+
/*
 * For a REQ_NOWAIT based request, return -EOPNOTSUPP
 * if queue is not a request based queue.
 */
-
if ((bio->bi_opf & REQ_NOWAIT) && !queue_is_rq_based(q))
goto not_supported;
 
-- 
2.4.3



Re: [PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Keith Busch
On Thu, Nov 09, 2017 at 06:44:47PM +0100, Christoph Hellwig wrote:
> +config NVME_MULTIPATH
> + bool "NVMe multipath support"
> + depends on NVME_CORE
> + ---help---
> +This option enables support for multipath access to NVMe
> +subsystems.  If this option is enabled only a single
> +/dev/nvneXnY device will show up for each NVMe namespaces,

Minor typo: should be /dev/nvmeXnY.

Otherwise, everything in the series looks good to me and testing on my
dual port devices hasn't found any problems.

For the whole series:

Reviewed-by: Keith Busch 


[PATCH 4/7] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Christoph Hellwig
This patch adds native multipath support to the nvme driver.  For each
namespace we create only single block device node, which can be used
to access that namespace through any of the controllers that refer to it.
The gendisk for each controllers path to the name space still exists
inside the kernel, but is hidden from userspace.  The character device
nodes are still available on a per-controller basis.  A new link from
the sysfs directory for the subsystem allows to find all controllers
for a given subsystem.

Currently we will always send I/O to the first available path, this will
be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
ratified and implemented, at which point we will look at the ANA state
for each namespace.  Another possibility that was prototyped is to
use the path that is closes to the submitting NUMA code, which will be
mostly interesting for PCI, but might also be useful for RDMA or FC
transports in the future.  There is not plan to implement round robin
or I/O service time path selectors, as those are not scalable with
the performance rates provided by NVMe.

The multipath device will go away once all paths to it disappear,
any delay to keep it alive needs to be implemented at the controller
level.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/Kconfig |   9 ++
 drivers/nvme/host/Makefile|   1 +
 drivers/nvme/host/core.c  | 133 +++---
 drivers/nvme/host/multipath.c | 255 ++
 drivers/nvme/host/nvme.h  |  57 ++
 5 files changed, 440 insertions(+), 15 deletions(-)
 create mode 100644 drivers/nvme/host/multipath.c

diff --git a/drivers/nvme/host/Kconfig b/drivers/nvme/host/Kconfig
index 46d6cb1e03bd..45886800a1df 100644
--- a/drivers/nvme/host/Kconfig
+++ b/drivers/nvme/host/Kconfig
@@ -13,6 +13,15 @@ config BLK_DEV_NVME
  To compile this driver as a module, choose M here: the
  module will be called nvme.
 
+config NVME_MULTIPATH
+   bool "NVMe multipath support"
+   depends on NVME_CORE
+   ---help---
+  This option enables support for multipath access to NVMe
+  subsystems.  If this option is enabled only a single
+  /dev/nvneXnY device will show up for each NVMe namespaces,
+  even if it is accessible through multiple controllers.
+
 config NVME_FABRICS
tristate
 
diff --git a/drivers/nvme/host/Makefile b/drivers/nvme/host/Makefile
index cc0aacb4c8b4..b856f2f549cd 100644
--- a/drivers/nvme/host/Makefile
+++ b/drivers/nvme/host/Makefile
@@ -5,6 +5,7 @@ obj-$(CONFIG_NVME_RDMA) += nvme-rdma.o
 obj-$(CONFIG_NVME_FC)  += nvme-fc.o
 
 nvme-core-y:= core.o
+nvme-core-$(CONFIG_NVME_MULTIPATH) += multipath.o
 nvme-core-$(CONFIG_NVM)+= lightnvm.o
 
 nvme-y += pci.o
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f4d3c160d9b1..a65449741650 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -185,17 +185,22 @@ static inline bool nvme_req_needs_retry(struct request 
*req)
return false;
if (nvme_req(req)->retries >= nvme_max_retries)
return false;
-   if (blk_queue_dying(req->q))
-   return false;
return true;
 }
 
 void nvme_complete_rq(struct request *req)
 {
if (unlikely(nvme_req(req)->status && nvme_req_needs_retry(req))) {
-   nvme_req(req)->retries++;
-   blk_mq_requeue_request(req, true);
-   return;
+   if (nvme_req_needs_failover(req)) {
+   nvme_failover_req(req);
+   return;
+   }
+
+   if (!blk_queue_dying(req->q)) {
+   nvme_req(req)->retries++;
+   blk_mq_requeue_request(req, true);
+   return;
+   }
}
 
blk_mq_end_request(req, nvme_error_status(req));
@@ -286,7 +291,8 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
ctrl->state = new_state;
 
spin_unlock_irqrestore(>lock, flags);
-
+   if (changed && ctrl->state == NVME_CTRL_LIVE)
+   nvme_kick_requeue_lists(ctrl);
return changed;
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
@@ -296,6 +302,7 @@ static void nvme_free_ns_head(struct kref *ref)
struct nvme_ns_head *head =
container_of(ref, struct nvme_ns_head, ref);
 
+   nvme_mpath_remove_disk(head);
ida_simple_remove(>subsys->ns_ida, head->instance);
list_del_init(>entry);
cleanup_srcu_struct(>srcu);
@@ -1137,11 +1144,33 @@ static int nvme_user_cmd(struct nvme_ctrl *ctrl, struct 
nvme_ns *ns,
return status;
 }
 
-static int nvme_ioctl(struct block_device *bdev, fmode_t mode,
-   unsigned int cmd, unsigned long arg)
+/*
+ * Issue ioctl requests on 

[PATCH 5/7] nvme: also expose the namespace identification sysfs files for mpath nodes

2017-11-09 Thread Christoph Hellwig
We do this by adding a helper that returns the ns_head for a device that
can belong to either the per-controller or per-subsystem block device
nodes, and otherwise reuse all the existing code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c  | 57 +++
 drivers/nvme/host/multipath.c |  6 +
 drivers/nvme/host/nvme.h  |  1 +
 3 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a65449741650..fab4c46e4e50 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2381,12 +2381,22 @@ static ssize_t nvme_sysfs_rescan(struct device *dev,
 }
 static DEVICE_ATTR(rescan_controller, S_IWUSR, NULL, nvme_sysfs_rescan);
 
+static inline struct nvme_ns_head *dev_to_ns_head(struct device *dev)
+{
+   struct gendisk *disk = dev_to_disk(dev);
+
+   if (disk->fops == _fops)
+   return nvme_get_ns_from_dev(dev)->head;
+   else
+   return disk->private_data;
+}
+
 static ssize_t wwid_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+   char *buf)
 {
-   struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-   struct nvme_ns_ids *ids = >head->ids;
-   struct nvme_subsystem *subsys = ns->ctrl->subsys;
+   struct nvme_ns_head *head = dev_to_ns_head(dev);
+   struct nvme_ns_ids *ids = >ids;
+   struct nvme_subsystem *subsys = head->subsys;
int serial_len = sizeof(subsys->serial);
int model_len = sizeof(subsys->model);
 
@@ -2408,23 +2418,21 @@ static ssize_t wwid_show(struct device *dev, struct 
device_attribute *attr,
 
return sprintf(buf, "nvme.%04x-%*phN-%*phN-%08x\n", subsys->vendor_id,
serial_len, subsys->serial, model_len, subsys->model,
-   ns->head->ns_id);
+   head->ns_id);
 }
 static DEVICE_ATTR(wwid, S_IRUGO, wwid_show, NULL);
 
 static ssize_t nguid_show(struct device *dev, struct device_attribute *attr,
- char *buf)
+   char *buf)
 {
-   struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-   return sprintf(buf, "%pU\n", ns->head->ids.nguid);
+   return sprintf(buf, "%pU\n", dev_to_ns_head(dev)->ids.nguid);
 }
 static DEVICE_ATTR(nguid, S_IRUGO, nguid_show, NULL);
 
 static ssize_t uuid_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+   char *buf)
 {
-   struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-   struct nvme_ns_ids *ids = >head->ids;
+   struct nvme_ns_ids *ids = _to_ns_head(dev)->ids;
 
/* For backward compatibility expose the NGUID to userspace if
 * we have no UUID set
@@ -2439,22 +2447,20 @@ static ssize_t uuid_show(struct device *dev, struct 
device_attribute *attr,
 static DEVICE_ATTR(uuid, S_IRUGO, uuid_show, NULL);
 
 static ssize_t eui_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+   char *buf)
 {
-   struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-   return sprintf(buf, "%8ph\n", ns->head->ids.eui64);
+   return sprintf(buf, "%8ph\n", dev_to_ns_head(dev)->ids.eui64);
 }
 static DEVICE_ATTR(eui, S_IRUGO, eui_show, NULL);
 
 static ssize_t nsid_show(struct device *dev, struct device_attribute *attr,
-   char *buf)
+   char *buf)
 {
-   struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-   return sprintf(buf, "%d\n", ns->head->ns_id);
+   return sprintf(buf, "%d\n", dev_to_ns_head(dev)->ns_id);
 }
 static DEVICE_ATTR(nsid, S_IRUGO, nsid_show, NULL);
 
-static struct attribute *nvme_ns_attrs[] = {
+static struct attribute *nvme_ns_id_attrs[] = {
_attr_wwid.attr,
_attr_uuid.attr,
_attr_nguid.attr,
@@ -2463,12 +2469,11 @@ static struct attribute *nvme_ns_attrs[] = {
NULL,
 };
 
-static umode_t nvme_ns_attrs_are_visible(struct kobject *kobj,
+static umode_t nvme_ns_id_attrs_are_visible(struct kobject *kobj,
struct attribute *a, int n)
 {
struct device *dev = container_of(kobj, struct device, kobj);
-   struct nvme_ns *ns = nvme_get_ns_from_dev(dev);
-   struct nvme_ns_ids *ids = >head->ids;
+   struct nvme_ns_ids *ids = _to_ns_head(dev)->ids;
 
if (a == _attr_uuid.attr) {
if (uuid_is_null(>uuid) ||
@@ -2486,9 +2491,9 @@ static umode_t nvme_ns_attrs_are_visible(struct kobject 
*kobj,
return a->mode;
 }
 
-static const struct attribute_group nvme_ns_attr_group = {
-   .attrs  = nvme_ns_attrs,
-   .is_visible = 

[PATCH 3/7] nvme: track shared namespaces

2017-11-09 Thread Christoph Hellwig
Introduce a new struct nvme_ns_head that holds information about an actual
namespace, unlike struct nvme_ns, which only holds the per-controller
namespace information.  For private namespaces there is a 1:1 relation of
the two, but for shared namespaces this lets us discover all the paths to
it.  For now only the identifiers are moved to the new structure, but most
of the information in struct nvme_ns should eventually move over.

To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Keith Busch 
Reviewed-by: Javier González 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Johannes Thumshirn 
Reviewed-by: Hannes Reinecke 
---
 drivers/nvme/host/core.c | 220 +++
 drivers/nvme/host/lightnvm.c |  14 +--
 drivers/nvme/host/nvme.h |  26 -
 3 files changed, 210 insertions(+), 50 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 61b7c91213e7..f4d3c160d9b1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -291,6 +291,22 @@ bool nvme_change_ctrl_state(struct nvme_ctrl *ctrl,
 }
 EXPORT_SYMBOL_GPL(nvme_change_ctrl_state);
 
+static void nvme_free_ns_head(struct kref *ref)
+{
+   struct nvme_ns_head *head =
+   container_of(ref, struct nvme_ns_head, ref);
+
+   ida_simple_remove(>subsys->ns_ida, head->instance);
+   list_del_init(>entry);
+   cleanup_srcu_struct(>srcu);
+   kfree(head);
+}
+
+static void nvme_put_ns_head(struct nvme_ns_head *head)
+{
+   kref_put(>ref, nvme_free_ns_head);
+}
+
 static void nvme_free_ns(struct kref *kref)
 {
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
@@ -299,7 +315,7 @@ static void nvme_free_ns(struct kref *kref)
nvme_nvm_unregister(ns);
 
put_disk(ns->disk);
-   ida_simple_remove(>ctrl->ns_ida, ns->instance);
+   nvme_put_ns_head(ns->head);
nvme_put_ctrl(ns->ctrl);
kfree(ns);
 }
@@ -435,7 +451,7 @@ static inline void nvme_setup_flush(struct nvme_ns *ns,
 {
memset(cmnd, 0, sizeof(*cmnd));
cmnd->common.opcode = nvme_cmd_flush;
-   cmnd->common.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->common.nsid = cpu_to_le32(ns->head->ns_id);
 }
 
 static blk_status_t nvme_setup_discard(struct nvme_ns *ns, struct request *req,
@@ -466,7 +482,7 @@ static blk_status_t nvme_setup_discard(struct nvme_ns *ns, 
struct request *req,
 
memset(cmnd, 0, sizeof(*cmnd));
cmnd->dsm.opcode = nvme_cmd_dsm;
-   cmnd->dsm.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->dsm.nsid = cpu_to_le32(ns->head->ns_id);
cmnd->dsm.nr = cpu_to_le32(segments - 1);
cmnd->dsm.attributes = cpu_to_le32(NVME_DSMGMT_AD);
 
@@ -495,7 +511,7 @@ static inline blk_status_t nvme_setup_rw(struct nvme_ns *ns,
 
memset(cmnd, 0, sizeof(*cmnd));
cmnd->rw.opcode = (rq_data_dir(req) ? nvme_cmd_write : nvme_cmd_read);
-   cmnd->rw.nsid = cpu_to_le32(ns->ns_id);
+   cmnd->rw.nsid = cpu_to_le32(ns->head->ns_id);
cmnd->rw.slba = cpu_to_le64(nvme_block_nr(ns, blk_rq_pos(req)));
cmnd->rw.length = cpu_to_le16((blk_rq_bytes(req) >> ns->lba_shift) - 1);
 
@@ -986,7 +1002,7 @@ static int nvme_submit_io(struct nvme_ns *ns, struct 
nvme_user_io __user *uio)
memset(, 0, sizeof(c));
c.rw.opcode = io.opcode;
c.rw.flags = io.flags;
-   c.rw.nsid = cpu_to_le32(ns->ns_id);
+   c.rw.nsid = cpu_to_le32(ns->head->ns_id);
c.rw.slba = cpu_to_le64(io.slba);
c.rw.length = cpu_to_le16(io.nblocks);
c.rw.control = cpu_to_le16(io.control);
@@ -1129,7 +1145,7 @@ static int nvme_ioctl(struct block_device *bdev, fmode_t 
mode,
switch (cmd) {
case NVME_IOCTL_ID:
force_successful_syscall_return();
-   return ns->ns_id;
+   return ns->head->ns_id;
case NVME_IOCTL_ADMIN_CMD:
return nvme_user_cmd(ns->ctrl, NULL, (void __user *)arg);
case NVME_IOCTL_IO_CMD:
@@ -1250,6 +1266,13 @@ static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, 
unsigned int nsid,
}
 }
 
+static bool nvme_ns_ids_valid(struct nvme_ns_ids *ids)
+{
+   return !uuid_is_null(>uuid) ||
+   memchr_inv(ids->nguid, 0, sizeof(ids->nguid)) ||
+   memchr_inv(ids->eui64, 0, sizeof(ids->eui64));
+}
+
 static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
 {
return uuid_equal(>uuid, >uuid) &&
@@ -1320,7 +1343,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
return -ENODEV;
}
 
-   id = nvme_identify_ns(ctrl, ns->ns_id);
+   id = nvme_identify_ns(ctrl, ns->head->ns_id);
if (!id)
return 

nvme multipath support V7

2017-11-09 Thread Christoph Hellwig
Hi all,

this series adds support for multipathing, that is accessing nvme
namespaces through multiple controllers to the nvme core driver.

I think we are pretty much done with with very little changes in
the last reposts.  Unless I hear objections I plan to send this
to Jens tomorrow with the remaining NVMe updates.

It is a very thin and efficient implementation that relies on
close cooperation with other bits of the nvme driver, and few small
and simple block helpers.

Compared to dm-multipath the important differences are how management
of the paths is done, and how the I/O path works.

Management of the paths is fully integrated into the nvme driver,
for each newly found nvme controller we check if there are other
controllers that refer to the same subsystem, and if so we link them
up in the nvme driver.  Then for each namespace found we check if
the namespace id and identifiers match to check if we have multiple
controllers that refer to the same namespaces.  For now path
availability is based entirely on the controller status, which at
least for fabrics will be continuously updated based on the mandatory
keep alive timer.  Once the Asynchronous Namespace Access (ANA)
proposal passes in NVMe we will also get per-namespace states in
addition to that, but for now any details of that remain confidential
to NVMe members.

The I/O path is very different from the existing multipath drivers,
which is enabled by the fact that NVMe (unlike SCSI) does not support
partial completions - a controller will either complete a whole
command or not, but never only complete parts of it.  Because of that
there is no need to clone bios or requests - the I/O path simply
redirects the I/O to a suitable path.  For successful commands
multipath is not in the completion stack at all.  For failed commands
we decide if the error could be a path failure, and if yes remove
the bios from the request structure and requeue them before completing
the request.  All together this means there is no performance
degradation compared to normal nvme operation when using the multipath
device node (at least not until I find a dual ported DRAM backed
device :))

A git tree is available at:

   git://git.infradead.org/users/hch/block.git nvme-mpath

gitweb:

   http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/nvme-mpath

Changes since V6:
  - added slaves/holder directories (Hannes)
  - trivial rebase on top of the latest nvme-4.15 changes

Changes since V5:
  - dropped various prep patches merged into the nvme tree
  - removed a leftover debug printk
  - small cleanups to blk_steal_bio
  - add a sysfs attribute for hidden gendisks
  - don't complete cancelled requests if another path is available
  - use the instance index for device naming
  - make the multipath code optional at compile time
  - don't use the multipath node for single-controller subsystems
  - add a module_option to disable multipathing (temporarily)
   
Changes since V4:
  - add a refcount to release the device in struct nvme_subsystem
  - use the instance to name the nvme_subsystems in sysfs
  - remove a NULL check before nvme_put_ns_head
  - take a ns_head reference in ->open
  - improve open protection for GENHD_FL_HIDDEN
  - add poll support for the mpath device

Changes since V3:
  - new block layer support for hidden gendisks
  - a couple new patches to refactor device handling before the
actual multipath support
  - don't expose per-controller block device nodes
  - use /dev/nvmeXnZ as the device nodes for the whole subsystem.
  - expose subsystems in sysfs (Hannes Reinecke)
  - fix a subsystem leak when duplicate NQNs are found
  - fix up some names
  - don't clear current_path if freeing a different namespace

Changes since V2:
  - don't create duplicate subsystems on reset (Keith Bush)
  - free requests properly when failing over in I/O completion (Keith Bush)
  - new devices names: /dev/nvm-sub%dn%d
  - expose the namespace identification sysfs files for the mpath nodes

Changes since V1:
  - introduce new nvme_ns_ids structure to clean up identifier handling
  - generic_make_request_fast is now named direct_make_request and calls
generic_make_request_checks
  - reset bi_disk on resubmission
  - create sysfs links between the existing nvme namespace block devices and
the new share mpath device
  - temporarily added the timeout patches from James, this should go into
nvme-4.14, though


[PATCH 7/7] nvme: create 'slaves' and 'holders' entries for hidden controllers

2017-11-09 Thread Christoph Hellwig
From: Hannes Reinecke 

When creating nvme multipath devices we should populate the 'slaves' and
'holders' directorys properly to aid userspace topology detection.

Signed-off-by: Hannes Reinecke 
[hch: split from a larger patch, compile fix for disable multipath code]
Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c  |  2 ++
 drivers/nvme/host/multipath.c | 30 ++
 drivers/nvme/host/nvme.h  |  8 
 3 files changed, 40 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index fab4c46e4e50..f52919f8f4dc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2879,6 +2879,7 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
 
if (new)
nvme_mpath_add_disk(ns->head);
+   nvme_mpath_add_disk_links(ns);
return;
  out_unlink_ns:
mutex_lock(>subsys->lock);
@@ -2902,6 +2903,7 @@ static void nvme_ns_remove(struct nvme_ns *ns)
if (ns->disk && ns->disk->flags & GENHD_FL_UP) {
if (blk_get_integrity(ns->disk))
blk_integrity_unregister(ns->disk);
+   nvme_mpath_remove_disk_links(ns);
sysfs_remove_group(_to_dev(ns->disk)->kobj,
_ns_id_attr_group);
if (ns->ndev)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 850275896e49..78d92151a904 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -245,6 +245,25 @@ void nvme_mpath_add_disk(struct nvme_ns_head *head)
head->disk->disk_name);
 }
 
+void nvme_mpath_add_disk_links(struct nvme_ns *ns)
+{
+   struct kobject *slave_disk_kobj, *holder_disk_kobj;
+
+   if (!ns->head->disk)
+   return;
+
+   slave_disk_kobj = _to_dev(ns->disk)->kobj;
+   if (sysfs_create_link(ns->head->disk->slave_dir, slave_disk_kobj,
+   kobject_name(slave_disk_kobj)))
+   return;
+
+   holder_disk_kobj = _to_dev(ns->head->disk)->kobj;
+   if (sysfs_create_link(ns->disk->part0.holder_dir, holder_disk_kobj,
+   kobject_name(holder_disk_kobj)))
+   sysfs_remove_link(ns->head->disk->slave_dir,
+   kobject_name(slave_disk_kobj));
+}
+
 void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
if (!head->disk)
@@ -259,3 +278,14 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
blk_cleanup_queue(head->disk->queue);
put_disk(head->disk);
 }
+
+void nvme_mpath_remove_disk_links(struct nvme_ns *ns)
+{
+   if (!ns->head->disk)
+   return;
+
+   sysfs_remove_link(ns->disk->part0.holder_dir,
+   kobject_name(_to_dev(ns->head->disk)->kobj));
+   sysfs_remove_link(ns->head->disk->slave_dir,
+   kobject_name(_to_dev(ns->disk)->kobj));
+}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 598525d9c125..0df31a2c7e0a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -404,7 +404,9 @@ bool nvme_req_needs_failover(struct request *req);
 void nvme_kick_requeue_lists(struct nvme_ctrl *ctrl);
 int nvme_mpath_alloc_disk(struct nvme_ctrl *ctrl,struct nvme_ns_head *head);
 void nvme_mpath_add_disk(struct nvme_ns_head *head);
+void nvme_mpath_add_disk_links(struct nvme_ns *ns);
 void nvme_mpath_remove_disk(struct nvme_ns_head *head);
+void nvme_mpath_remove_disk_links(struct nvme_ns *ns);
 
 static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
 {
@@ -436,6 +438,12 @@ static inline void nvme_mpath_add_disk(struct nvme_ns_head 
*head)
 static inline void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 {
 }
+static inline void nvme_mpath_add_disk_links(struct nvme_ns *ns)
+{
+}
+static inline void nvme_mpath_remove_disk_links(struct nvme_ns *ns)
+{
+}
 static inline void nvme_mpath_clear_current_path(struct nvme_ns *ns)
 {
 }
-- 
2.14.2



[PATCH 2/7] nvme: introduce a nvme_ns_ids structure

2017-11-09 Thread Christoph Hellwig
This allows us to manage the various uniqueue namespace identifiers
together instead needing various variables and arguments.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Keith Busch 
Reviewed-by: Sagi Grimberg 
Reviewed-by: Hannes Reinecke 
---
 drivers/nvme/host/core.c | 69 +++-
 drivers/nvme/host/nvme.h | 14 +++---
 2 files changed, 49 insertions(+), 34 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index cb6cf7d64343..61b7c91213e7 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -796,7 +796,7 @@ static int nvme_identify_ctrl(struct nvme_ctrl *dev, struct 
nvme_id_ctrl **id)
 }
 
 static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, unsigned nsid,
-   u8 *eui64, u8 *nguid, uuid_t *uuid)
+   struct nvme_ns_ids *ids)
 {
struct nvme_command c = { };
int status;
@@ -832,7 +832,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, 
unsigned nsid,
goto free_data;
}
len = NVME_NIDT_EUI64_LEN;
-   memcpy(eui64, data + pos + sizeof(*cur), len);
+   memcpy(ids->eui64, data + pos + sizeof(*cur), len);
break;
case NVME_NIDT_NGUID:
if (cur->nidl != NVME_NIDT_NGUID_LEN) {
@@ -842,7 +842,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, 
unsigned nsid,
goto free_data;
}
len = NVME_NIDT_NGUID_LEN;
-   memcpy(nguid, data + pos + sizeof(*cur), len);
+   memcpy(ids->nguid, data + pos + sizeof(*cur), len);
break;
case NVME_NIDT_UUID:
if (cur->nidl != NVME_NIDT_UUID_LEN) {
@@ -852,7 +852,7 @@ static int nvme_identify_ns_descs(struct nvme_ctrl *ctrl, 
unsigned nsid,
goto free_data;
}
len = NVME_NIDT_UUID_LEN;
-   uuid_copy(uuid, data + pos + sizeof(*cur));
+   uuid_copy(>uuid, data + pos + sizeof(*cur));
break;
default:
/* Skip unnkown types */
@@ -1232,22 +1232,31 @@ static void nvme_config_discard(struct nvme_ctrl *ctrl,
 }
 
 static void nvme_report_ns_ids(struct nvme_ctrl *ctrl, unsigned int nsid,
-   struct nvme_id_ns *id, u8 *eui64, u8 *nguid, uuid_t *uuid)
+   struct nvme_id_ns *id, struct nvme_ns_ids *ids)
 {
+   memset(ids, 0, sizeof(*ids));
+
if (ctrl->vs >= NVME_VS(1, 1, 0))
-   memcpy(eui64, id->eui64, sizeof(id->eui64));
+   memcpy(ids->eui64, id->eui64, sizeof(id->eui64));
if (ctrl->vs >= NVME_VS(1, 2, 0))
-   memcpy(nguid, id->nguid, sizeof(id->nguid));
+   memcpy(ids->nguid, id->nguid, sizeof(id->nguid));
if (ctrl->vs >= NVME_VS(1, 3, 0)) {
 /* Don't treat error as fatal we potentially
  * already have a NGUID or EUI-64
  */
-   if (nvme_identify_ns_descs(ctrl, nsid, eui64, nguid, uuid))
+   if (nvme_identify_ns_descs(ctrl, nsid, ids))
dev_warn(ctrl->device,
 "%s: Identify Descriptors failed\n", __func__);
}
 }
 
+static bool nvme_ns_ids_equal(struct nvme_ns_ids *a, struct nvme_ns_ids *b)
+{
+   return uuid_equal(>uuid, >uuid) &&
+   memcmp(>nguid, >nguid, sizeof(a->nguid)) == 0 &&
+   memcmp(>eui64, >eui64, sizeof(a->eui64)) == 0;
+}
+
 static void nvme_update_disk_info(struct gendisk *disk,
struct nvme_ns *ns, struct nvme_id_ns *id)
 {
@@ -1303,8 +1312,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
struct nvme_ns *ns = disk->private_data;
struct nvme_ctrl *ctrl = ns->ctrl;
struct nvme_id_ns *id;
-   u8 eui64[8] = { 0 }, nguid[16] = { 0 };
-   uuid_t uuid = uuid_null;
+   struct nvme_ns_ids ids;
int ret = 0;
 
if (test_bit(NVME_NS_DEAD, >flags)) {
@@ -1321,10 +1329,8 @@ static int nvme_revalidate_disk(struct gendisk *disk)
goto out;
}
 
-   nvme_report_ns_ids(ctrl, ns->ns_id, id, eui64, nguid, );
-   if (!uuid_equal(>uuid, ) ||
-   memcmp(>nguid, , sizeof(ns->nguid)) ||
-   memcmp(>eui, , sizeof(ns->eui))) {
+   nvme_report_ns_ids(ctrl, ns->ns_id, id, );
+   if (!nvme_ns_ids_equal(>ids, )) {
dev_err(ctrl->device,
"identifiers changed for nsid %d\n", ns->ns_id);
ret = -ENODEV;
@@ -2265,18 +2271,19 @@ static ssize_t wwid_show(struct device *dev, struct 
device_attribute *attr,
   

[PATCH 6/7] block: create 'slaves' and 'holders' entries for hidden gendisks

2017-11-09 Thread Christoph Hellwig
From: Hannes Reinecke 

When creating nvme multipath devices we should populate the 'slaves' and
'holders' directorys properly to aid userspace topology detection.

Signed-off-by: Hannes Reinecke 
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig 
---
 block/genhd.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 835e907e6e01..3de1671631bf 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -585,14 +585,14 @@ static void register_disk(struct device *parent, struct 
gendisk *disk)
 */
pm_runtime_set_memalloc_noio(ddev, true);
 
+   disk->part0.holder_dir = kobject_create_and_add("holders", >kobj);
+   disk->slave_dir = kobject_create_and_add("slaves", >kobj);
+
if (disk->flags & GENHD_FL_HIDDEN) {
dev_set_uevent_suppress(ddev, 0);
return;
}
 
-   disk->part0.holder_dir = kobject_create_and_add("holders", >kobj);
-   disk->slave_dir = kobject_create_and_add("slaves", >kobj);
-
/* No minors to use for partitions */
if (!disk_part_scan_enabled(disk))
goto exit;
@@ -728,11 +728,11 @@ void del_gendisk(struct gendisk *disk)
WARN_ON(1);
}
 
-   if (!(disk->flags & GENHD_FL_HIDDEN)) {
+   if (!(disk->flags & GENHD_FL_HIDDEN))
blk_unregister_region(disk_devt(disk), disk->minors);
-   kobject_put(disk->part0.holder_dir);
-   kobject_put(disk->slave_dir);
-   }
+
+   kobject_put(disk->part0.holder_dir);
+   kobject_put(disk->slave_dir);
 
part_stat_set_all(>part0, 0);
disk->part0.stamp = 0;
-- 
2.14.2



Re: [PATCH v11 0/7] block, scsi, md: Improve suspend and resume

2017-11-09 Thread Oleksandr Natalenko
Then,

Reported-by: Oleksandr Natalenko 
Tested-by: Oleksandr Natalenko 

On čtvrtek 9. listopadu 2017 17:55:58 CET Jens Axboe wrote:
> On 11/09/2017 09:54 AM, Bart Van Assche wrote:
> > On Thu, 2017-11-09 at 07:16 +0100, Oleksandr Natalenko wrote:
> >> is this something known to you, or it is just my fault applying this
> >> series to v4.13? Except having this warning, suspend/resume works for
> >> me:
> >> 
> >> [   27.383846] sd 0:0:0:0: [sda] Starting disk
> >> [   27.383976] sd 1:0:0:0: [sdb] Starting disk
> >> [   27.451218] sdb: Attempt to allocate non-preempt request in
> >> preempt-only
> >> mode.
> >> [   27.459640] [ cut here ]
> >> [   27.464521] WARNING: CPU: 0 PID: 172 at block/blk-core.c:823
> >> blk_queue_enter+0x222/0x280> 
> > Hello Oleksandr,
> > 
> > Thanks for the testing. The warning that you reported is expected. Maybe
> > it
> > should be left out to avoid that users get confused.
> 
> If the warning is expected, then it should be removed. It'll just confuse
> users and cause useless bug reports.
> 
> > Jens, this patch series still applies cleanly on top of your
> > for-4.15/block
> > branch. Are you fine with this patch series or do you perhaps want me to
> > repost it with Oleksandr's Tested-by tag added to each patch?
> 
> Since you need to kill the warning anyway, let's get it respun.




Re: [PATCH v11 0/7] block, scsi, md: Improve suspend and resume

2017-11-09 Thread Jens Axboe
On 11/09/2017 09:54 AM, Bart Van Assche wrote:
> On Thu, 2017-11-09 at 07:16 +0100, Oleksandr Natalenko wrote:
>> is this something known to you, or it is just my fault applying this series 
>> to 
>> v4.13? Except having this warning, suspend/resume works for me:
>>
>> [   27.383846] sd 0:0:0:0: [sda] Starting disk
>> [   27.383976] sd 1:0:0:0: [sdb] Starting disk
>> [   27.451218] sdb: Attempt to allocate non-preempt request in preempt-only 
>> mode.
>> [   27.459640] [ cut here ]
>> [   27.464521] WARNING: CPU: 0 PID: 172 at block/blk-core.c:823 
>> blk_queue_enter+0x222/0x280
> 
> Hello Oleksandr,
> 
> Thanks for the testing. The warning that you reported is expected. Maybe it
> should be left out to avoid that users get confused.

If the warning is expected, then it should be removed. It'll just confuse
users and cause useless bug reports.

> Jens, this patch series still applies cleanly on top of your for-4.15/block
> branch. Are you fine with this patch series or do you perhaps want me to
> repost it with Oleksandr's Tested-by tag added to each patch?

Since you need to kill the warning anyway, let's get it respun.

-- 
Jens Axboe



Re: [PATCH v11 0/7] block, scsi, md: Improve suspend and resume

2017-11-09 Thread Bart Van Assche
On Thu, 2017-11-09 at 07:16 +0100, Oleksandr Natalenko wrote:
> is this something known to you, or it is just my fault applying this series 
> to 
> v4.13? Except having this warning, suspend/resume works for me:
> 
> [   27.383846] sd 0:0:0:0: [sda] Starting disk
> [   27.383976] sd 1:0:0:0: [sdb] Starting disk
> [   27.451218] sdb: Attempt to allocate non-preempt request in preempt-only 
> mode.
> [   27.459640] [ cut here ]
> [   27.464521] WARNING: CPU: 0 PID: 172 at block/blk-core.c:823 
> blk_queue_enter+0x222/0x280

Hello Oleksandr,

Thanks for the testing. The warning that you reported is expected. Maybe it
should be left out to avoid that users get confused.

Jens, this patch series still applies cleanly on top of your for-4.15/block
branch. Are you fine with this patch series or do you perhaps want me to
repost it with Oleksandr's Tested-by tag added to each patch?

Thanks,

Bart.

Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Jens Axboe
On 11/09/2017 08:30 AM, Jens Axboe wrote:
> On 11/09/2017 03:00 AM, Ming Lei wrote:
>> On Thu, Nov 09, 2017 at 11:41:40AM +0800, Ming Lei wrote:
>>> On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote:
 This patch attempts to make the case of hctx re-running on driver tag
 failure more robust. Without this patch, it's pretty easy to trigger a
 stall condition with shared tags. An example is using null_blk like
 this:

 modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
 hw_queue_depth=1

 which sets up 4 devices, sharing the same tag set with a depth of 1.
 Running a fio job ala:

 [global]
 bs=4k
 rw=randread
 norandommap
 direct=1
 ioengine=libaio
 iodepth=4

 [nullb0]
 filename=/dev/nullb0
 [nullb1]
 filename=/dev/nullb1
 [nullb2]
 filename=/dev/nullb2
 [nullb3]
 filename=/dev/nullb3

 will inevitably end with one or more threads being stuck waiting for a
 scheduler tag. That IO is then stuck forever, until someone else
 triggers a run of the queue.

 Ensure that we always re-run the hardware queue, if the driver tag we
 were waiting for got freed before we added our leftover request entries
 back on the dispatch list.

 Signed-off-by: Jens Axboe 

 ---

 diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
 index 7f4a1ba532af..bb7f08415203 100644
 --- a/block/blk-mq-debugfs.c
 +++ b/block/blk-mq-debugfs.c
 @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = {
HCTX_STATE_NAME(STOPPED),
HCTX_STATE_NAME(TAG_ACTIVE),
HCTX_STATE_NAME(SCHED_RESTART),
 -  HCTX_STATE_NAME(TAG_WAITING),
HCTX_STATE_NAME(START_ON_RUN),
  };
  #undef HCTX_STATE_NAME
 diff --git a/block/blk-mq.c b/block/blk-mq.c
 index 3d759bb8a5bb..8dc5db40df9d 100644
 --- a/block/blk-mq.c
 +++ b/block/blk-mq.c
 @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, 
 struct blk_mq_hw_ctx **hctx,
return rq->tag != -1;
  }
  
 -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, 
 int flags,
 -  void *key)
 +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
 +  int flags, void *key)
  {
struct blk_mq_hw_ctx *hctx;
  
hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
  
 -  list_del(>entry);
 -  clear_bit_unlock(BLK_MQ_S_TAG_WAITING, >state);
 +  list_del_init(>entry);
blk_mq_run_hw_queue(hctx, true);
return 1;
  }
  
 -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
 +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx,
 +   struct request *rq)
  {
 +  struct blk_mq_hw_ctx *this_hctx = *hctx;
 +  wait_queue_entry_t *wait = _hctx->dispatch_wait;
struct sbq_wait_state *ws;
  
 +  if (!list_empty_careful(>entry))
 +  return false;
 +
 +  spin_lock(_hctx->lock);
 +  if (!list_empty(>entry)) {
 +  spin_unlock(_hctx->lock);
 +  return false;
 +  }
 +
 +  ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
 +  add_wait_queue(>wait, wait);
 +
/*
 -   * The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait.
 -   * The thread which wins the race to grab this bit adds the hardware
 -   * queue to the wait queue.
 +   * It's possible that a tag was freed in the window between the
 +   * allocation failure and adding the hardware queue to the wait
 +   * queue.
 */
 -  if (test_bit(BLK_MQ_S_TAG_WAITING, >state) ||
 -  test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, >state))
 +  if (!blk_mq_get_driver_tag(rq, hctx, false)) {
 +  spin_unlock(_hctx->lock);
return false;
 -
 -  init_waitqueue_func_entry(>dispatch_wait, blk_mq_dispatch_wake);
 -  ws = bt_wait_ptr(>tags->bitmap_tags, hctx);
 +  }
  
/*
 -   * As soon as this returns, it's no longer safe to fiddle with
 -   * hctx->dispatch_wait, since a completion can wake up the wait queue
 -   * and unlock the bit.
 +   * We got a tag, remove outselves from the wait queue to ensure
 +   * someone else gets the wakeup.
 */
 -  add_wait_queue(>wait, >dispatch_wait);
 +  spin_lock_irq(>wait.lock);
 +  list_del_init(>entry);
 +  spin_unlock_irq(>wait.lock);
 +  spin_unlock(_hctx->lock);
return true;
  }
  
  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head 
 *list,
 -  bool got_budget)
 +   bool got_budget)
  {
struct blk_mq_hw_ctx *hctx;
struct request *rq, 

Re: [PATCH 3/5] nvme: track shared namespaces

2017-11-09 Thread Sagi Grimberg



To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.


Can you remind me why isn't rcu sufficient? Can looking up a
path (ns from head->list) block?


blk_mq_make_request can block.


Oh I see, so can you explain why srcu should cover direct_make_request?

What if we were to do:
--
rcu_read_lock();
ns = nvme_find_path(head);
rcu_read_unlock();
if (likely(ns)) {
...
--


That way we'd have to take and release a namespace reference for every I/O.


Yea,

I'm good,

Reviewed-by: Sagi Grimberg 


Re: kernel 4.14 how to install Mellanox connectx-3 infiniband driver?

2017-11-09 Thread Max Gurtovoy



On 11/9/2017 5:20 PM, Tony Yang wrote:

Hi, All
I downloaded the nvme with multipath kernel, The kernel version is
4.14, I encountered a problem, I use Mellanox connectx-3 infiniband
driver. Because the 4.14 kernel version is too new to install
infiniband driver, does anyone encounter with me The same situation?
How to solve? Thank you for your help.



Are you talking about MLNX_OFED ? what do you mean infiniband driver ?


Re: kernel 4.14 how to install Mellanox connectx-3 infiniband driver?

2017-11-09 Thread Jack Wang
Tony,

2017-11-09 16:20 GMT+01:00 Tony Yang :
> Hi, All
>I downloaded the nvme with multipath kernel, The kernel version is
> 4.14, I encountered a problem, I use Mellanox connectx-3 infiniband
> driver. Because the 4.14 kernel version is too new to install
> infiniband driver, does anyone encounter with me The same situation?
> How to solve? Thank you for your help.
>
As Sagi also said, there is driver in tree.

You need build you kernel with CONFIG_MLX4_INFINIBAND enabled.

Jack
> ___
> Linux-nvme mailing list
> linux-n...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme


Re: kernel 4.14 how to install Mellanox connectx-3 infiniband driver?

2017-11-09 Thread Sagi Grimberg

Hi Tony,


Hi, All
I downloaded the nvme with multipath kernel, The kernel version is
4.14, I encountered a problem, I use Mellanox connectx-3 infiniband
driver. Because the 4.14 kernel version is too new to install
infiniband driver, does anyone encounter with me The same situation?
How to solve? Thank you for your help.


Is there any reason why you're using Mellanox out-of-tree driver?
I'm pretty sure nvme-rdma will work just fine with the in-kernel
mlx4 driver (unless you need some other functionality that does not
exist in 4.14).


Re: [PATCH V13 03/10] mmc: block: Add blk-mq support

2017-11-09 Thread Linus Walleij
On Thu, Nov 9, 2017 at 11:42 AM, Adrian Hunter  wrote:
> On 08/11/17 10:54, Linus Walleij wrote:
>> On Fri, Nov 3, 2017 at 2:20 PM, Adrian Hunter  
>> wrote:

>> At least you could do what I did and break out a helper like
>> this:
>>
>> /*
>>  * This reports status back to the block layer for a finished request.
>>  */
>> static void mmc_blk_complete(struct mmc_queue_req *mq_rq,
>> blk_status_t status)
>> {
>>struct request *req = mmc_queue_req_to_req(mq_rq);
>>
>>if (req->mq_ctx) {
>>   blk_mq_end_request(req, status);
>>} else
>>   blk_end_request_all(req, status);
>> }
>
> You are quibbling.

Hm wiktionary "quibble" points to:
https://en.wikipedia.org/wiki/Trivial_objections

Sorry for not being a native speaker in English, I think that
maybe you were again trying to be snarky to a reviewer but
I honestly don't know.

> It makes next to no difference especially as it all goes
> away when the legacy code is deleted.

Which is something your patch set doesn't do. Nor did you write
anywhere that deleting the legacy code was your intention.
But I'm happy to hear it.

>  I will change it in the next
> revision, but what a waste of everyone's time.
>  Please try to focus on
> things that matter.

As Jean-Paul Sartre said "hell is other people".
Can you try to be friendlier anyway?

>> This retry and error handling using requeue is very elegant.
>> I really like this.
>>
>> If we could also go for MQ-only, only this nice code
>> remains in the tree.
>
> No one has ever suggested that the legacy API will remain.  Once blk-mq is
> ready the old code gets deleted.

The block layer maintainers most definately think MQ is ready
but you seem to disagree. Why?

In the recent LWN article from kernel recepies:
https://lwn.net/Articles/735275/
the only obstacle I can see is a mention that SCSI was not
switched over by default because of some problems with
slow rotating media. "This change had to be backed out
because of some performance and scalability issues that
arose for rotating storage."

Christoph mentions that he's gonna try again for v4.14.
But it's true, I do not see that happening in linux-next yet.

But that is specifically rotating media, which MMC/SD is not.
And UBI is using it since ages without complaints. And
that is a sign of something.

Have you seen any problems when deploying it on MMC/SD,
really? If there are problems I bet the block layer people want
us to find them, diagnose them and ultimately fix them rather
than wait for them to do it. I haven't seen any performance or
scalability issues. At one point I had processes running
on two eMMC and one SD-card and apart from maxing out
the CPUs and DMA backplace as could be expected all was
fine.

>> The problem: you have just reimplemented the whole error
>> handling we had in the old block layer and now we have to
>> maintain two copies and keep them in sync.
>>
>> This is not OK IMO, we will inevitable screw it up, so we
>> need to get *one* error path.
>
> Wow, you really didn't read the code at all.

Just quit this snarky attitude.

> As I have repeatedly pointed
> out, the new code is all new.  There is no overlap and there nothing to keep
> in sync.

The problem is that you have not clearly communicated your
vision to delete the old code. The best way to communicate that
would be to include a patch actually deleteing it.

>  It may not look like it in this patch, but that is only because of
> the ridiculous idea of splitting up the patch.

Naming other people's review feedback as "ridiculous" is not very
helpful. I think the patch set became much easier to review
like this so I am happy with this format.

>>> +static void mmc_blk_mq_acct_req_done(struct mmc_queue *mq, struct request 
>>> *req)
>>
>> What does "acct" mean in the above function name?
>> Accounting? Actual? I'm lost.
>
> Does "actual" have two "c"'s.  You are just making things up.

Please stop your snarky and belitteling attitude.

I am not a native English speaker and I am not making up
my review comments in order to annoy you.

Consider Hanlon's razor:
https://en.wikipedia.org/wiki/Hanlon%27s_razor
"Never attribute to malice that which is adequately explained by
stupidity".

It's bad enough that you repeatedly point out how stupid
you think I am, I am sorry about that, in my opinion sometimes
other people are really stupid but more often than not the problem
is really on your side, like being impatient and unwilling to educate
others about how your code actually works becuase it seems
so evident to yourself that you think it should be evident to everyone
else as well. I don't know what to say about that, it seems a bit
unempatic.

But you even go and think I am making stuff up in purpose.
That's pretty offensive.

> Of course it is "account".

This is maybe obvious for you but it was not to me.

>  It is counting the number of requests in flight - which is
> 

Re: [PATCH 4/5] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Christoph Hellwig
On Thu, Nov 09, 2017 at 04:44:32PM +0100, Hannes Reinecke wrote:
> - We don't have the topology information in sysfs;

We have all the topology information in sysfs, but you seem to look
for the wrong thing.

> while the namespace
> device has the 'slaves' and 'holders' directories, they remain empty,
> and the path devices don't even have those directories. I really would
> like to see them populated to help things like dracut figuring out the
> topology when building up a list of modules to include.

Of course there aren't because there is not block device you can hold
for the controller.  

> - The patch doesn't integrate with the 'claim' mechanism for block
> devices, ie device-mapper might accidentally stumble upon it when
> traversing devices.

Same as above.  There is no block device to be claimed.

> I'll be sending two patches to resurrect the 'bd_link_disk_holder'
> idea I posted earlier; that should take care of these issues.

It doesn't.  There is not block device you can hold, so there is no
point in claiming it.


Re: [PATCH 4/5] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Hannes Reinecke
On 11/02/2017 07:30 PM, Christoph Hellwig wrote:
> This patch adds native multipath support to the nvme driver.  For each
> namespace we create only single block device node, which can be used
> to access that namespace through any of the controllers that refer to it.
> The gendisk for each controllers path to the name space still exists
> inside the kernel, but is hidden from userspace.  The character device
> nodes are still available on a per-controller basis.  A new link from
> the sysfs directory for the subsystem allows to find all controllers
> for a given subsystem.
> 
> Currently we will always send I/O to the first available path, this will
> be changed once the NVMe Asynchronous Namespace Access (ANA) TP is
> ratified and implemented, at which point we will look at the ANA state
> for each namespace.  Another possibility that was prototyped is to
> use the path that is closes to the submitting NUMA code, which will be
> mostly interesting for PCI, but might also be useful for RDMA or FC
> transports in the future.  There is not plan to implement round robin
> or I/O service time path selectors, as those are not scalable with
> the performance rates provided by NVMe.
> 
> The multipath device will go away once all paths to it disappear,
> any delay to keep it alive needs to be implemented at the controller
> level.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/nvme/host/Kconfig |   9 ++
>  drivers/nvme/host/Makefile|   1 +
>  drivers/nvme/host/core.c  | 133 +++---
>  drivers/nvme/host/multipath.c | 255 
> ++
>  drivers/nvme/host/nvme.h  |  57 ++
>  5 files changed, 440 insertions(+), 15 deletions(-)
>  create mode 100644 drivers/nvme/host/multipath.c
> 
In general I'm okay with this approach, but would like to address two
things:

- We don't have the topology information in sysfs; while the namespace
device has the 'slaves' and 'holders' directories, they remain empty,
and the path devices don't even have those directories. I really would
like to see them populated to help things like dracut figuring out the
topology when building up a list of modules to include.

- The patch doesn't integrate with the 'claim' mechanism for block
devices, ie device-mapper might accidentally stumble upon it when
traversing devices.

I'll be sending two patches to resurrect the 'bd_link_disk_holder'
idea I posted earlier; that should take care of these issues.

If you're totally against having to access the block device I might be
willing to look into breaking things out, so that the nvme code just
creates the symlinks and the block-device claiming code honours the
'HIDDEN' flag.

Cheers,

Hannes
-- 
Dr. Hannes ReineckeTeamlead Storage & Networking
h...@suse.de   +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)


Re: [PATCH V13 07/10] mmc: block: blk-mq: Add support for direct completion

2017-11-09 Thread Adrian Hunter
On 09/11/17 14:34, Linus Walleij wrote:
> On Thu, Nov 9, 2017 at 8:27 AM, Adrian Hunter  wrote:
>> On 08/11/17 11:28, Linus Walleij wrote:
>>> On Fri, Nov 3, 2017 at 2:20 PM, Adrian Hunter  
>>> wrote:
>>>
 For blk-mq, add support for completing requests directly in the ->done
 callback. That means that error handling and urgent background operations
 must be handled by recovery_work in that case.

 Signed-off-by: Adrian Hunter 
>>>
>>> I tried enabling this on my MMC host (mmci) but I got weird
>>> DMA error messages when I did.
>>>
>>> I guess this has not been tested on a non-DMA-coherent
>>> system?
>>
>> I don't see what DMA-coherence has to do with anything.
>>
>> Possibilities:
>> - DMA unmapping doesn't work in an atomic context
>> - requests' DMA operations have to be synchronized with each other
> 
> So since MMCI need the post_req() hook called with
> an error code to properly tear down any DMA operations,
> I was worried that maybe your error path is not doing this
> (passing an error code or calling in the right order).
> 
> I had a bunch of fallouts in my own patch set relating
> to that.
> 
>>> I think I might be seeing this because the .pre and .post
>>> callbacks need to be strictly sequenced, and this is
>>> maybe not taken into account here?
>>
>> I looked at mmci but that did not seem to be the case.
>>
>>> Isn't there as risk
>>> that the .post callback of the next request is called before
>>> the .post callback of the previous request has returned
>>> for example?
>>
>> Of course, the requests are treated as independent.  If the separate DMA
>> operations require synchronization, that is for the host driver to fix.
> 
> They are treated as independent by the block layer but
> it is the subsystems duty to serialize them for the hardware,
> 
> MMCI strictly requires that pre/post hooks per request
> happen in the right order, so if you have prepared a second
> request after submitting the first, and the first fails, you have
> to back out by unpreparing the second one before unpreparing
> the first. It is also the only host driver requireing to be passed
> an error code in the last parameter to the post hook in
> order to work properly.
> 
> I think your patch set handles that nicely though, because I
> haven't seen any errors, it's just when we do this direct
> completion I see problems.

If a request gets an error, then we always do the post_req before starting
another request, so the driver can assume that the first request finished
successfully if it is asked to do post_req on the second request.  So the
driver does have enough information to keep the DMA unmapping in order if
necessary.


Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Jens Axboe
On 11/09/2017 03:00 AM, Ming Lei wrote:
> On Thu, Nov 09, 2017 at 11:41:40AM +0800, Ming Lei wrote:
>> On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote:
>>> This patch attempts to make the case of hctx re-running on driver tag
>>> failure more robust. Without this patch, it's pretty easy to trigger a
>>> stall condition with shared tags. An example is using null_blk like
>>> this:
>>>
>>> modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
>>> hw_queue_depth=1
>>>
>>> which sets up 4 devices, sharing the same tag set with a depth of 1.
>>> Running a fio job ala:
>>>
>>> [global]
>>> bs=4k
>>> rw=randread
>>> norandommap
>>> direct=1
>>> ioengine=libaio
>>> iodepth=4
>>>
>>> [nullb0]
>>> filename=/dev/nullb0
>>> [nullb1]
>>> filename=/dev/nullb1
>>> [nullb2]
>>> filename=/dev/nullb2
>>> [nullb3]
>>> filename=/dev/nullb3
>>>
>>> will inevitably end with one or more threads being stuck waiting for a
>>> scheduler tag. That IO is then stuck forever, until someone else
>>> triggers a run of the queue.
>>>
>>> Ensure that we always re-run the hardware queue, if the driver tag we
>>> were waiting for got freed before we added our leftover request entries
>>> back on the dispatch list.
>>>
>>> Signed-off-by: Jens Axboe 
>>>
>>> ---
>>>
>>> diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
>>> index 7f4a1ba532af..bb7f08415203 100644
>>> --- a/block/blk-mq-debugfs.c
>>> +++ b/block/blk-mq-debugfs.c
>>> @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = {
>>> HCTX_STATE_NAME(STOPPED),
>>> HCTX_STATE_NAME(TAG_ACTIVE),
>>> HCTX_STATE_NAME(SCHED_RESTART),
>>> -   HCTX_STATE_NAME(TAG_WAITING),
>>> HCTX_STATE_NAME(START_ON_RUN),
>>>  };
>>>  #undef HCTX_STATE_NAME
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 3d759bb8a5bb..8dc5db40df9d 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, struct 
>>> blk_mq_hw_ctx **hctx,
>>> return rq->tag != -1;
>>>  }
>>>  
>>> -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, 
>>> int flags,
>>> -   void *key)
>>> +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
>>> +   int flags, void *key)
>>>  {
>>> struct blk_mq_hw_ctx *hctx;
>>>  
>>> hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
>>>  
>>> -   list_del(>entry);
>>> -   clear_bit_unlock(BLK_MQ_S_TAG_WAITING, >state);
>>> +   list_del_init(>entry);
>>> blk_mq_run_hw_queue(hctx, true);
>>> return 1;
>>>  }
>>>  
>>> -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
>>> +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx,
>>> +struct request *rq)
>>>  {
>>> +   struct blk_mq_hw_ctx *this_hctx = *hctx;
>>> +   wait_queue_entry_t *wait = _hctx->dispatch_wait;
>>> struct sbq_wait_state *ws;
>>>  
>>> +   if (!list_empty_careful(>entry))
>>> +   return false;
>>> +
>>> +   spin_lock(_hctx->lock);
>>> +   if (!list_empty(>entry)) {
>>> +   spin_unlock(_hctx->lock);
>>> +   return false;
>>> +   }
>>> +
>>> +   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
>>> +   add_wait_queue(>wait, wait);
>>> +
>>> /*
>>> -* The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait.
>>> -* The thread which wins the race to grab this bit adds the hardware
>>> -* queue to the wait queue.
>>> +* It's possible that a tag was freed in the window between the
>>> +* allocation failure and adding the hardware queue to the wait
>>> +* queue.
>>>  */
>>> -   if (test_bit(BLK_MQ_S_TAG_WAITING, >state) ||
>>> -   test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, >state))
>>> +   if (!blk_mq_get_driver_tag(rq, hctx, false)) {
>>> +   spin_unlock(_hctx->lock);
>>> return false;
>>> -
>>> -   init_waitqueue_func_entry(>dispatch_wait, blk_mq_dispatch_wake);
>>> -   ws = bt_wait_ptr(>tags->bitmap_tags, hctx);
>>> +   }
>>>  
>>> /*
>>> -* As soon as this returns, it's no longer safe to fiddle with
>>> -* hctx->dispatch_wait, since a completion can wake up the wait queue
>>> -* and unlock the bit.
>>> +* We got a tag, remove outselves from the wait queue to ensure
>>> +* someone else gets the wakeup.
>>>  */
>>> -   add_wait_queue(>wait, >dispatch_wait);
>>> +   spin_lock_irq(>wait.lock);
>>> +   list_del_init(>entry);
>>> +   spin_unlock_irq(>wait.lock);
>>> +   spin_unlock(_hctx->lock);
>>> return true;
>>>  }
>>>  
>>>  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head 
>>> *list,
>>> -   bool got_budget)
>>> +bool got_budget)
>>>  {
>>> struct blk_mq_hw_ctx *hctx;
>>> struct request *rq, *nxt;
>>> +   bool no_tag = false;
>>> int errors, queued;
>>>  
>>> if (list_empty(list))
>>> @@ -1060,22 +1075,15 

Re: [PATCH V13 09/10] mmc: block: blk-mq: Stop using card_busy_detect()

2017-11-09 Thread Adrian Hunter
On 09/11/17 15:36, Ulf Hansson wrote:
> On 3 November 2017 at 14:20, Adrian Hunter  wrote:
>> card_busy_detect() doesn't set a correct timeout, and it doesn't take care
>> of error status bits. Stop using it for blk-mq.
> 
> I think this changelog isn't very descriptive. Could you please work
> on that for the next version.

Ok

> 
>>
>> Signed-off-by: Adrian Hunter 
>> ---
>>  drivers/mmc/core/block.c | 117 
>> +++
>>  1 file changed, 109 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
>> index 0c29b1d8d545..5c5ff3c34313 100644
>> --- a/drivers/mmc/core/block.c
>> +++ b/drivers/mmc/core/block.c
>> @@ -1426,15 +1426,18 @@ static inline void mmc_apply_rel_rw(struct 
>> mmc_blk_request *brq,
>> }
>>  }
>>
>> -#define CMD_ERRORS \
>> -   (R1_OUT_OF_RANGE |  /* Command argument out of range */ \
>> -R1_ADDRESS_ERROR | /* Misaligned address */\
>> +#define CMD_ERRORS_EXCL_OOR\
>> +   (R1_ADDRESS_ERROR | /* Misaligned address */\
>>  R1_BLOCK_LEN_ERROR |   /* Transferred block length incorrect */\
>>  R1_WP_VIOLATION |  /* Tried to write to protected block */ \
>>  R1_CARD_ECC_FAILED |   /* Card ECC failed */   \
>>  R1_CC_ERROR |  /* Card controller error */ \
>>  R1_ERROR)  /* General/unknown error */
>>
>> +#define CMD_ERRORS \
>> +   (CMD_ERRORS_EXCL_OOR |  \
>> +R1_OUT_OF_RANGE)   /* Command argument out of range */ \
>> +
>>  static void mmc_blk_eval_resp_error(struct mmc_blk_request *brq)
>>  {
>> u32 val;
>> @@ -1951,6 +1954,95 @@ static void mmc_blk_ss_read(struct mmc_queue *mq, 
>> struct request *req)
>> mqrq->retries = MMC_NO_RETRIES;
>>  }
>>
>> +static inline bool mmc_blk_oor_valid(struct mmc_blk_request *brq)
>> +{
>> +   return !!brq->mrq.sbc;
>> +}
>> +
>> +static inline u32 mmc_blk_stop_err_bits(struct mmc_blk_request *brq)
>> +{
>> +   return mmc_blk_oor_valid(brq) ? CMD_ERRORS : CMD_ERRORS_EXCL_OOR;
>> +}
>> +
>> +static inline bool mmc_blk_in_tran_state(u32 status)
>> +{
>> +   /*
>> +* Some cards mishandle the status bits, so make sure to check both 
>> the
>> +* busy indication and the card state.
>> +*/
>> +   return status & R1_READY_FOR_DATA &&
>> +  (R1_CURRENT_STATE(status) == R1_STATE_TRAN);
>> +}
>> +
>> +static unsigned int mmc_blk_clock_khz(struct mmc_host *host)
>> +{
>> +   if (host->actual_clock)
>> +   return host->actual_clock / 1000;
>> +
>> +   /* Clock may be subject to a divisor, fudge it by a factor of 2. */
>> +   if (host->ios.clock)
>> +   return host->ios.clock / 2000;
>> +
>> +   /* How can there be no clock */
>> +   WARN_ON_ONCE(1);
>> +   return 100; /* 100 kHz is minimum possible value */
>> +}
>> +
>> +static unsigned long mmc_blk_data_timeout_jiffies(struct mmc_host *host,
>> + struct mmc_data *data)
>> +{
>> +   unsigned int ms = DIV_ROUND_UP(data->timeout_ns, 100);
>> +   unsigned int khz;
>> +
>> +   if (data->timeout_clks) {
>> +   khz = mmc_blk_clock_khz(host);
>> +   ms += DIV_ROUND_UP(data->timeout_clks, khz);
>> +   }
>> +
>> +   return msecs_to_jiffies(ms);
>> +}
>> +
>> +static int mmc_blk_card_stuck(struct mmc_card *card, struct request *req,
>> + u32 *resp_errs)
>> +{
>> +   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
>> +   struct mmc_data *data = >brq.data;
>> +   unsigned long timeout;
>> +   u32 status;
>> +   int err;
>> +
>> +   timeout = jiffies + mmc_blk_data_timeout_jiffies(card->host, data);
>> +
>> +   while (1) {
>> +   bool done = time_after(jiffies, timeout);
>> +
>> +   err = __mmc_send_status(card, , 5);
>> +   if (err) {
>> +   pr_err("%s: error %d requesting status\n",
>> +  req->rq_disk->disk_name, err);
>> +   break;
>> +   }
>> +
>> +   /* Accumulate any response error bits seen */
>> +   if (resp_errs)
>> +   *resp_errs |= status;
>> +
>> +   if (mmc_blk_in_tran_state(status))
>> +   break;
>> +
>> +   /* Timeout if the device never becomes ready */
>> +   if (done) {
>> +   pr_err("%s: Card stuck in wrong state! %s %s\n",
>> +   mmc_hostname(card->host),
>> + 

Re: kernel 4.14 how to install Mellanox connectx-3 infiniband driver?

2017-11-09 Thread Tony Yang
Hi, All
   I downloaded the nvme with multipath kernel, The kernel version is
4.14, I encountered a problem, I use Mellanox connectx-3 infiniband
driver. Because the 4.14 kernel version is too new to install
infiniband driver, does anyone encounter with me The same situation?
How to solve? Thank you for your help.


[PATCH 1/1] partitions/msdos: Unable to mount UFS 44bsd partitions

2017-11-09 Thread Richard Narron

UFS partitions from newer versions of FreeBSD 10 and 11 use relative addressing
for their subpartitions. But older versions of FreeBSD still use absolute
addressing just like OpenBSD and NetBSD.

Instead of simply testing for a FreeBSD partition, the code needs to also
test if the starting offset of the C subpartition is zero.

https://bugzilla.kernel.org/show_bug.cgi?id=197733

Signed-off-by: Richard Narron 
---

--- a/block/partitions/msdos.c.orig 2017-11-05 13:05:14.0 -0800
+++ b/block/partitions/msdos.c  2017-11-06 09:46:00.148228242 -0800
@@ -301,7 +301,9 @@ static void parse_bsd(struct parsed_part
continue;
bsd_start = le32_to_cpu(p->p_offset);
bsd_size = le32_to_cpu(p->p_size);
-   if (memcmp(flavour, "bsd\0", 4) == 0)
+   /* FreeBSD has relative offset if C partition offset is zero */
+   if (memcmp(flavour, "bsd\0", 4) == 0 &&
+   le32_to_cpu(l->d_partitions[2].p_offset) == 0)
bsd_start += offset;
if (offset == bsd_start && size == bsd_size)
/* full parent partition, we have it already */


Re: [PATCH V13 05/10] mmc: cqhci: support for command queue enabled host

2017-11-09 Thread Adrian Hunter
On 09/11/17 15:41, Ulf Hansson wrote:
> On 3 November 2017 at 14:20, Adrian Hunter  wrote:
>> From: Venkat Gopalakrishnan 
>>
>> This patch adds CMDQ support for command-queue compatible
>> hosts.
>>
>> Command queue is added in eMMC-5.1 specification. This
>> enables the controller to process upto 32 requests at
>> a time.
>>
>> Adrian Hunter contributed renaming to cqhci, recovery, suspend
>> and resume, cqhci_off, cqhci_wait_for_idle, and external timeout
>> handling.
>>
>> Signed-off-by: Asutosh Das 
>> Signed-off-by: Sujit Reddy Thumma 
>> Signed-off-by: Konstantin Dorfman 
>> Signed-off-by: Venkat Gopalakrishnan 
>> Signed-off-by: Subhash Jadavani 
>> Signed-off-by: Ritesh Harjani 
>> Signed-off-by: Adrian Hunter 
> 
> Overall this looks good to me.
> 
> However, I didn't see MAINTAINERS being updated. Is anybody above
> volunteering to maintain cqhci.*?

I have hardware that I can use for testing so I can maintain it if no one
else wants to, however if anyone else is keen please feel free to offer too.



[PATCH] blkcg: add wrappers for struct blkcg_policy operations

2017-11-09 Thread weiping zhang
Some blkcg policies may not implement all operations in struct blkcg_policy,
there are lots of "if (pol->xxx)", add wrappers for these pol->xxx_fn.

Signed-off-by: weiping zhang 
---
 block/blk-cgroup.c | 55 +--
 include/linux/blk-cgroup.h | 72 ++
 2 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index e7ec676..e34ecb2 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -71,7 +71,7 @@ static void blkg_free(struct blkcg_gq *blkg)
 
for (i = 0; i < BLKCG_MAX_POLS; i++)
if (blkg->pd[i])
-   blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
+   blkg_pol_free_pd(blkcg_policy[i], blkg->pd[i]);
 
if (blkg->blkcg != _root)
blk_exit_rl(blkg->q, >rl);
@@ -124,7 +124,7 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, 
struct request_queue *q,
continue;
 
/* alloc per-policy data and attach it to blkg */
-   pd = pol->pd_alloc_fn(gfp_mask, q->node);
+   pd = blkg_pol_alloc_pd(pol, gfp_mask, q->node);
if (!pd)
goto err_free;
 
@@ -218,8 +218,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
 
-   if (blkg->pd[i] && pol->pd_init_fn)
-   pol->pd_init_fn(blkg->pd[i]);
+   if (blkg->pd[i])
+   blkg_pol_init_pd(pol, blkg->pd[i]);
}
 
/* insert */
@@ -232,8 +232,8 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg,
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
 
-   if (blkg->pd[i] && pol->pd_online_fn)
-   pol->pd_online_fn(blkg->pd[i]);
+   if (blkg->pd[i])
+   blkg_pol_online_pd(pol, blkg->pd[i]);
}
}
blkg->online = true;
@@ -323,8 +323,8 @@ static void blkg_destroy(struct blkcg_gq *blkg)
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
 
-   if (blkg->pd[i] && pol->pd_offline_fn)
-   pol->pd_offline_fn(blkg->pd[i]);
+   if (blkg->pd[i])
+   blkg_pol_offline_pd(pol, blkg->pd[i]);
}
 
if (parent) {
@@ -457,8 +457,8 @@ static int blkcg_reset_stats(struct cgroup_subsys_state 
*css,
for (i = 0; i < BLKCG_MAX_POLS; i++) {
struct blkcg_policy *pol = blkcg_policy[i];
 
-   if (blkg->pd[i] && pol->pd_reset_stats_fn)
-   pol->pd_reset_stats_fn(blkg->pd[i]);
+   if (blkg->pd[i])
+   blkg_pol_reset_pd_stats(pol, blkg->pd[i]);
}
}
 
@@ -1045,7 +1045,7 @@ static void blkcg_css_free(struct cgroup_subsys_state 
*css)
 
for (i = 0; i < BLKCG_MAX_POLS; i++)
if (blkcg->cpd[i])
-   blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
+   blkcg_pol_free_cpd(blkcg_policy[i], blkcg->cpd[i]);
 
mutex_unlock(_pol_mutex);
 
@@ -1084,7 +1084,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
if (!pol || !pol->cpd_alloc_fn)
continue;
 
-   cpd = pol->cpd_alloc_fn(GFP_KERNEL);
+   cpd = blkcg_pol_alloc_cpd(pol, GFP_KERNEL);
if (!cpd) {
ret = ERR_PTR(-ENOMEM);
goto free_pd_blkcg;
@@ -1092,8 +1092,7 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
blkcg->cpd[i] = cpd;
cpd->blkcg = blkcg;
cpd->plid = i;
-   if (pol->cpd_init_fn)
-   pol->cpd_init_fn(cpd);
+   blkcg_pol_init_cpd(pol, cpd);
}
 
spin_lock_init(>lock);
@@ -1108,9 +1107,9 @@ blkcg_css_alloc(struct cgroup_subsys_state *parent_css)
return >css;
 
 free_pd_blkcg:
-   for (i--; i >= 0; i--)
+   while (i--)
if (blkcg->cpd[i])
-   blkcg_policy[i]->cpd_free_fn(blkcg->cpd[i]);
+   blkcg_pol_free_cpd(blkcg_policy[i], 
blkcg->cpd[pol->plid]);
 
if (blkcg != _root)
kfree(blkcg);
@@ -1246,7 +1245,7 @@ static void blkcg_bind(struct cgroup_subsys_state 
*root_css)
 
list_for_each_entry(blkcg, _blkcgs, all_blkcgs_node)
if (blkcg->cpd[pol->plid])
-   pol->cpd_bind_fn(blkcg->cpd[pol->plid]);
+   blkcg_pol_bind_cpd(pol, blkcg->cpd[pol->plid]);
 

Re: [PATCH V13 05/10] mmc: cqhci: support for command queue enabled host

2017-11-09 Thread Ulf Hansson
On 3 November 2017 at 14:20, Adrian Hunter  wrote:
> From: Venkat Gopalakrishnan 
>
> This patch adds CMDQ support for command-queue compatible
> hosts.
>
> Command queue is added in eMMC-5.1 specification. This
> enables the controller to process upto 32 requests at
> a time.
>
> Adrian Hunter contributed renaming to cqhci, recovery, suspend
> and resume, cqhci_off, cqhci_wait_for_idle, and external timeout
> handling.
>
> Signed-off-by: Asutosh Das 
> Signed-off-by: Sujit Reddy Thumma 
> Signed-off-by: Konstantin Dorfman 
> Signed-off-by: Venkat Gopalakrishnan 
> Signed-off-by: Subhash Jadavani 
> Signed-off-by: Ritesh Harjani 
> Signed-off-by: Adrian Hunter 

Overall this looks good to me.

However, I didn't see MAINTAINERS being updated. Is anybody above
volunteering to maintain cqhci.*?

Kind regards
Uffe

> ---
>  drivers/mmc/host/Kconfig  |   13 +
>  drivers/mmc/host/Makefile |1 +
>  drivers/mmc/host/cqhci.c  | 1150 
> +
>  drivers/mmc/host/cqhci.h  |  240 ++
>  4 files changed, 1404 insertions(+)
>  create mode 100644 drivers/mmc/host/cqhci.c
>  create mode 100644 drivers/mmc/host/cqhci.h
>
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
> index 567028c9219a..3092b7085cb5 100644
> --- a/drivers/mmc/host/Kconfig
> +++ b/drivers/mmc/host/Kconfig
> @@ -857,6 +857,19 @@ config MMC_SUNXI
>   This selects support for the SD/MMC Host Controller on
>   Allwinner sunxi SoCs.
>
> +config MMC_CQHCI
> +   tristate "Command Queue Host Controller Interface support"
> +   depends on HAS_DMA
> +   help
> + This selects the Command Queue Host Controller Interface (CQHCI)
> + support present in host controllers of Qualcomm Technologies, Inc
> + amongst others.
> + This controller supports eMMC devices with command queue support.
> +
> + If you have a controller with this interface, say Y or M here.
> +
> + If unsure, say N.
> +
>  config MMC_TOSHIBA_PCI
> tristate "Toshiba Type A SD/MMC Card Interface Driver"
> depends on PCI
> diff --git a/drivers/mmc/host/Makefile b/drivers/mmc/host/Makefile
> index ab61a3e39c0b..de140e3ef402 100644
> --- a/drivers/mmc/host/Makefile
> +++ b/drivers/mmc/host/Makefile
> @@ -91,6 +91,7 @@ obj-$(CONFIG_MMC_SDHCI_ST)+= sdhci-st.o
>  obj-$(CONFIG_MMC_SDHCI_MICROCHIP_PIC32)+= sdhci-pic32.o
>  obj-$(CONFIG_MMC_SDHCI_BRCMSTB)+= sdhci-brcmstb.o
>  obj-$(CONFIG_MMC_SDHCI_OMAP)   += sdhci-omap.o
> +obj-$(CONFIG_MMC_CQHCI)+= cqhci.o
>
>  ifeq ($(CONFIG_CB710_DEBUG),y)
> CFLAGS-cb710-mmc+= -DDEBUG
> diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
> new file mode 100644
> index ..159270e947cf
> --- /dev/null
> +++ b/drivers/mmc/host/cqhci.c
> @@ -0,0 +1,1150 @@
> +/* Copyright (c) 2015, The Linux Foundation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 and
> + * only version 2 as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include 
> +#include 
> +#include 
> +
> +#include "cqhci.h"
> +
> +#define DCMD_SLOT 31
> +#define NUM_SLOTS 32
> +
> +struct cqhci_slot {
> +   struct mmc_request *mrq;
> +   unsigned int flags;
> +#define CQHCI_EXTERNAL_TIMEOUT BIT(0)
> +#define CQHCI_COMPLETEDBIT(1)
> +#define CQHCI_HOST_CRC BIT(2)
> +#define CQHCI_HOST_TIMEOUT BIT(3)
> +#define CQHCI_HOST_OTHER   BIT(4)
> +};
> +
> +static inline u8 *get_desc(struct cqhci_host *cq_host, u8 tag)
> +{
> +   return cq_host->desc_base + (tag * cq_host->slot_sz);
> +}
> +
> +static inline u8 *get_link_desc(struct cqhci_host *cq_host, u8 tag)
> +{
> +   u8 *desc = get_desc(cq_host, tag);
> +
> +   return desc + cq_host->task_desc_len;
> +}
> +
> +static inline dma_addr_t get_trans_desc_dma(struct cqhci_host *cq_host, u8 
> tag)
> +{
> +   return cq_host->trans_desc_dma_base +
> +   (cq_host->mmc->max_segs * tag *
> +cq_host->trans_desc_len);
> +}
> +
> +static inline u8 *get_trans_desc(struct cqhci_host *cq_host, u8 tag)
> +{
> +   return cq_host->trans_desc_base +
> +   (cq_host->trans_desc_len * cq_host->mmc->max_segs * 

Re: [PATCH V13 06/10] mmc: sdhci-pci: Add CQHCI support for Intel GLK

2017-11-09 Thread Ulf Hansson
On 3 November 2017 at 14:20, Adrian Hunter  wrote:
> Add CQHCI initialization and implement CQHCI operations for Intel GLK.
>
> Signed-off-by: Adrian Hunter 

This looks good to me!

Kind regards
Uffe

> ---
>  drivers/mmc/host/Kconfig  |   1 +
>  drivers/mmc/host/sdhci-pci-core.c | 155 
> +-
>  2 files changed, 155 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/mmc/host/Kconfig b/drivers/mmc/host/Kconfig
> index 3092b7085cb5..2b02a9788bb6 100644
> --- a/drivers/mmc/host/Kconfig
> +++ b/drivers/mmc/host/Kconfig
> @@ -81,6 +81,7 @@ config MMC_SDHCI_BIG_ENDIAN_32BIT_BYTE_SWAPPER
>  config MMC_SDHCI_PCI
> tristate "SDHCI support on PCI bus"
> depends on MMC_SDHCI && PCI
> +   select MMC_CQHCI
> help
>   This selects the PCI Secure Digital Host Controller Interface.
>   Most controllers found today are PCI devices.
> diff --git a/drivers/mmc/host/sdhci-pci-core.c 
> b/drivers/mmc/host/sdhci-pci-core.c
> index 3e4f04fd5175..110c634cfb43 100644
> --- a/drivers/mmc/host/sdhci-pci-core.c
> +++ b/drivers/mmc/host/sdhci-pci-core.c
> @@ -30,6 +30,8 @@
>  #include 
>  #include 
>
> +#include "cqhci.h"
> +
>  #include "sdhci.h"
>  #include "sdhci-pci.h"
>
> @@ -116,6 +118,28 @@ int sdhci_pci_resume_host(struct sdhci_pci_chip *chip)
>
> return 0;
>  }
> +
> +static int sdhci_cqhci_suspend(struct sdhci_pci_chip *chip)
> +{
> +   int ret;
> +
> +   ret = cqhci_suspend(chip->slots[0]->host->mmc);
> +   if (ret)
> +   return ret;
> +
> +   return sdhci_pci_suspend_host(chip);
> +}
> +
> +static int sdhci_cqhci_resume(struct sdhci_pci_chip *chip)
> +{
> +   int ret;
> +
> +   ret = sdhci_pci_resume_host(chip);
> +   if (ret)
> +   return ret;
> +
> +   return cqhci_resume(chip->slots[0]->host->mmc);
> +}
>  #endif
>
>  #ifdef CONFIG_PM
> @@ -166,8 +190,48 @@ static int sdhci_pci_runtime_resume_host(struct 
> sdhci_pci_chip *chip)
>
> return 0;
>  }
> +
> +static int sdhci_cqhci_runtime_suspend(struct sdhci_pci_chip *chip)
> +{
> +   int ret;
> +
> +   ret = cqhci_suspend(chip->slots[0]->host->mmc);
> +   if (ret)
> +   return ret;
> +
> +   return sdhci_pci_runtime_suspend_host(chip);
> +}
> +
> +static int sdhci_cqhci_runtime_resume(struct sdhci_pci_chip *chip)
> +{
> +   int ret;
> +
> +   ret = sdhci_pci_runtime_resume_host(chip);
> +   if (ret)
> +   return ret;
> +
> +   return cqhci_resume(chip->slots[0]->host->mmc);
> +}
>  #endif
>
> +static u32 sdhci_cqhci_irq(struct sdhci_host *host, u32 intmask)
> +{
> +   int cmd_error = 0;
> +   int data_error = 0;
> +
> +   if (!sdhci_cqe_irq(host, intmask, _error, _error))
> +   return intmask;
> +
> +   cqhci_irq(host->mmc, intmask, cmd_error, data_error);
> +
> +   return 0;
> +}
> +
> +static void sdhci_pci_dumpregs(struct mmc_host *mmc)
> +{
> +   sdhci_dumpregs(mmc_priv(mmc));
> +}
> +
>  
> /*\
>   *   
> *
>   * Hardware specific quirk handling  
> *
> @@ -583,6 +647,18 @@ static void sdhci_intel_voltage_switch(struct sdhci_host 
> *host)
> .voltage_switch = sdhci_intel_voltage_switch,
>  };
>
> +static const struct sdhci_ops sdhci_intel_glk_ops = {
> +   .set_clock  = sdhci_set_clock,
> +   .set_power  = sdhci_intel_set_power,
> +   .enable_dma = sdhci_pci_enable_dma,
> +   .set_bus_width  = sdhci_set_bus_width,
> +   .reset  = sdhci_reset,
> +   .set_uhs_signaling  = sdhci_set_uhs_signaling,
> +   .hw_reset   = sdhci_pci_hw_reset,
> +   .voltage_switch = sdhci_intel_voltage_switch,
> +   .irq= sdhci_cqhci_irq,
> +};
> +
>  static void byt_read_dsm(struct sdhci_pci_slot *slot)
>  {
> struct intel_host *intel_host = sdhci_pci_priv(slot);
> @@ -612,12 +688,80 @@ static int glk_emmc_probe_slot(struct sdhci_pci_slot 
> *slot)
>  {
> int ret = byt_emmc_probe_slot(slot);
>
> +   slot->host->mmc->caps2 |= MMC_CAP2_CQE;
> +
> if (slot->chip->pdev->device != PCI_DEVICE_ID_INTEL_GLK_EMMC) {
> slot->host->mmc->caps2 |= MMC_CAP2_HS400_ES,
> slot->host->mmc_host_ops.hs400_enhanced_strobe =
> intel_hs400_enhanced_strobe;
> +   slot->host->mmc->caps2 |= MMC_CAP2_CQE_DCMD;
> +   }
> +
> +   return ret;
> +}
> +
> +static void glk_cqe_enable(struct mmc_host *mmc)
> +{
> +   struct sdhci_host *host = mmc_priv(mmc);
> +   u32 reg;
> +
> +   /*
> +* CQE gets stuck if it sees Buffer Read Enable bit set, 

Re: [PATCH V13 09/10] mmc: block: blk-mq: Stop using card_busy_detect()

2017-11-09 Thread Ulf Hansson
On 3 November 2017 at 14:20, Adrian Hunter  wrote:
> card_busy_detect() doesn't set a correct timeout, and it doesn't take care
> of error status bits. Stop using it for blk-mq.

I think this changelog isn't very descriptive. Could you please work
on that for the next version.

>
> Signed-off-by: Adrian Hunter 
> ---
>  drivers/mmc/core/block.c | 117 
> +++
>  1 file changed, 109 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
> index 0c29b1d8d545..5c5ff3c34313 100644
> --- a/drivers/mmc/core/block.c
> +++ b/drivers/mmc/core/block.c
> @@ -1426,15 +1426,18 @@ static inline void mmc_apply_rel_rw(struct 
> mmc_blk_request *brq,
> }
>  }
>
> -#define CMD_ERRORS \
> -   (R1_OUT_OF_RANGE |  /* Command argument out of range */ \
> -R1_ADDRESS_ERROR | /* Misaligned address */\
> +#define CMD_ERRORS_EXCL_OOR\
> +   (R1_ADDRESS_ERROR | /* Misaligned address */\
>  R1_BLOCK_LEN_ERROR |   /* Transferred block length incorrect */\
>  R1_WP_VIOLATION |  /* Tried to write to protected block */ \
>  R1_CARD_ECC_FAILED |   /* Card ECC failed */   \
>  R1_CC_ERROR |  /* Card controller error */ \
>  R1_ERROR)  /* General/unknown error */
>
> +#define CMD_ERRORS \
> +   (CMD_ERRORS_EXCL_OOR |  \
> +R1_OUT_OF_RANGE)   /* Command argument out of range */ \
> +
>  static void mmc_blk_eval_resp_error(struct mmc_blk_request *brq)
>  {
> u32 val;
> @@ -1951,6 +1954,95 @@ static void mmc_blk_ss_read(struct mmc_queue *mq, 
> struct request *req)
> mqrq->retries = MMC_NO_RETRIES;
>  }
>
> +static inline bool mmc_blk_oor_valid(struct mmc_blk_request *brq)
> +{
> +   return !!brq->mrq.sbc;
> +}
> +
> +static inline u32 mmc_blk_stop_err_bits(struct mmc_blk_request *brq)
> +{
> +   return mmc_blk_oor_valid(brq) ? CMD_ERRORS : CMD_ERRORS_EXCL_OOR;
> +}
> +
> +static inline bool mmc_blk_in_tran_state(u32 status)
> +{
> +   /*
> +* Some cards mishandle the status bits, so make sure to check both 
> the
> +* busy indication and the card state.
> +*/
> +   return status & R1_READY_FOR_DATA &&
> +  (R1_CURRENT_STATE(status) == R1_STATE_TRAN);
> +}
> +
> +static unsigned int mmc_blk_clock_khz(struct mmc_host *host)
> +{
> +   if (host->actual_clock)
> +   return host->actual_clock / 1000;
> +
> +   /* Clock may be subject to a divisor, fudge it by a factor of 2. */
> +   if (host->ios.clock)
> +   return host->ios.clock / 2000;
> +
> +   /* How can there be no clock */
> +   WARN_ON_ONCE(1);
> +   return 100; /* 100 kHz is minimum possible value */
> +}
> +
> +static unsigned long mmc_blk_data_timeout_jiffies(struct mmc_host *host,
> + struct mmc_data *data)
> +{
> +   unsigned int ms = DIV_ROUND_UP(data->timeout_ns, 100);
> +   unsigned int khz;
> +
> +   if (data->timeout_clks) {
> +   khz = mmc_blk_clock_khz(host);
> +   ms += DIV_ROUND_UP(data->timeout_clks, khz);
> +   }
> +
> +   return msecs_to_jiffies(ms);
> +}
> +
> +static int mmc_blk_card_stuck(struct mmc_card *card, struct request *req,
> + u32 *resp_errs)
> +{
> +   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
> +   struct mmc_data *data = >brq.data;
> +   unsigned long timeout;
> +   u32 status;
> +   int err;
> +
> +   timeout = jiffies + mmc_blk_data_timeout_jiffies(card->host, data);
> +
> +   while (1) {
> +   bool done = time_after(jiffies, timeout);
> +
> +   err = __mmc_send_status(card, , 5);
> +   if (err) {
> +   pr_err("%s: error %d requesting status\n",
> +  req->rq_disk->disk_name, err);
> +   break;
> +   }
> +
> +   /* Accumulate any response error bits seen */
> +   if (resp_errs)
> +   *resp_errs |= status;
> +
> +   if (mmc_blk_in_tran_state(status))
> +   break;
> +
> +   /* Timeout if the device never becomes ready */
> +   if (done) {
> +   pr_err("%s: Card stuck in wrong state! %s %s\n",
> +   mmc_hostname(card->host),
> +   req->rq_disk->disk_name, __func__);
> +   err = -ETIMEDOUT;
> +   break;
> +   }
> +   }
> +
> +   return 

Re: [PATCH V13 07/10] mmc: block: blk-mq: Add support for direct completion

2017-11-09 Thread Adrian Hunter
On 09/11/17 15:07, Ulf Hansson wrote:
> On 3 November 2017 at 14:20, Adrian Hunter  wrote:
>> For blk-mq, add support for completing requests directly in the ->done
>> callback. That means that error handling and urgent background operations
>> must be handled by recovery_work in that case.
> 
> As the mmc docs sucks, I think it's important that we elaborate a bit
> more on the constraints this has on the host driver, here in the
> changelog.
> 
> Something along the lines, "Using MMC_CAP_DIRECT_COMPLETE requires the
> host driver, when calling mmc_request_done(), to cope with that its
> ->post_req() callback may be called immediately from the same context,
> etc.."

Yes, I will expand it.  It is also a stepping stone to getting to support
for issuing the next request from the ->done() callback.

> 
> Otherwise this looks good to me.
> 
> Kind regards
> Uffe
> 
>>
>> Signed-off-by: Adrian Hunter 
>> ---
>>  drivers/mmc/core/block.c | 100 
>> +--
>>  drivers/mmc/core/block.h |   1 +
>>  drivers/mmc/core/queue.c |   5 ++-
>>  drivers/mmc/core/queue.h |   6 +++
>>  include/linux/mmc/host.h |   1 +
>>  5 files changed, 101 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/mmc/core/block.c b/drivers/mmc/core/block.c
>> index e8be17152884..cbb4b35a592d 100644
>> --- a/drivers/mmc/core/block.c
>> +++ b/drivers/mmc/core/block.c
>> @@ -2086,6 +2086,22 @@ static void mmc_blk_rw_recovery(struct mmc_queue *mq, 
>> struct request *req)
>> }
>>  }
>>
>> +static inline bool mmc_blk_rq_error(struct mmc_blk_request *brq)
>> +{
>> +   mmc_blk_eval_resp_error(brq);
>> +
>> +   return brq->sbc.error || brq->cmd.error || brq->stop.error ||
>> +  brq->data.error || brq->cmd.resp[0] & CMD_ERRORS;
>> +}
>> +
>> +static inline void mmc_blk_rw_reset_success(struct mmc_queue *mq,
>> +   struct request *req)
>> +{
>> +   int type = rq_data_dir(req) == READ ? MMC_BLK_READ : MMC_BLK_WRITE;
>> +
>> +   mmc_blk_reset_success(mq->blkdata, type);
>> +}
>> +
>>  static void mmc_blk_mq_complete_rq(struct mmc_queue *mq, struct request 
>> *req)
>>  {
>> struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
>> @@ -2167,14 +2183,43 @@ static void mmc_blk_mq_post_req(struct mmc_queue 
>> *mq, struct request *req)
>> if (host->ops->post_req)
>> host->ops->post_req(host, mrq, 0);
>>
>> -   blk_mq_complete_request(req);
>> +   /*
>> +* Block layer timeouts race with completions which means the normal
>> +* completion path cannot be used during recovery.
>> +*/
>> +   if (mq->in_recovery)
>> +   mmc_blk_mq_complete_rq(mq, req);
>> +   else
>> +   blk_mq_complete_request(req);
>>
>> mmc_blk_mq_acct_req_done(mq, req);
>>  }
>>
>> +void mmc_blk_mq_recovery(struct mmc_queue *mq)
>> +{
>> +   struct request *req = mq->recovery_req;
>> +   struct mmc_host *host = mq->card->host;
>> +   struct mmc_queue_req *mqrq = req_to_mmc_queue_req(req);
>> +
>> +   mq->recovery_req = NULL;
>> +   mq->rw_wait = false;
>> +
>> +   if (mmc_blk_rq_error(>brq)) {
>> +   mmc_retune_hold_now(host);
>> +   mmc_blk_rw_recovery(mq, req);
>> +   }
>> +
>> +   mmc_blk_urgent_bkops(mq, mqrq);
>> +
>> +   mmc_blk_mq_post_req(mq, req);
>> +}
>> +
>>  static void mmc_blk_mq_complete_prev_req(struct mmc_queue *mq,
>>  struct request **prev_req)
>>  {
>> +   if (mmc_queue_direct_complete(mq->card->host))
>> +   return;
>> +
>> mutex_lock(>complete_lock);
>>
>> if (!mq->complete_req)
>> @@ -2208,19 +2253,43 @@ static void mmc_blk_mq_req_done(struct mmc_request 
>> *mrq)
>> struct request *req = mmc_queue_req_to_req(mqrq);
>> struct request_queue *q = req->q;
>> struct mmc_queue *mq = q->queuedata;
>> +   struct mmc_host *host = mq->card->host;
>> unsigned long flags;
>> -   bool waiting;
>>
>> -   spin_lock_irqsave(q->queue_lock, flags);
>> -   mq->complete_req = req;
>> -   mq->rw_wait = false;
>> -   waiting = mq->waiting;
>> -   spin_unlock_irqrestore(q->queue_lock, flags);
>> +   if (!mmc_queue_direct_complete(host)) {
>> +   bool waiting;
>> +
>> +   spin_lock_irqsave(q->queue_lock, flags);
>> +   mq->complete_req = req;
>> +   mq->rw_wait = false;
>> +   waiting = mq->waiting;
>> +   spin_unlock_irqrestore(q->queue_lock, flags);
>> +
>> +   if (waiting)
>> +   wake_up(>wait);
>> +   else
>> +   kblockd_schedule_work(>complete_work);
>>
>> -   if (waiting)
>> +   return;
>> +   }
>> +
>> +   if (mmc_blk_rq_error(>brq) ||
>> +   

Re: [PATCH 1/5] nvme: track subsystems

2017-11-09 Thread Sagi Grimberg



Any reason to do all this before we know if we found an existing subsystem?


We'd either have to do all the initialization including the memory
allocation and ida_simple_get under nvme_subsystems_lock, or search
the list first, then allocate, then search again.

Given that the not found case is vastly more common than the found
case that doesn't seem to be worth the tradeoff.


Noted.


Re: [PATCH 3/5] nvme: track shared namespaces

2017-11-09 Thread Sagi Grimberg



To allow lockless path lookup the list of nvme_ns structures per
nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
structure through call_srcu.


Can you remind me why isn't rcu sufficient? Can looking up a
path (ns from head->list) block?


blk_mq_make_request can block.


Oh I see, so can you explain why srcu should cover direct_make_request?

What if we were to do:
--
rcu_read_lock();
ns = nvme_find_path(head);
rcu_read_unlock();
if (likely(ns)) {
...
--


Re: [PATCH V13 05/10] mmc: cqhci: support for command queue enabled host

2017-11-09 Thread Adrian Hunter
On 09/11/17 14:26, Linus Walleij wrote:
> On Wed, Nov 8, 2017 at 3:14 PM, Adrian Hunter  wrote:
>> On 08/11/17 11:22, Linus Walleij wrote:
>>> On Fri, Nov 3, 2017 at 2:20 PM, Adrian Hunter  
>>> wrote:
> 
>>> (...)
> 
 +EXPORT_SYMBOL(cqhci_resume);
>>>
>>> Why would the CQE case require special suspend/resume
>>> functionality?
>>
>> Seems like a very strange question.
> 
> Please realize that patch review is partly about education.
> 
> Educating me or anyone about your patch set involves
> being humbe and not seeing your peers as lesser.
> 
> Making your reviewer feel stupid by saying the ask
> "strange" or outright stupid questions is not helping
> your cause.

Please try to forgive me for being rude, it is just frustration.

> 
>> Obviously CQHCI has to be configured
>> after suspend.
> 
> Yeah. I think I misunderstood is such that:
> 
>> Also please don't confuse CQE and CQHCI.  CQHCI is an implementation of a
>> CQE.  We currently do not expect to have another implementation, but it is
>> not impossible.
> 
> OK now you educated me, see it's not that hard without
> using belitteling language.
> 
>>> This seems two much like on the side CQE-silo engineering,
>>> just use the device .[runtime]_suspend/resume callbacks like
>>> everyone else, make it possible for the host to figure out
>>> if it is in CQE mode or not (I guess it should know already
>>> since cqhci .enable() has been called?) and handle it
>>> from there.
>>
>> That is how it works!  The host controller has to decide how to handle
>> suspend / resume.
> 
> OK.
> 
>> cqhci_suspend() / cqhci_resume() are helper functions that the host
>> controller can use, but doesn't have to.
> 
> OK.
> 
>>> Why would CQE hosts need special accessors and the rest
>>> of the host not need it?
>>
>> Special accessors can be used to fix up registers that don't work exactly
>> the way the standard specified.
> 
> Yeah this is fine as it is for CQHCI, I didn't get that
> part :)
> 
>>> ->enable and ->disable() for just CQE seem reasonable.
>>> But that leaves just two new ops.
>>>
>>> So why not just put .cqe_enable() and .cqe_disable()
>>> ops into mmc_host_ops as optional and be done with it?
>>
>> Ok so you are not understanding this at all.
> 
> No I did not get it. But I do now (I think).
> 
>> As a CQE implementation, CQHCI interfaces with the upper layers through the
>> CQE ops etc.
>>
>> But CQHCI also has to work with any host controller driver, so it needs an
>> interface for that, which is what cqhci_host_ops is for.  All the ops serve
>> useful purposes.
> (...)
>> The whole point is to prove a library that can work with any host controller
>> driver.  That means it must provide functions and callbacks.
> 
> OK
> 
>>> I think the above approach to put any CQE-specific callbacks
>>> directly into the struct mmc_host_ops is way more viable.
>>
>> Nothing to do with CQE.  This is CQHCI.  Please try to get the difference.
> 
> I am trying, please try to think about your language.

I strongly disapprove of being rude but sadly it seems to get results.


Re: [PATCH 1/5] nvme: track subsystems

2017-11-09 Thread Christoph Hellwig
On Thu, Nov 09, 2017 at 01:33:16PM +0200, Sagi Grimberg wrote:
> Any reason to do all this before we know if we found an existing subsystem? 

We'd either have to do all the initialization including the memory
allocation and ida_simple_get under nvme_subsystems_lock, or search
the list first, then allocate, then search again.

Given that the not found case is vastly more common than the found
case that doesn't seem to be worth the tradeoff.

> Also can the enumeration become with "holes" because you
> acquire a subsystem ida temporarily and free it (and another subsystem
> just came in)? Not a big issue though...

It can, just like it can for the discovery controller in fabrics.
but so far that hasn't been an issue.


Re: [PATCH V13 08/10] mmc: block: blk-mq: Separate card polling from recovery

2017-11-09 Thread Linus Walleij
On Thu, Nov 9, 2017 at 8:56 AM, Adrian Hunter  wrote:
> On 08/11/17 11:30, Linus Walleij wrote:
>> On Fri, Nov 3, 2017 at 2:20 PM, Adrian Hunter  
>> wrote:
>>
>>> Recovery is simpler to understand if it is only used for errors. Create a
>>> separate function for card polling.
>>>
>>> Signed-off-by: Adrian Hunter 
>>
>> This looks good but I can't see why it's not folded into
>> patch 3 already. This error handling is introduced there.
>
> What are you on about?

You are attacking your most valuable resource, a reviewer.

And I even said the patch looks good.

The only thing you attain with this kind of langauge is alienante
me and discourage others to review your patch set. You also
give your employer a bad name, since you are representing
them.

> If we're going to split up the patches (which I
> argued against - the new code is all new, so it could be read independently
> from the old mess) then this is a logically distinct step.  Polling and
> error-recovery are conceptually different things and it is important to
> separate them to make the code easier to understand.

I understand it can be tough to deal with review comments
and it can make you loose your temper when people (sometimes
even the same person!) say contradictory things.

But in hindsight, don't you think these 5 last lines of your message
had been enough without that first line?

Yours,
Linus Walleij


Re: [PATCH 3/5] nvme: track shared namespaces

2017-11-09 Thread Christoph Hellwig
On Thu, Nov 09, 2017 at 01:37:43PM +0200, Sagi Grimberg wrote:
>
>> To allow lockless path lookup the list of nvme_ns structures per
>> nvme_ns_head is protected by SRCU, which requires freeing the nvme_ns
>> structure through call_srcu.
>
> Can you remind me why isn't rcu sufficient? Can looking up a
> path (ns from head->list) block?

blk_mq_make_request can block.


Re: [PATCH V13 10/10] mmc: block: blk-mq: Stop using legacy recovery

2017-11-09 Thread Linus Walleij
On Thu, Nov 9, 2017 at 8:43 AM, Adrian Hunter  wrote:
> On 08/11/17 11:38, Linus Walleij wrote:
>> On Fri, Nov 3, 2017 at 2:20 PM, Adrian Hunter  
>> wrote:
>>
>>> There are only a few things the recovery needs to do. Primarily, it just
>>> needs to:
>>> Determine the number of bytes transferred
>>> Get the card back to transfer state
>>> Determine whether to retry
>>>
>>> There are also a couple of additional features:
>>> Reset the card before the last retry
>>> Read one sector at a time
>>>
>>> The legacy code spent much effort analyzing command errors, but commands
>>> fail fast, so it is simpler just to give all command errors the same number
>>> of retries.
>>>
>>> Signed-off-by: Adrian Hunter 
>>
>> I have nothing against the patch as such. In fact something
>> like this makes a lot of sense (to me).
>>
>> But this just makes mmc_blk_rw_recovery() look really nice.
>>
>> And leaves a very ugly mmc_blk_issue_rw_rq() with the legacy
>> error handling in-tree.
>>
>> The former function isn't even named with some *mq* infix
>> making it clear that the new recovery path only happens
>> in the MQ case.
>>
>> If newcomers read this code in the MMC stack they will
>> just tear their hair, scream and run away. Even faster than
>> before.
>>
>> How are they supposed to know which functions are used on
>> which path? Run ftrace?
>
> You're kidding me right?  You don't know how to find where a function used?

What I mean is that there are now several functions in the same
file doing similar things, and it is pretty hard for a newcomer
already (IMO) to understand how the MMC/SD stack works.

This phenomenon of code complexity is not just making me
frustrated but also you, because you get annoyed that people
like me don't "just get it".

>> This illustrates firmly why we need to refactor and/or kill off
>> the old block layer interface *first* then add MQ on top.
>
> No it doesn't!  You are playing games!

You need to stop your snarky and undfriendly attitude.

Look Adrian: you have one (1) person reviewing your patches.
That is me.

I think need to quote Documentation/process/6.Followthrough.rst
in verbatim (it's an awesome piece of text!)

8<8<---8<--

Working with reviewers
--

A patch of any significance will result in a number of comments from other
developers as they review the code.  Working with reviewers can be, for
many developers, the most intimidating part of the kernel development
process.  Life can be made much easier, though, if you keep a few things in
mind:

 - If you have explained your patch well, reviewers will understand its
   value and why you went to the trouble of writing it.  But that value
   will not keep them from asking a fundamental question: what will it be
   like to maintain a kernel with this code in it five or ten years later?
   Many of the changes you may be asked to make - from coding style tweaks
   to substantial rewrites - come from the understanding that Linux will
   still be around and under development a decade from now.

 - Code review is hard work, and it is a relatively thankless occupation;
   people remember who wrote kernel code, but there is little lasting fame
   for those who reviewed it.  So reviewers can get grumpy, especially when
   they see the same mistakes being made over and over again.  If you get a
   review which seems angry, insulting, or outright offensive, resist the
   impulse to respond in kind.  Code review is about the code, not about
   the people, and code reviewers are not attacking you personally.

 - Similarly, code reviewers are not trying to promote their employers'
   agendas at the expense of your own.  Kernel developers often expect to
   be working on the kernel years from now, but they understand that their
   employer could change.  They truly are, almost without exception,
   working toward the creation of the best kernel they can; they are not
   trying to create discomfort for their employers' competitors.

What all of this comes down to is that, when reviewers send you comments,
you need to pay attention to the technical observations that they are
making.  Do not let their form of expression or your own pride keep that
from happening.  When you get review comments on a patch, take the time to
understand what the reviewer is trying to say.  If possible, fix the things
that the reviewer is asking you to fix.  And respond back to the reviewer:
thank them, and describe how you will answer their questions.

8<8<---8<--

> One function could be named
> differently, so that is evidence the whole patch set should be ignored.

No I was not pointing to that at all.

> The old code is rubbish.  There is nothing worth keeping.  Churning it
> around is a waste of everybody's time.  Review and test the new code.
> Delete the old code.  Much much simpler!

Yeah, but 

Re: [PATCH V13 04/10] mmc: block: Add CQE support

2017-11-09 Thread Adrian Hunter
On 09/11/17 14:04, Linus Walleij wrote:
> On Wed, Nov 8, 2017 at 2:20 PM, Adrian Hunter  wrote:
>> On 08/11/17 11:00, Linus Walleij wrote:
> 
>>> This and other bits gives me the feeling CQE is now actually ONLY
>>> working on the MQ path.
>>
>> I was not allowed to support non-mq.
> 
> Fair enough.
> 
>>> That is good. We only add new functionality on the MQ path,
>>> yay!
>>>
>>> But this fact (only abailable iff MQ==true) should at least be
>>> mentioned in the commit message I think?
>>
>> Why?  CQE is MQ only.
> 
> So if you read what I say, I think the commit message should
> say that CQE is MQ only so that people know that CQE is
> MQ only.

Alright

> 
>>> So why not ditch the old block layer or at least make MQ default?
>>
>> CQE is MQ only.
> 
> Yeah? So why keep it around for everything else?

Never said we should keep it around.  As soon as blk-mq is ready and tested,
delete it.

> 
>>> When you keep it like this people have to reconfigure
>>> their kernel to enable MQ before they see the benefits of MQ+CQE
>>> combined, I think that should rather be the default experience.
>>
>> Not at all.  I guess you are confusing the legacy mmc with CQE.  CQE is not
>> a layer on top of legacy mmc.  It is an alternative to legacy mmc.  CQE
>> does not sit on top of the legacy mmc blk-mq support.  You don't have to
>> enable legacy mmc blk-mq support to use CQE.
> 
> Now I am confused. I can't parse the last sentence. There is no
> such thing as legcay blk-mq?

Don't need non-CQE mmc blk-mq support for CQE support.


Re: [PATCH V13 07/10] mmc: block: blk-mq: Add support for direct completion

2017-11-09 Thread Linus Walleij
On Thu, Nov 9, 2017 at 8:27 AM, Adrian Hunter  wrote:
> On 08/11/17 11:28, Linus Walleij wrote:
>> On Fri, Nov 3, 2017 at 2:20 PM, Adrian Hunter  
>> wrote:
>>
>>> For blk-mq, add support for completing requests directly in the ->done
>>> callback. That means that error handling and urgent background operations
>>> must be handled by recovery_work in that case.
>>>
>>> Signed-off-by: Adrian Hunter 
>>
>> I tried enabling this on my MMC host (mmci) but I got weird
>> DMA error messages when I did.
>>
>> I guess this has not been tested on a non-DMA-coherent
>> system?
>
> I don't see what DMA-coherence has to do with anything.
>
> Possibilities:
> - DMA unmapping doesn't work in an atomic context
> - requests' DMA operations have to be synchronized with each other

So since MMCI need the post_req() hook called with
an error code to properly tear down any DMA operations,
I was worried that maybe your error path is not doing this
(passing an error code or calling in the right order).

I had a bunch of fallouts in my own patch set relating
to that.

>> I think I might be seeing this because the .pre and .post
>> callbacks need to be strictly sequenced, and this is
>> maybe not taken into account here?
>
> I looked at mmci but that did not seem to be the case.
>
>> Isn't there as risk
>> that the .post callback of the next request is called before
>> the .post callback of the previous request has returned
>> for example?
>
> Of course, the requests are treated as independent.  If the separate DMA
> operations require synchronization, that is for the host driver to fix.

They are treated as independent by the block layer but
it is the subsystems duty to serialize them for the hardware,

MMCI strictly requires that pre/post hooks per request
happen in the right order, so if you have prepared a second
request after submitting the first, and the first fails, you have
to back out by unpreparing the second one before unpreparing
the first. It is also the only host driver requireing to be passed
an error code in the last parameter to the post hook in
order to work properly.

I think your patch set handles that nicely though, because I
haven't seen any errors, it's just when we do this direct
completion I see problems.

Yours,
Linus Walleij


Re: [PATCH V13 04/10] mmc: block: Add CQE support

2017-11-09 Thread Linus Walleij
On Wed, Nov 8, 2017 at 2:20 PM, Adrian Hunter  wrote:
> On 08/11/17 11:00, Linus Walleij wrote:

>> This and other bits gives me the feeling CQE is now actually ONLY
>> working on the MQ path.
>
> I was not allowed to support non-mq.

Fair enough.

>> That is good. We only add new functionality on the MQ path,
>> yay!
>>
>> But this fact (only abailable iff MQ==true) should at least be
>> mentioned in the commit message I think?
>
> Why?  CQE is MQ only.

So if you read what I say, I think the commit message should
say that CQE is MQ only so that people know that CQE is
MQ only.

>> So why not ditch the old block layer or at least make MQ default?
>
> CQE is MQ only.

Yeah? So why keep it around for everything else?

>> When you keep it like this people have to reconfigure
>> their kernel to enable MQ before they see the benefits of MQ+CQE
>> combined, I think that should rather be the default experience.
>
> Not at all.  I guess you are confusing the legacy mmc with CQE.  CQE is not
> a layer on top of legacy mmc.  It is an alternative to legacy mmc.  CQE
> does not sit on top of the legacy mmc blk-mq support.  You don't have to
> enable legacy mmc blk-mq support to use CQE.

Now I am confused. I can't parse the last sentence. There is no
such thing as legcay blk-mq?

Yours,
Linus Walleij


Re: [PATCH 1/5] nvme: track subsystems

2017-11-09 Thread Sagi Grimberg



+static int nvme_init_subsystem(struct nvme_ctrl *ctrl, struct nvme_id_ctrl *id)
+{
+   struct nvme_subsystem *subsys, *found;
+   int ret;
+
+   subsys = kzalloc(sizeof(*subsys), GFP_KERNEL);
+   if (!subsys)
+   return -ENOMEM;
+   ret = ida_simple_get(_subsystems_ida, 0, 0, GFP_KERNEL);
+   if (ret < 0) {
+   kfree(subsys);
+   return ret;
+   }
+   subsys->instance = ret;
+   mutex_init(>lock);
+   kref_init(>ref);
+   INIT_LIST_HEAD(>ctrls);
+   nvme_init_subnqn(subsys, ctrl, id);
+   memcpy(subsys->serial, id->sn, sizeof(subsys->serial));
+   memcpy(subsys->model, id->mn, sizeof(subsys->model));
+   memcpy(subsys->firmware_rev, id->fr, sizeof(subsys->firmware_rev));
+   subsys->vendor_id = le16_to_cpu(id->vid);
+   subsys->cmic = id->cmic;
+
+   subsys->dev.class = nvme_subsys_class;
+   subsys->dev.release = nvme_release_subsystem;
+   dev_set_name(>dev, "nvme-subsys%d", subsys->instance);
+   device_initialize(>dev);
+


Any reason to do all this before we know if we found an existing 
subsystem? Also can the enumeration become with "holes" because you

acquire a subsystem ida temporarily and free it (and another subsystem
just came in)? Not a big issue though...


Re: [PATCH] blk-mq: fix issue with shared tag queue re-running

2017-11-09 Thread Ming Lei
On Thu, Nov 09, 2017 at 11:41:40AM +0800, Ming Lei wrote:
> On Wed, Nov 08, 2017 at 03:48:51PM -0700, Jens Axboe wrote:
> > This patch attempts to make the case of hctx re-running on driver tag
> > failure more robust. Without this patch, it's pretty easy to trigger a
> > stall condition with shared tags. An example is using null_blk like
> > this:
> > 
> > modprobe null_blk queue_mode=2 nr_devices=4 shared_tags=1 submit_queues=1 
> > hw_queue_depth=1
> > 
> > which sets up 4 devices, sharing the same tag set with a depth of 1.
> > Running a fio job ala:
> > 
> > [global]
> > bs=4k
> > rw=randread
> > norandommap
> > direct=1
> > ioengine=libaio
> > iodepth=4
> > 
> > [nullb0]
> > filename=/dev/nullb0
> > [nullb1]
> > filename=/dev/nullb1
> > [nullb2]
> > filename=/dev/nullb2
> > [nullb3]
> > filename=/dev/nullb3
> > 
> > will inevitably end with one or more threads being stuck waiting for a
> > scheduler tag. That IO is then stuck forever, until someone else
> > triggers a run of the queue.
> > 
> > Ensure that we always re-run the hardware queue, if the driver tag we
> > were waiting for got freed before we added our leftover request entries
> > back on the dispatch list.
> > 
> > Signed-off-by: Jens Axboe 
> > 
> > ---
> > 
> > diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
> > index 7f4a1ba532af..bb7f08415203 100644
> > --- a/block/blk-mq-debugfs.c
> > +++ b/block/blk-mq-debugfs.c
> > @@ -179,7 +179,6 @@ static const char *const hctx_state_name[] = {
> > HCTX_STATE_NAME(STOPPED),
> > HCTX_STATE_NAME(TAG_ACTIVE),
> > HCTX_STATE_NAME(SCHED_RESTART),
> > -   HCTX_STATE_NAME(TAG_WAITING),
> > HCTX_STATE_NAME(START_ON_RUN),
> >  };
> >  #undef HCTX_STATE_NAME
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 3d759bb8a5bb..8dc5db40df9d 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -998,49 +998,64 @@ bool blk_mq_get_driver_tag(struct request *rq, struct 
> > blk_mq_hw_ctx **hctx,
> > return rq->tag != -1;
> >  }
> >  
> > -static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode, 
> > int flags,
> > -   void *key)
> > +static int blk_mq_dispatch_wake(wait_queue_entry_t *wait, unsigned mode,
> > +   int flags, void *key)
> >  {
> > struct blk_mq_hw_ctx *hctx;
> >  
> > hctx = container_of(wait, struct blk_mq_hw_ctx, dispatch_wait);
> >  
> > -   list_del(>entry);
> > -   clear_bit_unlock(BLK_MQ_S_TAG_WAITING, >state);
> > +   list_del_init(>entry);
> > blk_mq_run_hw_queue(hctx, true);
> > return 1;
> >  }
> >  
> > -static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx *hctx)
> > +static bool blk_mq_dispatch_wait_add(struct blk_mq_hw_ctx **hctx,
> > +struct request *rq)
> >  {
> > +   struct blk_mq_hw_ctx *this_hctx = *hctx;
> > +   wait_queue_entry_t *wait = _hctx->dispatch_wait;
> > struct sbq_wait_state *ws;
> >  
> > +   if (!list_empty_careful(>entry))
> > +   return false;
> > +
> > +   spin_lock(_hctx->lock);
> > +   if (!list_empty(>entry)) {
> > +   spin_unlock(_hctx->lock);
> > +   return false;
> > +   }
> > +
> > +   ws = bt_wait_ptr(_hctx->tags->bitmap_tags, this_hctx);
> > +   add_wait_queue(>wait, wait);
> > +
> > /*
> > -* The TAG_WAITING bit serves as a lock protecting hctx->dispatch_wait.
> > -* The thread which wins the race to grab this bit adds the hardware
> > -* queue to the wait queue.
> > +* It's possible that a tag was freed in the window between the
> > +* allocation failure and adding the hardware queue to the wait
> > +* queue.
> >  */
> > -   if (test_bit(BLK_MQ_S_TAG_WAITING, >state) ||
> > -   test_and_set_bit_lock(BLK_MQ_S_TAG_WAITING, >state))
> > +   if (!blk_mq_get_driver_tag(rq, hctx, false)) {
> > +   spin_unlock(_hctx->lock);
> > return false;
> > -
> > -   init_waitqueue_func_entry(>dispatch_wait, blk_mq_dispatch_wake);
> > -   ws = bt_wait_ptr(>tags->bitmap_tags, hctx);
> > +   }
> >  
> > /*
> > -* As soon as this returns, it's no longer safe to fiddle with
> > -* hctx->dispatch_wait, since a completion can wake up the wait queue
> > -* and unlock the bit.
> > +* We got a tag, remove outselves from the wait queue to ensure
> > +* someone else gets the wakeup.
> >  */
> > -   add_wait_queue(>wait, >dispatch_wait);
> > +   spin_lock_irq(>wait.lock);
> > +   list_del_init(>entry);
> > +   spin_unlock_irq(>wait.lock);
> > +   spin_unlock(_hctx->lock);
> > return true;
> >  }
> >  
> >  bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head 
> > *list,
> > -   bool got_budget)
> > +bool got_budget)
> >  {
> > struct blk_mq_hw_ctx *hctx;
> > struct request *rq, *nxt;
> > +   bool no_tag = false;
> > int errors, queued;
> >  
> > if (list_empty(list))
> > @@ -1060,22 +1075,15 @@ bool 

Re: [PATCH 4/5] nvme: implement multipath access to nvme subsystems

2017-11-09 Thread Christoph Hellwig
On Wed, Nov 08, 2017 at 12:18:32PM +0100, Hannes Reinecke wrote:
> On 11/08/2017 09:54 AM, Christoph Hellwig wrote:
> > Can I get a review for this one?  The only changes vs the previously
> > reviewed versions is that we don't use the multipath code at all for
> > subsystems that aren't multiported, and that there is an explicit
> > opt-out at compile and module load time, so it shouldn't be that hard
> > to review for those who reviewed the previous versions.
> > 
> The one thing I need to try is co-existence with dm-multipath, ie
> ensuring that dm-multipath stays out of the way if nvme multipathing is
> enabled.

For one it will work just fine on the single node.  Second you easily
check it from the sysfs layout.

> 
> I hope to get it done sometime later this week
> 
> And I'll see to get the other patches reviewed.

Please do so now - you've complained about things going slow for a
merge in this merge window, and now we're only waiting for you.


Re: [General protection fault] in bio_integrity_advance

2017-11-09 Thread Yu Chen
On Tue, Nov 7, 2017 at 4:38 PM, Yu Chen  wrote:
> Hi all,
> We are using 4.13.5-100.fc25.x86_64 and a panic was found during
> resume from hibernation, the backtrace is illustrated as below, would
> someone please take a look if this has already been fixed or is this issue 
> still
> in the upstream kernel? thanks!
> [  114.846213] PM: Using 3 thread(s) for decompression.
> [  114.846213] PM: Loading and decompressing image data (6555729 pages)...
> [  115.143169] PM: Image loading progress:   0%
> [  156.386990] PM: Image loading progress:  10%
> [  175.114169] PM: Image loading progress:  20%
> [  185.364073] PM: Image loading progress:  30%
> [  191.345652] PM: Image loading progress:  40%
> [  200.655883] PM: Image loading progress:  50%
> [  220.084360] PM: Image loading progress:  60%
> [  240.581079] PM: Image loading progress:  70%
> [  250.406290] general protection fault:  [#1] SMP
> [  250.411779] Modules linked in: nouveau video mxm_wmi i2c_algo_bit
> drm_kms_helper ttm drm crc32c_intel wmi
> [  250.422524] CPU: 99 PID: 0 Comm: swapper/99 Not tainted
> 4.13.5-100.fc25.x86_64 #1
> [  250.430902] Hardware name: Intel Corporation PURLEY/PURLEY, BIOS
> PLYXCRB1.86B.0521.D18.1710241520 10/24/2017
> [  250.441901] task: 97f5827c task.stack: b0e418cdc000^M
> [  250.448528] RIP: 0010:bio_integrity_advance+0x1a/0xf0
> [  250.454182] RSP: 0018:97f58f6c3da8 EFLAGS: 00010202
> [  250.460024] RAX: db19e5a5b91ff161 RBX: 58b38c0def2b26b8 RCX: 
> 000180400021
> [  250.468008] RDX:  RSI: 8000 RDI: 
> 97f56eb7fd20
> [  250.475993] RBP: 97f58f6c3db0 R08: 97f56d8d3600 R09: 
> 000180400021
> [  250.483976] R10: 97f58f6c3c48 R11: 000a8000 R12: 
> 8000
> [  250.491961] R13: 9739fcdfd400 R14: 000a R15: 
> 8000
> [  250.499944] FS:  () GS:97f58f6c()
> knlGS:
> [  250.508997] CS:  0010 DS:  ES:  CR0: 80050033
> [  250.515427] CR2: 565407552e40 CR3: 0115b7a67000 CR4: 
> 007406e0
> [  250.523412] DR0:  DR1:  DR2: 
> 
> [  250.533458] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  250.543500] PKRU: 5554
> [  250.548604] Call Trace:
> [  250.553415]  
> [  250.557729]  bio_advance+0x28/0xf0
> [  250.563598]  blk_update_request+0x92/0x2f0
> [  250.570223]  scsi_end_request+0x37/0x1d0
> [  250.576654]  scsi_io_completion+0x20e/0x690
> [  250.583362]  ? rebalance_domains+0x160/0x2b0
> [  250.590187]  scsi_finish_command+0xd9/0x120
> [  250.596924]  scsi_softirq_done+0x125/0x140
> [  250.603562]  blk_done_softirq+0x9e/0xd0
> [  250.609916]  __do_softirq+0x10c/0x2a5
> [  250.616073]  irq_exit+0xff/0x110
> [  250.621737]  smp_call_function_single_interrupt+0x33/0x40
> [  250.629831]  call_function_single_interrupt+0x93/0xa0
> [  250.637544] RIP: 0010:cpuidle_enter_state+0x126/0x2c0
> [  250.645263] RSP: 0018:b0e418cdfe60 EFLAGS: 0246 ORIG_RAX:
> ff04
> [  250.655814] RAX:  RBX: 0002 RCX: 
> 001f
> [  250.665885] RDX: 003a4d5f2d20 RSI: ffc820eb310b RDI: 
> 
> [  250.675956] RBP: b0e418cdfe98 R08: 0176 R09: 
> 0018
> [  250.686018] R10: b0e418cdfe30 R11: 0094 R12: 
> 97f58f6e3b00
> [  250.696080] R13: b1f72a78 R14: 003a4d5f2d20 R15: 
> b1f72a60
> [  250.706123]  
> [  250.710547]  cpuidle_enter+0x17/0x20
> [  250.716609]  call_cpuidle+0x23/0x40
> [  250.722550]  do_idle+0x18e/0x1e0
> [  250.728177]  cpu_startup_entry+0x73/0x80
> [  250.734560]  start_secondary+0x156/0x190
> [  250.740930]  secondary_startup_64+0x9f/0x9f
> [  250.747578] Code: 01 79 cc b1 e8 09 16 ce ff 31 c0 eb e6 0f 1f 40
> 00 0f 1f 44 00 00 55 48 89 e5 53 31 db f6 47 16 01 74 04 48 8b 5f 68
> 48 8b 47 08 <48> 8b 80 80 00 00 00 48 8b 90 d0 03 00 00 48 83 ba 48 02
> 00 00
> [  250.770821] RIP: bio_integrity_advance+0x1a/0xf0 RSP: 97f58f6c3da8^M
> [  250.780481] ---[ end trace d7b00b76aab34156 ]---
> [  250.841521] Kernel panic - not syncing: Fatal exception in interrupt
> [  250.851158] Kernel Offset: 0x3000 from 0x8100
> (relocation range: 0x8000-0xbfff)
> [  250.912067] ---[ end Kernel panic - not syncing: Fatal exception in 
> interrupt

According to the log, the exception was triggered when trying to
access:
bio->bi_bdev->bd_disk:

03b0 :
 3b0:   e8 00 00 00 00  callq  3b5 
 ...
 3c2:   48 8b 5f 68 mov0x68(%rdi),%rbx
 3c6:   48 8b 47 08 mov0x8(%rdi),%rax

 bio->bi_bdev->bd_disk, BOOM!
 3ca:   48 8b 80 80 00 00 00mov0x80(%rax),%rax

When the exception was triggered, the bio->bi_bdev is:
RAX: db19e5a5b91ff161
besides, we can see that bio->bi_integrity is
RBX: