Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Damien Le Moal
Christoph,

On 5/30/18 17:47, Christoph Hellwig wrote:
> On Wed, May 30, 2018 at 06:22:04AM +, Damien Le Moal wrote:
>> That would necessitate splitting elevator_init() into a generic elevator
>> initialization function setting up the elevator related fields of the
>> request queue and a second part setting up the default elevator (e.g.
>> elevator_set_default()). Doing so, the function elevator_set_default()
>> could be called later in the device initialization sequence, after
>> information from the device has been obtained. It would make choosing a
>> sane default elevator much cleaner.
> 
> For blk-mq this makes entirely sense, untested series here:
> 
>   
> http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/elevator_init

Just reviewed and tried the series. Looks OK to me.
Feel free to post adding:
Reviewed-by: Damien Le Moal 
Tested-by: Damien Le Moal 

> For the legacy case we always need at least some I/O scheduler,
> so we can't just defer it.  But I think we could still switch to
> deadline in blk_register_queue if we found a zoned device.

Arrg... Yes, I forgot that. At least noop would be needed. So no real
point in changing anything I guess.

For the switch to deadline once the device is identified as zoned, I
would like to see that added. Jens decision on this though.

Best.

-- 
Damien Le Moal,
Western Digital

Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Damien Le Moal


On 5/31/18 00:52, Jens Axboe wrote:
> On 5/30/18 9:45 AM, Jeff Moyer wrote:
>> Jens Axboe  writes:
>>
>>> On 5/30/18 9:06 AM, Jeff Moyer wrote:
 Hi, Jens,

 Jens Axboe  writes:

> On 5/30/18 2:49 AM, Christoph Hellwig wrote:
>> While I really don't want drivers to change the I/O schedule themselves
>> we have a class of devices (zoned) that don't work at all with certain
>> I/O schedulers.  The kernel not chosing something sane and requiring
>> user workarounds is just silly.
>
> They work just fine for probing and reading purposes. There's absolutely
> no reason why we can't handle these special snowflakes with a udev rule.

 udev rules aren't shipped with the kernel, so it makes it hard to keep
 them in sync.  In this instance, I'm not sure anyone made an effort to
 notify distributions that a udev rule was even necessary.  (Is there a
 way of notifying distributions about kernel changes that require new
 udev rules, other than emailing each list individually?)

 From a technical standpoint, I totally agree with you, Jens.  However, I
 think the user experience sucks.  4.15 worked by default, 4.16 doesn't.
 The result will be bug reports from users (to the drive vendors,
 distribution bugzillas, here, etc.).
>>>
>>> I would imagine that most folks get their updates from a distro of some
>>> sort, in which case there's absolutely nothing stopping the distro from
>>> shipping updated rules for the 4.16 kernel update.
>>
>> The problem is distros have already shipped that kernel.  :)
> 
> Ship an update, then! I'm sure that most people would prefer a simple
> rule update over a kernel update. And you have to do one of them, to
> resolve this anyway.
> 
>>> But what's the regression? 4.15 had no zone write locking at all.
>>
>> The zone write locking was done in the sd driver prior to 4.16.  See
>> commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where
>> it was removed.  That means these devices "just worked" with all I/O
>> schedulers.
> 
> Gotcha, makes sense.
> 
 Moving on, assuming your mind is made up...

 I'm not sure how much logic should go into the udev rule.  As mentioned,
 this limitation was introduced in 4.16, and Damien has plans to lift the
 restriction in future kernels.  Because distributions tend to cherry
 pick changes, making decisions on whether a feature exists based solely
 on kernel version is usually not a great thing.  My inclination would be
 to just always force deadline for host-managed SMR drives.  These drives
 aren't that popular, after all.  Any opinions on this?
>>>
>>> The problem is that it's tied to an IO scheduler, which ends up causing
>>> issues like this, since users are free to select a different scheduler.
>>> Then things break. Granted, in this case, some extraordinarily shitty
>>> hardware even broke. That is on the hardware, not the kernel, that
>>> kind of breakage should not occur.
>>
>> If the firmware problem was widespread, I think we'd try to avoid it.  I
>> have no reason to believe that is the case, though.
>>
>> Damien made the argument that the user should be able to select an I/O
>> scheduler that doesn't perform the write locking, because a well-behaved
>> application could theoretically make use of it.  I think this is a weak
>> argument, given that dm-zoned doesn't even support such a mode.
> 
> Sure, the user should be able to select whatever they want. Maybe they
> are strictly using it through bsg or a similar interface, in which
> case no scheduler or kernel support is really neeeded to drive it.

dm-zoned, f2fs (more FSes coming) use the drives through the regular
block/bio stack. The scheduler is involved and needs to be correctly
set. For applications not relying on these components and doing raw disk
I/Os, I see 2 camps in the field: the sg/bsg camp and the regular POSIX
system calls camp. For the former, since the kernel will have minimal
interaction with the commands, the application is on its own. Control
over write ordering has to be coded in. But for the latter case, which
is orders of magnitudes easier to use, the scheduler needs to be set
correctly or the same pressure is on the application to "do the right
thing". This conflicts with the benefits of this access path choice
(simplicity).

In the end, if the drives are used directly from applications, I think
it is OK to only expect a correct system setting if deadline is
required. So udev is fine. But for the kernel components like dm-zoned,
a sane default being set from the start is my preferred choice.

Best regards.

-- 
Damien Le Moal,
Western Digital

Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Damien Le Moal

On 5/31/18 00:45, Jeff Moyer wrote:
> Jens Axboe  writes:
>> But what's the regression? 4.15 had no zone write locking at all.
> 
> The zone write locking was done in the sd driver prior to 4.16.  See
> commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where
> it was removed.  That means these devices "just worked" with all I/O
> schedulers.

Yes they did "just work", but that was not an ideal solution either
because of the performance implications: sequential writes to a single
zone where stalling the dispatch queue waiting for the dispatched write
to the locked zone to complete. That was not optimal at all (sure, the
drive side write caching was hiding this a bit, but still)

>>> Moving on, assuming your mind is made up...
>>>
>>> I'm not sure how much logic should go into the udev rule.  As mentioned,
>>> this limitation was introduced in 4.16, and Damien has plans to lift the
>>> restriction in future kernels.  Because distributions tend to cherry
>>> pick changes, making decisions on whether a feature exists based solely
>>> on kernel version is usually not a great thing.  My inclination would be
>>> to just always force deadline for host-managed SMR drives.  These drives
>>> aren't that popular, after all.  Any opinions on this?
>>
>> The problem is that it's tied to an IO scheduler, which ends up causing
>> issues like this, since users are free to select a different scheduler.
>> Then things break. Granted, in this case, some extraordinarily shitty
>> hardware even broke. That is on the hardware, not the kernel, that
>> kind of breakage should not occur.
> 
> If the firmware problem was widespread, I think we'd try to avoid it.  I
> have no reason to believe that is the case, though.

First time in my career that I heard of a disk breaking a system BIOS. I
will notify our system test lab to investigate this. Jeff, let's take
this discussion off-list since that is not kernel related.

> Damien made the argument that the user should be able to select an I/O
> scheduler that doesn't perform the write locking, because a well-behaved
> application could theoretically make use of it.  I think this is a weak
> argument, given that dm-zoned doesn't even support such a mode.

Yes, a little week. That is definitely not the main use case I am seeing
with customers. That said, these drives are starting to be used with
other feature sets being enabled like I/O priorities. Considering there
size, this is a very interesting feature to control access latency.
Deadline and mq-deadline will not act on I/O priorities, cfq (and bfq ?)
will. Potentially better results can be achieved, but as these
schedulers do not support zone write locking, the application needs to
be careful with its per-zone write queue depth and doing so avoid
tripping over kernel level command unintended reordering. I see this as
a valid enough use case to not "lock-down" the scheduler to deadline
only and allow others schedulers too. Yet, deadline should be the
default until an application asks for something else if needed.

dm-zoned or f2fs (btrfs in the lab too) "assume" that the underlying
stack does the right thing. That is of course true (for now) only if the
deadline scheduler is enabled. A sane default set on device
initialization would be nice to have and avoid potential headaches with
rule ordering with regard to components initialization (not to mention
that this would make booting from these disks possible).

> I definitely see this udev rule as a temporary workaround.

I agree. In fact I see the deadline based zone write locking itself as a
temporary workaround. For now, I do not see any other clean method that
covers both mq and legacy path. Considering only mq, we discussed
interesting possibilities at LSFMM using dedicated write queues. That
could be handled generically and remove the dependency on the scheduler
while also increasing coverage of the support to open channel SSDs as well.

My guess is that no major change for this write locking will happen on
the legacy path, which hopefully will go away soon (?). But there are
options forward with blk-mq.

>> So now we're stuck with this temporary situation which needs a work-around.
>> I don't think it's a terrible idea to have a rule that just sets
>> deadline/mq-deadline for an SMR device regardless of what kernel it is
>> running on. It'll probably never be a bad default.

I agree. But since there are other kernel components (dm-zoned, FSes and
the entire fs/block-dev.c direct I/O write path) depending on the
scheduler to be set to something sane, setting that early on in the
device initialization before it is grabbed by an FS or a device mapper
would definitely be nice to have.

> 
> OK.  Barring future input to the contrary, I'll work to get updates into
> fedora, at least.  I've CC'd Colin and Hannes.  I'm not sure who else to
> include.
> 
> FYI, below is the udev rule Damien had provided to Bryan.  I'm not sure
> about the KERNEL=="sd[a-z]" bit, that may ne

Re: [PATCH 12/12] nvme: limit warnings from nvme_identify_ns

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 11/12] nvme: use the changed namespaces list log to clear ns data changed AENs

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 10/12] nvme: mark nvme_queue_scan static

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 10/12] btrfs: convert to bioset_init()/mempool_init()

2018-05-30 Thread Chris Mason

On 20 May 2018, at 18:25, Kent Overstreet wrote:


Signed-off-by: Kent Overstreet 
---
 fs/btrfs/extent_io.c | 25 +++--
 1 file changed, 11 insertions(+), 14 deletions(-)



Assuming the rest of the comments/questions for the series get worked 
out, the btrfs side of this looks fine.


Acked-by: Chris Mason 



Re: [PATCH 08/12] nvmet: mask pending AERs

2018-05-30 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH 09/12] nvme: submit AEN event configuration on startup

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 07/12] nvmet: Add AEN configuration support

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 06/12] nvmet: implement the changed namespaces log

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 05/12] nvmet: split log page implementation

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 04/12] nvmet: add a new nvmet_zero_sgl helper

2018-05-30 Thread Sagi Grimberg




+u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)


Would personally prefer nvmet_clear_sgl, but not hard headed on it,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 03/12] nvme.h: add AER configuration symbols

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 02/12] nvme.h: add the changed namespace list log

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


Re: [PATCH 01/12] nvme.h: untangle AEN notice definitions

2018-05-30 Thread Sagi Grimberg

Looks good,

Reviewed-by: Sagi Grimberg 


[GIT PULL] Single fix for 4.17 final

2018-05-30 Thread Jens Axboe
Hi Linus,

Just a single fix that should make it into this release,
fixing a regression with T10-DIF on NVMe. Please pull!


  git://git.kernel.dk/linux-block.git tags/for-linus-20180530



Jens Axboe (1):
  Merge branch 'nvme-4.17' of git://git.infradead.org/nvme into for-linus

Max Gurtovoy (1):
  nvme: fix extended data LBA supported setting

 drivers/nvme/host/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

-- 
Jens Axboe



Re: [PATCH] blk-throttle: return proper bool type to caller instead of 0/1

2018-05-30 Thread Jens Axboe
On 5/29/18 4:32 AM, Chengguang Xu wrote:
> Change to return true/false only for bool type return code.

Looks good, applied.

-- 
Jens Axboe



Re: [PATCH 06/12] nvmet: implement the changed namespaces log

2018-05-30 Thread Daniel Verkamp
On 05/30/2018 09:45 AM, Christoph Hellwig wrote:
> Just keep a per-controller buffer of changed namespaces and copy it out
> in the get log page implementation.
> 
> Signed-off-by: Christoph Hellwig 
> ---
[...]
> +static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
> +{
> + mutex_lock(&ctrl->lock);
> + if (ctrl->nr_changed_ns < NVME_MAX_CHANGED_NAMESPACES) {

Minor nitpick here: if ctrlr->nr_changed_ns is exactly
NVME_MAX_CHANGED_NAMESPACES and the new nsid is already in the list,
this will skip down to the 0x case below, even though it could
have just left the list as-is.

I don't know if this is a problem in practice; reporting 0x is
probably always safe, since the host will presumably treat that as
"rescan everything".

> + u32 i;
> +
> + for (i = 0; i < ctrl->nr_changed_ns; i++)
> + if (ctrl->changed_ns_list[i] == nsid)
> + goto out_unlock;
> + ctrl->changed_ns_list[ctrl->nr_changed_ns++] = nsid;
> + } else if (ctrl->nr_changed_ns == NVME_MAX_CHANGED_NAMESPACES) {
> + ctrl->changed_ns_list[0] = cpu_to_le32(0x);
> + ctrl->nr_changed_ns = U32_MAX;
> + }
> +out_unlock:
> + mutex_unlock(&ctrl->lock);
> +}

Thanks,
-- Daniel


Re: [PATCH] blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter

2018-05-30 Thread Jens Axboe
On 5/30/18 10:51 AM, Christoph Hellwig wrote:
> We already check for started commands in all callbacks, but we should
> also protect against already completed commands.  Do this by taking
> the checks to common code.

Looks good, applied.

-- 
Jens Axboe



Re: [PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.

2018-05-30 Thread Jens Axboe
On 5/30/18 10:45 AM, kvi...@gmail.com wrote:
> From: Kevin Vigor 
> 
> When a userspace client requests a NBD device be disconnected, the
> DISCONNECT_REQUESTED flag is set. While this flag is set, the driver
> will not inform userspace when a connection is closed.
> 
> Unfortunately the flag was never cleared, so once a disconnect was
> requested the driver would thereafter never tell userspace about a
> closed connection. Thus when connections failed due to timeout, no
> attempt to reconnect was made and eventually the device would fail.
> 
> Fix by clearing the DISCONNECT_REQUESTED flag (and setting the
> DISCONNECTED flag) once all connections are closed.

Applied, thanks Kevin.

-- 
Jens Axboe



Re: [PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.

2018-05-30 Thread Josef Bacik
On Wed, May 30, 2018 at 10:45:11AM -0600, kvi...@gmail.com wrote:
> From: Kevin Vigor 
> 
> When a userspace client requests a NBD device be disconnected, the
> DISCONNECT_REQUESTED flag is set. While this flag is set, the driver
> will not inform userspace when a connection is closed.
> 
> Unfortunately the flag was never cleared, so once a disconnect was
> requested the driver would thereafter never tell userspace about a
> closed connection. Thus when connections failed due to timeout, no
> attempt to reconnect was made and eventually the device would fail.
> 
> Fix by clearing the DISCONNECT_REQUESTED flag (and setting the
> DISCONNECTED flag) once all connections are closed.
> 
> Changes relative to v1 (https://marc.info/?l=linux-block&m=152763540418902):
> 
> * remove pointless wake_up() -> wake_up_all() change.
> 

Usually you want to put the changelog after the --- bit below so we can all see
it but it doesn't end up in the git changelog.

Reviewed-by: Josef Bacik 

Thanks,

Josef


Re: [PATCH] blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter

2018-05-30 Thread Josef Bacik
On Wed, May 30, 2018 at 06:51:00PM +0200, Christoph Hellwig wrote:
> We already check for started commands in all callbacks, but we should
> also protect against already completed commands.  Do this by taking
> the checks to common code.
> 
> Signed-off-by: Christoph Hellwig 

Acked-by: Josef Bacik 

for the nbd portion.  Thanks,

Josef


Re: [PATCH RESEND] blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch

2018-05-30 Thread Jens Axboe
On 5/29/18 2:29 AM, Liu Bo wrote:
> tg in throtl_select_dispatch is used first and then do check. Since tg
> may be NULL, it has potential NULL pointer dereference risk. So fix
> it.

Applied, thanks.

-- 
Jens Axboe



[PATCH] blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter

2018-05-30 Thread Christoph Hellwig
We already check for started commands in all callbacks, but we should
also protect against already completed commands.  Do this by taking
the checks to common code.

Signed-off-by: Christoph Hellwig 
---
 block/blk-mq-tag.c|  2 +-
 drivers/block/mtip32xx/mtip32xx.c | 12 ++--
 drivers/block/nbd.c   |  5 +
 drivers/nvme/host/core.c  |  3 ---
 drivers/nvme/host/fc.c|  3 ---
 5 files changed, 4 insertions(+), 21 deletions(-)

diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index a4e58fc28a06..70356a2a11ab 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -271,7 +271,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned 
int bitnr, void *data)
 * test and set the bit before assining ->rqs[].
 */
rq = tags->rqs[bitnr];
-   if (rq)
+   if (rq && blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT)
iter_data->fn(rq, iter_data->data, reserved);
 
return true;
diff --git a/drivers/block/mtip32xx/mtip32xx.c 
b/drivers/block/mtip32xx/mtip32xx.c
index 95657b814543..c73626decb46 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -2725,15 +2725,11 @@ static void mtip_softirq_done_fn(struct request *rq)
blk_mq_end_request(rq, cmd->status);
 }
 
-static void mtip_abort_cmd(struct request *req, void *data,
-   bool reserved)
+static void mtip_abort_cmd(struct request *req, void *data, bool reserved)
 {
struct mtip_cmd *cmd = blk_mq_rq_to_pdu(req);
struct driver_data *dd = data;
 
-   if (!blk_mq_request_started(req))
-   return;
-
dbg_printk(MTIP_DRV_NAME " Aborting request, tag = %d\n", req->tag);
 
clear_bit(req->tag, dd->port->cmds_to_issue);
@@ -2741,14 +2737,10 @@ static void mtip_abort_cmd(struct request *req, void 
*data,
mtip_softirq_done_fn(req);
 }
 
-static void mtip_queue_cmd(struct request *req, void *data,
-   bool reserved)
+static void mtip_queue_cmd(struct request *req, void *data, bool reserved)
 {
struct driver_data *dd = data;
 
-   if (!blk_mq_request_started(req))
-   return;
-
set_bit(req->tag, dd->port->cmds_to_issue);
blk_abort_request(req);
 }
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 8860b24098bc..17d1d97d491e 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -666,11 +666,8 @@ static void recv_work(struct work_struct *work)
 
 static void nbd_clear_req(struct request *req, void *data, bool reserved)
 {
-   struct nbd_cmd *cmd;
+   struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req);
 
-   if (!blk_mq_request_started(req))
-   return;
-   cmd = blk_mq_rq_to_pdu(req);
cmd->status = BLK_STS_IOERR;
blk_mq_complete_request(req);
 }
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 7ad3cfc9d4e1..1744d8dcadfd 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -251,9 +251,6 @@ EXPORT_SYMBOL_GPL(nvme_complete_rq);
 
 void nvme_cancel_request(struct request *req, void *data, bool reserved)
 {
-   if (!blk_mq_request_started(req))
-   return;
-
dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
"Cancelling I/O %d", req->tag);
 
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index ac35a80f5532..0bad65803271 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2393,9 +2393,6 @@ nvme_fc_terminate_exchange(struct request *req, void 
*data, bool reserved)
struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl);
struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(req);
 
-   if (!blk_mq_request_started(req))
-   return;
-
__nvme_fc_abort_op(ctrl, op);
 }
 
-- 
2.17.0



[PATCH 11/12] nvme: use the changed namespaces list log to clear ns data changed AENs

2018-05-30 Thread Christoph Hellwig
Per section 5.2 we need to issue the corresponding log page to clear an
AEN, so for a namespace data changed AEN we need to read the changed
namespace list log.  And once we read that log anyway we might as well
use it to optimize the rescan.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/host/core.c | 51 
 drivers/nvme/host/nvme.h |  2 ++
 2 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 06cd04dcffbc..1ae77428a1a5 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3194,6 +3194,42 @@ static void nvme_scan_ns_sequential(struct nvme_ctrl 
*ctrl, unsigned nn)
nvme_remove_invalid_namespaces(ctrl, nn);
 }
 
+static bool nvme_scan_changed_ns_log(struct nvme_ctrl *ctrl)
+{
+   size_t log_size = NVME_MAX_CHANGED_NAMESPACES * sizeof(__le32);
+   __le32 *log;
+   int error, i;
+   bool ret = false;
+
+   log = kzalloc(log_size, GFP_KERNEL);
+   if (!log)
+   return false;
+
+   error = nvme_get_log(ctrl, NVME_LOG_CHANGED_NS, log, log_size);
+   if (error) {
+   dev_warn(ctrl->device,
+   "reading changed ns log failed: %d\n", error);
+   goto out_free_log;
+   }
+
+   if (log[0] == cpu_to_le32(0x))
+   goto out_free_log;
+
+   for (i = 0; i < NVME_MAX_CHANGED_NAMESPACES; i++) {
+   u32 nsid = le32_to_cpu(log[i]);
+
+   if (nsid == 0)
+   break;
+   dev_info(ctrl->device, "rescanning namespace %d.\n", nsid);
+   nvme_validate_ns(ctrl, nsid);
+   }
+   ret = true;
+
+out_free_log:
+   kfree(log);
+   return ret;
+}
+
 static void nvme_scan_work(struct work_struct *work)
 {
struct nvme_ctrl *ctrl =
@@ -3206,6 +3242,12 @@ static void nvme_scan_work(struct work_struct *work)
 
WARN_ON_ONCE(!ctrl->tagset);
 
+   if (test_and_clear_bit(EVENT_NS_CHANGED, &ctrl->events)) {
+   if (nvme_scan_changed_ns_log(ctrl))
+   goto out_sort_namespaces;
+   dev_info(ctrl->device, "rescanning namespaces.\n");
+   }
+
if (nvme_identify_ctrl(ctrl, &id))
return;
 
@@ -3213,14 +3255,15 @@ static void nvme_scan_work(struct work_struct *work)
if (ctrl->vs >= NVME_VS(1, 1, 0) &&
!(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) {
if (!nvme_scan_ns_list(ctrl, nn))
-   goto done;
+   goto out_free_id;
}
nvme_scan_ns_sequential(ctrl, nn);
- done:
+out_free_id:
+   kfree(id);
+out_sort_namespaces:
down_write(&ctrl->namespaces_rwsem);
list_sort(NULL, &ctrl->namespaces, ns_cmp);
up_write(&ctrl->namespaces_rwsem);
-   kfree(id);
 }
 
 /*
@@ -3340,7 +3383,7 @@ static void nvme_handle_aen_notice(struct nvme_ctrl 
*ctrl, u32 result)
 {
switch ((result & 0xff00) >> 8) {
case NVME_AER_NOTICE_NS_CHANGED:
-   dev_info(ctrl->device, "rescanning\n");
+   set_bit(EVENT_NS_CHANGED, &ctrl->events);
nvme_queue_scan(ctrl);
break;
case NVME_AER_NOTICE_FW_ACT_STARTING:
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index 11681278fdf6..07e8bfe705c6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -189,6 +189,8 @@ struct nvme_ctrl {
struct delayed_work ka_work;
struct nvme_command ka_cmd;
struct work_struct fw_act_work;
+#define EVENT_NS_CHANGED   (1 << 0)
+   unsigned long events;
 
/* Power saving configuration */
u64 ps_max_latency_us;
-- 
2.17.0



[PATCH 10/12] nvme: mark nvme_queue_scan static

2018-05-30 Thread Christoph Hellwig
And move it toward the top of the file to avoid a forward declaration.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 19 +--
 drivers/nvme/host/nvme.h |  1 -
 2 files changed, 9 insertions(+), 11 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index c0bc76d98194..06cd04dcffbc 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -100,6 +100,15 @@ static struct class *nvme_subsys_class;
 static void nvme_ns_remove(struct nvme_ns *ns);
 static int nvme_revalidate_disk(struct gendisk *disk);
 
+static void nvme_queue_scan(struct nvme_ctrl *ctrl)
+{
+   /*
+* Only new queue scan work when admin and IO queues are both alive
+*/
+   if (ctrl->state == NVME_CTRL_LIVE)
+   queue_work(nvme_wq, &ctrl->scan_work);
+}
+
 int nvme_reset_ctrl(struct nvme_ctrl *ctrl)
 {
if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
@@ -3214,16 +3223,6 @@ static void nvme_scan_work(struct work_struct *work)
kfree(id);
 }
 
-void nvme_queue_scan(struct nvme_ctrl *ctrl)
-{
-   /*
-* Only new queue scan work when admin and IO queues are both alive
-*/
-   if (ctrl->state == NVME_CTRL_LIVE)
-   queue_work(nvme_wq, &ctrl->scan_work);
-}
-EXPORT_SYMBOL_GPL(nvme_queue_scan);
-
 /*
  * This function iterates the namespace list unlocked to allow recovery from
  * controller failure. It is up to the caller to ensure the namespace list is
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index dcf9e9592c1a..11681278fdf6 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -395,7 +395,6 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl);
 void nvme_put_ctrl(struct nvme_ctrl *ctrl);
 int nvme_init_identify(struct nvme_ctrl *ctrl);
 
-void nvme_queue_scan(struct nvme_ctrl *ctrl);
 void nvme_remove_namespaces(struct nvme_ctrl *ctrl);
 
 int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len,
-- 
2.17.0



[PATCH 12/12] nvme: limit warnings from nvme_identify_ns

2018-05-30 Thread Christoph Hellwig
When rescanning namespaces after an AEN we will issue Identify Namespace
comands to namespaces that have gone away, so don't warn for this specific
case.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1ae77428a1a5..7ad3cfc9d4e1 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -984,7 +984,9 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl 
*ctrl,
 
error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id));
if (error) {
-   dev_warn(ctrl->device, "Identify namespace failed\n");
+   /* don't warn on a namespace that has gone away */
+   if (error < 0 || ((error & ~NVME_SC_DNR) != NVME_SC_INVALID_NS))
+   dev_warn(ctrl->device, "Identify namespace failed\n");
kfree(id);
return NULL;
}
-- 
2.17.0



[PATCH 09/12] nvme: submit AEN event configuration on startup

2018-05-30 Thread Christoph Hellwig
From: Hannes Reinecke 

We should register for AEN events; some law-abiding targets might
not be sending us AENs otherwise.

Signed-off-by: Hannes Reinecke 
[hch: slight cleanups]
Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 17 +
 drivers/nvme/host/nvme.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 86cb78653155..c0bc76d98194 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1030,6 +1030,21 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int 
*count)
 }
 EXPORT_SYMBOL_GPL(nvme_set_queue_count);
 
+#define NVME_AEN_SUPPORTED \
+   (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT)
+
+static void nvme_enable_aen(struct nvme_ctrl *ctrl)
+{
+   u32 result;
+   int status;
+
+   status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT,
+   ctrl->oaes & NVME_AEN_SUPPORTED, NULL, 0, &result);
+   if (status)
+   dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n",
+ctrl->oaes & NVME_AEN_SUPPORTED);
+}
+
 static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio)
 {
struct nvme_user_io io;
@@ -2347,6 +2362,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl)
 
ctrl->oacs = le16_to_cpu(id->oacs);
ctrl->oncs = le16_to_cpup(&id->oncs);
+   ctrl->oaes = le32_to_cpu(id->oaes);
atomic_set(&ctrl->abort_limit, id->acl + 1);
ctrl->vwc = id->vwc;
ctrl->cntlid = le16_to_cpup(&id->cntlid);
@@ -3379,6 +3395,7 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl)
 
if (ctrl->queue_count > 1) {
nvme_queue_scan(ctrl);
+   nvme_enable_aen(ctrl);
queue_work(nvme_wq, &ctrl->async_event_work);
nvme_start_queues(ctrl);
}
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index ec6e4acc4d48..dcf9e9592c1a 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -176,6 +176,7 @@ struct nvme_ctrl {
u16 kas;
u8 npss;
u8 apsta;
+   u32 oaes;
u32 aen_result;
unsigned int shutdown_timeout;
unsigned int kato;
-- 
2.17.0



[PATCH 07/12] nvmet: Add AEN configuration support

2018-05-30 Thread Christoph Hellwig
AEN configuration via the 'Get Features' and 'Set Features' admin
command is mandatory, so we should be implemeting handling for it.

Signed-off-by: Hannes Reinecke 
[hch: use WRITE_ONCE, check for invalid values]
Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/target/admin-cmd.c | 15 +--
 drivers/nvme/target/core.c  |  3 +++
 drivers/nvme/target/nvmet.h | 16 
 3 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 453d76db1647..4e269a8c8dbe 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -189,7 +189,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req 
*req)
id->ver = cpu_to_le32(ctrl->subsys->ver);
 
/* XXX: figure out what to do about RTD3R/RTD3 */
-   id->oaes = cpu_to_le32(1 << 8);
+   id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL);
id->ctratt = cpu_to_le32(1 << 0);
 
id->oacs = 0;
@@ -441,6 +441,16 @@ static void nvmet_execute_set_features(struct nvmet_req 
*req)
req->sq->ctrl->kato = DIV_ROUND_UP(val32, 1000);
nvmet_set_result(req, req->sq->ctrl->kato);
break;
+   case NVME_FEAT_ASYNC_EVENT:
+   val32 = le32_to_cpu(req->cmd->common.cdw10[1]);
+   if (val32 & ~NVMET_AEN_CFG_ALL) {
+   status = NVME_SC_INVALID_FIELD | NVME_SC_DNR;
+   break;
+   }
+
+   WRITE_ONCE(req->sq->ctrl->aen_enabled, val32);
+   nvmet_set_result(req, val32);
+   break;
case NVME_FEAT_HOST_ID:
status = NVME_SC_CMD_SEQ_ERROR | NVME_SC_DNR;
break;
@@ -479,9 +489,10 @@ static void nvmet_execute_get_features(struct nvmet_req 
*req)
break;
case NVME_FEAT_WRITE_ATOMIC:
break;
+#endif
case NVME_FEAT_ASYNC_EVENT:
+   nvmet_set_result(req, READ_ONCE(req->sq->ctrl->aen_enabled));
break;
-#endif
case NVME_FEAT_VOLATILE_WC:
nvmet_set_result(req, 1);
break;
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 99fcff757608..b2da65aec8b7 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -168,6 +168,8 @@ static void nvmet_ns_changed(struct nvmet_subsys *subsys, 
u32 nsid)
 
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid));
+   if (!(READ_ONCE(ctrl->aen_enabled) & NVME_AEN_CFG_NS_ATTR))
+   continue;
nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE,
NVME_AER_NOTICE_NS_CHANGED,
NVME_LOG_CHANGED_NS);
@@ -855,6 +857,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
 
kref_init(&ctrl->ref);
ctrl->subsys = subsys;
+   WRITE_ONCE(ctrl->aen_enabled, NVMET_AEN_CFG_OPTIONAL);
 
ctrl->changed_ns_list = kmalloc_array(NVME_MAX_CHANGED_NAMESPACES,
sizeof(__le32), GFP_KERNEL);
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 8cdc1e550396..f4d16d9b3582 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -30,6 +30,21 @@
 #define NVMET_ASYNC_EVENTS 4
 #define NVMET_ERROR_LOG_SLOTS  128
 
+
+/*
+ * Supported optional AENs:
+ */
+#define NVMET_AEN_CFG_OPTIONAL \
+   NVME_AEN_CFG_NS_ATTR
+
+/*
+ * Plus mandatory SMART AENs (we'll never send them, but allow enabling them):
+ */
+#define NVMET_AEN_CFG_ALL \
+   (NVME_SMART_CRIT_SPARE | NVME_SMART_CRIT_TEMPERATURE | \
+NVME_SMART_CRIT_RELIABILITY | NVME_SMART_CRIT_MEDIA | \
+NVME_SMART_CRIT_VOLATILE_MEMORY | NVMET_AEN_CFG_OPTIONAL)
+
 /* Helper Macros when NVMe error is NVME_SC_CONNECT_INVALID_PARAM
  * The 16 bit shift is to set IATTR bit to 1, which means offending
  * offset starts in the data section of connect()
@@ -123,6 +138,7 @@ struct nvmet_ctrl {
u16 cntlid;
u32 kato;
 
+   u32 aen_enabled;
struct nvmet_req*async_event_cmds[NVMET_ASYNC_EVENTS];
unsigned intnr_async_event_cmds;
struct list_headasync_events;
-- 
2.17.0



[PATCH 08/12] nvmet: mask pending AERs

2018-05-30 Thread Christoph Hellwig
Per section 5.2 of the NVMe 1.3 spec:

  "When the controller posts a completion queue entry for an outstanding
  Asynchronous Event Request command and thus reports an asynchronous
  event, subsequent events of that event type are automatically masked by
  the controller until the host clears that event. An event is cleared by
  reading the log page associated with that event using the Get Log Page
  command (see section 5.14)."

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/target/admin-cmd.c | 1 +
 drivers/nvme/target/core.c  | 9 -
 drivers/nvme/target/nvmet.h | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 4e269a8c8dbe..abf191ee00af 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -144,6 +144,7 @@ static void nvmet_execute_get_log_changed_ns(struct 
nvmet_req *req)
if (!status)
status = nvmet_zero_sgl(req, len, req->data_len - len);
ctrl->nr_changed_ns = 0;
+   clear_bit(NVME_AEN_CFG_NS_ATTR, &ctrl->aen_masked);
mutex_unlock(&ctrl->lock);
 out:
nvmet_req_complete(req, status);
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index b2da65aec8b7..8e6432b1abed 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -144,6 +144,13 @@ static void nvmet_add_async_event(struct nvmet_ctrl *ctrl, 
u8 event_type,
schedule_work(&ctrl->async_event_work);
 }
 
+static bool nvmet_aen_disabled(struct nvmet_ctrl *ctrl, u32 aen)
+{
+   if (!(READ_ONCE(ctrl->aen_enabled) & aen))
+   return true;
+   return test_and_set_bit(aen, &ctrl->aen_masked);
+}
+
 static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
 {
mutex_lock(&ctrl->lock);
@@ -168,7 +175,7 @@ static void nvmet_ns_changed(struct nvmet_subsys *subsys, 
u32 nsid)
 
list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid));
-   if (!(READ_ONCE(ctrl->aen_enabled) & NVME_AEN_CFG_NS_ATTR))
+   if (nvmet_aen_disabled(ctrl, NVME_AEN_CFG_NS_ATTR))
continue;
nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE,
NVME_AER_NOTICE_NS_CHANGED,
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index f4d16d9b3582..480dfe10fad9 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -139,6 +139,7 @@ struct nvmet_ctrl {
u32 kato;
 
u32 aen_enabled;
+   unsigned long   aen_masked;
struct nvmet_req*async_event_cmds[NVMET_ASYNC_EVENTS];
unsigned intnr_async_event_cmds;
struct list_headasync_events;
-- 
2.17.0



[PATCH 03/12] nvme.h: add AER configuration symbols

2018-05-30 Thread Christoph Hellwig
From: Hannes Reinecke 

Signed-off-by: Hannes Reinecke 
[hch: split from a larger patch]
Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 include/linux/nvme.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 7ce0f3cf4409..2950ce957656 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -446,6 +446,11 @@ enum {
NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 };
 
+enum {
+   NVME_AEN_CFG_NS_ATTR= 1 << 8,
+   NVME_AEN_CFG_FW_ACT = 1 << 9,
+};
+
 struct nvme_lba_range_type {
__u8type;
__u8attributes;
-- 
2.17.0



[PATCH 02/12] nvme.h: add the changed namespace list log

2018-05-30 Thread Christoph Hellwig
Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 include/linux/nvme.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index c37103a4ad38..7ce0f3cf4409 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -751,6 +751,7 @@ enum {
NVME_LOG_ERROR  = 0x01,
NVME_LOG_SMART  = 0x02,
NVME_LOG_FW_SLOT= 0x03,
+   NVME_LOG_CHANGED_NS = 0x04,
NVME_LOG_CMD_EFFECTS= 0x05,
NVME_LOG_DISC   = 0x70,
NVME_LOG_RESERVATION= 0x80,
@@ -759,6 +760,8 @@ enum {
NVME_FWACT_ACTV = (2 << 3),
 };
 
+#define NVME_MAX_CHANGED_NAMESPACES1024
+
 struct nvme_identify {
__u8opcode;
__u8flags;
-- 
2.17.0



[PATCH 05/12] nvmet: split log page implementation

2018-05-30 Thread Christoph Hellwig
Remove the common code to allocate a buffer and copy it into the SGL.
Instead the two no-op implementations just zero the SGL directly, and
the smart log allocates a buffer on its own.  This prepares for the
more elaborate ANA log page.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/target/admin-cmd.c | 99 -
 1 file changed, 36 insertions(+), 63 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 7f906e4ccc96..678a393f01de 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -32,6 +32,11 @@ u32 nvmet_get_log_page_len(struct nvme_command *cmd)
return len;
 }
 
+static void nvmet_execute_get_log_page_noop(struct nvmet_req *req)
+{
+   nvmet_req_complete(req, nvmet_zero_sgl(req, 0, req->data_len));
+}
+
 static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req,
struct nvme_smart_log *slog)
 {
@@ -97,74 +102,26 @@ static u16 nvmet_get_smart_log_all(struct nvmet_req *req,
return NVME_SC_SUCCESS;
 }
 
-static u16 nvmet_get_smart_log(struct nvmet_req *req,
-   struct nvme_smart_log *slog)
-{
-   u16 status;
-
-   WARN_ON(req == NULL || slog == NULL);
-   if (req->cmd->get_log_page.nsid == cpu_to_le32(NVME_NSID_ALL))
-   status = nvmet_get_smart_log_all(req, slog);
-   else
-   status = nvmet_get_smart_log_nsid(req, slog);
-   return status;
-}
-
-static void nvmet_execute_get_log_page(struct nvmet_req *req)
+static void nvmet_execute_get_log_page_smart(struct nvmet_req *req)
 {
-   struct nvme_smart_log *smart_log;
-   size_t data_len = nvmet_get_log_page_len(req->cmd);
-   void *buf;
-   u16 status = 0;
+   struct nvme_smart_log *log;
+   u16 status = NVME_SC_INTERNAL;
 
-   buf = kzalloc(data_len, GFP_KERNEL);
-   if (!buf) {
-   status = NVME_SC_INTERNAL;
+   if (req->data_len != sizeof(*log))
goto out;
-   }
 
-   switch (req->cmd->get_log_page.lid) {
-   case NVME_LOG_ERROR:
-   /*
-* We currently never set the More bit in the status field,
-* so all error log entries are invalid and can be zeroed out.
-* This is called a minum viable implementation (TM) of this
-* mandatory log page.
-*/
-   break;
-   case NVME_LOG_SMART:
-   /*
-* XXX: fill out actual smart log
-*
-* We might have a hard time coming up with useful values for
-* many of the fields, and even when we have useful data
-* available (e.g. units or commands read/written) those aren't
-* persistent over power loss.
-*/
-   if (data_len != sizeof(*smart_log)) {
-   status = NVME_SC_INTERNAL;
-   goto err;
-   }
-   smart_log = buf;
-   status = nvmet_get_smart_log(req, smart_log);
-   if (status)
-   goto err;
-   break;
-   case NVME_LOG_FW_SLOT:
-   /*
-* We only support a single firmware slot which always is
-* active, so we can zero out the whole firmware slot log and
-* still claim to fully implement this mandatory log page.
-*/
-   break;
-   default:
-   BUG();
-   }
+   log = kzalloc(sizeof(*log), GFP_KERNEL);
+   if (!log)
+   goto out;
 
-   status = nvmet_copy_to_sgl(req, 0, buf, data_len);
+   if (req->cmd->get_log_page.nsid == cpu_to_le32(NVME_NSID_ALL))
+   status = nvmet_get_smart_log_all(req, log);
+   else
+   status = nvmet_get_smart_log_nsid(req, log);
+   if (status)
+   goto out;
 
-err:
-   kfree(buf);
+   status = nvmet_copy_to_sgl(req, 0, log, sizeof(*log));
 out:
nvmet_req_complete(req, status);
 }
@@ -572,9 +529,25 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
 
switch (cmd->get_log_page.lid) {
case NVME_LOG_ERROR:
+   /*
+* We currently never set the More bit in the status
+* field, so all error log entries are invalid and can
+* be zeroed out.  This is called a minum viable
+* implementation (TM) of this mandatory log page.
+*/
+   req->execute = nvmet_execute_get_log_page_noop;
+   return 0;
case NVME_LOG_SMART:
+   req->execute = nvmet_execute_get_log_page_smart;
+   return 0;
case NVME_LOG_FW_SLOT:
-   req->execute = nvmet_execute_get_log_page;
+

[PATCH 01/12] nvme.h: untangle AEN notice definitions

2018-05-30 Thread Christoph Hellwig
Stop including the event type in the definitions for the notice type.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/host/core.c | 30 ++
 include/linux/nvme.h |  8 ++--
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 2c4cf65641a6..86cb78653155 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3321,6 +3321,21 @@ static void nvme_fw_act_work(struct work_struct *work)
nvme_get_fw_slot_info(ctrl);
 }
 
+static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result)
+{
+   switch ((result & 0xff00) >> 8) {
+   case NVME_AER_NOTICE_NS_CHANGED:
+   dev_info(ctrl->device, "rescanning\n");
+   nvme_queue_scan(ctrl);
+   break;
+   case NVME_AER_NOTICE_FW_ACT_STARTING:
+   queue_work(nvme_wq, &ctrl->fw_act_work);
+   break;
+   default:
+   dev_warn(ctrl->device, "async event result %08x\n", result);
+   }
+}
+
 void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status,
volatile union nvme_result *res)
 {
@@ -3330,6 +3345,9 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
return;
 
switch (result & 0x7) {
+   case NVME_AER_NOTICE:
+   nvme_handle_aen_notice(ctrl, result);
+   break;
case NVME_AER_ERROR:
case NVME_AER_SMART:
case NVME_AER_CSS:
@@ -3339,18 +3357,6 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, 
__le16 status,
default:
break;
}
-
-   switch (result & 0xff07) {
-   case NVME_AER_NOTICE_NS_CHANGED:
-   dev_info(ctrl->device, "rescanning\n");
-   nvme_queue_scan(ctrl);
-   break;
-   case NVME_AER_NOTICE_FW_ACT_STARTING:
-   queue_work(nvme_wq, &ctrl->fw_act_work);
-   break;
-   default:
-   dev_warn(ctrl->device, "async event result %08x\n", result);
-   }
queue_work(nvme_wq, &ctrl->async_event_work);
 }
 EXPORT_SYMBOL_GPL(nvme_complete_async_event);
diff --git a/include/linux/nvme.h b/include/linux/nvme.h
index 4112e2bd747f..c37103a4ad38 100644
--- a/include/linux/nvme.h
+++ b/include/linux/nvme.h
@@ -436,10 +436,14 @@ enum {
 enum {
NVME_AER_ERROR  = 0,
NVME_AER_SMART  = 1,
+   NVME_AER_NOTICE = 2,
NVME_AER_CSS= 6,
NVME_AER_VS = 7,
-   NVME_AER_NOTICE_NS_CHANGED  = 0x0002,
-   NVME_AER_NOTICE_FW_ACT_STARTING = 0x0102,
+};
+
+enum {
+   NVME_AER_NOTICE_NS_CHANGED  = 0x00,
+   NVME_AER_NOTICE_FW_ACT_STARTING = 0x01,
 };
 
 struct nvme_lba_range_type {
-- 
2.17.0



[PATCH 06/12] nvmet: implement the changed namespaces log

2018-05-30 Thread Christoph Hellwig
Just keep a per-controller buffer of changed namespaces and copy it out
in the get log page implementation.

Signed-off-by: Christoph Hellwig 
---
 drivers/nvme/target/admin-cmd.c | 26 +
 drivers/nvme/target/core.c  | 50 +++--
 drivers/nvme/target/nvmet.h |  3 ++
 3 files changed, 70 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c
index 678a393f01de..453d76db1647 100644
--- a/drivers/nvme/target/admin-cmd.c
+++ b/drivers/nvme/target/admin-cmd.c
@@ -126,6 +126,29 @@ static void nvmet_execute_get_log_page_smart(struct 
nvmet_req *req)
nvmet_req_complete(req, status);
 }
 
+static void nvmet_execute_get_log_changed_ns(struct nvmet_req *req)
+{
+   struct nvmet_ctrl *ctrl = req->sq->ctrl;
+   u16 status = NVME_SC_INTERNAL;
+   size_t len;
+
+   if (req->data_len != NVME_MAX_CHANGED_NAMESPACES * sizeof(__le32))
+   goto out;
+
+   mutex_lock(&ctrl->lock);
+   if (ctrl->nr_changed_ns == U32_MAX)
+   len = sizeof(__le32);
+   else
+   len = ctrl->nr_changed_ns * sizeof(__le32);
+   status = nvmet_copy_to_sgl(req, 0, ctrl->changed_ns_list, len);
+   if (!status)
+   status = nvmet_zero_sgl(req, len, req->data_len - len);
+   ctrl->nr_changed_ns = 0;
+   mutex_unlock(&ctrl->lock);
+out:
+   nvmet_req_complete(req, status);
+}
+
 static void nvmet_execute_identify_ctrl(struct nvmet_req *req)
 {
struct nvmet_ctrl *ctrl = req->sq->ctrl;
@@ -549,6 +572,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req)
 */
req->execute = nvmet_execute_get_log_page_noop;
return 0;
+   case NVME_LOG_CHANGED_NS:
+   req->execute = nvmet_execute_get_log_changed_ns;
+   return 0;
}
break;
case nvme_admin_identify:
diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 55c4bc693aa2..99fcff757608 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -144,6 +144,36 @@ static void nvmet_add_async_event(struct nvmet_ctrl *ctrl, 
u8 event_type,
schedule_work(&ctrl->async_event_work);
 }
 
+static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid)
+{
+   mutex_lock(&ctrl->lock);
+   if (ctrl->nr_changed_ns < NVME_MAX_CHANGED_NAMESPACES) {
+   u32 i;
+
+   for (i = 0; i < ctrl->nr_changed_ns; i++)
+   if (ctrl->changed_ns_list[i] == nsid)
+   goto out_unlock;
+   ctrl->changed_ns_list[ctrl->nr_changed_ns++] = nsid;
+   } else if (ctrl->nr_changed_ns == NVME_MAX_CHANGED_NAMESPACES) {
+   ctrl->changed_ns_list[0] = cpu_to_le32(0x);
+   ctrl->nr_changed_ns = U32_MAX;
+   }
+out_unlock:
+   mutex_unlock(&ctrl->lock);
+}
+
+static void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid)
+{
+   struct nvmet_ctrl *ctrl;
+
+   list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+   nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid));
+   nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE,
+   NVME_AER_NOTICE_NS_CHANGED,
+   NVME_LOG_CHANGED_NS);
+   }
+}
+
 int nvmet_register_transport(const struct nvmet_fabrics_ops *ops)
 {
int ret = 0;
@@ -287,7 +317,6 @@ static void nvmet_ns_dev_disable(struct nvmet_ns *ns)
 int nvmet_ns_enable(struct nvmet_ns *ns)
 {
struct nvmet_subsys *subsys = ns->subsys;
-   struct nvmet_ctrl *ctrl;
int ret = 0;
 
mutex_lock(&subsys->lock);
@@ -326,9 +355,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
list_add_tail_rcu(&ns->dev_link, &old->dev_link);
}
 
-   list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
-   nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
-
+   nvmet_ns_changed(subsys, ns->nsid);
ns->enabled = true;
ret = 0;
 out_unlock:
@@ -342,7 +369,6 @@ int nvmet_ns_enable(struct nvmet_ns *ns)
 void nvmet_ns_disable(struct nvmet_ns *ns)
 {
struct nvmet_subsys *subsys = ns->subsys;
-   struct nvmet_ctrl *ctrl;
 
mutex_lock(&subsys->lock);
if (!ns->enabled)
@@ -368,9 +394,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns)
percpu_ref_exit(&ns->ref);
 
mutex_lock(&subsys->lock);
-   list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry)
-   nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0);
-
+   nvmet_ns_changed(subsys, ns->nsid);
nvmet_ns_dev_disable(ns);
 out_unlock:
mutex_unlock(&subsys->lock);
@@ -832,11 +856,16 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char 
*hostnqn,
kref_init(&ctrl->ref);
ctrl->subsys = subsys;
 
+   

[PATCH 04/12] nvmet: add a new nvmet_zero_sgl helper

2018-05-30 Thread Christoph Hellwig
Zeroes the SGL in the payload.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Johannes Thumshirn 
---
 drivers/nvme/target/core.c  | 7 +++
 drivers/nvme/target/nvmet.h | 1 +
 2 files changed, 8 insertions(+)

diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c
index 800aaf96ddcd..55c4bc693aa2 100644
--- a/drivers/nvme/target/core.c
+++ b/drivers/nvme/target/core.c
@@ -57,6 +57,13 @@ u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, 
void *buf, size_t len)
return 0;
 }
 
+u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len)
+{
+   if (sg_zero_buffer(req->sg, req->sg_cnt, len, off) != len)
+   return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR;
+   return 0;
+}
+
 static unsigned int nvmet_max_nsid(struct nvmet_subsys *subsys)
 {
struct nvmet_ns *ns;
diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h
index 32ebffcf464c..768abe203298 100644
--- a/drivers/nvme/target/nvmet.h
+++ b/drivers/nvme/target/nvmet.h
@@ -330,6 +330,7 @@ u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, 
const void *buf,
size_t len);
 u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf,
size_t len);
+u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len);
 
 u32 nvmet_get_log_page_len(struct nvme_command *cmd);
 
-- 
2.17.0



nvme/nvmet AEN and log page fixups v2

2018-05-30 Thread Christoph Hellwig
Hi all,

this series started as prep work for ANA, but has grown a lot.

The idea is to make the AEN handling in the target closer to what
the standard says, and implement the changed namespaces list log page,
which is required to clear the namespace attribute notice event.

One the host side this just makes sure we actually have the events
we handle enabled using Set Features, and it fixes the handling of
the namespace attribute notice event so that it actually does get
cleared by issuing the log page.

Last but not last two block cleanups are thrown in so that we don't
print a pointless message when we delete a namespaces, and also
unexport the function printing said message as it should only be
called from the core code.

Changes from v1:
 - don't copy garbage when more then 1024 namespaces changed
 - filter out duplicates in the changed namespace list log page
 - use the proper define for the supported events to initialize
   OAES
 - add the nvmet_aen_disabled helper from the ANA series
 - allow the mandatory smart AENs in Set Features


[PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.

2018-05-30 Thread kvigor
From: Kevin Vigor 

When a userspace client requests a NBD device be disconnected, the
DISCONNECT_REQUESTED flag is set. While this flag is set, the driver
will not inform userspace when a connection is closed.

Unfortunately the flag was never cleared, so once a disconnect was
requested the driver would thereafter never tell userspace about a
closed connection. Thus when connections failed due to timeout, no
attempt to reconnect was made and eventually the device would fail.

Fix by clearing the DISCONNECT_REQUESTED flag (and setting the
DISCONNECTED flag) once all connections are closed.

Changes relative to v1 (https://marc.info/?l=linux-block&m=152763540418902):

* remove pointless wake_up() -> wake_up_all() change.

Signed-off-by: Kevin Vigor 
---
 drivers/block/nbd.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index afbc202..1cd041b 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -213,7 +213,15 @@ static void nbd_mark_nsock_dead(struct nbd_device *nbd, 
struct nbd_sock *nsock,
}
if (!nsock->dead) {
kernel_sock_shutdown(nsock->sock, SHUT_RDWR);
-   atomic_dec(&nbd->config->live_connections);
+   if (atomic_dec_return(&nbd->config->live_connections) == 0) {
+   if (test_and_clear_bit(NBD_DISCONNECT_REQUESTED,
+  &nbd->config->runtime_flags)) {
+   set_bit(NBD_DISCONNECTED,
+   &nbd->config->runtime_flags);
+   dev_info(nbd_to_dev(nbd),
+   "Disconnected due to user request.\n");
+   }
+   }
}
nsock->dead = true;
nsock->pending = NULL;
@@ -292,7 +300,9 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct 
request *req,
 
if (config->num_connections > 1) {
dev_err_ratelimited(nbd_to_dev(nbd),
-   "Connection timed out, retrying\n");
+   "Connection timed out, retrying (%d/%d 
alive)\n",
+   atomic_read(&config->live_connections),
+   config->num_connections);
/*
 * Hooray we have more connections, requeue this IO, the submit
 * path will put it on a real connection.
@@ -714,10 +724,9 @@ static int wait_for_reconnect(struct nbd_device *nbd)
return 0;
if (test_bit(NBD_DISCONNECTED, &config->runtime_flags))
return 0;
-   wait_event_timeout(config->conn_wait,
-  atomic_read(&config->live_connections),
-  config->dead_conn_timeout);
-   return atomic_read(&config->live_connections);
+   return wait_event_timeout(config->conn_wait,
+ atomic_read(&config->live_connections) > 0,
+ config->dead_conn_timeout) > 0;
 }
 
 static int nbd_handle_cmd(struct nbd_cmd *cmd, int index)
-- 
2.7.4



Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Jens Axboe
On 5/30/18 9:45 AM, Jeff Moyer wrote:
> Jens Axboe  writes:
> 
>> On 5/30/18 9:06 AM, Jeff Moyer wrote:
>>> Hi, Jens,
>>>
>>> Jens Axboe  writes:
>>>
 On 5/30/18 2:49 AM, Christoph Hellwig wrote:
> While I really don't want drivers to change the I/O schedule themselves
> we have a class of devices (zoned) that don't work at all with certain
> I/O schedulers.  The kernel not chosing something sane and requiring
> user workarounds is just silly.

 They work just fine for probing and reading purposes. There's absolutely
 no reason why we can't handle these special snowflakes with a udev rule.
>>>
>>> udev rules aren't shipped with the kernel, so it makes it hard to keep
>>> them in sync.  In this instance, I'm not sure anyone made an effort to
>>> notify distributions that a udev rule was even necessary.  (Is there a
>>> way of notifying distributions about kernel changes that require new
>>> udev rules, other than emailing each list individually?)
>>>
>>> From a technical standpoint, I totally agree with you, Jens.  However, I
>>> think the user experience sucks.  4.15 worked by default, 4.16 doesn't.
>>> The result will be bug reports from users (to the drive vendors,
>>> distribution bugzillas, here, etc.).
>>
>> I would imagine that most folks get their updates from a distro of some
>> sort, in which case there's absolutely nothing stopping the distro from
>> shipping updated rules for the 4.16 kernel update.
> 
> The problem is distros have already shipped that kernel.  :)

Ship an update, then! I'm sure that most people would prefer a simple
rule update over a kernel update. And you have to do one of them, to
resolve this anyway.

>> But what's the regression? 4.15 had no zone write locking at all.
> 
> The zone write locking was done in the sd driver prior to 4.16.  See
> commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where
> it was removed.  That means these devices "just worked" with all I/O
> schedulers.

Gotcha, makes sense.

>>> Moving on, assuming your mind is made up...
>>>
>>> I'm not sure how much logic should go into the udev rule.  As mentioned,
>>> this limitation was introduced in 4.16, and Damien has plans to lift the
>>> restriction in future kernels.  Because distributions tend to cherry
>>> pick changes, making decisions on whether a feature exists based solely
>>> on kernel version is usually not a great thing.  My inclination would be
>>> to just always force deadline for host-managed SMR drives.  These drives
>>> aren't that popular, after all.  Any opinions on this?
>>
>> The problem is that it's tied to an IO scheduler, which ends up causing
>> issues like this, since users are free to select a different scheduler.
>> Then things break. Granted, in this case, some extraordinarily shitty
>> hardware even broke. That is on the hardware, not the kernel, that
>> kind of breakage should not occur.
> 
> If the firmware problem was widespread, I think we'd try to avoid it.  I
> have no reason to believe that is the case, though.
> 
> Damien made the argument that the user should be able to select an I/O
> scheduler that doesn't perform the write locking, because a well-behaved
> application could theoretically make use of it.  I think this is a weak
> argument, given that dm-zoned doesn't even support such a mode.

Sure, the user should be able to select whatever they want. Maybe they
are strictly using it through bsg or a similar interface, in which
case no scheduler or kernel support is really neeeded to drive it.

>> So now we're stuck with this temporary situation which needs a work-around.
>> I don't think it's a terrible idea to have a rule that just sets
>> deadline/mq-deadline for an SMR device regardless of what kernel it is
>> running on. It'll probably never be a bad default.
> 
> OK.  Barring future input to the contrary, I'll work to get updates into
> fedora, at least.  I've CC'd Colin and Hannes.  I'm not sure who else to
> include.
> 
> FYI, below is the udev rule Damien had provided to Bryan.  I'm not sure
> about the KERNEL=="sd[a-z]" bit, that may need modification.  Note: I'm
> no udev expert.

Rule looks fine to me, but I'm also no udev expert.

-- 
Jens Axboe



Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Jeff Moyer
Jens Axboe  writes:

> On 5/30/18 9:06 AM, Jeff Moyer wrote:
>> Hi, Jens,
>> 
>> Jens Axboe  writes:
>> 
>>> On 5/30/18 2:49 AM, Christoph Hellwig wrote:
 While I really don't want drivers to change the I/O schedule themselves
 we have a class of devices (zoned) that don't work at all with certain
 I/O schedulers.  The kernel not chosing something sane and requiring
 user workarounds is just silly.
>>>
>>> They work just fine for probing and reading purposes. There's absolutely
>>> no reason why we can't handle these special snowflakes with a udev rule.
>> 
>> udev rules aren't shipped with the kernel, so it makes it hard to keep
>> them in sync.  In this instance, I'm not sure anyone made an effort to
>> notify distributions that a udev rule was even necessary.  (Is there a
>> way of notifying distributions about kernel changes that require new
>> udev rules, other than emailing each list individually?)
>> 
>> From a technical standpoint, I totally agree with you, Jens.  However, I
>> think the user experience sucks.  4.15 worked by default, 4.16 doesn't.
>> The result will be bug reports from users (to the drive vendors,
>> distribution bugzillas, here, etc.).
>
> I would imagine that most folks get their updates from a distro of some
> sort, in which case there's absolutely nothing stopping the distro from
> shipping updated rules for the 4.16 kernel update.

The problem is distros have already shipped that kernel.  :)

> But what's the regression? 4.15 had no zone write locking at all.

The zone write locking was done in the sd driver prior to 4.16.  See
commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where
it was removed.  That means these devices "just worked" with all I/O
schedulers.

>> Moving on, assuming your mind is made up...
>> 
>> I'm not sure how much logic should go into the udev rule.  As mentioned,
>> this limitation was introduced in 4.16, and Damien has plans to lift the
>> restriction in future kernels.  Because distributions tend to cherry
>> pick changes, making decisions on whether a feature exists based solely
>> on kernel version is usually not a great thing.  My inclination would be
>> to just always force deadline for host-managed SMR drives.  These drives
>> aren't that popular, after all.  Any opinions on this?
>
> The problem is that it's tied to an IO scheduler, which ends up causing
> issues like this, since users are free to select a different scheduler.
> Then things break. Granted, in this case, some extraordinarily shitty
> hardware even broke. That is on the hardware, not the kernel, that
> kind of breakage should not occur.

If the firmware problem was widespread, I think we'd try to avoid it.  I
have no reason to believe that is the case, though.

Damien made the argument that the user should be able to select an I/O
scheduler that doesn't perform the write locking, because a well-behaved
application could theoretically make use of it.  I think this is a weak
argument, given that dm-zoned doesn't even support such a mode.

I definitely see this udev rule as a temporary workaround.

> So now we're stuck with this temporary situation which needs a work-around.
> I don't think it's a terrible idea to have a rule that just sets
> deadline/mq-deadline for an SMR device regardless of what kernel it is
> running on. It'll probably never be a bad default.

OK.  Barring future input to the contrary, I'll work to get updates into
fedora, at least.  I've CC'd Colin and Hannes.  I'm not sure who else to
include.

FYI, below is the udev rule Damien had provided to Bryan.  I'm not sure
about the KERNEL=="sd[a-z]" bit, that may need modification.  Note: I'm
no udev expert.

Cheers,
Jeff

ACTION=="add|change", KERNEL=="sd[a-z]",
ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline"


Re: [PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.

2018-05-30 Thread Jens Axboe
On 5/29/18 5:09 PM, kvi...@gmail.com wrote:
> @@ -937,7 +946,7 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, 
> unsigned long arg)
>   queue_work(recv_workqueue, &args->work);
>  
>   atomic_inc(&config->live_connections);
> - wake_up(&config->conn_wait);
> + wake_up_all(&config->conn_wait);
>   return 0;
>   }
>   sockfd_put(sock);

Unless you're using exclusive waits, and you are not, then wake_up() is
equivalent to wake_up_all().

-- 
Jens Axboe



Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Jens Axboe
On 5/30/18 9:06 AM, Jeff Moyer wrote:
> Hi, Jens,
> 
> Jens Axboe  writes:
> 
>> On 5/30/18 2:49 AM, Christoph Hellwig wrote:
>>> While I really don't want drivers to change the I/O schedule themselves
>>> we have a class of devices (zoned) that don't work at all with certain
>>> I/O schedulers.  The kernel not chosing something sane and requiring
>>> user workarounds is just silly.
>>
>> They work just fine for probing and reading purposes. There's absolutely
>> no reason why we can't handle these special snowflakes with a udev rule.
> 
> udev rules aren't shipped with the kernel, so it makes it hard to keep
> them in sync.  In this instance, I'm not sure anyone made an effort to
> notify distributions that a udev rule was even necessary.  (Is there a
> way of notifying distributions about kernel changes that require new
> udev rules, other than emailing each list individually?)
> 
> From a technical standpoint, I totally agree with you, Jens.  However, I
> think the user experience sucks.  4.15 worked by default, 4.16 doesn't.
> The result will be bug reports from users (to the drive vendors,
> distribution bugzillas, here, etc.).

I would imagine that most folks get their updates from a distro of some
sort, in which case there's absolutely nothing stopping the distro from
shipping updated rules for the 4.16 kernel update.

But what's the regression? 4.15 had no zone write locking at all.

> Moving on, assuming your mind is made up...
> 
> I'm not sure how much logic should go into the udev rule.  As mentioned,
> this limitation was introduced in 4.16, and Damien has plans to lift the
> restriction in future kernels.  Because distributions tend to cherry
> pick changes, making decisions on whether a feature exists based solely
> on kernel version is usually not a great thing.  My inclination would be
> to just always force deadline for host-managed SMR drives.  These drives
> aren't that popular, after all.  Any opinions on this?

The problem is that it's tied to an IO scheduler, which ends up causing
issues like this, since users are free to select a different scheduler.
Then things break. Granted, in this case, some extraordinarily shitty
hardware even broke. That is on the hardware, not the kernel, that
kind of breakage should not occur.

So now we're stuck with this temporary situation which needs a work-around.
I don't think it's a terrible idea to have a rule that just sets
deadline/mq-deadline for an SMR device regardless of what kernel it is
running on. It'll probably never be a bad default.

-- 
Jens Axboe



Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Jeff Moyer
Hi, Jens,

Jens Axboe  writes:

> On 5/30/18 2:49 AM, Christoph Hellwig wrote:
>> While I really don't want drivers to change the I/O schedule themselves
>> we have a class of devices (zoned) that don't work at all with certain
>> I/O schedulers.  The kernel not chosing something sane and requiring
>> user workarounds is just silly.
>
> They work just fine for probing and reading purposes. There's absolutely
> no reason why we can't handle these special snowflakes with a udev rule.

udev rules aren't shipped with the kernel, so it makes it hard to keep
them in sync.  In this instance, I'm not sure anyone made an effort to
notify distributions that a udev rule was even necessary.  (Is there a
way of notifying distributions about kernel changes that require new
udev rules, other than emailing each list individually?)

>From a technical standpoint, I totally agree with you, Jens.  However, I
think the user experience sucks.  4.15 worked by default, 4.16 doesn't.
The result will be bug reports from users (to the drive vendors,
distribution bugzillas, here, etc.).

Moving on, assuming your mind is made up...

I'm not sure how much logic should go into the udev rule.  As mentioned,
this limitation was introduced in 4.16, and Damien has plans to lift the
restriction in future kernels.  Because distributions tend to cherry
pick changes, making decisions on whether a feature exists based solely
on kernel version is usually not a great thing.  My inclination would be
to just always force deadline for host-managed SMR drives.  These drives
aren't that popular, after all.  Any opinions on this?

Thanks!
Jeff


Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Jens Axboe
On 5/30/18 2:49 AM, Christoph Hellwig wrote:
> While I really don't want drivers to change the I/O schedule themselves
> we have a class of devices (zoned) that don't work at all with certain
> I/O schedulers.  The kernel not chosing something sane and requiring
> user workarounds is just silly.

They work just fine for probing and reading purposes. There's absolutely
no reason why we can't handle these special snowflakes with a udev rule.

-- 
Jens Axboe



Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Bryan Gurney
On Wed, May 30, 2018 at 2:22 AM, Damien Le Moal  wrote:
> Jens,
>
> On 5/26/18 13:01, Jens Axboe wrote:
>> On 5/25/18 4:18 PM, Jeff Moyer wrote:
>>> Hi, Jens,
>>>
>>> Jens Axboe  writes:
>>>
 On 5/25/18 3:14 PM, Jeff Moyer wrote:
> Bryan Gurney reported I/O errors when using dm-zoned with a host-managed
> SMR device.  It turns out he was using CFQ, which is the default.
> Unfortunately, as of v4.16, only the deadline schedulers work well with
> host-managed SMR devices.  This series aatempts to switch the elevator
> to deadline for those devices.
>
> NOTE: I'm not super happy with setting up one iosched and then
> immediately tearing it down.  I'm open to suggestions on better ways
> to accomplish this goal.

 Let's please not do this, a few years ago I finally managed to kill
 drivers changing the scheduler manually. Why can't this go into a
 udev (or similar) rule? That's where it belongs, imho.
>>>
>>> We could do that.  The downside is that distros will have to pick up
>>> udev rules, which they haven't done yet, and the udev rules will have to
>>> be kernel version dependent.  And then later, when this restriction is
>>> lifted, we'll have to update the udev rules.  That also sounds awful to
>>> me.
>>
>> They only have to be feature dependent, which isn't that hard. And if I
>> had to pick between a kernel upgrade and a udev rule package update, the
>> choice is pretty clear.
>
> Currently, a queue elevator is first initialized with a call to
> elevator_init() from either blk_mq_init_allocated_queue() or
> blk_init_allocated_queue(). When these are called, the device has not
> yet been probed and no information is available. Do you think it is
> reasonable to delay the default elevator bring-up to after some
> information about the device is obtained ?
>
> That would necessitate splitting elevator_init() into a generic elevator
> initialization function setting up the elevator related fields of the
> request queue and a second part setting up the default elevator (e.g.
> elevator_set_default()). Doing so, the function elevator_set_default()
> could be called later in the device initialization sequence, after
> information from the device has been obtained. It would make choosing a
> sane default elevator much cleaner.
>
> Granted, that is in effect very similar to what Jeff's patches do. But
> in a sense, the current code already consider some of the device
> characteristics when setting the default elevator. That is, if the
> device is mq or not. Would it be a stretch to add "is_zoned" as another
> characteristic that is looked at ?
>
> On another note, I noticed that the only two callers of elevator_init()
> are blk_mq_init_allocated_queue() and blk_init_allocated_queue(). Both
> call elevator_init() with a NULL name. I wonder the usefulness of that
> argument. Furthermore, elevator_init() is also exported but I do not see
> any external module user. Is this in purpose or is it the remnant of
> older use of it ? I can send a patch cleaning that up if you feel that
> needs so cleanup.
>
>>
>>> I understand why you don't like this patch set, but I happen to think
>>> the alternative is worse.  FYI, in Bryan's case, his system actually got
>>> bricked (likely due to buggy firmware).
>>
>> I disagree, I think the rule approach is much easier. If the wrong write
>> location bricked the drive, then I think that user has much larger
>> issues... That seems like a trivial issue that should have been caught
>> in basic testing, I would not trust that drive with any data if it
>> bricks that easily.
>
> I think that the bug in this particular case was with the BIOS (probably
> UEFI), not the disk. Disks only answer to trivial requests when probed
> on boot, unless the BIOS is too smart for its own good and doing very
> weird things. Different issue anyway, not related to the kernel.

Actually, the BIOS boot mode was configured to "Legacy" mode (i.e.:
non-UEFI) at the time of the bricking.  I'm pretty confident of this,
because after the BIOS reflash, I can see a different rendering mode
in the GRUB screen of Fedora 28, and when I checked back in the BIOS
setup utility, I saw that the default for the new BIOS image was UEFI.

On the test systems in general, I prefer to use legacy BIOS if I don't
need UEFI.  I'm not sure if this means that I walked into a latent
BIOS bug, though.


Thanks,

Bryan


Re: [bug report] IO blocked after offline CPU during fio on nvme SSD

2018-05-30 Thread Yi Zhang

Thanks for the update, will have a try after Keith update v4 fix. :)


On 05/30/2018 04:29 PM, Ming Lei wrote:

On Wed, May 30, 2018 at 3:53 PM, Yi Zhang  wrote:

Hi
I found this issue during my NVMe test on 4.17.0-rc7, here is the reproducer 
and dmesg log, let me know if you need more info, thanks.

Reproduce steps:
#!/bin/bash
set -x
fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync 
-bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
-direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 &
sleep 10
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

dmesg log:
[  153.402709] IRQ 82: no longer affine to CPU1
[  153.408596] smpboot: CPU 1 is now offline
[  153.426422] IRQ 86: no longer affine to CPU2
[  153.431204] IRQ 89: no longer affine to CPU2
[  153.437054] smpboot: CPU 2 is now offline
[  153.682095] IRQ 83: no longer affine to CPU3
[  153.686866] IRQ 85: no longer affine to CPU3
[  153.692712] smpboot: CPU 3 is now offline
[  183.847661] nvme nvme0: I/O 417 QID 2 timeout, completion polled
[  183.854410] nvme nvme0: I/O 251 QID 7 timeout, completion polled
[  183.861133] nvme nvme0: I/O 252 QID 7 timeout, completion polled
[  183.867851] nvme nvme0: I/O 253 QID 7 timeout, aborting
[  183.873699] nvme nvme0: I/O 254 QID 7 timeout, aborting
[  183.879530] nvme nvme0: Abort status: 0x0
[  183.884017] nvme nvme0: I/O 255 QID 7 timeout, aborting
[  183.889847] nvme nvme0: Abort status: 0x0
[  183.894338] nvme nvme0: I/O 495 QID 8 timeout, completion polled
[  183.901033] nvme nvme0: Abort status: 0x0
[  183.905520] nvme nvme0: I/O 496 QID 8 timeout, aborting
[  183.911359] nvme nvme0: I/O 497 QID 8 timeout, aborting
[  183.917190] nvme nvme0: Abort status: 0x0
[  183.921677] nvme nvme0: I/O 498 QID 8 timeout, aborting
[  183.927506] nvme nvme0: Abort status: 0x0
[  183.931994] nvme nvme0: I/O 499 QID 8 timeout, aborting
[  183.937825] nvme nvme0: Abort status: 0x0
[  183.942311] nvme nvme0: I/O 500 QID 8 timeout, aborting
[  183.948141] nvme nvme0: Abort status: 0x0
[  183.952637] nvme nvme0: Abort status: 0x0
[  214.057408] nvme nvme0: I/O 253 QID 7 timeout, reset controller
[  216.422702] nvme nvme0: I/O 254 QID 7 timeout, disable controller
[  216.438524] nvme nvme0: I/O 255 QID 7 timeout, disable controller
[  216.453428] nvme nvme0: I/O 496 QID 8 timeout, disable controller
[  216.469426] nvme nvme0: I/O 497 QID 8 timeout, disable controller
[  216.484435] nvme nvme0: I/O 498 QID 8 timeout, disable controller
[  216.496417] nvme nvme0: I/O 499 QID 8 timeout, disable controller
[  216.511418] nvme nvme0: I/O 500 QID 8 timeout, disable controller
[  249.830165] nvme nvme0: I/O 253 QID 7 timeout, completion polled
[  249.836901] nvme nvme0: I/O 254 QID 7 timeout, disable controller
[  249.858307] nvme nvme0: I/O 255 QID 7 timeout, disable controller
[  249.874172] nvme nvme0: I/O 496 QID 8 timeout, disable controller
[  249.887164] nvme nvme0: I/O 497 QID 8 timeout, disable controller
[  249.903175] nvme nvme0: I/O 498 QID 8 timeout, disable controller
[  249.918171] nvme nvme0: I/O 499 QID 8 timeout, disable controller
[  249.934172] nvme nvme0: I/O 500 QID 8 timeout, disable controller
[  369.695234] INFO: task kworker/u25:8:472 blocked for more than 120 seconds.
[  369.703012]   Not tainted 4.17.0-rc7 #1
[  369.707687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  369.716433] kworker/u25:8   D0   472  2 0x8000
[  369.716443] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  369.716445] Call Trace:
[  369.719189]  ? __schedule+0x290/0x870
[  369.723280]  schedule+0x32/0x80
[  369.726791]  blk_mq_freeze_queue_wait+0x46/0xb0
[  369.731853]  ? remove_wait_queue+0x60/0x60
[  369.736435]  nvme_wait_freeze+0x2e/0x50 [nvme_core]
[  369.741876]  nvme_reset_work+0x819/0xd83 [nvme]
[  369.746942]  ? __switch_to+0xa8/0x4f0
[  369.751035]  process_one_work+0x158/0x360
[  369.755513]  worker_thread+0x47/0x3e0
[  369.759604]  kthread+0xf8/0x130
[  369.763114]  ? max_active_store+0x80/0x80
[  369.767590]  ? kthread_bind+0x10/0x10
[  369.771673]  ret_from_fork+0x35/0x40
[  369.775687] INFO: task fio:1826 blocked for more than 120 seconds.
[  369.782588]   Not tainted 4.17.0-rc7 #1
[  369.787259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  369.796002] fio D0  1826  1 0x0080
[  369.796004] Call Trace:
[  369.798736]  ? __schedule+0x290/0x870
[  369.802826]  schedule+0x32/0x80
[  369.806338]  io_schedule+0x12/0x40
[  369.810139]  __blkdev_direct_IO_simple+0x1d3/0x310
[  369.815493]  ? bdput+0x10/0x10
[  369.818905]  blkdev_direct_IO+0x38e/0x3f0
[  369.823383]  ? __blkdev_direct_IO_simple+0x254/0x310
[  369.828929]  generic_file_direct_write+0xcc/0x170
[  369.834187]  __generic_file_write_iter+0xb7/0x1c0
[  369.839441]  blkdev_write_iter+0x9b/0x120
[  369.8

Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Christoph Hellwig
While I really don't want drivers to change the I/O schedule themselves
we have a class of devices (zoned) that don't work at all with certain
I/O schedulers.  The kernel not chosing something sane and requiring
user workarounds is just silly.


Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices

2018-05-30 Thread Christoph Hellwig
On Wed, May 30, 2018 at 06:22:04AM +, Damien Le Moal wrote:
> That would necessitate splitting elevator_init() into a generic elevator
> initialization function setting up the elevator related fields of the
> request queue and a second part setting up the default elevator (e.g.
> elevator_set_default()). Doing so, the function elevator_set_default()
> could be called later in the device initialization sequence, after
> information from the device has been obtained. It would make choosing a
> sane default elevator much cleaner.

For blk-mq this makes entirely sense, untested series here:


http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/elevator_init

For the legacy case we always need at least some I/O scheduler,
so we can't just defer it.  But I think we could still switch to
deadline in blk_register_queue if we found a zoned device.


Re: [bug report] IO blocked after offline CPU during fio on nvme SSD

2018-05-30 Thread Ming Lei
On Wed, May 30, 2018 at 3:53 PM, Yi Zhang  wrote:
> Hi
> I found this issue during my NVMe test on 4.17.0-rc7, here is the reproducer 
> and dmesg log, let me know if you need more info, thanks.
>
> Reproduce steps:
> #!/bin/bash
> set -x
> fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync 
> -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
> -direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 &
> sleep 10
> echo 0 > /sys/devices/system/cpu/cpu1/online
> echo 0 > /sys/devices/system/cpu/cpu2/online
> echo 0 > /sys/devices/system/cpu/cpu3/online
>
> dmesg log:
> [  153.402709] IRQ 82: no longer affine to CPU1
> [  153.408596] smpboot: CPU 1 is now offline
> [  153.426422] IRQ 86: no longer affine to CPU2
> [  153.431204] IRQ 89: no longer affine to CPU2
> [  153.437054] smpboot: CPU 2 is now offline
> [  153.682095] IRQ 83: no longer affine to CPU3
> [  153.686866] IRQ 85: no longer affine to CPU3
> [  153.692712] smpboot: CPU 3 is now offline
> [  183.847661] nvme nvme0: I/O 417 QID 2 timeout, completion polled
> [  183.854410] nvme nvme0: I/O 251 QID 7 timeout, completion polled
> [  183.861133] nvme nvme0: I/O 252 QID 7 timeout, completion polled
> [  183.867851] nvme nvme0: I/O 253 QID 7 timeout, aborting
> [  183.873699] nvme nvme0: I/O 254 QID 7 timeout, aborting
> [  183.879530] nvme nvme0: Abort status: 0x0
> [  183.884017] nvme nvme0: I/O 255 QID 7 timeout, aborting
> [  183.889847] nvme nvme0: Abort status: 0x0
> [  183.894338] nvme nvme0: I/O 495 QID 8 timeout, completion polled
> [  183.901033] nvme nvme0: Abort status: 0x0
> [  183.905520] nvme nvme0: I/O 496 QID 8 timeout, aborting
> [  183.911359] nvme nvme0: I/O 497 QID 8 timeout, aborting
> [  183.917190] nvme nvme0: Abort status: 0x0
> [  183.921677] nvme nvme0: I/O 498 QID 8 timeout, aborting
> [  183.927506] nvme nvme0: Abort status: 0x0
> [  183.931994] nvme nvme0: I/O 499 QID 8 timeout, aborting
> [  183.937825] nvme nvme0: Abort status: 0x0
> [  183.942311] nvme nvme0: I/O 500 QID 8 timeout, aborting
> [  183.948141] nvme nvme0: Abort status: 0x0
> [  183.952637] nvme nvme0: Abort status: 0x0
> [  214.057408] nvme nvme0: I/O 253 QID 7 timeout, reset controller
> [  216.422702] nvme nvme0: I/O 254 QID 7 timeout, disable controller
> [  216.438524] nvme nvme0: I/O 255 QID 7 timeout, disable controller
> [  216.453428] nvme nvme0: I/O 496 QID 8 timeout, disable controller
> [  216.469426] nvme nvme0: I/O 497 QID 8 timeout, disable controller
> [  216.484435] nvme nvme0: I/O 498 QID 8 timeout, disable controller
> [  216.496417] nvme nvme0: I/O 499 QID 8 timeout, disable controller
> [  216.511418] nvme nvme0: I/O 500 QID 8 timeout, disable controller
> [  249.830165] nvme nvme0: I/O 253 QID 7 timeout, completion polled
> [  249.836901] nvme nvme0: I/O 254 QID 7 timeout, disable controller
> [  249.858307] nvme nvme0: I/O 255 QID 7 timeout, disable controller
> [  249.874172] nvme nvme0: I/O 496 QID 8 timeout, disable controller
> [  249.887164] nvme nvme0: I/O 497 QID 8 timeout, disable controller
> [  249.903175] nvme nvme0: I/O 498 QID 8 timeout, disable controller
> [  249.918171] nvme nvme0: I/O 499 QID 8 timeout, disable controller
> [  249.934172] nvme nvme0: I/O 500 QID 8 timeout, disable controller
> [  369.695234] INFO: task kworker/u25:8:472 blocked for more than 120 seconds.
> [  369.703012]   Not tainted 4.17.0-rc7 #1
> [  369.707687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  369.716433] kworker/u25:8   D0   472  2 0x8000
> [  369.716443] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
> [  369.716445] Call Trace:
> [  369.719189]  ? __schedule+0x290/0x870
> [  369.723280]  schedule+0x32/0x80
> [  369.726791]  blk_mq_freeze_queue_wait+0x46/0xb0
> [  369.731853]  ? remove_wait_queue+0x60/0x60
> [  369.736435]  nvme_wait_freeze+0x2e/0x50 [nvme_core]
> [  369.741876]  nvme_reset_work+0x819/0xd83 [nvme]
> [  369.746942]  ? __switch_to+0xa8/0x4f0
> [  369.751035]  process_one_work+0x158/0x360
> [  369.755513]  worker_thread+0x47/0x3e0
> [  369.759604]  kthread+0xf8/0x130
> [  369.763114]  ? max_active_store+0x80/0x80
> [  369.767590]  ? kthread_bind+0x10/0x10
> [  369.771673]  ret_from_fork+0x35/0x40
> [  369.775687] INFO: task fio:1826 blocked for more than 120 seconds.
> [  369.782588]   Not tainted 4.17.0-rc7 #1
> [  369.787259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  369.796002] fio D0  1826  1 0x0080
> [  369.796004] Call Trace:
> [  369.798736]  ? __schedule+0x290/0x870
> [  369.802826]  schedule+0x32/0x80
> [  369.806338]  io_schedule+0x12/0x40
> [  369.810139]  __blkdev_direct_IO_simple+0x1d3/0x310
> [  369.815493]  ? bdput+0x10/0x10
> [  369.818905]  blkdev_direct_IO+0x38e/0x3f0
> [  369.823383]  ? __blkdev_direct_IO_simple+0x254/0x310
> [  369.828929]  generic_file_direct_write+0xcc/0x170
> [  369.834187]  __generic_file_w

[bug report] IO blocked after offline CPU during fio on nvme SSD

2018-05-30 Thread Yi Zhang
Hi
I found this issue during my NVMe test on 4.17.0-rc7, here is the reproducer 
and dmesg log, let me know if you need more info, thanks.

Reproduce steps:
#!/bin/bash
set -x
fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync 
-bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
-direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 &
sleep 10
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 0 > /sys/devices/system/cpu/cpu2/online
echo 0 > /sys/devices/system/cpu/cpu3/online

dmesg log:
[  153.402709] IRQ 82: no longer affine to CPU1
[  153.408596] smpboot: CPU 1 is now offline
[  153.426422] IRQ 86: no longer affine to CPU2
[  153.431204] IRQ 89: no longer affine to CPU2
[  153.437054] smpboot: CPU 2 is now offline
[  153.682095] IRQ 83: no longer affine to CPU3
[  153.686866] IRQ 85: no longer affine to CPU3
[  153.692712] smpboot: CPU 3 is now offline
[  183.847661] nvme nvme0: I/O 417 QID 2 timeout, completion polled
[  183.854410] nvme nvme0: I/O 251 QID 7 timeout, completion polled
[  183.861133] nvme nvme0: I/O 252 QID 7 timeout, completion polled
[  183.867851] nvme nvme0: I/O 253 QID 7 timeout, aborting
[  183.873699] nvme nvme0: I/O 254 QID 7 timeout, aborting
[  183.879530] nvme nvme0: Abort status: 0x0
[  183.884017] nvme nvme0: I/O 255 QID 7 timeout, aborting
[  183.889847] nvme nvme0: Abort status: 0x0
[  183.894338] nvme nvme0: I/O 495 QID 8 timeout, completion polled
[  183.901033] nvme nvme0: Abort status: 0x0
[  183.905520] nvme nvme0: I/O 496 QID 8 timeout, aborting
[  183.911359] nvme nvme0: I/O 497 QID 8 timeout, aborting
[  183.917190] nvme nvme0: Abort status: 0x0
[  183.921677] nvme nvme0: I/O 498 QID 8 timeout, aborting
[  183.927506] nvme nvme0: Abort status: 0x0
[  183.931994] nvme nvme0: I/O 499 QID 8 timeout, aborting
[  183.937825] nvme nvme0: Abort status: 0x0
[  183.942311] nvme nvme0: I/O 500 QID 8 timeout, aborting
[  183.948141] nvme nvme0: Abort status: 0x0
[  183.952637] nvme nvme0: Abort status: 0x0
[  214.057408] nvme nvme0: I/O 253 QID 7 timeout, reset controller
[  216.422702] nvme nvme0: I/O 254 QID 7 timeout, disable controller
[  216.438524] nvme nvme0: I/O 255 QID 7 timeout, disable controller
[  216.453428] nvme nvme0: I/O 496 QID 8 timeout, disable controller
[  216.469426] nvme nvme0: I/O 497 QID 8 timeout, disable controller
[  216.484435] nvme nvme0: I/O 498 QID 8 timeout, disable controller
[  216.496417] nvme nvme0: I/O 499 QID 8 timeout, disable controller
[  216.511418] nvme nvme0: I/O 500 QID 8 timeout, disable controller
[  249.830165] nvme nvme0: I/O 253 QID 7 timeout, completion polled
[  249.836901] nvme nvme0: I/O 254 QID 7 timeout, disable controller
[  249.858307] nvme nvme0: I/O 255 QID 7 timeout, disable controller
[  249.874172] nvme nvme0: I/O 496 QID 8 timeout, disable controller
[  249.887164] nvme nvme0: I/O 497 QID 8 timeout, disable controller
[  249.903175] nvme nvme0: I/O 498 QID 8 timeout, disable controller
[  249.918171] nvme nvme0: I/O 499 QID 8 timeout, disable controller
[  249.934172] nvme nvme0: I/O 500 QID 8 timeout, disable controller
[  369.695234] INFO: task kworker/u25:8:472 blocked for more than 120 seconds.
[  369.703012]   Not tainted 4.17.0-rc7 #1
[  369.707687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  369.716433] kworker/u25:8   D0   472  2 0x8000
[  369.716443] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[  369.716445] Call Trace:
[  369.719189]  ? __schedule+0x290/0x870
[  369.723280]  schedule+0x32/0x80
[  369.726791]  blk_mq_freeze_queue_wait+0x46/0xb0
[  369.731853]  ? remove_wait_queue+0x60/0x60
[  369.736435]  nvme_wait_freeze+0x2e/0x50 [nvme_core]
[  369.741876]  nvme_reset_work+0x819/0xd83 [nvme]
[  369.746942]  ? __switch_to+0xa8/0x4f0
[  369.751035]  process_one_work+0x158/0x360
[  369.755513]  worker_thread+0x47/0x3e0
[  369.759604]  kthread+0xf8/0x130
[  369.763114]  ? max_active_store+0x80/0x80
[  369.767590]  ? kthread_bind+0x10/0x10
[  369.771673]  ret_from_fork+0x35/0x40
[  369.775687] INFO: task fio:1826 blocked for more than 120 seconds.
[  369.782588]   Not tainted 4.17.0-rc7 #1
[  369.787259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[  369.796002] fio D0  1826  1 0x0080
[  369.796004] Call Trace:
[  369.798736]  ? __schedule+0x290/0x870
[  369.802826]  schedule+0x32/0x80
[  369.806338]  io_schedule+0x12/0x40
[  369.810139]  __blkdev_direct_IO_simple+0x1d3/0x310
[  369.815493]  ? bdput+0x10/0x10
[  369.818905]  blkdev_direct_IO+0x38e/0x3f0
[  369.823383]  ? __blkdev_direct_IO_simple+0x254/0x310
[  369.828929]  generic_file_direct_write+0xcc/0x170
[  369.834187]  __generic_file_write_iter+0xb7/0x1c0
[  369.839441]  blkdev_write_iter+0x9b/0x120
[  369.843923]  __vfs_write+0x100/0x180
[  369.847917]  vfs_write+0xad/0x1a0
[  369.851622]  ? __audit_syscall_exit+0x213/0x330
[  369.856682]  ksys_pwrite64+0x62/0x90
[  3

blktests block/019 lead system hang

2018-05-30 Thread Yi Zhang
Hi Keith
I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, 
let me know if you need more info, thanks. 

Server: Dell R730xd
NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd 
NVMe SSD Controller 172X (rev 01)

Console log:
Kernel 4.17.0-rc7 on an x86_64

storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34
[ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware 
Error Source: 3
[ 6049.108478] {1}[Hardware Error]: event severity: fatal
[ 6049.108479] {1}[Hardware Error]:  Error 0, type: fatal
[ 6049.108481] {1}[Hardware Error]:   section_type: PCIe error
[ 6049.108482] {1}[Hardware Error]:   port_type: 6, downstream switch port
[ 6049.108483] {1}[Hardware Error]:   version: 1.16
[ 6049.108484] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[ 6049.108485] {1}[Hardware Error]:   device_id: :83:05.0
[ 6049.108486] {1}[Hardware Error]:   slot: 0
[ 6049.108487] {1}[Hardware Error]:   secondary_bus: 0x85
[ 6049.108488] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x8734
[ 6049.108489] {1}[Hardware Error]:   class_code: 000406
[ 6049.108489] {1}[Hardware Error]:   bridge: secondary_status: 0x, 
control: 0x0003
[ 6049.108491] Kernel panic - not syncing: Fatal hardware error!
[ 6049.108514] Kernel Offset: 0x2580 from 0x8100 (relocation 
range: 0x8000-0xbfff)


Best Regards,
  Yi Zhang