Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
Christoph, On 5/30/18 17:47, Christoph Hellwig wrote: > On Wed, May 30, 2018 at 06:22:04AM +, Damien Le Moal wrote: >> That would necessitate splitting elevator_init() into a generic elevator >> initialization function setting up the elevator related fields of the >> request queue and a second part setting up the default elevator (e.g. >> elevator_set_default()). Doing so, the function elevator_set_default() >> could be called later in the device initialization sequence, after >> information from the device has been obtained. It would make choosing a >> sane default elevator much cleaner. > > For blk-mq this makes entirely sense, untested series here: > > > http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/elevator_init Just reviewed and tried the series. Looks OK to me. Feel free to post adding: Reviewed-by: Damien Le Moal Tested-by: Damien Le Moal > For the legacy case we always need at least some I/O scheduler, > so we can't just defer it. But I think we could still switch to > deadline in blk_register_queue if we found a zoned device. Arrg... Yes, I forgot that. At least noop would be needed. So no real point in changing anything I guess. For the switch to deadline once the device is identified as zoned, I would like to see that added. Jens decision on this though. Best. -- Damien Le Moal, Western Digital
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/31/18 00:52, Jens Axboe wrote: > On 5/30/18 9:45 AM, Jeff Moyer wrote: >> Jens Axboe writes: >> >>> On 5/30/18 9:06 AM, Jeff Moyer wrote: Hi, Jens, Jens Axboe writes: > On 5/30/18 2:49 AM, Christoph Hellwig wrote: >> While I really don't want drivers to change the I/O schedule themselves >> we have a class of devices (zoned) that don't work at all with certain >> I/O schedulers. The kernel not chosing something sane and requiring >> user workarounds is just silly. > > They work just fine for probing and reading purposes. There's absolutely > no reason why we can't handle these special snowflakes with a udev rule. udev rules aren't shipped with the kernel, so it makes it hard to keep them in sync. In this instance, I'm not sure anyone made an effort to notify distributions that a udev rule was even necessary. (Is there a way of notifying distributions about kernel changes that require new udev rules, other than emailing each list individually?) From a technical standpoint, I totally agree with you, Jens. However, I think the user experience sucks. 4.15 worked by default, 4.16 doesn't. The result will be bug reports from users (to the drive vendors, distribution bugzillas, here, etc.). >>> >>> I would imagine that most folks get their updates from a distro of some >>> sort, in which case there's absolutely nothing stopping the distro from >>> shipping updated rules for the 4.16 kernel update. >> >> The problem is distros have already shipped that kernel. :) > > Ship an update, then! I'm sure that most people would prefer a simple > rule update over a kernel update. And you have to do one of them, to > resolve this anyway. > >>> But what's the regression? 4.15 had no zone write locking at all. >> >> The zone write locking was done in the sd driver prior to 4.16. See >> commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where >> it was removed. That means these devices "just worked" with all I/O >> schedulers. > > Gotcha, makes sense. > Moving on, assuming your mind is made up... I'm not sure how much logic should go into the udev rule. As mentioned, this limitation was introduced in 4.16, and Damien has plans to lift the restriction in future kernels. Because distributions tend to cherry pick changes, making decisions on whether a feature exists based solely on kernel version is usually not a great thing. My inclination would be to just always force deadline for host-managed SMR drives. These drives aren't that popular, after all. Any opinions on this? >>> >>> The problem is that it's tied to an IO scheduler, which ends up causing >>> issues like this, since users are free to select a different scheduler. >>> Then things break. Granted, in this case, some extraordinarily shitty >>> hardware even broke. That is on the hardware, not the kernel, that >>> kind of breakage should not occur. >> >> If the firmware problem was widespread, I think we'd try to avoid it. I >> have no reason to believe that is the case, though. >> >> Damien made the argument that the user should be able to select an I/O >> scheduler that doesn't perform the write locking, because a well-behaved >> application could theoretically make use of it. I think this is a weak >> argument, given that dm-zoned doesn't even support such a mode. > > Sure, the user should be able to select whatever they want. Maybe they > are strictly using it through bsg or a similar interface, in which > case no scheduler or kernel support is really neeeded to drive it. dm-zoned, f2fs (more FSes coming) use the drives through the regular block/bio stack. The scheduler is involved and needs to be correctly set. For applications not relying on these components and doing raw disk I/Os, I see 2 camps in the field: the sg/bsg camp and the regular POSIX system calls camp. For the former, since the kernel will have minimal interaction with the commands, the application is on its own. Control over write ordering has to be coded in. But for the latter case, which is orders of magnitudes easier to use, the scheduler needs to be set correctly or the same pressure is on the application to "do the right thing". This conflicts with the benefits of this access path choice (simplicity). In the end, if the drives are used directly from applications, I think it is OK to only expect a correct system setting if deadline is required. So udev is fine. But for the kernel components like dm-zoned, a sane default being set from the start is my preferred choice. Best regards. -- Damien Le Moal, Western Digital
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/31/18 00:45, Jeff Moyer wrote: > Jens Axboe writes: >> But what's the regression? 4.15 had no zone write locking at all. > > The zone write locking was done in the sd driver prior to 4.16. See > commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where > it was removed. That means these devices "just worked" with all I/O > schedulers. Yes they did "just work", but that was not an ideal solution either because of the performance implications: sequential writes to a single zone where stalling the dispatch queue waiting for the dispatched write to the locked zone to complete. That was not optimal at all (sure, the drive side write caching was hiding this a bit, but still) >>> Moving on, assuming your mind is made up... >>> >>> I'm not sure how much logic should go into the udev rule. As mentioned, >>> this limitation was introduced in 4.16, and Damien has plans to lift the >>> restriction in future kernels. Because distributions tend to cherry >>> pick changes, making decisions on whether a feature exists based solely >>> on kernel version is usually not a great thing. My inclination would be >>> to just always force deadline for host-managed SMR drives. These drives >>> aren't that popular, after all. Any opinions on this? >> >> The problem is that it's tied to an IO scheduler, which ends up causing >> issues like this, since users are free to select a different scheduler. >> Then things break. Granted, in this case, some extraordinarily shitty >> hardware even broke. That is on the hardware, not the kernel, that >> kind of breakage should not occur. > > If the firmware problem was widespread, I think we'd try to avoid it. I > have no reason to believe that is the case, though. First time in my career that I heard of a disk breaking a system BIOS. I will notify our system test lab to investigate this. Jeff, let's take this discussion off-list since that is not kernel related. > Damien made the argument that the user should be able to select an I/O > scheduler that doesn't perform the write locking, because a well-behaved > application could theoretically make use of it. I think this is a weak > argument, given that dm-zoned doesn't even support such a mode. Yes, a little week. That is definitely not the main use case I am seeing with customers. That said, these drives are starting to be used with other feature sets being enabled like I/O priorities. Considering there size, this is a very interesting feature to control access latency. Deadline and mq-deadline will not act on I/O priorities, cfq (and bfq ?) will. Potentially better results can be achieved, but as these schedulers do not support zone write locking, the application needs to be careful with its per-zone write queue depth and doing so avoid tripping over kernel level command unintended reordering. I see this as a valid enough use case to not "lock-down" the scheduler to deadline only and allow others schedulers too. Yet, deadline should be the default until an application asks for something else if needed. dm-zoned or f2fs (btrfs in the lab too) "assume" that the underlying stack does the right thing. That is of course true (for now) only if the deadline scheduler is enabled. A sane default set on device initialization would be nice to have and avoid potential headaches with rule ordering with regard to components initialization (not to mention that this would make booting from these disks possible). > I definitely see this udev rule as a temporary workaround. I agree. In fact I see the deadline based zone write locking itself as a temporary workaround. For now, I do not see any other clean method that covers both mq and legacy path. Considering only mq, we discussed interesting possibilities at LSFMM using dedicated write queues. That could be handled generically and remove the dependency on the scheduler while also increasing coverage of the support to open channel SSDs as well. My guess is that no major change for this write locking will happen on the legacy path, which hopefully will go away soon (?). But there are options forward with blk-mq. >> So now we're stuck with this temporary situation which needs a work-around. >> I don't think it's a terrible idea to have a rule that just sets >> deadline/mq-deadline for an SMR device regardless of what kernel it is >> running on. It'll probably never be a bad default. I agree. But since there are other kernel components (dm-zoned, FSes and the entire fs/block-dev.c direct I/O write path) depending on the scheduler to be set to something sane, setting that early on in the device initialization before it is grabbed by an FS or a device mapper would definitely be nice to have. > > OK. Barring future input to the contrary, I'll work to get updates into > fedora, at least. I've CC'd Colin and Hannes. I'm not sure who else to > include. > > FYI, below is the udev rule Damien had provided to Bryan. I'm not sure > about the KERNEL=="sd[a-z]" bit, that may ne
Re: [PATCH 12/12] nvme: limit warnings from nvme_identify_ns
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 11/12] nvme: use the changed namespaces list log to clear ns data changed AENs
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 10/12] nvme: mark nvme_queue_scan static
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 10/12] btrfs: convert to bioset_init()/mempool_init()
On 20 May 2018, at 18:25, Kent Overstreet wrote: Signed-off-by: Kent Overstreet --- fs/btrfs/extent_io.c | 25 +++-- 1 file changed, 11 insertions(+), 14 deletions(-) Assuming the rest of the comments/questions for the series get worked out, the btrfs side of this looks fine. Acked-by: Chris Mason
Re: [PATCH 08/12] nvmet: mask pending AERs
Reviewed-by: Sagi Grimberg
Re: [PATCH 09/12] nvme: submit AEN event configuration on startup
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 07/12] nvmet: Add AEN configuration support
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 06/12] nvmet: implement the changed namespaces log
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 05/12] nvmet: split log page implementation
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 04/12] nvmet: add a new nvmet_zero_sgl helper
+u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len) Would personally prefer nvmet_clear_sgl, but not hard headed on it, Reviewed-by: Sagi Grimberg
Re: [PATCH 03/12] nvme.h: add AER configuration symbols
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 02/12] nvme.h: add the changed namespace list log
Looks good, Reviewed-by: Sagi Grimberg
Re: [PATCH 01/12] nvme.h: untangle AEN notice definitions
Looks good, Reviewed-by: Sagi Grimberg
[GIT PULL] Single fix for 4.17 final
Hi Linus, Just a single fix that should make it into this release, fixing a regression with T10-DIF on NVMe. Please pull! git://git.kernel.dk/linux-block.git tags/for-linus-20180530 Jens Axboe (1): Merge branch 'nvme-4.17' of git://git.infradead.org/nvme into for-linus Max Gurtovoy (1): nvme: fix extended data LBA supported setting drivers/nvme/host/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) -- Jens Axboe
Re: [PATCH] blk-throttle: return proper bool type to caller instead of 0/1
On 5/29/18 4:32 AM, Chengguang Xu wrote: > Change to return true/false only for bool type return code. Looks good, applied. -- Jens Axboe
Re: [PATCH 06/12] nvmet: implement the changed namespaces log
On 05/30/2018 09:45 AM, Christoph Hellwig wrote: > Just keep a per-controller buffer of changed namespaces and copy it out > in the get log page implementation. > > Signed-off-by: Christoph Hellwig > --- [...] > +static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid) > +{ > + mutex_lock(&ctrl->lock); > + if (ctrl->nr_changed_ns < NVME_MAX_CHANGED_NAMESPACES) { Minor nitpick here: if ctrlr->nr_changed_ns is exactly NVME_MAX_CHANGED_NAMESPACES and the new nsid is already in the list, this will skip down to the 0x case below, even though it could have just left the list as-is. I don't know if this is a problem in practice; reporting 0x is probably always safe, since the host will presumably treat that as "rescan everything". > + u32 i; > + > + for (i = 0; i < ctrl->nr_changed_ns; i++) > + if (ctrl->changed_ns_list[i] == nsid) > + goto out_unlock; > + ctrl->changed_ns_list[ctrl->nr_changed_ns++] = nsid; > + } else if (ctrl->nr_changed_ns == NVME_MAX_CHANGED_NAMESPACES) { > + ctrl->changed_ns_list[0] = cpu_to_le32(0x); > + ctrl->nr_changed_ns = U32_MAX; > + } > +out_unlock: > + mutex_unlock(&ctrl->lock); > +} Thanks, -- Daniel
Re: [PATCH] blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
On 5/30/18 10:51 AM, Christoph Hellwig wrote: > We already check for started commands in all callbacks, but we should > also protect against already completed commands. Do this by taking > the checks to common code. Looks good, applied. -- Jens Axboe
Re: [PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.
On 5/30/18 10:45 AM, kvi...@gmail.com wrote: > From: Kevin Vigor > > When a userspace client requests a NBD device be disconnected, the > DISCONNECT_REQUESTED flag is set. While this flag is set, the driver > will not inform userspace when a connection is closed. > > Unfortunately the flag was never cleared, so once a disconnect was > requested the driver would thereafter never tell userspace about a > closed connection. Thus when connections failed due to timeout, no > attempt to reconnect was made and eventually the device would fail. > > Fix by clearing the DISCONNECT_REQUESTED flag (and setting the > DISCONNECTED flag) once all connections are closed. Applied, thanks Kevin. -- Jens Axboe
Re: [PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.
On Wed, May 30, 2018 at 10:45:11AM -0600, kvi...@gmail.com wrote: > From: Kevin Vigor > > When a userspace client requests a NBD device be disconnected, the > DISCONNECT_REQUESTED flag is set. While this flag is set, the driver > will not inform userspace when a connection is closed. > > Unfortunately the flag was never cleared, so once a disconnect was > requested the driver would thereafter never tell userspace about a > closed connection. Thus when connections failed due to timeout, no > attempt to reconnect was made and eventually the device would fail. > > Fix by clearing the DISCONNECT_REQUESTED flag (and setting the > DISCONNECTED flag) once all connections are closed. > > Changes relative to v1 (https://marc.info/?l=linux-block&m=152763540418902): > > * remove pointless wake_up() -> wake_up_all() change. > Usually you want to put the changelog after the --- bit below so we can all see it but it doesn't end up in the git changelog. Reviewed-by: Josef Bacik Thanks, Josef
Re: [PATCH] blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
On Wed, May 30, 2018 at 06:51:00PM +0200, Christoph Hellwig wrote: > We already check for started commands in all callbacks, but we should > also protect against already completed commands. Do this by taking > the checks to common code. > > Signed-off-by: Christoph Hellwig Acked-by: Josef Bacik for the nbd portion. Thanks, Josef
Re: [PATCH RESEND] blk-throttle: fix potential NULL pointer dereference in throtl_select_dispatch
On 5/29/18 2:29 AM, Liu Bo wrote: > tg in throtl_select_dispatch is used first and then do check. Since tg > may be NULL, it has potential NULL pointer dereference risk. So fix > it. Applied, thanks. -- Jens Axboe
[PATCH] blk-mq: only iterate over inflight requests in blk_mq_tagset_busy_iter
We already check for started commands in all callbacks, but we should also protect against already completed commands. Do this by taking the checks to common code. Signed-off-by: Christoph Hellwig --- block/blk-mq-tag.c| 2 +- drivers/block/mtip32xx/mtip32xx.c | 12 ++-- drivers/block/nbd.c | 5 + drivers/nvme/host/core.c | 3 --- drivers/nvme/host/fc.c| 3 --- 5 files changed, 4 insertions(+), 21 deletions(-) diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index a4e58fc28a06..70356a2a11ab 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -271,7 +271,7 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data) * test and set the bit before assining ->rqs[]. */ rq = tags->rqs[bitnr]; - if (rq) + if (rq && blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) iter_data->fn(rq, iter_data->data, reserved); return true; diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c index 95657b814543..c73626decb46 100644 --- a/drivers/block/mtip32xx/mtip32xx.c +++ b/drivers/block/mtip32xx/mtip32xx.c @@ -2725,15 +2725,11 @@ static void mtip_softirq_done_fn(struct request *rq) blk_mq_end_request(rq, cmd->status); } -static void mtip_abort_cmd(struct request *req, void *data, - bool reserved) +static void mtip_abort_cmd(struct request *req, void *data, bool reserved) { struct mtip_cmd *cmd = blk_mq_rq_to_pdu(req); struct driver_data *dd = data; - if (!blk_mq_request_started(req)) - return; - dbg_printk(MTIP_DRV_NAME " Aborting request, tag = %d\n", req->tag); clear_bit(req->tag, dd->port->cmds_to_issue); @@ -2741,14 +2737,10 @@ static void mtip_abort_cmd(struct request *req, void *data, mtip_softirq_done_fn(req); } -static void mtip_queue_cmd(struct request *req, void *data, - bool reserved) +static void mtip_queue_cmd(struct request *req, void *data, bool reserved) { struct driver_data *dd = data; - if (!blk_mq_request_started(req)) - return; - set_bit(req->tag, dd->port->cmds_to_issue); blk_abort_request(req); } diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index 8860b24098bc..17d1d97d491e 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -666,11 +666,8 @@ static void recv_work(struct work_struct *work) static void nbd_clear_req(struct request *req, void *data, bool reserved) { - struct nbd_cmd *cmd; + struct nbd_cmd *cmd = blk_mq_rq_to_pdu(req); - if (!blk_mq_request_started(req)) - return; - cmd = blk_mq_rq_to_pdu(req); cmd->status = BLK_STS_IOERR; blk_mq_complete_request(req); } diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 7ad3cfc9d4e1..1744d8dcadfd 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -251,9 +251,6 @@ EXPORT_SYMBOL_GPL(nvme_complete_rq); void nvme_cancel_request(struct request *req, void *data, bool reserved) { - if (!blk_mq_request_started(req)) - return; - dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device, "Cancelling I/O %d", req->tag); diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c index ac35a80f5532..0bad65803271 100644 --- a/drivers/nvme/host/fc.c +++ b/drivers/nvme/host/fc.c @@ -2393,9 +2393,6 @@ nvme_fc_terminate_exchange(struct request *req, void *data, bool reserved) struct nvme_fc_ctrl *ctrl = to_fc_ctrl(nctrl); struct nvme_fc_fcp_op *op = blk_mq_rq_to_pdu(req); - if (!blk_mq_request_started(req)) - return; - __nvme_fc_abort_op(ctrl, op); } -- 2.17.0
[PATCH 11/12] nvme: use the changed namespaces list log to clear ns data changed AENs
Per section 5.2 we need to issue the corresponding log page to clear an AEN, so for a namespace data changed AEN we need to read the changed namespace list log. And once we read that log anyway we might as well use it to optimize the rescan. Signed-off-by: Christoph Hellwig --- drivers/nvme/host/core.c | 51 drivers/nvme/host/nvme.h | 2 ++ 2 files changed, 49 insertions(+), 4 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 06cd04dcffbc..1ae77428a1a5 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3194,6 +3194,42 @@ static void nvme_scan_ns_sequential(struct nvme_ctrl *ctrl, unsigned nn) nvme_remove_invalid_namespaces(ctrl, nn); } +static bool nvme_scan_changed_ns_log(struct nvme_ctrl *ctrl) +{ + size_t log_size = NVME_MAX_CHANGED_NAMESPACES * sizeof(__le32); + __le32 *log; + int error, i; + bool ret = false; + + log = kzalloc(log_size, GFP_KERNEL); + if (!log) + return false; + + error = nvme_get_log(ctrl, NVME_LOG_CHANGED_NS, log, log_size); + if (error) { + dev_warn(ctrl->device, + "reading changed ns log failed: %d\n", error); + goto out_free_log; + } + + if (log[0] == cpu_to_le32(0x)) + goto out_free_log; + + for (i = 0; i < NVME_MAX_CHANGED_NAMESPACES; i++) { + u32 nsid = le32_to_cpu(log[i]); + + if (nsid == 0) + break; + dev_info(ctrl->device, "rescanning namespace %d.\n", nsid); + nvme_validate_ns(ctrl, nsid); + } + ret = true; + +out_free_log: + kfree(log); + return ret; +} + static void nvme_scan_work(struct work_struct *work) { struct nvme_ctrl *ctrl = @@ -3206,6 +3242,12 @@ static void nvme_scan_work(struct work_struct *work) WARN_ON_ONCE(!ctrl->tagset); + if (test_and_clear_bit(EVENT_NS_CHANGED, &ctrl->events)) { + if (nvme_scan_changed_ns_log(ctrl)) + goto out_sort_namespaces; + dev_info(ctrl->device, "rescanning namespaces.\n"); + } + if (nvme_identify_ctrl(ctrl, &id)) return; @@ -3213,14 +3255,15 @@ static void nvme_scan_work(struct work_struct *work) if (ctrl->vs >= NVME_VS(1, 1, 0) && !(ctrl->quirks & NVME_QUIRK_IDENTIFY_CNS)) { if (!nvme_scan_ns_list(ctrl, nn)) - goto done; + goto out_free_id; } nvme_scan_ns_sequential(ctrl, nn); - done: +out_free_id: + kfree(id); +out_sort_namespaces: down_write(&ctrl->namespaces_rwsem); list_sort(NULL, &ctrl->namespaces, ns_cmp); up_write(&ctrl->namespaces_rwsem); - kfree(id); } /* @@ -3340,7 +3383,7 @@ static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result) { switch ((result & 0xff00) >> 8) { case NVME_AER_NOTICE_NS_CHANGED: - dev_info(ctrl->device, "rescanning\n"); + set_bit(EVENT_NS_CHANGED, &ctrl->events); nvme_queue_scan(ctrl); break; case NVME_AER_NOTICE_FW_ACT_STARTING: diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index 11681278fdf6..07e8bfe705c6 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -189,6 +189,8 @@ struct nvme_ctrl { struct delayed_work ka_work; struct nvme_command ka_cmd; struct work_struct fw_act_work; +#define EVENT_NS_CHANGED (1 << 0) + unsigned long events; /* Power saving configuration */ u64 ps_max_latency_us; -- 2.17.0
[PATCH 10/12] nvme: mark nvme_queue_scan static
And move it toward the top of the file to avoid a forward declaration. Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 19 +-- drivers/nvme/host/nvme.h | 1 - 2 files changed, 9 insertions(+), 11 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index c0bc76d98194..06cd04dcffbc 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -100,6 +100,15 @@ static struct class *nvme_subsys_class; static void nvme_ns_remove(struct nvme_ns *ns); static int nvme_revalidate_disk(struct gendisk *disk); +static void nvme_queue_scan(struct nvme_ctrl *ctrl) +{ + /* +* Only new queue scan work when admin and IO queues are both alive +*/ + if (ctrl->state == NVME_CTRL_LIVE) + queue_work(nvme_wq, &ctrl->scan_work); +} + int nvme_reset_ctrl(struct nvme_ctrl *ctrl) { if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING)) @@ -3214,16 +3223,6 @@ static void nvme_scan_work(struct work_struct *work) kfree(id); } -void nvme_queue_scan(struct nvme_ctrl *ctrl) -{ - /* -* Only new queue scan work when admin and IO queues are both alive -*/ - if (ctrl->state == NVME_CTRL_LIVE) - queue_work(nvme_wq, &ctrl->scan_work); -} -EXPORT_SYMBOL_GPL(nvme_queue_scan); - /* * This function iterates the namespace list unlocked to allow recovery from * controller failure. It is up to the caller to ensure the namespace list is diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index dcf9e9592c1a..11681278fdf6 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -395,7 +395,6 @@ void nvme_stop_ctrl(struct nvme_ctrl *ctrl); void nvme_put_ctrl(struct nvme_ctrl *ctrl); int nvme_init_identify(struct nvme_ctrl *ctrl); -void nvme_queue_scan(struct nvme_ctrl *ctrl); void nvme_remove_namespaces(struct nvme_ctrl *ctrl); int nvme_sec_submit(void *data, u16 spsp, u8 secp, void *buffer, size_t len, -- 2.17.0
[PATCH 12/12] nvme: limit warnings from nvme_identify_ns
When rescanning namespaces after an AEN we will issue Identify Namespace comands to namespaces that have gone away, so don't warn for this specific case. Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1ae77428a1a5..7ad3cfc9d4e1 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -984,7 +984,9 @@ static struct nvme_id_ns *nvme_identify_ns(struct nvme_ctrl *ctrl, error = nvme_submit_sync_cmd(ctrl->admin_q, &c, id, sizeof(*id)); if (error) { - dev_warn(ctrl->device, "Identify namespace failed\n"); + /* don't warn on a namespace that has gone away */ + if (error < 0 || ((error & ~NVME_SC_DNR) != NVME_SC_INVALID_NS)) + dev_warn(ctrl->device, "Identify namespace failed\n"); kfree(id); return NULL; } -- 2.17.0
[PATCH 09/12] nvme: submit AEN event configuration on startup
From: Hannes Reinecke We should register for AEN events; some law-abiding targets might not be sending us AENs otherwise. Signed-off-by: Hannes Reinecke [hch: slight cleanups] Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 17 + drivers/nvme/host/nvme.h | 1 + 2 files changed, 18 insertions(+) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 86cb78653155..c0bc76d98194 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -1030,6 +1030,21 @@ int nvme_set_queue_count(struct nvme_ctrl *ctrl, int *count) } EXPORT_SYMBOL_GPL(nvme_set_queue_count); +#define NVME_AEN_SUPPORTED \ + (NVME_AEN_CFG_NS_ATTR | NVME_AEN_CFG_FW_ACT) + +static void nvme_enable_aen(struct nvme_ctrl *ctrl) +{ + u32 result; + int status; + + status = nvme_set_features(ctrl, NVME_FEAT_ASYNC_EVENT, + ctrl->oaes & NVME_AEN_SUPPORTED, NULL, 0, &result); + if (status) + dev_warn(ctrl->device, "Failed to configure AEN (cfg %x)\n", +ctrl->oaes & NVME_AEN_SUPPORTED); +} + static int nvme_submit_io(struct nvme_ns *ns, struct nvme_user_io __user *uio) { struct nvme_user_io io; @@ -2347,6 +2362,7 @@ int nvme_init_identify(struct nvme_ctrl *ctrl) ctrl->oacs = le16_to_cpu(id->oacs); ctrl->oncs = le16_to_cpup(&id->oncs); + ctrl->oaes = le32_to_cpu(id->oaes); atomic_set(&ctrl->abort_limit, id->acl + 1); ctrl->vwc = id->vwc; ctrl->cntlid = le16_to_cpup(&id->cntlid); @@ -3379,6 +3395,7 @@ void nvme_start_ctrl(struct nvme_ctrl *ctrl) if (ctrl->queue_count > 1) { nvme_queue_scan(ctrl); + nvme_enable_aen(ctrl); queue_work(nvme_wq, &ctrl->async_event_work); nvme_start_queues(ctrl); } diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h index ec6e4acc4d48..dcf9e9592c1a 100644 --- a/drivers/nvme/host/nvme.h +++ b/drivers/nvme/host/nvme.h @@ -176,6 +176,7 @@ struct nvme_ctrl { u16 kas; u8 npss; u8 apsta; + u32 oaes; u32 aen_result; unsigned int shutdown_timeout; unsigned int kato; -- 2.17.0
[PATCH 07/12] nvmet: Add AEN configuration support
AEN configuration via the 'Get Features' and 'Set Features' admin command is mandatory, so we should be implemeting handling for it. Signed-off-by: Hannes Reinecke [hch: use WRITE_ONCE, check for invalid values] Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 15 +-- drivers/nvme/target/core.c | 3 +++ drivers/nvme/target/nvmet.h | 16 3 files changed, 32 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 453d76db1647..4e269a8c8dbe 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -189,7 +189,7 @@ static void nvmet_execute_identify_ctrl(struct nvmet_req *req) id->ver = cpu_to_le32(ctrl->subsys->ver); /* XXX: figure out what to do about RTD3R/RTD3 */ - id->oaes = cpu_to_le32(1 << 8); + id->oaes = cpu_to_le32(NVMET_AEN_CFG_OPTIONAL); id->ctratt = cpu_to_le32(1 << 0); id->oacs = 0; @@ -441,6 +441,16 @@ static void nvmet_execute_set_features(struct nvmet_req *req) req->sq->ctrl->kato = DIV_ROUND_UP(val32, 1000); nvmet_set_result(req, req->sq->ctrl->kato); break; + case NVME_FEAT_ASYNC_EVENT: + val32 = le32_to_cpu(req->cmd->common.cdw10[1]); + if (val32 & ~NVMET_AEN_CFG_ALL) { + status = NVME_SC_INVALID_FIELD | NVME_SC_DNR; + break; + } + + WRITE_ONCE(req->sq->ctrl->aen_enabled, val32); + nvmet_set_result(req, val32); + break; case NVME_FEAT_HOST_ID: status = NVME_SC_CMD_SEQ_ERROR | NVME_SC_DNR; break; @@ -479,9 +489,10 @@ static void nvmet_execute_get_features(struct nvmet_req *req) break; case NVME_FEAT_WRITE_ATOMIC: break; +#endif case NVME_FEAT_ASYNC_EVENT: + nvmet_set_result(req, READ_ONCE(req->sq->ctrl->aen_enabled)); break; -#endif case NVME_FEAT_VOLATILE_WC: nvmet_set_result(req, 1); break; diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 99fcff757608..b2da65aec8b7 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -168,6 +168,8 @@ static void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid) list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid)); + if (!(READ_ONCE(ctrl->aen_enabled) & NVME_AEN_CFG_NS_ATTR)) + continue; nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, NVME_AER_NOTICE_NS_CHANGED, NVME_LOG_CHANGED_NS); @@ -855,6 +857,7 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, kref_init(&ctrl->ref); ctrl->subsys = subsys; + WRITE_ONCE(ctrl->aen_enabled, NVMET_AEN_CFG_OPTIONAL); ctrl->changed_ns_list = kmalloc_array(NVME_MAX_CHANGED_NAMESPACES, sizeof(__le32), GFP_KERNEL); diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 8cdc1e550396..f4d16d9b3582 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -30,6 +30,21 @@ #define NVMET_ASYNC_EVENTS 4 #define NVMET_ERROR_LOG_SLOTS 128 + +/* + * Supported optional AENs: + */ +#define NVMET_AEN_CFG_OPTIONAL \ + NVME_AEN_CFG_NS_ATTR + +/* + * Plus mandatory SMART AENs (we'll never send them, but allow enabling them): + */ +#define NVMET_AEN_CFG_ALL \ + (NVME_SMART_CRIT_SPARE | NVME_SMART_CRIT_TEMPERATURE | \ +NVME_SMART_CRIT_RELIABILITY | NVME_SMART_CRIT_MEDIA | \ +NVME_SMART_CRIT_VOLATILE_MEMORY | NVMET_AEN_CFG_OPTIONAL) + /* Helper Macros when NVMe error is NVME_SC_CONNECT_INVALID_PARAM * The 16 bit shift is to set IATTR bit to 1, which means offending * offset starts in the data section of connect() @@ -123,6 +138,7 @@ struct nvmet_ctrl { u16 cntlid; u32 kato; + u32 aen_enabled; struct nvmet_req*async_event_cmds[NVMET_ASYNC_EVENTS]; unsigned intnr_async_event_cmds; struct list_headasync_events; -- 2.17.0
[PATCH 08/12] nvmet: mask pending AERs
Per section 5.2 of the NVMe 1.3 spec: "When the controller posts a completion queue entry for an outstanding Asynchronous Event Request command and thus reports an asynchronous event, subsequent events of that event type are automatically masked by the controller until the host clears that event. An event is cleared by reading the log page associated with that event using the Get Log Page command (see section 5.14)." Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 1 + drivers/nvme/target/core.c | 9 - drivers/nvme/target/nvmet.h | 1 + 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 4e269a8c8dbe..abf191ee00af 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -144,6 +144,7 @@ static void nvmet_execute_get_log_changed_ns(struct nvmet_req *req) if (!status) status = nvmet_zero_sgl(req, len, req->data_len - len); ctrl->nr_changed_ns = 0; + clear_bit(NVME_AEN_CFG_NS_ATTR, &ctrl->aen_masked); mutex_unlock(&ctrl->lock); out: nvmet_req_complete(req, status); diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index b2da65aec8b7..8e6432b1abed 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -144,6 +144,13 @@ static void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, schedule_work(&ctrl->async_event_work); } +static bool nvmet_aen_disabled(struct nvmet_ctrl *ctrl, u32 aen) +{ + if (!(READ_ONCE(ctrl->aen_enabled) & aen)) + return true; + return test_and_set_bit(aen, &ctrl->aen_masked); +} + static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid) { mutex_lock(&ctrl->lock); @@ -168,7 +175,7 @@ static void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid) list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid)); - if (!(READ_ONCE(ctrl->aen_enabled) & NVME_AEN_CFG_NS_ATTR)) + if (nvmet_aen_disabled(ctrl, NVME_AEN_CFG_NS_ATTR)) continue; nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, NVME_AER_NOTICE_NS_CHANGED, diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index f4d16d9b3582..480dfe10fad9 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -139,6 +139,7 @@ struct nvmet_ctrl { u32 kato; u32 aen_enabled; + unsigned long aen_masked; struct nvmet_req*async_event_cmds[NVMET_ASYNC_EVENTS]; unsigned intnr_async_event_cmds; struct list_headasync_events; -- 2.17.0
[PATCH 03/12] nvme.h: add AER configuration symbols
From: Hannes Reinecke Signed-off-by: Hannes Reinecke [hch: split from a larger patch] Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- include/linux/nvme.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 7ce0f3cf4409..2950ce957656 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -446,6 +446,11 @@ enum { NVME_AER_NOTICE_FW_ACT_STARTING = 0x01, }; +enum { + NVME_AEN_CFG_NS_ATTR= 1 << 8, + NVME_AEN_CFG_FW_ACT = 1 << 9, +}; + struct nvme_lba_range_type { __u8type; __u8attributes; -- 2.17.0
[PATCH 02/12] nvme.h: add the changed namespace list log
Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- include/linux/nvme.h | 3 +++ 1 file changed, 3 insertions(+) diff --git a/include/linux/nvme.h b/include/linux/nvme.h index c37103a4ad38..7ce0f3cf4409 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -751,6 +751,7 @@ enum { NVME_LOG_ERROR = 0x01, NVME_LOG_SMART = 0x02, NVME_LOG_FW_SLOT= 0x03, + NVME_LOG_CHANGED_NS = 0x04, NVME_LOG_CMD_EFFECTS= 0x05, NVME_LOG_DISC = 0x70, NVME_LOG_RESERVATION= 0x80, @@ -759,6 +760,8 @@ enum { NVME_FWACT_ACTV = (2 << 3), }; +#define NVME_MAX_CHANGED_NAMESPACES1024 + struct nvme_identify { __u8opcode; __u8flags; -- 2.17.0
[PATCH 05/12] nvmet: split log page implementation
Remove the common code to allocate a buffer and copy it into the SGL. Instead the two no-op implementations just zero the SGL directly, and the smart log allocates a buffer on its own. This prepares for the more elaborate ANA log page. Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- drivers/nvme/target/admin-cmd.c | 99 - 1 file changed, 36 insertions(+), 63 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 7f906e4ccc96..678a393f01de 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -32,6 +32,11 @@ u32 nvmet_get_log_page_len(struct nvme_command *cmd) return len; } +static void nvmet_execute_get_log_page_noop(struct nvmet_req *req) +{ + nvmet_req_complete(req, nvmet_zero_sgl(req, 0, req->data_len)); +} + static u16 nvmet_get_smart_log_nsid(struct nvmet_req *req, struct nvme_smart_log *slog) { @@ -97,74 +102,26 @@ static u16 nvmet_get_smart_log_all(struct nvmet_req *req, return NVME_SC_SUCCESS; } -static u16 nvmet_get_smart_log(struct nvmet_req *req, - struct nvme_smart_log *slog) -{ - u16 status; - - WARN_ON(req == NULL || slog == NULL); - if (req->cmd->get_log_page.nsid == cpu_to_le32(NVME_NSID_ALL)) - status = nvmet_get_smart_log_all(req, slog); - else - status = nvmet_get_smart_log_nsid(req, slog); - return status; -} - -static void nvmet_execute_get_log_page(struct nvmet_req *req) +static void nvmet_execute_get_log_page_smart(struct nvmet_req *req) { - struct nvme_smart_log *smart_log; - size_t data_len = nvmet_get_log_page_len(req->cmd); - void *buf; - u16 status = 0; + struct nvme_smart_log *log; + u16 status = NVME_SC_INTERNAL; - buf = kzalloc(data_len, GFP_KERNEL); - if (!buf) { - status = NVME_SC_INTERNAL; + if (req->data_len != sizeof(*log)) goto out; - } - switch (req->cmd->get_log_page.lid) { - case NVME_LOG_ERROR: - /* -* We currently never set the More bit in the status field, -* so all error log entries are invalid and can be zeroed out. -* This is called a minum viable implementation (TM) of this -* mandatory log page. -*/ - break; - case NVME_LOG_SMART: - /* -* XXX: fill out actual smart log -* -* We might have a hard time coming up with useful values for -* many of the fields, and even when we have useful data -* available (e.g. units or commands read/written) those aren't -* persistent over power loss. -*/ - if (data_len != sizeof(*smart_log)) { - status = NVME_SC_INTERNAL; - goto err; - } - smart_log = buf; - status = nvmet_get_smart_log(req, smart_log); - if (status) - goto err; - break; - case NVME_LOG_FW_SLOT: - /* -* We only support a single firmware slot which always is -* active, so we can zero out the whole firmware slot log and -* still claim to fully implement this mandatory log page. -*/ - break; - default: - BUG(); - } + log = kzalloc(sizeof(*log), GFP_KERNEL); + if (!log) + goto out; - status = nvmet_copy_to_sgl(req, 0, buf, data_len); + if (req->cmd->get_log_page.nsid == cpu_to_le32(NVME_NSID_ALL)) + status = nvmet_get_smart_log_all(req, log); + else + status = nvmet_get_smart_log_nsid(req, log); + if (status) + goto out; -err: - kfree(buf); + status = nvmet_copy_to_sgl(req, 0, log, sizeof(*log)); out: nvmet_req_complete(req, status); } @@ -572,9 +529,25 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req) switch (cmd->get_log_page.lid) { case NVME_LOG_ERROR: + /* +* We currently never set the More bit in the status +* field, so all error log entries are invalid and can +* be zeroed out. This is called a minum viable +* implementation (TM) of this mandatory log page. +*/ + req->execute = nvmet_execute_get_log_page_noop; + return 0; case NVME_LOG_SMART: + req->execute = nvmet_execute_get_log_page_smart; + return 0; case NVME_LOG_FW_SLOT: - req->execute = nvmet_execute_get_log_page; +
[PATCH 01/12] nvme.h: untangle AEN notice definitions
Stop including the event type in the definitions for the notice type. Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- drivers/nvme/host/core.c | 30 ++ include/linux/nvme.h | 8 ++-- 2 files changed, 24 insertions(+), 14 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 2c4cf65641a6..86cb78653155 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -3321,6 +3321,21 @@ static void nvme_fw_act_work(struct work_struct *work) nvme_get_fw_slot_info(ctrl); } +static void nvme_handle_aen_notice(struct nvme_ctrl *ctrl, u32 result) +{ + switch ((result & 0xff00) >> 8) { + case NVME_AER_NOTICE_NS_CHANGED: + dev_info(ctrl->device, "rescanning\n"); + nvme_queue_scan(ctrl); + break; + case NVME_AER_NOTICE_FW_ACT_STARTING: + queue_work(nvme_wq, &ctrl->fw_act_work); + break; + default: + dev_warn(ctrl->device, "async event result %08x\n", result); + } +} + void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, volatile union nvme_result *res) { @@ -3330,6 +3345,9 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, return; switch (result & 0x7) { + case NVME_AER_NOTICE: + nvme_handle_aen_notice(ctrl, result); + break; case NVME_AER_ERROR: case NVME_AER_SMART: case NVME_AER_CSS: @@ -3339,18 +3357,6 @@ void nvme_complete_async_event(struct nvme_ctrl *ctrl, __le16 status, default: break; } - - switch (result & 0xff07) { - case NVME_AER_NOTICE_NS_CHANGED: - dev_info(ctrl->device, "rescanning\n"); - nvme_queue_scan(ctrl); - break; - case NVME_AER_NOTICE_FW_ACT_STARTING: - queue_work(nvme_wq, &ctrl->fw_act_work); - break; - default: - dev_warn(ctrl->device, "async event result %08x\n", result); - } queue_work(nvme_wq, &ctrl->async_event_work); } EXPORT_SYMBOL_GPL(nvme_complete_async_event); diff --git a/include/linux/nvme.h b/include/linux/nvme.h index 4112e2bd747f..c37103a4ad38 100644 --- a/include/linux/nvme.h +++ b/include/linux/nvme.h @@ -436,10 +436,14 @@ enum { enum { NVME_AER_ERROR = 0, NVME_AER_SMART = 1, + NVME_AER_NOTICE = 2, NVME_AER_CSS= 6, NVME_AER_VS = 7, - NVME_AER_NOTICE_NS_CHANGED = 0x0002, - NVME_AER_NOTICE_FW_ACT_STARTING = 0x0102, +}; + +enum { + NVME_AER_NOTICE_NS_CHANGED = 0x00, + NVME_AER_NOTICE_FW_ACT_STARTING = 0x01, }; struct nvme_lba_range_type { -- 2.17.0
[PATCH 06/12] nvmet: implement the changed namespaces log
Just keep a per-controller buffer of changed namespaces and copy it out in the get log page implementation. Signed-off-by: Christoph Hellwig --- drivers/nvme/target/admin-cmd.c | 26 + drivers/nvme/target/core.c | 50 +++-- drivers/nvme/target/nvmet.h | 3 ++ 3 files changed, 70 insertions(+), 9 deletions(-) diff --git a/drivers/nvme/target/admin-cmd.c b/drivers/nvme/target/admin-cmd.c index 678a393f01de..453d76db1647 100644 --- a/drivers/nvme/target/admin-cmd.c +++ b/drivers/nvme/target/admin-cmd.c @@ -126,6 +126,29 @@ static void nvmet_execute_get_log_page_smart(struct nvmet_req *req) nvmet_req_complete(req, status); } +static void nvmet_execute_get_log_changed_ns(struct nvmet_req *req) +{ + struct nvmet_ctrl *ctrl = req->sq->ctrl; + u16 status = NVME_SC_INTERNAL; + size_t len; + + if (req->data_len != NVME_MAX_CHANGED_NAMESPACES * sizeof(__le32)) + goto out; + + mutex_lock(&ctrl->lock); + if (ctrl->nr_changed_ns == U32_MAX) + len = sizeof(__le32); + else + len = ctrl->nr_changed_ns * sizeof(__le32); + status = nvmet_copy_to_sgl(req, 0, ctrl->changed_ns_list, len); + if (!status) + status = nvmet_zero_sgl(req, len, req->data_len - len); + ctrl->nr_changed_ns = 0; + mutex_unlock(&ctrl->lock); +out: + nvmet_req_complete(req, status); +} + static void nvmet_execute_identify_ctrl(struct nvmet_req *req) { struct nvmet_ctrl *ctrl = req->sq->ctrl; @@ -549,6 +572,9 @@ u16 nvmet_parse_admin_cmd(struct nvmet_req *req) */ req->execute = nvmet_execute_get_log_page_noop; return 0; + case NVME_LOG_CHANGED_NS: + req->execute = nvmet_execute_get_log_changed_ns; + return 0; } break; case nvme_admin_identify: diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 55c4bc693aa2..99fcff757608 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -144,6 +144,36 @@ static void nvmet_add_async_event(struct nvmet_ctrl *ctrl, u8 event_type, schedule_work(&ctrl->async_event_work); } +static void nvmet_add_to_changed_ns_log(struct nvmet_ctrl *ctrl, __le32 nsid) +{ + mutex_lock(&ctrl->lock); + if (ctrl->nr_changed_ns < NVME_MAX_CHANGED_NAMESPACES) { + u32 i; + + for (i = 0; i < ctrl->nr_changed_ns; i++) + if (ctrl->changed_ns_list[i] == nsid) + goto out_unlock; + ctrl->changed_ns_list[ctrl->nr_changed_ns++] = nsid; + } else if (ctrl->nr_changed_ns == NVME_MAX_CHANGED_NAMESPACES) { + ctrl->changed_ns_list[0] = cpu_to_le32(0x); + ctrl->nr_changed_ns = U32_MAX; + } +out_unlock: + mutex_unlock(&ctrl->lock); +} + +static void nvmet_ns_changed(struct nvmet_subsys *subsys, u32 nsid) +{ + struct nvmet_ctrl *ctrl; + + list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) { + nvmet_add_to_changed_ns_log(ctrl, cpu_to_le32(nsid)); + nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, + NVME_AER_NOTICE_NS_CHANGED, + NVME_LOG_CHANGED_NS); + } +} + int nvmet_register_transport(const struct nvmet_fabrics_ops *ops) { int ret = 0; @@ -287,7 +317,6 @@ static void nvmet_ns_dev_disable(struct nvmet_ns *ns) int nvmet_ns_enable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; - struct nvmet_ctrl *ctrl; int ret = 0; mutex_lock(&subsys->lock); @@ -326,9 +355,7 @@ int nvmet_ns_enable(struct nvmet_ns *ns) list_add_tail_rcu(&ns->dev_link, &old->dev_link); } - list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) - nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0); - + nvmet_ns_changed(subsys, ns->nsid); ns->enabled = true; ret = 0; out_unlock: @@ -342,7 +369,6 @@ int nvmet_ns_enable(struct nvmet_ns *ns) void nvmet_ns_disable(struct nvmet_ns *ns) { struct nvmet_subsys *subsys = ns->subsys; - struct nvmet_ctrl *ctrl; mutex_lock(&subsys->lock); if (!ns->enabled) @@ -368,9 +394,7 @@ void nvmet_ns_disable(struct nvmet_ns *ns) percpu_ref_exit(&ns->ref); mutex_lock(&subsys->lock); - list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) - nvmet_add_async_event(ctrl, NVME_AER_TYPE_NOTICE, 0, 0); - + nvmet_ns_changed(subsys, ns->nsid); nvmet_ns_dev_disable(ns); out_unlock: mutex_unlock(&subsys->lock); @@ -832,11 +856,16 @@ u16 nvmet_alloc_ctrl(const char *subsysnqn, const char *hostnqn, kref_init(&ctrl->ref); ctrl->subsys = subsys; +
[PATCH 04/12] nvmet: add a new nvmet_zero_sgl helper
Zeroes the SGL in the payload. Signed-off-by: Christoph Hellwig Reviewed-by: Johannes Thumshirn --- drivers/nvme/target/core.c | 7 +++ drivers/nvme/target/nvmet.h | 1 + 2 files changed, 8 insertions(+) diff --git a/drivers/nvme/target/core.c b/drivers/nvme/target/core.c index 800aaf96ddcd..55c4bc693aa2 100644 --- a/drivers/nvme/target/core.c +++ b/drivers/nvme/target/core.c @@ -57,6 +57,13 @@ u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len) return 0; } +u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len) +{ + if (sg_zero_buffer(req->sg, req->sg_cnt, len, off) != len) + return NVME_SC_SGL_INVALID_DATA | NVME_SC_DNR; + return 0; +} + static unsigned int nvmet_max_nsid(struct nvmet_subsys *subsys) { struct nvmet_ns *ns; diff --git a/drivers/nvme/target/nvmet.h b/drivers/nvme/target/nvmet.h index 32ebffcf464c..768abe203298 100644 --- a/drivers/nvme/target/nvmet.h +++ b/drivers/nvme/target/nvmet.h @@ -330,6 +330,7 @@ u16 nvmet_copy_to_sgl(struct nvmet_req *req, off_t off, const void *buf, size_t len); u16 nvmet_copy_from_sgl(struct nvmet_req *req, off_t off, void *buf, size_t len); +u16 nvmet_zero_sgl(struct nvmet_req *req, off_t off, size_t len); u32 nvmet_get_log_page_len(struct nvme_command *cmd); -- 2.17.0
nvme/nvmet AEN and log page fixups v2
Hi all, this series started as prep work for ANA, but has grown a lot. The idea is to make the AEN handling in the target closer to what the standard says, and implement the changed namespaces list log page, which is required to clear the namespace attribute notice event. One the host side this just makes sure we actually have the events we handle enabled using Set Features, and it fixes the handling of the namespace attribute notice event so that it actually does get cleared by issuing the log page. Last but not last two block cleanups are thrown in so that we don't print a pointless message when we delete a namespaces, and also unexport the function printing said message as it should only be called from the core code. Changes from v1: - don't copy garbage when more then 1024 namespaces changed - filter out duplicates in the changed namespace list log page - use the proper define for the supported events to initialize OAES - add the nvmet_aen_disabled helper from the ANA series - allow the mandatory smart AENs in Set Features
[PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.
From: Kevin Vigor When a userspace client requests a NBD device be disconnected, the DISCONNECT_REQUESTED flag is set. While this flag is set, the driver will not inform userspace when a connection is closed. Unfortunately the flag was never cleared, so once a disconnect was requested the driver would thereafter never tell userspace about a closed connection. Thus when connections failed due to timeout, no attempt to reconnect was made and eventually the device would fail. Fix by clearing the DISCONNECT_REQUESTED flag (and setting the DISCONNECTED flag) once all connections are closed. Changes relative to v1 (https://marc.info/?l=linux-block&m=152763540418902): * remove pointless wake_up() -> wake_up_all() change. Signed-off-by: Kevin Vigor --- drivers/block/nbd.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c index afbc202..1cd041b 100644 --- a/drivers/block/nbd.c +++ b/drivers/block/nbd.c @@ -213,7 +213,15 @@ static void nbd_mark_nsock_dead(struct nbd_device *nbd, struct nbd_sock *nsock, } if (!nsock->dead) { kernel_sock_shutdown(nsock->sock, SHUT_RDWR); - atomic_dec(&nbd->config->live_connections); + if (atomic_dec_return(&nbd->config->live_connections) == 0) { + if (test_and_clear_bit(NBD_DISCONNECT_REQUESTED, + &nbd->config->runtime_flags)) { + set_bit(NBD_DISCONNECTED, + &nbd->config->runtime_flags); + dev_info(nbd_to_dev(nbd), + "Disconnected due to user request.\n"); + } + } } nsock->dead = true; nsock->pending = NULL; @@ -292,7 +300,9 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, if (config->num_connections > 1) { dev_err_ratelimited(nbd_to_dev(nbd), - "Connection timed out, retrying\n"); + "Connection timed out, retrying (%d/%d alive)\n", + atomic_read(&config->live_connections), + config->num_connections); /* * Hooray we have more connections, requeue this IO, the submit * path will put it on a real connection. @@ -714,10 +724,9 @@ static int wait_for_reconnect(struct nbd_device *nbd) return 0; if (test_bit(NBD_DISCONNECTED, &config->runtime_flags)) return 0; - wait_event_timeout(config->conn_wait, - atomic_read(&config->live_connections), - config->dead_conn_timeout); - return atomic_read(&config->live_connections); + return wait_event_timeout(config->conn_wait, + atomic_read(&config->live_connections) > 0, + config->dead_conn_timeout) > 0; } static int nbd_handle_cmd(struct nbd_cmd *cmd, int index) -- 2.7.4
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/30/18 9:45 AM, Jeff Moyer wrote: > Jens Axboe writes: > >> On 5/30/18 9:06 AM, Jeff Moyer wrote: >>> Hi, Jens, >>> >>> Jens Axboe writes: >>> On 5/30/18 2:49 AM, Christoph Hellwig wrote: > While I really don't want drivers to change the I/O schedule themselves > we have a class of devices (zoned) that don't work at all with certain > I/O schedulers. The kernel not chosing something sane and requiring > user workarounds is just silly. They work just fine for probing and reading purposes. There's absolutely no reason why we can't handle these special snowflakes with a udev rule. >>> >>> udev rules aren't shipped with the kernel, so it makes it hard to keep >>> them in sync. In this instance, I'm not sure anyone made an effort to >>> notify distributions that a udev rule was even necessary. (Is there a >>> way of notifying distributions about kernel changes that require new >>> udev rules, other than emailing each list individually?) >>> >>> From a technical standpoint, I totally agree with you, Jens. However, I >>> think the user experience sucks. 4.15 worked by default, 4.16 doesn't. >>> The result will be bug reports from users (to the drive vendors, >>> distribution bugzillas, here, etc.). >> >> I would imagine that most folks get their updates from a distro of some >> sort, in which case there's absolutely nothing stopping the distro from >> shipping updated rules for the 4.16 kernel update. > > The problem is distros have already shipped that kernel. :) Ship an update, then! I'm sure that most people would prefer a simple rule update over a kernel update. And you have to do one of them, to resolve this anyway. >> But what's the regression? 4.15 had no zone write locking at all. > > The zone write locking was done in the sd driver prior to 4.16. See > commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where > it was removed. That means these devices "just worked" with all I/O > schedulers. Gotcha, makes sense. >>> Moving on, assuming your mind is made up... >>> >>> I'm not sure how much logic should go into the udev rule. As mentioned, >>> this limitation was introduced in 4.16, and Damien has plans to lift the >>> restriction in future kernels. Because distributions tend to cherry >>> pick changes, making decisions on whether a feature exists based solely >>> on kernel version is usually not a great thing. My inclination would be >>> to just always force deadline for host-managed SMR drives. These drives >>> aren't that popular, after all. Any opinions on this? >> >> The problem is that it's tied to an IO scheduler, which ends up causing >> issues like this, since users are free to select a different scheduler. >> Then things break. Granted, in this case, some extraordinarily shitty >> hardware even broke. That is on the hardware, not the kernel, that >> kind of breakage should not occur. > > If the firmware problem was widespread, I think we'd try to avoid it. I > have no reason to believe that is the case, though. > > Damien made the argument that the user should be able to select an I/O > scheduler that doesn't perform the write locking, because a well-behaved > application could theoretically make use of it. I think this is a weak > argument, given that dm-zoned doesn't even support such a mode. Sure, the user should be able to select whatever they want. Maybe they are strictly using it through bsg or a similar interface, in which case no scheduler or kernel support is really neeeded to drive it. >> So now we're stuck with this temporary situation which needs a work-around. >> I don't think it's a terrible idea to have a rule that just sets >> deadline/mq-deadline for an SMR device regardless of what kernel it is >> running on. It'll probably never be a bad default. > > OK. Barring future input to the contrary, I'll work to get updates into > fedora, at least. I've CC'd Colin and Hannes. I'm not sure who else to > include. > > FYI, below is the udev rule Damien had provided to Bryan. I'm not sure > about the KERNEL=="sd[a-z]" bit, that may need modification. Note: I'm > no udev expert. Rule looks fine to me, but I'm also no udev expert. -- Jens Axboe
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
Jens Axboe writes: > On 5/30/18 9:06 AM, Jeff Moyer wrote: >> Hi, Jens, >> >> Jens Axboe writes: >> >>> On 5/30/18 2:49 AM, Christoph Hellwig wrote: While I really don't want drivers to change the I/O schedule themselves we have a class of devices (zoned) that don't work at all with certain I/O schedulers. The kernel not chosing something sane and requiring user workarounds is just silly. >>> >>> They work just fine for probing and reading purposes. There's absolutely >>> no reason why we can't handle these special snowflakes with a udev rule. >> >> udev rules aren't shipped with the kernel, so it makes it hard to keep >> them in sync. In this instance, I'm not sure anyone made an effort to >> notify distributions that a udev rule was even necessary. (Is there a >> way of notifying distributions about kernel changes that require new >> udev rules, other than emailing each list individually?) >> >> From a technical standpoint, I totally agree with you, Jens. However, I >> think the user experience sucks. 4.15 worked by default, 4.16 doesn't. >> The result will be bug reports from users (to the drive vendors, >> distribution bugzillas, here, etc.). > > I would imagine that most folks get their updates from a distro of some > sort, in which case there's absolutely nothing stopping the distro from > shipping updated rules for the 4.16 kernel update. The problem is distros have already shipped that kernel. :) > But what's the regression? 4.15 had no zone write locking at all. The zone write locking was done in the sd driver prior to 4.16. See commit 39051dd85f287 ("scsi: sd: Remove zone write locking") for where it was removed. That means these devices "just worked" with all I/O schedulers. >> Moving on, assuming your mind is made up... >> >> I'm not sure how much logic should go into the udev rule. As mentioned, >> this limitation was introduced in 4.16, and Damien has plans to lift the >> restriction in future kernels. Because distributions tend to cherry >> pick changes, making decisions on whether a feature exists based solely >> on kernel version is usually not a great thing. My inclination would be >> to just always force deadline for host-managed SMR drives. These drives >> aren't that popular, after all. Any opinions on this? > > The problem is that it's tied to an IO scheduler, which ends up causing > issues like this, since users are free to select a different scheduler. > Then things break. Granted, in this case, some extraordinarily shitty > hardware even broke. That is on the hardware, not the kernel, that > kind of breakage should not occur. If the firmware problem was widespread, I think we'd try to avoid it. I have no reason to believe that is the case, though. Damien made the argument that the user should be able to select an I/O scheduler that doesn't perform the write locking, because a well-behaved application could theoretically make use of it. I think this is a weak argument, given that dm-zoned doesn't even support such a mode. I definitely see this udev rule as a temporary workaround. > So now we're stuck with this temporary situation which needs a work-around. > I don't think it's a terrible idea to have a rule that just sets > deadline/mq-deadline for an SMR device regardless of what kernel it is > running on. It'll probably never be a bad default. OK. Barring future input to the contrary, I'll work to get updates into fedora, at least. I've CC'd Colin and Hannes. I'm not sure who else to include. FYI, below is the udev rule Damien had provided to Bryan. I'm not sure about the KERNEL=="sd[a-z]" bit, that may need modification. Note: I'm no udev expert. Cheers, Jeff ACTION=="add|change", KERNEL=="sd[a-z]", ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline"
Re: [PATCH] nbd: clear DISCONNECT_REQUESTED flag once disconnection occurs.
On 5/29/18 5:09 PM, kvi...@gmail.com wrote: > @@ -937,7 +946,7 @@ static int nbd_reconnect_socket(struct nbd_device *nbd, > unsigned long arg) > queue_work(recv_workqueue, &args->work); > > atomic_inc(&config->live_connections); > - wake_up(&config->conn_wait); > + wake_up_all(&config->conn_wait); > return 0; > } > sockfd_put(sock); Unless you're using exclusive waits, and you are not, then wake_up() is equivalent to wake_up_all(). -- Jens Axboe
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/30/18 9:06 AM, Jeff Moyer wrote: > Hi, Jens, > > Jens Axboe writes: > >> On 5/30/18 2:49 AM, Christoph Hellwig wrote: >>> While I really don't want drivers to change the I/O schedule themselves >>> we have a class of devices (zoned) that don't work at all with certain >>> I/O schedulers. The kernel not chosing something sane and requiring >>> user workarounds is just silly. >> >> They work just fine for probing and reading purposes. There's absolutely >> no reason why we can't handle these special snowflakes with a udev rule. > > udev rules aren't shipped with the kernel, so it makes it hard to keep > them in sync. In this instance, I'm not sure anyone made an effort to > notify distributions that a udev rule was even necessary. (Is there a > way of notifying distributions about kernel changes that require new > udev rules, other than emailing each list individually?) > > From a technical standpoint, I totally agree with you, Jens. However, I > think the user experience sucks. 4.15 worked by default, 4.16 doesn't. > The result will be bug reports from users (to the drive vendors, > distribution bugzillas, here, etc.). I would imagine that most folks get their updates from a distro of some sort, in which case there's absolutely nothing stopping the distro from shipping updated rules for the 4.16 kernel update. But what's the regression? 4.15 had no zone write locking at all. > Moving on, assuming your mind is made up... > > I'm not sure how much logic should go into the udev rule. As mentioned, > this limitation was introduced in 4.16, and Damien has plans to lift the > restriction in future kernels. Because distributions tend to cherry > pick changes, making decisions on whether a feature exists based solely > on kernel version is usually not a great thing. My inclination would be > to just always force deadline for host-managed SMR drives. These drives > aren't that popular, after all. Any opinions on this? The problem is that it's tied to an IO scheduler, which ends up causing issues like this, since users are free to select a different scheduler. Then things break. Granted, in this case, some extraordinarily shitty hardware even broke. That is on the hardware, not the kernel, that kind of breakage should not occur. So now we're stuck with this temporary situation which needs a work-around. I don't think it's a terrible idea to have a rule that just sets deadline/mq-deadline for an SMR device regardless of what kernel it is running on. It'll probably never be a bad default. -- Jens Axboe
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
Hi, Jens, Jens Axboe writes: > On 5/30/18 2:49 AM, Christoph Hellwig wrote: >> While I really don't want drivers to change the I/O schedule themselves >> we have a class of devices (zoned) that don't work at all with certain >> I/O schedulers. The kernel not chosing something sane and requiring >> user workarounds is just silly. > > They work just fine for probing and reading purposes. There's absolutely > no reason why we can't handle these special snowflakes with a udev rule. udev rules aren't shipped with the kernel, so it makes it hard to keep them in sync. In this instance, I'm not sure anyone made an effort to notify distributions that a udev rule was even necessary. (Is there a way of notifying distributions about kernel changes that require new udev rules, other than emailing each list individually?) >From a technical standpoint, I totally agree with you, Jens. However, I think the user experience sucks. 4.15 worked by default, 4.16 doesn't. The result will be bug reports from users (to the drive vendors, distribution bugzillas, here, etc.). Moving on, assuming your mind is made up... I'm not sure how much logic should go into the udev rule. As mentioned, this limitation was introduced in 4.16, and Damien has plans to lift the restriction in future kernels. Because distributions tend to cherry pick changes, making decisions on whether a feature exists based solely on kernel version is usually not a great thing. My inclination would be to just always force deadline for host-managed SMR drives. These drives aren't that popular, after all. Any opinions on this? Thanks! Jeff
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On 5/30/18 2:49 AM, Christoph Hellwig wrote: > While I really don't want drivers to change the I/O schedule themselves > we have a class of devices (zoned) that don't work at all with certain > I/O schedulers. The kernel not chosing something sane and requiring > user workarounds is just silly. They work just fine for probing and reading purposes. There's absolutely no reason why we can't handle these special snowflakes with a udev rule. -- Jens Axboe
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On Wed, May 30, 2018 at 2:22 AM, Damien Le Moal wrote: > Jens, > > On 5/26/18 13:01, Jens Axboe wrote: >> On 5/25/18 4:18 PM, Jeff Moyer wrote: >>> Hi, Jens, >>> >>> Jens Axboe writes: >>> On 5/25/18 3:14 PM, Jeff Moyer wrote: > Bryan Gurney reported I/O errors when using dm-zoned with a host-managed > SMR device. It turns out he was using CFQ, which is the default. > Unfortunately, as of v4.16, only the deadline schedulers work well with > host-managed SMR devices. This series aatempts to switch the elevator > to deadline for those devices. > > NOTE: I'm not super happy with setting up one iosched and then > immediately tearing it down. I'm open to suggestions on better ways > to accomplish this goal. Let's please not do this, a few years ago I finally managed to kill drivers changing the scheduler manually. Why can't this go into a udev (or similar) rule? That's where it belongs, imho. >>> >>> We could do that. The downside is that distros will have to pick up >>> udev rules, which they haven't done yet, and the udev rules will have to >>> be kernel version dependent. And then later, when this restriction is >>> lifted, we'll have to update the udev rules. That also sounds awful to >>> me. >> >> They only have to be feature dependent, which isn't that hard. And if I >> had to pick between a kernel upgrade and a udev rule package update, the >> choice is pretty clear. > > Currently, a queue elevator is first initialized with a call to > elevator_init() from either blk_mq_init_allocated_queue() or > blk_init_allocated_queue(). When these are called, the device has not > yet been probed and no information is available. Do you think it is > reasonable to delay the default elevator bring-up to after some > information about the device is obtained ? > > That would necessitate splitting elevator_init() into a generic elevator > initialization function setting up the elevator related fields of the > request queue and a second part setting up the default elevator (e.g. > elevator_set_default()). Doing so, the function elevator_set_default() > could be called later in the device initialization sequence, after > information from the device has been obtained. It would make choosing a > sane default elevator much cleaner. > > Granted, that is in effect very similar to what Jeff's patches do. But > in a sense, the current code already consider some of the device > characteristics when setting the default elevator. That is, if the > device is mq or not. Would it be a stretch to add "is_zoned" as another > characteristic that is looked at ? > > On another note, I noticed that the only two callers of elevator_init() > are blk_mq_init_allocated_queue() and blk_init_allocated_queue(). Both > call elevator_init() with a NULL name. I wonder the usefulness of that > argument. Furthermore, elevator_init() is also exported but I do not see > any external module user. Is this in purpose or is it the remnant of > older use of it ? I can send a patch cleaning that up if you feel that > needs so cleanup. > >> >>> I understand why you don't like this patch set, but I happen to think >>> the alternative is worse. FYI, in Bryan's case, his system actually got >>> bricked (likely due to buggy firmware). >> >> I disagree, I think the rule approach is much easier. If the wrong write >> location bricked the drive, then I think that user has much larger >> issues... That seems like a trivial issue that should have been caught >> in basic testing, I would not trust that drive with any data if it >> bricks that easily. > > I think that the bug in this particular case was with the BIOS (probably > UEFI), not the disk. Disks only answer to trivial requests when probed > on boot, unless the BIOS is too smart for its own good and doing very > weird things. Different issue anyway, not related to the kernel. Actually, the BIOS boot mode was configured to "Legacy" mode (i.e.: non-UEFI) at the time of the bricking. I'm pretty confident of this, because after the BIOS reflash, I can see a different rendering mode in the GRUB screen of Fedora 28, and when I checked back in the BIOS setup utility, I saw that the default for the new BIOS image was UEFI. On the test systems in general, I prefer to use legacy BIOS if I don't need UEFI. I'm not sure if this means that I walked into a latent BIOS bug, though. Thanks, Bryan
Re: [bug report] IO blocked after offline CPU during fio on nvme SSD
Thanks for the update, will have a try after Keith update v4 fix. :) On 05/30/2018 04:29 PM, Ming Lei wrote: On Wed, May 30, 2018 at 3:53 PM, Yi Zhang wrote: Hi I found this issue during my NVMe test on 4.17.0-rc7, here is the reproducer and dmesg log, let me know if you need more info, thanks. Reproduce steps: #!/bin/bash set -x fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 & sleep 10 echo 0 > /sys/devices/system/cpu/cpu1/online echo 0 > /sys/devices/system/cpu/cpu2/online echo 0 > /sys/devices/system/cpu/cpu3/online dmesg log: [ 153.402709] IRQ 82: no longer affine to CPU1 [ 153.408596] smpboot: CPU 1 is now offline [ 153.426422] IRQ 86: no longer affine to CPU2 [ 153.431204] IRQ 89: no longer affine to CPU2 [ 153.437054] smpboot: CPU 2 is now offline [ 153.682095] IRQ 83: no longer affine to CPU3 [ 153.686866] IRQ 85: no longer affine to CPU3 [ 153.692712] smpboot: CPU 3 is now offline [ 183.847661] nvme nvme0: I/O 417 QID 2 timeout, completion polled [ 183.854410] nvme nvme0: I/O 251 QID 7 timeout, completion polled [ 183.861133] nvme nvme0: I/O 252 QID 7 timeout, completion polled [ 183.867851] nvme nvme0: I/O 253 QID 7 timeout, aborting [ 183.873699] nvme nvme0: I/O 254 QID 7 timeout, aborting [ 183.879530] nvme nvme0: Abort status: 0x0 [ 183.884017] nvme nvme0: I/O 255 QID 7 timeout, aborting [ 183.889847] nvme nvme0: Abort status: 0x0 [ 183.894338] nvme nvme0: I/O 495 QID 8 timeout, completion polled [ 183.901033] nvme nvme0: Abort status: 0x0 [ 183.905520] nvme nvme0: I/O 496 QID 8 timeout, aborting [ 183.911359] nvme nvme0: I/O 497 QID 8 timeout, aborting [ 183.917190] nvme nvme0: Abort status: 0x0 [ 183.921677] nvme nvme0: I/O 498 QID 8 timeout, aborting [ 183.927506] nvme nvme0: Abort status: 0x0 [ 183.931994] nvme nvme0: I/O 499 QID 8 timeout, aborting [ 183.937825] nvme nvme0: Abort status: 0x0 [ 183.942311] nvme nvme0: I/O 500 QID 8 timeout, aborting [ 183.948141] nvme nvme0: Abort status: 0x0 [ 183.952637] nvme nvme0: Abort status: 0x0 [ 214.057408] nvme nvme0: I/O 253 QID 7 timeout, reset controller [ 216.422702] nvme nvme0: I/O 254 QID 7 timeout, disable controller [ 216.438524] nvme nvme0: I/O 255 QID 7 timeout, disable controller [ 216.453428] nvme nvme0: I/O 496 QID 8 timeout, disable controller [ 216.469426] nvme nvme0: I/O 497 QID 8 timeout, disable controller [ 216.484435] nvme nvme0: I/O 498 QID 8 timeout, disable controller [ 216.496417] nvme nvme0: I/O 499 QID 8 timeout, disable controller [ 216.511418] nvme nvme0: I/O 500 QID 8 timeout, disable controller [ 249.830165] nvme nvme0: I/O 253 QID 7 timeout, completion polled [ 249.836901] nvme nvme0: I/O 254 QID 7 timeout, disable controller [ 249.858307] nvme nvme0: I/O 255 QID 7 timeout, disable controller [ 249.874172] nvme nvme0: I/O 496 QID 8 timeout, disable controller [ 249.887164] nvme nvme0: I/O 497 QID 8 timeout, disable controller [ 249.903175] nvme nvme0: I/O 498 QID 8 timeout, disable controller [ 249.918171] nvme nvme0: I/O 499 QID 8 timeout, disable controller [ 249.934172] nvme nvme0: I/O 500 QID 8 timeout, disable controller [ 369.695234] INFO: task kworker/u25:8:472 blocked for more than 120 seconds. [ 369.703012] Not tainted 4.17.0-rc7 #1 [ 369.707687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.716433] kworker/u25:8 D0 472 2 0x8000 [ 369.716443] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 369.716445] Call Trace: [ 369.719189] ? __schedule+0x290/0x870 [ 369.723280] schedule+0x32/0x80 [ 369.726791] blk_mq_freeze_queue_wait+0x46/0xb0 [ 369.731853] ? remove_wait_queue+0x60/0x60 [ 369.736435] nvme_wait_freeze+0x2e/0x50 [nvme_core] [ 369.741876] nvme_reset_work+0x819/0xd83 [nvme] [ 369.746942] ? __switch_to+0xa8/0x4f0 [ 369.751035] process_one_work+0x158/0x360 [ 369.755513] worker_thread+0x47/0x3e0 [ 369.759604] kthread+0xf8/0x130 [ 369.763114] ? max_active_store+0x80/0x80 [ 369.767590] ? kthread_bind+0x10/0x10 [ 369.771673] ret_from_fork+0x35/0x40 [ 369.775687] INFO: task fio:1826 blocked for more than 120 seconds. [ 369.782588] Not tainted 4.17.0-rc7 #1 [ 369.787259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.796002] fio D0 1826 1 0x0080 [ 369.796004] Call Trace: [ 369.798736] ? __schedule+0x290/0x870 [ 369.802826] schedule+0x32/0x80 [ 369.806338] io_schedule+0x12/0x40 [ 369.810139] __blkdev_direct_IO_simple+0x1d3/0x310 [ 369.815493] ? bdput+0x10/0x10 [ 369.818905] blkdev_direct_IO+0x38e/0x3f0 [ 369.823383] ? __blkdev_direct_IO_simple+0x254/0x310 [ 369.828929] generic_file_direct_write+0xcc/0x170 [ 369.834187] __generic_file_write_iter+0xb7/0x1c0 [ 369.839441] blkdev_write_iter+0x9b/0x120 [ 369.8
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
While I really don't want drivers to change the I/O schedule themselves we have a class of devices (zoned) that don't work at all with certain I/O schedulers. The kernel not chosing something sane and requiring user workarounds is just silly.
Re: [PATCH 0/2][RFC] block: default to deadline for SMR devices
On Wed, May 30, 2018 at 06:22:04AM +, Damien Le Moal wrote: > That would necessitate splitting elevator_init() into a generic elevator > initialization function setting up the elevator related fields of the > request queue and a second part setting up the default elevator (e.g. > elevator_set_default()). Doing so, the function elevator_set_default() > could be called later in the device initialization sequence, after > information from the device has been obtained. It would make choosing a > sane default elevator much cleaner. For blk-mq this makes entirely sense, untested series here: http://git.infradead.org/users/hch/block.git/shortlog/refs/heads/elevator_init For the legacy case we always need at least some I/O scheduler, so we can't just defer it. But I think we could still switch to deadline in blk_register_queue if we found a zoned device.
Re: [bug report] IO blocked after offline CPU during fio on nvme SSD
On Wed, May 30, 2018 at 3:53 PM, Yi Zhang wrote: > Hi > I found this issue during my NVMe test on 4.17.0-rc7, here is the reproducer > and dmesg log, let me know if you need more info, thanks. > > Reproduce steps: > #!/bin/bash > set -x > fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync > -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 > -direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 & > sleep 10 > echo 0 > /sys/devices/system/cpu/cpu1/online > echo 0 > /sys/devices/system/cpu/cpu2/online > echo 0 > /sys/devices/system/cpu/cpu3/online > > dmesg log: > [ 153.402709] IRQ 82: no longer affine to CPU1 > [ 153.408596] smpboot: CPU 1 is now offline > [ 153.426422] IRQ 86: no longer affine to CPU2 > [ 153.431204] IRQ 89: no longer affine to CPU2 > [ 153.437054] smpboot: CPU 2 is now offline > [ 153.682095] IRQ 83: no longer affine to CPU3 > [ 153.686866] IRQ 85: no longer affine to CPU3 > [ 153.692712] smpboot: CPU 3 is now offline > [ 183.847661] nvme nvme0: I/O 417 QID 2 timeout, completion polled > [ 183.854410] nvme nvme0: I/O 251 QID 7 timeout, completion polled > [ 183.861133] nvme nvme0: I/O 252 QID 7 timeout, completion polled > [ 183.867851] nvme nvme0: I/O 253 QID 7 timeout, aborting > [ 183.873699] nvme nvme0: I/O 254 QID 7 timeout, aborting > [ 183.879530] nvme nvme0: Abort status: 0x0 > [ 183.884017] nvme nvme0: I/O 255 QID 7 timeout, aborting > [ 183.889847] nvme nvme0: Abort status: 0x0 > [ 183.894338] nvme nvme0: I/O 495 QID 8 timeout, completion polled > [ 183.901033] nvme nvme0: Abort status: 0x0 > [ 183.905520] nvme nvme0: I/O 496 QID 8 timeout, aborting > [ 183.911359] nvme nvme0: I/O 497 QID 8 timeout, aborting > [ 183.917190] nvme nvme0: Abort status: 0x0 > [ 183.921677] nvme nvme0: I/O 498 QID 8 timeout, aborting > [ 183.927506] nvme nvme0: Abort status: 0x0 > [ 183.931994] nvme nvme0: I/O 499 QID 8 timeout, aborting > [ 183.937825] nvme nvme0: Abort status: 0x0 > [ 183.942311] nvme nvme0: I/O 500 QID 8 timeout, aborting > [ 183.948141] nvme nvme0: Abort status: 0x0 > [ 183.952637] nvme nvme0: Abort status: 0x0 > [ 214.057408] nvme nvme0: I/O 253 QID 7 timeout, reset controller > [ 216.422702] nvme nvme0: I/O 254 QID 7 timeout, disable controller > [ 216.438524] nvme nvme0: I/O 255 QID 7 timeout, disable controller > [ 216.453428] nvme nvme0: I/O 496 QID 8 timeout, disable controller > [ 216.469426] nvme nvme0: I/O 497 QID 8 timeout, disable controller > [ 216.484435] nvme nvme0: I/O 498 QID 8 timeout, disable controller > [ 216.496417] nvme nvme0: I/O 499 QID 8 timeout, disable controller > [ 216.511418] nvme nvme0: I/O 500 QID 8 timeout, disable controller > [ 249.830165] nvme nvme0: I/O 253 QID 7 timeout, completion polled > [ 249.836901] nvme nvme0: I/O 254 QID 7 timeout, disable controller > [ 249.858307] nvme nvme0: I/O 255 QID 7 timeout, disable controller > [ 249.874172] nvme nvme0: I/O 496 QID 8 timeout, disable controller > [ 249.887164] nvme nvme0: I/O 497 QID 8 timeout, disable controller > [ 249.903175] nvme nvme0: I/O 498 QID 8 timeout, disable controller > [ 249.918171] nvme nvme0: I/O 499 QID 8 timeout, disable controller > [ 249.934172] nvme nvme0: I/O 500 QID 8 timeout, disable controller > [ 369.695234] INFO: task kworker/u25:8:472 blocked for more than 120 seconds. > [ 369.703012] Not tainted 4.17.0-rc7 #1 > [ 369.707687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 369.716433] kworker/u25:8 D0 472 2 0x8000 > [ 369.716443] Workqueue: nvme-reset-wq nvme_reset_work [nvme] > [ 369.716445] Call Trace: > [ 369.719189] ? __schedule+0x290/0x870 > [ 369.723280] schedule+0x32/0x80 > [ 369.726791] blk_mq_freeze_queue_wait+0x46/0xb0 > [ 369.731853] ? remove_wait_queue+0x60/0x60 > [ 369.736435] nvme_wait_freeze+0x2e/0x50 [nvme_core] > [ 369.741876] nvme_reset_work+0x819/0xd83 [nvme] > [ 369.746942] ? __switch_to+0xa8/0x4f0 > [ 369.751035] process_one_work+0x158/0x360 > [ 369.755513] worker_thread+0x47/0x3e0 > [ 369.759604] kthread+0xf8/0x130 > [ 369.763114] ? max_active_store+0x80/0x80 > [ 369.767590] ? kthread_bind+0x10/0x10 > [ 369.771673] ret_from_fork+0x35/0x40 > [ 369.775687] INFO: task fio:1826 blocked for more than 120 seconds. > [ 369.782588] Not tainted 4.17.0-rc7 #1 > [ 369.787259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables > this message. > [ 369.796002] fio D0 1826 1 0x0080 > [ 369.796004] Call Trace: > [ 369.798736] ? __schedule+0x290/0x870 > [ 369.802826] schedule+0x32/0x80 > [ 369.806338] io_schedule+0x12/0x40 > [ 369.810139] __blkdev_direct_IO_simple+0x1d3/0x310 > [ 369.815493] ? bdput+0x10/0x10 > [ 369.818905] blkdev_direct_IO+0x38e/0x3f0 > [ 369.823383] ? __blkdev_direct_IO_simple+0x254/0x310 > [ 369.828929] generic_file_direct_write+0xcc/0x170 > [ 369.834187] __generic_file_w
[bug report] IO blocked after offline CPU during fio on nvme SSD
Hi I found this issue during my NVMe test on 4.17.0-rc7, here is the reproducer and dmesg log, let me know if you need more info, thanks. Reproduce steps: #!/bin/bash set -x fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 & sleep 10 echo 0 > /sys/devices/system/cpu/cpu1/online echo 0 > /sys/devices/system/cpu/cpu2/online echo 0 > /sys/devices/system/cpu/cpu3/online dmesg log: [ 153.402709] IRQ 82: no longer affine to CPU1 [ 153.408596] smpboot: CPU 1 is now offline [ 153.426422] IRQ 86: no longer affine to CPU2 [ 153.431204] IRQ 89: no longer affine to CPU2 [ 153.437054] smpboot: CPU 2 is now offline [ 153.682095] IRQ 83: no longer affine to CPU3 [ 153.686866] IRQ 85: no longer affine to CPU3 [ 153.692712] smpboot: CPU 3 is now offline [ 183.847661] nvme nvme0: I/O 417 QID 2 timeout, completion polled [ 183.854410] nvme nvme0: I/O 251 QID 7 timeout, completion polled [ 183.861133] nvme nvme0: I/O 252 QID 7 timeout, completion polled [ 183.867851] nvme nvme0: I/O 253 QID 7 timeout, aborting [ 183.873699] nvme nvme0: I/O 254 QID 7 timeout, aborting [ 183.879530] nvme nvme0: Abort status: 0x0 [ 183.884017] nvme nvme0: I/O 255 QID 7 timeout, aborting [ 183.889847] nvme nvme0: Abort status: 0x0 [ 183.894338] nvme nvme0: I/O 495 QID 8 timeout, completion polled [ 183.901033] nvme nvme0: Abort status: 0x0 [ 183.905520] nvme nvme0: I/O 496 QID 8 timeout, aborting [ 183.911359] nvme nvme0: I/O 497 QID 8 timeout, aborting [ 183.917190] nvme nvme0: Abort status: 0x0 [ 183.921677] nvme nvme0: I/O 498 QID 8 timeout, aborting [ 183.927506] nvme nvme0: Abort status: 0x0 [ 183.931994] nvme nvme0: I/O 499 QID 8 timeout, aborting [ 183.937825] nvme nvme0: Abort status: 0x0 [ 183.942311] nvme nvme0: I/O 500 QID 8 timeout, aborting [ 183.948141] nvme nvme0: Abort status: 0x0 [ 183.952637] nvme nvme0: Abort status: 0x0 [ 214.057408] nvme nvme0: I/O 253 QID 7 timeout, reset controller [ 216.422702] nvme nvme0: I/O 254 QID 7 timeout, disable controller [ 216.438524] nvme nvme0: I/O 255 QID 7 timeout, disable controller [ 216.453428] nvme nvme0: I/O 496 QID 8 timeout, disable controller [ 216.469426] nvme nvme0: I/O 497 QID 8 timeout, disable controller [ 216.484435] nvme nvme0: I/O 498 QID 8 timeout, disable controller [ 216.496417] nvme nvme0: I/O 499 QID 8 timeout, disable controller [ 216.511418] nvme nvme0: I/O 500 QID 8 timeout, disable controller [ 249.830165] nvme nvme0: I/O 253 QID 7 timeout, completion polled [ 249.836901] nvme nvme0: I/O 254 QID 7 timeout, disable controller [ 249.858307] nvme nvme0: I/O 255 QID 7 timeout, disable controller [ 249.874172] nvme nvme0: I/O 496 QID 8 timeout, disable controller [ 249.887164] nvme nvme0: I/O 497 QID 8 timeout, disable controller [ 249.903175] nvme nvme0: I/O 498 QID 8 timeout, disable controller [ 249.918171] nvme nvme0: I/O 499 QID 8 timeout, disable controller [ 249.934172] nvme nvme0: I/O 500 QID 8 timeout, disable controller [ 369.695234] INFO: task kworker/u25:8:472 blocked for more than 120 seconds. [ 369.703012] Not tainted 4.17.0-rc7 #1 [ 369.707687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.716433] kworker/u25:8 D0 472 2 0x8000 [ 369.716443] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 369.716445] Call Trace: [ 369.719189] ? __schedule+0x290/0x870 [ 369.723280] schedule+0x32/0x80 [ 369.726791] blk_mq_freeze_queue_wait+0x46/0xb0 [ 369.731853] ? remove_wait_queue+0x60/0x60 [ 369.736435] nvme_wait_freeze+0x2e/0x50 [nvme_core] [ 369.741876] nvme_reset_work+0x819/0xd83 [nvme] [ 369.746942] ? __switch_to+0xa8/0x4f0 [ 369.751035] process_one_work+0x158/0x360 [ 369.755513] worker_thread+0x47/0x3e0 [ 369.759604] kthread+0xf8/0x130 [ 369.763114] ? max_active_store+0x80/0x80 [ 369.767590] ? kthread_bind+0x10/0x10 [ 369.771673] ret_from_fork+0x35/0x40 [ 369.775687] INFO: task fio:1826 blocked for more than 120 seconds. [ 369.782588] Not tainted 4.17.0-rc7 #1 [ 369.787259] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 369.796002] fio D0 1826 1 0x0080 [ 369.796004] Call Trace: [ 369.798736] ? __schedule+0x290/0x870 [ 369.802826] schedule+0x32/0x80 [ 369.806338] io_schedule+0x12/0x40 [ 369.810139] __blkdev_direct_IO_simple+0x1d3/0x310 [ 369.815493] ? bdput+0x10/0x10 [ 369.818905] blkdev_direct_IO+0x38e/0x3f0 [ 369.823383] ? __blkdev_direct_IO_simple+0x254/0x310 [ 369.828929] generic_file_direct_write+0xcc/0x170 [ 369.834187] __generic_file_write_iter+0xb7/0x1c0 [ 369.839441] blkdev_write_iter+0x9b/0x120 [ 369.843923] __vfs_write+0x100/0x180 [ 369.847917] vfs_write+0xad/0x1a0 [ 369.851622] ? __audit_syscall_exit+0x213/0x330 [ 369.856682] ksys_pwrite64+0x62/0x90 [ 3
blktests block/019 lead system hang
Hi Keith I found blktest block/019 also can lead my NVMe server hang with 4.17.0-rc7, let me know if you need more info, thanks. Server: Dell R730xd NVMe SSD: 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172X (rev 01) Console log: Kernel 4.17.0-rc7 on an x86_64 storageqe-62 login: [ 6043.121834] run blktests block/019 at 2018-05-30 03:16:34 [ 6049.108476] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 3 [ 6049.108478] {1}[Hardware Error]: event severity: fatal [ 6049.108479] {1}[Hardware Error]: Error 0, type: fatal [ 6049.108481] {1}[Hardware Error]: section_type: PCIe error [ 6049.108482] {1}[Hardware Error]: port_type: 6, downstream switch port [ 6049.108483] {1}[Hardware Error]: version: 1.16 [ 6049.108484] {1}[Hardware Error]: command: 0x0407, status: 0x0010 [ 6049.108485] {1}[Hardware Error]: device_id: :83:05.0 [ 6049.108486] {1}[Hardware Error]: slot: 0 [ 6049.108487] {1}[Hardware Error]: secondary_bus: 0x85 [ 6049.108488] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x8734 [ 6049.108489] {1}[Hardware Error]: class_code: 000406 [ 6049.108489] {1}[Hardware Error]: bridge: secondary_status: 0x, control: 0x0003 [ 6049.108491] Kernel panic - not syncing: Fatal hardware error! [ 6049.108514] Kernel Offset: 0x2580 from 0x8100 (relocation range: 0x8000-0xbfff) Best Regards, Yi Zhang