[GIT PULL] Block changes for 4.17-rc
Hi Linus, This is the main pull request for block for 4.17. It's a pretty quiet round this time, which is nice. This pull request contains: - Series from Bart, cleaning up the way we set/test/clear atomic queue flags. - Series from Bart, fixing races between gendisk and queue registration and removal. - Set of bcache fixes and improvements from various folks, by way of Michael Lyle. - Set of lightnvm updates from Matias, most of it being the 1.2 to 2.0 transition. - Removal of unused DIO flags from Nikolay. - blk-mq/sbitmap memory ordering fixes from Omar. - Divide-by-zero fix for BFQ from Paolo. - Minor documentation patches from Randy. - Timeout fix from Tejun. - Alpha "can't write a char atomically" fix from Mikulas. - Set of NVMe fixes by way of Keith. - bsg and bsg-lib improvements from Christoph. - A few sed-opal fixes from Jonas. - cdrom check-disk-change deadlock fix from Maurizio. - Various little fixes, comment fixes, etc from various folks. You'll get a merge conflict that you need to resolve in drivers/nvme/host/core.c - I've included the resolution at the end of the email. Also see: http://git.kernel.dk/cgit/linux-block/log/?h=for-4.17/merged if you wish, that's the above pull request merged with master. Please pull! git://git.kernel.dk/linux-block.git tags/for-4.17/block-20180402 Anshuman Khandual (1): lib/scatterlist: Add SG_CHAIN and SG_END macros for LSB encodings Arnd Bergmann (2): misc: rtsx: rename SG_END macro staging: rts5208: rename SG_END macro Bart Van Assche (31): blk-mq-debugfs: Reorder queue show and store methods blk-mq-debugfs: Show zone locking information block/loop: Delete gendisk before cleaning up the request queue md: Delete gendisk before cleaning up the request queue zram: Delete gendisk before cleaning up the request queue block: Add 'lock' as third argument to blk_alloc_queue_node() block: Fix a race between the cgroup code and request queue initialization block: Fix a race between request queue removal and the block cgroup controller block: Reorder the queue flag manipulation function definitions block: Use the queue_flag_*() functions instead of open-coding these block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}() block: Protect queue flag changes with the queue lock mtip32xx: Use the blk_queue_flag_*() functions bcache: Use the blk_queue_flag_{set,clear}() functions iscsi: Use blk_queue_flag_set() target/tcm_loop: Use blk_queue_flag_set() block: Use blk_queue_flag_*() in drivers instead of queue_flag_*() block: Complain if queue_flag_(set|clear)_unlocked() is abused block: Move the queue_flag_*() functions from a public into a private header file block: Suppress kernel-doc warnings triggered by blk-zoned.c blk-mq-debugfs: Show more request state information block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into bcache: Fix indentation bcache: Add __printf annotation to __bch_check_keys() bcache: Annotate switch fall-through bcache: Fix kernel-doc warnings bcache: Remove an unused variable bcache: Suppress more warnings about set-but-not-used variables bcache: Reduce the number of sparse complaints about lock imbalances bcache: Fix a compiler warning in bcache_device_init() block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}() Chengguang Xu (1): bcache: move closure debug file into debug directory Christoph Hellwig (6): bsg-lib: introduce a timeout field in struct bsg_job bsg-lib: remove bsg_job.req bsg: split handling of SCSI CDBs vs transport requeues block: bio_check_eod() needs to consider partitions nvmet: refactor configfs transport type handling nvmet: constify struct nvmet_fabrics_ops Coly Li (7): bcache: fix cached_dev->count usage for bch_cache_set_error() bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set bcache: stop dc->writeback_rate_update properly bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags bcache: add stop_when_cache_set_failed option to backing device bcache: add backing_request_endio() for bi_end_io bcache: add io_disable to struct cached_dev Dan Carpenter (1): lightnvm: pblk: remove some unnecessary NULL checks Hans Holmberg (8): lightnvm: pblk: handle bad sectors in the emeta area correctly lightnvm: pblk: check data lines version on recovery lightnvm: pblk: export write amplification counters to sysfs lightnvm: pblk: add padding distribution sysfs attribute lightnvm: pblk: delete writer kick timer before stopping thread lightnvm: pblk: allow allocation of new lines during shutdown lightnvm:
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
On Mon, 2018-04-02 at 15:16 -0700, t...@kernel.org wrote: > AFAIK, > there's one non-critical race condition which has always been there. > We have a larger race window for that case but don't yet know whether > that's problematic or not. If that actually is problematic, we can > figure out a way to solve that but such effort / added complexity > doesn't seem justified yet. No? Hello Tejun, Some important block drivers systematically return BLK_EH_RESET_TIMER from their timeout handler, e.g. the virtio-scsi initiator driver. What will the consequence be for these drivers if a blk_mq_complete_request() call is ignored? Will the request for which the completion was ignored get stuck? Thanks, Bart.
Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request
On 4/2/18 4:04 PM, Tejun Heo wrote: > Request abortion is performed by overriding deadline to now and > scheduling timeout handling immediately. For the latter part, the > code was using mod_timer(timeout, 0) which can't guarantee that the > timer runs afterwards. Let's schedule the underlying work item > directly instead. > > This fixes the hangs during probing reported by Sitsofe but it isn't > yet clear to me how the failure can happen reliably if it's just the > above described race condition. > > Signed-off-by: Tejun Heo> Reported-by: Sitsofe Wheeler > Reported-by: Meelis Roos > Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path") > Cc: sta...@vger.kernel.org # v4.16 > Link: > http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8nyof-...@mail.gmail.com > Link: http://lkml.kernel.org/r/alpine.lrh.2.21.1802261049140.4...@math.ut.ee > --- > Hello, > > I don't have the full explanation yet but here's a preliminary patch. > > Thanks. > > block/blk-timeout.c |2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-timeout.c b/block/blk-timeout.c > index a05e367..f0e6e41 100644 > --- a/block/blk-timeout.c > +++ b/block/blk-timeout.c > @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req) >* No need for fancy synchronizations. >*/ > blk_rq_set_deadline(req, jiffies); > - mod_timer(>q->timeout, 0); > + kblockd_schedule_work(>q->timeout_work); > } else { > if (blk_mark_rq_complete(req)) > return; In any case, it's cleaner than relying on mod_timer(.., 0). If that doesn't guarantee that the timer runs again, I can see how a race with the running timer could prevent us from seeing the timeout after an abort. I'll apply this, thanks. -- Jens Axboe
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
Hello, On Mon, Apr 02, 2018 at 10:09:18PM +, Bart Van Assche wrote: > Please elaborate what your long-term goal is for the blk-mq timeout handler. Hmm... I don't really have any plans beyond what's been posted. > The legacy block layer suspends request state changes while a timeout is > being processed by holding the request queue lock while requests are being > processed, while processing a timeout and while calling > q->rq_timed_out_fn(rq). > Do you think it is possible to make the blk-mq core suspend request processing > while processing a timeout without introducing locking in > blk_mq_complete_request()? If you do not plan to add locking in > blk_mq_complete_request(), do you think it is possible to fix all the races we > discussed in previous e-mails? I don't know of multiple race conditions. What am I missing? AFAIK, there's one non-critical race condition which has always been there. We have a larger race window for that case but don't yet know whether that's problematic or not. If that actually is problematic, we can figure out a way to solve that but such effort / added complexity doesn't seem justified yet. No? Thanks. -- tejun
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
On Mon, 2018-04-02 at 15:01 -0700, t...@kernel.org wrote: > On Mon, Apr 02, 2018 at 09:56:41PM +, Bart Van Assche wrote: > > This patch increases the time during which .aborted_gstate == .gstate if the > > timeout is reset. Does that increase the chance that a completion will be > > missed > > if the request timeout is reset? > > It sure does, but we're comparing an outright kernel bug vs. an > inherently opportunistic mechanism being a bit more lossy. I think > the answer is pretty clear. Hello Tejun, Please elaborate what your long-term goal is for the blk-mq timeout handler. The legacy block layer suspends request state changes while a timeout is being processed by holding the request queue lock while requests are being processed, while processing a timeout and while calling q->rq_timed_out_fn(rq). Do you think it is possible to make the blk-mq core suspend request processing while processing a timeout without introducing locking in blk_mq_complete_request()? If you do not plan to add locking in blk_mq_complete_request(), do you think it is possible to fix all the races we discussed in previous e-mails? Thanks, Bart.
Re: 4.16-rc2+git: pata_serverworks: hanging ata detection thread on HP DL380G3
Hello, Meelis. Can you please verify whether the following patch fixes the problem? Thanks. Subject: blk-mq: Directly schedule q->timeout_work when aborting a request Request abortion is performed by overriding deadline to now and scheduling timeout handling immediately. For the latter part, the code was using mod_timer(timeout, 0) which can't guarantee that the timer runs afterwards. Let's schedule the underlying work item directly instead. This fixes the hangs during probing reported by Sitsofe but it isn't yet clear to me how the failure can happen reliably if it's just the above described race condition. Signed-off-by: Tejun HeoReported-by: Sitsofe Wheeler Reported-by: Meelis Roos --- block/blk-timeout.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-timeout.c b/block/blk-timeout.c index a05e367..f0e6e41 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req) * No need for fancy synchronizations. */ blk_rq_set_deadline(req, jiffies); - mod_timer(>q->timeout, 0); + kblockd_schedule_work(>q->timeout_work); } else { if (blk_mark_rq_complete(req)) return;
[PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request
Request abortion is performed by overriding deadline to now and scheduling timeout handling immediately. For the latter part, the code was using mod_timer(timeout, 0) which can't guarantee that the timer runs afterwards. Let's schedule the underlying work item directly instead. This fixes the hangs during probing reported by Sitsofe but it isn't yet clear to me how the failure can happen reliably if it's just the above described race condition. Signed-off-by: Tejun HeoReported-by: Sitsofe Wheeler Reported-by: Meelis Roos Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path") Cc: sta...@vger.kernel.org # v4.16 Link: http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8nyof-...@mail.gmail.com Link: http://lkml.kernel.org/r/alpine.lrh.2.21.1802261049140.4...@math.ut.ee --- Hello, I don't have the full explanation yet but here's a preliminary patch. Thanks. block/blk-timeout.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-timeout.c b/block/blk-timeout.c index a05e367..f0e6e41 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req) * No need for fancy synchronizations. */ blk_rq_set_deadline(req, jiffies); - mod_timer(>q->timeout, 0); + kblockd_schedule_work(>q->timeout_work); } else { if (blk_mark_rq_complete(req)) return;
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
Hello, On Mon, Apr 02, 2018 at 09:56:41PM +, Bart Van Assche wrote: > This patch increases the time during which .aborted_gstate == .gstate if the > timeout is reset. Does that increase the chance that a completion will be > missed > if the request timeout is reset? It sure does, but we're comparing an outright kernel bug vs. an inherently opportunistic mechanism being a bit more lossy. I think the answer is pretty clear. Thanks. -- tejun
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
On Mon, 2018-04-02 at 14:10 -0700, Tejun Heo wrote: > On Mon, Apr 02, 2018 at 02:08:37PM -0700, Bart Van Assche wrote: > > On 04/02/18 12:01, Tejun Heo wrote: > > > + * As nothing prevents from completion happening while > > > + * ->aborted_gstate is set, this may lead to ignored completions > > > + * and further spurious timeouts. > > > + */ > > > + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) > > > + blk_mq_rq_update_aborted_gstate(rq, 0); > > > > Hello Tejun, > > > > Since this patch fixes one race but introduces another race, is this > > patch really an improvement? > > Oh, that's not a new race. That's the same non-critical race which > always existed. It's just being documented. This patch increases the time during which .aborted_gstate == .gstate if the timeout is reset. Does that increase the chance that a completion will be missed if the request timeout is reset? Thanks, Bart.
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
Hello, On Mon, Apr 02, 2018 at 09:31:34PM +, Bart Van Assche wrote: > > > > +* As nothing prevents from completion happening while > > > > +* ->aborted_gstate is set, this may lead to ignored completions > > > > +* and further spurious timeouts. > > > > +*/ > > > > + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) > > > > + blk_mq_rq_update_aborted_gstate(rq, 0); ... > I think it can happen that the above code sees that (rq->rq_flags & > RQF_MQ_TIMEOUT_RESET) != 0, that blk_mq_start_request() executes the > following code: > > blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); > blk_add_timer(rq); > > and that subsequently blk_mq_rq_update_aborted_gstate(rq, 0) is called, > which will cause the next completion to be lost. Is fixing one occurrence > of a race and reintroducing it in another code path really an improvement? I'm not following at all. How would blk_mq_start_request() get called on the request while it's still owned by the timeout handler? That gstate clearing is what transfers the ownership back to the non-timeout path. Thanks. -- tejun
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
On Mon, 2018-04-02 at 14:10 -0700, Tejun Heo wrote: > On Mon, Apr 02, 2018 at 02:08:37PM -0700, Bart Van Assche wrote: > > On 04/02/18 12:01, Tejun Heo wrote: > > > + * As nothing prevents from completion happening while > > > + * ->aborted_gstate is set, this may lead to ignored completions > > > + * and further spurious timeouts. > > > + */ > > > + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) > > > + blk_mq_rq_update_aborted_gstate(rq, 0); > > > > Hello Tejun, > > > > Since this patch fixes one race but introduces another race, is this > > patch really an improvement? > > Oh, that's not a new race. That's the same non-critical race which > always existed. It's just being documented. Hello Tejun, I think it can happen that the above code sees that (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) != 0, that blk_mq_start_request() executes the following code: blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); and that subsequently blk_mq_rq_update_aborted_gstate(rq, 0) is called, which will cause the next completion to be lost. Is fixing one occurrence of a race and reintroducing it in another code path really an improvement? Thanks, Bart.
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
On Mon, Apr 02, 2018 at 02:08:37PM -0700, Bart Van Assche wrote: > On 04/02/18 12:01, Tejun Heo wrote: > >+ * As nothing prevents from completion happening while > >+ * ->aborted_gstate is set, this may lead to ignored completions > >+ * and further spurious timeouts. > >+ */ > >+if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) > >+blk_mq_rq_update_aborted_gstate(rq, 0); > > Hello Tejun, > > Since this patch fixes one race but introduces another race, is this > patch really an improvement? Oh, that's not a new race. That's the same non-critical race which always existed. It's just being documented. Thanks. -- tejun
Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
On 04/02/18 12:01, Tejun Heo wrote: +* As nothing prevents from completion happening while +* ->aborted_gstate is set, this may lead to ignored completions +* and further spurious timeouts. +*/ + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) + blk_mq_rq_update_aborted_gstate(rq, 0); Hello Tejun, Since this patch fixes one race but introduces another race, is this patch really an improvement? Thanks, Bart.
Re: [BISECTED][REGRESSION] Hang while booting EeePC 900
Hi Tejun, On 2 April 2018 at 21:29, Tejun Heowrote: > > Can you see whether the following patch makes any difference? > > Thanks. > > diff --git a/block/blk-timeout.c b/block/blk-timeout.c > index a05e367..f0e6e41 100644 > --- a/block/blk-timeout.c > +++ b/block/blk-timeout.c > @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req) > * No need for fancy synchronizations. > */ > blk_rq_set_deadline(req, jiffies); > - mod_timer(>q->timeout, 0); > + kblockd_schedule_work(>q->timeout_work); > } else { > if (blk_mark_rq_complete(req)) > return; That patch seems to fix the issue. -- Sitsofe | http://sucs.org/~sits/
Re: [PATCH 1/2] blk-mq: Factor out [s]rcu synchronization
On 04/02/18 12:00, Tejun Heo wrote: Factor out [s]rcu synchronization in blk_mq_timeout_work() into blk_mq_timeout_sync_rcu(). This is to add another user in the future and doesn't cause any functional changes. Reviewed-by: Bart Van Assche
Re: [BISECTED][REGRESSION] Hang while booting EeePC 900
Hello, Sitsofe. Can you see whether the following patch makes any difference? Thanks. diff --git a/block/blk-timeout.c b/block/blk-timeout.c index a05e367..f0e6e41 100644 --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req) * No need for fancy synchronizations. */ blk_rq_set_deadline(req, jiffies); - mod_timer(>q->timeout, 0); + kblockd_schedule_work(>q->timeout_work); } else { if (blk_mark_rq_complete(req)) return;
[PATCH 1/2] blk-mq: Factor out [s]rcu synchronization
Factor out [s]rcu synchronization in blk_mq_timeout_work() into blk_mq_timeout_sync_rcu(). This is to add another user in the future and doesn't cause any functional changes. Signed-off-by: Tejun HeoCc: Bart Van Assche --- Hello, We were tracking this down in the following thread http://lkml.kernel.org/r/20180207011133.25957-1-bart.vanass...@wdc.com but lost the reproducer in the middle and couldn't fully verify these two patches fix the problem; however, the identified race is real and a bug, so I think it'd be best to apply these two patches. Given the lack of further reports on this front, I don't think it's necessary to rush these patches. I think it'd be best to apply these once the merge window closes. If we need to backport these to -stable, we can tag them later. Thanks. block/blk-mq.c | 39 +++ include/linux/blk-mq.h |2 +- 2 files changed, 24 insertions(+), 17 deletions(-) --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -876,13 +876,34 @@ static void blk_mq_check_expired(struct time_after_eq(jiffies, deadline)) { blk_mq_rq_update_aborted_gstate(rq, gstate); data->nr_expired++; - hctx->nr_expired++; + hctx->need_sync_rcu = true; } else if (!data->next_set || time_after(data->next, deadline)) { data->next = deadline; data->next_set = 1; } } +static void blk_mq_timeout_sync_rcu(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + bool has_rcu = false; + int i; + + queue_for_each_hw_ctx(q, hctx, i) { + if (!hctx->need_sync_rcu) + continue; + + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) + has_rcu = true; + else + synchronize_srcu(hctx->srcu); + + hctx->need_sync_rcu = false; + } + if (has_rcu) + synchronize_rcu(); +} + static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { @@ -930,27 +951,13 @@ static void blk_mq_timeout_work(struct w blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ); if (data.nr_expired) { - bool has_rcu = false; - /* * Wait till everyone sees ->aborted_gstate. The * sequential waits for SRCUs aren't ideal. If this ever * becomes a problem, we can add per-hw_ctx rcu_head and * wait in parallel. */ - queue_for_each_hw_ctx(q, hctx, i) { - if (!hctx->nr_expired) - continue; - - if (!(hctx->flags & BLK_MQ_F_BLOCKING)) - has_rcu = true; - else - synchronize_srcu(hctx->srcu); - - hctx->nr_expired = 0; - } - if (has_rcu) - synchronize_rcu(); + blk_mq_timeout_sync_rcu(q); /* terminate the ones we won */ blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL); --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -51,7 +51,7 @@ struct blk_mq_hw_ctx { unsigned intqueue_num; atomic_tnr_active; - unsigned intnr_expired; + boolneed_sync_rcu; struct hlist_node cpuhp_dead; struct kobject kobj;
[PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution
When a request is handed over from normal execution to timeout, we synchronize using ->aborted_gstate and RCU grace periods; however, when a request is being returned from timeout handling to normal execution for BLK_EH_RESET_TIMER, we were skipping the same synchronization. This means that it theoretically is possible for a returned request's completion and recycling compete against the reordered and delayed writes from timeout path. This patch adds an equivalent synchronization when a request is returned from timeout path to normal completion path. Signed-off-by: Tejun HeoCc: Bart Van Assche --- block/blk-mq.c | 49 - block/blk-timeout.c|2 +- include/linux/blkdev.h |4 +++- 3 files changed, 44 insertions(+), 11 deletions(-) --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -818,7 +818,8 @@ struct blk_mq_timeout_data { unsigned int nr_expired; }; -static void blk_mq_rq_timed_out(struct request *req, bool reserved) +static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req, + int *nr_resets, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; @@ -833,13 +834,10 @@ static void blk_mq_rq_timed_out(struct r __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: - /* -* As nothing prevents from completion happening while -* ->aborted_gstate is set, this may lead to ignored -* completions and further spurious timeouts. -*/ - blk_mq_rq_update_aborted_gstate(req, 0); blk_add_timer(req); + req->rq_flags |= RQF_MQ_TIMEOUT_RESET; + (*nr_resets)++; + hctx->need_sync_rcu = true; break; case BLK_EH_NOT_HANDLED: break; @@ -916,7 +914,26 @@ static void blk_mq_terminate_expired(str */ if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && READ_ONCE(rq->gstate) == rq->aborted_gstate) - blk_mq_rq_timed_out(rq, reserved); + blk_mq_rq_timed_out(hctx, rq, priv, reserved); +} + +static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, bool reserved) +{ + /* +* @rq's timer reset has gone through rcu synchronization and is +* visible now. Allow normal completions again by resetting +* ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as +* there's no memory ordering around ->aborted_gstate making it the +* only field safe to update. Let blk_add_timer() clear it later +* when the request is recycled or times out again. +* +* As nothing prevents from completion happening while +* ->aborted_gstate is set, this may lead to ignored completions +* and further spurious timeouts. +*/ + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) + blk_mq_rq_update_aborted_gstate(rq, 0); } static void blk_mq_timeout_work(struct work_struct *work) @@ -951,6 +968,8 @@ static void blk_mq_timeout_work(struct w blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, ); if (data.nr_expired) { + int nr_resets = 0; + /* * Wait till everyone sees ->aborted_gstate. The * sequential waits for SRCUs aren't ideal. If this ever @@ -960,7 +979,19 @@ static void blk_mq_timeout_work(struct w blk_mq_timeout_sync_rcu(q); /* terminate the ones we won */ - blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL); + blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, + _resets); + + /* +* For BLK_EH_RESET_TIMER, release the requests after +* blk_add_timer() from above is visible to avoid timer +* reset racing against recycling. +*/ + if (nr_resets) { + blk_mq_timeout_sync_rcu(q); + blk_mq_queue_tag_busy_iter(q, + blk_mq_finish_timeout_reset, NULL); + } } if (data.next_set) { --- a/block/blk-timeout.c +++ b/block/blk-timeout.c @@ -216,7 +216,7 @@ void blk_add_timer(struct request *req) req->timeout = q->rq_timeout; blk_rq_set_deadline(req, jiffies + req->timeout); - req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED; + req->rq_flags &= ~(RQF_MQ_TIMEOUT_EXPIRED | RQF_MQ_TIMEOUT_RESET); /* * Only the non-mq case needs to add the request to a protected list. --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -127,8
Re: Multi-Actuator SAS HDD First Look
On Mon, Apr 2, 2018 at 10:29 AM, Douglas Gilbertwrote: > On 2018-04-02 11:34 AM, Tim Walker wrote: >> >> On Sat, Mar 31, 2018 at 10:52 AM, Douglas Gilbert >> wrote: >>> >>> On 2018-03-30 04:01 PM, Bart Van Assche wrote: On Fri, 2018-03-30 at 12:36 -0600, Tim Walker wrote: > > > Yes I will be there to discuss the multi-LUN approach. I wanted to get > these interface details out so we could have some background and > perhaps folks would come with ideas. I don't have much more to put > out, but I will certainly answer questions - either on this list or > directly. Hello Tim, As far as I know the Linux SCSI stack does not yet support any SCSI devices for which a single SCSI command affects multiple (two in this case) LUNs. Adding such support may take significant work. There will also be the disadvantage that most SCSI core contributors do not have access to a multi- actuator device and hence that their changes may break support for multi- actuator devices. >>> >>> >>> >>> Hmmm, INQUIRY (3PC bit) and REPORT LUNS seem to be counter examples to >>> Bart's assertion. Plus there are a more that tell you about things >>> outside >>> the addressed LU, for example the SCSI Ports VPD page tells you about >>> other >>> SCSI ports hence other LUs) on the current device. >>> >>> >>> From Tim's command list: >>> >>> Device level >>> >>> 0x0, 0x1: okay >>> 0x4 (Format unit): yikes, that could be a nasty surprise, accessing a >>> file >>>system on the other LU and getting an error "Not ready, format in >>> progress"!! >>> 0x12: standard INQUIRY okay, VPD pages not so much LU id different; >>> relative >>>port id, different; target port id different (at the least) >>> 0x1b (SSU): storage LUs need to know this model, otherwise the logic on >>>each LU could get into a duel: "I want it spun up; no, I want it spun >>>down ..." >>> 0x35, 0x37, 0x3b, 0x3c: okay >>> 0x48 (sanitize): similar to Format unit above >>> 0x91,0x4c,0x4d: okay >>> MODE SENSE/SELECT(6,10): depends on page, block descriptor needs to be >>>partially device level (since LB size may be changed by FU which is >>>device level) >>> rest of device level: okay or (I) don't know >>> 0xf7: READ UDS DATA, that's interesting, but proprietary I guess >>> >>> Perhaps you could add a rider on FU and SAN: they get rejected unless the >>> other storage LU is in (logical) spun down state. >>> >>> >>> LU specific >>> --- >>> all okay, I hoping READ(6,10,12,16,32) and their WRITE cousins will be >>>there also :-) Plus the TMF: LU reset >>> >>> Device or LU >>> >>> all okay >>> >>> >>> I'm intrigued by your 3 LU model. My wish list for that: >>> >>> LUN 0 would be processor device type (0x3) so it wouldn't confuse the >>> OS (Linux) that it held storage (READ CAPACITY is mandatory for PDT 0x0 >>> and cannot represent a 0 block LU) and you could pick and choose which >>> SCSI commands to implement on it. LUN 0 TUR could report what the spindle >>> was really doing, SSU could do precisely what it is told (and SSU on LUNs >>> 1 and 2 would be an "and" for spin down and an "or" for spin up). I've >>> got several boxes full of SAS cables and only one cable that I can think >>> of that lets me get to the secondary SAS port. So on LUN 0 you could have >>> a proprietary (unless T10 takes it on board) mode page to enable the user >>> to say which ports that LUNs 1 and 2 where accessible on. Obviously LUN 0 >>> would need to be accessible on both ports. [A non-accessible LUN would >>> still respond to INQUIRY but with a first byte of 07f: PDT=0x1f (unknown) >>> and PQ=3 which implies it is there but inaccessible via the current >>> port.] >>> >>> A processor PDT opens up the possibility of putting a copy manager on >>> LUN 0. Think offloaded (from main machine's perspective) backups and >>> restores where LUN 1 or 2 is the source or destination. >>> >>> Enough dreaming. >>> >>> Doug Gilbert >> >> >> >> >> Thanks for all the input from everybody. I'll collate it for a meeting >> with our interface architect. > > > Just to amplify the case against a FU or SAN being allowed at almost any > time from either storage LU irrespective of what the other was doing. > Imagine one initiator has some important data on one LU and has an > EXCLUSIVE ACCESS persistent reservation on it. Then another initiator > (e.g. on a different machine) sends a FU to the other LU which is > honoured, wiping the whole device. One unhappy customer > > Doug Gilbert Thanks, Doug. That is very clear, and I get it. We'll have a solution. Best regards, -Tim -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770
Re: Multi-Actuator SAS HDD First Look
On 2018-04-02 11:34 AM, Tim Walker wrote: On Sat, Mar 31, 2018 at 10:52 AM, Douglas Gilbertwrote: On 2018-03-30 04:01 PM, Bart Van Assche wrote: On Fri, 2018-03-30 at 12:36 -0600, Tim Walker wrote: Yes I will be there to discuss the multi-LUN approach. I wanted to get these interface details out so we could have some background and perhaps folks would come with ideas. I don't have much more to put out, but I will certainly answer questions - either on this list or directly. Hello Tim, As far as I know the Linux SCSI stack does not yet support any SCSI devices for which a single SCSI command affects multiple (two in this case) LUNs. Adding such support may take significant work. There will also be the disadvantage that most SCSI core contributors do not have access to a multi- actuator device and hence that their changes may break support for multi- actuator devices. Hmmm, INQUIRY (3PC bit) and REPORT LUNS seem to be counter examples to Bart's assertion. Plus there are a more that tell you about things outside the addressed LU, for example the SCSI Ports VPD page tells you about other SCSI ports hence other LUs) on the current device. From Tim's command list: Device level 0x0, 0x1: okay 0x4 (Format unit): yikes, that could be a nasty surprise, accessing a file system on the other LU and getting an error "Not ready, format in progress"!! 0x12: standard INQUIRY okay, VPD pages not so much LU id different; relative port id, different; target port id different (at the least) 0x1b (SSU): storage LUs need to know this model, otherwise the logic on each LU could get into a duel: "I want it spun up; no, I want it spun down ..." 0x35, 0x37, 0x3b, 0x3c: okay 0x48 (sanitize): similar to Format unit above 0x91,0x4c,0x4d: okay MODE SENSE/SELECT(6,10): depends on page, block descriptor needs to be partially device level (since LB size may be changed by FU which is device level) rest of device level: okay or (I) don't know 0xf7: READ UDS DATA, that's interesting, but proprietary I guess Perhaps you could add a rider on FU and SAN: they get rejected unless the other storage LU is in (logical) spun down state. LU specific --- all okay, I hoping READ(6,10,12,16,32) and their WRITE cousins will be there also :-) Plus the TMF: LU reset Device or LU all okay I'm intrigued by your 3 LU model. My wish list for that: LUN 0 would be processor device type (0x3) so it wouldn't confuse the OS (Linux) that it held storage (READ CAPACITY is mandatory for PDT 0x0 and cannot represent a 0 block LU) and you could pick and choose which SCSI commands to implement on it. LUN 0 TUR could report what the spindle was really doing, SSU could do precisely what it is told (and SSU on LUNs 1 and 2 would be an "and" for spin down and an "or" for spin up). I've got several boxes full of SAS cables and only one cable that I can think of that lets me get to the secondary SAS port. So on LUN 0 you could have a proprietary (unless T10 takes it on board) mode page to enable the user to say which ports that LUNs 1 and 2 where accessible on. Obviously LUN 0 would need to be accessible on both ports. [A non-accessible LUN would still respond to INQUIRY but with a first byte of 07f: PDT=0x1f (unknown) and PQ=3 which implies it is there but inaccessible via the current port.] A processor PDT opens up the possibility of putting a copy manager on LUN 0. Think offloaded (from main machine's perspective) backups and restores where LUN 1 or 2 is the source or destination. Enough dreaming. Doug Gilbert Thanks for all the input from everybody. I'll collate it for a meeting with our interface architect. Just to amplify the case against a FU or SAN being allowed at almost any time from either storage LU irrespective of what the other was doing. Imagine one initiator has some important data on one LU and has an EXCLUSIVE ACCESS persistent reservation on it. Then another initiator (e.g. on a different machine) sends a FU to the other LU which is honoured, wiping the whole device. One unhappy customer Doug Gilbert
Re: 4.16-rc2+git: pata_serverworks: hanging ata detection thread on HP DL380G3
Hello, On Fri, Mar 30, 2018 at 11:47:24AM +0300, Meelis Roos wrote: > Added CC-s, start of the thread is at > https://lkml.org/lkml/2018/2/26/165 > > > > > 4.16 git bootup on HP Proliant DL380 G3 pauses for a a minute or two > > > > and > > > > then continues with "blocked for more than 120 seconds" message with > > > > libata detection functions in ther stack - > > > > async_synchronize_cookie_domain() as the last. It seems to happen > > > > during > > > > IDE CD-ROM detection (detected before but registered as sr0 after the > > > > warning). After detection, the eject button on the drive did not work. > > > > > > > > > > > > pata_serverworks is the libata driver in use. > > > > There were no changes to pata_serverworks since 2014 and libata changes > > in v4.16 look obviously correct.. > > > > > This is still the same in 4.16.0-rc7-00062-g0b412605ef5f. > > > > Any chance that you could bisect this issue? > > Bisected to the following commit: > > 358f70da49d77c43f2ca11b5da584213b2add29c is the first bad commit > commit 358f70da49d77c43f2ca11b5da584213b2add29c > Author: Tejun Heo> Date: Tue Jan 9 08:29:50 2018 -0800 > > blk-mq: make blk_abort_request() trigger timeout path Can you share the .config file? Thanks. -- tejun
Re: Multi-Actuator SAS HDD First Look
On Sat, Mar 31, 2018 at 10:52 AM, Douglas Gilbertwrote: > On 2018-03-30 04:01 PM, Bart Van Assche wrote: >> >> On Fri, 2018-03-30 at 12:36 -0600, Tim Walker wrote: >>> >>> Yes I will be there to discuss the multi-LUN approach. I wanted to get >>> these interface details out so we could have some background and >>> perhaps folks would come with ideas. I don't have much more to put >>> out, but I will certainly answer questions - either on this list or >>> directly. >> >> >> Hello Tim, >> >> As far as I know the Linux SCSI stack does not yet support any SCSI >> devices >> for which a single SCSI command affects multiple (two in this case) LUNs. >> Adding such support may take significant work. There will also be the >> disadvantage that most SCSI core contributors do not have access to a >> multi- >> actuator device and hence that their changes may break support for multi- >> actuator devices. > > > Hmmm, INQUIRY (3PC bit) and REPORT LUNS seem to be counter examples to > Bart's assertion. Plus there are a more that tell you about things outside > the addressed LU, for example the SCSI Ports VPD page tells you about other > SCSI ports hence other LUs) on the current device. > > > From Tim's command list: > > Device level > > 0x0, 0x1: okay > 0x4 (Format unit): yikes, that could be a nasty surprise, accessing a file > system on the other LU and getting an error "Not ready, format in > progress"!! > 0x12: standard INQUIRY okay, VPD pages not so much LU id different; relative > port id, different; target port id different (at the least) > 0x1b (SSU): storage LUs need to know this model, otherwise the logic on > each LU could get into a duel: "I want it spun up; no, I want it spun > down ..." > 0x35, 0x37, 0x3b, 0x3c: okay > 0x48 (sanitize): similar to Format unit above > 0x91,0x4c,0x4d: okay > MODE SENSE/SELECT(6,10): depends on page, block descriptor needs to be > partially device level (since LB size may be changed by FU which is > device level) > rest of device level: okay or (I) don't know > 0xf7: READ UDS DATA, that's interesting, but proprietary I guess > > Perhaps you could add a rider on FU and SAN: they get rejected unless the > other storage LU is in (logical) spun down state. > > > LU specific > --- > all okay, I hoping READ(6,10,12,16,32) and their WRITE cousins will be > there also :-) Plus the TMF: LU reset > > Device or LU > > all okay > > > I'm intrigued by your 3 LU model. My wish list for that: > > LUN 0 would be processor device type (0x3) so it wouldn't confuse the > OS (Linux) that it held storage (READ CAPACITY is mandatory for PDT 0x0 > and cannot represent a 0 block LU) and you could pick and choose which > SCSI commands to implement on it. LUN 0 TUR could report what the spindle > was really doing, SSU could do precisely what it is told (and SSU on LUNs > 1 and 2 would be an "and" for spin down and an "or" for spin up). I've > got several boxes full of SAS cables and only one cable that I can think > of that lets me get to the secondary SAS port. So on LUN 0 you could have > a proprietary (unless T10 takes it on board) mode page to enable the user > to say which ports that LUNs 1 and 2 where accessible on. Obviously LUN 0 > would need to be accessible on both ports. [A non-accessible LUN would > still respond to INQUIRY but with a first byte of 07f: PDT=0x1f (unknown) > and PQ=3 which implies it is there but inaccessible via the current port.] > > A processor PDT opens up the possibility of putting a copy manager on > LUN 0. Think offloaded (from main machine's perspective) backups and > restores where LUN 1 or 2 is the source or destination. > > Enough dreaming. > > Doug Gilbert Thanks for all the input from everybody. I'll collate it for a meeting with our interface architect. Best regards, -TIm -- Tim Walker Product Design Systems Engineering, Seagate Technology (303) 775-3770