[GIT PULL] Block changes for 4.17-rc

2018-04-02 Thread Jens Axboe
Hi Linus,

This is the main pull request for block for 4.17. It's a pretty quiet
round this time, which is nice. This pull request contains:

- Series from Bart, cleaning up the way we set/test/clear atomic queue
  flags.

- Series from Bart, fixing races between gendisk and queue registration
  and removal.

- Set of bcache fixes and improvements from various folks, by way of
  Michael Lyle.

- Set of lightnvm updates from Matias, most of it being the 1.2 to
  2.0 transition.

- Removal of unused DIO flags from Nikolay.

- blk-mq/sbitmap memory ordering fixes from Omar.

- Divide-by-zero fix for BFQ from Paolo.

- Minor documentation patches from Randy.

- Timeout fix from Tejun.

- Alpha "can't write a char atomically" fix from Mikulas.

- Set of NVMe fixes by way of Keith.

- bsg and bsg-lib improvements from Christoph.

- A few sed-opal fixes from Jonas.

- cdrom check-disk-change deadlock fix from Maurizio.

- Various little fixes, comment fixes, etc from various folks.

You'll get a merge conflict that you need to resolve in
drivers/nvme/host/core.c - I've included the resolution at the end of
the email. Also see:

http://git.kernel.dk/cgit/linux-block/log/?h=for-4.17/merged

if you wish, that's the above pull request merged with master.

Please pull!


  git://git.kernel.dk/linux-block.git tags/for-4.17/block-20180402



Anshuman Khandual (1):
  lib/scatterlist: Add SG_CHAIN and SG_END macros for LSB encodings

Arnd Bergmann (2):
  misc: rtsx: rename SG_END macro
  staging: rts5208: rename SG_END macro

Bart Van Assche (31):
  blk-mq-debugfs: Reorder queue show and store methods
  blk-mq-debugfs: Show zone locking information
  block/loop: Delete gendisk before cleaning up the request queue
  md: Delete gendisk before cleaning up the request queue
  zram: Delete gendisk before cleaning up the request queue
  block: Add 'lock' as third argument to blk_alloc_queue_node()
  block: Fix a race between the cgroup code and request queue initialization
  block: Fix a race between request queue removal and the block cgroup 
controller
  block: Reorder the queue flag manipulation function definitions
  block: Use the queue_flag_*() functions instead of open-coding these
  block: Introduce blk_queue_flag_{set,clear,test_and_{set,clear}}()
  block: Protect queue flag changes with the queue lock
  mtip32xx: Use the blk_queue_flag_*() functions
  bcache: Use the blk_queue_flag_{set,clear}() functions
  iscsi: Use blk_queue_flag_set()
  target/tcm_loop: Use blk_queue_flag_set()
  block: Use blk_queue_flag_*() in drivers instead of queue_flag_*()
  block: Complain if queue_flag_(set|clear)_unlocked() is abused
  block: Move the queue_flag_*() functions from a public into a private 
header file
  block: Suppress kernel-doc warnings triggered by blk-zoned.c
  blk-mq-debugfs: Show more request state information
  block: Move SECTOR_SIZE and SECTOR_SHIFT definitions into 
  bcache: Fix indentation
  bcache: Add __printf annotation to __bch_check_keys()
  bcache: Annotate switch fall-through
  bcache: Fix kernel-doc warnings
  bcache: Remove an unused variable
  bcache: Suppress more warnings about set-but-not-used variables
  bcache: Reduce the number of sparse complaints about lock imbalances
  bcache: Fix a compiler warning in bcache_device_init()
  block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}()

Chengguang Xu (1):
  bcache: move closure debug file into debug directory

Christoph Hellwig (6):
  bsg-lib: introduce a timeout field in struct bsg_job
  bsg-lib: remove bsg_job.req
  bsg: split handling of SCSI CDBs vs transport requeues
  block: bio_check_eod() needs to consider partitions
  nvmet: refactor configfs transport type handling
  nvmet: constify struct nvmet_fabrics_ops

Coly Li (7):
  bcache: fix cached_dev->count usage for bch_cache_set_error()
  bcache: quit dc->writeback_thread when BCACHE_DEV_DETACHING is set
  bcache: stop dc->writeback_rate_update properly
  bcache: add CACHE_SET_IO_DISABLE to struct cache_set flags
  bcache: add stop_when_cache_set_failed option to backing device
  bcache: add backing_request_endio() for bi_end_io
  bcache: add io_disable to struct cached_dev

Dan Carpenter (1):
  lightnvm: pblk: remove some unnecessary NULL checks

Hans Holmberg (8):
  lightnvm: pblk: handle bad sectors in the emeta area correctly
  lightnvm: pblk: check data lines version on recovery
  lightnvm: pblk: export write amplification counters to sysfs
  lightnvm: pblk: add padding distribution sysfs attribute
  lightnvm: pblk: delete writer kick timer before stopping thread
  lightnvm: pblk: allow allocation of new lines during shutdown
  lightnvm:

Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Bart Van Assche
On Mon, 2018-04-02 at 15:16 -0700, t...@kernel.org wrote:
> AFAIK,
> there's one non-critical race condition which has always been there.
> We have a larger race window for that case but don't yet know whether
> that's problematic or not.  If that actually is problematic, we can
> figure out a way to solve that but such effort / added complexity
> doesn't seem justified yet.  No?

Hello Tejun,

Some important block drivers systematically return BLK_EH_RESET_TIMER
from their timeout handler, e.g. the virtio-scsi initiator driver. What
will the consequence be for these drivers if a blk_mq_complete_request()
call is ignored? Will the request for which the completion was ignored
get stuck?

Thanks,

Bart.





Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

2018-04-02 Thread Jens Axboe
On 4/2/18 4:04 PM, Tejun Heo wrote:
> Request abortion is performed by overriding deadline to now and
> scheduling timeout handling immediately.  For the latter part, the
> code was using mod_timer(timeout, 0) which can't guarantee that the
> timer runs afterwards.  Let's schedule the underlying work item
> directly instead.
> 
> This fixes the hangs during probing reported by Sitsofe but it isn't
> yet clear to me how the failure can happen reliably if it's just the
> above described race condition.
> 
> Signed-off-by: Tejun Heo 
> Reported-by: Sitsofe Wheeler 
> Reported-by: Meelis Roos 
> Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path")
> Cc: sta...@vger.kernel.org # v4.16
> Link: 
> http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8nyof-...@mail.gmail.com
> Link: http://lkml.kernel.org/r/alpine.lrh.2.21.1802261049140.4...@math.ut.ee
> ---
> Hello,
> 
> I don't have the full explanation yet but here's a preliminary patch.
> 
> Thanks.
> 
>  block/blk-timeout.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
>* No need for fancy synchronizations.
>*/
>   blk_rq_set_deadline(req, jiffies);
> - mod_timer(>q->timeout, 0);
> + kblockd_schedule_work(>q->timeout_work);
>   } else {
>   if (blk_mark_rq_complete(req))
>   return;

In any case, it's cleaner than relying on mod_timer(.., 0). If that
doesn't guarantee that the timer runs again, I can see how a race
with the running timer could prevent us from seeing the timeout
after an abort.

I'll apply this, thanks.

-- 
Jens Axboe



Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 10:09:18PM +, Bart Van Assche wrote:
> Please elaborate what your long-term goal is for the blk-mq timeout handler.

Hmm... I don't really have any plans beyond what's been posted.

> The legacy block layer suspends request state changes while a timeout is
> being processed by holding the request queue lock while requests are being
> processed, while processing a timeout and while calling 
> q->rq_timed_out_fn(rq).
> Do you think it is possible to make the blk-mq core suspend request processing
> while processing a timeout without introducing locking in
> blk_mq_complete_request()? If you do not plan to add locking in
> blk_mq_complete_request(), do you think it is possible to fix all the races we
> discussed in previous e-mails?

I don't know of multiple race conditions.  What am I missing?  AFAIK,
there's one non-critical race condition which has always been there.
We have a larger race window for that case but don't yet know whether
that's problematic or not.  If that actually is problematic, we can
figure out a way to solve that but such effort / added complexity
doesn't seem justified yet.  No?

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Bart Van Assche
On Mon, 2018-04-02 at 15:01 -0700, t...@kernel.org wrote:
> On Mon, Apr 02, 2018 at 09:56:41PM +, Bart Van Assche wrote:
> > This patch increases the time during which .aborted_gstate == .gstate if the
> > timeout is reset. Does that increase the chance that a completion will be 
> > missed
> > if the request timeout is reset?
> 
> It sure does, but we're comparing an outright kernel bug vs. an
> inherently opportunistic mechanism being a bit more lossy.  I think
> the answer is pretty clear.

Hello Tejun,

Please elaborate what your long-term goal is for the blk-mq timeout handler.
The legacy block layer suspends request state changes while a timeout is
being processed by holding the request queue lock while requests are being
processed, while processing a timeout and while calling q->rq_timed_out_fn(rq).
Do you think it is possible to make the blk-mq core suspend request processing
while processing a timeout without introducing locking in
blk_mq_complete_request()? If you do not plan to add locking in
blk_mq_complete_request(), do you think it is possible to fix all the races we
discussed in previous e-mails?

Thanks,

Bart.




Re: 4.16-rc2+git: pata_serverworks: hanging ata detection thread on HP DL380G3

2018-04-02 Thread Tejun Heo
Hello, Meelis.

Can you please verify whether the following patch fixes the problem?

Thanks.

Subject: blk-mq: Directly schedule q->timeout_work when aborting a request

Request abortion is performed by overriding deadline to now and
scheduling timeout handling immediately.  For the latter part, the
code was using mod_timer(timeout, 0) which can't guarantee that the
timer runs afterwards.  Let's schedule the underlying work item
directly instead.

This fixes the hangs during probing reported by Sitsofe but it isn't
yet clear to me how the failure can happen reliably if it's just the
above described race condition.

Signed-off-by: Tejun Heo 
Reported-by: Sitsofe Wheeler 
Reported-by: Meelis Roos 
---
 block/blk-timeout.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a05e367..f0e6e41 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
 * No need for fancy synchronizations.
 */
blk_rq_set_deadline(req, jiffies);
-   mod_timer(>q->timeout, 0);
+   kblockd_schedule_work(>q->timeout_work);
} else {
if (blk_mark_rq_complete(req))
return;


[PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

2018-04-02 Thread Tejun Heo
Request abortion is performed by overriding deadline to now and
scheduling timeout handling immediately.  For the latter part, the
code was using mod_timer(timeout, 0) which can't guarantee that the
timer runs afterwards.  Let's schedule the underlying work item
directly instead.

This fixes the hangs during probing reported by Sitsofe but it isn't
yet clear to me how the failure can happen reliably if it's just the
above described race condition.

Signed-off-by: Tejun Heo 
Reported-by: Sitsofe Wheeler 
Reported-by: Meelis Roos 
Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path")
Cc: sta...@vger.kernel.org # v4.16
Link: 
http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8nyof-...@mail.gmail.com
Link: http://lkml.kernel.org/r/alpine.lrh.2.21.1802261049140.4...@math.ut.ee
---
Hello,

I don't have the full explanation yet but here's a preliminary patch.

Thanks.

 block/blk-timeout.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a05e367..f0e6e41 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
 * No need for fancy synchronizations.
 */
blk_rq_set_deadline(req, jiffies);
-   mod_timer(>q->timeout, 0);
+   kblockd_schedule_work(>q->timeout_work);
} else {
if (blk_mark_rq_complete(req))
return;


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 09:56:41PM +, Bart Van Assche wrote:
> This patch increases the time during which .aborted_gstate == .gstate if the
> timeout is reset. Does that increase the chance that a completion will be 
> missed
> if the request timeout is reset?

It sure does, but we're comparing an outright kernel bug vs. an
inherently opportunistic mechanism being a bit more lossy.  I think
the answer is pretty clear.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Bart Van Assche
On Mon, 2018-04-02 at 14:10 -0700, Tejun Heo wrote:
> On Mon, Apr 02, 2018 at 02:08:37PM -0700, Bart Van Assche wrote:
> > On 04/02/18 12:01, Tejun Heo wrote:
> > > +  * As nothing prevents from completion happening while
> > > +  * ->aborted_gstate is set, this may lead to ignored completions
> > > +  * and further spurious timeouts.
> > > +  */
> > > + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
> > > + blk_mq_rq_update_aborted_gstate(rq, 0);
> > 
> > Hello Tejun,
> > 
> > Since this patch fixes one race but introduces another race, is this
> > patch really an improvement?
> 
> Oh, that's not a new race.  That's the same non-critical race which
> always existed.  It's just being documented.

This patch increases the time during which .aborted_gstate == .gstate if the
timeout is reset. Does that increase the chance that a completion will be missed
if the request timeout is reset?

Thanks,

Bart.




Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread t...@kernel.org
Hello,

On Mon, Apr 02, 2018 at 09:31:34PM +, Bart Van Assche wrote:
> > > > +* As nothing prevents from completion happening while
> > > > +* ->aborted_gstate is set, this may lead to ignored completions
> > > > +* and further spurious timeouts.
> > > > +*/
> > > > +   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
> > > > +   blk_mq_rq_update_aborted_gstate(rq, 0);
...
> I think it can happen that the above code sees that (rq->rq_flags &
> RQF_MQ_TIMEOUT_RESET) != 0, that blk_mq_start_request() executes the
> following code:
> 
>   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
>   blk_add_timer(rq);
> 
> and that subsequently blk_mq_rq_update_aborted_gstate(rq, 0) is called,
> which will cause the next completion to be lost. Is fixing one occurrence
> of a race and reintroducing it in another code path really an improvement?

I'm not following at all.  How would blk_mq_start_request() get called
on the request while it's still owned by the timeout handler?  That
gstate clearing is what transfers the ownership back to the
non-timeout path.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Bart Van Assche
On Mon, 2018-04-02 at 14:10 -0700, Tejun Heo wrote:
> On Mon, Apr 02, 2018 at 02:08:37PM -0700, Bart Van Assche wrote:
> > On 04/02/18 12:01, Tejun Heo wrote:
> > > +  * As nothing prevents from completion happening while
> > > +  * ->aborted_gstate is set, this may lead to ignored completions
> > > +  * and further spurious timeouts.
> > > +  */
> > > + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
> > > + blk_mq_rq_update_aborted_gstate(rq, 0);
> > 
> > Hello Tejun,
> > 
> > Since this patch fixes one race but introduces another race, is this
> > patch really an improvement?
> 
> Oh, that's not a new race.  That's the same non-critical race which
> always existed.  It's just being documented.

Hello Tejun,

I think it can happen that the above code sees that (rq->rq_flags &
RQF_MQ_TIMEOUT_RESET) != 0, that blk_mq_start_request() executes the
following code:

blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);

and that subsequently blk_mq_rq_update_aborted_gstate(rq, 0) is called,
which will cause the next completion to be lost. Is fixing one occurrence
of a race and reintroducing it in another code path really an improvement?

Thanks,

Bart.





Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Tejun Heo
On Mon, Apr 02, 2018 at 02:08:37PM -0700, Bart Van Assche wrote:
> On 04/02/18 12:01, Tejun Heo wrote:
> >+ * As nothing prevents from completion happening while
> >+ * ->aborted_gstate is set, this may lead to ignored completions
> >+ * and further spurious timeouts.
> >+ */
> >+if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
> >+blk_mq_rq_update_aborted_gstate(rq, 0);
> 
> Hello Tejun,
> 
> Since this patch fixes one race but introduces another race, is this
> patch really an improvement?

Oh, that's not a new race.  That's the same non-critical race which
always existed.  It's just being documented.

Thanks.

-- 
tejun


Re: [PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Bart Van Assche

On 04/02/18 12:01, Tejun Heo wrote:

+* As nothing prevents from completion happening while
+* ->aborted_gstate is set, this may lead to ignored completions
+* and further spurious timeouts.
+*/
+   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+   blk_mq_rq_update_aborted_gstate(rq, 0);


Hello Tejun,

Since this patch fixes one race but introduces another race, is this 
patch really an improvement?


Thanks,

Bart.


Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

2018-04-02 Thread Sitsofe Wheeler
Hi Tejun,

On 2 April 2018 at 21:29, Tejun Heo  wrote:
>
> Can you see whether the following patch makes any difference?
>
> Thanks.
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
>  * No need for fancy synchronizations.
>  */
> blk_rq_set_deadline(req, jiffies);
> -   mod_timer(>q->timeout, 0);
> +   kblockd_schedule_work(>q->timeout_work);
> } else {
> if (blk_mark_rq_complete(req))
> return;

That patch seems to fix the issue.

-- 
Sitsofe | http://sucs.org/~sits/


Re: [PATCH 1/2] blk-mq: Factor out [s]rcu synchronization

2018-04-02 Thread Bart Van Assche

On 04/02/18 12:00, Tejun Heo wrote:

Factor out [s]rcu synchronization in blk_mq_timeout_work() into
blk_mq_timeout_sync_rcu().  This is to add another user in the future
and doesn't cause any functional changes.


Reviewed-by: Bart Van Assche 



Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

2018-04-02 Thread Tejun Heo
Hello, Sitsofe.

Can you see whether the following patch makes any difference?

Thanks.

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a05e367..f0e6e41 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
 * No need for fancy synchronizations.
 */
blk_rq_set_deadline(req, jiffies);
-   mod_timer(>q->timeout, 0);
+   kblockd_schedule_work(>q->timeout_work);
} else {
if (blk_mark_rq_complete(req))
return;


[PATCH 1/2] blk-mq: Factor out [s]rcu synchronization

2018-04-02 Thread Tejun Heo
Factor out [s]rcu synchronization in blk_mq_timeout_work() into
blk_mq_timeout_sync_rcu().  This is to add another user in the future
and doesn't cause any functional changes.

Signed-off-by: Tejun Heo 
Cc: Bart Van Assche 
---
Hello,

We were tracking this down in the following thread

  http://lkml.kernel.org/r/20180207011133.25957-1-bart.vanass...@wdc.com

but lost the reproducer in the middle and couldn't fully verify these
two patches fix the problem; however, the identified race is real and
a bug, so I think it'd be best to apply these two patches.

Given the lack of further reports on this front, I don't think it's
necessary to rush these patches.  I think it'd be best to apply these
once the merge window closes.  If we need to backport these to
-stable, we can tag them later.

Thanks.

 block/blk-mq.c |   39 +++
 include/linux/blk-mq.h |2 +-
 2 files changed, 24 insertions(+), 17 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -876,13 +876,34 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -930,27 +951,13 @@ static void blk_mq_timeout_work(struct w
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
 
if (data.nr_expired) {
-   bool has_rcu = false;
-
/*
 * Wait till everyone sees ->aborted_gstate.  The
 * sequential waits for SRCUs aren't ideal.  If this ever
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   queue_for_each_hw_ctx(q, hctx, i) {
-   if (!hctx->nr_expired)
-   continue;
-
-   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
-   has_rcu = true;
-   else
-   synchronize_srcu(hctx->srcu);
-
-   hctx->nr_expired = 0;
-   }
-   if (has_rcu)
-   synchronize_rcu();
+   blk_mq_timeout_sync_rcu(q);
 
/* terminate the ones we won */
blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -51,7 +51,7 @@ struct blk_mq_hw_ctx {
unsigned intqueue_num;
 
atomic_tnr_active;
-   unsigned intnr_expired;
+   boolneed_sync_rcu;
 
struct hlist_node   cpuhp_dead;
struct kobject  kobj;


[PATCH 2/2] blk-mq: Fix request handover from timeout path to normal execution

2018-04-02 Thread Tejun Heo
When a request is handed over from normal execution to timeout, we
synchronize using ->aborted_gstate and RCU grace periods; however,
when a request is being returned from timeout handling to normal
execution for BLK_EH_RESET_TIMER, we were skipping the same
synchronization.

This means that it theoretically is possible for a returned request's
completion and recycling compete against the reordered and delayed
writes from timeout path.

This patch adds an equivalent synchronization when a request is
returned from timeout path to normal completion path.

Signed-off-by: Tejun Heo 
Cc: Bart Van Assche 
---
 block/blk-mq.c |   49 -
 block/blk-timeout.c|2 +-
 include/linux/blkdev.h |4 +++-
 3 files changed, 44 insertions(+), 11 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -818,7 +818,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +834,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -916,7 +914,26 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* there's no memory ordering around ->aborted_gstate making it the
+* only field safe to update.  Let blk_add_timer() clear it later
+* when the request is recycled or times out again.
+*
+* As nothing prevents from completion happening while
+* ->aborted_gstate is set, this may lead to ignored completions
+* and further spurious timeouts.
+*/
+   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+   blk_mq_rq_update_aborted_gstate(rq, 0);
 }
 
 static void blk_mq_timeout_work(struct work_struct *work)
@@ -951,6 +968,8 @@ static void blk_mq_timeout_work(struct w
blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, );
 
if (data.nr_expired) {
+   int nr_resets = 0;
+
/*
 * Wait till everyone sees ->aborted_gstate.  The
 * sequential waits for SRCUs aren't ideal.  If this ever
@@ -960,7 +979,19 @@ static void blk_mq_timeout_work(struct w
blk_mq_timeout_sync_rcu(q);
 
/* terminate the ones we won */
-   blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, NULL);
+   blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
+  _resets);
+
+   /*
+* For BLK_EH_RESET_TIMER, release the requests after
+* blk_add_timer() from above is visible to avoid timer
+* reset racing against recycling.
+*/
+   if (nr_resets) {
+   blk_mq_timeout_sync_rcu(q);
+   blk_mq_queue_tag_busy_iter(q,
+   blk_mq_finish_timeout_reset, NULL);
+   }
}
 
if (data.next_set) {
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -216,7 +216,7 @@ void blk_add_timer(struct request *req)
req->timeout = q->rq_timeout;
 
blk_rq_set_deadline(req, jiffies + req->timeout);
-   req->rq_flags &= ~RQF_MQ_TIMEOUT_EXPIRED;
+   req->rq_flags &= ~(RQF_MQ_TIMEOUT_EXPIRED | RQF_MQ_TIMEOUT_RESET);
 
/*
 * Only the non-mq case needs to add the request to a protected list.
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -127,8 

Re: Multi-Actuator SAS HDD First Look

2018-04-02 Thread Tim Walker
On Mon, Apr 2, 2018 at 10:29 AM, Douglas Gilbert  wrote:
> On 2018-04-02 11:34 AM, Tim Walker wrote:
>>
>> On Sat, Mar 31, 2018 at 10:52 AM, Douglas Gilbert 
>> wrote:
>>>
>>> On 2018-03-30 04:01 PM, Bart Van Assche wrote:


 On Fri, 2018-03-30 at 12:36 -0600, Tim Walker wrote:
>
>
> Yes I will be there to discuss the multi-LUN approach. I wanted to get
> these interface details out so we could have some background and
> perhaps folks would come with ideas. I don't have much more to put
> out, but I will certainly answer questions - either on this list or
> directly.



 Hello Tim,

 As far as I know the Linux SCSI stack does not yet support any SCSI
 devices
 for which a single SCSI command affects multiple (two in this case)
 LUNs.
 Adding such support may take significant work. There will also be the
 disadvantage that most SCSI core contributors do not have access to a
 multi-
 actuator device and hence that their changes may break support for
 multi-
 actuator devices.
>>>
>>>
>>>
>>> Hmmm, INQUIRY (3PC bit) and REPORT LUNS seem to be counter examples to
>>> Bart's assertion. Plus there are a more that tell you about things
>>> outside
>>> the addressed LU, for example the SCSI Ports VPD page tells you about
>>> other
>>> SCSI ports hence other LUs) on the current device.
>>>
>>>
>>>  From Tim's command list:
>>>
>>> Device level
>>> 
>>> 0x0, 0x1: okay
>>> 0x4 (Format unit): yikes, that could be a nasty surprise, accessing a
>>> file
>>>system on the other LU and getting an error "Not ready, format in
>>> progress"!!
>>> 0x12: standard INQUIRY okay, VPD pages not so much LU id different;
>>> relative
>>>port id, different; target port id different (at the least)
>>> 0x1b (SSU): storage LUs need to know this model, otherwise the logic on
>>>each LU could get into a duel: "I want it spun up; no, I want it spun
>>>down ..."
>>> 0x35, 0x37, 0x3b, 0x3c: okay
>>> 0x48 (sanitize): similar to Format unit above
>>> 0x91,0x4c,0x4d: okay
>>> MODE SENSE/SELECT(6,10): depends on page, block descriptor needs to be
>>>partially device level (since LB size may be changed by FU which is
>>>device level)
>>> rest of device level: okay or (I) don't know
>>> 0xf7: READ UDS DATA, that's interesting, but proprietary I guess
>>>
>>> Perhaps you could add a rider on FU and SAN: they get rejected unless the
>>> other storage LU is in (logical) spun down state.
>>>
>>>
>>> LU specific
>>> ---
>>> all okay, I hoping READ(6,10,12,16,32) and their WRITE cousins will be
>>>there also :-) Plus the TMF: LU reset
>>>
>>> Device or LU
>>> 
>>> all okay
>>>
>>>
>>> I'm intrigued by your 3 LU model. My wish list for that:
>>>
>>> LUN 0 would be processor device type (0x3) so it wouldn't confuse the
>>> OS (Linux) that it held storage (READ CAPACITY is mandatory for PDT 0x0
>>> and cannot represent a 0 block LU) and you could pick and choose which
>>> SCSI commands to implement on it. LUN 0 TUR could report what the spindle
>>> was really doing, SSU could do precisely what it is told (and SSU on LUNs
>>> 1 and 2 would be an "and" for spin down and an "or" for spin up). I've
>>> got several boxes full of SAS cables and only one cable that I can think
>>> of that lets me get to the secondary SAS port. So on LUN 0 you could have
>>> a proprietary (unless T10 takes it on board) mode page to enable the user
>>> to say which ports that LUNs 1 and 2 where accessible on. Obviously LUN 0
>>> would need to be accessible on both ports. [A non-accessible LUN would
>>> still respond to INQUIRY but with a first byte of 07f: PDT=0x1f (unknown)
>>> and PQ=3 which implies it is there but inaccessible via the current
>>> port.]
>>>
>>> A processor PDT opens up the possibility of putting a copy manager on
>>> LUN 0. Think offloaded (from main machine's perspective) backups and
>>> restores where LUN 1 or 2 is the source or destination.
>>>
>>> Enough dreaming.
>>>
>>> Doug Gilbert
>>
>>
>>
>>
>> Thanks for all the input from everybody. I'll collate it for a meeting
>> with our interface architect.
>
>
> Just to amplify the case against a FU or SAN being allowed at almost any
> time from either storage LU irrespective of what the other was doing.
> Imagine one initiator has some important data on one LU and has an
> EXCLUSIVE ACCESS persistent reservation on it. Then another initiator
> (e.g. on a different machine) sends a FU to the other LU which is
> honoured, wiping the whole device. One unhappy customer 
>
> Doug Gilbert

Thanks, Doug. That is very clear, and I get it. We'll have a solution.

Best regards,
-Tim

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770


Re: Multi-Actuator SAS HDD First Look

2018-04-02 Thread Douglas Gilbert

On 2018-04-02 11:34 AM, Tim Walker wrote:

On Sat, Mar 31, 2018 at 10:52 AM, Douglas Gilbert  wrote:

On 2018-03-30 04:01 PM, Bart Van Assche wrote:


On Fri, 2018-03-30 at 12:36 -0600, Tim Walker wrote:


Yes I will be there to discuss the multi-LUN approach. I wanted to get
these interface details out so we could have some background and
perhaps folks would come with ideas. I don't have much more to put
out, but I will certainly answer questions - either on this list or
directly.



Hello Tim,

As far as I know the Linux SCSI stack does not yet support any SCSI
devices
for which a single SCSI command affects multiple (two in this case) LUNs.
Adding such support may take significant work. There will also be the
disadvantage that most SCSI core contributors do not have access to a
multi-
actuator device and hence that their changes may break support for multi-
actuator devices.



Hmmm, INQUIRY (3PC bit) and REPORT LUNS seem to be counter examples to
Bart's assertion. Plus there are a more that tell you about things outside
the addressed LU, for example the SCSI Ports VPD page tells you about other
SCSI ports hence other LUs) on the current device.


 From Tim's command list:

Device level

0x0, 0x1: okay
0x4 (Format unit): yikes, that could be a nasty surprise, accessing a file
   system on the other LU and getting an error "Not ready, format in
progress"!!
0x12: standard INQUIRY okay, VPD pages not so much LU id different; relative
   port id, different; target port id different (at the least)
0x1b (SSU): storage LUs need to know this model, otherwise the logic on
   each LU could get into a duel: "I want it spun up; no, I want it spun
   down ..."
0x35, 0x37, 0x3b, 0x3c: okay
0x48 (sanitize): similar to Format unit above
0x91,0x4c,0x4d: okay
MODE SENSE/SELECT(6,10): depends on page, block descriptor needs to be
   partially device level (since LB size may be changed by FU which is
   device level)
rest of device level: okay or (I) don't know
0xf7: READ UDS DATA, that's interesting, but proprietary I guess

Perhaps you could add a rider on FU and SAN: they get rejected unless the
other storage LU is in (logical) spun down state.


LU specific
---
all okay, I hoping READ(6,10,12,16,32) and their WRITE cousins will be
   there also :-) Plus the TMF: LU reset

Device or LU

all okay


I'm intrigued by your 3 LU model. My wish list for that:

LUN 0 would be processor device type (0x3) so it wouldn't confuse the
OS (Linux) that it held storage (READ CAPACITY is mandatory for PDT 0x0
and cannot represent a 0 block LU) and you could pick and choose which
SCSI commands to implement on it. LUN 0 TUR could report what the spindle
was really doing, SSU could do precisely what it is told (and SSU on LUNs
1 and 2 would be an "and" for spin down and an "or" for spin up). I've
got several boxes full of SAS cables and only one cable that I can think
of that lets me get to the secondary SAS port. So on LUN 0 you could have
a proprietary (unless T10 takes it on board) mode page to enable the user
to say which ports that LUNs 1 and 2 where accessible on. Obviously LUN 0
would need to be accessible on both ports. [A non-accessible LUN would
still respond to INQUIRY but with a first byte of 07f: PDT=0x1f (unknown)
and PQ=3 which implies it is there but inaccessible via the current port.]

A processor PDT opens up the possibility of putting a copy manager on
LUN 0. Think offloaded (from main machine's perspective) backups and
restores where LUN 1 or 2 is the source or destination.

Enough dreaming.

Doug Gilbert




Thanks for all the input from everybody. I'll collate it for a meeting
with our interface architect.


Just to amplify the case against a FU or SAN being allowed at almost any
time from either storage LU irrespective of what the other was doing.
Imagine one initiator has some important data on one LU and has an
EXCLUSIVE ACCESS persistent reservation on it. Then another initiator
(e.g. on a different machine) sends a FU to the other LU which is
honoured, wiping the whole device. One unhappy customer 

Doug Gilbert


Re: 4.16-rc2+git: pata_serverworks: hanging ata detection thread on HP DL380G3

2018-04-02 Thread Tejun Heo
Hello,

On Fri, Mar 30, 2018 at 11:47:24AM +0300, Meelis Roos wrote:
> Added CC-s, start of the thread is at 
> https://lkml.org/lkml/2018/2/26/165
> 
> > > > 4.16 git bootup on HP Proliant DL380 G3 pauses for a a minute or two 
> > > > and 
> > > > then continues with "blocked for more than 120 seconds" message with 
> > > > libata detection functions in ther stack - 
> > > > async_synchronize_cookie_domain() as the last. It seems to happen 
> > > > during 
> > > > IDE CD-ROM detection (detected before but registered as sr0 after the 
> > > > warning). After detection, the eject button on the drive did not work.
> > > > 
> > > > 
> > > > pata_serverworks is the libata driver in use.
> > 
> > There were no changes to pata_serverworks since 2014 and libata changes
> > in v4.16 look obviously correct..
> > 
> > > This is still the same in 4.16.0-rc7-00062-g0b412605ef5f.
> > 
> > Any chance that you could bisect this issue?
> 
> Bisected to the following commit:
> 
> 358f70da49d77c43f2ca11b5da584213b2add29c is the first bad commit
> commit 358f70da49d77c43f2ca11b5da584213b2add29c
> Author: Tejun Heo 
> Date:   Tue Jan 9 08:29:50 2018 -0800
> 
> blk-mq: make blk_abort_request() trigger timeout path

Can you share the .config file?

Thanks.

-- 
tejun


Re: Multi-Actuator SAS HDD First Look

2018-04-02 Thread Tim Walker
On Sat, Mar 31, 2018 at 10:52 AM, Douglas Gilbert  wrote:
> On 2018-03-30 04:01 PM, Bart Van Assche wrote:
>>
>> On Fri, 2018-03-30 at 12:36 -0600, Tim Walker wrote:
>>>
>>> Yes I will be there to discuss the multi-LUN approach. I wanted to get
>>> these interface details out so we could have some background and
>>> perhaps folks would come with ideas. I don't have much more to put
>>> out, but I will certainly answer questions - either on this list or
>>> directly.
>>
>>
>> Hello Tim,
>>
>> As far as I know the Linux SCSI stack does not yet support any SCSI
>> devices
>> for which a single SCSI command affects multiple (two in this case) LUNs.
>> Adding such support may take significant work. There will also be the
>> disadvantage that most SCSI core contributors do not have access to a
>> multi-
>> actuator device and hence that their changes may break support for multi-
>> actuator devices.
>
>
> Hmmm, INQUIRY (3PC bit) and REPORT LUNS seem to be counter examples to
> Bart's assertion. Plus there are a more that tell you about things outside
> the addressed LU, for example the SCSI Ports VPD page tells you about other
> SCSI ports hence other LUs) on the current device.
>
>
> From Tim's command list:
>
> Device level
> 
> 0x0, 0x1: okay
> 0x4 (Format unit): yikes, that could be a nasty surprise, accessing a file
>   system on the other LU and getting an error "Not ready, format in
> progress"!!
> 0x12: standard INQUIRY okay, VPD pages not so much LU id different; relative
>   port id, different; target port id different (at the least)
> 0x1b (SSU): storage LUs need to know this model, otherwise the logic on
>   each LU could get into a duel: "I want it spun up; no, I want it spun
>   down ..."
> 0x35, 0x37, 0x3b, 0x3c: okay
> 0x48 (sanitize): similar to Format unit above
> 0x91,0x4c,0x4d: okay
> MODE SENSE/SELECT(6,10): depends on page, block descriptor needs to be
>   partially device level (since LB size may be changed by FU which is
>   device level)
> rest of device level: okay or (I) don't know
> 0xf7: READ UDS DATA, that's interesting, but proprietary I guess
>
> Perhaps you could add a rider on FU and SAN: they get rejected unless the
> other storage LU is in (logical) spun down state.
>
>
> LU specific
> ---
> all okay, I hoping READ(6,10,12,16,32) and their WRITE cousins will be
>   there also :-) Plus the TMF: LU reset
>
> Device or LU
> 
> all okay
>
>
> I'm intrigued by your 3 LU model. My wish list for that:
>
> LUN 0 would be processor device type (0x3) so it wouldn't confuse the
> OS (Linux) that it held storage (READ CAPACITY is mandatory for PDT 0x0
> and cannot represent a 0 block LU) and you could pick and choose which
> SCSI commands to implement on it. LUN 0 TUR could report what the spindle
> was really doing, SSU could do precisely what it is told (and SSU on LUNs
> 1 and 2 would be an "and" for spin down and an "or" for spin up). I've
> got several boxes full of SAS cables and only one cable that I can think
> of that lets me get to the secondary SAS port. So on LUN 0 you could have
> a proprietary (unless T10 takes it on board) mode page to enable the user
> to say which ports that LUNs 1 and 2 where accessible on. Obviously LUN 0
> would need to be accessible on both ports. [A non-accessible LUN would
> still respond to INQUIRY but with a first byte of 07f: PDT=0x1f (unknown)
> and PQ=3 which implies it is there but inaccessible via the current port.]
>
> A processor PDT opens up the possibility of putting a copy manager on
> LUN 0. Think offloaded (from main machine's perspective) backups and
> restores where LUN 1 or 2 is the source or destination.
>
> Enough dreaming.
>
> Doug Gilbert



Thanks for all the input from everybody. I'll collate it for a meeting
with our interface architect.

Best regards,
-TIm

-- 
Tim Walker
Product Design Systems Engineering, Seagate Technology
(303) 775-3770