Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-11 Thread Christoph Hellwig
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:

Can we move the cleanup to put_disk in general and not just for
this case?  Having alloc/free routines pair up generally avoids
a lot of confusion.


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Christoph Hellwig
On Wed, Apr 11, 2018 at 04:19:18PM +0300, Sagi Grimberg wrote:
>
>>   static void __blk_mq_requeue_request(struct request *rq)
>>   {
>>  struct request_queue *q = rq->q;
>> +enum mq_rq_state old_state = blk_mq_rq_state(rq);
>>  blk_mq_put_driver_tag(rq);
>>  trace_block_rq_requeue(q, rq);
>>  wbt_requeue(q->rq_wb, >issue_stat);
>>   -  if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
>> -blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
>> +if (old_state != MQ_RQ_IDLE) {
>> +if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
>> +WARN_ON_ONCE(true);
>>  if (q->dma_drain_size && blk_rq_bytes(rq))
>>  rq->nr_phys_segments--;
>>  }
>
> Can you explain why was old_state kept as a local variable?

As it was me who added this:  just to not read it again as no one
else can change the state at this point.

>> +static inline bool blk_mq_change_rq_state(struct request *rq,
>> +  enum mq_rq_state old_state,
>> +  enum mq_rq_state new_state)
>>   {
>> -u64 old_val = READ_ONCE(rq->gstate);
>> -u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
>> -
>> -if (state == MQ_RQ_IN_FLIGHT) {
>> -WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
>> -new_val += MQ_RQ_GEN_INC;
>> -}
>> +unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
>> +old_state;
>> +unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
>>   -  /* avoid exposing interim values */
>> -WRITE_ONCE(rq->gstate, new_val);
>> +return cmpxchg(>__deadline, old_val, new_val) == old_val;
>>   }
>
> Can you explain why this takes the old_state of the request?
>
> Otherwise this looks good to me,

Because that is how cmpxchg works?


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-11 Thread Alexandru Moise
On Thu, Apr 12, 2018 at 06:20:34AM +0200, Alexandru Moise wrote:
> On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> > Several block drivers call alloc_disk() followed by put_disk() if
> > something fails before device_add_disk() is called without calling
> > blk_cleanup_queue(). Make sure that also for this scenario a request
> > queue is dissociated from the cgroup controller. This patch avoids
> > that loading the parport_pc, paride and pf drivers triggers the
> > following kernel crash:
> > 
> > BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> > Read of size 4 at addr 0008 by task modprobe/744
> > Call Trace:
> > dump_stack+0x9a/0xeb
> > kasan_report+0x139/0x350
> > pi_init+0x42e/0x580 [paride]
> > pf_init+0x2bb/0x1000 [pf]
> > do_one_initcall+0x8e/0x405
> > do_init_module+0xd9/0x2f2
> > load_module+0x3ab4/0x4700
> > SYSC_finit_module+0x176/0x1a0
> > do_syscall_64+0xee/0x2b0
> > entry_SYSCALL_64_after_hwframe+0x42/0xb7
> > 
> > Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
> > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and 
> > the block cgroup controller")
> > Signed-off-by: Bart Van Assche 
> > Cc: Alexandru Moise <00moses.alexande...@gmail.com>
> > Cc: Tejun Heo 
> > ---
> >  block/blk-sysfs.c | 26 ++
> >  1 file changed, 26 insertions(+)
> > 
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index f8457d6f0190..2e134da78f82 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head 
> > *rcu_head)
> >  static void __blk_release_queue(struct work_struct *work)
> >  {
> > struct request_queue *q = container_of(work, typeof(*q), release_work);
> > +   struct blkcg_gq *gq;
> >  
> > if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
> > blk_stat_remove_callback(q, q->poll_cb);
> > blk_stat_free_callback(q->poll_cb);
> >  
> > +   if (!blk_queue_dead(q)) {
> > +   /*
> > +* Last reference was dropped without having called
> > +* blk_cleanup_queue().
> > +*/
> > +   WARN_ONCE(blk_queue_init_done(q),
> > + "request queue %p has been registered but 
> > blk_cleanup_queue() has not been called for that queue\n",
> > + q);
> > +   bdi_put(q->backing_dev_info);
> > +   blkcg_exit_queue(q);
> > +
> > +   if (q->elevator) {
> > +   ioc_clear_queue(q);
> > +   elevator_exit(q, q->elevator);
> > +   }
> > +   }
> > +
> > +   rcu_read_lock();
> > +   gq = blkg_lookup(_root, q);
> > +   rcu_read_unlock();
> > +
> > +   WARN(gq,
> > +"request queue %p is being released but it has not yet been 
> > removed from the blkcg controller\n",
> > +q);
> > +
> > blk_free_queue_stats(q->stats);
> 
> This solution is good. Thanks for fixing this!
> 
> Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>
> 
> ../Alex

Oh, you also need the no-ops for !CONFIG_BLK_CGROUP, also there is no
blkcg_root then as well.

Thanks,
../Alex
> 
> >  
> > blk_exit_rl(q, >root_rl);
> > -- 
> > 2.16.2
> > 


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Cc: Alexandru Moise <00moses.alexande...@gmail.com>
> Cc: Tejun Heo 
> ---
>  block/blk-sysfs.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f8457d6f0190..2e134da78f82 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head 
> *rcu_head)
>  static void __blk_release_queue(struct work_struct *work)
>  {
>   struct request_queue *q = container_of(work, typeof(*q), release_work);
> + struct blkcg_gq *gq;
>  
>   if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
>   blk_stat_remove_callback(q, q->poll_cb);
>   blk_stat_free_callback(q->poll_cb);
>  
> + if (!blk_queue_dead(q)) {
> + /*
> +  * Last reference was dropped without having called
> +  * blk_cleanup_queue().
> +  */
> + WARN_ONCE(blk_queue_init_done(q),
> +   "request queue %p has been registered but 
> blk_cleanup_queue() has not been called for that queue\n",
> +   q);
> + bdi_put(q->backing_dev_info);
> + blkcg_exit_queue(q);
> +
> + if (q->elevator) {
> + ioc_clear_queue(q);
> + elevator_exit(q, q->elevator);
> + }
> + }
> +
> + rcu_read_lock();
> + gq = blkg_lookup(_root, q);
> + rcu_read_unlock();
> +
> + WARN(gq,
> +  "request queue %p is being released but it has not yet been 
> removed from the blkcg controller\n",
> +  q);
> +
>   blk_free_queue_stats(q->stats);

This solution is good. Thanks for fixing this!

Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>

../Alex

>  
>   blk_exit_rl(q, >root_rl);
> -- 
> 2.16.2
> 


Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote:
> Several block drivers call alloc_disk() followed by put_disk() if
> something fails before device_add_disk() is called without calling
> blk_cleanup_queue(). Make sure that also for this scenario a request
> queue is dissociated from the cgroup controller. This patch avoids
> that loading the parport_pc, paride and pf drivers triggers the
> following kernel crash:
> 
> BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
> Read of size 4 at addr 0008 by task modprobe/744
> Call Trace:
> dump_stack+0x9a/0xeb
> kasan_report+0x139/0x350
> pi_init+0x42e/0x580 [paride]
> pf_init+0x2bb/0x1000 [pf]
> do_one_initcall+0x8e/0x405
> do_init_module+0xd9/0x2f2
> load_module+0x3ab4/0x4700
> SYSC_finit_module+0x176/0x1a0
> do_syscall_64+0xee/0x2b0
> entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
> block cgroup controller")
> Signed-off-by: Bart Van Assche 
> Cc: Alexandru Moise <00moses.alexande...@gmail.com>
> Cc: Tejun Heo 
> ---
>  block/blk-sysfs.c | 26 ++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index f8457d6f0190..2e134da78f82 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head 
> *rcu_head)
>  static void __blk_release_queue(struct work_struct *work)
>  {
>   struct request_queue *q = container_of(work, typeof(*q), release_work);
> + struct blkcg_gq *gq;
>  
>   if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
>   blk_stat_remove_callback(q, q->poll_cb);
>   blk_stat_free_callback(q->poll_cb);
>  
> + if (!blk_queue_dead(q)) {
> + /*
> +  * Last reference was dropped without having called
> +  * blk_cleanup_queue().
> +  */
> + WARN_ONCE(blk_queue_init_done(q),
> +   "request queue %p has been registered but 
> blk_cleanup_queue() has not been called for that queue\n",
> +   q);
> + bdi_put(q->backing_dev_info);
> + blkcg_exit_queue(q);
> +
> + if (q->elevator) {
> + ioc_clear_queue(q);
> + elevator_exit(q, q->elevator);
> + }
> + }
> +
> + rcu_read_lock();
> + gq = blkg_lookup(_root, q);
> + rcu_read_unlock();
> +
> + WARN(gq,
> +  "request queue %p is being released but it has not yet been 
> removed from the blkcg controller\n",
> +  q);
> +
>   blk_free_queue_stats(q->stats);

This solution is good. Thanks for fixing this!

Tested-by: Alexandru Moise <00moses.alexande...@gmail.com>

../Alex

>  
>   blk_exit_rl(q, >root_rl);
> -- 
> 2.16.2
> 


Re: [PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread jianchao.wang
Hi Ming

On 04/12/2018 07:38 AM, Ming Lei wrote:
> +  *
> +  * Cover complete vs BLK_EH_RESET_TIMER race in slow path with
> +  * helding queue lock.
>*/
>   hctx_lock(hctx, _idx);
>   if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
>   __blk_mq_complete_request(rq);
> + else {
> + unsigned long flags;
> + bool need_complete = false;
> +
> + spin_lock_irqsave(q->queue_lock, flags);
> + if (!blk_mq_rq_aborted_gstate(rq))
> + need_complete = true;
> + else
> + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
> + spin_unlock_irqrestore(q->queue_lock, flags);

What if the .timeout return BLK_EH_HANDLED during this ?
timeout context irq context
  .timeout()
 blk_mq_complete_request
   set state to MQ_RQ_COMPLETE_IN_TIMEOUT
  __blk_mq_complete_request
WARN_ON_ONCE(blk_mq_rq_state(rq) 
 != MQ_RQ_IN_FLIGHT);

If further upon blk_mq_free_request, the final freed request maybe changed to 
MQ_RQ_COMPLETE_IN_TIMEOUT
instead of MQ_RQ_IDLE.

Thanks
Jianchao


[PATCH] block: Ensure that a request queue is dissociated from the cgroup controller

2018-04-11 Thread Bart Van Assche
Several block drivers call alloc_disk() followed by put_disk() if
something fails before device_add_disk() is called without calling
blk_cleanup_queue(). Make sure that also for this scenario a request
queue is dissociated from the cgroup controller. This patch avoids
that loading the parport_pc, paride and pf drivers triggers the
following kernel crash:

BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride]
Read of size 4 at addr 0008 by task modprobe/744
Call Trace:
dump_stack+0x9a/0xeb
kasan_report+0x139/0x350
pi_init+0x42e/0x580 [paride]
pf_init+0x2bb/0x1000 [pf]
do_one_initcall+0x8e/0x405
do_init_module+0xd9/0x2f2
load_module+0x3ab4/0x4700
SYSC_finit_module+0x176/0x1a0
do_syscall_64+0xee/0x2b0
entry_SYSCALL_64_after_hwframe+0x42/0xb7

Reported-by: Alexandru Moise <00moses.alexande...@gmail.com>
Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the 
block cgroup controller")
Signed-off-by: Bart Van Assche 
Cc: Alexandru Moise <00moses.alexande...@gmail.com>
Cc: Tejun Heo 
---
 block/blk-sysfs.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index f8457d6f0190..2e134da78f82 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head)
 static void __blk_release_queue(struct work_struct *work)
 {
struct request_queue *q = container_of(work, typeof(*q), release_work);
+   struct blkcg_gq *gq;
 
if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags))
blk_stat_remove_callback(q, q->poll_cb);
blk_stat_free_callback(q->poll_cb);
 
+   if (!blk_queue_dead(q)) {
+   /*
+* Last reference was dropped without having called
+* blk_cleanup_queue().
+*/
+   WARN_ONCE(blk_queue_init_done(q),
+ "request queue %p has been registered but 
blk_cleanup_queue() has not been called for that queue\n",
+ q);
+   bdi_put(q->backing_dev_info);
+   blkcg_exit_queue(q);
+
+   if (q->elevator) {
+   ioc_clear_queue(q);
+   elevator_exit(q, q->elevator);
+   }
+   }
+
+   rcu_read_lock();
+   gq = blkg_lookup(_root, q);
+   rcu_read_unlock();
+
+   WARN(gq,
+"request queue %p is being released but it has not yet been 
removed from the blkcg controller\n",
+q);
+
blk_free_queue_stats(q->stats);
 
blk_exit_rl(q, >root_rl);
-- 
2.16.2



Re: System hung in I/O when booting with sd card

2018-04-11 Thread Bart Van Assche
On Thu, 2018-04-12 at 09:48 +0800, Shawn Lin wrote:
> I ran into 2 times that my system hung here when booting with a ext4 sd
> card. No sure how to reproduce it but it seems doesn't matter with the
> ext4 as I see it again with a vfat sd card, this morning with Linus'
> master branch. Is it a known issue or any idea how to debug this?

Please verify whether the following patch resolves this:

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30

Thanks,

Bart.






System hung in I/O when booting with sd card

2018-04-11 Thread Shawn Lin

Hi,

I ran into 2 times that my system hung here when booting with a ext4 sd
card. No sure how to reproduce it but it seems doesn't matter with the
ext4 as I see it again with a vfat sd card, this morning with Linus'
master branch. Is it a known issue or any idea how to debug this?


[  110.877652] Unable to handle kernel NULL pointer dereference at 
virtual address 00c8

[  110.878360] Mem abort info:
[  110.878607]   ESR = 0x9606
[  110.878876]   Exception class = DABT (current EL), IL = 32 bits
[  110.879391]   SET = 0, FnV = 0
[  110.879659]   EA = 0, S1PTW = 0
[  110.879957] Data abort info:
[  110.880210]   ISV = 0, ISS = 0x0006
[  110.880545]   CM = 0, WnR = 0
[  110.880807] user pgtable: 4k pages, 48-bit VAs, pgdp = ed0ed92e
[  110.881382] [00c8] pgd=f015e003, 
pud=f0173003, pmd=

[  110.882143] Internal error: Oops: 9606 [#1] PREEMPT SMP
[  110.882629] Modules linked in:
[  110.882902] CPU: 4 PID: 1777 Comm: ls Not tainted 
4.16.0-next-20180410-00013-geabffa6-dirty #315

[  110.883665] Hardware name: Excavator-RK3399 Board (DT)
[  110.884114] pstate: 4005 (nZcv daif -PAN -UAO)
[  110.884542] pc : percpu_counter_add_batch+0x2c/0x100
[  110.884977] lr : generic_make_request_checks+0x214/0x478
[  110.885440] sp : 0c403960
[  110.885730] x29: 0c403960 x28: 8000efcba6e0
[  110.886196] x27: 08a81000 x26: 082643f8
[  110.886662] x25: 08263828 x24: 3000
[  110.887127] x23: 1000 x22: 
[  110.887591] x21: 1000 x20: 00a8
[  110.888056] x19: 8000f0d7dc00 x18: 0001
[  110.888521] x17: 004ab8a8 x16: 08241628
[  110.888986] x15: ff80 x14: 7e0003bbaf80
[  110.889451] x13: 00c5 x12: 09069b88
[  110.889916] x11: 0040 x10: 8000efc57dc0
[  110.890380] x9 : 8000efc57e90 x8 : 8000f0d3b138
[  110.890844] x7 :  x6 : 8000f0d7dc88
[  110.891309] x5 : 0e56 x4 : 0001
[  110.891774] x3 : 8000f0cab800 x2 : 3fff
[  110.892239] x1 : 8000edf3 x0 : 00a8
[  110.892705] Process ls (pid: 1777, stack limit = 0xb3fb6733)
[  110.893258] Call trace:
[  110.893475]  percpu_counter_add_batch+0x2c/0x100
[  110.893879]  generic_make_request_checks+0x214/0x478
[  110.894313]  generic_make_request+0x34/0x260
[  110.894686]  submit_bio+0xcc/0x1b0
[  110.894988]  ll_rw_block+0xc0/0x100
[  110.895295]  ext4_bread+0x74/0xc0
[  110.895587]  __ext4_read_dirblock+0x3c/0x2d8
[  110.895960]  htree_dirblock_to_tree+0x70/0x1d8
[  110.896349]  ext4_htree_fill_tree+0xa4/0x2c8
[  110.896723]  ext4_readdir+0x5f4/0x7f0
[  110.897046]  iterate_dir+0x9c/0x1a8
[  110.897351]  ksys_getdents64+0x8c/0x168
[  110.897687]  sys_getdents64+0xc/0x18
[  110.898001]  el0_svc_naked+0x30/0x34
[  110.898315] Code: b9401064 11000484 b9001064 d538d081 (f9401000)
[  110.898848] ---[ end trace 3b0d37baa3bb2fb3 ]---
[  110.899260] note: ls[1777] exited with preempt_count 1
[  110.899852] WARNING: CPU: 4 PID: 1777 at kernel/rcu/tree_plugin.h:330 
rcu_note_context_switch+0x30/0x3b8

[  110.900676] Modules linked in:
Se[gmentation fault  110.900947
[root@rockchip:/]# ] CPU: 4 PID: 1777 Comm: ls Tainted: G  D 
  4.16.0-next-20180410-00013-geabffa6-dirty #315

[  110.902121] Hardware name: Excavator-RK3399 Board (DT)
[  110.902570] pstate: 2085 (nzCv daIf -PAN -UAO)
[  110.902990] pc : rcu_note_context_switch+0x30/0x3b8
[  110.903417] lr : rcu_note_context_switch+0x1c/0x3b8
[  110.903841] sp : 0c403480
[  110.904132] x29: 0c403480 x28: 0c403820
[  110.904597] x27:  x26: 8000f0cab800
[  110.905062] x25: 080fdaa8 x24: 09069000
[  110.905527] x23: 09042000 x22: 8000f0cab800
[  110.905992] x21: 8000f6f7ed00 x20: 
[  110.906456] x19: 8000f0cab800 x18: 
[  110.906921] x17: 004ab8a8 x16: 08241628
[  110.907386] x15:  x14: 0400
[  110.907851] x13: 091b6200 x12: 0004
[  110.908316] x11: 0019c85d4000 x10: 0400
[  110.908781] x9 : 0004 x8 : 8000f02e6a00
[  110.909246] x7 : 8000f6f7f760 x6 : 090ba86f
[  110.909711] x5 :  x4 : 
[  110.910176] x3 : 8000edf3 x2 : 0004
[  110.910641] x1 : 0904fa98 x0 : 0001
[  110.911106] Call trace:
[  110.911322]  rcu_note_context_switch+0x30/0x3b8
[  110.911722]  __schedule+0x90/0x600
[  110.912022]  do_task_dead+0x40/0x48
[  110.912328]  do_exit+0x6c8/0x9b8
[  110.912613]  die+0x1cc/0x1f8
[  110.912868]  __do_kernel_fault+0xa4/0xf8
[  110.913212]  do_page_fault+0x1f0/0x428
[  110.913541]  do_translation_fault+0x5c/0x68
[  110.913908]  do_mem_abort+0x54/0xd8
[  110.914215]  el1_da+0x20/0x80
[  110.914476]  

[PATCH] block/amiflop: Don't log an error message for an invalid ioctl

2018-04-11 Thread Finn Thain
Do as the swim3 driver does and just return -ENOTTY.

Cc: Geert Uytterhoeven 
Cc: linux-m...@lists.linux-m68k.org
Signed-off-by: Finn Thain 
---
 drivers/block/amiflop.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c
index 3aaf6af3ec23..b9fb794c1255 100644
--- a/drivers/block/amiflop.c
+++ b/drivers/block/amiflop.c
@@ -1533,9 +1533,7 @@ static int fd_locked_ioctl(struct block_device *bdev, 
fmode_t mode,
return p->type->read_size;
 #endif
default:
-   printk(KERN_DEBUG "fd_ioctl: unknown cmd %d for drive %d.",
-  cmd, drive);
-   return -ENOSYS;
+   return -ENOTTY;
}
return 0;
 }
-- 
2.16.1



[PATCH v2 03/10] m68k/mac: Don't remap SWIM MMIO region

2018-04-11 Thread Finn Thain
For reasons I don't understand, calling ioremap() then iounmap() on
the SWIM MMIO region causes a hang on 68030 (but not on 68040).

~# modprobe swim_mod
SWIM floppy driver Version 0.2 (2008-10-30)
SWIM device not found !
watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:285]
Modules linked in: swim_mod(+)
Format 00  Vector: 0064  PC: 75aa  Status: 2000Not tainted
ORIG_D0:   D0: d00c  A2: 007c2370  A1: 003f810c
A0: 0004  D5: d0096800  D4: d0097e00
D3: 0001  D2: 0003  D1: 
Non-Maskable Interrupt
Modules linked in: swim_mod(+)
PC: [<75ba>] __iounmap+0x24/0x10e
SR: 2000  SP: 007abc48  a2: 007c2370
d0: d00cd1: 01a0d2: 0019d3: 0001
d4: d0097e00d5: d0096800a0: 0004a1: 003f810c
Process modprobe (pid: 285, task=007c2370)
Frame format=0
Stack from 007abc7c:
ffed  006a4060 004712e0 007abca0 76ea d008 0008
010bb4b8 007abcd8 010ba542 d0096000   0001 010bb59c
 007abf30 010bb4b8 0047760a 0047763c 00477612 00616540 007abcec
0020a91a 00477600 0047760a 010bb4cc 007abd18 002092f2 0047760a 00333b06
007abd5c  0047760a 010bb4cc 00404f90 004776b8 0001 007abd38
00209446 010bb4cc 0047760a 010bb4cc 0020938e 0031f8be 00616540 007abd64
Call Trace: [<76ea>] iounmap+0x46/0x5a
 [<0008>] shrink_page_list+0x7f6/0xe06
 [<010ba542>] swim_probe+0xe4/0x496 [swim_mod]
 [<0020a91a>] platform_drv_probe+0x20/0x5e
 [<002092f2>] driver_probe_device+0x21c/0x2b8
 [<00333b06>] mutex_lock+0x0/0x2e
 [<00209446>] __driver_attach+0xb8/0xce
 [<0020938e>] __driver_attach+0x0/0xce
 [<0031f8be>] klist_next+0x0/0xa0
 [<00207562>] bus_for_each_dev+0x74/0xba
 [<000344c0>] blocking_notifier_call_chain+0x0/0x20
 [<00333b06>] mutex_lock+0x0/0x2e
 [<00208e44>] driver_attach+0x1a/0x1e
 [<0020938e>] __driver_attach+0x0/0xce
 [<00207e26>] bus_add_driver+0x188/0x234
 [<000344c0>] blocking_notifier_call_chain+0x0/0x20
 [<00209894>] driver_register+0x58/0x104
 [<000344c0>] blocking_notifier_call_chain+0x0/0x20
 [<010bd000>] swim_init+0x0/0x2c [swim_mod]
 [<0020a7be>] __platform_driver_register+0x38/0x3c
 [<010bd028>] swim_init+0x28/0x2c [swim_mod]
 [<20dc>] do_one_initcall+0x38/0x196
 [<000344c0>] blocking_notifier_call_chain+0x0/0x20
 [<003331cc>] mutex_unlock+0x0/0x3e
 [<00333b06>] mutex_lock+0x0/0x2e
 [<003331cc>] mutex_unlock+0x0/0x3e
 [<00333b06>] mutex_lock+0x0/0x2e
 [<003331cc>] mutex_unlock+0x0/0x3e
 [<00333b06>] mutex_lock+0x0/0x2e
 [<003331cc>] mutex_unlock+0x0/0x3e
 [<00333b06>] mutex_lock+0x0/0x2e
 [<00075008>] __free_pages+0x0/0x38
 [<45c0>] mangle_kernel_stack+0x30/0xda
 [<000344c0>] blocking_notifier_call_chain+0x0/0x20
 [<003331cc>] mutex_unlock+0x0/0x3e
 [<00333b06>] mutex_lock+0x0/0x2e
 [<0005ced4>] do_init_module+0x42/0x266
 [<010bd000>] swim_init+0x0/0x2c [swim_mod]
 [<000344c0>] blocking_notifier_call_chain+0x0/0x20
 [<0005eda0>] load_module+0x1a30/0x1e70
 [<465d>] mangle_kernel_stack+0xcd/0xda
 [<00331c64>] __generic_copy_from_user+0x0/0x46
 [<0033256e>] _cond_resched+0x0/0x32
 [<00331b9c>] memset+0x0/0x98
 [<0033256e>] _cond_resched+0x0/0x32
 [<0005f25c>] SyS_init_module+0x7c/0x112
 [<2000>] _start+0x0/0x8
 [<2000>] _start+0x0/0x8
 [<00331c82>] __generic_copy_from_user+0x1e/0x46
 [<0005f2b2>] SyS_init_module+0xd2/0x112
 [<465d>] mangle_kernel_stack+0xcd/0xda
 [<2b40>] syscall+0x8/0xc
 [<465d>] mangle_kernel_stack+0xcd/0xda
 [<0008c00c>] pcpu_balance_workfn+0xb2/0x40e
Code: 2200 7419 e4a9 e589 2841 d9fc  1000 <2414> 7203 c282 7602 b681 6600 
0096 0242 fe00 0482   e9c0 11c3 ed89 2642

There's no need to call ioremap() for the SWIM address range, as it lies
within the usual IO device region at 0x5000 , which has already been
mapped by head.S.

Remove the redundant ioremap() and iounmap() calls to fix the hang.

Cc: Laurent Vivier 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
---
 drivers/block/swim.c | 10 +++---
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 64e066eba72e..92f0cddc597e 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -911,7 +911,7 @@ static int swim_probe(struct platform_device *dev)
goto out;
}
 
-   swim_base = ioremap(res->start, resource_size(res));
+   swim_base = (struct swim __iomem *)res->start;
if (!swim_base) {
ret = -ENOMEM;
goto out_release_io;
@@ -923,7 +923,7 @@ static int swim_probe(struct platform_device *dev)
if (!get_swim_mode(swim_base)) {
printk(KERN_INFO "SWIM device not found !\n");
ret = -ENODEV;
-   goto out_iounmap;
+   goto out_release_io;
}
 
/* set platform 

[PATCH v2 02/10] m68k/mac: Fix SWIM memory resource end address

2018-04-11 Thread Finn Thain
The resource size is 0x2000 == end - start + 1.
Therefore end == start + 0x2000 - 1.

Cc: Laurent Vivier 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
Reviewed-by: Geert Uytterhoeven 
---
 arch/m68k/mac/config.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c
index 84bd9161bf23..4a544a6c72dd 100644
--- a/arch/m68k/mac/config.c
+++ b/arch/m68k/mac/config.c
@@ -1011,7 +1011,7 @@ int __init mac_platform_init(void)
struct resource swim_rsrc = {
.flags = IORESOURCE_MEM,
.start = (resource_size_t)swim_base,
-   .end   = (resource_size_t)swim_base + 0x2000,
+   .end   = (resource_size_t)swim_base + 0x1FFF,
};
 
platform_device_register_simple("swim", -1, _rsrc, 1);
-- 
2.16.1



Re: 4.15.14 crash with iscsi target and dvd

2018-04-11 Thread Wakko Warner
Wakko Warner wrote:
> Ming Lei wrote:
> > Sure, thanks for your sharing.
> > 
> > Wakko, could you test the following patch and see if there is any
> > difference?
> > 
> > --
> > diff --git a/drivers/target/target_core_pscsi.c 
> > b/drivers/target/target_core_pscsi.c
> > index 0d99b242e82e..6147178f1f37 100644
> > --- a/drivers/target/target_core_pscsi.c
> > +++ b/drivers/target/target_core_pscsi.c
> > @@ -888,7 +888,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist 
> > *sgl, u32 sgl_nents,
> > if (len > 0 && data_len > 0) {
> > bytes = min_t(unsigned int, len, PAGE_SIZE - off);
> > bytes = min(bytes, data_len);
> > -
> > + new_bio:
> > if (!bio) {
> > nr_vecs = min_t(int, BIO_MAX_PAGES, nr_pages);
> > nr_pages -= nr_vecs;
> > @@ -931,6 +931,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist 
> > *sgl, u32 sgl_nents,
> >  * be allocated with pscsi_get_bio() above.
> >  */
> > bio = NULL;
> > +   goto new_bio;
> > }
> >  
> > data_len -= bytes;
> 
> Sorry for the delay.  I reverted my change, added this one.  I didn't
> reboot, I just unloaded and loaded this one.
> Note: /dev/sr1 as seen from the initiator is /dev/sr0 (physical disc) on the
> target.
> 
> Doesn't crash, however on the initiator I see this:
> [9273849.70] ISO 9660 Extensions: RRIP_1991A
> [9273863.359718] scsi_io_completion: 13 callbacks suppressed
> [9273863.359788] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: 
> hostbyte=0x00 driverbyte=0x08
> [9273863.359909] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
> [9273863.359974] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
> [9273863.360036] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f6 96 
> 00 00 80 00
> [9273863.360116] blk_update_request: 13 callbacks suppressed
> [9273863.360177] blk_update_request: I/O error, dev sr1, sector 9165400
> [9273875.864648] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: 
> hostbyte=0x00 driverbyte=0x08
> [9273875.864738] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] 
> [9273875.864801] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 
> [9273875.864890] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f7 16 
> 00 00 80 00
> [9273875.864971] blk_update_request: I/O error, dev sr1, sector 9165912

Just FYI: The jobs that I do that uses the disc over iscsi didn't cause any
kernel messages on either system (except for the informational when the disc
was mounted)

I have a dumb question though.  Could the label be placed just after the
'if' statement instead of before it?  bio is set to null and the 'if'
statement checks if it's null, which it always would be after the goto.

-- 
 Microsoft has beaten Volkswagen's world record.  Volkswagen only created 22
 million bugs.


[PATCH v2 01/10] m68k/mac: Revisit floppy disc controller base addresses

2018-04-11 Thread Finn Thain
Rename floppy_type macros to make them more consistent with the scsi_type
macros, which are named after classes of models with similar memory maps.

The documentation for LC-class machines has the IO devices at offsets
from $50F0 . Use these addresses (consistent with mac_scsi resources)
because they may not be aliased elsewhere in the memory map, e.g. at
offsets from $5000 .

Add comments with controller type information from 'Designing Cards and
Drivers for the Macintosh Family', relevant Developer Notes and
http://mess.redump.net/mess/driver_info/mac_technical_notes

Cc: Laurent Vivier 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
---
 arch/m68k/include/asm/macintosh.h |  10 +--
 arch/m68k/mac/config.c| 124 --
 2 files changed, 70 insertions(+), 64 deletions(-)

diff --git a/arch/m68k/include/asm/macintosh.h 
b/arch/m68k/include/asm/macintosh.h
index 9b840c03ebb7..a61ce06c0a54 100644
--- a/arch/m68k/include/asm/macintosh.h
+++ b/arch/m68k/include/asm/macintosh.h
@@ -79,11 +79,11 @@ struct mac_model
 #define MAC_EXP_PDS_NUBUS  3 /* Accepts PDS card and/or NuBus card(s) */
 #define MAC_EXP_PDS_COMM   4 /* Accepts PDS card or Comm Slot card */
 
-#define MAC_FLOPPY_IWM 0
-#define MAC_FLOPPY_SWIM_ADDR1  1
-#define MAC_FLOPPY_SWIM_ADDR2  2
-#define MAC_FLOPPY_SWIM_IOP3
-#define MAC_FLOPPY_AV  4
+#define MAC_FLOPPY_UNSUPPORTED 0
+#define MAC_FLOPPY_QUADRA  1
+#define MAC_FLOPPY_OLD 2
+#define MAC_FLOPPY_IIFX3
+#define MAC_FLOPPY_LC  4
 
 extern struct mac_model *macintosh_config;
 
diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c
index 0c3275aa0197..84bd9161bf23 100644
--- a/arch/m68k/mac/config.c
+++ b/arch/m68k/mac/config.c
@@ -214,7 +214,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = MAC_EXP_NUBUS,
-   .floppy_type= MAC_FLOPPY_IWM,
+   .floppy_type= MAC_FLOPPY_UNSUPPORTED, /* IWM */
},
 
/*
@@ -229,7 +229,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = MAC_EXP_NUBUS,
-   .floppy_type= MAC_FLOPPY_IWM,
+   .floppy_type= MAC_FLOPPY_UNSUPPORTED, /* IWM */
}, {
.ident  = MAC_MODEL_IIX,
.name   = "IIx",
@@ -238,7 +238,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = MAC_EXP_NUBUS,
-   .floppy_type= MAC_FLOPPY_SWIM_ADDR2,
+   .floppy_type= MAC_FLOPPY_OLD, /* SWIM */
}, {
.ident  = MAC_MODEL_IICX,
.name   = "IIcx",
@@ -247,7 +247,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = MAC_EXP_NUBUS,
-   .floppy_type= MAC_FLOPPY_SWIM_ADDR2,
+   .floppy_type= MAC_FLOPPY_OLD, /* SWIM */
}, {
.ident  = MAC_MODEL_SE30,
.name   = "SE/30",
@@ -256,7 +256,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = MAC_EXP_PDS,
-   .floppy_type= MAC_FLOPPY_SWIM_ADDR2,
+   .floppy_type= MAC_FLOPPY_OLD, /* SWIM */
},
 
/*
@@ -274,7 +274,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = MAC_EXP_NUBUS,
-   .floppy_type= MAC_FLOPPY_SWIM_ADDR2,
+   .floppy_type= MAC_FLOPPY_OLD, /* SWIM */
}, {
.ident  = MAC_MODEL_IIFX,
.name   = "IIfx",
@@ -283,7 +283,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_IIFX,
.scc_type   = MAC_SCC_IOP,
.expansion_type = MAC_EXP_PDS_NUBUS,
-   .floppy_type= MAC_FLOPPY_SWIM_IOP,
+   .floppy_type= MAC_FLOPPY_IIFX, /* SWIM with IOP */
}, {
.ident  = MAC_MODEL_IISI,
.name   = "IIsi",
@@ -292,7 +292,7 @@ static struct mac_model mac_data_table[] = {
.scsi_type  = MAC_SCSI_OLD,
.scc_type   = MAC_SCC_II,
.expansion_type = 

[PATCH v2 05/10] block/swim: Remove extra put_disk() call from error path

2018-04-11 Thread Finn Thain
Cc: Laurent Vivier 
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Fixes: 103db8b2dfa5 ("[PATCH] swim: stop sharing request queue across multiple 
gendisks")
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/block/swim.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 2cdfc0db5966..0258a96e0c46 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -861,7 +861,6 @@ static int swim_floppy_init(struct swim_priv *swd)
  >lock);
if (!swd->unit[drive].disk->queue) {
err = -ENOMEM;
-   put_disk(swd->unit[drive].disk);
goto exit_put_disks;
}
blk_queue_bounce_limit(swd->unit[drive].disk->queue,
-- 
2.16.1



[PATCH v2 06/10] block/swim: Don't log an error message for an invalid ioctl

2018-04-11 Thread Finn Thain
The 'eject' shell command may send various different ioctl commands.
This leads to error messages on the console even though the FDEJECT
ioctl succeeds.

~# eject floppy
SWIM floppy_ioctl: unknown cmd 21257
SWIM floppy_ioctl: unknown cmd 1

Don't log an error message for an invalid ioctl, just do as the
swim3 driver does and return -ENOTTY.

Cc: Laurent Vivier 
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/block/swim.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 0258a96e0c46..7b847170cf71 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -727,14 +727,9 @@ static int floppy_ioctl(struct block_device *bdev, fmode_t 
mode,
if (copy_to_user((void __user *) param, (void *) _type,
 sizeof(struct floppy_struct)))
return -EFAULT;
-   break;
-
-   default:
-   printk(KERN_DEBUG "SWIM floppy_ioctl: unknown cmd %d\n",
-  cmd);
-   return -ENOSYS;
+   return 0;
}
-   return 0;
+   return -ENOTTY;
 }
 
 static int floppy_getgeo(struct block_device *bdev, struct hd_geometry *geo)
-- 
2.16.1



[PATCH v2 04/10] block/swim: Fix array bounds check

2018-04-11 Thread Finn Thain
In the floppy_find() function in swim.c is a call to
get_disk(swd->unit[drive].disk). The actual parameter to this call
can be a NULL pointer when drive == swd->floppy_count. This causes
an oops in get_disk().

Data read fault at 0x0198 in Super Data (pc=0x1be5b6)
BAD KERNEL BUSERR
Oops: 
Modules linked in: swim_mod ipv6 mac8390
PC: [<001be5b6>] get_disk+0xc/0x76
SR: 2004  SP: 9a078bc1  a2: 0213ed90
d0: d1: d2: d3: 00ff
d4: 0002d5: 02983590a0: 02332e00a1: 022dfd64
Process dd (pid: 285, task=020ab25b)
Frame format=B ssw=074d isc=4a88 isb=6732 daddr=0198 dobuf=
baddr=001be5bc dibuf=bfff ver=f
Stack from 022dfca4:
 0203fc00 0213ed90 022dfcc0 02982936  0020 022dfd08
0020f85a 0020 022dfd64 02332e00 004040fc 0014 001be77e 022dfd64
00334e4a 001be3f8 081d 022dfd64 01c04b60 01c04b70 022aba80 029828f8
02332e00 022dfd2c 001be7ac 0203fc00 0020 022dfd64 02103a00 01c04b60
01c04b60 0200e400 022dfd68 000e191a 0020 022dfd64 02103a00 081d
 0003 000b89de 0050 02103a00 01c04b60 02103a08 01c04c2e
Call Trace: [<02982936>] floppy_find+0x3e/0x4a [swim_mod]
 [<0020>] uart_remove_one_port+0x1a2/0x260
 [<0020f85a>] kobj_lookup+0xde/0x132
 [<0020>] uart_remove_one_port+0x1a2/0x260
 [<001be77e>] get_gendisk+0x0/0x130
 [<00334e4a>] mutex_lock+0x0/0x2e
 [<001be3f8>] disk_block_events+0x0/0x6c
 [<029828f8>] floppy_find+0x0/0x4a [swim_mod]
 [<001be7ac>] get_gendisk+0x2e/0x130
 [<0020>] uart_remove_one_port+0x1a2/0x260
 [<000e191a>] __blkdev_get+0x32/0x45a
 [<0020>] uart_remove_one_port+0x1a2/0x260
 [<000b89de>] complete_walk+0x0/0x8a
 [<000e1e22>] blkdev_get+0xe0/0x29a
 [<000e1fdc>] blkdev_open+0x0/0xb0
 [<000b89de>] complete_walk+0x0/0x8a
 [<000e1fdc>] blkdev_open+0x0/0xb0
 [<000e01cc>] bd_acquire+0x74/0x8a
 [<000e205c>] blkdev_open+0x80/0xb0
 [<000e1fdc>] blkdev_open+0x0/0xb0
 [<000abf24>] do_dentry_open+0x1a4/0x322
 [<0002>] __do_proc_douintvec+0x22/0x27e
 [<000b89de>] complete_walk+0x0/0x8a
 [<000baa62>] link_path_walk+0x0/0x48e
 [<000ba3f8>] inode_permission+0x20/0x54
 [<000ac0e4>] vfs_open+0x42/0x78
 [<000bc372>] path_openat+0x2b2/0xeaa
 [<000bc0c0>] path_openat+0x0/0xeaa
 [<0004463e>] __irq_wake_thread+0x0/0x4e
 [<0003a45a>] task_tick_fair+0x18/0xc8
 [<000bd00a>] do_filp_open+0xa0/0xea
 [<000abae0>] do_sys_open+0x11a/0x1ee
 [<0002>] __do_proc_douintvec+0x22/0x27e
 [<000abbf4>] SyS_open+0x1e/0x22
 [<0002>] __do_proc_douintvec+0x22/0x27e
 [<2b40>] syscall+0x8/0xc
 [<0002>] __do_proc_douintvec+0x22/0x27e
 [] dyadic+0x1/0x28
Code: 4e5e 4e75 4e56 fffc 2f0b 2f02 266e 0008 <206b> 0198 4a88 6732 2428 002c 
661e 486b 0058 4eb9 0032 0b96 588f 4a88 672c 2008
Disabling lock debugging due to kernel taint

Fix the array index bounds check to avoid this.

Cc: Laurent Vivier 
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Fixes: 8852ecd97488 ("[PATCH] m68k: mac - Add SWIM floppy support")
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
Reviewed-by: Geert Uytterhoeven 
---
 drivers/block/swim.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 92f0cddc597e..2cdfc0db5966 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -795,7 +795,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, 
void *data)
struct swim_priv *swd = data;
int drive = (*part & 3);
 
-   if (drive > swd->floppy_count)
+   if (drive >= swd->floppy_count)
return NULL;
 
*part = 0;
-- 
2.16.1



[PATCH v2 09/10] block/swim: Fix IO error at end of medium

2018-04-11 Thread Finn Thain
Reading to the end of a 720K disk results in an IO error instead of EOF
because the block layer thinks the disk has 2880 sectors. (Partly this
is a result of inverted logic of the ONEMEG_MEDIA bit that's now fixed.)

Initialize the density and head count in swim_add_floppy() to agree
with the device size passed to set_capacity() during drive probe.

Call set_capacity() again upon device open, after refreshing the density
and head count values.

Cc: Laurent Vivier 
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
---
 drivers/block/swim.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index c8c8b9da3edd..2c75761b61e8 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -612,7 +612,6 @@ static void setup_medium(struct floppy_state *fs)
struct floppy_struct *g;
fs->disk_in = 1;
fs->write_protected = swim_readbit(base, WRITE_PROT);
-   fs->type = swim_readbit(base, TWOMEG_MEDIA);
 
if (swim_track00(base))
printk(KERN_ERR
@@ -620,6 +619,9 @@ static void setup_medium(struct floppy_state *fs)
 
swim_track00(base);
 
+   fs->type = swim_readbit(base, TWOMEG_MEDIA) ?
+   HD_MEDIA : DD_MEDIA;
+   fs->head_number = swim_readbit(base, SINGLE_SIDED) ? 1 : 2;
get_floppy_geometry(fs, 0, );
fs->total_secs = g->size;
fs->secpercyl = g->head * g->sect;
@@ -656,6 +658,8 @@ static int floppy_open(struct block_device *bdev, fmode_t 
mode)
goto out;
}
 
+   set_capacity(fs->disk, fs->total_secs);
+
if (mode & FMODE_NDELAY)
return 0;
 
@@ -808,10 +812,9 @@ static int swim_add_floppy(struct swim_priv *swd, enum 
drive_location location)
 
swim_motor(base, OFF);
 
-   if (swim_readbit(base, SINGLE_SIDED))
-   fs->head_number = 1;
-   else
-   fs->head_number = 2;
+   fs->type = HD_MEDIA;
+   fs->head_number = 2;
+
fs->ref_count = 0;
fs->ejected = 1;
 
-- 
2.16.1



[PATCH v2 08/10] block/swim: Check drive type

2018-04-11 Thread Finn Thain
The SWIM chip is compatible with GCR-mode Sony 400K/800K drives but
this driver only supports MFM mode. Therefore only Sony FDHD drives
are supported. Skip incompatible drives.

Cc: Laurent Vivier 
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
---
 drivers/block/swim.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index d1ee4670666a..c8c8b9da3edd 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -829,10 +829,12 @@ static int swim_floppy_init(struct swim_priv *swd)
/* scan floppy drives */
 
swim_drive(base, INTERNAL_DRIVE);
-   if (swim_readbit(base, DRIVE_PRESENT))
+   if (swim_readbit(base, DRIVE_PRESENT) &&
+   !swim_readbit(base, ONEMEG_DRIVE))
swim_add_floppy(swd, INTERNAL_DRIVE);
swim_drive(base, EXTERNAL_DRIVE);
-   if (swim_readbit(base, DRIVE_PRESENT))
+   if (swim_readbit(base, DRIVE_PRESENT) &&
+   !swim_readbit(base, ONEMEG_DRIVE))
swim_add_floppy(swd, EXTERNAL_DRIVE);
 
/* register floppy drives */
-- 
2.16.1



[PATCH v2 10/10] block/swim: Select appropriate drive on device open

2018-04-11 Thread Finn Thain
The driver supports internal and external FDD units so the floppy_open
function must not hard-code the drive location.

Cc: Laurent Vivier 
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
---
 drivers/block/swim.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 2c75761b61e8..0e31884a9519 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -648,7 +648,7 @@ static int floppy_open(struct block_device *bdev, fmode_t 
mode)
 
swim_write(base, setup, S_IBM_DRIVE  | S_FCLK_DIV2);
udelay(10);
-   swim_drive(base, INTERNAL_DRIVE);
+   swim_drive(base, fs->location);
swim_motor(base, ON);
swim_action(base, SETMFM);
if (fs->ejected)
-- 
2.16.1



[PATCH v2 07/10] block/swim: Rename macros to avoid inconsistent inverted logic

2018-04-11 Thread Finn Thain
The Sony drive status bits use active-low logic. The swim_readbit()
function converts that to 'C' logic for readability. Hence, the
sense of the names of the status bit macros should not be inverted.

Mostly they are correct. However, the TWOMEG_DRIVE, MFM_MODE and
TWOMEG_MEDIA macros have inverted sense (like MkLinux). Fix this
inconsistency and make the following patches less confusing.

The same problem affects swim3.c so fix that too.

No functional change.

The FDHD drive status bits are documented in sonydriv.cpp from MAME
and in swimiii.h from MkLinux.

Cc: Laurent Vivier 
Cc: Benjamin Herrenschmidt 
Cc: linuxppc-...@lists.ozlabs.org
Cc: Jens Axboe 
Cc: sta...@vger.kernel.org # v4.14+
Tested-by: Stan Johnson 
Signed-off-by: Finn Thain 
Acked-by: Laurent Vivier 
---
 drivers/block/swim.c  | 8 
 drivers/block/swim3.c | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/block/swim.c b/drivers/block/swim.c
index 7b847170cf71..d1ee4670666a 100644
--- a/drivers/block/swim.c
+++ b/drivers/block/swim.c
@@ -110,7 +110,7 @@ struct iwm {
 /* Select values for swim_select and swim_readbit */
 
 #define READ_DATA_00x074
-#define TWOMEG_DRIVE   0x075
+#define ONEMEG_DRIVE   0x075
 #define SINGLE_SIDED   0x076
 #define DRIVE_PRESENT  0x077
 #define DISK_IN0x170
@@ -118,9 +118,9 @@ struct iwm {
 #define TRACK_ZERO 0x172
 #define TACHO  0x173
 #define READ_DATA_10x174
-#define MFM_MODE   0x175
+#define GCR_MODE   0x175
 #define SEEK_COMPLETE  0x176
-#define ONEMEG_MEDIA   0x177
+#define TWOMEG_MEDIA   0x177
 
 /* Bits in handshake register */
 
@@ -612,7 +612,7 @@ static void setup_medium(struct floppy_state *fs)
struct floppy_struct *g;
fs->disk_in = 1;
fs->write_protected = swim_readbit(base, WRITE_PROT);
-   fs->type = swim_readbit(base, ONEMEG_MEDIA);
+   fs->type = swim_readbit(base, TWOMEG_MEDIA);
 
if (swim_track00(base))
printk(KERN_ERR
diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c
index af51015d056e..469541c1e51e 100644
--- a/drivers/block/swim3.c
+++ b/drivers/block/swim3.c
@@ -148,7 +148,7 @@ struct swim3 {
 #define MOTOR_ON   2
 #define RELAX  3   /* also eject in progress */
 #define READ_DATA_04
-#define TWOMEG_DRIVE   5
+#define ONEMEG_DRIVE   5
 #define SINGLE_SIDED   6   /* drive or diskette is 4MB type? */
 #define DRIVE_PRESENT  7
 #define DISK_IN8
@@ -156,9 +156,9 @@ struct swim3 {
 #define TRACK_ZERO 10
 #define TACHO  11
 #define READ_DATA_112
-#define MFM_MODE   13
+#define GCR_MODE   13
 #define SEEK_COMPLETE  14
-#define ONEMEG_MEDIA   15
+#define TWOMEG_MEDIA   15
 
 /* Definitions of values used in writing and formatting */
 #define DATA_ESCAPE0x99
-- 
2.16.1



[PATCH v2 00/10] SWIM driver fixes

2018-04-11 Thread Finn Thain
This patch series has fixes for bugs in the SWIM floppy disk controller
driver, including an oops and a soft lockup.

One way to apply these patches to v4.14+ is by first cherry-picking
these commits:
b87eaec27eca3def6c8ed617e3b1bac08d7bc715
e5f0d2e2a153b18dcf31e1a633e210c37829d759
There are of course other ways to fix the patch rejects, but this way
would be convenient for me because it would simplify my own backporting.

Changes since v1:
- Dropped the two IOP patches as they aren't simple fixes. This way,
  the entire series is suitable for stable trees.
- Added Cc, Fixes, Acked-by and Reviewed-by tags.


Finn Thain (10):
  m68k/mac: Revisit floppy disc controller base addresses
  m68k/mac: Fix SWIM memory resource end address
  m68k/mac: Don't remap SWIM MMIO region
  block/swim: Fix array bounds check
  block/swim: Remove extra put_disk() call from error path
  block/swim: Don't log an error message for an invalid ioctl
  block/swim: Rename macros to avoid inconsistent inverted logic
  block/swim: Check drive type
  block/swim: Fix IO error at end of medium
  block/swim: Select appropriate drive on device open

 arch/m68k/include/asm/macintosh.h |  10 +--
 arch/m68k/mac/config.c| 126 --
 drivers/block/swim.c  |  49 +++
 drivers/block/swim3.c |   6 +-
 4 files changed, 96 insertions(+), 95 deletions(-)

-- 
2.16.1



Re: usercopy whitelist woe in scsi_sense_cache

2018-04-11 Thread Kees Cook
On Wed, Apr 11, 2018 at 3:47 PM, Kees Cook  wrote:
> On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook  wrote:
>> I'll see about booting with my own kernels, etc, and try to narrow this 
>> down. :)
>
> If I boot kernels I've built, I no longer hit the bug in this VM
> (though I'll keep trying). What compiler are you using?

Ignore that: I've reproduced it with my kernels now. I think I messed
up the initramfs initially. But with an exact copy of your .config,
booting under Arch grub with initramfs, I see it. I'll start removing
variables now... :P

-Kees

-- 
Kees Cook
Pixel Security


[PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handing
BLK_EH_RESET_TIMER.

This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---

V2:
- rename the new flag as MQ_RQ_COMPLETE_IN_TIMEOUT
- fix lock uses in blk_mq_rq_timed_out
- document re-order between blk_add_timer() and
blk_mq_rq_update_aborted_gstate(req, 0)


 block/blk-mq.c | 49 -
 block/blk-mq.h |  1 +
 2 files changed, 45 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..db1c84b2bb5e 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
 * However, that would complicate paths which want to synchronize
 * against us.  Let stay in sync with the issue path so that
 * hctx_lock() covers both issue and completion paths.
+*
+* Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+* helding queue lock.
 */
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else {
+   unsigned long flags;
+   bool need_complete = false;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   if (!blk_mq_rq_aborted_gstate(rq))
+   need_complete = true;
+   else
+   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (need_complete)
+   __blk_mq_complete_request(rq);
+   }
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -814,6 +831,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
+   unsigned long flags;
 
req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 
@@ -826,12 +844,33 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
break;
case BLK_EH_RESET_TIMER:
/*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
+* The normal completion may happen during handling the
+* timeout, or even after returning from .timeout(), so
+* once the request has been completed, we can't reset
+* timer any more since this request may be handled as
+* BLK_EH_RESET_TIMER in next timeout handling too, and
+* it has to be completed in this situation.
+*
+* Holding the queue lock to cover read/write rq's
+* aborted_gstate and normal state, so the race can be
+* avoided completely.
+*
+* blk_add_timer() may be re-ordered with resetting
+* aborted_gstate, and the only side-effec is that if this
+* request is recycled after aborted_gstate is cleared, it
+* may be timed out a bit late, that is what we can survive
+* given timeout event is so unusual.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
-   blk_add_timer(req);
+   spin_lock_irqsave(req->q->queue_lock, flags);
+   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_TIMEOUT) {
+   blk_add_timer(req);
+   blk_mq_rq_update_aborted_gstate(req, 0);
+   spin_unlock_irqrestore(req->q->queue_lock, flags);
+   } else {
+   blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
+   spin_unlock_irqrestore(req->q->queue_lock, flags);
+   __blk_mq_complete_request(req);
+   }
break;
case BLK_EH_NOT_HANDLED:
break;
diff --git a/block/blk-mq.h 

Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
On Wed, Apr 11, 2018 at 10:49:51PM +, Bart Van Assche wrote:
> On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
> > +again:
> > switch (ret) {
> > case BLK_EH_HANDLED:
> > __blk_mq_complete_request(req);
> > break;
> > case BLK_EH_RESET_TIMER:
> > [ ... ]
> > +   spin_lock_irqsave(req->q->queue_lock, flags);
> > +   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > +   blk_add_timer(req);
> > +   } else {
> > +   blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
> > +   ret = BLK_EH_HANDLED;
> > +   goto again;
> > +   }
> > +   spin_unlock_irqrestore(req->q->queue_lock, flags);
> 
> Does the above chunk introduce a backwards goto from inside a region around
> which a spinlock is held to outside that region? Can such a goto result in
> anything else than a deadlock?

Yes, it is being fixed in my local V2, :-)

-- 
Ming


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Bart Van Assche
On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote:
> +again:
>   switch (ret) {
>   case BLK_EH_HANDLED:
>   __blk_mq_complete_request(req);
>   break;
>   case BLK_EH_RESET_TIMER:
>   [ ... ]
> + spin_lock_irqsave(req->q->queue_lock, flags);
> + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> + blk_mq_rq_update_aborted_gstate(req, 0);
> + blk_add_timer(req);
> + } else {
> + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
> + ret = BLK_EH_HANDLED;
> + goto again;
> + }
> + spin_unlock_irqrestore(req->q->queue_lock, flags);

Does the above chunk introduce a backwards goto from inside a region around
which a spinlock is held to outside that region? Can such a goto result in
anything else than a deadlock?

Thanks,

Bart.





Re: usercopy whitelist woe in scsi_sense_cache

2018-04-11 Thread Kees Cook
On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook  wrote:
> I'll see about booting with my own kernels, etc, and try to narrow this down. 
> :)

If I boot kernels I've built, I no longer hit the bug in this VM
(though I'll keep trying). What compiler are you using?

-Kees

-- 
Kees Cook
Pixel Security


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Tejun Heo
Hello,

On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote:
> On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> > Hello, Ming.
> > 
> > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> > ...
> > > + spin_lock_irqsave(req->q->queue_lock, flags);
> > > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > > + blk_mq_rq_update_aborted_gstate(req, 0);
> > > + blk_add_timer(req);
> > 
> > Nothing prevents the above blk_add_timer() racing against the next
> > recycle instance of the request, so this still leaves a small race
> > window.
> 
> OK.
> 
> But this small race window can be avoided by running blk_add_timer(req)
> before blk_mq_rq_update_aborted_gstate(req, 0), can't it?

Not really because aborted_gstate right now doesn't have any memory
barrier around it, so nothing ensures blk_add_timer() actually appears
before.  We can either add the matching barriers in aborted_gstate
update and when it's read in the normal completion path, or we can
wait for the update to be visible everywhere by waiting for rcu grace
period (because the reader is rcu protected).

Thanks.

-- 
tejun


Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote:
> Hello, Ming.
> 
> On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
> ...
> > +   spin_lock_irqsave(req->q->queue_lock, flags);
> > +   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> > +   blk_mq_rq_update_aborted_gstate(req, 0);
> > +   blk_add_timer(req);
> 
> Nothing prevents the above blk_add_timer() racing against the next
> recycle instance of the request, so this still leaves a small race
> window.

OK.

But this small race window can be avoided by running blk_add_timer(req)
before blk_mq_rq_update_aborted_gstate(req, 0), can't it?

-- 
Ming


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hey, again.

On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote:
> Hello, Israel.
> 
> On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote:
> > >Just noticed this one, this looks interesting to me as well. Israel,
> > >can you run your test with this patch?
> > 
> > Yes, I just did and it looks good.
> 
> Awesome.

Just to be sure, you tested the combined patch and saw the XXX debug
messages, right?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 23:23 +0200, Alexandru Moise wrote:
> Hi, I tested it, it doesn't solve the problem.
> By the time you get here it's already too late, my patch
> prevents this from failing in the first place.

Hello Alex,

If you can share the steps to follow to trigger the bug you reported then
I will have a closer look at this.

Thanks,

Bart.





Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Tejun Heo
Hello, Ming.

On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote:
...
> + spin_lock_irqsave(req->q->queue_lock, flags);
> + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
> + blk_mq_rq_update_aborted_gstate(req, 0);
> + blk_add_timer(req);

Nothing prevents the above blk_add_timer() racing against the next
recycle instance of the request, so this still leaves a small race
window.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> On 04/11/18 13:00, Alexandru Moise wrote:
> > But the root cause of it is in blkcg_init_queue() when blkg_create() returns
> > an ERR ptr, because it tries to insert into a populated index into 
> > blkcg->blkg_tree,
> > the entry that we fail to remove at __blk_release_queue().
> 
> Hello Alex,
> 
> Had you considered something like the untested patch below?
> 
> Thanks,
> 
> Bart.
> 
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index 1c16694ae145..f2ced19e74b8 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1191,14 +1191,17 @@ int blkcg_init_queue(struct request_queue *q)
>   if (preloaded)
>   radix_tree_preload_end();
> 
> - if (IS_ERR(blkg))
> - return PTR_ERR(blkg);
> + if (IS_ERR(blkg)) {
> + ret = PTR_ERR(blkg);
> + goto destroy_all;
> + }
> 
>   q->root_blkg = blkg;
>   q->root_rl.blkg = blkg;
> 
>   ret = blk_throtl_init(q);
>   if (ret) {
> +destroy_all:
>   spin_lock_irq(q->queue_lock);
>   blkg_destroy_all(q);
>   spin_unlock_irq(q->queue_lock);
> 

Hi, I tested it, it doesn't solve the problem.
By the time you get here it's already too late, my patch
prevents this from failing in the first place.

I would have liked this more than my solution though.

../Alex



[PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER

2018-04-11 Thread Ming Lei
The normal request completion can be done before or during handling
BLK_EH_RESET_TIMER, and this race may cause the request to never
be completed since driver's .timeout() may always return
BLK_EH_RESET_TIMER.

This issue can't be fixed completely by driver, since the normal
completion can be done between returning .timeout() and handing
BLK_EH_RESET_TIMER.

This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET,
and reading/writing rq's state by holding queue lock, which can be
per-request actually, but just not necessary to introduce one lock for
so unusual event.

Cc: Bart Van Assche 
Cc: Tejun Heo 
Cc: Christoph Hellwig 
Cc: Ming Lei 
Cc: Sagi Grimberg 
Cc: Israel Rukshin ,
Cc: Max Gurtovoy 
Cc: sta...@vger.kernel.org
Signed-off-by: Ming Lei 
---

This is another way to fix this long-time issue, and turns out this
solution is much simpler.

 block/blk-mq.c | 44 +++-
 block/blk-mq.h |  1 +
 2 files changed, 40 insertions(+), 5 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 0dc9e341c2a7..12e8850e3905 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq)
 * However, that would complicate paths which want to synchronize
 * against us.  Let stay in sync with the issue path so that
 * hctx_lock() covers both issue and completion paths.
+*
+* Cover complete vs BLK_EH_RESET_TIMER race in slow path with
+* helding queue lock.
 */
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else {
+   unsigned long flags;
+   bool need_complete = false;
+
+   spin_lock_irqsave(q->queue_lock, flags);
+   if (!blk_mq_rq_aborted_gstate(rq))
+   need_complete = true;
+   else
+   blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_RESET);
+   spin_unlock_irqrestore(q->queue_lock, flags);
+
+   if (need_complete)
+   __blk_mq_complete_request(rq);
+   }
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -814,24 +831,41 @@ static void blk_mq_rq_timed_out(struct request *req, bool 
reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
+   unsigned long flags;
 
req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED;
 
if (ops->timeout)
ret = ops->timeout(req, reserved);
 
+again:
switch (ret) {
case BLK_EH_HANDLED:
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
/*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
+* The normal completion may happen during handling the
+* timeout, or even after returning from .timeout(), so
+* once the request has been completed, we can't reset
+* timer any more since this request may be handled as
+* BLK_EH_RESET_TIMER in next timeout handling too, and
+* it has to be completed in this situation.
+*
+* Holding the queue lock to cover read/write rq's
+* aborted_gstate and normal state, so the race can be
+* avoided completely.
 */
-   blk_mq_rq_update_aborted_gstate(req, 0);
-   blk_add_timer(req);
+   spin_lock_irqsave(req->q->queue_lock, flags);
+   if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) {
+   blk_mq_rq_update_aborted_gstate(req, 0);
+   blk_add_timer(req);
+   } else {
+   blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT);
+   ret = BLK_EH_HANDLED;
+   goto again;
+   }
+   spin_unlock_irqrestore(req->q->queue_lock, flags);
break;
case BLK_EH_NOT_HANDLED:
break;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 88c558f71819..6dc242fc785a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -35,6 +35,7 @@ enum mq_rq_state {
MQ_RQ_IDLE  = 0,
MQ_RQ_IN_FLIGHT = 1,
MQ_RQ_COMPLETE  = 2,
+   MQ_RQ_COMPLETE_IN_RESET = 3,
 
MQ_RQ_STATE_BITS= 2,
MQ_RQ_STATE_MASK= (1 << MQ_RQ_STATE_BITS) - 1,
-- 
2.9.5



Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 13:02 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote:
> > On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> > > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > > > On 04/11/18 13:00, Alexandru Moise wrote:
> > > > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > > > returns
> > > > > an ERR ptr, because it tries to insert into a populated index into 
> > > > > blkcg->blkg_tree,
> > > > > the entry that we fail to remove at __blk_release_queue().
> > > > 
> > > > Hello Alex,
> > > > 
> > > > Had you considered something like the untested patch below?
> > > 
> > > But queue init shouldn't fail here, right?
> > 
> > Hello Tejun,
> > 
> > Your question is not entirely clear to me. Are you referring to the atomic
> > allocations in blkg_create() or are you perhaps referring to something else?
> 
> Hmm.. maybe I'm confused but I thought that the fact that
> blkcg_init_queue() fails itself is already a bug, which happens
> because a previously destroyed queue left behind blkgs.

Hello Tejun,

I had missed the start of this thread so I was not aware of which problem Alex
was trying to solve. In the description of v1 of this patch I read that Alex
thinks that he ran into a scenario in which blk_queue_alloc_node() assigns a
q->id that is still in use by another request queue? That's weird. The following
code still occurs in __blk_release_queue():

ida_simple_remove(_queue_ida, q->id);

It's not clear to me how that remove call could happen *before* q->id is removed
from the blkcg radix tree.

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > > On 04/11/18 13:00, Alexandru Moise wrote:
> > > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > > returns
> > > > an ERR ptr, because it tries to insert into a populated index into 
> > > > blkcg->blkg_tree,
> > > > the entry that we fail to remove at __blk_release_queue().
> > > 
> > > Hello Alex,
> > > 
> > > Had you considered something like the untested patch below?
> > 
> > But queue init shouldn't fail here, right?
> 
> Hello Tejun,
> 
> Your question is not entirely clear to me. Are you referring to the atomic
> allocations in blkg_create() or are you perhaps referring to something else?

Hmm.. maybe I'm confused but I thought that the fact that
blkcg_init_queue() fails itself is already a bug, which happens
because a previously destroyed queue left behind blkgs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> > On 04/11/18 13:00, Alexandru Moise wrote:
> > > But the root cause of it is in blkcg_init_queue() when blkg_create() 
> > > returns
> > > an ERR ptr, because it tries to insert into a populated index into 
> > > blkcg->blkg_tree,
> > > the entry that we fail to remove at __blk_release_queue().
> > 
> > Hello Alex,
> > 
> > Had you considered something like the untested patch below?
> 
> But queue init shouldn't fail here, right?

Hello Tejun,

Your question is not entirely clear to me. Are you referring to the atomic
allocations in blkg_create() or are you perhaps referring to something else?

Bart.




Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote:
> On 04/11/18 13:00, Alexandru Moise wrote:
> >But the root cause of it is in blkcg_init_queue() when blkg_create() returns
> >an ERR ptr, because it tries to insert into a populated index into 
> >blkcg->blkg_tree,
> >the entry that we fail to remove at __blk_release_queue().
> 
> Hello Alex,
> 
> Had you considered something like the untested patch below?

But queue init shouldn't fail here, right?

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche

On 04/11/18 13:00, Alexandru Moise wrote:

But the root cause of it is in blkcg_init_queue() when blkg_create() returns
an ERR ptr, because it tries to insert into a populated index into 
blkcg->blkg_tree,
the entry that we fail to remove at __blk_release_queue().


Hello Alex,

Had you considered something like the untested patch below?

Thanks,

Bart.


diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1c16694ae145..f2ced19e74b8 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1191,14 +1191,17 @@ int blkcg_init_queue(struct request_queue *q)
if (preloaded)
radix_tree_preload_end();

-   if (IS_ERR(blkg))
-   return PTR_ERR(blkg);
+   if (IS_ERR(blkg)) {
+   ret = PTR_ERR(blkg);
+   goto destroy_all;
+   }

q->root_blkg = blkg;
q->root_rl.blkg = blkg;

ret = blk_throtl_init(q);
if (ret) {
+destroy_all:
spin_lock_irq(q->queue_lock);
blkg_destroy_all(q);
spin_unlock_irq(q->queue_lock);



Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 03:54:53PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote:
> > [0.76] BUG: unable to handle kernel NULL pointer dereference at 
> > 01b4
> > [0.763350] Kernel panic - not syncing: Attempted to kill init! 
> > exitcode=0x0009
> > [0.763350]
> > [0.76] PGD 0 P4D 0
> > [0.76] Oops:  [#2] PREEMPT SMP
> > [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G  D  
> > 4.16.0-ARCH+ #81
> > [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > 1.11.0-20171110_100015-anatol 04/01/20$
> > [0.76] Workqueue: nvme-reset-wq nvme_reset_work
> > [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40
> > [0.76] RSP: :c91bfcb0 EFLAGS: 00010246
> > [0.76] RAX: 88003b698000 RBX:  RCX: 
> > 
> > [0.76] RDX: 88003b698000 RSI: fff4 RDI: 
> > 001c
> > [0.76] RBP: c91bfcc0 R08:  R09: 
> > 
> > [0.76] R10: eaeaa980 R11: 814e0970 R12: 
> > 001c
> > [0.76] R13:  R14:  R15: 
> > 88003aad8010
> > [0.76] FS:  () GS:88003e40() 
> > knlGS:
> > [0.76] CS:  0010 DS:  ES:  CR0: 80050033
> > [0.76] CR2: 01b4 CR3: 02209001 CR4: 
> > 000606f0
> > [0.76] Call Trace:
> > [0.76]  blk_mq_quiesce_queue+0x23/0x80
> > [0.76]  nvme_dev_disable+0x34f/0x480
> > [0.76]  ? nvme_irq+0x50/0x50
> > [0.76]  ? dev_warn+0x64/0x80
> > [0.76]  nvme_reset_work+0x13de/0x1570
> > [0.76]  ? __switch_to_asm+0x34/0x70
> > [0.76]  ? __switch_to_asm+0x40/0x70
> > [0.76]  ? _raw_spin_unlock_irq+0x15/0x30
> > [0.76]  ? finish_task_switch+0x156/0x210
> > [0.76]  process_one_work+0x20c/0x3d0
> > [0.76]  worker_thread+0x216/0x400
> > [0.76]  kthread+0x125/0x130
> > [0.76]  ? process_one_work+0x3d0/0x3d0
> > [0.76]  ? __kthread_bind_mask+0x60/0x60
> > [0.76]  ret_from_fork+0x3a/0x50
> 
> Hello Alexandru,
> 
> What made you look at cgroups? In the above register dump I see that %rbx == 
> NULL.
> I think that means that the queue pointer argument of blk_queue_flag_set() is 
> NULL.
> The NVMe initiator driver should never pass a NULL pointer to 
> blk_mq_quiesce_queue().
> Please ask the NVMe driver maintainers for their opinion on the linux-nvme 
> mailing
> list.
> 
> Thanks,
> 
> Bart.

The %rbx == NULL is only a symptom of the cgroup mishandling, perhaps we could
improve error handling in the NVMe driver, but I can say that about a lot of 
block
drivers actually, perhaps I will write some patches in the future to improve the
error handling.

But the root cause of it is in blkcg_init_queue() when blkg_create() returns
an ERR ptr, because it tries to insert into a populated index into 
blkcg->blkg_tree,
the entry that we fail to remove at __blk_release_queue().

Thanks,
../Alex


> 
> 
> 


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Martin Steigerwald
Bart Van Assche - 11.04.18, 14:50:
> On Tue, 2018-04-10 at 14:54 -0700, t...@kernel.org wrote:
> > Ah, yeah, I was moving it out of add_timer but forgot to actully add
> > it to the issue path.  Fixed patch below.
> > 
> > BTW, no matter what we do w/ the request handover between normal and
> > timeout paths, we'd need something similar.  Otherwise, while we can
> > reduce the window, we can't get rid of it.
> 
> (+Martin Steigerwald)
[…]
> Thank you for having shared this patch. It looks interesting to me.
> What I know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this
> patch applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear
> when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma
> corrupts memory upon timeout"
>  
> (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848
> .html) and also to a "RIP: scsi_times_out+0x17" crash during boot
> (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

Yes, with the three patches:

- '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a 
request.mbox'

- '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into 
rcu_read_{lock,unlock}().mbox'

- '[PATCH v4] blk-mq_Fix race conditions in request timeout 
handling.mbox'

the occasional hangs on some boots / resumes from hibernation appear to 
be gone.

Also it appears that the error loading SMART data issue is gone as well 
(see my bug report). However it is still to early to say for sure. I 
think I need at least 2-3 days of additional testing with this kernel to 
be sufficiently sure about it.

However… I could also test another patch, but from reading the rest of 
this thread so far I have no clear on whether to try one of the new 
patches and if so which one and whether adding it on top of some of the 
patches I already applied or using it as a replacement of it.

So while doing a training this and next week I can apply a patch here 
and then, but I won´t have much time to read the complete discussion to 
figure out what to apply.

Personally as a stable kernel has been released with those issues, I 
think its good to fix it up soon. On the other hand it may take quite 
some time til popular distros carry 4.16 for regular users. And I have 
no idea how frequent the reported issues are, i.e. how many users would 
be affected.

Thanks,
-- 
Martin




Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:26:12PM +, Bart Van Assche wrote:
> Please explain what you wrote further. It's not clear to me why you think
> that anything is broken nor what a "sever model" is.

So, sever (or revoke) model is where you actively disconnect an object
and ensuring that there'll be no further references from others.  In
contrast, what we usually do is refcnting, where we don't actively
sever the dying object from its users but let the users drain
themselves over time and destroy the object when we know all the users
are gone (recnt reaching zero).

> I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
> to avoid race conditions between request queue cleanup and the block cgroup
> controller. Additionally, since it is guaranteed that no new requests will
> be submitted to a queue after it has been marked dead I don't see why it
> would make sense to keep the association between a request queue and the
> blkcg controller until the last reference on a queue is dropped.

Sure, new requests aren't the only source tho.  e.g. there can be
derefs through sysfs / cgroupfs or writeback attempts on dead devices.
If you want to implement sever, you gotta hunt down all those and make
sure they can't make no further derefs.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:15 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote:
> > A simple and effective solution is to dissociate a request queue from the
> > block cgroup controller before blk_cleanup_queue() returns. This is why 
> > commit
> > a063057d7c73 ("block: Fix a race between request queue removal and the block
> > cgroup controller") moved the blkcg_exit_queue() call from 
> > __blk_release_queue()
> > into blk_cleanup_queue().
> 
> which is broken.  We can try to switch the lifetime model to revoking
> all live objects but that likely is a lot more involving than blindly
> moving blkg shootdown from release to cleanup.  Implementing sever
> semantics is usually a lot more involved / fragile because it requires
> explicit participation from all users (exactly the same way revoking
> ->queue_lock is difficult).
> 
> I'm not necessarily against switching to sever model, but what the
> patch did isn't that.  It just moved some code without actually
> understanding or auditing what the implications are.

Hello Tejun,

Please explain what you wrote further. It's not clear to me why you think
that anything is broken nor what a "sever model" is.

I think we really need the blkcg_exit_queue() call in blk_cleanup_queue()
to avoid race conditions between request queue cleanup and the block cgroup
controller. Additionally, since it is guaranteed that no new requests will
be submitted to a queue after it has been marked dead I don't see why it
would make sense to keep the association between a request queue and the
blkcg controller until the last reference on a queue is dropped.

Bart.





Re: sr: get/drop reference to device in revalidate and check_events

2018-04-11 Thread Jan Kara
On Wed 11-04-18 10:23:30, Jens Axboe wrote:
> We can't just use scsi_cd() to get the scsi_cd structure, we have
> to grab a live reference to the device. For both callbacks, we're
> not inside an open where we already hold a reference to the device.
> 
> This fixes device removal/addition under concurrent device access,
> which otherwise could result in the below oops.
> 
> NULL pointer dereference at 0010
> PGD 0 P4D 0 
> Oops:  [#1] PREEMPT SMP
> Modules linked in:
> sr 12:0:0:0: [sr2] scsi-1 drive
>  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core 
> sb_edac xl
> sr 12:0:0:0: Attached scsi CD-ROM sr2
>  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress 
> zlib_defc
> sr 12:0:0:0: Attached scsi generic sg7 type 5
>  igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif]
> CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
> Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
> RSP: 0018:883ff357bb58 EFLAGS: 00010292
> RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66
> RDX: 0003 RSI: 7530 RDI: 881fea631000
> RBP:  R08: 881fe4d38400 R09: 
> R10:  R11: 01b6 R12: 085d
> R13: 085d R14: 883ffd9b3790 R15: 
> FS:  7f7dc8e6d8c0() GS:883fff34() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0010 CR3: 003ffda98005 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  ? __invalidate_device+0x48/0x60
>  check_disk_change+0x4c/0x60
>  sr_block_open+0x16/0xd0 [sr_mod]
>  __blkdev_get+0xb9/0x450
>  ? iget5_locked+0x1c0/0x1e0
>  blkdev_get+0x11e/0x320
>  ? bdget+0x11d/0x150
>  ? _raw_spin_unlock+0xa/0x20
>  ? bd_acquire+0xc0/0xc0
>  do_dentry_open+0x1b0/0x320
>  ? inode_permission+0x24/0xc0
>  path_openat+0x4e6/0x1420
>  ? cpumask_any_but+0x1f/0x40
>  ? flush_tlb_mm_range+0xa0/0x120
>  do_filp_open+0x8c/0xf0
>  ? __seccomp_filter+0x28/0x230
>  ? _raw_spin_unlock+0xa/0x20
>  ? __handle_mm_fault+0x7d6/0x9b0
>  ? list_lru_add+0xa8/0xc0
>  ? _raw_spin_unlock+0xa/0x20
>  ? __alloc_fd+0xaf/0x160
>  ? do_sys_open+0x1a6/0x230
>  do_sys_open+0x1a6/0x230
>  do_syscall_64+0x5a/0x100
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Signed-off-by: Jens Axboe 

Looks good to me. You can add:

Reviewed-by: Jan Kara 

Honza

> 
> ---
> 
> diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
> index 0cf25d789d05..3f3cb72e0c0c 100644
> --- a/drivers/scsi/sr.c
> +++ b/drivers/scsi/sr.c
> @@ -587,18 +587,28 @@ static int sr_block_ioctl(struct block_device *bdev, 
> fmode_t mode, unsigned cmd,
>  static unsigned int sr_block_check_events(struct gendisk *disk,
> unsigned int clearing)
>  {
> - struct scsi_cd *cd = scsi_cd(disk);
> + unsigned int ret = 0;
> + struct scsi_cd *cd;
>  
> - if (atomic_read(>device->disk_events_disable_depth))
> + cd = scsi_cd_get(disk);
> + if (!cd)
>   return 0;
>  
> - return cdrom_check_events(>cdi, clearing);
> + if (!atomic_read(>device->disk_events_disable_depth))
> + ret = cdrom_check_events(>cdi, clearing);
> +
> + scsi_cd_put(cd);
> + return ret;
>  }
>  
>  static int sr_block_revalidate_disk(struct gendisk *disk)
>  {
> - struct scsi_cd *cd = scsi_cd(disk);
>   struct scsi_sense_hdr sshdr;
> + struct scsi_cd *cd;
> +
> + cd = scsi_cd_get(disk);
> + if (!cd)
> + return -ENXIO;
>  
>   /* if the unit is not ready, nothing more to do */
>   if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, ))
> @@ -607,6 +617,7 @@ static int sr_block_revalidate_disk(struct gendisk *disk)
>   sr_cd_check(>cdi);
>   get_sectorsize(cd);
>  out:
> + scsi_cd_put(cd);
>   return 0;
>  }
>  
> 
> -- 
> Jens Axboe
> 
-- 
Jan Kara 
SUSE Labs, CR


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote:
> A simple and effective solution is to dissociate a request queue from the
> block cgroup controller before blk_cleanup_queue() returns. This is why commit
> a063057d7c73 ("block: Fix a race between request queue removal and the block
> cgroup controller") moved the blkcg_exit_queue() call from 
> __blk_release_queue()
> into blk_cleanup_queue().

which is broken.  We can try to switch the lifetime model to revoking
all live objects but that likely is a lot more involving than blindly
moving blkg shootdown from release to cleanup.  Implementing sever
semantics is usually a lot more involved / fragile because it requires
explicit participation from all users (exactly the same way revoking
->queue_lock is difficult).

I'm not necessarily against switching to sever model, but what the
patch did isn't that.  It just moved some code without actually
understanding or auditing what the implications are.

Thanks.

-- 
tejun


Re: sr: get/drop reference to device in revalidate and check_events

2018-04-11 Thread Lee Duncan
On 04/11/2018 09:23 AM, Jens Axboe wrote:
> We can't just use scsi_cd() to get the scsi_cd structure, we have
> to grab a live reference to the device. For both callbacks, we're
> not inside an open where we already hold a reference to the device.
> 
> This fixes device removal/addition under concurrent device access,
> which otherwise could result in the below oops.
> 
> NULL pointer dereference at 0010
> PGD 0 P4D 0 
> Oops:  [#1] PREEMPT SMP
> Modules linked in:
> sr 12:0:0:0: [sr2] scsi-1 drive
>  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core 
> sb_edac xl
> sr 12:0:0:0: Attached scsi CD-ROM sr2
>  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress 
> zlib_defc
> sr 12:0:0:0: Attached scsi generic sg7 type 5
>  igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif]
> CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
> Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
> RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
> RSP: 0018:883ff357bb58 EFLAGS: 00010292
> RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66
> RDX: 0003 RSI: 7530 RDI: 881fea631000
> RBP:  R08: 881fe4d38400 R09: 
> R10:  R11: 01b6 R12: 085d
> R13: 085d R14: 883ffd9b3790 R15: 
> FS:  7f7dc8e6d8c0() GS:883fff34() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 0010 CR3: 003ffda98005 CR4: 003606e0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  ? __invalidate_device+0x48/0x60
>  check_disk_change+0x4c/0x60
>  sr_block_open+0x16/0xd0 [sr_mod]
>  __blkdev_get+0xb9/0x450
>  ? iget5_locked+0x1c0/0x1e0
>  blkdev_get+0x11e/0x320
>  ? bdget+0x11d/0x150
>  ? _raw_spin_unlock+0xa/0x20
>  ? bd_acquire+0xc0/0xc0
>  do_dentry_open+0x1b0/0x320
>  ? inode_permission+0x24/0xc0
>  path_openat+0x4e6/0x1420
>  ? cpumask_any_but+0x1f/0x40
>  ? flush_tlb_mm_range+0xa0/0x120
>  do_filp_open+0x8c/0xf0
>  ? __seccomp_filter+0x28/0x230
>  ? _raw_spin_unlock+0xa/0x20
>  ? __handle_mm_fault+0x7d6/0x9b0
>  ? list_lru_add+0xa8/0xc0
>  ? _raw_spin_unlock+0xa/0x20
>  ? __alloc_fd+0xaf/0x160
>  ? do_sys_open+0x1a6/0x230
>  do_sys_open+0x1a6/0x230
>  do_syscall_64+0x5a/0x100
>  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
> index 0cf25d789d05..3f3cb72e0c0c 100644
> --- a/drivers/scsi/sr.c
> +++ b/drivers/scsi/sr.c
> @@ -587,18 +587,28 @@ static int sr_block_ioctl(struct block_device *bdev, 
> fmode_t mode, unsigned cmd,
>  static unsigned int sr_block_check_events(struct gendisk *disk,
> unsigned int clearing)
>  {
> - struct scsi_cd *cd = scsi_cd(disk);
> + unsigned int ret = 0;
> + struct scsi_cd *cd;
>  
> - if (atomic_read(>device->disk_events_disable_depth))
> + cd = scsi_cd_get(disk);
> + if (!cd)
>   return 0;
>  
> - return cdrom_check_events(>cdi, clearing);
> + if (!atomic_read(>device->disk_events_disable_depth))
> + ret = cdrom_check_events(>cdi, clearing);
> +
> + scsi_cd_put(cd);
> + return ret;
>  }
>  
>  static int sr_block_revalidate_disk(struct gendisk *disk)
>  {
> - struct scsi_cd *cd = scsi_cd(disk);
>   struct scsi_sense_hdr sshdr;
> + struct scsi_cd *cd;
> +
> + cd = scsi_cd_get(disk);
> + if (!cd)
> + return -ENXIO;
>  
>   /* if the unit is not ready, nothing more to do */
>   if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, ))
> @@ -607,6 +617,7 @@ static int sr_block_revalidate_disk(struct gendisk *disk)
>   sr_cd_check(>cdi);
>   get_sectorsize(cd);
>  out:
> + scsi_cd_put(cd);
>   return 0;
>  }
>  
> 

Reviewed-by: Lee Duncan 


RE: [PATCH] block: ratelimite pr_err on IO path

2018-04-11 Thread Elliott, Robert (Persistent Memory)
> -Original Message-
> From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel-
> ow...@vger.kernel.org] On Behalf Of Jack Wang
> Sent: Wednesday, April 11, 2018 8:21 AM
> Subject: [PATCH] block: ratelimite pr_err on IO path
> 
> From: Jack Wang 
...
> - pr_err("%s: ref tag error at location %llu " \
> -"(rcvd %u)\n", iter->disk_name,
> -(unsigned long long)
> -iter->seed, be32_to_cpu(pi->ref_tag));
> + pr_err_ratelimited("%s: ref tag error at "
> +"location %llu (rcvd %u)\n",

Per process/coding-style.rst, you should keep a string like that on
one line even if that exceeds 80 columns:

  Statements longer than 80 columns will be broken into sensible chunks, unless
  exceeding 80 columns significantly increases readability and does not hide
  information. ... However, never break user-visible strings such as
  printk messages, because that breaks the ability to grep for them.




Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:00 -0700, t...@kernel.org wrote:
> On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote:
> > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > > And looking at the change, it looks like the right thing we should
> > > have done is caching @lock on the print_blkg side and when switching
> > > locks make sure both locks are held.  IOW, do the following in
> > > blk_cleanup_queue()
> > > 
> > >   spin_lock_irq(lock);
> > >   if (q->queue_lock != >__queue_lock) {
> > >   spin_lock(>__queue_lock);
> > >   q->queue_lock = >__queue_lock;
> > >   spin_unlock(>__queue_lock);
> > >   }
> > >   spin_unlock_irq(lock);
> > > 
> > > Otherwise, there can be two lock holders thinking they have exclusive
> > > access to the request_queue.
> > 
> > I think that's a bad idea. A block driver is allowed to destroy the
> > spinlock it associated with the request queue as soon as blk_cleanup_queue()
> > has finished. If the block cgroup controller would cache a pointer to the
> > block driver spinlock then that could cause the cgroup code to attempt to
> > lock a spinlock after it has been destroyed. I don't think we need that kind
> > of race conditions.
> 
> I see, but that problem is there with or without caching as long as we
> have queu_lock usage which reach beyond cleanup_queue, right?  Whether
> that user caches the lock for matching unlocking or not doesn't really
> change the situation.
> 
> Short of adding protection around queue_lock switching, I can't think
> of a solution tho.  Probably the right thing to do is adding queue
> lock/unlock helpers which are safe to use beyond cleanup_queue.

Hello Tejun,

A simple and effective solution is to dissociate a request queue from the
block cgroup controller before blk_cleanup_queue() returns. This is why commit
a063057d7c73 ("block: Fix a race between request queue removal and the block
cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue()
into blk_cleanup_queue().

Thanks,

Bart.






Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> > And looking at the change, it looks like the right thing we should
> > have done is caching @lock on the print_blkg side and when switching
> > locks make sure both locks are held.  IOW, do the following in
> > blk_cleanup_queue()
> > 
> > spin_lock_irq(lock);
> > if (q->queue_lock != >__queue_lock) {
> > spin_lock(>__queue_lock);
> > q->queue_lock = >__queue_lock;
> > spin_unlock(>__queue_lock);
> > }
> > spin_unlock_irq(lock);
> > 
> > Otherwise, there can be two lock holders thinking they have exclusive
> > access to the request_queue.
> 
> I think that's a bad idea. A block driver is allowed to destroy the
> spinlock it associated with the request queue as soon as blk_cleanup_queue()
> has finished. If the block cgroup controller would cache a pointer to the
> block driver spinlock then that could cause the cgroup code to attempt to
> lock a spinlock after it has been destroyed. I don't think we need that kind
> of race conditions.

I see, but that problem is there with or without caching as long as we
have queu_lock usage which reach beyond cleanup_queue, right?  Whether
that user caches the lock for matching unlocking or not doesn't really
change the situation.

Short of adding protection around queue_lock switching, I can't think
of a solution tho.  Probably the right thing to do is adding queue
lock/unlock helpers which are safe to use beyond cleanup_queue.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote:
> And looking at the change, it looks like the right thing we should
> have done is caching @lock on the print_blkg side and when switching
> locks make sure both locks are held.  IOW, do the following in
> blk_cleanup_queue()
> 
>   spin_lock_irq(lock);
>   if (q->queue_lock != >__queue_lock) {
>   spin_lock(>__queue_lock);
>   q->queue_lock = >__queue_lock;
>   spin_unlock(>__queue_lock);
>   }
>   spin_unlock_irq(lock);
> 
> Otherwise, there can be two lock holders thinking they have exclusive
> access to the request_queue.

I think that's a bad idea. A block driver is allowed to destroy the
spinlock it associated with the request queue as soon as blk_cleanup_queue()
has finished. If the block cgroup controller would cache a pointer to the
block driver spinlock then that could cause the cgroup code to attempt to
lock a spinlock after it has been destroyed. I don't think we need that kind
of race conditions.

Bart.





sr: get/drop reference to device in revalidate and check_events

2018-04-11 Thread Jens Axboe
We can't just use scsi_cd() to get the scsi_cd structure, we have
to grab a live reference to the device. For both callbacks, we're
not inside an open where we already hold a reference to the device.

This fixes device removal/addition under concurrent device access,
which otherwise could result in the below oops.

NULL pointer dereference at 0010
PGD 0 P4D 0 
Oops:  [#1] PREEMPT SMP
Modules linked in:
sr 12:0:0:0: [sr2] scsi-1 drive
 scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core 
sb_edac xl
sr 12:0:0:0: Attached scsi CD-ROM sr2
 sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress 
zlib_defc
sr 12:0:0:0: Attached scsi generic sg7 type 5
 igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif]
CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
RSP: 0018:883ff357bb58 EFLAGS: 00010292
RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66
RDX: 0003 RSI: 7530 RDI: 881fea631000
RBP:  R08: 881fe4d38400 R09: 
R10:  R11: 01b6 R12: 085d
R13: 085d R14: 883ffd9b3790 R15: 
FS:  7f7dc8e6d8c0() GS:883fff34() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 0010 CR3: 003ffda98005 CR4: 003606e0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 ? __invalidate_device+0x48/0x60
 check_disk_change+0x4c/0x60
 sr_block_open+0x16/0xd0 [sr_mod]
 __blkdev_get+0xb9/0x450
 ? iget5_locked+0x1c0/0x1e0
 blkdev_get+0x11e/0x320
 ? bdget+0x11d/0x150
 ? _raw_spin_unlock+0xa/0x20
 ? bd_acquire+0xc0/0xc0
 do_dentry_open+0x1b0/0x320
 ? inode_permission+0x24/0xc0
 path_openat+0x4e6/0x1420
 ? cpumask_any_but+0x1f/0x40
 ? flush_tlb_mm_range+0xa0/0x120
 do_filp_open+0x8c/0xf0
 ? __seccomp_filter+0x28/0x230
 ? _raw_spin_unlock+0xa/0x20
 ? __handle_mm_fault+0x7d6/0x9b0
 ? list_lru_add+0xa8/0xc0
 ? _raw_spin_unlock+0xa/0x20
 ? __alloc_fd+0xaf/0x160
 ? do_sys_open+0x1a6/0x230
 do_sys_open+0x1a6/0x230
 do_syscall_64+0x5a/0x100
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Signed-off-by: Jens Axboe 

---

diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c
index 0cf25d789d05..3f3cb72e0c0c 100644
--- a/drivers/scsi/sr.c
+++ b/drivers/scsi/sr.c
@@ -587,18 +587,28 @@ static int sr_block_ioctl(struct block_device *bdev, 
fmode_t mode, unsigned cmd,
 static unsigned int sr_block_check_events(struct gendisk *disk,
  unsigned int clearing)
 {
-   struct scsi_cd *cd = scsi_cd(disk);
+   unsigned int ret = 0;
+   struct scsi_cd *cd;
 
-   if (atomic_read(>device->disk_events_disable_depth))
+   cd = scsi_cd_get(disk);
+   if (!cd)
return 0;
 
-   return cdrom_check_events(>cdi, clearing);
+   if (!atomic_read(>device->disk_events_disable_depth))
+   ret = cdrom_check_events(>cdi, clearing);
+
+   scsi_cd_put(cd);
+   return ret;
 }
 
 static int sr_block_revalidate_disk(struct gendisk *disk)
 {
-   struct scsi_cd *cd = scsi_cd(disk);
struct scsi_sense_hdr sshdr;
+   struct scsi_cd *cd;
+
+   cd = scsi_cd_get(disk);
+   if (!cd)
+   return -ENXIO;
 
/* if the unit is not ready, nothing more to do */
if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, ))
@@ -607,6 +617,7 @@ static int sr_block_revalidate_disk(struct gendisk *disk)
sr_cd_check(>cdi);
get_sectorsize(cd);
 out:
+   scsi_cd_put(cd);
return 0;
 }
 

-- 
Jens Axboe



Re: Hotplug

2018-04-11 Thread Jens Axboe
On 4/11/18 10:14 AM, Jan Kara wrote:
> Hello,
> 
> On Wed 11-04-18 08:11:13, Jens Axboe wrote:
>> On 4/11/18 7:58 AM, Jan Kara wrote:
>>> On Tue 10-04-18 11:17:46, Jens Axboe wrote:
 Been running some tests and I keep running into issues with hotplug.
 This looks similar to what Bart posted the other day, but it looks
 more deeply rooted than just having to protect the queue in
 generic_make_request_checks(). The test run is blktests,
 block/001. Current -git doesn't survive it. I've seen at least two
 different oopses, pasted below.

 [  102.163442] NULL pointer dereference at 0010
 [  102.163444] PGD 0 P4D 0 
 [  102.163447] Oops:  [#1] PREEMPT SMP
 [  102.163449] Modules linked in:
 [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
 [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common 
 nvme nvme_core sb_edac xl
 [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
 [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress 
 xxhash lzo_compress zlib_defc
 [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
 [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
 crc_t10dif]
 [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ 
 #650
 [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
 11/09/2016
 [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
 [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
 [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
 883ff357bb66
 [  102.373220] RDX: 0003 RSI: 7530 RDI: 
 881fea631000
 [  102.381705] RBP:  R08: 881fe4d38400 R09: 
 
 [  102.390185] R10:  R11: 01b6 R12: 
 085d
 [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
 
 [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
 knlGS:
 [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
 [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
 003606e0
 [  102.432545] DR0:  DR1:  DR2: 
 
 [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
 0400
 [  102.449502] Call Trace:
 [  102.452744]  ? __invalidate_device+0x48/0x60
 [  102.458022]  check_disk_change+0x4c/0x60
 [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
 [  102.468270]  __blkdev_get+0xb9/0x450
 [  102.472774]  ? iget5_locked+0x1c0/0x1e0
 [  102.477568]  blkdev_get+0x11e/0x320
 [  102.481969]  ? bdget+0x11d/0x150
 [  102.486083]  ? _raw_spin_unlock+0xa/0x20
 [  102.490968]  ? bd_acquire+0xc0/0xc0
 [  102.495368]  do_dentry_open+0x1b0/0x320
 [  102.500159]  ? inode_permission+0x24/0xc0
 [  102.505140]  path_openat+0x4e6/0x1420
 [  102.509741]  ? cpumask_any_but+0x1f/0x40
 [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
 [  102.519903]  do_filp_open+0x8c/0xf0
 [  102.524305]  ? __seccomp_filter+0x28/0x230
 [  102.529389]  ? _raw_spin_unlock+0xa/0x20
 [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
 [  102.539559]  ? list_lru_add+0xa8/0xc0
 [  102.544157]  ? _raw_spin_unlock+0xa/0x20
 [  102.549047]  ? __alloc_fd+0xaf/0x160
 [  102.553549]  ? do_sys_open+0x1a6/0x230
 [  102.558244]  do_sys_open+0x1a6/0x230
 [  102.562742]  do_syscall_64+0x5a/0x100
 [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>>>
>>> Interesting. Thinking out loud: This is cd->device dereference I guess
>>> which means disk->private_data was NULL. That gets set in sr_probe()
>>> together with disk->fops which are certainly set as they must have led us
>>> to the crashing function sr_block_revalidate_disk(). So likely
>>> disk->private_data got already cleared. That happens in sr_kref_release()
>>> and the fact that that function got called means struct scsi_cd went away -
>>> so sr_remove() must have been called as well. That all seems possible like:
>>>
>>> CPU1CPU2
>>> sr_probe()
>>> __blkdev_get()
>>>   disk = bdev_get_gendisk();
>>> 
>>> sr_remove()
>>>   del_gendisk()
>>>   ...
>>>   kref_put(>kref, sr_kref_release);
>>> disk->private_data = NULL;
>>> put_disk(disk);
>>> kfree(cd);
>>>   if (disk->fops->open) {
>>> ret = disk->fops->open(bdev, mode); => sr_block_open
>>>   check_disk_change(bdev);
>>> sr_block_revalidate_disk()
>>>   CRASH
>>>
>>> And I think the problem is in sr_block_revalidate_disk() itself as the
>>> scsi_cd() call is not guaranteed that the caller holds reference to 'cd'
>>> and thus that 'cd' does not disappear under 

Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Israel Rukshin

On 4/11/2018 5:24 PM, Sagi Grimberg wrote:



Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.


Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?


Yes, I just did and it looks good.





---
  block/blk-mq.c |   45 
-

  include/linux/blkdev.h |    2 ++
  2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
  hctx_lock(hctx, _idx);
  if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
  __blk_mq_complete_request(rq);
+    else
+    rq->missed_completion = true;
  hctx_unlock(hctx, srcu_idx);
  }
  EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
    blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
  blk_add_timer(rq);
+    rq->missed_completion = false;
    write_seqcount_end(>gstate_seq);
  preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
  }
  }
  -static void blk_mq_timeout_sync_rcu(struct request_queue *q)
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool 
clear)

  {
  struct blk_mq_hw_ctx *hctx;
  bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
  else
  synchronize_srcu(hctx->srcu);
  -    hctx->need_sync_rcu = false;
+    if (clear)
+    hctx->need_sync_rcu = false;
  }
  if (has_rcu)
  synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
  blk_mq_rq_timed_out(hctx, rq, priv, reserved);
  }
  -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
  struct request *rq, void *priv, bool reserved)
  {
  /*
   * @rq's timer reset has gone through rcu synchronization and is
   * visible now.  Allow normal completions again by resetting
   * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
- * there's no memory ordering around ->aborted_gstate making it the
- * only field safe to update.  Let blk_add_timer() clear it later
- * when the request is recycled or times out again.
- *
- * As nothing prevents from completion happening while
- * ->aborted_gstate is set, this may lead to ignored completions
- * and further spurious timeouts.
+ * blk_mq_timeout_reset_cleanup() needs it again and there's no
+ * memory ordering around ->aborted_gstate making it the only field
+ * safe to update.  Let blk_add_timer() clear it later when the
+ * request is recycled or times out again.
   */
  if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
  blk_mq_rq_update_aborted_gstate(rq, 0);
  }
  +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+    struct request *rq, void *priv, bool reserved)
+{
+    /*
+ * @rq is now fully returned to the normal path.  If normal
+ * completion raced timeout handling, execute the missed completion
+ * here.  This is safe because 1. ->missed_completion can no longer
+ * be asserted because nothing is timing out right now and 2. if
+ * ->missed_completion is set, @rq is safe from recycling because
+ * nobody could have completed it.
+ */
+    if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+    blk_mq_complete_request(rq);
+}
+
  static void blk_mq_timeout_work(struct work_struct *work)
  {
  struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
   * becomes a problem, we can add per-hw_ctx rcu_head and
   * wait in parallel.
   */
-    blk_mq_timeout_sync_rcu(q);
+    blk_mq_timeout_sync_rcu(q, true);
    /* terminate the ones we won */
  blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
   * reset racing against recycling.
   */
  if (nr_resets) {
-    blk_mq_timeout_sync_rcu(q);
+    blk_mq_timeout_sync_rcu(q, false);
+    blk_mq_queue_tag_busy_iter(q,
+    blk_mq_timeout_reset_return, NULL);
+    blk_mq_timeout_sync_rcu(q, true);
  blk_mq_queue_tag_busy_iter(q,
-    blk_mq_finish_timeout_reset, NULL);
+    blk_mq_timeout_reset_cleanup, NULL);
  }
  }
  --- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -227,6 +227,8 @@ struct request {
    unsigned int extra_len;    /* length of alignment and padding */
  +    bool missed_completion;
+


Re: Hotplug

2018-04-11 Thread Jan Kara
Hello,

On Wed 11-04-18 08:11:13, Jens Axboe wrote:
> On 4/11/18 7:58 AM, Jan Kara wrote:
> > On Tue 10-04-18 11:17:46, Jens Axboe wrote:
> >> Been running some tests and I keep running into issues with hotplug.
> >> This looks similar to what Bart posted the other day, but it looks
> >> more deeply rooted than just having to protect the queue in
> >> generic_make_request_checks(). The test run is blktests,
> >> block/001. Current -git doesn't survive it. I've seen at least two
> >> different oopses, pasted below.
> >>
> >> [  102.163442] NULL pointer dereference at 0010
> >> [  102.163444] PGD 0 P4D 0 
> >> [  102.163447] Oops:  [#1] PREEMPT SMP
> >> [  102.163449] Modules linked in:
> >> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
> >> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common 
> >> nvme nvme_core sb_edac xl
> >> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
> >> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress 
> >> xxhash lzo_compress zlib_defc
> >> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
> >> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
> >> crc_t10dif]
> >> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ 
> >> #650
> >> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> >> 11/09/2016
> >> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
> >> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
> >> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
> >> 883ff357bb66
> >> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
> >> 881fea631000
> >> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
> >> 
> >> [  102.390185] R10:  R11: 01b6 R12: 
> >> 085d
> >> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
> >> 
> >> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
> >> knlGS:
> >> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
> >> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
> >> 003606e0
> >> [  102.432545] DR0:  DR1:  DR2: 
> >> 
> >> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
> >> 0400
> >> [  102.449502] Call Trace:
> >> [  102.452744]  ? __invalidate_device+0x48/0x60
> >> [  102.458022]  check_disk_change+0x4c/0x60
> >> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
> >> [  102.468270]  __blkdev_get+0xb9/0x450
> >> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
> >> [  102.477568]  blkdev_get+0x11e/0x320
> >> [  102.481969]  ? bdget+0x11d/0x150
> >> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
> >> [  102.490968]  ? bd_acquire+0xc0/0xc0
> >> [  102.495368]  do_dentry_open+0x1b0/0x320
> >> [  102.500159]  ? inode_permission+0x24/0xc0
> >> [  102.505140]  path_openat+0x4e6/0x1420
> >> [  102.509741]  ? cpumask_any_but+0x1f/0x40
> >> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
> >> [  102.519903]  do_filp_open+0x8c/0xf0
> >> [  102.524305]  ? __seccomp_filter+0x28/0x230
> >> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
> >> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
> >> [  102.539559]  ? list_lru_add+0xa8/0xc0
> >> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
> >> [  102.549047]  ? __alloc_fd+0xaf/0x160
> >> [  102.553549]  ? do_sys_open+0x1a6/0x230
> >> [  102.558244]  do_sys_open+0x1a6/0x230
> >> [  102.562742]  do_syscall_64+0x5a/0x100
> >> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> > 
> > Interesting. Thinking out loud: This is cd->device dereference I guess
> > which means disk->private_data was NULL. That gets set in sr_probe()
> > together with disk->fops which are certainly set as they must have led us
> > to the crashing function sr_block_revalidate_disk(). So likely
> > disk->private_data got already cleared. That happens in sr_kref_release()
> > and the fact that that function got called means struct scsi_cd went away -
> > so sr_remove() must have been called as well. That all seems possible like:
> > 
> > CPU1CPU2
> > sr_probe()
> > __blkdev_get()
> >   disk = bdev_get_gendisk();
> > 
> > sr_remove()
> >   del_gendisk()
> >   ...
> >   kref_put(>kref, sr_kref_release);
> > disk->private_data = NULL;
> > put_disk(disk);
> > kfree(cd);
> >   if (disk->fops->open) {
> > ret = disk->fops->open(bdev, mode); => sr_block_open
> >   check_disk_change(bdev);
> > sr_block_revalidate_disk()
> >   CRASH
> > 
> > And I think the problem is in sr_block_revalidate_disk() itself as the
> > scsi_cd() call is not guaranteed that the caller holds reference to 'cd'
> > and thus that 'cd' does not disappear under it. IMHO it needs to use
> > 

Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote:
> [0.76] BUG: unable to handle kernel NULL pointer dereference at 
> 01b4
> [0.763350] Kernel panic - not syncing: Attempted to kill init! 
> exitcode=0x0009
> [0.763350]
> [0.76] PGD 0 P4D 0
> [0.76] Oops:  [#2] PREEMPT SMP
> [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G  D  
> 4.16.0-ARCH+ #81
> [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> 1.11.0-20171110_100015-anatol 04/01/20$
> [0.76] Workqueue: nvme-reset-wq nvme_reset_work
> [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40
> [0.76] RSP: :c91bfcb0 EFLAGS: 00010246
> [0.76] RAX: 88003b698000 RBX:  RCX: 
> 
> [0.76] RDX: 88003b698000 RSI: fff4 RDI: 
> 001c
> [0.76] RBP: c91bfcc0 R08:  R09: 
> 
> [0.76] R10: eaeaa980 R11: 814e0970 R12: 
> 001c
> [0.76] R13:  R14:  R15: 
> 88003aad8010
> [0.76] FS:  () GS:88003e40() 
> knlGS:
> [0.76] CS:  0010 DS:  ES:  CR0: 80050033
> [0.76] CR2: 01b4 CR3: 02209001 CR4: 
> 000606f0
> [0.76] Call Trace:
> [0.76]  blk_mq_quiesce_queue+0x23/0x80
> [0.76]  nvme_dev_disable+0x34f/0x480
> [0.76]  ? nvme_irq+0x50/0x50
> [0.76]  ? dev_warn+0x64/0x80
> [0.76]  nvme_reset_work+0x13de/0x1570
> [0.76]  ? __switch_to_asm+0x34/0x70
> [0.76]  ? __switch_to_asm+0x40/0x70
> [0.76]  ? _raw_spin_unlock_irq+0x15/0x30
> [0.76]  ? finish_task_switch+0x156/0x210
> [0.76]  process_one_work+0x20c/0x3d0
> [0.76]  worker_thread+0x216/0x400
> [0.76]  kthread+0x125/0x130
> [0.76]  ? process_one_work+0x3d0/0x3d0
> [0.76]  ? __kthread_bind_mask+0x60/0x60
> [0.76]  ret_from_fork+0x3a/0x50

Hello Alexandru,

What made you look at cgroups? In the above register dump I see that %rbx == 
NULL.
I think that means that the queue pointer argument of blk_queue_flag_set() is 
NULL.
The NVMe initiator driver should never pass a NULL pointer to 
blk_mq_quiesce_queue().
Please ask the NVMe driver maintainers for their opinion on the linux-nvme 
mailing
list.

Thanks,

Bart.





Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello, again.

On Wed, Apr 11, 2018 at 07:51:23AM -0700, Tejun Heo wrote:
> Oh, it wasn't Joseph's change.  It was Bart's fix for a problem
> reported by Joseph.  Bart, a063057d7c73 ("block: Fix a race between
> request queue removal and the block cgroup controller") created a
> regression where a request_queue can be destroyed with blkgs still
> attached.  The original report is..
> 
>  http://lkml.kernel.org/r/20180407102148.ga9...@gmail.com

And looking at the change, it looks like the right thing we should
have done is caching @lock on the print_blkg side and when switching
locks make sure both locks are held.  IOW, do the following in
blk_cleanup_queue()

spin_lock_irq(lock);
if (q->queue_lock != >__queue_lock) {
spin_lock(>__queue_lock);
q->queue_lock = >__queue_lock;
spin_unlock(>__queue_lock);
}
spin_unlock_irq(lock);

Otherwise, there can be two lock holders thinking they have exclusive
access to the request_queue.

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello,

(cc'ing Bart)

On Wed, Apr 11, 2018 at 07:46:16AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote:
> > > Ah, that changed recently.  Can you please check out the current
> > > upstream git master?
> > > 
> > Just did, without my patch I see this crash:
> 
> lol I was looking at the old tree, so this is the fix for the new
> breakage introduced by the recent change.  Sorry about the confusion.
> Joseph, can you please take a look?

Oh, it wasn't Joseph's change.  It was Bart's fix for a problem
reported by Joseph.  Bart, a063057d7c73 ("block: Fix a race between
request queue removal and the block cgroup controller") created a
regression where a request_queue can be destroyed with blkgs still
attached.  The original report is..

 http://lkml.kernel.org/r/20180407102148.ga9...@gmail.com

Thanks.

-- 
tejun


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello,

On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote:
> > Ah, that changed recently.  Can you please check out the current
> > upstream git master?
> > 
> Just did, without my patch I see this crash:

lol I was looking at the old tree, so this is the fix for the new
breakage introduced by the recent change.  Sorry about the confusion.
Joseph, can you please take a look?

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hello,

On Wed, Apr 11, 2018 at 05:24:49PM +0300, Sagi Grimberg wrote:
> 
> >Ah, yeah, I was moving it out of add_timer but forgot to actully add
> >it to the issue path.  Fixed patch below.
> >
> >BTW, no matter what we do w/ the request handover between normal and
> >timeout paths, we'd need something similar.  Otherwise, while we can
> >reduce the window, we can't get rid of it.
> >
> >Thanks.
> 
> Just noticed this one, this looks interesting to me as well. Israel,
> can you run your test with this patch?

The following is the combined patch with one debug message to see
whether missed completions are actually happening.

Thanks.

Index: work/block/blk-mq.c
===
--- work.orig/block/blk-mq.c
+++ work/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
 }
 EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
 
blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);
blk_add_timer(rq);
+   rq->missed_completion = false;
 
write_seqcount_end(>gstate_seq);
preempt_enable();
@@ -818,7 +821,8 @@ struct blk_mq_timeout_data {
unsigned int nr_expired;
 };
 
-static void blk_mq_rq_timed_out(struct request *req, bool reserved)
+static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request 
*req,
+   int *nr_resets, bool reserved)
 {
const struct blk_mq_ops *ops = req->q->mq_ops;
enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER;
@@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r
__blk_mq_complete_request(req);
break;
case BLK_EH_RESET_TIMER:
-   /*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored
-* completions and further spurious timeouts.
-*/
-   blk_mq_rq_update_aborted_gstate(req, 0);
blk_add_timer(req);
+   req->rq_flags |= RQF_MQ_TIMEOUT_RESET;
+   (*nr_resets)++;
+   hctx->need_sync_rcu = true;
break;
case BLK_EH_NOT_HANDLED:
break;
@@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct
time_after_eq(jiffies, deadline)) {
blk_mq_rq_update_aborted_gstate(rq, gstate);
data->nr_expired++;
-   hctx->nr_expired++;
+   hctx->need_sync_rcu = true;
} else if (!data->next_set || time_after(data->next, deadline)) {
data->next = deadline;
data->next_set = 1;
}
 }
 
+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
+{
+   struct blk_mq_hw_ctx *hctx;
+   bool has_rcu = false;
+   int i;
+
+   queue_for_each_hw_ctx(q, hctx, i) {
+   if (!hctx->need_sync_rcu)
+   continue;
+
+   if (!(hctx->flags & BLK_MQ_F_BLOCKING))
+   has_rcu = true;
+   else
+   synchronize_srcu(hctx->srcu);
+
+   if (clear)
+   hctx->need_sync_rcu = false;
+   }
+   if (has_rcu)
+   synchronize_rcu();
+}
+
 static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
 {
@@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str
 */
if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) &&
READ_ONCE(rq->gstate) == rq->aborted_gstate)
-   blk_mq_rq_timed_out(rq, reserved);
+   blk_mq_rq_timed_out(hctx, rq, priv, reserved);
+}
+
+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq's timer reset has gone through rcu synchronization and is
+* visible now.  Allow normal completions again by resetting
+* ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
+*/
+   if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
+   blk_mq_rq_update_aborted_gstate(rq, 0);
+}
+
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,
+   struct request *rq, void *priv, bool reserved)
+{
+   int cnt = 0;
+
+   /*
+* @rq is now fully returned to the normal path.  If normal
+ 

Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



   static void __blk_mq_requeue_request(struct request *rq)
   {
struct request_queue *q = rq->q;
+   enum mq_rq_state old_state = blk_mq_rq_state(rq);
   
   	blk_mq_put_driver_tag(rq);
   
   	trace_block_rq_requeue(q, rq);

wbt_requeue(q->rq_wb, >issue_stat);
   
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (old_state != MQ_RQ_IDLE) {
+   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}


Can you explain why was old_state kept as a local variable?


Hello Sagi,

Since blk_mq_requeue_request() must be called after a request has completed
the timeout handler will ignore requests that are being requeued. Hence it
is safe in this function to cache the request state in a local variable.


I understand why it is safe, I just didn't understand why it is needed.


+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
   {
-   u64 old_val = READ_ONCE(rq->gstate);
-   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-   if (state == MQ_RQ_IN_FLIGHT) {
-   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-   new_val += MQ_RQ_GEN_INC;
-   }
+   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+   old_state;
+   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
   
-	/* avoid exposing interim values */

-   WRITE_ONCE(rq->gstate, new_val);
+   return cmpxchg(>__deadline, old_val, new_val) == old_val;
   }


Can you explain why this takes the old_state of the request?


Can you clarify your question? The purpose of this function is to change
the request state only into @new_state if it matches @old_state. I think
that is also what the implementation of this function does.


I misread the documentation of this. never mind. thanks.


Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Wed, Apr 11, 2018 at 07:20:19AM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote:
> > > But we already do this through calling blkcg_exit_queue() from
> > > __blk_release_queue().  What's missing?
> > 
> > Hi,
> > 
> > It might be the jetlag but I can't see how you end up calling
> > blkcg_exit_queue() from __blk_release_queue().
> > 
> > As I see it the only way to reach blkcg_exit_queue() is from
> > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().
> > 
> > I suspect that I'm just fixing a corner case though and
> > the general case is what you describe or similar.
> 
> Ah, that changed recently.  Can you please check out the current
> upstream git master?
> 
> Thanks.
> 
Just did, without my patch I see this crash:

[0.75] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.16.0-ARCH+ #81   
  [7/1949]
[0.75] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/204
[0.75] RIP: 0010:pi_init+0x23f/0x2f0
[0.75] RSP: :c9197d90 EFLAGS: 00010246
[0.75] RAX:  RBX: 0020 RCX: 0038
[0.75] RDX: 0001 RSI: 0001 RDI: 
[0.75] RBP: c9197e18 R08:  R09: 
[0.75] R10: eaeda600 R11: 88003b69f164 R12: 82e2d740
[0.75] R13:  R14:  R15: 
[0.75] FS:  () GS:88003e50() 
knlGS:
[0.75] CS:  0010 DS:  ES:  CR0: 80050033
[0.75] CR2: 0008 CR3: 02209001 CR4: 000606e0
[0.75] Call Trace:
[0.75]  pf_init+0x1db/0x3be
[0.75]  ? pcd_init+0x4e8/0x4e8
[0.75]  do_one_initcall+0x9e/0x1b0
[0.75]  ? do_early_param+0x97/0x97
[0.75]  kernel_init_freeable+0x259/0x2fd
[0.75]  ? rest_init+0xd0/0xd0
[0.75]  ? syscall_slow_exit_work+0x1c/0x160
[0.75]  kernel_init+0xe/0x100
[0.75]  ret_from_fork+0x3a/0x50
[0.75] Code: 75 6a 49 8b 06 48 8b 40 78 48 85 c0 74 08 4c 89 f7 e8 46 
76 51 00 83 c3 01 3b 5d a8 7d 0d 49
[0.75] RIP: pi_init+0x23f/0x2f0 RSP: c9197d90
[0.75] CR2: 0008
[0.75] ---[ end trace 12004f267bb8bf7d ]---
[0.76] BUG: unable to handle kernel NULL pointer dereference at 
01b4
[0.763350] Kernel panic - not syncing: Attempted to kill init! 
exitcode=0x0009
[0.763350]
[0.76] PGD 0 P4D 0
[0.76] Oops:  [#2] PREEMPT SMP
[0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G  D  
4.16.0-ARCH+ #81
[0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.11.0-20171110_100015-anatol 04/01/20$
[0.76] Workqueue: nvme-reset-wq nvme_reset_work
[0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40
[0.76] RSP: :c91bfcb0 EFLAGS: 00010246
[0.76] RAX: 88003b698000 RBX:  RCX: 
[0.76] RDX: 88003b698000 RSI: fff4 RDI: 001c
[0.76] RBP: c91bfcc0 R08:  R09: 
[0.76] R10: eaeaa980 R11: 814e0970 R12: 001c
[0.76] R13:  R14:  R15: 88003aad8010
[0.76] FS:  () GS:88003e40() 
knlGS:
[0.76] CS:  0010 DS:  ES:  CR0: 80050033
[0.76] CR2: 01b4 CR3: 02209001 CR4: 000606f0
[0.76] Call Trace:
[0.76]  blk_mq_quiesce_queue+0x23/0x80
[0.76]  nvme_dev_disable+0x34f/0x480
[0.76]  ? nvme_irq+0x50/0x50
[0.76]  ? dev_warn+0x64/0x80
[0.76]  nvme_reset_work+0x13de/0x1570
[0.76]  ? __switch_to_asm+0x34/0x70
[0.76]  ? __switch_to_asm+0x40/0x70
[0.76]  ? _raw_spin_unlock_irq+0x15/0x30
[0.76]  ? finish_task_switch+0x156/0x210
[0.76]  process_one_work+0x20c/0x3d0
[0.76]  worker_thread+0x216/0x400
[0.76]  kthread+0x125/0x130
[0.76]  ? process_one_work+0x3d0/0x3d0
[0.76]  ? __kthread_bind_mask+0x60/0x60
[0.76]  ret_from_fork+0x3a/0x50


With the patch the crash goes away,

Thanks,
../Alex

> -- 
> tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Sagi Grimberg



Ah, yeah, I was moving it out of add_timer but forgot to actully add
it to the issue path.  Fixed patch below.

BTW, no matter what we do w/ the request handover between normal and
timeout paths, we'd need something similar.  Otherwise, while we can
reduce the window, we can't get rid of it.

Thanks.


Just noticed this one, this looks interesting to me as well. Israel,
can you run your test with this patch?



---
  block/blk-mq.c |   45 -
  include/linux/blkdev.h |2 ++
  2 files changed, 34 insertions(+), 13 deletions(-)

--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ
hctx_lock(hctx, _idx);
if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)
__blk_mq_complete_request(rq);
+   else
+   rq->missed_completion = true;
hctx_unlock(hctx, srcu_idx);
  }
  EXPORT_SYMBOL(blk_mq_complete_request);
@@ -684,6 +686,7 @@ void blk_mq_start_request(struct request
  
  	blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);

blk_add_timer(rq);
+   rq->missed_completion = false;
  
  	write_seqcount_end(>gstate_seq);

preempt_enable();
@@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct
}
  }
  
-static void blk_mq_timeout_sync_rcu(struct request_queue *q)

+static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)
  {
struct blk_mq_hw_ctx *hctx;
bool has_rcu = false;
@@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru
else
synchronize_srcu(hctx->srcu);
  
-		hctx->need_sync_rcu = false;

+   if (clear)
+   hctx->need_sync_rcu = false;
}
if (has_rcu)
synchronize_rcu();
@@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str
blk_mq_rq_timed_out(hctx, rq, priv, reserved);
  }
  
-static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx,

+static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,
struct request *rq, void *priv, bool reserved)
  {
/*
 * @rq's timer reset has gone through rcu synchronization and is
 * visible now.  Allow normal completions again by resetting
 * ->aborted_gstate.  Don't clear RQF_MQ_TIMEOUT_RESET here as
-* there's no memory ordering around ->aborted_gstate making it the
-* only field safe to update.  Let blk_add_timer() clear it later
-* when the request is recycled or times out again.
-*
-* As nothing prevents from completion happening while
-* ->aborted_gstate is set, this may lead to ignored completions
-* and further spurious timeouts.
+* blk_mq_timeout_reset_cleanup() needs it again and there's no
+* memory ordering around ->aborted_gstate making it the only field
+* safe to update.  Let blk_add_timer() clear it later when the
+* request is recycled or times out again.
 */
if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)
blk_mq_rq_update_aborted_gstate(rq, 0);
  }
  
+static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx,

+   struct request *rq, void *priv, bool reserved)
+{
+   /*
+* @rq is now fully returned to the normal path.  If normal
+* completion raced timeout handling, execute the missed completion
+* here.  This is safe because 1. ->missed_completion can no longer
+* be asserted because nothing is timing out right now and 2. if
+* ->missed_completion is set, @rq is safe from recycling because
+* nobody could have completed it.
+*/
+   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion)
+   blk_mq_complete_request(rq);
+}
+
  static void blk_mq_timeout_work(struct work_struct *work)
  {
struct request_queue *q =
@@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w
 * becomes a problem, we can add per-hw_ctx rcu_head and
 * wait in parallel.
 */
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, true);
  
  		/* terminate the ones we won */

blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired,
@@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w
 * reset racing against recycling.
 */
if (nr_resets) {
-   blk_mq_timeout_sync_rcu(q);
+   blk_mq_timeout_sync_rcu(q, false);
+   blk_mq_queue_tag_busy_iter(q,
+   blk_mq_timeout_reset_return, NULL);
+   blk_mq_timeout_sync_rcu(q, true);
blk_mq_queue_tag_busy_iter(q,
-   blk_mq_finish_timeout_reset, NULL);
+   

Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Tejun Heo
Hello,

On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote:
> > But we already do this through calling blkcg_exit_queue() from
> > __blk_release_queue().  What's missing?
> 
> Hi,
> 
> It might be the jetlag but I can't see how you end up calling
> blkcg_exit_queue() from __blk_release_queue().
> 
> As I see it the only way to reach blkcg_exit_queue() is from
> blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().
> 
> I suspect that I'm just fixing a corner case though and
> the general case is what you describe or similar.

Ah, that changed recently.  Can you please check out the current
upstream git master?

Thanks.

-- 
tejun


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread t...@kernel.org
Hello, Bart.

On Wed, Apr 11, 2018 at 12:50:51PM +, Bart Van Assche wrote:
> Thank you for having shared this patch. It looks interesting to me. What I
> know about the blk-mq timeout handling is as follows:
> * Nobody who has reviewed the blk-mq timeout handling code with this patch
>   applied has reported any shortcomings for that code.
> * However, several people have reported kernel crashes that disappear when
>   the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
>   memory upon timeout"
>   (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
>   and also to a "RIP: scsi_times_out+0x17" crash during boot
>   (https://bugzilla.kernel.org/show_bug.cgi?id=199077).
> 
> So we have the choice between two approaches:
> (1) apply the patch from your previous e-mail and root-cause and fix the
> crashes referred to above.
> (2) apply a patch that makes the crashes reported against v4.16 disappear and
> remove the atomic instructions introduced by such a patch at a later time.
> 
> Since crashes have been reported for kernel v4.16 I think we should follow
> approach (2). That will remove the time pressure from root-causing and fixing
> the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

So, it really bothers me how blind we're going about this.  It isn't
an insurmountable emergency that we have to adopt whatever solution
which passed a couple tests this minute.  We can debug and root cause
this properly and pick the right solution.  We even have two most
likely causes already analysed and patches proposed, one of them
months ago.  If we wanna change the handover model, let's do that
because the new one is better, not because of vague fear.

Thanks.

-- 
tejun


Re: Hotplug

2018-04-11 Thread Jens Axboe
On 4/11/18 7:58 AM, Jan Kara wrote:
> Hi,
> 
> On Tue 10-04-18 11:17:46, Jens Axboe wrote:
>> Been running some tests and I keep running into issues with hotplug.
>> This looks similar to what Bart posted the other day, but it looks
>> more deeply rooted than just having to protect the queue in
>> generic_make_request_checks(). The test run is blktests,
>> block/001. Current -git doesn't survive it. I've seen at least two
>> different oopses, pasted below.
>>
>> [  102.163442] NULL pointer dereference at 0010
>> [  102.163444] PGD 0 P4D 0 
>> [  102.163447] Oops:  [#1] PREEMPT SMP
>> [  102.163449] Modules linked in:
>> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
>> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common 
>> nvme nvme_core sb_edac xl
>> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
>> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash 
>> lzo_compress zlib_defc
>> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
>> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
>> crc_t10dif]
>> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
>> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
>> 11/09/2016
>> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
>> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
>> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
>> 883ff357bb66
>> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
>> 881fea631000
>> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
>> 
>> [  102.390185] R10:  R11: 01b6 R12: 
>> 085d
>> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
>> 
>> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
>> knlGS:
>> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
>> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
>> 003606e0
>> [  102.432545] DR0:  DR1:  DR2: 
>> 
>> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [  102.449502] Call Trace:
>> [  102.452744]  ? __invalidate_device+0x48/0x60
>> [  102.458022]  check_disk_change+0x4c/0x60
>> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
>> [  102.468270]  __blkdev_get+0xb9/0x450
>> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
>> [  102.477568]  blkdev_get+0x11e/0x320
>> [  102.481969]  ? bdget+0x11d/0x150
>> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
>> [  102.490968]  ? bd_acquire+0xc0/0xc0
>> [  102.495368]  do_dentry_open+0x1b0/0x320
>> [  102.500159]  ? inode_permission+0x24/0xc0
>> [  102.505140]  path_openat+0x4e6/0x1420
>> [  102.509741]  ? cpumask_any_but+0x1f/0x40
>> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
>> [  102.519903]  do_filp_open+0x8c/0xf0
>> [  102.524305]  ? __seccomp_filter+0x28/0x230
>> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
>> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
>> [  102.539559]  ? list_lru_add+0xa8/0xc0
>> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
>> [  102.549047]  ? __alloc_fd+0xaf/0x160
>> [  102.553549]  ? do_sys_open+0x1a6/0x230
>> [  102.558244]  do_sys_open+0x1a6/0x230
>> [  102.562742]  do_syscall_64+0x5a/0x100
>> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
> 
> Interesting. Thinking out loud: This is cd->device dereference I guess
> which means disk->private_data was NULL. That gets set in sr_probe()
> together with disk->fops which are certainly set as they must have led us
> to the crashing function sr_block_revalidate_disk(). So likely
> disk->private_data got already cleared. That happens in sr_kref_release()
> and the fact that that function got called means struct scsi_cd went away -
> so sr_remove() must have been called as well. That all seems possible like:
> 
> CPU1  CPU2
> sr_probe()
>   __blkdev_get()
> disk = bdev_get_gendisk();
> 
> sr_remove()
>   del_gendisk()
>   ...
>   kref_put(>kref, sr_kref_release);
> disk->private_data = NULL;
> put_disk(disk);
> kfree(cd);
> if (disk->fops->open) {
>   ret = disk->fops->open(bdev, mode); => sr_block_open
> check_disk_change(bdev);
>   sr_block_revalidate_disk()
> CRASH
> 
> And I think the problem is in sr_block_revalidate_disk() itself as the
> scsi_cd() call is not guaranteed that the caller holds reference to 'cd'
> and thus that 'cd' does not disappear under it. IMHO it needs to use
> scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something?

No I think you are correct, from the revalidate path it should grab/release
a reference. Looks like sr_block_check_events() needs the same treatment.
How about the below?

>> [ 

Re: Hotplug

2018-04-11 Thread Jens Axboe
On 4/11/18 8:06 AM, Bart Van Assche wrote:
> On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote:
>> I'm not really sure where this is crashing and 'Code' line is incomplete to
>> tell me.
> 
> Hello Jan,
> 
> The following patch should fix this crash:
> https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html.

Yeah, I forgot the link in my reply, thanks.

-- 
Jens Axboe



Re: Hotplug

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote:
> I'm not really sure where this is crashing and 'Code' line is incomplete to
> tell me.

Hello Jan,

The following patch should fix this crash:
https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html.

Thanks,

Bart.




Re: [PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"

2018-04-11 Thread Jens Axboe
On 4/11/18 4:47 AM, Ming Lei wrote:
> This reverts commit 127276c6ce5a30fcc806b7fe53015f4f89b62956.
> 
> When all CPUs of one hw queue become offline, there still may have IOs
> not completed from this hctx. But blk_mq_hw_queue_mapped() is called in
> blk_mq_queue_tag_busy_iter(), which is used for iterating request in timeout
> handler, timeout event will be missed on the inactive hctx, then request may
> never be completed.
> 
> Also the replementation of blk_mq_hw_queue_mapped() doesn't match the helper's
> name any more, and it should have been named as blk_mq_hw_queue_active().
> 
> Even other callers need further verification about this reimplemenation.
> 
> So revert this patch now, and we can improve hw queue activate/inactivate 
> event
> after adequent researching and test.

Thanks, applied.

-- 
Jens Axboe



Re: Hotplug

2018-04-11 Thread Jan Kara
Hi,

On Tue 10-04-18 11:17:46, Jens Axboe wrote:
> Been running some tests and I keep running into issues with hotplug.
> This looks similar to what Bart posted the other day, but it looks
> more deeply rooted than just having to protect the queue in
> generic_make_request_checks(). The test run is blktests,
> block/001. Current -git doesn't survive it. I've seen at least two
> different oopses, pasted below.
> 
> [  102.163442] NULL pointer dereference at 0010
> [  102.163444] PGD 0 P4D 0 
> [  102.163447] Oops:  [#1] PREEMPT SMP
> [  102.163449] Modules linked in:
> [  102.175540] sr 12:0:0:0: [sr2] scsi-1 drive
> [  102.180112]  scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme 
> nvme_core sb_edac xl
> [  102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2
> [  102.191896]  sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash 
> lzo_compress zlib_defc
> [  102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5
> [  102.203475]  igb ahci libahci i2c_algo_bit libata dca [last unloaded: 
> crc_t10dif]
> [  102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650
> [  102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 
> 11/09/2016
> [  102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod]
> [  102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292
> [  102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 
> 883ff357bb66
> [  102.373220] RDX: 0003 RSI: 7530 RDI: 
> 881fea631000
> [  102.381705] RBP:  R08: 881fe4d38400 R09: 
> 
> [  102.390185] R10:  R11: 01b6 R12: 
> 085d
> [  102.398671] R13: 085d R14: 883ffd9b3790 R15: 
> 
> [  102.407156] FS:  7f7dc8e6d8c0() GS:883fff34() 
> knlGS:
> [  102.417138] CS:  0010 DS:  ES:  CR0: 80050033
> [  102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 
> 003606e0
> [  102.432545] DR0:  DR1:  DR2: 
> 
> [  102.441024] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [  102.449502] Call Trace:
> [  102.452744]  ? __invalidate_device+0x48/0x60
> [  102.458022]  check_disk_change+0x4c/0x60
> [  102.462900]  sr_block_open+0x16/0xd0 [sr_mod]
> [  102.468270]  __blkdev_get+0xb9/0x450
> [  102.472774]  ? iget5_locked+0x1c0/0x1e0
> [  102.477568]  blkdev_get+0x11e/0x320
> [  102.481969]  ? bdget+0x11d/0x150
> [  102.486083]  ? _raw_spin_unlock+0xa/0x20
> [  102.490968]  ? bd_acquire+0xc0/0xc0
> [  102.495368]  do_dentry_open+0x1b0/0x320
> [  102.500159]  ? inode_permission+0x24/0xc0
> [  102.505140]  path_openat+0x4e6/0x1420
> [  102.509741]  ? cpumask_any_but+0x1f/0x40
> [  102.514630]  ? flush_tlb_mm_range+0xa0/0x120
> [  102.519903]  do_filp_open+0x8c/0xf0
> [  102.524305]  ? __seccomp_filter+0x28/0x230
> [  102.529389]  ? _raw_spin_unlock+0xa/0x20
> [  102.534283]  ? __handle_mm_fault+0x7d6/0x9b0
> [  102.539559]  ? list_lru_add+0xa8/0xc0
> [  102.544157]  ? _raw_spin_unlock+0xa/0x20
> [  102.549047]  ? __alloc_fd+0xaf/0x160
> [  102.553549]  ? do_sys_open+0x1a6/0x230
> [  102.558244]  do_sys_open+0x1a6/0x230
> [  102.562742]  do_syscall_64+0x5a/0x100
> [  102.567336]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Interesting. Thinking out loud: This is cd->device dereference I guess
which means disk->private_data was NULL. That gets set in sr_probe()
together with disk->fops which are certainly set as they must have led us
to the crashing function sr_block_revalidate_disk(). So likely
disk->private_data got already cleared. That happens in sr_kref_release()
and the fact that that function got called means struct scsi_cd went away -
so sr_remove() must have been called as well. That all seems possible like:

CPU1CPU2
sr_probe()
__blkdev_get()
  disk = bdev_get_gendisk();

sr_remove()
  del_gendisk()
  ...
  kref_put(>kref, sr_kref_release);
disk->private_data = NULL;
put_disk(disk);
kfree(cd);
  if (disk->fops->open) {
ret = disk->fops->open(bdev, mode); => sr_block_open
  check_disk_change(bdev);
sr_block_revalidate_disk()
  CRASH

And I think the problem is in sr_block_revalidate_disk() itself as the
scsi_cd() call is not guaranteed that the caller holds reference to 'cd'
and thus that 'cd' does not disappear under it. IMHO it needs to use
scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something?

> and this one, similar to Barts:
> 
> [ 4676.738069] NULL pointer dereference at 0154
> [ 4676.738071] PGD 0 P4D 0 
> [ 4676.738073] Oops:  [#1] PREEMPT SMP
> [ 4676.738075] Modules linked in: scsi_debug crc_t10dif nvme nvme_core loop 
> configfs crct10dif_ge]
> [ 4676.859272] CPU: 10 PID: 16598 Comm: systemd-udevd Not 

Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 10:11 +0800, Ming Lei wrote:
> On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote:
> > The blk-mq timeout handling code ignores completions that occur after
> > blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> > has reset rq->aborted_gstate. If a block driver timeout handler always
> > returns BLK_EH_RESET_TIMER then the result will be that the request
> > never terminates.
> 
> Under this situation:
> 
> IMO, if this request has been handled by driver's irq handler, and if
> driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
> and the correct return value should be BLK_EH_HANDLED.

Maybe. In the virtio-scsi driver I found the following:

/*
 * The host guarantees to respond to each command, although I/O
 * latencies might be higher than on bare metal.  Reset the timer
 * unconditionally to give the host a chance to perform EH.
 */
static enum blk_eh_timer_return virtscsi_eh_timed_out(struct scsi_cmnd *scmnd)
{
return BLK_EH_RESET_TIMER;
}

Bart.





Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Bart Van Assche
On Wed, 2018-04-11 at 16:19 +0300, Sagi Grimberg wrote:
> >   static void __blk_mq_requeue_request(struct request *rq)
> >   {
> > struct request_queue *q = rq->q;
> > +   enum mq_rq_state old_state = blk_mq_rq_state(rq);
> >   
> > blk_mq_put_driver_tag(rq);
> >   
> > trace_block_rq_requeue(q, rq);
> > wbt_requeue(q->rq_wb, >issue_stat);
> >   
> > -   if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {
> > -   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
> > +   if (old_state != MQ_RQ_IDLE) {
> > +   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
> > +   WARN_ON_ONCE(true);
> > if (q->dma_drain_size && blk_rq_bytes(rq))
> > rq->nr_phys_segments--;
> > }
> 
> Can you explain why was old_state kept as a local variable?

Hello Sagi,

Since blk_mq_requeue_request() must be called after a request has completed
the timeout handler will ignore requests that are being requeued. Hence it
is safe in this function to cache the request state in a local variable.

> > +static inline bool blk_mq_change_rq_state(struct request *rq,
> > + enum mq_rq_state old_state,
> > + enum mq_rq_state new_state)
> >   {
> > -   u64 old_val = READ_ONCE(rq->gstate);
> > -   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
> > -
> > -   if (state == MQ_RQ_IN_FLIGHT) {
> > -   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
> > -   new_val += MQ_RQ_GEN_INC;
> > -   }
> > +   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
> > +   old_state;
> > +   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
> >   
> > -   /* avoid exposing interim values */
> > -   WRITE_ONCE(rq->gstate, new_val);
> > +   return cmpxchg(>__deadline, old_val, new_val) == old_val;
> >   }
> 
> Can you explain why this takes the old_state of the request?

Can you clarify your question? The purpose of this function is to change
the request state only into @new_state if it matches @old_state. I think
that is also what the implementation of this function does.

Thanks,

Bart.






Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7

2018-04-11 Thread Sagi Grimberg



diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
request_queue *q,
 return ERR_PTR(-EXDEV);
 }
 cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+   if (cpu >= nr_cpu_ids) {
+   pr_warn("no online cpu for hctx %d\n", hctx_idx);
+   cpu = cpumask_first(alloc_data.hctx->cpumask);
+   }


We may do this way for the special case, but it is ugly, IMO.


Christoph?


[PATCH] block: ratelimite pr_err on IO path

2018-04-11 Thread Jack Wang
From: Jack Wang 

This avoid soft lockup below:
[ 2328.328429] Call Trace:
[ 2328.328433]  vprintk_emit+0x229/0x2e0
[ 2328.328436]  ? t10_pi_type3_verify_ip+0x20/0x20
[ 2328.328437]  printk+0x52/0x6e
[ 2328.328439]  t10_pi_verify+0x9e/0xf0
[ 2328.328441]  bio_integrity_process+0x12e/0x220
[ 2328.328442]  ? t10_pi_type1_verify_crc+0x20/0x20
[ 2328.328443]  bio_integrity_verify_fn+0xde/0x140
[ 2328.328447]  process_one_work+0x13f/0x370
[ 2328.328449]  worker_thread+0x62/0x3d0
[ 2328.328450]  ? rescuer_thread+0x2f0/0x2f0
[ 2328.328452]  kthread+0x116/0x150
[ 2328.328454]  ? __kthread_parkme+0x70/0x70
[ 2328.328457]  ret_from_fork+0x35/0x40

Signed-off-by: Jack Wang 
---
 block/t10-pi.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/t10-pi.c b/block/t10-pi.c
index a98db38..35967d5 100644
--- a/block/t10-pi.c
+++ b/block/t10-pi.c
@@ -84,10 +84,12 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter 
*iter,
 
if (be32_to_cpu(pi->ref_tag) !=
lower_32_bits(iter->seed)) {
-   pr_err("%s: ref tag error at location %llu " \
-  "(rcvd %u)\n", iter->disk_name,
-  (unsigned long long)
-  iter->seed, be32_to_cpu(pi->ref_tag));
+   pr_err_ratelimited("%s: ref tag error at "
+  "location %llu (rcvd %u)\n",
+  iter->disk_name,
+  (unsigned long long)
+  iter->seed,
+  be32_to_cpu(pi->ref_tag));
return BLK_STS_PROTECTION;
}
break;
@@ -101,10 +103,12 @@ static blk_status_t t10_pi_verify(struct 
blk_integrity_iter *iter,
csum = fn(iter->data_buf, iter->interval);
 
if (pi->guard_tag != csum) {
-   pr_err("%s: guard tag error at sector %llu " \
-  "(rcvd %04x, want %04x)\n", iter->disk_name,
-  (unsigned long long)iter->seed,
-  be16_to_cpu(pi->guard_tag), be16_to_cpu(csum));
+   pr_err_ratelimited("%s: guard tag error at sector %llu "
+  "(rcvd %04x, want %04x)\n",
+  iter->disk_name,
+  (unsigned long long)iter->seed,
+  be16_to_cpu(pi->guard_tag),
+  be16_to_cpu(csum));
return BLK_STS_PROTECTION;
}
 
-- 
2.7.4



Re: [PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"

2018-04-11 Thread Sagi Grimberg

Reviewed-by: Sagi Grimberg 


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



  static void __blk_mq_requeue_request(struct request *rq)
  {
struct request_queue *q = rq->q;
+   enum mq_rq_state old_state = blk_mq_rq_state(rq);
  
  	blk_mq_put_driver_tag(rq);
  
  	trace_block_rq_requeue(q, rq);

wbt_requeue(q->rq_wb, >issue_stat);
  
-	if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) {

-   blk_mq_rq_update_state(rq, MQ_RQ_IDLE);
+   if (old_state != MQ_RQ_IDLE) {
+   if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE))
+   WARN_ON_ONCE(true);
if (q->dma_drain_size && blk_rq_bytes(rq))
rq->nr_phys_segments--;
}


Can you explain why was old_state kept as a local variable?


+static inline bool blk_mq_change_rq_state(struct request *rq,
+ enum mq_rq_state old_state,
+ enum mq_rq_state new_state)
  {
-   u64 old_val = READ_ONCE(rq->gstate);
-   u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state;
-
-   if (state == MQ_RQ_IN_FLIGHT) {
-   WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE);
-   new_val += MQ_RQ_GEN_INC;
-   }
+   unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) |
+   old_state;
+   unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state;
  
-	/* avoid exposing interim values */

-   WRITE_ONCE(rq->gstate, new_val);
+   return cmpxchg(>__deadline, old_val, new_val) == old_val;
  }


Can you explain why this takes the old_state of the request?

Otherwise this looks good to me,

Reviewed-by: Sagi Grimberg 


Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

2018-04-11 Thread Sagi Grimberg



The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.


Under this situation:

IMO, if this request has been handled by driver's irq handler, and if
driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug,
and the correct return value should be BLK_EH_HANDLED.


Is it possible to guarantee this efficiently if the irq handler
can run concurrently with the timeout handler?


Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling

2018-04-11 Thread Bart Van Assche
On Tue, 2018-04-10 at 14:54 -0700, t...@kernel.org wrote:
> Ah, yeah, I was moving it out of add_timer but forgot to actully add
> it to the issue path.  Fixed patch below.
> 
> BTW, no matter what we do w/ the request handover between normal and
> timeout paths, we'd need something similar.  Otherwise, while we can
> reduce the window, we can't get rid of it.

(+Martin Steigerwald)

Hello Tejun,

Thank you for having shared this patch. It looks interesting to me. What I
know about the blk-mq timeout handling is as follows:
* Nobody who has reviewed the blk-mq timeout handling code with this patch
  applied has reported any shortcomings for that code.
* However, several people have reported kernel crashes that disappear when
  the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts
  memory upon timeout"
  (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html)
  and also to a "RIP: scsi_times_out+0x17" crash during boot
  (https://bugzilla.kernel.org/show_bug.cgi?id=199077).

So we have the choice between two approaches:
(1) apply the patch from your previous e-mail and root-cause and fix the
crashes referred to above.
(2) apply a patch that makes the crashes reported against v4.16 disappear and
remove the atomic instructions introduced by such a patch at a later time.

Since crashes have been reported for kernel v4.16 I think we should follow
approach (2). That will remove the time pressure from root-causing and fixing
the crashes reported for the NVMeOF initiator and SCSI initiator drivers.

Thanks,

Bart.





[PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"

2018-04-11 Thread Ming Lei
This reverts commit 127276c6ce5a30fcc806b7fe53015f4f89b62956.

When all CPUs of one hw queue become offline, there still may have IOs
not completed from this hctx. But blk_mq_hw_queue_mapped() is called in
blk_mq_queue_tag_busy_iter(), which is used for iterating request in timeout
handler, timeout event will be missed on the inactive hctx, then request may
never be completed.

Also the replementation of blk_mq_hw_queue_mapped() doesn't match the helper's
name any more, and it should have been named as blk_mq_hw_queue_active().

Even other callers need further verification about this reimplemenation.

So revert this patch now, and we can improve hw queue activate/inactivate event
after adequent researching and test.

Cc: Stefan Haberland 
Cc: Christian Borntraeger 
Cc: Christoph Hellwig 
Cc: Sagi Grimberg 
Reported-by: Jens Axboe 
Fixes:  127276c6ce5a30fcc ("blk-mq: reimplement blk_mq_hw_queue_mapped")
Signed-off-by: Ming Lei 
---
 block/blk-mq.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 502af371b83b..88c558f71819 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -181,7 +181,7 @@ static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx 
*hctx)
 
 static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx)
 {
-   return cpumask_first_and(hctx->cpumask, cpu_online_mask) < nr_cpu_ids;
+   return hctx->nr_ctx && hctx->tags;
 }
 
 void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part,
-- 
2.9.5



Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release

2018-04-11 Thread Alexandru Moise
On Mon, Apr 09, 2018 at 03:09:38PM -0700, Tejun Heo wrote:
> (cc'ing Joseph as he worked on the area recently, hi!)
> 
> Hello,
> 
> On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote:
> > The q->id is used as an index within the blkg_tree radix tree.
> > 
> > If the entry is not released before reclaiming the blk_queue_ida's id
> > blkcg_init_queue() within a different driver from which this id
> > was originally for can fail due to the entry at that index within
> > the radix tree already existing.
> > 
> > Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com>
> > ---
> > v2: Added no-op for !CONFIG_BLK_CGROUP
> > 
> >  block/blk-cgroup.c | 2 +-
> >  block/blk-sysfs.c  | 4 
> >  include/linux/blk-cgroup.h | 3 +++
> >  3 files changed, 8 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index 1c16694ae145..224e937dbb59 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg)
> >   *
> >   * Destroy all blkgs associated with @q.
> >   */
> > -static void blkg_destroy_all(struct request_queue *q)
> > +void blkg_destroy_all(struct request_queue *q)
> >  {
> > struct blkcg_gq *blkg, *n;
> >  
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index d00d1b0ec109..a72866458f22 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct 
> > *work)
> > if (q->bio_split)
> > bioset_free(q->bio_split);
> >  
> > +   spin_lock_irq(q->queue_lock);
> > +   blkg_destroy_all(q);
> > +   spin_unlock_irq(q->queue_lock);
> > +
> > ida_simple_remove(_queue_ida, q->id);
> > call_rcu(>rcu_head, blk_free_queue_rcu);
> 
> But we already do this through calling blkcg_exit_queue() from
> __blk_release_queue().  What's missing?

Hi,

It might be the jetlag but I can't see how you end up calling
blkcg_exit_queue() from __blk_release_queue().

As I see it the only way to reach blkcg_exit_queue() is from
blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue().

I suspect that I'm just fixing a corner case though and
the general case is what you describe or similar.

../Alex
> 
> Thanks.
> 
> -- 
> tejun