Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote: > Several block drivers call alloc_disk() followed by put_disk() if > something fails before device_add_disk() is called without calling > blk_cleanup_queue(). Make sure that also for this scenario a request > queue is dissociated from the cgroup controller. This patch avoids > that loading the parport_pc, paride and pf drivers triggers the > following kernel crash: Can we move the cleanup to put_disk in general and not just for this case? Having alloc/free routines pair up generally avoids a lot of confusion.
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
On Wed, Apr 11, 2018 at 04:19:18PM +0300, Sagi Grimberg wrote: > >> static void __blk_mq_requeue_request(struct request *rq) >> { >> struct request_queue *q = rq->q; >> +enum mq_rq_state old_state = blk_mq_rq_state(rq); >> blk_mq_put_driver_tag(rq); >> trace_block_rq_requeue(q, rq); >> wbt_requeue(q->rq_wb, >issue_stat); >> - if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { >> -blk_mq_rq_update_state(rq, MQ_RQ_IDLE); >> +if (old_state != MQ_RQ_IDLE) { >> +if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE)) >> +WARN_ON_ONCE(true); >> if (q->dma_drain_size && blk_rq_bytes(rq)) >> rq->nr_phys_segments--; >> } > > Can you explain why was old_state kept as a local variable? As it was me who added this: just to not read it again as no one else can change the state at this point. >> +static inline bool blk_mq_change_rq_state(struct request *rq, >> + enum mq_rq_state old_state, >> + enum mq_rq_state new_state) >> { >> -u64 old_val = READ_ONCE(rq->gstate); >> -u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state; >> - >> -if (state == MQ_RQ_IN_FLIGHT) { >> -WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); >> -new_val += MQ_RQ_GEN_INC; >> -} >> +unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) | >> +old_state; >> +unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state; >> - /* avoid exposing interim values */ >> -WRITE_ONCE(rq->gstate, new_val); >> +return cmpxchg(>__deadline, old_val, new_val) == old_val; >> } > > Can you explain why this takes the old_state of the request? > > Otherwise this looks good to me, Because that is how cmpxchg works?
Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller
On Thu, Apr 12, 2018 at 06:20:34AM +0200, Alexandru Moise wrote: > On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote: > > Several block drivers call alloc_disk() followed by put_disk() if > > something fails before device_add_disk() is called without calling > > blk_cleanup_queue(). Make sure that also for this scenario a request > > queue is dissociated from the cgroup controller. This patch avoids > > that loading the parport_pc, paride and pf drivers triggers the > > following kernel crash: > > > > BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] > > Read of size 4 at addr 0008 by task modprobe/744 > > Call Trace: > > dump_stack+0x9a/0xeb > > kasan_report+0x139/0x350 > > pi_init+0x42e/0x580 [paride] > > pf_init+0x2bb/0x1000 [pf] > > do_one_initcall+0x8e/0x405 > > do_init_module+0xd9/0x2f2 > > load_module+0x3ab4/0x4700 > > SYSC_finit_module+0x176/0x1a0 > > do_syscall_64+0xee/0x2b0 > > entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > > > Reported-by: Alexandru Moise <00moses.alexande...@gmail.com> > > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and > > the block cgroup controller") > > Signed-off-by: Bart Van Assche> > Cc: Alexandru Moise <00moses.alexande...@gmail.com> > > Cc: Tejun Heo > > --- > > block/blk-sysfs.c | 26 ++ > > 1 file changed, 26 insertions(+) > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index f8457d6f0190..2e134da78f82 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head > > *rcu_head) > > static void __blk_release_queue(struct work_struct *work) > > { > > struct request_queue *q = container_of(work, typeof(*q), release_work); > > + struct blkcg_gq *gq; > > > > if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) > > blk_stat_remove_callback(q, q->poll_cb); > > blk_stat_free_callback(q->poll_cb); > > > > + if (!blk_queue_dead(q)) { > > + /* > > +* Last reference was dropped without having called > > +* blk_cleanup_queue(). > > +*/ > > + WARN_ONCE(blk_queue_init_done(q), > > + "request queue %p has been registered but > > blk_cleanup_queue() has not been called for that queue\n", > > + q); > > + bdi_put(q->backing_dev_info); > > + blkcg_exit_queue(q); > > + > > + if (q->elevator) { > > + ioc_clear_queue(q); > > + elevator_exit(q, q->elevator); > > + } > > + } > > + > > + rcu_read_lock(); > > + gq = blkg_lookup(_root, q); > > + rcu_read_unlock(); > > + > > + WARN(gq, > > +"request queue %p is being released but it has not yet been > > removed from the blkcg controller\n", > > +q); > > + > > blk_free_queue_stats(q->stats); > > This solution is good. Thanks for fixing this! > > Tested-by: Alexandru Moise <00moses.alexande...@gmail.com> > > ../Alex Oh, you also need the no-ops for !CONFIG_BLK_CGROUP, also there is no blkcg_root then as well. Thanks, ../Alex > > > > > blk_exit_rl(q, >root_rl); > > -- > > 2.16.2 > >
Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote: > Several block drivers call alloc_disk() followed by put_disk() if > something fails before device_add_disk() is called without calling > blk_cleanup_queue(). Make sure that also for this scenario a request > queue is dissociated from the cgroup controller. This patch avoids > that loading the parport_pc, paride and pf drivers triggers the > following kernel crash: > > BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] > Read of size 4 at addr 0008 by task modprobe/744 > Call Trace: > dump_stack+0x9a/0xeb > kasan_report+0x139/0x350 > pi_init+0x42e/0x580 [paride] > pf_init+0x2bb/0x1000 [pf] > do_one_initcall+0x8e/0x405 > do_init_module+0xd9/0x2f2 > load_module+0x3ab4/0x4700 > SYSC_finit_module+0x176/0x1a0 > do_syscall_64+0xee/0x2b0 > entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > Reported-by: Alexandru Moise <00moses.alexande...@gmail.com> > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the > block cgroup controller") > Signed-off-by: Bart Van Assche> Cc: Alexandru Moise <00moses.alexande...@gmail.com> > Cc: Tejun Heo > --- > block/blk-sysfs.c | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index f8457d6f0190..2e134da78f82 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head > *rcu_head) > static void __blk_release_queue(struct work_struct *work) > { > struct request_queue *q = container_of(work, typeof(*q), release_work); > + struct blkcg_gq *gq; > > if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) > blk_stat_remove_callback(q, q->poll_cb); > blk_stat_free_callback(q->poll_cb); > > + if (!blk_queue_dead(q)) { > + /* > + * Last reference was dropped without having called > + * blk_cleanup_queue(). > + */ > + WARN_ONCE(blk_queue_init_done(q), > + "request queue %p has been registered but > blk_cleanup_queue() has not been called for that queue\n", > + q); > + bdi_put(q->backing_dev_info); > + blkcg_exit_queue(q); > + > + if (q->elevator) { > + ioc_clear_queue(q); > + elevator_exit(q, q->elevator); > + } > + } > + > + rcu_read_lock(); > + gq = blkg_lookup(_root, q); > + rcu_read_unlock(); > + > + WARN(gq, > + "request queue %p is being released but it has not yet been > removed from the blkcg controller\n", > + q); > + > blk_free_queue_stats(q->stats); This solution is good. Thanks for fixing this! Tested-by: Alexandru Moise <00moses.alexande...@gmail.com> ../Alex > > blk_exit_rl(q, >root_rl); > -- > 2.16.2 >
Re: [PATCH] block: Ensure that a request queue is dissociated from the cgroup controller
On Wed, Apr 11, 2018 at 07:58:52PM -0600, Bart Van Assche wrote: > Several block drivers call alloc_disk() followed by put_disk() if > something fails before device_add_disk() is called without calling > blk_cleanup_queue(). Make sure that also for this scenario a request > queue is dissociated from the cgroup controller. This patch avoids > that loading the parport_pc, paride and pf drivers triggers the > following kernel crash: > > BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] > Read of size 4 at addr 0008 by task modprobe/744 > Call Trace: > dump_stack+0x9a/0xeb > kasan_report+0x139/0x350 > pi_init+0x42e/0x580 [paride] > pf_init+0x2bb/0x1000 [pf] > do_one_initcall+0x8e/0x405 > do_init_module+0xd9/0x2f2 > load_module+0x3ab4/0x4700 > SYSC_finit_module+0x176/0x1a0 > do_syscall_64+0xee/0x2b0 > entry_SYSCALL_64_after_hwframe+0x42/0xb7 > > Reported-by: Alexandru Moise <00moses.alexande...@gmail.com> > Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the > block cgroup controller") > Signed-off-by: Bart Van Assche> Cc: Alexandru Moise <00moses.alexande...@gmail.com> > Cc: Tejun Heo > --- > block/blk-sysfs.c | 26 ++ > 1 file changed, 26 insertions(+) > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > index f8457d6f0190..2e134da78f82 100644 > --- a/block/blk-sysfs.c > +++ b/block/blk-sysfs.c > @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head > *rcu_head) > static void __blk_release_queue(struct work_struct *work) > { > struct request_queue *q = container_of(work, typeof(*q), release_work); > + struct blkcg_gq *gq; > > if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) > blk_stat_remove_callback(q, q->poll_cb); > blk_stat_free_callback(q->poll_cb); > > + if (!blk_queue_dead(q)) { > + /* > + * Last reference was dropped without having called > + * blk_cleanup_queue(). > + */ > + WARN_ONCE(blk_queue_init_done(q), > + "request queue %p has been registered but > blk_cleanup_queue() has not been called for that queue\n", > + q); > + bdi_put(q->backing_dev_info); > + blkcg_exit_queue(q); > + > + if (q->elevator) { > + ioc_clear_queue(q); > + elevator_exit(q, q->elevator); > + } > + } > + > + rcu_read_lock(); > + gq = blkg_lookup(_root, q); > + rcu_read_unlock(); > + > + WARN(gq, > + "request queue %p is being released but it has not yet been > removed from the blkcg controller\n", > + q); > + > blk_free_queue_stats(q->stats); This solution is good. Thanks for fixing this! Tested-by: Alexandru Moise <00moses.alexande...@gmail.com> ../Alex > > blk_exit_rl(q, >root_rl); > -- > 2.16.2 >
Re: [PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
Hi Ming On 04/12/2018 07:38 AM, Ming Lei wrote: > + * > + * Cover complete vs BLK_EH_RESET_TIMER race in slow path with > + * helding queue lock. >*/ > hctx_lock(hctx, _idx); > if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) > __blk_mq_complete_request(rq); > + else { > + unsigned long flags; > + bool need_complete = false; > + > + spin_lock_irqsave(q->queue_lock, flags); > + if (!blk_mq_rq_aborted_gstate(rq)) > + need_complete = true; > + else > + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT); > + spin_unlock_irqrestore(q->queue_lock, flags); What if the .timeout return BLK_EH_HANDLED during this ? timeout context irq context .timeout() blk_mq_complete_request set state to MQ_RQ_COMPLETE_IN_TIMEOUT __blk_mq_complete_request WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT); If further upon blk_mq_free_request, the final freed request maybe changed to MQ_RQ_COMPLETE_IN_TIMEOUT instead of MQ_RQ_IDLE. Thanks Jianchao
[PATCH] block: Ensure that a request queue is dissociated from the cgroup controller
Several block drivers call alloc_disk() followed by put_disk() if something fails before device_add_disk() is called without calling blk_cleanup_queue(). Make sure that also for this scenario a request queue is dissociated from the cgroup controller. This patch avoids that loading the parport_pc, paride and pf drivers triggers the following kernel crash: BUG: KASAN: null-ptr-deref in pi_init+0x42e/0x580 [paride] Read of size 4 at addr 0008 by task modprobe/744 Call Trace: dump_stack+0x9a/0xeb kasan_report+0x139/0x350 pi_init+0x42e/0x580 [paride] pf_init+0x2bb/0x1000 [pf] do_one_initcall+0x8e/0x405 do_init_module+0xd9/0x2f2 load_module+0x3ab4/0x4700 SYSC_finit_module+0x176/0x1a0 do_syscall_64+0xee/0x2b0 entry_SYSCALL_64_after_hwframe+0x42/0xb7 Reported-by: Alexandru Moise <00moses.alexande...@gmail.com> Fixes: a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") Signed-off-by: Bart Van AsscheCc: Alexandru Moise <00moses.alexande...@gmail.com> Cc: Tejun Heo --- block/blk-sysfs.c | 26 ++ 1 file changed, 26 insertions(+) diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c index f8457d6f0190..2e134da78f82 100644 --- a/block/blk-sysfs.c +++ b/block/blk-sysfs.c @@ -788,11 +788,37 @@ static void blk_free_queue_rcu(struct rcu_head *rcu_head) static void __blk_release_queue(struct work_struct *work) { struct request_queue *q = container_of(work, typeof(*q), release_work); + struct blkcg_gq *gq; if (test_bit(QUEUE_FLAG_POLL_STATS, >queue_flags)) blk_stat_remove_callback(q, q->poll_cb); blk_stat_free_callback(q->poll_cb); + if (!blk_queue_dead(q)) { + /* +* Last reference was dropped without having called +* blk_cleanup_queue(). +*/ + WARN_ONCE(blk_queue_init_done(q), + "request queue %p has been registered but blk_cleanup_queue() has not been called for that queue\n", + q); + bdi_put(q->backing_dev_info); + blkcg_exit_queue(q); + + if (q->elevator) { + ioc_clear_queue(q); + elevator_exit(q, q->elevator); + } + } + + rcu_read_lock(); + gq = blkg_lookup(_root, q); + rcu_read_unlock(); + + WARN(gq, +"request queue %p is being released but it has not yet been removed from the blkcg controller\n", +q); + blk_free_queue_stats(q->stats); blk_exit_rl(q, >root_rl); -- 2.16.2
Re: System hung in I/O when booting with sd card
On Thu, 2018-04-12 at 09:48 +0800, Shawn Lin wrote: > I ran into 2 times that my system hung here when booting with a ext4 sd > card. No sure how to reproduce it but it seems doesn't matter with the > ext4 as I see it again with a vfat sd card, this morning with Linus' > master branch. Is it a known issue or any idea how to debug this? Please verify whether the following patch resolves this: https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-linus=37f9579f4c31a6d698dbf3016d7bf132f9288d30 Thanks, Bart.
System hung in I/O when booting with sd card
Hi, I ran into 2 times that my system hung here when booting with a ext4 sd card. No sure how to reproduce it but it seems doesn't matter with the ext4 as I see it again with a vfat sd card, this morning with Linus' master branch. Is it a known issue or any idea how to debug this? [ 110.877652] Unable to handle kernel NULL pointer dereference at virtual address 00c8 [ 110.878360] Mem abort info: [ 110.878607] ESR = 0x9606 [ 110.878876] Exception class = DABT (current EL), IL = 32 bits [ 110.879391] SET = 0, FnV = 0 [ 110.879659] EA = 0, S1PTW = 0 [ 110.879957] Data abort info: [ 110.880210] ISV = 0, ISS = 0x0006 [ 110.880545] CM = 0, WnR = 0 [ 110.880807] user pgtable: 4k pages, 48-bit VAs, pgdp = ed0ed92e [ 110.881382] [00c8] pgd=f015e003, pud=f0173003, pmd= [ 110.882143] Internal error: Oops: 9606 [#1] PREEMPT SMP [ 110.882629] Modules linked in: [ 110.882902] CPU: 4 PID: 1777 Comm: ls Not tainted 4.16.0-next-20180410-00013-geabffa6-dirty #315 [ 110.883665] Hardware name: Excavator-RK3399 Board (DT) [ 110.884114] pstate: 4005 (nZcv daif -PAN -UAO) [ 110.884542] pc : percpu_counter_add_batch+0x2c/0x100 [ 110.884977] lr : generic_make_request_checks+0x214/0x478 [ 110.885440] sp : 0c403960 [ 110.885730] x29: 0c403960 x28: 8000efcba6e0 [ 110.886196] x27: 08a81000 x26: 082643f8 [ 110.886662] x25: 08263828 x24: 3000 [ 110.887127] x23: 1000 x22: [ 110.887591] x21: 1000 x20: 00a8 [ 110.888056] x19: 8000f0d7dc00 x18: 0001 [ 110.888521] x17: 004ab8a8 x16: 08241628 [ 110.888986] x15: ff80 x14: 7e0003bbaf80 [ 110.889451] x13: 00c5 x12: 09069b88 [ 110.889916] x11: 0040 x10: 8000efc57dc0 [ 110.890380] x9 : 8000efc57e90 x8 : 8000f0d3b138 [ 110.890844] x7 : x6 : 8000f0d7dc88 [ 110.891309] x5 : 0e56 x4 : 0001 [ 110.891774] x3 : 8000f0cab800 x2 : 3fff [ 110.892239] x1 : 8000edf3 x0 : 00a8 [ 110.892705] Process ls (pid: 1777, stack limit = 0xb3fb6733) [ 110.893258] Call trace: [ 110.893475] percpu_counter_add_batch+0x2c/0x100 [ 110.893879] generic_make_request_checks+0x214/0x478 [ 110.894313] generic_make_request+0x34/0x260 [ 110.894686] submit_bio+0xcc/0x1b0 [ 110.894988] ll_rw_block+0xc0/0x100 [ 110.895295] ext4_bread+0x74/0xc0 [ 110.895587] __ext4_read_dirblock+0x3c/0x2d8 [ 110.895960] htree_dirblock_to_tree+0x70/0x1d8 [ 110.896349] ext4_htree_fill_tree+0xa4/0x2c8 [ 110.896723] ext4_readdir+0x5f4/0x7f0 [ 110.897046] iterate_dir+0x9c/0x1a8 [ 110.897351] ksys_getdents64+0x8c/0x168 [ 110.897687] sys_getdents64+0xc/0x18 [ 110.898001] el0_svc_naked+0x30/0x34 [ 110.898315] Code: b9401064 11000484 b9001064 d538d081 (f9401000) [ 110.898848] ---[ end trace 3b0d37baa3bb2fb3 ]--- [ 110.899260] note: ls[1777] exited with preempt_count 1 [ 110.899852] WARNING: CPU: 4 PID: 1777 at kernel/rcu/tree_plugin.h:330 rcu_note_context_switch+0x30/0x3b8 [ 110.900676] Modules linked in: Se[gmentation fault 110.900947 [root@rockchip:/]# ] CPU: 4 PID: 1777 Comm: ls Tainted: G D 4.16.0-next-20180410-00013-geabffa6-dirty #315 [ 110.902121] Hardware name: Excavator-RK3399 Board (DT) [ 110.902570] pstate: 2085 (nzCv daIf -PAN -UAO) [ 110.902990] pc : rcu_note_context_switch+0x30/0x3b8 [ 110.903417] lr : rcu_note_context_switch+0x1c/0x3b8 [ 110.903841] sp : 0c403480 [ 110.904132] x29: 0c403480 x28: 0c403820 [ 110.904597] x27: x26: 8000f0cab800 [ 110.905062] x25: 080fdaa8 x24: 09069000 [ 110.905527] x23: 09042000 x22: 8000f0cab800 [ 110.905992] x21: 8000f6f7ed00 x20: [ 110.906456] x19: 8000f0cab800 x18: [ 110.906921] x17: 004ab8a8 x16: 08241628 [ 110.907386] x15: x14: 0400 [ 110.907851] x13: 091b6200 x12: 0004 [ 110.908316] x11: 0019c85d4000 x10: 0400 [ 110.908781] x9 : 0004 x8 : 8000f02e6a00 [ 110.909246] x7 : 8000f6f7f760 x6 : 090ba86f [ 110.909711] x5 : x4 : [ 110.910176] x3 : 8000edf3 x2 : 0004 [ 110.910641] x1 : 0904fa98 x0 : 0001 [ 110.911106] Call trace: [ 110.911322] rcu_note_context_switch+0x30/0x3b8 [ 110.911722] __schedule+0x90/0x600 [ 110.912022] do_task_dead+0x40/0x48 [ 110.912328] do_exit+0x6c8/0x9b8 [ 110.912613] die+0x1cc/0x1f8 [ 110.912868] __do_kernel_fault+0xa4/0xf8 [ 110.913212] do_page_fault+0x1f0/0x428 [ 110.913541] do_translation_fault+0x5c/0x68 [ 110.913908] do_mem_abort+0x54/0xd8 [ 110.914215] el1_da+0x20/0x80 [ 110.914476]
[PATCH] block/amiflop: Don't log an error message for an invalid ioctl
Do as the swim3 driver does and just return -ENOTTY. Cc: Geert UytterhoevenCc: linux-m...@lists.linux-m68k.org Signed-off-by: Finn Thain --- drivers/block/amiflop.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/drivers/block/amiflop.c b/drivers/block/amiflop.c index 3aaf6af3ec23..b9fb794c1255 100644 --- a/drivers/block/amiflop.c +++ b/drivers/block/amiflop.c @@ -1533,9 +1533,7 @@ static int fd_locked_ioctl(struct block_device *bdev, fmode_t mode, return p->type->read_size; #endif default: - printk(KERN_DEBUG "fd_ioctl: unknown cmd %d for drive %d.", - cmd, drive); - return -ENOSYS; + return -ENOTTY; } return 0; } -- 2.16.1
[PATCH v2 03/10] m68k/mac: Don't remap SWIM MMIO region
For reasons I don't understand, calling ioremap() then iounmap() on the SWIM MMIO region causes a hang on 68030 (but not on 68040). ~# modprobe swim_mod SWIM floppy driver Version 0.2 (2008-10-30) SWIM device not found ! watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [modprobe:285] Modules linked in: swim_mod(+) Format 00 Vector: 0064 PC: 75aa Status: 2000Not tainted ORIG_D0: D0: d00c A2: 007c2370 A1: 003f810c A0: 0004 D5: d0096800 D4: d0097e00 D3: 0001 D2: 0003 D1: Non-Maskable Interrupt Modules linked in: swim_mod(+) PC: [<75ba>] __iounmap+0x24/0x10e SR: 2000 SP: 007abc48 a2: 007c2370 d0: d00cd1: 01a0d2: 0019d3: 0001 d4: d0097e00d5: d0096800a0: 0004a1: 003f810c Process modprobe (pid: 285, task=007c2370) Frame format=0 Stack from 007abc7c: ffed 006a4060 004712e0 007abca0 76ea d008 0008 010bb4b8 007abcd8 010ba542 d0096000 0001 010bb59c 007abf30 010bb4b8 0047760a 0047763c 00477612 00616540 007abcec 0020a91a 00477600 0047760a 010bb4cc 007abd18 002092f2 0047760a 00333b06 007abd5c 0047760a 010bb4cc 00404f90 004776b8 0001 007abd38 00209446 010bb4cc 0047760a 010bb4cc 0020938e 0031f8be 00616540 007abd64 Call Trace: [<76ea>] iounmap+0x46/0x5a [<0008>] shrink_page_list+0x7f6/0xe06 [<010ba542>] swim_probe+0xe4/0x496 [swim_mod] [<0020a91a>] platform_drv_probe+0x20/0x5e [<002092f2>] driver_probe_device+0x21c/0x2b8 [<00333b06>] mutex_lock+0x0/0x2e [<00209446>] __driver_attach+0xb8/0xce [<0020938e>] __driver_attach+0x0/0xce [<0031f8be>] klist_next+0x0/0xa0 [<00207562>] bus_for_each_dev+0x74/0xba [<000344c0>] blocking_notifier_call_chain+0x0/0x20 [<00333b06>] mutex_lock+0x0/0x2e [<00208e44>] driver_attach+0x1a/0x1e [<0020938e>] __driver_attach+0x0/0xce [<00207e26>] bus_add_driver+0x188/0x234 [<000344c0>] blocking_notifier_call_chain+0x0/0x20 [<00209894>] driver_register+0x58/0x104 [<000344c0>] blocking_notifier_call_chain+0x0/0x20 [<010bd000>] swim_init+0x0/0x2c [swim_mod] [<0020a7be>] __platform_driver_register+0x38/0x3c [<010bd028>] swim_init+0x28/0x2c [swim_mod] [<20dc>] do_one_initcall+0x38/0x196 [<000344c0>] blocking_notifier_call_chain+0x0/0x20 [<003331cc>] mutex_unlock+0x0/0x3e [<00333b06>] mutex_lock+0x0/0x2e [<003331cc>] mutex_unlock+0x0/0x3e [<00333b06>] mutex_lock+0x0/0x2e [<003331cc>] mutex_unlock+0x0/0x3e [<00333b06>] mutex_lock+0x0/0x2e [<003331cc>] mutex_unlock+0x0/0x3e [<00333b06>] mutex_lock+0x0/0x2e [<00075008>] __free_pages+0x0/0x38 [<45c0>] mangle_kernel_stack+0x30/0xda [<000344c0>] blocking_notifier_call_chain+0x0/0x20 [<003331cc>] mutex_unlock+0x0/0x3e [<00333b06>] mutex_lock+0x0/0x2e [<0005ced4>] do_init_module+0x42/0x266 [<010bd000>] swim_init+0x0/0x2c [swim_mod] [<000344c0>] blocking_notifier_call_chain+0x0/0x20 [<0005eda0>] load_module+0x1a30/0x1e70 [<465d>] mangle_kernel_stack+0xcd/0xda [<00331c64>] __generic_copy_from_user+0x0/0x46 [<0033256e>] _cond_resched+0x0/0x32 [<00331b9c>] memset+0x0/0x98 [<0033256e>] _cond_resched+0x0/0x32 [<0005f25c>] SyS_init_module+0x7c/0x112 [<2000>] _start+0x0/0x8 [<2000>] _start+0x0/0x8 [<00331c82>] __generic_copy_from_user+0x1e/0x46 [<0005f2b2>] SyS_init_module+0xd2/0x112 [<465d>] mangle_kernel_stack+0xcd/0xda [<2b40>] syscall+0x8/0xc [<465d>] mangle_kernel_stack+0xcd/0xda [<0008c00c>] pcpu_balance_workfn+0xb2/0x40e Code: 2200 7419 e4a9 e589 2841 d9fc 1000 <2414> 7203 c282 7602 b681 6600 0096 0242 fe00 0482 e9c0 11c3 ed89 2642 There's no need to call ioremap() for the SWIM address range, as it lies within the usual IO device region at 0x5000 , which has already been mapped by head.S. Remove the redundant ioremap() and iounmap() calls to fix the hang. Cc: Laurent VivierCc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier --- drivers/block/swim.c | 10 +++--- 1 file changed, 3 insertions(+), 7 deletions(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 64e066eba72e..92f0cddc597e 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -911,7 +911,7 @@ static int swim_probe(struct platform_device *dev) goto out; } - swim_base = ioremap(res->start, resource_size(res)); + swim_base = (struct swim __iomem *)res->start; if (!swim_base) { ret = -ENOMEM; goto out_release_io; @@ -923,7 +923,7 @@ static int swim_probe(struct platform_device *dev) if (!get_swim_mode(swim_base)) { printk(KERN_INFO "SWIM device not found !\n"); ret = -ENODEV; - goto out_iounmap; + goto out_release_io; } /* set platform
[PATCH v2 02/10] m68k/mac: Fix SWIM memory resource end address
The resource size is 0x2000 == end - start + 1. Therefore end == start + 0x2000 - 1. Cc: Laurent VivierCc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier Reviewed-by: Geert Uytterhoeven --- arch/m68k/mac/config.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c index 84bd9161bf23..4a544a6c72dd 100644 --- a/arch/m68k/mac/config.c +++ b/arch/m68k/mac/config.c @@ -1011,7 +1011,7 @@ int __init mac_platform_init(void) struct resource swim_rsrc = { .flags = IORESOURCE_MEM, .start = (resource_size_t)swim_base, - .end = (resource_size_t)swim_base + 0x2000, + .end = (resource_size_t)swim_base + 0x1FFF, }; platform_device_register_simple("swim", -1, _rsrc, 1); -- 2.16.1
Re: 4.15.14 crash with iscsi target and dvd
Wakko Warner wrote: > Ming Lei wrote: > > Sure, thanks for your sharing. > > > > Wakko, could you test the following patch and see if there is any > > difference? > > > > -- > > diff --git a/drivers/target/target_core_pscsi.c > > b/drivers/target/target_core_pscsi.c > > index 0d99b242e82e..6147178f1f37 100644 > > --- a/drivers/target/target_core_pscsi.c > > +++ b/drivers/target/target_core_pscsi.c > > @@ -888,7 +888,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist > > *sgl, u32 sgl_nents, > > if (len > 0 && data_len > 0) { > > bytes = min_t(unsigned int, len, PAGE_SIZE - off); > > bytes = min(bytes, data_len); > > - > > + new_bio: > > if (!bio) { > > nr_vecs = min_t(int, BIO_MAX_PAGES, nr_pages); > > nr_pages -= nr_vecs; > > @@ -931,6 +931,7 @@ pscsi_map_sg(struct se_cmd *cmd, struct scatterlist > > *sgl, u32 sgl_nents, > > * be allocated with pscsi_get_bio() above. > > */ > > bio = NULL; > > + goto new_bio; > > } > > > > data_len -= bytes; > > Sorry for the delay. I reverted my change, added this one. I didn't > reboot, I just unloaded and loaded this one. > Note: /dev/sr1 as seen from the initiator is /dev/sr0 (physical disc) on the > target. > > Doesn't crash, however on the initiator I see this: > [9273849.70] ISO 9660 Extensions: RRIP_1991A > [9273863.359718] scsi_io_completion: 13 callbacks suppressed > [9273863.359788] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: > hostbyte=0x00 driverbyte=0x08 > [9273863.359909] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] > [9273863.359974] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 > [9273863.360036] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f6 96 > 00 00 80 00 > [9273863.360116] blk_update_request: 13 callbacks suppressed > [9273863.360177] blk_update_request: I/O error, dev sr1, sector 9165400 > [9273875.864648] sr 26:0:0:0: [sr1] tag#1 UNKNOWN(0x2003) Result: > hostbyte=0x00 driverbyte=0x08 > [9273875.864738] sr 26:0:0:0: [sr1] tag#1 Sense Key : 0x2 [current] > [9273875.864801] sr 26:0:0:0: [sr1] tag#1 ASC=0x8 ASCQ=0x0 > [9273875.864890] sr 26:0:0:0: [sr1] tag#1 CDB: opcode=0x28 28 00 00 22 f7 16 > 00 00 80 00 > [9273875.864971] blk_update_request: I/O error, dev sr1, sector 9165912 Just FYI: The jobs that I do that uses the disc over iscsi didn't cause any kernel messages on either system (except for the informational when the disc was mounted) I have a dumb question though. Could the label be placed just after the 'if' statement instead of before it? bio is set to null and the 'if' statement checks if it's null, which it always would be after the goto. -- Microsoft has beaten Volkswagen's world record. Volkswagen only created 22 million bugs.
[PATCH v2 01/10] m68k/mac: Revisit floppy disc controller base addresses
Rename floppy_type macros to make them more consistent with the scsi_type macros, which are named after classes of models with similar memory maps. The documentation for LC-class machines has the IO devices at offsets from $50F0 . Use these addresses (consistent with mac_scsi resources) because they may not be aliased elsewhere in the memory map, e.g. at offsets from $5000 . Add comments with controller type information from 'Designing Cards and Drivers for the Macintosh Family', relevant Developer Notes and http://mess.redump.net/mess/driver_info/mac_technical_notes Cc: Laurent VivierCc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier --- arch/m68k/include/asm/macintosh.h | 10 +-- arch/m68k/mac/config.c| 124 -- 2 files changed, 70 insertions(+), 64 deletions(-) diff --git a/arch/m68k/include/asm/macintosh.h b/arch/m68k/include/asm/macintosh.h index 9b840c03ebb7..a61ce06c0a54 100644 --- a/arch/m68k/include/asm/macintosh.h +++ b/arch/m68k/include/asm/macintosh.h @@ -79,11 +79,11 @@ struct mac_model #define MAC_EXP_PDS_NUBUS 3 /* Accepts PDS card and/or NuBus card(s) */ #define MAC_EXP_PDS_COMM 4 /* Accepts PDS card or Comm Slot card */ -#define MAC_FLOPPY_IWM 0 -#define MAC_FLOPPY_SWIM_ADDR1 1 -#define MAC_FLOPPY_SWIM_ADDR2 2 -#define MAC_FLOPPY_SWIM_IOP3 -#define MAC_FLOPPY_AV 4 +#define MAC_FLOPPY_UNSUPPORTED 0 +#define MAC_FLOPPY_QUADRA 1 +#define MAC_FLOPPY_OLD 2 +#define MAC_FLOPPY_IIFX3 +#define MAC_FLOPPY_LC 4 extern struct mac_model *macintosh_config; diff --git a/arch/m68k/mac/config.c b/arch/m68k/mac/config.c index 0c3275aa0197..84bd9161bf23 100644 --- a/arch/m68k/mac/config.c +++ b/arch/m68k/mac/config.c @@ -214,7 +214,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type = MAC_EXP_NUBUS, - .floppy_type= MAC_FLOPPY_IWM, + .floppy_type= MAC_FLOPPY_UNSUPPORTED, /* IWM */ }, /* @@ -229,7 +229,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type = MAC_EXP_NUBUS, - .floppy_type= MAC_FLOPPY_IWM, + .floppy_type= MAC_FLOPPY_UNSUPPORTED, /* IWM */ }, { .ident = MAC_MODEL_IIX, .name = "IIx", @@ -238,7 +238,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type = MAC_EXP_NUBUS, - .floppy_type= MAC_FLOPPY_SWIM_ADDR2, + .floppy_type= MAC_FLOPPY_OLD, /* SWIM */ }, { .ident = MAC_MODEL_IICX, .name = "IIcx", @@ -247,7 +247,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type = MAC_EXP_NUBUS, - .floppy_type= MAC_FLOPPY_SWIM_ADDR2, + .floppy_type= MAC_FLOPPY_OLD, /* SWIM */ }, { .ident = MAC_MODEL_SE30, .name = "SE/30", @@ -256,7 +256,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type = MAC_EXP_PDS, - .floppy_type= MAC_FLOPPY_SWIM_ADDR2, + .floppy_type= MAC_FLOPPY_OLD, /* SWIM */ }, /* @@ -274,7 +274,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type = MAC_EXP_NUBUS, - .floppy_type= MAC_FLOPPY_SWIM_ADDR2, + .floppy_type= MAC_FLOPPY_OLD, /* SWIM */ }, { .ident = MAC_MODEL_IIFX, .name = "IIfx", @@ -283,7 +283,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_IIFX, .scc_type = MAC_SCC_IOP, .expansion_type = MAC_EXP_PDS_NUBUS, - .floppy_type= MAC_FLOPPY_SWIM_IOP, + .floppy_type= MAC_FLOPPY_IIFX, /* SWIM with IOP */ }, { .ident = MAC_MODEL_IISI, .name = "IIsi", @@ -292,7 +292,7 @@ static struct mac_model mac_data_table[] = { .scsi_type = MAC_SCSI_OLD, .scc_type = MAC_SCC_II, .expansion_type =
[PATCH v2 05/10] block/swim: Remove extra put_disk() call from error path
Cc: Laurent VivierCc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Fixes: 103db8b2dfa5 ("[PATCH] swim: stop sharing request queue across multiple gendisks") Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier Reviewed-by: Geert Uytterhoeven --- drivers/block/swim.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 2cdfc0db5966..0258a96e0c46 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -861,7 +861,6 @@ static int swim_floppy_init(struct swim_priv *swd) >lock); if (!swd->unit[drive].disk->queue) { err = -ENOMEM; - put_disk(swd->unit[drive].disk); goto exit_put_disks; } blk_queue_bounce_limit(swd->unit[drive].disk->queue, -- 2.16.1
[PATCH v2 06/10] block/swim: Don't log an error message for an invalid ioctl
The 'eject' shell command may send various different ioctl commands. This leads to error messages on the console even though the FDEJECT ioctl succeeds. ~# eject floppy SWIM floppy_ioctl: unknown cmd 21257 SWIM floppy_ioctl: unknown cmd 1 Don't log an error message for an invalid ioctl, just do as the swim3 driver does and return -ENOTTY. Cc: Laurent VivierCc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier Reviewed-by: Geert Uytterhoeven --- drivers/block/swim.c | 9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 0258a96e0c46..7b847170cf71 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -727,14 +727,9 @@ static int floppy_ioctl(struct block_device *bdev, fmode_t mode, if (copy_to_user((void __user *) param, (void *) _type, sizeof(struct floppy_struct))) return -EFAULT; - break; - - default: - printk(KERN_DEBUG "SWIM floppy_ioctl: unknown cmd %d\n", - cmd); - return -ENOSYS; + return 0; } - return 0; + return -ENOTTY; } static int floppy_getgeo(struct block_device *bdev, struct hd_geometry *geo) -- 2.16.1
[PATCH v2 04/10] block/swim: Fix array bounds check
In the floppy_find() function in swim.c is a call to get_disk(swd->unit[drive].disk). The actual parameter to this call can be a NULL pointer when drive == swd->floppy_count. This causes an oops in get_disk(). Data read fault at 0x0198 in Super Data (pc=0x1be5b6) BAD KERNEL BUSERR Oops: Modules linked in: swim_mod ipv6 mac8390 PC: [<001be5b6>] get_disk+0xc/0x76 SR: 2004 SP: 9a078bc1 a2: 0213ed90 d0: d1: d2: d3: 00ff d4: 0002d5: 02983590a0: 02332e00a1: 022dfd64 Process dd (pid: 285, task=020ab25b) Frame format=B ssw=074d isc=4a88 isb=6732 daddr=0198 dobuf= baddr=001be5bc dibuf=bfff ver=f Stack from 022dfca4: 0203fc00 0213ed90 022dfcc0 02982936 0020 022dfd08 0020f85a 0020 022dfd64 02332e00 004040fc 0014 001be77e 022dfd64 00334e4a 001be3f8 081d 022dfd64 01c04b60 01c04b70 022aba80 029828f8 02332e00 022dfd2c 001be7ac 0203fc00 0020 022dfd64 02103a00 01c04b60 01c04b60 0200e400 022dfd68 000e191a 0020 022dfd64 02103a00 081d 0003 000b89de 0050 02103a00 01c04b60 02103a08 01c04c2e Call Trace: [<02982936>] floppy_find+0x3e/0x4a [swim_mod] [<0020>] uart_remove_one_port+0x1a2/0x260 [<0020f85a>] kobj_lookup+0xde/0x132 [<0020>] uart_remove_one_port+0x1a2/0x260 [<001be77e>] get_gendisk+0x0/0x130 [<00334e4a>] mutex_lock+0x0/0x2e [<001be3f8>] disk_block_events+0x0/0x6c [<029828f8>] floppy_find+0x0/0x4a [swim_mod] [<001be7ac>] get_gendisk+0x2e/0x130 [<0020>] uart_remove_one_port+0x1a2/0x260 [<000e191a>] __blkdev_get+0x32/0x45a [<0020>] uart_remove_one_port+0x1a2/0x260 [<000b89de>] complete_walk+0x0/0x8a [<000e1e22>] blkdev_get+0xe0/0x29a [<000e1fdc>] blkdev_open+0x0/0xb0 [<000b89de>] complete_walk+0x0/0x8a [<000e1fdc>] blkdev_open+0x0/0xb0 [<000e01cc>] bd_acquire+0x74/0x8a [<000e205c>] blkdev_open+0x80/0xb0 [<000e1fdc>] blkdev_open+0x0/0xb0 [<000abf24>] do_dentry_open+0x1a4/0x322 [<0002>] __do_proc_douintvec+0x22/0x27e [<000b89de>] complete_walk+0x0/0x8a [<000baa62>] link_path_walk+0x0/0x48e [<000ba3f8>] inode_permission+0x20/0x54 [<000ac0e4>] vfs_open+0x42/0x78 [<000bc372>] path_openat+0x2b2/0xeaa [<000bc0c0>] path_openat+0x0/0xeaa [<0004463e>] __irq_wake_thread+0x0/0x4e [<0003a45a>] task_tick_fair+0x18/0xc8 [<000bd00a>] do_filp_open+0xa0/0xea [<000abae0>] do_sys_open+0x11a/0x1ee [<0002>] __do_proc_douintvec+0x22/0x27e [<000abbf4>] SyS_open+0x1e/0x22 [<0002>] __do_proc_douintvec+0x22/0x27e [<2b40>] syscall+0x8/0xc [<0002>] __do_proc_douintvec+0x22/0x27e [] dyadic+0x1/0x28 Code: 4e5e 4e75 4e56 fffc 2f0b 2f02 266e 0008 <206b> 0198 4a88 6732 2428 002c 661e 486b 0058 4eb9 0032 0b96 588f 4a88 672c 2008 Disabling lock debugging due to kernel taint Fix the array index bounds check to avoid this. Cc: Laurent Vivier Cc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Fixes: 8852ecd97488 ("[PATCH] m68k: mac - Add SWIM floppy support") Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier Reviewed-by: Geert Uytterhoeven --- drivers/block/swim.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 92f0cddc597e..2cdfc0db5966 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -795,7 +795,7 @@ static struct kobject *floppy_find(dev_t dev, int *part, void *data) struct swim_priv *swd = data; int drive = (*part & 3); - if (drive > swd->floppy_count) + if (drive >= swd->floppy_count) return NULL; *part = 0; -- 2.16.1
[PATCH v2 09/10] block/swim: Fix IO error at end of medium
Reading to the end of a 720K disk results in an IO error instead of EOF because the block layer thinks the disk has 2880 sectors. (Partly this is a result of inverted logic of the ONEMEG_MEDIA bit that's now fixed.) Initialize the density and head count in swim_add_floppy() to agree with the device size passed to set_capacity() during drive probe. Call set_capacity() again upon device open, after refreshing the density and head count values. Cc: Laurent VivierCc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier --- drivers/block/swim.c | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index c8c8b9da3edd..2c75761b61e8 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -612,7 +612,6 @@ static void setup_medium(struct floppy_state *fs) struct floppy_struct *g; fs->disk_in = 1; fs->write_protected = swim_readbit(base, WRITE_PROT); - fs->type = swim_readbit(base, TWOMEG_MEDIA); if (swim_track00(base)) printk(KERN_ERR @@ -620,6 +619,9 @@ static void setup_medium(struct floppy_state *fs) swim_track00(base); + fs->type = swim_readbit(base, TWOMEG_MEDIA) ? + HD_MEDIA : DD_MEDIA; + fs->head_number = swim_readbit(base, SINGLE_SIDED) ? 1 : 2; get_floppy_geometry(fs, 0, ); fs->total_secs = g->size; fs->secpercyl = g->head * g->sect; @@ -656,6 +658,8 @@ static int floppy_open(struct block_device *bdev, fmode_t mode) goto out; } + set_capacity(fs->disk, fs->total_secs); + if (mode & FMODE_NDELAY) return 0; @@ -808,10 +812,9 @@ static int swim_add_floppy(struct swim_priv *swd, enum drive_location location) swim_motor(base, OFF); - if (swim_readbit(base, SINGLE_SIDED)) - fs->head_number = 1; - else - fs->head_number = 2; + fs->type = HD_MEDIA; + fs->head_number = 2; + fs->ref_count = 0; fs->ejected = 1; -- 2.16.1
[PATCH v2 08/10] block/swim: Check drive type
The SWIM chip is compatible with GCR-mode Sony 400K/800K drives but this driver only supports MFM mode. Therefore only Sony FDHD drives are supported. Skip incompatible drives. Cc: Laurent VivierCc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier --- drivers/block/swim.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index d1ee4670666a..c8c8b9da3edd 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -829,10 +829,12 @@ static int swim_floppy_init(struct swim_priv *swd) /* scan floppy drives */ swim_drive(base, INTERNAL_DRIVE); - if (swim_readbit(base, DRIVE_PRESENT)) + if (swim_readbit(base, DRIVE_PRESENT) && + !swim_readbit(base, ONEMEG_DRIVE)) swim_add_floppy(swd, INTERNAL_DRIVE); swim_drive(base, EXTERNAL_DRIVE); - if (swim_readbit(base, DRIVE_PRESENT)) + if (swim_readbit(base, DRIVE_PRESENT) && + !swim_readbit(base, ONEMEG_DRIVE)) swim_add_floppy(swd, EXTERNAL_DRIVE); /* register floppy drives */ -- 2.16.1
[PATCH v2 10/10] block/swim: Select appropriate drive on device open
The driver supports internal and external FDD units so the floppy_open function must not hard-code the drive location. Cc: Laurent VivierCc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier --- drivers/block/swim.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 2c75761b61e8..0e31884a9519 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -648,7 +648,7 @@ static int floppy_open(struct block_device *bdev, fmode_t mode) swim_write(base, setup, S_IBM_DRIVE | S_FCLK_DIV2); udelay(10); - swim_drive(base, INTERNAL_DRIVE); + swim_drive(base, fs->location); swim_motor(base, ON); swim_action(base, SETMFM); if (fs->ejected) -- 2.16.1
[PATCH v2 07/10] block/swim: Rename macros to avoid inconsistent inverted logic
The Sony drive status bits use active-low logic. The swim_readbit() function converts that to 'C' logic for readability. Hence, the sense of the names of the status bit macros should not be inverted. Mostly they are correct. However, the TWOMEG_DRIVE, MFM_MODE and TWOMEG_MEDIA macros have inverted sense (like MkLinux). Fix this inconsistency and make the following patches less confusing. The same problem affects swim3.c so fix that too. No functional change. The FDHD drive status bits are documented in sonydriv.cpp from MAME and in swimiii.h from MkLinux. Cc: Laurent VivierCc: Benjamin Herrenschmidt Cc: linuxppc-...@lists.ozlabs.org Cc: Jens Axboe Cc: sta...@vger.kernel.org # v4.14+ Tested-by: Stan Johnson Signed-off-by: Finn Thain Acked-by: Laurent Vivier --- drivers/block/swim.c | 8 drivers/block/swim3.c | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/block/swim.c b/drivers/block/swim.c index 7b847170cf71..d1ee4670666a 100644 --- a/drivers/block/swim.c +++ b/drivers/block/swim.c @@ -110,7 +110,7 @@ struct iwm { /* Select values for swim_select and swim_readbit */ #define READ_DATA_00x074 -#define TWOMEG_DRIVE 0x075 +#define ONEMEG_DRIVE 0x075 #define SINGLE_SIDED 0x076 #define DRIVE_PRESENT 0x077 #define DISK_IN0x170 @@ -118,9 +118,9 @@ struct iwm { #define TRACK_ZERO 0x172 #define TACHO 0x173 #define READ_DATA_10x174 -#define MFM_MODE 0x175 +#define GCR_MODE 0x175 #define SEEK_COMPLETE 0x176 -#define ONEMEG_MEDIA 0x177 +#define TWOMEG_MEDIA 0x177 /* Bits in handshake register */ @@ -612,7 +612,7 @@ static void setup_medium(struct floppy_state *fs) struct floppy_struct *g; fs->disk_in = 1; fs->write_protected = swim_readbit(base, WRITE_PROT); - fs->type = swim_readbit(base, ONEMEG_MEDIA); + fs->type = swim_readbit(base, TWOMEG_MEDIA); if (swim_track00(base)) printk(KERN_ERR diff --git a/drivers/block/swim3.c b/drivers/block/swim3.c index af51015d056e..469541c1e51e 100644 --- a/drivers/block/swim3.c +++ b/drivers/block/swim3.c @@ -148,7 +148,7 @@ struct swim3 { #define MOTOR_ON 2 #define RELAX 3 /* also eject in progress */ #define READ_DATA_04 -#define TWOMEG_DRIVE 5 +#define ONEMEG_DRIVE 5 #define SINGLE_SIDED 6 /* drive or diskette is 4MB type? */ #define DRIVE_PRESENT 7 #define DISK_IN8 @@ -156,9 +156,9 @@ struct swim3 { #define TRACK_ZERO 10 #define TACHO 11 #define READ_DATA_112 -#define MFM_MODE 13 +#define GCR_MODE 13 #define SEEK_COMPLETE 14 -#define ONEMEG_MEDIA 15 +#define TWOMEG_MEDIA 15 /* Definitions of values used in writing and formatting */ #define DATA_ESCAPE0x99 -- 2.16.1
[PATCH v2 00/10] SWIM driver fixes
This patch series has fixes for bugs in the SWIM floppy disk controller driver, including an oops and a soft lockup. One way to apply these patches to v4.14+ is by first cherry-picking these commits: b87eaec27eca3def6c8ed617e3b1bac08d7bc715 e5f0d2e2a153b18dcf31e1a633e210c37829d759 There are of course other ways to fix the patch rejects, but this way would be convenient for me because it would simplify my own backporting. Changes since v1: - Dropped the two IOP patches as they aren't simple fixes. This way, the entire series is suitable for stable trees. - Added Cc, Fixes, Acked-by and Reviewed-by tags. Finn Thain (10): m68k/mac: Revisit floppy disc controller base addresses m68k/mac: Fix SWIM memory resource end address m68k/mac: Don't remap SWIM MMIO region block/swim: Fix array bounds check block/swim: Remove extra put_disk() call from error path block/swim: Don't log an error message for an invalid ioctl block/swim: Rename macros to avoid inconsistent inverted logic block/swim: Check drive type block/swim: Fix IO error at end of medium block/swim: Select appropriate drive on device open arch/m68k/include/asm/macintosh.h | 10 +-- arch/m68k/mac/config.c| 126 -- drivers/block/swim.c | 49 +++ drivers/block/swim3.c | 6 +- 4 files changed, 96 insertions(+), 95 deletions(-) -- 2.16.1
Re: usercopy whitelist woe in scsi_sense_cache
On Wed, Apr 11, 2018 at 3:47 PM, Kees Cookwrote: > On Tue, Apr 10, 2018 at 8:13 PM, Kees Cook wrote: >> I'll see about booting with my own kernels, etc, and try to narrow this >> down. :) > > If I boot kernels I've built, I no longer hit the bug in this VM > (though I'll keep trying). What compiler are you using? Ignore that: I've reproduced it with my kernels now. I think I messed up the initramfs initially. But with an exact copy of your .config, booting under Arch grub with initramfs, I see it. I'll start removing variables now... :P -Kees -- Kees Cook Pixel Security
[PATCH V2] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
The normal request completion can be done before or during handling BLK_EH_RESET_TIMER, and this race may cause the request to never be completed since driver's .timeout() may always return BLK_EH_RESET_TIMER. This issue can't be fixed completely by driver, since the normal completion can be done between returning .timeout() and handing BLK_EH_RESET_TIMER. This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding queue lock, which can be per-request actually, but just not necessary to introduce one lock for so unusual event. Cc: Bart Van AsscheCc: Tejun Heo Cc: Christoph Hellwig Cc: Ming Lei Cc: Sagi Grimberg Cc: Israel Rukshin , Cc: Max Gurtovoy Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- V2: - rename the new flag as MQ_RQ_COMPLETE_IN_TIMEOUT - fix lock uses in blk_mq_rq_timed_out - document re-order between blk_add_timer() and blk_mq_rq_update_aborted_gstate(req, 0) block/blk-mq.c | 49 - block/blk-mq.h | 1 + 2 files changed, 45 insertions(+), 5 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 0dc9e341c2a7..db1c84b2bb5e 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq) * However, that would complicate paths which want to synchronize * against us. Let stay in sync with the issue path so that * hctx_lock() covers both issue and completion paths. +* +* Cover complete vs BLK_EH_RESET_TIMER race in slow path with +* helding queue lock. */ hctx_lock(hctx, _idx); if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); + else { + unsigned long flags; + bool need_complete = false; + + spin_lock_irqsave(q->queue_lock, flags); + if (!blk_mq_rq_aborted_gstate(rq)) + need_complete = true; + else + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_TIMEOUT); + spin_unlock_irqrestore(q->queue_lock, flags); + + if (need_complete) + __blk_mq_complete_request(rq); + } hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -814,6 +831,7 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; + unsigned long flags; req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; @@ -826,12 +844,33 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) break; case BLK_EH_RESET_TIMER: /* -* As nothing prevents from completion happening while -* ->aborted_gstate is set, this may lead to ignored -* completions and further spurious timeouts. +* The normal completion may happen during handling the +* timeout, or even after returning from .timeout(), so +* once the request has been completed, we can't reset +* timer any more since this request may be handled as +* BLK_EH_RESET_TIMER in next timeout handling too, and +* it has to be completed in this situation. +* +* Holding the queue lock to cover read/write rq's +* aborted_gstate and normal state, so the race can be +* avoided completely. +* +* blk_add_timer() may be re-ordered with resetting +* aborted_gstate, and the only side-effec is that if this +* request is recycled after aborted_gstate is cleared, it +* may be timed out a bit late, that is what we can survive +* given timeout event is so unusual. */ - blk_mq_rq_update_aborted_gstate(req, 0); - blk_add_timer(req); + spin_lock_irqsave(req->q->queue_lock, flags); + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_TIMEOUT) { + blk_add_timer(req); + blk_mq_rq_update_aborted_gstate(req, 0); + spin_unlock_irqrestore(req->q->queue_lock, flags); + } else { + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT); + spin_unlock_irqrestore(req->q->queue_lock, flags); + __blk_mq_complete_request(req); + } break; case BLK_EH_NOT_HANDLED: break; diff --git a/block/blk-mq.h
Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
On Wed, Apr 11, 2018 at 10:49:51PM +, Bart Van Assche wrote: > On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote: > > +again: > > switch (ret) { > > case BLK_EH_HANDLED: > > __blk_mq_complete_request(req); > > break; > > case BLK_EH_RESET_TIMER: > > [ ... ] > > + spin_lock_irqsave(req->q->queue_lock, flags); > > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { > > + blk_mq_rq_update_aborted_gstate(req, 0); > > + blk_add_timer(req); > > + } else { > > + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT); > > + ret = BLK_EH_HANDLED; > > + goto again; > > + } > > + spin_unlock_irqrestore(req->q->queue_lock, flags); > > Does the above chunk introduce a backwards goto from inside a region around > which a spinlock is held to outside that region? Can such a goto result in > anything else than a deadlock? Yes, it is being fixed in my local V2, :-) -- Ming
Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
On Thu, 2018-04-12 at 04:55 +0800, Ming Lei wrote: > +again: > switch (ret) { > case BLK_EH_HANDLED: > __blk_mq_complete_request(req); > break; > case BLK_EH_RESET_TIMER: > [ ... ] > + spin_lock_irqsave(req->q->queue_lock, flags); > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { > + blk_mq_rq_update_aborted_gstate(req, 0); > + blk_add_timer(req); > + } else { > + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT); > + ret = BLK_EH_HANDLED; > + goto again; > + } > + spin_unlock_irqrestore(req->q->queue_lock, flags); Does the above chunk introduce a backwards goto from inside a region around which a spinlock is held to outside that region? Can such a goto result in anything else than a deadlock? Thanks, Bart.
Re: usercopy whitelist woe in scsi_sense_cache
On Tue, Apr 10, 2018 at 8:13 PM, Kees Cookwrote: > I'll see about booting with my own kernels, etc, and try to narrow this down. > :) If I boot kernels I've built, I no longer hit the bug in this VM (though I'll keep trying). What compiler are you using? -Kees -- Kees Cook Pixel Security
Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
Hello, On Thu, Apr 12, 2018 at 06:43:45AM +0800, Ming Lei wrote: > On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote: > > Hello, Ming. > > > > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: > > ... > > > + spin_lock_irqsave(req->q->queue_lock, flags); > > > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { > > > + blk_mq_rq_update_aborted_gstate(req, 0); > > > + blk_add_timer(req); > > > > Nothing prevents the above blk_add_timer() racing against the next > > recycle instance of the request, so this still leaves a small race > > window. > > OK. > > But this small race window can be avoided by running blk_add_timer(req) > before blk_mq_rq_update_aborted_gstate(req, 0), can't it? Not really because aborted_gstate right now doesn't have any memory barrier around it, so nothing ensures blk_add_timer() actually appears before. We can either add the matching barriers in aborted_gstate update and when it's read in the normal completion path, or we can wait for the update to be visible everywhere by waiting for rcu grace period (because the reader is rcu protected). Thanks. -- tejun
Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
On Wed, Apr 11, 2018 at 02:30:07PM -0700, Tejun Heo wrote: > Hello, Ming. > > On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: > ... > > + spin_lock_irqsave(req->q->queue_lock, flags); > > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { > > + blk_mq_rq_update_aborted_gstate(req, 0); > > + blk_add_timer(req); > > Nothing prevents the above blk_add_timer() racing against the next > recycle instance of the request, so this still leaves a small race > window. OK. But this small race window can be avoided by running blk_add_timer(req) before blk_mq_rq_update_aborted_gstate(req, 0), can't it? -- Ming
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Hey, again. On Wed, Apr 11, 2018 at 10:07:33AM -0700, t...@kernel.org wrote: > Hello, Israel. > > On Wed, Apr 11, 2018 at 07:16:14PM +0300, Israel Rukshin wrote: > > >Just noticed this one, this looks interesting to me as well. Israel, > > >can you run your test with this patch? > > > > Yes, I just did and it looks good. > > Awesome. Just to be sure, you tested the combined patch and saw the XXX debug messages, right? Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 23:23 +0200, Alexandru Moise wrote: > Hi, I tested it, it doesn't solve the problem. > By the time you get here it's already too late, my patch > prevents this from failing in the first place. Hello Alex, If you can share the steps to follow to trigger the bug you reported then I will have a closer look at this. Thanks, Bart.
Re: [PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
Hello, Ming. On Thu, Apr 12, 2018 at 04:55:29AM +0800, Ming Lei wrote: ... > + spin_lock_irqsave(req->q->queue_lock, flags); > + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { > + blk_mq_rq_update_aborted_gstate(req, 0); > + blk_add_timer(req); Nothing prevents the above blk_add_timer() racing against the next recycle instance of the request, so this still leaves a small race window. Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote: > On 04/11/18 13:00, Alexandru Moise wrote: > > But the root cause of it is in blkcg_init_queue() when blkg_create() returns > > an ERR ptr, because it tries to insert into a populated index into > > blkcg->blkg_tree, > > the entry that we fail to remove at __blk_release_queue(). > > Hello Alex, > > Had you considered something like the untested patch below? > > Thanks, > > Bart. > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > index 1c16694ae145..f2ced19e74b8 100644 > --- a/block/blk-cgroup.c > +++ b/block/blk-cgroup.c > @@ -1191,14 +1191,17 @@ int blkcg_init_queue(struct request_queue *q) > if (preloaded) > radix_tree_preload_end(); > > - if (IS_ERR(blkg)) > - return PTR_ERR(blkg); > + if (IS_ERR(blkg)) { > + ret = PTR_ERR(blkg); > + goto destroy_all; > + } > > q->root_blkg = blkg; > q->root_rl.blkg = blkg; > > ret = blk_throtl_init(q); > if (ret) { > +destroy_all: > spin_lock_irq(q->queue_lock); > blkg_destroy_all(q); > spin_unlock_irq(q->queue_lock); > Hi, I tested it, it doesn't solve the problem. By the time you get here it's already too late, my patch prevents this from failing in the first place. I would have liked this more than my solution though. ../Alex
[PATCH] blk-mq: fix race between complete and BLK_EH_RESET_TIMER
The normal request completion can be done before or during handling BLK_EH_RESET_TIMER, and this race may cause the request to never be completed since driver's .timeout() may always return BLK_EH_RESET_TIMER. This issue can't be fixed completely by driver, since the normal completion can be done between returning .timeout() and handing BLK_EH_RESET_TIMER. This patch fixes this race by introducing rq state of MQ_RQ_COMPLETE_IN_RESET, and reading/writing rq's state by holding queue lock, which can be per-request actually, but just not necessary to introduce one lock for so unusual event. Cc: Bart Van AsscheCc: Tejun Heo Cc: Christoph Hellwig Cc: Ming Lei Cc: Sagi Grimberg Cc: Israel Rukshin , Cc: Max Gurtovoy Cc: sta...@vger.kernel.org Signed-off-by: Ming Lei --- This is another way to fix this long-time issue, and turns out this solution is much simpler. block/blk-mq.c | 44 +++- block/blk-mq.h | 1 + 2 files changed, 40 insertions(+), 5 deletions(-) diff --git a/block/blk-mq.c b/block/blk-mq.c index 0dc9e341c2a7..12e8850e3905 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -630,10 +630,27 @@ void blk_mq_complete_request(struct request *rq) * However, that would complicate paths which want to synchronize * against us. Let stay in sync with the issue path so that * hctx_lock() covers both issue and completion paths. +* +* Cover complete vs BLK_EH_RESET_TIMER race in slow path with +* helding queue lock. */ hctx_lock(hctx, _idx); if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); + else { + unsigned long flags; + bool need_complete = false; + + spin_lock_irqsave(q->queue_lock, flags); + if (!blk_mq_rq_aborted_gstate(rq)) + need_complete = true; + else + blk_mq_rq_update_state(rq, MQ_RQ_COMPLETE_IN_RESET); + spin_unlock_irqrestore(q->queue_lock, flags); + + if (need_complete) + __blk_mq_complete_request(rq); + } hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -814,24 +831,41 @@ static void blk_mq_rq_timed_out(struct request *req, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; + unsigned long flags; req->rq_flags |= RQF_MQ_TIMEOUT_EXPIRED; if (ops->timeout) ret = ops->timeout(req, reserved); +again: switch (ret) { case BLK_EH_HANDLED: __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: /* -* As nothing prevents from completion happening while -* ->aborted_gstate is set, this may lead to ignored -* completions and further spurious timeouts. +* The normal completion may happen during handling the +* timeout, or even after returning from .timeout(), so +* once the request has been completed, we can't reset +* timer any more since this request may be handled as +* BLK_EH_RESET_TIMER in next timeout handling too, and +* it has to be completed in this situation. +* +* Holding the queue lock to cover read/write rq's +* aborted_gstate and normal state, so the race can be +* avoided completely. */ - blk_mq_rq_update_aborted_gstate(req, 0); - blk_add_timer(req); + spin_lock_irqsave(req->q->queue_lock, flags); + if (blk_mq_rq_state(req) != MQ_RQ_COMPLETE_IN_RESET) { + blk_mq_rq_update_aborted_gstate(req, 0); + blk_add_timer(req); + } else { + blk_mq_rq_update_state(req, MQ_RQ_IN_FLIGHT); + ret = BLK_EH_HANDLED; + goto again; + } + spin_unlock_irqrestore(req->q->queue_lock, flags); break; case BLK_EH_NOT_HANDLED: break; diff --git a/block/blk-mq.h b/block/blk-mq.h index 88c558f71819..6dc242fc785a 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -35,6 +35,7 @@ enum mq_rq_state { MQ_RQ_IDLE = 0, MQ_RQ_IN_FLIGHT = 1, MQ_RQ_COMPLETE = 2, + MQ_RQ_COMPLETE_IN_RESET = 3, MQ_RQ_STATE_BITS= 2, MQ_RQ_STATE_MASK= (1 << MQ_RQ_STATE_BITS) - 1, -- 2.9.5
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 13:02 -0700, t...@kernel.org wrote: > On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote: > > On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote: > > > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote: > > > > On 04/11/18 13:00, Alexandru Moise wrote: > > > > > But the root cause of it is in blkcg_init_queue() when blkg_create() > > > > > returns > > > > > an ERR ptr, because it tries to insert into a populated index into > > > > > blkcg->blkg_tree, > > > > > the entry that we fail to remove at __blk_release_queue(). > > > > > > > > Hello Alex, > > > > > > > > Had you considered something like the untested patch below? > > > > > > But queue init shouldn't fail here, right? > > > > Hello Tejun, > > > > Your question is not entirely clear to me. Are you referring to the atomic > > allocations in blkg_create() or are you perhaps referring to something else? > > Hmm.. maybe I'm confused but I thought that the fact that > blkcg_init_queue() fails itself is already a bug, which happens > because a previously destroyed queue left behind blkgs. Hello Tejun, I had missed the start of this thread so I was not aware of which problem Alex was trying to solve. In the description of v1 of this patch I read that Alex thinks that he ran into a scenario in which blk_queue_alloc_node() assigns a q->id that is still in use by another request queue? That's weird. The following code still occurs in __blk_release_queue(): ida_simple_remove(_queue_ida, q->id); It's not clear to me how that remove call could happen *before* q->id is removed from the blkcg radix tree. Bart.
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, Apr 11, 2018 at 08:00:29PM +, Bart Van Assche wrote: > On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote: > > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote: > > > On 04/11/18 13:00, Alexandru Moise wrote: > > > > But the root cause of it is in blkcg_init_queue() when blkg_create() > > > > returns > > > > an ERR ptr, because it tries to insert into a populated index into > > > > blkcg->blkg_tree, > > > > the entry that we fail to remove at __blk_release_queue(). > > > > > > Hello Alex, > > > > > > Had you considered something like the untested patch below? > > > > But queue init shouldn't fail here, right? > > Hello Tejun, > > Your question is not entirely clear to me. Are you referring to the atomic > allocations in blkg_create() or are you perhaps referring to something else? Hmm.. maybe I'm confused but I thought that the fact that blkcg_init_queue() fails itself is already a bug, which happens because a previously destroyed queue left behind blkgs. Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 12:57 -0700, t...@kernel.org wrote: > On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote: > > On 04/11/18 13:00, Alexandru Moise wrote: > > > But the root cause of it is in blkcg_init_queue() when blkg_create() > > > returns > > > an ERR ptr, because it tries to insert into a populated index into > > > blkcg->blkg_tree, > > > the entry that we fail to remove at __blk_release_queue(). > > > > Hello Alex, > > > > Had you considered something like the untested patch below? > > But queue init shouldn't fail here, right? Hello Tejun, Your question is not entirely clear to me. Are you referring to the atomic allocations in blkg_create() or are you perhaps referring to something else? Bart.
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, Apr 11, 2018 at 01:55:25PM -0600, Bart Van Assche wrote: > On 04/11/18 13:00, Alexandru Moise wrote: > >But the root cause of it is in blkcg_init_queue() when blkg_create() returns > >an ERR ptr, because it tries to insert into a populated index into > >blkcg->blkg_tree, > >the entry that we fail to remove at __blk_release_queue(). > > Hello Alex, > > Had you considered something like the untested patch below? But queue init shouldn't fail here, right? Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On 04/11/18 13:00, Alexandru Moise wrote: But the root cause of it is in blkcg_init_queue() when blkg_create() returns an ERR ptr, because it tries to insert into a populated index into blkcg->blkg_tree, the entry that we fail to remove at __blk_release_queue(). Hello Alex, Had you considered something like the untested patch below? Thanks, Bart. diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index 1c16694ae145..f2ced19e74b8 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1191,14 +1191,17 @@ int blkcg_init_queue(struct request_queue *q) if (preloaded) radix_tree_preload_end(); - if (IS_ERR(blkg)) - return PTR_ERR(blkg); + if (IS_ERR(blkg)) { + ret = PTR_ERR(blkg); + goto destroy_all; + } q->root_blkg = blkg; q->root_rl.blkg = blkg; ret = blk_throtl_init(q); if (ret) { +destroy_all: spin_lock_irq(q->queue_lock); blkg_destroy_all(q); spin_unlock_irq(q->queue_lock);
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, Apr 11, 2018 at 03:54:53PM +, Bart Van Assche wrote: > On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote: > > [0.76] BUG: unable to handle kernel NULL pointer dereference at > > 01b4 > > [0.763350] Kernel panic - not syncing: Attempted to kill init! > > exitcode=0x0009 > > [0.763350] > > [0.76] PGD 0 P4D 0 > > [0.76] Oops: [#2] PREEMPT SMP > > [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G D > > 4.16.0-ARCH+ #81 > > [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > > 1.11.0-20171110_100015-anatol 04/01/20$ > > [0.76] Workqueue: nvme-reset-wq nvme_reset_work > > [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40 > > [0.76] RSP: :c91bfcb0 EFLAGS: 00010246 > > [0.76] RAX: 88003b698000 RBX: RCX: > > > > [0.76] RDX: 88003b698000 RSI: fff4 RDI: > > 001c > > [0.76] RBP: c91bfcc0 R08: R09: > > > > [0.76] R10: eaeaa980 R11: 814e0970 R12: > > 001c > > [0.76] R13: R14: R15: > > 88003aad8010 > > [0.76] FS: () GS:88003e40() > > knlGS: > > [0.76] CS: 0010 DS: ES: CR0: 80050033 > > [0.76] CR2: 01b4 CR3: 02209001 CR4: > > 000606f0 > > [0.76] Call Trace: > > [0.76] blk_mq_quiesce_queue+0x23/0x80 > > [0.76] nvme_dev_disable+0x34f/0x480 > > [0.76] ? nvme_irq+0x50/0x50 > > [0.76] ? dev_warn+0x64/0x80 > > [0.76] nvme_reset_work+0x13de/0x1570 > > [0.76] ? __switch_to_asm+0x34/0x70 > > [0.76] ? __switch_to_asm+0x40/0x70 > > [0.76] ? _raw_spin_unlock_irq+0x15/0x30 > > [0.76] ? finish_task_switch+0x156/0x210 > > [0.76] process_one_work+0x20c/0x3d0 > > [0.76] worker_thread+0x216/0x400 > > [0.76] kthread+0x125/0x130 > > [0.76] ? process_one_work+0x3d0/0x3d0 > > [0.76] ? __kthread_bind_mask+0x60/0x60 > > [0.76] ret_from_fork+0x3a/0x50 > > Hello Alexandru, > > What made you look at cgroups? In the above register dump I see that %rbx == > NULL. > I think that means that the queue pointer argument of blk_queue_flag_set() is > NULL. > The NVMe initiator driver should never pass a NULL pointer to > blk_mq_quiesce_queue(). > Please ask the NVMe driver maintainers for their opinion on the linux-nvme > mailing > list. > > Thanks, > > Bart. The %rbx == NULL is only a symptom of the cgroup mishandling, perhaps we could improve error handling in the NVMe driver, but I can say that about a lot of block drivers actually, perhaps I will write some patches in the future to improve the error handling. But the root cause of it is in blkcg_init_queue() when blkg_create() returns an ERR ptr, because it tries to insert into a populated index into blkcg->blkg_tree, the entry that we fail to remove at __blk_release_queue(). Thanks, ../Alex > > >
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Bart Van Assche - 11.04.18, 14:50: > On Tue, 2018-04-10 at 14:54 -0700, t...@kernel.org wrote: > > Ah, yeah, I was moving it out of add_timer but forgot to actully add > > it to the issue path. Fixed patch below. > > > > BTW, no matter what we do w/ the request handover between normal and > > timeout paths, we'd need something similar. Otherwise, while we can > > reduce the window, we can't get rid of it. > > (+Martin Steigerwald) […] > Thank you for having shared this patch. It looks interesting to me. > What I know about the blk-mq timeout handling is as follows: > * Nobody who has reviewed the blk-mq timeout handling code with this > patch applied has reported any shortcomings for that code. > * However, several people have reported kernel crashes that disappear > when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma > corrupts memory upon timeout" > > (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848 > .html) and also to a "RIP: scsi_times_out+0x17" crash during boot > (https://bugzilla.kernel.org/show_bug.cgi?id=199077). Yes, with the three patches: - '[PATCH] blk-mq_Directly schedule q->timeout_work when aborting a request.mbox' - '[PATCH v2] block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}().mbox' - '[PATCH v4] blk-mq_Fix race conditions in request timeout handling.mbox' the occasional hangs on some boots / resumes from hibernation appear to be gone. Also it appears that the error loading SMART data issue is gone as well (see my bug report). However it is still to early to say for sure. I think I need at least 2-3 days of additional testing with this kernel to be sufficiently sure about it. However… I could also test another patch, but from reading the rest of this thread so far I have no clear on whether to try one of the new patches and if so which one and whether adding it on top of some of the patches I already applied or using it as a replacement of it. So while doing a training this and next week I can apply a patch here and then, but I won´t have much time to read the complete discussion to figure out what to apply. Personally as a stable kernel has been released with those issues, I think its good to fix it up soon. On the other hand it may take quite some time til popular distros carry 4.16 for regular users. And I have no idea how frequent the reported issues are, i.e. how many users would be affected. Thanks, -- Martin
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, On Wed, Apr 11, 2018 at 05:26:12PM +, Bart Van Assche wrote: > Please explain what you wrote further. It's not clear to me why you think > that anything is broken nor what a "sever model" is. So, sever (or revoke) model is where you actively disconnect an object and ensuring that there'll be no further references from others. In contrast, what we usually do is refcnting, where we don't actively sever the dying object from its users but let the users drain themselves over time and destroy the object when we know all the users are gone (recnt reaching zero). > I think we really need the blkcg_exit_queue() call in blk_cleanup_queue() > to avoid race conditions between request queue cleanup and the block cgroup > controller. Additionally, since it is guaranteed that no new requests will > be submitted to a queue after it has been marked dead I don't see why it > would make sense to keep the association between a request queue and the > blkcg controller until the last reference on a queue is dropped. Sure, new requests aren't the only source tho. e.g. there can be derefs through sysfs / cgroupfs or writeback attempts on dead devices. If you want to implement sever, you gotta hunt down all those and make sure they can't make no further derefs. Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 10:15 -0700, t...@kernel.org wrote: > On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote: > > A simple and effective solution is to dissociate a request queue from the > > block cgroup controller before blk_cleanup_queue() returns. This is why > > commit > > a063057d7c73 ("block: Fix a race between request queue removal and the block > > cgroup controller") moved the blkcg_exit_queue() call from > > __blk_release_queue() > > into blk_cleanup_queue(). > > which is broken. We can try to switch the lifetime model to revoking > all live objects but that likely is a lot more involving than blindly > moving blkg shootdown from release to cleanup. Implementing sever > semantics is usually a lot more involved / fragile because it requires > explicit participation from all users (exactly the same way revoking > ->queue_lock is difficult). > > I'm not necessarily against switching to sever model, but what the > patch did isn't that. It just moved some code without actually > understanding or auditing what the implications are. Hello Tejun, Please explain what you wrote further. It's not clear to me why you think that anything is broken nor what a "sever model" is. I think we really need the blkcg_exit_queue() call in blk_cleanup_queue() to avoid race conditions between request queue cleanup and the block cgroup controller. Additionally, since it is guaranteed that no new requests will be submitted to a queue after it has been marked dead I don't see why it would make sense to keep the association between a request queue and the blkcg controller until the last reference on a queue is dropped. Bart.
Re: sr: get/drop reference to device in revalidate and check_events
On Wed 11-04-18 10:23:30, Jens Axboe wrote: > We can't just use scsi_cd() to get the scsi_cd structure, we have > to grab a live reference to the device. For both callbacks, we're > not inside an open where we already hold a reference to the device. > > This fixes device removal/addition under concurrent device access, > which otherwise could result in the below oops. > > NULL pointer dereference at 0010 > PGD 0 P4D 0 > Oops: [#1] PREEMPT SMP > Modules linked in: > sr 12:0:0:0: [sr2] scsi-1 drive > scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core > sb_edac xl > sr 12:0:0:0: Attached scsi CD-ROM sr2 > sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress > zlib_defc > sr 12:0:0:0: Attached scsi generic sg7 type 5 > igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif] > CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 > Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 > RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] > RSP: 0018:883ff357bb58 EFLAGS: 00010292 > RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66 > RDX: 0003 RSI: 7530 RDI: 881fea631000 > RBP: R08: 881fe4d38400 R09: > R10: R11: 01b6 R12: 085d > R13: 085d R14: 883ffd9b3790 R15: > FS: 7f7dc8e6d8c0() GS:883fff34() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0010 CR3: 003ffda98005 CR4: 003606e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > ? __invalidate_device+0x48/0x60 > check_disk_change+0x4c/0x60 > sr_block_open+0x16/0xd0 [sr_mod] > __blkdev_get+0xb9/0x450 > ? iget5_locked+0x1c0/0x1e0 > blkdev_get+0x11e/0x320 > ? bdget+0x11d/0x150 > ? _raw_spin_unlock+0xa/0x20 > ? bd_acquire+0xc0/0xc0 > do_dentry_open+0x1b0/0x320 > ? inode_permission+0x24/0xc0 > path_openat+0x4e6/0x1420 > ? cpumask_any_but+0x1f/0x40 > ? flush_tlb_mm_range+0xa0/0x120 > do_filp_open+0x8c/0xf0 > ? __seccomp_filter+0x28/0x230 > ? _raw_spin_unlock+0xa/0x20 > ? __handle_mm_fault+0x7d6/0x9b0 > ? list_lru_add+0xa8/0xc0 > ? _raw_spin_unlock+0xa/0x20 > ? __alloc_fd+0xaf/0x160 > ? do_sys_open+0x1a6/0x230 > do_sys_open+0x1a6/0x230 > do_syscall_64+0x5a/0x100 > entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Signed-off-by: Jens AxboeLooks good to me. You can add: Reviewed-by: Jan Kara Honza > > --- > > diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c > index 0cf25d789d05..3f3cb72e0c0c 100644 > --- a/drivers/scsi/sr.c > +++ b/drivers/scsi/sr.c > @@ -587,18 +587,28 @@ static int sr_block_ioctl(struct block_device *bdev, > fmode_t mode, unsigned cmd, > static unsigned int sr_block_check_events(struct gendisk *disk, > unsigned int clearing) > { > - struct scsi_cd *cd = scsi_cd(disk); > + unsigned int ret = 0; > + struct scsi_cd *cd; > > - if (atomic_read(>device->disk_events_disable_depth)) > + cd = scsi_cd_get(disk); > + if (!cd) > return 0; > > - return cdrom_check_events(>cdi, clearing); > + if (!atomic_read(>device->disk_events_disable_depth)) > + ret = cdrom_check_events(>cdi, clearing); > + > + scsi_cd_put(cd); > + return ret; > } > > static int sr_block_revalidate_disk(struct gendisk *disk) > { > - struct scsi_cd *cd = scsi_cd(disk); > struct scsi_sense_hdr sshdr; > + struct scsi_cd *cd; > + > + cd = scsi_cd_get(disk); > + if (!cd) > + return -ENXIO; > > /* if the unit is not ready, nothing more to do */ > if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, )) > @@ -607,6 +617,7 @@ static int sr_block_revalidate_disk(struct gendisk *disk) > sr_cd_check(>cdi); > get_sectorsize(cd); > out: > + scsi_cd_put(cd); > return 0; > } > > > -- > Jens Axboe > -- Jan Kara SUSE Labs, CR
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, On Wed, Apr 11, 2018 at 05:06:41PM +, Bart Van Assche wrote: > A simple and effective solution is to dissociate a request queue from the > block cgroup controller before blk_cleanup_queue() returns. This is why commit > a063057d7c73 ("block: Fix a race between request queue removal and the block > cgroup controller") moved the blkcg_exit_queue() call from > __blk_release_queue() > into blk_cleanup_queue(). which is broken. We can try to switch the lifetime model to revoking all live objects but that likely is a lot more involving than blindly moving blkg shootdown from release to cleanup. Implementing sever semantics is usually a lot more involved / fragile because it requires explicit participation from all users (exactly the same way revoking ->queue_lock is difficult). I'm not necessarily against switching to sever model, but what the patch did isn't that. It just moved some code without actually understanding or auditing what the implications are. Thanks. -- tejun
Re: sr: get/drop reference to device in revalidate and check_events
On 04/11/2018 09:23 AM, Jens Axboe wrote: > We can't just use scsi_cd() to get the scsi_cd structure, we have > to grab a live reference to the device. For both callbacks, we're > not inside an open where we already hold a reference to the device. > > This fixes device removal/addition under concurrent device access, > which otherwise could result in the below oops. > > NULL pointer dereference at 0010 > PGD 0 P4D 0 > Oops: [#1] PREEMPT SMP > Modules linked in: > sr 12:0:0:0: [sr2] scsi-1 drive > scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core > sb_edac xl > sr 12:0:0:0: Attached scsi CD-ROM sr2 > sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress > zlib_defc > sr 12:0:0:0: Attached scsi generic sg7 type 5 > igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif] > CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 > Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 > RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] > RSP: 0018:883ff357bb58 EFLAGS: 00010292 > RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66 > RDX: 0003 RSI: 7530 RDI: 881fea631000 > RBP: R08: 881fe4d38400 R09: > R10: R11: 01b6 R12: 085d > R13: 085d R14: 883ffd9b3790 R15: > FS: 7f7dc8e6d8c0() GS:883fff34() knlGS: > CS: 0010 DS: ES: CR0: 80050033 > CR2: 0010 CR3: 003ffda98005 CR4: 003606e0 > DR0: DR1: DR2: > DR3: DR6: fffe0ff0 DR7: 0400 > Call Trace: > ? __invalidate_device+0x48/0x60 > check_disk_change+0x4c/0x60 > sr_block_open+0x16/0xd0 [sr_mod] > __blkdev_get+0xb9/0x450 > ? iget5_locked+0x1c0/0x1e0 > blkdev_get+0x11e/0x320 > ? bdget+0x11d/0x150 > ? _raw_spin_unlock+0xa/0x20 > ? bd_acquire+0xc0/0xc0 > do_dentry_open+0x1b0/0x320 > ? inode_permission+0x24/0xc0 > path_openat+0x4e6/0x1420 > ? cpumask_any_but+0x1f/0x40 > ? flush_tlb_mm_range+0xa0/0x120 > do_filp_open+0x8c/0xf0 > ? __seccomp_filter+0x28/0x230 > ? _raw_spin_unlock+0xa/0x20 > ? __handle_mm_fault+0x7d6/0x9b0 > ? list_lru_add+0xa8/0xc0 > ? _raw_spin_unlock+0xa/0x20 > ? __alloc_fd+0xaf/0x160 > ? do_sys_open+0x1a6/0x230 > do_sys_open+0x1a6/0x230 > do_syscall_64+0x5a/0x100 > entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Signed-off-by: Jens Axboe> > --- > > diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c > index 0cf25d789d05..3f3cb72e0c0c 100644 > --- a/drivers/scsi/sr.c > +++ b/drivers/scsi/sr.c > @@ -587,18 +587,28 @@ static int sr_block_ioctl(struct block_device *bdev, > fmode_t mode, unsigned cmd, > static unsigned int sr_block_check_events(struct gendisk *disk, > unsigned int clearing) > { > - struct scsi_cd *cd = scsi_cd(disk); > + unsigned int ret = 0; > + struct scsi_cd *cd; > > - if (atomic_read(>device->disk_events_disable_depth)) > + cd = scsi_cd_get(disk); > + if (!cd) > return 0; > > - return cdrom_check_events(>cdi, clearing); > + if (!atomic_read(>device->disk_events_disable_depth)) > + ret = cdrom_check_events(>cdi, clearing); > + > + scsi_cd_put(cd); > + return ret; > } > > static int sr_block_revalidate_disk(struct gendisk *disk) > { > - struct scsi_cd *cd = scsi_cd(disk); > struct scsi_sense_hdr sshdr; > + struct scsi_cd *cd; > + > + cd = scsi_cd_get(disk); > + if (!cd) > + return -ENXIO; > > /* if the unit is not ready, nothing more to do */ > if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, )) > @@ -607,6 +617,7 @@ static int sr_block_revalidate_disk(struct gendisk *disk) > sr_cd_check(>cdi); > get_sectorsize(cd); > out: > + scsi_cd_put(cd); > return 0; > } > > Reviewed-by: Lee Duncan
RE: [PATCH] block: ratelimite pr_err on IO path
> -Original Message- > From: linux-kernel-ow...@vger.kernel.org [mailto:linux-kernel- > ow...@vger.kernel.org] On Behalf Of Jack Wang > Sent: Wednesday, April 11, 2018 8:21 AM > Subject: [PATCH] block: ratelimite pr_err on IO path > > From: Jack Wang... > - pr_err("%s: ref tag error at location %llu " \ > -"(rcvd %u)\n", iter->disk_name, > -(unsigned long long) > -iter->seed, be32_to_cpu(pi->ref_tag)); > + pr_err_ratelimited("%s: ref tag error at " > +"location %llu (rcvd %u)\n", Per process/coding-style.rst, you should keep a string like that on one line even if that exceeds 80 columns: Statements longer than 80 columns will be broken into sensible chunks, unless exceeding 80 columns significantly increases readability and does not hide information. ... However, never break user-visible strings such as printk messages, because that breaks the ability to grep for them.
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 10:00 -0700, t...@kernel.org wrote: > On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote: > > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote: > > > And looking at the change, it looks like the right thing we should > > > have done is caching @lock on the print_blkg side and when switching > > > locks make sure both locks are held. IOW, do the following in > > > blk_cleanup_queue() > > > > > > spin_lock_irq(lock); > > > if (q->queue_lock != >__queue_lock) { > > > spin_lock(>__queue_lock); > > > q->queue_lock = >__queue_lock; > > > spin_unlock(>__queue_lock); > > > } > > > spin_unlock_irq(lock); > > > > > > Otherwise, there can be two lock holders thinking they have exclusive > > > access to the request_queue. > > > > I think that's a bad idea. A block driver is allowed to destroy the > > spinlock it associated with the request queue as soon as blk_cleanup_queue() > > has finished. If the block cgroup controller would cache a pointer to the > > block driver spinlock then that could cause the cgroup code to attempt to > > lock a spinlock after it has been destroyed. I don't think we need that kind > > of race conditions. > > I see, but that problem is there with or without caching as long as we > have queu_lock usage which reach beyond cleanup_queue, right? Whether > that user caches the lock for matching unlocking or not doesn't really > change the situation. > > Short of adding protection around queue_lock switching, I can't think > of a solution tho. Probably the right thing to do is adding queue > lock/unlock helpers which are safe to use beyond cleanup_queue. Hello Tejun, A simple and effective solution is to dissociate a request queue from the block cgroup controller before blk_cleanup_queue() returns. This is why commit a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") moved the blkcg_exit_queue() call from __blk_release_queue() into blk_cleanup_queue(). Thanks, Bart.
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, On Wed, Apr 11, 2018 at 04:42:55PM +, Bart Van Assche wrote: > On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote: > > And looking at the change, it looks like the right thing we should > > have done is caching @lock on the print_blkg side and when switching > > locks make sure both locks are held. IOW, do the following in > > blk_cleanup_queue() > > > > spin_lock_irq(lock); > > if (q->queue_lock != >__queue_lock) { > > spin_lock(>__queue_lock); > > q->queue_lock = >__queue_lock; > > spin_unlock(>__queue_lock); > > } > > spin_unlock_irq(lock); > > > > Otherwise, there can be two lock holders thinking they have exclusive > > access to the request_queue. > > I think that's a bad idea. A block driver is allowed to destroy the > spinlock it associated with the request queue as soon as blk_cleanup_queue() > has finished. If the block cgroup controller would cache a pointer to the > block driver spinlock then that could cause the cgroup code to attempt to > lock a spinlock after it has been destroyed. I don't think we need that kind > of race conditions. I see, but that problem is there with or without caching as long as we have queu_lock usage which reach beyond cleanup_queue, right? Whether that user caches the lock for matching unlocking or not doesn't really change the situation. Short of adding protection around queue_lock switching, I can't think of a solution tho. Probably the right thing to do is adding queue lock/unlock helpers which are safe to use beyond cleanup_queue. Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 07:56 -0700, Tejun Heo wrote: > And looking at the change, it looks like the right thing we should > have done is caching @lock on the print_blkg side and when switching > locks make sure both locks are held. IOW, do the following in > blk_cleanup_queue() > > spin_lock_irq(lock); > if (q->queue_lock != >__queue_lock) { > spin_lock(>__queue_lock); > q->queue_lock = >__queue_lock; > spin_unlock(>__queue_lock); > } > spin_unlock_irq(lock); > > Otherwise, there can be two lock holders thinking they have exclusive > access to the request_queue. I think that's a bad idea. A block driver is allowed to destroy the spinlock it associated with the request queue as soon as blk_cleanup_queue() has finished. If the block cgroup controller would cache a pointer to the block driver spinlock then that could cause the cgroup code to attempt to lock a spinlock after it has been destroyed. I don't think we need that kind of race conditions. Bart.
sr: get/drop reference to device in revalidate and check_events
We can't just use scsi_cd() to get the scsi_cd structure, we have to grab a live reference to the device. For both callbacks, we're not inside an open where we already hold a reference to the device. This fixes device removal/addition under concurrent device access, which otherwise could result in the below oops. NULL pointer dereference at 0010 PGD 0 P4D 0 Oops: [#1] PREEMPT SMP Modules linked in: sr 12:0:0:0: [sr2] scsi-1 drive scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core sb_edac xl sr 12:0:0:0: Attached scsi CD-ROM sr2 sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress zlib_defc sr 12:0:0:0: Attached scsi generic sg7 type 5 igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] RSP: 0018:883ff357bb58 EFLAGS: 00010292 RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66 RDX: 0003 RSI: 7530 RDI: 881fea631000 RBP: R08: 881fe4d38400 R09: R10: R11: 01b6 R12: 085d R13: 085d R14: 883ffd9b3790 R15: FS: 7f7dc8e6d8c0() GS:883fff34() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 0010 CR3: 003ffda98005 CR4: 003606e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ? __invalidate_device+0x48/0x60 check_disk_change+0x4c/0x60 sr_block_open+0x16/0xd0 [sr_mod] __blkdev_get+0xb9/0x450 ? iget5_locked+0x1c0/0x1e0 blkdev_get+0x11e/0x320 ? bdget+0x11d/0x150 ? _raw_spin_unlock+0xa/0x20 ? bd_acquire+0xc0/0xc0 do_dentry_open+0x1b0/0x320 ? inode_permission+0x24/0xc0 path_openat+0x4e6/0x1420 ? cpumask_any_but+0x1f/0x40 ? flush_tlb_mm_range+0xa0/0x120 do_filp_open+0x8c/0xf0 ? __seccomp_filter+0x28/0x230 ? _raw_spin_unlock+0xa/0x20 ? __handle_mm_fault+0x7d6/0x9b0 ? list_lru_add+0xa8/0xc0 ? _raw_spin_unlock+0xa/0x20 ? __alloc_fd+0xaf/0x160 ? do_sys_open+0x1a6/0x230 do_sys_open+0x1a6/0x230 do_syscall_64+0x5a/0x100 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Signed-off-by: Jens Axboe--- diff --git a/drivers/scsi/sr.c b/drivers/scsi/sr.c index 0cf25d789d05..3f3cb72e0c0c 100644 --- a/drivers/scsi/sr.c +++ b/drivers/scsi/sr.c @@ -587,18 +587,28 @@ static int sr_block_ioctl(struct block_device *bdev, fmode_t mode, unsigned cmd, static unsigned int sr_block_check_events(struct gendisk *disk, unsigned int clearing) { - struct scsi_cd *cd = scsi_cd(disk); + unsigned int ret = 0; + struct scsi_cd *cd; - if (atomic_read(>device->disk_events_disable_depth)) + cd = scsi_cd_get(disk); + if (!cd) return 0; - return cdrom_check_events(>cdi, clearing); + if (!atomic_read(>device->disk_events_disable_depth)) + ret = cdrom_check_events(>cdi, clearing); + + scsi_cd_put(cd); + return ret; } static int sr_block_revalidate_disk(struct gendisk *disk) { - struct scsi_cd *cd = scsi_cd(disk); struct scsi_sense_hdr sshdr; + struct scsi_cd *cd; + + cd = scsi_cd_get(disk); + if (!cd) + return -ENXIO; /* if the unit is not ready, nothing more to do */ if (scsi_test_unit_ready(cd->device, SR_TIMEOUT, MAX_RETRIES, )) @@ -607,6 +617,7 @@ static int sr_block_revalidate_disk(struct gendisk *disk) sr_cd_check(>cdi); get_sectorsize(cd); out: + scsi_cd_put(cd); return 0; } -- Jens Axboe
Re: Hotplug
On 4/11/18 10:14 AM, Jan Kara wrote: > Hello, > > On Wed 11-04-18 08:11:13, Jens Axboe wrote: >> On 4/11/18 7:58 AM, Jan Kara wrote: >>> On Tue 10-04-18 11:17:46, Jens Axboe wrote: Been running some tests and I keep running into issues with hotplug. This looks similar to what Bart posted the other day, but it looks more deeply rooted than just having to protect the queue in generic_make_request_checks(). The test run is blktests, block/001. Current -git doesn't survive it. I've seen at least two different oopses, pasted below. [ 102.163442] NULL pointer dereference at 0010 [ 102.163444] PGD 0 P4D 0 [ 102.163447] Oops: [#1] PREEMPT SMP [ 102.163449] Modules linked in: [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme nvme_core sb_edac xl [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash lzo_compress zlib_defc [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: crc_t10dif] [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016 [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: 883ff357bb66 [ 102.373220] RDX: 0003 RSI: 7530 RDI: 881fea631000 [ 102.381705] RBP: R08: 881fe4d38400 R09: [ 102.390185] R10: R11: 01b6 R12: 085d [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() knlGS: [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: 003606e0 [ 102.432545] DR0: DR1: DR2: [ 102.441024] DR3: DR6: fffe0ff0 DR7: 0400 [ 102.449502] Call Trace: [ 102.452744] ? __invalidate_device+0x48/0x60 [ 102.458022] check_disk_change+0x4c/0x60 [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] [ 102.468270] __blkdev_get+0xb9/0x450 [ 102.472774] ? iget5_locked+0x1c0/0x1e0 [ 102.477568] blkdev_get+0x11e/0x320 [ 102.481969] ? bdget+0x11d/0x150 [ 102.486083] ? _raw_spin_unlock+0xa/0x20 [ 102.490968] ? bd_acquire+0xc0/0xc0 [ 102.495368] do_dentry_open+0x1b0/0x320 [ 102.500159] ? inode_permission+0x24/0xc0 [ 102.505140] path_openat+0x4e6/0x1420 [ 102.509741] ? cpumask_any_but+0x1f/0x40 [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 [ 102.519903] do_filp_open+0x8c/0xf0 [ 102.524305] ? __seccomp_filter+0x28/0x230 [ 102.529389] ? _raw_spin_unlock+0xa/0x20 [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 [ 102.539559] ? list_lru_add+0xa8/0xc0 [ 102.544157] ? _raw_spin_unlock+0xa/0x20 [ 102.549047] ? __alloc_fd+0xaf/0x160 [ 102.553549] ? do_sys_open+0x1a6/0x230 [ 102.558244] do_sys_open+0x1a6/0x230 [ 102.562742] do_syscall_64+0x5a/0x100 [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 >>> >>> Interesting. Thinking out loud: This is cd->device dereference I guess >>> which means disk->private_data was NULL. That gets set in sr_probe() >>> together with disk->fops which are certainly set as they must have led us >>> to the crashing function sr_block_revalidate_disk(). So likely >>> disk->private_data got already cleared. That happens in sr_kref_release() >>> and the fact that that function got called means struct scsi_cd went away - >>> so sr_remove() must have been called as well. That all seems possible like: >>> >>> CPU1CPU2 >>> sr_probe() >>> __blkdev_get() >>> disk = bdev_get_gendisk(); >>> >>> sr_remove() >>> del_gendisk() >>> ... >>> kref_put(>kref, sr_kref_release); >>> disk->private_data = NULL; >>> put_disk(disk); >>> kfree(cd); >>> if (disk->fops->open) { >>> ret = disk->fops->open(bdev, mode); => sr_block_open >>> check_disk_change(bdev); >>> sr_block_revalidate_disk() >>> CRASH >>> >>> And I think the problem is in sr_block_revalidate_disk() itself as the >>> scsi_cd() call is not guaranteed that the caller holds reference to 'cd' >>> and thus that 'cd' does not disappear under
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
On 4/11/2018 5:24 PM, Sagi Grimberg wrote: Ah, yeah, I was moving it out of add_timer but forgot to actully add it to the issue path. Fixed patch below. BTW, no matter what we do w/ the request handover between normal and timeout paths, we'd need something similar. Otherwise, while we can reduce the window, we can't get rid of it. Thanks. Just noticed this one, this looks interesting to me as well. Israel, can you run your test with this patch? Yes, I just did and it looks good. ---  block/blk-mq.c |  45 -  include/linux/blkdev.h |   2 ++  2 files changed, 34 insertions(+), 13 deletions(-) --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ  hctx_lock(hctx, _idx);  if (blk_mq_rq_aborted_gstate(rq) != rq->gstate)  __blk_mq_complete_request(rq); +   else +   rq->missed_completion = true;  hctx_unlock(hctx, srcu_idx);  }  EXPORT_SYMBOL(blk_mq_complete_request); @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request   blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT);  blk_add_timer(rq); +   rq->missed_completion = false;   write_seqcount_end(>gstate_seq);  preempt_enable(); @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct  }  }  -static void blk_mq_timeout_sync_rcu(struct request_queue *q) +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear)  {  struct blk_mq_hw_ctx *hctx;  bool has_rcu = false; @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru  else  synchronize_srcu(hctx->srcu);  -   hctx->need_sync_rcu = false; +   if (clear) +   hctx->need_sync_rcu = false;  }  if (has_rcu)  synchronize_rcu(); @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str  blk_mq_rq_timed_out(hctx, rq, priv, reserved);  }  -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx, +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx,  struct request *rq, void *priv, bool reserved)  {  /*   * @rq's timer reset has gone through rcu synchronization and is   * visible now. Allow normal completions again by resetting   * ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as - * there's no memory ordering around ->aborted_gstate making it the - * only field safe to update. Let blk_add_timer() clear it later - * when the request is recycled or times out again. - * - * As nothing prevents from completion happening while - * ->aborted_gstate is set, this may lead to ignored completions - * and further spurious timeouts. + * blk_mq_timeout_reset_cleanup() needs it again and there's no + * memory ordering around ->aborted_gstate making it the only field + * safe to update. Let blk_add_timer() clear it later when the + * request is recycled or times out again.   */  if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET)  blk_mq_rq_update_aborted_gstate(rq, 0);  }  +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx, +   struct request *rq, void *priv, bool reserved) +{ +   /* + * @rq is now fully returned to the normal path. If normal + * completion raced timeout handling, execute the missed completion + * here. This is safe because 1. ->missed_completion can no longer + * be asserted because nothing is timing out right now and 2. if + * ->missed_completion is set, @rq is safe from recycling because + * nobody could have completed it. + */ +   if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) +   blk_mq_complete_request(rq); +} +  static void blk_mq_timeout_work(struct work_struct *work)  {  struct request_queue *q = @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w   * becomes a problem, we can add per-hw_ctx rcu_head and   * wait in parallel.   */ -   blk_mq_timeout_sync_rcu(q); +   blk_mq_timeout_sync_rcu(q, true);   /* terminate the ones we won */  blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w   * reset racing against recycling.   */  if (nr_resets) { -   blk_mq_timeout_sync_rcu(q); +   blk_mq_timeout_sync_rcu(q, false); +   blk_mq_queue_tag_busy_iter(q, +   blk_mq_timeout_reset_return, NULL); +   blk_mq_timeout_sync_rcu(q, true);  blk_mq_queue_tag_busy_iter(q, -   blk_mq_finish_timeout_reset, NULL); +   blk_mq_timeout_reset_cleanup, NULL);  }  }  --- a/include/linux/blkdev.h +++ b/include/linux/blkdev.h @@ -227,6 +227,8 @@ struct request {   unsigned int extra_len;   /* length of alignment and padding */  +   bool missed_completion; +
Re: Hotplug
Hello, On Wed 11-04-18 08:11:13, Jens Axboe wrote: > On 4/11/18 7:58 AM, Jan Kara wrote: > > On Tue 10-04-18 11:17:46, Jens Axboe wrote: > >> Been running some tests and I keep running into issues with hotplug. > >> This looks similar to what Bart posted the other day, but it looks > >> more deeply rooted than just having to protect the queue in > >> generic_make_request_checks(). The test run is blktests, > >> block/001. Current -git doesn't survive it. I've seen at least two > >> different oopses, pasted below. > >> > >> [ 102.163442] NULL pointer dereference at 0010 > >> [ 102.163444] PGD 0 P4D 0 > >> [ 102.163447] Oops: [#1] PREEMPT SMP > >> [ 102.163449] Modules linked in: > >> [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive > >> [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common > >> nvme nvme_core sb_edac xl > >> [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 > >> [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress > >> xxhash lzo_compress zlib_defc > >> [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 > >> [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: > >> crc_t10dif] > >> [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ > >> #650 > >> [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 > >> 11/09/2016 > >> [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] > >> [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 > >> [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: > >> 883ff357bb66 > >> [ 102.373220] RDX: 0003 RSI: 7530 RDI: > >> 881fea631000 > >> [ 102.381705] RBP: R08: 881fe4d38400 R09: > >> > >> [ 102.390185] R10: R11: 01b6 R12: > >> 085d > >> [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: > >> > >> [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() > >> knlGS: > >> [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 > >> [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: > >> 003606e0 > >> [ 102.432545] DR0: DR1: DR2: > >> > >> [ 102.441024] DR3: DR6: fffe0ff0 DR7: > >> 0400 > >> [ 102.449502] Call Trace: > >> [ 102.452744] ? __invalidate_device+0x48/0x60 > >> [ 102.458022] check_disk_change+0x4c/0x60 > >> [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] > >> [ 102.468270] __blkdev_get+0xb9/0x450 > >> [ 102.472774] ? iget5_locked+0x1c0/0x1e0 > >> [ 102.477568] blkdev_get+0x11e/0x320 > >> [ 102.481969] ? bdget+0x11d/0x150 > >> [ 102.486083] ? _raw_spin_unlock+0xa/0x20 > >> [ 102.490968] ? bd_acquire+0xc0/0xc0 > >> [ 102.495368] do_dentry_open+0x1b0/0x320 > >> [ 102.500159] ? inode_permission+0x24/0xc0 > >> [ 102.505140] path_openat+0x4e6/0x1420 > >> [ 102.509741] ? cpumask_any_but+0x1f/0x40 > >> [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 > >> [ 102.519903] do_filp_open+0x8c/0xf0 > >> [ 102.524305] ? __seccomp_filter+0x28/0x230 > >> [ 102.529389] ? _raw_spin_unlock+0xa/0x20 > >> [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 > >> [ 102.539559] ? list_lru_add+0xa8/0xc0 > >> [ 102.544157] ? _raw_spin_unlock+0xa/0x20 > >> [ 102.549047] ? __alloc_fd+0xaf/0x160 > >> [ 102.553549] ? do_sys_open+0x1a6/0x230 > >> [ 102.558244] do_sys_open+0x1a6/0x230 > >> [ 102.562742] do_syscall_64+0x5a/0x100 > >> [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > > > Interesting. Thinking out loud: This is cd->device dereference I guess > > which means disk->private_data was NULL. That gets set in sr_probe() > > together with disk->fops which are certainly set as they must have led us > > to the crashing function sr_block_revalidate_disk(). So likely > > disk->private_data got already cleared. That happens in sr_kref_release() > > and the fact that that function got called means struct scsi_cd went away - > > so sr_remove() must have been called as well. That all seems possible like: > > > > CPU1CPU2 > > sr_probe() > > __blkdev_get() > > disk = bdev_get_gendisk(); > > > > sr_remove() > > del_gendisk() > > ... > > kref_put(>kref, sr_kref_release); > > disk->private_data = NULL; > > put_disk(disk); > > kfree(cd); > > if (disk->fops->open) { > > ret = disk->fops->open(bdev, mode); => sr_block_open > > check_disk_change(bdev); > > sr_block_revalidate_disk() > > CRASH > > > > And I think the problem is in sr_block_revalidate_disk() itself as the > > scsi_cd() call is not guaranteed that the caller holds reference to 'cd' > > and thus that 'cd' does not disappear under it. IMHO it needs to use > >
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, 2018-04-11 at 16:28 +0200, Alexandru Moise wrote: > [0.76] BUG: unable to handle kernel NULL pointer dereference at > 01b4 > [0.763350] Kernel panic - not syncing: Attempted to kill init! > exitcode=0x0009 > [0.763350] > [0.76] PGD 0 P4D 0 > [0.76] Oops: [#2] PREEMPT SMP > [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G D > 4.16.0-ARCH+ #81 > [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.11.0-20171110_100015-anatol 04/01/20$ > [0.76] Workqueue: nvme-reset-wq nvme_reset_work > [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40 > [0.76] RSP: :c91bfcb0 EFLAGS: 00010246 > [0.76] RAX: 88003b698000 RBX: RCX: > > [0.76] RDX: 88003b698000 RSI: fff4 RDI: > 001c > [0.76] RBP: c91bfcc0 R08: R09: > > [0.76] R10: eaeaa980 R11: 814e0970 R12: > 001c > [0.76] R13: R14: R15: > 88003aad8010 > [0.76] FS: () GS:88003e40() > knlGS: > [0.76] CS: 0010 DS: ES: CR0: 80050033 > [0.76] CR2: 01b4 CR3: 02209001 CR4: > 000606f0 > [0.76] Call Trace: > [0.76] blk_mq_quiesce_queue+0x23/0x80 > [0.76] nvme_dev_disable+0x34f/0x480 > [0.76] ? nvme_irq+0x50/0x50 > [0.76] ? dev_warn+0x64/0x80 > [0.76] nvme_reset_work+0x13de/0x1570 > [0.76] ? __switch_to_asm+0x34/0x70 > [0.76] ? __switch_to_asm+0x40/0x70 > [0.76] ? _raw_spin_unlock_irq+0x15/0x30 > [0.76] ? finish_task_switch+0x156/0x210 > [0.76] process_one_work+0x20c/0x3d0 > [0.76] worker_thread+0x216/0x400 > [0.76] kthread+0x125/0x130 > [0.76] ? process_one_work+0x3d0/0x3d0 > [0.76] ? __kthread_bind_mask+0x60/0x60 > [0.76] ret_from_fork+0x3a/0x50 Hello Alexandru, What made you look at cgroups? In the above register dump I see that %rbx == NULL. I think that means that the queue pointer argument of blk_queue_flag_set() is NULL. The NVMe initiator driver should never pass a NULL pointer to blk_mq_quiesce_queue(). Please ask the NVMe driver maintainers for their opinion on the linux-nvme mailing list. Thanks, Bart.
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, again. On Wed, Apr 11, 2018 at 07:51:23AM -0700, Tejun Heo wrote: > Oh, it wasn't Joseph's change. It was Bart's fix for a problem > reported by Joseph. Bart, a063057d7c73 ("block: Fix a race between > request queue removal and the block cgroup controller") created a > regression where a request_queue can be destroyed with blkgs still > attached. The original report is.. > > http://lkml.kernel.org/r/20180407102148.ga9...@gmail.com And looking at the change, it looks like the right thing we should have done is caching @lock on the print_blkg side and when switching locks make sure both locks are held. IOW, do the following in blk_cleanup_queue() spin_lock_irq(lock); if (q->queue_lock != >__queue_lock) { spin_lock(>__queue_lock); q->queue_lock = >__queue_lock; spin_unlock(>__queue_lock); } spin_unlock_irq(lock); Otherwise, there can be two lock holders thinking they have exclusive access to the request_queue. Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, (cc'ing Bart) On Wed, Apr 11, 2018 at 07:46:16AM -0700, Tejun Heo wrote: > Hello, > > On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote: > > > Ah, that changed recently. Can you please check out the current > > > upstream git master? > > > > > Just did, without my patch I see this crash: > > lol I was looking at the old tree, so this is the fix for the new > breakage introduced by the recent change. Sorry about the confusion. > Joseph, can you please take a look? Oh, it wasn't Joseph's change. It was Bart's fix for a problem reported by Joseph. Bart, a063057d7c73 ("block: Fix a race between request queue removal and the block cgroup controller") created a regression where a request_queue can be destroyed with blkgs still attached. The original report is.. http://lkml.kernel.org/r/20180407102148.ga9...@gmail.com Thanks. -- tejun
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, On Wed, Apr 11, 2018 at 04:28:59PM +0200, Alexandru Moise wrote: > > Ah, that changed recently. Can you please check out the current > > upstream git master? > > > Just did, without my patch I see this crash: lol I was looking at the old tree, so this is the fix for the new breakage introduced by the recent change. Sorry about the confusion. Joseph, can you please take a look? Thanks. -- tejun
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Hello, On Wed, Apr 11, 2018 at 05:24:49PM +0300, Sagi Grimberg wrote: > > >Ah, yeah, I was moving it out of add_timer but forgot to actully add > >it to the issue path. Fixed patch below. > > > >BTW, no matter what we do w/ the request handover between normal and > >timeout paths, we'd need something similar. Otherwise, while we can > >reduce the window, we can't get rid of it. > > > >Thanks. > > Just noticed this one, this looks interesting to me as well. Israel, > can you run your test with this patch? The following is the combined patch with one debug message to see whether missed completions are actually happening. Thanks. Index: work/block/blk-mq.c === --- work.orig/block/blk-mq.c +++ work/block/blk-mq.c @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ hctx_lock(hctx, _idx); if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); + else + rq->missed_completion = true; hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); + rq->missed_completion = false; write_seqcount_end(>gstate_seq); preempt_enable(); @@ -818,7 +821,8 @@ struct blk_mq_timeout_data { unsigned int nr_expired; }; -static void blk_mq_rq_timed_out(struct request *req, bool reserved) +static void blk_mq_rq_timed_out(struct blk_mq_hw_ctx *hctx, struct request *req, + int *nr_resets, bool reserved) { const struct blk_mq_ops *ops = req->q->mq_ops; enum blk_eh_timer_return ret = BLK_EH_RESET_TIMER; @@ -833,13 +837,10 @@ static void blk_mq_rq_timed_out(struct r __blk_mq_complete_request(req); break; case BLK_EH_RESET_TIMER: - /* -* As nothing prevents from completion happening while -* ->aborted_gstate is set, this may lead to ignored -* completions and further spurious timeouts. -*/ - blk_mq_rq_update_aborted_gstate(req, 0); blk_add_timer(req); + req->rq_flags |= RQF_MQ_TIMEOUT_RESET; + (*nr_resets)++; + hctx->need_sync_rcu = true; break; case BLK_EH_NOT_HANDLED: break; @@ -876,13 +877,35 @@ static void blk_mq_check_expired(struct time_after_eq(jiffies, deadline)) { blk_mq_rq_update_aborted_gstate(rq, gstate); data->nr_expired++; - hctx->nr_expired++; + hctx->need_sync_rcu = true; } else if (!data->next_set || time_after(data->next, deadline)) { data->next = deadline; data->next_set = 1; } } +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear) +{ + struct blk_mq_hw_ctx *hctx; + bool has_rcu = false; + int i; + + queue_for_each_hw_ctx(q, hctx, i) { + if (!hctx->need_sync_rcu) + continue; + + if (!(hctx->flags & BLK_MQ_F_BLOCKING)) + has_rcu = true; + else + synchronize_srcu(hctx->srcu); + + if (clear) + hctx->need_sync_rcu = false; + } + if (has_rcu) + synchronize_rcu(); +} + static void blk_mq_terminate_expired(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { @@ -895,7 +918,45 @@ static void blk_mq_terminate_expired(str */ if (!(rq->rq_flags & RQF_MQ_TIMEOUT_EXPIRED) && READ_ONCE(rq->gstate) == rq->aborted_gstate) - blk_mq_rq_timed_out(rq, reserved); + blk_mq_rq_timed_out(hctx, rq, priv, reserved); +} + +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, bool reserved) +{ + /* +* @rq's timer reset has gone through rcu synchronization and is +* visible now. Allow normal completions again by resetting +* ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as +* blk_mq_timeout_reset_cleanup() needs it again and there's no +* memory ordering around ->aborted_gstate making it the only field +* safe to update. Let blk_add_timer() clear it later when the +* request is recycled or times out again. +*/ + if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) + blk_mq_rq_update_aborted_gstate(rq, 0); +} + +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, bool reserved) +{ + int cnt = 0; + + /* +* @rq is now fully returned to the normal path. If normal +
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
static void __blk_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q; + enum mq_rq_state old_state = blk_mq_rq_state(rq); blk_mq_put_driver_tag(rq); trace_block_rq_requeue(q, rq); wbt_requeue(q->rq_wb, >issue_stat); - if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { - blk_mq_rq_update_state(rq, MQ_RQ_IDLE); + if (old_state != MQ_RQ_IDLE) { + if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE)) + WARN_ON_ONCE(true); if (q->dma_drain_size && blk_rq_bytes(rq)) rq->nr_phys_segments--; } Can you explain why was old_state kept as a local variable? Hello Sagi, Since blk_mq_requeue_request() must be called after a request has completed the timeout handler will ignore requests that are being requeued. Hence it is safe in this function to cache the request state in a local variable. I understand why it is safe, I just didn't understand why it is needed. +static inline bool blk_mq_change_rq_state(struct request *rq, + enum mq_rq_state old_state, + enum mq_rq_state new_state) { - u64 old_val = READ_ONCE(rq->gstate); - u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state; - - if (state == MQ_RQ_IN_FLIGHT) { - WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); - new_val += MQ_RQ_GEN_INC; - } + unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) | + old_state; + unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state; - /* avoid exposing interim values */ - WRITE_ONCE(rq->gstate, new_val); + return cmpxchg(>__deadline, old_val, new_val) == old_val; } Can you explain why this takes the old_state of the request? Can you clarify your question? The purpose of this function is to change the request state only into @new_state if it matches @old_state. I think that is also what the implementation of this function does. I misread the documentation of this. never mind. thanks.
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Wed, Apr 11, 2018 at 07:20:19AM -0700, Tejun Heo wrote: > Hello, > > On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote: > > > But we already do this through calling blkcg_exit_queue() from > > > __blk_release_queue(). What's missing? > > > > Hi, > > > > It might be the jetlag but I can't see how you end up calling > > blkcg_exit_queue() from __blk_release_queue(). > > > > As I see it the only way to reach blkcg_exit_queue() is from > > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue(). > > > > I suspect that I'm just fixing a corner case though and > > the general case is what you describe or similar. > > Ah, that changed recently. Can you please check out the current > upstream git master? > > Thanks. > Just did, without my patch I see this crash: [0.75] CPU: 4 PID: 1 Comm: swapper/0 Not tainted 4.16.0-ARCH+ #81 [7/1949] [0.75] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/204 [0.75] RIP: 0010:pi_init+0x23f/0x2f0 [0.75] RSP: :c9197d90 EFLAGS: 00010246 [0.75] RAX: RBX: 0020 RCX: 0038 [0.75] RDX: 0001 RSI: 0001 RDI: [0.75] RBP: c9197e18 R08: R09: [0.75] R10: eaeda600 R11: 88003b69f164 R12: 82e2d740 [0.75] R13: R14: R15: [0.75] FS: () GS:88003e50() knlGS: [0.75] CS: 0010 DS: ES: CR0: 80050033 [0.75] CR2: 0008 CR3: 02209001 CR4: 000606e0 [0.75] Call Trace: [0.75] pf_init+0x1db/0x3be [0.75] ? pcd_init+0x4e8/0x4e8 [0.75] do_one_initcall+0x9e/0x1b0 [0.75] ? do_early_param+0x97/0x97 [0.75] kernel_init_freeable+0x259/0x2fd [0.75] ? rest_init+0xd0/0xd0 [0.75] ? syscall_slow_exit_work+0x1c/0x160 [0.75] kernel_init+0xe/0x100 [0.75] ret_from_fork+0x3a/0x50 [0.75] Code: 75 6a 49 8b 06 48 8b 40 78 48 85 c0 74 08 4c 89 f7 e8 46 76 51 00 83 c3 01 3b 5d a8 7d 0d 49 [0.75] RIP: pi_init+0x23f/0x2f0 RSP: c9197d90 [0.75] CR2: 0008 [0.75] ---[ end trace 12004f267bb8bf7d ]--- [0.76] BUG: unable to handle kernel NULL pointer dereference at 01b4 [0.763350] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0009 [0.763350] [0.76] PGD 0 P4D 0 [0.76] Oops: [#2] PREEMPT SMP [0.76] CPU: 0 PID: 6 Comm: kworker/u12:0 Tainted: G D 4.16.0-ARCH+ #81 [0.76] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/20$ [0.76] Workqueue: nvme-reset-wq nvme_reset_work [0.76] RIP: 0010:blk_queue_flag_set+0xf/0x40 [0.76] RSP: :c91bfcb0 EFLAGS: 00010246 [0.76] RAX: 88003b698000 RBX: RCX: [0.76] RDX: 88003b698000 RSI: fff4 RDI: 001c [0.76] RBP: c91bfcc0 R08: R09: [0.76] R10: eaeaa980 R11: 814e0970 R12: 001c [0.76] R13: R14: R15: 88003aad8010 [0.76] FS: () GS:88003e40() knlGS: [0.76] CS: 0010 DS: ES: CR0: 80050033 [0.76] CR2: 01b4 CR3: 02209001 CR4: 000606f0 [0.76] Call Trace: [0.76] blk_mq_quiesce_queue+0x23/0x80 [0.76] nvme_dev_disable+0x34f/0x480 [0.76] ? nvme_irq+0x50/0x50 [0.76] ? dev_warn+0x64/0x80 [0.76] nvme_reset_work+0x13de/0x1570 [0.76] ? __switch_to_asm+0x34/0x70 [0.76] ? __switch_to_asm+0x40/0x70 [0.76] ? _raw_spin_unlock_irq+0x15/0x30 [0.76] ? finish_task_switch+0x156/0x210 [0.76] process_one_work+0x20c/0x3d0 [0.76] worker_thread+0x216/0x400 [0.76] kthread+0x125/0x130 [0.76] ? process_one_work+0x3d0/0x3d0 [0.76] ? __kthread_bind_mask+0x60/0x60 [0.76] ret_from_fork+0x3a/0x50 With the patch the crash goes away, Thanks, ../Alex > -- > tejun
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Ah, yeah, I was moving it out of add_timer but forgot to actully add it to the issue path. Fixed patch below. BTW, no matter what we do w/ the request handover between normal and timeout paths, we'd need something similar. Otherwise, while we can reduce the window, we can't get rid of it. Thanks. Just noticed this one, this looks interesting to me as well. Israel, can you run your test with this patch? --- block/blk-mq.c | 45 - include/linux/blkdev.h |2 ++ 2 files changed, 34 insertions(+), 13 deletions(-) --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -642,6 +642,8 @@ void blk_mq_complete_request(struct requ hctx_lock(hctx, _idx); if (blk_mq_rq_aborted_gstate(rq) != rq->gstate) __blk_mq_complete_request(rq); + else + rq->missed_completion = true; hctx_unlock(hctx, srcu_idx); } EXPORT_SYMBOL(blk_mq_complete_request); @@ -684,6 +686,7 @@ void blk_mq_start_request(struct request blk_mq_rq_update_state(rq, MQ_RQ_IN_FLIGHT); blk_add_timer(rq); + rq->missed_completion = false; write_seqcount_end(>gstate_seq); preempt_enable(); @@ -881,7 +884,7 @@ static void blk_mq_check_expired(struct } } -static void blk_mq_timeout_sync_rcu(struct request_queue *q) +static void blk_mq_timeout_sync_rcu(struct request_queue *q, bool clear) { struct blk_mq_hw_ctx *hctx; bool has_rcu = false; @@ -896,7 +899,8 @@ static void blk_mq_timeout_sync_rcu(stru else synchronize_srcu(hctx->srcu); - hctx->need_sync_rcu = false; + if (clear) + hctx->need_sync_rcu = false; } if (has_rcu) synchronize_rcu(); @@ -917,25 +921,37 @@ static void blk_mq_terminate_expired(str blk_mq_rq_timed_out(hctx, rq, priv, reserved); } -static void blk_mq_finish_timeout_reset(struct blk_mq_hw_ctx *hctx, +static void blk_mq_timeout_reset_return(struct blk_mq_hw_ctx *hctx, struct request *rq, void *priv, bool reserved) { /* * @rq's timer reset has gone through rcu synchronization and is * visible now. Allow normal completions again by resetting * ->aborted_gstate. Don't clear RQF_MQ_TIMEOUT_RESET here as -* there's no memory ordering around ->aborted_gstate making it the -* only field safe to update. Let blk_add_timer() clear it later -* when the request is recycled or times out again. -* -* As nothing prevents from completion happening while -* ->aborted_gstate is set, this may lead to ignored completions -* and further spurious timeouts. +* blk_mq_timeout_reset_cleanup() needs it again and there's no +* memory ordering around ->aborted_gstate making it the only field +* safe to update. Let blk_add_timer() clear it later when the +* request is recycled or times out again. */ if (rq->rq_flags & RQF_MQ_TIMEOUT_RESET) blk_mq_rq_update_aborted_gstate(rq, 0); } +static void blk_mq_timeout_reset_cleanup(struct blk_mq_hw_ctx *hctx, + struct request *rq, void *priv, bool reserved) +{ + /* +* @rq is now fully returned to the normal path. If normal +* completion raced timeout handling, execute the missed completion +* here. This is safe because 1. ->missed_completion can no longer +* be asserted because nothing is timing out right now and 2. if +* ->missed_completion is set, @rq is safe from recycling because +* nobody could have completed it. +*/ + if ((rq->rq_flags & RQF_MQ_TIMEOUT_RESET) && rq->missed_completion) + blk_mq_complete_request(rq); +} + static void blk_mq_timeout_work(struct work_struct *work) { struct request_queue *q = @@ -976,7 +992,7 @@ static void blk_mq_timeout_work(struct w * becomes a problem, we can add per-hw_ctx rcu_head and * wait in parallel. */ - blk_mq_timeout_sync_rcu(q); + blk_mq_timeout_sync_rcu(q, true); /* terminate the ones we won */ blk_mq_queue_tag_busy_iter(q, blk_mq_terminate_expired, @@ -988,9 +1004,12 @@ static void blk_mq_timeout_work(struct w * reset racing against recycling. */ if (nr_resets) { - blk_mq_timeout_sync_rcu(q); + blk_mq_timeout_sync_rcu(q, false); + blk_mq_queue_tag_busy_iter(q, + blk_mq_timeout_reset_return, NULL); + blk_mq_timeout_sync_rcu(q, true); blk_mq_queue_tag_busy_iter(q, - blk_mq_finish_timeout_reset, NULL); +
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
Hello, On Wed, Apr 11, 2018 at 12:12:56PM +0200, Alexandru Moise wrote: > > But we already do this through calling blkcg_exit_queue() from > > __blk_release_queue(). What's missing? > > Hi, > > It might be the jetlag but I can't see how you end up calling > blkcg_exit_queue() from __blk_release_queue(). > > As I see it the only way to reach blkcg_exit_queue() is from > blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue(). > > I suspect that I'm just fixing a corner case though and > the general case is what you describe or similar. Ah, that changed recently. Can you please check out the current upstream git master? Thanks. -- tejun
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
Hello, Bart. On Wed, Apr 11, 2018 at 12:50:51PM +, Bart Van Assche wrote: > Thank you for having shared this patch. It looks interesting to me. What I > know about the blk-mq timeout handling is as follows: > * Nobody who has reviewed the blk-mq timeout handling code with this patch > applied has reported any shortcomings for that code. > * However, several people have reported kernel crashes that disappear when > the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts > memory upon timeout" > (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html) > and also to a "RIP: scsi_times_out+0x17" crash during boot > (https://bugzilla.kernel.org/show_bug.cgi?id=199077). > > So we have the choice between two approaches: > (1) apply the patch from your previous e-mail and root-cause and fix the > crashes referred to above. > (2) apply a patch that makes the crashes reported against v4.16 disappear and > remove the atomic instructions introduced by such a patch at a later time. > > Since crashes have been reported for kernel v4.16 I think we should follow > approach (2). That will remove the time pressure from root-causing and fixing > the crashes reported for the NVMeOF initiator and SCSI initiator drivers. So, it really bothers me how blind we're going about this. It isn't an insurmountable emergency that we have to adopt whatever solution which passed a couple tests this minute. We can debug and root cause this properly and pick the right solution. We even have two most likely causes already analysed and patches proposed, one of them months ago. If we wanna change the handover model, let's do that because the new one is better, not because of vague fear. Thanks. -- tejun
Re: Hotplug
On 4/11/18 7:58 AM, Jan Kara wrote: > Hi, > > On Tue 10-04-18 11:17:46, Jens Axboe wrote: >> Been running some tests and I keep running into issues with hotplug. >> This looks similar to what Bart posted the other day, but it looks >> more deeply rooted than just having to protect the queue in >> generic_make_request_checks(). The test run is blktests, >> block/001. Current -git doesn't survive it. I've seen at least two >> different oopses, pasted below. >> >> [ 102.163442] NULL pointer dereference at 0010 >> [ 102.163444] PGD 0 P4D 0 >> [ 102.163447] Oops: [#1] PREEMPT SMP >> [ 102.163449] Modules linked in: >> [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive >> [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common >> nvme nvme_core sb_edac xl >> [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 >> [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash >> lzo_compress zlib_defc >> [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 >> [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: >> crc_t10dif] >> [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 >> [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 >> 11/09/2016 >> [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] >> [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 >> [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: >> 883ff357bb66 >> [ 102.373220] RDX: 0003 RSI: 7530 RDI: >> 881fea631000 >> [ 102.381705] RBP: R08: 881fe4d38400 R09: >> >> [ 102.390185] R10: R11: 01b6 R12: >> 085d >> [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: >> >> [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() >> knlGS: >> [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 >> [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: >> 003606e0 >> [ 102.432545] DR0: DR1: DR2: >> >> [ 102.441024] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 102.449502] Call Trace: >> [ 102.452744] ? __invalidate_device+0x48/0x60 >> [ 102.458022] check_disk_change+0x4c/0x60 >> [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] >> [ 102.468270] __blkdev_get+0xb9/0x450 >> [ 102.472774] ? iget5_locked+0x1c0/0x1e0 >> [ 102.477568] blkdev_get+0x11e/0x320 >> [ 102.481969] ? bdget+0x11d/0x150 >> [ 102.486083] ? _raw_spin_unlock+0xa/0x20 >> [ 102.490968] ? bd_acquire+0xc0/0xc0 >> [ 102.495368] do_dentry_open+0x1b0/0x320 >> [ 102.500159] ? inode_permission+0x24/0xc0 >> [ 102.505140] path_openat+0x4e6/0x1420 >> [ 102.509741] ? cpumask_any_but+0x1f/0x40 >> [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 >> [ 102.519903] do_filp_open+0x8c/0xf0 >> [ 102.524305] ? __seccomp_filter+0x28/0x230 >> [ 102.529389] ? _raw_spin_unlock+0xa/0x20 >> [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 >> [ 102.539559] ? list_lru_add+0xa8/0xc0 >> [ 102.544157] ? _raw_spin_unlock+0xa/0x20 >> [ 102.549047] ? __alloc_fd+0xaf/0x160 >> [ 102.553549] ? do_sys_open+0x1a6/0x230 >> [ 102.558244] do_sys_open+0x1a6/0x230 >> [ 102.562742] do_syscall_64+0x5a/0x100 >> [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 > > Interesting. Thinking out loud: This is cd->device dereference I guess > which means disk->private_data was NULL. That gets set in sr_probe() > together with disk->fops which are certainly set as they must have led us > to the crashing function sr_block_revalidate_disk(). So likely > disk->private_data got already cleared. That happens in sr_kref_release() > and the fact that that function got called means struct scsi_cd went away - > so sr_remove() must have been called as well. That all seems possible like: > > CPU1 CPU2 > sr_probe() > __blkdev_get() > disk = bdev_get_gendisk(); > > sr_remove() > del_gendisk() > ... > kref_put(>kref, sr_kref_release); > disk->private_data = NULL; > put_disk(disk); > kfree(cd); > if (disk->fops->open) { > ret = disk->fops->open(bdev, mode); => sr_block_open > check_disk_change(bdev); > sr_block_revalidate_disk() > CRASH > > And I think the problem is in sr_block_revalidate_disk() itself as the > scsi_cd() call is not guaranteed that the caller holds reference to 'cd' > and thus that 'cd' does not disappear under it. IMHO it needs to use > scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something? No I think you are correct, from the revalidate path it should grab/release a reference. Looks like sr_block_check_events() needs the same treatment. How about the below? >> [
Re: Hotplug
On 4/11/18 8:06 AM, Bart Van Assche wrote: > On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote: >> I'm not really sure where this is crashing and 'Code' line is incomplete to >> tell me. > > Hello Jan, > > The following patch should fix this crash: > https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html. Yeah, I forgot the link in my reply, thanks. -- Jens Axboe
Re: Hotplug
On Wed, 2018-04-11 at 15:58 +0200, Jan Kara wrote: > I'm not really sure where this is crashing and 'Code' line is incomplete to > tell me. Hello Jan, The following patch should fix this crash: https://www.mail-archive.com/linux-block@vger.kernel.org/msg20209.html. Thanks, Bart.
Re: [PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"
On 4/11/18 4:47 AM, Ming Lei wrote: > This reverts commit 127276c6ce5a30fcc806b7fe53015f4f89b62956. > > When all CPUs of one hw queue become offline, there still may have IOs > not completed from this hctx. But blk_mq_hw_queue_mapped() is called in > blk_mq_queue_tag_busy_iter(), which is used for iterating request in timeout > handler, timeout event will be missed on the inactive hctx, then request may > never be completed. > > Also the replementation of blk_mq_hw_queue_mapped() doesn't match the helper's > name any more, and it should have been named as blk_mq_hw_queue_active(). > > Even other callers need further verification about this reimplemenation. > > So revert this patch now, and we can improve hw queue activate/inactivate > event > after adequent researching and test. Thanks, applied. -- Jens Axboe
Re: Hotplug
Hi, On Tue 10-04-18 11:17:46, Jens Axboe wrote: > Been running some tests and I keep running into issues with hotplug. > This looks similar to what Bart posted the other day, but it looks > more deeply rooted than just having to protect the queue in > generic_make_request_checks(). The test run is blktests, > block/001. Current -git doesn't survive it. I've seen at least two > different oopses, pasted below. > > [ 102.163442] NULL pointer dereference at 0010 > [ 102.163444] PGD 0 P4D 0 > [ 102.163447] Oops: [#1] PREEMPT SMP > [ 102.163449] Modules linked in: > [ 102.175540] sr 12:0:0:0: [sr2] scsi-1 drive > [ 102.180112] scsi_debug crc_t10dif crct10dif_generic crct10dif_common nvme > nvme_core sb_edac xl > [ 102.186934] sr 12:0:0:0: Attached scsi CD-ROM sr2 > [ 102.191896] sr_mod cdrom btrfs xor zstd_decompress zstd_compress xxhash > lzo_compress zlib_defc > [ 102.197169] sr 12:0:0:0: Attached scsi generic sg7 type 5 > [ 102.203475] igb ahci libahci i2c_algo_bit libata dca [last unloaded: > crc_t10dif] > [ 102.203484] CPU: 43 PID: 4629 Comm: systemd-udevd Not tainted 4.16.0+ #650 > [ 102.203487] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 > 11/09/2016 > [ 102.350882] RIP: 0010:sr_block_revalidate_disk+0x23/0x190 [sr_mod] > [ 102.358299] RSP: 0018:883ff357bb58 EFLAGS: 00010292 > [ 102.364734] RAX: a00b07d0 RBX: 883ff3058000 RCX: > 883ff357bb66 > [ 102.373220] RDX: 0003 RSI: 7530 RDI: > 881fea631000 > [ 102.381705] RBP: R08: 881fe4d38400 R09: > > [ 102.390185] R10: R11: 01b6 R12: > 085d > [ 102.398671] R13: 085d R14: 883ffd9b3790 R15: > > [ 102.407156] FS: 7f7dc8e6d8c0() GS:883fff34() > knlGS: > [ 102.417138] CS: 0010 DS: ES: CR0: 80050033 > [ 102.424066] CR2: 0010 CR3: 003ffda98005 CR4: > 003606e0 > [ 102.432545] DR0: DR1: DR2: > > [ 102.441024] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 102.449502] Call Trace: > [ 102.452744] ? __invalidate_device+0x48/0x60 > [ 102.458022] check_disk_change+0x4c/0x60 > [ 102.462900] sr_block_open+0x16/0xd0 [sr_mod] > [ 102.468270] __blkdev_get+0xb9/0x450 > [ 102.472774] ? iget5_locked+0x1c0/0x1e0 > [ 102.477568] blkdev_get+0x11e/0x320 > [ 102.481969] ? bdget+0x11d/0x150 > [ 102.486083] ? _raw_spin_unlock+0xa/0x20 > [ 102.490968] ? bd_acquire+0xc0/0xc0 > [ 102.495368] do_dentry_open+0x1b0/0x320 > [ 102.500159] ? inode_permission+0x24/0xc0 > [ 102.505140] path_openat+0x4e6/0x1420 > [ 102.509741] ? cpumask_any_but+0x1f/0x40 > [ 102.514630] ? flush_tlb_mm_range+0xa0/0x120 > [ 102.519903] do_filp_open+0x8c/0xf0 > [ 102.524305] ? __seccomp_filter+0x28/0x230 > [ 102.529389] ? _raw_spin_unlock+0xa/0x20 > [ 102.534283] ? __handle_mm_fault+0x7d6/0x9b0 > [ 102.539559] ? list_lru_add+0xa8/0xc0 > [ 102.544157] ? _raw_spin_unlock+0xa/0x20 > [ 102.549047] ? __alloc_fd+0xaf/0x160 > [ 102.553549] ? do_sys_open+0x1a6/0x230 > [ 102.558244] do_sys_open+0x1a6/0x230 > [ 102.562742] do_syscall_64+0x5a/0x100 > [ 102.567336] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Interesting. Thinking out loud: This is cd->device dereference I guess which means disk->private_data was NULL. That gets set in sr_probe() together with disk->fops which are certainly set as they must have led us to the crashing function sr_block_revalidate_disk(). So likely disk->private_data got already cleared. That happens in sr_kref_release() and the fact that that function got called means struct scsi_cd went away - so sr_remove() must have been called as well. That all seems possible like: CPU1CPU2 sr_probe() __blkdev_get() disk = bdev_get_gendisk(); sr_remove() del_gendisk() ... kref_put(>kref, sr_kref_release); disk->private_data = NULL; put_disk(disk); kfree(cd); if (disk->fops->open) { ret = disk->fops->open(bdev, mode); => sr_block_open check_disk_change(bdev); sr_block_revalidate_disk() CRASH And I think the problem is in sr_block_revalidate_disk() itself as the scsi_cd() call is not guaranteed that the caller holds reference to 'cd' and thus that 'cd' does not disappear under it. IMHO it needs to use scsi_cd_get() to get struct scsi_cd from gendisk. Am I missing something? > and this one, similar to Barts: > > [ 4676.738069] NULL pointer dereference at 0154 > [ 4676.738071] PGD 0 P4D 0 > [ 4676.738073] Oops: [#1] PREEMPT SMP > [ 4676.738075] Modules linked in: scsi_debug crc_t10dif nvme nvme_core loop > configfs crct10dif_ge] > [ 4676.859272] CPU: 10 PID: 16598 Comm: systemd-udevd Not
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
On Wed, 2018-04-11 at 10:11 +0800, Ming Lei wrote: > On Tue, Apr 10, 2018 at 03:01:57PM -0600, Bart Van Assche wrote: > > The blk-mq timeout handling code ignores completions that occur after > > blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() > > has reset rq->aborted_gstate. If a block driver timeout handler always > > returns BLK_EH_RESET_TIMER then the result will be that the request > > never terminates. > > Under this situation: > > IMO, if this request has been handled by driver's irq handler, and if > driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug, > and the correct return value should be BLK_EH_HANDLED. Maybe. In the virtio-scsi driver I found the following: /* * The host guarantees to respond to each command, although I/O * latencies might be higher than on bare metal. Reset the timer * unconditionally to give the host a chance to perform EH. */ static enum blk_eh_timer_return virtscsi_eh_timed_out(struct scsi_cmnd *scmnd) { return BLK_EH_RESET_TIMER; } Bart.
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
On Wed, 2018-04-11 at 16:19 +0300, Sagi Grimberg wrote: > > static void __blk_mq_requeue_request(struct request *rq) > > { > > struct request_queue *q = rq->q; > > + enum mq_rq_state old_state = blk_mq_rq_state(rq); > > > > blk_mq_put_driver_tag(rq); > > > > trace_block_rq_requeue(q, rq); > > wbt_requeue(q->rq_wb, >issue_stat); > > > > - if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { > > - blk_mq_rq_update_state(rq, MQ_RQ_IDLE); > > + if (old_state != MQ_RQ_IDLE) { > > + if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE)) > > + WARN_ON_ONCE(true); > > if (q->dma_drain_size && blk_rq_bytes(rq)) > > rq->nr_phys_segments--; > > } > > Can you explain why was old_state kept as a local variable? Hello Sagi, Since blk_mq_requeue_request() must be called after a request has completed the timeout handler will ignore requests that are being requeued. Hence it is safe in this function to cache the request state in a local variable. > > +static inline bool blk_mq_change_rq_state(struct request *rq, > > + enum mq_rq_state old_state, > > + enum mq_rq_state new_state) > > { > > - u64 old_val = READ_ONCE(rq->gstate); > > - u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state; > > - > > - if (state == MQ_RQ_IN_FLIGHT) { > > - WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); > > - new_val += MQ_RQ_GEN_INC; > > - } > > + unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) | > > + old_state; > > + unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state; > > > > - /* avoid exposing interim values */ > > - WRITE_ONCE(rq->gstate, new_val); > > + return cmpxchg(>__deadline, old_val, new_val) == old_val; > > } > > Can you explain why this takes the old_state of the request? Can you clarify your question? The purpose of this function is to change the request state only into @new_state if it matches @old_state. I think that is also what the implementation of this function does. Thanks, Bart.
Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
diff --git a/block/blk-mq.c b/block/blk-mq.c index 75336848f7a7..81ced3096433 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, return ERR_PTR(-EXDEV); } cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask); + if (cpu >= nr_cpu_ids) { + pr_warn("no online cpu for hctx %d\n", hctx_idx); + cpu = cpumask_first(alloc_data.hctx->cpumask); + } We may do this way for the special case, but it is ugly, IMO. Christoph?
[PATCH] block: ratelimite pr_err on IO path
From: Jack WangThis avoid soft lockup below: [ 2328.328429] Call Trace: [ 2328.328433] vprintk_emit+0x229/0x2e0 [ 2328.328436] ? t10_pi_type3_verify_ip+0x20/0x20 [ 2328.328437] printk+0x52/0x6e [ 2328.328439] t10_pi_verify+0x9e/0xf0 [ 2328.328441] bio_integrity_process+0x12e/0x220 [ 2328.328442] ? t10_pi_type1_verify_crc+0x20/0x20 [ 2328.328443] bio_integrity_verify_fn+0xde/0x140 [ 2328.328447] process_one_work+0x13f/0x370 [ 2328.328449] worker_thread+0x62/0x3d0 [ 2328.328450] ? rescuer_thread+0x2f0/0x2f0 [ 2328.328452] kthread+0x116/0x150 [ 2328.328454] ? __kthread_parkme+0x70/0x70 [ 2328.328457] ret_from_fork+0x35/0x40 Signed-off-by: Jack Wang --- block/t10-pi.c | 20 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/block/t10-pi.c b/block/t10-pi.c index a98db38..35967d5 100644 --- a/block/t10-pi.c +++ b/block/t10-pi.c @@ -84,10 +84,12 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, if (be32_to_cpu(pi->ref_tag) != lower_32_bits(iter->seed)) { - pr_err("%s: ref tag error at location %llu " \ - "(rcvd %u)\n", iter->disk_name, - (unsigned long long) - iter->seed, be32_to_cpu(pi->ref_tag)); + pr_err_ratelimited("%s: ref tag error at " + "location %llu (rcvd %u)\n", + iter->disk_name, + (unsigned long long) + iter->seed, + be32_to_cpu(pi->ref_tag)); return BLK_STS_PROTECTION; } break; @@ -101,10 +103,12 @@ static blk_status_t t10_pi_verify(struct blk_integrity_iter *iter, csum = fn(iter->data_buf, iter->interval); if (pi->guard_tag != csum) { - pr_err("%s: guard tag error at sector %llu " \ - "(rcvd %04x, want %04x)\n", iter->disk_name, - (unsigned long long)iter->seed, - be16_to_cpu(pi->guard_tag), be16_to_cpu(csum)); + pr_err_ratelimited("%s: guard tag error at sector %llu " + "(rcvd %04x, want %04x)\n", + iter->disk_name, + (unsigned long long)iter->seed, + be16_to_cpu(pi->guard_tag), + be16_to_cpu(csum)); return BLK_STS_PROTECTION; } -- 2.7.4
Re: [PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"
Reviewed-by: Sagi Grimberg
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
static void __blk_mq_requeue_request(struct request *rq) { struct request_queue *q = rq->q; + enum mq_rq_state old_state = blk_mq_rq_state(rq); blk_mq_put_driver_tag(rq); trace_block_rq_requeue(q, rq); wbt_requeue(q->rq_wb, >issue_stat); - if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { - blk_mq_rq_update_state(rq, MQ_RQ_IDLE); + if (old_state != MQ_RQ_IDLE) { + if (!blk_mq_change_rq_state(rq, old_state, MQ_RQ_IDLE)) + WARN_ON_ONCE(true); if (q->dma_drain_size && blk_rq_bytes(rq)) rq->nr_phys_segments--; } Can you explain why was old_state kept as a local variable? +static inline bool blk_mq_change_rq_state(struct request *rq, + enum mq_rq_state old_state, + enum mq_rq_state new_state) { - u64 old_val = READ_ONCE(rq->gstate); - u64 new_val = (old_val & ~MQ_RQ_STATE_MASK) | state; - - if (state == MQ_RQ_IN_FLIGHT) { - WARN_ON_ONCE((old_val & MQ_RQ_STATE_MASK) != MQ_RQ_IDLE); - new_val += MQ_RQ_GEN_INC; - } + unsigned long old_val = (READ_ONCE(rq->__deadline) & ~RQ_STATE_MASK) | + old_state; + unsigned long new_val = (old_val & ~RQ_STATE_MASK) | new_state; - /* avoid exposing interim values */ - WRITE_ONCE(rq->gstate, new_val); + return cmpxchg(>__deadline, old_val, new_val) == old_val; } Can you explain why this takes the old_state of the request? Otherwise this looks good to me, Reviewed-by: Sagi Grimberg
Re: [PATCH v5] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
The blk-mq timeout handling code ignores completions that occur after blk_mq_check_expired() has been called and before blk_mq_rq_timed_out() has reset rq->aborted_gstate. If a block driver timeout handler always returns BLK_EH_RESET_TIMER then the result will be that the request never terminates. Under this situation: IMO, if this request has been handled by driver's irq handler, and if driver's .timeout still returns BLK_EH_RESET_TIMER, it is driver's bug, and the correct return value should be BLK_EH_HANDLED. Is it possible to guarantee this efficiently if the irq handler can run concurrently with the timeout handler?
Re: [PATCH v4] blk-mq: Fix race conditions in request timeout handling
On Tue, 2018-04-10 at 14:54 -0700, t...@kernel.org wrote: > Ah, yeah, I was moving it out of add_timer but forgot to actully add > it to the issue path. Fixed patch below. > > BTW, no matter what we do w/ the request handover between normal and > timeout paths, we'd need something similar. Otherwise, while we can > reduce the window, we can't get rid of it. (+Martin Steigerwald) Hello Tejun, Thank you for having shared this patch. It looks interesting to me. What I know about the blk-mq timeout handling is as follows: * Nobody who has reviewed the blk-mq timeout handling code with this patch applied has reported any shortcomings for that code. * However, several people have reported kernel crashes that disappear when the blk-mq timeout code is reworked. I'm referring to "nvme-rdma corrupts memory upon timeout" (http://lists.infradead.org/pipermail/linux-nvme/2018-February/015848.html) and also to a "RIP: scsi_times_out+0x17" crash during boot (https://bugzilla.kernel.org/show_bug.cgi?id=199077). So we have the choice between two approaches: (1) apply the patch from your previous e-mail and root-cause and fix the crashes referred to above. (2) apply a patch that makes the crashes reported against v4.16 disappear and remove the atomic instructions introduced by such a patch at a later time. Since crashes have been reported for kernel v4.16 I think we should follow approach (2). That will remove the time pressure from root-causing and fixing the crashes reported for the NVMeOF initiator and SCSI initiator drivers. Thanks, Bart.
[PATCH] blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped"
This reverts commit 127276c6ce5a30fcc806b7fe53015f4f89b62956. When all CPUs of one hw queue become offline, there still may have IOs not completed from this hctx. But blk_mq_hw_queue_mapped() is called in blk_mq_queue_tag_busy_iter(), which is used for iterating request in timeout handler, timeout event will be missed on the inactive hctx, then request may never be completed. Also the replementation of blk_mq_hw_queue_mapped() doesn't match the helper's name any more, and it should have been named as blk_mq_hw_queue_active(). Even other callers need further verification about this reimplemenation. So revert this patch now, and we can improve hw queue activate/inactivate event after adequent researching and test. Cc: Stefan HaberlandCc: Christian Borntraeger Cc: Christoph Hellwig Cc: Sagi Grimberg Reported-by: Jens Axboe Fixes: 127276c6ce5a30fcc ("blk-mq: reimplement blk_mq_hw_queue_mapped") Signed-off-by: Ming Lei --- block/blk-mq.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/blk-mq.h b/block/blk-mq.h index 502af371b83b..88c558f71819 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -181,7 +181,7 @@ static inline bool blk_mq_hctx_stopped(struct blk_mq_hw_ctx *hctx) static inline bool blk_mq_hw_queue_mapped(struct blk_mq_hw_ctx *hctx) { - return cpumask_first_and(hctx->cpumask, cpu_online_mask) < nr_cpu_ids; + return hctx->nr_ctx && hctx->tags; } void blk_mq_in_flight(struct request_queue *q, struct hd_struct *part, -- 2.9.5
Re: [PATCH v2] blk-cgroup: remove entries in blkg_tree before queue release
On Mon, Apr 09, 2018 at 03:09:38PM -0700, Tejun Heo wrote: > (cc'ing Joseph as he worked on the area recently, hi!) > > Hello, > > On Sat, Apr 07, 2018 at 12:21:48PM +0200, Alexandru Moise wrote: > > The q->id is used as an index within the blkg_tree radix tree. > > > > If the entry is not released before reclaiming the blk_queue_ida's id > > blkcg_init_queue() within a different driver from which this id > > was originally for can fail due to the entry at that index within > > the radix tree already existing. > > > > Signed-off-by: Alexandru Moise <00moses.alexande...@gmail.com> > > --- > > v2: Added no-op for !CONFIG_BLK_CGROUP > > > > block/blk-cgroup.c | 2 +- > > block/blk-sysfs.c | 4 > > include/linux/blk-cgroup.h | 3 +++ > > 3 files changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c > > index 1c16694ae145..224e937dbb59 100644 > > --- a/block/blk-cgroup.c > > +++ b/block/blk-cgroup.c > > @@ -369,7 +369,7 @@ static void blkg_destroy(struct blkcg_gq *blkg) > > * > > * Destroy all blkgs associated with @q. > > */ > > -static void blkg_destroy_all(struct request_queue *q) > > +void blkg_destroy_all(struct request_queue *q) > > { > > struct blkcg_gq *blkg, *n; > > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c > > index d00d1b0ec109..a72866458f22 100644 > > --- a/block/blk-sysfs.c > > +++ b/block/blk-sysfs.c > > @@ -816,6 +816,10 @@ static void __blk_release_queue(struct work_struct > > *work) > > if (q->bio_split) > > bioset_free(q->bio_split); > > > > + spin_lock_irq(q->queue_lock); > > + blkg_destroy_all(q); > > + spin_unlock_irq(q->queue_lock); > > + > > ida_simple_remove(_queue_ida, q->id); > > call_rcu(>rcu_head, blk_free_queue_rcu); > > But we already do this through calling blkcg_exit_queue() from > __blk_release_queue(). What's missing? Hi, It might be the jetlag but I can't see how you end up calling blkcg_exit_queue() from __blk_release_queue(). As I see it the only way to reach blkcg_exit_queue() is from blk_cleanup_queue(), which I don't see anywhere in __blk_release_queue(). I suspect that I'm just fixing a corner case though and the general case is what you describe or similar. ../Alex > > Thanks. > > -- > tejun