Re: v4.15 and I/O hang with BFQ
> Il giorno 30 gen 2018, alle ore 16:40, Paolo Valente >ha scritto: > > > >> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei ha >> scritto: >> >> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote: >>> Hi. >>> >> ... >>> systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert >>> rq->0 >>> systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: >>> not get rq, 1 >>> cfdisk-408 [000] 13.484220: bfq_insert_requests: insert >>> rq->1 >>> kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: >>> not get rq, 1 >>> === >>> >>> Looks the same, right? >> >> Yeah, same with before. >> > > Hi guys, > sorry for the delay with this fix. We are proceeding very slowly on > this, because I'm super busy. Anyway, now I can at least explain in > more detail the cause that leads to this hang. Commit 'a6a252e64914 > ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' > makes all non-flush re-prepared requests be re-inserted into the I/O > scheduler. With this change, I/O schedulers may get the same request > inserted again, even several times, without a finish_request invoked > on the request before each re-insertion. > > For the I/O scheduler, every such re-prepared request is equivalent > to the insertion of a new request. For schedulers like mq-deadline > or kyber this fact causes no problems. In contrast, it confuses a stateful > scheduler like BFQ, which preserves states for an I/O request until > finish_request is invoked on it. In particular, BFQ has no way > to know that the above re-insertions concerns the same, already dispatched > request. So it may get stuck waiting for the completion of these > re-inserted requests forever, thus preventing any other queue of > requests to be served. > > We are trying to address this issue by adding the hook requeue_request > to bfq interface. > > Unfortunately, with our current implementation of requeue_request in > place, bfq eventually gets to an incoherent state. This is apparently > caused by a requeue of an I/O request, immediately followed by a > completion of the same request. This seems rather absurd, and drives > bfq crazy. But this is something for which we don't have definite > results yet. > > We're working on it, sorry again for the delay. > Ok, patch arriving ... Please test it. Thanks, Paolo > Thanks, > Paolo > >> -- >> Ming > > -- > You received this message because you are subscribed to the Google Groups > "bfq-iosched" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to bfq-iosched+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout.
Re: v4.15 and I/O hang with BFQ
> Il giorno 30 gen 2018, alle ore 15:40, Ming Leiha > scritto: > > On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote: >> Hi. >> > ... >> systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert >> rq->0 >> systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: >> not get rq, 1 >> cfdisk-408 [000] 13.484220: bfq_insert_requests: insert >> rq->1 >>kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: >> not get rq, 1 >> === >> >> Looks the same, right? > > Yeah, same with before. > Hi guys, sorry for the delay with this fix. We are proceeding very slowly on this, because I'm super busy. Anyway, now I can at least explain in more detail the cause that leads to this hang. Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests be re-inserted into the I/O scheduler. With this change, I/O schedulers may get the same request inserted again, even several times, without a finish_request invoked on the request before each re-insertion. For the I/O scheduler, every such re-prepared request is equivalent to the insertion of a new request. For schedulers like mq-deadline or kyber this fact causes no problems. In contrast, it confuses a stateful scheduler like BFQ, which preserves states for an I/O request until finish_request is invoked on it. In particular, BFQ has no way to know that the above re-insertions concerns the same, already dispatched request. So it may get stuck waiting for the completion of these re-inserted requests forever, thus preventing any other queue of requests to be served. We are trying to address this issue by adding the hook requeue_request to bfq interface. Unfortunately, with our current implementation of requeue_request in place, bfq eventually gets to an incoherent state. This is apparently caused by a requeue of an I/O request, immediately followed by a completion of the same request. This seems rather absurd, and drives bfq crazy. But this is something for which we don't have definite results yet. We're working on it, sorry again for the delay. Thanks, Paolo > -- > Ming
Re: v4.15 and I/O hang with BFQ
On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote: > Hi. > ... >systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert > rq->0 >systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: > not get rq, 1 > cfdisk-408 [000] 13.484220: bfq_insert_requests: insert > rq->1 > kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: > not get rq, 1 > === > > Looks the same, right? Yeah, same with before. -- Ming
Re: v4.15 and I/O hang with BFQ
Hi. 30.01.2018 09:19, Ming Lei wrote: Hi, We knew there is IO hang issue on BFQ over USB-storage wrt. blk-mq, and last time I found it is inside BFQ. You can try the debug patch in the following link[1] to see if it is same with the previous report[1][2]: [1] https://marc.info/?l=linux-block=151214241518562=2 [2] https://bugzilla.kernel.org/show_bug.cgi?id=198023 If you aren't sure if they are same, please post the trace somewhere, then I can check if it is a new bug. OK, first, I got 2 more stacktraces without tracing, then I've rebooted with your patch and checked tracing. Before reboot, cfdisk: === [ 266.630386] INFO: task cfdisk:437 blocked for more than 20 seconds. [ 266.640950] Not tainted 4.15.0-pf1 #1 [ 266.645131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 266.651789] cfdisk D0 437410 0x [ 266.661331] Call Trace: [ 266.664517] ? __schedule+0x35f/0x1000 [ 266.668660] ? bio_alloc_bioset+0xc7/0x1e0 [ 266.672330] ? submit_bh_wbc+0x162/0x190 [ 266.678034] schedule+0x32/0xc0 [ 266.681293] io_schedule+0x12/0x40 [ 266.685230] wait_on_page_bit+0xea/0x130 [ 266.689015] ? add_to_page_cache_lru+0xe0/0xe0 [ 266.693456] ? blkdev_writepages+0x30/0x30 [ 266.695521] do_read_cache_page+0x167/0x350 [ 266.697160] read_dev_sector+0x28/0xc0 [ 266.698685] scsi_bios_ptable+0x4e/0x120 [scsi_mod] [ 266.700156] scsicam_bios_param+0x16/0x1a0 [scsi_mod] [ 266.701868] sd_getgeo+0xbf/0xd0 [sd_mod] [ 266.703388] blkdev_ioctl+0x178/0x990 [ 266.704818] block_ioctl+0x39/0x40 [ 266.706381] do_vfs_ioctl+0xa4/0x630 [ 266.708584] ? __fput+0x131/0x1e0 [ 266.710184] ? preempt_count_add+0x68/0xa0 [ 266.711762] ? _raw_spin_unlock_irq+0x1d/0x30 [ 266.713304] SyS_ioctl+0x74/0x80 [ 266.714502] ? exit_to_usermode_loop+0x39/0xa0 [ 266.717352] entry_SYSCALL_64_fastpath+0x20/0x83 [ 266.718857] RIP: 0033:0x7fc482064d87 === Blocked kworker: === [ 389.511083] INFO: task kworker/u8:5:178 blocked for more than 20 seconds. [ 389.516454] Not tainted 4.15.0-pf1 #1 [ 389.520091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 389.524726] kworker/u8:5D0 178 2 0x8000 [ 389.528321] Workqueue: events_freezable_power_ disk_events_workfn [ 389.532228] Call Trace: [ 389.534909] ? __schedule+0x35f/0x1000 [ 389.541906] ? blk_mq_sched_dispatch_requests+0x116/0x190 [ 389.546561] ? __sbitmap_get_word+0x2a/0x80 [ 389.551167] ? sbitmap_get_shallow+0x5c/0xa0 [ 389.555633] schedule+0x32/0xc0 [ 389.559803] io_schedule+0x12/0x40 [ 389.564504] blk_mq_get_tag+0x181/0x2a0 [ 389.572541] ? wait_woken+0x80/0x80 [ 389.576008] blk_mq_get_request+0xf9/0x410 [ 389.579998] blk_mq_alloc_request+0x7e/0xe0 [ 389.584824] blk_get_request_flags+0x40/0x160 [ 389.588917] scsi_execute+0x38/0x1e0 [scsi_mod] [ 389.593079] scsi_test_unit_ready+0x5d/0xd0 [scsi_mod] [ 389.596966] sd_check_events+0xf5/0x1a0 [sd_mod] [ 389.602558] disk_check_events+0x69/0x150 [ 389.604822] process_one_work+0x1df/0x420 [ 389.606584] worker_thread+0x2b/0x3d0 [ 389.608175] ? process_one_work+0x420/0x420 [ 389.609833] kthread+0x113/0x130 [ 389.611327] ? kthread_create_on_node+0x70/0x70 [ 389.612852] ret_from_fork+0x35/0x40 === After reboot, tracing info: === systemd-udevd-269 [000] ...1 4.309198: blk_mq_do_dispatch_sched: get rq->0, 1 kworker/0:1H-174 [000] 4.309380: blk_mq_do_dispatch_sched: not get rq, 1 kworker/u8:2-167 [000] 4.309562: bfq_insert_requests: insert rq->0 kworker/u8:2-167 [000] ...1 4.309563: blk_mq_do_dispatch_sched: get rq->0, 1 kworker/0:1H-174 [000] 4.309694: blk_mq_do_dispatch_sched: not get rq, 1 kworker/u8:2-167 [000] 4.309879: bfq_insert_requests: insert rq->0 kworker/u8:2-167 [000] ...1 4.309880: blk_mq_do_dispatch_sched: get rq->0, 1 kworker/0:1H-174 [000] 4.310001: blk_mq_do_dispatch_sched: not get rq, 1 systemd-udevd-271 [000] 4.311033: bfq_insert_requests: insert rq->0 systemd-udevd-271 [000] ...1 4.311037: blk_mq_do_dispatch_sched: not get rq, 1 cfdisk-408 [000] 13.484220: bfq_insert_requests: insert rq->1 kworker/0:1H-174 [000] 13.484253: blk_mq_do_dispatch_sched: not get rq, 1 === Looks the same, right? Or Paolo should know if the issue is fixed or not in V4.15. So, Paolo :)? Regards, Oleksandr
v4.15 and I/O hang with BFQ
Hi, Paolo, Ivan, Ming et al. It looks like I've just encountered the issue Ivan has already described in [1]. Since I'm able to reproduce it reliably in a VM, I'd like to draw more attention to it. First, I'm using v4.15 kernel with all pending BFQ fixes: === 2ad909a300c4 bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED 83c97a310f83 block, bfq: release oom-queue ref to root group on exit 5b9eb4716af1 block, bfq: put async queues for root bfq groups too 3c5529454a27 block, bfq: limit sectors served with interactive weight raising e6c72be3486b block, bfq: limit tags for writes and async I/O e579b91d96ce block, bfq: increase threshold to deem I/O as random f6cbc16aac88 block, bfq: remove superfluous check in queue-merging setup 8045d8575183 block, bfq: let a queue be merged only shortly after starting I/O 242954975f5e block, bfq: check low_latency flag in bfq_bfqq_save_state() 8349c1bddd95 block, bfq: add missing rq_pos_tree update on rq removal 558200440cb9 block, bfq: fix occurrences of request finish method's old name 6ed2f47ee870 block, bfq: consider also past I/O in soft real-time detection e5f295dd18f2 block, bfq: remove batches of confusing ifdefs === Next, I boot an Arch VM with this kernel and emulated USB stick attached: === qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu host,+vmx -enable-kvm -drive if=pflash,format=raw,readonly,file=/mnt/vms/ovmf/code.img -drive if=pflash,format=raw,file=/mnt/vms/ovmf/vars.img -cdrom /mnt/vms/ovmf/shell.iso -netdev user,id=user.0 -device virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device usb-tablet,bus=xhci.0 -serial stdio -m 512 -hda sda.img -hdb sdb.img -smp 4 -drive if=none,id=stick,file=usb.img -device usb-storage,bus=xhci.0,drive=stick === Within the VM itself I use udev rule to set the I/O scheduler: === ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq" === Things boot and work fine until I try to create a partition on the emulated USB stick with cfdisk. On writing a new partition table it blocks, and I get the following stacktrace: === [ 225.670702] INFO: task cfdisk:416 blocked for more than 20 seconds. [ 225.681427] Not tainted 4.15.0-pf1 #1 [ 225.685341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 225.691910] cfdisk D0 416407 0x [ 225.700777] Call Trace: [ 225.703654] ? __schedule+0x35f/0x1000 [ 225.706990] ? __switch_to_asm+0x40/0x70 [ 225.709943] ? __switch_to_asm+0x34/0x70 [ 225.712224] ? __switch_to_asm+0x40/0x70 [ 225.714225] ? __switch_to_asm+0x40/0x70 [ 225.716790] schedule+0x32/0xc0 [ 225.718355] io_schedule+0x12/0x40 [ 225.719747] wait_on_page_bit+0xea/0x130 [ 225.721266] ? add_to_page_cache_lru+0xe0/0xe0 [ 225.722622] __filemap_fdatawait_range+0xbf/0x110 [ 225.724625] ? preempt_count_sub+0x50/0x90 [ 225.726591] ? sync_inodes_one_sb+0x20/0x20 [ 225.727507] filemap_fdatawait_keep_errors+0x1a/0x40 [ 225.728491] iterate_bdevs+0xa7/0x140 [ 225.729304] sys_sync+0x7c/0xb0 [ 225.730095] entry_SYSCALL_64_fastpath+0x20/0x83 [ 225.732420] RIP: 0033:0x7f2631cf4a17 === I've tried it several times with the same result. Next, I reboot the system, change the scheduler to Kyber, and then I can create a new partition successfully. I also went further. While having Kyber activated to work with an emulated USB stick properly, I create RAID10 on it, then I create LVM PV, then VG, then LV, then XFS on top of that, then I set I/O scheduler back to BFQ, then reboot, and on reboot mdadm gets blocked while discovering things, producing the following stacktrace: === [ 41.350763] INFO: task mdadm:354 blocked for more than 20 seconds. [ 41.356154] Not tainted 4.15.0-pf1 #1 [ 41.358674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 41.363046] mdadm D0 354352 0x0100 [ 41.368700] Call Trace: [ 41.370417] ? __schedule+0x35f/0x1000 [ 41.372668] ? blk_queue_exit+0x3e/0x60 [ 41.374816] ? generic_make_request+0x12f/0x2d0 [ 41.377363] schedule+0x32/0xc0 [ 41.380416] io_schedule+0x12/0x40 [ 41.382698] __blkdev_direct_IO_simple+0x206/0x360 [ 41.382707] ? bdget+0x120/0x120 [ 41.382714] ? blkdev_direct_IO+0x3a5/0x3f0 [ 41.382718] blkdev_direct_IO+0x3a5/0x3f0 [ 41.382724] ? current_time+0x18/0x70 [ 41.382731] ? __atime_needs_update+0x7f/0x190 [ 41.382743] ? generic_file_read_iter+0x8c/0x9d0 [ 41.382747] generic_file_read_iter+0x8c/0x9d0 [ 41.382759] ? __seccomp_filter+0x3b/0x260 [ 41.382765] __vfs_read+0xf5/0x170 [ 41.382772] vfs_read+0x91/0x130 [ 41.382778] SyS_read+0x52/0xc0 [ 41.382819] do_syscall_64+0x67/0x1a0 [ 41.382828] entry_SYSCALL64_slow_path+0x25/0x25 [ 41.382833] RIP: 0033:0x7f4b8088a3a1 [ 41.382836] RSP: 002b:7ffdd681def8 EFLAGS: 0246 ORIG_RAX: [ 41.382841] RAX: ffda RBX: