Re: v4.15 and I/O hang with BFQ

2018-02-05 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 16:40, Paolo Valente 
>  ha scritto:
> 
> 
> 
>> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
>> scritto:
>> 
>> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>>> Hi.
>>> 
>> ...
>>>  systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>>> rq->0
>>>  systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>>> rq->1
>>>   kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>>> not get rq, 1
>>> ===
>>> 
>>> Looks the same, right?
>> 
>> Yeah, same with before.
>> 
> 
> Hi guys,
> sorry for the delay with this fix.  We are proceeding very slowly on
> this, because I'm super busy.  Anyway, now I can at least explain in
> more detail the cause that leads to this hang.  Commit 'a6a252e64914
> ("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
> makes all non-flush re-prepared requests be re-inserted into the I/O
> scheduler.  With this change, I/O schedulers may get the same request
> inserted again, even several times, without a finish_request invoked
> on the request before each re-insertion.
> 
> For the I/O scheduler, every such re-prepared request is equivalent
> to the insertion of a new request. For schedulers like mq-deadline
> or kyber this fact causes no problems. In contrast, it confuses a stateful
> scheduler like BFQ, which preserves states for an I/O request until
> finish_request is invoked on it. In particular, BFQ has no way
> to know that the above re-insertions concerns the same, already dispatched
> request. So it may get stuck waiting for the completion of these
> re-inserted requests forever, thus preventing any other queue of
> requests to be served.
> 
> We are trying to address this issue by adding the hook requeue_request
> to bfq interface.
> 
> Unfortunately, with our current implementation of requeue_request in
> place, bfq eventually gets to an incoherent state.  This is apparently
> caused by a requeue of an I/O request, immediately followed by a
> completion of the same request.  This seems rather absurd, and drives
> bfq crazy.  But this is something for which we don't have definite
> results yet.
> 
> We're working on it, sorry again for the delay.
> 

Ok, patch arriving ... Please test it.

Thanks,
Paolo

> Thanks,
> Paolo
> 
>> -- 
>> Ming
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "bfq-iosched" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to bfq-iosched+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.



Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Paolo Valente


> Il giorno 30 gen 2018, alle ore 15:40, Ming Lei  ha 
> scritto:
> 
> On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
>> Hi.
>> 
> ...
>>   systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
>> rq->0
>>   systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
>> not get rq, 1
>>  cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
>> rq->1
>>kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
>> not get rq, 1
>> ===
>> 
>> Looks the same, right?
> 
> Yeah, same with before.
> 

Hi guys,
sorry for the delay with this fix.  We are proceeding very slowly on
this, because I'm super busy.  Anyway, now I can at least explain in
more detail the cause that leads to this hang.  Commit 'a6a252e64914
("blk-mq-sched: decide how to handle flush rq via RQF_FLUSH_SEQ")'
makes all non-flush re-prepared requests be re-inserted into the I/O
scheduler.  With this change, I/O schedulers may get the same request
inserted again, even several times, without a finish_request invoked
on the request before each re-insertion.

For the I/O scheduler, every such re-prepared request is equivalent
to the insertion of a new request. For schedulers like mq-deadline
or kyber this fact causes no problems. In contrast, it confuses a stateful
scheduler like BFQ, which preserves states for an I/O request until
finish_request is invoked on it. In particular, BFQ has no way
to know that the above re-insertions concerns the same, already dispatched
request. So it may get stuck waiting for the completion of these
re-inserted requests forever, thus preventing any other queue of
requests to be served.

We are trying to address this issue by adding the hook requeue_request
to bfq interface.

Unfortunately, with our current implementation of requeue_request in
place, bfq eventually gets to an incoherent state.  This is apparently
caused by a requeue of an I/O request, immediately followed by a
completion of the same request.  This seems rather absurd, and drives
bfq crazy.  But this is something for which we don't have definite
results yet.

We're working on it, sorry again for the delay.

Thanks,
Paolo

> -- 
> Ming



Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Ming Lei
On Tue, Jan 30, 2018 at 03:30:28PM +0100, Oleksandr Natalenko wrote:
> Hi.
> 
...
>systemd-udevd-271   [000]  4.311033: bfq_insert_requests: insert
> rq->0
>systemd-udevd-271   [000] ...1 4.311037: blk_mq_do_dispatch_sched:
> not get rq, 1
>   cfdisk-408   [000] 13.484220: bfq_insert_requests: insert
> rq->1
> kworker/0:1H-174   [000] 13.484253: blk_mq_do_dispatch_sched:
> not get rq, 1
> ===
> 
> Looks the same, right?

Yeah, same with before.

-- 
Ming


Re: v4.15 and I/O hang with BFQ

2018-01-30 Thread Oleksandr Natalenko

Hi.

30.01.2018 09:19, Ming Lei wrote:

Hi,
We knew there is IO hang issue on BFQ over USB-storage wrt. blk-mq, and
last time I found it is inside BFQ. You can try the debug patch in the
following link[1] to see if it is same with the previous report[1][2]:

[1] https://marc.info/?l=linux-block=151214241518562=2
[2] https://bugzilla.kernel.org/show_bug.cgi?id=198023

If you aren't sure if they are same, please post the trace somewhere,
then I can check if it is a new bug.


OK, first, I got 2 more stacktraces without tracing, then I've rebooted 
with your patch and checked tracing.


Before reboot, cfdisk:

===
[  266.630386] INFO: task cfdisk:437 blocked for more than 20 seconds.
[  266.640950]   Not tainted 4.15.0-pf1 #1
[  266.645131] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  266.651789] cfdisk  D0   437410 0x
[  266.661331] Call Trace:
[  266.664517]  ? __schedule+0x35f/0x1000
[  266.668660]  ? bio_alloc_bioset+0xc7/0x1e0
[  266.672330]  ? submit_bh_wbc+0x162/0x190
[  266.678034]  schedule+0x32/0xc0
[  266.681293]  io_schedule+0x12/0x40
[  266.685230]  wait_on_page_bit+0xea/0x130
[  266.689015]  ? add_to_page_cache_lru+0xe0/0xe0
[  266.693456]  ? blkdev_writepages+0x30/0x30
[  266.695521]  do_read_cache_page+0x167/0x350
[  266.697160]  read_dev_sector+0x28/0xc0
[  266.698685]  scsi_bios_ptable+0x4e/0x120 [scsi_mod]
[  266.700156]  scsicam_bios_param+0x16/0x1a0 [scsi_mod]
[  266.701868]  sd_getgeo+0xbf/0xd0 [sd_mod]
[  266.703388]  blkdev_ioctl+0x178/0x990
[  266.704818]  block_ioctl+0x39/0x40
[  266.706381]  do_vfs_ioctl+0xa4/0x630
[  266.708584]  ? __fput+0x131/0x1e0
[  266.710184]  ? preempt_count_add+0x68/0xa0
[  266.711762]  ? _raw_spin_unlock_irq+0x1d/0x30
[  266.713304]  SyS_ioctl+0x74/0x80
[  266.714502]  ? exit_to_usermode_loop+0x39/0xa0
[  266.717352]  entry_SYSCALL_64_fastpath+0x20/0x83
[  266.718857] RIP: 0033:0x7fc482064d87
===

Blocked kworker:

===
[  389.511083] INFO: task kworker/u8:5:178 blocked for more than 20 
seconds.

[  389.516454]   Not tainted 4.15.0-pf1 #1
[  389.520091] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this

message.
[  389.524726] kworker/u8:5D0   178  2 0x8000
[  389.528321] Workqueue: events_freezable_power_ disk_events_workfn
[  389.532228] Call Trace:
[  389.534909]  ? __schedule+0x35f/0x1000
[  389.541906]  ? blk_mq_sched_dispatch_requests+0x116/0x190
[  389.546561]  ? __sbitmap_get_word+0x2a/0x80
[  389.551167]  ? sbitmap_get_shallow+0x5c/0xa0
[  389.555633]  schedule+0x32/0xc0
[  389.559803]  io_schedule+0x12/0x40
[  389.564504]  blk_mq_get_tag+0x181/0x2a0
[  389.572541]  ? wait_woken+0x80/0x80
[  389.576008]  blk_mq_get_request+0xf9/0x410
[  389.579998]  blk_mq_alloc_request+0x7e/0xe0
[  389.584824]  blk_get_request_flags+0x40/0x160
[  389.588917]  scsi_execute+0x38/0x1e0 [scsi_mod]
[  389.593079]  scsi_test_unit_ready+0x5d/0xd0 [scsi_mod]
[  389.596966]  sd_check_events+0xf5/0x1a0 [sd_mod]
[  389.602558]  disk_check_events+0x69/0x150
[  389.604822]  process_one_work+0x1df/0x420
[  389.606584]  worker_thread+0x2b/0x3d0
[  389.608175]  ? process_one_work+0x420/0x420
[  389.609833]  kthread+0x113/0x130
[  389.611327]  ? kthread_create_on_node+0x70/0x70
[  389.612852]  ret_from_fork+0x35/0x40
===

After reboot, tracing info:

===
   systemd-udevd-269   [000] ...1 4.309198: 
blk_mq_do_dispatch_sched: get rq->0, 1
kworker/0:1H-174   [000]  4.309380: 
blk_mq_do_dispatch_sched: not get rq, 1
kworker/u8:2-167   [000]  4.309562: bfq_insert_requests: 
insert rq->0
kworker/u8:2-167   [000] ...1 4.309563: 
blk_mq_do_dispatch_sched: get rq->0, 1
kworker/0:1H-174   [000]  4.309694: 
blk_mq_do_dispatch_sched: not get rq, 1
kworker/u8:2-167   [000]  4.309879: bfq_insert_requests: 
insert rq->0
kworker/u8:2-167   [000] ...1 4.309880: 
blk_mq_do_dispatch_sched: get rq->0, 1
kworker/0:1H-174   [000]  4.310001: 
blk_mq_do_dispatch_sched: not get rq, 1
   systemd-udevd-271   [000]  4.311033: bfq_insert_requests: 
insert rq->0
   systemd-udevd-271   [000] ...1 4.311037: 
blk_mq_do_dispatch_sched: not get rq, 1
  cfdisk-408   [000] 13.484220: bfq_insert_requests: 
insert rq->1
kworker/0:1H-174   [000] 13.484253: 
blk_mq_do_dispatch_sched: not get rq, 1

===

Looks the same, right?


Or Paolo should know if the issue is fixed or not in V4.15.


So, Paolo :)?

Regards,
  Oleksandr


v4.15 and I/O hang with BFQ

2018-01-30 Thread Oleksandr Natalenko

Hi, Paolo, Ivan, Ming et al.

It looks like I've just encountered the issue Ivan has already described 
in [1]. Since I'm able to reproduce it reliably in a VM, I'd like to 
draw more attention to it.


First, I'm using v4.15 kernel with all pending BFQ fixes:

===
2ad909a300c4 bfq-iosched: don't call bfqg_and_blkg_put for 
!CONFIG_BFQ_GROUP_IOSCHED

83c97a310f83 block, bfq: release oom-queue ref to root group on exit
5b9eb4716af1 block, bfq: put async queues for root bfq groups too
3c5529454a27 block, bfq: limit sectors served with interactive weight 
raising

e6c72be3486b block, bfq: limit tags for writes and async I/O
e579b91d96ce block, bfq: increase threshold to deem I/O as random
f6cbc16aac88 block, bfq: remove superfluous check in queue-merging setup
8045d8575183 block, bfq: let a queue be merged only shortly after 
starting I/O

242954975f5e block, bfq: check low_latency flag in bfq_bfqq_save_state()
8349c1bddd95 block, bfq: add missing rq_pos_tree update on rq removal
558200440cb9 block, bfq: fix occurrences of request finish method's old 
name
6ed2f47ee870 block, bfq: consider also past I/O in soft real-time 
detection

e5f295dd18f2 block, bfq: remove batches of confusing ifdefs
===

Next, I boot an Arch VM with this kernel and emulated USB stick 
attached:


===
qemu-system-x86_64 -display gtk,gl=on -machine q35,accel=kvm -cpu 
host,+vmx -enable-kvm -drive 
if=pflash,format=raw,readonly,file=/mnt/vms/ovmf/code.img -drive 
if=pflash,format=raw,file=/mnt/vms/ovmf/vars.img -cdrom 
/mnt/vms/ovmf/shell.iso -netdev user,id=user.0 -device 
virtio-net,netdev=user.0 -usb -device nec-usb-xhci,id=xhci -device 
usb-tablet,bus=xhci.0 -serial stdio -m 512 -hda sda.img -hdb sdb.img 
-smp 4 -drive if=none,id=stick,file=usb.img -device 
usb-storage,bus=xhci.0,drive=stick

===

Within the VM itself I use udev rule to set the I/O scheduler:

===
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/scheduler}="bfq"
===

Things boot and work fine until I try to create a partition on the 
emulated USB stick with cfdisk. On writing a new partition table it 
blocks, and I get the following stacktrace:


===
[  225.670702] INFO: task cfdisk:416 blocked for more than 20 seconds.
[  225.681427]   Not tainted 4.15.0-pf1 #1
[  225.685341] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[  225.691910] cfdisk  D0   416407 0x
[  225.700777] Call Trace:
[  225.703654]  ? __schedule+0x35f/0x1000
[  225.706990]  ? __switch_to_asm+0x40/0x70
[  225.709943]  ? __switch_to_asm+0x34/0x70
[  225.712224]  ? __switch_to_asm+0x40/0x70
[  225.714225]  ? __switch_to_asm+0x40/0x70
[  225.716790]  schedule+0x32/0xc0
[  225.718355]  io_schedule+0x12/0x40
[  225.719747]  wait_on_page_bit+0xea/0x130
[  225.721266]  ? add_to_page_cache_lru+0xe0/0xe0
[  225.722622]  __filemap_fdatawait_range+0xbf/0x110
[  225.724625]  ? preempt_count_sub+0x50/0x90
[  225.726591]  ? sync_inodes_one_sb+0x20/0x20
[  225.727507]  filemap_fdatawait_keep_errors+0x1a/0x40
[  225.728491]  iterate_bdevs+0xa7/0x140
[  225.729304]  sys_sync+0x7c/0xb0
[  225.730095]  entry_SYSCALL_64_fastpath+0x20/0x83
[  225.732420] RIP: 0033:0x7f2631cf4a17
===

I've tried it several times with the same result. Next, I reboot the 
system, change the scheduler to Kyber, and then I can create a new 
partition successfully.


I also went further. While having Kyber activated to work with an 
emulated USB stick properly, I create RAID10 on it, then I create LVM 
PV, then VG, then LV, then XFS on top of that, then I set I/O scheduler 
back to BFQ, then reboot, and on reboot mdadm gets blocked while 
discovering things, producing the following stacktrace:


===
[   41.350763] INFO: task mdadm:354 blocked for more than 20 seconds.
[   41.356154]   Not tainted 4.15.0-pf1 #1
[   41.358674] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.

[   41.363046] mdadm   D0   354352 0x0100
[   41.368700] Call Trace:
[   41.370417]  ? __schedule+0x35f/0x1000
[   41.372668]  ? blk_queue_exit+0x3e/0x60
[   41.374816]  ? generic_make_request+0x12f/0x2d0
[   41.377363]  schedule+0x32/0xc0
[   41.380416]  io_schedule+0x12/0x40
[   41.382698]  __blkdev_direct_IO_simple+0x206/0x360
[   41.382707]  ? bdget+0x120/0x120
[   41.382714]  ? blkdev_direct_IO+0x3a5/0x3f0
[   41.382718]  blkdev_direct_IO+0x3a5/0x3f0
[   41.382724]  ? current_time+0x18/0x70
[   41.382731]  ? __atime_needs_update+0x7f/0x190
[   41.382743]  ? generic_file_read_iter+0x8c/0x9d0
[   41.382747]  generic_file_read_iter+0x8c/0x9d0
[   41.382759]  ? __seccomp_filter+0x3b/0x260
[   41.382765]  __vfs_read+0xf5/0x170
[   41.382772]  vfs_read+0x91/0x130
[   41.382778]  SyS_read+0x52/0xc0
[   41.382819]  do_syscall_64+0x67/0x1a0
[   41.382828]  entry_SYSCALL64_slow_path+0x25/0x25
[   41.382833] RIP: 0033:0x7f4b8088a3a1
[   41.382836] RSP: 002b:7ffdd681def8 EFLAGS: 0246 ORIG_RAX: 

[   41.382841] RAX: ffda RBX: