Re: dm mpath: switch IO scheduler of underlying paths to "none" [was: Re: BFQ + dm-mpath]
On 09/08/2017 01:58 PM, Mike Snitzer wrote: > On Fri, Sep 08 2017 at 1:07pm -0400, > Mike Snitzerwrote: > >> On Fri, Sep 08 2017 at 12:48pm -0400, >> Jens Axboe wrote: >> Please see the following untested patch. All testing/review/comments/acks appreciated. I elected to use elevator_change() rather than fiddle with adding a new blk-mq elevator hook (e.g. ->request_prepared) to verify that each blk-mq elevator enabled request did in fact get prepared. Bart, please test this patch and reply with your review/feedback. Jens, if you're OK with this solution please reply with your Ack and I'll send it to Linus along with the rest of the handful of DM changes I have for 4.14. >>> >>> I am not - we used to have this elevator change functionality from >>> inside the kernel, and finally got rid of it when certain drivers killed >>> it. I don't want to be bringing it back. >> >> Fine. > > BTW, while I conceded "Fine": I think your justification for not > reintroducing elevator_change() lacks substance. What is inherently > problematic about elevator_change()? Because no in-kernel users should be mucking with the IO scheduler. Adding this back is just an excuse for drivers to start doing it again, which generally happens because whatever vendors driver team tests some synthetic benchmark and decide that X is better than the default of Y. So we're not going back to that. > Having an elevator attached to a DM multipath device's underlying path's > request_queue just asks for trouble (especially given the blk-mq > elevator interface). > > Please own this issue as a regression and help me arrive at a timely way > forward. I'm trying, I made suggestions on how we can proceed - we can have a way to insert to hctx->dispatch without bothering the IO scheduler. I'm open to other suggestions as well, just not open to exporting an interface to change IO schedulers from inside the kernel. -- Jens Axboe
Re: dm mpath: switch IO scheduler of underlying paths to "none" [was: Re: BFQ + dm-mpath]
On Fri, Sep 08 2017 at 1:07pm -0400, Mike Snitzerwrote: > On Fri, Sep 08 2017 at 12:48pm -0400, > Jens Axboe wrote: > > > > Please see the following untested patch. All > > > testing/review/comments/acks appreciated. > > > > > > I elected to use elevator_change() rather than fiddle with adding a new > > > blk-mq elevator hook (e.g. ->request_prepared) to verify that each > > > blk-mq elevator enabled request did in fact get prepared. > > > > > > Bart, please test this patch and reply with your review/feedback. > > > > > > Jens, if you're OK with this solution please reply with your Ack and > > > I'll send it to Linus along with the rest of the handful of DM changes I > > > have for 4.14. > > > > I am not - we used to have this elevator change functionality from > > inside the kernel, and finally got rid of it when certain drivers killed > > it. I don't want to be bringing it back. > > Fine. BTW, while I conceded "Fine": I think your justification for not reintroducing elevator_change() lacks substance. What is inherently problematic about elevator_change()? Having an elevator attached to a DM multipath device's underlying path's request_queue just asks for trouble (especially given the blk-mq elevator interface). Please own this issue as a regression and help me arrive at a timely way forward. Thanks, Mike
Re: dm mpath: switch IO scheduler of underlying paths to "none" [was: Re: BFQ + dm-mpath]
On Fri, Sep 08 2017 at 12:48pm -0400, Jens Axboewrote: > On 09/08/2017 10:41 AM, Mike Snitzer wrote: > > On Fri, Sep 08 2017 at 5:13P -0400, > > Paolo Valente wrote: > > > >> > >>> Il giorno 07 set 2017, alle ore 17:52, Mike Snitzer > >>> ha scritto: > >>> > >>> On Tue, Sep 05 2017 at 10:15am -0400, > >>> Bart Van Assche wrote: > >>> > On Tue, 2017-09-05 at 09:56 +0200, Paolo Valente wrote: > > Ok, my suspects seem confirmed: the path dm_mq_queue_rq -> map_request > > -> setup_clone -> blk_rq_prep_clone creates a cloned request without > > invoking e->type->ops.mq.prepare_request for the target elevator e. > > The cloned request is therefore not initialized for the scheduler, but > > it is however inserted into the scheduler by > > blk_mq_sched_insert_request. This seems an error for any scheduler > > that needs to initialize fields in the incoming request, or in general > > to take some preliminary action. > > > > Am I missing something here? > > (+Mike Snitzer) > > Mike, do you perhaps have the time to look into this memory leak? > >>> > >>> It isn't a memory leak, it is missing initialization in the case of > >>> cloned requests (if I'm understanding Paolo correctly). > >>> > >> > >> Exactly! > >> > >>> But cloned requests shouldn't be going through the scheduler. Only the > >>> original requests should. > >>> > >>> Commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO > >>> schedulers") switched from blk_mq_insert_request() to > >>> blk_mq_sched_insert_request() and in doing so it opened dm-mpath up to > >>> this bug. > >>> > >>> Could be we need to take steps to ensure the block layer still > >>> supports bypassing the elevator by using direct insertion? > >>> > >>> Or blk_mq_sched_insert_request() needs updating to check if > >>> e->type->ops.mq.prepare_request were actually performed and to fallback > >>> to the !elevator case if not.. > >>> > >>> Not sure on the fix, but I can look closer if others (like Jens or > >>> Paolo) don't have quicker suggestions. > >>> > >> > >> No quick suggestion from me :( > >> > >> Thanks for analyzing this bug, > > > > Please see the following untested patch. All > > testing/review/comments/acks appreciated. > > > > I elected to use elevator_change() rather than fiddle with adding a new > > blk-mq elevator hook (e.g. ->request_prepared) to verify that each > > blk-mq elevator enabled request did in fact get prepared. > > > > Bart, please test this patch and reply with your review/feedback. > > > > Jens, if you're OK with this solution please reply with your Ack and > > I'll send it to Linus along with the rest of the handful of DM changes I > > have for 4.14. > > I am not - we used to have this elevator change functionality from > inside the kernel, and finally got rid of it when certain drivers killed > it. I don't want to be bringing it back. Fine. > Sounds like we have two issues here. One is that we run into issues with > stacking IO schedulers, and the other is that we'd rather not have > multiple schedulers in play for a stacked setup. > > Maybe it'd be cleaner to have the dm-mq side of things not insert > through the scheduler, but rather just FIFO on the target end? That was how blk_insert_cloned_request() was before. From the patch header (you may have missed it): "Commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO schedulers") switched blk_insert_cloned_request() from using blk_mq_insert_request() to blk_mq_sched_insert_request(). Which incorrectly added elevator machinery into a call chain that isn't supposed to have any." So shouldn't blk_insert_cloned_request() be made to _not_ use blk_mq_sched_insert_request()? We'd need a new block interface established, or equivalent open-coded in blk_insert_cloned_request(), to handle direct dispatch to an mq request_queue's queue(s). Mike
Re: [PATCH] dm mpath: switch IO scheduler of underlying paths to "none" [was: Re: BFQ + dm-mpath]
On 09/08/2017 10:41 AM, Mike Snitzer wrote: > On Fri, Sep 08 2017 at 5:13P -0400, > Paolo Valentewrote: > >> >>> Il giorno 07 set 2017, alle ore 17:52, Mike Snitzer ha >>> scritto: >>> >>> On Tue, Sep 05 2017 at 10:15am -0400, >>> Bart Van Assche wrote: >>> On Tue, 2017-09-05 at 09:56 +0200, Paolo Valente wrote: > Ok, my suspects seem confirmed: the path dm_mq_queue_rq -> map_request > -> setup_clone -> blk_rq_prep_clone creates a cloned request without > invoking e->type->ops.mq.prepare_request for the target elevator e. > The cloned request is therefore not initialized for the scheduler, but > it is however inserted into the scheduler by > blk_mq_sched_insert_request. This seems an error for any scheduler > that needs to initialize fields in the incoming request, or in general > to take some preliminary action. > > Am I missing something here? (+Mike Snitzer) Mike, do you perhaps have the time to look into this memory leak? >>> >>> It isn't a memory leak, it is missing initialization in the case of >>> cloned requests (if I'm understanding Paolo correctly). >>> >> >> Exactly! >> >>> But cloned requests shouldn't be going through the scheduler. Only the >>> original requests should. >>> >>> Commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO >>> schedulers") switched from blk_mq_insert_request() to >>> blk_mq_sched_insert_request() and in doing so it opened dm-mpath up to >>> this bug. >>> >>> Could be we need to take steps to ensure the block layer still >>> supports bypassing the elevator by using direct insertion? >>> >>> Or blk_mq_sched_insert_request() needs updating to check if >>> e->type->ops.mq.prepare_request were actually performed and to fallback >>> to the !elevator case if not.. >>> >>> Not sure on the fix, but I can look closer if others (like Jens or >>> Paolo) don't have quicker suggestions. >>> >> >> No quick suggestion from me :( >> >> Thanks for analyzing this bug, > > Please see the following untested patch. All > testing/review/comments/acks appreciated. > > I elected to use elevator_change() rather than fiddle with adding a new > blk-mq elevator hook (e.g. ->request_prepared) to verify that each > blk-mq elevator enabled request did in fact get prepared. > > Bart, please test this patch and reply with your review/feedback. > > Jens, if you're OK with this solution please reply with your Ack and > I'll send it to Linus along with the rest of the handful of DM changes I > have for 4.14. I am not - we used to have this elevator change functionality from inside the kernel, and finally got rid of it when certain drivers killed it. I don't want to be bringing it back. Sounds like we have two issues here. One is that we run into issues with stacking IO schedulers, and the other is that we'd rather not have multiple schedulers in play for a stacked setup. Maybe it'd be cleaner to have the dm-mq side of things not insert through the scheduler, but rather just FIFO on the target end? -- Jens Axboe
[PATCH] dm mpath: switch IO scheduler of underlying paths to "none" [was: Re: BFQ + dm-mpath]
On Fri, Sep 08 2017 at 5:13P -0400, Paolo Valentewrote: > > > Il giorno 07 set 2017, alle ore 17:52, Mike Snitzer ha > > scritto: > > > > On Tue, Sep 05 2017 at 10:15am -0400, > > Bart Van Assche wrote: > > > >> On Tue, 2017-09-05 at 09:56 +0200, Paolo Valente wrote: > >>> Ok, my suspects seem confirmed: the path dm_mq_queue_rq -> map_request > >>> -> setup_clone -> blk_rq_prep_clone creates a cloned request without > >>> invoking e->type->ops.mq.prepare_request for the target elevator e. > >>> The cloned request is therefore not initialized for the scheduler, but > >>> it is however inserted into the scheduler by > >>> blk_mq_sched_insert_request. This seems an error for any scheduler > >>> that needs to initialize fields in the incoming request, or in general > >>> to take some preliminary action. > >>> > >>> Am I missing something here? > >> > >> (+Mike Snitzer) > >> > >> Mike, do you perhaps have the time to look into this memory leak? > > > > It isn't a memory leak, it is missing initialization in the case of > > cloned requests (if I'm understanding Paolo correctly). > > > > Exactly! > > > But cloned requests shouldn't be going through the scheduler. Only the > > original requests should. > > > > Commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO > > schedulers") switched from blk_mq_insert_request() to > > blk_mq_sched_insert_request() and in doing so it opened dm-mpath up to > > this bug. > > > > Could be we need to take steps to ensure the block layer still > > supports bypassing the elevator by using direct insertion? > > > > Or blk_mq_sched_insert_request() needs updating to check if > > e->type->ops.mq.prepare_request were actually performed and to fallback > > to the !elevator case if not.. > > > > Not sure on the fix, but I can look closer if others (like Jens or > > Paolo) don't have quicker suggestions. > > > > No quick suggestion from me :( > > Thanks for analyzing this bug, Please see the following untested patch. All testing/review/comments/acks appreciated. I elected to use elevator_change() rather than fiddle with adding a new blk-mq elevator hook (e.g. ->request_prepared) to verify that each blk-mq elevator enabled request did in fact get prepared. Bart, please test this patch and reply with your review/feedback. Jens, if you're OK with this solution please reply with your Ack and I'll send it to Linus along with the rest of the handful of DM changes I have for 4.14. Thanks, Mike From: Mike Snitzer Date: Fri, 8 Sep 2017 11:45:13 -0400 Subject: [PATCH] dm mpath: switch IO scheduler of underlying paths to "none" A NULL pointer crash was reported for the case of having the BFQ IO scheduler attached to the underlying paths of a DM multipath device. The crash occurs in blk_mq_sched_insert_request()'s call to e->type->ops.mq.insert_requests(). Paolo Valente correctly summarized why the crash occured with: "the call chain (dm_mq_queue_rq -> map_request -> setup_clone -> blk_rq_prep_clone) creates a cloned request without invoking e->type->ops.mq.prepare_request for the target elevator e. The cloned request is therefore not initialized for the scheduler, but it is however inserted into the scheduler by blk_mq_sched_insert_request." All said, there is no reason for IO scheduling in the underlying paths because the top-level DM multipath request_queue handles all IO scheduling of the original requests issued to the multipath device. The multipath device's clones of the original requests are then just inserted directly into the underlying path's dispatch queue(s). Commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO schedulers") switched blk_insert_cloned_request() from using blk_mq_insert_request() to blk_mq_sched_insert_request(). Which incorrectly added elevator machinery into a call chain that isn't supposed to have any. To fix this DM multipath now explicitly removes the IO scheduler from all underlying paths during multipath device initialization. To do so elevator_change() is needed, so elevator_change() is reinstated by reverting commit c033269490 ("block: Remove elevator_change()"). Fixes: bd166ef18 ("blk-mq-sched: add framework for MQ capable IO schedulers") Reported-by: Bart Van Assche Signed-off-by: Mike Snitzer --- block/elevator.c | 13 + drivers/md/dm-mpath.c| 14 -- include/linux/elevator.h | 1 + 3 files changed, 26 insertions(+), 2 deletions(-) diff --git a/block/elevator.c b/block/elevator.c index 4bb2f0c..a5d9639 100644 --- a/block/elevator.c +++ b/block/elevator.c @@ -1084,6 +1084,19 @@ static int __elevator_change(struct request_queue *q, const char *name) return elevator_switch(q, e); } +int elevator_change(struct request_queue *q, const char *name) +{ + int ret; + + /* Protect q->elevator from
Re: BFQ + dm-mpath
> Il giorno 07 set 2017, alle ore 17:52, Mike Snitzerha > scritto: > > On Tue, Sep 05 2017 at 10:15am -0400, > Bart Van Assche wrote: > >> On Tue, 2017-09-05 at 09:56 +0200, Paolo Valente wrote: >>> Ok, my suspects seem confirmed: the path dm_mq_queue_rq -> map_request >>> -> setup_clone -> blk_rq_prep_clone creates a cloned request without >>> invoking e->type->ops.mq.prepare_request for the target elevator e. >>> The cloned request is therefore not initialized for the scheduler, but >>> it is however inserted into the scheduler by >>> blk_mq_sched_insert_request. This seems an error for any scheduler >>> that needs to initialize fields in the incoming request, or in general >>> to take some preliminary action. >>> >>> Am I missing something here? >> >> (+Mike Snitzer) >> >> Mike, do you perhaps have the time to look into this memory leak? > > It isn't a memory leak, it is missing initialization in the case of > cloned requests (if I'm understanding Paolo correctly). > Exactly! > But cloned requests shouldn't be going through the scheduler. Only the > original requests should. > > Commit bd166ef18 ("blk-mq-sched: add framework for MQ capable IO > schedulers") switched from blk_mq_insert_request() to > blk_mq_sched_insert_request() and in doing so it opened dm-mpath up to > this bug. > > Could be we need to take steps to ensure the block layer still > supports bypassing the elevator by using direct insertion? > > Or blk_mq_sched_insert_request() needs updating to check if > e->type->ops.mq.prepare_request were actually performed and to fallback > to the !elevator case if not.. > > Not sure on the fix, but I can look closer if others (like Jens or > Paolo) don't have quicker suggestions. > No quick suggestion from me :( Thanks for analyzing this bug, Paolo > Mike
Re: BFQ + dm-mpath
On Tue, 2017-09-05 at 09:56 +0200, Paolo Valente wrote: > Ok, my suspects seem confirmed: the path dm_mq_queue_rq -> map_request > -> setup_clone -> blk_rq_prep_clone creates a cloned request without > invoking e->type->ops.mq.prepare_request for the target elevator e. > The cloned request is therefore not initialized for the scheduler, but > it is however inserted into the scheduler by > blk_mq_sched_insert_request. This seems an error for any scheduler > that needs to initialize fields in the incoming request, or in general > to take some preliminary action. > > Am I missing something here? (+Mike Snitzer) Mike, do you perhaps have the time to look into this memory leak? Thanks, Bart.
Re: BFQ + dm-mpath
> Il giorno 25 ago 2017, alle ore 01:16, Bart Van Assche >ha scritto: > > Hello Paolo, > Hi Bart > Has BFQ ever been tested in combination with dm-mpath? Unfortunately no. > A few seconds > after I had started a run of the srp-tests software with BFQ a kernel > oops appeared on the console. The command I ran is: > > $ while true; do ~bart/software/infiniband/srp-test/run_tests -d -r 30 -e > bfq; done > > And this is what appeared on the console: > > [89251.977201] BUG: unable to handle kernel NULL pointer dereference at > 0018 > [89251.977201] BUG: unable to handle kernel NULL pointer dereference at > 0018 > [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89251.987831] IP: bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89251.995241] PGD 0 > [89251.995241] PGD 0 > [89251.995243] P4D 0 > [89251.995243] P4D 0 > [89251.999194] > [89251.999194] > [89252.006423] Oops: [#1] PREEMPT SMP > [89252.006423] Oops: [#1] PREEMPT SMP > [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp > target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd > bfq dm_service_time rdma_cm > iw_cm bridge stp llc ip6table_filter ip > 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal > intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas > crct10dif_pclmul ioatdma > crc32_pclmul mei_me joydev ghash_clmulni_intel > mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca > cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi > mfd_core ib_ipoib ib_cm > ib_uverbs ib_umad mlx4_ib ib_core dm_mul > tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs > ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid > mgag200 i2c_algo_bit > drm_kms_helper > [89252.012354] Modules linked in: ib_srp libcrc32c scsi_transport_srp > target_core_file ib_srpt target_core_iblock target_core_mod scsi_debug brd > bfq dm_service_time rdma_cm > iw_cm bridge stp llc ip6table_filter ip > 6_tables iptable_filter intel_rapl sb_edac x86_pkg_temp_thermal > intel_powerclamp coretemp kvm_intel kvm af_packet irqbypass ipmi_ssif dcdbas > crct10dif_pclmul ioatdma > crc32_pclmul mei_me joydev ghash_clmulni_intel > mei aesni_intel sg shpchp ipmi_si aes_x86_64 ipmi_devintf crypto_simd dca > cryptd glue_helper lpc_ich ipmi_msghandler acpi_power_meter acpi_pad wmi > mfd_core ib_ipoib ib_cm > ib_uverbs ib_umad mlx4_ib ib_core dm_mul > tipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua netconsole configfs > ip_tables x_tables autofs4 ext4 mbcache jbd2 mlx4_en hid_generic usbhid > mgag200 i2c_algo_bit > drm_kms_helper > [89252.103941] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci > mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common > pps_core megaraid_sas libphy > [last unloaded: scsi_transport_srp] > [89252.103941] syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ehci_pci > mlx4_core tg3 ehci_hcd crc32c_intel devlink drm ptp usbcore usb_common > pps_core megaraid_sas libphy > [last unloaded: scsi_transport_srp] > [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW > 4.13.0-rc6-dbg+ #2 > [89252.128375] CPU: 13 PID: 352 Comm: kworker/13:1H Tainted: GW > 4.13.0-rc6-dbg+ #2 > [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 > 09/11/2012 > [89252.139884] Hardware name: Dell Inc. PowerEdge R720/0VWT90, BIOS 1.3.6 > 09/11/2012 > [89252.150243] Workqueue: kblockd blk_mq_run_work_fn > [89252.150243] Workqueue: kblockd blk_mq_run_work_fn > [89252.157449] task: 880911d80040 task.stack: c90007c64000 > [89252.157449] task: 880911d80040 task.stack: c90007c64000 > [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89252.166038] RIP: 0010:bfq_setup_cooperator+0x16/0x2e0 [bfq] > [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 > [89252.174222] RSP: 0018:c90007c67b20 EFLAGS: 00010082 > [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: > 0001 > [89252.182026] RAX: ffe0 RBX: 8807b9988700 RCX: > 0001 > [89252.191990] RDX: 8807b9988700 RSI: RDI: > 880288554a68 > [89252.191990] RDX: 8807b9988700 RSI: RDI: > 880288554a68 > [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: > > [89252.201956] RBP: c90007c67b68 R08: 825820c0 R09: > > [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: > 880288554dc8 > [89252.211899] R10: c90007c67ae8 R11: a05a5ad5 R12: > 880288554dc8 > [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: > 8807bbe03528 > [89252.221821] R13: 880288554a68 R14: 880928078bd0 R15: > 8807bbe03528 > [89252.231715] FS: