Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, 2018-03-13 at 12:31 -0400, Mike Snitzer wrote: > But now I cannot get the test to run: > > # srp-test/run_tests -c -d -r 10 -t 02-mq > [ ... ] > /dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b310: not found > Test srp-test/tests/02-mq failed Hello Mike, The bug that caused this failure has been fixed and is in linux-next. The srp-test software runs fine on my setup against linux-next. See also https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=for-next=b470c154c600e427592df5237596ce0f33ce7d9f Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, 2018-03-13 at 17:41 -0400, Mike Snitzer wrote: > On Tue, Mar 13 2018 at 1:10pm -0400, > Bart Van Asschewrote: > > Even if that would be the case, that can't have been the cause of what I > > reported. Before I run any dm tests I merge the block layer, SCSI and RDMA > > changes that are scheduled for the next kernel version into the dm tree. > > Well I've rebased dm-4.16 ontop of v4.15-rc5. If you udate to latest > dm-4.16 and look at the following diff it is pretty clear that these > changes will not compromise dm-mpath's "mq" mode (which you're using): > > git diff > 8d47e65948ddea4398892946d9e50778a316b397^..e8f74a0f00113d74ac18d6de13096f9e2f95618a > -- drivers/md/dm-mpath.c > > I see no reason why you'd hit hangs with requests lingering on the > requeue_list Hello Mike, I agree with you that the hanging requests shouldn't be related to the most recent dm changes. I will try to find some time to root-cause this myself. Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, 2018-03-13 at 13:07 -0400, Mike Snitzer wrote: > Just a thought: Maybe dm-4.16 (rc4 based) is missing a blk-mq fix? > > Might be worth cherry-picking the 2 topmost commits from dm-4.16 into > the linus-based (rc5) tree you reported the original issue against? > > (looking at jens' rc5 block pull request, it seems unlikely but...) Hello Mike, Even if that would be the case, that can't have been the cause of what I reported. Before I run any dm tests I merge the block layer, SCSI and RDMA changes that are scheduled for the next kernel version into the dm tree. Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, Mar 13 2018 at 1:02pm -0400, Mike Snitzerwrote: > On Tue, Mar 13 2018 at 12:43pm -0400, > Bart Van Assche wrote: > > > On Mon, 2018-03-12 at 21:23 -0400, Mike Snitzer wrote: > > > Anyway, I'm hopeful I fixed the issue you reported. Please feel free to > > > test the 2 topmost commits I've staged in linux-next, via dm-4.16: > > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 > > > > So far I haven't seen the crash I reported yesterday in my tests with > > dm-4.16 > > of this morning. But I see hanging dm requests. I did not see any such issue > > yesterday with the revert I posted applied on top of dm-4.16. What I see > > today > > in the logs is the following: > > > > INFO: task kworker/3:150:1681 blocked for more than 120 seconds > > > > and in debugfs: > > # (cd /sys/kernel/debug/block/ && grep -r op= .) > > ./dm-2/requeue_list:df366fff {.op=READ, .cmd_flags=, > > .rq_flags=SORTED|STARTED|SOFTBARRIER|ELVPRIV|IO_STAT, .state=idle, .tag=-1, > > .internal_tag=45} > > ./dm-0/requeue_list:8302ea45 {.op=READ, .cmd_flags=, > > .rq_flags=SORTED|STARTED|SOFTBARRIER|ELVPRIV|IO_STAT, .state=idle, .tag=-1, > > .internal_tag=211} > > ./dm-1/requeue_list:7ea8ad0e {.op=READ, .cmd_flags=, > > .rq_flags=SOFTBARRIER|IO_STAT, .state=idle, .tag=428, .internal_tag=-1} > > ./dm-1/requeue_list:e93ecaa8 {.op=READ, .cmd_flags=, > > .rq_flags=SOFTBARRIER|IO_STAT, .state=idle, .tag=429, .internal_tag=-1} > > Strange.. but I'll review closer. Clearly requests aren't getting > pulled off the request_list Just a thought: Maybe dm-4.16 (rc4 based) is missing a blk-mq fix? Might be worth cherry-picking the 2 topmost commits from dm-4.16 into the linus-based (rc5) tree you reported the original issue against? (looking at jens' rc5 block pull request, it seems unlikely but...) -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, Mar 13 2018 at 12:43pm -0400, Bart Van Asschewrote: > On Mon, 2018-03-12 at 21:23 -0400, Mike Snitzer wrote: > > Anyway, I'm hopeful I fixed the issue you reported. Please feel free to > > test the 2 topmost commits I've staged in linux-next, via dm-4.16: > > https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 > > So far I haven't seen the crash I reported yesterday in my tests with dm-4.16 > of this morning. But I see hanging dm requests. I did not see any such issue > yesterday with the revert I posted applied on top of dm-4.16. What I see today > in the logs is the following: > > INFO: task kworker/3:150:1681 blocked for more than 120 seconds > > and in debugfs: > # (cd /sys/kernel/debug/block/ && grep -r op= .) > ./dm-2/requeue_list:df366fff {.op=READ, .cmd_flags=, > .rq_flags=SORTED|STARTED|SOFTBARRIER|ELVPRIV|IO_STAT, .state=idle, .tag=-1, > .internal_tag=45} > ./dm-0/requeue_list:8302ea45 {.op=READ, .cmd_flags=, > .rq_flags=SORTED|STARTED|SOFTBARRIER|ELVPRIV|IO_STAT, .state=idle, .tag=-1, > .internal_tag=211} > ./dm-1/requeue_list:7ea8ad0e {.op=READ, .cmd_flags=, > .rq_flags=SOFTBARRIER|IO_STAT, .state=idle, .tag=428, .internal_tag=-1} > ./dm-1/requeue_list:e93ecaa8 {.op=READ, .cmd_flags=, > .rq_flags=SOFTBARRIER|IO_STAT, .state=idle, .tag=429, .internal_tag=-1} Strange.. but I'll review closer. Clearly requests aren't getting pulled off the request_list -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, Mar 13 2018 at 12:46pm -0400, Bart Van Asschewrote: > On Tue, 2018-03-13 at 12:31 -0400, Mike Snitzer wrote: > > But now I cannot get the test to run: > > > > # srp-test/run_tests -c -d -r 10 -t 02-mq > > Unloaded the ib_srpt kernel module > > Unloaded the rdma_rxe kernel module > > SoftRoCE network interfaces: rxe0 rxe1 rxe2 rxe3 > > Zero-initializing /dev/ram0 ... done > > Zero-initializing /dev/ram1 ... done > > Configured SRP target driver > > Running test srp-test/tests/02-mq ... > > Test file I/O on top of multipath concurrently with logout and login (0 > > min; mq) > > Unloaded the ib_srp kernel module > > /dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b310: not found > > Test srp-test/tests/02-mq failed > > > > [ 379.634518] ib_srp: QP creation failed for dev rxe1: -22 > > [ 379.639849] srpt/10.16.43.122: Unsupported SCSI Opcode 0xa3, sending > > CHECK_CONDITION. > > [ 379.665891] sd 7:0:0:1: [sdk] Attached SCSI disk > > [ 379.673312] ib_srp: QP creation failed for dev rxe2: -22 > > [ 379.688331] ib_srp: QP creation failed for dev rxe3: -22 > > [ 379.708324] ib_srp: bad dest parameter > > '[2620:52:0:102f:219:99ff:feb7:2648' > > [ 379.724538] ib_srp: target creation request is missing one or more > > parameters > > [ 379.740253] ib_srp: bad dest parameter > > '[2620:52:0:102f:219:99ff:feb7:2648' > > [ 379.756531] ib_srp: target creation request is missing one or more > > parameters > > [ 379.773242] ib_srp: bad dest parameter > > '[2620:52:0:102f:219:99ff:feb7:2648' > > [ 379.789532] ib_srp: target creation request is missing one or more > > parameters > > [ 379.805255] ib_srp: bad dest parameter > > '[2620:52:0:102f:219:99ff:feb7:2648' > > [ 379.822532] ib_srp: target creation request is missing one or more > > parameters > > That's weird. I will see whether I can reproduce this with linux-next, since > I have > not yet tried to run srp-test against linux-next myself. OK, I appreciate it (you'll need to revert thos commits I shared in the other linux-next thread). -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, 2018-03-13 at 11:25 -0400, Mike Snitzer wrote: > A pointer to the commit ids in question would be helpful so I can > appreciate the details better. Not sure whether this list is complete, but this is the most important one: 63cf1a902c9dd6b0761861ea87fce3663f59403b IB/srpt: Add RDMA/CM support > Why is there need for a virtual machine? > Just using extra isolation so as not to conflict with anything on the > host? Running these tests directly on the host is also fine. I wanted to explain that no InfiniBand hardware is required. Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Tue, Mar 13 2018 at 11:25am -0400, Mike Snitzerwrote: > On Mon, Mar 12 2018 at 5:32pm -0400, > Bart Van Assche wrote: > > > On Mon, 2018-03-12 at 17:23 -0400, Mike Snitzer wrote: > > > Could you provide more details on your setup? > > > > > > Obviously you're using "queue_mode mq", what are your underlying paths? > > > > > > Given the trace it would seem you're hitting multipath_clone_and_map()'s > > > blk_queue_dying(q) error path that calls activate_or_offline_path(). > > > > Hello Mike, > > > > The reported kernel crash was triggered by running the following command: > > > > srp-test/run_tests -c -d -r 10 -t 02-mq > > > > The srp-test software is available at > > https://github.com/bvanassche/srp-test. > > All patches necessary to run that script in a virtual machine (RoCE support > > for the SRP initiator and target drivers) will be sent to Linus during the > > kernel v4.17 merge window. These patches are already available in linux-next > > today. Although I have not tried this myself, I expect that if you run the > > above command against a kernel built from the linux-next code that that will > > allow you to reproduce what I reported. > > A pointer to the commit ids in question would be helpful so I can > appreciate the details better. > > Why is there need for a virtual machine? > Just using extra isolation so as not to conflict with anything on the > host? > > Anyway, I followed srp-test's README.md > > But sadly, today's linux-next (next-20180313) is a mess (lots of macro > expansion breakage, for me anyway). I fixed the linux-next build errors (I cc'd you on that). But now I cannot get the test to run: # srp-test/run_tests -c -d -r 10 -t 02-mq Unloaded the ib_srpt kernel module Unloaded the rdma_rxe kernel module SoftRoCE network interfaces: rxe0 rxe1 rxe2 rxe3 Zero-initializing /dev/ram0 ... done Zero-initializing /dev/ram1 ... done Configured SRP target driver Running test srp-test/tests/02-mq ... Test file I/O on top of multipath concurrently with logout and login (0 min; mq) Unloaded the ib_srp kernel module /dev/disk/by-id/dm-uuid-mpath-3600140572616d6469736b310: not found Test srp-test/tests/02-mq failed [ 379.634518] ib_srp: QP creation failed for dev rxe1: -22 [ 379.639849] srpt/10.16.43.122: Unsupported SCSI Opcode 0xa3, sending CHECK_CONDITION. [ 379.665891] sd 7:0:0:1: [sdk] Attached SCSI disk [ 379.673312] ib_srp: QP creation failed for dev rxe2: -22 [ 379.688331] ib_srp: QP creation failed for dev rxe3: -22 [ 379.708324] ib_srp: bad dest parameter '[2620:52:0:102f:219:99ff:feb7:2648' [ 379.724538] ib_srp: target creation request is missing one or more parameters [ 379.740253] ib_srp: bad dest parameter '[2620:52:0:102f:219:99ff:feb7:2648' [ 379.756531] ib_srp: target creation request is missing one or more parameters [ 379.773242] ib_srp: bad dest parameter '[2620:52:0:102f:219:99ff:feb7:2648' [ 379.789532] ib_srp: target creation request is missing one or more parameters [ 379.805255] ib_srp: bad dest parameter '[2620:52:0:102f:219:99ff:feb7:2648' [ 379.822532] ib_srp: target creation request is missing one or more parameters But I realized that was with an old srp-test build.. so I tried to make again.. seems your buildrequires has expanded: Why does it need shellcheck? On RHEL, having to pull in EPEL packages sucks. And even once installed via EPEL (so ShellCheck-0.3.5-1.el7.x86_64): # make ... shellcheck -x -f gcc run_tests bin/getuid_callout lib/functions \ tests/*[^~] unrecognized option `-x' Usage: shellcheck [OPTIONS...] FILES... -e CODE1,CODE2.. --exclude=CODE1,CODE2.. exclude types of warnings -f FORMAT --format=FORMAT output format -s SHELLNAME --shell=SHELLNAMESpecify dialect (bash,sh,ksh,zsh) -V--versionPrint version information make: *** [shellcheck] Error 3 In general this srp-test suite is way too exotic with its requirements. Barrier to entry from a dumb user like me is _way_ too high. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Mon, Mar 12 2018 at 5:32pm -0400, Bart Van Asschewrote: > On Mon, 2018-03-12 at 17:23 -0400, Mike Snitzer wrote: > > Could you provide more details on your setup? > > > > Obviously you're using "queue_mode mq", what are your underlying paths? > > > > Given the trace it would seem you're hitting multipath_clone_and_map()'s > > blk_queue_dying(q) error path that calls activate_or_offline_path(). > > Hello Mike, > > The reported kernel crash was triggered by running the following command: > > srp-test/run_tests -c -d -r 10 -t 02-mq > > The srp-test software is available at https://github.com/bvanassche/srp-test. > All patches necessary to run that script in a virtual machine (RoCE support > for the SRP initiator and target drivers) will be sent to Linus during the > kernel v4.17 merge window. These patches are already available in linux-next > today. Although I have not tried this myself, I expect that if you run the > above command against a kernel built from the linux-next code that that will > allow you to reproduce what I reported. A pointer to the commit ids in question would be helpful so I can appreciate the details better. Why is there need for a virtual machine? Just using extra isolation so as not to conflict with anything on the host? Anyway, I followed srp-test's README.md But sadly, today's linux-next (next-20180313) is a mess (lots of macro expansion breakage, for me anyway). -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Mon, Mar 12 2018 at 5:32pm -0400, Bart Van Asschewrote: > On Mon, 2018-03-12 at 17:23 -0400, Mike Snitzer wrote: > > Could you provide more details on your setup? > > > > Obviously you're using "queue_mode mq", what are your underlying paths? > > > > Given the trace it would seem you're hitting multipath_clone_and_map()'s > > blk_queue_dying(q) error path that calls activate_or_offline_path(). > > Hello Mike, > > The reported kernel crash was triggered by running the following command: > > srp-test/run_tests -c -d -r 10 -t 02-mq > > The srp-test software is available at https://github.com/bvanassche/srp-test. > All patches necessary to run that script in a virtual machine (RoCE support > for the SRP initiator and target drivers) will be sent to Linus during the > kernel v4.17 merge window. These patches are already available in linux-next > today. Although I have not tried this myself, I expect that if you run the > above command against a kernel built from the linux-next code that that will > allow you to reproduce what I reported. OK, that test again. I clearly need to invest time to making it run on my testbed. But that will take time (though hopefully I can cut through it tomorrrow). Anyway, I'm hopeful I fixed the issue you reported. Please feel free to test the 2 topmost commits I've staged in linux-next, via dm-4.16: https://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm.git/log/?h=dm-4.16 Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Mon, 2018-03-12 at 17:23 -0400, Mike Snitzer wrote: > Could you provide more details on your setup? > > Obviously you're using "queue_mode mq", what are your underlying paths? > > Given the trace it would seem you're hitting multipath_clone_and_map()'s > blk_queue_dying(q) error path that calls activate_or_offline_path(). Hello Mike, The reported kernel crash was triggered by running the following command: srp-test/run_tests -c -d -r 10 -t 02-mq The srp-test software is available at https://github.com/bvanassche/srp-test. All patches necessary to run that script in a virtual machine (RoCE support for the SRP initiator and target drivers) will be sent to Linus during the kernel v4.17 merge window. These patches are already available in linux-next today. Although I have not tried this myself, I expect that if you run the above command against a kernel built from the linux-next code that that will allow you to reproduce what I reported. Thanks, Bart. -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel
Re: [dm-devel] Revert "dm mpath: remove unnecessary NVMe branching in favor of scsi_dh checks"
On Mon, Mar 12 2018 at 4:28pm -0400, Bart Van Asschewrote: > This patch fixes the following kernel crash: > > INFO: trying to register non-static key. > the code is fine but needs lockdep annotation. > turning off the locking correctness validator. > CPU: 1 PID: 155 Comm: kworker/1:1H Not tainted 4.16.0-rc5-dbg+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.0.0-prebuilt.qemu-project.org 04/01/2014 > Workqueue: kblockd blk_mq_run_work_fn > Call Trace: > dump_stack+0x85/0xc7 > register_lock_class+0x82a/0x830 > __lock_acquire+0x141/0x1b10 > lock_acquire+0xc9/0x260 > _raw_spin_lock_irqsave+0x41/0x50 > __wake_up_common_lock+0x9e/0x100 > pg_init_done+0x100/0x240 [dm_multipath] > multipath_clone_and_map+0x32c/0x340 [dm_multipath] > map_request+0xc1/0x550 [dm_mod] > dm_mq_queue_rq+0xf9/0x1a0 [dm_mod] > blk_mq_dispatch_rq_list+0x143/0xac0 > blk_mq_sched_dispatch_requests+0x23d/0x2f0 > __blk_mq_run_hw_queue+0xdb/0x160 > process_one_work+0x441/0xa50 > worker_thread+0x76/0x6c0 > kthread+0x1b2/0x1d0 > ret_from_fork+0x24/0x30 > == > BUG: KASAN: null-ptr-deref in __wake_up_common+0x60/0x230 > Read of size 8 at addr by task kworker/1:1H/155 > > CPU: 1 PID: 155 Comm: kworker/1:1H Not tainted 4.16.0-rc5-dbg+ #1 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > 1.0.0-prebuilt.qemu-project.org 04/01/2014 > Workqueue: kblockd blk_mq_run_work_fn > Call Trace: > dump_stack+0x85/0xc7 > kasan_report+0x139/0x350 > __wake_up_common+0x60/0x230 > __wake_up_common_lock+0xb9/0x100 > pg_init_done+0x100/0x240 [dm_multipath] > multipath_clone_and_map+0x32c/0x340 [dm_multipath] > map_request+0xc1/0x550 [dm_mod] > dm_mq_queue_rq+0xf9/0x1a0 [dm_mod] > blk_mq_dispatch_rq_list+0x143/0xac0 > blk_mq_sched_dispatch_requests+0x23d/0x2f0 > __blk_mq_run_hw_queue+0xdb/0x160 > process_one_work+0x441/0xa50 > worker_thread+0x76/0x6c0 > kthread+0x1b2/0x1d0 > ret_from_fork+0x24/0x30 > == > > Fixes: 8d47e65948dd ("dm mpath: remove unnecessary NVMe branching in favor of > scsi_dh checks") > Signed-off-by: Bart Van Assche Sorry for your troubles but reverting isn't the proper way to handle this (yet). Could you provide more details on your setup? Obviously you're using "queue_mode mq", what are your underlying paths? Given the trace it would seem you're hitting multipath_clone_and_map()'s blk_queue_dying(q) error path that calls activate_or_offline_path(). Would be useful to know the crash utility's output for: dis -l pg_init_done+0x100 But I'd imagine it isn't happy here: wake_up(>pg_init_wait); Given the commit in question, I am assuming there is something about this setup_scsi_dh() code that is causing m->pg_init_wait to not be initialized: /* * Init fields that are only used when a scsi_dh is attached */ if (!test_and_set_bit(MPATHF_QUEUE_IO, >flags)) { atomic_set(>pg_init_in_progress, 0); atomic_set(>pg_init_count, 0); m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; init_waitqueue_head(>pg_init_wait); } Wonder if having made that initialization conditional is the culprit... that was needed because setup_scsi_dh() is called multiple times now. Whereas before this commit it was only done once as part of the initial multipath table load (in alloc_multipath_stage2). I'll keep looking at this. Mike -- dm-devel mailing list dm-devel@redhat.com https://www.redhat.com/mailman/listinfo/dm-devel