Re: Hard lockup in ext4_finish_bio
On 10/08/2015 09:56 PM, John Stoffel wrote: >> "Nikolay" == Nikolay Borisov writes: > > Nikolay> On 10/08/2015 05:34 PM, John Stoffel wrote: >>> Great bug report, but you're missing the info on which kernel >>> you're > > Nikolay> This is on 3.12.47 (self compiled). It was evident on my > Nikolay> initial post, but I did forget to mention that in the > Nikolay> reply. Also, I suspect even current kernel are susceptible to > Nikolay> this since the locking in question hasn't changed. > > Hi Nikolay, must have missed it. I looked quickly, but didn't find > it. Since it's such an older kernel release, it might be best if you > upgrade to the latest version and try to re-create the lock if at all > possible. The relevant code hasn't changed. I've already posted RFC patches. You can find the thread here: http://thread.gmane.org/gmane.linux.kernel/2056996 > > What kind of workload are you running on there? And if you have more > details on the hardware, that might help as well. > > John > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hard lockup in ext4_finish_bio
On 10/08/2015 09:56 PM, John Stoffel wrote: >> "Nikolay" == Nikolay Borisovwrites: > > Nikolay> On 10/08/2015 05:34 PM, John Stoffel wrote: >>> Great bug report, but you're missing the info on which kernel >>> you're > > Nikolay> This is on 3.12.47 (self compiled). It was evident on my > Nikolay> initial post, but I did forget to mention that in the > Nikolay> reply. Also, I suspect even current kernel are susceptible to > Nikolay> this since the locking in question hasn't changed. > > Hi Nikolay, must have missed it. I looked quickly, but didn't find > it. Since it's such an older kernel release, it might be best if you > upgrade to the latest version and try to re-create the lock if at all > possible. The relevant code hasn't changed. I've already posted RFC patches. You can find the thread here: http://thread.gmane.org/gmane.linux.kernel/2056996 > > What kind of workload are you running on there? And if you have more > details on the hardware, that might help as well. > > John > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hard lockup in ext4_finish_bio
> "Nikolay" == Nikolay Borisov writes: Nikolay> On 10/08/2015 05:34 PM, John Stoffel wrote: >> Great bug report, but you're missing the info on which kernel >> you're Nikolay> This is on 3.12.47 (self compiled). It was evident on my Nikolay> initial post, but I did forget to mention that in the Nikolay> reply. Also, I suspect even current kernel are susceptible to Nikolay> this since the locking in question hasn't changed. Hi Nikolay, must have missed it. I looked quickly, but didn't find it. Since it's such an older kernel release, it might be best if you upgrade to the latest version and try to re-create the lock if at all possible. What kind of workload are you running on there? And if you have more details on the hardware, that might help as well. John -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hard lockup in ext4_finish_bio
Great bug report, but you're missing the info on which kernel you're running here... is this a vendor kernel or self-compiled? Nikolay> I've hit a rather strange hard lock up on one of my servers Nikolay> from the page writeback path, the actual backtrace is: Nikolay> [427149.717151] [ cut here ] Nikolay> [427149.717553] WARNING: CPU: 23 PID: 4611 at Nikolay> kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0() Nikolay> [427149.718216] Watchdog detected hard LOCKUP on cpu 23 Nikolay> [427149.718292] Modules linked in: [427149.718723] tcp_diag Nikolay> inet_diag netconsole act_police cls_basic sch_ingress Nikolay> xt_pkttype xt_state veth openvswitch gre vxlan ip_tunnel Nikolay> xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat Nikolay> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n at xt_CT Nikolay> nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs Nikolay> ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ext2 Nikolay> dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio Nikolay> dm_mirror dm_region_hash dm_log i2c_i801 lpc_ich mfd_core Nikolay> shpchp i oapic ioatdma igb i2c_algo_bit ses enclosure Nikolay> ipmi_devintf ipmi_si ipmi_msghandler ib_qib dca ib_mad Nikolay> ib_core [427149.725321] CPU: 23 PID: 4611 Comm: kworker/u98:7 Nikolay> Not tainted 3.12.47-clouder1 #1 [427149.725690] Hardware Nikolay> name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 Nikolay> [427149.726062] Workqueue: writeback bdi_writeback_workfn Nikolay> (flush-252:148) [427149.726564] 00f5 Nikolay> 883fff366b58 81651631 00f5 Nikolay> [427149.727212] 883fff366ba8 883fff366b98 Nikolay> 81089a6c [427149.727860] Nikolay> 883fd2f08000 883fff366ce8 Nikolay> [427149.728490] Call Trace: [427149.728845] Nikolay> [] dump_stack+0x58/0x7f Nikolay> [427149.729350] [] Nikolay> warn_slowpath_common+0x8c/0xc0 [427149.729712] Nikolay> [] warn_slowpath_fmt+0x46/0x50 Nikolay> [427149.730076] [] Nikolay> watchdog_overflow_callback+0x98/0xc0 [427149.730443] Nikolay> [] __perf_event_overflow+0x9c/0x250 Nikolay> [427149.730810] [] Nikolay> perf_event_overflow+0x14/0x20 [427149.731175] Nikolay> [] intel_pmu_handle_irq+0x1d6/0x3e0 Nikolay> [427149.739656] [] Nikolay> perf_event_nmi_handler+0x34/0x60 [427149.740027] Nikolay> [] nmi_handle+0xa2/0x1a0 [427149.740389] Nikolay> [] do_nmi+0x164/0x430 [427149.740754] Nikolay> [] end_repeat_nmi+0x1a/0x1e [427149.741122] Nikolay> [] ? mempool_free_slab+0x17/0x20 Nikolay> [427149.741492] [] ? Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.741854] Nikolay> [] ? ext4_finish_bio+0x275/0x2a0 Nikolay> [427149.742216] [] ? Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.742579] <> Nikolay> [] ext4_end_bio+0xc8/0x120 [427149.743150] Nikolay> [] bio_endio+0x1d/0x40 [427149.743516] Nikolay> [] dec_pending+0x1c1/0x360 [427149.743878] Nikolay> [] clone_endio+0x76/0xa0 [427149.744239] Nikolay> [] bio_endio+0x1d/0x40 [427149.744599] Nikolay> [] dec_pending+0x1c1/0x360 [427149.744964] Nikolay> [] clone_endio+0x76/0xa0 [427149.745326] Nikolay> [] bio_endio+0x1d/0x40 [427149.745686] Nikolay> [] dec_pending+0x1c1/0x360 [427149.746048] Nikolay> [] clone_endio+0x76/0xa0 [427149.746407] Nikolay> [] bio_endio+0x1d/0x40 [427149.746773] Nikolay> [] blk_update_request+0x21b/0x450 Nikolay> [427149.747138] [] Nikolay> blk_update_bidi_request+0x27/0xb0 [427149.747513] Nikolay> [] blk_end_bidi_request+0x2f/0x80 Nikolay> [427149.748101] [] Nikolay> blk_end_request+0x10/0x20 [427149.748705] Nikolay> [] scsi_io_completion+0xbc/0x620 Nikolay> [427149.749297] [] Nikolay> scsi_finish_command+0xc9/0x130 [427149.749891] Nikolay> [] scsi_softirq_done+0x147/0x170 Nikolay> [427149.750491] [] Nikolay> blk_done_softirq+0x7d/0x90 [427149.751089] Nikolay> [] __do_softirq+0x137/0x2e0 [427149.751694] Nikolay> [] call_softirq+0x1c/0x30 [427149.752284] Nikolay> [] do_softirq+0x8d/0xc0 [427149.752892] Nikolay> [] irq_exit+0x95/0xa0 [427149.753526] Nikolay> [] Nikolay> smp_call_function_single_interrupt+0x35/0x40 [427149.754149] Nikolay> [] call_function_single_interrupt+0x6f/0x80 Nikolay> [427149.754750] [] ? memcpy+0x6/0x110 Nikolay> [427149.755572] [] ? __bio_clone+0x26/0x70 Nikolay> [427149.756179] [] Nikolay> __clone_and_map_data_bio+0x139/0x160 [427149.756814] Nikolay> [] __split_and_process_bio+0x3ed/0x490 Nikolay> [427149.757444] [] dm_request+0x136/0x1e0 Nikolay> [427149.758041] [] Nikolay> generic_make_request+0xca/0x100 [427149.758641] Nikolay> [] submit_bio+0x79/0x160 [427149.759035] Nikolay> [] ? account_page_writeback+0x2d/0x40 Nikolay> [427149.759406] [] ? Nikolay> __test_set_page_writeback+0x16d/0x1f0 [427149.759781] Nikolay> [] ext4_io_submit+0x29/0x50 [427149.760151] Nikolay> [] ext4_bio_write_page+0x12b/0x2f0 Nikolay> [427149.760519] [] Nikolay> mpage_submit_page+0x68/0x90 [427149.760887] Nikolay> [] mpage_process_page_bufs+0xf0/0x110 Nikolay> [427149.761257] [] Nikolay>
Re: Hard lockup in ext4_finish_bio
On 10/08/2015 05:34 PM, John Stoffel wrote: > Great bug report, but you're missing the info on which kernel you're This is on 3.12.47 (self compiled). It was evident on my initial post, but I did forget to mention that in the reply. Also, I suspect even current kernel are susceptible to this since the locking in question hasn't changed. Regards, Nikolay -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hard lockup in ext4_finish_bio
[Adding the dm people since I believe this might be in their field as well.] After further investigation I discovered that there was indeed a task, executing block_write_full_page_endio (this function is supposed to eventually call end_buffer_async_write which would then unblock the writeback flusher. Here is how its callstack is looking like: PID: 34220 TASK: 883937660810 CPU: 44 COMMAND: "kworker/u98:39" #0 [88209d5b10b8] __schedule at 81653d5a #1 [88209d5b1150] schedule at 816542f9 #2 [88209d5b1160] schedule_preempt_disabled at 81654686 #3 [88209d5b1180] __mutex_lock_slowpath at 816521eb #4 [88209d5b1200] mutex_lock at 816522d1 #5 [88209d5b1220] new_read at a0152a7e [dm_bufio] #6 [88209d5b1280] dm_bufio_get at a0152ba6 [dm_bufio] #7 [88209d5b1290] dm_bm_read_try_lock at a015c878 [dm_persistent_data] #8 [88209d5b12e0] dm_tm_read_lock at a015f7ad [dm_persistent_data] #9 [88209d5b12f0] bn_read_lock at a016281b [dm_persistent_data] #10 [88209d5b1300] ro_step at a0162856 [dm_persistent_data] #11 [88209d5b1320] dm_btree_lookup at a01610e4 [dm_persistent_data] #12 [88209d5b13d0] dm_thin_find_block at a017d20c [dm_thin_pool] #13 [88209d5b1430] thin_bio_map at a017866b [dm_thin_pool] #14 [88209d5b1540] thin_map at a0178825 [dm_thin_pool] #15 [88209d5b1550] __map_bio at 81546a0e #16 [88209d5b1590] __clone_and_map_data_bio at 81546d81 #17 [88209d5b1600] __split_and_process_bio at 815471cd #18 [88209d5b16b0] dm_request at 815473a6 #19 [88209d5b16e0] generic_make_request at 812fbe0a #20 [88209d5b1710] submit_bio at 812fbeb9 #21 [88209d5b1780] _submit_bh at 811d78a8 #22 [88209d5b17b0] submit_bh at 811d7990 #23 [88209d5b17c0] __block_write_full_page at 811da7ef #24 [88209d5b1840] block_write_full_page_endio at 811daa0b #25 [88209d5b1890] block_write_full_page at 811daa65 #26 [88209d5b18a0] blkdev_writepage at 811df658 #27 [88209d5b18b0] __writepage at 81144567 #28 [88209d5b18d0] write_cache_pages at 811467e3 #29 [88209d5b1a20] generic_writepages at 81146af1 #30 [88209d5b1a80] do_writepages at 81146b40 #31 [88209d5b1a90] __writeback_single_inode at 811cec79 #32 [88209d5b1ae0] writeback_sb_inodes at 811cfdee #33 [88209d5b1bc0] __writeback_inodes_wb at 811d00ee #34 [88209d5b1c10] wb_writeback at 811d035b #35 [88209d5b1cc0] wb_do_writeback at 811d062c #36 [88209d5b1d60] bdi_writeback_workfn at 811d0bf8 #37 [88209d5b1df0] process_one_work at 810a61e5 #38 [88209d5b1e40] worker_thread at 810a848a #39 [88209d5b1ec0] kthread at 810ae48e #40 [88209d5b1f50] ret_from_fork at 816571c8 So it in turn has also blocked on dm_bufio_lock in new_read. The owner of this mutex turns out to be: PID: 3 TASK: 881cbb2fb870 CPU: 2 COMMAND: "kworker/u96:0" #0 [881fffa46dc0] crash_nmi_callback at 8106f24e #1 [881fffa46de0] nmi_handle at 8104c152 #2 [881fffa46e70] do_nmi at 8104c3b4 #3 [881fffa46ef0] end_repeat_nmi at 81656e2e [exception RIP: smp_call_function_many+577] RIP: 810e7f81 RSP: 880d35b815c8 RFLAGS: 0202 RAX: 0017 RBX: 81142690 RCX: 0017 RDX: 883fff375478 RSI: 0040 RDI: 0040 RBP: 880d35b81628 R8: 881fffa51ec8 R9: R10: R11: 812943f3 R12: R13: 881fffa51ec0 R14: 881fffa51ec8 R15: 00011f00 ORIG_RAX: CS: 0010 SS: 0018 --- --- #4 [880d35b815c8] smp_call_function_many at 810e7f81 #5 [880d35b81630] on_each_cpu_mask at 810e801c #6 [880d35b81660] drain_all_pages at 81140178 #7 [880d35b81690] __alloc_pages_nodemask at 8114310b #8 [880d35b81810] alloc_pages_current at 81181c5e #9 [880d35b81860] new_slab at 81188305 #10 [880d35b818b0] __slab_alloc at 81189a30 #11 [880d35b81990] __kmalloc at 8118a70b #12 [880d35b819e0] alloc_buffer at a0151b66 [dm_bufio] #13 [880d35b81a20] __bufio_new at a0152831 [dm_bufio] #14 [880d35b81a90] new_read at a0152a93 [dm_bufio] #15 [880d35b81af0] dm_bufio_read at a0152b89 [dm_bufio] #16 [880d35b81b00] dm_bm_read_lock at a015cb3d [dm_persistent_data] #17 [880d35b81b80] dm_tm_read_lock at a015f79a [dm_persistent_data] #18 [880d35b81b90] bn_read_lock at a016281b [dm_persistent_data] #19 [880d35b81ba0] ro_step at a0162856
Hard lockup in ext4_finish_bio
Hello, I've hit a rather strange hard lock up on one of my servers from the page writeback path, the actual backtrace is: [427149.717151] [ cut here ] [427149.717553] WARNING: CPU: 23 PID: 4611 at kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0() [427149.718216] Watchdog detected hard LOCKUP on cpu 23 [427149.718292] Modules linked in: [427149.718723] tcp_diag inet_diag netconsole act_police cls_basic sch_ingress xt_pkttype xt_state veth openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n at xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log i2c_i801 lpc_ich mfd_core shpchp i oapic ioatdma igb i2c_algo_bit ses enclosure ipmi_devintf ipmi_si ipmi_msghandler ib_qib dca ib_mad ib_core [427149.725321] CPU: 23 PID: 4611 Comm: kworker/u98:7 Not tainted 3.12.47-clouder1 #1 [427149.725690] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 [427149.726062] Workqueue: writeback bdi_writeback_workfn (flush-252:148) [427149.726564] 00f5 883fff366b58 81651631 00f5 [427149.727212] 883fff366ba8 883fff366b98 81089a6c [427149.727860] 883fd2f08000 883fff366ce8 [427149.728490] Call Trace: [427149.728845][] dump_stack+0x58/0x7f [427149.729350] [] warn_slowpath_common+0x8c/0xc0 [427149.729712] [] warn_slowpath_fmt+0x46/0x50 [427149.730076] [] watchdog_overflow_callback+0x98/0xc0 [427149.730443] [] __perf_event_overflow+0x9c/0x250 [427149.730810] [] perf_event_overflow+0x14/0x20 [427149.731175] [] intel_pmu_handle_irq+0x1d6/0x3e0 [427149.739656] [] perf_event_nmi_handler+0x34/0x60 [427149.740027] [] nmi_handle+0xa2/0x1a0 [427149.740389] [] do_nmi+0x164/0x430 [427149.740754] [] end_repeat_nmi+0x1a/0x1e [427149.741122] [] ? mempool_free_slab+0x17/0x20 [427149.741492] [] ? ext4_finish_bio+0x275/0x2a0 [427149.741854] [] ? ext4_finish_bio+0x275/0x2a0 [427149.742216] [] ? ext4_finish_bio+0x275/0x2a0 [427149.742579] <>[] ext4_end_bio+0xc8/0x120 [427149.743150] [] bio_endio+0x1d/0x40 [427149.743516] [] dec_pending+0x1c1/0x360 [427149.743878] [] clone_endio+0x76/0xa0 [427149.744239] [] bio_endio+0x1d/0x40 [427149.744599] [] dec_pending+0x1c1/0x360 [427149.744964] [] clone_endio+0x76/0xa0 [427149.745326] [] bio_endio+0x1d/0x40 [427149.745686] [] dec_pending+0x1c1/0x360 [427149.746048] [] clone_endio+0x76/0xa0 [427149.746407] [] bio_endio+0x1d/0x40 [427149.746773] [] blk_update_request+0x21b/0x450 [427149.747138] [] blk_update_bidi_request+0x27/0xb0 [427149.747513] [] blk_end_bidi_request+0x2f/0x80 [427149.748101] [] blk_end_request+0x10/0x20 [427149.748705] [] scsi_io_completion+0xbc/0x620 [427149.749297] [] scsi_finish_command+0xc9/0x130 [427149.749891] [] scsi_softirq_done+0x147/0x170 [427149.750491] [] blk_done_softirq+0x7d/0x90 [427149.751089] [] __do_softirq+0x137/0x2e0 [427149.751694] [] call_softirq+0x1c/0x30 [427149.752284] [] do_softirq+0x8d/0xc0 [427149.752892] [] irq_exit+0x95/0xa0 [427149.753526] [] smp_call_function_single_interrupt+0x35/0x40 [427149.754149] [] call_function_single_interrupt+0x6f/0x80 [427149.754750][] ? memcpy+0x6/0x110 [427149.755572] [] ? __bio_clone+0x26/0x70 [427149.756179] [] __clone_and_map_data_bio+0x139/0x160 [427149.756814] [] __split_and_process_bio+0x3ed/0x490 [427149.757444] [] dm_request+0x136/0x1e0 [427149.758041] [] generic_make_request+0xca/0x100 [427149.758641] [] submit_bio+0x79/0x160 [427149.759035] [] ? account_page_writeback+0x2d/0x40 [427149.759406] [] ? __test_set_page_writeback+0x16d/0x1f0 [427149.759781] [] ext4_io_submit+0x29/0x50 [427149.760151] [] ext4_bio_write_page+0x12b/0x2f0 [427149.760519] [] mpage_submit_page+0x68/0x90 [427149.760887] [] mpage_process_page_bufs+0xf0/0x110 [427149.761257] [] mpage_prepare_extent_to_map+0x210/0x310 [427149.761624] [] ? ext4_writepages+0x361/0xc60 [427149.761998] [] ? __ext4_journal_start_sb+0x79/0x110 [427149.762362] [] ext4_writepages+0x398/0xc60 [427149.762726] [] ? blk_finish_plug+0x18/0x50 [427149.763089] [] do_writepages+0x20/0x40 [427149.763454] [] __writeback_single_inode+0x49/0x2b0 [427149.763819] [] ? wake_up_bit+0x2f/0x40 [427149.764182] [] writeback_sb_inodes+0x2de/0x540 [427149.764544] [] ? put_super+0x25/0x50 [427149.764903] [] __writeback_inodes_wb+0x9e/0xd0 [427149.765265] [] wb_writeback+0x23b/0x340 [427149.765628] [] wb_do_writeback+0x99/0x230 [427149.765996] [] ? set_worker_desc+0x81/0x90 [427149.766364] [] ? dequeue_task_fair+0x36a/0x4c0 [427149.766734] [] bdi_writeback_workfn+0x88/0x260 [427149.767101] [] ? finish_task_switch+0x4e/0xe0 [427149.767473] [] ? __schedule+0x2dc/0x760 [427149.767845] [] process_one_work+0x195/0x550
Hard lockup in ext4_finish_bio
Hello, I've hit a rather strange hard lock up on one of my servers from the page writeback path, the actual backtrace is: [427149.717151] [ cut here ] [427149.717553] WARNING: CPU: 23 PID: 4611 at kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0() [427149.718216] Watchdog detected hard LOCKUP on cpu 23 [427149.718292] Modules linked in: [427149.718723] tcp_diag inet_diag netconsole act_police cls_basic sch_ingress xt_pkttype xt_state veth openvswitch gre vxlan ip_tunnel xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n at xt_CT nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ext2 dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio dm_mirror dm_region_hash dm_log i2c_i801 lpc_ich mfd_core shpchp i oapic ioatdma igb i2c_algo_bit ses enclosure ipmi_devintf ipmi_si ipmi_msghandler ib_qib dca ib_mad ib_core [427149.725321] CPU: 23 PID: 4611 Comm: kworker/u98:7 Not tainted 3.12.47-clouder1 #1 [427149.725690] Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 [427149.726062] Workqueue: writeback bdi_writeback_workfn (flush-252:148) [427149.726564] 00f5 883fff366b58 81651631 00f5 [427149.727212] 883fff366ba8 883fff366b98 81089a6c [427149.727860] 883fd2f08000 883fff366ce8 [427149.728490] Call Trace: [427149.728845][] dump_stack+0x58/0x7f [427149.729350] [] warn_slowpath_common+0x8c/0xc0 [427149.729712] [] warn_slowpath_fmt+0x46/0x50 [427149.730076] [] watchdog_overflow_callback+0x98/0xc0 [427149.730443] [] __perf_event_overflow+0x9c/0x250 [427149.730810] [] perf_event_overflow+0x14/0x20 [427149.731175] [] intel_pmu_handle_irq+0x1d6/0x3e0 [427149.739656] [] perf_event_nmi_handler+0x34/0x60 [427149.740027] [] nmi_handle+0xa2/0x1a0 [427149.740389] [] do_nmi+0x164/0x430 [427149.740754] [] end_repeat_nmi+0x1a/0x1e [427149.741122] [] ? mempool_free_slab+0x17/0x20 [427149.741492] [] ? ext4_finish_bio+0x275/0x2a0 [427149.741854] [] ? ext4_finish_bio+0x275/0x2a0 [427149.742216] [] ? ext4_finish_bio+0x275/0x2a0 [427149.742579] <>[] ext4_end_bio+0xc8/0x120 [427149.743150] [] bio_endio+0x1d/0x40 [427149.743516] [] dec_pending+0x1c1/0x360 [427149.743878] [] clone_endio+0x76/0xa0 [427149.744239] [] bio_endio+0x1d/0x40 [427149.744599] [] dec_pending+0x1c1/0x360 [427149.744964] [] clone_endio+0x76/0xa0 [427149.745326] [] bio_endio+0x1d/0x40 [427149.745686] [] dec_pending+0x1c1/0x360 [427149.746048] [] clone_endio+0x76/0xa0 [427149.746407] [] bio_endio+0x1d/0x40 [427149.746773] [] blk_update_request+0x21b/0x450 [427149.747138] [] blk_update_bidi_request+0x27/0xb0 [427149.747513] [] blk_end_bidi_request+0x2f/0x80 [427149.748101] [] blk_end_request+0x10/0x20 [427149.748705] [] scsi_io_completion+0xbc/0x620 [427149.749297] [] scsi_finish_command+0xc9/0x130 [427149.749891] [] scsi_softirq_done+0x147/0x170 [427149.750491] [] blk_done_softirq+0x7d/0x90 [427149.751089] [] __do_softirq+0x137/0x2e0 [427149.751694] [] call_softirq+0x1c/0x30 [427149.752284] [] do_softirq+0x8d/0xc0 [427149.752892] [] irq_exit+0x95/0xa0 [427149.753526] [] smp_call_function_single_interrupt+0x35/0x40 [427149.754149] [] call_function_single_interrupt+0x6f/0x80 [427149.754750][] ? memcpy+0x6/0x110 [427149.755572] [] ? __bio_clone+0x26/0x70 [427149.756179] [] __clone_and_map_data_bio+0x139/0x160 [427149.756814] [] __split_and_process_bio+0x3ed/0x490 [427149.757444] [] dm_request+0x136/0x1e0 [427149.758041] [] generic_make_request+0xca/0x100 [427149.758641] [] submit_bio+0x79/0x160 [427149.759035] [] ? account_page_writeback+0x2d/0x40 [427149.759406] [] ? __test_set_page_writeback+0x16d/0x1f0 [427149.759781] [] ext4_io_submit+0x29/0x50 [427149.760151] [] ext4_bio_write_page+0x12b/0x2f0 [427149.760519] [] mpage_submit_page+0x68/0x90 [427149.760887] [] mpage_process_page_bufs+0xf0/0x110 [427149.761257] [] mpage_prepare_extent_to_map+0x210/0x310 [427149.761624] [] ? ext4_writepages+0x361/0xc60 [427149.761998] [] ? __ext4_journal_start_sb+0x79/0x110 [427149.762362] [] ext4_writepages+0x398/0xc60 [427149.762726] [] ? blk_finish_plug+0x18/0x50 [427149.763089] [] do_writepages+0x20/0x40 [427149.763454] [] __writeback_single_inode+0x49/0x2b0 [427149.763819] [] ? wake_up_bit+0x2f/0x40 [427149.764182] [] writeback_sb_inodes+0x2de/0x540 [427149.764544] [] ? put_super+0x25/0x50 [427149.764903] [] __writeback_inodes_wb+0x9e/0xd0 [427149.765265] [] wb_writeback+0x23b/0x340 [427149.765628] [] wb_do_writeback+0x99/0x230 [427149.765996] [] ? set_worker_desc+0x81/0x90 [427149.766364] [] ? dequeue_task_fair+0x36a/0x4c0 [427149.766734] [] bdi_writeback_workfn+0x88/0x260 [427149.767101] [] ? finish_task_switch+0x4e/0xe0 [427149.767473] [] ? __schedule+0x2dc/0x760 [427149.767845] [] process_one_work+0x195/0x550
Re: Hard lockup in ext4_finish_bio
[Adding the dm people since I believe this might be in their field as well.] After further investigation I discovered that there was indeed a task, executing block_write_full_page_endio (this function is supposed to eventually call end_buffer_async_write which would then unblock the writeback flusher. Here is how its callstack is looking like: PID: 34220 TASK: 883937660810 CPU: 44 COMMAND: "kworker/u98:39" #0 [88209d5b10b8] __schedule at 81653d5a #1 [88209d5b1150] schedule at 816542f9 #2 [88209d5b1160] schedule_preempt_disabled at 81654686 #3 [88209d5b1180] __mutex_lock_slowpath at 816521eb #4 [88209d5b1200] mutex_lock at 816522d1 #5 [88209d5b1220] new_read at a0152a7e [dm_bufio] #6 [88209d5b1280] dm_bufio_get at a0152ba6 [dm_bufio] #7 [88209d5b1290] dm_bm_read_try_lock at a015c878 [dm_persistent_data] #8 [88209d5b12e0] dm_tm_read_lock at a015f7ad [dm_persistent_data] #9 [88209d5b12f0] bn_read_lock at a016281b [dm_persistent_data] #10 [88209d5b1300] ro_step at a0162856 [dm_persistent_data] #11 [88209d5b1320] dm_btree_lookup at a01610e4 [dm_persistent_data] #12 [88209d5b13d0] dm_thin_find_block at a017d20c [dm_thin_pool] #13 [88209d5b1430] thin_bio_map at a017866b [dm_thin_pool] #14 [88209d5b1540] thin_map at a0178825 [dm_thin_pool] #15 [88209d5b1550] __map_bio at 81546a0e #16 [88209d5b1590] __clone_and_map_data_bio at 81546d81 #17 [88209d5b1600] __split_and_process_bio at 815471cd #18 [88209d5b16b0] dm_request at 815473a6 #19 [88209d5b16e0] generic_make_request at 812fbe0a #20 [88209d5b1710] submit_bio at 812fbeb9 #21 [88209d5b1780] _submit_bh at 811d78a8 #22 [88209d5b17b0] submit_bh at 811d7990 #23 [88209d5b17c0] __block_write_full_page at 811da7ef #24 [88209d5b1840] block_write_full_page_endio at 811daa0b #25 [88209d5b1890] block_write_full_page at 811daa65 #26 [88209d5b18a0] blkdev_writepage at 811df658 #27 [88209d5b18b0] __writepage at 81144567 #28 [88209d5b18d0] write_cache_pages at 811467e3 #29 [88209d5b1a20] generic_writepages at 81146af1 #30 [88209d5b1a80] do_writepages at 81146b40 #31 [88209d5b1a90] __writeback_single_inode at 811cec79 #32 [88209d5b1ae0] writeback_sb_inodes at 811cfdee #33 [88209d5b1bc0] __writeback_inodes_wb at 811d00ee #34 [88209d5b1c10] wb_writeback at 811d035b #35 [88209d5b1cc0] wb_do_writeback at 811d062c #36 [88209d5b1d60] bdi_writeback_workfn at 811d0bf8 #37 [88209d5b1df0] process_one_work at 810a61e5 #38 [88209d5b1e40] worker_thread at 810a848a #39 [88209d5b1ec0] kthread at 810ae48e #40 [88209d5b1f50] ret_from_fork at 816571c8 So it in turn has also blocked on dm_bufio_lock in new_read. The owner of this mutex turns out to be: PID: 3 TASK: 881cbb2fb870 CPU: 2 COMMAND: "kworker/u96:0" #0 [881fffa46dc0] crash_nmi_callback at 8106f24e #1 [881fffa46de0] nmi_handle at 8104c152 #2 [881fffa46e70] do_nmi at 8104c3b4 #3 [881fffa46ef0] end_repeat_nmi at 81656e2e [exception RIP: smp_call_function_many+577] RIP: 810e7f81 RSP: 880d35b815c8 RFLAGS: 0202 RAX: 0017 RBX: 81142690 RCX: 0017 RDX: 883fff375478 RSI: 0040 RDI: 0040 RBP: 880d35b81628 R8: 881fffa51ec8 R9: R10: R11: 812943f3 R12: R13: 881fffa51ec0 R14: 881fffa51ec8 R15: 00011f00 ORIG_RAX: CS: 0010 SS: 0018 --- --- #4 [880d35b815c8] smp_call_function_many at 810e7f81 #5 [880d35b81630] on_each_cpu_mask at 810e801c #6 [880d35b81660] drain_all_pages at 81140178 #7 [880d35b81690] __alloc_pages_nodemask at 8114310b #8 [880d35b81810] alloc_pages_current at 81181c5e #9 [880d35b81860] new_slab at 81188305 #10 [880d35b818b0] __slab_alloc at 81189a30 #11 [880d35b81990] __kmalloc at 8118a70b #12 [880d35b819e0] alloc_buffer at a0151b66 [dm_bufio] #13 [880d35b81a20] __bufio_new at a0152831 [dm_bufio] #14 [880d35b81a90] new_read at a0152a93 [dm_bufio] #15 [880d35b81af0] dm_bufio_read at a0152b89 [dm_bufio] #16 [880d35b81b00] dm_bm_read_lock at a015cb3d [dm_persistent_data] #17 [880d35b81b80] dm_tm_read_lock at a015f79a [dm_persistent_data] #18 [880d35b81b90] bn_read_lock at a016281b [dm_persistent_data] #19 [880d35b81ba0] ro_step at a0162856
Re: Hard lockup in ext4_finish_bio
On 10/08/2015 05:34 PM, John Stoffel wrote: > Great bug report, but you're missing the info on which kernel you're This is on 3.12.47 (self compiled). It was evident on my initial post, but I did forget to mention that in the reply. Also, I suspect even current kernel are susceptible to this since the locking in question hasn't changed. Regards, Nikolay -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Hard lockup in ext4_finish_bio
Great bug report, but you're missing the info on which kernel you're running here... is this a vendor kernel or self-compiled? Nikolay> I've hit a rather strange hard lock up on one of my servers Nikolay> from the page writeback path, the actual backtrace is: Nikolay> [427149.717151] [ cut here ] Nikolay> [427149.717553] WARNING: CPU: 23 PID: 4611 at Nikolay> kernel/watchdog.c:245 watchdog_overflow_callback+0x98/0xc0() Nikolay> [427149.718216] Watchdog detected hard LOCKUP on cpu 23 Nikolay> [427149.718292] Modules linked in: [427149.718723] tcp_diag Nikolay> inet_diag netconsole act_police cls_basic sch_ingress Nikolay> xt_pkttype xt_state veth openvswitch gre vxlan ip_tunnel Nikolay> xt_owner xt_conntrack iptable_mangle xt_nat iptable_nat Nikolay> nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_n at xt_CT Nikolay> nf_conntrack iptable_raw ib_ipoib rdma_ucm ib_ucm ib_uverbs Nikolay> ib_umad rdma_cm ib_cm iw_cm ib_addr ipv6 ib_sa ext2 Nikolay> dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio Nikolay> dm_mirror dm_region_hash dm_log i2c_i801 lpc_ich mfd_core Nikolay> shpchp i oapic ioatdma igb i2c_algo_bit ses enclosure Nikolay> ipmi_devintf ipmi_si ipmi_msghandler ib_qib dca ib_mad Nikolay> ib_core [427149.725321] CPU: 23 PID: 4611 Comm: kworker/u98:7 Nikolay> Not tainted 3.12.47-clouder1 #1 [427149.725690] Hardware Nikolay> name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 Nikolay> [427149.726062] Workqueue: writeback bdi_writeback_workfn Nikolay> (flush-252:148) [427149.726564] 00f5 Nikolay> 883fff366b58 81651631 00f5 Nikolay> [427149.727212] 883fff366ba8 883fff366b98 Nikolay> 81089a6c [427149.727860] Nikolay> 883fd2f08000 883fff366ce8 Nikolay> [427149.728490] Call Trace: [427149.728845] Nikolay> [] dump_stack+0x58/0x7f Nikolay> [427149.729350] [] Nikolay> warn_slowpath_common+0x8c/0xc0 [427149.729712] Nikolay> [] warn_slowpath_fmt+0x46/0x50 Nikolay> [427149.730076] [] Nikolay> watchdog_overflow_callback+0x98/0xc0 [427149.730443] Nikolay> [] __perf_event_overflow+0x9c/0x250 Nikolay> [427149.730810] [] Nikolay> perf_event_overflow+0x14/0x20 [427149.731175] Nikolay> [] intel_pmu_handle_irq+0x1d6/0x3e0 Nikolay> [427149.739656] [] Nikolay> perf_event_nmi_handler+0x34/0x60 [427149.740027] Nikolay> [] nmi_handle+0xa2/0x1a0 [427149.740389] Nikolay> [] do_nmi+0x164/0x430 [427149.740754] Nikolay> [] end_repeat_nmi+0x1a/0x1e [427149.741122] Nikolay> [] ? mempool_free_slab+0x17/0x20 Nikolay> [427149.741492] [] ? Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.741854] Nikolay> [] ? ext4_finish_bio+0x275/0x2a0 Nikolay> [427149.742216] [] ? Nikolay> ext4_finish_bio+0x275/0x2a0 [427149.742579] <> Nikolay> [] ext4_end_bio+0xc8/0x120 [427149.743150] Nikolay> [] bio_endio+0x1d/0x40 [427149.743516] Nikolay> [] dec_pending+0x1c1/0x360 [427149.743878] Nikolay> [] clone_endio+0x76/0xa0 [427149.744239] Nikolay> [] bio_endio+0x1d/0x40 [427149.744599] Nikolay> [] dec_pending+0x1c1/0x360 [427149.744964] Nikolay> [] clone_endio+0x76/0xa0 [427149.745326] Nikolay> [] bio_endio+0x1d/0x40 [427149.745686] Nikolay> [] dec_pending+0x1c1/0x360 [427149.746048] Nikolay> [] clone_endio+0x76/0xa0 [427149.746407] Nikolay> [] bio_endio+0x1d/0x40 [427149.746773] Nikolay> [] blk_update_request+0x21b/0x450 Nikolay> [427149.747138] [] Nikolay> blk_update_bidi_request+0x27/0xb0 [427149.747513] Nikolay> [] blk_end_bidi_request+0x2f/0x80 Nikolay> [427149.748101] [] Nikolay> blk_end_request+0x10/0x20 [427149.748705] Nikolay> [] scsi_io_completion+0xbc/0x620 Nikolay> [427149.749297] [] Nikolay> scsi_finish_command+0xc9/0x130 [427149.749891] Nikolay> [] scsi_softirq_done+0x147/0x170 Nikolay> [427149.750491] [] Nikolay> blk_done_softirq+0x7d/0x90 [427149.751089] Nikolay> [] __do_softirq+0x137/0x2e0 [427149.751694] Nikolay> [] call_softirq+0x1c/0x30 [427149.752284] Nikolay> [] do_softirq+0x8d/0xc0 [427149.752892] Nikolay> [] irq_exit+0x95/0xa0 [427149.753526] Nikolay> [] Nikolay> smp_call_function_single_interrupt+0x35/0x40 [427149.754149] Nikolay> [] call_function_single_interrupt+0x6f/0x80 Nikolay> [427149.754750] [] ? memcpy+0x6/0x110 Nikolay> [427149.755572] [] ? __bio_clone+0x26/0x70 Nikolay> [427149.756179] [] Nikolay> __clone_and_map_data_bio+0x139/0x160 [427149.756814] Nikolay> [] __split_and_process_bio+0x3ed/0x490 Nikolay> [427149.757444] [] dm_request+0x136/0x1e0 Nikolay> [427149.758041] [] Nikolay> generic_make_request+0xca/0x100 [427149.758641] Nikolay> [] submit_bio+0x79/0x160 [427149.759035] Nikolay> [] ? account_page_writeback+0x2d/0x40 Nikolay> [427149.759406] [] ? Nikolay> __test_set_page_writeback+0x16d/0x1f0 [427149.759781] Nikolay> [] ext4_io_submit+0x29/0x50 [427149.760151] Nikolay> [] ext4_bio_write_page+0x12b/0x2f0 Nikolay> [427149.760519] [] Nikolay> mpage_submit_page+0x68/0x90 [427149.760887] Nikolay> [] mpage_process_page_bufs+0xf0/0x110 Nikolay> [427149.761257] [] Nikolay>
Re: Hard lockup in ext4_finish_bio
> "Nikolay" == Nikolay Borisovwrites: Nikolay> On 10/08/2015 05:34 PM, John Stoffel wrote: >> Great bug report, but you're missing the info on which kernel >> you're Nikolay> This is on 3.12.47 (self compiled). It was evident on my Nikolay> initial post, but I did forget to mention that in the Nikolay> reply. Also, I suspect even current kernel are susceptible to Nikolay> this since the locking in question hasn't changed. Hi Nikolay, must have missed it. I looked quickly, but didn't find it. Since it's such an older kernel release, it might be best if you upgrade to the latest version and try to re-create the lock if at all possible. What kind of workload are you running on there? And if you have more details on the hardware, that might help as well. John -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/