Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 10:58:49AM -0700, Dan Williams wrote: > I think this may be a bug in __raid_run_ops that is only possible when > raid offload and CONFIG_MULTICORE_RAID456 are enabled. I'm thinking > the descriptor is completed and recycled to another requester in the > space between these two events: > > ops_run_compute(); > > /* terminate the chain if reconstruct is not set to be run */ > if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, _request)) > async_tx_ack(tx); > > ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you > leave IOAT DMA disabled. A rework of the raid operation dma chaining > is in progress, but may not be ready for a while. Hi, I usually don't use CONFIG_MULTICORE_RAID456 as it proved to be sluggish and/or unstable in my experience, so I should be pretty safe letting I/O AT DMA disabled for now on those bosex. Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 10:58:49AM -0700, Dan Williams wrote: I think this may be a bug in __raid_run_ops that is only possible when raid offload and CONFIG_MULTICORE_RAID456 are enabled. I'm thinking the descriptor is completed and recycled to another requester in the space between these two events: ops_run_compute(); /* terminate the chain if reconstruct is not set to be run */ if (tx !test_bit(STRIPE_OP_RECONSTRUCT, ops_request)) async_tx_ack(tx); ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you leave IOAT DMA disabled. A rework of the raid operation dma chaining is in progress, but may not be ready for a while. Hi, I usually don't use CONFIG_MULTICORE_RAID456 as it proved to be sluggish and/or unstable in my experience, so I should be pretty safe letting I/O AT DMA disabled for now on those bosex. Thanks -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 5:52 AM, Laurent CARON wrote: > On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote: >> On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: >> > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: >> > > That's: >> > > >> > > BUG_ON(async_tx_test_ack(depend_tx) || >> > > txd_next(depend_tx) || >> > > txd_parent(tx)); >> > > >> > > but probably the b0rkage happens up the stack. And this __raid_run_ops >> > > is probably starting the whole TX so maybe we should add >> > > linux-r...@vger.kernel.org to CC. Added. >> > >> > >> > Hi, >> > >> > The machines seem stable after disabling I/O AT DMA at the BIOS level. >> >> That's a good point because the backtrace goes through I/O AT DMA so it >> could very well be the culprit. Let's add some more people to Cc. >> >> Vinod/Dan, here's the BUG_ON Laurent is hitting: >> >> http://marc.info/?l=linux-kernel=135033064724794=2 >> >> and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma >> in the BIOS makes the issue disappear so ... >> >> > > What is that "r510" thing in the kernel version? You have your patches >> > > ontop? If yes, please try reproducing this with a kernel.org kernel >> > > without anything else ontop. >> > >> > My kernel is vanilla from Kernel.org. The -r510 string is because I >> > tried it on a -r510 also. >> >> Ok, good. >> >> > > Also, it might be worth trying plain 3.6 to rule out a regression >> > > introduced in the stable 3.6 series. >> > >> > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. >> > >> > For now, I did create more volumes, rsync lors of data over the network >> > to the disks with no crashs (after disabling I/O AT DMA). >> >> And when you do this with ioat dma enabled, you get the bug, right? So >> it is reproducible...? > > It is 100% reproductible. The only "nondeterministic" point is the time > it takes to have the machine crash. > I think this may be a bug in __raid_run_ops that is only possible when raid offload and CONFIG_MULTICORE_RAID456 are enabled. I'm thinking the descriptor is completed and recycled to another requester in the space between these two events: ops_run_compute(); /* terminate the chain if reconstruct is not set to be run */ if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, _request)) async_tx_ack(tx); ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you leave IOAT DMA disabled. A rework of the raid operation dma chaining is in progress, but may not be ready for a while. -- Dan -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote: > On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: > > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: > > > That's: > > > > > > BUG_ON(async_tx_test_ack(depend_tx) || > > > txd_next(depend_tx) || > > > txd_parent(tx)); > > > > > > but probably the b0rkage happens up the stack. And this __raid_run_ops > > > is probably starting the whole TX so maybe we should add > > > linux-r...@vger.kernel.org to CC. Added. > > > > > > Hi, > > > > The machines seem stable after disabling I/O AT DMA at the BIOS level. > > That's a good point because the backtrace goes through I/O AT DMA so it > could very well be the culprit. Let's add some more people to Cc. > > Vinod/Dan, here's the BUG_ON Laurent is hitting: > > http://marc.info/?l=linux-kernel=135033064724794=2 > > and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma > in the BIOS makes the issue disappear so ... > > > > What is that "r510" thing in the kernel version? You have your patches > > > ontop? If yes, please try reproducing this with a kernel.org kernel > > > without anything else ontop. > > > > My kernel is vanilla from Kernel.org. The -r510 string is because I > > tried it on a -r510 also. > > Ok, good. > > > > Also, it might be worth trying plain 3.6 to rule out a regression > > > introduced in the stable 3.6 series. > > > > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. > > > > For now, I did create more volumes, rsync lors of data over the network > > to the disks with no crashs (after disabling I/O AT DMA). > > And when you do this with ioat dma enabled, you get the bug, right? So > it is reproducible...? It is 100% reproductible. The only "nondeterministic" point is the time it takes to have the machine crash. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: > > That's: > > > > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) > > || > > txd_parent(tx)); > > > > but probably the b0rkage happens up the stack. And this __raid_run_ops > > is probably starting the whole TX so maybe we should add > > linux-r...@vger.kernel.org to CC. Added. > > > Hi, > > The machines seem stable after disabling I/O AT DMA at the BIOS level. That's a good point because the backtrace goes through I/O AT DMA so it could very well be the culprit. Let's add some more people to Cc. Vinod/Dan, here's the BUG_ON Laurent is hitting: http://marc.info/?l=linux-kernel=135033064724794=2 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma in the BIOS makes the issue disappear so ... > > What is that "r510" thing in the kernel version? You have your patches > > ontop? If yes, please try reproducing this with a kernel.org kernel > > without anything else ontop. > > My kernel is vanilla from Kernel.org. The -r510 string is because I > tried it on a -r510 also. Ok, good. > > Also, it might be worth trying plain 3.6 to rule out a regression > > introduced in the stable 3.6 series. > > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. > > For now, I did create more volumes, rsync lors of data over the network > to the disks with no crashs (after disabling I/O AT DMA). And when you do this with ioat dma enabled, you get the bug, right? So it is reproducible...? Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: > That's: > > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || > txd_parent(tx)); > > but probably the b0rkage happens up the stack. And this __raid_run_ops > is probably starting the whole TX so maybe we should add > linux-r...@vger.kernel.org to CC. Added. Hi, The machines seem stable after disabling I/O AT DMA at the BIOS level. > What is that "r510" thing in the kernel version? You have your patches > ontop? If yes, please try reproducing this with a kernel.org kernel > without anything else ontop. My kernel is vanilla from Kernel.org. The -r510 string is because I tried it on a -r510 also. > Also, it might be worth trying plain 3.6 to rule out a regression > introduced in the stable 3.6 series. I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. For now, I did create more volumes, rsync lors of data over the network to the disks with no crashs (after disabling I/O AT DMA). ...snip... -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Mon, Oct 15, 2012 at 09:42:58PM +0200, Laurent CARON wrote: > Hi, > > I'm currently replacing an old system (HP DL 380 G5) by new dell R720xd. > On those new boxes I did configure the H310 controler as plain JBOD. > > Those boxes appear to crash more often than not (from 5 mins to a couple > of hours). > I have the impression those crashes appear under heavy IO. > > The setup consists of a few md RAID arrays serving as underlying devices > for either filesystem, or drbd (plus lvm on top). > > I managed to catch a trace over netconsole: > [ cut here ] > kernel BUG at crypto/async_tx/async_tx.c:174! That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-r...@vger.kernel.org to CC. Added. > invalid opcode: [#1] SMP > Modules linked in: drbd lru_cache netconsole iptable_filter ip_tables > ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue > bonding ipv6 btrfs ioatdma lpc_ich sb_edac dca mfd_core > CPU 0 > Pid: 12580, comm: kworker/u:2 Not tainted 3.6.2-r510-r720xd #1 Dell Inc. > PowerEdge R720xd What is that "r510" thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. (leaving in the rest for reference) > RIP: 0010:[] [] async_tx_submit+0x29/0xab > RSP: 0018:88100940fb30 EFLAGS: 00010202 > RAX: 88100b30aeb0 RBX: 88080b5cf390 RCX: 0029 > RDX: 88100940fd00 RSI: 88080b5cf390 RDI: 880809ad0818 > RBP: 8808054a7d90 R08: 88080b5cf900 R09: 0001 > R10: 1000 R11: 0001 R12: 88100940fd00 > R13: 0002 R14: 880809ad0638 R15: 880809ad0818 > FS: () GS:88080fc0() knlGS: > CS: 0010 DS: ES: CR0: 8005003b > CR2: ff600400 CR3: 000e4055f000 CR4: 000407f0 > DR0: DR1: DR2: > DR3: DR6: 0ff0 DR7: 0400 > Process kworker/u:2 (pid: 12580, threadinfo 88100940e000, task > 880804850630) > Stack: > 88100940fd00 88100940fc40 0101 8131044b > 0001 0246 0201 a0073a00 > 8808054a7d90 8808054a7690 88100940fc40 88080bf9e668 > Call Trace: > [] ? do_async_gen_syndrome+0x2f3/0x320 > [] ? ioat2_tx_submit_unlock+0xac/0xb3 [ioatdma] > [] ? ops_complete_compute+0x7b/0x7b > [] ? async_gen_syndrome+0xc8/0x1d6 > [] ? __raid_run_ops+0x9e7/0xb5a > [] ? select_task_rq_fair+0x487/0x74b > [] ? ops_complete_compute+0x7b/0x7b > [] ? __wake_up+0x35/0x46 > [] ? async_schedule+0x12/0x12 > [] ? async_run_ops+0x32/0x3e > [] ? async_run_entry_fn+0xa4/0x17e > [] ? async_schedule+0x12/0x12 > [] ? process_one_work+0x259/0x381 > [] ? worker_thread+0x2ad/0x3e3 > [] ? try_to_wake_up+0x1fc/0x20c > [] ? manage_workers+0x245/0x245 > [] ? manage_workers+0x245/0x245 > [] ? kthread+0x81/0x89 > [] ? kernel_thread_helper+0x4/0x10 > [] ? kthread_freezable_should_stop+0x4e/0x4e > [] ? gs_change+0xb/0xb > Code: 5b c3 41 54 49 89 d4 55 53 48 89 f3 48 8b 6a 08 48 8b 42 10 48 85 ed 48 > 89 46 20 48 8b 42 18 48 89 46 28 74 5c f6 45 04 02 74 72 <0f> 0b eb fe 48 8b > 02 48 8b 48 28 80 e1 40 74 24 31 f6 48 89 d7 > RIP [] async_tx_submit+0x29/0xab > RSP > ---[ end trace 64fb561d16a3b535 ]--- > Kernel panic - not syncing: Fatal exception in interrupt > Rebooting in 5 seconds.. > > Do any of you guys have a clue about it ? > > Thanks > > Laurent > > PS: The very same kernel doesn't cause any trouble on R510 hardware. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Mon, Oct 15, 2012 at 09:42:58PM +0200, Laurent CARON wrote: Hi, I'm currently replacing an old system (HP DL 380 G5) by new dell R720xd. On those new boxes I did configure the H310 controler as plain JBOD. Those boxes appear to crash more often than not (from 5 mins to a couple of hours). I have the impression those crashes appear under heavy IO. The setup consists of a few md RAID arrays serving as underlying devices for either filesystem, or drbd (plus lvm on top). I managed to catch a trace over netconsole: [ cut here ] kernel BUG at crypto/async_tx/async_tx.c:174! That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-r...@vger.kernel.org to CC. Added. invalid opcode: [#1] SMP Modules linked in: drbd lru_cache netconsole iptable_filter ip_tables ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue bonding ipv6 btrfs ioatdma lpc_ich sb_edac dca mfd_core CPU 0 Pid: 12580, comm: kworker/u:2 Not tainted 3.6.2-r510-r720xd #1 Dell Inc. PowerEdge R720xd What is that r510 thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. (leaving in the rest for reference) RIP: 0010:[8130f9ab] [8130f9ab] async_tx_submit+0x29/0xab RSP: 0018:88100940fb30 EFLAGS: 00010202 RAX: 88100b30aeb0 RBX: 88080b5cf390 RCX: 0029 RDX: 88100940fd00 RSI: 88080b5cf390 RDI: 880809ad0818 RBP: 8808054a7d90 R08: 88080b5cf900 R09: 0001 R10: 1000 R11: 0001 R12: 88100940fd00 R13: 0002 R14: 880809ad0638 R15: 880809ad0818 FS: () GS:88080fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: ff600400 CR3: 000e4055f000 CR4: 000407f0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process kworker/u:2 (pid: 12580, threadinfo 88100940e000, task 880804850630) Stack: 88100940fd00 88100940fc40 0101 8131044b 0001 0246 0201 a0073a00 8808054a7d90 8808054a7690 88100940fc40 88080bf9e668 Call Trace: [8131044b] ? do_async_gen_syndrome+0x2f3/0x320 [a0073a00] ? ioat2_tx_submit_unlock+0xac/0xb3 [ioatdma] [815e6820] ? ops_complete_compute+0x7b/0x7b [81310540] ? async_gen_syndrome+0xc8/0x1d6 [815e8b9a] ? __raid_run_ops+0x9e7/0xb5a [810848f0] ? select_task_rq_fair+0x487/0x74b [815e6820] ? ops_complete_compute+0x7b/0x7b [8107e40b] ? __wake_up+0x35/0x46 [8107ca2a] ? async_schedule+0x12/0x12 [815e8d3f] ? async_run_ops+0x32/0x3e [8107cace] ? async_run_entry_fn+0xa4/0x17e [8107ca2a] ? async_schedule+0x12/0x12 [81071cf8] ? process_one_work+0x259/0x381 [81072312] ? worker_thread+0x2ad/0x3e3 [81082e50] ? try_to_wake_up+0x1fc/0x20c [81072065] ? manage_workers+0x245/0x245 [81072065] ? manage_workers+0x245/0x245 [8107746a] ? kthread+0x81/0x89 [81791034] ? kernel_thread_helper+0x4/0x10 [810773e9] ? kthread_freezable_should_stop+0x4e/0x4e [81791030] ? gs_change+0xb/0xb Code: 5b c3 41 54 49 89 d4 55 53 48 89 f3 48 8b 6a 08 48 8b 42 10 48 85 ed 48 89 46 20 48 8b 42 18 48 89 46 28 74 5c f6 45 04 02 74 72 0f 0b eb fe 48 8b 02 48 8b 48 28 80 e1 40 74 24 31 f6 48 89 d7 RIP [8130f9ab] async_tx_submit+0x29/0xab RSP 88100940fb30 ---[ end trace 64fb561d16a3b535 ]--- Kernel panic - not syncing: Fatal exception in interrupt Rebooting in 5 seconds.. Do any of you guys have a clue about it ? Thanks Laurent PS: The very same kernel doesn't cause any trouble on R510 hardware. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-r...@vger.kernel.org to CC. Added. Hi, The machines seem stable after disabling I/O AT DMA at the BIOS level. What is that r510 thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. My kernel is vanilla from Kernel.org. The -r510 string is because I tried it on a -r510 also. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. For now, I did create more volumes, rsync lors of data over the network to the disks with no crashs (after disabling I/O AT DMA). ...snip... -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-r...@vger.kernel.org to CC. Added. Hi, The machines seem stable after disabling I/O AT DMA at the BIOS level. That's a good point because the backtrace goes through I/O AT DMA so it could very well be the culprit. Let's add some more people to Cc. Vinod/Dan, here's the BUG_ON Laurent is hitting: http://marc.info/?l=linux-kernelm=135033064724794w=2 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma in the BIOS makes the issue disappear so ... What is that r510 thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. My kernel is vanilla from Kernel.org. The -r510 string is because I tried it on a -r510 also. Ok, good. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. For now, I did create more volumes, rsync lors of data over the network to the disks with no crashs (after disabling I/O AT DMA). And when you do this with ioat dma enabled, you get the bug, right? So it is reproducible...? Thanks. -- Regards/Gruss, Boris. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote: On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-r...@vger.kernel.org to CC. Added. Hi, The machines seem stable after disabling I/O AT DMA at the BIOS level. That's a good point because the backtrace goes through I/O AT DMA so it could very well be the culprit. Let's add some more people to Cc. Vinod/Dan, here's the BUG_ON Laurent is hitting: http://marc.info/?l=linux-kernelm=135033064724794w=2 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma in the BIOS makes the issue disappear so ... What is that r510 thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. My kernel is vanilla from Kernel.org. The -r510 string is because I tried it on a -r510 also. Ok, good. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. For now, I did create more volumes, rsync lors of data over the network to the disks with no crashs (after disabling I/O AT DMA). And when you do this with ioat dma enabled, you get the bug, right? So it is reproducible...? It is 100% reproductible. The only nondeterministic point is the time it takes to have the machine crash. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Strange crash on Dell R720xd
On Tue, Oct 16, 2012 at 5:52 AM, Laurent CARON lca...@unix-scripts.info wrote: On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote: On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote: On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote: That's: BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) || txd_parent(tx)); but probably the b0rkage happens up the stack. And this __raid_run_ops is probably starting the whole TX so maybe we should add linux-r...@vger.kernel.org to CC. Added. Hi, The machines seem stable after disabling I/O AT DMA at the BIOS level. That's a good point because the backtrace goes through I/O AT DMA so it could very well be the culprit. Let's add some more people to Cc. Vinod/Dan, here's the BUG_ON Laurent is hitting: http://marc.info/?l=linux-kernelm=135033064724794w=2 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma in the BIOS makes the issue disappear so ... What is that r510 thing in the kernel version? You have your patches ontop? If yes, please try reproducing this with a kernel.org kernel without anything else ontop. My kernel is vanilla from Kernel.org. The -r510 string is because I tried it on a -r510 also. Ok, good. Also, it might be worth trying plain 3.6 to rule out a regression introduced in the stable 3.6 series. I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results. For now, I did create more volumes, rsync lors of data over the network to the disks with no crashs (after disabling I/O AT DMA). And when you do this with ioat dma enabled, you get the bug, right? So it is reproducible...? It is 100% reproductible. The only nondeterministic point is the time it takes to have the machine crash. I think this may be a bug in __raid_run_ops that is only possible when raid offload and CONFIG_MULTICORE_RAID456 are enabled. I'm thinking the descriptor is completed and recycled to another requester in the space between these two events: ops_run_compute(); /* terminate the chain if reconstruct is not set to be run */ if (tx !test_bit(STRIPE_OP_RECONSTRUCT, ops_request)) async_tx_ack(tx); ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you leave IOAT DMA disabled. A rework of the raid operation dma chaining is in progress, but may not be ready for a while. -- Dan -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/