Re: Strange crash on Dell R720xd

2012-10-17 Thread Laurent CARON
On Tue, Oct 16, 2012 at 10:58:49AM -0700, Dan Williams wrote:
> I think this may be a bug in __raid_run_ops that is only possible when
> raid offload and CONFIG_MULTICORE_RAID456 are enabled.  I'm thinking
> the descriptor is completed and recycled to another requester in the
> space between these two events:
> 
> ops_run_compute();
> 
> /* terminate the chain if reconstruct is not set to be run */
> if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, _request))
> async_tx_ack(tx);
> 
> ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you
> leave IOAT DMA disabled.  A rework of the raid operation dma chaining
> is in progress, but may not be ready for a while.

Hi,

I usually don't use CONFIG_MULTICORE_RAID456 as it proved to be sluggish
and/or unstable in my experience, so I should be pretty safe letting I/O
AT DMA disabled for now on those bosex.

Thanks

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-17 Thread Laurent CARON
On Tue, Oct 16, 2012 at 10:58:49AM -0700, Dan Williams wrote:
 I think this may be a bug in __raid_run_ops that is only possible when
 raid offload and CONFIG_MULTICORE_RAID456 are enabled.  I'm thinking
 the descriptor is completed and recycled to another requester in the
 space between these two events:
 
 ops_run_compute();
 
 /* terminate the chain if reconstruct is not set to be run */
 if (tx  !test_bit(STRIPE_OP_RECONSTRUCT, ops_request))
 async_tx_ack(tx);
 
 ...don't use the experimental CONFIG_MULTICORE_RAID456 even if you
 leave IOAT DMA disabled.  A rework of the raid operation dma chaining
 is in progress, but may not be ready for a while.

Hi,

I usually don't use CONFIG_MULTICORE_RAID456 as it proved to be sluggish
and/or unstable in my experience, so I should be pretty safe letting I/O
AT DMA disabled for now on those bosex.

Thanks

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Dan Williams
On Tue, Oct 16, 2012 at 5:52 AM, Laurent CARON  wrote:
> On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote:
>> On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote:
>> > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
>> > > That's:
>> > >
>> > > BUG_ON(async_tx_test_ack(depend_tx) || 
>> > > txd_next(depend_tx) ||
>> > >   txd_parent(tx));
>> > >
>> > > but probably the b0rkage happens up the stack. And this __raid_run_ops
>> > > is probably starting the whole TX so maybe we should add
>> > > linux-r...@vger.kernel.org to CC. Added.
>> >
>> >
>> > Hi,
>> >
>> > The machines seem stable after disabling I/O AT DMA at the BIOS level.
>>
>> That's a good point because the backtrace goes through I/O AT DMA so it
>> could very well be the culprit. Let's add some more people to Cc.
>>
>> Vinod/Dan, here's the BUG_ON Laurent is hitting:
>>
>> http://marc.info/?l=linux-kernel=135033064724794=2
>>
>> and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma
>> in the BIOS makes the issue disappear so ...
>>
>> > > What is that "r510" thing in the kernel version? You have your patches
>> > > ontop? If yes, please try reproducing this with a kernel.org kernel
>> > > without anything else ontop.
>> >
>> > My kernel is vanilla from Kernel.org. The -r510 string is because I
>> > tried it on a -r510 also.
>>
>> Ok, good.
>>
>> > > Also, it might be worth trying plain 3.6 to rule out a regression
>> > > introduced in the stable 3.6 series.
>> >
>> > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.
>> >
>> > For now, I did create more volumes, rsync lors of data over the network
>> > to the disks with no crashs (after disabling I/O AT DMA).
>>
>> And when you do this with ioat dma enabled, you get the bug, right? So
>> it is reproducible...?
>
> It is 100% reproductible. The only "nondeterministic" point is the time
> it takes to have the machine crash.
>

I think this may be a bug in __raid_run_ops that is only possible when
raid offload and CONFIG_MULTICORE_RAID456 are enabled.  I'm thinking
the descriptor is completed and recycled to another requester in the
space between these two events:

ops_run_compute();

/* terminate the chain if reconstruct is not set to be run */
if (tx && !test_bit(STRIPE_OP_RECONSTRUCT, _request))
async_tx_ack(tx);

...don't use the experimental CONFIG_MULTICORE_RAID456 even if you
leave IOAT DMA disabled.  A rework of the raid operation dma chaining
is in progress, but may not be ready for a while.


--
Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Laurent CARON
On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote:
> On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote:
> > On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
> > > That's:
> > > 
> > > BUG_ON(async_tx_test_ack(depend_tx) || 
> > > txd_next(depend_tx) ||
> > >   txd_parent(tx));
> > > 
> > > but probably the b0rkage happens up the stack. And this __raid_run_ops
> > > is probably starting the whole TX so maybe we should add
> > > linux-r...@vger.kernel.org to CC. Added.
> > 
> > 
> > Hi,
> > 
> > The machines seem stable after disabling I/O AT DMA at the BIOS level.
> 
> That's a good point because the backtrace goes through I/O AT DMA so it
> could very well be the culprit. Let's add some more people to Cc.
> 
> Vinod/Dan, here's the BUG_ON Laurent is hitting:
> 
> http://marc.info/?l=linux-kernel=135033064724794=2
> 
> and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma
> in the BIOS makes the issue disappear so ...
> 
> > > What is that "r510" thing in the kernel version? You have your patches
> > > ontop? If yes, please try reproducing this with a kernel.org kernel
> > > without anything else ontop.
> > 
> > My kernel is vanilla from Kernel.org. The -r510 string is because I
> > tried it on a -r510 also.
> 
> Ok, good.
> 
> > > Also, it might be worth trying plain 3.6 to rule out a regression
> > > introduced in the stable 3.6 series.
> > 
> > I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.
> > 
> > For now, I did create more volumes, rsync lors of data over the network
> > to the disks with no crashs (after disabling I/O AT DMA).
> 
> And when you do this with ioat dma enabled, you get the bug, right? So
> it is reproducible...?

It is 100% reproductible. The only "nondeterministic" point is the time
it takes to have the machine crash.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Borislav Petkov
On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote:
> On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
> > That's:
> > 
> > BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) 
> > ||
> > txd_parent(tx));
> > 
> > but probably the b0rkage happens up the stack. And this __raid_run_ops
> > is probably starting the whole TX so maybe we should add
> > linux-r...@vger.kernel.org to CC. Added.
> 
> 
> Hi,
> 
> The machines seem stable after disabling I/O AT DMA at the BIOS level.

That's a good point because the backtrace goes through I/O AT DMA so it
could very well be the culprit. Let's add some more people to Cc.

Vinod/Dan, here's the BUG_ON Laurent is hitting:

http://marc.info/?l=linux-kernel=135033064724794=2

and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma
in the BIOS makes the issue disappear so ...

> > What is that "r510" thing in the kernel version? You have your patches
> > ontop? If yes, please try reproducing this with a kernel.org kernel
> > without anything else ontop.
> 
> My kernel is vanilla from Kernel.org. The -r510 string is because I
> tried it on a -r510 also.

Ok, good.

> > Also, it might be worth trying plain 3.6 to rule out a regression
> > introduced in the stable 3.6 series.
> 
> I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.
> 
> For now, I did create more volumes, rsync lors of data over the network
> to the disks with no crashs (after disabling I/O AT DMA).

And when you do this with ioat dma enabled, you get the bug, right? So
it is reproducible...?

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Laurent CARON
On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
> That's:
> 
> BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) ||
>   txd_parent(tx));
> 
> but probably the b0rkage happens up the stack. And this __raid_run_ops
> is probably starting the whole TX so maybe we should add
> linux-r...@vger.kernel.org to CC. Added.


Hi,

The machines seem stable after disabling I/O AT DMA at the BIOS level.

> What is that "r510" thing in the kernel version? You have your patches
> ontop? If yes, please try reproducing this with a kernel.org kernel
> without anything else ontop.

My kernel is vanilla from Kernel.org. The -r510 string is because I
tried it on a -r510 also.

> Also, it might be worth trying plain 3.6 to rule out a regression
> introduced in the stable 3.6 series.

I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.

For now, I did create more volumes, rsync lors of data over the network
to the disks with no crashs (after disabling I/O AT DMA).

...snip...

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Borislav Petkov
On Mon, Oct 15, 2012 at 09:42:58PM +0200, Laurent CARON wrote:
> Hi,
> 
> I'm currently replacing an old system (HP DL 380 G5) by new dell R720xd.
> On those new boxes I did configure the H310 controler as plain JBOD.
> 
> Those boxes appear to crash more often than not (from 5 mins to a couple
> of hours).
> I have the impression those crashes appear under heavy IO.
> 
> The setup consists of a few md RAID arrays serving as underlying devices
> for either filesystem, or drbd (plus lvm on top).
> 
> I managed to catch a trace over netconsole:
> [ cut here ]
> kernel BUG at crypto/async_tx/async_tx.c:174!

That's:

BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) ||
txd_parent(tx));

but probably the b0rkage happens up the stack. And this __raid_run_ops
is probably starting the whole TX so maybe we should add
linux-r...@vger.kernel.org to CC. Added.

> invalid opcode:  [#1] SMP 
> Modules linked in: drbd lru_cache netconsole iptable_filter ip_tables 
> ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue 
> bonding ipv6 btrfs ioatdma lpc_ich sb_edac dca mfd_core
> CPU 0 
> Pid: 12580, comm: kworker/u:2 Not tainted 3.6.2-r510-r720xd #1 Dell Inc. 
> PowerEdge R720xd

What is that "r510" thing in the kernel version? You have your patches
ontop? If yes, please try reproducing this with a kernel.org kernel
without anything else ontop.

Also, it might be worth trying plain 3.6 to rule out a regression
introduced in the stable 3.6 series.

(leaving in the rest for reference)

> RIP: 0010:[]  [] async_tx_submit+0x29/0xab
> RSP: 0018:88100940fb30  EFLAGS: 00010202
> RAX: 88100b30aeb0 RBX: 88080b5cf390 RCX: 0029
> RDX: 88100940fd00 RSI: 88080b5cf390 RDI: 880809ad0818
> RBP: 8808054a7d90 R08: 88080b5cf900 R09: 0001
> R10: 1000 R11: 0001 R12: 88100940fd00
> R13: 0002 R14: 880809ad0638 R15: 880809ad0818
> FS:  () GS:88080fc0() knlGS:
> CS:  0010 DS:  ES:  CR0: 8005003b
> CR2: ff600400 CR3: 000e4055f000 CR4: 000407f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: 0ff0 DR7: 0400
> Process kworker/u:2 (pid: 12580, threadinfo 88100940e000, task 
> 880804850630)
> Stack:
>  88100940fd00 88100940fc40 0101 8131044b
>  0001 0246 0201 a0073a00
>  8808054a7d90 8808054a7690 88100940fc40 88080bf9e668
> Call Trace:
>  [] ? do_async_gen_syndrome+0x2f3/0x320
>  [] ? ioat2_tx_submit_unlock+0xac/0xb3 [ioatdma]
>  [] ? ops_complete_compute+0x7b/0x7b
>  [] ? async_gen_syndrome+0xc8/0x1d6
>  [] ? __raid_run_ops+0x9e7/0xb5a
>  [] ? select_task_rq_fair+0x487/0x74b
>  [] ? ops_complete_compute+0x7b/0x7b
>  [] ? __wake_up+0x35/0x46
>  [] ? async_schedule+0x12/0x12
>  [] ? async_run_ops+0x32/0x3e
>  [] ? async_run_entry_fn+0xa4/0x17e
>  [] ? async_schedule+0x12/0x12
>  [] ? process_one_work+0x259/0x381
>  [] ? worker_thread+0x2ad/0x3e3
>  [] ? try_to_wake_up+0x1fc/0x20c
>  [] ? manage_workers+0x245/0x245
>  [] ? manage_workers+0x245/0x245
>  [] ? kthread+0x81/0x89
>  [] ? kernel_thread_helper+0x4/0x10
>  [] ? kthread_freezable_should_stop+0x4e/0x4e
>  [] ? gs_change+0xb/0xb
> Code: 5b c3 41 54 49 89 d4 55 53 48 89 f3 48 8b 6a 08 48 8b 42 10 48 85 ed 48 
> 89 46 20 48 8b 42 18 48 89 46 28 74 5c f6 45 04 02 74 72 <0f> 0b eb fe 48 8b 
> 02 48 8b 48 28 80 e1 40 74 24 31 f6 48 89 d7 
> RIP  [] async_tx_submit+0x29/0xab
>  RSP 
> ---[ end trace 64fb561d16a3b535 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Rebooting in 5 seconds..
> 
> Do any of you guys have a clue about it ?
> 
> Thanks
> 
> Laurent
> 
> PS: The very same kernel doesn't cause any trouble on R510 hardware.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Borislav Petkov
On Mon, Oct 15, 2012 at 09:42:58PM +0200, Laurent CARON wrote:
 Hi,
 
 I'm currently replacing an old system (HP DL 380 G5) by new dell R720xd.
 On those new boxes I did configure the H310 controler as plain JBOD.
 
 Those boxes appear to crash more often than not (from 5 mins to a couple
 of hours).
 I have the impression those crashes appear under heavy IO.
 
 The setup consists of a few md RAID arrays serving as underlying devices
 for either filesystem, or drbd (plus lvm on top).
 
 I managed to catch a trace over netconsole:
 [ cut here ]
 kernel BUG at crypto/async_tx/async_tx.c:174!

That's:

BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) ||
txd_parent(tx));

but probably the b0rkage happens up the stack. And this __raid_run_ops
is probably starting the whole TX so maybe we should add
linux-r...@vger.kernel.org to CC. Added.

 invalid opcode:  [#1] SMP 
 Modules linked in: drbd lru_cache netconsole iptable_filter ip_tables 
 ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue 
 bonding ipv6 btrfs ioatdma lpc_ich sb_edac dca mfd_core
 CPU 0 
 Pid: 12580, comm: kworker/u:2 Not tainted 3.6.2-r510-r720xd #1 Dell Inc. 
 PowerEdge R720xd

What is that r510 thing in the kernel version? You have your patches
ontop? If yes, please try reproducing this with a kernel.org kernel
without anything else ontop.

Also, it might be worth trying plain 3.6 to rule out a regression
introduced in the stable 3.6 series.

(leaving in the rest for reference)

 RIP: 0010:[8130f9ab]  [8130f9ab] async_tx_submit+0x29/0xab
 RSP: 0018:88100940fb30  EFLAGS: 00010202
 RAX: 88100b30aeb0 RBX: 88080b5cf390 RCX: 0029
 RDX: 88100940fd00 RSI: 88080b5cf390 RDI: 880809ad0818
 RBP: 8808054a7d90 R08: 88080b5cf900 R09: 0001
 R10: 1000 R11: 0001 R12: 88100940fd00
 R13: 0002 R14: 880809ad0638 R15: 880809ad0818
 FS:  () GS:88080fc0() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: ff600400 CR3: 000e4055f000 CR4: 000407f0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process kworker/u:2 (pid: 12580, threadinfo 88100940e000, task 
 880804850630)
 Stack:
  88100940fd00 88100940fc40 0101 8131044b
  0001 0246 0201 a0073a00
  8808054a7d90 8808054a7690 88100940fc40 88080bf9e668
 Call Trace:
  [8131044b] ? do_async_gen_syndrome+0x2f3/0x320
  [a0073a00] ? ioat2_tx_submit_unlock+0xac/0xb3 [ioatdma]
  [815e6820] ? ops_complete_compute+0x7b/0x7b
  [81310540] ? async_gen_syndrome+0xc8/0x1d6
  [815e8b9a] ? __raid_run_ops+0x9e7/0xb5a
  [810848f0] ? select_task_rq_fair+0x487/0x74b
  [815e6820] ? ops_complete_compute+0x7b/0x7b
  [8107e40b] ? __wake_up+0x35/0x46
  [8107ca2a] ? async_schedule+0x12/0x12
  [815e8d3f] ? async_run_ops+0x32/0x3e
  [8107cace] ? async_run_entry_fn+0xa4/0x17e
  [8107ca2a] ? async_schedule+0x12/0x12
  [81071cf8] ? process_one_work+0x259/0x381
  [81072312] ? worker_thread+0x2ad/0x3e3
  [81082e50] ? try_to_wake_up+0x1fc/0x20c
  [81072065] ? manage_workers+0x245/0x245
  [81072065] ? manage_workers+0x245/0x245
  [8107746a] ? kthread+0x81/0x89
  [81791034] ? kernel_thread_helper+0x4/0x10
  [810773e9] ? kthread_freezable_should_stop+0x4e/0x4e
  [81791030] ? gs_change+0xb/0xb
 Code: 5b c3 41 54 49 89 d4 55 53 48 89 f3 48 8b 6a 08 48 8b 42 10 48 85 ed 48 
 89 46 20 48 8b 42 18 48 89 46 28 74 5c f6 45 04 02 74 72 0f 0b eb fe 48 8b 
 02 48 8b 48 28 80 e1 40 74 24 31 f6 48 89 d7 
 RIP  [8130f9ab] async_tx_submit+0x29/0xab
  RSP 88100940fb30
 ---[ end trace 64fb561d16a3b535 ]---
 Kernel panic - not syncing: Fatal exception in interrupt
 Rebooting in 5 seconds..
 
 Do any of you guys have a clue about it ?
 
 Thanks
 
 Laurent
 
 PS: The very same kernel doesn't cause any trouble on R510 hardware.
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Laurent CARON
On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
 That's:
 
 BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) ||
   txd_parent(tx));
 
 but probably the b0rkage happens up the stack. And this __raid_run_ops
 is probably starting the whole TX so maybe we should add
 linux-r...@vger.kernel.org to CC. Added.


Hi,

The machines seem stable after disabling I/O AT DMA at the BIOS level.

 What is that r510 thing in the kernel version? You have your patches
 ontop? If yes, please try reproducing this with a kernel.org kernel
 without anything else ontop.

My kernel is vanilla from Kernel.org. The -r510 string is because I
tried it on a -r510 also.

 Also, it might be worth trying plain 3.6 to rule out a regression
 introduced in the stable 3.6 series.

I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.

For now, I did create more volumes, rsync lors of data over the network
to the disks with no crashs (after disabling I/O AT DMA).

...snip...

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Borislav Petkov
On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote:
 On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
  That's:
  
  BUG_ON(async_tx_test_ack(depend_tx) || txd_next(depend_tx) 
  ||
  txd_parent(tx));
  
  but probably the b0rkage happens up the stack. And this __raid_run_ops
  is probably starting the whole TX so maybe we should add
  linux-r...@vger.kernel.org to CC. Added.
 
 
 Hi,
 
 The machines seem stable after disabling I/O AT DMA at the BIOS level.

That's a good point because the backtrace goes through I/O AT DMA so it
could very well be the culprit. Let's add some more people to Cc.

Vinod/Dan, here's the BUG_ON Laurent is hitting:

http://marc.info/?l=linux-kernelm=135033064724794w=2

and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma
in the BIOS makes the issue disappear so ...

  What is that r510 thing in the kernel version? You have your patches
  ontop? If yes, please try reproducing this with a kernel.org kernel
  without anything else ontop.
 
 My kernel is vanilla from Kernel.org. The -r510 string is because I
 tried it on a -r510 also.

Ok, good.

  Also, it might be worth trying plain 3.6 to rule out a regression
  introduced in the stable 3.6 series.
 
 I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.
 
 For now, I did create more volumes, rsync lors of data over the network
 to the disks with no crashs (after disabling I/O AT DMA).

And when you do this with ioat dma enabled, you get the bug, right? So
it is reproducible...?

Thanks.

-- 
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Laurent CARON
On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote:
 On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote:
  On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
   That's:
   
   BUG_ON(async_tx_test_ack(depend_tx) || 
   txd_next(depend_tx) ||
 txd_parent(tx));
   
   but probably the b0rkage happens up the stack. And this __raid_run_ops
   is probably starting the whole TX so maybe we should add
   linux-r...@vger.kernel.org to CC. Added.
  
  
  Hi,
  
  The machines seem stable after disabling I/O AT DMA at the BIOS level.
 
 That's a good point because the backtrace goes through I/O AT DMA so it
 could very well be the culprit. Let's add some more people to Cc.
 
 Vinod/Dan, here's the BUG_ON Laurent is hitting:
 
 http://marc.info/?l=linux-kernelm=135033064724794w=2
 
 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma
 in the BIOS makes the issue disappear so ...
 
   What is that r510 thing in the kernel version? You have your patches
   ontop? If yes, please try reproducing this with a kernel.org kernel
   without anything else ontop.
  
  My kernel is vanilla from Kernel.org. The -r510 string is because I
  tried it on a -r510 also.
 
 Ok, good.
 
   Also, it might be worth trying plain 3.6 to rule out a regression
   introduced in the stable 3.6 series.
  
  I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.
  
  For now, I did create more volumes, rsync lors of data over the network
  to the disks with no crashs (after disabling I/O AT DMA).
 
 And when you do this with ioat dma enabled, you get the bug, right? So
 it is reproducible...?

It is 100% reproductible. The only nondeterministic point is the time
it takes to have the machine crash.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange crash on Dell R720xd

2012-10-16 Thread Dan Williams
On Tue, Oct 16, 2012 at 5:52 AM, Laurent CARON lca...@unix-scripts.info wrote:
 On Tue, Oct 16, 2012 at 02:48:25PM +0200, Borislav Petkov wrote:
 On Tue, Oct 16, 2012 at 11:26:01AM +0200, Laurent CARON wrote:
  On Tue, Oct 16, 2012 at 11:03:53AM +0200, Borislav Petkov wrote:
   That's:
  
   BUG_ON(async_tx_test_ack(depend_tx) || 
   txd_next(depend_tx) ||
 txd_parent(tx));
  
   but probably the b0rkage happens up the stack. And this __raid_run_ops
   is probably starting the whole TX so maybe we should add
   linux-r...@vger.kernel.org to CC. Added.
 
 
  Hi,
 
  The machines seem stable after disabling I/O AT DMA at the BIOS level.

 That's a good point because the backtrace goes through I/O AT DMA so it
 could very well be the culprit. Let's add some more people to Cc.

 Vinod/Dan, here's the BUG_ON Laurent is hitting:

 http://marc.info/?l=linux-kernelm=135033064724794w=2

 and it has ioat2_tx_submit_unlock in the backtrace. Disabling ioat dma
 in the BIOS makes the issue disappear so ...

   What is that r510 thing in the kernel version? You have your patches
   ontop? If yes, please try reproducing this with a kernel.org kernel
   without anything else ontop.
 
  My kernel is vanilla from Kernel.org. The -r510 string is because I
  tried it on a -r510 also.

 Ok, good.

   Also, it might be worth trying plain 3.6 to rule out a regression
   introduced in the stable 3.6 series.
 
  I tried 3.5.x, 3.6, 3.6.1, 3.6.2 with exactly the same results.
 
  For now, I did create more volumes, rsync lors of data over the network
  to the disks with no crashs (after disabling I/O AT DMA).

 And when you do this with ioat dma enabled, you get the bug, right? So
 it is reproducible...?

 It is 100% reproductible. The only nondeterministic point is the time
 it takes to have the machine crash.


I think this may be a bug in __raid_run_ops that is only possible when
raid offload and CONFIG_MULTICORE_RAID456 are enabled.  I'm thinking
the descriptor is completed and recycled to another requester in the
space between these two events:

ops_run_compute();

/* terminate the chain if reconstruct is not set to be run */
if (tx  !test_bit(STRIPE_OP_RECONSTRUCT, ops_request))
async_tx_ack(tx);

...don't use the experimental CONFIG_MULTICORE_RAID456 even if you
leave IOAT DMA disabled.  A rework of the raid operation dma chaining
is in progress, but may not be ready for a while.


--
Dan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/