Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-05 Thread Dmitry Osipenko
On 01.03.2018 19:04, Theodore Ts'o wrote:
> On Thu, Mar 01, 2018 at 10:55:37AM +0200, Adrian Hunter wrote:
>> On 27/02/18 11:28, Adrian Hunter wrote:
>>> On 26/02/18 23:48, Dmitry Osipenko wrote:
 But still something is wrong... I've been getting occasional EXT4 Ooops's, 
 like
 the one below, and __wait_on_bit() is always figuring in the stacktrace. It
 never happened with blk-mq disabled, though it could be a coincidence and
 actually unrelated to blk-mq patches.
>>>
 [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
 address 001c
 [ 6625.993004] pgd = 00b30c03
 [ 6625.993257] [001c] *pgd=
 [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
 [ 6625.994022] Modules linked in:
 [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
 [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
 [ 6625.995595] PC is aht dx_probe+0x68/0x684
 [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8
> 
> This doesn't seem to make sense; the PC is where we are currently
> executing, and LR is the "Link Register" where the flow of control
> will be returning after the current function returns, right?  Well,
> dx_probe should *not* be returning to __wait_on_bit().  So this just
> seems weird.
> 
> Ignoring the LR register, this stack trace looks sane...  I can't see
> which pointer could be NULL and getting dereferenced, though.  How
> easily can you reproduce the problem?  Can you either (a) translate
> the PC into a line number, or better yet, if you can reproduce, add a
> series of BUG_ON's so we can see what's going on?
> 
> + BUG_ON(frame);
>   memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>   frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>   if (IS_ERR(frame->bh))
>   return (struct dx_frame *) frame->bh;
> 
> + BUG_ON(frame->bh);
> + BUG_ON(frame->bh->b_data);
>   root = (struct dx_root *) frame->bh->b_data;
>   if (root->info.hash_version != DX_HASH_TEA &&
>   root->info.hash_version != DX_HASH_HALF_MD4 &&
>   root->info.hash_version != DX_HASH_LEGACY) {
> 
> These are "could never" happen scenarios from looking at the code, but
> that will help explain what is going on.
> 
> If this is reliably only happening with mq, the only way I could see
> that if is something is returning an error when it previously wasn't.
> This isn't a problem we're seeing with any of our testing, though.

It happened today again, "BUG_ON(!frame->bh->b_data);" has been trapped.

kernel BUG at fs/ext4/namei.c:751!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 296 Comm: cron Not tainted
4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2100
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
PC is at dx_probe+0x308/0x694
LR is at __wait_on_bit+0xac/0xc8
pc : []lr : []psr: 60040013
sp : d545bc20  ip : c0170e88  fp : d545bc74
r10:   r9 : d545bca0  r8 : d4209300
r7 :   r6 :   r5 : d656e838  r4 : d545bcbc
r3 : 007b  r2 : d5830800  r1 : d5831000  r0 : d4209300
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 1552004a  DAC: 0051
Process cron (pid: 296, stack limit = 0x4d1ebf14)
Stack: (0xd545bc20 to 0xd545c000)
bc20: 02ea c0c019d4 60040113 014000c0 c029e640 d6cf3540 d545bc7c d545bc48
bc40: c02797f4 c0152804 d545bca4 0007 d5830800  d656e838 0001
bc60: d545bca0  d545bd0c d545bc78 c033d578 c033b904 c029e714 c029b088
bc80: 0148 c0c01984 d65f6be0  d545be10 d545bd24 d545bd00 d5830800
bca0: d65f6bf8 d65f6c0c 0007 d6547720 8420edbe c029eec8  d4209300
bcc0:        
bce0: d545bd48 d65f6be0 d656e838 d65f6be0 d6547720 0001 d545be10 
bd00: d545bd44 d545bd10 c033d7c0 c033d1c8 d545bd34 d656e8b8 d656e838 d545be08
bd20: d656e838  d65f6be0 d656e838 d656e8b8 d6547720 d545bd8c d545bd48
bd40: c028ea50 c033d774  dead4ead   d545bd58 d545bd58
bd60: d6d7f015 d545be08   d545bee8 d545bee8 d545bf28 
bd80: d545bdd4 d545bd90 c028f310 c028e9b0 d545be08 80808080 d545be08 d6d7f010
bda0: d545bdd4 d545bdb0 c028df9c d545be08 d6d7f010  d545bee8 d545bee8
bdc0: d545bf28  d545be04 d545bdd8 c0290e24 c028f160 c0111a1c c0111674
bde0: d545be04 d545bdf0 0001 d6d7f000 d545be08 0001 d545beb4 d545be08
be00: c0293848 c0290da4 d6dd0310 d6547720 8420edbe 0007 d6d7f015 000c
be20: d6dd0310 d6547098 d656e838 0001 0002 0fe0  
be40:  d545be48 c02797f4 0ff0 d6d7f010 c102b4c8 d5522db8 d6d7f000
be60: c130bbdc 004f73f8  0001 d545bf28  d6d7f000 
be80: c0293570 0002 ff9c 0001 ff9c 0001 ff9c d545bee8
bea0: ff9c 004f73f8 

Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-05 Thread Dmitry Osipenko
On 01.03.2018 19:04, Theodore Ts'o wrote:
> On Thu, Mar 01, 2018 at 10:55:37AM +0200, Adrian Hunter wrote:
>> On 27/02/18 11:28, Adrian Hunter wrote:
>>> On 26/02/18 23:48, Dmitry Osipenko wrote:
 But still something is wrong... I've been getting occasional EXT4 Ooops's, 
 like
 the one below, and __wait_on_bit() is always figuring in the stacktrace. It
 never happened with blk-mq disabled, though it could be a coincidence and
 actually unrelated to blk-mq patches.
>>>
 [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
 address 001c
 [ 6625.993004] pgd = 00b30c03
 [ 6625.993257] [001c] *pgd=
 [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
 [ 6625.994022] Modules linked in:
 [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
 [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
 [ 6625.995595] PC is aht dx_probe+0x68/0x684
 [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8
> 
> This doesn't seem to make sense; the PC is where we are currently
> executing, and LR is the "Link Register" where the flow of control
> will be returning after the current function returns, right?  Well,
> dx_probe should *not* be returning to __wait_on_bit().  So this just
> seems weird.
> 
> Ignoring the LR register, this stack trace looks sane...  I can't see
> which pointer could be NULL and getting dereferenced, though.  How
> easily can you reproduce the problem?  Can you either (a) translate
> the PC into a line number, or better yet, if you can reproduce, add a
> series of BUG_ON's so we can see what's going on?
> 
> + BUG_ON(frame);
>   memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>   frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>   if (IS_ERR(frame->bh))
>   return (struct dx_frame *) frame->bh;
> 
> + BUG_ON(frame->bh);
> + BUG_ON(frame->bh->b_data);
>   root = (struct dx_root *) frame->bh->b_data;
>   if (root->info.hash_version != DX_HASH_TEA &&
>   root->info.hash_version != DX_HASH_HALF_MD4 &&
>   root->info.hash_version != DX_HASH_LEGACY) {
> 
> These are "could never" happen scenarios from looking at the code, but
> that will help explain what is going on.
> 
> If this is reliably only happening with mq, the only way I could see
> that if is something is returning an error when it previously wasn't.
> This isn't a problem we're seeing with any of our testing, though.

It happened today again, "BUG_ON(!frame->bh->b_data);" has been trapped.

kernel BUG at fs/ext4/namei.c:751!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 296 Comm: cron Not tainted
4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2100
Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
PC is at dx_probe+0x308/0x694
LR is at __wait_on_bit+0xac/0xc8
pc : []lr : []psr: 60040013
sp : d545bc20  ip : c0170e88  fp : d545bc74
r10:   r9 : d545bca0  r8 : d4209300
r7 :   r6 :   r5 : d656e838  r4 : d545bcbc
r3 : 007b  r2 : d5830800  r1 : d5831000  r0 : d4209300
Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
Control: 10c5387d  Table: 1552004a  DAC: 0051
Process cron (pid: 296, stack limit = 0x4d1ebf14)
Stack: (0xd545bc20 to 0xd545c000)
bc20: 02ea c0c019d4 60040113 014000c0 c029e640 d6cf3540 d545bc7c d545bc48
bc40: c02797f4 c0152804 d545bca4 0007 d5830800  d656e838 0001
bc60: d545bca0  d545bd0c d545bc78 c033d578 c033b904 c029e714 c029b088
bc80: 0148 c0c01984 d65f6be0  d545be10 d545bd24 d545bd00 d5830800
bca0: d65f6bf8 d65f6c0c 0007 d6547720 8420edbe c029eec8  d4209300
bcc0:        
bce0: d545bd48 d65f6be0 d656e838 d65f6be0 d6547720 0001 d545be10 
bd00: d545bd44 d545bd10 c033d7c0 c033d1c8 d545bd34 d656e8b8 d656e838 d545be08
bd20: d656e838  d65f6be0 d656e838 d656e8b8 d6547720 d545bd8c d545bd48
bd40: c028ea50 c033d774  dead4ead   d545bd58 d545bd58
bd60: d6d7f015 d545be08   d545bee8 d545bee8 d545bf28 
bd80: d545bdd4 d545bd90 c028f310 c028e9b0 d545be08 80808080 d545be08 d6d7f010
bda0: d545bdd4 d545bdb0 c028df9c d545be08 d6d7f010  d545bee8 d545bee8
bdc0: d545bf28  d545be04 d545bdd8 c0290e24 c028f160 c0111a1c c0111674
bde0: d545be04 d545bdf0 0001 d6d7f000 d545be08 0001 d545beb4 d545be08
be00: c0293848 c0290da4 d6dd0310 d6547720 8420edbe 0007 d6d7f015 000c
be20: d6dd0310 d6547098 d656e838 0001 0002 0fe0  
be40:  d545be48 c02797f4 0ff0 d6d7f010 c102b4c8 d5522db8 d6d7f000
be60: c130bbdc 004f73f8  0001 d545bf28  d6d7f000 
be80: c0293570 0002 ff9c 0001 ff9c 0001 ff9c d545bee8
bea0: ff9c 004f73f8 

Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-02 Thread Dmitry Osipenko
On 01.03.2018 23:20, Andreas Dilger wrote:
> 
> On Mar 1, 2018, at 9:04 AM, Theodore Ts'o  wrote:
>> This doesn't seem to make sense; the PC is where we are currently
>> executing, and LR is the "Link Register" where the flow of control
>> will be returning after the current function returns, right?  Well,
>> dx_probe should *not* be returning to __wait_on_bit().  So this just
>> seems weird.
>>
>> Ignoring the LR register, this stack trace looks sane...  I can't see
>> which pointer could be NULL and getting dereferenced, though.  How
>> easily can you reproduce the problem?  Can you either (a) translate
>> the PC into a line number, or better yet, if you can reproduce, add a
>> series of BUG_ON's so we can see what's going on?

Ted, thank you for the suggestion. I don't have a bug-reproducer, it happens
only under some IO load and quite randomly. I've applied the BUG_ON()'s, but it
may take some time to catch the bug again.

>> +BUG_ON(frame);
> 
> I think you mean:
>   BUG_ON(frame == NULL);
> or
>   BUG_ON(!frame);
> 
> 
>>  memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>>  frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>>  if (IS_ERR(frame->bh))
>>  return (struct dx_frame *) frame->bh;
>>
>> +BUG_ON(frame->bh);
>> +BUG_ON(frame->bh->b_data);
> 
> Same here.
> 
>   BUG_ON(frame->bh == NULL);
>   BUG_ON(frame->bh->b_data == NULL);
> 
> This is why I don't like implicit "is NULL" or "is non-zero" usage.  Lustre
> used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to
> abandon that because of upstream code style.

Well spotted, thanks Andreas.

>>  root = (struct dx_root *) frame->bh->b_data;
>>  if (root->info.hash_version != DX_HASH_TEA &&
>>  root->info.hash_version != DX_HASH_HALF_MD4 &&
>>  root->info.hash_version != DX_HASH_LEGACY) {
>>
>> These are "could never" happen scenarios from looking at the code, but
>> that will help explain what is going on.
>>
>> If this is reliably only happening with mq, the only way I could see
>> that if is something is returning an error when it previously wasn't.
>> This isn't a problem we're seeing with any of our testing, though.



Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-02 Thread Dmitry Osipenko
On 01.03.2018 23:20, Andreas Dilger wrote:
> 
> On Mar 1, 2018, at 9:04 AM, Theodore Ts'o  wrote:
>> This doesn't seem to make sense; the PC is where we are currently
>> executing, and LR is the "Link Register" where the flow of control
>> will be returning after the current function returns, right?  Well,
>> dx_probe should *not* be returning to __wait_on_bit().  So this just
>> seems weird.
>>
>> Ignoring the LR register, this stack trace looks sane...  I can't see
>> which pointer could be NULL and getting dereferenced, though.  How
>> easily can you reproduce the problem?  Can you either (a) translate
>> the PC into a line number, or better yet, if you can reproduce, add a
>> series of BUG_ON's so we can see what's going on?

Ted, thank you for the suggestion. I don't have a bug-reproducer, it happens
only under some IO load and quite randomly. I've applied the BUG_ON()'s, but it
may take some time to catch the bug again.

>> +BUG_ON(frame);
> 
> I think you mean:
>   BUG_ON(frame == NULL);
> or
>   BUG_ON(!frame);
> 
> 
>>  memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>>  frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>>  if (IS_ERR(frame->bh))
>>  return (struct dx_frame *) frame->bh;
>>
>> +BUG_ON(frame->bh);
>> +BUG_ON(frame->bh->b_data);
> 
> Same here.
> 
>   BUG_ON(frame->bh == NULL);
>   BUG_ON(frame->bh->b_data == NULL);
> 
> This is why I don't like implicit "is NULL" or "is non-zero" usage.  Lustre
> used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to
> abandon that because of upstream code style.

Well spotted, thanks Andreas.

>>  root = (struct dx_root *) frame->bh->b_data;
>>  if (root->info.hash_version != DX_HASH_TEA &&
>>  root->info.hash_version != DX_HASH_HALF_MD4 &&
>>  root->info.hash_version != DX_HASH_LEGACY) {
>>
>> These are "could never" happen scenarios from looking at the code, but
>> that will help explain what is going on.
>>
>> If this is reliably only happening with mq, the only way I could see
>> that if is something is returning an error when it previously wasn't.
>> This isn't a problem we're seeing with any of our testing, though.



Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Andreas Dilger

On Mar 1, 2018, at 9:04 AM, Theodore Ts'o  wrote:
> This doesn't seem to make sense; the PC is where we are currently
> executing, and LR is the "Link Register" where the flow of control
> will be returning after the current function returns, right?  Well,
> dx_probe should *not* be returning to __wait_on_bit().  So this just
> seems weird.
> 
> Ignoring the LR register, this stack trace looks sane...  I can't see
> which pointer could be NULL and getting dereferenced, though.  How
> easily can you reproduce the problem?  Can you either (a) translate
> the PC into a line number, or better yet, if you can reproduce, add a
> series of BUG_ON's so we can see what's going on?
> 
> + BUG_ON(frame);

I think you mean:
BUG_ON(frame == NULL);
or
BUG_ON(!frame);


>   memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>   frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>   if (IS_ERR(frame->bh))
>   return (struct dx_frame *) frame->bh;
> 
> + BUG_ON(frame->bh);
> + BUG_ON(frame->bh->b_data);

Same here.

BUG_ON(frame->bh == NULL);
BUG_ON(frame->bh->b_data == NULL);

This is why I don't like implicit "is NULL" or "is non-zero" usage.  Lustre
used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to
abandon that because of upstream code style.

>   root = (struct dx_root *) frame->bh->b_data;
>   if (root->info.hash_version != DX_HASH_TEA &&
>   root->info.hash_version != DX_HASH_HALF_MD4 &&
>   root->info.hash_version != DX_HASH_LEGACY) {
> 
> These are "could never" happen scenarios from looking at the code, but
> that will help explain what is going on.
> 
> If this is reliably only happening with mq, the only way I could see
> that if is something is returning an error when it previously wasn't.
> This isn't a problem we're seeing with any of our testing, though.
> 
> Cheers,
> 
>   - Ted
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Andreas Dilger

On Mar 1, 2018, at 9:04 AM, Theodore Ts'o  wrote:
> This doesn't seem to make sense; the PC is where we are currently
> executing, and LR is the "Link Register" where the flow of control
> will be returning after the current function returns, right?  Well,
> dx_probe should *not* be returning to __wait_on_bit().  So this just
> seems weird.
> 
> Ignoring the LR register, this stack trace looks sane...  I can't see
> which pointer could be NULL and getting dereferenced, though.  How
> easily can you reproduce the problem?  Can you either (a) translate
> the PC into a line number, or better yet, if you can reproduce, add a
> series of BUG_ON's so we can see what's going on?
> 
> + BUG_ON(frame);

I think you mean:
BUG_ON(frame == NULL);
or
BUG_ON(!frame);


>   memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>   frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>   if (IS_ERR(frame->bh))
>   return (struct dx_frame *) frame->bh;
> 
> + BUG_ON(frame->bh);
> + BUG_ON(frame->bh->b_data);

Same here.

BUG_ON(frame->bh == NULL);
BUG_ON(frame->bh->b_data == NULL);

This is why I don't like implicit "is NULL" or "is non-zero" usage.  Lustre
used to require "== NULL" or "!= NULL" to avoid bugs like this, but had to
abandon that because of upstream code style.

>   root = (struct dx_root *) frame->bh->b_data;
>   if (root->info.hash_version != DX_HASH_TEA &&
>   root->info.hash_version != DX_HASH_HALF_MD4 &&
>   root->info.hash_version != DX_HASH_LEGACY) {
> 
> These are "could never" happen scenarios from looking at the code, but
> that will help explain what is going on.
> 
> If this is reliably only happening with mq, the only way I could see
> that if is something is returning an error when it previously wasn't.
> This isn't a problem we're seeing with any of our testing, though.
> 
> Cheers,
> 
>   - Ted
> 


Cheers, Andreas







signature.asc
Description: Message signed with OpenPGP


Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 01:15:24AM -0800, Jose R R wrote:
> Probably it is not wise to place all your eggs (data) in one basket
> (ext4) and diversify to viable alternatives which won't be affected by
> UNIX 2038 year date problem, likewise?
> < 
> https://metztli.it/blog/index.php/amatl8/reiser-nahui/reiser4-filesystem-and-the-unix
> >

All of the modern file systems (btrfs, ext4, f2fs, xfs, etc.) are fine
with respect to the 2038 problem.

- Ted


Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 01:15:24AM -0800, Jose R R wrote:
> Probably it is not wise to place all your eggs (data) in one basket
> (ext4) and diversify to viable alternatives which won't be affected by
> UNIX 2038 year date problem, likewise?
> < 
> https://metztli.it/blog/index.php/amatl8/reiser-nahui/reiser4-filesystem-and-the-unix
> >

All of the modern file systems (btrfs, ext4, f2fs, xfs, etc.) are fine
with respect to the 2038 problem.

- Ted


Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 10:55:37AM +0200, Adrian Hunter wrote:
> On 27/02/18 11:28, Adrian Hunter wrote:
> > On 26/02/18 23:48, Dmitry Osipenko wrote:
> >> But still something is wrong... I've been getting occasional EXT4 Ooops's, 
> >> like
> >> the one below, and __wait_on_bit() is always figuring in the stacktrace. It
> >> never happened with blk-mq disabled, though it could be a coincidence and
> >> actually unrelated to blk-mq patches.
> > 
> >> [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
> >> address 001c
> >> [ 6625.993004] pgd = 00b30c03
> >> [ 6625.993257] [001c] *pgd=
> >> [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> >> [ 6625.994022] Modules linked in:
> >> [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
> >> 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
> >> [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> >> [ 6625.995595] PC is aht dx_probe+0x68/0x684
> >> [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8

This doesn't seem to make sense; the PC is where we are currently
executing, and LR is the "Link Register" where the flow of control
will be returning after the current function returns, right?  Well,
dx_probe should *not* be returning to __wait_on_bit().  So this just
seems weird.

Ignoring the LR register, this stack trace looks sane...  I can't see
which pointer could be NULL and getting dereferenced, though.  How
easily can you reproduce the problem?  Can you either (a) translate
the PC into a line number, or better yet, if you can reproduce, add a
series of BUG_ON's so we can see what's going on?

+   BUG_ON(frame);
memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
frame->bh = ext4_read_dirblock(dir, 0, INDEX);
if (IS_ERR(frame->bh))
return (struct dx_frame *) frame->bh;

+   BUG_ON(frame->bh);
+   BUG_ON(frame->bh->b_data);
root = (struct dx_root *) frame->bh->b_data;
if (root->info.hash_version != DX_HASH_TEA &&
root->info.hash_version != DX_HASH_HALF_MD4 &&
root->info.hash_version != DX_HASH_LEGACY) {

These are "could never" happen scenarios from looking at the code, but
that will help explain what is going on.

If this is reliably only happening with mq, the only way I could see
that if is something is returning an error when it previously wasn't.
This isn't a problem we're seeing with any of our testing, though.

Cheers,

- Ted



Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Theodore Ts'o
On Thu, Mar 01, 2018 at 10:55:37AM +0200, Adrian Hunter wrote:
> On 27/02/18 11:28, Adrian Hunter wrote:
> > On 26/02/18 23:48, Dmitry Osipenko wrote:
> >> But still something is wrong... I've been getting occasional EXT4 Ooops's, 
> >> like
> >> the one below, and __wait_on_bit() is always figuring in the stacktrace. It
> >> never happened with blk-mq disabled, though it could be a coincidence and
> >> actually unrelated to blk-mq patches.
> > 
> >> [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
> >> address 001c
> >> [ 6625.993004] pgd = 00b30c03
> >> [ 6625.993257] [001c] *pgd=
> >> [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
> >> [ 6625.994022] Modules linked in:
> >> [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
> >> 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
> >> [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
> >> [ 6625.995595] PC is aht dx_probe+0x68/0x684
> >> [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8

This doesn't seem to make sense; the PC is where we are currently
executing, and LR is the "Link Register" where the flow of control
will be returning after the current function returns, right?  Well,
dx_probe should *not* be returning to __wait_on_bit().  So this just
seems weird.

Ignoring the LR register, this stack trace looks sane...  I can't see
which pointer could be NULL and getting dereferenced, though.  How
easily can you reproduce the problem?  Can you either (a) translate
the PC into a line number, or better yet, if you can reproduce, add a
series of BUG_ON's so we can see what's going on?

+   BUG_ON(frame);
memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
frame->bh = ext4_read_dirblock(dir, 0, INDEX);
if (IS_ERR(frame->bh))
return (struct dx_frame *) frame->bh;

+   BUG_ON(frame->bh);
+   BUG_ON(frame->bh->b_data);
root = (struct dx_root *) frame->bh->b_data;
if (root->info.hash_version != DX_HASH_TEA &&
root->info.hash_version != DX_HASH_HALF_MD4 &&
root->info.hash_version != DX_HASH_LEGACY) {

These are "could never" happen scenarios from looking at the code, but
that will help explain what is going on.

If this is reliably only happening with mq, the only way I could see
that if is something is returning an error when it previously wasn't.
This isn't a problem we're seeing with any of our testing, though.

Cheers,

- Ted



Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Jose R R
On Thu, Mar 1, 2018 at 12:55 AM, Adrian Hunter  wrote:
> On 27/02/18 11:28, Adrian Hunter wrote:
>> On 26/02/18 23:48, Dmitry Osipenko wrote:
>>> But still something is wrong... I've been getting occasional EXT4 Ooops's, 
>>> like
>>> the one below, and __wait_on_bit() is always figuring in the stacktrace. It
>>> never happened with blk-mq disabled, though it could be a coincidence and
>>> actually unrelated to blk-mq patches.
>>
>> I can't think how an IO driver could cause that.

Probably it is not wise to place all your eggs (data) in one basket
(ext4) and diversify to viable alternatives which won't be affected by
UNIX 2038 year date problem, likewise?
< 
https://metztli.it/blog/index.php/amatl/reiser-nahui/reiser4-filesystem-and-the-unix
>

>>
>> cc'ing ext4 mailing list for more advice.
>
> + Ted and Andreas
>
>>
>>>
>>>
>>> [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
>>> address 001c
>>> [ 6625.993004] pgd = 00b30c03
>>> [ 6625.993257] [001c] *pgd=
>>> [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
>>> [ 6625.994022] Modules linked in:
>>> [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
>>> 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
>>> [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
>>> [ 6625.995595] PC is at dx_probe+0x68/0x684
>>> [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8
>>> [ 6625.996307] pc : []lr : []psr: 800f0013
>>> [ 6625.996806] sp : d55e3df0  ip : c0170e88  fp : d55e3e44
>>> [ 6625.997227] r10: d55e3f4c  r9 : d55e3e70  r8 : 
>>> [ 6625.997650] r7 : c4e13240  r6 :   r5 : d657db18  r4 : d55e3e8c
>>> [ 6625.998165] r3 : 007b  r2 : d5830800  r1 : d5831000  r0 : c4e13240
>>> [ 6625.998686] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
>>> none
>>> [ 6625.999246] Control: 10c5387d  Table: 0a63004a  DAC: 0051
>>> [ 6625.999710] Process dpkg (pid: 19355, stack limit = 0x139a48b6)
>>> [ 6626.000184] Stack: (0xd55e3df0 to 0xd55e4000)
>>> [ 6626.000560] 3de0: 02e9 d55e3e00
>>> c0c01964 c0278c70
>>> [ 6626.001209] 3e00: d55e3e24 014000c0 c04f3580 c0c01958 d55e3e90 801a001a
>>> d55e3e3c 0012
>>> [ 6626.001854] 3e20: d5830800  d657db18 c24b d55e3e70 d55e3f4c
>>> d55e3edc d55e3e48
>>> [ 6626.002502] 3e40: c033d568 c033b904  600f0013 c029e640 d6cf3540
>>> e000 
>>> [ 6626.003150] 3e60: 00076e99 d55e3ef4 d55e3e8c d5830800 d409c440 d409c454
>>> 0012 c029e640
>>> [ 6626.003795] 3e80: d55e3ec4 d55e3e90 c02797b4 c4e13240  
>>>  
>>> [ 6626.004442] 3ea0:     d409c428 d409c428
>>> d657db18 d409c428
>>> [ 6626.005088] 3ec0:  c24b ff9c d55e3f4c d55e3f14 d55e3ee0
>>> c033d7b0 c033d1b8
>>> [ 6626.005732] 3ee0: c0c01964 c0180050 d55e3f14 d55e3ef8 c029e870 
>>> d409c428 d6546558
>>> [ 6626.006382] 3f00: d55e3f58  d55e3f34 d55e3f18 c0291f04 c033d764
>>>  0001
>>> [ 6626.007032] 3f20:  d55e3f58 d55e3f94 d55e3f38 c0293d70 c0291ea0
>>> d55e3f58 d55e3f4c
>>> [ 6626.007679] 3f40:  0090abb0 d5467800  d6dd0110 d6546558
>>> f3bc423c 0012
>>> [ 6626.008326] 3f60: c24b0019 80808080  015ce1b0 0090abb0 00d8d670
>>> 0028 c01011e4
>>> [ 6626.008971] 3f80: d55e2000  d55e3fa4 d55e3f98 c0294544 c0293c44
>>>  d55e3fa8
>>> [ 6626.009620] 3fa0: c0101000 c0294530 015ce1b0 0090abb0 0090abb0 02a8
>>> 7d5a8800 7d5a8800
>>> [ 6626.010264] 3fc0: 015ce1b0 0090abb0 00d8d670 0028 0048eb80 00487344
>>> 015eb160 004a6c10
>>> [ 6626.010912] 3fe0: 004a6c8c bede3c0c 0048149d b6ecc6b8 600f0030 0090abb0
>>>  
>>> [ 6626.011577] [] (dx_probe) from []
>>> (ext4_find_entry+0x3bc/0x5ac)
>>> [ 6626.012198] [] (ext4_find_entry) from []
>>> (ext4_lookup+0x58/0x1f4)
>>> [ 6626.012844] [] (ext4_lookup) from []
>>> (__lookup_hash+0x70/0x9c)
>>> [ 6626.013468] [] (__lookup_hash) from [] 
>>> (do_rmdir+0x138/0x1b8)
>>> [ 6626.014071] [] (do_rmdir) from [] 
>>> (SyS_rmdir+0x20/0x24)
>>> [ 6626.014642] [] (SyS_rmdir) from []
>>> (ret_fast_syscall+0x0/0x54)
>>> [ 6626.015231] Exception stack(0xd55e3fa8 to 0xd55e3ff0)
>>> [ 6626.015656] 3fa0:   015ce1b0 0090abb0 0090abb0 02a8
>>> 7d5a8800 7d5a8800
>>> [ 6626.016302] 3fc0: 015ce1b0 0090abb0 00d8d670 0028 0048eb80 00487344
>>> 015eb160 004a6c10
>>> [ 6626.035930] 3fe0: 004a6c8c bede3c0c 0048149d b6ecc6b8
>>> [ 6626.055341] Code: e1a07000 e584 8a78 e590601c (e5d6301c)
>>> [ 6626.075632] ---[ end trace 034f3552437a92bc ]---
>>>
>>
>>
>

 Sorry if I intrude but just my 2¢.


Best Professional Regards.

-- 
Jose R R
http://metztli.it
-
Download Metztli Reiser4: Debian Stretch w/ Linux 4.14 AMD64

Re: EXT4 Oops (Re: [PATCH V15 06/22] mmc: block: Add blk-mq support)

2018-03-01 Thread Jose R R
On Thu, Mar 1, 2018 at 12:55 AM, Adrian Hunter  wrote:
> On 27/02/18 11:28, Adrian Hunter wrote:
>> On 26/02/18 23:48, Dmitry Osipenko wrote:
>>> But still something is wrong... I've been getting occasional EXT4 Ooops's, 
>>> like
>>> the one below, and __wait_on_bit() is always figuring in the stacktrace. It
>>> never happened with blk-mq disabled, though it could be a coincidence and
>>> actually unrelated to blk-mq patches.
>>
>> I can't think how an IO driver could cause that.

Probably it is not wise to place all your eggs (data) in one basket
(ext4) and diversify to viable alternatives which won't be affected by
UNIX 2038 year date problem, likewise?
< 
https://metztli.it/blog/index.php/amatl/reiser-nahui/reiser4-filesystem-and-the-unix
>

>>
>> cc'ing ext4 mailing list for more advice.
>
> + Ted and Andreas
>
>>
>>>
>>>
>>> [ 6625.992337] Unable to handle kernel NULL pointer dereference at virtual
>>> address 001c
>>> [ 6625.993004] pgd = 00b30c03
>>> [ 6625.993257] [001c] *pgd=
>>> [ 6625.993594] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
>>> [ 6625.994022] Modules linked in:
>>> [ 6625.994326] CPU: 1 PID: 19355 Comm: dpkg Not tainted
>>> 4.16.0-rc2-next-20180220-00095-ge9c9f5689a84-dirty #2090
>>> [ 6625.995078] Hardware name: NVIDIA Tegra SoC (Flattened Device Tree)
>>> [ 6625.995595] PC is at dx_probe+0x68/0x684
>>> [ 6625.995947] LR is at __wait_on_bit+0xac/0xc8
>>> [ 6625.996307] pc : []lr : []psr: 800f0013
>>> [ 6625.996806] sp : d55e3df0  ip : c0170e88  fp : d55e3e44
>>> [ 6625.997227] r10: d55e3f4c  r9 : d55e3e70  r8 : 
>>> [ 6625.997650] r7 : c4e13240  r6 :   r5 : d657db18  r4 : d55e3e8c
>>> [ 6625.998165] r3 : 007b  r2 : d5830800  r1 : d5831000  r0 : c4e13240
>>> [ 6625.998686] Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment 
>>> none
>>> [ 6625.999246] Control: 10c5387d  Table: 0a63004a  DAC: 0051
>>> [ 6625.999710] Process dpkg (pid: 19355, stack limit = 0x139a48b6)
>>> [ 6626.000184] Stack: (0xd55e3df0 to 0xd55e4000)
>>> [ 6626.000560] 3de0: 02e9 d55e3e00
>>> c0c01964 c0278c70
>>> [ 6626.001209] 3e00: d55e3e24 014000c0 c04f3580 c0c01958 d55e3e90 801a001a
>>> d55e3e3c 0012
>>> [ 6626.001854] 3e20: d5830800  d657db18 c24b d55e3e70 d55e3f4c
>>> d55e3edc d55e3e48
>>> [ 6626.002502] 3e40: c033d568 c033b904  600f0013 c029e640 d6cf3540
>>> e000 
>>> [ 6626.003150] 3e60: 00076e99 d55e3ef4 d55e3e8c d5830800 d409c440 d409c454
>>> 0012 c029e640
>>> [ 6626.003795] 3e80: d55e3ec4 d55e3e90 c02797b4 c4e13240  
>>>  
>>> [ 6626.004442] 3ea0:     d409c428 d409c428
>>> d657db18 d409c428
>>> [ 6626.005088] 3ec0:  c24b ff9c d55e3f4c d55e3f14 d55e3ee0
>>> c033d7b0 c033d1b8
>>> [ 6626.005732] 3ee0: c0c01964 c0180050 d55e3f14 d55e3ef8 c029e870 
>>> d409c428 d6546558
>>> [ 6626.006382] 3f00: d55e3f58  d55e3f34 d55e3f18 c0291f04 c033d764
>>>  0001
>>> [ 6626.007032] 3f20:  d55e3f58 d55e3f94 d55e3f38 c0293d70 c0291ea0
>>> d55e3f58 d55e3f4c
>>> [ 6626.007679] 3f40:  0090abb0 d5467800  d6dd0110 d6546558
>>> f3bc423c 0012
>>> [ 6626.008326] 3f60: c24b0019 80808080  015ce1b0 0090abb0 00d8d670
>>> 0028 c01011e4
>>> [ 6626.008971] 3f80: d55e2000  d55e3fa4 d55e3f98 c0294544 c0293c44
>>>  d55e3fa8
>>> [ 6626.009620] 3fa0: c0101000 c0294530 015ce1b0 0090abb0 0090abb0 02a8
>>> 7d5a8800 7d5a8800
>>> [ 6626.010264] 3fc0: 015ce1b0 0090abb0 00d8d670 0028 0048eb80 00487344
>>> 015eb160 004a6c10
>>> [ 6626.010912] 3fe0: 004a6c8c bede3c0c 0048149d b6ecc6b8 600f0030 0090abb0
>>>  
>>> [ 6626.011577] [] (dx_probe) from []
>>> (ext4_find_entry+0x3bc/0x5ac)
>>> [ 6626.012198] [] (ext4_find_entry) from []
>>> (ext4_lookup+0x58/0x1f4)
>>> [ 6626.012844] [] (ext4_lookup) from []
>>> (__lookup_hash+0x70/0x9c)
>>> [ 6626.013468] [] (__lookup_hash) from [] 
>>> (do_rmdir+0x138/0x1b8)
>>> [ 6626.014071] [] (do_rmdir) from [] 
>>> (SyS_rmdir+0x20/0x24)
>>> [ 6626.014642] [] (SyS_rmdir) from []
>>> (ret_fast_syscall+0x0/0x54)
>>> [ 6626.015231] Exception stack(0xd55e3fa8 to 0xd55e3ff0)
>>> [ 6626.015656] 3fa0:   015ce1b0 0090abb0 0090abb0 02a8
>>> 7d5a8800 7d5a8800
>>> [ 6626.016302] 3fc0: 015ce1b0 0090abb0 00d8d670 0028 0048eb80 00487344
>>> 015eb160 004a6c10
>>> [ 6626.035930] 3fe0: 004a6c8c bede3c0c 0048149d b6ecc6b8
>>> [ 6626.055341] Code: e1a07000 e584 8a78 e590601c (e5d6301c)
>>> [ 6626.075632] ---[ end trace 034f3552437a92bc ]---
>>>
>>
>>
>

 Sorry if I intrude but just my 2¢.


Best Professional Regards.

-- 
Jose R R
http://metztli.it
-
Download Metztli Reiser4: Debian Stretch w/ Linux 4.14 AMD64