Re: Dcache oops

2016-06-04 Thread Oleg Drokin
On Jun 3, 2016, at 8:56 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > >>> EOPENSTALE, that is... Oleg, could you check if the following works? >> >> Yes, this one lasted for an hour with no crashing, so it must be good. >> Thanks. >> (note, I am not

Re: Dcache oops

2016-06-04 Thread Oleg Drokin
On Jun 3, 2016, at 8:56 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > >>> EOPENSTALE, that is... Oleg, could you check if the following works? >> >> Yes, this one lasted for an hour with no crashing, so it must be good. >> Thanks. >> (note, I am not

Re: Dcache oops

2016-06-04 Thread Jeff Layton
On Sat, 2016-06-04 at 01:56 +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > > > > > > > > EOPENSTALE, that is...  Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > > Thanks. > >

Re: Dcache oops

2016-06-04 Thread Jeff Layton
On Sat, 2016-06-04 at 01:56 +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > > > > > > > > EOPENSTALE, that is...  Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > > Thanks. > >

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > Thanks. > (note, I am not equipped to verify correctness of NFS operations, though).

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > Thanks. > (note, I am not equipped to verify correctness of NFS operations, though).

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW,

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW,

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW,

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW,

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 03:36:22PM -0700, Linus Torvalds wrote: > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Oleg has started to use Lustre torture tests on NFS, that's all. Note, BTW, that first they'd triggered an oopsable bug

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 03:36:22PM -0700, Linus Torvalds wrote: > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Oleg has started to use Lustre torture tests on NFS, that's all. Note, BTW, that first they'd triggered an oopsable bug

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:36 PM, Linus Torvalds wrote: > On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: >> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> Normally it's done at terminate_walk() time. But I note that in >>> walk_component(), we do

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 6:36 PM, Linus Torvalds wrote: > On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: >> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> Normally it's done at terminate_walk() time. But I note that in >>> walk_component(), we do put_link(nd) which does a

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. EOPENSTALE,

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. EOPENSTALE,

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> >> Normally it's done at terminate_walk() time. But I note that in >> walk_component(), we do put_link(nd) which does a do_delayed_call(), >> but does

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> >> Normally it's done at terminate_walk() time. But I note that in >> walk_component(), we do put_link(nd) which does a do_delayed_call(), >> but does *not* do a

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. FWIW, I'd

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. FWIW, I'd

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: > Is perhaps the "delayed_call" logic broken, and the symlink is free'd too > early? > > That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe > something broke that logic, and we've executed the delayed freeing >

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: > Is perhaps the "delayed_call" logic broken, and the symlink is free'd too > early? > > That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe > something broke that logic, and we've executed the delayed freeing >

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 10:46:31PM +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > > 4e8440b? > > > > Well, that was faster than I expected. 4e8440b triggers right away, so

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 10:46:31PM +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > > 4e8440b? > > > > Well, that was faster than I expected. 4e8440b triggers right away, so

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 2:26 PM, Al Viro wrote: >> >> in the __d_lookup() disassembly. And %rdi contains 2, so there were >> supposed to be two more characters at 'ct' (which is %rdx). > > ... and since r8 and rsi are 0, we couldn't have consumed anything. Right you are.

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 2:26 PM, Al Viro wrote: >> >> in the __d_lookup() disassembly. And %rdi contains 2, so there were >> supposed to be two more characters at 'ct' (which is %rdx). > > ... and since r8 and rsi are 0, we couldn't have consumed anything. Right you are. So it really started out

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > 4e8440b? > > Well, that was faster than I expected. 4e8440b triggers right away, so I guess > there's no point in trying the later ones? > BTW, just to

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > 4e8440b? > > Well, that was faster than I expected. 4e8440b triggers right away, so I guess > there's no point in trying the later ones? > BTW, just to

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 02:18:15PM -0700, Linus Torvalds wrote: > So something must have corrupted the qstr. > > The remaining length *should* in %edi, judging by the > >0x81243b82 <+306>: cmp$0x7,%edi > > in the __d_lookup() disassembly. And %rdi contains 2, so there were >

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 02:18:15PM -0700, Linus Torvalds wrote: > So something must have corrupted the qstr. > > The remaining length *should* in %edi, judging by the > >0x81243b82 <+306>: cmp$0x7,%edi > > in the __d_lookup() disassembly. And %rdi contains 2, so there were >

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 1:07 PM, Al Viro wrote: > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that

Re: Dcache oops

2016-06-03 Thread Linus Torvalds
On Fri, Jun 3, 2016 at 1:07 PM, Al Viro wrote: > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that > fault was on attempt to

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 4:07 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 >>> >>> *ow* >>> Could you dump your vmlinux (and

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 4:07 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 >>> >>> *ow* >>> Could you dump your vmlinux (and

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > >> [ 2642.364383] BUG: unable to handle kernel paging request at > >> 880113f82000 > >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > > > *ow* > > Could you dump your vmlinux (and System.map) somewhere on anonftp? > > This

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > >> [ 2642.364383] BUG: unable to handle kernel paging request at > >> 880113f82000 > >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > > > *ow* > > Could you dump your vmlinux (and System.map) somewhere on anonftp? > > This

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 2:22 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> I am dropping NFS people since it seems to be converting into a generic >> VFS/dcache bug even though you need NFS or the like to trigger it - the >> lookup_open path. > > NFS bug is

Re: Dcache oops

2016-06-03 Thread Oleg Drokin
On Jun 3, 2016, at 2:22 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> I am dropping NFS people since it seems to be converting into a generic >> VFS/dcache bug even though you need NFS or the like to trigger it - the >> lookup_open path. > > NFS bug is

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > I am dropping NFS people since it seems to be converting into a generic > VFS/dcache bug even though you need NFS or the like to trigger it - the > lookup_open path. NFS bug is real; there might very well be something else, but that

Re: Dcache oops

2016-06-03 Thread Al Viro
On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > I am dropping NFS people since it seems to be converting into a generic > VFS/dcache bug even though you need NFS or the like to trigger it - the > lookup_open path. NFS bug is real; there might very well be something else, but that

Dcache oops

2016-06-03 Thread Oleg Drokin
I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path. On Jun 3, 2016, at 12:26 AM, Al Viro wrote: > Looks like the right thing to do would be to do d_drop() at no_open:, > just before we

Dcache oops

2016-06-03 Thread Oleg Drokin
I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path. On Jun 3, 2016, at 12:26 AM, Al Viro wrote: > Looks like the right thing to do would be to do d_drop() at no_open:, > just before we