Re: Dcache oops
On Jun 3, 2016, at 8:56 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > >>> EOPENSTALE, that is... Oleg, could you check if the following works? >> >> Yes, this one lasted for an hour with no crashing, so it must be good. >> Thanks. >> (note, I am not equipped to verify correctness of NFS operations, though). > > I suspect that Jeff Layton might have relevant regression tests. > Incidentally, > we really need a consolidated regression testsuite, including the tests you'd > been running. Right now there's some stuff in xfstests, LTP and cthon; if > anything, this mess shows just why we need all of that and then some in > a single place. Lustre stuff has caught a 3 years old NFS bug (missing > d_drop() in nfs_atomic_open()) and a year-old bug in handling of EOPENSTALE > retries on the last component of a trailing non-embedded symlink. Neither > is hard to trigger; it's just that relevant tests hadn't been run on NFS, > period. BTW, the nets also have brought in another use after free in nfs4 state tracking code (this is the one I was trying to hunt down from the start). I'll submit a patch shortly. And also there's a mysterious ext4 data corruption that I do not really fully understand but only hit once so far.
Re: Dcache oops
On Jun 3, 2016, at 8:56 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > >>> EOPENSTALE, that is... Oleg, could you check if the following works? >> >> Yes, this one lasted for an hour with no crashing, so it must be good. >> Thanks. >> (note, I am not equipped to verify correctness of NFS operations, though). > > I suspect that Jeff Layton might have relevant regression tests. > Incidentally, > we really need a consolidated regression testsuite, including the tests you'd > been running. Right now there's some stuff in xfstests, LTP and cthon; if > anything, this mess shows just why we need all of that and then some in > a single place. Lustre stuff has caught a 3 years old NFS bug (missing > d_drop() in nfs_atomic_open()) and a year-old bug in handling of EOPENSTALE > retries on the last component of a trailing non-embedded symlink. Neither > is hard to trigger; it's just that relevant tests hadn't been run on NFS, > period. BTW, the nets also have brought in another use after free in nfs4 state tracking code (this is the one I was trying to hunt down from the start). I'll submit a patch shortly. And also there's a mysterious ext4 data corruption that I do not really fully understand but only hit once so far.
Re: Dcache oops
On Sat, 2016-06-04 at 01:56 +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > > > > > > > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > > Thanks. > > (note, I am not equipped to verify correctness of NFS operations, though). > I suspect that Jeff Layton might have relevant regression tests. > Incidentally, > we really need a consolidated regression testsuite, including the tests you'd > been running. Right now there's some stuff in xfstests, LTP and cthon; if > anything, this mess shows just why we need all of that and then some in > a single place. Lustre stuff has caught a 3 years old NFS bug (missing > d_drop() in nfs_atomic_open()) and a year-old bug in handling of EOPENSTALE > retries on the last component of a trailing non-embedded symlink. Neither > is hard to trigger; it's just that relevant tests hadn't been run on NFS, > period. > > Jeff, could you verify that the following does not cause regressions in > stale fhandles treatment? I want to rip the damn retry logics out of > do_last() and if the staleness had only been discovered inside of > nfs4_file_open() just have the upper-level logics handle it by doing > a normal LOOKUP_REVAL pass from scratch. To hell with trying to be clever; > a few roundtrips it saves us in some cases is not worth the complexity and > potential for bugs. I'm fairly sure that the time spent debugging this > particular turd exceeds the total amount of time it has ever saved, > and do_last() is in dire need of simplification. All talk about "enough eyes" > isn't worth much when the readers of code in question feel like ripping their > eyes out... > Agreed. I see no need to optimize an error case here. Any performance hit that we'd get here is almost certainly acceptable in this situation. The main thing is that we prevent the ESTALE from bubbling up into userland if we can avoid it by retrying. No, I didn't have the test for this anymore unfortunately. RHQA might have one though. Either way, I cooked one up that does this on the server: #!/bin/bash while true; do rm -rf foo mkdir foo echo foo > foo/bar usleep 10 done ...and then this on the client after mounting the fs with lookupcache=none and noac. #include #include #include #include #include #include int main(int argc, char **argv) { int fd; while(1) { fd = open(argv[1], O_RDONLY); if (fd < 0) { if (errno == ESTALE) { printf("ESTALE"); return 1; } continue; } close(fd); } return 0; } I did see some of the OPEN compounds come back with NFS4ERR_STALE on the PUTFH op but no corresponding ESTALE error in userland. So, this patch does seem to do the right thing. Reviewed-and-Tested-by: Jeff Layton
Re: Dcache oops
On Sat, 2016-06-04 at 01:56 +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > > > > > > > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > > Thanks. > > (note, I am not equipped to verify correctness of NFS operations, though). > I suspect that Jeff Layton might have relevant regression tests. > Incidentally, > we really need a consolidated regression testsuite, including the tests you'd > been running. Right now there's some stuff in xfstests, LTP and cthon; if > anything, this mess shows just why we need all of that and then some in > a single place. Lustre stuff has caught a 3 years old NFS bug (missing > d_drop() in nfs_atomic_open()) and a year-old bug in handling of EOPENSTALE > retries on the last component of a trailing non-embedded symlink. Neither > is hard to trigger; it's just that relevant tests hadn't been run on NFS, > period. > > Jeff, could you verify that the following does not cause regressions in > stale fhandles treatment? I want to rip the damn retry logics out of > do_last() and if the staleness had only been discovered inside of > nfs4_file_open() just have the upper-level logics handle it by doing > a normal LOOKUP_REVAL pass from scratch. To hell with trying to be clever; > a few roundtrips it saves us in some cases is not worth the complexity and > potential for bugs. I'm fairly sure that the time spent debugging this > particular turd exceeds the total amount of time it has ever saved, > and do_last() is in dire need of simplification. All talk about "enough eyes" > isn't worth much when the readers of code in question feel like ripping their > eyes out... > Agreed. I see no need to optimize an error case here. Any performance hit that we'd get here is almost certainly acceptable in this situation. The main thing is that we prevent the ESTALE from bubbling up into userland if we can avoid it by retrying. No, I didn't have the test for this anymore unfortunately. RHQA might have one though. Either way, I cooked one up that does this on the server: #!/bin/bash while true; do rm -rf foo mkdir foo echo foo > foo/bar usleep 10 done ...and then this on the client after mounting the fs with lookupcache=none and noac. #include #include #include #include #include #include int main(int argc, char **argv) { int fd; while(1) { fd = open(argv[1], O_RDONLY); if (fd < 0) { if (errno == ESTALE) { printf("ESTALE"); return 1; } continue; } close(fd); } return 0; } I did see some of the OPEN compounds come back with NFS4ERR_STALE on the PUTFH op but no corresponding ESTALE error in userland. So, this patch does seem to do the right thing. Reviewed-and-Tested-by: Jeff Layton
Re: Dcache oops
On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > Thanks. > (note, I am not equipped to verify correctness of NFS operations, though). I suspect that Jeff Layton might have relevant regression tests. Incidentally, we really need a consolidated regression testsuite, including the tests you'd been running. Right now there's some stuff in xfstests, LTP and cthon; if anything, this mess shows just why we need all of that and then some in a single place. Lustre stuff has caught a 3 years old NFS bug (missing d_drop() in nfs_atomic_open()) and a year-old bug in handling of EOPENSTALE retries on the last component of a trailing non-embedded symlink. Neither is hard to trigger; it's just that relevant tests hadn't been run on NFS, period. Jeff, could you verify that the following does not cause regressions in stale fhandles treatment? I want to rip the damn retry logics out of do_last() and if the staleness had only been discovered inside of nfs4_file_open() just have the upper-level logics handle it by doing a normal LOOKUP_REVAL pass from scratch. To hell with trying to be clever; a few roundtrips it saves us in some cases is not worth the complexity and potential for bugs. I'm fairly sure that the time spent debugging this particular turd exceeds the total amount of time it has ever saved, and do_last() is in dire need of simplification. All talk about "enough eyes" isn't worth much when the readers of code in question feel like ripping their eyes out... diff --git a/fs/namei.c b/fs/namei.c index 4c4f95a..3d9511e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, int acc_mode = op->acc_mode; unsigned seq; struct inode *inode; - struct path save_parent = { .dentry = NULL, .mnt = NULL }; struct path path; - bool retried = false; int error; nd->flags &= ~LOOKUP_PARENT; @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, return -EISDIR; } -retry_lookup: if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { error = mnt_want_write(nd->path.mnt); if (!error) @@ -3292,23 +3289,14 @@ finish_lookup: if (unlikely(error)) return error; - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { - path_to_nameidata(, nd); - } else { - save_parent.dentry = nd->path.dentry; - save_parent.mnt = mntget(path.mnt); - nd->path.dentry = path.dentry; - - } + path_to_nameidata(, nd); nd->inode = inode; nd->seq = seq; /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ finish_open: error = complete_walk(nd); - if (error) { - path_put(_parent); + if (error) return error; - } audit_inode(nd->name, nd->path.dentry, 0); error = -EISDIR; if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) @@ -3331,13 +3319,9 @@ finish_open_created: goto out; BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ error = vfs_open(>path, file, current_cred()); - if (!error) { - *opened |= FILE_OPENED; - } else { - if (error == -EOPENSTALE) - goto stale_open; + if (error) goto out; - } + *opened |= FILE_OPENED; opened: error = open_check_o_direct(file); if (!error) @@ -3353,26 +3337,7 @@ out: } if (got_write) mnt_drop_write(nd->path.mnt); - path_put(_parent); return error; - -stale_open: - /* If no saved parent or already retried then can't retry */ - if (!save_parent.dentry || retried) - goto out; - - BUG_ON(save_parent.dentry != dir); - path_put(>path); - nd->path = save_parent; - nd->inode = dir->d_inode; - save_parent.mnt = NULL; - save_parent.dentry = NULL; - if (got_write) { - mnt_drop_write(nd->path.mnt); - got_write = false; - } - retried = true; - goto retry_lookup; } static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Fri, Jun 03, 2016 at 07:58:37PM -0400, Oleg Drokin wrote: > > EOPENSTALE, that is... Oleg, could you check if the following works? > > Yes, this one lasted for an hour with no crashing, so it must be good. > Thanks. > (note, I am not equipped to verify correctness of NFS operations, though). I suspect that Jeff Layton might have relevant regression tests. Incidentally, we really need a consolidated regression testsuite, including the tests you'd been running. Right now there's some stuff in xfstests, LTP and cthon; if anything, this mess shows just why we need all of that and then some in a single place. Lustre stuff has caught a 3 years old NFS bug (missing d_drop() in nfs_atomic_open()) and a year-old bug in handling of EOPENSTALE retries on the last component of a trailing non-embedded symlink. Neither is hard to trigger; it's just that relevant tests hadn't been run on NFS, period. Jeff, could you verify that the following does not cause regressions in stale fhandles treatment? I want to rip the damn retry logics out of do_last() and if the staleness had only been discovered inside of nfs4_file_open() just have the upper-level logics handle it by doing a normal LOOKUP_REVAL pass from scratch. To hell with trying to be clever; a few roundtrips it saves us in some cases is not worth the complexity and potential for bugs. I'm fairly sure that the time spent debugging this particular turd exceeds the total amount of time it has ever saved, and do_last() is in dire need of simplification. All talk about "enough eyes" isn't worth much when the readers of code in question feel like ripping their eyes out... diff --git a/fs/namei.c b/fs/namei.c index 4c4f95a..3d9511e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, int acc_mode = op->acc_mode; unsigned seq; struct inode *inode; - struct path save_parent = { .dentry = NULL, .mnt = NULL }; struct path path; - bool retried = false; int error; nd->flags &= ~LOOKUP_PARENT; @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, return -EISDIR; } -retry_lookup: if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { error = mnt_want_write(nd->path.mnt); if (!error) @@ -3292,23 +3289,14 @@ finish_lookup: if (unlikely(error)) return error; - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { - path_to_nameidata(, nd); - } else { - save_parent.dentry = nd->path.dentry; - save_parent.mnt = mntget(path.mnt); - nd->path.dentry = path.dentry; - - } + path_to_nameidata(, nd); nd->inode = inode; nd->seq = seq; /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ finish_open: error = complete_walk(nd); - if (error) { - path_put(_parent); + if (error) return error; - } audit_inode(nd->name, nd->path.dentry, 0); error = -EISDIR; if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) @@ -3331,13 +3319,9 @@ finish_open_created: goto out; BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ error = vfs_open(>path, file, current_cred()); - if (!error) { - *opened |= FILE_OPENED; - } else { - if (error == -EOPENSTALE) - goto stale_open; + if (error) goto out; - } + *opened |= FILE_OPENED; opened: error = open_check_o_direct(file); if (!error) @@ -3353,26 +3337,7 @@ out: } if (got_write) mnt_drop_write(nd->path.mnt); - path_put(_parent); return error; - -stale_open: - /* If no saved parent or already retried then can't retry */ - if (!save_parent.dentry || retried) - goto out; - - BUG_ON(save_parent.dentry != dir); - path_put(>path); - nd->path = save_parent; - nd->inode = dir->d_inode; - save_parent.mnt = NULL; - save_parent.dentry = NULL; - if (got_write) { - mnt_drop_write(nd->path.mnt); - got_write = false; - } - retried = true; - goto retry_lookup; } static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW, the bug predates delayed_call >> stuff. > > EOPENSTALE, that is... Oleg, could you check if the following works? Yes, this one lasted for an hour with no crashing, so it must be good. Thanks. (note, I am not equipped to verify correctness of NFS operations, though). > diff --git a/fs/namei.c b/fs/namei.c > index 4c4f95a..3d9511e 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, > int acc_mode = op->acc_mode; > unsigned seq; > struct inode *inode; > - struct path save_parent = { .dentry = NULL, .mnt = NULL }; > struct path path; > - bool retried = false; > int error; > > nd->flags &= ~LOOKUP_PARENT; > @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, > return -EISDIR; > } > > -retry_lookup: > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > error = mnt_want_write(nd->path.mnt); > if (!error) > @@ -3292,23 +3289,14 @@ finish_lookup: > if (unlikely(error)) > return error; > > - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { > - path_to_nameidata(, nd); > - } else { > - save_parent.dentry = nd->path.dentry; > - save_parent.mnt = mntget(path.mnt); > - nd->path.dentry = path.dentry; > - > - } > + path_to_nameidata(, nd); > nd->inode = inode; > nd->seq = seq; > /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ > finish_open: > error = complete_walk(nd); > - if (error) { > - path_put(_parent); > + if (error) > return error; > - } > audit_inode(nd->name, nd->path.dentry, 0); > error = -EISDIR; > if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) > @@ -3331,13 +3319,9 @@ finish_open_created: > goto out; > BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ > error = vfs_open(>path, file, current_cred()); > - if (!error) { > - *opened |= FILE_OPENED; > - } else { > - if (error == -EOPENSTALE) > - goto stale_open; > + if (error) > goto out; > - } > + *opened |= FILE_OPENED; > opened: > error = open_check_o_direct(file); > if (!error) > @@ -3353,26 +3337,7 @@ out: > } > if (got_write) > mnt_drop_write(nd->path.mnt); > - path_put(_parent); > return error; > - > -stale_open: > - /* If no saved parent or already retried then can't retry */ > - if (!save_parent.dentry || retried) > - goto out; > - > - BUG_ON(save_parent.dentry != dir); > - path_put(>path); > - nd->path = save_parent; > - nd->inode = dir->d_inode; > - save_parent.mnt = NULL; > - save_parent.dentry = NULL; > - if (got_write) { > - mnt_drop_write(nd->path.mnt); > - got_write = false; > - } > - retried = true; > - goto retry_lookup; > } > > static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW, the bug predates delayed_call >> stuff. > > EOPENSTALE, that is... Oleg, could you check if the following works? Yes, this one lasted for an hour with no crashing, so it must be good. Thanks. (note, I am not equipped to verify correctness of NFS operations, though). > diff --git a/fs/namei.c b/fs/namei.c > index 4c4f95a..3d9511e 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, > int acc_mode = op->acc_mode; > unsigned seq; > struct inode *inode; > - struct path save_parent = { .dentry = NULL, .mnt = NULL }; > struct path path; > - bool retried = false; > int error; > > nd->flags &= ~LOOKUP_PARENT; > @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, > return -EISDIR; > } > > -retry_lookup: > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > error = mnt_want_write(nd->path.mnt); > if (!error) > @@ -3292,23 +3289,14 @@ finish_lookup: > if (unlikely(error)) > return error; > > - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { > - path_to_nameidata(, nd); > - } else { > - save_parent.dentry = nd->path.dentry; > - save_parent.mnt = mntget(path.mnt); > - nd->path.dentry = path.dentry; > - > - } > + path_to_nameidata(, nd); > nd->inode = inode; > nd->seq = seq; > /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ > finish_open: > error = complete_walk(nd); > - if (error) { > - path_put(_parent); > + if (error) > return error; > - } > audit_inode(nd->name, nd->path.dentry, 0); > error = -EISDIR; > if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) > @@ -3331,13 +3319,9 @@ finish_open_created: > goto out; > BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ > error = vfs_open(>path, file, current_cred()); > - if (!error) { > - *opened |= FILE_OPENED; > - } else { > - if (error == -EOPENSTALE) > - goto stale_open; > + if (error) > goto out; > - } > + *opened |= FILE_OPENED; > opened: > error = open_check_o_direct(file); > if (!error) > @@ -3353,26 +3337,7 @@ out: > } > if (got_write) > mnt_drop_write(nd->path.mnt); > - path_put(_parent); > return error; > - > -stale_open: > - /* If no saved parent or already retried then can't retry */ > - if (!save_parent.dentry || retried) > - goto out; > - > - BUG_ON(save_parent.dentry != dir); > - path_put(>path); > - nd->path = save_parent; > - nd->inode = dir->d_inode; > - save_parent.mnt = NULL; > - save_parent.dentry = NULL; > - if (got_write) { > - mnt_drop_write(nd->path.mnt); > - got_write = false; > - } > - retried = true; > - goto retry_lookup; > } > > static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW, the bug predates delayed_call >> stuff. > > EOPENSTALE, that is... Oleg, could you check if the following works? Ok, I'll try this one too. The other one with goto stale_open also was appearing to work. We'll see what else is going to show up next... > diff --git a/fs/namei.c b/fs/namei.c > index 4c4f95a..3d9511e 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, > int acc_mode = op->acc_mode; > unsigned seq; > struct inode *inode; > - struct path save_parent = { .dentry = NULL, .mnt = NULL }; > struct path path; > - bool retried = false; > int error; > > nd->flags &= ~LOOKUP_PARENT; > @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, > return -EISDIR; > } > > -retry_lookup: > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > error = mnt_want_write(nd->path.mnt); > if (!error) > @@ -3292,23 +3289,14 @@ finish_lookup: > if (unlikely(error)) > return error; > > - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { > - path_to_nameidata(, nd); > - } else { > - save_parent.dentry = nd->path.dentry; > - save_parent.mnt = mntget(path.mnt); > - nd->path.dentry = path.dentry; > - > - } > + path_to_nameidata(, nd); > nd->inode = inode; > nd->seq = seq; > /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ > finish_open: > error = complete_walk(nd); > - if (error) { > - path_put(_parent); > + if (error) > return error; > - } > audit_inode(nd->name, nd->path.dentry, 0); > error = -EISDIR; > if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) > @@ -3331,13 +3319,9 @@ finish_open_created: > goto out; > BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ > error = vfs_open(>path, file, current_cred()); > - if (!error) { > - *opened |= FILE_OPENED; > - } else { > - if (error == -EOPENSTALE) > - goto stale_open; > + if (error) > goto out; > - } > + *opened |= FILE_OPENED; > opened: > error = open_check_o_direct(file); > if (!error) > @@ -3353,26 +3337,7 @@ out: > } > if (got_write) > mnt_drop_write(nd->path.mnt); > - path_put(_parent); > return error; > - > -stale_open: > - /* If no saved parent or already retried then can't retry */ > - if (!save_parent.dentry || retried) > - goto out; > - > - BUG_ON(save_parent.dentry != dir); > - path_put(>path); > - nd->path = save_parent; > - nd->inode = dir->d_inode; > - save_parent.mnt = NULL; > - save_parent.dentry = NULL; > - if (got_write) { > - mnt_drop_write(nd->path.mnt); > - got_write = false; > - } > - retried = true; > - goto retry_lookup; > } > > static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Jun 3, 2016, at 6:37 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > >> It's not that. It's explicit put_link() in do_last(), followed by >> ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" >> looking at now-freed nd->last.name. IOW, the bug predates delayed_call >> stuff. > > EOPENSTALE, that is... Oleg, could you check if the following works? Ok, I'll try this one too. The other one with goto stale_open also was appearing to work. We'll see what else is going to show up next... > diff --git a/fs/namei.c b/fs/namei.c > index 4c4f95a..3d9511e 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, > int acc_mode = op->acc_mode; > unsigned seq; > struct inode *inode; > - struct path save_parent = { .dentry = NULL, .mnt = NULL }; > struct path path; > - bool retried = false; > int error; > > nd->flags &= ~LOOKUP_PARENT; > @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, > return -EISDIR; > } > > -retry_lookup: > if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { > error = mnt_want_write(nd->path.mnt); > if (!error) > @@ -3292,23 +3289,14 @@ finish_lookup: > if (unlikely(error)) > return error; > > - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { > - path_to_nameidata(, nd); > - } else { > - save_parent.dentry = nd->path.dentry; > - save_parent.mnt = mntget(path.mnt); > - nd->path.dentry = path.dentry; > - > - } > + path_to_nameidata(, nd); > nd->inode = inode; > nd->seq = seq; > /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ > finish_open: > error = complete_walk(nd); > - if (error) { > - path_put(_parent); > + if (error) > return error; > - } > audit_inode(nd->name, nd->path.dentry, 0); > error = -EISDIR; > if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) > @@ -3331,13 +3319,9 @@ finish_open_created: > goto out; > BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ > error = vfs_open(>path, file, current_cred()); > - if (!error) { > - *opened |= FILE_OPENED; > - } else { > - if (error == -EOPENSTALE) > - goto stale_open; > + if (error) > goto out; > - } > + *opened |= FILE_OPENED; > opened: > error = open_check_o_direct(file); > if (!error) > @@ -3353,26 +3337,7 @@ out: > } > if (got_write) > mnt_drop_write(nd->path.mnt); > - path_put(_parent); > return error; > - > -stale_open: > - /* If no saved parent or already retried then can't retry */ > - if (!save_parent.dentry || retried) > - goto out; > - > - BUG_ON(save_parent.dentry != dir); > - path_put(>path); > - nd->path = save_parent; > - nd->inode = dir->d_inode; > - save_parent.mnt = NULL; > - save_parent.dentry = NULL; > - if (got_write) { > - mnt_drop_write(nd->path.mnt); > - got_write = false; > - } > - retried = true; > - goto retry_lookup; > } > > static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Fri, Jun 03, 2016 at 03:36:22PM -0700, Linus Torvalds wrote: > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Oleg has started to use Lustre torture tests on NFS, that's all. Note, BTW, that first they'd triggered an oopsable bug (fairly easy to reproduce) in nfs_atomic_open() that had been there for 3 years ;-/
Re: Dcache oops
On Fri, Jun 03, 2016 at 03:36:22PM -0700, Linus Torvalds wrote: > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Oleg has started to use Lustre torture tests on NFS, that's all. Note, BTW, that first they'd triggered an oopsable bug (fairly easy to reproduce) in nfs_atomic_open() that had been there for 3 years ;-/
Re: Dcache oops
On Jun 3, 2016, at 6:36 PM, Linus Torvalds wrote: > On Fri, Jun 3, 2016 at 3:23 PM, Al Virowrote: >> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> Normally it's done at terminate_walk() time. But I note that in >>> walk_component(), we do put_link(nd) which does a do_delayed_call(), >>> but does *not* do a clear_delayed_call(), so now I think a subsequent >>> terminate_walk() might drop it *again*. >> >> Nope - put_link() also decrements nd->depth. No double calls there... > > Yeah, I figured that out, and then continued to try to look at other cases.. > > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Apparently nobody runs NFS in a serious manner these days anymore. EOPENSTALE is only used by NFS.
Re: Dcache oops
On Jun 3, 2016, at 6:36 PM, Linus Torvalds wrote: > On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: >> On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> Normally it's done at terminate_walk() time. But I note that in >>> walk_component(), we do put_link(nd) which does a do_delayed_call(), >>> but does *not* do a clear_delayed_call(), so now I think a subsequent >>> terminate_walk() might drop it *again*. >> >> Nope - put_link() also decrements nd->depth. No double calls there... > > Yeah, I figured that out, and then continued to try to look at other cases.. > > Happy to hear that you seem to have figured it out. > > But why did it apparently only start happening now? Apparently nobody runs NFS in a serious manner these days anymore. EOPENSTALE is only used by NFS.
Re: Dcache oops
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. EOPENSTALE, that is... Oleg, could you check if the following works? diff --git a/fs/namei.c b/fs/namei.c index 4c4f95a..3d9511e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, int acc_mode = op->acc_mode; unsigned seq; struct inode *inode; - struct path save_parent = { .dentry = NULL, .mnt = NULL }; struct path path; - bool retried = false; int error; nd->flags &= ~LOOKUP_PARENT; @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, return -EISDIR; } -retry_lookup: if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { error = mnt_want_write(nd->path.mnt); if (!error) @@ -3292,23 +3289,14 @@ finish_lookup: if (unlikely(error)) return error; - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { - path_to_nameidata(, nd); - } else { - save_parent.dentry = nd->path.dentry; - save_parent.mnt = mntget(path.mnt); - nd->path.dentry = path.dentry; - - } + path_to_nameidata(, nd); nd->inode = inode; nd->seq = seq; /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ finish_open: error = complete_walk(nd); - if (error) { - path_put(_parent); + if (error) return error; - } audit_inode(nd->name, nd->path.dentry, 0); error = -EISDIR; if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) @@ -3331,13 +3319,9 @@ finish_open_created: goto out; BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ error = vfs_open(>path, file, current_cred()); - if (!error) { - *opened |= FILE_OPENED; - } else { - if (error == -EOPENSTALE) - goto stale_open; + if (error) goto out; - } + *opened |= FILE_OPENED; opened: error = open_check_o_direct(file); if (!error) @@ -3353,26 +3337,7 @@ out: } if (got_write) mnt_drop_write(nd->path.mnt); - path_put(_parent); return error; - -stale_open: - /* If no saved parent or already retried then can't retry */ - if (!save_parent.dentry || retried) - goto out; - - BUG_ON(save_parent.dentry != dir); - path_put(>path); - nd->path = save_parent; - nd->inode = dir->d_inode; - save_parent.mnt = NULL; - save_parent.dentry = NULL; - if (got_write) { - mnt_drop_write(nd->path.mnt); - got_write = false; - } - retried = true; - goto retry_lookup; } static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. EOPENSTALE, that is... Oleg, could you check if the following works? diff --git a/fs/namei.c b/fs/namei.c index 4c4f95a..3d9511e 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -3166,9 +3166,7 @@ static int do_last(struct nameidata *nd, int acc_mode = op->acc_mode; unsigned seq; struct inode *inode; - struct path save_parent = { .dentry = NULL, .mnt = NULL }; struct path path; - bool retried = false; int error; nd->flags &= ~LOOKUP_PARENT; @@ -3211,7 +3209,6 @@ static int do_last(struct nameidata *nd, return -EISDIR; } -retry_lookup: if (open_flag & (O_CREAT | O_TRUNC | O_WRONLY | O_RDWR)) { error = mnt_want_write(nd->path.mnt); if (!error) @@ -3292,23 +3289,14 @@ finish_lookup: if (unlikely(error)) return error; - if ((nd->flags & LOOKUP_RCU) || nd->path.mnt != path.mnt) { - path_to_nameidata(, nd); - } else { - save_parent.dentry = nd->path.dentry; - save_parent.mnt = mntget(path.mnt); - nd->path.dentry = path.dentry; - - } + path_to_nameidata(, nd); nd->inode = inode; nd->seq = seq; /* Why this, you ask? _Now_ we might have grown LOOKUP_JUMPED... */ finish_open: error = complete_walk(nd); - if (error) { - path_put(_parent); + if (error) return error; - } audit_inode(nd->name, nd->path.dentry, 0); error = -EISDIR; if ((open_flag & O_CREAT) && d_is_dir(nd->path.dentry)) @@ -3331,13 +3319,9 @@ finish_open_created: goto out; BUG_ON(*opened & FILE_OPENED); /* once it's opened, it's opened */ error = vfs_open(>path, file, current_cred()); - if (!error) { - *opened |= FILE_OPENED; - } else { - if (error == -EOPENSTALE) - goto stale_open; + if (error) goto out; - } + *opened |= FILE_OPENED; opened: error = open_check_o_direct(file); if (!error) @@ -3353,26 +3337,7 @@ out: } if (got_write) mnt_drop_write(nd->path.mnt); - path_put(_parent); return error; - -stale_open: - /* If no saved parent or already retried then can't retry */ - if (!save_parent.dentry || retried) - goto out; - - BUG_ON(save_parent.dentry != dir); - path_put(>path); - nd->path = save_parent; - nd->inode = dir->d_inode; - save_parent.mnt = NULL; - save_parent.dentry = NULL; - if (got_write) { - mnt_drop_write(nd->path.mnt); - got_write = false; - } - retried = true; - goto retry_lookup; } static int do_tmpfile(struct nameidata *nd, unsigned flags,
Re: Dcache oops
On Fri, Jun 3, 2016 at 3:23 PM, Al Virowrote: > On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> >> Normally it's done at terminate_walk() time. But I note that in >> walk_component(), we do put_link(nd) which does a do_delayed_call(), >> but does *not* do a clear_delayed_call(), so now I think a subsequent >> terminate_walk() might drop it *again*. > > Nope - put_link() also decrements nd->depth. No double calls there... Yeah, I figured that out, and then continued to try to look at other cases.. Happy to hear that you seem to have figured it out. But why did it apparently only start happening now? Linus
Re: Dcache oops
On Fri, Jun 3, 2016 at 3:23 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: >>> >> Normally it's done at terminate_walk() time. But I note that in >> walk_component(), we do put_link(nd) which does a do_delayed_call(), >> but does *not* do a clear_delayed_call(), so now I think a subsequent >> terminate_walk() might drop it *again*. > > Nope - put_link() also decrements nd->depth. No double calls there... Yeah, I figured that out, and then continued to try to look at other cases.. Happy to hear that you seem to have figured it out. But why did it apparently only start happening now? Linus
Re: Dcache oops
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. FWIW, I'd stepped into that in d63ff28f "namei: make should_follow_link() store the link in nd->link", so it's 4.1+ mess. delayed_call stuff is 4.4+...
Re: Dcache oops
On Fri, Jun 03, 2016 at 11:23:55PM +0100, Al Viro wrote: > It's not that. It's explicit put_link() in do_last(), followed by > ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" > looking at now-freed nd->last.name. IOW, the bug predates delayed_call > stuff. FWIW, I'd stepped into that in d63ff28f "namei: make should_follow_link() store the link in nd->link", so it's 4.1+ mess. delayed_call stuff is 4.4+...
Re: Dcache oops
On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: > Is perhaps the "delayed_call" logic broken, and the symlink is free'd too > early? > > That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe > something broke that logic, and we've executed the delayed freeing > before we should have. > > Normally it's done at terminate_walk() time. But I note that in > walk_component(), we do put_link(nd) which does a do_delayed_call(), > but does *not* do a clear_delayed_call(), so now I think a subsequent > terminate_walk() might drop it *again*. Nope - put_link() also decrements nd->depth. No double calls there... > I'm probably missing something, but I have to say that the old > explicit "put_link()" callback logic was more obvious than the new > delayed calls are. It's not that. It's explicit put_link() in do_last(), followed by ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" looking at now-freed nd->last.name. IOW, the bug predates delayed_call stuff.
Re: Dcache oops
On Fri, Jun 03, 2016 at 03:00:02PM -0700, Linus Torvalds wrote: > Is perhaps the "delayed_call" logic broken, and the symlink is free'd too > early? > > That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe > something broke that logic, and we've executed the delayed freeing > before we should have. > > Normally it's done at terminate_walk() time. But I note that in > walk_component(), we do put_link(nd) which does a do_delayed_call(), > but does *not* do a clear_delayed_call(), so now I think a subsequent > terminate_walk() might drop it *again*. Nope - put_link() also decrements nd->depth. No double calls there... > I'm probably missing something, but I have to say that the old > explicit "put_link()" callback logic was more obvious than the new > delayed calls are. It's not that. It's explicit put_link() in do_last(), followed by ESTALEOPEN and subsequent misbegotten "retry the last step on ESTALEOPEN" looking at now-freed nd->last.name. IOW, the bug predates delayed_call stuff.
Re: Dcache oops
On Fri, Jun 03, 2016 at 10:46:31PM +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > > 4e8440b? > > > > Well, that was faster than I expected. 4e8440b triggers right away, so I > > guess > > there's no point in trying the later ones? > > BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build, > > so all freed memory is unmapped which is likely causing this oops - as a > > sign > > of use after free. > > > [ 54.990119] BUG: unable to handle kernel paging request at > > 8800d2b7f000 > > Again a page-aligned nd->last.name and even smaller nd->last.len. It smells > like a page that used to contain a symlink body, but got freed under us. OK, I think I understand what's going on there. We have a pathname that ends with a trailing symlink. Traverse that symlink up to the last component. And get EOPENSTALE on attempt to open that. At that point we proceed to retry_lookup: and call lookup_open(). But we'd *already* done put_link() on the first pass, so now nd->last.name points into freed page. Damn... I'm very tempted to rip the retry_lookup logics out of there and just let the damn thing repeat the whole pathname resolution ;-/ do_last() will become so much saner after that... Let's at least verify that this is what's going on - remove if (error == -EOPENSTALE) goto stale_open; from do_last() and see if that fixes the damn thing. Alternative solution would be to turn that if (nd->depth) put_link(nd); error = should_follow_link(nd, , nd->flags & LOOKUP_FOLLOW, inode, seq); if (unlikely(error)) return error; in do_last() into error = should_follow_link(nd, , nd->flags & LOOKUP_FOLLOW, inode, seq); if (unlikely(error)) { if (nd->depth == 2) { struct saved *last = nd->stack[0]; do_delayed_call(>done); if (!(nd->flags & LOOKUP_RCU)) path_put(>link); nd->stack[0] = nd->stack[1]; nd->depth--; } return error; } but I would really prefer the first approach - it allows to remove arseloads of convoluted crap from do_last().
Re: Dcache oops
On Fri, Jun 03, 2016 at 10:46:31PM +0100, Al Viro wrote: > On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > > 4e8440b? > > > > Well, that was faster than I expected. 4e8440b triggers right away, so I > > guess > > there's no point in trying the later ones? > > BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build, > > so all freed memory is unmapped which is likely causing this oops - as a > > sign > > of use after free. > > > [ 54.990119] BUG: unable to handle kernel paging request at > > 8800d2b7f000 > > Again a page-aligned nd->last.name and even smaller nd->last.len. It smells > like a page that used to contain a symlink body, but got freed under us. OK, I think I understand what's going on there. We have a pathname that ends with a trailing symlink. Traverse that symlink up to the last component. And get EOPENSTALE on attempt to open that. At that point we proceed to retry_lookup: and call lookup_open(). But we'd *already* done put_link() on the first pass, so now nd->last.name points into freed page. Damn... I'm very tempted to rip the retry_lookup logics out of there and just let the damn thing repeat the whole pathname resolution ;-/ do_last() will become so much saner after that... Let's at least verify that this is what's going on - remove if (error == -EOPENSTALE) goto stale_open; from do_last() and see if that fixes the damn thing. Alternative solution would be to turn that if (nd->depth) put_link(nd); error = should_follow_link(nd, , nd->flags & LOOKUP_FOLLOW, inode, seq); if (unlikely(error)) return error; in do_last() into error = should_follow_link(nd, , nd->flags & LOOKUP_FOLLOW, inode, seq); if (unlikely(error)) { if (nd->depth == 2) { struct saved *last = nd->stack[0]; do_delayed_call(>done); if (!(nd->flags & LOOKUP_RCU)) path_put(>link); nd->stack[0] = nd->stack[1]; nd->depth--; } return error; } but I would really prefer the first approach - it allows to remove arseloads of convoluted crap from do_last().
Re: Dcache oops
On Fri, Jun 3, 2016 at 2:26 PM, Al Virowrote: >> >> in the __d_lookup() disassembly. And %rdi contains 2, so there were >> supposed to be two more characters at 'ct' (which is %rdx). > > ... and since r8 and rsi are 0, we couldn't have consumed anything. Right you are. So it really started out page-aligned. >> Why would nd->last.name be bogus? I don't see anything. > > An interesting part is that it's page-aligned. Which is impossible for > a short name obtained by getname(), but is quite likely for a symlink body. > So at a guess, we have a page containing a symlink body freed under us. Hmm. Good point. Is perhaps the "delayed_call" logic broken, and the symlink is free'd too early? That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe something broke that logic, and we've executed the delayed freeing before we should have. Normally it's done at terminate_walk() time. But I note that in walk_component(), we do put_link(nd) which does a do_delayed_call(), but does *not* do a clear_delayed_call(), so now I think a subsequent terminate_walk() might drop it *again*. I'm probably missing something, but I have to say that the old explicit "put_link()" callback logic was more obvious than the new delayed calls are. Linus
Re: Dcache oops
On Fri, Jun 3, 2016 at 2:26 PM, Al Viro wrote: >> >> in the __d_lookup() disassembly. And %rdi contains 2, so there were >> supposed to be two more characters at 'ct' (which is %rdx). > > ... and since r8 and rsi are 0, we couldn't have consumed anything. Right you are. So it really started out page-aligned. >> Why would nd->last.name be bogus? I don't see anything. > > An interesting part is that it's page-aligned. Which is impossible for > a short name obtained by getname(), but is quite likely for a symlink body. > So at a guess, we have a page containing a symlink body freed under us. Hmm. Good point. Is perhaps the "delayed_call" logic broken, and the symlink is free'd too early? That whole set_delayed_call/do_delayed_call thing came in 4.5. Maybe something broke that logic, and we've executed the delayed freeing before we should have. Normally it's done at terminate_walk() time. But I note that in walk_component(), we do put_link(nd) which does a do_delayed_call(), but does *not* do a clear_delayed_call(), so now I think a subsequent terminate_walk() might drop it *again*. I'm probably missing something, but I have to say that the old explicit "put_link()" callback logic was more obvious than the new delayed calls are. Linus
Re: Dcache oops
On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > 4e8440b? > > Well, that was faster than I expected. 4e8440b triggers right away, so I guess > there's no point in trying the later ones? > BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build, > so all freed memory is unmapped which is likely causing this oops - as a sign > of use after free. > [ 54.990119] BUG: unable to handle kernel paging request at 8800d2b7f000 Again a page-aligned nd->last.name and even smaller nd->last.len. It smells like a page that used to contain a symlink body, but got freed under us.
Re: Dcache oops
On Fri, Jun 03, 2016 at 05:17:06PM -0400, Oleg Drokin wrote: > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > > 4e8440b? > > Well, that was faster than I expected. 4e8440b triggers right away, so I guess > there's no point in trying the later ones? > BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build, > so all freed memory is unmapped which is likely causing this oops - as a sign > of use after free. > [ 54.990119] BUG: unable to handle kernel paging request at 8800d2b7f000 Again a page-aligned nd->last.name and even smaller nd->last.len. It smells like a page that used to contain a symlink body, but got freed under us.
Re: Dcache oops
On Fri, Jun 03, 2016 at 02:18:15PM -0700, Linus Torvalds wrote: > So something must have corrupted the qstr. > > The remaining length *should* in %edi, judging by the > >0x81243b82 <+306>: cmp$0x7,%edi > > in the __d_lookup() disassembly. And %rdi contains 2, so there were > supposed to be two more characters at 'ct' (which is %rdx). ... and since r8 and rsi are 0, we couldn't have consumed anything. > > Why would nd->last.name be bogus? I don't see anything. An interesting part is that it's page-aligned. Which is impossible for a short name obtained by getname(), but is quite likely for a symlink body. So at a guess, we have a page containing a symlink body freed under us.
Re: Dcache oops
On Fri, Jun 03, 2016 at 02:18:15PM -0700, Linus Torvalds wrote: > So something must have corrupted the qstr. > > The remaining length *should* in %edi, judging by the > >0x81243b82 <+306>: cmp$0x7,%edi > > in the __d_lookup() disassembly. And %rdi contains 2, so there were > supposed to be two more characters at 'ct' (which is %rdx). ... and since r8 and rsi are 0, we couldn't have consumed anything. > > Why would nd->last.name be bogus? I don't see anything. An interesting part is that it's page-aligned. Which is impossible for a short name obtained by getname(), but is quite likely for a symlink body. So at a guess, we have a page containing a symlink body freed under us.
Re: Dcache oops
On Fri, Jun 3, 2016 at 1:07 PM, Al Virowrote: > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that > fault was on attempt to cross into the next page. No such luck, address > was aligned in the first place (it's in %rdx - 0x880113f82000), so > we still oops. Hmm. We do end up comparing the string length racily with the name address, so I suspect there could be a real bug in dentry_cmp: the fact that the name pointer and length aren't necessarily atomic wrt each other means that we could overrun the dentry pointer to the next page, even though it's aligned. So maybe we need to do a careful load of the aligned dentry string value too, not just the possibly unaligned qstr name. It's a *very* unlikely race to hit, though. You'd have to shrink the name in the dentry, get the old (longer name length) to match the one you look up, and when have the memory unmapped at the end of the new (short) length. However, this is not that theoretical race case for two reasons: (1) this happens in __d_lookup(), which has taken the dentry lock. So it's not the racy unlocked RCU case to begin with. (2) as you point out, since it is the load_unaligned_zeropad() case, it isn't the possibly racy dentry name at all, but trhe qstr we're comparing against (which may have the unaligned case, but not the confusion about length). > The unexpected part is that unmapped address did *NOT* come from a dentry; > it's .name of qstr we were looking for. And your call chain was > __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was > nd->last.name... That should have no such issues. It should be a normal qstr, and the length should be reliable. So something must have corrupted the qstr. The remaining length *should* in %edi, judging by the 0x81243b82 <+306>: cmp$0x7,%edi in the __d_lookup() disassembly. And %rdi contains 2, so there were supposed to be two more characters at 'ct' (which is %rdx). Why would nd->last.name be bogus? I don't see anything. Linus
Re: Dcache oops
On Fri, Jun 3, 2016 at 1:07 PM, Al Viro wrote: > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that > fault was on attempt to cross into the next page. No such luck, address > was aligned in the first place (it's in %rdx - 0x880113f82000), so > we still oops. Hmm. We do end up comparing the string length racily with the name address, so I suspect there could be a real bug in dentry_cmp: the fact that the name pointer and length aren't necessarily atomic wrt each other means that we could overrun the dentry pointer to the next page, even though it's aligned. So maybe we need to do a careful load of the aligned dentry string value too, not just the possibly unaligned qstr name. It's a *very* unlikely race to hit, though. You'd have to shrink the name in the dentry, get the old (longer name length) to match the one you look up, and when have the memory unmapped at the end of the new (short) length. However, this is not that theoretical race case for two reasons: (1) this happens in __d_lookup(), which has taken the dentry lock. So it's not the racy unlocked RCU case to begin with. (2) as you point out, since it is the load_unaligned_zeropad() case, it isn't the possibly racy dentry name at all, but trhe qstr we're comparing against (which may have the unaligned case, but not the confusion about length). > The unexpected part is that unmapped address did *NOT* come from a dentry; > it's .name of qstr we were looking for. And your call chain was > __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was > nd->last.name... That should have no such issues. It should be a normal qstr, and the length should be reliable. So something must have corrupted the qstr. The remaining length *should* in %edi, judging by the 0x81243b82 <+306>: cmp$0x7,%edi in the __d_lookup() disassembly. And %rdi contains 2, so there were supposed to be two more characters at 'ct' (which is %rdx). Why would nd->last.name be bogus? I don't see anything. Linus
Re: Dcache oops
On Jun 3, 2016, at 4:07 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 >>> >>> *ow* >>> Could you dump your vmlinux (and System.map) somewhere on anonftp? >>> This 'bad_gs' is there simply because it's one of the few labels in >>> .fixup - to say anything useful we'll need to find out where we'd >>> really come from. >> >> I see. >> vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz >> System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz > > OK... > 817f87cd: 48 8d 0alea(%rdx),%rcx > 817f87d0: 48 83 e1 f8 and > $0xfff8,%rcx > 817f87d4: 4c 8b 01mov(%rcx),%r8 > 817f87d7: 8d 0a lea(%rdx),%ecx > 817f87d9: 83 e1 07and$0x7,%ecx > 817f87dc: c1 e1 03shl$0x3,%ecx > 817f87df: 49 d3 e8shr%cl,%r8 > 817f87e2: e9 9b b3 a4 ff jmpq 81243b82 > <__d_lookup+0x132> > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that > fault was on attempt to cross into the next page. No such luck, address > was aligned in the first place (it's in %rdx - 0x880113f82000), so > we still oops. > > The unexpected part is that unmapped address did *NOT* come from a dentry; > it's .name of qstr we were looking for. And your call chain was > __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was > nd->last.name... > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > 4e8440b? Well, that was faster than I expected. 4e8440b triggers right away, so I guess there's no point in trying the later ones? BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build, so all freed memory is unmapped which is likely causing this oops - as a sign of use after free. [ 54.990119] BUG: unable to handle kernel paging request at 8800d2b7f000 [ 54.990423] IP: [] bad_gs+0xcff/0x1b89 [ 54.990598] PGD 2dca067 PUD 11f900067 PMD 11f86a067 PTE 8000d2b7f060 [ 54.990942] Oops: [#1] SMP DEBUG_PAGEALLOC [ 54.991044] Modules linked in: loop rpcsec_gss_krb5 joydev pcspkr i2c_piix4 acpi_cpufreq tpm_tis tpm nfsd drm_kms_helper ttm drm serio_raw virtio_blk [ 54.992301] CPU: 7 PID: 5550 Comm: file_concat.sh Not tainted 4.6.0-4e8440b+ #1 [ 54.993019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 54.993407] task: 8800d6ec6200 ti: 880110f44000 task.ti: 880110f44000 [ 54.994112] RIP: 0010:[] [] bad_gs+0xcff/0x1b89 [ 54.994870] RSP: 0018:880110f47ba0 EFLAGS: 00010286 [ 54.995288] RAX: 8800d2760f30 RBX: 8800d2760f00 RCX: 8800d2b7f000 [ 54.995736] RDX: 8800d2b7f000 RSI: RDI: 0001 [ 54.996154] RBP: 880110f47be8 R08: R09: 8800d2760ed0 [ 54.996628] R10: 0059 R11: 8800d6ec6d78 R12: 8800d2694ed0 [ 54.997044] R13: 880110f47df0 R14: 8800d2760f50 R15: b761b37e [ 54.997450] FS: 7fe8804d2700() GS:88011f5c() knlGS: [ 54.998164] CS: 0010 DS: ES: CR0: 80050033 [ 54.998624] CR2: 8800d2b7f000 CR3: 00011931f000 CR4: 06e0 [ 54.999225] Stack: [ 54.999596] 81243965 0096 8800d2b7f000 0001 [ 55.000632] 042c 880110f47df0 8800d2694ed0 880110f47de0 [ 55.001497] 880110f47df0 880110f47c18 81243b94 81234b3e [ 55.002353] Call Trace: [ 55.002709] [] ? __d_lookup+0x5/0x1b0 [ 55.003104] [] d_lookup+0x84/0xb0 [ 55.003487] [] ? lookup_open+0xfe/0x7a0 [ 55.003878] [] lookup_open+0xfe/0x7a0 [ 55.004272] [] path_openat+0x4de/0xfc0 [ 55.004665] [] do_filp_open+0x7e/0xe0 [ 55.005061] [] ? lock_rename+0x100/0x100 [ 55.005452] [] ? _raw_spin_unlock+0x27/0x40 [ 55.005846] [] ? __alloc_fd+0xbc/0x170 [ 55.006246] [] do_sys_open+0x116/0x1f0 [ 55.006635] [] SyS_open+0x1e/0x20 [ 55.007036] [] entry_SYSCALL_64_fastpath+0x1e/0xad [ 55.007434] Code: e1 03 49 d3 e8 e9 8f a3 a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 4e a4 a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 c9 a8 a4 ff b9 f2 [ 55.011146] RIP [] bad_gs+0xcff/0x1b89 [ 55.011584] RSP [ 55.011943] CR2: 8800d2b7f000
Re: Dcache oops
On Jun 3, 2016, at 4:07 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 >>> >>> *ow* >>> Could you dump your vmlinux (and System.map) somewhere on anonftp? >>> This 'bad_gs' is there simply because it's one of the few labels in >>> .fixup - to say anything useful we'll need to find out where we'd >>> really come from. >> >> I see. >> vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz >> System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz > > OK... > 817f87cd: 48 8d 0alea(%rdx),%rcx > 817f87d0: 48 83 e1 f8 and > $0xfff8,%rcx > 817f87d4: 4c 8b 01mov(%rcx),%r8 > 817f87d7: 8d 0a lea(%rdx),%ecx > 817f87d9: 83 e1 07and$0x7,%ecx > 817f87dc: c1 e1 03shl$0x3,%ecx > 817f87df: 49 d3 e8shr%cl,%r8 > 817f87e2: e9 9b b3 a4 ff jmpq 81243b82 > <__d_lookup+0x132> > > Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting > a genuinely unmapped address. That sends it into fixup, where it tries to > load an aligned word containing the address in question, in hope that > fault was on attempt to cross into the next page. No such luck, address > was aligned in the first place (it's in %rdx - 0x880113f82000), so > we still oops. > > The unexpected part is that unmapped address did *NOT* come from a dentry; > it's .name of qstr we were looking for. And your call chain was > __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was > nd->last.name... > > Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, > 4e8440b? Well, that was faster than I expected. 4e8440b triggers right away, so I guess there's no point in trying the later ones? BTW, just to confirm you are noticing - this is a DEBUG_PAGEALLOC build, so all freed memory is unmapped which is likely causing this oops - as a sign of use after free. [ 54.990119] BUG: unable to handle kernel paging request at 8800d2b7f000 [ 54.990423] IP: [] bad_gs+0xcff/0x1b89 [ 54.990598] PGD 2dca067 PUD 11f900067 PMD 11f86a067 PTE 8000d2b7f060 [ 54.990942] Oops: [#1] SMP DEBUG_PAGEALLOC [ 54.991044] Modules linked in: loop rpcsec_gss_krb5 joydev pcspkr i2c_piix4 acpi_cpufreq tpm_tis tpm nfsd drm_kms_helper ttm drm serio_raw virtio_blk [ 54.992301] CPU: 7 PID: 5550 Comm: file_concat.sh Not tainted 4.6.0-4e8440b+ #1 [ 54.993019] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 54.993407] task: 8800d6ec6200 ti: 880110f44000 task.ti: 880110f44000 [ 54.994112] RIP: 0010:[] [] bad_gs+0xcff/0x1b89 [ 54.994870] RSP: 0018:880110f47ba0 EFLAGS: 00010286 [ 54.995288] RAX: 8800d2760f30 RBX: 8800d2760f00 RCX: 8800d2b7f000 [ 54.995736] RDX: 8800d2b7f000 RSI: RDI: 0001 [ 54.996154] RBP: 880110f47be8 R08: R09: 8800d2760ed0 [ 54.996628] R10: 0059 R11: 8800d6ec6d78 R12: 8800d2694ed0 [ 54.997044] R13: 880110f47df0 R14: 8800d2760f50 R15: b761b37e [ 54.997450] FS: 7fe8804d2700() GS:88011f5c() knlGS: [ 54.998164] CS: 0010 DS: ES: CR0: 80050033 [ 54.998624] CR2: 8800d2b7f000 CR3: 00011931f000 CR4: 06e0 [ 54.999225] Stack: [ 54.999596] 81243965 0096 8800d2b7f000 0001 [ 55.000632] 042c 880110f47df0 8800d2694ed0 880110f47de0 [ 55.001497] 880110f47df0 880110f47c18 81243b94 81234b3e [ 55.002353] Call Trace: [ 55.002709] [] ? __d_lookup+0x5/0x1b0 [ 55.003104] [] d_lookup+0x84/0xb0 [ 55.003487] [] ? lookup_open+0xfe/0x7a0 [ 55.003878] [] lookup_open+0xfe/0x7a0 [ 55.004272] [] path_openat+0x4de/0xfc0 [ 55.004665] [] do_filp_open+0x7e/0xe0 [ 55.005061] [] ? lock_rename+0x100/0x100 [ 55.005452] [] ? _raw_spin_unlock+0x27/0x40 [ 55.005846] [] ? __alloc_fd+0xbc/0x170 [ 55.006246] [] do_sys_open+0x116/0x1f0 [ 55.006635] [] SyS_open+0x1e/0x20 [ 55.007036] [] entry_SYSCALL_64_fastpath+0x1e/0xad [ 55.007434] Code: e1 03 49 d3 e8 e9 8f a3 a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 4e a4 a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 c9 a8 a4 ff b9 f2 [ 55.011146] RIP [] bad_gs+0xcff/0x1b89 [ 55.011584] RSP [ 55.011943] CR2: 8800d2b7f000
Re: Dcache oops
On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > >> [ 2642.364383] BUG: unable to handle kernel paging request at > >> 880113f82000 > >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > > > *ow* > > Could you dump your vmlinux (and System.map) somewhere on anonftp? > > This 'bad_gs' is there simply because it's one of the few labels in > > .fixup - to say anything useful we'll need to find out where we'd > > really come from. > > I see. > vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz > System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz OK... 817f87cd: 48 8d 0alea(%rdx),%rcx 817f87d0: 48 83 e1 f8 and$0xfff8,%rcx 817f87d4: 4c 8b 01mov(%rcx),%r8 817f87d7: 8d 0a lea(%rdx),%ecx 817f87d9: 83 e1 07and$0x7,%ecx 817f87dc: c1 e1 03shl$0x3,%ecx 817f87df: 49 d3 e8shr%cl,%r8 817f87e2: e9 9b b3 a4 ff jmpq 81243b82 <__d_lookup+0x132> Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting a genuinely unmapped address. That sends it into fixup, where it tries to load an aligned word containing the address in question, in hope that fault was on attempt to cross into the next page. No such luck, address was aligned in the first place (it's in %rdx - 0x880113f82000), so we still oops. The unexpected part is that unmapped address did *NOT* come from a dentry; it's .name of qstr we were looking for. And your call chain was __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was nd->last.name... Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, 4e8440b?
Re: Dcache oops
On Fri, Jun 03, 2016 at 02:35:41PM -0400, Oleg Drokin wrote: > >> [ 2642.364383] BUG: unable to handle kernel paging request at > >> 880113f82000 > >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > > > *ow* > > Could you dump your vmlinux (and System.map) somewhere on anonftp? > > This 'bad_gs' is there simply because it's one of the few labels in > > .fixup - to say anything useful we'll need to find out where we'd > > really come from. > > I see. > vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz > System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz OK... 817f87cd: 48 8d 0alea(%rdx),%rcx 817f87d0: 48 83 e1 f8 and$0xfff8,%rcx 817f87d4: 4c 8b 01mov(%rcx),%r8 817f87d7: 8d 0a lea(%rdx),%ecx 817f87d9: 83 e1 07and$0x7,%ecx 817f87dc: c1 e1 03shl$0x3,%ecx 817f87df: 49 d3 e8shr%cl,%r8 817f87e2: e9 9b b3 a4 ff jmpq 81243b82 <__d_lookup+0x132> Aha... It's load_unaligned_zeropad() from dentry_string_cmp(), hitting a genuinely unmapped address. That sends it into fixup, where it tries to load an aligned word containing the address in question, in hope that fault was on attempt to cross into the next page. No such luck, address was aligned in the first place (it's in %rdx - 0x880113f82000), so we still oops. The unexpected part is that unmapped address did *NOT* come from a dentry; it's .name of qstr we were looking for. And your call chain was __d_lookup() <- d_lookup() <- lookup_open(), so in lookup_open() it was nd->last.name... Can the same thing be reproduced (with NFS fix) on v4.6, ede4090, 7f427d3, 4e8440b?
Re: Dcache oops
On Jun 3, 2016, at 2:22 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> I am dropping NFS people since it seems to be converting into a generic >> VFS/dcache bug even though you need NFS or the like to trigger it - the >> lookup_open path. > > NFS bug is real; there might very well be something else, but that d_drop() > in nfs_atomic_open() needs to be restored. Yes, that's what I mean - after the nfs fix, there seems to be another bug in dcache that I can now trigger once nfs bug is no longer blocking the way. >> [ 2642.364383] BUG: unable to handle kernel paging request at >> 880113f82000 >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > *ow* > Could you dump your vmlinux (and System.map) somewhere on anonftp? > This 'bad_gs' is there simply because it's one of the few labels in > .fixup - to say anything useful we'll need to find out where we'd > really come from. I see. vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz >> Hm, somehow crashdumping support is broken for the newish kernels on my test >> box, I guess >> I'll try to fix it and then re-reproduce to better understand what's going >> on here, >> this trace is all I have for now in case anybody has any immediate ideas. > > PS: Oleg, fix your MUA, please - long lines in mail are bloody annoying. Huh, sorry about that. I kind of hoped the era of 80 columns text terminals was mostly behind us. Time to dust off mutt, I guess.
Re: Dcache oops
On Jun 3, 2016, at 2:22 PM, Al Viro wrote: > On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: >> I am dropping NFS people since it seems to be converting into a generic >> VFS/dcache bug even though you need NFS or the like to trigger it - the >> lookup_open path. > > NFS bug is real; there might very well be something else, but that d_drop() > in nfs_atomic_open() needs to be restored. Yes, that's what I mean - after the nfs fix, there seems to be another bug in dcache that I can now trigger once nfs bug is no longer blocking the way. >> [ 2642.364383] BUG: unable to handle kernel paging request at >> 880113f82000 >> [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 > > *ow* > Could you dump your vmlinux (and System.map) somewhere on anonftp? > This 'bad_gs' is there simply because it's one of the few labels in > .fixup - to say anything useful we'll need to find out where we'd > really come from. I see. vmlinux with debug symbols: http://knox.linuxhacker.ru/tmp/dcache/vmlinux.gz System.map: http://knox.linuxhacker.ru/tmp/dcache/System.map.gz >> Hm, somehow crashdumping support is broken for the newish kernels on my test >> box, I guess >> I'll try to fix it and then re-reproduce to better understand what's going >> on here, >> this trace is all I have for now in case anybody has any immediate ideas. > > PS: Oleg, fix your MUA, please - long lines in mail are bloody annoying. Huh, sorry about that. I kind of hoped the era of 80 columns text terminals was mostly behind us. Time to dust off mutt, I guess.
Re: Dcache oops
On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > I am dropping NFS people since it seems to be converting into a generic > VFS/dcache bug even though you need NFS or the like to trigger it - the > lookup_open path. NFS bug is real; there might very well be something else, but that d_drop() in nfs_atomic_open() needs to be restored. > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 > [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 *ow* Could you dump your vmlinux (and System.map) somewhere on anonftp? This 'bad_gs' is there simply because it's one of the few labels in .fixup - to say anything useful we'll need to find out where we'd really come from. > [ 2642.369810] RIP: 0010:[] [] > bad_gs+0xd1d/0x1ba9 > [ 2642.370750] RSP: 0018:8800d7277ba0 EFLAGS: 00010286 > [ 2642.371239] RAX: 8800c3a6ff30 RBX: 8800c3a6ff00 RCX: > 880113f82000 > [ 2642.371765] RDX: 880113f82000 RSI: RDI: > 0002 > [ 2642.372286] RBP: 8800d7277be8 R08: R09: > 8800c3a6fed0 > [ 2642.372818] R10: 0059 R11: 8800d6956dd0 R12: > 880111567ed0 > [ 2642.373415] R13: 8800d7277df0 R14: 8800c3a6ff50 R15: > 84832a57 > [ 2642.373940] FS: 7fa1814a4700() GS:88011f48() > knlGS: > [ 2642.374877] CS: 0010 DS: ES: CR0: 80050033 > [ 2642.375378] CR2: 880113f82000 CR3: d605c000 CR4: > 06e0 > [ 2642.375881] Stack: > [ 2642.376295] 81243a55 0096 880113f82000 > 0002 > [ 2642.377204] bc04 8800d7277df0 880111567ed0 > 8800d7277de0 > [ 2642.378113] 8800d7277df0 8800d7277c18 81243c84 > 81236c4e > [ 2642.379022] Call Trace: > [ 2642.379451] [] ? __d_lookup+0x5/0x1b0 > [ 2642.379920] [] d_lookup+0x84/0xb0 > [ 2642.380388] [] ? lookup_open+0xfe/0x7a0 > [ 2642.380862] [] lookup_open+0xfe/0x7a0 > [ 2642.381374] [] path_openat+0x94f/0xfc0 > [ 2642.381852] [] do_filp_open+0x7e/0xe0 > [ 2642.382182] [] ? lock_rename+0x100/0x100 > [ 2642.382747] [] ? _raw_spin_unlock+0x27/0x40 > [ 2642.383324] [] ? __alloc_fd+0xbc/0x170 > [ 2642.383864] [] do_sys_open+0x116/0x1f0 > [ 2642.384230] [] SyS_open+0x1e/0x20 > [ 2642.384569] [] entry_SYSCALL_64_fastpath+0x1e/0xad > [ 2642.398718] Code: e1 03 49 d3 e8 e9 61 ae a4 ff 48 8d 0a 48 83 e1 f8 4c 8b > 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 20 af a4 ff 48 8d 0a 48 83 e1 f8 <4c> > 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 9b b3 a4 ff b9 f2 > [ 2642.400892] RIP [] bad_gs+0xd1d/0x1ba9 > [ 2642.401417] RSP > [ 2642.401856] CR2: 880113f82000 > > Hm, somehow crashdumping support is broken for the newish kernels on my test > box, I guess > I'll try to fix it and then re-reproduce to better understand what's going on > here, > this trace is all I have for now in case anybody has any immediate ideas. PS: Oleg, fix your MUA, please - long lines in mail are bloody annoying.
Re: Dcache oops
On Fri, Jun 03, 2016 at 12:38:40PM -0400, Oleg Drokin wrote: > I am dropping NFS people since it seems to be converting into a generic > VFS/dcache bug even though you need NFS or the like to trigger it - the > lookup_open path. NFS bug is real; there might very well be something else, but that d_drop() in nfs_atomic_open() needs to be restored. > [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 > [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 *ow* Could you dump your vmlinux (and System.map) somewhere on anonftp? This 'bad_gs' is there simply because it's one of the few labels in .fixup - to say anything useful we'll need to find out where we'd really come from. > [ 2642.369810] RIP: 0010:[] [] > bad_gs+0xd1d/0x1ba9 > [ 2642.370750] RSP: 0018:8800d7277ba0 EFLAGS: 00010286 > [ 2642.371239] RAX: 8800c3a6ff30 RBX: 8800c3a6ff00 RCX: > 880113f82000 > [ 2642.371765] RDX: 880113f82000 RSI: RDI: > 0002 > [ 2642.372286] RBP: 8800d7277be8 R08: R09: > 8800c3a6fed0 > [ 2642.372818] R10: 0059 R11: 8800d6956dd0 R12: > 880111567ed0 > [ 2642.373415] R13: 8800d7277df0 R14: 8800c3a6ff50 R15: > 84832a57 > [ 2642.373940] FS: 7fa1814a4700() GS:88011f48() > knlGS: > [ 2642.374877] CS: 0010 DS: ES: CR0: 80050033 > [ 2642.375378] CR2: 880113f82000 CR3: d605c000 CR4: > 06e0 > [ 2642.375881] Stack: > [ 2642.376295] 81243a55 0096 880113f82000 > 0002 > [ 2642.377204] bc04 8800d7277df0 880111567ed0 > 8800d7277de0 > [ 2642.378113] 8800d7277df0 8800d7277c18 81243c84 > 81236c4e > [ 2642.379022] Call Trace: > [ 2642.379451] [] ? __d_lookup+0x5/0x1b0 > [ 2642.379920] [] d_lookup+0x84/0xb0 > [ 2642.380388] [] ? lookup_open+0xfe/0x7a0 > [ 2642.380862] [] lookup_open+0xfe/0x7a0 > [ 2642.381374] [] path_openat+0x94f/0xfc0 > [ 2642.381852] [] do_filp_open+0x7e/0xe0 > [ 2642.382182] [] ? lock_rename+0x100/0x100 > [ 2642.382747] [] ? _raw_spin_unlock+0x27/0x40 > [ 2642.383324] [] ? __alloc_fd+0xbc/0x170 > [ 2642.383864] [] do_sys_open+0x116/0x1f0 > [ 2642.384230] [] SyS_open+0x1e/0x20 > [ 2642.384569] [] entry_SYSCALL_64_fastpath+0x1e/0xad > [ 2642.398718] Code: e1 03 49 d3 e8 e9 61 ae a4 ff 48 8d 0a 48 83 e1 f8 4c 8b > 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 20 af a4 ff 48 8d 0a 48 83 e1 f8 <4c> > 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 9b b3 a4 ff b9 f2 > [ 2642.400892] RIP [] bad_gs+0xd1d/0x1ba9 > [ 2642.401417] RSP > [ 2642.401856] CR2: 880113f82000 > > Hm, somehow crashdumping support is broken for the newish kernels on my test > box, I guess > I'll try to fix it and then re-reproduce to better understand what's going on > here, > this trace is all I have for now in case anybody has any immediate ideas. PS: Oleg, fix your MUA, please - long lines in mail are bloody annoying.
Dcache oops
I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path. On Jun 3, 2016, at 12:26 AM, Al Viro wrote: > Looks like the right thing to do would be to do d_drop() at no_open:, > just before we call nfs_lookup(). If not earlier, actually... How > about the following? So with the below patch applied I get the same crash as with the patch from me that allowed unhashed dentries to come through. Same workload, but this time it takes much longer to trigger, so must be some race. [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 [ 2642.365590] PGD 2dca067 PUD 2dcd067 PMD 11ff61067 PTE 800113f82060 [ 2642.366200] Oops: [#1] SMP DEBUG_PAGEALLOC [ 2642.366719] Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm nfsd drm_kms_helper ttm drm virtio_blk serio_raw [ 2642.367852] CPU: 2 PID: 18446 Comm: cat Not tainted 4.7.0-rc1-vm-nfs+ #103 [ 2642.368372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2642.368873] task: 8800d6956280 ti: 8800d7274000 task.ti: 8800d7274000 [ 2642.369810] RIP: 0010:[] [] bad_gs+0xd1d/0x1ba9 [ 2642.370750] RSP: 0018:8800d7277ba0 EFLAGS: 00010286 [ 2642.371239] RAX: 8800c3a6ff30 RBX: 8800c3a6ff00 RCX: 880113f82000 [ 2642.371765] RDX: 880113f82000 RSI: RDI: 0002 [ 2642.372286] RBP: 8800d7277be8 R08: R09: 8800c3a6fed0 [ 2642.372818] R10: 0059 R11: 8800d6956dd0 R12: 880111567ed0 [ 2642.373415] R13: 8800d7277df0 R14: 8800c3a6ff50 R15: 84832a57 [ 2642.373940] FS: 7fa1814a4700() GS:88011f48() knlGS: [ 2642.374877] CS: 0010 DS: ES: CR0: 80050033 [ 2642.375378] CR2: 880113f82000 CR3: d605c000 CR4: 06e0 [ 2642.375881] Stack: [ 2642.376295] 81243a55 0096 880113f82000 0002 [ 2642.377204] bc04 8800d7277df0 880111567ed0 8800d7277de0 [ 2642.378113] 8800d7277df0 8800d7277c18 81243c84 81236c4e [ 2642.379022] Call Trace: [ 2642.379451] [] ? __d_lookup+0x5/0x1b0 [ 2642.379920] [] d_lookup+0x84/0xb0 [ 2642.380388] [] ? lookup_open+0xfe/0x7a0 [ 2642.380862] [] lookup_open+0xfe/0x7a0 [ 2642.381374] [] path_openat+0x94f/0xfc0 [ 2642.381852] [] do_filp_open+0x7e/0xe0 [ 2642.382182] [] ? lock_rename+0x100/0x100 [ 2642.382747] [] ? _raw_spin_unlock+0x27/0x40 [ 2642.383324] [] ? __alloc_fd+0xbc/0x170 [ 2642.383864] [] do_sys_open+0x116/0x1f0 [ 2642.384230] [] SyS_open+0x1e/0x20 [ 2642.384569] [] entry_SYSCALL_64_fastpath+0x1e/0xad [ 2642.398718] Code: e1 03 49 d3 e8 e9 61 ae a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 20 af a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 9b b3 a4 ff b9 f2 [ 2642.400892] RIP [] bad_gs+0xd1d/0x1ba9 [ 2642.401417] RSP [ 2642.401856] CR2: 880113f82000 Hm, somehow crashdumping support is broken for the newish kernels on my test box, I guess I'll try to fix it and then re-reproduce to better understand what's going on here, this trace is all I have for now in case anybody has any immediate ideas. > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index aaf7bd0..6e3a6f4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -1536,9 +1536,9 @@ int nfs_atomic_open(struct inode *dir, struct dentry > *dentry, > err = PTR_ERR(inode); > trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); > put_nfs_open_context(ctx); > + d_drop(dentry); > switch (err) { > case -ENOENT: > - d_drop(dentry); > d_add(dentry, NULL); > nfs_set_verifier(dentry, > nfs_save_change_attribute(dir)); > break;
Dcache oops
I am dropping NFS people since it seems to be converting into a generic VFS/dcache bug even though you need NFS or the like to trigger it - the lookup_open path. On Jun 3, 2016, at 12:26 AM, Al Viro wrote: > Looks like the right thing to do would be to do d_drop() at no_open:, > just before we call nfs_lookup(). If not earlier, actually... How > about the following? So with the below patch applied I get the same crash as with the patch from me that allowed unhashed dentries to come through. Same workload, but this time it takes much longer to trigger, so must be some race. [ 2642.364383] BUG: unable to handle kernel paging request at 880113f82000 [ 2642.365014] IP: [] bad_gs+0xd1d/0x1ba9 [ 2642.365590] PGD 2dca067 PUD 2dcd067 PMD 11ff61067 PTE 800113f82060 [ 2642.366200] Oops: [#1] SMP DEBUG_PAGEALLOC [ 2642.366719] Modules linked in: loop rpcsec_gss_krb5 acpi_cpufreq tpm_tis joydev i2c_piix4 pcspkr tpm nfsd drm_kms_helper ttm drm virtio_blk serio_raw [ 2642.367852] CPU: 2 PID: 18446 Comm: cat Not tainted 4.7.0-rc1-vm-nfs+ #103 [ 2642.368372] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 2642.368873] task: 8800d6956280 ti: 8800d7274000 task.ti: 8800d7274000 [ 2642.369810] RIP: 0010:[] [] bad_gs+0xd1d/0x1ba9 [ 2642.370750] RSP: 0018:8800d7277ba0 EFLAGS: 00010286 [ 2642.371239] RAX: 8800c3a6ff30 RBX: 8800c3a6ff00 RCX: 880113f82000 [ 2642.371765] RDX: 880113f82000 RSI: RDI: 0002 [ 2642.372286] RBP: 8800d7277be8 R08: R09: 8800c3a6fed0 [ 2642.372818] R10: 0059 R11: 8800d6956dd0 R12: 880111567ed0 [ 2642.373415] R13: 8800d7277df0 R14: 8800c3a6ff50 R15: 84832a57 [ 2642.373940] FS: 7fa1814a4700() GS:88011f48() knlGS: [ 2642.374877] CS: 0010 DS: ES: CR0: 80050033 [ 2642.375378] CR2: 880113f82000 CR3: d605c000 CR4: 06e0 [ 2642.375881] Stack: [ 2642.376295] 81243a55 0096 880113f82000 0002 [ 2642.377204] bc04 8800d7277df0 880111567ed0 8800d7277de0 [ 2642.378113] 8800d7277df0 8800d7277c18 81243c84 81236c4e [ 2642.379022] Call Trace: [ 2642.379451] [] ? __d_lookup+0x5/0x1b0 [ 2642.379920] [] d_lookup+0x84/0xb0 [ 2642.380388] [] ? lookup_open+0xfe/0x7a0 [ 2642.380862] [] lookup_open+0xfe/0x7a0 [ 2642.381374] [] path_openat+0x94f/0xfc0 [ 2642.381852] [] do_filp_open+0x7e/0xe0 [ 2642.382182] [] ? lock_rename+0x100/0x100 [ 2642.382747] [] ? _raw_spin_unlock+0x27/0x40 [ 2642.383324] [] ? __alloc_fd+0xbc/0x170 [ 2642.383864] [] do_sys_open+0x116/0x1f0 [ 2642.384230] [] SyS_open+0x1e/0x20 [ 2642.384569] [] entry_SYSCALL_64_fastpath+0x1e/0xad [ 2642.398718] Code: e1 03 49 d3 e8 e9 61 ae a4 ff 48 8d 0a 48 83 e1 f8 4c 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 20 af a4 ff 48 8d 0a 48 83 e1 f8 <4c> 8b 01 8d 0a 83 e1 07 c1 e1 03 49 d3 e8 e9 9b b3 a4 ff b9 f2 [ 2642.400892] RIP [] bad_gs+0xd1d/0x1ba9 [ 2642.401417] RSP [ 2642.401856] CR2: 880113f82000 Hm, somehow crashdumping support is broken for the newish kernels on my test box, I guess I'll try to fix it and then re-reproduce to better understand what's going on here, this trace is all I have for now in case anybody has any immediate ideas. > > diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c > index aaf7bd0..6e3a6f4 100644 > --- a/fs/nfs/dir.c > +++ b/fs/nfs/dir.c > @@ -1536,9 +1536,9 @@ int nfs_atomic_open(struct inode *dir, struct dentry > *dentry, > err = PTR_ERR(inode); > trace_nfs_atomic_open_exit(dir, ctx, open_flags, err); > put_nfs_open_context(ctx); > + d_drop(dentry); > switch (err) { > case -ENOENT: > - d_drop(dentry); > d_add(dentry, NULL); > nfs_set_verifier(dentry, > nfs_save_change_attribute(dir)); > break;