Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
Hi, On Thu, Apr 19, 2018 at 12:56 PM, Kirill Tkhaiwrote: > On 19.04.2018 19:44, Al Viro wrote: >> On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote: >> >>> IOW, we only get there if our vfsmount was an MNT_INTERNAL one. >>> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in >>> ->mnt_pins of some other mount. Which, AFAICS, means that >>> it used to be mounted on that other mount. How the hell can >>> that happen? >>> >>> It looks like you somehow get a long chain of MNT_INTERNAL mounts >>> stacked on top of each other, which ought to be prevented by >>> mnt_flags &= ~MNT_INTERNAL_FLAGS; >>> in do_add_mount(). Nuts... >> >> Argh... Nuts is right - clone_mnt() preserves the sodding >> MNT_INTERNAL, with obvious results. >> >> netns is related to the problem, by exposing MNT_INTERNAL mounts >> (in /proc/*/ns/*) for mount --bind to copy and attach to the >> tree. AFAICS, the minimal reproducer is >> >> touch /tmp/a >> unshare -m sh -c 'for i in `seq 1`; do mount --bind /proc/1/ns/net >> /tmp/a; done' >> >> (and it can be anything in /proc/*/ns/*, really) >> >> I think the fix should be along the lines of the following: >> >> Don't leak MNT_INTERNAL away from internal mounts >> >> We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for >> their copies. >> >> Cc: sta...@kernel.org >> Signed-off-by: Al Viro > > Flawless victory! Thanks. > Thanks to all. Also thanks to Kirill for helping me here and doing the main part by bisecting this issue. Finally, my testing stuff which produced this bug also works well now. Tested-by: Alexander Aring - Alex
Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
On 19.04.2018 19:44, Al Viro wrote: > On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote: > >> IOW, we only get there if our vfsmount was an MNT_INTERNAL one. >> So we have mnt->mnt_umount of some MNT_INTERNAL mount found in >> ->mnt_pins of some other mount. Which, AFAICS, means that >> it used to be mounted on that other mount. How the hell can >> that happen? >> >> It looks like you somehow get a long chain of MNT_INTERNAL mounts >> stacked on top of each other, which ought to be prevented by >> mnt_flags &= ~MNT_INTERNAL_FLAGS; >> in do_add_mount(). Nuts... > > Argh... Nuts is right - clone_mnt() preserves the sodding > MNT_INTERNAL, with obvious results. > > netns is related to the problem, by exposing MNT_INTERNAL mounts > (in /proc/*/ns/*) for mount --bind to copy and attach to the > tree. AFAICS, the minimal reproducer is > > touch /tmp/a > unshare -m sh -c 'for i in `seq 1`; do mount --bind /proc/1/ns/net > /tmp/a; done' > > (and it can be anything in /proc/*/ns/*, really) > > I think the fix should be along the lines of the following: > > Don't leak MNT_INTERNAL away from internal mounts > > We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for > their copies. > > Cc: sta...@kernel.org > Signed-off-by: Al ViroFlawless victory! Thanks. Tested-by: Kirill Tkhai > --- > diff --git a/fs/namespace.c b/fs/namespace.c > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -1089,7 +1089,8 @@ static struct mount *clone_mnt(struct mount *old, > struct dentry *root, > goto out_free; > } > > - mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED); > + mnt->mnt.mnt_flags = old->mnt.mnt_flags; > + mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL); > /* Don't allow unprivileged users to change mount flags */ > if (flag & CL_UNPRIVILEGED) { > mnt->mnt.mnt_flags |= MNT_LOCK_ATIME; >
Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
On Thu, Apr 19, 2018 at 04:34:48PM +0100, Al Viro wrote: > IOW, we only get there if our vfsmount was an MNT_INTERNAL one. > So we have mnt->mnt_umount of some MNT_INTERNAL mount found in > ->mnt_pins of some other mount. Which, AFAICS, means that > it used to be mounted on that other mount. How the hell can > that happen? > > It looks like you somehow get a long chain of MNT_INTERNAL mounts > stacked on top of each other, which ought to be prevented by > mnt_flags &= ~MNT_INTERNAL_FLAGS; > in do_add_mount(). Nuts... Argh... Nuts is right - clone_mnt() preserves the sodding MNT_INTERNAL, with obvious results. netns is related to the problem, by exposing MNT_INTERNAL mounts (in /proc/*/ns/*) for mount --bind to copy and attach to the tree. AFAICS, the minimal reproducer is touch /tmp/a unshare -m sh -c 'for i in `seq 1`; do mount --bind /proc/1/ns/net /tmp/a; done' (and it can be anything in /proc/*/ns/*, really) I think the fix should be along the lines of the following: Don't leak MNT_INTERNAL away from internal mounts We want it only for the stuff created by SB_KERNMOUNT mounts, *not* for their copies. Cc: sta...@kernel.org Signed-off-by: Al Viro--- diff --git a/fs/namespace.c b/fs/namespace.c --- a/fs/namespace.c +++ b/fs/namespace.c @@ -1089,7 +1089,8 @@ static struct mount *clone_mnt(struct mount *old, struct dentry *root, goto out_free; } - mnt->mnt.mnt_flags = old->mnt.mnt_flags & ~(MNT_WRITE_HOLD|MNT_MARKED); + mnt->mnt.mnt_flags = old->mnt.mnt_flags; + mnt->mnt.mnt_flags &= ~(MNT_WRITE_HOLD|MNT_MARKED|MNT_INTERNAL); /* Don't allow unprivileged users to change mount flags */ if (flag & CL_UNPRIVILEGED) { mnt->mnt.mnt_flags |= MNT_LOCK_ATIME;
Re: [bisected] Stack overflow after fs: "switch the IO-triggering parts of umount to fs_pin" (was net namespaces kernel stack overflow)
On Thu, Apr 19, 2018 at 03:50:25PM +0300, Kirill Tkhai wrote: > Hi, Al, > > commit 87b95ce0964c016ede92763be9c164e49f1019e9 is the first after which the > below test crashes the kernel: > > Author: Al Viro> Date: Sat Jan 10 19:01:08 2015 -0500 > > switch the IO-triggering parts of umount to fs_pin > > Signed-off-by: Al Viro > > $modprobe dummy > > $while true > do > mkdir /var/run/netns > touch /var/run/netns/init_net > mount --bind /proc/1/ns/net /var/run/netns/init_net > > ip netns add foo > ip netns exec foo ip link add dummy0 type dummy > ip netns delete foo > done I can reproduce that, all right, and with a stack chain that looks like this: [77132.414912] pin_kill+0x81/0x150 [77132.424362] ? finish_wait+0x80/0x80 [77132.433917] mnt_pin_kill+0x1e/0x30 [77132.443829] cleanup_mnt+0x6b/0x70 [77132.453477] pin_kill+0x81/0x150 [77132.463064] ? finish_wait+0x80/0x80 [77132.472553] group_pin_kill+0x1a/0x30 [77132.481973] namespace_unlock+0x6f/0x80 [77132.491801] put_mnt_ns+0x1d/0x30 [77132.501258] free_nsproxy+0x17/0x90 [77132.510604] do_exit+0x2dc/0xb40 [77132.520146] ? handle_mm_fault+0xaa/0x1e0 [77132.529725] do_group_exit+0x3a/0xa0 [77132.539506] SyS_exit_group+0x10/0x10 with the top 4 entries repeated a lot. Those cleanup_mnt() could be called from __cleanup_mnt(), delayed_mntput() or mntput_no_expire(). __cleanup_mnt() is only fed to task_work_add(); no way in hell would you get the call stack similar to that; it would be called by task_work_run() from exit_task_work() from do_exit(). Not in the evidence. delayed_mntput() is only fed to schedule_delayed_work(); again, not a chance of having the call chain like that. The one in mntput_no_expire() is a tail-call, with mntput_no_expire() called from umount(2) and mntput() (tail-calls both of them). The former is never called from exit(2), so that call chain reads pin_kill -> mntput or something tail-calling mntput -> mntput_no_expire -> cleanup_mnt -> mnt_pin_kill -> pin_kill Now, the thing called by pin_kill must be something passed to init_fs_pin(), i.e. acct_pin_kill() or drop_mountpoint(). acct_pin_kill() ends with pin_remove(pin); acct_put(acct); } with static void acct_put(struct bsd_acct_struct *p) { if (atomic_long_dec_and_test(>count)) kfree_rcu(p, rcu); } IOW, no tail-call of mntput() in there. OTOH, static void drop_mountpoint(struct fs_pin *p) { struct mount *m = container_of(p, struct mount, mnt_umount); dput(m->mnt_ex_mountpoint); pin_remove(p); mntput(>mnt); } *does* have the tail-call, so this call chain must be pin_kill -> drop_mountpoint -> mntput -> mntput_no_expire -> cleanup_mnt -> mnt_pin_kill -> pin_kill So far, so good, but if you look into mntput_no_expire() you see if (likely(!(mnt->mnt.mnt_flags & MNT_INTERNAL))) { struct task_struct *task = current; if (likely(!(task->flags & PF_KTHREAD))) { init_task_work(>mnt_rcu, __cleanup_mnt); if (!task_work_add(task, >mnt_rcu, true)) return; } if (llist_add(>mnt_llist, _mntput_list)) schedule_delayed_work(_mntput_work, 1); return; } cleanup_mnt(mnt); IOW, we only get there if our vfsmount was an MNT_INTERNAL one. So we have mnt->mnt_umount of some MNT_INTERNAL mount found in ->mnt_pins of some other mount. Which, AFAICS, means that it used to be mounted on that other mount. How the hell can that happen? It looks like you somehow get a long chain of MNT_INTERNAL mounts stacked on top of each other, which ought to be prevented by mnt_flags &= ~MNT_INTERNAL_FLAGS; in do_add_mount(). Nuts...