Re: [RFC][PATCH v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once
On Fri, Oct 14, 2016 at 01:29:18PM -0500, Eric W. Biederman wrote: > > Adrei Vagin pointed out that time to executue propagate_umount can go > non-linear (and take a ludicrious amount of time) when the mount > propogation trees of the mounts to be unmunted by a lazy unmount > overlap. > > Solve this in the most straight forward way possible, by adding a new > mount flag to mark parts of the mount propagation tree that have been > visited, and use that mark to skip parts of the mount propagation tree > that have already been visited during an unmount. This guarantees > that each mountpoint in the possibly overlapping mount propagation > trees will be visited exactly once. > > Add the functions propagation_visit_next and propagation_revisit_next > to coordinate setting and clearling the visited mount mark. > > The skipping of already unmounted mounts has been moved from > __lookup_mnt_last to mark_umount_candidates, so that the new > propagation functions can notice record when the propagation tree > passes through the initial set of unmounted mounts. Except in > umount_tree as part of the unmounting process the only place where > unmounted mounts should be found are in unmounted subtrees. All of > the other callers of __lookup_mnt_last are from mounted subtrees so > the not checking for unmounted mounts should not affect them. > > Here is a script to generate such mount tree: > $ cat run.sh > mount -t tmpfs test-mount /mnt > mount --make-shared /mnt > for i in `seq $1`; do > mkdir /mnt/test.$i > mount --bind /mnt /mnt/test.$i > done > cat /proc/mounts | grep test-mount | wc -l > time umount -l /mnt > $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done > > Here are the performance numbers with and without the patch: > > mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | > 104857 > mounts | before | after | after (sys) | after | after (sys) | after | > after (sys) > - > 1024 | 0.071s | 0.023s | 0.008s | 0.026s | 0.000s | 0.024s | > 0.008s > 2048 | 0.184s | 0.030s | 0.012s | 0.035s | 0.008s | 0.030s | > 0.012s > 4096 | 0.604s | 0.047s | 0.012s | 0.042s | 0.016s | 0.032s | > 0.016s > 8912 | 4.471s | 0.085s | 0.020s | 0.059s | 0.059s | 0.050s | > 0.036s > 16384 | 34.826s | 0.105s | 0.092s | 0.109s | 0.060s | 0.087s | > 0.068s > 32768 | | 0.245s | 0.168s | 0.192s | 0.144s | 0.167s | > 0.156s > 65536 | | 0.833s | 0.716s | 0.485s | 0.276s | 0.468s | > 0.316s > 131072 | | 4.628s | 4.108s | 0.758s | 0.632s | 0.736s | > 0.612s > > Andrei Vagin reports fixing this performance problem is part of the > work to fix CVE-2016-6213. > > Cc: sta...@vger.kernel.org > Reported-by: Andrei Vagin> Signed-off-by: "Eric W. Biederman" > --- > > I think this version is very close. I had to modify __lookup_mnt_last > to not skip MOUNT_UMOUNT or we would never see when the mount > propagation trees intersected. > > This doesn't look as good as the previous buggy version but it looks > good. When the hash table isn't getting full the times look pretty > linear. So it may be necessary to do some hash table resizing. > > That said there remains one issue I need to think about some more. > > In mark_umount_candidates I don't mark mounts that are locked to their > parent and their parent is not marked as a umount candidate. Given that > we skip processing mounts multiple times this might result in a mount > whose parent gets marked as unmountable after the first time we see a > mount not getting marked as unmountable later. > > Anyway Andrei if you could check this out and see if you can see > anything I missed please let me know. I've tested this patch today and it works to me. The idea of this patch looks good for me too. Thanks! There is one inline comment. > > Eric > > fs/namespace.c| 6 +-- > fs/pnode.c| 147 > -- > fs/pnode.h| 4 ++ > include/linux/mount.h | 2 + > 4 files changed, 138 insertions(+), 21 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index db1b5a38864e..1ca99fa2e0f4 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -650,13 +650,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, > struct dentry *dentry) > p = __lookup_mnt(mnt, dentry); > if (!p) > goto out; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; > + res = p; > hlist_for_each_entry_continue(p, mnt_hash) { > if (>mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) > break; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; __lookup_mnt_last is used in
Re: [RFC][PATCH v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once
On Fri, Oct 14, 2016 at 01:29:18PM -0500, Eric W. Biederman wrote: > > Adrei Vagin pointed out that time to executue propagate_umount can go > non-linear (and take a ludicrious amount of time) when the mount > propogation trees of the mounts to be unmunted by a lazy unmount > overlap. > > Solve this in the most straight forward way possible, by adding a new > mount flag to mark parts of the mount propagation tree that have been > visited, and use that mark to skip parts of the mount propagation tree > that have already been visited during an unmount. This guarantees > that each mountpoint in the possibly overlapping mount propagation > trees will be visited exactly once. > > Add the functions propagation_visit_next and propagation_revisit_next > to coordinate setting and clearling the visited mount mark. > > The skipping of already unmounted mounts has been moved from > __lookup_mnt_last to mark_umount_candidates, so that the new > propagation functions can notice record when the propagation tree > passes through the initial set of unmounted mounts. Except in > umount_tree as part of the unmounting process the only place where > unmounted mounts should be found are in unmounted subtrees. All of > the other callers of __lookup_mnt_last are from mounted subtrees so > the not checking for unmounted mounts should not affect them. > > Here is a script to generate such mount tree: > $ cat run.sh > mount -t tmpfs test-mount /mnt > mount --make-shared /mnt > for i in `seq $1`; do > mkdir /mnt/test.$i > mount --bind /mnt /mnt/test.$i > done > cat /proc/mounts | grep test-mount | wc -l > time umount -l /mnt > $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done > > Here are the performance numbers with and without the patch: > > mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | > 104857 > mounts | before | after | after (sys) | after | after (sys) | after | > after (sys) > - > 1024 | 0.071s | 0.023s | 0.008s | 0.026s | 0.000s | 0.024s | > 0.008s > 2048 | 0.184s | 0.030s | 0.012s | 0.035s | 0.008s | 0.030s | > 0.012s > 4096 | 0.604s | 0.047s | 0.012s | 0.042s | 0.016s | 0.032s | > 0.016s > 8912 | 4.471s | 0.085s | 0.020s | 0.059s | 0.059s | 0.050s | > 0.036s > 16384 | 34.826s | 0.105s | 0.092s | 0.109s | 0.060s | 0.087s | > 0.068s > 32768 | | 0.245s | 0.168s | 0.192s | 0.144s | 0.167s | > 0.156s > 65536 | | 0.833s | 0.716s | 0.485s | 0.276s | 0.468s | > 0.316s > 131072 | | 4.628s | 4.108s | 0.758s | 0.632s | 0.736s | > 0.612s > > Andrei Vagin reports fixing this performance problem is part of the > work to fix CVE-2016-6213. > > Cc: sta...@vger.kernel.org > Reported-by: Andrei Vagin > Signed-off-by: "Eric W. Biederman" > --- > > I think this version is very close. I had to modify __lookup_mnt_last > to not skip MOUNT_UMOUNT or we would never see when the mount > propagation trees intersected. > > This doesn't look as good as the previous buggy version but it looks > good. When the hash table isn't getting full the times look pretty > linear. So it may be necessary to do some hash table resizing. > > That said there remains one issue I need to think about some more. > > In mark_umount_candidates I don't mark mounts that are locked to their > parent and their parent is not marked as a umount candidate. Given that > we skip processing mounts multiple times this might result in a mount > whose parent gets marked as unmountable after the first time we see a > mount not getting marked as unmountable later. > > Anyway Andrei if you could check this out and see if you can see > anything I missed please let me know. I've tested this patch today and it works to me. The idea of this patch looks good for me too. Thanks! There is one inline comment. > > Eric > > fs/namespace.c| 6 +-- > fs/pnode.c| 147 > -- > fs/pnode.h| 4 ++ > include/linux/mount.h | 2 + > 4 files changed, 138 insertions(+), 21 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index db1b5a38864e..1ca99fa2e0f4 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -650,13 +650,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, > struct dentry *dentry) > p = __lookup_mnt(mnt, dentry); > if (!p) > goto out; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; > + res = p; > hlist_for_each_entry_continue(p, mnt_hash) { > if (>mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) > break; > - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) > - res = p; __lookup_mnt_last is used in propagate_mount_busy and attach_recursive_mnt. Should we do
Re: [RFC][PATCH v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once
Andrei Vaginwrites: > On Fri, Oct 14, 2016 at 01:29:18PM -0500, Eric W. Biederman wrote: >> >> Adrei Vagin pointed out that time to executue propagate_umount can go >> non-linear (and take a ludicrious amount of time) when the mount >> propogation trees of the mounts to be unmunted by a lazy unmount >> overlap. >> >> Solve this in the most straight forward way possible, by adding a new >> mount flag to mark parts of the mount propagation tree that have been >> visited, and use that mark to skip parts of the mount propagation tree >> that have already been visited during an unmount. This guarantees >> that each mountpoint in the possibly overlapping mount propagation >> trees will be visited exactly once. >> >> Add the functions propagation_visit_next and propagation_revisit_next >> to coordinate setting and clearling the visited mount mark. >> >> The skipping of already unmounted mounts has been moved from >> __lookup_mnt_last to mark_umount_candidates, so that the new >> propagation functions can notice record when the propagation tree >> passes through the initial set of unmounted mounts. Except in >> umount_tree as part of the unmounting process the only place where >> unmounted mounts should be found are in unmounted subtrees. All of >> the other callers of __lookup_mnt_last are from mounted subtrees so >> the not checking for unmounted mounts should not affect them. >> >> Here is a script to generate such mount tree: >> $ cat run.sh >> mount -t tmpfs test-mount /mnt >> mount --make-shared /mnt >> for i in `seq $1`; do >> mkdir /mnt/test.$i >> mount --bind /mnt /mnt/test.$i >> done >> cat /proc/mounts | grep test-mount | wc -l >> time umount -l /mnt >> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done >> >> Here are the performance numbers with and without the patch: >> >> mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | >> 104857 >> mounts | before | after | after (sys) | after | after (sys) | after | >> after (sys) >> - >> 1024 | 0.071s | 0.023s | 0.008s | 0.026s | 0.000s | 0.024s | >> 0.008s >> 2048 | 0.184s | 0.030s | 0.012s | 0.035s | 0.008s | 0.030s | >> 0.012s >> 4096 | 0.604s | 0.047s | 0.012s | 0.042s | 0.016s | 0.032s | >> 0.016s >> 8912 | 4.471s | 0.085s | 0.020s | 0.059s | 0.059s | 0.050s | >> 0.036s >> 16384 | 34.826s | 0.105s | 0.092s | 0.109s | 0.060s | 0.087s | >> 0.068s >> 32768 | | 0.245s | 0.168s | 0.192s | 0.144s | 0.167s | >> 0.156s >> 65536 | | 0.833s | 0.716s | 0.485s | 0.276s | 0.468s | >> 0.316s >> 131072 | | 4.628s | 4.108s | 0.758s | 0.632s | 0.736s | >> 0.612s >> >> Andrei Vagin reports fixing this performance problem is part of the >> work to fix CVE-2016-6213. >> >> Cc: sta...@vger.kernel.org >> Reported-by: Andrei Vagin >> Signed-off-by: "Eric W. Biederman" >> --- >> >> I think this version is very close. I had to modify __lookup_mnt_last >> to not skip MOUNT_UMOUNT or we would never see when the mount >> propagation trees intersected. >> >> This doesn't look as good as the previous buggy version but it looks >> good. When the hash table isn't getting full the times look pretty >> linear. So it may be necessary to do some hash table resizing. >> >> That said there remains one issue I need to think about some more. >> >> In mark_umount_candidates I don't mark mounts that are locked to their >> parent and their parent is not marked as a umount candidate. Given that >> we skip processing mounts multiple times this might result in a mount >> whose parent gets marked as unmountable after the first time we see a >> mount not getting marked as unmountable later. Unfortunately my fears are born out as demonstrated by the script below. $ cat pathology.sh #!/bin/sh set -e set -x mount -t tmpfs base /mnt mount --make-shared /mnt mkdir -p /mnt/b mount -t tmpfs test1 /mnt/b mount --make-shared /mnt/b mkdir -p /mnt/b/10 mount -t tmpfs test2 /mnt/b/10 mount --make-shared /mnt/b/10 mkdir -p /mnt/b/10/20 mount --rbind /mnt/b /mnt/b/10/20 cat /proc/self/mountinfo ls /mnt /mnt/b /mnt/b/10 /mnt/b/10/20 /mnt/b/10/20/10 /mnt/b/10/20/10/20 || true unshare -Urm --propagation unchanged /bin/bash -c 'cat /proc/self/mountinfo; sleep 5; ls /mnt /mnt/b /mnt/b/10 /mnt/b/10/20 /mnt/b/10/20/10 \ /mnt/b/10/20/10/20 || true; cat /proc/self/mountinfo' & sleep 1 umount -l /mnt/b/ wait %% $ unshare -Urm ./pathology.sh >> Anyway Andrei if you could check this out and see if you can see >> anything I missed please let me know. > > I've tested this patch today and it works to me. The idea of this patch > looks good for me
Re: [RFC][PATCH v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once
Andrei Vagin writes: > On Fri, Oct 14, 2016 at 01:29:18PM -0500, Eric W. Biederman wrote: >> >> Adrei Vagin pointed out that time to executue propagate_umount can go >> non-linear (and take a ludicrious amount of time) when the mount >> propogation trees of the mounts to be unmunted by a lazy unmount >> overlap. >> >> Solve this in the most straight forward way possible, by adding a new >> mount flag to mark parts of the mount propagation tree that have been >> visited, and use that mark to skip parts of the mount propagation tree >> that have already been visited during an unmount. This guarantees >> that each mountpoint in the possibly overlapping mount propagation >> trees will be visited exactly once. >> >> Add the functions propagation_visit_next and propagation_revisit_next >> to coordinate setting and clearling the visited mount mark. >> >> The skipping of already unmounted mounts has been moved from >> __lookup_mnt_last to mark_umount_candidates, so that the new >> propagation functions can notice record when the propagation tree >> passes through the initial set of unmounted mounts. Except in >> umount_tree as part of the unmounting process the only place where >> unmounted mounts should be found are in unmounted subtrees. All of >> the other callers of __lookup_mnt_last are from mounted subtrees so >> the not checking for unmounted mounts should not affect them. >> >> Here is a script to generate such mount tree: >> $ cat run.sh >> mount -t tmpfs test-mount /mnt >> mount --make-shared /mnt >> for i in `seq $1`; do >> mkdir /mnt/test.$i >> mount --bind /mnt /mnt/test.$i >> done >> cat /proc/mounts | grep test-mount | wc -l >> time umount -l /mnt >> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done >> >> Here are the performance numbers with and without the patch: >> >> mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | >> 104857 >> mounts | before | after | after (sys) | after | after (sys) | after | >> after (sys) >> - >> 1024 | 0.071s | 0.023s | 0.008s | 0.026s | 0.000s | 0.024s | >> 0.008s >> 2048 | 0.184s | 0.030s | 0.012s | 0.035s | 0.008s | 0.030s | >> 0.012s >> 4096 | 0.604s | 0.047s | 0.012s | 0.042s | 0.016s | 0.032s | >> 0.016s >> 8912 | 4.471s | 0.085s | 0.020s | 0.059s | 0.059s | 0.050s | >> 0.036s >> 16384 | 34.826s | 0.105s | 0.092s | 0.109s | 0.060s | 0.087s | >> 0.068s >> 32768 | | 0.245s | 0.168s | 0.192s | 0.144s | 0.167s | >> 0.156s >> 65536 | | 0.833s | 0.716s | 0.485s | 0.276s | 0.468s | >> 0.316s >> 131072 | | 4.628s | 4.108s | 0.758s | 0.632s | 0.736s | >> 0.612s >> >> Andrei Vagin reports fixing this performance problem is part of the >> work to fix CVE-2016-6213. >> >> Cc: sta...@vger.kernel.org >> Reported-by: Andrei Vagin >> Signed-off-by: "Eric W. Biederman" >> --- >> >> I think this version is very close. I had to modify __lookup_mnt_last >> to not skip MOUNT_UMOUNT or we would never see when the mount >> propagation trees intersected. >> >> This doesn't look as good as the previous buggy version but it looks >> good. When the hash table isn't getting full the times look pretty >> linear. So it may be necessary to do some hash table resizing. >> >> That said there remains one issue I need to think about some more. >> >> In mark_umount_candidates I don't mark mounts that are locked to their >> parent and their parent is not marked as a umount candidate. Given that >> we skip processing mounts multiple times this might result in a mount >> whose parent gets marked as unmountable after the first time we see a >> mount not getting marked as unmountable later. Unfortunately my fears are born out as demonstrated by the script below. $ cat pathology.sh #!/bin/sh set -e set -x mount -t tmpfs base /mnt mount --make-shared /mnt mkdir -p /mnt/b mount -t tmpfs test1 /mnt/b mount --make-shared /mnt/b mkdir -p /mnt/b/10 mount -t tmpfs test2 /mnt/b/10 mount --make-shared /mnt/b/10 mkdir -p /mnt/b/10/20 mount --rbind /mnt/b /mnt/b/10/20 cat /proc/self/mountinfo ls /mnt /mnt/b /mnt/b/10 /mnt/b/10/20 /mnt/b/10/20/10 /mnt/b/10/20/10/20 || true unshare -Urm --propagation unchanged /bin/bash -c 'cat /proc/self/mountinfo; sleep 5; ls /mnt /mnt/b /mnt/b/10 /mnt/b/10/20 /mnt/b/10/20/10 \ /mnt/b/10/20/10/20 || true; cat /proc/self/mountinfo' & sleep 1 umount -l /mnt/b/ wait %% $ unshare -Urm ./pathology.sh >> Anyway Andrei if you could check this out and see if you can see >> anything I missed please let me know. > > I've tested this patch today and it works to me. The idea of this patch > looks good for me too. Thanks! There is one inline comment. It is definitely close
[RFC][PATCH v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once
Adrei Vagin pointed out that time to executue propagate_umount can go non-linear (and take a ludicrious amount of time) when the mount propogation trees of the mounts to be unmunted by a lazy unmount overlap. Solve this in the most straight forward way possible, by adding a new mount flag to mark parts of the mount propagation tree that have been visited, and use that mark to skip parts of the mount propagation tree that have already been visited during an unmount. This guarantees that each mountpoint in the possibly overlapping mount propagation trees will be visited exactly once. Add the functions propagation_visit_next and propagation_revisit_next to coordinate setting and clearling the visited mount mark. The skipping of already unmounted mounts has been moved from __lookup_mnt_last to mark_umount_candidates, so that the new propagation functions can notice record when the propagation tree passes through the initial set of unmounted mounts. Except in umount_tree as part of the unmounting process the only place where unmounted mounts should be found are in unmounted subtrees. All of the other callers of __lookup_mnt_last are from mounted subtrees so the not checking for unmounted mounts should not affect them. Here is a script to generate such mount tree: $ cat run.sh mount -t tmpfs test-mount /mnt mount --make-shared /mnt for i in `seq $1`; do mkdir /mnt/test.$i mount --bind /mnt /mnt/test.$i done cat /proc/mounts | grep test-mount | wc -l time umount -l /mnt $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done Here are the performance numbers with and without the patch: mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857 mounts | before | after | after (sys) | after | after (sys) | after | after (sys) - 1024 | 0.071s | 0.023s | 0.008s | 0.026s | 0.000s | 0.024s | 0.008s 2048 | 0.184s | 0.030s | 0.012s | 0.035s | 0.008s | 0.030s | 0.012s 4096 | 0.604s | 0.047s | 0.012s | 0.042s | 0.016s | 0.032s | 0.016s 8912 | 4.471s | 0.085s | 0.020s | 0.059s | 0.059s | 0.050s | 0.036s 16384 | 34.826s | 0.105s | 0.092s | 0.109s | 0.060s | 0.087s | 0.068s 32768 | | 0.245s | 0.168s | 0.192s | 0.144s | 0.167s | 0.156s 65536 | | 0.833s | 0.716s | 0.485s | 0.276s | 0.468s | 0.316s 131072 | | 4.628s | 4.108s | 0.758s | 0.632s | 0.736s | 0.612s Andrei Vagin reports fixing this performance problem is part of the work to fix CVE-2016-6213. Cc: sta...@vger.kernel.org Reported-by: Andrei VaginSigned-off-by: "Eric W. Biederman" --- I think this version is very close. I had to modify __lookup_mnt_last to not skip MOUNT_UMOUNT or we would never see when the mount propagation trees intersected. This doesn't look as good as the previous buggy version but it looks good. When the hash table isn't getting full the times look pretty linear. So it may be necessary to do some hash table resizing. That said there remains one issue I need to think about some more. In mark_umount_candidates I don't mark mounts that are locked to their parent and their parent is not marked as a umount candidate. Given that we skip processing mounts multiple times this might result in a mount whose parent gets marked as unmountable after the first time we see a mount not getting marked as unmountable later. Anyway Andrei if you could check this out and see if you can see anything I missed please let me know. Eric fs/namespace.c| 6 +-- fs/pnode.c| 147 -- fs/pnode.h| 4 ++ include/linux/mount.h | 2 + 4 files changed, 138 insertions(+), 21 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index db1b5a38864e..1ca99fa2e0f4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -650,13 +650,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) p = __lookup_mnt(mnt, dentry); if (!p) goto out; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; + res = p; hlist_for_each_entry_continue(p, mnt_hash) { if (>mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) break; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; + res = p; } out: return res; diff --git a/fs/pnode.c b/fs/pnode.c index 234a9ac49958..025e3d9339b0 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -390,20 +390,137 @@ void propagate_mount_unlock(struct mount *mnt) } /* + * get the next mount in the propagation tree (that has not been visited) + * @m: the mount seen last + * @origin: the original mount from where the tree walk initiated + * + * Note that peer groups form
[RFC][PATCH v2] mount: In mark_umount_candidates and __propogate_umount visit each mount once
Adrei Vagin pointed out that time to executue propagate_umount can go non-linear (and take a ludicrious amount of time) when the mount propogation trees of the mounts to be unmunted by a lazy unmount overlap. Solve this in the most straight forward way possible, by adding a new mount flag to mark parts of the mount propagation tree that have been visited, and use that mark to skip parts of the mount propagation tree that have already been visited during an unmount. This guarantees that each mountpoint in the possibly overlapping mount propagation trees will be visited exactly once. Add the functions propagation_visit_next and propagation_revisit_next to coordinate setting and clearling the visited mount mark. The skipping of already unmounted mounts has been moved from __lookup_mnt_last to mark_umount_candidates, so that the new propagation functions can notice record when the propagation tree passes through the initial set of unmounted mounts. Except in umount_tree as part of the unmounting process the only place where unmounted mounts should be found are in unmounted subtrees. All of the other callers of __lookup_mnt_last are from mounted subtrees so the not checking for unmounted mounts should not affect them. Here is a script to generate such mount tree: $ cat run.sh mount -t tmpfs test-mount /mnt mount --make-shared /mnt for i in `seq $1`; do mkdir /mnt/test.$i mount --bind /mnt /mnt/test.$i done cat /proc/mounts | grep test-mount | wc -l time umount -l /mnt $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done Here are the performance numbers with and without the patch: mhash | 8192 | 8192 | 8192 | 131072 | 131072 | 104857 | 104857 mounts | before | after | after (sys) | after | after (sys) | after | after (sys) - 1024 | 0.071s | 0.023s | 0.008s | 0.026s | 0.000s | 0.024s | 0.008s 2048 | 0.184s | 0.030s | 0.012s | 0.035s | 0.008s | 0.030s | 0.012s 4096 | 0.604s | 0.047s | 0.012s | 0.042s | 0.016s | 0.032s | 0.016s 8912 | 4.471s | 0.085s | 0.020s | 0.059s | 0.059s | 0.050s | 0.036s 16384 | 34.826s | 0.105s | 0.092s | 0.109s | 0.060s | 0.087s | 0.068s 32768 | | 0.245s | 0.168s | 0.192s | 0.144s | 0.167s | 0.156s 65536 | | 0.833s | 0.716s | 0.485s | 0.276s | 0.468s | 0.316s 131072 | | 4.628s | 4.108s | 0.758s | 0.632s | 0.736s | 0.612s Andrei Vagin reports fixing this performance problem is part of the work to fix CVE-2016-6213. Cc: sta...@vger.kernel.org Reported-by: Andrei Vagin Signed-off-by: "Eric W. Biederman" --- I think this version is very close. I had to modify __lookup_mnt_last to not skip MOUNT_UMOUNT or we would never see when the mount propagation trees intersected. This doesn't look as good as the previous buggy version but it looks good. When the hash table isn't getting full the times look pretty linear. So it may be necessary to do some hash table resizing. That said there remains one issue I need to think about some more. In mark_umount_candidates I don't mark mounts that are locked to their parent and their parent is not marked as a umount candidate. Given that we skip processing mounts multiple times this might result in a mount whose parent gets marked as unmountable after the first time we see a mount not getting marked as unmountable later. Anyway Andrei if you could check this out and see if you can see anything I missed please let me know. Eric fs/namespace.c| 6 +-- fs/pnode.c| 147 -- fs/pnode.h| 4 ++ include/linux/mount.h | 2 + 4 files changed, 138 insertions(+), 21 deletions(-) diff --git a/fs/namespace.c b/fs/namespace.c index db1b5a38864e..1ca99fa2e0f4 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -650,13 +650,11 @@ struct mount *__lookup_mnt_last(struct vfsmount *mnt, struct dentry *dentry) p = __lookup_mnt(mnt, dentry); if (!p) goto out; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; + res = p; hlist_for_each_entry_continue(p, mnt_hash) { if (>mnt_parent->mnt != mnt || p->mnt_mountpoint != dentry) break; - if (!(p->mnt.mnt_flags & MNT_UMOUNT)) - res = p; + res = p; } out: return res; diff --git a/fs/pnode.c b/fs/pnode.c index 234a9ac49958..025e3d9339b0 100644 --- a/fs/pnode.c +++ b/fs/pnode.c @@ -390,20 +390,137 @@ void propagate_mount_unlock(struct mount *mnt) } /* + * get the next mount in the propagation tree (that has not been visited) + * @m: the mount seen last + * @origin: the original mount from where the tree walk initiated + * + * Note that peer groups form contiguous segments of slave lists. + * We rely on