Re: [RFC][PATCH] mount: In mark_umount_candidates and __propogate_umount visit each mount once

2016-10-13 Thread Eric W. Biederman
Andrey Vagin  writes:

> On Thu, Oct 13, 2016 at 2:46 PM, Andrei Vagin  wrote:
>> On Thu, Oct 13, 2016 at 02:53:46PM -0500, Eric W. Biederman wrote:
>>>
>>> Adrei Vagin pointed out that time to executue propagate_umount can go
>>> non-linear (and take a ludicrious amount of time) when the mount
>>> propogation trees of the mounts to be unmunted by a lazy unmount
>>> overlap.
>>>
>>> Solve this in the most straight forward way possible, by adding a new
>>> mount flag to mark parts of the mount propagation tree that have been
>>> visited, and use that mark to skip parts of the mount propagation tree
>>> that have already been visited during an unmount.  This guarantees
>>> that each mountpoint in the possibly overlapping mount propagation
>>> trees will be visited exactly once.
>>>
>>> Add the functions propagation_visit_next and propagation_revisit_next
>>> to coordinate setting and clearling the visited mount mark.
>>>
>>> Here is a script to generate such mount tree:
>>> $ cat run.sh
>>> mount -t tmpfs test-mount /mnt
>>> mount --make-shared /mnt
>>> for i in `seq $1`; do
>>> mkdir /mnt/test.$i
>>> mount --bind /mnt /mnt/test.$i
>>> done
>>> cat /proc/mounts | grep test-mount | wc -l
>>> time umount -l /mnt
>>> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
>>>
>>> Here are the performance numbers with and without the patch:
>>>
>>> mounts | before | after (real sec)
>>> -
>>>   1024 |  0.071 | 0.024
>>>   2048 |  0.184 | 0.030
>>>   4096 |  0.604 | 0.040
>>>   8912 |  4.471 | 0.043
>>>  16384 | 34.826 | 0.082
>>>  32768 || 0.151
>>>  65536 || 0.289
>>> 131072 || 0.659
>>>
>>> Andrei Vagin fixing this performance problem is part of the
>>> work to fix CVE-2016-6213.
>>>
>>> Cc: sta...@vger.kernel.org
>>> Reported-by: Andrei Vagin 
>>> Signed-off-by: "Eric W. Biederman" 
>>> ---
>>>
>>> Andrei can you take a look at this patch and see if you can see any
>>> problems.  My limited testing suggests this approach does a much better
>>> job of solving the problem you were seeing.  With the time looking
>>> almost linear in the number of mounts now.
>>
>> I read this patch and I like the idea.
>>
>> Then I run my tests and one of them doesn't work with this patch.
>> I haven't found a reason yet.
>
>>> + for (m = propagation_visit_next(parent, parent); m;
>>> + m = propagation_visit_next(m, parent)) {
>>>   struct mount *child = __lookup_mnt_last(&m->mnt,
>>>   mnt->mnt_mountpoint);
>
> The reason is that this loop is called for different "mnt", but
> it is executed only once with this optimization.
>
> So I think the idea to mark parent will not work, because one parent
> can have a few children which have to be umounted.

Good catch.  So what needs to be marked is the parent mount and
mountpoint combination.  Which is effectively the child mount.

I still think replacing the propagation_next and fixing the propagation
walk is the way to go.   But it sounds like to make things work the
__lookup_mnt_last needs to be moved into the propagation walk function.

That doesn't feel to hard.  I will have to see what the code looks like.

Eric


Re: [RFC][PATCH] mount: In mark_umount_candidates and __propogate_umount visit each mount once

2016-10-13 Thread Andrey Vagin
On Thu, Oct 13, 2016 at 2:46 PM, Andrei Vagin  wrote:
> On Thu, Oct 13, 2016 at 02:53:46PM -0500, Eric W. Biederman wrote:
>>
>> Adrei Vagin pointed out that time to executue propagate_umount can go
>> non-linear (and take a ludicrious amount of time) when the mount
>> propogation trees of the mounts to be unmunted by a lazy unmount
>> overlap.
>>
>> Solve this in the most straight forward way possible, by adding a new
>> mount flag to mark parts of the mount propagation tree that have been
>> visited, and use that mark to skip parts of the mount propagation tree
>> that have already been visited during an unmount.  This guarantees
>> that each mountpoint in the possibly overlapping mount propagation
>> trees will be visited exactly once.
>>
>> Add the functions propagation_visit_next and propagation_revisit_next
>> to coordinate setting and clearling the visited mount mark.
>>
>> Here is a script to generate such mount tree:
>> $ cat run.sh
>> mount -t tmpfs test-mount /mnt
>> mount --make-shared /mnt
>> for i in `seq $1`; do
>> mkdir /mnt/test.$i
>> mount --bind /mnt /mnt/test.$i
>> done
>> cat /proc/mounts | grep test-mount | wc -l
>> time umount -l /mnt
>> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
>>
>> Here are the performance numbers with and without the patch:
>>
>> mounts | before | after (real sec)
>> -
>>   1024 |  0.071 | 0.024
>>   2048 |  0.184 | 0.030
>>   4096 |  0.604 | 0.040
>>   8912 |  4.471 | 0.043
>>  16384 | 34.826 | 0.082
>>  32768 || 0.151
>>  65536 || 0.289
>> 131072 || 0.659
>>
>> Andrei Vagin fixing this performance problem is part of the
>> work to fix CVE-2016-6213.
>>
>> Cc: sta...@vger.kernel.org
>> Reported-by: Andrei Vagin 
>> Signed-off-by: "Eric W. Biederman" 
>> ---
>>
>> Andrei can you take a look at this patch and see if you can see any
>> problems.  My limited testing suggests this approach does a much better
>> job of solving the problem you were seeing.  With the time looking
>> almost linear in the number of mounts now.
>
> I read this patch and I like the idea.
>
> Then I run my tests and one of them doesn't work with this patch.
> I haven't found a reason yet.

>> + for (m = propagation_visit_next(parent, parent); m;
>> + m = propagation_visit_next(m, parent)) {
>>   struct mount *child = __lookup_mnt_last(&m->mnt,
>>   mnt->mnt_mountpoint);

The reason is that this loop is called for different "mnt", but
it is executed only once with this optimization.

So I think the idea to mark parent will not work, because one parent
can have a few children which have to be umounted.

>
> Here is the test:
>
> [root@fc24 mounts]# cat run.sh
> set -e
> mount -t tmpfs zdtm /mnt
> mkdir -p /mnt/1 /mnt/2
> mount -t tmpfs zdtm /mnt/1
> mount --make-shared /mnt/1
> for i in `seq $1`; do
> mount --bind /mnt/1 `mktemp -d /mnt/1/test.XX`
> done
> mount --rbind /mnt/1 /mnt/2
> cat /proc/self/mountinfo | grep zdtm | wc -l
> time umount -l /mnt/1
> cat /proc/self/mountinfo | grep zdtm | wc -l
> umount /mnt/2
>
>
> [root@fc24 mounts]# unshare -Urm ./run.sh  5
> 65
>
> real0m0.014s
> user0m0.000s
> sys 0m0.004s
> 33
> umount: /mnt/2: target is busy
> (In some cases useful info about processes that
>  use the device is found by lsof(8) or fuser(1).)
>
>>

Thanks,
Andrei


Re: [RFC][PATCH] mount: In mark_umount_candidates and __propogate_umount visit each mount once

2016-10-13 Thread Andrei Vagin
On Thu, Oct 13, 2016 at 02:53:46PM -0500, Eric W. Biederman wrote:
> 
> Adrei Vagin pointed out that time to executue propagate_umount can go
> non-linear (and take a ludicrious amount of time) when the mount
> propogation trees of the mounts to be unmunted by a lazy unmount
> overlap.
> 
> Solve this in the most straight forward way possible, by adding a new
> mount flag to mark parts of the mount propagation tree that have been
> visited, and use that mark to skip parts of the mount propagation tree
> that have already been visited during an unmount.  This guarantees
> that each mountpoint in the possibly overlapping mount propagation
> trees will be visited exactly once.
> 
> Add the functions propagation_visit_next and propagation_revisit_next
> to coordinate setting and clearling the visited mount mark.
> 
> Here is a script to generate such mount tree:
> $ cat run.sh
> mount -t tmpfs test-mount /mnt
> mount --make-shared /mnt
> for i in `seq $1`; do
> mkdir /mnt/test.$i
> mount --bind /mnt /mnt/test.$i
> done
> cat /proc/mounts | grep test-mount | wc -l
> time umount -l /mnt
> $ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done
> 
> Here are the performance numbers with and without the patch:
> 
> mounts | before | after (real sec)
> -
>   1024 |  0.071 | 0.024
>   2048 |  0.184 | 0.030
>   4096 |  0.604 | 0.040
>   8912 |  4.471 | 0.043
>  16384 | 34.826 | 0.082
>  32768 || 0.151
>  65536 || 0.289
> 131072 || 0.659
> 
> Andrei Vagin fixing this performance problem is part of the
> work to fix CVE-2016-6213.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Andrei Vagin 
> Signed-off-by: "Eric W. Biederman" 
> ---
> 
> Andrei can you take a look at this patch and see if you can see any
> problems.  My limited testing suggests this approach does a much better
> job of solving the problem you were seeing.  With the time looking
> almost linear in the number of mounts now.

I read this patch and I like the idea.

Then I run my tests and one of them doesn't work with this patch.
I haven't found a reason yet.

Here is the test:

[root@fc24 mounts]# cat run.sh
set -e
mount -t tmpfs zdtm /mnt
mkdir -p /mnt/1 /mnt/2
mount -t tmpfs zdtm /mnt/1
mount --make-shared /mnt/1
for i in `seq $1`; do
mount --bind /mnt/1 `mktemp -d /mnt/1/test.XX`
done
mount --rbind /mnt/1 /mnt/2
cat /proc/self/mountinfo | grep zdtm | wc -l
time umount -l /mnt/1
cat /proc/self/mountinfo | grep zdtm | wc -l
umount /mnt/2


[root@fc24 mounts]# unshare -Urm ./run.sh  5
65

real0m0.014s
user0m0.000s
sys 0m0.004s
33
umount: /mnt/2: target is busy
(In some cases useful info about processes that
 use the device is found by lsof(8) or fuser(1).)

> 
>  fs/pnode.c| 125 
> --
>  fs/pnode.h|   4 ++
>  include/linux/mount.h |   2 +
>  3 files changed, 126 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/pnode.c b/fs/pnode.c
> index 234a9ac49958..3acce0c75f94 100644
> --- a/fs/pnode.c
> +++ b/fs/pnode.c
> @@ -164,6 +164,120 @@ static struct mount *propagation_next(struct mount *m,
>   }
>  }
>  
> +/*
> + * get the next mount in the propagation tree (that has not been visited)
> + * @m: the mount seen last
> + * @origin: the original mount from where the tree walk initiated
> + *
> + * Note that peer groups form contiguous segments of slave lists.
> + * We rely on that in get_source() to be able to find out if
> + * vfsmount found while iterating with propagation_next() is
> + * a peer of one we'd found earlier.
> + */
> +static struct mount *propagation_visit_next(struct mount *m,
> + struct mount *origin)
> +{
> + /* Has this part of the propgation tree already been visited? */
> + if (IS_MNT_VISITED(m))
> + return NULL;
> +
> + SET_MNT_VISITED(m);
> +
> + /* are there any slaves of this mount? */
> + if (!list_empty(&m->mnt_slave_list)) {
> + struct mount *slave = first_slave(m);
> + while (1) {
> + if (!IS_MNT_VISITED(slave))
> + return slave;
> + if (slave->mnt_slave.next == &m->mnt_slave_list)
> + break;
> + slave = next_slave(slave);
> + }
> + }
> + while (1) {
> + struct mount *master = m->mnt_master;
> +
> + if (master == origin->mnt_master) {
> + struct mount *next = next_peer(m);
> + while (1) {
> + if (next == origin)
> + return NULL;
> + if (!IS_MNT_VISITED(next))
> + return next;
> + next = next_peer(next);
> + }
> + } else {
> + while (1) {

[RFC][PATCH] mount: In mark_umount_candidates and __propogate_umount visit each mount once

2016-10-13 Thread Eric W. Biederman

Adrei Vagin pointed out that time to executue propagate_umount can go
non-linear (and take a ludicrious amount of time) when the mount
propogation trees of the mounts to be unmunted by a lazy unmount
overlap.

Solve this in the most straight forward way possible, by adding a new
mount flag to mark parts of the mount propagation tree that have been
visited, and use that mark to skip parts of the mount propagation tree
that have already been visited during an unmount.  This guarantees
that each mountpoint in the possibly overlapping mount propagation
trees will be visited exactly once.

Add the functions propagation_visit_next and propagation_revisit_next
to coordinate setting and clearling the visited mount mark.

Here is a script to generate such mount tree:
$ cat run.sh
mount -t tmpfs test-mount /mnt
mount --make-shared /mnt
for i in `seq $1`; do
mkdir /mnt/test.$i
mount --bind /mnt /mnt/test.$i
done
cat /proc/mounts | grep test-mount | wc -l
time umount -l /mnt
$ for i in `seq 10 16`; do echo $i; unshare -Urm bash ./run.sh $i; done

Here are the performance numbers with and without the patch:

mounts | before | after (real sec)
-
  1024 |  0.071 | 0.024
  2048 |  0.184 | 0.030
  4096 |  0.604 | 0.040
  8912 |  4.471 | 0.043
 16384 | 34.826 | 0.082
 32768 || 0.151
 65536 || 0.289
131072 || 0.659

Andrei Vagin fixing this performance problem is part of the
work to fix CVE-2016-6213.

Cc: sta...@vger.kernel.org
Reported-by: Andrei Vagin 
Signed-off-by: "Eric W. Biederman" 
---

Andrei can you take a look at this patch and see if you can see any
problems.  My limited testing suggests this approach does a much better
job of solving the problem you were seeing.  With the time looking
almost linear in the number of mounts now.

 fs/pnode.c| 125 --
 fs/pnode.h|   4 ++
 include/linux/mount.h |   2 +
 3 files changed, 126 insertions(+), 5 deletions(-)

diff --git a/fs/pnode.c b/fs/pnode.c
index 234a9ac49958..3acce0c75f94 100644
--- a/fs/pnode.c
+++ b/fs/pnode.c
@@ -164,6 +164,120 @@ static struct mount *propagation_next(struct mount *m,
}
 }
 
+/*
+ * get the next mount in the propagation tree (that has not been visited)
+ * @m: the mount seen last
+ * @origin: the original mount from where the tree walk initiated
+ *
+ * Note that peer groups form contiguous segments of slave lists.
+ * We rely on that in get_source() to be able to find out if
+ * vfsmount found while iterating with propagation_next() is
+ * a peer of one we'd found earlier.
+ */
+static struct mount *propagation_visit_next(struct mount *m,
+   struct mount *origin)
+{
+   /* Has this part of the propgation tree already been visited? */
+   if (IS_MNT_VISITED(m))
+   return NULL;
+
+   SET_MNT_VISITED(m);
+
+   /* are there any slaves of this mount? */
+   if (!list_empty(&m->mnt_slave_list)) {
+   struct mount *slave = first_slave(m);
+   while (1) {
+   if (!IS_MNT_VISITED(slave))
+   return slave;
+   if (slave->mnt_slave.next == &m->mnt_slave_list)
+   break;
+   slave = next_slave(slave);
+   }
+   }
+   while (1) {
+   struct mount *master = m->mnt_master;
+
+   if (master == origin->mnt_master) {
+   struct mount *next = next_peer(m);
+   while (1) {
+   if (next == origin)
+   return NULL;
+   if (!IS_MNT_VISITED(next))
+   return next;
+   next = next_peer(next);
+   }
+   } else {
+   while (1) {
+   if (m->mnt_slave.next == 
&master->mnt_slave_list)
+   break;
+   m = next_slave(m);
+   if (!IS_MNT_VISITED(m))
+   return m;
+   }
+   }
+
+   /* back at master */
+   m = master;
+   }
+}
+
+/*
+ * get the next mount in the propagation tree (that has not been revisited)
+ * @m: the mount seen last
+ * @origin: the original mount from where the tree walk initiated
+ *
+ * Note that peer groups form contiguous segments of slave lists.
+ * We rely on that in get_source() to be able to find out if
+ * vfsmount found while iterating with propagation_next() is
+ * a peer of one we'd found earlier.
+ */
+static struct mount *propagation_revisit_next(struct mount *m,
+ struct mount *origin)
+{
+   /* Has this part of the propgation tree already b