Re: [RFC] Add option to mount only a pids subset

2017-03-30 Thread Alexey Gladkov
On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote:
> (Cc'ed Kees and Jann for the procfs stacking issue)
> 
> > I completely agree with you that it looks wrong when options of one
> > mountpoint affect the others mountpoints.
> >
> >> I'm not sure if that's the right approach, it is still buggy, however
> >> seems that your patch also stores the mount option inside the
> >> pid_namespace which may get propagated to all mounts inside same pidns
> >> ?
> >
> > I don't store 'pidonly' option in my current patch. I mean in:
> > https://lkml.org/lkml/2017/3/20/363
> >
> > I parse options twice at first mount of procfs. It happens before
> > the mounting /proc in userspace.
> >
> > I know it's bad, but I couldn't find place to store this option.
> 
> Ok, then maybe that approach of having a procfs struct holder instead
> of using pid namespace may help!

I deside to stop doing my patch. I talked with a few people and found out
that the overlayfs doesn't feel very well if on the lower level filesystem
appear/disappear files.

In addition with the pidfs isn't so simple. Separate the root will lead to
a doubling of memory consumption and restrictions on the filesystem operations
level can easily be skipped.

It means that even I can do this pidfs (or pid subset in /proc), it
will be pointless.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-30 Thread Alexey Gladkov
On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote:
> (Cc'ed Kees and Jann for the procfs stacking issue)
> 
> > I completely agree with you that it looks wrong when options of one
> > mountpoint affect the others mountpoints.
> >
> >> I'm not sure if that's the right approach, it is still buggy, however
> >> seems that your patch also stores the mount option inside the
> >> pid_namespace which may get propagated to all mounts inside same pidns
> >> ?
> >
> > I don't store 'pidonly' option in my current patch. I mean in:
> > https://lkml.org/lkml/2017/3/20/363
> >
> > I parse options twice at first mount of procfs. It happens before
> > the mounting /proc in userspace.
> >
> > I know it's bad, but I couldn't find place to store this option.
> 
> Ok, then maybe that approach of having a procfs struct holder instead
> of using pid namespace may help!

I deside to stop doing my patch. I talked with a few people and found out
that the overlayfs doesn't feel very well if on the lower level filesystem
appear/disappear files.

In addition with the pidfs isn't so simple. Separate the root will lead to
a doubling of memory consumption and restrictions on the filesystem operations
level can easily be skipped.

It means that even I can do this pidfs (or pid subset in /proc), it
will be pointless.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-26 Thread Djalal Harouni
(Cc'ed Kees and Jann for the procfs stacking issue)


On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov
 wrote:
> On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
>> Hi Alexey,
>>
>> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>>  wrote:
>> >
>> >
>> > Al Viro, this patch looks better ?
>> >
>> > == Overview ==
>> >
>> > Some of the container virtualization systems are mounted /proc inside
>> > the container. This is done in most cases to operate with information
>> > about the processes. Knowing that /proc filesystem is not fully
>> > virtualized they are mounted on top of dangerous places empty files or
>> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>> >
>> > The structure of this filesystem is dynamic and any module can create a
>> > new object which will not necessarily be virtualized. There are
>> > proprietary modules that aren't in the mainline whose work we can not
>> > verify.
>> >
>> > This opens up a potential threat to the system. The developers of the
>> > virtualization system can't predict all dangerous places in /proc by
>> > definition.
>> >
>> > A more effective solution would be to mount into the container only what
>> > is necessary and ignore the rest.
>> >
>> > Right now there is the opportunity to pass in the container any port of
>> > the /proc filesystem using mount --bind expect the pids.
>> >
>> > This patch allows to mount only the part of /proc related to pids without
>> > rest objects. Since this is an option for /proc, flags applied to /proc
>> > have an effect on this subset of filesystem.
>>
>> I just sent a patch that also has to deal with proc hidepid here:
>> https://lkml.org/lkml/2017/3/23/505
>
> I completely agree with you that it looks wrong when options of one
> mountpoint affect the others mountpoints.
>
>> I'm not sure if that's the right approach, it is still buggy, however
>> seems that your patch also stores the mount option inside the
>> pid_namespace which may get propagated to all mounts inside same pidns
>> ?
>
> I don't store 'pidonly' option in my current patch. I mean in:
> https://lkml.org/lkml/2017/3/20/363
>
> I parse options twice at first mount of procfs. It happens before
> the mounting /proc in userspace.
>
> I know it's bad, but I couldn't find place to store this option.

Ok, then maybe that approach of having a procfs struct holder instead
of using pid namespace may help!


>> I didn't have enough time but maybe if they are related we can work it
>> out together ?
>
> I don't have enough experience in kenrel hacking, but I would happily do
> my best :)

OK, let's sync on this then.

> I also tring to mention it in every patch, as my changes almost completely
> useless without the ability to use the overlayfs.
>
> Now if you remove the restriction:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

This protection was introduced in e54ad7f1ee263

That's a security fix, ecryptfs seems to schedule some code through
kthreads which always run in the initial namespaces with full
capabilities, bypassing all security checks.

I don't know if overlayfs has something similar, if not then maybe if
it's possible in ecryptfs to check the lower mount and the fs type and
move this procfs there...  Cc'ed Jann in case he may comment.

> and mount procfs as lowerdir in overlayfs then you get NULL pointer
> dereference at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891
>
> I got it when I tried to do `ls -la /overlay/proc/self/`.
>
> --
> Rgrds, legion


Re: [RFC] Add option to mount only a pids subset

2017-03-26 Thread Djalal Harouni
(Cc'ed Kees and Jann for the procfs stacking issue)


On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov
 wrote:
> On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
>> Hi Alexey,
>>
>> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>>  wrote:
>> >
>> >
>> > Al Viro, this patch looks better ?
>> >
>> > == Overview ==
>> >
>> > Some of the container virtualization systems are mounted /proc inside
>> > the container. This is done in most cases to operate with information
>> > about the processes. Knowing that /proc filesystem is not fully
>> > virtualized they are mounted on top of dangerous places empty files or
>> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>> >
>> > The structure of this filesystem is dynamic and any module can create a
>> > new object which will not necessarily be virtualized. There are
>> > proprietary modules that aren't in the mainline whose work we can not
>> > verify.
>> >
>> > This opens up a potential threat to the system. The developers of the
>> > virtualization system can't predict all dangerous places in /proc by
>> > definition.
>> >
>> > A more effective solution would be to mount into the container only what
>> > is necessary and ignore the rest.
>> >
>> > Right now there is the opportunity to pass in the container any port of
>> > the /proc filesystem using mount --bind expect the pids.
>> >
>> > This patch allows to mount only the part of /proc related to pids without
>> > rest objects. Since this is an option for /proc, flags applied to /proc
>> > have an effect on this subset of filesystem.
>>
>> I just sent a patch that also has to deal with proc hidepid here:
>> https://lkml.org/lkml/2017/3/23/505
>
> I completely agree with you that it looks wrong when options of one
> mountpoint affect the others mountpoints.
>
>> I'm not sure if that's the right approach, it is still buggy, however
>> seems that your patch also stores the mount option inside the
>> pid_namespace which may get propagated to all mounts inside same pidns
>> ?
>
> I don't store 'pidonly' option in my current patch. I mean in:
> https://lkml.org/lkml/2017/3/20/363
>
> I parse options twice at first mount of procfs. It happens before
> the mounting /proc in userspace.
>
> I know it's bad, but I couldn't find place to store this option.

Ok, then maybe that approach of having a procfs struct holder instead
of using pid namespace may help!


>> I didn't have enough time but maybe if they are related we can work it
>> out together ?
>
> I don't have enough experience in kenrel hacking, but I would happily do
> my best :)

OK, let's sync on this then.

> I also tring to mention it in every patch, as my changes almost completely
> useless without the ability to use the overlayfs.
>
> Now if you remove the restriction:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

This protection was introduced in e54ad7f1ee263

That's a security fix, ecryptfs seems to schedule some code through
kthreads which always run in the initial namespaces with full
capabilities, bypassing all security checks.

I don't know if overlayfs has something similar, if not then maybe if
it's possible in ecryptfs to check the lower mount and the fs type and
move this procfs there...  Cc'ed Jann in case he may comment.

> and mount procfs as lowerdir in overlayfs then you get NULL pointer
> dereference at:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891
>
> I got it when I tried to do `ls -la /overlay/proc/self/`.
>
> --
> Rgrds, legion


Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Alexey Gladkov
On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote:
> Again, I can't really review this, I know nothing about vfs, but since
> nobody else replied...

Thanks anyway :)

> On 03/20, Alexey Gladkov wrote:
> >
> > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct 
> > file_system_type *fs_type,
> > ns = task_active_pid_ns(current);
> > }
> >
> > -   return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +   root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +
> > +   if (!IS_ERR(root)) {
> > +   if (!proc_fill_options(data, ))
> > +   return ERR_PTR(-EINVAL);
> 
> So we have to call proc_fill_options() twice, not good... Yes, I understand
> why, but perhaps we factor it out somehow, we can pack options + pid_ns into
> sb->s_fs_info. Nevermind, this is minor.

It happens only once, when we don't have the s_root yet.

> > +   if (opts.pid_only) {
> > +   int ret;
> > +
> > +   if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> > +   return ERR_PTR(ret);
> > +
> > +   root = ns->pidfs;
> 
> Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
> in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
> if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
> But this all is fixeable.
> 
> So with this change "mount -opidonly" creates another IS_ROOT() dentry which
> is not equal to sb->s_root. I simply do not know if this is technically
> correct or not... but, say, the "Only bind mounts can have disconnected paths"
> comment in path_connected() makes me worry ;)
> 
> And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
> the normal /proc mount. Not really good imo even if not really wrong... Lets
> look at proc_flush_task(). The exiting task will flush its $pid dentries in
> /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
> still...

I know that I'm cheater, but I did not start first :)

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Alexey Gladkov
On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote:
> Again, I can't really review this, I know nothing about vfs, but since
> nobody else replied...

Thanks anyway :)

> On 03/20, Alexey Gladkov wrote:
> >
> > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct 
> > file_system_type *fs_type,
> > ns = task_active_pid_ns(current);
> > }
> >
> > -   return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +   root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> > +
> > +   if (!IS_ERR(root)) {
> > +   if (!proc_fill_options(data, ))
> > +   return ERR_PTR(-EINVAL);
> 
> So we have to call proc_fill_options() twice, not good... Yes, I understand
> why, but perhaps we factor it out somehow, we can pack options + pid_ns into
> sb->s_fs_info. Nevermind, this is minor.

It happens only once, when we don't have the s_root yet.

> > +   if (opts.pid_only) {
> > +   int ret;
> > +
> > +   if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> > +   return ERR_PTR(ret);
> > +
> > +   root = ns->pidfs;
> 
> Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
> in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
> if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
> But this all is fixeable.
> 
> So with this change "mount -opidonly" creates another IS_ROOT() dentry which
> is not equal to sb->s_root. I simply do not know if this is technically
> correct or not... but, say, the "Only bind mounts can have disconnected paths"
> comment in path_connected() makes me worry ;)
> 
> And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
> the normal /proc mount. Not really good imo even if not really wrong... Lets
> look at proc_flush_task(). The exiting task will flush its $pid dentries in
> /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
> still...

I know that I'm cheater, but I did not start first :)

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Alexey Gladkov
On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
> Hi Alexey,
> 
> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>  wrote:
> >
> >
> > Al Viro, this patch looks better ?
> >
> > == Overview ==
> >
> > Some of the container virtualization systems are mounted /proc inside
> > the container. This is done in most cases to operate with information
> > about the processes. Knowing that /proc filesystem is not fully
> > virtualized they are mounted on top of dangerous places empty files or
> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
> >
> > The structure of this filesystem is dynamic and any module can create a
> > new object which will not necessarily be virtualized. There are
> > proprietary modules that aren't in the mainline whose work we can not
> > verify.
> >
> > This opens up a potential threat to the system. The developers of the
> > virtualization system can't predict all dangerous places in /proc by
> > definition.
> >
> > A more effective solution would be to mount into the container only what
> > is necessary and ignore the rest.
> >
> > Right now there is the opportunity to pass in the container any port of
> > the /proc filesystem using mount --bind expect the pids.
> >
> > This patch allows to mount only the part of /proc related to pids without
> > rest objects. Since this is an option for /proc, flags applied to /proc
> > have an effect on this subset of filesystem.
> 
> I just sent a patch that also has to deal with proc hidepid here:
> https://lkml.org/lkml/2017/3/23/505

I completely agree with you that it looks wrong when options of one
mountpoint affect the others mountpoints.

> I'm not sure if that's the right approach, it is still buggy, however
> seems that your patch also stores the mount option inside the
> pid_namespace which may get propagated to all mounts inside same pidns
> ?

I don't store 'pidonly' option in my current patch. I mean in:
https://lkml.org/lkml/2017/3/20/363

I parse options twice at first mount of procfs. It happens before
the mounting /proc in userspace.

I know it's bad, but I couldn't find place to store this option.

> I didn't have enough time but maybe if they are related we can work it
> out together ?

I don't have enough experience in kenrel hacking, but I would happily do
my best :)

I also tring to mention it in every patch, as my changes almost completely
useless without the ability to use the overlayfs.

Now if you remove the restriction:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

and mount procfs as lowerdir in overlayfs then you get NULL pointer
dereference at:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891

I got it when I tried to do `ls -la /overlay/proc/self/`.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Alexey Gladkov
On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote:
> Hi Alexey,
> 
> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
>  wrote:
> >
> >
> > Al Viro, this patch looks better ?
> >
> > == Overview ==
> >
> > Some of the container virtualization systems are mounted /proc inside
> > the container. This is done in most cases to operate with information
> > about the processes. Knowing that /proc filesystem is not fully
> > virtualized they are mounted on top of dangerous places empty files or
> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
> >
> > The structure of this filesystem is dynamic and any module can create a
> > new object which will not necessarily be virtualized. There are
> > proprietary modules that aren't in the mainline whose work we can not
> > verify.
> >
> > This opens up a potential threat to the system. The developers of the
> > virtualization system can't predict all dangerous places in /proc by
> > definition.
> >
> > A more effective solution would be to mount into the container only what
> > is necessary and ignore the rest.
> >
> > Right now there is the opportunity to pass in the container any port of
> > the /proc filesystem using mount --bind expect the pids.
> >
> > This patch allows to mount only the part of /proc related to pids without
> > rest objects. Since this is an option for /proc, flags applied to /proc
> > have an effect on this subset of filesystem.
> 
> I just sent a patch that also has to deal with proc hidepid here:
> https://lkml.org/lkml/2017/3/23/505

I completely agree with you that it looks wrong when options of one
mountpoint affect the others mountpoints.

> I'm not sure if that's the right approach, it is still buggy, however
> seems that your patch also stores the mount option inside the
> pid_namespace which may get propagated to all mounts inside same pidns
> ?

I don't store 'pidonly' option in my current patch. I mean in:
https://lkml.org/lkml/2017/3/20/363

I parse options twice at first mount of procfs. It happens before
the mounting /proc in userspace.

I know it's bad, but I couldn't find place to store this option.

> I didn't have enough time but maybe if they are related we can work it
> out together ?

I don't have enough experience in kenrel hacking, but I would happily do
my best :)

I also tring to mention it in every patch, as my changes almost completely
useless without the ability to use the overlayfs.

Now if you remove the restriction:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497

and mount procfs as lowerdir in overlayfs then you get NULL pointer
dereference at:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891

I got it when I tried to do `ls -la /overlay/proc/self/`.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Oleg Nesterov
Again, I can't really review this, I know nothing about vfs, but since
nobody else replied...

On 03/20, Alexey Gladkov wrote:
>
> @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type 
> *fs_type,
>   ns = task_active_pid_ns(current);
>   }
>
> - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +
> + if (!IS_ERR(root)) {
> + if (!proc_fill_options(data, ))
> + return ERR_PTR(-EINVAL);

So we have to call proc_fill_options() twice, not good... Yes, I understand
why, but perhaps we factor it out somehow, we can pack options + pid_ns into
sb->s_fs_info. Nevermind, this is minor.

> + if (opts.pid_only) {
> + int ret;
> +
> + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> + return ERR_PTR(ret);
> +
> + root = ns->pidfs;

Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
But this all is fixeable.

So with this change "mount -opidonly" creates another IS_ROOT() dentry which
is not equal to sb->s_root. I simply do not know if this is technically
correct or not... but, say, the "Only bind mounts can have disconnected paths"
comment in path_connected() makes me worry ;)

And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
the normal /proc mount. Not really good imo even if not really wrong... Lets
look at proc_flush_task(). The exiting task will flush its $pid dentries in
/proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
still...

Oleg.



Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Oleg Nesterov
Again, I can't really review this, I know nothing about vfs, but since
nobody else replied...

On 03/20, Alexey Gladkov wrote:
>
> @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type 
> *fs_type,
>   ns = task_active_pid_ns(current);
>   }
>
> - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super);
> +
> + if (!IS_ERR(root)) {
> + if (!proc_fill_options(data, ))
> + return ERR_PTR(-EINVAL);

So we have to call proc_fill_options() twice, not good... Yes, I understand
why, but perhaps we factor it out somehow, we can pack options + pid_ns into
sb->s_fs_info. Nevermind, this is minor.

> + if (opts.pid_only) {
> + int ret;
> +
> + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb)))
> + return ERR_PTR(ret);
> +
> + root = ns->pidfs;

Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root)
in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH,
if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced.
But this all is fixeable.

So with this change "mount -opidonly" creates another IS_ROOT() dentry which
is not equal to sb->s_root. I simply do not know if this is technically
correct or not... but, say, the "Only bind mounts can have disconnected paths"
comment in path_connected() makes me worry ;)

And this obviously means that /path-to-pidonly-mnt/ won't share dentries with
the normal /proc mount. Not really good imo even if not really wrong... Lets
look at proc_flush_task(). The exiting task will flush its $pid dentries in
/proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but
still...

Oleg.



Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Djalal Harouni
Hi Alexey,

On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
 wrote:
>
>
> Al Viro, this patch looks better ?
>
> == Overview ==
>
> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids without
> rest objects. Since this is an option for /proc, flags applied to /proc
> have an effect on this subset of filesystem.

I just sent a patch that also has to deal with proc hidepid here:
https://lkml.org/lkml/2017/3/23/505

I'm not sure if that's the right approach, it is still buggy, however
seems that your patch also stores the mount option inside the
pid_namespace which may get propagated to all mounts inside same pidns
?

I didn't have enough time but maybe if they are related we can work it
out together ?

Thank you!


-- 
tixxdz


Re: [RFC] Add option to mount only a pids subset

2017-03-23 Thread Djalal Harouni
Hi Alexey,

On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov
 wrote:
>
>
> Al Viro, this patch looks better ?
>
> == Overview ==
>
> Some of the container virtualization systems are mounted /proc inside
> the container. This is done in most cases to operate with information
> about the processes. Knowing that /proc filesystem is not fully
> virtualized they are mounted on top of dangerous places empty files or
> directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).
>
> The structure of this filesystem is dynamic and any module can create a
> new object which will not necessarily be virtualized. There are
> proprietary modules that aren't in the mainline whose work we can not
> verify.
>
> This opens up a potential threat to the system. The developers of the
> virtualization system can't predict all dangerous places in /proc by
> definition.
>
> A more effective solution would be to mount into the container only what
> is necessary and ignore the rest.
>
> Right now there is the opportunity to pass in the container any port of
> the /proc filesystem using mount --bind expect the pids.
>
> This patch allows to mount only the part of /proc related to pids without
> rest objects. Since this is an option for /proc, flags applied to /proc
> have an effect on this subset of filesystem.

I just sent a patch that also has to deal with proc hidepid here:
https://lkml.org/lkml/2017/3/23/505

I'm not sure if that's the right approach, it is still buggy, however
seems that your patch also stores the mount option inside the
pid_namespace which may get propagated to all mounts inside same pidns
?

I didn't have enough time but maybe if they are related we can work it
out together ?

Thank you!


-- 
tixxdz


[RFC] Add option to mount only a pids subset

2017-03-20 Thread Alexey Gladkov

Al Viro, this patch looks better ?

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids without
rest objects. Since this is an option for /proc, flags applied to /proc
have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid
sunset and additional required files will be mounted on top using the
overlayfs.

But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

* Add overlayfs support.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/generic.c |   5 ++
 fs/proc/inode.c   |   2 +
 fs/proc/internal.h|   7 +++
 fs/proc/root.c| 113 --
 include/linux/pid_namespace.h |   1 +
 5 files changed, 123 insertions(+), 5 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..50bb1e9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -307,6 +308,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file 
*file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
struct inode *inode = file_inode(file);
+   struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+   if (ns->pidfs && inode == d_inode(ns->pidfs))
+   return 1;
 
return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..0c9be65 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -109,6 +109,8 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, 
pid->pid_gid));
if (pid->hide_pid != HIDEPID_OFF)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+   if (root == pid->pidfs)
+   seq_printf(seq, ",pidonly");
 
return 0;
 }
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b..a5a4bf1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -260,7 +260,14 @@ static inline void proc_tty_init(void) {}
 /*
  * root.c
  */
+struct proc_options {
+   kgid_t pid_gid;
+   int hide_pid;
+   int pid_only;
+};
+
 extern struct proc_dir_entry proc_root;
+extern struct proc_dir_entry pidfs_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..c2443d5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -26,16 +26,17 @@
 #include "internal.h"
 
 enum {
-   Opt_gid, Opt_hidepid, Opt_err,
+   Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
 };
 
 static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+   {Opt_pidonly, "pidonly"},
{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
 {
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -55,7 +56,7 @@ int proc_parse_options(char *options, struct pid_namespace 
*pid)
case Opt_gid:
if (match_int([0], ))
return 0;
-   pid->pid_gid = make_kgid(current_user_ns(), option);
+   fs_opts->pid_gid = make_kgid(current_user_ns(), option);
break;
case Opt_hidepid:
if (match_int([0], ))
@@ -65,7 +66,10 @@ int proc_parse_options(char *options, struct pid_namespace 
*pid)
pr_err("proc: hidepid value must be between 0 
and 2.\n");
return 0;
}
-   pid->hide_pid = option;
+   fs_opts->hide_pid = option;
+   break;
+   

[RFC] Add option to mount only a pids subset

2017-03-20 Thread Alexey Gladkov

Al Viro, this patch looks better ?

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids without
rest objects. Since this is an option for /proc, flags applied to /proc
have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid
sunset and additional required files will be mounted on top using the
overlayfs.

But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

* Add overlayfs support.

Signed-off-by: Alexey Gladkov 
---
 fs/proc/generic.c |   5 ++
 fs/proc/inode.c   |   2 +
 fs/proc/internal.h|   7 +++
 fs/proc/root.c| 113 --
 include/linux/pid_namespace.h |   1 +
 5 files changed, 123 insertions(+), 5 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index ee27feb..50bb1e9 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -307,6 +308,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file 
*file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
struct inode *inode = file_inode(file);
+   struct pid_namespace *ns = inode->i_sb->s_fs_info;
+
+   if (ns->pidfs && inode == d_inode(ns->pidfs))
+   return 1;
 
return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 2cc7a80..0c9be65 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -109,6 +109,8 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, 
pid->pid_gid));
if (pid->hide_pid != HIDEPID_OFF)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);
+   if (root == pid->pidfs)
+   seq_printf(seq, ",pidonly");
 
return 0;
 }
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c5ae09b..a5a4bf1 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -260,7 +260,14 @@ static inline void proc_tty_init(void) {}
 /*
  * root.c
  */
+struct proc_options {
+   kgid_t pid_gid;
+   int hide_pid;
+   int pid_only;
+};
+
 extern struct proc_dir_entry proc_root;
+extern struct proc_dir_entry pidfs_root;
 extern int proc_parse_options(char *options, struct pid_namespace *pid);
 
 extern void proc_self_init(void);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index deecb39..c2443d5 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -26,16 +26,17 @@
 #include "internal.h"
 
 enum {
-   Opt_gid, Opt_hidepid, Opt_err,
+   Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err,
 };
 
 static const match_table_t tokens = {
{Opt_hidepid, "hidepid=%u"},
{Opt_gid, "gid=%u"},
+   {Opt_pidonly, "pidonly"},
{Opt_err, NULL},
 };
 
-int proc_parse_options(char *options, struct pid_namespace *pid)
+static int proc_fill_options(char *options, struct proc_options *fs_opts)
 {
char *p;
substring_t args[MAX_OPT_ARGS];
@@ -55,7 +56,7 @@ int proc_parse_options(char *options, struct pid_namespace 
*pid)
case Opt_gid:
if (match_int([0], ))
return 0;
-   pid->pid_gid = make_kgid(current_user_ns(), option);
+   fs_opts->pid_gid = make_kgid(current_user_ns(), option);
break;
case Opt_hidepid:
if (match_int([0], ))
@@ -65,7 +66,10 @@ int proc_parse_options(char *options, struct pid_namespace 
*pid)
pr_err("proc: hidepid value must be between 0 
and 2.\n");
return 0;
}
-   pid->hide_pid = option;
+   fs_opts->hide_pid = option;
+   break;
+   case 

Re: [RFC] Add option to mount only a pids subset

2017-03-13 Thread Andy Lutomirski
On Mon, Mar 13, 2017 at 6:27 AM, Al Viro  wrote:
> On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
>> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro  wrote:
>> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
>> > expose the full thing.  And as for the lifetimes making no sense...
>> > note that you are simply not freeing these structures of yours.
>> > Try to handle that and you'll get a serious PITA all over the
>> > place.
>> >
>> > What are you trying to achieve, anyway?  Why not add a second vfsmount
>> > pointer per pid_namespace and make it initialized on demand, at the
>> > first attempt of no-pid mount?  Just have a separate no-pid instance
>> > created for those namespaces where it had been asked for, with
>> > separate superblock and dentry tree not containing anything other
>> > that pid-only parts + self + thread-self...
>>
>> Can't we just make procfs work like most other filesystems and have
>> each mount have its own superblock?  If we need to do something funky
>> to stat() output to keep existing userspace working, I think that's
>> okay.
>
> First of all, most of the filesystems do *NOT* guarantee anything of
> that sort.  And what's the point of having more instances than
> necessary, anyway?

I mean that, if I do:

mount -t proc -o foobar none a
mount -t proc -o baz none b

Then I think that the second mount should create a whole new proc
instance rather than just a new vfsmount.  Then the options could
differ, which would solve a bunch of problems.

>
> Again, what for?  It won't salvage that kludge...  It's not as if it
> had been hard to have separate pid-only instance created when asked
> for (and reused every time when we are asked for pid-only).  What's
> the point of ever having more than two instances per pidns?  IDGI...

I can easily procfs growing more than one interesting option.
Pid-only and hidepid come to mind, and that's already six possible
combinations.  The current hidepid implementation is really awful.

>
> Folks, there is no one-to-one correspondence between mountpoints and
> superblocks.  Not since 2000 or so.  Just don't try to shove your
> per-superblock stuff into vfsmount; it simply won't work.  If you
> want a separate instance for that thing, then just go ahead and
> have ->mount() decide which one to use (and whether to create a new
> one).  All there is to it...

That's what I mean.  I just don't see the point of going all-out in
trying to reuse superblocks.


Re: [RFC] Add option to mount only a pids subset

2017-03-13 Thread Andy Lutomirski
On Mon, Mar 13, 2017 at 6:27 AM, Al Viro  wrote:
> On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
>> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro  wrote:
>> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
>> > expose the full thing.  And as for the lifetimes making no sense...
>> > note that you are simply not freeing these structures of yours.
>> > Try to handle that and you'll get a serious PITA all over the
>> > place.
>> >
>> > What are you trying to achieve, anyway?  Why not add a second vfsmount
>> > pointer per pid_namespace and make it initialized on demand, at the
>> > first attempt of no-pid mount?  Just have a separate no-pid instance
>> > created for those namespaces where it had been asked for, with
>> > separate superblock and dentry tree not containing anything other
>> > that pid-only parts + self + thread-self...
>>
>> Can't we just make procfs work like most other filesystems and have
>> each mount have its own superblock?  If we need to do something funky
>> to stat() output to keep existing userspace working, I think that's
>> okay.
>
> First of all, most of the filesystems do *NOT* guarantee anything of
> that sort.  And what's the point of having more instances than
> necessary, anyway?

I mean that, if I do:

mount -t proc -o foobar none a
mount -t proc -o baz none b

Then I think that the second mount should create a whole new proc
instance rather than just a new vfsmount.  Then the options could
differ, which would solve a bunch of problems.

>
> Again, what for?  It won't salvage that kludge...  It's not as if it
> had been hard to have separate pid-only instance created when asked
> for (and reused every time when we are asked for pid-only).  What's
> the point of ever having more than two instances per pidns?  IDGI...

I can easily procfs growing more than one interesting option.
Pid-only and hidepid come to mind, and that's already six possible
combinations.  The current hidepid implementation is really awful.

>
> Folks, there is no one-to-one correspondence between mountpoints and
> superblocks.  Not since 2000 or so.  Just don't try to shove your
> per-superblock stuff into vfsmount; it simply won't work.  If you
> want a separate instance for that thing, then just go ahead and
> have ->mount() decide which one to use (and whether to create a new
> one).  All there is to it...

That's what I mean.  I just don't see the point of going all-out in
trying to reuse superblocks.


Re: [RFC] Add option to mount only a pids subset

2017-03-13 Thread Al Viro
On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro  wrote:
> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> > expose the full thing.  And as for the lifetimes making no sense...
> > note that you are simply not freeing these structures of yours.
> > Try to handle that and you'll get a serious PITA all over the
> > place.
> >
> > What are you trying to achieve, anyway?  Why not add a second vfsmount
> > pointer per pid_namespace and make it initialized on demand, at the
> > first attempt of no-pid mount?  Just have a separate no-pid instance
> > created for those namespaces where it had been asked for, with
> > separate superblock and dentry tree not containing anything other
> > that pid-only parts + self + thread-self...
> 
> Can't we just make procfs work like most other filesystems and have
> each mount have its own superblock?  If we need to do something funky
> to stat() output to keep existing userspace working, I think that's
> okay.

First of all, most of the filesystems do *NOT* guarantee anything of
that sort.  And what's the point of having more instances than
necessary, anyway?

> As far as I can tell, proc_mnt is very nearly useless -- it seems to
> be used for proc_flush_task (which claims to be purely an optimization
> and could be preserved in the common case where there's only one
> relevant mount) and for sysctl_binary.  For the latter, we could
> create proc_mnt but make actual user-initiated mounts be new
> superblocks anyway.

Again, what for?  It won't salvage that kludge...  It's not as if it
had been hard to have separate pid-only instance created when asked
for (and reused every time when we are asked for pid-only).  What's
the point of ever having more than two instances per pidns?  IDGI...

Folks, there is no one-to-one correspondence between mountpoints and
superblocks.  Not since 2000 or so.  Just don't try to shove your
per-superblock stuff into vfsmount; it simply won't work.  If you
want a separate instance for that thing, then just go ahead and
have ->mount() decide which one to use (and whether to create a new
one).  All there is to it...


Re: [RFC] Add option to mount only a pids subset

2017-03-13 Thread Al Viro
On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote:
> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro  wrote:
> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> > expose the full thing.  And as for the lifetimes making no sense...
> > note that you are simply not freeing these structures of yours.
> > Try to handle that and you'll get a serious PITA all over the
> > place.
> >
> > What are you trying to achieve, anyway?  Why not add a second vfsmount
> > pointer per pid_namespace and make it initialized on demand, at the
> > first attempt of no-pid mount?  Just have a separate no-pid instance
> > created for those namespaces where it had been asked for, with
> > separate superblock and dentry tree not containing anything other
> > that pid-only parts + self + thread-self...
> 
> Can't we just make procfs work like most other filesystems and have
> each mount have its own superblock?  If we need to do something funky
> to stat() output to keep existing userspace working, I think that's
> okay.

First of all, most of the filesystems do *NOT* guarantee anything of
that sort.  And what's the point of having more instances than
necessary, anyway?

> As far as I can tell, proc_mnt is very nearly useless -- it seems to
> be used for proc_flush_task (which claims to be purely an optimization
> and could be preserved in the common case where there's only one
> relevant mount) and for sysctl_binary.  For the latter, we could
> create proc_mnt but make actual user-initiated mounts be new
> superblocks anyway.

Again, what for?  It won't salvage that kludge...  It's not as if it
had been hard to have separate pid-only instance created when asked
for (and reused every time when we are asked for pid-only).  What's
the point of ever having more than two instances per pidns?  IDGI...

Folks, there is no one-to-one correspondence between mountpoints and
superblocks.  Not since 2000 or so.  Just don't try to shove your
per-superblock stuff into vfsmount; it simply won't work.  If you
want a separate instance for that thing, then just go ahead and
have ->mount() decide which one to use (and whether to create a new
one).  All there is to it...


Re: [RFC] Add option to mount only a pids subset

2017-03-12 Thread Andy Lutomirski
On Sat, Mar 11, 2017 at 6:13 PM, Al Viro  wrote:
> PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> expose the full thing.  And as for the lifetimes making no sense...
> note that you are simply not freeing these structures of yours.
> Try to handle that and you'll get a serious PITA all over the
> place.
>
> What are you trying to achieve, anyway?  Why not add a second vfsmount
> pointer per pid_namespace and make it initialized on demand, at the
> first attempt of no-pid mount?  Just have a separate no-pid instance
> created for those namespaces where it had been asked for, with
> separate superblock and dentry tree not containing anything other
> that pid-only parts + self + thread-self...

Can't we just make procfs work like most other filesystems and have
each mount have its own superblock?  If we need to do something funky
to stat() output to keep existing userspace working, I think that's
okay.

As far as I can tell, proc_mnt is very nearly useless -- it seems to
be used for proc_flush_task (which claims to be purely an optimization
and could be preserved in the common case where there's only one
relevant mount) and for sysctl_binary.  For the latter, we could
create proc_mnt but make actual user-initiated mounts be new
superblocks anyway.


Re: [RFC] Add option to mount only a pids subset

2017-03-12 Thread Andy Lutomirski
On Sat, Mar 11, 2017 at 6:13 PM, Al Viro  wrote:
> PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
> expose the full thing.  And as for the lifetimes making no sense...
> note that you are simply not freeing these structures of yours.
> Try to handle that and you'll get a serious PITA all over the
> place.
>
> What are you trying to achieve, anyway?  Why not add a second vfsmount
> pointer per pid_namespace and make it initialized on demand, at the
> first attempt of no-pid mount?  Just have a separate no-pid instance
> created for those namespaces where it had been asked for, with
> separate superblock and dentry tree not containing anything other
> that pid-only parts + self + thread-self...

Can't we just make procfs work like most other filesystems and have
each mount have its own superblock?  If we need to do something funky
to stat() output to keep existing userspace working, I think that's
okay.

As far as I can tell, proc_mnt is very nearly useless -- it seems to
be used for proc_flush_task (which claims to be purely an optimization
and could be preserved in the common case where there's only one
relevant mount) and for sysctl_binary.  For the latter, we could
create proc_mnt but make actual user-initiated mounts be new
superblocks anyway.


Re: [RFC] Add option to mount only a pids subset

2017-03-11 Thread Al Viro
On Sun, Mar 12, 2017 at 01:54:38AM +, Al Viro wrote:
> On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 83de8b6..5bd1b84 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,6 +1759,8 @@ struct super_operations {
> > int (*thaw_super) (struct super_block *);
> > int (*unfreeze_fs) (struct super_block *);
> > int (*statfs) (struct dentry *, struct kstatfs *);
> > +   int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> > +   int (*mount_fs) (struct vfsmount *, int *, char *);
> > int (*remount_fs) (struct super_block *, int *, char *);
> > void (*umount_begin) (struct super_block *);
> >  
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 1172cce..4bd364e 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -67,6 +67,7 @@ struct vfsmount {
> > struct dentry *mnt_root;/* root of the mounted tree */
> > struct super_block *mnt_sb; /* pointer to superblock */
> > int mnt_flags;
> > +   void *fs_data;  /* fs-specific data */
> 
>   No.  This is really asking for trouble - you *can't* hang
> fs-specific data structures off vfsmount.  Lifetimes make no
> sense whatsoever.
>
>   BTW, your patch leaks supreblock reference on failure, and
> is kludgy as hell wrt to what it's doing in procfs itself, but
> that's secondary - the quoted part is enough for a hard NAK on
> architectural grounds.  Don't go there.

PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
expose the full thing.  And as for the lifetimes making no sense...
note that you are simply not freeing these structures of yours.
Try to handle that and you'll get a serious PITA all over the
place.

What are you trying to achieve, anyway?  Why not add a second vfsmount
pointer per pid_namespace and make it initialized on demand, at the
first attempt of no-pid mount?  Just have a separate no-pid instance
created for those namespaces where it had been asked for, with
separate superblock and dentry tree not containing anything other
that pid-only parts + self + thread-self...


Re: [RFC] Add option to mount only a pids subset

2017-03-11 Thread Al Viro
On Sun, Mar 12, 2017 at 01:54:38AM +, Al Viro wrote:
> On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:
> 
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 83de8b6..5bd1b84 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -1759,6 +1759,8 @@ struct super_operations {
> > int (*thaw_super) (struct super_block *);
> > int (*unfreeze_fs) (struct super_block *);
> > int (*statfs) (struct dentry *, struct kstatfs *);
> > +   int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> > +   int (*mount_fs) (struct vfsmount *, int *, char *);
> > int (*remount_fs) (struct super_block *, int *, char *);
> > void (*umount_begin) (struct super_block *);
> >  
> > diff --git a/include/linux/mount.h b/include/linux/mount.h
> > index 1172cce..4bd364e 100644
> > --- a/include/linux/mount.h
> > +++ b/include/linux/mount.h
> > @@ -67,6 +67,7 @@ struct vfsmount {
> > struct dentry *mnt_root;/* root of the mounted tree */
> > struct super_block *mnt_sb; /* pointer to superblock */
> > int mnt_flags;
> > +   void *fs_data;  /* fs-specific data */
> 
>   No.  This is really asking for trouble - you *can't* hang
> fs-specific data structures off vfsmount.  Lifetimes make no
> sense whatsoever.
>
>   BTW, your patch leaks supreblock reference on failure, and
> is kludgy as hell wrt to what it's doing in procfs itself, but
> that's secondary - the quoted part is enough for a hard NAK on
> architectural grounds.  Don't go there.

PS: AFAICS, simple mount --bind of your pid-only mount will suddenly
expose the full thing.  And as for the lifetimes making no sense...
note that you are simply not freeing these structures of yours.
Try to handle that and you'll get a serious PITA all over the
place.

What are you trying to achieve, anyway?  Why not add a second vfsmount
pointer per pid_namespace and make it initialized on demand, at the
first attempt of no-pid mount?  Just have a separate no-pid instance
created for those namespaces where it had been asked for, with
separate superblock and dentry tree not containing anything other
that pid-only parts + self + thread-self...


Re: [RFC] Add option to mount only a pids subset

2017-03-11 Thread Al Viro
On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 83de8b6..5bd1b84 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1759,6 +1759,8 @@ struct super_operations {
>   int (*thaw_super) (struct super_block *);
>   int (*unfreeze_fs) (struct super_block *);
>   int (*statfs) (struct dentry *, struct kstatfs *);
> + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> + int (*mount_fs) (struct vfsmount *, int *, char *);
>   int (*remount_fs) (struct super_block *, int *, char *);
>   void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1172cce..4bd364e 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -67,6 +67,7 @@ struct vfsmount {
>   struct dentry *mnt_root;/* root of the mounted tree */
>   struct super_block *mnt_sb; /* pointer to superblock */
>   int mnt_flags;
> + void *fs_data;  /* fs-specific data */

No.  This is really asking for trouble - you *can't* hang
fs-specific data structures off vfsmount.  Lifetimes make no
sense whatsoever.

BTW, your patch leaks supreblock reference on failure, and
is kludgy as hell wrt to what it's doing in procfs itself, but
that's secondary - the quoted part is enough for a hard NAK on
architectural grounds.  Don't go there.


Re: [RFC] Add option to mount only a pids subset

2017-03-11 Thread Al Viro
On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote:

> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 83de8b6..5bd1b84 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1759,6 +1759,8 @@ struct super_operations {
>   int (*thaw_super) (struct super_block *);
>   int (*unfreeze_fs) (struct super_block *);
>   int (*statfs) (struct dentry *, struct kstatfs *);
> + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *);
> + int (*mount_fs) (struct vfsmount *, int *, char *);
>   int (*remount_fs) (struct super_block *, int *, char *);
>   void (*umount_begin) (struct super_block *);
>  
> diff --git a/include/linux/mount.h b/include/linux/mount.h
> index 1172cce..4bd364e 100644
> --- a/include/linux/mount.h
> +++ b/include/linux/mount.h
> @@ -67,6 +67,7 @@ struct vfsmount {
>   struct dentry *mnt_root;/* root of the mounted tree */
>   struct super_block *mnt_sb; /* pointer to superblock */
>   int mnt_flags;
> + void *fs_data;  /* fs-specific data */

No.  This is really asking for trouble - you *can't* hang
fs-specific data structures off vfsmount.  Lifetimes make no
sense whatsoever.

BTW, your patch leaks supreblock reference on failure, and
is kludgy as hell wrt to what it's doing in procfs itself, but
that's secondary - the quoted part is enough for a hard NAK on
architectural grounds.  Don't go there.


Re: [RFC] Add option to mount only a pids subset

2017-03-11 Thread Alexey Gladkov
On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote:
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html

In fact, nothing has changed. I added a parameter that affects
the mountpoint, not the entire pid namespace.

The procfs will still be global. The device ID will be the same as before.

> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
> 
> Other /proc/self/ns/* comparison and stat() logic...
> 
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Basically we have here a issue because other proc options (hidepid for
example) affect the entire pid namespace, but, I guess, have to affect
the mountpoint.

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime 0 0

# mount -t proc proc /tmp/proc
# mount -o remount,hidepid=2 -t proc proc /tmp/proc

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-11 Thread Alexey Gladkov
On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote:
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html

In fact, nothing has changed. I added a parameter that affects
the mountpoint, not the entire pid namespace.

The procfs will still be global. The device ID will be the same as before.

> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
> 
> Other /proc/self/ns/* comparison and stat() logic...
> 
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Basically we have here a issue because other proc options (hidepid for
example) affect the entire pid namespace, but, I guess, have to affect
the mountpoint.

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime 0 0

# mount -t proc proc /tmp/proc
# mount -o remount,hidepid=2 -t proc proc /tmp/proc

# grep ^proc /proc/mounts 
proc /proc proc rw,relatime,hidepid=2 0 0
proc /tmp/proc proc rw,relatime,hidepid=2 0 0

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-10 Thread Alexey Gladkov
On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  
> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
> 
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.

Sure, but first I wanted to discuss the idea. My patch isn't very useful
without the ability to add additional files.

I will split it into parts in the new version.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-10 Thread Alexey Gladkov
On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote:
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  
> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
> 
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.

Sure, but first I wanted to discuss the idea. My patch isn't very useful
without the ability to add additional files.

I will split it into parts in the new version.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-10 Thread Alexey Gladkov
On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote:
> I can't really review this... but in any case I think you should split
> this patch to separate the vfs and proc changes.
> 
> On 03/07, Alexey Gladkov wrote:
> >
> > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int 
> > flags, const char *name, void
> > mnt->mnt.mnt_sb = root->d_sb;
> > mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> > mnt->mnt_parent = mnt;
> > +
> > +   err = do_mount_sb(>mnt, flags, data);
> > +   if(err) {
> > +   mnt_free_id(mnt);
> > +   free_vfsmnt(mnt);
> > +   return ERR_PTR(err);
> > +   }
> 
> This duplicates the error handling, we do the same if mount_fs() fails.
> Perhaps you should move these 2 lines into cleanup block and add goto's.
> 
> > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct 
> > kstat *stat)
> > +{
> > +   struct inode *inode = d_inode(dentry);
> > +   struct pid *pid = proc_pid(dentry->d_inode);
> > +   struct proc_options *opts = mnt->fs_data;
> > +
> > +   if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> > +   return -ENOENT;
> 
> Hmm. I don't quite understand why do we need this, and how this should work.
> 
> Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
> but only because they both do stat() ?
> 
> Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
> work ?

Yes, you're right! I thought that getattr is called always together with
open(). I wanted to prevent all attempts open() for not-pid directories.

> I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
> out we also need to update proc_sys_dir_file_operations too and may be 
> something
> else...

My main task was to hide all possible direcitrices from the /proc
(in pidonly mode)... even those which we do not know. In this case we
can't rely on the fact that everyone will follow the rules and to
properly handle open().

My current attempt was to force filesystem level check of mountpoint flag.
This is necessary to avoid even the theoretical possibility of ignoring
"pidonly" parameter.

I guess I need to add callback to vfs_open or something to can be sure
that we will not open the wrong file or directory in pidonly mode.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-10 Thread Alexey Gladkov
On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote:
> I can't really review this... but in any case I think you should split
> this patch to separate the vfs and proc changes.
> 
> On 03/07, Alexey Gladkov wrote:
> >
> > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int 
> > flags, const char *name, void
> > mnt->mnt.mnt_sb = root->d_sb;
> > mnt->mnt_mountpoint = mnt->mnt.mnt_root;
> > mnt->mnt_parent = mnt;
> > +
> > +   err = do_mount_sb(>mnt, flags, data);
> > +   if(err) {
> > +   mnt_free_id(mnt);
> > +   free_vfsmnt(mnt);
> > +   return ERR_PTR(err);
> > +   }
> 
> This duplicates the error handling, we do the same if mount_fs() fails.
> Perhaps you should move these 2 lines into cleanup block and add goto's.
> 
> > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct 
> > kstat *stat)
> > +{
> > +   struct inode *inode = d_inode(dentry);
> > +   struct pid *pid = proc_pid(dentry->d_inode);
> > +   struct proc_options *opts = mnt->fs_data;
> > +
> > +   if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> > +   return -ENOENT;
> 
> Hmm. I don't quite understand why do we need this, and how this should work.
> 
> Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
> but only because they both do stat() ?
> 
> Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
> work ?

Yes, you're right! I thought that getattr is called always together with
open(). I wanted to prevent all attempts open() for not-pid directories.

> I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
> out we also need to update proc_sys_dir_file_operations too and may be 
> something
> else...

My main task was to hide all possible direcitrices from the /proc
(in pidonly mode)... even those which we do not know. In this case we
can't rely on the fact that everyone will follow the rules and to
properly handle open().

My current attempt was to force filesystem level check of mountpoint flag.
This is necessary to avoid even the theoretical possibility of ignoring
"pidonly" parameter.

I guess I need to add callback to vfs_open or something to can be sure
that we will not open the wrong file or directory in pidonly mode.

-- 
Rgrds, legion



Re: [RFC] Add option to mount only a pids subset

2017-03-09 Thread Eric W. Biederman
Djalal Harouni  writes:

> On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski  wrote:
>>
>> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  
>> wrote:
>> >
>> > After discussion with Oleg Nesterov I reimplement my patch as an additional
>> > option for /proc. This option affects the mountpoint. It means that in one
>> > pid namespace it possible to have both the whole traditional /proc and
>> > /proc with only pids subset.
>> >
>>
>> I like this.  I think you should split it into two patches, though:
>> one that reworks how procfs gets mounted and one that makes adds the
>> new functionality.
>>
>> Djajal had some concerns about the first part breaking applications
>> that use stat and expect certain behavior.  This should be manageable,
>> though, but making stat work appropriately.
>
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html
>
> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
>
> Other /proc/self/ns/* comparison and stat() logic...
>
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Well right now it is less of an issue than you image.  I suggest you
stat /proc and /proc/self/ns/* and look at st_dev.

That said anything that changes today's proc needs to be tested with a
range of distro's especially those using selinux and apparmor as they
tend to have policies that are very sensitive to the implementation
details.  Such that seemingly innocent changes can cause systems to stop
booting.

Eric


Re: [RFC] Add option to mount only a pids subset

2017-03-09 Thread Eric W. Biederman
Djalal Harouni  writes:

> On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski  wrote:
>>
>> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  
>> wrote:
>> >
>> > After discussion with Oleg Nesterov I reimplement my patch as an additional
>> > option for /proc. This option affects the mountpoint. It means that in one
>> > pid namespace it possible to have both the whole traditional /proc and
>> > /proc with only pids subset.
>> >
>>
>> I like this.  I think you should split it into two patches, though:
>> one that reworks how procfs gets mounted and one that makes adds the
>> new functionality.
>>
>> Djajal had some concerns about the first part breaking applications
>> that use stat and expect certain behavior.  This should be manageable,
>> though, but making stat work appropriately.
>
> I'm bit lost in the two discussion, however the main concern I was
> discussing with Andy was if you have per superblock proc mounts then
> each mount will end up with its own device ID st_dev, right now they
> share the same ID if they are in the same pid namespace, but if we
> change that then we may break the following:
> http://man7.org/linux/man-pages/man7/namespaces.7.html
>
> Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
> suggests to follow up with fstat() to identify the namespaces..
> "By applying fstat(2) to the returned file descriptor, one obtains a
> stat structure whose st_dev (resident device) and st_ino (inode
> number) fields together identify the owning/parent namespace."
>
> Other /proc/self/ns/* comparison and stat() logic...
>
> Andy suggested that we may have the same st_dev for mounts in the same
> pid namespace... I'm not sure which side effect this may bring!

Well right now it is less of an issue than you image.  I suggest you
stat /proc and /proc/self/ns/* and look at st_dev.

That said anything that changes today's proc needs to be tested with a
range of distro's especially those using selinux and apparmor as they
tend to have policies that are very sensitive to the implementation
details.  Such that seemingly innocent changes can cause systems to stop
booting.

Eric


Re: [RFC] Add option to mount only a pids subset

2017-03-09 Thread Djalal Harouni
On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski  wrote:
>
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  
> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
>
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.
>
> Djajal had some concerns about the first part breaking applications
> that use stat and expect certain behavior.  This should be manageable,
> though, but making stat work appropriately.

I'm bit lost in the two discussion, however the main concern I was
discussing with Andy was if you have per superblock proc mounts then
each mount will end up with its own device ID st_dev, right now they
share the same ID if they are in the same pid namespace, but if we
change that then we may break the following:
http://man7.org/linux/man-pages/man7/namespaces.7.html

Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
suggests to follow up with fstat() to identify the namespaces..
"By applying fstat(2) to the returned file descriptor, one obtains a
stat structure whose st_dev (resident device) and st_ino (inode
number) fields together identify the owning/parent namespace."

Other /proc/self/ns/* comparison and stat() logic...

Andy suggested that we may have the same st_dev for mounts in the same
pid namespace... I'm not sure which side effect this may bring!

Thanks!

-- 
tixxdz


Re: [RFC] Add option to mount only a pids subset

2017-03-09 Thread Djalal Harouni
On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski  wrote:
>
> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  
> wrote:
> >
> > After discussion with Oleg Nesterov I reimplement my patch as an additional
> > option for /proc. This option affects the mountpoint. It means that in one
> > pid namespace it possible to have both the whole traditional /proc and
> > /proc with only pids subset.
> >
>
> I like this.  I think you should split it into two patches, though:
> one that reworks how procfs gets mounted and one that makes adds the
> new functionality.
>
> Djajal had some concerns about the first part breaking applications
> that use stat and expect certain behavior.  This should be manageable,
> though, but making stat work appropriately.

I'm bit lost in the two discussion, however the main concern I was
discussing with Andy was if you have per superblock proc mounts then
each mount will end up with its own device ID st_dev, right now they
share the same ID if they are in the same pid namespace, but if we
change that then we may break the following:
http://man7.org/linux/man-pages/man7/namespaces.7.html

Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd,
suggests to follow up with fstat() to identify the namespaces..
"By applying fstat(2) to the returned file descriptor, one obtains a
stat structure whose st_dev (resident device) and st_ino (inode
number) fields together identify the owning/parent namespace."

Other /proc/self/ns/* comparison and stat() logic...

Andy suggested that we may have the same st_dev for mounts in the same
pid namespace... I'm not sure which side effect this may bring!

Thanks!

-- 
tixxdz


Re: [RFC] Add option to mount only a pids subset

2017-03-07 Thread Oleg Nesterov
I can't really review this... but in any case I think you should split
this patch to separate the vfs and proc changes.

On 03/07, Alexey Gladkov wrote:
>
> @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, 
> const char *name, void
>   mnt->mnt.mnt_sb = root->d_sb;
>   mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>   mnt->mnt_parent = mnt;
> +
> + err = do_mount_sb(>mnt, flags, data);
> + if(err) {
> + mnt_free_id(mnt);
> + free_vfsmnt(mnt);
> + return ERR_PTR(err);
> + }

This duplicates the error handling, we do the same if mount_fs() fails.
Perhaps you should move these 2 lines into cleanup block and add goto's.

> +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat 
> *stat)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct pid *pid = proc_pid(dentry->d_inode);
> + struct proc_options *opts = mnt->fs_data;
> +
> + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> + return -ENOENT;

Hmm. I don't quite understand why do we need this, and how this should work.

Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
but only because they both do stat() ?

Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
work ?

I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
out we also need to update proc_sys_dir_file_operations too and may be something
else...

> +
> + if (!inode->i_op->getattr) {
> + generic_fillattr(inode, stat);
> + return 0;
> + }
> +
> + return inode->i_op->getattr(mnt, dentry, stat);
> +}

Oh, it would be nice to not duplicate the code from the caller, imo.

Oleg.




Re: [RFC] Add option to mount only a pids subset

2017-03-07 Thread Oleg Nesterov
I can't really review this... but in any case I think you should split
this patch to separate the vfs and proc changes.

On 03/07, Alexey Gladkov wrote:
>
> @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, 
> const char *name, void
>   mnt->mnt.mnt_sb = root->d_sb;
>   mnt->mnt_mountpoint = mnt->mnt.mnt_root;
>   mnt->mnt_parent = mnt;
> +
> + err = do_mount_sb(>mnt, flags, data);
> + if(err) {
> + mnt_free_id(mnt);
> + free_vfsmnt(mnt);
> + return ERR_PTR(err);
> + }

This duplicates the error handling, we do the same if mount_fs() fails.
Perhaps you should move these 2 lines into cleanup block and add goto's.

> +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat 
> *stat)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct pid *pid = proc_pid(dentry->d_inode);
> + struct proc_options *opts = mnt->fs_data;
> +
> + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid)
> + return -ENOENT;

Hmm. I don't quite understand why do we need this, and how this should work.

Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail,
but only because they both do stat() ?

Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should
work ?

I still think proc_dir_operations.open() makes more sense. Yes, as you pointed
out we also need to update proc_sys_dir_file_operations too and may be something
else...

> +
> + if (!inode->i_op->getattr) {
> + generic_fillattr(inode, stat);
> + return 0;
> + }
> +
> + return inode->i_op->getattr(mnt, dentry, stat);
> +}

Oh, it would be nice to not duplicate the code from the caller, imo.

Oleg.




Re: [RFC] Add option to mount only a pids subset

2017-03-07 Thread Andy Lutomirski
On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  wrote:
>
> After discussion with Oleg Nesterov I reimplement my patch as an additional
> option for /proc. This option affects the mountpoint. It means that in one
> pid namespace it possible to have both the whole traditional /proc and
> /proc with only pids subset.
>

I like this.  I think you should split it into two patches, though:
one that reworks how procfs gets mounted and one that makes adds the
new functionality.

Djajal had some concerns about the first part breaking applications
that use stat and expect certain behavior.  This should be manageable,
though, but making stat work appropriately.


Re: [RFC] Add option to mount only a pids subset

2017-03-07 Thread Andy Lutomirski
On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov  wrote:
>
> After discussion with Oleg Nesterov I reimplement my patch as an additional
> option for /proc. This option affects the mountpoint. It means that in one
> pid namespace it possible to have both the whole traditional /proc and
> /proc with only pids subset.
>

I like this.  I think you should split it into two patches, though:
one that reworks how procfs gets mounted and one that makes adds the
new functionality.

Djajal had some concerns about the first part breaking applications
that use stat and expect certain behavior.  This should be manageable,
though, but making stat work appropriately.


[RFC] Add option to mount only a pids subset

2017-03-06 Thread Alexey Gladkov

After discussion with Oleg Nesterov I reimplement my patch as an additional
option for /proc. This option affects the mountpoint. It means that in one
pid namespace it possible to have both the whole traditional /proc and
/proc with only pids subset.

However, it remains an open question about overlayfs support in the /proc.

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids
without rest objects. Since this is an option for /proc, flags applied to
/proc have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid sunset
and additional required files will be mounted on top using the overlayfs.
But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

 * Show pidonly via proc_show_options.
 * Add overlayfs support.

---
 fs/internal.h |  1 +
 fs/namespace.c|  9 
 fs/proc/generic.c |  4 
 fs/proc/inode.c   |  7 +-
 fs/proc/internal.h|  8 +++
 fs/proc/root.c| 62 +++
 fs/stat.c |  4 
 fs/super.c| 20 +
 include/linux/fs.h|  2 ++
 include/linux/mount.h |  1 +
 10 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 4fcf517..cb44ca7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void);
  * super.c
  */
 extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
   int, const char *, void *);
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..6cb2bcb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p)
 struct vfsmount *
 vfs_kern_mount(struct file_system_type *type, int flags, const char *name, 
void *data)
 {
+   int err;
struct mount *mnt;
struct dentry *root;
 
@@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, 
const char *name, void
mnt->mnt.mnt_sb = root->d_sb;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
mnt->mnt_parent = mnt;
+
+   err = do_mount_sb(>mnt, flags, data);
+   if(err) {
+   mnt_free_id(mnt);
+   free_vfsmnt(mnt);
+   return ERR_PTR(err);
+   }
+
lock_mount_hash();
list_add_tail(>mnt_instance, >d_sb->s_mounts);
unlock_mount_hash();
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 7eb3cef..dd3da60 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file 
*file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
struct inode *inode = file_inode(file);
+   struct proc_options *opts = file->f_path.mnt->fs_data;
+
+   if (opts && opts->pid_only)
+   return 1;
 
return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 783bc19..51b1712 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, 
pid->pid_gid));
if (pid->hide_pid != 0)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);
-
return 0;
 }
 
@@ -118,6 +117,8 @@ static const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode= proc_evict_inode,
.statfs = simple_statfs,
+   .getattr_fs = proc_getattrfs,
+   .mount_fs   = proc_mount_cb,
.remount_fs = proc_remount,
.show_options   = 

[RFC] Add option to mount only a pids subset

2017-03-06 Thread Alexey Gladkov

After discussion with Oleg Nesterov I reimplement my patch as an additional
option for /proc. This option affects the mountpoint. It means that in one
pid namespace it possible to have both the whole traditional /proc and
/proc with only pids subset.

However, it remains an open question about overlayfs support in the /proc.

== Overview ==

Some of the container virtualization systems are mounted /proc inside
the container. This is done in most cases to operate with information
about the processes. Knowing that /proc filesystem is not fully
virtualized they are mounted on top of dangerous places empty files or
directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.).

The structure of this filesystem is dynamic and any module can create a
new object which will not necessarily be virtualized. There are
proprietary modules that aren't in the mainline whose work we can not
verify.

This opens up a potential threat to the system. The developers of the
virtualization system can't predict all dangerous places in /proc by
definition.

A more effective solution would be to mount into the container only what
is necessary and ignore the rest.

Right now there is the opportunity to pass in the container any port of
the /proc filesystem using mount --bind expect the pids.

This patch allows to mount only the part of /proc related to pids
without rest objects. Since this is an option for /proc, flags applied to
/proc have an effect on this subset of filesystem.

Originally the idea was that the container will be mounted only pid sunset
and additional required files will be mounted on top using the overlayfs.
But I found out that /proc does not support overlayfs and does not allow
to mount anything on top or under it.

== TODO ==

There is still work to do:

 * Show pidonly via proc_show_options.
 * Add overlayfs support.

---
 fs/internal.h |  1 +
 fs/namespace.c|  9 
 fs/proc/generic.c |  4 
 fs/proc/inode.c   |  7 +-
 fs/proc/internal.h|  8 +++
 fs/proc/root.c| 62 +++
 fs/stat.c |  4 
 fs/super.c| 20 +
 include/linux/fs.h|  2 ++
 include/linux/mount.h |  1 +
 10 files changed, 113 insertions(+), 5 deletions(-)

diff --git a/fs/internal.h b/fs/internal.h
index 4fcf517..cb44ca7 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void);
  * super.c
  */
 extern int do_remount_sb(struct super_block *, int, void *, int);
+extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data);
 extern bool trylock_super(struct super_block *sb);
 extern struct dentry *mount_fs(struct file_system_type *,
   int, const char *, void *);
diff --git a/fs/namespace.c b/fs/namespace.c
index e6c234b..6cb2bcb 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p)
 struct vfsmount *
 vfs_kern_mount(struct file_system_type *type, int flags, const char *name, 
void *data)
 {
+   int err;
struct mount *mnt;
struct dentry *root;
 
@@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, 
const char *name, void
mnt->mnt.mnt_sb = root->d_sb;
mnt->mnt_mountpoint = mnt->mnt.mnt_root;
mnt->mnt_parent = mnt;
+
+   err = do_mount_sb(>mnt, flags, data);
+   if(err) {
+   mnt_free_id(mnt);
+   free_vfsmnt(mnt);
+   return ERR_PTR(err);
+   }
+
lock_mount_hash();
list_add_tail(>mnt_instance, >d_sb->s_mounts);
unlock_mount_hash();
diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 7eb3cef..dd3da60 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file 
*file,
 int proc_readdir(struct file *file, struct dir_context *ctx)
 {
struct inode *inode = file_inode(file);
+   struct proc_options *opts = file->f_path.mnt->fs_data;
+
+   if (opts && opts->pid_only)
+   return 1;
 
return proc_readdir_de(PDE(inode), file, ctx);
 }
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 783bc19..51b1712 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct 
dentry *root)
seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, 
pid->pid_gid));
if (pid->hide_pid != 0)
seq_printf(seq, ",hidepid=%u", pid->hide_pid);
-
return 0;
 }
 
@@ -118,6 +117,8 @@ static const struct super_operations proc_sops = {
.drop_inode = generic_delete_inode,
.evict_inode= proc_evict_inode,
.statfs = simple_statfs,
+   .getattr_fs = proc_getattrfs,
+   .mount_fs   = proc_mount_cb,
.remount_fs = proc_remount,
.show_options   =