Re: [RFC] Add option to mount only a pids subset
On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote: > (Cc'ed Kees and Jann for the procfs stacking issue) > > > I completely agree with you that it looks wrong when options of one > > mountpoint affect the others mountpoints. > > > >> I'm not sure if that's the right approach, it is still buggy, however > >> seems that your patch also stores the mount option inside the > >> pid_namespace which may get propagated to all mounts inside same pidns > >> ? > > > > I don't store 'pidonly' option in my current patch. I mean in: > > https://lkml.org/lkml/2017/3/20/363 > > > > I parse options twice at first mount of procfs. It happens before > > the mounting /proc in userspace. > > > > I know it's bad, but I couldn't find place to store this option. > > Ok, then maybe that approach of having a procfs struct holder instead > of using pid namespace may help! I deside to stop doing my patch. I talked with a few people and found out that the overlayfs doesn't feel very well if on the lower level filesystem appear/disappear files. In addition with the pidfs isn't so simple. Separate the root will lead to a doubling of memory consumption and restrictions on the filesystem operations level can easily be skipped. It means that even I can do this pidfs (or pid subset in /proc), it will be pointless. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Sun, Mar 26, 2017 at 09:03:33AM +0200, Djalal Harouni wrote: > (Cc'ed Kees and Jann for the procfs stacking issue) > > > I completely agree with you that it looks wrong when options of one > > mountpoint affect the others mountpoints. > > > >> I'm not sure if that's the right approach, it is still buggy, however > >> seems that your patch also stores the mount option inside the > >> pid_namespace which may get propagated to all mounts inside same pidns > >> ? > > > > I don't store 'pidonly' option in my current patch. I mean in: > > https://lkml.org/lkml/2017/3/20/363 > > > > I parse options twice at first mount of procfs. It happens before > > the mounting /proc in userspace. > > > > I know it's bad, but I couldn't find place to store this option. > > Ok, then maybe that approach of having a procfs struct holder instead > of using pid namespace may help! I deside to stop doing my patch. I talked with a few people and found out that the overlayfs doesn't feel very well if on the lower level filesystem appear/disappear files. In addition with the pidfs isn't so simple. Separate the root will lead to a doubling of memory consumption and restrictions on the filesystem operations level can easily be skipped. It means that even I can do this pidfs (or pid subset in /proc), it will be pointless. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
(Cc'ed Kees and Jann for the procfs stacking issue) On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkovwrote: > On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote: >> Hi Alexey, >> >> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov >> wrote: >> > >> > >> > Al Viro, this patch looks better ? >> > >> > == Overview == >> > >> > Some of the container virtualization systems are mounted /proc inside >> > the container. This is done in most cases to operate with information >> > about the processes. Knowing that /proc filesystem is not fully >> > virtualized they are mounted on top of dangerous places empty files or >> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). >> > >> > The structure of this filesystem is dynamic and any module can create a >> > new object which will not necessarily be virtualized. There are >> > proprietary modules that aren't in the mainline whose work we can not >> > verify. >> > >> > This opens up a potential threat to the system. The developers of the >> > virtualization system can't predict all dangerous places in /proc by >> > definition. >> > >> > A more effective solution would be to mount into the container only what >> > is necessary and ignore the rest. >> > >> > Right now there is the opportunity to pass in the container any port of >> > the /proc filesystem using mount --bind expect the pids. >> > >> > This patch allows to mount only the part of /proc related to pids without >> > rest objects. Since this is an option for /proc, flags applied to /proc >> > have an effect on this subset of filesystem. >> >> I just sent a patch that also has to deal with proc hidepid here: >> https://lkml.org/lkml/2017/3/23/505 > > I completely agree with you that it looks wrong when options of one > mountpoint affect the others mountpoints. > >> I'm not sure if that's the right approach, it is still buggy, however >> seems that your patch also stores the mount option inside the >> pid_namespace which may get propagated to all mounts inside same pidns >> ? > > I don't store 'pidonly' option in my current patch. I mean in: > https://lkml.org/lkml/2017/3/20/363 > > I parse options twice at first mount of procfs. It happens before > the mounting /proc in userspace. > > I know it's bad, but I couldn't find place to store this option. Ok, then maybe that approach of having a procfs struct holder instead of using pid namespace may help! >> I didn't have enough time but maybe if they are related we can work it >> out together ? > > I don't have enough experience in kenrel hacking, but I would happily do > my best :) OK, let's sync on this then. > I also tring to mention it in every patch, as my changes almost completely > useless without the ability to use the overlayfs. > > Now if you remove the restriction: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497 This protection was introduced in e54ad7f1ee263 That's a security fix, ecryptfs seems to schedule some code through kthreads which always run in the initial namespaces with full capabilities, bypassing all security checks. I don't know if overlayfs has something similar, if not then maybe if it's possible in ecryptfs to check the lower mount and the fs type and move this procfs there... Cc'ed Jann in case he may comment. > and mount procfs as lowerdir in overlayfs then you get NULL pointer > dereference at: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891 > > I got it when I tried to do `ls -la /overlay/proc/self/`. > > -- > Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
(Cc'ed Kees and Jann for the procfs stacking issue) On Thu, Mar 23, 2017 at 11:07 PM, Alexey Gladkov wrote: > On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote: >> Hi Alexey, >> >> On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov >> wrote: >> > >> > >> > Al Viro, this patch looks better ? >> > >> > == Overview == >> > >> > Some of the container virtualization systems are mounted /proc inside >> > the container. This is done in most cases to operate with information >> > about the processes. Knowing that /proc filesystem is not fully >> > virtualized they are mounted on top of dangerous places empty files or >> > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). >> > >> > The structure of this filesystem is dynamic and any module can create a >> > new object which will not necessarily be virtualized. There are >> > proprietary modules that aren't in the mainline whose work we can not >> > verify. >> > >> > This opens up a potential threat to the system. The developers of the >> > virtualization system can't predict all dangerous places in /proc by >> > definition. >> > >> > A more effective solution would be to mount into the container only what >> > is necessary and ignore the rest. >> > >> > Right now there is the opportunity to pass in the container any port of >> > the /proc filesystem using mount --bind expect the pids. >> > >> > This patch allows to mount only the part of /proc related to pids without >> > rest objects. Since this is an option for /proc, flags applied to /proc >> > have an effect on this subset of filesystem. >> >> I just sent a patch that also has to deal with proc hidepid here: >> https://lkml.org/lkml/2017/3/23/505 > > I completely agree with you that it looks wrong when options of one > mountpoint affect the others mountpoints. > >> I'm not sure if that's the right approach, it is still buggy, however >> seems that your patch also stores the mount option inside the >> pid_namespace which may get propagated to all mounts inside same pidns >> ? > > I don't store 'pidonly' option in my current patch. I mean in: > https://lkml.org/lkml/2017/3/20/363 > > I parse options twice at first mount of procfs. It happens before > the mounting /proc in userspace. > > I know it's bad, but I couldn't find place to store this option. Ok, then maybe that approach of having a procfs struct holder instead of using pid namespace may help! >> I didn't have enough time but maybe if they are related we can work it >> out together ? > > I don't have enough experience in kenrel hacking, but I would happily do > my best :) OK, let's sync on this then. > I also tring to mention it in every patch, as my changes almost completely > useless without the ability to use the overlayfs. > > Now if you remove the restriction: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497 This protection was introduced in e54ad7f1ee263 That's a security fix, ecryptfs seems to schedule some code through kthreads which always run in the initial namespaces with full capabilities, bypassing all security checks. I don't know if overlayfs has something similar, if not then maybe if it's possible in ecryptfs to check the lower mount and the fs type and move this procfs there... Cc'ed Jann in case he may comment. > and mount procfs as lowerdir in overlayfs then you get NULL pointer > dereference at: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891 > > I got it when I tried to do `ls -la /overlay/proc/self/`. > > -- > Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote: > Again, I can't really review this, I know nothing about vfs, but since > nobody else replied... Thanks anyway :) > On 03/20, Alexey Gladkov wrote: > > > > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct > > file_system_type *fs_type, > > ns = task_active_pid_ns(current); > > } > > > > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > > + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > > + > > + if (!IS_ERR(root)) { > > + if (!proc_fill_options(data, )) > > + return ERR_PTR(-EINVAL); > > So we have to call proc_fill_options() twice, not good... Yes, I understand > why, but perhaps we factor it out somehow, we can pack options + pid_ns into > sb->s_fs_info. Nevermind, this is minor. It happens only once, when we don't have the s_root yet. > > + if (opts.pid_only) { > > + int ret; > > + > > + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb))) > > + return ERR_PTR(ret); > > + > > + root = ns->pidfs; > > Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root) > in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH, > if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced. > But this all is fixeable. > > So with this change "mount -opidonly" creates another IS_ROOT() dentry which > is not equal to sb->s_root. I simply do not know if this is technically > correct or not... but, say, the "Only bind mounts can have disconnected paths" > comment in path_connected() makes me worry ;) > > And this obviously means that /path-to-pidonly-mnt/ won't share dentries with > the normal /proc mount. Not really good imo even if not really wrong... Lets > look at proc_flush_task(). The exiting task will flush its $pid dentries in > /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but > still... I know that I'm cheater, but I did not start first :) -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Thu, Mar 23, 2017 at 05:05:07PM +0100, Oleg Nesterov wrote: > Again, I can't really review this, I know nothing about vfs, but since > nobody else replied... Thanks anyway :) > On 03/20, Alexey Gladkov wrote: > > > > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct > > file_system_type *fs_type, > > ns = task_active_pid_ns(current); > > } > > > > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > > + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > > + > > + if (!IS_ERR(root)) { > > + if (!proc_fill_options(data, )) > > + return ERR_PTR(-EINVAL); > > So we have to call proc_fill_options() twice, not good... Yes, I understand > why, but perhaps we factor it out somehow, we can pack options + pid_ns into > sb->s_fs_info. Nevermind, this is minor. It happens only once, when we don't have the s_root yet. > > + if (opts.pid_only) { > > + int ret; > > + > > + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb))) > > + return ERR_PTR(ret); > > + > > + root = ns->pidfs; > > Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root) > in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH, > if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced. > But this all is fixeable. > > So with this change "mount -opidonly" creates another IS_ROOT() dentry which > is not equal to sb->s_root. I simply do not know if this is technically > correct or not... but, say, the "Only bind mounts can have disconnected paths" > comment in path_connected() makes me worry ;) > > And this obviously means that /path-to-pidonly-mnt/ won't share dentries with > the normal /proc mount. Not really good imo even if not really wrong... Lets > look at proc_flush_task(). The exiting task will flush its $pid dentries in > /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but > still... I know that I'm cheater, but I did not start first :) -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote: > Hi Alexey, > > On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov >wrote: > > > > > > Al Viro, this patch looks better ? > > > > == Overview == > > > > Some of the container virtualization systems are mounted /proc inside > > the container. This is done in most cases to operate with information > > about the processes. Knowing that /proc filesystem is not fully > > virtualized they are mounted on top of dangerous places empty files or > > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). > > > > The structure of this filesystem is dynamic and any module can create a > > new object which will not necessarily be virtualized. There are > > proprietary modules that aren't in the mainline whose work we can not > > verify. > > > > This opens up a potential threat to the system. The developers of the > > virtualization system can't predict all dangerous places in /proc by > > definition. > > > > A more effective solution would be to mount into the container only what > > is necessary and ignore the rest. > > > > Right now there is the opportunity to pass in the container any port of > > the /proc filesystem using mount --bind expect the pids. > > > > This patch allows to mount only the part of /proc related to pids without > > rest objects. Since this is an option for /proc, flags applied to /proc > > have an effect on this subset of filesystem. > > I just sent a patch that also has to deal with proc hidepid here: > https://lkml.org/lkml/2017/3/23/505 I completely agree with you that it looks wrong when options of one mountpoint affect the others mountpoints. > I'm not sure if that's the right approach, it is still buggy, however > seems that your patch also stores the mount option inside the > pid_namespace which may get propagated to all mounts inside same pidns > ? I don't store 'pidonly' option in my current patch. I mean in: https://lkml.org/lkml/2017/3/20/363 I parse options twice at first mount of procfs. It happens before the mounting /proc in userspace. I know it's bad, but I couldn't find place to store this option. > I didn't have enough time but maybe if they are related we can work it > out together ? I don't have enough experience in kenrel hacking, but I would happily do my best :) I also tring to mention it in every patch, as my changes almost completely useless without the ability to use the overlayfs. Now if you remove the restriction: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497 and mount procfs as lowerdir in overlayfs then you get NULL pointer dereference at: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891 I got it when I tried to do `ls -la /overlay/proc/self/`. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Thu, Mar 23, 2017 at 05:06:28PM +0100, Djalal Harouni wrote: > Hi Alexey, > > On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov > wrote: > > > > > > Al Viro, this patch looks better ? > > > > == Overview == > > > > Some of the container virtualization systems are mounted /proc inside > > the container. This is done in most cases to operate with information > > about the processes. Knowing that /proc filesystem is not fully > > virtualized they are mounted on top of dangerous places empty files or > > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). > > > > The structure of this filesystem is dynamic and any module can create a > > new object which will not necessarily be virtualized. There are > > proprietary modules that aren't in the mainline whose work we can not > > verify. > > > > This opens up a potential threat to the system. The developers of the > > virtualization system can't predict all dangerous places in /proc by > > definition. > > > > A more effective solution would be to mount into the container only what > > is necessary and ignore the rest. > > > > Right now there is the opportunity to pass in the container any port of > > the /proc filesystem using mount --bind expect the pids. > > > > This patch allows to mount only the part of /proc related to pids without > > rest objects. Since this is an option for /proc, flags applied to /proc > > have an effect on this subset of filesystem. > > I just sent a patch that also has to deal with proc hidepid here: > https://lkml.org/lkml/2017/3/23/505 I completely agree with you that it looks wrong when options of one mountpoint affect the others mountpoints. > I'm not sure if that's the right approach, it is still buggy, however > seems that your patch also stores the mount option inside the > pid_namespace which may get propagated to all mounts inside same pidns > ? I don't store 'pidonly' option in my current patch. I mean in: https://lkml.org/lkml/2017/3/20/363 I parse options twice at first mount of procfs. It happens before the mounting /proc in userspace. I know it's bad, but I couldn't find place to store this option. > I didn't have enough time but maybe if they are related we can work it > out together ? I don't have enough experience in kenrel hacking, but I would happily do my best :) I also tring to mention it in every patch, as my changes almost completely useless without the ability to use the overlayfs. Now if you remove the restriction: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/proc/inode.c#n497 and mount procfs as lowerdir in overlayfs then you get NULL pointer dereference at: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/namei.c#n891 I got it when I tried to do `ls -la /overlay/proc/self/`. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
Again, I can't really review this, I know nothing about vfs, but since nobody else replied... On 03/20, Alexey Gladkov wrote: > > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type > *fs_type, > ns = task_active_pid_ns(current); > } > > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > + > + if (!IS_ERR(root)) { > + if (!proc_fill_options(data, )) > + return ERR_PTR(-EINVAL); So we have to call proc_fill_options() twice, not good... Yes, I understand why, but perhaps we factor it out somehow, we can pack options + pid_ns into sb->s_fs_info. Nevermind, this is minor. > + if (opts.pid_only) { > + int ret; > + > + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb))) > + return ERR_PTR(ret); > + > + root = ns->pidfs; Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root) in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH, if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced. But this all is fixeable. So with this change "mount -opidonly" creates another IS_ROOT() dentry which is not equal to sb->s_root. I simply do not know if this is technically correct or not... but, say, the "Only bind mounts can have disconnected paths" comment in path_connected() makes me worry ;) And this obviously means that /path-to-pidonly-mnt/ won't share dentries with the normal /proc mount. Not really good imo even if not really wrong... Lets look at proc_flush_task(). The exiting task will flush its $pid dentries in /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but still... Oleg.
Re: [RFC] Add option to mount only a pids subset
Again, I can't really review this, I know nothing about vfs, but since nobody else replied... On 03/20, Alexey Gladkov wrote: > > @@ -97,7 +169,23 @@ static struct dentry *proc_mount(struct file_system_type > *fs_type, > ns = task_active_pid_ns(current); > } > > - return mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > + root = mount_ns(fs_type, flags, data, ns, ns->user_ns, proc_fill_super); > + > + if (!IS_ERR(root)) { > + if (!proc_fill_options(data, )) > + return ERR_PTR(-EINVAL); So we have to call proc_fill_options() twice, not good... Yes, I understand why, but perhaps we factor it out somehow, we can pack options + pid_ns into sb->s_fs_info. Nevermind, this is minor. > + if (opts.pid_only) { > + int ret; > + > + if (!ns->pidfs && (ret = fill_pidfs_root(root->d_sb))) > + return ERR_PTR(ret); > + > + root = ns->pidfs; Afaics this lacks dget(ns->pidfs) which should pair with dput(mnt.mnt_root) in cleanup_mnt(). IIUC otherwise ns->pidfs can go away after umount, OTOH, if we return ns->pidfs then dget(sb->s_root) in mount_ns() is not balanced. But this all is fixeable. So with this change "mount -opidonly" creates another IS_ROOT() dentry which is not equal to sb->s_root. I simply do not know if this is technically correct or not... but, say, the "Only bind mounts can have disconnected paths" comment in path_connected() makes me worry ;) And this obviously means that /path-to-pidonly-mnt/ won't share dentries with the normal /proc mount. Not really good imo even if not really wrong... Lets look at proc_flush_task(). The exiting task will flush its $pid dentries in /proc/ but not in /path-to-pidonly-mnt/ iiuc. Again, not really a bug, but still... Oleg.
Re: [RFC] Add option to mount only a pids subset
Hi Alexey, On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkovwrote: > > > Al Viro, this patch looks better ? > > == Overview == > > Some of the container virtualization systems are mounted /proc inside > the container. This is done in most cases to operate with information > about the processes. Knowing that /proc filesystem is not fully > virtualized they are mounted on top of dangerous places empty files or > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). > > The structure of this filesystem is dynamic and any module can create a > new object which will not necessarily be virtualized. There are > proprietary modules that aren't in the mainline whose work we can not > verify. > > This opens up a potential threat to the system. The developers of the > virtualization system can't predict all dangerous places in /proc by > definition. > > A more effective solution would be to mount into the container only what > is necessary and ignore the rest. > > Right now there is the opportunity to pass in the container any port of > the /proc filesystem using mount --bind expect the pids. > > This patch allows to mount only the part of /proc related to pids without > rest objects. Since this is an option for /proc, flags applied to /proc > have an effect on this subset of filesystem. I just sent a patch that also has to deal with proc hidepid here: https://lkml.org/lkml/2017/3/23/505 I'm not sure if that's the right approach, it is still buggy, however seems that your patch also stores the mount option inside the pid_namespace which may get propagated to all mounts inside same pidns ? I didn't have enough time but maybe if they are related we can work it out together ? Thank you! -- tixxdz
Re: [RFC] Add option to mount only a pids subset
Hi Alexey, On Mon, Mar 20, 2017 at 1:58 PM, Alexey Gladkov wrote: > > > Al Viro, this patch looks better ? > > == Overview == > > Some of the container virtualization systems are mounted /proc inside > the container. This is done in most cases to operate with information > about the processes. Knowing that /proc filesystem is not fully > virtualized they are mounted on top of dangerous places empty files or > directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). > > The structure of this filesystem is dynamic and any module can create a > new object which will not necessarily be virtualized. There are > proprietary modules that aren't in the mainline whose work we can not > verify. > > This opens up a potential threat to the system. The developers of the > virtualization system can't predict all dangerous places in /proc by > definition. > > A more effective solution would be to mount into the container only what > is necessary and ignore the rest. > > Right now there is the opportunity to pass in the container any port of > the /proc filesystem using mount --bind expect the pids. > > This patch allows to mount only the part of /proc related to pids without > rest objects. Since this is an option for /proc, flags applied to /proc > have an effect on this subset of filesystem. I just sent a patch that also has to deal with proc hidepid here: https://lkml.org/lkml/2017/3/23/505 I'm not sure if that's the right approach, it is still buggy, however seems that your patch also stores the mount option inside the pid_namespace which may get propagated to all mounts inside same pidns ? I didn't have enough time but maybe if they are related we can work it out together ? Thank you! -- tixxdz
[RFC] Add option to mount only a pids subset
Al Viro, this patch looks better ? == Overview == Some of the container virtualization systems are mounted /proc inside the container. This is done in most cases to operate with information about the processes. Knowing that /proc filesystem is not fully virtualized they are mounted on top of dangerous places empty files or directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). The structure of this filesystem is dynamic and any module can create a new object which will not necessarily be virtualized. There are proprietary modules that aren't in the mainline whose work we can not verify. This opens up a potential threat to the system. The developers of the virtualization system can't predict all dangerous places in /proc by definition. A more effective solution would be to mount into the container only what is necessary and ignore the rest. Right now there is the opportunity to pass in the container any port of the /proc filesystem using mount --bind expect the pids. This patch allows to mount only the part of /proc related to pids without rest objects. Since this is an option for /proc, flags applied to /proc have an effect on this subset of filesystem. Originally the idea was that the container will be mounted only pid sunset and additional required files will be mounted on top using the overlayfs. But I found out that /proc does not support overlayfs and does not allow to mount anything on top or under it. == TODO == There is still work to do: * Add overlayfs support. Signed-off-by: Alexey Gladkov--- fs/proc/generic.c | 5 ++ fs/proc/inode.c | 2 + fs/proc/internal.h| 7 +++ fs/proc/root.c| 113 -- include/linux/pid_namespace.h | 1 + 5 files changed, 123 insertions(+), 5 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index ee27feb..50bb1e9 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "internal.h" @@ -307,6 +308,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file, int proc_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); + struct pid_namespace *ns = inode->i_sb->s_fs_info; + + if (ns->pidfs && inode == d_inode(ns->pidfs)) + return 1; return proc_readdir_de(PDE(inode), file, ctx); } diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 2cc7a80..0c9be65 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -109,6 +109,8 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, pid->pid_gid)); if (pid->hide_pid != HIDEPID_OFF) seq_printf(seq, ",hidepid=%u", pid->hide_pid); + if (root == pid->pidfs) + seq_printf(seq, ",pidonly"); return 0; } diff --git a/fs/proc/internal.h b/fs/proc/internal.h index c5ae09b..a5a4bf1 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -260,7 +260,14 @@ static inline void proc_tty_init(void) {} /* * root.c */ +struct proc_options { + kgid_t pid_gid; + int hide_pid; + int pid_only; +}; + extern struct proc_dir_entry proc_root; +extern struct proc_dir_entry pidfs_root; extern int proc_parse_options(char *options, struct pid_namespace *pid); extern void proc_self_init(void); diff --git a/fs/proc/root.c b/fs/proc/root.c index deecb39..c2443d5 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -26,16 +26,17 @@ #include "internal.h" enum { - Opt_gid, Opt_hidepid, Opt_err, + Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err, }; static const match_table_t tokens = { {Opt_hidepid, "hidepid=%u"}, {Opt_gid, "gid=%u"}, + {Opt_pidonly, "pidonly"}, {Opt_err, NULL}, }; -int proc_parse_options(char *options, struct pid_namespace *pid) +static int proc_fill_options(char *options, struct proc_options *fs_opts) { char *p; substring_t args[MAX_OPT_ARGS]; @@ -55,7 +56,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid) case Opt_gid: if (match_int([0], )) return 0; - pid->pid_gid = make_kgid(current_user_ns(), option); + fs_opts->pid_gid = make_kgid(current_user_ns(), option); break; case Opt_hidepid: if (match_int([0], )) @@ -65,7 +66,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid) pr_err("proc: hidepid value must be between 0 and 2.\n"); return 0; } - pid->hide_pid = option; + fs_opts->hide_pid = option; + break; +
[RFC] Add option to mount only a pids subset
Al Viro, this patch looks better ? == Overview == Some of the container virtualization systems are mounted /proc inside the container. This is done in most cases to operate with information about the processes. Knowing that /proc filesystem is not fully virtualized they are mounted on top of dangerous places empty files or directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). The structure of this filesystem is dynamic and any module can create a new object which will not necessarily be virtualized. There are proprietary modules that aren't in the mainline whose work we can not verify. This opens up a potential threat to the system. The developers of the virtualization system can't predict all dangerous places in /proc by definition. A more effective solution would be to mount into the container only what is necessary and ignore the rest. Right now there is the opportunity to pass in the container any port of the /proc filesystem using mount --bind expect the pids. This patch allows to mount only the part of /proc related to pids without rest objects. Since this is an option for /proc, flags applied to /proc have an effect on this subset of filesystem. Originally the idea was that the container will be mounted only pid sunset and additional required files will be mounted on top using the overlayfs. But I found out that /proc does not support overlayfs and does not allow to mount anything on top or under it. == TODO == There is still work to do: * Add overlayfs support. Signed-off-by: Alexey Gladkov --- fs/proc/generic.c | 5 ++ fs/proc/inode.c | 2 + fs/proc/internal.h| 7 +++ fs/proc/root.c| 113 -- include/linux/pid_namespace.h | 1 + 5 files changed, 123 insertions(+), 5 deletions(-) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index ee27feb..50bb1e9 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -23,6 +23,7 @@ #include #include #include +#include #include "internal.h" @@ -307,6 +308,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file, int proc_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); + struct pid_namespace *ns = inode->i_sb->s_fs_info; + + if (ns->pidfs && inode == d_inode(ns->pidfs)) + return 1; return proc_readdir_de(PDE(inode), file, ctx); } diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 2cc7a80..0c9be65 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -109,6 +109,8 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, pid->pid_gid)); if (pid->hide_pid != HIDEPID_OFF) seq_printf(seq, ",hidepid=%u", pid->hide_pid); + if (root == pid->pidfs) + seq_printf(seq, ",pidonly"); return 0; } diff --git a/fs/proc/internal.h b/fs/proc/internal.h index c5ae09b..a5a4bf1 100644 --- a/fs/proc/internal.h +++ b/fs/proc/internal.h @@ -260,7 +260,14 @@ static inline void proc_tty_init(void) {} /* * root.c */ +struct proc_options { + kgid_t pid_gid; + int hide_pid; + int pid_only; +}; + extern struct proc_dir_entry proc_root; +extern struct proc_dir_entry pidfs_root; extern int proc_parse_options(char *options, struct pid_namespace *pid); extern void proc_self_init(void); diff --git a/fs/proc/root.c b/fs/proc/root.c index deecb39..c2443d5 100644 --- a/fs/proc/root.c +++ b/fs/proc/root.c @@ -26,16 +26,17 @@ #include "internal.h" enum { - Opt_gid, Opt_hidepid, Opt_err, + Opt_gid, Opt_hidepid, Opt_pidonly, Opt_err, }; static const match_table_t tokens = { {Opt_hidepid, "hidepid=%u"}, {Opt_gid, "gid=%u"}, + {Opt_pidonly, "pidonly"}, {Opt_err, NULL}, }; -int proc_parse_options(char *options, struct pid_namespace *pid) +static int proc_fill_options(char *options, struct proc_options *fs_opts) { char *p; substring_t args[MAX_OPT_ARGS]; @@ -55,7 +56,7 @@ int proc_parse_options(char *options, struct pid_namespace *pid) case Opt_gid: if (match_int([0], )) return 0; - pid->pid_gid = make_kgid(current_user_ns(), option); + fs_opts->pid_gid = make_kgid(current_user_ns(), option); break; case Opt_hidepid: if (match_int([0], )) @@ -65,7 +66,10 @@ int proc_parse_options(char *options, struct pid_namespace *pid) pr_err("proc: hidepid value must be between 0 and 2.\n"); return 0; } - pid->hide_pid = option; + fs_opts->hide_pid = option; + break; + case
Re: [RFC] Add option to mount only a pids subset
On Mon, Mar 13, 2017 at 6:27 AM, Al Virowrote: > On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote: >> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro wrote: >> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly >> > expose the full thing. And as for the lifetimes making no sense... >> > note that you are simply not freeing these structures of yours. >> > Try to handle that and you'll get a serious PITA all over the >> > place. >> > >> > What are you trying to achieve, anyway? Why not add a second vfsmount >> > pointer per pid_namespace and make it initialized on demand, at the >> > first attempt of no-pid mount? Just have a separate no-pid instance >> > created for those namespaces where it had been asked for, with >> > separate superblock and dentry tree not containing anything other >> > that pid-only parts + self + thread-self... >> >> Can't we just make procfs work like most other filesystems and have >> each mount have its own superblock? If we need to do something funky >> to stat() output to keep existing userspace working, I think that's >> okay. > > First of all, most of the filesystems do *NOT* guarantee anything of > that sort. And what's the point of having more instances than > necessary, anyway? I mean that, if I do: mount -t proc -o foobar none a mount -t proc -o baz none b Then I think that the second mount should create a whole new proc instance rather than just a new vfsmount. Then the options could differ, which would solve a bunch of problems. > > Again, what for? It won't salvage that kludge... It's not as if it > had been hard to have separate pid-only instance created when asked > for (and reused every time when we are asked for pid-only). What's > the point of ever having more than two instances per pidns? IDGI... I can easily procfs growing more than one interesting option. Pid-only and hidepid come to mind, and that's already six possible combinations. The current hidepid implementation is really awful. > > Folks, there is no one-to-one correspondence between mountpoints and > superblocks. Not since 2000 or so. Just don't try to shove your > per-superblock stuff into vfsmount; it simply won't work. If you > want a separate instance for that thing, then just go ahead and > have ->mount() decide which one to use (and whether to create a new > one). All there is to it... That's what I mean. I just don't see the point of going all-out in trying to reuse superblocks.
Re: [RFC] Add option to mount only a pids subset
On Mon, Mar 13, 2017 at 6:27 AM, Al Viro wrote: > On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote: >> On Sat, Mar 11, 2017 at 6:13 PM, Al Viro wrote: >> > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly >> > expose the full thing. And as for the lifetimes making no sense... >> > note that you are simply not freeing these structures of yours. >> > Try to handle that and you'll get a serious PITA all over the >> > place. >> > >> > What are you trying to achieve, anyway? Why not add a second vfsmount >> > pointer per pid_namespace and make it initialized on demand, at the >> > first attempt of no-pid mount? Just have a separate no-pid instance >> > created for those namespaces where it had been asked for, with >> > separate superblock and dentry tree not containing anything other >> > that pid-only parts + self + thread-self... >> >> Can't we just make procfs work like most other filesystems and have >> each mount have its own superblock? If we need to do something funky >> to stat() output to keep existing userspace working, I think that's >> okay. > > First of all, most of the filesystems do *NOT* guarantee anything of > that sort. And what's the point of having more instances than > necessary, anyway? I mean that, if I do: mount -t proc -o foobar none a mount -t proc -o baz none b Then I think that the second mount should create a whole new proc instance rather than just a new vfsmount. Then the options could differ, which would solve a bunch of problems. > > Again, what for? It won't salvage that kludge... It's not as if it > had been hard to have separate pid-only instance created when asked > for (and reused every time when we are asked for pid-only). What's > the point of ever having more than two instances per pidns? IDGI... I can easily procfs growing more than one interesting option. Pid-only and hidepid come to mind, and that's already six possible combinations. The current hidepid implementation is really awful. > > Folks, there is no one-to-one correspondence between mountpoints and > superblocks. Not since 2000 or so. Just don't try to shove your > per-superblock stuff into vfsmount; it simply won't work. If you > want a separate instance for that thing, then just go ahead and > have ->mount() decide which one to use (and whether to create a new > one). All there is to it... That's what I mean. I just don't see the point of going all-out in trying to reuse superblocks.
Re: [RFC] Add option to mount only a pids subset
On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote: > On Sat, Mar 11, 2017 at 6:13 PM, Al Virowrote: > > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly > > expose the full thing. And as for the lifetimes making no sense... > > note that you are simply not freeing these structures of yours. > > Try to handle that and you'll get a serious PITA all over the > > place. > > > > What are you trying to achieve, anyway? Why not add a second vfsmount > > pointer per pid_namespace and make it initialized on demand, at the > > first attempt of no-pid mount? Just have a separate no-pid instance > > created for those namespaces where it had been asked for, with > > separate superblock and dentry tree not containing anything other > > that pid-only parts + self + thread-self... > > Can't we just make procfs work like most other filesystems and have > each mount have its own superblock? If we need to do something funky > to stat() output to keep existing userspace working, I think that's > okay. First of all, most of the filesystems do *NOT* guarantee anything of that sort. And what's the point of having more instances than necessary, anyway? > As far as I can tell, proc_mnt is very nearly useless -- it seems to > be used for proc_flush_task (which claims to be purely an optimization > and could be preserved in the common case where there's only one > relevant mount) and for sysctl_binary. For the latter, we could > create proc_mnt but make actual user-initiated mounts be new > superblocks anyway. Again, what for? It won't salvage that kludge... It's not as if it had been hard to have separate pid-only instance created when asked for (and reused every time when we are asked for pid-only). What's the point of ever having more than two instances per pidns? IDGI... Folks, there is no one-to-one correspondence between mountpoints and superblocks. Not since 2000 or so. Just don't try to shove your per-superblock stuff into vfsmount; it simply won't work. If you want a separate instance for that thing, then just go ahead and have ->mount() decide which one to use (and whether to create a new one). All there is to it...
Re: [RFC] Add option to mount only a pids subset
On Sun, Mar 12, 2017 at 08:19:33PM -0700, Andy Lutomirski wrote: > On Sat, Mar 11, 2017 at 6:13 PM, Al Viro wrote: > > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly > > expose the full thing. And as for the lifetimes making no sense... > > note that you are simply not freeing these structures of yours. > > Try to handle that and you'll get a serious PITA all over the > > place. > > > > What are you trying to achieve, anyway? Why not add a second vfsmount > > pointer per pid_namespace and make it initialized on demand, at the > > first attempt of no-pid mount? Just have a separate no-pid instance > > created for those namespaces where it had been asked for, with > > separate superblock and dentry tree not containing anything other > > that pid-only parts + self + thread-self... > > Can't we just make procfs work like most other filesystems and have > each mount have its own superblock? If we need to do something funky > to stat() output to keep existing userspace working, I think that's > okay. First of all, most of the filesystems do *NOT* guarantee anything of that sort. And what's the point of having more instances than necessary, anyway? > As far as I can tell, proc_mnt is very nearly useless -- it seems to > be used for proc_flush_task (which claims to be purely an optimization > and could be preserved in the common case where there's only one > relevant mount) and for sysctl_binary. For the latter, we could > create proc_mnt but make actual user-initiated mounts be new > superblocks anyway. Again, what for? It won't salvage that kludge... It's not as if it had been hard to have separate pid-only instance created when asked for (and reused every time when we are asked for pid-only). What's the point of ever having more than two instances per pidns? IDGI... Folks, there is no one-to-one correspondence between mountpoints and superblocks. Not since 2000 or so. Just don't try to shove your per-superblock stuff into vfsmount; it simply won't work. If you want a separate instance for that thing, then just go ahead and have ->mount() decide which one to use (and whether to create a new one). All there is to it...
Re: [RFC] Add option to mount only a pids subset
On Sat, Mar 11, 2017 at 6:13 PM, Al Virowrote: > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly > expose the full thing. And as for the lifetimes making no sense... > note that you are simply not freeing these structures of yours. > Try to handle that and you'll get a serious PITA all over the > place. > > What are you trying to achieve, anyway? Why not add a second vfsmount > pointer per pid_namespace and make it initialized on demand, at the > first attempt of no-pid mount? Just have a separate no-pid instance > created for those namespaces where it had been asked for, with > separate superblock and dentry tree not containing anything other > that pid-only parts + self + thread-self... Can't we just make procfs work like most other filesystems and have each mount have its own superblock? If we need to do something funky to stat() output to keep existing userspace working, I think that's okay. As far as I can tell, proc_mnt is very nearly useless -- it seems to be used for proc_flush_task (which claims to be purely an optimization and could be preserved in the common case where there's only one relevant mount) and for sysctl_binary. For the latter, we could create proc_mnt but make actual user-initiated mounts be new superblocks anyway.
Re: [RFC] Add option to mount only a pids subset
On Sat, Mar 11, 2017 at 6:13 PM, Al Viro wrote: > PS: AFAICS, simple mount --bind of your pid-only mount will suddenly > expose the full thing. And as for the lifetimes making no sense... > note that you are simply not freeing these structures of yours. > Try to handle that and you'll get a serious PITA all over the > place. > > What are you trying to achieve, anyway? Why not add a second vfsmount > pointer per pid_namespace and make it initialized on demand, at the > first attempt of no-pid mount? Just have a separate no-pid instance > created for those namespaces where it had been asked for, with > separate superblock and dentry tree not containing anything other > that pid-only parts + self + thread-self... Can't we just make procfs work like most other filesystems and have each mount have its own superblock? If we need to do something funky to stat() output to keep existing userspace working, I think that's okay. As far as I can tell, proc_mnt is very nearly useless -- it seems to be used for proc_flush_task (which claims to be purely an optimization and could be preserved in the common case where there's only one relevant mount) and for sysctl_binary. For the latter, we could create proc_mnt but make actual user-initiated mounts be new superblocks anyway.
Re: [RFC] Add option to mount only a pids subset
On Sun, Mar 12, 2017 at 01:54:38AM +, Al Viro wrote: > On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote: > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 83de8b6..5bd1b84 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1759,6 +1759,8 @@ struct super_operations { > > int (*thaw_super) (struct super_block *); > > int (*unfreeze_fs) (struct super_block *); > > int (*statfs) (struct dentry *, struct kstatfs *); > > + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *); > > + int (*mount_fs) (struct vfsmount *, int *, char *); > > int (*remount_fs) (struct super_block *, int *, char *); > > void (*umount_begin) (struct super_block *); > > > > diff --git a/include/linux/mount.h b/include/linux/mount.h > > index 1172cce..4bd364e 100644 > > --- a/include/linux/mount.h > > +++ b/include/linux/mount.h > > @@ -67,6 +67,7 @@ struct vfsmount { > > struct dentry *mnt_root;/* root of the mounted tree */ > > struct super_block *mnt_sb; /* pointer to superblock */ > > int mnt_flags; > > + void *fs_data; /* fs-specific data */ > > No. This is really asking for trouble - you *can't* hang > fs-specific data structures off vfsmount. Lifetimes make no > sense whatsoever. > > BTW, your patch leaks supreblock reference on failure, and > is kludgy as hell wrt to what it's doing in procfs itself, but > that's secondary - the quoted part is enough for a hard NAK on > architectural grounds. Don't go there. PS: AFAICS, simple mount --bind of your pid-only mount will suddenly expose the full thing. And as for the lifetimes making no sense... note that you are simply not freeing these structures of yours. Try to handle that and you'll get a serious PITA all over the place. What are you trying to achieve, anyway? Why not add a second vfsmount pointer per pid_namespace and make it initialized on demand, at the first attempt of no-pid mount? Just have a separate no-pid instance created for those namespaces where it had been asked for, with separate superblock and dentry tree not containing anything other that pid-only parts + self + thread-self...
Re: [RFC] Add option to mount only a pids subset
On Sun, Mar 12, 2017 at 01:54:38AM +, Al Viro wrote: > On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote: > > > diff --git a/include/linux/fs.h b/include/linux/fs.h > > index 83de8b6..5bd1b84 100644 > > --- a/include/linux/fs.h > > +++ b/include/linux/fs.h > > @@ -1759,6 +1759,8 @@ struct super_operations { > > int (*thaw_super) (struct super_block *); > > int (*unfreeze_fs) (struct super_block *); > > int (*statfs) (struct dentry *, struct kstatfs *); > > + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *); > > + int (*mount_fs) (struct vfsmount *, int *, char *); > > int (*remount_fs) (struct super_block *, int *, char *); > > void (*umount_begin) (struct super_block *); > > > > diff --git a/include/linux/mount.h b/include/linux/mount.h > > index 1172cce..4bd364e 100644 > > --- a/include/linux/mount.h > > +++ b/include/linux/mount.h > > @@ -67,6 +67,7 @@ struct vfsmount { > > struct dentry *mnt_root;/* root of the mounted tree */ > > struct super_block *mnt_sb; /* pointer to superblock */ > > int mnt_flags; > > + void *fs_data; /* fs-specific data */ > > No. This is really asking for trouble - you *can't* hang > fs-specific data structures off vfsmount. Lifetimes make no > sense whatsoever. > > BTW, your patch leaks supreblock reference on failure, and > is kludgy as hell wrt to what it's doing in procfs itself, but > that's secondary - the quoted part is enough for a hard NAK on > architectural grounds. Don't go there. PS: AFAICS, simple mount --bind of your pid-only mount will suddenly expose the full thing. And as for the lifetimes making no sense... note that you are simply not freeing these structures of yours. Try to handle that and you'll get a serious PITA all over the place. What are you trying to achieve, anyway? Why not add a second vfsmount pointer per pid_namespace and make it initialized on demand, at the first attempt of no-pid mount? Just have a separate no-pid instance created for those namespaces where it had been asked for, with separate superblock and dentry tree not containing anything other that pid-only parts + self + thread-self...
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote: > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 83de8b6..5bd1b84 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1759,6 +1759,8 @@ struct super_operations { > int (*thaw_super) (struct super_block *); > int (*unfreeze_fs) (struct super_block *); > int (*statfs) (struct dentry *, struct kstatfs *); > + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *); > + int (*mount_fs) (struct vfsmount *, int *, char *); > int (*remount_fs) (struct super_block *, int *, char *); > void (*umount_begin) (struct super_block *); > > diff --git a/include/linux/mount.h b/include/linux/mount.h > index 1172cce..4bd364e 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -67,6 +67,7 @@ struct vfsmount { > struct dentry *mnt_root;/* root of the mounted tree */ > struct super_block *mnt_sb; /* pointer to superblock */ > int mnt_flags; > + void *fs_data; /* fs-specific data */ No. This is really asking for trouble - you *can't* hang fs-specific data structures off vfsmount. Lifetimes make no sense whatsoever. BTW, your patch leaks supreblock reference on failure, and is kludgy as hell wrt to what it's doing in procfs itself, but that's secondary - the quoted part is enough for a hard NAK on architectural grounds. Don't go there.
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 07, 2017 at 12:05:16AM +0100, Alexey Gladkov wrote: > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 83de8b6..5bd1b84 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -1759,6 +1759,8 @@ struct super_operations { > int (*thaw_super) (struct super_block *); > int (*unfreeze_fs) (struct super_block *); > int (*statfs) (struct dentry *, struct kstatfs *); > + int (*getattr_fs) (struct vfsmount *, struct dentry *, struct kstat *); > + int (*mount_fs) (struct vfsmount *, int *, char *); > int (*remount_fs) (struct super_block *, int *, char *); > void (*umount_begin) (struct super_block *); > > diff --git a/include/linux/mount.h b/include/linux/mount.h > index 1172cce..4bd364e 100644 > --- a/include/linux/mount.h > +++ b/include/linux/mount.h > @@ -67,6 +67,7 @@ struct vfsmount { > struct dentry *mnt_root;/* root of the mounted tree */ > struct super_block *mnt_sb; /* pointer to superblock */ > int mnt_flags; > + void *fs_data; /* fs-specific data */ No. This is really asking for trouble - you *can't* hang fs-specific data structures off vfsmount. Lifetimes make no sense whatsoever. BTW, your patch leaks supreblock reference on failure, and is kludgy as hell wrt to what it's doing in procfs itself, but that's secondary - the quoted part is enough for a hard NAK on architectural grounds. Don't go there.
Re: [RFC] Add option to mount only a pids subset
On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote: > I'm bit lost in the two discussion, however the main concern I was > discussing with Andy was if you have per superblock proc mounts then > each mount will end up with its own device ID st_dev, right now they > share the same ID if they are in the same pid namespace, but if we > change that then we may break the following: > http://man7.org/linux/man-pages/man7/namespaces.7.html In fact, nothing has changed. I added a parameter that affects the mountpoint, not the entire pid namespace. The procfs will still be global. The device ID will be the same as before. > Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd, > suggests to follow up with fstat() to identify the namespaces.. > "By applying fstat(2) to the returned file descriptor, one obtains a > stat structure whose st_dev (resident device) and st_ino (inode > number) fields together identify the owning/parent namespace." > > Other /proc/self/ns/* comparison and stat() logic... > > Andy suggested that we may have the same st_dev for mounts in the same > pid namespace... I'm not sure which side effect this may bring! Basically we have here a issue because other proc options (hidepid for example) affect the entire pid namespace, but, I guess, have to affect the mountpoint. # grep ^proc /proc/mounts proc /proc proc rw,relatime 0 0 # mount -t proc proc /tmp/proc # mount -o remount,hidepid=2 -t proc proc /tmp/proc # grep ^proc /proc/mounts proc /proc proc rw,relatime,hidepid=2 0 0 proc /tmp/proc proc rw,relatime,hidepid=2 0 0 -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Thu, Mar 09, 2017 at 12:26:49PM +0100, Djalal Harouni wrote: > I'm bit lost in the two discussion, however the main concern I was > discussing with Andy was if you have per superblock proc mounts then > each mount will end up with its own device ID st_dev, right now they > share the same ID if they are in the same pid namespace, but if we > change that then we may break the following: > http://man7.org/linux/man-pages/man7/namespaces.7.html In fact, nothing has changed. I added a parameter that affects the mountpoint, not the entire pid namespace. The procfs will still be global. The device ID will be the same as before. > Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd, > suggests to follow up with fstat() to identify the namespaces.. > "By applying fstat(2) to the returned file descriptor, one obtains a > stat structure whose st_dev (resident device) and st_ino (inode > number) fields together identify the owning/parent namespace." > > Other /proc/self/ns/* comparison and stat() logic... > > Andy suggested that we may have the same st_dev for mounts in the same > pid namespace... I'm not sure which side effect this may bring! Basically we have here a issue because other proc options (hidepid for example) affect the entire pid namespace, but, I guess, have to affect the mountpoint. # grep ^proc /proc/mounts proc /proc proc rw,relatime 0 0 # mount -t proc proc /tmp/proc # mount -o remount,hidepid=2 -t proc proc /tmp/proc # grep ^proc /proc/mounts proc /proc proc rw,relatime,hidepid=2 0 0 proc /tmp/proc proc rw,relatime,hidepid=2 0 0 -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote: > On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov> wrote: > > > > After discussion with Oleg Nesterov I reimplement my patch as an additional > > option for /proc. This option affects the mountpoint. It means that in one > > pid namespace it possible to have both the whole traditional /proc and > > /proc with only pids subset. > > > > I like this. I think you should split it into two patches, though: > one that reworks how procfs gets mounted and one that makes adds the > new functionality. Sure, but first I wanted to discuss the idea. My patch isn't very useful without the ability to add additional files. I will split it into parts in the new version. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 07, 2017 at 08:24:12AM -0800, Andy Lutomirski wrote: > On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov > wrote: > > > > After discussion with Oleg Nesterov I reimplement my patch as an additional > > option for /proc. This option affects the mountpoint. It means that in one > > pid namespace it possible to have both the whole traditional /proc and > > /proc with only pids subset. > > > > I like this. I think you should split it into two patches, though: > one that reworks how procfs gets mounted and one that makes adds the > new functionality. Sure, but first I wanted to discuss the idea. My patch isn't very useful without the ability to add additional files. I will split it into parts in the new version. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote: > I can't really review this... but in any case I think you should split > this patch to separate the vfs and proc changes. > > On 03/07, Alexey Gladkov wrote: > > > > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int > > flags, const char *name, void > > mnt->mnt.mnt_sb = root->d_sb; > > mnt->mnt_mountpoint = mnt->mnt.mnt_root; > > mnt->mnt_parent = mnt; > > + > > + err = do_mount_sb(>mnt, flags, data); > > + if(err) { > > + mnt_free_id(mnt); > > + free_vfsmnt(mnt); > > + return ERR_PTR(err); > > + } > > This duplicates the error handling, we do the same if mount_fs() fails. > Perhaps you should move these 2 lines into cleanup block and add goto's. > > > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct > > kstat *stat) > > +{ > > + struct inode *inode = d_inode(dentry); > > + struct pid *pid = proc_pid(dentry->d_inode); > > + struct proc_options *opts = mnt->fs_data; > > + > > + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid) > > + return -ENOENT; > > Hmm. I don't quite understand why do we need this, and how this should work. > > Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail, > but only because they both do stat() ? > > Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should > work ? Yes, you're right! I thought that getattr is called always together with open(). I wanted to prevent all attempts open() for not-pid directories. > I still think proc_dir_operations.open() makes more sense. Yes, as you pointed > out we also need to update proc_sys_dir_file_operations too and may be > something > else... My main task was to hide all possible direcitrices from the /proc (in pidonly mode)... even those which we do not know. In this case we can't rely on the fact that everyone will follow the rules and to properly handle open(). My current attempt was to force filesystem level check of mountpoint flag. This is necessary to avoid even the theoretical possibility of ignoring "pidonly" parameter. I guess I need to add callback to vfs_open or something to can be sure that we will not open the wrong file or directory in pidonly mode. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 07, 2017 at 06:49:09PM +0100, Oleg Nesterov wrote: > I can't really review this... but in any case I think you should split > this patch to separate the vfs and proc changes. > > On 03/07, Alexey Gladkov wrote: > > > > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int > > flags, const char *name, void > > mnt->mnt.mnt_sb = root->d_sb; > > mnt->mnt_mountpoint = mnt->mnt.mnt_root; > > mnt->mnt_parent = mnt; > > + > > + err = do_mount_sb(>mnt, flags, data); > > + if(err) { > > + mnt_free_id(mnt); > > + free_vfsmnt(mnt); > > + return ERR_PTR(err); > > + } > > This duplicates the error handling, we do the same if mount_fs() fails. > Perhaps you should move these 2 lines into cleanup block and add goto's. > > > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct > > kstat *stat) > > +{ > > + struct inode *inode = d_inode(dentry); > > + struct pid *pid = proc_pid(dentry->d_inode); > > + struct proc_options *opts = mnt->fs_data; > > + > > + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid) > > + return -ENOENT; > > Hmm. I don't quite understand why do we need this, and how this should work. > > Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail, > but only because they both do stat() ? > > Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should > work ? Yes, you're right! I thought that getattr is called always together with open(). I wanted to prevent all attempts open() for not-pid directories. > I still think proc_dir_operations.open() makes more sense. Yes, as you pointed > out we also need to update proc_sys_dir_file_operations too and may be > something > else... My main task was to hide all possible direcitrices from the /proc (in pidonly mode)... even those which we do not know. In this case we can't rely on the fact that everyone will follow the rules and to properly handle open(). My current attempt was to force filesystem level check of mountpoint flag. This is necessary to avoid even the theoretical possibility of ignoring "pidonly" parameter. I guess I need to add callback to vfs_open or something to can be sure that we will not open the wrong file or directory in pidonly mode. -- Rgrds, legion
Re: [RFC] Add option to mount only a pids subset
Djalal Harouniwrites: > On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski wrote: >> >> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov >> wrote: >> > >> > After discussion with Oleg Nesterov I reimplement my patch as an additional >> > option for /proc. This option affects the mountpoint. It means that in one >> > pid namespace it possible to have both the whole traditional /proc and >> > /proc with only pids subset. >> > >> >> I like this. I think you should split it into two patches, though: >> one that reworks how procfs gets mounted and one that makes adds the >> new functionality. >> >> Djajal had some concerns about the first part breaking applications >> that use stat and expect certain behavior. This should be manageable, >> though, but making stat work appropriately. > > I'm bit lost in the two discussion, however the main concern I was > discussing with Andy was if you have per superblock proc mounts then > each mount will end up with its own device ID st_dev, right now they > share the same ID if they are in the same pid namespace, but if we > change that then we may break the following: > http://man7.org/linux/man-pages/man7/namespaces.7.html > > Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd, > suggests to follow up with fstat() to identify the namespaces.. > "By applying fstat(2) to the returned file descriptor, one obtains a > stat structure whose st_dev (resident device) and st_ino (inode > number) fields together identify the owning/parent namespace." > > Other /proc/self/ns/* comparison and stat() logic... > > Andy suggested that we may have the same st_dev for mounts in the same > pid namespace... I'm not sure which side effect this may bring! Well right now it is less of an issue than you image. I suggest you stat /proc and /proc/self/ns/* and look at st_dev. That said anything that changes today's proc needs to be tested with a range of distro's especially those using selinux and apparmor as they tend to have policies that are very sensitive to the implementation details. Such that seemingly innocent changes can cause systems to stop booting. Eric
Re: [RFC] Add option to mount only a pids subset
Djalal Harouni writes: > On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski wrote: >> >> On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov >> wrote: >> > >> > After discussion with Oleg Nesterov I reimplement my patch as an additional >> > option for /proc. This option affects the mountpoint. It means that in one >> > pid namespace it possible to have both the whole traditional /proc and >> > /proc with only pids subset. >> > >> >> I like this. I think you should split it into two patches, though: >> one that reworks how procfs gets mounted and one that makes adds the >> new functionality. >> >> Djajal had some concerns about the first part breaking applications >> that use stat and expect certain behavior. This should be manageable, >> though, but making stat work appropriately. > > I'm bit lost in the two discussion, however the main concern I was > discussing with Andy was if you have per superblock proc mounts then > each mount will end up with its own device ID st_dev, right now they > share the same ID if they are in the same pid namespace, but if we > change that then we may break the following: > http://man7.org/linux/man-pages/man7/namespaces.7.html > > Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd, > suggests to follow up with fstat() to identify the namespaces.. > "By applying fstat(2) to the returned file descriptor, one obtains a > stat structure whose st_dev (resident device) and st_ino (inode > number) fields together identify the owning/parent namespace." > > Other /proc/self/ns/* comparison and stat() logic... > > Andy suggested that we may have the same st_dev for mounts in the same > pid namespace... I'm not sure which side effect this may bring! Well right now it is less of an issue than you image. I suggest you stat /proc and /proc/self/ns/* and look at st_dev. That said anything that changes today's proc needs to be tested with a range of distro's especially those using selinux and apparmor as they tend to have policies that are very sensitive to the implementation details. Such that seemingly innocent changes can cause systems to stop booting. Eric
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirskiwrote: > > On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov > wrote: > > > > After discussion with Oleg Nesterov I reimplement my patch as an additional > > option for /proc. This option affects the mountpoint. It means that in one > > pid namespace it possible to have both the whole traditional /proc and > > /proc with only pids subset. > > > > I like this. I think you should split it into two patches, though: > one that reworks how procfs gets mounted and one that makes adds the > new functionality. > > Djajal had some concerns about the first part breaking applications > that use stat and expect certain behavior. This should be manageable, > though, but making stat work appropriately. I'm bit lost in the two discussion, however the main concern I was discussing with Andy was if you have per superblock proc mounts then each mount will end up with its own device ID st_dev, right now they share the same ID if they are in the same pid namespace, but if we change that then we may break the following: http://man7.org/linux/man-pages/man7/namespaces.7.html Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd, suggests to follow up with fstat() to identify the namespaces.. "By applying fstat(2) to the returned file descriptor, one obtains a stat structure whose st_dev (resident device) and st_ino (inode number) fields together identify the owning/parent namespace." Other /proc/self/ns/* comparison and stat() logic... Andy suggested that we may have the same st_dev for mounts in the same pid namespace... I'm not sure which side effect this may bring! Thanks! -- tixxdz
Re: [RFC] Add option to mount only a pids subset
On Tue, Mar 7, 2017 at 5:24 PM, Andy Lutomirski wrote: > > On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov > wrote: > > > > After discussion with Oleg Nesterov I reimplement my patch as an additional > > option for /proc. This option affects the mountpoint. It means that in one > > pid namespace it possible to have both the whole traditional /proc and > > /proc with only pids subset. > > > > I like this. I think you should split it into two patches, though: > one that reworks how procfs gets mounted and one that makes adds the > new functionality. > > Djajal had some concerns about the first part breaking applications > that use stat and expect certain behavior. This should be manageable, > though, but making stat work appropriately. I'm bit lost in the two discussion, however the main concern I was discussing with Andy was if you have per superblock proc mounts then each mount will end up with its own device ID st_dev, right now they share the same ID if they are in the same pid namespace, but if we change that then we may break the following: http://man7.org/linux/man-pages/man7/namespaces.7.html Both new NS_GET_PARENT and NS_GET_USERNS ioctl() that return an fd, suggests to follow up with fstat() to identify the namespaces.. "By applying fstat(2) to the returned file descriptor, one obtains a stat structure whose st_dev (resident device) and st_ino (inode number) fields together identify the owning/parent namespace." Other /proc/self/ns/* comparison and stat() logic... Andy suggested that we may have the same st_dev for mounts in the same pid namespace... I'm not sure which side effect this may bring! Thanks! -- tixxdz
Re: [RFC] Add option to mount only a pids subset
I can't really review this... but in any case I think you should split this patch to separate the vfs and proc changes. On 03/07, Alexey Gladkov wrote: > > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, > const char *name, void > mnt->mnt.mnt_sb = root->d_sb; > mnt->mnt_mountpoint = mnt->mnt.mnt_root; > mnt->mnt_parent = mnt; > + > + err = do_mount_sb(>mnt, flags, data); > + if(err) { > + mnt_free_id(mnt); > + free_vfsmnt(mnt); > + return ERR_PTR(err); > + } This duplicates the error handling, we do the same if mount_fs() fails. Perhaps you should move these 2 lines into cleanup block and add goto's. > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat > *stat) > +{ > + struct inode *inode = d_inode(dentry); > + struct pid *pid = proc_pid(dentry->d_inode); > + struct proc_options *opts = mnt->fs_data; > + > + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid) > + return -ENOENT; Hmm. I don't quite understand why do we need this, and how this should work. Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail, but only because they both do stat() ? Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should work ? I still think proc_dir_operations.open() makes more sense. Yes, as you pointed out we also need to update proc_sys_dir_file_operations too and may be something else... > + > + if (!inode->i_op->getattr) { > + generic_fillattr(inode, stat); > + return 0; > + } > + > + return inode->i_op->getattr(mnt, dentry, stat); > +} Oh, it would be nice to not duplicate the code from the caller, imo. Oleg.
Re: [RFC] Add option to mount only a pids subset
I can't really review this... but in any case I think you should split this patch to separate the vfs and proc changes. On 03/07, Alexey Gladkov wrote: > > @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, > const char *name, void > mnt->mnt.mnt_sb = root->d_sb; > mnt->mnt_mountpoint = mnt->mnt.mnt_root; > mnt->mnt_parent = mnt; > + > + err = do_mount_sb(>mnt, flags, data); > + if(err) { > + mnt_free_id(mnt); > + free_vfsmnt(mnt); > + return ERR_PTR(err); > + } This duplicates the error handling, we do the same if mount_fs() fails. Perhaps you should move these 2 lines into cleanup block and add goto's. > +int proc_getattrfs(struct vfsmount *mnt, struct dentry *dentry, struct kstat > *stat) > +{ > + struct inode *inode = d_inode(dentry); > + struct pid *pid = proc_pid(dentry->d_inode); > + struct proc_options *opts = mnt->fs_data; > + > + if (opts && opts->pid_only && mnt->mnt_root != dentry && !pid) > + return -ENOENT; Hmm. I don't quite understand why do we need this, and how this should work. Yes, "/bin/ls /pidonly-proc/sys" or opendir(/pidonly-proc/sys) should fail, but only because they both do stat() ? Afaics you still can do open("/pidonly-proc/sys") + getdents() and this should work ? I still think proc_dir_operations.open() makes more sense. Yes, as you pointed out we also need to update proc_sys_dir_file_operations too and may be something else... > + > + if (!inode->i_op->getattr) { > + generic_fillattr(inode, stat); > + return 0; > + } > + > + return inode->i_op->getattr(mnt, dentry, stat); > +} Oh, it would be nice to not duplicate the code from the caller, imo. Oleg.
Re: [RFC] Add option to mount only a pids subset
On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkovwrote: > > After discussion with Oleg Nesterov I reimplement my patch as an additional > option for /proc. This option affects the mountpoint. It means that in one > pid namespace it possible to have both the whole traditional /proc and > /proc with only pids subset. > I like this. I think you should split it into two patches, though: one that reworks how procfs gets mounted and one that makes adds the new functionality. Djajal had some concerns about the first part breaking applications that use stat and expect certain behavior. This should be manageable, though, but making stat work appropriately.
Re: [RFC] Add option to mount only a pids subset
On Mon, Mar 6, 2017 at 3:05 PM, Alexey Gladkov wrote: > > After discussion with Oleg Nesterov I reimplement my patch as an additional > option for /proc. This option affects the mountpoint. It means that in one > pid namespace it possible to have both the whole traditional /proc and > /proc with only pids subset. > I like this. I think you should split it into two patches, though: one that reworks how procfs gets mounted and one that makes adds the new functionality. Djajal had some concerns about the first part breaking applications that use stat and expect certain behavior. This should be manageable, though, but making stat work appropriately.
[RFC] Add option to mount only a pids subset
After discussion with Oleg Nesterov I reimplement my patch as an additional option for /proc. This option affects the mountpoint. It means that in one pid namespace it possible to have both the whole traditional /proc and /proc with only pids subset. However, it remains an open question about overlayfs support in the /proc. == Overview == Some of the container virtualization systems are mounted /proc inside the container. This is done in most cases to operate with information about the processes. Knowing that /proc filesystem is not fully virtualized they are mounted on top of dangerous places empty files or directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). The structure of this filesystem is dynamic and any module can create a new object which will not necessarily be virtualized. There are proprietary modules that aren't in the mainline whose work we can not verify. This opens up a potential threat to the system. The developers of the virtualization system can't predict all dangerous places in /proc by definition. A more effective solution would be to mount into the container only what is necessary and ignore the rest. Right now there is the opportunity to pass in the container any port of the /proc filesystem using mount --bind expect the pids. This patch allows to mount only the part of /proc related to pids without rest objects. Since this is an option for /proc, flags applied to /proc have an effect on this subset of filesystem. Originally the idea was that the container will be mounted only pid sunset and additional required files will be mounted on top using the overlayfs. But I found out that /proc does not support overlayfs and does not allow to mount anything on top or under it. == TODO == There is still work to do: * Show pidonly via proc_show_options. * Add overlayfs support. --- fs/internal.h | 1 + fs/namespace.c| 9 fs/proc/generic.c | 4 fs/proc/inode.c | 7 +- fs/proc/internal.h| 8 +++ fs/proc/root.c| 62 +++ fs/stat.c | 4 fs/super.c| 20 + include/linux/fs.h| 2 ++ include/linux/mount.h | 1 + 10 files changed, 113 insertions(+), 5 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 4fcf517..cb44ca7 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void); * super.c */ extern int do_remount_sb(struct super_block *, int, void *, int); +extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data); extern bool trylock_super(struct super_block *sb); extern struct dentry *mount_fs(struct file_system_type *, int, const char *, void *); diff --git a/fs/namespace.c b/fs/namespace.c index e6c234b..6cb2bcb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p) struct vfsmount * vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data) { + int err; struct mount *mnt; struct dentry *root; @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void mnt->mnt.mnt_sb = root->d_sb; mnt->mnt_mountpoint = mnt->mnt.mnt_root; mnt->mnt_parent = mnt; + + err = do_mount_sb(>mnt, flags, data); + if(err) { + mnt_free_id(mnt); + free_vfsmnt(mnt); + return ERR_PTR(err); + } + lock_mount_hash(); list_add_tail(>mnt_instance, >d_sb->s_mounts); unlock_mount_hash(); diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 7eb3cef..dd3da60 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file, int proc_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); + struct proc_options *opts = file->f_path.mnt->fs_data; + + if (opts && opts->pid_only) + return 1; return proc_readdir_de(PDE(inode), file, ctx); } diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 783bc19..51b1712 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, pid->pid_gid)); if (pid->hide_pid != 0) seq_printf(seq, ",hidepid=%u", pid->hide_pid); - return 0; } @@ -118,6 +117,8 @@ static const struct super_operations proc_sops = { .drop_inode = generic_delete_inode, .evict_inode= proc_evict_inode, .statfs = simple_statfs, + .getattr_fs = proc_getattrfs, + .mount_fs = proc_mount_cb, .remount_fs = proc_remount, .show_options =
[RFC] Add option to mount only a pids subset
After discussion with Oleg Nesterov I reimplement my patch as an additional option for /proc. This option affects the mountpoint. It means that in one pid namespace it possible to have both the whole traditional /proc and /proc with only pids subset. However, it remains an open question about overlayfs support in the /proc. == Overview == Some of the container virtualization systems are mounted /proc inside the container. This is done in most cases to operate with information about the processes. Knowing that /proc filesystem is not fully virtualized they are mounted on top of dangerous places empty files or directories (for exmaple /proc/sys, /proc/kcore, /sys/firmware, etc.). The structure of this filesystem is dynamic and any module can create a new object which will not necessarily be virtualized. There are proprietary modules that aren't in the mainline whose work we can not verify. This opens up a potential threat to the system. The developers of the virtualization system can't predict all dangerous places in /proc by definition. A more effective solution would be to mount into the container only what is necessary and ignore the rest. Right now there is the opportunity to pass in the container any port of the /proc filesystem using mount --bind expect the pids. This patch allows to mount only the part of /proc related to pids without rest objects. Since this is an option for /proc, flags applied to /proc have an effect on this subset of filesystem. Originally the idea was that the container will be mounted only pid sunset and additional required files will be mounted on top using the overlayfs. But I found out that /proc does not support overlayfs and does not allow to mount anything on top or under it. == TODO == There is still work to do: * Show pidonly via proc_show_options. * Add overlayfs support. --- fs/internal.h | 1 + fs/namespace.c| 9 fs/proc/generic.c | 4 fs/proc/inode.c | 7 +- fs/proc/internal.h| 8 +++ fs/proc/root.c| 62 +++ fs/stat.c | 4 fs/super.c| 20 + include/linux/fs.h| 2 ++ include/linux/mount.h | 1 + 10 files changed, 113 insertions(+), 5 deletions(-) diff --git a/fs/internal.h b/fs/internal.h index 4fcf517..cb44ca7 100644 --- a/fs/internal.h +++ b/fs/internal.h @@ -88,6 +88,7 @@ extern struct file *get_empty_filp(void); * super.c */ extern int do_remount_sb(struct super_block *, int, void *, int); +extern int do_mount_sb(struct vfsmount *mnt, int flags, void *data); extern bool trylock_super(struct super_block *sb); extern struct dentry *mount_fs(struct file_system_type *, int, const char *, void *); diff --git a/fs/namespace.c b/fs/namespace.c index e6c234b..6cb2bcb 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -938,6 +938,7 @@ static struct mount *skip_mnt_tree(struct mount *p) struct vfsmount * vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data) { + int err; struct mount *mnt; struct dentry *root; @@ -962,6 +963,14 @@ vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void mnt->mnt.mnt_sb = root->d_sb; mnt->mnt_mountpoint = mnt->mnt.mnt_root; mnt->mnt_parent = mnt; + + err = do_mount_sb(>mnt, flags, data); + if(err) { + mnt_free_id(mnt); + free_vfsmnt(mnt); + return ERR_PTR(err); + } + lock_mount_hash(); list_add_tail(>mnt_instance, >d_sb->s_mounts); unlock_mount_hash(); diff --git a/fs/proc/generic.c b/fs/proc/generic.c index 7eb3cef..dd3da60 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -306,6 +306,10 @@ int proc_readdir_de(struct proc_dir_entry *de, struct file *file, int proc_readdir(struct file *file, struct dir_context *ctx) { struct inode *inode = file_inode(file); + struct proc_options *opts = file->f_path.mnt->fs_data; + + if (opts && opts->pid_only) + return 1; return proc_readdir_de(PDE(inode), file, ctx); } diff --git a/fs/proc/inode.c b/fs/proc/inode.c index 783bc19..51b1712 100644 --- a/fs/proc/inode.c +++ b/fs/proc/inode.c @@ -108,7 +108,6 @@ static int proc_show_options(struct seq_file *seq, struct dentry *root) seq_printf(seq, ",gid=%u", from_kgid_munged(_user_ns, pid->pid_gid)); if (pid->hide_pid != 0) seq_printf(seq, ",hidepid=%u", pid->hide_pid); - return 0; } @@ -118,6 +117,8 @@ static const struct super_operations proc_sops = { .drop_inode = generic_delete_inode, .evict_inode= proc_evict_inode, .statfs = simple_statfs, + .getattr_fs = proc_getattrfs, + .mount_fs = proc_mount_cb, .remount_fs = proc_remount, .show_options =