Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
AKA it should be this fix that removes the need for your dumpable setting. bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") I will check, though from what I recall that patch doesn't fix the ptrace_may_access checks. Not to mention it won't help if the container doesn't have it's own user namespace. That change very much is to the ptrace_may_access checks. Sorry, you're right. I misremembered the patch. You are not playing with setgroups if you don't have your own user namespace. So I don't see how the other cases are relevant. There's also oom_score_adj and a few others. User namespaces were just the first example that I hit. What I think I would do in the situation you describe is to join what you are going to join. Limit yourself to creating pid namespaces with unshare. If you are joining a user namespace set undumpable. If you are creating a user namespace create it and then set undumpable. I just tried to implement this, and it works _okay_. Except that oom_scoore_adj also is no longer writeable if the process is not dumpable -- and the "create container" and "modify process" stages in runC are very separate and there's no easy way of reconciling the two. Ultimately this only affects rootless containers, where security is simply not a major concern. However it is a bit frustrating that a process which is not dumpable cannot even modify _its own_ /proc/self/... files. My current fix is to just not set dumpable for rootless containers, as it seems there's no proper way of setting dumpable in this context. It appears as though the only way that dumpable works is by just using CAP_DAC_OVERRIDE and completely ignoring the dumpable restrictions. Clearing dumpable is to help not leak things into a container when you call setns on a user namespace. It is also to help not leak things into a container when you join other namespaces. Most notably the PID namespace. Except that you don't strictly join a PID namespace. You set a context for children to run in a different PID namespace. So you are safe from PID namespaces as long as you don't call fork. But you need to fork eventually if you want to set settings on the process which will eventually be the container process (PID 1 in the new container). You can't exec before then, you need to fork first. + if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { I'd just like to draw your attention to this special case -- why is this special cased? What was the original reasoning behind it? Does it make sense for a non-dumpable process to allow someone to change the mode of some random /proc/[pid]/ directories? This has nothing at all to do with changing modes and is all about what uid/gid are set on the proc inode. Usually it is the uid/gid of the process in question but occassionally for undumpable processes it is root/root to prevent people from accessing the files in question. My question was "why are o+rx directories set to the process's e[ug]id but other inodes are set to the root [ug]id?". And it's the other way around -- only for two directories on my system is it the case that the process has /proc/pid/... inodes owned by the process's e[ug]id (the rest are owned by root). And it is about modes, because once you're the owner of a file you can change its mode (allowing other processes to access the inodes). I'm not sure what other purpose changing the ownership *of a directory* serves -- you cannot create new files (or unlink files) under /proc/self/... directories so u+w permissions aren't actually "useful" (as far as I can tell). I get the feeling that some of this logic is a bit iffy. It looks like I forgot to carry forward the comment that explains that case in my patch. Something I need to fix before I merge it. /* * Before the /proc/pid/status file was created the only way to read * the effective uid of a /process was to stat /proc/pid. Reading * /proc/pid/status is slow enough that procps and other packages * kept stating /proc/pid. To keep the rules in /proc simple I have * made this apply to all per process world readable and executable * directories. */ Or in short. I broke ps when I removed all of the special cases, and to fix ps I added the existing special case. Not that the uid or gid of a directory that the whole world can access matters. Okay, but why is it being applied to _all_ subdirectories of /proc/pid? Why not make it *only* set the owner of /proc/pid and nothing else? Looks like that was not intended given the reasons you just provided. -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
AKA it should be this fix that removes the need for your dumpable setting. bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") I will check, though from what I recall that patch doesn't fix the ptrace_may_access checks. Not to mention it won't help if the container doesn't have it's own user namespace. That change very much is to the ptrace_may_access checks. Sorry, you're right. I misremembered the patch. You are not playing with setgroups if you don't have your own user namespace. So I don't see how the other cases are relevant. There's also oom_score_adj and a few others. User namespaces were just the first example that I hit. What I think I would do in the situation you describe is to join what you are going to join. Limit yourself to creating pid namespaces with unshare. If you are joining a user namespace set undumpable. If you are creating a user namespace create it and then set undumpable. I just tried to implement this, and it works _okay_. Except that oom_scoore_adj also is no longer writeable if the process is not dumpable -- and the "create container" and "modify process" stages in runC are very separate and there's no easy way of reconciling the two. Ultimately this only affects rootless containers, where security is simply not a major concern. However it is a bit frustrating that a process which is not dumpable cannot even modify _its own_ /proc/self/... files. My current fix is to just not set dumpable for rootless containers, as it seems there's no proper way of setting dumpable in this context. It appears as though the only way that dumpable works is by just using CAP_DAC_OVERRIDE and completely ignoring the dumpable restrictions. Clearing dumpable is to help not leak things into a container when you call setns on a user namespace. It is also to help not leak things into a container when you join other namespaces. Most notably the PID namespace. Except that you don't strictly join a PID namespace. You set a context for children to run in a different PID namespace. So you are safe from PID namespaces as long as you don't call fork. But you need to fork eventually if you want to set settings on the process which will eventually be the container process (PID 1 in the new container). You can't exec before then, you need to fork first. + if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { I'd just like to draw your attention to this special case -- why is this special cased? What was the original reasoning behind it? Does it make sense for a non-dumpable process to allow someone to change the mode of some random /proc/[pid]/ directories? This has nothing at all to do with changing modes and is all about what uid/gid are set on the proc inode. Usually it is the uid/gid of the process in question but occassionally for undumpable processes it is root/root to prevent people from accessing the files in question. My question was "why are o+rx directories set to the process's e[ug]id but other inodes are set to the root [ug]id?". And it's the other way around -- only for two directories on my system is it the case that the process has /proc/pid/... inodes owned by the process's e[ug]id (the rest are owned by root). And it is about modes, because once you're the owner of a file you can change its mode (allowing other processes to access the inodes). I'm not sure what other purpose changing the ownership *of a directory* serves -- you cannot create new files (or unlink files) under /proc/self/... directories so u+w permissions aren't actually "useful" (as far as I can tell). I get the feeling that some of this logic is a bit iffy. It looks like I forgot to carry forward the comment that explains that case in my patch. Something I need to fix before I merge it. /* * Before the /proc/pid/status file was created the only way to read * the effective uid of a /process was to stat /proc/pid. Reading * /proc/pid/status is slow enough that procps and other packages * kept stating /proc/pid. To keep the rules in /proc simple I have * made this apply to all per process world readable and executable * directories. */ Or in short. I broke ps when I removed all of the special cases, and to fix ps I added the existing special case. Not that the uid or gid of a directory that the whole world can access matters. Okay, but why is it being applied to _all_ subdirectories of /proc/pid? Why not make it *only* set the owner of /proc/pid and nothing else? Looks like that was not intended given the reasons you just provided. -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
The patch under discussion just appears wrong. Part of what we are detecting when we ask if a task is dumpable is if a process might have under gone a security transition. If a process has undergone a security transition (calling setuid or the like) it is quite possible that everyone with it's current effective credentials should not be able to access that process. Please verify but the ptrace issue that allowed processes in a container to call setns on our processes should be fixed as of 4.10-rc1. And the change has been marked for backporting. AKA it should be this fix that removes the need for your dumpable setting. Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") Now with that said I believe we want to add the following change now that dumpable is user namespace relative. That will use not the GLOBAL_ROOT_UID/GID but instead uid and gid 0 in the namespace that dumpable is relative too. But ugh! Your case is even more confused that I had first noticed. Saying that a processes is undumpable is completely unnecessary when you are entering into a new fresh user namespace. Touching setgroups at any point where there are other processes in the namespace makes no sense whatsoever. Clearing dumpable is to help not leak things into a container when you call setns on a user namespace. So my general impression of your use case is don't do that, it isn't necessary and it causes you problems. At the same time here is what I am looking at to better set the uids and gids of proc files. From: "Eric W. Biederman"Date: Tue, 3 Jan 2017 10:23:11 +1300 Subject: [PATCH] proc: Better ownership of files for non-dumpable tasks in user namespaces Instead of making the files owned by the GLOBAL_ROOT_USER. Make non-dumpable files whose mm has always lived in a user namespace owned by the user namespace root. This allows the container root to have things work as expected in a container. Signed-off-by: "Eric W. Biederman" --- fs/proc/base.c | 88 +++--- fs/proc/fd.c | 12 +--- fs/proc/internal.h | 16 ++ 3 files changed, 53 insertions(+), 63 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 8e7e61b28f31..e96487af2307 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1667,12 +1667,55 @@ const struct inode_operations proc_pid_link_inode_operations = { /* building an inode */ +void task_dump_owner(struct task_struct *task, mode_t mode, +kuid_t *ruid, kgid_t *rgid) +{ + /* Depending on the state of dumpable compute who should own a +* proc file for a task. +*/ + const struct cred *cred; + kuid_t uid; + kgid_t gid; + + /* Default to the tasks effective ownership */ + rcu_read_lock(); + cred = __task_cred(task); + uid = cred->euid; + gid = cred->egid; + rcu_read_unlock(); + + if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { + struct mm_struct *mm; + task_lock(task); + mm = task->mm; + /* Make non-dumpable tasks owned by some root */ + if (mm) { + if (get_dumpable(mm) != SUID_DUMP_USER) { + struct user_namespace *user_ns = mm->user_ns; + + uid = make_kuid(user_ns, 0); + if (!uid_valid(uid)) + uid = GLOBAL_ROOT_UID; + + gid = make_kgid(user_ns, 0); + if (!gid_valid(gid)) + gid = GLOBAL_ROOT_GID; + } + } else { + uid = GLOBAL_ROOT_UID; + gid = GLOBAL_ROOT_GID; + } + task_unlock(task); + } + *ruid = uid; + *rgid = gid; +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; - const struct cred *cred; /* We need a new inode */ @@ -1694,13 +1737,7 @@ struct inode *proc_pid_make_inode(struct super_block * sb, if (!ei->pid) goto out_unlock; - if (task_dumpable(task)) { - rcu_read_lock(); - cred = __task_cred(task); - inode->i_uid = cred->euid; - inode->i_gid = cred->egid; - rcu_read_unlock(); - } + task_dump_owner(task, 0, >i_uid, >i_gid); security_task_to_inode(task, inode); out: @@ -1715,7 +1752,6 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { struct inode *inode = d_inode(dentry); struct task_struct *task; - const struct cred *cred; struct pid_namespace
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
The patch under discussion just appears wrong. Part of what we are detecting when we ask if a task is dumpable is if a process might have under gone a security transition. If a process has undergone a security transition (calling setuid or the like) it is quite possible that everyone with it's current effective credentials should not be able to access that process. Please verify but the ptrace issue that allowed processes in a container to call setns on our processes should be fixed as of 4.10-rc1. And the change has been marked for backporting. AKA it should be this fix that removes the need for your dumpable setting. Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") Now with that said I believe we want to add the following change now that dumpable is user namespace relative. That will use not the GLOBAL_ROOT_UID/GID but instead uid and gid 0 in the namespace that dumpable is relative too. But ugh! Your case is even more confused that I had first noticed. Saying that a processes is undumpable is completely unnecessary when you are entering into a new fresh user namespace. Touching setgroups at any point where there are other processes in the namespace makes no sense whatsoever. Clearing dumpable is to help not leak things into a container when you call setns on a user namespace. So my general impression of your use case is don't do that, it isn't necessary and it causes you problems. At the same time here is what I am looking at to better set the uids and gids of proc files. From: "Eric W. Biederman" Date: Tue, 3 Jan 2017 10:23:11 +1300 Subject: [PATCH] proc: Better ownership of files for non-dumpable tasks in user namespaces Instead of making the files owned by the GLOBAL_ROOT_USER. Make non-dumpable files whose mm has always lived in a user namespace owned by the user namespace root. This allows the container root to have things work as expected in a container. Signed-off-by: "Eric W. Biederman" --- fs/proc/base.c | 88 +++--- fs/proc/fd.c | 12 +--- fs/proc/internal.h | 16 ++ 3 files changed, 53 insertions(+), 63 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index 8e7e61b28f31..e96487af2307 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1667,12 +1667,55 @@ const struct inode_operations proc_pid_link_inode_operations = { /* building an inode */ +void task_dump_owner(struct task_struct *task, mode_t mode, +kuid_t *ruid, kgid_t *rgid) +{ + /* Depending on the state of dumpable compute who should own a +* proc file for a task. +*/ + const struct cred *cred; + kuid_t uid; + kgid_t gid; + + /* Default to the tasks effective ownership */ + rcu_read_lock(); + cred = __task_cred(task); + uid = cred->euid; + gid = cred->egid; + rcu_read_unlock(); + + if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { + struct mm_struct *mm; + task_lock(task); + mm = task->mm; + /* Make non-dumpable tasks owned by some root */ + if (mm) { + if (get_dumpable(mm) != SUID_DUMP_USER) { + struct user_namespace *user_ns = mm->user_ns; + + uid = make_kuid(user_ns, 0); + if (!uid_valid(uid)) + uid = GLOBAL_ROOT_UID; + + gid = make_kgid(user_ns, 0); + if (!gid_valid(gid)) + gid = GLOBAL_ROOT_GID; + } + } else { + uid = GLOBAL_ROOT_UID; + gid = GLOBAL_ROOT_GID; + } + task_unlock(task); + } + *ruid = uid; + *rgid = gid; +} + struct inode *proc_pid_make_inode(struct super_block * sb, struct task_struct *task, umode_t mode) { struct inode * inode; struct proc_inode *ei; - const struct cred *cred; /* We need a new inode */ @@ -1694,13 +1737,7 @@ struct inode *proc_pid_make_inode(struct super_block * sb, if (!ei->pid) goto out_unlock; - if (task_dumpable(task)) { - rcu_read_lock(); - cred = __task_cred(task); - inode->i_uid = cred->euid; - inode->i_gid = cred->egid; - rcu_read_unlock(); - } + task_dump_owner(task, 0, >i_uid, >i_gid); security_task_to_inode(task, inode); out: @@ -1715,7 +1752,6 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) { struct inode *inode = d_inode(dentry); struct task_struct *task; - const struct cred *cred; struct pid_namespace *pid = dentry->d_sb->s_fs_info;
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
Aleksa Saraiwrites: >> Please verify but the ptrace issue that allowed processes in a container >> to call setns on our processes should be fixed as of 4.10-rc1. And the >> change has been marked for backporting. > > ptrace(2) is not the only issue, the issue that we had in runC is that > a process joining a namespace may have file descriptors that refer to > the host filesystem. If the process joining is dumpable, a racing > process inside the container can access those file descriptors through > the /proc/[pid]/fd/... mechanism. > > See CVE-2016-9962. Wow, I said that badly. But the issue is ptrace_attach and the ptrace_may_access checks. >> AKA it should be this fix that removes the need for your dumpable setting. >> bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace >> permission checks") > > I will check, though from what I recall that patch doesn't fix the > ptrace_may_access checks. Not to mention it won't help if the > container doesn't have it's own user namespace. That change very much is to the ptrace_may_access checks. You are not playing with setgroups if you don't have your own user namespace. So I don't see how the other cases are relevant. Going further there are only two namespaces that are only 3 pieces of information that I see are relevant in this context. The pid namespace, the user namespace, and the process credentials. You can not join a PID namespace you can only fork into one. The user namespace is relevant in older kernels as joining a user namespace would allow ptrace_has_cap check in ptrace_may_access to pass. After the change listed above that check does not start passing until after exec. The credentials on your process are relevant as changing those allow the ptrace_may_access checks to start passing for other sets of processes. What I think I would do in the situation you describe is to join what you are going to join. Limit yourself to creating pid namespaces with unshare. If you are joining a user namespace set undumpable. If you are creating a user namespace create it and then set undumpable. fork your process into the new pid namespace. change your pocesses credentials. exec whatever you are execing, letting close_on_exec clean up for you. >> Now with that said I believe we want to add the following change now >> that dumpable is user namespace relative. That will use not the >> GLOBAL_ROOT_UID/GID but instead uid and gid 0 in the namespace >> that dumpable is relative too. > > Sure, but that's tangential to the issue under discussion. It is pretty much the only case I see worth considering. Certainly I don't see setgroups as a motivational example. >> But ugh! Your case is even more confused that I had first noticed. >> Saying that a processes is undumpable is completely unnecessary >> when you are entering into a new fresh user namespace. Touching >> setgroups at any point where there are other processes in the namespace >> makes no sense whatsoever. > > Currently in runC the ordering for mixed create-and-join namespaces is > that we first join existing namespaces and _then_ create new ones. So > we need to be non-dumpable to avoid the problem in CVE-2016-9962. Yes mixed create and join is trickier than a pure create and use case, and trickier than a pure join case. But see above it looks quite doable to me to avoid being vulnerable while using the existing permissions even in the case you describe. >> Clearing dumpable is to help not leak things >> into a container when you call setns on a user namespace. > > It is also to help not leak things into a container when you join > other namespaces. Most notably the PID namespace. Except that you don't strictly join a PID namespace. You set a context for children to run in a different PID namespace. So you are safe from PID namespaces as long as you don't call fork. >> +if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { > > I'd just like to draw your attention to this special case -- why is > this special cased? What was the original reasoning behind it? Does it > make sense for a non-dumpable process to allow someone to change the > mode of some random /proc/[pid]/ directories? This has nothing at all to do with changing modes and is all about what uid/gid are set on the proc inode. Usually it is the uid/gid of the process in question but occassionally for undumpable processes it is root/root to prevent people from accessing the files in question. > I get the feeling that some of this logic is a bit iffy. It looks like I forgot to carry forward the comment that explains that case in my patch. Something I need to fix before I merge it. /* * Before the /proc/pid/status file was created the only way to read * the effective uid of a /process was to stat /proc/pid. Reading * /proc/pid/status is slow enough that procps and other packages * kept stating /proc/pid. To keep the rules in /proc simple I have * made this apply to all per process world
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
Aleksa Sarai writes: >> Please verify but the ptrace issue that allowed processes in a container >> to call setns on our processes should be fixed as of 4.10-rc1. And the >> change has been marked for backporting. > > ptrace(2) is not the only issue, the issue that we had in runC is that > a process joining a namespace may have file descriptors that refer to > the host filesystem. If the process joining is dumpable, a racing > process inside the container can access those file descriptors through > the /proc/[pid]/fd/... mechanism. > > See CVE-2016-9962. Wow, I said that badly. But the issue is ptrace_attach and the ptrace_may_access checks. >> AKA it should be this fix that removes the need for your dumpable setting. >> bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace >> permission checks") > > I will check, though from what I recall that patch doesn't fix the > ptrace_may_access checks. Not to mention it won't help if the > container doesn't have it's own user namespace. That change very much is to the ptrace_may_access checks. You are not playing with setgroups if you don't have your own user namespace. So I don't see how the other cases are relevant. Going further there are only two namespaces that are only 3 pieces of information that I see are relevant in this context. The pid namespace, the user namespace, and the process credentials. You can not join a PID namespace you can only fork into one. The user namespace is relevant in older kernels as joining a user namespace would allow ptrace_has_cap check in ptrace_may_access to pass. After the change listed above that check does not start passing until after exec. The credentials on your process are relevant as changing those allow the ptrace_may_access checks to start passing for other sets of processes. What I think I would do in the situation you describe is to join what you are going to join. Limit yourself to creating pid namespaces with unshare. If you are joining a user namespace set undumpable. If you are creating a user namespace create it and then set undumpable. fork your process into the new pid namespace. change your pocesses credentials. exec whatever you are execing, letting close_on_exec clean up for you. >> Now with that said I believe we want to add the following change now >> that dumpable is user namespace relative. That will use not the >> GLOBAL_ROOT_UID/GID but instead uid and gid 0 in the namespace >> that dumpable is relative too. > > Sure, but that's tangential to the issue under discussion. It is pretty much the only case I see worth considering. Certainly I don't see setgroups as a motivational example. >> But ugh! Your case is even more confused that I had first noticed. >> Saying that a processes is undumpable is completely unnecessary >> when you are entering into a new fresh user namespace. Touching >> setgroups at any point where there are other processes in the namespace >> makes no sense whatsoever. > > Currently in runC the ordering for mixed create-and-join namespaces is > that we first join existing namespaces and _then_ create new ones. So > we need to be non-dumpable to avoid the problem in CVE-2016-9962. Yes mixed create and join is trickier than a pure create and use case, and trickier than a pure join case. But see above it looks quite doable to me to avoid being vulnerable while using the existing permissions even in the case you describe. >> Clearing dumpable is to help not leak things >> into a container when you call setns on a user namespace. > > It is also to help not leak things into a container when you join > other namespaces. Most notably the PID namespace. Except that you don't strictly join a PID namespace. You set a context for children to run in a different PID namespace. So you are safe from PID namespaces as long as you don't call fork. >> +if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { > > I'd just like to draw your attention to this special case -- why is > this special cased? What was the original reasoning behind it? Does it > make sense for a non-dumpable process to allow someone to change the > mode of some random /proc/[pid]/ directories? This has nothing at all to do with changing modes and is all about what uid/gid are set on the proc inode. Usually it is the uid/gid of the process in question but occassionally for undumpable processes it is root/root to prevent people from accessing the files in question. > I get the feeling that some of this logic is a bit iffy. It looks like I forgot to carry forward the comment that explains that case in my patch. Something I need to fix before I merge it. /* * Before the /proc/pid/status file was created the only way to read * the effective uid of a /process was to stat /proc/pid. Reading * /proc/pid/status is slow enough that procps and other packages * kept stating /proc/pid. To keep the rules in /proc simple I have * made this apply to all per process world readable and
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
Please verify but the ptrace issue that allowed processes in a container to call setns on our processes should be fixed as of 4.10-rc1. And the change has been marked for backporting. ptrace(2) is not the only issue, the issue that we had in runC is that a process joining a namespace may have file descriptors that refer to the host filesystem. If the process joining is dumpable, a racing process inside the container can access those file descriptors through the /proc/[pid]/fd/... mechanism. See CVE-2016-9962. AKA it should be this fix that removes the need for your dumpable setting. Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") I will check, though from what I recall that patch doesn't fix the ptrace_may_access checks. Not to mention it won't help if the container doesn't have it's own user namespace. Now with that said I believe we want to add the following change now that dumpable is user namespace relative. That will use not the GLOBAL_ROOT_UID/GID but instead uid and gid 0 in the namespace that dumpable is relative too. Sure, but that's tangential to the issue under discussion. But ugh! Your case is even more confused that I had first noticed. Saying that a processes is undumpable is completely unnecessary when you are entering into a new fresh user namespace. Touching setgroups at any point where there are other processes in the namespace makes no sense whatsoever. Currently in runC the ordering for mixed create-and-join namespaces is that we first join existing namespaces and _then_ create new ones. So we need to be non-dumpable to avoid the problem in CVE-2016-9962. Clearing dumpable is to help not leak things into a container when you call setns on a user namespace. It is also to help not leak things into a container when you join other namespaces. Most notably the PID namespace. + if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { I'd just like to draw your attention to this special case -- why is this special cased? What was the original reasoning behind it? Does it make sense for a non-dumpable process to allow someone to change the mode of some random /proc/[pid]/ directories? I get the feeling that some of this logic is a bit iffy. -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
Please verify but the ptrace issue that allowed processes in a container to call setns on our processes should be fixed as of 4.10-rc1. And the change has been marked for backporting. ptrace(2) is not the only issue, the issue that we had in runC is that a process joining a namespace may have file descriptors that refer to the host filesystem. If the process joining is dumpable, a racing process inside the container can access those file descriptors through the /proc/[pid]/fd/... mechanism. See CVE-2016-9962. AKA it should be this fix that removes the need for your dumpable setting. Fixes: bfedb589252c ("mm: Add a user_ns owner to mm_struct and fix ptrace permission checks") I will check, though from what I recall that patch doesn't fix the ptrace_may_access checks. Not to mention it won't help if the container doesn't have it's own user namespace. Now with that said I believe we want to add the following change now that dumpable is user namespace relative. That will use not the GLOBAL_ROOT_UID/GID but instead uid and gid 0 in the namespace that dumpable is relative too. Sure, but that's tangential to the issue under discussion. But ugh! Your case is even more confused that I had first noticed. Saying that a processes is undumpable is completely unnecessary when you are entering into a new fresh user namespace. Touching setgroups at any point where there are other processes in the namespace makes no sense whatsoever. Currently in runC the ordering for mixed create-and-join namespaces is that we first join existing namespaces and _then_ create new ones. So we need to be non-dumpable to avoid the problem in CVE-2016-9962. Clearing dumpable is to help not leak things into a container when you call setns on a user namespace. It is also to help not leak things into a container when you join other namespaces. Most notably the PID namespace. + if (mode != (S_IFDIR|S_IRUGO|S_IXUGO)) { I'd just like to draw your attention to this special case -- why is this special cased? What was the original reasoning behind it? Does it make sense for a non-dumpable process to allow someone to change the mode of some random /proc/[pid]/ directories? I get the feeling that some of this logic is a bit iffy. -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
In order to protect against ptrace(2) and similar attacks on container runtimes when they join namespaces, many runtimes set mm->dumpable to SUID_DUMP_DISABLE. However, doing this means that attempting to set up an unprivileged user namespace will fail because an unprivileged process can no longer access /proc/self/{setgroups,{uid,gid}_map} for the container process (which is the same uid as the runtime process). Fix this by changing pid_getattr to *also* change the owner of regular files that have a mode of 0644 (when the process is not dumpable). This ensures that the important /proc/[pid]/... files mentioned above are properly accessible by a container runtime in a rootless container context. The most blantant issue is that a non-dumpable process in a rootless container context is unable to open /proc/self/setgroups, because it doesn't own the file. int main(void) { prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); unshare(CLONE_NEWUSER); /* This will fail. */ int fd = open("/proc/self/setgroups", O_WRONLY); if (fd < 0) abort(); return 0; } I do agree that failing to open anything in /proc/self/ is more than unexepcted! I cannot judge the patch but my gut feeling tells me that the fix should be somewhere in the open handler. Maybe that would suffice as a more specific fix (for the special case of /proc/self), but the fact that none of the users and groups are correctly set in /proc/[pid] will cause issues for runC and other container runtimes (because they don't go through /proc/self -- it's accessing /proc/[pid] from another process). Though I get the feeling that the *correct* fix would be to remove the conditional and *always* change the owner -- maybe I'm missing something but I can't think of the security issue that this code currently fixes (since all of the important permission checks are *in addition* to the generic_permission used for /proc/self/..., which use ptrace_may_access). -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
In order to protect against ptrace(2) and similar attacks on container runtimes when they join namespaces, many runtimes set mm->dumpable to SUID_DUMP_DISABLE. However, doing this means that attempting to set up an unprivileged user namespace will fail because an unprivileged process can no longer access /proc/self/{setgroups,{uid,gid}_map} for the container process (which is the same uid as the runtime process). Fix this by changing pid_getattr to *also* change the owner of regular files that have a mode of 0644 (when the process is not dumpable). This ensures that the important /proc/[pid]/... files mentioned above are properly accessible by a container runtime in a rootless container context. The most blantant issue is that a non-dumpable process in a rootless container context is unable to open /proc/self/setgroups, because it doesn't own the file. int main(void) { prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); unshare(CLONE_NEWUSER); /* This will fail. */ int fd = open("/proc/self/setgroups", O_WRONLY); if (fd < 0) abort(); return 0; } I do agree that failing to open anything in /proc/self/ is more than unexepcted! I cannot judge the patch but my gut feeling tells me that the fix should be somewhere in the open handler. Maybe that would suffice as a more specific fix (for the special case of /proc/self), but the fact that none of the users and groups are correctly set in /proc/[pid] will cause issues for runC and other container runtimes (because they don't go through /proc/self -- it's accessing /proc/[pid] from another process). Though I get the feeling that the *correct* fix would be to remove the conditional and *always* change the owner -- maybe I'm missing something but I can't think of the security issue that this code currently fixes (since all of the important permission checks are *in addition* to the generic_permission used for /proc/self/..., which use ptrace_may_access). -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
Cc Eric On Wed 18-01-17 15:01:59, Aleksa Sarai wrote: > In order to protect against ptrace(2) and similar attacks on container > runtimes when they join namespaces, many runtimes set mm->dumpable to > SUID_DUMP_DISABLE. However, doing this means that attempting to set up > an unprivileged user namespace will fail because an unprivileged process > can no longer access /proc/self/{setgroups,{uid,gid}_map} for the > container process (which is the same uid as the runtime process). > > Fix this by changing pid_getattr to *also* change the owner of regular > files that have a mode of 0644 (when the process is not dumpable). This > ensures that the important /proc/[pid]/... files mentioned above are > properly accessible by a container runtime in a rootless container > context. > > The most blantant issue is that a non-dumpable process in a rootless > container context is unable to open /proc/self/setgroups, because it > doesn't own the file. > > int main(void) > { > prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); > unshare(CLONE_NEWUSER); > > /* This will fail. */ > int fd = open("/proc/self/setgroups", O_WRONLY); > if (fd < 0) > abort(); > > return 0; > } I do agree that failing to open anything in /proc/self/ is more than unexepcted! I cannot judge the patch but my gut feeling tells me that the fix should be somewhere in the open handler. One nit below > > Cc: d...@opencontainers.org > Cc: contain...@lists.linux-foundation.org > Signed-off-by: Aleksa Sarai> --- > fs/proc/base.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index ca651ac00660..ebabb12f4536 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry > *dentry, struct kstat *stat) > return -ENOENT; > } > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > cred = __task_cred(task); > stat->uid = cred->euid; > @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int > flags) > > if (task) { > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > rcu_read_lock(); > cred = __task_cred(task); > @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir, > return -ENOENT; > } > > -static struct dentry *proc_pident_lookup(struct inode *dir, > +static struct dentry *proc_pident_lookup(struct inode *dir, >struct dentry *dentry, >const struct pid_entry *ents, >unsigned int nents) > @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = { > > static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) > { > - return proc_pident_readdir(file, ctx, > + return proc_pident_readdir(file, ctx, > attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff)); > } this two are just whitespace noise -- Michal Hocko SUSE Labs
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
Cc Eric On Wed 18-01-17 15:01:59, Aleksa Sarai wrote: > In order to protect against ptrace(2) and similar attacks on container > runtimes when they join namespaces, many runtimes set mm->dumpable to > SUID_DUMP_DISABLE. However, doing this means that attempting to set up > an unprivileged user namespace will fail because an unprivileged process > can no longer access /proc/self/{setgroups,{uid,gid}_map} for the > container process (which is the same uid as the runtime process). > > Fix this by changing pid_getattr to *also* change the owner of regular > files that have a mode of 0644 (when the process is not dumpable). This > ensures that the important /proc/[pid]/... files mentioned above are > properly accessible by a container runtime in a rootless container > context. > > The most blantant issue is that a non-dumpable process in a rootless > container context is unable to open /proc/self/setgroups, because it > doesn't own the file. > > int main(void) > { > prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); > unshare(CLONE_NEWUSER); > > /* This will fail. */ > int fd = open("/proc/self/setgroups", O_WRONLY); > if (fd < 0) > abort(); > > return 0; > } I do agree that failing to open anything in /proc/self/ is more than unexepcted! I cannot judge the patch but my gut feeling tells me that the fix should be somewhere in the open handler. One nit below > > Cc: d...@opencontainers.org > Cc: contain...@lists.linux-foundation.org > Signed-off-by: Aleksa Sarai > --- > fs/proc/base.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index ca651ac00660..ebabb12f4536 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry > *dentry, struct kstat *stat) > return -ENOENT; > } > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > cred = __task_cred(task); > stat->uid = cred->euid; > @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int > flags) > > if (task) { > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > rcu_read_lock(); > cred = __task_cred(task); > @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir, > return -ENOENT; > } > > -static struct dentry *proc_pident_lookup(struct inode *dir, > +static struct dentry *proc_pident_lookup(struct inode *dir, >struct dentry *dentry, >const struct pid_entry *ents, >unsigned int nents) > @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = { > > static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) > { > - return proc_pident_readdir(file, ctx, > + return proc_pident_readdir(file, ctx, > attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff)); > } this two are just whitespace noise -- Michal Hocko SUSE Labs
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
In order to protect against ptrace(2) and similar attacks on container runtimes when they join namespaces, many runtimes set mm->dumpable to SUID_DUMP_DISABLE. However, doing this means that attempting to set up an unprivileged user namespace will fail because an unprivileged process can no longer access /proc/self/{setgroups,{uid,gid}_map} for the container process (which is the same uid as the runtime process). Fix this by changing pid_getattr to *also* change the owner of regular files that have a mode of 0644 (when the process is not dumpable). This ensures that the important /proc/[pid]/... files mentioned above are properly accessible by a container runtime in a rootless container context. The most blantant issue is that a non-dumpable process in a rootless container context is unable to open /proc/self/setgroups, because it doesn't own the file. This changes a lot more than just setgroups, doesn't it? This bypasses the task_dumpable check for all kinds of things. Yeah. I can special case /proc/self/setgroups as well as uid_map, gid_map and the other *specific* things that runC needs. But ultimately I think we should come up with agreement on what things should always appear to be owned by the process's user. Though, I expect the has_pid_permissions() check to be the harder one to pass. Why does has_pid_permissions() succeed in the case you've given? Because the group id of the container process is the same as the parent process, so is_group_p() will succeed. Also hide_pid_min is set such that it will work in either case. -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
In order to protect against ptrace(2) and similar attacks on container runtimes when they join namespaces, many runtimes set mm->dumpable to SUID_DUMP_DISABLE. However, doing this means that attempting to set up an unprivileged user namespace will fail because an unprivileged process can no longer access /proc/self/{setgroups,{uid,gid}_map} for the container process (which is the same uid as the runtime process). Fix this by changing pid_getattr to *also* change the owner of regular files that have a mode of 0644 (when the process is not dumpable). This ensures that the important /proc/[pid]/... files mentioned above are properly accessible by a container runtime in a rootless container context. The most blantant issue is that a non-dumpable process in a rootless container context is unable to open /proc/self/setgroups, because it doesn't own the file. This changes a lot more than just setgroups, doesn't it? This bypasses the task_dumpable check for all kinds of things. Yeah. I can special case /proc/self/setgroups as well as uid_map, gid_map and the other *specific* things that runC needs. But ultimately I think we should come up with agreement on what things should always appear to be owned by the process's user. Though, I expect the has_pid_permissions() check to be the harder one to pass. Why does has_pid_permissions() succeed in the case you've given? Because the group id of the container process is the same as the parent process, so is_group_p() will succeed. Also hide_pid_min is set such that it will work in either case. -- Aleksa Sarai Software Engineer (Containers) SUSE Linux GmbH https://www.cyphar.com/
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
On Tue, Jan 17, 2017 at 8:01 PM, Aleksa Saraiwrote: > In order to protect against ptrace(2) and similar attacks on container > runtimes when they join namespaces, many runtimes set mm->dumpable to > SUID_DUMP_DISABLE. However, doing this means that attempting to set up > an unprivileged user namespace will fail because an unprivileged process > can no longer access /proc/self/{setgroups,{uid,gid}_map} for the > container process (which is the same uid as the runtime process). > > Fix this by changing pid_getattr to *also* change the owner of regular > files that have a mode of 0644 (when the process is not dumpable). This > ensures that the important /proc/[pid]/... files mentioned above are > properly accessible by a container runtime in a rootless container > context. > > The most blantant issue is that a non-dumpable process in a rootless > container context is unable to open /proc/self/setgroups, because it > doesn't own the file. This changes a lot more than just setgroups, doesn't it? This bypasses the task_dumpable check for all kinds of things. Though, I expect the has_pid_permissions() check to be the harder one to pass. Why does has_pid_permissions() succeed in the case you've given? -Kees > int main(void) > { > prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); > unshare(CLONE_NEWUSER); > > /* This will fail. */ > int fd = open("/proc/self/setgroups", O_WRONLY); > if (fd < 0) > abort(); > > return 0; > } > > Cc: d...@opencontainers.org > Cc: contain...@lists.linux-foundation.org > Signed-off-by: Aleksa Sarai > --- > fs/proc/base.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index ca651ac00660..ebabb12f4536 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry > *dentry, struct kstat *stat) > return -ENOENT; > } > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > cred = __task_cred(task); > stat->uid = cred->euid; > @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int > flags) > > if (task) { > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > rcu_read_lock(); > cred = __task_cred(task); > @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir, > return -ENOENT; > } > > -static struct dentry *proc_pident_lookup(struct inode *dir, > +static struct dentry *proc_pident_lookup(struct inode *dir, > struct dentry *dentry, > const struct pid_entry *ents, > unsigned int nents) > @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = { > > static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) > { > - return proc_pident_readdir(file, ctx, > + return proc_pident_readdir(file, ctx, >attr_dir_stuff, > ARRAY_SIZE(attr_dir_stuff)); > } > > -- > 2.11.0 > -- Kees Cook Nexus Security
Re: [PATCH] procfs: change the owner of non-dumpable and writeable files
On Tue, Jan 17, 2017 at 8:01 PM, Aleksa Sarai wrote: > In order to protect against ptrace(2) and similar attacks on container > runtimes when they join namespaces, many runtimes set mm->dumpable to > SUID_DUMP_DISABLE. However, doing this means that attempting to set up > an unprivileged user namespace will fail because an unprivileged process > can no longer access /proc/self/{setgroups,{uid,gid}_map} for the > container process (which is the same uid as the runtime process). > > Fix this by changing pid_getattr to *also* change the owner of regular > files that have a mode of 0644 (when the process is not dumpable). This > ensures that the important /proc/[pid]/... files mentioned above are > properly accessible by a container runtime in a rootless container > context. > > The most blantant issue is that a non-dumpable process in a rootless > container context is unable to open /proc/self/setgroups, because it > doesn't own the file. This changes a lot more than just setgroups, doesn't it? This bypasses the task_dumpable check for all kinds of things. Though, I expect the has_pid_permissions() check to be the harder one to pass. Why does has_pid_permissions() succeed in the case you've given? -Kees > int main(void) > { > prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); > unshare(CLONE_NEWUSER); > > /* This will fail. */ > int fd = open("/proc/self/setgroups", O_WRONLY); > if (fd < 0) > abort(); > > return 0; > } > > Cc: d...@opencontainers.org > Cc: contain...@lists.linux-foundation.org > Signed-off-by: Aleksa Sarai > --- > fs/proc/base.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/fs/proc/base.c b/fs/proc/base.c > index ca651ac00660..ebabb12f4536 100644 > --- a/fs/proc/base.c > +++ b/fs/proc/base.c > @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry > *dentry, struct kstat *stat) > return -ENOENT; > } > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > cred = __task_cred(task); > stat->uid = cred->euid; > @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int > flags) > > if (task) { > if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || > + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || > task_dumpable(task)) { > rcu_read_lock(); > cred = __task_cred(task); > @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir, > return -ENOENT; > } > > -static struct dentry *proc_pident_lookup(struct inode *dir, > +static struct dentry *proc_pident_lookup(struct inode *dir, > struct dentry *dentry, > const struct pid_entry *ents, > unsigned int nents) > @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = { > > static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) > { > - return proc_pident_readdir(file, ctx, > + return proc_pident_readdir(file, ctx, >attr_dir_stuff, > ARRAY_SIZE(attr_dir_stuff)); > } > > -- > 2.11.0 > -- Kees Cook Nexus Security
[PATCH] procfs: change the owner of non-dumpable and writeable files
In order to protect against ptrace(2) and similar attacks on container runtimes when they join namespaces, many runtimes set mm->dumpable to SUID_DUMP_DISABLE. However, doing this means that attempting to set up an unprivileged user namespace will fail because an unprivileged process can no longer access /proc/self/{setgroups,{uid,gid}_map} for the container process (which is the same uid as the runtime process). Fix this by changing pid_getattr to *also* change the owner of regular files that have a mode of 0644 (when the process is not dumpable). This ensures that the important /proc/[pid]/... files mentioned above are properly accessible by a container runtime in a rootless container context. The most blantant issue is that a non-dumpable process in a rootless container context is unable to open /proc/self/setgroups, because it doesn't own the file. int main(void) { prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); unshare(CLONE_NEWUSER); /* This will fail. */ int fd = open("/proc/self/setgroups", O_WRONLY); if (fd < 0) abort(); return 0; } Cc: d...@opencontainers.org Cc: contain...@lists.linux-foundation.org Signed-off-by: Aleksa Sarai--- fs/proc/base.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index ca651ac00660..ebabb12f4536 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) return -ENOENT; } if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || task_dumpable(task)) { cred = __task_cred(task); stat->uid = cred->euid; @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags) if (task) { if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || task_dumpable(task)) { rcu_read_lock(); cred = __task_cred(task); @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir, return -ENOENT; } -static struct dentry *proc_pident_lookup(struct inode *dir, +static struct dentry *proc_pident_lookup(struct inode *dir, struct dentry *dentry, const struct pid_entry *ents, unsigned int nents) @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = { static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) { - return proc_pident_readdir(file, ctx, + return proc_pident_readdir(file, ctx, attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff)); } -- 2.11.0
[PATCH] procfs: change the owner of non-dumpable and writeable files
In order to protect against ptrace(2) and similar attacks on container runtimes when they join namespaces, many runtimes set mm->dumpable to SUID_DUMP_DISABLE. However, doing this means that attempting to set up an unprivileged user namespace will fail because an unprivileged process can no longer access /proc/self/{setgroups,{uid,gid}_map} for the container process (which is the same uid as the runtime process). Fix this by changing pid_getattr to *also* change the owner of regular files that have a mode of 0644 (when the process is not dumpable). This ensures that the important /proc/[pid]/... files mentioned above are properly accessible by a container runtime in a rootless container context. The most blantant issue is that a non-dumpable process in a rootless container context is unable to open /proc/self/setgroups, because it doesn't own the file. int main(void) { prctl(PR_SET_DUMPABLE, 0, 0, 0, 0); unshare(CLONE_NEWUSER); /* This will fail. */ int fd = open("/proc/self/setgroups", O_WRONLY); if (fd < 0) abort(); return 0; } Cc: d...@opencontainers.org Cc: contain...@lists.linux-foundation.org Signed-off-by: Aleksa Sarai --- fs/proc/base.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index ca651ac00660..ebabb12f4536 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -1729,6 +1729,7 @@ int pid_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat) return -ENOENT; } if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || task_dumpable(task)) { cred = __task_cred(task); stat->uid = cred->euid; @@ -1770,6 +1771,7 @@ int pid_revalidate(struct dentry *dentry, unsigned int flags) if (task) { if ((inode->i_mode == (S_IFDIR|S_IRUGO|S_IXUGO)) || + (inode->i_mode == (S_IFREG|S_IRUGO|S_IWUSR)) || task_dumpable(task)) { rcu_read_lock(); cred = __task_cred(task); @@ -2394,7 +2396,7 @@ static int proc_pident_instantiate(struct inode *dir, return -ENOENT; } -static struct dentry *proc_pident_lookup(struct inode *dir, +static struct dentry *proc_pident_lookup(struct inode *dir, struct dentry *dentry, const struct pid_entry *ents, unsigned int nents) @@ -2536,7 +2538,7 @@ static const struct pid_entry attr_dir_stuff[] = { static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx) { - return proc_pident_readdir(file, ctx, + return proc_pident_readdir(file, ctx, attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff)); } -- 2.11.0