Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Rob Landley writes: > On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote: >> Work remains to make it safe to build user namespaces and 9p, afs, >> ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the >> Kconfig >> guard remains in place preventing that user namespaces from being >> built >> when any of those filesystems are enabled. > > What work specifically? Essentially the uid_t to kuid_t and gid_t to kgid_t conversion. These are the more complex filesystems. And I haven't yet had the time to go through the preliminary patches in my development branch and convert them into stupid obviously correct patches, or review them closely. The changes generally are pretty simple the amount of review to ensure I didn't overlook things tends to proprotional to the size of the file system. There is also the fact that I don't use most of those file-systems. The big advantage of finishing those filesystems is that user namespaces can be turned on in allyesconfig, allowing my compile time checks to notice when someone doesn't use kuids and kgids. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote: Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. What work specifically? Rob-- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote: Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. What work specifically? Rob-- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Rob Landley r...@landley.net writes: On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote: Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. What work specifically? Essentially the uid_t to kuid_t and gid_t to kgid_t conversion. These are the more complex filesystems. And I haven't yet had the time to go through the preliminary patches in my development branch and convert them into stupid obviously correct patches, or review them closely. The changes generally are pretty simple the amount of review to ensure I didn't overlook things tends to proprotional to the size of the file system. There is also the fact that I don't use most of those file-systems. The big advantage of finishing those filesystems is that user namespaces can be turned on in allyesconfig, allowing my compile time checks to notice when someone doesn't use kuids and kgids. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
ebied...@xmission.com (Eric W. Biederman) writes: > Linus, > > Please pull the for-linus git tree from: > >git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git > for-linus > >HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in > description of the limitation of userns_install > >This tree is against v3.7-rc3 > > The embarrasing oversights that Andy found have been corrected. Those bugs, those darn embarrasing bugs just want don't want to get fixed. Linus I just updated my mirror of your kernel.org tree and it appears you successfully pulled everything except the last 4 commits that fix those embarrasing bugs. When you get a chance can you please repull my branch (the details above are still corect. The pending changes are. Eric W. Biederman (4): Fix cap_capable to only allow owners in the parent user namespace to have caps. userns: Require CAP_SYS_ADMIN for most uses of setns. userns: Add a more complete capability subset test to commit_creds userns: Fix typo in description of the limitation of userns_install fs/namespace.c |3 ++- ipc/namespace.c |3 ++- kernel/cred.c| 27 ++- kernel/pid_namespace.c |3 ++- kernel/user_namespace.c |2 +- kernel/utsname.c |3 ++- net/core/net_namespace.c |3 ++- security/commoncap.c | 25 + 8 files changed, 54 insertions(+), 15 deletions(-) Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] user namespace and namespace infrastructure changes for 3.8
Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in description of the limitation of userns_install This tree is against v3.7-rc3 The embarrasing oversights that Andy found have been corrected. While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyrrany of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc//ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc//ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small changes to add user namespace support were commited here adn in David Miller's -net tree so that I could complete the work on the /proc//ns/ files in this tree. Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. Future design work remains to allow root users outside of the initial user namespace to mount more than just /proc and /sys. Eric W. Biederman (41): userns: Support autofs4 interacing with multiple user namespaces userns: Support fuse interacting with multiple user namespaces netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS userns: make each net (net_ns) belong to a user_ns userns: On mips modify check_same_owner to use uid_eq procfs: Use the proc generic infrastructure for proc/self. procfs: Don't cache a pid in the root inode. pidns: Capture the user namespace and filter ns_last_pid pidns: Use task_active_pid_ns where appropriate pidns: Make the pidns proc mount/umount logic obvious. pidns: Don't allow new processes in a dead pid namespace. pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 pidns: Deny strange cases when creating pid namespaces. pidns: Add setns support pidns: Consolidate initialzation of special init task state pidns: Support unsharing the pid namespace. vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace vfs: Add setns support for the mount namespace vfs: Add a user namespace reference from struct mnt_namespace vfs: Only support slave subtrees across different user namespaces vfs: Allow unprivileged manipulation of the mount namespace. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: Allow unprivileged users to create user namespaces. userns: Allow chown and setgid preservation userns: Allow setting a userns mapping to your current uid. userns: Allow unprivileged users to create new namespaces userns: Allow unprivileged use of setns. userns: Make create_new_namespaces take a user_ns parameter userns: Kill task_user_ns userns: Implent proc namespace operations userns: Implement unshare of the user namespace procfs: Print task uids and gids in the userns that opened the proc file userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file userns: Allow unprivilged mounts of proc and sysfs proc: Generalize proc inode allocation proc: Fix the namespace inode permission checks. proc: Usable inode numbers for the namespace file descriptors. Fix cap_capable to only allow owners in the parent user namespace to have caps. userns: Require CAP_SYS_ADMIN for most uses of setns. userns: Add a more complete capability subset test to commit_creds
[GIT PULL] user namespace and namespace infrastructure changes for 3.8
Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in description of the limitation of userns_install This tree is against v3.7-rc3 The embarrasing oversights that Andy found have been corrected. While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyrrany of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc/pid/ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc/pid/ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small changes to add user namespace support were commited here adn in David Miller's -net tree so that I could complete the work on the /proc/pid/ns/ files in this tree. Work remains to make it safe to build user namespaces and 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig guard remains in place preventing that user namespaces from being built when any of those filesystems are enabled. Future design work remains to allow root users outside of the initial user namespace to mount more than just /proc and /sys. Eric W. Biederman (41): userns: Support autofs4 interacing with multiple user namespaces userns: Support fuse interacting with multiple user namespaces netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS userns: make each net (net_ns) belong to a user_ns userns: On mips modify check_same_owner to use uid_eq procfs: Use the proc generic infrastructure for proc/self. procfs: Don't cache a pid in the root inode. pidns: Capture the user namespace and filter ns_last_pid pidns: Use task_active_pid_ns where appropriate pidns: Make the pidns proc mount/umount logic obvious. pidns: Don't allow new processes in a dead pid namespace. pidns: Wait in zap_pid_ns_processes until pid_ns-nr_hashed == 1 pidns: Deny strange cases when creating pid namespaces. pidns: Add setns support pidns: Consolidate initialzation of special init task state pidns: Support unsharing the pid namespace. vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace vfs: Add setns support for the mount namespace vfs: Add a user namespace reference from struct mnt_namespace vfs: Only support slave subtrees across different user namespaces vfs: Allow unprivileged manipulation of the mount namespace. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: Allow unprivileged users to create user namespaces. userns: Allow chown and setgid preservation userns: Allow setting a userns mapping to your current uid. userns: Allow unprivileged users to create new namespaces userns: Allow unprivileged use of setns. userns: Make create_new_namespaces take a user_ns parameter userns: Kill task_user_ns userns: Implent proc namespace operations userns: Implement unshare of the user namespace procfs: Print task uids and gids in the userns that opened the proc file userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file userns: Allow unprivilged mounts of proc and sysfs proc: Generalize proc inode allocation proc: Fix the namespace inode permission checks. proc: Usable inode numbers for the namespace file descriptors. Fix cap_capable to only allow owners in the parent user namespace to have caps. userns: Require CAP_SYS_ADMIN for most uses of setns. userns: Add a more complete capability subset test to commit_creds
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
ebied...@xmission.com (Eric W. Biederman) writes: Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in description of the limitation of userns_install This tree is against v3.7-rc3 The embarrasing oversights that Andy found have been corrected. Those bugs, those darn embarrasing bugs just want don't want to get fixed. Linus I just updated my mirror of your kernel.org tree and it appears you successfully pulled everything except the last 4 commits that fix those embarrasing bugs. When you get a chance can you please repull my branch (the details above are still corect. The pending changes are. Eric W. Biederman (4): Fix cap_capable to only allow owners in the parent user namespace to have caps. userns: Require CAP_SYS_ADMIN for most uses of setns. userns: Add a more complete capability subset test to commit_creds userns: Fix typo in description of the limitation of userns_install fs/namespace.c |3 ++- ipc/namespace.c |3 ++- kernel/cred.c| 27 ++- kernel/pid_namespace.c |3 ++- kernel/user_namespace.c |2 +- kernel/utsname.c |3 ++- net/core/net_namespace.c |3 ++- security/commoncap.c | 25 + 8 files changed, 54 insertions(+), 15 deletions(-) Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Andy Lutomirski writes: > On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman > wrote: >> Andy Lutomirski writes: >> >>> One more issue: the requirement that both upper and lower uids (etc.) >>> in the maps are in order is rather limiting. I have no objection if >>> you only require upper ids to be monotonic, but currently there's no >>> way to may root outside to uid n (for n > 0) and some nonroot user >>> outside to uid 0. >> >> There is. You may set up to 5 (extents). You just have to use a second >> extent for the non-contiguous bits. My reader is lazy and you have to >> set all of the extents with a single write, so you may have missed the >> ability to set more than one extent. >> > > If I'm wrong, I'll happily eat my words. Both: > > 0 1 1 > 1 0 1 > > and > > 1 0 1 > 0 1 1 > > are rejected, unless I totally messed up. Duh. You are right. It is this check: /* For now only accept extents that are strictly in order */ if (last && (((last->first + last->count) > extent->first) || ((last->lower_first + last->count) > extent->lower_first))) goto out; Fundamentally every value mapped must be distinct. Aka the direction of the mapping must be reversible without loss of information. Ensuring all of the values were increasing in the extents was just a lame way of ensuring that the same value was not mapped twice in either the upper or lower ranges. That check can most certainly be relaxed (patches welcome). But that probably isn't 3.8 material as that is feature work. Not having bumped into this limitation myself I'm not certain the value in removing this check. But there is no good reason not to replace the current check with a more general one either. So your example should work, and that it doesn't is a misfeature. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Andy Lutomirski l...@amacapital.net writes: On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: One more issue: the requirement that both upper and lower uids (etc.) in the maps are in order is rather limiting. I have no objection if you only require upper ids to be monotonic, but currently there's no way to may root outside to uid n (for n 0) and some nonroot user outside to uid 0. There is. You may set up to 5 (extents). You just have to use a second extent for the non-contiguous bits. My reader is lazy and you have to set all of the extents with a single write, so you may have missed the ability to set more than one extent. If I'm wrong, I'll happily eat my words. Both: 0 1 1 1 0 1 and 1 0 1 0 1 1 are rejected, unless I totally messed up. Duh. You are right. It is this check: /* For now only accept extents that are strictly in order */ if (last (((last-first + last-count) extent-first) || ((last-lower_first + last-count) extent-lower_first))) goto out; Fundamentally every value mapped must be distinct. Aka the direction of the mapping must be reversible without loss of information. Ensuring all of the values were increasing in the extents was just a lame way of ensuring that the same value was not mapped twice in either the upper or lower ranges. That check can most certainly be relaxed (patches welcome). But that probably isn't 3.8 material as that is feature work. Not having bumped into this limitation myself I'm not certain the value in removing this check. But there is no good reason not to replace the current check with a more general one either. So your example should work, and that it doesn't is a misfeature. Eric -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> One more issue: the requirement that both upper and lower uids (etc.) >> in the maps are in order is rather limiting. I have no objection if >> you only require upper ids to be monotonic, but currently there's no >> way to may root outside to uid n (for n > 0) and some nonroot user >> outside to uid 0. > > There is. You may set up to 5 (extents). You just have to use a second > extent for the non-contiguous bits. My reader is lazy and you have to > set all of the extents with a single write, so you may have missed the > ability to set more than one extent. > If I'm wrong, I'll happily eat my words. Both: 0 1 1 1 0 1 and 1 0 1 0 1 1 are rejected, unless I totally messed up. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Andy Lutomirski writes: > On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman > wrote: >> Andy Lutomirski writes: >> >>> On 12/11/2012 01:17 PM, Eric W. Biederman wrote: > >> But please also note the difference between capable and ns_capable. Any >> security check that is capable() still requires priviliges in the >> initial user namespace. > > Huh? > > I'm talking about: > > clone with CLONE_NEWUSER > - child does unshare(CLONE_NEWPID) > - parent does setfd(child's pid namespace) > > Now the parent is running in the init userns with a different pid ns. > Setuid binaries will work but will see the unexpected pid ns. With > mount namespaces, this would be Really Bad (tm). With pid or ipc or > net, it's less obviously dangerous, but I'm not convinced it's safe. That isn't safe. That is a sneaky bug in my tree that I overlooked. :( > I sort of think that setns on a *non*-userns should require > CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't > set. Yes. CAP_SYS_ADMIN in your current user namespace should make setns as safe as it currently is before my patches. That is just a matter of adding a couple nsown_capable(CAP_SYS_ADMIN) permission checks. Right now I test for nsown_capable(CAP_SYS_CHROOT) for the mount namespace, which is probably sufficient to prevent those kinds of shenanigans but I am going to add a nsown_capable(CAP_SYS_ADMIN) for good measure. > A non-threaded process can have mm_users == 2 with CLONE_VM. I'm not > sure this is a problem, though. No it isn't. I said threads because they are the easy concept not because that covers all possible corner cases. >>> In any case, why are threads special here? >> >> You know I don't think I stopped to think about it. The combination >> of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user >> namespace support was merged in 2008. >> >> I do know that things can get really strange when you mix multiple >> namespaces in a process. tkill of your own threads will stop working. >> Which access permissions should apply to files you mmap, file handles >> you have open, the core dumper etc. >> >> We do allow setresuid per thread so we might be able to cope >> with a process that mixes with user namespaces in different threads, >> but I would want a close review of things before we allow that kind of >> sharing. >> > > Fair enough. > > I'd personally be more concerned about shared signal handlers than > shared tgid, though. The signal handler set has all kinds of weird > things like session id. CLONE_THREAD implies CLONE_VM and CLONE_SIGNAL, and in practice mm_users > 1 protects against all of those cases. So I was really thinking all of those cases. >> (See the end. A significant bug in cap_capable slipped in about >> 3.5. cap_capable is only supposed to grant permissions to the owner >> of a user namespace if it is a child user namespace). > > [snipping lots of stuff] > > If the intended semantics of cap_capable are, indeed: > > If targ_ns is equals or is a descendent of cred->user_ns and the cap > is effective, return true. If targ_ns is owned by cred->euid and > targ_ns is a descendent of cred->user_ns (but is not equal to > cred->user_ns), then return true. Else return false > > then I agree with you on almost everything you said. I assumed that > the actual semantics were intended. Good. >>> unshare has a bug. This code: >> >> Interesting... >> >> Looking at it this is a very small misfeature. >> >> What is happening is that commit_creds is setting is making the task >> undumpable because we changed the set of capabilities in struct cred. >> >> This in turn results in pid_revalidate setting the owner of >> of /proc/self/uid_map to GLOBAL_ROOT_UID. >> >> From the outside the permissions on /proc/self/uid_map look like: >> -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map >> >> Then since /proc/self/uid_map uses the default permission function >> and the test program below is not run as root the read-write open >> of uid_map fails. > > It's probably either worth fixing this or disabling unshare of userns. > This makes it hard to use. IMO non-dumpable tasks should still be > able to access the contents of /proc/self -- i.e. I'd call this a more > general bug. > > But I'd also argue that unsharing userns shouldn't set non-dumpable -- > cap_permitted increased, but the new capabilities are still logically > a subset of the old ones. Agreed. Setting dumpable is the bug. I am going to sleep on it but the code in commit_creds should probably read: /* dumpability changes */ if (!uid_eq(old->euid, new->euid) || !gid_eq(old->egid, new->egid) || !uid_eq(old->fsuid, new->fsuid) || !gid_eq(old->fsgid, new->fsgid) || ((old->user_ns == new->user_ns) && !cap_issubset(new->cap_permitted, old->cap_permitted))) { if (task->mm) set_dumpable(task->mm, suid_dumpable);
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman wrote: > Andy Lutomirski writes: > >> On 12/11/2012 01:17 PM, Eric W. Biederman wrote: >>> >>> Linus, >>> >>> Please pull the for-linus git tree from: >>> >>> >>> git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git >>> for-linus >>> >>>HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode >>> numbers for the namespace file descriptors. >>> >>>This tree is against v3.7-rc3 >> >> You've just allowed unprivileged users to create new pid namespaces, >> etc, by creating a new userns, then creating a new pid namespace inside >> that userns, then setns-ing from outside the userns into the pid ns. Is >> this intentional? (The mount ns is okay -- it checks for CAP_CHROOT on >> setns.) > > Absolutely. My commit messages talk about this. I allow creating other > namespaces once inside a user namespace deliberately. There is no > reason I know of to ban creation of pid and other namespaces once you > are inside of a user namespace. > I must be looking at the wrong commit message. > But please also note the difference between capable and ns_capable. Any > security check that is capable() still requires priviliges in the > initial user namespace. Huh? I'm talking about: clone with CLONE_NEWUSER - child does unshare(CLONE_NEWPID) - parent does setfd(child's pid namespace) Now the parent is running in the init userns with a different pid ns. Setuid binaries will work but will see the unexpected pid ns. With mount namespaces, this would be Really Bad (tm). With pid or ipc or net, it's less obviously dangerous, but I'm not convinced it's safe. I sort of think that setns on a *non*-userns should require CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't set. > >> In user_namespace.c: >> >> /* Threaded many not enter a different user namespace */ >> if (atomic_read(>mm->mm_users) > 1) >> return -EINVAL; >> >> The comment has a typo. Also, you're checking the wrong condition: >> that's whether the vm is shared, not whether the thread group has more >> than one member. > > Yes the comment should say. > > Threaded processes may not enter a different user namespace. > > As for the condition. mm_users will equal one for a non-threaded > process. And mm_users is the check we use in unshare to detect if > a threaded process calls unshare so I think the check seems perfectly > reasonable. Especially since the vm must have more than one member if > there is more than one member in the thread group. A non-threaded process can have mm_users == 2 with CLONE_VM. I'm not sure this is a problem, though. > >> In any case, why are threads special here? > > You know I don't think I stopped to think about it. The combination > of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user > namespace support was merged in 2008. > > I do know that things can get really strange when you mix multiple > namespaces in a process. tkill of your own threads will stop working. > Which access permissions should apply to files you mmap, file handles > you have open, the core dumper etc. > > We do allow setresuid per thread so we might be able to cope > with a process that mixes with user namespaces in different threads, > but I would want a close review of things before we allow that kind of > sharing. > Fair enough. I'd personally be more concerned about shared signal handlers than shared tgid, though. The signal handler set has all kinds of weird things like session id. >> I think, although I haven't verified it, that these changes allow >> CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain >> CAP_MODULE): unshare the user namespace and then setfd yourself back. I >> think that setns should only grant caps when changing to a descendent >> namespace. > > (See the end. A significant bug in cap_capable slipped in about > 3.5. cap_capable is only supposed to grant permissions to the owner > of a user namespace if it is a child user namespace). [snipping lots of stuff] If the intended semantics of cap_capable are, indeed: If targ_ns is equals or is a descendent of cred->user_ns and the cap is effective, return true. If targ_ns is owned by cred->euid and targ_ns is a descendent of cred->user_ns (but is not equal to cred->user_ns), then return true. Else return false then I agree with you on almost everything you said. I assumed that the actual semantics were intended. > >> unshare has a bug. This code: > > Interesting... > > Looking at it this is a very small misfeature. > > What is happening is that commit_creds is setting is making the task > undumpable because we changed the set of capabilities in struct cred. > > This in turn results in pid_revalidate setting the owner of > of /proc/self/uid_map to GLOBAL_ROOT_UID. > > From the outside the permissions on /proc/self/uid_map look like: > -rw-r--r-- 1 root root 0 2012-12-13 12:43
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Andy Lutomirski writes: > On 12/11/2012 01:17 PM, Eric W. Biederman wrote: >> >> Linus, >> >> Please pull the for-linus git tree from: >> >> >> git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git >> for-linus >> >>HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers >> for the namespace file descriptors. >> >>This tree is against v3.7-rc3 > > You've just allowed unprivileged users to create new pid namespaces, > etc, by creating a new userns, then creating a new pid namespace inside > that userns, then setns-ing from outside the userns into the pid ns. Is > this intentional? (The mount ns is okay -- it checks for CAP_CHROOT on > setns.) Absolutely. My commit messages talk about this. I allow creating other namespaces once inside a user namespace deliberately. There is no reason I know of to ban creation of pid and other namespaces once you are inside of a user namespace. But please also note the difference between capable and ns_capable. Any security check that is capable() still requires priviliges in the initial user namespace. > In user_namespace.c: > > /* Threaded many not enter a different user namespace */ > if (atomic_read(>mm->mm_users) > 1) > return -EINVAL; > > The comment has a typo. Also, you're checking the wrong condition: > that's whether the vm is shared, not whether the thread group has more > than one member. Yes the comment should say. Threaded processes may not enter a different user namespace. As for the condition. mm_users will equal one for a non-threaded process. And mm_users is the check we use in unshare to detect if a threaded process calls unshare so I think the check seems perfectly reasonable. Especially since the vm must have more than one member if there is more than one member in the thread group. > In any case, why are threads special here? You know I don't think I stopped to think about it. The combination of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user namespace support was merged in 2008. I do know that things can get really strange when you mix multiple namespaces in a process. tkill of your own threads will stop working. Which access permissions should apply to files you mmap, file handles you have open, the core dumper etc. We do allow setresuid per thread so we might be able to cope with a process that mixes with user namespaces in different threads, but I would want a close review of things before we allow that kind of sharing. > I think, although I haven't verified it, that these changes allow > CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain > CAP_MODULE): unshare the user namespace and then setfd yourself back. I > think that setns should only grant caps when changing to a descendent > namespace. (See the end. A significant bug in cap_capable slipped in about 3.5. cap_capable is only supposed to grant permissions to the owner of a user namespace if it is a child user namespace). These changes do not allow CAP_SYS_ADMIN to bypass the bounding set. The test: if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; verifies that the user namespace we are entering is a nested user namespace, because we can only have CAP_SYS_ADMIN in our current namespace and in nested user namespaces. > Also in userns_install: > > 796 /* Don't allow gaining capabilities by reentering > 797 * the same user namespace. > 798 */ > 799 if (user_ns == current_user_ns()) > 800 return -EINVAL; > > Why? To keep processes that deliberately drop some capabilities from being able to gain those capabilities back by reentering the current user namespace. Aka that test plus the ns_capable test prevent are the combination that prevent a process gaining privileges in the current user namespace. > You can trivially bypass this by creating a temporary user ns. > (If you're the owner of your own ns, then you can create a subsidiary > ns, map yourself into it, then setns back -- you'll still be the > owner.) Nope. Once you have capabilities in a user namespace you do not have any capabilities in the parent user namespace. Entering a user namespace is a one way operation. > unshare has a bug. This code: Interesting... Looking at it this is a very small misfeature. What is happening is that commit_creds is setting is making the task undumpable because we changed the set of capabilities in struct cred. This in turn results in pid_revalidate setting the owner of of /proc/self/uid_map to GLOBAL_ROOT_UID. >From the outside the permissions on /proc/self/uid_map look like: -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map Then since /proc/self/uid_map uses the default permission function and the test program below is not run as root the read-write open of uid_map fails. > Also, I'm entirely unconvinced that the owner of a userns should
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On 12/11/2012 01:17 PM, Eric W. Biederman wrote: > > Linus, > > Please pull the for-linus git tree from: > >git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git > for-linus > >HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers > for the namespace file descriptors. > >This tree is against v3.7-rc3 You've just allowed unprivileged users to create new pid namespaces, etc, by creating a new userns, then creating a new pid namespace inside that userns, then setns-ing from outside the userns into the pid ns. Is this intentional? (The mount ns is okay -- it checks for CAP_CHROOT on setns.) In user_namespace.c: /* Threaded many not enter a different user namespace */ if (atomic_read(>mm->mm_users) > 1) return -EINVAL; The comment has a typo. Also, you're checking the wrong condition: that's whether the vm is shared, not whether the thread group has more than one member. In any case, why are threads special here? I think, although I haven't verified it, that these changes allow CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain CAP_MODULE): unshare the user namespace and then setfd yourself back. I think that setns should only grant caps when changing to a descendent namespace. Also in userns_install: 796 /* Don't allow gaining capabilities by reentering 797 * the same user namespace. 798 */ 799 if (user_ns == current_user_ns()) 800 return -EINVAL; Why? You can trivially bypass this by creating a temporary user ns. (If you're the owner of your own ns, then you can create a subsidiary ns, map yourself into it, then setns back -- you'll still be the owner.) unshare has a bug. This code: #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #include #include #include static void fail(const char *msg) { perror(msg); exit(1); } int main() { if (unshare(CLONE_NEWUSER) != 0) fail("CLONE_NEWUSER"); if (open("/proc/self/uid_map", O_RDWR) == -1) perror("/proc/self/uid_map O_RDWR"); int fd = open("/proc/self/uid_map", O_RDONLY); if (fd == -1) { perror("/proc/self/uid_map O_RDONLY"); } else { char buf[4096]; ssize_t len = read(fd, buf, sizeof(buf)); if (len > 0) write(1, buf, len); else printf("read uid_map returned %d\n", (int)len); } } produces this output: /proc/self/uid_map O_RDWR: Permission denied read uid_map returned 0 With clone instead of unshare, it works. I'm not quite sure what's going on. Also, I'm entirely unconvinced that the owner of a userns should automatically have all caps (in the cap_capable sense) on a userns if the task is inside that ns. What's wrong with just using normal caps? (Of course, the fact that caps don't usefully inherit is an issue -- there's a lng thread going on right now about that.) But doing this enshrines root-has-caps even farther into ABI. At least please make this depend on !SECURE_NOROOT. > > While small this set of changes is very significant with respect to > containers in general and user namespaces in particular. The user space > interface is now complete. > > This set of changes adds support for unprivileged users to create user > namespaces and as a user namespace root to create other namespaces. The > tyrrany of supporting suid root preventing unprivileged users from using > cool new kernel features is broken. > no_new_privs already (kind of) did that :) --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On 12/11/2012 01:17 PM, Eric W. Biederman wrote: Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors. This tree is against v3.7-rc3 You've just allowed unprivileged users to create new pid namespaces, etc, by creating a new userns, then creating a new pid namespace inside that userns, then setns-ing from outside the userns into the pid ns. Is this intentional? (The mount ns is okay -- it checks for CAP_CHROOT on setns.) In user_namespace.c: /* Threaded many not enter a different user namespace */ if (atomic_read(current-mm-mm_users) 1) return -EINVAL; The comment has a typo. Also, you're checking the wrong condition: that's whether the vm is shared, not whether the thread group has more than one member. In any case, why are threads special here? I think, although I haven't verified it, that these changes allow CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain CAP_MODULE): unshare the user namespace and then setfd yourself back. I think that setns should only grant caps when changing to a descendent namespace. Also in userns_install: 796 /* Don't allow gaining capabilities by reentering 797 * the same user namespace. 798 */ 799 if (user_ns == current_user_ns()) 800 return -EINVAL; Why? You can trivially bypass this by creating a temporary user ns. (If you're the owner of your own ns, then you can create a subsidiary ns, map yourself into it, then setns back -- you'll still be the owner.) unshare has a bug. This code: #define _GNU_SOURCE #include stdbool.h #include stdio.h #include stdlib.h #include string.h #include errno.h #include signal.h #include unistd.h #include sys/types.h #include sys/wait.h #include sys/prctl.h #include sys/mman.h #include sched.h #include fcntl.h static void fail(const char *msg) { perror(msg); exit(1); } int main() { if (unshare(CLONE_NEWUSER) != 0) fail(CLONE_NEWUSER); if (open(/proc/self/uid_map, O_RDWR) == -1) perror(/proc/self/uid_map O_RDWR); int fd = open(/proc/self/uid_map, O_RDONLY); if (fd == -1) { perror(/proc/self/uid_map O_RDONLY); } else { char buf[4096]; ssize_t len = read(fd, buf, sizeof(buf)); if (len 0) write(1, buf, len); else printf(read uid_map returned %d\n, (int)len); } } produces this output: /proc/self/uid_map O_RDWR: Permission denied read uid_map returned 0 With clone instead of unshare, it works. I'm not quite sure what's going on. Also, I'm entirely unconvinced that the owner of a userns should automatically have all caps (in the cap_capable sense) on a userns if the task is inside that ns. What's wrong with just using normal caps? (Of course, the fact that caps don't usefully inherit is an issue -- there's a lng thread going on right now about that.) But doing this enshrines root-has-caps even farther into ABI. At least please make this depend on !SECURE_NOROOT. While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyrrany of supporting suid root preventing unprivileged users from using cool new kernel features is broken. no_new_privs already (kind of) did that :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Andy Lutomirski l...@amacapital.net writes: On 12/11/2012 01:17 PM, Eric W. Biederman wrote: Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors. This tree is against v3.7-rc3 You've just allowed unprivileged users to create new pid namespaces, etc, by creating a new userns, then creating a new pid namespace inside that userns, then setns-ing from outside the userns into the pid ns. Is this intentional? (The mount ns is okay -- it checks for CAP_CHROOT on setns.) Absolutely. My commit messages talk about this. I allow creating other namespaces once inside a user namespace deliberately. There is no reason I know of to ban creation of pid and other namespaces once you are inside of a user namespace. But please also note the difference between capable and ns_capable. Any security check that is capable() still requires priviliges in the initial user namespace. In user_namespace.c: /* Threaded many not enter a different user namespace */ if (atomic_read(current-mm-mm_users) 1) return -EINVAL; The comment has a typo. Also, you're checking the wrong condition: that's whether the vm is shared, not whether the thread group has more than one member. Yes the comment should say. Threaded processes may not enter a different user namespace. As for the condition. mm_users will equal one for a non-threaded process. And mm_users is the check we use in unshare to detect if a threaded process calls unshare so I think the check seems perfectly reasonable. Especially since the vm must have more than one member if there is more than one member in the thread group. In any case, why are threads special here? You know I don't think I stopped to think about it. The combination of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user namespace support was merged in 2008. I do know that things can get really strange when you mix multiple namespaces in a process. tkill of your own threads will stop working. Which access permissions should apply to files you mmap, file handles you have open, the core dumper etc. We do allow setresuid per thread so we might be able to cope with a process that mixes with user namespaces in different threads, but I would want a close review of things before we allow that kind of sharing. I think, although I haven't verified it, that these changes allow CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain CAP_MODULE): unshare the user namespace and then setfd yourself back. I think that setns should only grant caps when changing to a descendent namespace. (See the end. A significant bug in cap_capable slipped in about 3.5. cap_capable is only supposed to grant permissions to the owner of a user namespace if it is a child user namespace). These changes do not allow CAP_SYS_ADMIN to bypass the bounding set. The test: if (!ns_capable(user_ns, CAP_SYS_ADMIN)) return -EPERM; verifies that the user namespace we are entering is a nested user namespace, because we can only have CAP_SYS_ADMIN in our current namespace and in nested user namespaces. Also in userns_install: 796 /* Don't allow gaining capabilities by reentering 797 * the same user namespace. 798 */ 799 if (user_ns == current_user_ns()) 800 return -EINVAL; Why? To keep processes that deliberately drop some capabilities from being able to gain those capabilities back by reentering the current user namespace. Aka that test plus the ns_capable test prevent are the combination that prevent a process gaining privileges in the current user namespace. You can trivially bypass this by creating a temporary user ns. (If you're the owner of your own ns, then you can create a subsidiary ns, map yourself into it, then setns back -- you'll still be the owner.) Nope. Once you have capabilities in a user namespace you do not have any capabilities in the parent user namespace. Entering a user namespace is a one way operation. unshare has a bug. This code: Interesting... Looking at it this is a very small misfeature. What is happening is that commit_creds is setting is making the task undumpable because we changed the set of capabilities in struct cred. This in turn results in pid_revalidate setting the owner of of /proc/self/uid_map to GLOBAL_ROOT_UID. From the outside the permissions on /proc/self/uid_map look like: -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map Then since /proc/self/uid_map uses the default permission function and the test program below is not run as root the read-write open of uid_map fails. Also, I'm entirely unconvinced that the owner of a userns should automatically have all caps (in the
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On 12/11/2012 01:17 PM, Eric W. Biederman wrote: Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors. This tree is against v3.7-rc3 You've just allowed unprivileged users to create new pid namespaces, etc, by creating a new userns, then creating a new pid namespace inside that userns, then setns-ing from outside the userns into the pid ns. Is this intentional? (The mount ns is okay -- it checks for CAP_CHROOT on setns.) Absolutely. My commit messages talk about this. I allow creating other namespaces once inside a user namespace deliberately. There is no reason I know of to ban creation of pid and other namespaces once you are inside of a user namespace. I must be looking at the wrong commit message. But please also note the difference between capable and ns_capable. Any security check that is capable() still requires priviliges in the initial user namespace. Huh? I'm talking about: clone with CLONE_NEWUSER - child does unshare(CLONE_NEWPID) - parent does setfd(child's pid namespace) Now the parent is running in the init userns with a different pid ns. Setuid binaries will work but will see the unexpected pid ns. With mount namespaces, this would be Really Bad (tm). With pid or ipc or net, it's less obviously dangerous, but I'm not convinced it's safe. I sort of think that setns on a *non*-userns should require CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't set. In user_namespace.c: /* Threaded many not enter a different user namespace */ if (atomic_read(current-mm-mm_users) 1) return -EINVAL; The comment has a typo. Also, you're checking the wrong condition: that's whether the vm is shared, not whether the thread group has more than one member. Yes the comment should say. Threaded processes may not enter a different user namespace. As for the condition. mm_users will equal one for a non-threaded process. And mm_users is the check we use in unshare to detect if a threaded process calls unshare so I think the check seems perfectly reasonable. Especially since the vm must have more than one member if there is more than one member in the thread group. A non-threaded process can have mm_users == 2 with CLONE_VM. I'm not sure this is a problem, though. In any case, why are threads special here? You know I don't think I stopped to think about it. The combination of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user namespace support was merged in 2008. I do know that things can get really strange when you mix multiple namespaces in a process. tkill of your own threads will stop working. Which access permissions should apply to files you mmap, file handles you have open, the core dumper etc. We do allow setresuid per thread so we might be able to cope with a process that mixes with user namespaces in different threads, but I would want a close review of things before we allow that kind of sharing. Fair enough. I'd personally be more concerned about shared signal handlers than shared tgid, though. The signal handler set has all kinds of weird things like session id. I think, although I haven't verified it, that these changes allow CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain CAP_MODULE): unshare the user namespace and then setfd yourself back. I think that setns should only grant caps when changing to a descendent namespace. (See the end. A significant bug in cap_capable slipped in about 3.5. cap_capable is only supposed to grant permissions to the owner of a user namespace if it is a child user namespace). [snipping lots of stuff] If the intended semantics of cap_capable are, indeed: If targ_ns is equals or is a descendent of cred-user_ns and the cap is effective, return true. If targ_ns is owned by cred-euid and targ_ns is a descendent of cred-user_ns (but is not equal to cred-user_ns), then return true. Else return false then I agree with you on almost everything you said. I assumed that the actual semantics were intended. unshare has a bug. This code: Interesting... Looking at it this is a very small misfeature. What is happening is that commit_creds is setting is making the task undumpable because we changed the set of capabilities in struct cred. This in turn results in pid_revalidate setting the owner of of /proc/self/uid_map to GLOBAL_ROOT_UID. From the outside the permissions on /proc/self/uid_map look like: -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map Then since /proc/self/uid_map uses the default permission function and the test
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
Andy Lutomirski l...@amacapital.net writes: On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: On 12/11/2012 01:17 PM, Eric W. Biederman wrote: But please also note the difference between capable and ns_capable. Any security check that is capable() still requires priviliges in the initial user namespace. Huh? I'm talking about: clone with CLONE_NEWUSER - child does unshare(CLONE_NEWPID) - parent does setfd(child's pid namespace) Now the parent is running in the init userns with a different pid ns. Setuid binaries will work but will see the unexpected pid ns. With mount namespaces, this would be Really Bad (tm). With pid or ipc or net, it's less obviously dangerous, but I'm not convinced it's safe. That isn't safe. That is a sneaky bug in my tree that I overlooked. :( I sort of think that setns on a *non*-userns should require CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't set. Yes. CAP_SYS_ADMIN in your current user namespace should make setns as safe as it currently is before my patches. That is just a matter of adding a couple nsown_capable(CAP_SYS_ADMIN) permission checks. Right now I test for nsown_capable(CAP_SYS_CHROOT) for the mount namespace, which is probably sufficient to prevent those kinds of shenanigans but I am going to add a nsown_capable(CAP_SYS_ADMIN) for good measure. A non-threaded process can have mm_users == 2 with CLONE_VM. I'm not sure this is a problem, though. No it isn't. I said threads because they are the easy concept not because that covers all possible corner cases. In any case, why are threads special here? You know I don't think I stopped to think about it. The combination of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user namespace support was merged in 2008. I do know that things can get really strange when you mix multiple namespaces in a process. tkill of your own threads will stop working. Which access permissions should apply to files you mmap, file handles you have open, the core dumper etc. We do allow setresuid per thread so we might be able to cope with a process that mixes with user namespaces in different threads, but I would want a close review of things before we allow that kind of sharing. Fair enough. I'd personally be more concerned about shared signal handlers than shared tgid, though. The signal handler set has all kinds of weird things like session id. CLONE_THREAD implies CLONE_VM and CLONE_SIGNAL, and in practice mm_users 1 protects against all of those cases. So I was really thinking all of those cases. (See the end. A significant bug in cap_capable slipped in about 3.5. cap_capable is only supposed to grant permissions to the owner of a user namespace if it is a child user namespace). [snipping lots of stuff] If the intended semantics of cap_capable are, indeed: If targ_ns is equals or is a descendent of cred-user_ns and the cap is effective, return true. If targ_ns is owned by cred-euid and targ_ns is a descendent of cred-user_ns (but is not equal to cred-user_ns), then return true. Else return false then I agree with you on almost everything you said. I assumed that the actual semantics were intended. Good. unshare has a bug. This code: Interesting... Looking at it this is a very small misfeature. What is happening is that commit_creds is setting is making the task undumpable because we changed the set of capabilities in struct cred. This in turn results in pid_revalidate setting the owner of of /proc/self/uid_map to GLOBAL_ROOT_UID. From the outside the permissions on /proc/self/uid_map look like: -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map Then since /proc/self/uid_map uses the default permission function and the test program below is not run as root the read-write open of uid_map fails. It's probably either worth fixing this or disabling unshare of userns. This makes it hard to use. IMO non-dumpable tasks should still be able to access the contents of /proc/self -- i.e. I'd call this a more general bug. But I'd also argue that unsharing userns shouldn't set non-dumpable -- cap_permitted increased, but the new capabilities are still logically a subset of the old ones. Agreed. Setting dumpable is the bug. I am going to sleep on it but the code in commit_creds should probably read: /* dumpability changes */ if (!uid_eq(old-euid, new-euid) || !gid_eq(old-egid, new-egid) || !uid_eq(old-fsuid, new-fsuid) || !gid_eq(old-fsgid, new-fsgid) || ((old-user_ns == new-user_ns) !cap_issubset(new-cap_permitted, old-cap_permitted))) { if (task-mm) set_dumpable(task-mm, suid_dumpable); task-pdeath_signal = 0; smp_wmb(); } Which is correct
Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8
On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman ebied...@xmission.com wrote: Andy Lutomirski l...@amacapital.net writes: One more issue: the requirement that both upper and lower uids (etc.) in the maps are in order is rather limiting. I have no objection if you only require upper ids to be monotonic, but currently there's no way to may root outside to uid n (for n 0) and some nonroot user outside to uid 0. There is. You may set up to 5 (extents). You just have to use a second extent for the non-contiguous bits. My reader is lazy and you have to set all of the extents with a single write, so you may have missed the ability to set more than one extent. If I'm wrong, I'll happily eat my words. Both: 0 1 1 1 0 1 and 1 0 1 0 1 1 are rejected, unless I totally messed up. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] user namespace and namespace infrastructure changes for 3.8
Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors. This tree is against v3.7-rc3 While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyrrany of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc//ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc//ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small nework namespace changes were double committed here and in David Millers -net tree so that I could complete the work on the /proc//ns/ files in this tree. The user namespace work that remains is converting, 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so they are safe to enable when user namespaces are enabled, and implementing unprivileged mounts of more than just /proc and /sys. I had hoped to get through more of those changes this cycle but I turned into a cold magnet this season and the UAPI changes caused a lot of churn late into the 3.7 -rc cycle that made a stable starting place hard to work from hard to find. Eric W. Biederman (37): userns: Support autofs4 interacing with multiple user namespaces userns: Support fuse interacting with multiple user namespaces netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS userns: make each net (net_ns) belong to a user_ns userns: On mips modify check_same_owner to use uid_eq procfs: Use the proc generic infrastructure for proc/self. procfs: Don't cache a pid in the root inode. pidns: Capture the user namespace and filter ns_last_pid pidns: Use task_active_pid_ns where appropriate pidns: Make the pidns proc mount/umount logic obvious. pidns: Don't allow new processes in a dead pid namespace. pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1 pidns: Deny strange cases when creating pid namespaces. pidns: Add setns support pidns: Consolidate initialzation of special init task state pidns: Support unsharing the pid namespace. vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace vfs: Add setns support for the mount namespace vfs: Add a user namespace reference from struct mnt_namespace vfs: Only support slave subtrees across different user namespaces vfs: Allow unprivileged manipulation of the mount namespace. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: Allow unprivileged users to create user namespaces. userns: Allow chown and setgid preservation userns: Allow setting a userns mapping to your current uid. userns: Allow unprivileged users to create new namespaces userns: Allow unprivileged use of setns. userns: Make create_new_namespaces take a user_ns parameter userns: Kill task_user_ns userns: Implent proc namespace operations userns: Implement unshare of the user namespace procfs: Print task uids and gids in the userns that opened the proc file userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file userns: Allow unprivilged mounts of proc and sysfs proc: Generalize proc inode allocation proc: Fix the namespace inode permission checks. proc: Usable inode numbers for the namespace file descriptors. Zhao Hongjiang (1): userns: fix return value on mntns_install() failure arch/mips/kernel/mips-mt-fpaff.c |4 +- arch/powerpc/platforms/cell/spufs/sched.c |2 +-
[GIT PULL] user namespace and namespace infrastructure changes for 3.8
Linus, Please pull the for-linus git tree from: git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers for the namespace file descriptors. This tree is against v3.7-rc3 While small this set of changes is very significant with respect to containers in general and user namespaces in particular. The user space interface is now complete. This set of changes adds support for unprivileged users to create user namespaces and as a user namespace root to create other namespaces. The tyrrany of supporting suid root preventing unprivileged users from using cool new kernel features is broken. This set of changes completes the work on setns, adding support for the pid, user, mount namespaces. This set of changes includes a bunch of basic pid namespace cleanups/simplifications. Of particular significance is the rework of the pid namespace cleanup so it no longer requires sending out tendrils into all kinds of unexpected cleanup paths for operation. At least one case of broken error handling is fixed by this cleanup. The files under /proc/pid/ns/ have been converted from regular files to magic symlinks which prevents incorrect caching by the VFS, ensuring the files always refer to the namespace the process is currently using and ensuring that the ptrace_mayaccess permission checks are always applied. The files under /proc/pid/ns/ have been given stable inode numbers so it is now possible to see if different processes share the same namespaces. Through the David Miller's net tree are changes to relax many of the permission checks in the networking stack to allowing the user namespace root to usefully use the networking stack. Similar changes for the mount namespace and the pid namespace are coming through my tree. Two small nework namespace changes were double committed here and in David Millers -net tree so that I could complete the work on the /proc/pid/ns/ files in this tree. The user namespace work that remains is converting, 9p, afs, ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so they are safe to enable when user namespaces are enabled, and implementing unprivileged mounts of more than just /proc and /sys. I had hoped to get through more of those changes this cycle but I turned into a cold magnet this season and the UAPI changes caused a lot of churn late into the 3.7 -rc cycle that made a stable starting place hard to work from hard to find. Eric W. Biederman (37): userns: Support autofs4 interacing with multiple user namespaces userns: Support fuse interacting with multiple user namespaces netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS userns: make each net (net_ns) belong to a user_ns userns: On mips modify check_same_owner to use uid_eq procfs: Use the proc generic infrastructure for proc/self. procfs: Don't cache a pid in the root inode. pidns: Capture the user namespace and filter ns_last_pid pidns: Use task_active_pid_ns where appropriate pidns: Make the pidns proc mount/umount logic obvious. pidns: Don't allow new processes in a dead pid namespace. pidns: Wait in zap_pid_ns_processes until pid_ns-nr_hashed == 1 pidns: Deny strange cases when creating pid namespaces. pidns: Add setns support pidns: Consolidate initialzation of special init task state pidns: Support unsharing the pid namespace. vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace vfs: Add setns support for the mount namespace vfs: Add a user namespace reference from struct mnt_namespace vfs: Only support slave subtrees across different user namespaces vfs: Allow unprivileged manipulation of the mount namespace. userns: Ignore suid and sgid on binaries if the uid or gid can not be mapped userns: Allow unprivileged users to create user namespaces. userns: Allow chown and setgid preservation userns: Allow setting a userns mapping to your current uid. userns: Allow unprivileged users to create new namespaces userns: Allow unprivileged use of setns. userns: Make create_new_namespaces take a user_ns parameter userns: Kill task_user_ns userns: Implent proc namespace operations userns: Implement unshare of the user namespace procfs: Print task uids and gids in the userns that opened the proc file userns: For /proc/self/{uid,gid}_map derive the lower userns from the struct file userns: Allow unprivilged mounts of proc and sysfs proc: Generalize proc inode allocation proc: Fix the namespace inode permission checks. proc: Usable inode numbers for the namespace file descriptors. Zhao Hongjiang (1): userns: fix return value on mntns_install() failure arch/mips/kernel/mips-mt-fpaff.c |4 +- arch/powerpc/platforms/cell/spufs/sched.c |2 +-