Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-20 Thread Eric W. Biederman
Rob Landley  writes:

> On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote:
>> Work remains to make it safe to build user namespaces and 9p, afs,
>> ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the  
>> Kconfig
>> guard remains in place preventing that user namespaces from being  
>> built
>> when any of those filesystems are enabled.
>
> What work specifically?

Essentially the uid_t to kuid_t and gid_t to kgid_t conversion.

These are the more complex filesystems.  And I haven't yet had
the time to go through the preliminary patches in my development branch
and convert them into stupid obviously correct patches, or review them
closely.

The changes generally are pretty simple the amount of review to ensure I
didn't overlook things tends to proprotional to the size of the file
system.

There is also the fact that I don't use most of those file-systems.

The big advantage of finishing those filesystems is that user namespaces
can be turned on in allyesconfig, allowing my compile time checks to
notice when someone doesn't use kuids and kgids.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-20 Thread Rob Landley

On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote:

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the  
Kconfig
guard remains in place preventing that user namespaces from being  
built

when any of those filesystems are enabled.


What work specifically?

Rob--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-20 Thread Rob Landley

On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote:

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the  
Kconfig
guard remains in place preventing that user namespaces from being  
built

when any of those filesystems are enabled.


What work specifically?

Rob--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-20 Thread Eric W. Biederman
Rob Landley r...@landley.net writes:

 On 12/17/2012 05:18:57 PM, Eric W. Biederman wrote:
 Work remains to make it safe to build user namespaces and 9p, afs,
 ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the  
 Kconfig
 guard remains in place preventing that user namespaces from being  
 built
 when any of those filesystems are enabled.

 What work specifically?

Essentially the uid_t to kuid_t and gid_t to kgid_t conversion.

These are the more complex filesystems.  And I haven't yet had
the time to go through the preliminary patches in my development branch
and convert them into stupid obviously correct patches, or review them
closely.

The changes generally are pretty simple the amount of review to ensure I
didn't overlook things tends to proprotional to the size of the file
system.

There is also the fact that I don't use most of those file-systems.

The big advantage of finishing those filesystems is that user namespaces
can be turned on in allyesconfig, allowing my compile time checks to
notice when someone doesn't use kuids and kgids.

Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-17 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> Linus,
>
> Please pull the for-linus git tree from:
>
>git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> for-linus
>
>HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in 
> description of the limitation of userns_install
>
>This tree is against v3.7-rc3
>
> The embarrasing oversights that Andy found have been corrected.

Those bugs, those darn embarrasing bugs just want don't want to get
fixed.

Linus I just updated my mirror of your kernel.org tree and it appears
you successfully pulled everything except the last 4 commits that fix
those embarrasing bugs.

When you get a chance can you please repull my branch (the details
above are still corect.

The pending changes are.

Eric W. Biederman (4):
  Fix cap_capable to only allow owners in the parent user namespace to have 
caps.
  userns: Require CAP_SYS_ADMIN for most uses of setns.
  userns: Add a more complete capability subset test to commit_creds
  userns: Fix typo in description of the limitation of userns_install

 fs/namespace.c   |3 ++-
 ipc/namespace.c  |3 ++-
 kernel/cred.c|   27 ++-
 kernel/pid_namespace.c   |3 ++-
 kernel/user_namespace.c  |2 +-
 kernel/utsname.c |3 ++-
 net/core/net_namespace.c |3 ++-
 security/commoncap.c |   25 +
 8 files changed, 54 insertions(+), 15 deletions(-)

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-17 Thread Eric W. Biederman

Linus,

Please pull the for-linus git tree from:

   git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in 
description of the limitation of userns_install

   This tree is against v3.7-rc3

The embarrasing oversights that Andy found have been corrected.

While small this set of changes is very significant with respect to
containers in general and user namespaces in particular.  The user space
interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.  The
tyrrany of supporting suid root preventing unprivileged users from using
cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications.  Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out tendrils
into all kinds of unexpected cleanup paths for operation.  At least one
case of broken error handling is fixed by this cleanup.

The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS, ensuring
the files always refer to the namespace the process is currently using
and ensuring that the ptrace_mayaccess permission checks are always
applied.

The files under /proc//ns/ have been given stable inode numbers so
it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user namespace
root to usefully use the networking stack.  Similar changes for the
mount namespace and the pid namespace are coming through my tree.

Two small changes to add user namespace support were commited here
adn in David Miller's -net tree so that I could complete the work on the
/proc//ns/ files in this tree.  

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig
guard remains in place preventing that user namespaces from being built
when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys.

Eric W. Biederman (41):
  userns: Support autofs4 interacing with multiple user namespaces
  userns: Support fuse interacting with multiple user namespaces
  netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS
  userns: make each net (net_ns) belong to a user_ns
  userns: On mips modify check_same_owner to use uid_eq
  procfs: Use the proc generic infrastructure for proc/self.
  procfs: Don't cache a pid in the root inode.
  pidns: Capture the user namespace and filter ns_last_pid
  pidns: Use task_active_pid_ns where appropriate
  pidns: Make the pidns proc mount/umount logic obvious.
  pidns: Don't allow new processes in a dead pid namespace.
  pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  pidns: Deny strange cases when creating pid namespaces.
  pidns: Add setns support
  pidns: Consolidate initialzation of special init task state
  pidns: Support unsharing the pid namespace.
  vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
  vfs: Add setns support for the mount namespace
  vfs: Add a user namespace reference from struct mnt_namespace
  vfs: Only support slave subtrees across different user namespaces
  vfs: Allow unprivileged manipulation of the mount namespace.
  userns: Ignore suid and sgid on binaries if the uid or gid can not be 
mapped
  userns: Allow unprivileged users to create user namespaces.
  userns: Allow chown and setgid preservation
  userns: Allow setting a userns mapping to your current uid.
  userns: Allow unprivileged users to create new namespaces
  userns: Allow unprivileged use of setns.
  userns: Make create_new_namespaces take a user_ns parameter
  userns: Kill task_user_ns
  userns: Implent proc namespace operations
  userns: Implement unshare of the user namespace
  procfs: Print task uids and gids in the userns that opened the proc file
  userns: For /proc/self/{uid,gid}_map derive the lower userns from the 
struct file
  userns: Allow unprivilged mounts of proc and sysfs
  proc: Generalize proc inode allocation
  proc: Fix the namespace inode permission checks.
  proc: Usable inode numbers for the namespace file descriptors.
  Fix cap_capable to only allow owners in the parent user namespace to have 
caps.
  userns: Require CAP_SYS_ADMIN for most uses of setns.
  userns: Add a more complete capability subset test to commit_creds
  

[GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-17 Thread Eric W. Biederman

Linus,

Please pull the for-linus git tree from:

   git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in 
description of the limitation of userns_install

   This tree is against v3.7-rc3

The embarrasing oversights that Andy found have been corrected.

While small this set of changes is very significant with respect to
containers in general and user namespaces in particular.  The user space
interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.  The
tyrrany of supporting suid root preventing unprivileged users from using
cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications.  Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out tendrils
into all kinds of unexpected cleanup paths for operation.  At least one
case of broken error handling is fixed by this cleanup.

The files under /proc/pid/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS, ensuring
the files always refer to the namespace the process is currently using
and ensuring that the ptrace_mayaccess permission checks are always
applied.

The files under /proc/pid/ns/ have been given stable inode numbers so
it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user namespace
root to usefully use the networking stack.  Similar changes for the
mount namespace and the pid namespace are coming through my tree.

Two small changes to add user namespace support were commited here
adn in David Miller's -net tree so that I could complete the work on the
/proc/pid/ns/ files in this tree.  

Work remains to make it safe to build user namespaces and 9p, afs,
ceph, cifs, coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so the Kconfig
guard remains in place preventing that user namespaces from being built
when any of those filesystems are enabled.

Future design work remains to allow root users outside of the initial
user namespace to mount more than just /proc and /sys.

Eric W. Biederman (41):
  userns: Support autofs4 interacing with multiple user namespaces
  userns: Support fuse interacting with multiple user namespaces
  netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS
  userns: make each net (net_ns) belong to a user_ns
  userns: On mips modify check_same_owner to use uid_eq
  procfs: Use the proc generic infrastructure for proc/self.
  procfs: Don't cache a pid in the root inode.
  pidns: Capture the user namespace and filter ns_last_pid
  pidns: Use task_active_pid_ns where appropriate
  pidns: Make the pidns proc mount/umount logic obvious.
  pidns: Don't allow new processes in a dead pid namespace.
  pidns: Wait in zap_pid_ns_processes until pid_ns-nr_hashed == 1
  pidns: Deny strange cases when creating pid namespaces.
  pidns: Add setns support
  pidns: Consolidate initialzation of special init task state
  pidns: Support unsharing the pid namespace.
  vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
  vfs: Add setns support for the mount namespace
  vfs: Add a user namespace reference from struct mnt_namespace
  vfs: Only support slave subtrees across different user namespaces
  vfs: Allow unprivileged manipulation of the mount namespace.
  userns: Ignore suid and sgid on binaries if the uid or gid can not be 
mapped
  userns: Allow unprivileged users to create user namespaces.
  userns: Allow chown and setgid preservation
  userns: Allow setting a userns mapping to your current uid.
  userns: Allow unprivileged users to create new namespaces
  userns: Allow unprivileged use of setns.
  userns: Make create_new_namespaces take a user_ns parameter
  userns: Kill task_user_ns
  userns: Implent proc namespace operations
  userns: Implement unshare of the user namespace
  procfs: Print task uids and gids in the userns that opened the proc file
  userns: For /proc/self/{uid,gid}_map derive the lower userns from the 
struct file
  userns: Allow unprivilged mounts of proc and sysfs
  proc: Generalize proc inode allocation
  proc: Fix the namespace inode permission checks.
  proc: Usable inode numbers for the namespace file descriptors.
  Fix cap_capable to only allow owners in the parent user namespace to have 
caps.
  userns: Require CAP_SYS_ADMIN for most uses of setns.
  userns: Add a more complete capability subset test to commit_creds
  

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-17 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

 Linus,

 Please pull the for-linus git tree from:

git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
 for-linus

HEAD: 5155040ed349950e16c093ba8e65ad534994df2a userns: Fix typo in 
 description of the limitation of userns_install

This tree is against v3.7-rc3

 The embarrasing oversights that Andy found have been corrected.

Those bugs, those darn embarrasing bugs just want don't want to get
fixed.

Linus I just updated my mirror of your kernel.org tree and it appears
you successfully pulled everything except the last 4 commits that fix
those embarrasing bugs.

When you get a chance can you please repull my branch (the details
above are still corect.

The pending changes are.

Eric W. Biederman (4):
  Fix cap_capable to only allow owners in the parent user namespace to have 
caps.
  userns: Require CAP_SYS_ADMIN for most uses of setns.
  userns: Add a more complete capability subset test to commit_creds
  userns: Fix typo in description of the limitation of userns_install

 fs/namespace.c   |3 ++-
 ipc/namespace.c  |3 ++-
 kernel/cred.c|   27 ++-
 kernel/pid_namespace.c   |3 ++-
 kernel/user_namespace.c  |2 +-
 kernel/utsname.c |3 ++-
 net/core/net_namespace.c |3 ++-
 security/commoncap.c |   25 +
 8 files changed, 54 insertions(+), 15 deletions(-)

Eric
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-14 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman
>  wrote:
>> Andy Lutomirski  writes:
>>
>>> One more issue: the requirement that both upper and lower uids (etc.)
>>> in the maps are in order is rather limiting.  I have no objection if
>>> you only require upper ids to be monotonic, but currently there's no
>>> way to may root outside to uid n (for n > 0) and some nonroot user
>>> outside to uid 0.
>>
>> There is.  You may set up to 5 (extents).  You just have to use a second
>> extent for the non-contiguous bits.  My reader is lazy and you have to
>> set all of the extents with a single write, so you may have missed the
>> ability to set more than one extent.
>>
>
> If I'm wrong, I'll happily eat my words.  Both:
>
> 0 1 1
> 1 0 1
>
> and
>
> 1 0 1
> 0 1 1
>
> are rejected, unless I totally messed up.

Duh.  You are right.  

It is this check:

/* For now only accept extents that are strictly in order */
if (last &&
(((last->first + last->count) > extent->first) ||
 ((last->lower_first + last->count) > extent->lower_first)))
goto out;

Fundamentally every value mapped must be distinct.  Aka the direction of
the mapping must be reversible without loss of information.  

Ensuring all of the values were increasing in the extents was just a
lame way of ensuring that the same value was not mapped twice in either
the upper or lower ranges.

That check can most certainly be relaxed (patches welcome).  But that
probably isn't 3.8 material as that is feature work.

Not having bumped into this limitation myself I'm not certain the value
in removing this check.  But there is no good reason not to replace the
current check with a more general one either.

So your example should work, and that it doesn't is a misfeature.

Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-14 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman
 ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 One more issue: the requirement that both upper and lower uids (etc.)
 in the maps are in order is rather limiting.  I have no objection if
 you only require upper ids to be monotonic, but currently there's no
 way to may root outside to uid n (for n  0) and some nonroot user
 outside to uid 0.

 There is.  You may set up to 5 (extents).  You just have to use a second
 extent for the non-contiguous bits.  My reader is lazy and you have to
 set all of the extents with a single write, so you may have missed the
 ability to set more than one extent.


 If I'm wrong, I'll happily eat my words.  Both:

 0 1 1
 1 0 1

 and

 1 0 1
 0 1 1

 are rejected, unless I totally messed up.

Duh.  You are right.  

It is this check:

/* For now only accept extents that are strictly in order */
if (last 
(((last-first + last-count)  extent-first) ||
 ((last-lower_first + last-count)  extent-lower_first)))
goto out;

Fundamentally every value mapped must be distinct.  Aka the direction of
the mapping must be reversible without loss of information.  

Ensuring all of the values were increasing in the extents was just a
lame way of ensuring that the same value was not mapped twice in either
the upper or lower ranges.

That check can most certainly be relaxed (patches welcome).  But that
probably isn't 3.8 material as that is feature work.

Not having bumped into this limitation myself I'm not certain the value
in removing this check.  But there is no good reason not to replace the
current check with a more general one either.

So your example should work, and that it doesn't is a misfeature.

Eric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Andy Lutomirski
On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman
 wrote:
> Andy Lutomirski  writes:
>
>> One more issue: the requirement that both upper and lower uids (etc.)
>> in the maps are in order is rather limiting.  I have no objection if
>> you only require upper ids to be monotonic, but currently there's no
>> way to may root outside to uid n (for n > 0) and some nonroot user
>> outside to uid 0.
>
> There is.  You may set up to 5 (extents).  You just have to use a second
> extent for the non-contiguous bits.  My reader is lazy and you have to
> set all of the extents with a single write, so you may have missed the
> ability to set more than one extent.
>

If I'm wrong, I'll happily eat my words.  Both:

0 1 1
1 0 1

and

1 0 1
0 1 1

are rejected, unless I totally messed up.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman
>  wrote:
>> Andy Lutomirski  writes:
>>
>>> On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
>
>> But please also note the difference between capable and ns_capable.  Any
>> security check that is capable() still requires priviliges in the
>> initial user namespace.
>
> Huh?
>
> I'm talking about:
>
> clone with CLONE_NEWUSER
>  - child does unshare(CLONE_NEWPID)
>  - parent does setfd(child's pid namespace)
>
> Now the parent is running in the init userns with a different pid ns.
> Setuid binaries will work but will see the unexpected pid ns.  With
> mount namespaces, this would be Really Bad (tm).  With pid or ipc or
> net, it's less obviously dangerous, but I'm not convinced it's safe.

That isn't safe.  That is a sneaky bug in my tree that I overlooked. :(

> I sort of think that setns on a *non*-userns should require
> CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't
> set.

Yes.  CAP_SYS_ADMIN in your current user namespace should make
setns as safe as it currently is before my patches.

That is just a matter of adding a couple nsown_capable(CAP_SYS_ADMIN)
permission checks.

Right now I test for nsown_capable(CAP_SYS_CHROOT) for the mount
namespace, which is probably sufficient to prevent those kinds of
shenanigans but I am going to add a nsown_capable(CAP_SYS_ADMIN) for
good measure.

> A non-threaded process can have mm_users == 2 with CLONE_VM.  I'm not
> sure this is a problem, though.

No it isn't.  I said threads because they are the easy concept not
because that covers all possible corner cases.

>>> In any case, why are threads special here?
>>
>> You know I don't think I stopped to think about it.   The combination
>> of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
>> namespace support was merged in 2008.
>>
>> I do know that things can get really strange when you mix multiple
>> namespaces in a process.  tkill of your own threads will stop working.
>> Which access permissions should apply to files you mmap, file handles
>> you have open, the core dumper etc.
>>
>> We do allow setresuid per thread so we might be able to cope
>> with a process that mixes with user namespaces in different threads,
>> but I would want a close review of things before we allow that kind of
>> sharing.
>>
>
> Fair enough.
>
> I'd personally be more concerned about shared signal handlers than
> shared tgid, though.  The signal handler set has all kinds of weird
> things like session id.

CLONE_THREAD implies CLONE_VM and CLONE_SIGNAL,
and in practice mm_users > 1 protects against all of those cases.

So I was really thinking all of those cases.


>> (See the end.  A significant bug in cap_capable slipped in about
>>  3.5. cap_capable is only supposed to grant permissions to the owner
>>  of a user namespace if it is a child user namespace).
>
> [snipping lots of stuff]
>
> If the intended semantics of cap_capable are, indeed:
>
> If targ_ns is equals or is a descendent of cred->user_ns and the cap
> is effective, return true.  If targ_ns is owned by cred->euid and
> targ_ns is a descendent of cred->user_ns (but is not equal to
> cred->user_ns), then return true.  Else return false
>
> then I agree with you on almost everything you said.  I assumed that
> the actual semantics were intended.

Good.

>>> unshare has a bug.  This code:
>>
>> Interesting...
>>
>> Looking at it this is a very small misfeature.
>>
>> What is happening is that commit_creds is setting is making the task
>> undumpable because we changed the set of capabilities in struct cred.
>>
>> This in turn results in pid_revalidate setting the owner of
>> of /proc/self/uid_map to GLOBAL_ROOT_UID.
>>
>> From the outside the permissions on /proc/self/uid_map look like:
>> -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map
>>
>> Then since /proc/self/uid_map uses the default permission function
>> and the test program below is not run as root the read-write open
>> of uid_map fails.
>
> It's probably either worth fixing this or disabling unshare of userns.
>  This makes it hard to use.  IMO non-dumpable tasks should still be
> able to access the contents of /proc/self -- i.e. I'd call this a more
> general bug.
>
> But I'd also argue that unsharing userns shouldn't set non-dumpable --
> cap_permitted increased, but the new capabilities are still logically
> a subset of the old ones.

Agreed.  Setting dumpable is the bug.

I am going to sleep on it but the code in commit_creds should probably
read:

/* dumpability changes */
if (!uid_eq(old->euid, new->euid) ||
!gid_eq(old->egid, new->egid) ||
!uid_eq(old->fsuid, new->fsuid) ||
!gid_eq(old->fsgid, new->fsgid) ||
((old->user_ns == new->user_ns) &&
 !cap_issubset(new->cap_permitted, old->cap_permitted))) {
if (task->mm)
set_dumpable(task->mm, suid_dumpable);
   

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Andy Lutomirski
On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman
 wrote:
> Andy Lutomirski  writes:
>
>> On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
>>>
>>> Linus,
>>>
>>> Please pull the for-linus git tree from:
>>>
>>>
>>> git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
>>> for-linus
>>>
>>>HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode 
>>> numbers for the namespace file descriptors.
>>>
>>>This tree is against v3.7-rc3
>>
>> You've just allowed unprivileged users to create new pid namespaces,
>> etc, by creating a new userns, then creating a new pid namespace inside
>> that userns, then setns-ing from outside the userns into the pid ns.  Is
>> this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
>> setns.)
>
> Absolutely.  My commit messages talk about this.  I allow creating other
> namespaces once inside a user namespace deliberately.  There is no
> reason I know of to ban creation of pid and other namespaces once you
> are inside of a user namespace.
>

I must be looking at the wrong commit message.

> But please also note the difference between capable and ns_capable.  Any
> security check that is capable() still requires priviliges in the
> initial user namespace.

Huh?

I'm talking about:

clone with CLONE_NEWUSER
 - child does unshare(CLONE_NEWPID)
 - parent does setfd(child's pid namespace)

Now the parent is running in the init userns with a different pid ns.
Setuid binaries will work but will see the unexpected pid ns.  With
mount namespaces, this would be Really Bad (tm).  With pid or ipc or
net, it's less obviously dangerous, but I'm not convinced it's safe.

I sort of think that setns on a *non*-userns should require
CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't
set.

>
>> In user_namespace.c:
>>
>> /* Threaded many not enter a different user namespace */
>> if (atomic_read(>mm->mm_users) > 1)
>> return -EINVAL;
>>
>> The comment has a typo.  Also, you're checking the wrong condition:
>> that's whether the vm is shared, not whether the thread group has more
>> than one member.
>
> Yes the comment should say.
>
> Threaded processes may not enter a different user namespace.
>
> As for the condition.  mm_users will equal one for a non-threaded
> process.  And mm_users is the check we use in unshare to detect if
> a threaded process calls unshare so I think the check seems perfectly
> reasonable.  Especially since the vm must have more than one member if
> there is more than one member in the thread group.

A non-threaded process can have mm_users == 2 with CLONE_VM.  I'm not
sure this is a problem, though.

>
>> In any case, why are threads special here?
>
> You know I don't think I stopped to think about it.   The combination
> of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
> namespace support was merged in 2008.
>
> I do know that things can get really strange when you mix multiple
> namespaces in a process.  tkill of your own threads will stop working.
> Which access permissions should apply to files you mmap, file handles
> you have open, the core dumper etc.
>
> We do allow setresuid per thread so we might be able to cope
> with a process that mixes with user namespaces in different threads,
> but I would want a close review of things before we allow that kind of
> sharing.
>

Fair enough.

I'd personally be more concerned about shared signal handlers than
shared tgid, though.  The signal handler set has all kinds of weird
things like session id.

>> I think, although I haven't verified it, that these changes allow
>> CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
>> CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
>> think that setns should only grant caps when changing to a descendent
>> namespace.
>
> (See the end.  A significant bug in cap_capable slipped in about
>  3.5. cap_capable is only supposed to grant permissions to the owner
>  of a user namespace if it is a child user namespace).

[snipping lots of stuff]

If the intended semantics of cap_capable are, indeed:

If targ_ns is equals or is a descendent of cred->user_ns and the cap
is effective, return true.  If targ_ns is owned by cred->euid and
targ_ns is a descendent of cred->user_ns (but is not equal to
cred->user_ns), then return true.  Else return false

then I agree with you on almost everything you said.  I assumed that
the actual semantics were intended.

>
>> unshare has a bug.  This code:
>
> Interesting...
>
> Looking at it this is a very small misfeature.
>
> What is happening is that commit_creds is setting is making the task
> undumpable because we changed the set of capabilities in struct cred.
>
> This in turn results in pid_revalidate setting the owner of
> of /proc/self/uid_map to GLOBAL_ROOT_UID.
>
> From the outside the permissions on /proc/self/uid_map look like:
> -rw-r--r-- 1 root root 0 2012-12-13 12:43 

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
>> 
>> Linus,
>> 
>> Please pull the for-linus git tree from:
>> 
>>
>> git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
>> for-linus
>> 
>>HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers 
>> for the namespace file descriptors.
>> 
>>This tree is against v3.7-rc3
>
> You've just allowed unprivileged users to create new pid namespaces,
> etc, by creating a new userns, then creating a new pid namespace inside
> that userns, then setns-ing from outside the userns into the pid ns.  Is
> this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
> setns.)

Absolutely.  My commit messages talk about this.  I allow creating other
namespaces once inside a user namespace deliberately.  There is no
reason I know of to ban creation of pid and other namespaces once you
are inside of a user namespace.

But please also note the difference between capable and ns_capable.  Any
security check that is capable() still requires priviliges in the
initial user namespace.

> In user_namespace.c:
>
> /* Threaded many not enter a different user namespace */
> if (atomic_read(>mm->mm_users) > 1)
> return -EINVAL;
>
> The comment has a typo.  Also, you're checking the wrong condition:
> that's whether the vm is shared, not whether the thread group has more
> than one member.

Yes the comment should say.

Threaded processes may not enter a different user namespace.

As for the condition.  mm_users will equal one for a non-threaded
process.  And mm_users is the check we use in unshare to detect if
a threaded process calls unshare so I think the check seems perfectly
reasonable.  Especially since the vm must have more than one member if
there is more than one member in the thread group.

> In any case, why are threads special here?

You know I don't think I stopped to think about it.   The combination
of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
namespace support was merged in 2008.

I do know that things can get really strange when you mix multiple
namespaces in a process.  tkill of your own threads will stop working.
Which access permissions should apply to files you mmap, file handles
you have open, the core dumper etc.

We do allow setresuid per thread so we might be able to cope
with a process that mixes with user namespaces in different threads,
but I would want a close review of things before we allow that kind of
sharing.

> I think, although I haven't verified it, that these changes allow
> CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
> CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
> think that setns should only grant caps when changing to a descendent
> namespace.

(See the end.  A significant bug in cap_capable slipped in about
 3.5. cap_capable is only supposed to grant permissions to the owner
 of a user namespace if it is a child user namespace).

These changes do not allow CAP_SYS_ADMIN to bypass the bounding set.

The test:

if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;

verifies that the user namespace we are entering is a nested user
namespace, because we can only have CAP_SYS_ADMIN in our current
namespace and in nested user namespaces.

> Also in userns_install:
>
> 796 /* Don't allow gaining capabilities by reentering
> 797  * the same user namespace.
> 798  */
> 799 if (user_ns == current_user_ns())
> 800 return -EINVAL;
>
> Why?

To keep processes that deliberately drop some capabilities from being
able to gain those capabilities back by reentering the current user
namespace.

Aka that test plus the ns_capable test prevent are the combination
that prevent a process gaining privileges in the current user namespace.

> You can trivially bypass this by creating a temporary user ns.
> (If you're the owner of your own ns, then you can create a subsidiary
> ns, map yourself into it, then setns back -- you'll still be the
> owner.)

Nope.   Once you have capabilities in a user namespace you do not have
any capabilities in the parent user namespace.  Entering a user
namespace is a one way operation.

> unshare has a bug.  This code:

Interesting...

Looking at it this is a very small misfeature.

What is happening is that commit_creds is setting is making the task
undumpable because we changed the set of capabilities in struct cred.

This in turn results in pid_revalidate setting the owner of
of /proc/self/uid_map to GLOBAL_ROOT_UID.

>From the outside the permissions on /proc/self/uid_map look like:
-rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map

Then since /proc/self/uid_map uses the default permission function
and the test program below is not run as root the read-write open
of uid_map fails.

> Also, I'm entirely unconvinced that the owner of a userns should

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Andy Lutomirski
On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
> 
> Linus,
> 
> Please pull the for-linus git tree from:
> 
>git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> for-linus
> 
>HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers 
> for the namespace file descriptors.
> 
>This tree is against v3.7-rc3

You've just allowed unprivileged users to create new pid namespaces,
etc, by creating a new userns, then creating a new pid namespace inside
that userns, then setns-ing from outside the userns into the pid ns.  Is
this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
setns.)


In user_namespace.c:

/* Threaded many not enter a different user namespace */
if (atomic_read(>mm->mm_users) > 1)
return -EINVAL;

The comment has a typo.  Also, you're checking the wrong condition:
that's whether the vm is shared, not whether the thread group has more
than one member.

In any case, why are threads special here?



I think, although I haven't verified it, that these changes allow
CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
think that setns should only grant caps when changing to a descendent
namespace.

Also in userns_install:

796 /* Don't allow gaining capabilities by reentering
797  * the same user namespace.
798  */
799 if (user_ns == current_user_ns())
800 return -EINVAL;

Why?  You can trivially bypass this by creating a temporary user ns.
(If you're the owner of your own ns, then you can create a subsidiary
ns, map yourself into it, then setns back -- you'll still be the owner.)


unshare has a bug.  This code:

#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void fail(const char *msg)
{
  perror(msg);
  exit(1);
}

int main()
{
  if (unshare(CLONE_NEWUSER) != 0)
fail("CLONE_NEWUSER");

  if (open("/proc/self/uid_map", O_RDWR) == -1)
perror("/proc/self/uid_map O_RDWR");

  int fd = open("/proc/self/uid_map", O_RDONLY);
  if (fd == -1) {
perror("/proc/self/uid_map O_RDONLY");
  } else {
char buf[4096];
ssize_t len = read(fd, buf, sizeof(buf));
if (len > 0)
  write(1, buf, len);
else
  printf("read uid_map returned %d\n", (int)len);
  }
}

produces this output:

/proc/self/uid_map O_RDWR: Permission denied
read uid_map returned 0

With clone instead of unshare, it works.  I'm not quite sure what's
going on.



Also, I'm entirely unconvinced that the owner of a userns should
automatically have all caps (in the cap_capable sense) on a userns if
the task is inside that ns.  What's wrong with just using normal caps?
(Of course, the fact that caps don't usefully inherit is an issue --
there's a lng thread going on right now about that.)  But doing this
enshrines root-has-caps even farther into ABI.  At least please make
this depend on !SECURE_NOROOT.


> 
> While small this set of changes is very significant with respect to
> containers in general and user namespaces in particular.  The user space
> interface is now complete.
> 
> This set of changes adds support for unprivileged users to create user
> namespaces and as a user namespace root to create other namespaces.  The
> tyrrany of supporting suid root preventing unprivileged users from using
> cool new kernel features is broken.
> 

no_new_privs already (kind of) did that :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Andy Lutomirski
On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
 
 Linus,
 
 Please pull the for-linus git tree from:
 
git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
 for-linus
 
HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers 
 for the namespace file descriptors.
 
This tree is against v3.7-rc3

You've just allowed unprivileged users to create new pid namespaces,
etc, by creating a new userns, then creating a new pid namespace inside
that userns, then setns-ing from outside the userns into the pid ns.  Is
this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
setns.)


In user_namespace.c:

/* Threaded many not enter a different user namespace */
if (atomic_read(current-mm-mm_users)  1)
return -EINVAL;

The comment has a typo.  Also, you're checking the wrong condition:
that's whether the vm is shared, not whether the thread group has more
than one member.

In any case, why are threads special here?



I think, although I haven't verified it, that these changes allow
CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
think that setns should only grant caps when changing to a descendent
namespace.

Also in userns_install:

796 /* Don't allow gaining capabilities by reentering
797  * the same user namespace.
798  */
799 if (user_ns == current_user_ns())
800 return -EINVAL;

Why?  You can trivially bypass this by creating a temporary user ns.
(If you're the owner of your own ns, then you can create a subsidiary
ns, map yourself into it, then setns back -- you'll still be the owner.)


unshare has a bug.  This code:

#define _GNU_SOURCE

#include stdbool.h
#include stdio.h
#include stdlib.h
#include string.h
#include errno.h
#include signal.h
#include unistd.h
#include sys/types.h
#include sys/wait.h
#include sys/prctl.h
#include sys/mman.h
#include sched.h
#include fcntl.h

static void fail(const char *msg)
{
  perror(msg);
  exit(1);
}

int main()
{
  if (unshare(CLONE_NEWUSER) != 0)
fail(CLONE_NEWUSER);

  if (open(/proc/self/uid_map, O_RDWR) == -1)
perror(/proc/self/uid_map O_RDWR);

  int fd = open(/proc/self/uid_map, O_RDONLY);
  if (fd == -1) {
perror(/proc/self/uid_map O_RDONLY);
  } else {
char buf[4096];
ssize_t len = read(fd, buf, sizeof(buf));
if (len  0)
  write(1, buf, len);
else
  printf(read uid_map returned %d\n, (int)len);
  }
}

produces this output:

/proc/self/uid_map O_RDWR: Permission denied
read uid_map returned 0

With clone instead of unshare, it works.  I'm not quite sure what's
going on.



Also, I'm entirely unconvinced that the owner of a userns should
automatically have all caps (in the cap_capable sense) on a userns if
the task is inside that ns.  What's wrong with just using normal caps?
(Of course, the fact that caps don't usefully inherit is an issue --
there's a lng thread going on right now about that.)  But doing this
enshrines root-has-caps even farther into ABI.  At least please make
this depend on !SECURE_NOROOT.


 
 While small this set of changes is very significant with respect to
 containers in general and user namespaces in particular.  The user space
 interface is now complete.
 
 This set of changes adds support for unprivileged users to create user
 namespaces and as a user namespace root to create other namespaces.  The
 tyrrany of supporting suid root preventing unprivileged users from using
 cool new kernel features is broken.
 

no_new_privs already (kind of) did that :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On 12/11/2012 01:17 PM, Eric W. Biederman wrote:
 
 Linus,
 
 Please pull the for-linus git tree from:
 

 git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
 for-linus
 
HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers 
 for the namespace file descriptors.
 
This tree is against v3.7-rc3

 You've just allowed unprivileged users to create new pid namespaces,
 etc, by creating a new userns, then creating a new pid namespace inside
 that userns, then setns-ing from outside the userns into the pid ns.  Is
 this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
 setns.)

Absolutely.  My commit messages talk about this.  I allow creating other
namespaces once inside a user namespace deliberately.  There is no
reason I know of to ban creation of pid and other namespaces once you
are inside of a user namespace.

But please also note the difference between capable and ns_capable.  Any
security check that is capable() still requires priviliges in the
initial user namespace.

 In user_namespace.c:

 /* Threaded many not enter a different user namespace */
 if (atomic_read(current-mm-mm_users)  1)
 return -EINVAL;

 The comment has a typo.  Also, you're checking the wrong condition:
 that's whether the vm is shared, not whether the thread group has more
 than one member.

Yes the comment should say.

Threaded processes may not enter a different user namespace.

As for the condition.  mm_users will equal one for a non-threaded
process.  And mm_users is the check we use in unshare to detect if
a threaded process calls unshare so I think the check seems perfectly
reasonable.  Especially since the vm must have more than one member if
there is more than one member in the thread group.

 In any case, why are threads special here?

You know I don't think I stopped to think about it.   The combination
of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
namespace support was merged in 2008.

I do know that things can get really strange when you mix multiple
namespaces in a process.  tkill of your own threads will stop working.
Which access permissions should apply to files you mmap, file handles
you have open, the core dumper etc.

We do allow setresuid per thread so we might be able to cope
with a process that mixes with user namespaces in different threads,
but I would want a close review of things before we allow that kind of
sharing.

 I think, although I haven't verified it, that these changes allow
 CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
 CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
 think that setns should only grant caps when changing to a descendent
 namespace.

(See the end.  A significant bug in cap_capable slipped in about
 3.5. cap_capable is only supposed to grant permissions to the owner
 of a user namespace if it is a child user namespace).

These changes do not allow CAP_SYS_ADMIN to bypass the bounding set.

The test:

if (!ns_capable(user_ns, CAP_SYS_ADMIN))
return -EPERM;

verifies that the user namespace we are entering is a nested user
namespace, because we can only have CAP_SYS_ADMIN in our current
namespace and in nested user namespaces.

 Also in userns_install:

 796 /* Don't allow gaining capabilities by reentering
 797  * the same user namespace.
 798  */
 799 if (user_ns == current_user_ns())
 800 return -EINVAL;

 Why?

To keep processes that deliberately drop some capabilities from being
able to gain those capabilities back by reentering the current user
namespace.

Aka that test plus the ns_capable test prevent are the combination
that prevent a process gaining privileges in the current user namespace.

 You can trivially bypass this by creating a temporary user ns.
 (If you're the owner of your own ns, then you can create a subsidiary
 ns, map yourself into it, then setns back -- you'll still be the
 owner.)

Nope.   Once you have capabilities in a user namespace you do not have
any capabilities in the parent user namespace.  Entering a user
namespace is a one way operation.

 unshare has a bug.  This code:

Interesting...

Looking at it this is a very small misfeature.

What is happening is that commit_creds is setting is making the task
undumpable because we changed the set of capabilities in struct cred.

This in turn results in pid_revalidate setting the owner of
of /proc/self/uid_map to GLOBAL_ROOT_UID.

From the outside the permissions on /proc/self/uid_map look like:
-rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map

Then since /proc/self/uid_map uses the default permission function
and the test program below is not run as root the read-write open
of uid_map fails.

 Also, I'm entirely unconvinced that the owner of a userns should
 automatically have all caps (in the 

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Andy Lutomirski
On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman
ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On 12/11/2012 01:17 PM, Eric W. Biederman wrote:

 Linus,

 Please pull the for-linus git tree from:


 git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
 for-linus

HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode 
 numbers for the namespace file descriptors.

This tree is against v3.7-rc3

 You've just allowed unprivileged users to create new pid namespaces,
 etc, by creating a new userns, then creating a new pid namespace inside
 that userns, then setns-ing from outside the userns into the pid ns.  Is
 this intentional?  (The mount ns is okay -- it checks for CAP_CHROOT on
 setns.)

 Absolutely.  My commit messages talk about this.  I allow creating other
 namespaces once inside a user namespace deliberately.  There is no
 reason I know of to ban creation of pid and other namespaces once you
 are inside of a user namespace.


I must be looking at the wrong commit message.

 But please also note the difference between capable and ns_capable.  Any
 security check that is capable() still requires priviliges in the
 initial user namespace.

Huh?

I'm talking about:

clone with CLONE_NEWUSER
 - child does unshare(CLONE_NEWPID)
 - parent does setfd(child's pid namespace)

Now the parent is running in the init userns with a different pid ns.
Setuid binaries will work but will see the unexpected pid ns.  With
mount namespaces, this would be Really Bad (tm).  With pid or ipc or
net, it's less obviously dangerous, but I'm not convinced it's safe.

I sort of think that setns on a *non*-userns should require
CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't
set.


 In user_namespace.c:

 /* Threaded many not enter a different user namespace */
 if (atomic_read(current-mm-mm_users)  1)
 return -EINVAL;

 The comment has a typo.  Also, you're checking the wrong condition:
 that's whether the vm is shared, not whether the thread group has more
 than one member.

 Yes the comment should say.

 Threaded processes may not enter a different user namespace.

 As for the condition.  mm_users will equal one for a non-threaded
 process.  And mm_users is the check we use in unshare to detect if
 a threaded process calls unshare so I think the check seems perfectly
 reasonable.  Especially since the vm must have more than one member if
 there is more than one member in the thread group.

A non-threaded process can have mm_users == 2 with CLONE_VM.  I'm not
sure this is a problem, though.


 In any case, why are threads special here?

 You know I don't think I stopped to think about it.   The combination
 of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
 namespace support was merged in 2008.

 I do know that things can get really strange when you mix multiple
 namespaces in a process.  tkill of your own threads will stop working.
 Which access permissions should apply to files you mmap, file handles
 you have open, the core dumper etc.

 We do allow setresuid per thread so we might be able to cope
 with a process that mixes with user namespaces in different threads,
 but I would want a close review of things before we allow that kind of
 sharing.


Fair enough.

I'd personally be more concerned about shared signal handlers than
shared tgid, though.  The signal handler set has all kinds of weird
things like session id.

 I think, although I haven't verified it, that these changes allow
 CAP_SYS_ADMIN to bypass the bounding set (and, in particular, to gain
 CAP_MODULE): unshare the user namespace and then setfd yourself back.  I
 think that setns should only grant caps when changing to a descendent
 namespace.

 (See the end.  A significant bug in cap_capable slipped in about
  3.5. cap_capable is only supposed to grant permissions to the owner
  of a user namespace if it is a child user namespace).

[snipping lots of stuff]

If the intended semantics of cap_capable are, indeed:

If targ_ns is equals or is a descendent of cred-user_ns and the cap
is effective, return true.  If targ_ns is owned by cred-euid and
targ_ns is a descendent of cred-user_ns (but is not equal to
cred-user_ns), then return true.  Else return false

then I agree with you on almost everything you said.  I assumed that
the actual semantics were intended.


 unshare has a bug.  This code:

 Interesting...

 Looking at it this is a very small misfeature.

 What is happening is that commit_creds is setting is making the task
 undumpable because we changed the set of capabilities in struct cred.

 This in turn results in pid_revalidate setting the owner of
 of /proc/self/uid_map to GLOBAL_ROOT_UID.

 From the outside the permissions on /proc/self/uid_map look like:
 -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map

 Then since /proc/self/uid_map uses the default permission function
 and the test 

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Eric W. Biederman
Andy Lutomirski l...@amacapital.net writes:

 On Thu, Dec 13, 2012 at 2:01 PM, Eric W. Biederman
 ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 On 12/11/2012 01:17 PM, Eric W. Biederman wrote:

 But please also note the difference between capable and ns_capable.  Any
 security check that is capable() still requires priviliges in the
 initial user namespace.

 Huh?

 I'm talking about:

 clone with CLONE_NEWUSER
  - child does unshare(CLONE_NEWPID)
  - parent does setfd(child's pid namespace)

 Now the parent is running in the init userns with a different pid ns.
 Setuid binaries will work but will see the unexpected pid ns.  With
 mount namespaces, this would be Really Bad (tm).  With pid or ipc or
 net, it's less obviously dangerous, but I'm not convinced it's safe.

That isn't safe.  That is a sneaky bug in my tree that I overlooked. :(

 I sort of think that setns on a *non*-userns should require
 CAP_SYS_ADMIN in the current userns, at least if no_new_privs isn't
 set.

Yes.  CAP_SYS_ADMIN in your current user namespace should make
setns as safe as it currently is before my patches.

That is just a matter of adding a couple nsown_capable(CAP_SYS_ADMIN)
permission checks.

Right now I test for nsown_capable(CAP_SYS_CHROOT) for the mount
namespace, which is probably sufficient to prevent those kinds of
shenanigans but I am going to add a nsown_capable(CAP_SYS_ADMIN) for
good measure.

 A non-threaded process can have mm_users == 2 with CLONE_VM.  I'm not
 sure this is a problem, though.

No it isn't.  I said threads because they are the easy concept not
because that covers all possible corner cases.

 In any case, why are threads special here?

 You know I don't think I stopped to think about it.   The combination
 of CLONE_NEWUSER and CLONE_THREAD have been denined since the first user
 namespace support was merged in 2008.

 I do know that things can get really strange when you mix multiple
 namespaces in a process.  tkill of your own threads will stop working.
 Which access permissions should apply to files you mmap, file handles
 you have open, the core dumper etc.

 We do allow setresuid per thread so we might be able to cope
 with a process that mixes with user namespaces in different threads,
 but I would want a close review of things before we allow that kind of
 sharing.


 Fair enough.

 I'd personally be more concerned about shared signal handlers than
 shared tgid, though.  The signal handler set has all kinds of weird
 things like session id.

CLONE_THREAD implies CLONE_VM and CLONE_SIGNAL,
and in practice mm_users  1 protects against all of those cases.

So I was really thinking all of those cases.


 (See the end.  A significant bug in cap_capable slipped in about
  3.5. cap_capable is only supposed to grant permissions to the owner
  of a user namespace if it is a child user namespace).

 [snipping lots of stuff]

 If the intended semantics of cap_capable are, indeed:

 If targ_ns is equals or is a descendent of cred-user_ns and the cap
 is effective, return true.  If targ_ns is owned by cred-euid and
 targ_ns is a descendent of cred-user_ns (but is not equal to
 cred-user_ns), then return true.  Else return false

 then I agree with you on almost everything you said.  I assumed that
 the actual semantics were intended.

Good.

 unshare has a bug.  This code:

 Interesting...

 Looking at it this is a very small misfeature.

 What is happening is that commit_creds is setting is making the task
 undumpable because we changed the set of capabilities in struct cred.

 This in turn results in pid_revalidate setting the owner of
 of /proc/self/uid_map to GLOBAL_ROOT_UID.

 From the outside the permissions on /proc/self/uid_map look like:
 -rw-r--r-- 1 root root 0 2012-12-13 12:43 /proc/30530/uid_map

 Then since /proc/self/uid_map uses the default permission function
 and the test program below is not run as root the read-write open
 of uid_map fails.

 It's probably either worth fixing this or disabling unshare of userns.
  This makes it hard to use.  IMO non-dumpable tasks should still be
 able to access the contents of /proc/self -- i.e. I'd call this a more
 general bug.

 But I'd also argue that unsharing userns shouldn't set non-dumpable --
 cap_permitted increased, but the new capabilities are still logically
 a subset of the old ones.

Agreed.  Setting dumpable is the bug.

I am going to sleep on it but the code in commit_creds should probably
read:

/* dumpability changes */
if (!uid_eq(old-euid, new-euid) ||
!gid_eq(old-egid, new-egid) ||
!uid_eq(old-fsuid, new-fsuid) ||
!gid_eq(old-fsgid, new-fsgid) ||
((old-user_ns == new-user_ns) 
 !cap_issubset(new-cap_permitted, old-cap_permitted))) {
if (task-mm)
set_dumpable(task-mm, suid_dumpable);
task-pdeath_signal = 0;
smp_wmb();
}

Which is correct 

Re: [GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-13 Thread Andy Lutomirski
On Thu, Dec 13, 2012 at 8:11 PM, Eric W. Biederman
ebied...@xmission.com wrote:
 Andy Lutomirski l...@amacapital.net writes:

 One more issue: the requirement that both upper and lower uids (etc.)
 in the maps are in order is rather limiting.  I have no objection if
 you only require upper ids to be monotonic, but currently there's no
 way to may root outside to uid n (for n  0) and some nonroot user
 outside to uid 0.

 There is.  You may set up to 5 (extents).  You just have to use a second
 extent for the non-contiguous bits.  My reader is lazy and you have to
 set all of the extents with a single write, so you may have missed the
 ability to set more than one extent.


If I'm wrong, I'll happily eat my words.  Both:

0 1 1
1 0 1

and

1 0 1
0 1 1

are rejected, unless I totally messed up.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-11 Thread Eric W. Biederman

Linus,

Please pull the for-linus git tree from:

   git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers 
for the namespace file descriptors.

   This tree is against v3.7-rc3

While small this set of changes is very significant with respect to
containers in general and user namespaces in particular.  The user space
interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.  The
tyrrany of supporting suid root preventing unprivileged users from using
cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications.  Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out tendrils
into all kinds of unexpected cleanup paths for operation.  At least one
case of broken error handling is fixed by this cleanup.

The files under /proc//ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS, ensuring
the files always refer to the namespace the process is currently using
and ensuring that the ptrace_mayaccess permission checks are always
applied.

The files under /proc//ns/ have been given stable inode numbers so
it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user namespace
root to usefully use the networking stack.  Similar changes for the
mount namespace and the pid namespace are coming through my tree.

Two small nework namespace changes were double committed here and in
David Millers -net tree so that I could complete the work on the
/proc//ns/ files in this tree.

The user namespace work that remains is converting, 9p, afs, ceph, cifs,
coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so they are safe to enable
when user namespaces are enabled, and implementing unprivileged mounts
of more than just /proc and /sys.

I had hoped to get through more of those changes this cycle but
I turned into a cold magnet this season and the UAPI changes caused
a lot of churn late into the 3.7 -rc cycle that made a stable starting
place hard to work from hard to find.

Eric W. Biederman (37):
  userns: Support autofs4 interacing with multiple user namespaces
  userns: Support fuse interacting with multiple user namespaces
  netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS
  userns: make each net (net_ns) belong to a user_ns
  userns: On mips modify check_same_owner to use uid_eq
  procfs: Use the proc generic infrastructure for proc/self.
  procfs: Don't cache a pid in the root inode.
  pidns: Capture the user namespace and filter ns_last_pid
  pidns: Use task_active_pid_ns where appropriate
  pidns: Make the pidns proc mount/umount logic obvious.
  pidns: Don't allow new processes in a dead pid namespace.
  pidns: Wait in zap_pid_ns_processes until pid_ns->nr_hashed == 1
  pidns: Deny strange cases when creating pid namespaces.
  pidns: Add setns support
  pidns: Consolidate initialzation of special init task state
  pidns: Support unsharing the pid namespace.
  vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
  vfs: Add setns support for the mount namespace
  vfs: Add a user namespace reference from struct mnt_namespace
  vfs: Only support slave subtrees across different user namespaces
  vfs: Allow unprivileged manipulation of the mount namespace.
  userns: Ignore suid and sgid on binaries if the uid or gid can not be 
mapped
  userns: Allow unprivileged users to create user namespaces.
  userns: Allow chown and setgid preservation
  userns: Allow setting a userns mapping to your current uid.
  userns: Allow unprivileged users to create new namespaces
  userns: Allow unprivileged use of setns.
  userns: Make create_new_namespaces take a user_ns parameter
  userns: Kill task_user_ns
  userns: Implent proc namespace operations
  userns: Implement unshare of the user namespace
  procfs: Print task uids and gids in the userns that opened the proc file
  userns: For /proc/self/{uid,gid}_map derive the lower userns from the 
struct file
  userns: Allow unprivilged mounts of proc and sysfs
  proc: Generalize proc inode allocation
  proc: Fix the namespace inode permission checks.
  proc: Usable inode numbers for the namespace file descriptors.

Zhao Hongjiang (1):
  userns: fix return value on mntns_install() failure

 arch/mips/kernel/mips-mt-fpaff.c  |4 +-
 arch/powerpc/platforms/cell/spufs/sched.c |2 +-
 

[GIT PULL] user namespace and namespace infrastructure changes for 3.8

2012-12-11 Thread Eric W. Biederman

Linus,

Please pull the for-linus git tree from:

   git://git.kernel.org:/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
for-linus

   HEAD: 98f842e675f96ffac96e6c50315790912b2812be proc: Usable inode numbers 
for the namespace file descriptors.

   This tree is against v3.7-rc3

While small this set of changes is very significant with respect to
containers in general and user namespaces in particular.  The user space
interface is now complete.

This set of changes adds support for unprivileged users to create user
namespaces and as a user namespace root to create other namespaces.  The
tyrrany of supporting suid root preventing unprivileged users from using
cool new kernel features is broken.

This set of changes completes the work on setns, adding support for
the pid, user, mount namespaces.

This set of changes includes a bunch of basic pid namespace
cleanups/simplifications.  Of particular significance is the rework of
the pid namespace cleanup so it no longer requires sending out tendrils
into all kinds of unexpected cleanup paths for operation.  At least one
case of broken error handling is fixed by this cleanup.

The files under /proc/pid/ns/ have been converted from regular files
to magic symlinks which prevents incorrect caching by the VFS, ensuring
the files always refer to the namespace the process is currently using
and ensuring that the ptrace_mayaccess permission checks are always
applied.

The files under /proc/pid/ns/ have been given stable inode numbers so
it is now possible to see if different processes share the same
namespaces.

Through the David Miller's net tree are changes to relax many of the
permission checks in the networking stack to allowing the user namespace
root to usefully use the networking stack.  Similar changes for the
mount namespace and the pid namespace are coming through my tree.

Two small nework namespace changes were double committed here and in
David Millers -net tree so that I could complete the work on the
/proc/pid/ns/ files in this tree.

The user namespace work that remains is converting, 9p, afs, ceph, cifs,
coda, gfs2, ncpfs, nfs, nfsd, ocfs2, and xfs so they are safe to enable
when user namespaces are enabled, and implementing unprivileged mounts
of more than just /proc and /sys.

I had hoped to get through more of those changes this cycle but
I turned into a cold magnet this season and the UAPI changes caused
a lot of churn late into the 3.7 -rc cycle that made a stable starting
place hard to work from hard to find.

Eric W. Biederman (37):
  userns: Support autofs4 interacing with multiple user namespaces
  userns: Support fuse interacting with multiple user namespaces
  netns: Deduplicate and fix copy_net_ns when !CONFIG_NET_NS
  userns: make each net (net_ns) belong to a user_ns
  userns: On mips modify check_same_owner to use uid_eq
  procfs: Use the proc generic infrastructure for proc/self.
  procfs: Don't cache a pid in the root inode.
  pidns: Capture the user namespace and filter ns_last_pid
  pidns: Use task_active_pid_ns where appropriate
  pidns: Make the pidns proc mount/umount logic obvious.
  pidns: Don't allow new processes in a dead pid namespace.
  pidns: Wait in zap_pid_ns_processes until pid_ns-nr_hashed == 1
  pidns: Deny strange cases when creating pid namespaces.
  pidns: Add setns support
  pidns: Consolidate initialzation of special init task state
  pidns: Support unsharing the pid namespace.
  vfs: Allow chroot if you have CAP_SYS_CHROOT in your user namespace
  vfs: Add setns support for the mount namespace
  vfs: Add a user namespace reference from struct mnt_namespace
  vfs: Only support slave subtrees across different user namespaces
  vfs: Allow unprivileged manipulation of the mount namespace.
  userns: Ignore suid and sgid on binaries if the uid or gid can not be 
mapped
  userns: Allow unprivileged users to create user namespaces.
  userns: Allow chown and setgid preservation
  userns: Allow setting a userns mapping to your current uid.
  userns: Allow unprivileged users to create new namespaces
  userns: Allow unprivileged use of setns.
  userns: Make create_new_namespaces take a user_ns parameter
  userns: Kill task_user_ns
  userns: Implent proc namespace operations
  userns: Implement unshare of the user namespace
  procfs: Print task uids and gids in the userns that opened the proc file
  userns: For /proc/self/{uid,gid}_map derive the lower userns from the 
struct file
  userns: Allow unprivilged mounts of proc and sysfs
  proc: Generalize proc inode allocation
  proc: Fix the namespace inode permission checks.
  proc: Usable inode numbers for the namespace file descriptors.

Zhao Hongjiang (1):
  userns: fix return value on mntns_install() failure

 arch/mips/kernel/mips-mt-fpaff.c  |4 +-
 arch/powerpc/platforms/cell/spufs/sched.c |2 +-