Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

2021-04-16 Thread Serge E. Hallyn
On Fri, Apr 16, 2021 at 04:34:53PM -0500, Serge E. Hallyn wrote:
> On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote:
> > On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> > > (Eric - this patch (v3) is a cleaned up version of the previous approach.
> > > v4 is at 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > > and is the approach you suggested.  I can send it also as a separate patch
> > > if you like)
> > > 
> > > A process running as uid 0 but without cap_setfcap currently can simply
> > > unshare a new user namespace with uid 0 mapped to 0.  While this task
> > > will not have new capabilities against the parent namespace, there is
> > > a loophole due to the way namespaced file capabilities work.  File
> > > capabilities valid in userns 1 are distinguised from file capabilities
> > > valid in userns 2 by the kuid which underlies uid 0.  Therefore
> > > the restricted root process can unshare a new self-mapping namespace,
> > > add a namespaced file capability onto a file, then use that file
> > > capability in the parent namespace.
> > > 
> > > To prevent that, do not allow mapping uid 0 if the process which
> > > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > > for setting file capabilities.
> > > 
> > > A further wrinkle:  a task can unshare its user namespace, then
> > > open its uid_map file itself, and map (only) its own uid.  In this
> > > case we do not have the credential from before unshare,  which was
> > > potentially more restricted.  So, when creating a user namespace, we
> > > record whether the creator had CAP_SETFCAP.  Then we can use that
> > > during map_write().
> > > 
> > > With this patch:
> > > 
> > > 1. unprivileged user can still unshare -Ur
> > > 
> > > ubuntu@caps:~$ unshare -Ur
> > > root@caps:~# logout
> > > 
> > > 2. root user can still unshare -Ur
> > > 
> > > ubuntu@caps:~$ sudo bash
> > > root@caps:/home/ubuntu# unshare -Ur
> > > root@caps:/home/ubuntu# logout
> > > 
> > > 3. root user without CAP_SETFCAP cannot unshare -Ur:
> > > 
> > > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > > root@caps:/home/ubuntu# unshare -Ur
> > > unshare: write failed /proc/self/uid_map: Operation not permitted
> > > 
> > > Signed-off-by: Serge Hallyn 
> > > 
> > > Changelog:
> > >* fix logic in the case of writing to another task's uid_map
> > >* rename 'ns' to 'map_ns', and make a file_ns local variable
> > >* use /* comments */
> > >* update the CAP_SETFCAP comment in capability.h
> > >* rename parent_unpriv to parent_can_setfcap (and reverse the
> > >  logic)
> > >* remove printks
> > >* clarify (i hope) the code comments
> > >* update capability.h comment
> > >* renamed parent_can_setfcap to parent_could_setfcap
> > >* made the check its own disallowed_0_mapping() fn
> > >* moved the check into new_idmap_permitted
> > > ---
> > 
> > Thank you for working on this fix!
> > 
> > I do prefer your approach of doing the check at user namespace creation
> > time instead of moving it into the setxattr() codepath.
> > 
> > Let me reiterate that the ability to write through fscaps is a valid
> > usecase and this should continue to work but that for locked down user
> > namespace as Andrew wants to use them your patch provides a clean
> > solution.
> > We've are using identity mappings in quite a few scenarios partially
> > when performing tests but also to write through fscaps.
> > We also had reports of users that use identity mappings. They create
> > their rootfs by running image extraction in an identity mapped userns
> > where fscaps are written through.
> > Podman has use-cases for this feature as well and has been affected by
> > the regression of the first fix.
> 
> Thanks for reviewing.
> 
> I'm not sure what your point above is, so just to make sure - the
> alternative implementation also does allow fscaps for cases where
> root uid is remapped, only disallowing it if it would violate the
> ancestor's lack of cap_setfcap.
> 
> 
> > >  include/linux/user_namespace.h  |  3 ++
> > >  include/uapi/linux/capability.h |  3 +-
> > >  kernel/user_namespace.c | 61 +++--
> > >  3 files changed, 63 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/user_namespace.h 
> > > b/include/linux/user_namespace.h
> > > index 64cf8ebdc4ec..f6c5f784be5a 100644
> > > --- a/include/linux/user_namespace.h
> > > +++ b/include/linux/user_namespace.h
> > > @@ -63,6 +63,9 @@ struct user_namespace {
> > >   kgid_t  group;
> > >   struct ns_commonns;
> > >   unsigned long   flags;
> > > + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> > > +  * in its effective 

Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

2021-04-16 Thread Serge E. Hallyn
On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote:
> On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> > (Eric - this patch (v3) is a cleaned up version of the previous approach.
> > v4 is at 
> > https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > and is the approach you suggested.  I can send it also as a separate patch
> > if you like)
> > 
> > A process running as uid 0 but without cap_setfcap currently can simply
> > unshare a new user namespace with uid 0 mapped to 0.  While this task
> > will not have new capabilities against the parent namespace, there is
> > a loophole due to the way namespaced file capabilities work.  File
> > capabilities valid in userns 1 are distinguised from file capabilities
> > valid in userns 2 by the kuid which underlies uid 0.  Therefore
> > the restricted root process can unshare a new self-mapping namespace,
> > add a namespaced file capability onto a file, then use that file
> > capability in the parent namespace.
> > 
> > To prevent that, do not allow mapping uid 0 if the process which
> > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > for setting file capabilities.
> > 
> > A further wrinkle:  a task can unshare its user namespace, then
> > open its uid_map file itself, and map (only) its own uid.  In this
> > case we do not have the credential from before unshare,  which was
> > potentially more restricted.  So, when creating a user namespace, we
> > record whether the creator had CAP_SETFCAP.  Then we can use that
> > during map_write().
> > 
> > With this patch:
> > 
> > 1. unprivileged user can still unshare -Ur
> > 
> > ubuntu@caps:~$ unshare -Ur
> > root@caps:~# logout
> > 
> > 2. root user can still unshare -Ur
> > 
> > ubuntu@caps:~$ sudo bash
> > root@caps:/home/ubuntu# unshare -Ur
> > root@caps:/home/ubuntu# logout
> > 
> > 3. root user without CAP_SETFCAP cannot unshare -Ur:
> > 
> > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > root@caps:/home/ubuntu# unshare -Ur
> > unshare: write failed /proc/self/uid_map: Operation not permitted
> > 
> > Signed-off-by: Serge Hallyn 
> > 
> > Changelog:
> >* fix logic in the case of writing to another task's uid_map
> >* rename 'ns' to 'map_ns', and make a file_ns local variable
> >* use /* comments */
> >* update the CAP_SETFCAP comment in capability.h
> >* rename parent_unpriv to parent_can_setfcap (and reverse the
> >  logic)
> >* remove printks
> >* clarify (i hope) the code comments
> >* update capability.h comment
> >* renamed parent_can_setfcap to parent_could_setfcap
> >* made the check its own disallowed_0_mapping() fn
> >* moved the check into new_idmap_permitted
> > ---
> 
> Thank you for working on this fix!
> 
> I do prefer your approach of doing the check at user namespace creation
> time instead of moving it into the setxattr() codepath.
> 
> Let me reiterate that the ability to write through fscaps is a valid
> usecase and this should continue to work but that for locked down user
> namespace as Andrew wants to use them your patch provides a clean
> solution.
> We've are using identity mappings in quite a few scenarios partially
> when performing tests but also to write through fscaps.
> We also had reports of users that use identity mappings. They create
> their rootfs by running image extraction in an identity mapped userns
> where fscaps are written through.
> Podman has use-cases for this feature as well and has been affected by
> the regression of the first fix.

Thanks for reviewing.

I'm not sure what your point above is, so just to make sure - the
alternative implementation also does allow fscaps for cases where
root uid is remapped, only disallowing it if it would violate the
ancestor's lack of cap_setfcap.


> >  include/linux/user_namespace.h  |  3 ++
> >  include/uapi/linux/capability.h |  3 +-
> >  kernel/user_namespace.c | 61 +++--
> >  3 files changed, 63 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index 64cf8ebdc4ec..f6c5f784be5a 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -63,6 +63,9 @@ struct user_namespace {
> > kgid_t  group;
> > struct ns_commonns;
> > unsigned long   flags;
> > +   /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> > +* in its effective capability set at the child ns creation time. */
> > +   boolparent_could_setfcap;
> >  
> >  #ifdef CONFIG_KEYS
> > /* List of joinable keyrings in this namespace.  Modification access of
> > diff --git a/include/uapi/linux/capability.h 
> > 

Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

2021-04-16 Thread Christian Brauner
On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> (Eric - this patch (v3) is a cleaned up version of the previous approach.
> v4 is at 
> https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> and is the approach you suggested.  I can send it also as a separate patch
> if you like)
> 
> A process running as uid 0 but without cap_setfcap currently can simply
> unshare a new user namespace with uid 0 mapped to 0.  While this task
> will not have new capabilities against the parent namespace, there is
> a loophole due to the way namespaced file capabilities work.  File
> capabilities valid in userns 1 are distinguised from file capabilities
> valid in userns 2 by the kuid which underlies uid 0.  Therefore
> the restricted root process can unshare a new self-mapping namespace,
> add a namespaced file capability onto a file, then use that file
> capability in the parent namespace.
> 
> To prevent that, do not allow mapping uid 0 if the process which
> opened the uid_map file does not have CAP_SETFCAP, which is the capability
> for setting file capabilities.
> 
> A further wrinkle:  a task can unshare its user namespace, then
> open its uid_map file itself, and map (only) its own uid.  In this
> case we do not have the credential from before unshare,  which was
> potentially more restricted.  So, when creating a user namespace, we
> record whether the creator had CAP_SETFCAP.  Then we can use that
> during map_write().
> 
> With this patch:
> 
> 1. unprivileged user can still unshare -Ur
> 
> ubuntu@caps:~$ unshare -Ur
> root@caps:~# logout
> 
> 2. root user can still unshare -Ur
> 
> ubuntu@caps:~$ sudo bash
> root@caps:/home/ubuntu# unshare -Ur
> root@caps:/home/ubuntu# logout
> 
> 3. root user without CAP_SETFCAP cannot unshare -Ur:
> 
> root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> unable to set CAP_SETFCAP effective capability: Operation not permitted
> root@caps:/home/ubuntu# unshare -Ur
> unshare: write failed /proc/self/uid_map: Operation not permitted
> 
> Signed-off-by: Serge Hallyn 
> 
> Changelog:
>* fix logic in the case of writing to another task's uid_map
>* rename 'ns' to 'map_ns', and make a file_ns local variable
>* use /* comments */
>* update the CAP_SETFCAP comment in capability.h
>* rename parent_unpriv to parent_can_setfcap (and reverse the
>  logic)
>* remove printks
>* clarify (i hope) the code comments
>* update capability.h comment
>* renamed parent_can_setfcap to parent_could_setfcap
>* made the check its own disallowed_0_mapping() fn
>* moved the check into new_idmap_permitted
> ---

Thank you for working on this fix!

I do prefer your approach of doing the check at user namespace creation
time instead of moving it into the setxattr() codepath.

Let me reiterate that the ability to write through fscaps is a valid
usecase and this should continue to work but that for locked down user
namespace as Andrew wants to use them your patch provides a clean
solution.
We've are using identity mappings in quite a few scenarios partially
when performing tests but also to write through fscaps.
We also had reports of users that use identity mappings. They create
their rootfs by running image extraction in an identity mapped userns
where fscaps are written through.
Podman has use-cases for this feature as well and has been affected by
the regression of the first fix.

>  include/linux/user_namespace.h  |  3 ++
>  include/uapi/linux/capability.h |  3 +-
>  kernel/user_namespace.c | 61 +++--
>  3 files changed, 63 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index 64cf8ebdc4ec..f6c5f784be5a 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -63,6 +63,9 @@ struct user_namespace {
>   kgid_t  group;
>   struct ns_commonns;
>   unsigned long   flags;
> + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> +  * in its effective capability set at the child ns creation time. */
> + boolparent_could_setfcap;
>  
>  #ifdef CONFIG_KEYS
>   /* List of joinable keyrings in this namespace.  Modification access of
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index c6ca33034147..2ddb4226cd23 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
>  
>  #define CAP_AUDIT_CONTROL30
>  
> -/* Set or remove capabilities on files */
> +/* Set or remove capabilities on files.
> +   Map uid=0 into a child user namespace. */
>  
>  #define CAP_SETFCAP   31
>  
> diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> index af612945a4d0..8c75028a9aae 100644
> 

[RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

2021-04-15 Thread Serge E. Hallyn
(Eric - this patch (v3) is a cleaned up version of the previous approach.
v4 is at 
https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
and is the approach you suggested.  I can send it also as a separate patch
if you like)

A process running as uid 0 but without cap_setfcap currently can simply
unshare a new user namespace with uid 0 mapped to 0.  While this task
will not have new capabilities against the parent namespace, there is
a loophole due to the way namespaced file capabilities work.  File
capabilities valid in userns 1 are distinguised from file capabilities
valid in userns 2 by the kuid which underlies uid 0.  Therefore
the restricted root process can unshare a new self-mapping namespace,
add a namespaced file capability onto a file, then use that file
capability in the parent namespace.

To prevent that, do not allow mapping uid 0 if the process which
opened the uid_map file does not have CAP_SETFCAP, which is the capability
for setting file capabilities.

A further wrinkle:  a task can unshare its user namespace, then
open its uid_map file itself, and map (only) its own uid.  In this
case we do not have the credential from before unshare,  which was
potentially more restricted.  So, when creating a user namespace, we
record whether the creator had CAP_SETFCAP.  Then we can use that
during map_write().

With this patch:

1. unprivileged user can still unshare -Ur

ubuntu@caps:~$ unshare -Ur
root@caps:~# logout

2. root user can still unshare -Ur

ubuntu@caps:~$ sudo bash
root@caps:/home/ubuntu# unshare -Ur
root@caps:/home/ubuntu# logout

3. root user without CAP_SETFCAP cannot unshare -Ur:

root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
unable to set CAP_SETFCAP effective capability: Operation not permitted
root@caps:/home/ubuntu# unshare -Ur
unshare: write failed /proc/self/uid_map: Operation not permitted

Signed-off-by: Serge Hallyn 

Changelog:
   * fix logic in the case of writing to another task's uid_map
   * rename 'ns' to 'map_ns', and make a file_ns local variable
   * use /* comments */
   * update the CAP_SETFCAP comment in capability.h
   * rename parent_unpriv to parent_can_setfcap (and reverse the
 logic)
   * remove printks
   * clarify (i hope) the code comments
   * update capability.h comment
   * renamed parent_can_setfcap to parent_could_setfcap
   * made the check its own disallowed_0_mapping() fn
   * moved the check into new_idmap_permitted
---
 include/linux/user_namespace.h  |  3 ++
 include/uapi/linux/capability.h |  3 +-
 kernel/user_namespace.c | 61 +++--
 3 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index 64cf8ebdc4ec..f6c5f784be5a 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -63,6 +63,9 @@ struct user_namespace {
kgid_t  group;
struct ns_commonns;
unsigned long   flags;
+   /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
+* in its effective capability set at the child ns creation time. */
+   boolparent_could_setfcap;
 
 #ifdef CONFIG_KEYS
/* List of joinable keyrings in this namespace.  Modification access of
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index c6ca33034147..2ddb4226cd23 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
 
 #define CAP_AUDIT_CONTROL30
 
-/* Set or remove capabilities on files */
+/* Set or remove capabilities on files.
+   Map uid=0 into a child user namespace. */
 
 #define CAP_SETFCAP 31
 
diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
index af612945a4d0..8c75028a9aae 100644
--- a/kernel/user_namespace.c
+++ b/kernel/user_namespace.c
@@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
if (!ns)
goto fail_dec;
 
+   ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
ret = ns_alloc_inum(>ns);
if (ret)
goto fail_free;
@@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map)
return 0;
 }
 
+/*
+ * If mapping uid 0, then file capabilities created by the new namespace will
+ * be effective in the parent namespace.  Adding file capabilities requires
+ * CAP_SETFCAP, which the child namespace will have, so creating such a
+ * mapping requires CAP_SETFCAP in the parent namespace.
+ */
+static bool disallowed_0_mapping(const struct file *file,
+struct user_namespace *map_ns,
+struct uid_gid_map *new_map)
+{
+   int idx;
+   bool zeromapping = false;
+   const struct user_namespace *file_ns = file->f_cred->user_ns;
+
+