[PATCH v3] fs: Fix the default values of i_uid/i_gid on /proc/sys inodes.

2019-07-08 Thread Radoslaw Burny
fs: Fix the default values of i_uid/i_gid on /proc/sys inodes.
Normally, the inode's i_uid/i_gid are translated relative to s_user_ns,
but this is not a correct behavior for proc. Since sysctl permission
check in test_perm is done against GLOBAL_ROOT_[UG]ID, it makes more
sense to use these values in u_[ug]id of proc inodes.
In other words: although uid/gid in the inode is not read during
test_perm, the inode logically belongs to the root of the namespace.
I have confirmed this with Eric Biederman at LPC and in this thread:
https://lore.kernel.org/lkml/87k1kzjdff@xmission.com

Consequences

Since the i_[ug]id values of proc nodes are not used for permissions
checks, this change usually makes no functional difference. However, it
causes an issue in a setup where:
* a namespace container is created without root user in container -
  hence the i_[ug]id of proc nodes are set to INVALID_[UG]ID
* container creator tries to configure it by writing /proc/sys files,
  e.g. writing /proc/sys/kernel/shmmax to configure shared memory limit
Kernel does not allow to open an inode for writing if its i_[ug]id are
invalid, making it impossible to write shmmax and thus - configure the
container.
Using a container with no root mapping is apparently rare, but we do use
this configuration at Google. Also, we use a generic tool to configure
the container limits, and the inability to write any of them causes a
failure.

History
===
The invalid uids/gids in inodes first appeared due to 81754357770e (fs:
Update i_[ug]id_(read|write) to translate relative to s_user_ns).
However, AFAIK, this did not immediately cause any issues.
The inability to write to these "invalid" inodes was only caused by a
later commit 0bd23d09b874 (vfs: Don't modify inodes with a uid or gid
unknown to the vfs).

Tested: Used a repro program that creates a user namespace without any
mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
Before the change, it shows the overflow uid, with the change it's 0.
The overflow uid indicates that the uid in the inode is not correct and
thus it is not possible to open the file for writing.

Fixes: 0bd23d09b874 ("vfs: Don't modify inodes with a uid or gid unknown to the 
vfs")
Cc: sta...@vger.kernel.org # v4.8+
Signed-off-by: Radoslaw Burny 
---
Changelog since v1:
  - Updated the commit title and description.

 fs/proc/proc_sysctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c74570736b24..36ad1b0d6259 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -499,6 +499,10 @@ static struct inode *proc_sys_make_inode(struct 
super_block *sb,
 
if (root->set_ownership)
root->set_ownership(head, table, >i_uid, >i_gid);
+   else {
+   inode->i_uid = GLOBAL_ROOT_UID;
+   inode->i_gid = GLOBAL_ROOT_GID;
+   }
 
return inode;
 }
-- 
2.22.0.410.gd8fdbe21b5-goog



Re: [PATCH v2] fs: Fix the default values of i_uid/i_gid on /proc/sys inodes.

2019-07-05 Thread Radoslaw Burny
On Fri, Jul 5, 2019 at 10:02 PM Luis Chamberlain  wrote:
>
>
> Please re-state the main fix in the commit log, not just the subject.

Sure, I'll do this. Just to make sure - for every iteration on the
commit message, I need to increment the patch "version" and resend the
whole patch, right?

>
> Also, this does not explain why the current values are and the impact to
> systems / users. This would help in determine and evaluating if this
> deserves to be a stable fix.

This commit a (much overdue) resend of https://lkml.org/lkml/2018/11/30/990
I think Eric's comment on the previous thread explained it best:

> We spoke about this at LPC.  And this is the correct behavioral change.
>
> The problem is there is a default value for i_uid and i_gid that is
> correct in the general case.  That default value is not corect for
> sysctl, because proc is weird.  As the sysctl permission check in
> test_perm are all against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID we did not
> notice that i_uid and i_gid were being set wrong.
>
> So all this patch does is fix the default values i_uid and i_gid.

If my new commit message is still not conveying this clearly, feel
free to suggest the specific wording (I'm new to the kernel patch
process, and I might not be explaining the problems well enough).

>
>
> On Fri, Jul 05, 2019 at 06:30:21PM +0200, Radoslaw Burny wrote:
> > This also fixes a problem where, in a user namespace without root user
> > mapping, it is not possible to write to /proc/sys/kernel/shmmax.
>
> This does not explain why that should be possible and what impact this
> limitation has.

Writing to /proc/sys/kernel/shmmax allows setting a shared memory
limit for that container. Since this is usually a part of container's
initial configuration, one would expect that the container's owner /
creator is able to set the limit. Yet, due to the bug described here,
no process can write the container's shmmax if the container's user
namespace does not contain root mapping.

Using a container with no root mapping seems to be a rare case, but we
do use this configuration at Google, which is how I found the issue.
Also, we use a generic tool to configure the container limits, and the
inability to write any of them causes a hard failure.

>
> > The problem was introduced by the combination of the two commits:
> > * 81754357770ebd900801231e7bc8d151ddc00498: fs: Update
> >   i_[ug]id_(read|write) to translate relative to s_user_ns
> > - this caused the kernel to write INVALID_[UG]ID to i_uid/i_gid
> > members of /proc/sys inodes if a containing userns does not have
> > entries for root in the uid/gid_map.
> This is 2014 commit merged as of v4.8.
>
> > * 0bd23d09b874e53bd1a2fe2296030aa2720d7b08: vfs: Don't modify inodes
> >   with a uid or gid unknown to the vfs
> > - changed the kernel to prevent opens for write if the i_uid/i_gid
> > field in the inode is invalid
>
> This is a 2016 commit merged as of v4.8 as well...
>
> So regardless of the dates of the commits, are you saying this is a
> regression you can confirm did not exist prior to v4.8? Did you test
> v4.7 to confirm?

I assume no one has noticed this issue before because it requires such
a specific combination of triggers.
Yes, I've tested this with older kernel versions. I've additionally
tested a 4.8 build with just 0aa2720d7b08 reverted, confirming that
the revert fixes the issue.

>
> > This commit fixes the issue by defaulting i_uid/i_gid to
> > GLOBAL_ROOT_UID/GID.
>
> Why is this right?

Quoting Eric: "the sysctl permission check in test_perm are all
against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID".
The values in the inode are not even read during test_perm, but
logically, the inode belongs to the root of the namespace.

>
> > Note that these values are not used for /proc/sys
> > access checks, so the change does not otherwise affect /proc semantics.
> >
> > Tested: Used a repro program that creates a user namespace without any
> > mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
> > Before the change, it shows the overflow uid, with the change it's 0.
>
> Why is the overflow uid bad for user experience? Did you test prior to
> v4.8, ie on v4.7 to confirm this is indeed a regression?
>
> You'd want then to also ammend in the commit log a Fixes:  tag with both
> commits listed. If this is a stable fix (criteria yet to be determined),
> then we'd need a stable tag.

The overflow is technically correct; the uid in the inode is invalid,
hence it must be displayed as overflow uid. The fact that the uid is
invalid is the issue.
Logically, this commit fixes 81754357770e (as that commit first
introduced invalid uid/gid values). If you agree, I'll add this to my
updated 

[PATCH v2] fs: Fix the default values of i_uid/i_gid on /proc/sys inodes.

2019-07-05 Thread Radoslaw Burny
This also fixes a problem where, in a user namespace without root user
mapping, it is not possible to write to /proc/sys/kernel/shmmax.

The problem was introduced by the combination of the two commits:
* 81754357770ebd900801231e7bc8d151ddc00498: fs: Update
  i_[ug]id_(read|write) to translate relative to s_user_ns
- this caused the kernel to write INVALID_[UG]ID to i_uid/i_gid
members of /proc/sys inodes if a containing userns does not have
entries for root in the uid/gid_map.
* 0bd23d09b874e53bd1a2fe2296030aa2720d7b08: vfs: Don't modify inodes
  with a uid or gid unknown to the vfs
- changed the kernel to prevent opens for write if the i_uid/i_gid
field in the inode is invalid

This commit fixes the issue by defaulting i_uid/i_gid to
GLOBAL_ROOT_UID/GID. Note that these values are not used for /proc/sys
access checks, so the change does not otherwise affect /proc semantics.

Tested: Used a repro program that creates a user namespace without any
mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
Before the change, it shows the overflow uid, with the change it's 0.

Signed-off-by: Radoslaw Burny 
---
Changelog since v1:
- Updated the commit title and description.

 fs/proc/proc_sysctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c74570736b24..36ad1b0d6259 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -499,6 +499,10 @@ static struct inode *proc_sys_make_inode(struct 
super_block *sb,
 
if (root->set_ownership)
root->set_ownership(head, table, >i_uid, >i_gid);
+   else {
+   inode->i_uid = GLOBAL_ROOT_UID;
+   inode->i_gid = GLOBAL_ROOT_GID;
+   }
 
return inode;
 }
-- 
2.22.0.410.gd8fdbe21b5-goog



Re: [PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-11-30 Thread Radoslaw Burny
On Fri, Nov 30, 2018 at 2:09 AM Luis Chamberlain  wrote:
>
> On Mon, Nov 26, 2018 at 11:29:40PM -0600, Eric W. Biederman wrote:
> > Luis Chamberlain  writes:
> > > Thanks for the description of how to run into the issue described but
> > > is there also a practical use case today where this is happening? I ask
> > > as it would be good to know the severity of the issue in the real world
> > > today.
> >
> > People trying to run containers without a root user in the container.
> > It atypical but something doable.
>
> My question was if there are generic tools / propreitary tools which are
> doing this widely *today*. Or is this just a custom setup some folks
> use?

We will soon start using this setup at Google to harden our usage of CRIU.
There are some more details in my LPC presentation:
https://linuxplumbersconf.org/event/2/contributions/210/

Although I don't know of specific tools using this setup, there was a
kernel patch in 2017 to support such use case:
7c6d78148fa0 - prctl: Allow local CAP_SYS_ADMIN changing exe_file
So, perhaps Virtuozzo people use a similar setup too?

> > We spoke about this at LPC.  And this is the correct behavioral change.
> >
> > The problem is there is a default value for i_uid and i_gid that is
> > correct in the general case.  That default value is not corect for
> > sysctl, because proc is weird.  As the sysctl permission check in
> > test_perm are all against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID we did not
> > notice that i_uid and i_gid were being set wrong.
> >
> > So all this patch does is fix the default values i_uid and i_gid.
> >
> > The commit comment seems worth cleaning up.  But for the
> > content of the code.
>
> The logic seems sensible then, but are we implicating what a container
> does with its sysctl values onto the entire system? If so, sure, it
> seems you want this for networking purposes as there are a series of
> sysctl values a container may want to muck with, but are we sure we
> want the same for *all* sysctl entries?

The point is that these sysctls do not affect the whole system, just
an appropriate namespace.
For example, IPC-related files (e.g. shmmax) will always affect
writing process's UTS namespace, regardless of /proc mountpoint that
is used to access them:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/ipc_sysctl.c?h=v4.20-rc4#n24

I presume the net-related sysctls that Eric was referring to have a
similar behavior.


>
>   Luis


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-11-30 Thread Radoslaw Burny
On Fri, Nov 30, 2018 at 2:09 AM Luis Chamberlain  wrote:
>
> On Mon, Nov 26, 2018 at 11:29:40PM -0600, Eric W. Biederman wrote:
> > Luis Chamberlain  writes:
> > > Thanks for the description of how to run into the issue described but
> > > is there also a practical use case today where this is happening? I ask
> > > as it would be good to know the severity of the issue in the real world
> > > today.
> >
> > People trying to run containers without a root user in the container.
> > It atypical but something doable.
>
> My question was if there are generic tools / propreitary tools which are
> doing this widely *today*. Or is this just a custom setup some folks
> use?

We will soon start using this setup at Google to harden our usage of CRIU.
There are some more details in my LPC presentation:
https://linuxplumbersconf.org/event/2/contributions/210/

Although I don't know of specific tools using this setup, there was a
kernel patch in 2017 to support such use case:
7c6d78148fa0 - prctl: Allow local CAP_SYS_ADMIN changing exe_file
So, perhaps Virtuozzo people use a similar setup too?

> > We spoke about this at LPC.  And this is the correct behavioral change.
> >
> > The problem is there is a default value for i_uid and i_gid that is
> > correct in the general case.  That default value is not corect for
> > sysctl, because proc is weird.  As the sysctl permission check in
> > test_perm are all against GLOBAL_ROOT_UID and GLOBAL_ROOT_GID we did not
> > notice that i_uid and i_gid were being set wrong.
> >
> > So all this patch does is fix the default values i_uid and i_gid.
> >
> > The commit comment seems worth cleaning up.  But for the
> > content of the code.
>
> The logic seems sensible then, but are we implicating what a container
> does with its sysctl values onto the entire system? If so, sure, it
> seems you want this for networking purposes as there are a series of
> sysctl values a container may want to muck with, but are we sure we
> want the same for *all* sysctl entries?

The point is that these sysctls do not affect the whole system, just
an appropriate namespace.
For example, IPC-related files (e.g. shmmax) will always affect
writing process's UTS namespace, regardless of /proc mountpoint that
is used to access them:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/ipc/ipc_sysctl.c?h=v4.20-rc4#n24

I presume the net-related sysctls that Eric was referring to have a
similar behavior.


>
>   Luis


smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-11-26 Thread Radoslaw Burny
Due to a recent commit (d151ddc00498 - fs: Update i_[ug]id_(read|write)
to translate relative to s_user_ns), inodes under /proc/sys have -1
written to their i_uid/i_gid members if a containing userns does not
have entries for root in the uid/gid_map.

This wouldn't normally matter, because these values are not used for
access checks. However, a later change (0bd23d09b874 - Don't modify
inodes with a uid or gid unknown to the vfs) changes the kernel to
prevent opens for write if the i_uid/i_gid field in the inode is -1,
even if the /proc/sys-specific access checks would otherwise pass.

This causes a problem: in a userns without root mapping, even the
namespace creator cannot write to e.g. /proc/sys/kernel/shmmax.
This change fixes the problem by overriding i_uid/i_gid back to
GLOBAL_ROOT_UID/GID.

Tested: Used a repro program that creates a user namespace without any
mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
Before the change, it shows uid/gid of 65534, with the change it's 0.

Signed-off-by: Radoslaw Burny 
---
 fs/proc/proc_sysctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c5cbbdff3c3d..67379a389658 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -499,6 +499,10 @@ static struct inode *proc_sys_make_inode(struct 
super_block *sb,
 
if (root->set_ownership)
root->set_ownership(head, table, >i_uid, >i_gid);
+   else {
+   inode->i_uid = GLOBAL_ROOT_UID;
+   inode->i_gid = GLOBAL_ROOT_GID;
+   }
 
 out:
return inode;
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH] fs: Make /proc/sys inodes be owned by global root.

2018-11-26 Thread Radoslaw Burny
Due to a recent commit (d151ddc00498 - fs: Update i_[ug]id_(read|write)
to translate relative to s_user_ns), inodes under /proc/sys have -1
written to their i_uid/i_gid members if a containing userns does not
have entries for root in the uid/gid_map.

This wouldn't normally matter, because these values are not used for
access checks. However, a later change (0bd23d09b874 - Don't modify
inodes with a uid or gid unknown to the vfs) changes the kernel to
prevent opens for write if the i_uid/i_gid field in the inode is -1,
even if the /proc/sys-specific access checks would otherwise pass.

This causes a problem: in a userns without root mapping, even the
namespace creator cannot write to e.g. /proc/sys/kernel/shmmax.
This change fixes the problem by overriding i_uid/i_gid back to
GLOBAL_ROOT_UID/GID.

Tested: Used a repro program that creates a user namespace without any
mapping and stat'ed /proc/$PID/root/proc/sys/kernel/shmmax from outside.
Before the change, it shows uid/gid of 65534, with the change it's 0.

Signed-off-by: Radoslaw Burny 
---
 fs/proc/proc_sysctl.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index c5cbbdff3c3d..67379a389658 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -499,6 +499,10 @@ static struct inode *proc_sys_make_inode(struct 
super_block *sb,
 
if (root->set_ownership)
root->set_ownership(head, table, >i_uid, >i_gid);
+   else {
+   inode->i_uid = GLOBAL_ROOT_UID;
+   inode->i_gid = GLOBAL_ROOT_GID;
+   }
 
 out:
return inode;
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog