Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-29 Thread Enrico Weigelt, metux IT consult
On 17.10.20 18:51, Eric W. Biederman wrote:

Hi folks,

>> I believe subusers aren't meant for tyical containers (like docker or
>> lxc), but unprivileged user programs that wanna have further isolation
>> for subprocesses (eg. a browser's renderer or js engine).
>>
>> Correct me if I'm wrong.
> 
> There is an on-going trend to make unprivileged containers typical
> containers.

Yes, that's what I hope for :)
But I'm still unsure whether these files really fit into the scenarios
we're currently discussing.

What still puzzles me: we've got several quite different scenarios
related to uid allocation and mapping. Maybe we should first work out,
what they all have in common ?

Some quick examples:

a) arbitrary user wants to run certain programs (eg. daemons) with
   limited privileges (eg. can access only certain resources, eg.
   subdir of is homedir), possibly under some different UID, but still
   have full control over them (signals, strace, ...) - without any
   special help by root.

b) arbitrary user wants to run some programs with different mounts
   (plan9 style) w/o any special help by root. (unprivileged mount_ns
   still needs user_ns, right ?)

c) arbitrary user wants to run some (docker-style) containerized
   GUI application, which needs access to certain files in his homedir,
   just if it would run directly

d) classical container workload (really being root inside it) with
   shared images and possibly shared directories w/ the calling user.

Steps to care of are eg:

* allocate new user-visible UIDs (usually w/ names assigned), either
  permanently or temporarily
* sane mapping between several namespaces (which ones exactly shall
  appear from inside vs outside ?)
* map file system permissions and fs uids

Tricky. How can we decide which mappings an unprivileged user shall be
allowed to do under which circumstances ?

Scanario a) container is running with (parts of) the host fs
--> we need to make sure it cannot escape and access some
sensible files
--> different fs-UID mappings per fs ?
 b) container is running with its own fs image
--> the image could be entirely under the unprivileged
user's control (maybe created by him itself)
--> uids recorded in the fs probably should be exactly those
visible inside the container

Maybe we should put in a separate UID/permission translation layer
into VFS, which would process different policies (not just plain range
shifting, more possibly more complex translations) depending on the
namespace ?

> I forget the details but systemd has a feature where it will randomly
> allocate a uid for a service.  Calling them something like temporariy uids.

I'd consider this an horrible bug - especially from operating
perspective. As operator, I really want to know what users (uids) I've
got on the system. and what's running under them.
(I never user systemd, for tons of other reasons, anyways)

>> IMHO, all we need is to maintain a list of active ranges (more precisely
>> the 16bit prefixes, just like class B networks ;-)). As said, I'd
>> declare the scenario #P3 as invalid and rather fix those few broken
>> applications.
> 
> Which is /etc/subuid and /etc/subgid, and it was very much inspired from
> the same source.

Why not just moving this into some common daemon or access pattern ?
(outside the kernel)

>> Is this really an practical isssue, when we're using uid namespaces ?
> 
> Very much so.  There are containers who otherwise would use the same uid
> range. (AKA they have the same set of users).  But can't because there
> are cases like daemons that set their RLIMIT_NPROC to 1.  Because the
> daemon knows that user for that daemon will never run any other
> processes.

Just curious: why are these containers (smells like typical server
workloads) running with the same UIDs in the first place ?

Maybe because the lack of proper mapping of fs-uids ? (see above).
Or are there any reasons why they should run oder the same uid.

>>> S2. Kernel-enforced user namespace isolation.
>>> This means, there is no need for different container runtimes to
>>> collaborate on id ranges with immediate benefits for everyone.
>>> This solves P1 and P2.
>>
>> Okay, but how to support scenarios where some of the UIDs should
>> overlap on purpose ? (eg. mounting some of the host's user homedirs
>> into namespaces ?)
> 
> Just have a limited number of mappings for the cases that actually need
> on-disk storage.  The key idea is adding uids that don't need to be
> mapped.  Everything else stays the same.

Okay, but the interesting question becomes: what does not to be mapped,
what not ? How exactly shall find that out in a generic manner ?

I guess your proposal only helps for those UIDs which are really random
allocated - or anything outside the explicitly given ranges, which
(IMHO) now is mapped to -1. Correct ?

Just a weird though: shall we introduce an 

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-19 Thread Giuseppe Scrivano
"Serge E. Hallyn"  writes:

> On Tue, Oct 13, 2020 at 05:17:36PM +0200, Giuseppe Scrivano wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
>> >> Josh Triplett  writes:
>> >> 
>> >> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> >> >in mind the case of groups used for negative access control.
>> >> >> >This was suggested by Josh Triplett and Geoffrey Thomas. Their 
>> >> >> > idea was to
>> >> >> >investigate adding a prctl() to allow setgroups() to be called in 
>> >> >> > a user
>> >> >> >namespace at the cost of restricting paths to the most restrictive
>> >> >> >permission. So if something is 0707 it needs to be treated as if 
>> >> >> > it's 
>> >> >> >even though the caller is not in its owning group which is used 
>> >> >> > for negative
>> >> >> >access control (how these new semantics will interact with ACLs 
>> >> >> > will also
>> >> >> >need to be looked into).
>> >> >> 
>> >> >> I should probably think this through more, but for this problem, would 
>> >> >> it
>> >> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> >> struct group_info *locked_groups, and every time an unprivileged task 
>> >> >> creates
>> >> >> a new user namespace, add all its current groups to this list?
>> >> >
>> >> > So, effectively, you would be allowed to drop permissions, but
>> >> > locked_groups would still be checked for restrictions?
>> >> >
>> >> > That seems like it'd introduce a new level of complexity (a new facet of
>> >> > permission) to manage. Not opposed, but it does seem more complex than
>> >> > just opting out of using groups for negative permissions.
>> >> 
>> >> I have played with something similar in the past.  At that time I've
>> >> discussed it only privately with Eric and we agreed it wasn't worth the
>> >> extra complexity:
>> >> 
>> >> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af
>> >
>> > Hi, you linked the setgroups patch, but do you also have a link to the
>> > attempt which you deemed was not worth it?
>> 
>> it was just part of a private discussion; but was 4 years ago so we can
>> probably revisit and accept the additional complexity since setgroups()
>> is still an issue with user namespaces.
>> 
>> 
>> >> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
>> >> allows setgroups in a userns locking the current gids.
>> >> 
>> >> What do you think about using /proc/PID/setgroups instead of a new
>> >> prctl()?
>> >
>> > It's better than not having it, but two concerns -
>> >
>> > 1. some userspace, especially testsuites, could become confused by the fact
>> > that they can't drop groups no matter how hard they try, since these will 
>> > all
>> > still show up as regular groups.
>> 
>> I forgot to send a link to a second patch :-) that completes the feature:
>> https://github.com/giuseppe/linux/commit/1c5fe726346b216293a527719e64f34e6297f0c2
>> 
>> When the new mode is used, the gids that are not known in the userns do
>> not show up in userspace.
>
> Ah, right - and of course those gids better not be mapped into the namespace 
> :)
>
> But so, this is the patch you said you agreed was not worth the extra
> complexity?

yes, these two patches are what looked too complex at that time.  The
problem still exists though, we could perhaps reconsider if the
extra-complexity is acceptable to address it.

Regards,
Giuseppe



Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-18 Thread Christian Brauner
On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
> "Enrico Weigelt, metux IT consult"  writes:
> 
> > On 30.08.20 16:39, Christian Brauner wrote:
> >
> > Hi Christian,
> >
> >> P1. Isolated id mappings can only be guaranteed to be locally isolated.
> >> A container runtime/daemon can only guarantee non-overlapping id 
> >> mappings
> >> when no other users on the system create containers.
> >
> > Indeed. But couldn't we just record the mappings in some standardized
> > place (eg. some file) which all engines maintain ?
> >
> > I'd guess other solutions would need changes in the runtimes, too.
> >
> > Please keep in mind that some scenarios actually need some overlaps, eg.
> > application containers that shall have direct access to home dirs.
> >
> >> P2. Enforcing isolated id mappings in userspace is difficult.
> >> It is always possible to create other processes with overlapping id
> >> mappings. Coordinating id mappings in userspace will always remain
> >> optional. Quite a few tools nowadays (including systemd) don't care 
> >> about
> >> /etc/sub{g,u}id and actively advise against using it. This is made even
> >> more problematic since sub{g,u}iid delegation is done per-user rather 
> >> than
> >> per-container-runtime.
> >
> > I believe subusers aren't meant for tyical containers (like docker or
> > lxc), but unprivileged user programs that wanna have further isolation
> > for subprocesses (eg. a browser's renderer or js engine).
> >
> > Correct me if I'm wrong.
> 
> There is an on-going trend to make unprivileged containers typical
> containers.

In general, this is something we all have been collectively pushing on
for years. Our users running LXD run unprivileged containers by default.
The daemon requires you to explicitly request running privileged
containers. All Linux workloads on Chromebooks are LXD-based and are
thus run in fully unprivileged containers so are all workloads on
ppc/arm64/s390x on Travis.
And now we're finally also see more runC based container managers like
Podman/cri-o adopting unprivileged containers too. So this is becoming
more and more common and in the interest of security we have an
obligation to help push for more adoption.

> 
> >> P3. The range of the id mapping of a container can't be predetermined.
> >> While POSIX mandates that a standard system should use a range of 
> >> 65536 ids
> >> reality is very different. Some programs allocate high ids for random
> >> processes or for network authentication. This means, in practice it is
> >> often necessary to assign a range of up to 10 million ids to a 
> >> container.
> >> This limits a system to less than 500 containers total.
> >
> > In 25+ years, haven't seen such an application in the field. I'd
> > consider this a horrible and dangerous bug. Sane applications create
> > specific user entries (/etc/passwd) for that.
> >
> > I'd say we're safe w/ max 2^16 users per container, which should give us
> > space for about 2^16 containers.
> 
> I forget the details but systemd has a feature where it will randomly
> allocate a uid for a service.  Calling them something like temporariy uids.

and things like ldap, pam, or samba. The number is growing with
applications becoming more security aware. Here's an example from a user
reported bug:

Jun 13 02:05:39 xenial-template sshd[390]: Accepted password for sokoow from 
10.21.34.100 port 37532 ssh2
Jun 13 02:05:39 xenial-template sshd[390]: pam_keyinit(sshd:session): Unable to 
change GID to 99000 temporarily
Jun 13 02:05:39 xenial-template sshd[390]: pam_unix(sshd:session): session 
opened for user sokoow by (uid=0)
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): 
pam_modutil_drop_priv: change_gid failed: Success
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to 
change UID to 10003 temporarily
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): 
pam_modutil_regain_priv: called with invalid state
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to 
change UID back to -1
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): 
pam_modutil_drop_priv: change_gid failed: Success
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to 
change UID to 10003 temporarily
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): 
pam_modutil_regain_priv: called with invalid state
Jun 13 02:05:39 xenial-template sshd[390]: pam_motd(sshd:session): Unable to 
change UID back to -1
Jun 13 02:05:39 xenial-template sshd[390]: pam_mail(sshd:session): 
pam_modutil_drop_priv: change_gid failed: Success

Maybe running application containers that problem is not as pressing
immediately but for containers running full systems bug reports
involving high id allocations are pretty common
https://github.com/lxc/lxd/issues/2111

There's nothing wrong with dropping to high ids technically and we can't
really 

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-17 Thread Eric W. Biederman
"Enrico Weigelt, metux IT consult"  writes:

> On 30.08.20 16:39, Christian Brauner wrote:
>
> Hi Christian,
>
>> P1. Isolated id mappings can only be guaranteed to be locally isolated.
>> A container runtime/daemon can only guarantee non-overlapping id mappings
>> when no other users on the system create containers.
>
> Indeed. But couldn't we just record the mappings in some standardized
> place (eg. some file) which all engines maintain ?
>
> I'd guess other solutions would need changes in the runtimes, too.
>
> Please keep in mind that some scenarios actually need some overlaps, eg.
> application containers that shall have direct access to home dirs.
>
>> P2. Enforcing isolated id mappings in userspace is difficult.
>> It is always possible to create other processes with overlapping id
>> mappings. Coordinating id mappings in userspace will always remain
>> optional. Quite a few tools nowadays (including systemd) don't care about
>> /etc/sub{g,u}id and actively advise against using it. This is made even
>> more problematic since sub{g,u}iid delegation is done per-user rather 
>> than
>> per-container-runtime.
>
> I believe subusers aren't meant for tyical containers (like docker or
> lxc), but unprivileged user programs that wanna have further isolation
> for subprocesses (eg. a browser's renderer or js engine).
>
> Correct me if I'm wrong.

There is an on-going trend to make unprivileged containers typical
containers.

>> P3. The range of the id mapping of a container can't be predetermined.
>> While POSIX mandates that a standard system should use a range of 65536 
>> ids
>> reality is very different. Some programs allocate high ids for random
>> processes or for network authentication. This means, in practice it is
>> often necessary to assign a range of up to 10 million ids to a container.
>> This limits a system to less than 500 containers total.
>
> In 25+ years, haven't seen such an application in the field. I'd
> consider this a horrible and dangerous bug. Sane applications create
> specific user entries (/etc/passwd) for that.
>
> I'd say we're safe w/ max 2^16 users per container, which should give us
> space for about 2^16 containers.

I forget the details but systemd has a feature where it will randomly
allocate a uid for a service.  Calling them something like temporariy uids.

>> P4. Isolated id mappings severely restrict the number of containers that can 
>> be
>> run on a system.
>> This ties back to the point about pre-determining the id range of a
>> container and how large range allocations tend to be on real systems. 
>> That
>> becomes even more relevant when nesting containers.
>
> IMHO, all we need is to maintain a list of active ranges (more precisely
> the 16bit prefixes, just like class B networks ;-)). As said, I'd
> declare the scenario #P3 as invalid and rather fix those few broken
> applications.

Which is /etc/subuid and /etc/subgid, and it was very much inspired from
the same source.

>> P5. Container runtimes cannot reuse overlayfs lower directories if each
>> container uses isolated ID mappings, leading to either needless storage
>> overhead (LXD -- though the LXD folks don’t really mind), completely
>> ignoring the benefits of isolating containers from each other (Docker), 
>> or
>> not using them at all (Kubernetes). (This is a more general issue but 
>> bears
>> repeating since it is closely tied to most userns proposals.)
>
> Indeed. That's IMHO the main problem. We somehow need to map the UIDs.
> Maybe a synthetic filesystem that just does exactly the same uid<->kuid
> translations we're already doing in other places ?
>
>> P6. Rlimits pose a problem for containers that share the same id mapping.
>> This means containers with overlapping id mappings can DOS each other by
>> exhausting their rlimits. The reason for this lies with the current
>> implementation of rlimits -- rlimits are currently tied to users and are
>> not hierarchically limited like inotify limits are. This is a severe
>> problem in unprivileged workloads. Eric and others identified that this
>> issue can be fixed independently of the isolated user namespace proposal.
>
> Is this really an practical isssue, when we're using uid namespaces ?

Very much so.  There are containers who otherwise would use the same uid
range. (AKA they have the same set of users).  But can't because there
are cases like daemons that set their RLIMIT_NPROC to 1.  Because the
daemon knows that user for that daemon will never run any other
processes.

Run two containers with the same mappings and that daemon DOS's itself.

>> S2. Kernel-enforced user namespace isolation.
>> This means, there is no need for different container runtimes to
>> collaborate on id ranges with immediate benefits for everyone.
>> This solves P1 and P2.
>
> Okay, but how to support scenarios where some of the UIDs should
> 

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-17 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Wed, Oct 14, 2020 at 02:46:46PM -0500, Eric W. Biederman wrote:
>> "Serge E. Hallyn"  writes:
>> 
>> > On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
>> >> Andy Lutomirski  writes:
>> >> 
>> >> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett  
>> >> > wrote:
>> >> >>
>> >> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> >> > > 3. Find a way to allow setgroups() in a user namespace while 
>> >> >> > > keeping
>> >> >> > >in mind the case of groups used for negative access control.
>> >> >> > >This was suggested by Josh Triplett and Geoffrey Thomas. Their 
>> >> >> > > idea was to
>> >> >> > >investigate adding a prctl() to allow setgroups() to be called 
>> >> >> > > in a user
>> >> >> > >namespace at the cost of restricting paths to the most 
>> >> >> > > restrictive
>> >> >> > >permission. So if something is 0707 it needs to be treated as 
>> >> >> > > if it's 
>> >> >> > >even though the caller is not in its owning group which is used 
>> >> >> > > for negative
>> >> >> > >access control (how these new semantics will interact with ACLs 
>> >> >> > > will also
>> >> >> > >need to be looked into).
>> >> >> >
>> >> >> > I should probably think this through more, but for this problem, 
>> >> >> > would it
>> >> >> > not suffice to add a new prevgroups grouplist to the struct cred, 
>> >> >> > maybe
>> >> >> > struct group_info *locked_groups, and every time an unprivileged 
>> >> >> > task creates
>> >> >> > a new user namespace, add all its current groups to this list?
>> >> >>
>> >> >> So, effectively, you would be allowed to drop permissions, but
>> >> >> locked_groups would still be checked for restrictions?
>> >> >>
>> >> >> That seems like it'd introduce a new level of complexity (a new facet 
>> >> >> of
>> >> >> permission) to manage. Not opposed, but it does seem more complex than
>> >> >> just opting out of using groups for negative permissions.
>> >
>> > Yeah, it would, but I basically hoped that we could catch most of this at
>> > e.g. generic_permission(), and/or we could introduce a helper which
>> > automatically adds a check for permission denied from locked_groups, so
>> > it shouldn't be too wide-spread.  If it does end up showing up all over
>> > the place, then that's a good reason not to do this.
>> >
>> >> > Is there any context other than regular UNIX DAC in which groups can
>> >> > act as negative permissions or is this literally just an issue for
>> >> > files with a more restrictive group mode than other mode?
>> >> 
>> >> Just that.
>> >> 
>> >> The ideas kicked around in the conversation were some variant of having
>> >> a sysctl that says "This system never uses groups for negative
>> >> permissions".
>> >> 
>> >> It was also suggested that if the sysctl was set the the permission
>> >> checks would be altered such that even if someone tried to set a
>> >> negative permission, the more liberal permissions of other would be used
>> >> instead.
>> >
>> > So then this would touch all the same code points which the
>> > locked_groups approach would have to touch?
>> 
>> No locked_groups would touch in_group_p and set_groups.  Especially what
>> set_groups means in that context.  It would have to handle what happens
>> when you start accumulating locked groups (because of multiple
>> namespaces).  How you dedup locked groups etc.
>
> Well since group_info is sorted, you should be able to do a pretty
> simple and quick merge of current->locked_groups and
> current->group_info.  I suppose we'd have to consider a nasty user who
> is allocated 100k groups, sticks them all in groupinfo, then unshare
> twice, locking the kernel up for awhile, but that user can already hurt
> us.
>
>> I was not able to convince myself that not being able to clear out
>> groups that a user has when they create a user namespace won't cause
>> other problems.  Especially as user namespaces had been in use for a
>> while at that point.
>
> The locked_groups would *only* be considered for negative acls, right?

I had not seen that idea proposed.  I had assumed they would be
consulted in all cases for group membership in permission checks,
and that the only change would be to in_group_p and the code to
maintain the group lists.

> You would not *grant* any perms based on them.  It seems like exactly
> what you want.  If any user is denied perms on account of it, then that
> was the intent, and that's the whole reason we're having this problem.
> We are discussing whether it's ok to let a new user_ns be a way to
> bypass that restriction - not *looking* for a way to support bypassing
> it.
>
> I could state this as a more formal proof if you like.


If you modify the permission checks as you suggest it does seem easier
to reason about with respect to causing problems.  I would want to call
them denied_groups or something like that in the data structure for
clarity.

Howver there is a big question of 

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-15 Thread Enrico Weigelt, metux IT consult
On 30.08.20 16:39, Christian Brauner wrote:

Hi Christian,

> P1. Isolated id mappings can only be guaranteed to be locally isolated.
> A container runtime/daemon can only guarantee non-overlapping id mappings
> when no other users on the system create containers.

Indeed. But couldn't we just record the mappings in some standardized
place (eg. some file) which all engines maintain ?

I'd guess other solutions would need changes in the runtimes, too.

Please keep in mind that some scenarios actually need some overlaps, eg.
application containers that shall have direct access to home dirs.

> P2. Enforcing isolated id mappings in userspace is difficult.
> It is always possible to create other processes with overlapping id
> mappings. Coordinating id mappings in userspace will always remain
> optional. Quite a few tools nowadays (including systemd) don't care about
> /etc/sub{g,u}id and actively advise against using it. This is made even
> more problematic since sub{g,u}iid delegation is done per-user rather than
> per-container-runtime.

I believe subusers aren't meant for tyical containers (like docker or
lxc), but unprivileged user programs that wanna have further isolation
for subprocesses (eg. a browser's renderer or js engine).

Correct me if I'm wrong.

> P3. The range of the id mapping of a container can't be predetermined.
> While POSIX mandates that a standard system should use a range of 65536 
> ids
> reality is very different. Some programs allocate high ids for random
> processes or for network authentication. This means, in practice it is
> often necessary to assign a range of up to 10 million ids to a container.
> This limits a system to less than 500 containers total.

In 25+ years, haven't seen such an application in the field. I'd
consider this a horrible and dangerous bug. Sane applications create
specific user entries (/etc/passwd) for that.

I'd say we're safe w/ max 2^16 users per container, which should give us
space for about 2^16 containers.

> P4. Isolated id mappings severely restrict the number of containers that can 
> be
> run on a system.
> This ties back to the point about pre-determining the id range of a
> container and how large range allocations tend to be on real systems. That
> becomes even more relevant when nesting containers.

IMHO, all we need is to maintain a list of active ranges (more precisely
the 16bit prefixes, just like class B networks ;-)). As said, I'd
declare the scenario #P3 as invalid and rather fix those few broken
applications.

> P5. Container runtimes cannot reuse overlayfs lower directories if each
> container uses isolated ID mappings, leading to either needless storage
> overhead (LXD -- though the LXD folks don’t really mind), completely
> ignoring the benefits of isolating containers from each other (Docker), or
> not using them at all (Kubernetes). (This is a more general issue but 
> bears
> repeating since it is closely tied to most userns proposals.)

Indeed. That's IMHO the main problem. We somehow need to map the UIDs.
Maybe a synthetic filesystem that just does exactly the same uid<->kuid
translations we're already doing in other places ?

> P6. Rlimits pose a problem for containers that share the same id mapping.
> This means containers with overlapping id mappings can DOS each other by
> exhausting their rlimits. The reason for this lies with the current
> implementation of rlimits -- rlimits are currently tied to users and are
> not hierarchically limited like inotify limits are. This is a severe
> problem in unprivileged workloads. Eric and others identified that this
> issue can be fixed independently of the isolated user namespace proposal.

Is this really an practical isssue, when we're using uid namespaces ?

> S2. Kernel-enforced user namespace isolation.
> This means, there is no need for different container runtimes to
> collaborate on id ranges with immediate benefits for everyone.
> This solves P1 and P2.

Okay, but how to support scenarios where some of the UIDs should
overlap on purpose ? (eg. mounting some of the host's user homedirs
into namespaces ?)

> S5. The owning id concept of a user namespace makes monitoring and interacting
> with such containers way easier.

What exactly is the owning id ? How is it created and managed ?
Some magic id or an cryptographic token =

> 1. How are interactions across isolated user namespaces handled?

What kind of interaction do you have in mind ?
Data transfers ? Process manipulaton ? Namespace destruction ?

Can you please illustrate some actual use cases ?

>Proposal 1.1 semmed prefered since it would allow an unprivileged
>user creating an isolated user namespace to kill/ptrace all processes
>in the isolated namespace they spawned. 

Don't we already have this if this user is mapped as root inside the
container ?

>The first consensus 

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-15 Thread Serge E. Hallyn
On Tue, Oct 13, 2020 at 05:17:36PM +0200, Giuseppe Scrivano wrote:
> "Serge E. Hallyn"  writes:
> 
> > On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
> >> Josh Triplett  writes:
> >> 
> >> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> >> >in mind the case of groups used for negative access control.
> >> >> >This was suggested by Josh Triplett and Geoffrey Thomas. Their 
> >> >> > idea was to
> >> >> >investigate adding a prctl() to allow setgroups() to be called in 
> >> >> > a user
> >> >> >namespace at the cost of restricting paths to the most restrictive
> >> >> >permission. So if something is 0707 it needs to be treated as if 
> >> >> > it's 
> >> >> >even though the caller is not in its owning group which is used 
> >> >> > for negative
> >> >> >access control (how these new semantics will interact with ACLs 
> >> >> > will also
> >> >> >need to be looked into).
> >> >> 
> >> >> I should probably think this through more, but for this problem, would 
> >> >> it
> >> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> >> struct group_info *locked_groups, and every time an unprivileged task 
> >> >> creates
> >> >> a new user namespace, add all its current groups to this list?
> >> >
> >> > So, effectively, you would be allowed to drop permissions, but
> >> > locked_groups would still be checked for restrictions?
> >> >
> >> > That seems like it'd introduce a new level of complexity (a new facet of
> >> > permission) to manage. Not opposed, but it does seem more complex than
> >> > just opting out of using groups for negative permissions.
> >> 
> >> I have played with something similar in the past.  At that time I've
> >> discussed it only privately with Eric and we agreed it wasn't worth the
> >> extra complexity:
> >> 
> >> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af
> >
> > Hi, you linked the setgroups patch, but do you also have a link to the
> > attempt which you deemed was not worth it?
> 
> it was just part of a private discussion; but was 4 years ago so we can
> probably revisit and accept the additional complexity since setgroups()
> is still an issue with user namespaces.
> 
> 
> >> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
> >> allows setgroups in a userns locking the current gids.
> >> 
> >> What do you think about using /proc/PID/setgroups instead of a new
> >> prctl()?
> >
> > It's better than not having it, but two concerns -
> >
> > 1. some userspace, especially testsuites, could become confused by the fact
> > that they can't drop groups no matter how hard they try, since these will 
> > all
> > still show up as regular groups.
> 
> I forgot to send a link to a second patch :-) that completes the feature:
> https://github.com/giuseppe/linux/commit/1c5fe726346b216293a527719e64f34e6297f0c2
> 
> When the new mode is used, the gids that are not known in the userns do
> not show up in userspace.

Ah, right - and of course those gids better not be mapped into the namespace :)

But so, this is the patch you said you agreed was not worth the extra
complexity?

> > 2. whereas in my lockgroups proposal, lock_groups would only be taken into 
> > account
> > for permission denial, this proposal would count for permission grants too. 
> >  This
> > means that if I have a group which is permitted to read /foo/topsecret, and 
> > I
> > start a program in a new user namespace expecting it to drop that 
> > permission,
> > I can't have that, right?  The new program, will always have that 
> > permission?
> 
> right.  The new mode I was working on cannot be used to drop granted 
> permissions.
> 
> Giuseppe


Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-15 Thread Serge E. Hallyn
On Wed, Oct 14, 2020 at 02:46:46PM -0500, Eric W. Biederman wrote:
> "Serge E. Hallyn"  writes:
> 
> > On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
> >> Andy Lutomirski  writes:
> >> 
> >> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett  
> >> > wrote:
> >> >>
> >> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> >> > >in mind the case of groups used for negative access control.
> >> >> > >This was suggested by Josh Triplett and Geoffrey Thomas. Their 
> >> >> > > idea was to
> >> >> > >investigate adding a prctl() to allow setgroups() to be called 
> >> >> > > in a user
> >> >> > >namespace at the cost of restricting paths to the most 
> >> >> > > restrictive
> >> >> > >permission. So if something is 0707 it needs to be treated as if 
> >> >> > > it's 
> >> >> > >even though the caller is not in its owning group which is used 
> >> >> > > for negative
> >> >> > >access control (how these new semantics will interact with ACLs 
> >> >> > > will also
> >> >> > >need to be looked into).
> >> >> >
> >> >> > I should probably think this through more, but for this problem, 
> >> >> > would it
> >> >> > not suffice to add a new prevgroups grouplist to the struct cred, 
> >> >> > maybe
> >> >> > struct group_info *locked_groups, and every time an unprivileged task 
> >> >> > creates
> >> >> > a new user namespace, add all its current groups to this list?
> >> >>
> >> >> So, effectively, you would be allowed to drop permissions, but
> >> >> locked_groups would still be checked for restrictions?
> >> >>
> >> >> That seems like it'd introduce a new level of complexity (a new facet of
> >> >> permission) to manage. Not opposed, but it does seem more complex than
> >> >> just opting out of using groups for negative permissions.
> >
> > Yeah, it would, but I basically hoped that we could catch most of this at
> > e.g. generic_permission(), and/or we could introduce a helper which
> > automatically adds a check for permission denied from locked_groups, so
> > it shouldn't be too wide-spread.  If it does end up showing up all over
> > the place, then that's a good reason not to do this.
> >
> >> > Is there any context other than regular UNIX DAC in which groups can
> >> > act as negative permissions or is this literally just an issue for
> >> > files with a more restrictive group mode than other mode?
> >> 
> >> Just that.
> >> 
> >> The ideas kicked around in the conversation were some variant of having
> >> a sysctl that says "This system never uses groups for negative
> >> permissions".
> >> 
> >> It was also suggested that if the sysctl was set the the permission
> >> checks would be altered such that even if someone tried to set a
> >> negative permission, the more liberal permissions of other would be used
> >> instead.
> >
> > So then this would touch all the same code points which the
> > locked_groups approach would have to touch?
> 
> No locked_groups would touch in_group_p and set_groups.  Especially what
> set_groups means in that context.  It would have to handle what happens
> when you start accumulating locked groups (because of multiple
> namespaces).  How you dedup locked groups etc.

Well since group_info is sorted, you should be able to do a pretty
simple and quick merge of current->locked_groups and
current->group_info.  I suppose we'd have to consider a nasty user who
is allocated 100k groups, sticks them all in groupinfo, then unshare
twice, locking the kernel up for awhile, but that user can already hurt
us.

> I was not able to convince myself that not being able to clear out
> groups that a user has when they create a user namespace won't cause
> other problems.  Especially as user namespaces had been in use for a
> while at that point.

The locked_groups would *only* be considered for negative acls, right?
You would not *grant* any perms based on them.  It seems like exactly
what you want.  If any user is denied perms on account of it, then that
was the intent, and that's the whole reason we're having this problem.
We are discussing whether it's ok to let a new user_ns be a way to
bypass that restriction - not *looking* for a way to support bypassing
it.

I could state this as a more formal proof if you like.

> Not supporting negative groups would touch acl_permission and modify it
> like:
> 
>  static int acl_permission_check(struct inode *inode, int mask)
>  {
> [irrelveant code snipped]
>   /* Only RWX matters for group/other mode bits */
>   mask &= 7;
>  
>   /*
>* Are the group permissions different from
>* the other permissions in the bits we care
>* about? Need to check group ownership if so.
>*/
>   if (mask & (mode ^ (mode >> 3))) {
> - if (in_group_p(inode->i_gid))
> + if (in_group_p(inode->i_gid) &&
> + (!sysctl_force_positive_groups ||

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-14 Thread Eric W. Biederman
"Serge E. Hallyn"  writes:

> On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
>> Andy Lutomirski  writes:
>> 
>> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett  
>> > wrote:
>> >>
>> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> > >in mind the case of groups used for negative access control.
>> >> > >This was suggested by Josh Triplett and Geoffrey Thomas. Their 
>> >> > > idea was to
>> >> > >investigate adding a prctl() to allow setgroups() to be called in 
>> >> > > a user
>> >> > >namespace at the cost of restricting paths to the most restrictive
>> >> > >permission. So if something is 0707 it needs to be treated as if 
>> >> > > it's 
>> >> > >even though the caller is not in its owning group which is used 
>> >> > > for negative
>> >> > >access control (how these new semantics will interact with ACLs 
>> >> > > will also
>> >> > >need to be looked into).
>> >> >
>> >> > I should probably think this through more, but for this problem, would 
>> >> > it
>> >> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> > struct group_info *locked_groups, and every time an unprivileged task 
>> >> > creates
>> >> > a new user namespace, add all its current groups to this list?
>> >>
>> >> So, effectively, you would be allowed to drop permissions, but
>> >> locked_groups would still be checked for restrictions?
>> >>
>> >> That seems like it'd introduce a new level of complexity (a new facet of
>> >> permission) to manage. Not opposed, but it does seem more complex than
>> >> just opting out of using groups for negative permissions.
>
> Yeah, it would, but I basically hoped that we could catch most of this at
> e.g. generic_permission(), and/or we could introduce a helper which
> automatically adds a check for permission denied from locked_groups, so
> it shouldn't be too wide-spread.  If it does end up showing up all over
> the place, then that's a good reason not to do this.
>
>> > Is there any context other than regular UNIX DAC in which groups can
>> > act as negative permissions or is this literally just an issue for
>> > files with a more restrictive group mode than other mode?
>> 
>> Just that.
>> 
>> The ideas kicked around in the conversation were some variant of having
>> a sysctl that says "This system never uses groups for negative
>> permissions".
>> 
>> It was also suggested that if the sysctl was set the the permission
>> checks would be altered such that even if someone tried to set a
>> negative permission, the more liberal permissions of other would be used
>> instead.
>
> So then this would touch all the same code points which the
> locked_groups approach would have to touch?

No locked_groups would touch in_group_p and set_groups.  Especially what
set_groups means in that context.  It would have to handle what happens
when you start accumulating locked groups (because of multiple
namespaces).  How you dedup locked groups etc.

I was not able to convince myself that not being able to clear out
groups that a user has when they create a user namespace won't cause
other problems.  Especially as user namespaces had been in use for a
while at that point.

Not supporting negative groups would touch acl_permission and modify it
like:

 static int acl_permission_check(struct inode *inode, int mask)
 {
[irrelveant code snipped]
/* Only RWX matters for group/other mode bits */
mask &= 7;
 
/*
 * Are the group permissions different from
 * the other permissions in the bits we care
 * about? Need to check group ownership if so.
 */
if (mask & (mode ^ (mode >> 3))) {
-   if (in_group_p(inode->i_gid))
+   if (in_group_p(inode->i_gid) &&
+   (!sysctl_force_positive_groups ||
+   (mask & ~(mode >> 3)))
mode >>= 3;
}
 
/* Bits in 'mode' clear that we require? */
return (mask & ~mode) ? -EACCES : 0;
 }


I don't know that we need to do that.  But it would might be a good way
of flushing out the issues.


>> Given that creating /etc/subgid is effectively opting out of negative
>> permissions already have a sysctl that says that upfront feels like a
>> very clean solution.
>> 
>> Eric
>
> That feels like a cop-out to me.  If some young admin at Roxxon Corp decides
> she needs to run a container, so installs subuid package and sets that sysctl,
> how does she know whether or not some previous admin, who has since retired 
> and
> did not keep good docs, set things up so that a negative acl is keeping nginx
> from reading some supersecret doc?
>
> Now personally I'm not a great believer in the negative acls so I think the
> above is a very unlikely scenario, but if we're going to worry about it, then
> we should worry about it :)

There is a different between 

Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-13 Thread Giuseppe Scrivano
"Serge E. Hallyn"  writes:

> On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
>> Josh Triplett  writes:
>> 
>> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >> >in mind the case of groups used for negative access control.
>> >> >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea 
>> >> > was to
>> >> >investigate adding a prctl() to allow setgroups() to be called in a 
>> >> > user
>> >> >namespace at the cost of restricting paths to the most restrictive
>> >> >permission. So if something is 0707 it needs to be treated as if 
>> >> > it's 
>> >> >even though the caller is not in its owning group which is used for 
>> >> > negative
>> >> >access control (how these new semantics will interact with ACLs will 
>> >> > also
>> >> >need to be looked into).
>> >> 
>> >> I should probably think this through more, but for this problem, would it
>> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> >> struct group_info *locked_groups, and every time an unprivileged task 
>> >> creates
>> >> a new user namespace, add all its current groups to this list?
>> >
>> > So, effectively, you would be allowed to drop permissions, but
>> > locked_groups would still be checked for restrictions?
>> >
>> > That seems like it'd introduce a new level of complexity (a new facet of
>> > permission) to manage. Not opposed, but it does seem more complex than
>> > just opting out of using groups for negative permissions.
>> 
>> I have played with something similar in the past.  At that time I've
>> discussed it only privately with Eric and we agreed it wasn't worth the
>> extra complexity:
>> 
>> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af
>
> Hi, you linked the setgroups patch, but do you also have a link to the
> attempt which you deemed was not worth it?

it was just part of a private discussion; but was 4 years ago so we can
probably revisit and accept the additional complexity since setgroups()
is still an issue with user namespaces.


>> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
>> allows setgroups in a userns locking the current gids.
>> 
>> What do you think about using /proc/PID/setgroups instead of a new
>> prctl()?
>
> It's better than not having it, but two concerns -
>
> 1. some userspace, especially testsuites, could become confused by the fact
> that they can't drop groups no matter how hard they try, since these will all
> still show up as regular groups.

I forgot to send a link to a second patch :-) that completes the feature:
https://github.com/giuseppe/linux/commit/1c5fe726346b216293a527719e64f34e6297f0c2

When the new mode is used, the gids that are not known in the userns do
not show up in userspace.

> 2. whereas in my lockgroups proposal, lock_groups would only be taken into 
> account
> for permission denial, this proposal would count for permission grants too.  
> This
> means that if I have a group which is permitted to read /foo/topsecret, and I
> start a program in a new user namespace expecting it to drop that permission,
> I can't have that, right?  The new program, will always have that permission?

right.  The new mode I was working on cannot be used to drop granted 
permissions.

Giuseppe



Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-13 Thread Serge E. Hallyn
On Mon, Oct 12, 2020 at 07:05:10PM +0200, Giuseppe Scrivano wrote:
> Josh Triplett  writes:
> 
> > On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> >in mind the case of groups used for negative access control.
> >> >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea 
> >> > was to
> >> >investigate adding a prctl() to allow setgroups() to be called in a 
> >> > user
> >> >namespace at the cost of restricting paths to the most restrictive
> >> >permission. So if something is 0707 it needs to be treated as if it's 
> >> > 
> >> >even though the caller is not in its owning group which is used for 
> >> > negative
> >> >access control (how these new semantics will interact with ACLs will 
> >> > also
> >> >need to be looked into).
> >> 
> >> I should probably think this through more, but for this problem, would it
> >> not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> struct group_info *locked_groups, and every time an unprivileged task 
> >> creates
> >> a new user namespace, add all its current groups to this list?
> >
> > So, effectively, you would be allowed to drop permissions, but
> > locked_groups would still be checked for restrictions?
> >
> > That seems like it'd introduce a new level of complexity (a new facet of
> > permission) to manage. Not opposed, but it does seem more complex than
> > just opting out of using groups for negative permissions.
> 
> I have played with something similar in the past.  At that time I've
> discussed it only privately with Eric and we agreed it wasn't worth the
> extra complexity:
> 
> https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af

Hi, you linked the setgroups patch, but do you also have a link to the
attempt which you deemed was not worth it?

> instead of a prctl, I've added a new mode to /proc/PID/setgroups that
> allows setgroups in a userns locking the current gids.
> 
> What do you think about using /proc/PID/setgroups instead of a new
> prctl()?

It's better than not having it, but two concerns -

1. some userspace, especially testsuites, could become confused by the fact
that they can't drop groups no matter how hard they try, since these will all
still show up as regular groups.
2. whereas in my lockgroups proposal, lock_groups would only be taken into 
account
for permission denial, this proposal would count for permission grants too.  
This
means that if I have a group which is permitted to read /foo/topsecret, and I
start a program in a new user namespace expecting it to drop that permission,
I can't have that, right?  The new program, will always have that permission?


Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-12 Thread Giuseppe Scrivano
Josh Triplett  writes:

> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> > 3. Find a way to allow setgroups() in a user namespace while keeping
>> >in mind the case of groups used for negative access control.
>> >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was 
>> > to
>> >investigate adding a prctl() to allow setgroups() to be called in a user
>> >namespace at the cost of restricting paths to the most restrictive
>> >permission. So if something is 0707 it needs to be treated as if it's 
>> > 
>> >even though the caller is not in its owning group which is used for 
>> > negative
>> >access control (how these new semantics will interact with ACLs will 
>> > also
>> >need to be looked into).
>> 
>> I should probably think this through more, but for this problem, would it
>> not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> struct group_info *locked_groups, and every time an unprivileged task creates
>> a new user namespace, add all its current groups to this list?
>
> So, effectively, you would be allowed to drop permissions, but
> locked_groups would still be checked for restrictions?
>
> That seems like it'd introduce a new level of complexity (a new facet of
> permission) to manage. Not opposed, but it does seem more complex than
> just opting out of using groups for negative permissions.

I have played with something similar in the past.  At that time I've
discussed it only privately with Eric and we agreed it wasn't worth the
extra complexity:

https://github.com/giuseppe/linux/commit/7e0701b389c497472d11fab8570c153a414050af

instead of a prctl, I've added a new mode to /proc/PID/setgroups that
allows setgroups in a userns locking the current gids.

What do you think about using /proc/PID/setgroups instead of a new
prctl()?

Giuseppe



Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-12 Thread Serge E. Hallyn
On Mon, Oct 12, 2020 at 12:01:09AM -0500, Eric W. Biederman wrote:
> Andy Lutomirski  writes:
> 
> > On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett  wrote:
> >>
> >> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> >> > > 3. Find a way to allow setgroups() in a user namespace while keeping
> >> > >in mind the case of groups used for negative access control.
> >> > >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea 
> >> > > was to
> >> > >investigate adding a prctl() to allow setgroups() to be called in a 
> >> > > user
> >> > >namespace at the cost of restricting paths to the most restrictive
> >> > >permission. So if something is 0707 it needs to be treated as if 
> >> > > it's 
> >> > >even though the caller is not in its owning group which is used for 
> >> > > negative
> >> > >access control (how these new semantics will interact with ACLs 
> >> > > will also
> >> > >need to be looked into).
> >> >
> >> > I should probably think this through more, but for this problem, would it
> >> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
> >> > struct group_info *locked_groups, and every time an unprivileged task 
> >> > creates
> >> > a new user namespace, add all its current groups to this list?
> >>
> >> So, effectively, you would be allowed to drop permissions, but
> >> locked_groups would still be checked for restrictions?
> >>
> >> That seems like it'd introduce a new level of complexity (a new facet of
> >> permission) to manage. Not opposed, but it does seem more complex than
> >> just opting out of using groups for negative permissions.

Yeah, it would, but I basically hoped that we could catch most of this at
e.g. generic_permission(), and/or we could introduce a helper which
automatically adds a check for permission denied from locked_groups, so
it shouldn't be too wide-spread.  If it does end up showing up all over
the place, then that's a good reason not to do this.

> > Is there any context other than regular UNIX DAC in which groups can
> > act as negative permissions or is this literally just an issue for
> > files with a more restrictive group mode than other mode?
> 
> Just that.
> 
> The ideas kicked around in the conversation were some variant of having
> a sysctl that says "This system never uses groups for negative
> permissions".
> 
> It was also suggested that if the sysctl was set the the permission
> checks would be altered such that even if someone tried to set a
> negative permission, the more liberal permissions of other would be used
> instead.

So then this would touch all the same code points which the
locked_groups approach would have to touch?

> Given that creating /etc/subgid is effectively opting out of negative
> permissions already have a sysctl that says that upfront feels like a
> very clean solution.
> 
> Eric

That feels like a cop-out to me.  If some young admin at Roxxon Corp decides
she needs to run a container, so installs subuid package and sets that sysctl,
how does she know whether or not some previous admin, who has since retired and
did not keep good docs, set things up so that a negative acl is keeping nginx
from reading some supersecret doc?

Now personally I'm not a great believer in the negative acls so I think the
above is a very unlikely scenario, but if we're going to worry about it, then
we should worry about it :)

"Click this button if noone has ever used feature X on this server"

-serge


Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-11 Thread Eric W. Biederman
Andy Lutomirski  writes:

> On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett  wrote:
>>
>> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
>> > > 3. Find a way to allow setgroups() in a user namespace while keeping
>> > >in mind the case of groups used for negative access control.
>> > >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea 
>> > > was to
>> > >investigate adding a prctl() to allow setgroups() to be called in a 
>> > > user
>> > >namespace at the cost of restricting paths to the most restrictive
>> > >permission. So if something is 0707 it needs to be treated as if it's 
>> > > 
>> > >even though the caller is not in its owning group which is used for 
>> > > negative
>> > >access control (how these new semantics will interact with ACLs will 
>> > > also
>> > >need to be looked into).
>> >
>> > I should probably think this through more, but for this problem, would it
>> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
>> > struct group_info *locked_groups, and every time an unprivileged task 
>> > creates
>> > a new user namespace, add all its current groups to this list?
>>
>> So, effectively, you would be allowed to drop permissions, but
>> locked_groups would still be checked for restrictions?
>>
>> That seems like it'd introduce a new level of complexity (a new facet of
>> permission) to manage. Not opposed, but it does seem more complex than
>> just opting out of using groups for negative permissions.
>
> Is there any context other than regular UNIX DAC in which groups can
> act as negative permissions or is this literally just an issue for
> files with a more restrictive group mode than other mode?

Just that.

The ideas kicked around in the conversation were some variant of having
a sysctl that says "This system never uses groups for negative
permissions".

It was also suggested that if the sysctl was set the the permission
checks would be altered such that even if someone tried to set a
negative permission, the more liberal permissions of other would be used
instead.

Given that creating /etc/subgid is effectively opting out of negative
permissions already have a sysctl that says that upfront feels like a
very clean solution.

Eric


Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-11 Thread Andy Lutomirski
On Sun, Oct 11, 2020 at 1:53 PM Josh Triplett  wrote:
>
> On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> > > 3. Find a way to allow setgroups() in a user namespace while keeping
> > >in mind the case of groups used for negative access control.
> > >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea 
> > > was to
> > >investigate adding a prctl() to allow setgroups() to be called in a 
> > > user
> > >namespace at the cost of restricting paths to the most restrictive
> > >permission. So if something is 0707 it needs to be treated as if it's 
> > > 
> > >even though the caller is not in its owning group which is used for 
> > > negative
> > >access control (how these new semantics will interact with ACLs will 
> > > also
> > >need to be looked into).
> >
> > I should probably think this through more, but for this problem, would it
> > not suffice to add a new prevgroups grouplist to the struct cred, maybe
> > struct group_info *locked_groups, and every time an unprivileged task 
> > creates
> > a new user namespace, add all its current groups to this list?
>
> So, effectively, you would be allowed to drop permissions, but
> locked_groups would still be checked for restrictions?
>
> That seems like it'd introduce a new level of complexity (a new facet of
> permission) to manage. Not opposed, but it does seem more complex than
> just opting out of using groups for negative permissions.

Is there any context other than regular UNIX DAC in which groups can
act as negative permissions or is this literally just an issue for
files with a more restrictive group mode than other mode?


Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-11 Thread Josh Triplett
On Fri, Oct 09, 2020 at 11:26:06PM -0500, Serge E. Hallyn wrote:
> > 3. Find a way to allow setgroups() in a user namespace while keeping
> >in mind the case of groups used for negative access control.
> >This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was 
> > to
> >investigate adding a prctl() to allow setgroups() to be called in a user
> >namespace at the cost of restricting paths to the most restrictive
> >permission. So if something is 0707 it needs to be treated as if it's 
> > 
> >even though the caller is not in its owning group which is used for 
> > negative
> >access control (how these new semantics will interact with ACLs will also
> >need to be looked into).
> 
> I should probably think this through more, but for this problem, would it
> not suffice to add a new prevgroups grouplist to the struct cred, maybe
> struct group_info *locked_groups, and every time an unprivileged task creates
> a new user namespace, add all its current groups to this list?

So, effectively, you would be allowed to drop permissions, but
locked_groups would still be checked for restrictions?

That seems like it'd introduce a new level of complexity (a new facet of
permission) to manage. Not opposed, but it does seem more complex than
just opting out of using groups for negative permissions.


Re: LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-10-09 Thread Serge E. Hallyn
> 3. Find a way to allow setgroups() in a user namespace while keeping
>in mind the case of groups used for negative access control.
>This was suggested by Josh Triplett and Geoffrey Thomas. Their idea was to
>investigate adding a prctl() to allow setgroups() to be called in a user
>namespace at the cost of restricting paths to the most restrictive
>permission. So if something is 0707 it needs to be treated as if it's 
>even though the caller is not in its owning group which is used for 
> negative
>access control (how these new semantics will interact with ACLs will also
>need to be looked into).

I should probably think this through more, but for this problem, would it
not suffice to add a new prevgroups grouplist to the struct cred, maybe
struct group_info *locked_groups, and every time an unprivileged task creates
a new user namespace, add all its current groups to this list?


LPC 2020 Hackroom Session: summary and next steps for isolated user namespaces

2020-08-30 Thread Christian Brauner
Hello everyone,

## Preliminaries

This is the summary of the Hackroom session Stéphane and I led as a follow-up
to our presentations in the Containers & Checkpoint/Restore micro-conference at
Linux Plumbers 2020.

Please make sure to see the Action Items section below as it outlines the next
concrete steps that came up during the meeting and who seemed interested in
tackling them.

The background for this summary is:

1. Stéphane's and my talk "Isolated Dynamic User Namespaces"
   People interested in the full session can watch it on YouTube:
   https://youtu.be/fSyr_IXM21Y?t=8856

2. The Hackroom session on Wednesday, 25.08.2020 at 17:00 UTC
   This session has been recorded as well. It is not yet on YouTube because
   Hackroom sessions weren't streamed. However, I plan on cutting that video
   and putting it up on YouTube as well just so there's no chance of
   miscommunication.

All people that attended session 1. were asked to send me an e-mail if they
wanted to attend session 2. to hash out details. The following people requested
to attend session 2. and were informed either through the e-mail I sent out or 
IRC:

Aleksa Sarai
Alexander Mihalicyn
Andy Lutomirski
Christian Brauner
Eric W. Biederman
Geoffrey Thomas
Giuseppe Scrivano
Joseph Christopher Sible
Josh Triplett
Kees Cook
Mickaël Salaün
Mrunal Patel
Pavel Tikhomirov
Sargun Dhillon
Serge Hallyn
Stephane Graber
Vivek Goyal
Wat Lim

All of them should be Cced here. In case I forgot someone don't hesitate to
forward this mail to them!

## Summary

During the Containers & Checkpoint/Restore micro-conference and in the hackroom
session Stéphane Graber and I proposed a way to make using user namespaces
simpler and more isolated. The following current problems were identified:

P1. Isolated id mappings can only be guaranteed to be locally isolated.
A container runtime/daemon can only guarantee non-overlapping id mappings
when no other users on the system create containers.

P2. Enforcing isolated id mappings in userspace is difficult.
It is always possible to create other processes with overlapping id
mappings. Coordinating id mappings in userspace will always remain
optional. Quite a few tools nowadays (including systemd) don't care about
/etc/sub{g,u}id and actively advise against using it. This is made even
more problematic since sub{g,u}iid delegation is done per-user rather than
per-container-runtime.

P3. The range of the id mapping of a container can't be predetermined.
While POSIX mandates that a standard system should use a range of 65536 ids
reality is very different. Some programs allocate high ids for random
processes or for network authentication. This means, in practice it is
often necessary to assign a range of up to 10 million ids to a container.
This limits a system to less than 500 containers total.

P4. Isolated id mappings severely restrict the number of containers that can be
run on a system.
This ties back to the point about pre-determining the id range of a
container and how large range allocations tend to be on real systems. That
becomes even more relevant when nesting containers.

P5. Container runtimes cannot reuse overlayfs lower directories if each
container uses isolated ID mappings, leading to either needless storage
overhead (LXD -- though the LXD folks don’t really mind), completely
ignoring the benefits of isolating containers from each other (Docker), or
not using them at all (Kubernetes). (This is a more general issue but bears
repeating since it is closely tied to most userns proposals.)

P6. Rlimits pose a problem for containers that share the same id mapping.
This means containers with overlapping id mappings can DOS each other by
exhausting their rlimits. The reason for this lies with the current
implementation of rlimits -- rlimits are currently tied to users and are
not hierarchically limited like inotify limits are. This is a severe
problem in unprivileged workloads. Eric and others identified that this
issue can be fixed independently of the isolated user namespace proposal.

In response to these and other issues, we made the following proposal which was
floated around in less clear form already during Linux Plumber 2019 in Lisbon
during informal discussions:

## Proposal

Introduce an in-kernel concept of an isolated user namespace by switching the
id types in the kernel from 32 to 64 bits. Userspace will only get to see the
lower 32 bits as usual. The upper 32 bits are used for a unique, in-kernel user
namespace token. The owner of such a namespace will either be the effective id
of the creator of that namespace or optionally an owning id can be set (when
created by a privileged user).

The following advantages were identified by various people during the session:

S1. An isolated user namespace has access to the full 32 bit id range.
This makes it compatible with every Linux workload and