Re: The problem of setgroups and containers

2020-10-18 Thread Eric W. Biederman
ebied...@xmission.com (Eric W. Biederman) writes:

> [ Added linux-api because we are talking about a subtle semantic
>   change to the permission checks ]
>
> Christian Brauner  writes:
>
>> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>>> "Enrico Weigelt, metux IT consult"  writes:
>>> 
>>> > On 30.08.20 16:39, Christian Brauner wrote:
>>> >> For mount points
>>> >>that originate from outside the namespace, everything will show as
>>> >>the overflow ids and access would be restricted to the most
>>> >>restricted permission bit for any path that can be accessed.
>>> >
>>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>>> 
>>> Interesting until reading through your commentary I had missed the
>>> proposal to effectively effectively change the permissions to:
>>> ((mode >> 3) & (mode >> 6) & mode & 7).
>>> 
>>> The challenge is that in a permission triple it is possible to set
>>> lower permissions for the owner of the file, or for a specific group,
>>> than for everyone else.
>>> 
>>> Today we require root permissions to be able to map users and groups in
>>> /proc//uid_map and /proc//gid_map, and we require root
>>> permissions to be able to drop groups with setgroups.
>>> 
>>> Now we are discussiong moving to a world where we can use users and
>>> groups that don't map to any other user namespace in uid_map and
>>> gid_map.  It should be completely safe to use those users and groups
>>> except for negative permissions in filesystems.  So a big question is
>>> how do we arrange the system so anyone can use those files without
>>> negative permission causing problems.
>>> 
>>> 
>>> I believe it is safe to not limit the owner of a file, as the
>>> owner of a file can always chmode the file and remove any restrictions.
>>> Which is no worse than calling setuid to a different uid.
>>> 
>>> Which leaves where we have been dealing with the ability to drop groups
>>> with setgroups.
>>> 
>>> I guess the practical proposal is when the !in_group_p and we are
>>> looking at the other permission.  Treat the permissions as:
>>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>>> 
>>> Which for systems who don't use negative group permissions is a no-op.
>>> So this should not effect your btrfs snapshots at all (unless you use
>>> negative group permissions).
>>> 
>>> It denies things before we get to an NFS server or other interesting
>>> case so it should work for pretty much everything the kernel deals with.
>>> 
>>> Userspace repeating permission checks could break.  But that is just a
>>> problem of inconsistency, and will always be a problem.
>>> 
>>> We could make it more precise as Serge was suggesting with a set of that
>>> were dropped from setgroups, but under the assumption that negative
>>> groups are sufficient rare we can avoid that overhead.
>>
>> I'm tempted to agree and say that it's safe to assume that they are used
>> very much. Negative acls have been brought up a couple of times in
>> related contexts though. One being a potential bug in newgidmap which we
>> discussed back in
>> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
>> But I think if we have this under a sysctl as proposed earlier is good
>> enough.
>>
>>> 
>>>  static int acl_permission_check(struct inode *inode, int mask)
>>>  {
>>> unsigned int mode = inode->i_mode;
>>>  
>>> - [irrelevant bits of this function]
>>>  
>>> /* Only RWX matters for group/other mode bits */
>>> mask &= 7;
>>>  
>>> /*
>>>  * Are the group permissions different from
>>>  * the other permissions in the bits we care
>>>  * about? Need to check group ownership if so.
>>>  */
>>> if (mask & (mode ^ (mode >> 3))) {
>>> if (in_group_p(inode->i_gid))
>>> mode >>= 3;
>>> +   /* Use the most restrictive permissions? */
>>> +   else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>>> +   mode &= (mode >> 3);
>>> }
>>>  
>>> /* Bits in 'mode' clear that we require? */
>>> return (mask & ~mode) ? -EACCES : 0;
>>>  }
>>> 
>>> As I read posix_acl_permission all of the posix acls for groups are
>>> positive permissions.  So I think the only other code that would need to
>>> be updated would be the filesystems that replace generic_permission with
>>> something that doesn't call acl_permission check.
>>> 
>>> Userspace could then activate this mode with:
>>> echo "safely_allow" > /proc//setgroups
>>> 
>>> That looks very elegant and simple, and I don't think will cause
>>> problems for anyone.  It might even make sense to make that the default
>>> mode when creating a new user namespace.
>>> 
>>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>>> 
>>> Does anyone see any problems with tweaking the permissions this way so
>>> that we can always allow setgroups in a user namespace?
>>
>> This looks sane and simple. I would still think that making it opt-in
>> 

The problem of setgroups and containers

2020-10-18 Thread Eric W. Biederman
[ Added linux-api because we are talking about a subtle semantic
  change to the permission checks ]

Christian Brauner  writes:

> On Sat, Oct 17, 2020 at 11:51:22AM -0500, Eric W. Biederman wrote:
>> "Enrico Weigelt, metux IT consult"  writes:
>> 
>> > On 30.08.20 16:39, Christian Brauner wrote:
>> >> For mount points
>> >>that originate from outside the namespace, everything will show as
>> >>the overflow ids and access would be restricted to the most
>> >>restricted permission bit for any path that can be accessed.
>> >
>> > So, I can't just take a btrfs snapshot as rootfs anymore ?
>> 
>> Interesting until reading through your commentary I had missed the
>> proposal to effectively effectively change the permissions to:
>> ((mode >> 3) & (mode >> 6) & mode & 7).
>> 
>> The challenge is that in a permission triple it is possible to set
>> lower permissions for the owner of the file, or for a specific group,
>> than for everyone else.
>> 
>> Today we require root permissions to be able to map users and groups in
>> /proc//uid_map and /proc//gid_map, and we require root
>> permissions to be able to drop groups with setgroups.
>> 
>> Now we are discussiong moving to a world where we can use users and
>> groups that don't map to any other user namespace in uid_map and
>> gid_map.  It should be completely safe to use those users and groups
>> except for negative permissions in filesystems.  So a big question is
>> how do we arrange the system so anyone can use those files without
>> negative permission causing problems.
>> 
>> 
>> I believe it is safe to not limit the owner of a file, as the
>> owner of a file can always chmode the file and remove any restrictions.
>> Which is no worse than calling setuid to a different uid.
>> 
>> Which leaves where we have been dealing with the ability to drop groups
>> with setgroups.
>> 
>> I guess the practical proposal is when the !in_group_p and we are
>> looking at the other permission.  Treat the permissions as:
>> ((mode >> 3) & mode & 7).  Instead of just (mode & 7).
>> 
>> Which for systems who don't use negative group permissions is a no-op.
>> So this should not effect your btrfs snapshots at all (unless you use
>> negative group permissions).
>> 
>> It denies things before we get to an NFS server or other interesting
>> case so it should work for pretty much everything the kernel deals with.
>> 
>> Userspace repeating permission checks could break.  But that is just a
>> problem of inconsistency, and will always be a problem.
>> 
>> We could make it more precise as Serge was suggesting with a set of that
>> were dropped from setgroups, but under the assumption that negative
>> groups are sufficient rare we can avoid that overhead.
>
> I'm tempted to agree and say that it's safe to assume that they are used
> very much. Negative acls have been brought up a couple of times in
> related contexts though. One being a potential bug in newgidmap which we
> discussed back in
> https://bugs.launchpad.net/ubuntu/+source/shadow/+bug/1729357
> But I think if we have this under a sysctl as proposed earlier is good
> enough.
>
>> 
>>  static int acl_permission_check(struct inode *inode, int mask)
>>  {
>>  unsigned int mode = inode->i_mode;
>>  
>> - [irrelevant bits of this function]
>>  
>>  /* Only RWX matters for group/other mode bits */
>>  mask &= 7;
>>  
>>  /*
>>   * Are the group permissions different from
>>   * the other permissions in the bits we care
>>   * about? Need to check group ownership if so.
>>   */
>>  if (mask & (mode ^ (mode >> 3))) {
>>  if (in_group_p(inode->i_gid))
>>  mode >>= 3;
>> +/* Use the most restrictive permissions? */
>> +else (current->user_ns->flags & USERNS_ALWAYS_DENY_GROUPS)
>> +mode &= (mode >> 3);
>>  }
>>  
>>  /* Bits in 'mode' clear that we require? */
>>  return (mask & ~mode) ? -EACCES : 0;
>>  }
>> 
>> As I read posix_acl_permission all of the posix acls for groups are
>> positive permissions.  So I think the only other code that would need to
>> be updated would be the filesystems that replace generic_permission with
>> something that doesn't call acl_permission check.
>> 
>> Userspace could then activate this mode with:
>>  echo "safely_allow" > /proc//setgroups
>> 
>> That looks very elegant and simple, and I don't think will cause
>> problems for anyone.  It might even make sense to make that the default
>> mode when creating a new user namespace.
>> 
>> I guess we owe this idea to Josh Triplett and Geoffrey Thomas.
>> 
>> Does anyone see any problems with tweaking the permissions this way so
>> that we can always allow setgroups in a user namespace?
>
> This looks sane and simple. I would still think that making it opt-in
> for a few kernel releases might be preferable to just making it the new
> default. We can then revisit flipping the default. Advanced enough
>