Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-23 Thread Jann Horn
On Mon, Nov 23, 2020 at 10:16 PM Mickaël Salaün  wrote:
> On 23/11/2020 20:44, Jann Horn wrote:
> > On Sat, Nov 21, 2020 at 11:06 AM Mickaël Salaün  wrote:
> >> On 21/11/2020 08:00, Jann Horn wrote:
> >>> On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
>  Thanks to the Landlock objects and ruleset, it is possible to identify
>  inodes according to a process's domain.  To enable an unprivileged
>  process to express a file hierarchy, it first needs to open a directory
>  (or a file) and pass this file descriptor to the kernel through
>  landlock_add_rule(2).  When checking if a file access request is
>  allowed, we walk from the requested dentry to the real root, following
>  the different mount layers.  The access to each "tagged" inodes are
>  collected according to their rule layer level, and ANDed to create
>  access to the requested file hierarchy.  This makes possible to identify
>  a lot of files without tagging every inodes nor modifying the
>  filesystem, while still following the view and understanding the user
>  has from the filesystem.
> 
>  Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
>  keep the same struct inodes for the same inodes whereas these inodes are
>  in use.
> 
>  This commit adds a minimal set of supported filesystem access-control
>  which doesn't enable to restrict all file-related actions.  This is the
>  result of multiple discussions to minimize the code of Landlock to ease
>  review.  Thanks to the Landlock design, extending this access-control
>  without breaking user space will not be a problem.  Moreover, seccomp
>  filters can be used to restrict the use of syscall families which may
>  not be currently handled by Landlock.
> 
>  Cc: Al Viro 
>  Cc: Anton Ivanov 
>  Cc: James Morris 
>  Cc: Jann Horn 
>  Cc: Jeff Dike 
>  Cc: Kees Cook 
>  Cc: Richard Weinberger 
>  Cc: Serge E. Hallyn 
>  Signed-off-by: Mickaël Salaün 
>  ---
> 
>  Changes since v23:
>  * Enforce deterministic interleaved path rules.  To have consistent
>    layered rules, granting access to a path implies that all accesses
>    tied to inodes, from the requested file to the real root, must be
>    checked.  Otherwise, stacked rules may result to overzealous
>    restrictions.  By excluding the ability to add exceptions in the same
>    layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
>    deterministic interleaved path rules.  This removes an optimization
> >>>
> >>> I don't understand the "deterministic interleaved path rules" part.
> >>
> >> I explain bellow.
> >>
> >>>
> >>>
> >>> What if I have a policy like this?
> >>>
> >>> /home/user READ
> >>> /home/user/Downloads READ+WRITE
> >>>
> >>> That's a reasonable policy, right?
> >>
> >> Definitely, I forgot this, thanks for the outside perspective!
> >>
> >>>
> >>> If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
> >>> will first check against the READ+WRITE rule for /home/user, that
> >>> check will pass, and then it will check against the READ rule for /,
> >>> which will deny the access, right? That seems bad.
> >>
> >> Yes that was the intent.
> >>
> >>>
> >>>
> >>> The v22 code ensured that for each layer, the most specific rule (the
> >>> first we encounter on the walk) always wins, right? What's the problem
> >>> with that?
> >>
> >> This can be explained with the interleaved_masked_accesses test:
> >> https://github.com/landlock-lsm/linux/blob/landlock-v24/tools/testing/selftests/landlock/fs_test.c#L647
> >>
> >> In this case there is 4 stacked layers:
> >> layer 1: allows s1d1/s1d2/s1d3/file1
> >> layer 2: allows s1d1/s1d2/s1d3
> >>  denies s1d1/s1d2
> >> layer 3: allows s1d1
> >> layer 4: allows s1d1/s1d2
> >>
> >> In the v23, access to file1 would be allowed until layer 3, but layer 4
> >> would merge a new rule for the s1d2 inode. Because we don't record where
> >> exactly the access come from, we can't tell that layer 2 allowed access
> >> thanks to s1d3 and that its s1d2 rule was ignored. I think this behavior
> >> doesn't make sense from the user point of view.
> >
> > Aah, I think I'm starting to understand the issue now. Basically, with
> > the current UAPI, the semantics have to be "an access is permitted if,
> > for each policy layer, at least one rule encountered on the pathwalk
> > permits the access; rules that deny the access are irrelevant". And if
> > it turns out that someone needs to be able to deny access to specific
> > inodes, we'll have to extend struct landlock_path_beneath_attr.
>
> Right, I'll add this to the documentation (aligned with the new
> implementation).
>
> >
> > That reminds me... if we do need to make such a change in the future,
> > it would be easier in terms of UAPI compatibility if
> > landlock_add_rule() used copy_struct_from_user(), which is 

Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-23 Thread Mickaël Salaün


On 23/11/2020 20:44, Jann Horn wrote:
> On Sat, Nov 21, 2020 at 11:06 AM Mickaël Salaün  wrote:
>> On 21/11/2020 08:00, Jann Horn wrote:
>>> On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
 Thanks to the Landlock objects and ruleset, it is possible to identify
 inodes according to a process's domain.  To enable an unprivileged
 process to express a file hierarchy, it first needs to open a directory
 (or a file) and pass this file descriptor to the kernel through
 landlock_add_rule(2).  When checking if a file access request is
 allowed, we walk from the requested dentry to the real root, following
 the different mount layers.  The access to each "tagged" inodes are
 collected according to their rule layer level, and ANDed to create
 access to the requested file hierarchy.  This makes possible to identify
 a lot of files without tagging every inodes nor modifying the
 filesystem, while still following the view and understanding the user
 has from the filesystem.

 Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
 keep the same struct inodes for the same inodes whereas these inodes are
 in use.

 This commit adds a minimal set of supported filesystem access-control
 which doesn't enable to restrict all file-related actions.  This is the
 result of multiple discussions to minimize the code of Landlock to ease
 review.  Thanks to the Landlock design, extending this access-control
 without breaking user space will not be a problem.  Moreover, seccomp
 filters can be used to restrict the use of syscall families which may
 not be currently handled by Landlock.

 Cc: Al Viro 
 Cc: Anton Ivanov 
 Cc: James Morris 
 Cc: Jann Horn 
 Cc: Jeff Dike 
 Cc: Kees Cook 
 Cc: Richard Weinberger 
 Cc: Serge E. Hallyn 
 Signed-off-by: Mickaël Salaün 
 ---

 Changes since v23:
 * Enforce deterministic interleaved path rules.  To have consistent
   layered rules, granting access to a path implies that all accesses
   tied to inodes, from the requested file to the real root, must be
   checked.  Otherwise, stacked rules may result to overzealous
   restrictions.  By excluding the ability to add exceptions in the same
   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
   deterministic interleaved path rules.  This removes an optimization
>>>
>>> I don't understand the "deterministic interleaved path rules" part.
>>
>> I explain bellow.
>>
>>>
>>>
>>> What if I have a policy like this?
>>>
>>> /home/user READ
>>> /home/user/Downloads READ+WRITE
>>>
>>> That's a reasonable policy, right?
>>
>> Definitely, I forgot this, thanks for the outside perspective!
>>
>>>
>>> If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
>>> will first check against the READ+WRITE rule for /home/user, that
>>> check will pass, and then it will check against the READ rule for /,
>>> which will deny the access, right? That seems bad.
>>
>> Yes that was the intent.
>>
>>>
>>>
>>> The v22 code ensured that for each layer, the most specific rule (the
>>> first we encounter on the walk) always wins, right? What's the problem
>>> with that?
>>
>> This can be explained with the interleaved_masked_accesses test:
>> https://github.com/landlock-lsm/linux/blob/landlock-v24/tools/testing/selftests/landlock/fs_test.c#L647
>>
>> In this case there is 4 stacked layers:
>> layer 1: allows s1d1/s1d2/s1d3/file1
>> layer 2: allows s1d1/s1d2/s1d3
>>  denies s1d1/s1d2
>> layer 3: allows s1d1
>> layer 4: allows s1d1/s1d2
>>
>> In the v23, access to file1 would be allowed until layer 3, but layer 4
>> would merge a new rule for the s1d2 inode. Because we don't record where
>> exactly the access come from, we can't tell that layer 2 allowed access
>> thanks to s1d3 and that its s1d2 rule was ignored. I think this behavior
>> doesn't make sense from the user point of view.
> 
> Aah, I think I'm starting to understand the issue now. Basically, with
> the current UAPI, the semantics have to be "an access is permitted if,
> for each policy layer, at least one rule encountered on the pathwalk
> permits the access; rules that deny the access are irrelevant". And if
> it turns out that someone needs to be able to deny access to specific
> inodes, we'll have to extend struct landlock_path_beneath_attr.

Right, I'll add this to the documentation (aligned with the new
implementation).

> 
> That reminds me... if we do need to make such a change in the future,
> it would be easier in terms of UAPI compatibility if
> landlock_add_rule() used copy_struct_from_user(), which is designed to
> create backwards and forwards compatibility with other version of UAPI
> headers. So adding that now might save us some headaches later.

I used copy_struct_from_user() before v21, but Arnd wasn't a fan of
having type and size arguments, so we simplified the 

Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-23 Thread Jann Horn
On Sat, Nov 21, 2020 at 11:06 AM Mickaël Salaün  wrote:
> On 21/11/2020 08:00, Jann Horn wrote:
> > On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> >> Thanks to the Landlock objects and ruleset, it is possible to identify
> >> inodes according to a process's domain.  To enable an unprivileged
> >> process to express a file hierarchy, it first needs to open a directory
> >> (or a file) and pass this file descriptor to the kernel through
> >> landlock_add_rule(2).  When checking if a file access request is
> >> allowed, we walk from the requested dentry to the real root, following
> >> the different mount layers.  The access to each "tagged" inodes are
> >> collected according to their rule layer level, and ANDed to create
> >> access to the requested file hierarchy.  This makes possible to identify
> >> a lot of files without tagging every inodes nor modifying the
> >> filesystem, while still following the view and understanding the user
> >> has from the filesystem.
> >>
> >> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> >> keep the same struct inodes for the same inodes whereas these inodes are
> >> in use.
> >>
> >> This commit adds a minimal set of supported filesystem access-control
> >> which doesn't enable to restrict all file-related actions.  This is the
> >> result of multiple discussions to minimize the code of Landlock to ease
> >> review.  Thanks to the Landlock design, extending this access-control
> >> without breaking user space will not be a problem.  Moreover, seccomp
> >> filters can be used to restrict the use of syscall families which may
> >> not be currently handled by Landlock.
> >>
> >> Cc: Al Viro 
> >> Cc: Anton Ivanov 
> >> Cc: James Morris 
> >> Cc: Jann Horn 
> >> Cc: Jeff Dike 
> >> Cc: Kees Cook 
> >> Cc: Richard Weinberger 
> >> Cc: Serge E. Hallyn 
> >> Signed-off-by: Mickaël Salaün 
> >> ---
> >>
> >> Changes since v23:
> >> * Enforce deterministic interleaved path rules.  To have consistent
> >>   layered rules, granting access to a path implies that all accesses
> >>   tied to inodes, from the requested file to the real root, must be
> >>   checked.  Otherwise, stacked rules may result to overzealous
> >>   restrictions.  By excluding the ability to add exceptions in the same
> >>   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
> >>   deterministic interleaved path rules.  This removes an optimization
> >
> > I don't understand the "deterministic interleaved path rules" part.
>
> I explain bellow.
>
> >
> >
> > What if I have a policy like this?
> >
> > /home/user READ
> > /home/user/Downloads READ+WRITE
> >
> > That's a reasonable policy, right?
>
> Definitely, I forgot this, thanks for the outside perspective!
>
> >
> > If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
> > will first check against the READ+WRITE rule for /home/user, that
> > check will pass, and then it will check against the READ rule for /,
> > which will deny the access, right? That seems bad.
>
> Yes that was the intent.
>
> >
> >
> > The v22 code ensured that for each layer, the most specific rule (the
> > first we encounter on the walk) always wins, right? What's the problem
> > with that?
>
> This can be explained with the interleaved_masked_accesses test:
> https://github.com/landlock-lsm/linux/blob/landlock-v24/tools/testing/selftests/landlock/fs_test.c#L647
>
> In this case there is 4 stacked layers:
> layer 1: allows s1d1/s1d2/s1d3/file1
> layer 2: allows s1d1/s1d2/s1d3
>  denies s1d1/s1d2
> layer 3: allows s1d1
> layer 4: allows s1d1/s1d2
>
> In the v23, access to file1 would be allowed until layer 3, but layer 4
> would merge a new rule for the s1d2 inode. Because we don't record where
> exactly the access come from, we can't tell that layer 2 allowed access
> thanks to s1d3 and that its s1d2 rule was ignored. I think this behavior
> doesn't make sense from the user point of view.

Aah, I think I'm starting to understand the issue now. Basically, with
the current UAPI, the semantics have to be "an access is permitted if,
for each policy layer, at least one rule encountered on the pathwalk
permits the access; rules that deny the access are irrelevant". And if
it turns out that someone needs to be able to deny access to specific
inodes, we'll have to extend struct landlock_path_beneath_attr.

That reminds me... if we do need to make such a change in the future,
it would be easier in terms of UAPI compatibility if
landlock_add_rule() used copy_struct_from_user(), which is designed to
create backwards and forwards compatibility with other version of UAPI
headers. So adding that now might save us some headaches later.


> In the v24, access to file1 would only be allowed with layer 1. The
> layer 2, would deny access to file1 because of the s1d2 rule. This makes
> the reasoning consistent and deterministic whatever the layers are,
> while storing the same access and layer bits. But I agree that this may
> not be 

Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-21 Thread Mickaël Salaün


On 21/11/2020 08:00, Jann Horn wrote:
> On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
>> Thanks to the Landlock objects and ruleset, it is possible to identify
>> inodes according to a process's domain.  To enable an unprivileged
>> process to express a file hierarchy, it first needs to open a directory
>> (or a file) and pass this file descriptor to the kernel through
>> landlock_add_rule(2).  When checking if a file access request is
>> allowed, we walk from the requested dentry to the real root, following
>> the different mount layers.  The access to each "tagged" inodes are
>> collected according to their rule layer level, and ANDed to create
>> access to the requested file hierarchy.  This makes possible to identify
>> a lot of files without tagging every inodes nor modifying the
>> filesystem, while still following the view and understanding the user
>> has from the filesystem.
>>
>> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
>> keep the same struct inodes for the same inodes whereas these inodes are
>> in use.
>>
>> This commit adds a minimal set of supported filesystem access-control
>> which doesn't enable to restrict all file-related actions.  This is the
>> result of multiple discussions to minimize the code of Landlock to ease
>> review.  Thanks to the Landlock design, extending this access-control
>> without breaking user space will not be a problem.  Moreover, seccomp
>> filters can be used to restrict the use of syscall families which may
>> not be currently handled by Landlock.
>>
>> Cc: Al Viro 
>> Cc: Anton Ivanov 
>> Cc: James Morris 
>> Cc: Jann Horn 
>> Cc: Jeff Dike 
>> Cc: Kees Cook 
>> Cc: Richard Weinberger 
>> Cc: Serge E. Hallyn 
>> Signed-off-by: Mickaël Salaün 
>> ---
>>
>> Changes since v23:
>> * Enforce deterministic interleaved path rules.  To have consistent
>>   layered rules, granting access to a path implies that all accesses
>>   tied to inodes, from the requested file to the real root, must be
>>   checked.  Otherwise, stacked rules may result to overzealous
>>   restrictions.  By excluding the ability to add exceptions in the same
>>   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
>>   deterministic interleaved path rules.  This removes an optimization
> 
> I don't understand the "deterministic interleaved path rules" part.

I explain bellow.

> 
> 
> What if I have a policy like this?
> 
> /home/user READ
> /home/user/Downloads READ+WRITE
> 
> That's a reasonable policy, right?

Definitely, I forgot this, thanks for the outside perspective!

> 
> If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
> will first check against the READ+WRITE rule for /home/user, that
> check will pass, and then it will check against the READ rule for /,
> which will deny the access, right? That seems bad.

Yes that was the intent.

> 
> 
> The v22 code ensured that for each layer, the most specific rule (the
> first we encounter on the walk) always wins, right? What's the problem
> with that?

This can be explained with the interleaved_masked_accesses test:
https://github.com/landlock-lsm/linux/blob/landlock-v24/tools/testing/selftests/landlock/fs_test.c#L647

In this case there is 4 stacked layers:
layer 1: allows s1d1/s1d2/s1d3/file1
layer 2: allows s1d1/s1d2/s1d3
 denies s1d1/s1d2
layer 3: allows s1d1
layer 4: allows s1d1/s1d2

In the v23, access to file1 would be allowed until layer 3, but layer 4
would merge a new rule for the s1d2 inode. Because we don't record where
exactly the access come from, we can't tell that layer 2 allowed access
thanks to s1d3 and that its s1d2 rule was ignored. I think this behavior
doesn't make sense from the user point of view.

In the v24, access to file1 would only be allowed with layer 1. The
layer 2, would deny access to file1 because of the s1d2 rule. This makes
the reasoning consistent and deterministic whatever the layers are,
while storing the same access and layer bits. But I agree that this may
not be desirable.

In a perfect v25, file1 should be allowed by all these layers. I didn't
find a simple solution to this while minimizing the memory allocated by
rule (cf. struct landlock_rule: mainly 32-bits for access rights and
64-bits for the layers that contributed to this ANDed accesses). I would
like to avoid storing 32-bits access rights per stacked layer. Do you
see another solution?

> 
>>   which could be replaced by a proper cache mechanism.  This also
>>   further simplifies and explain check_access_path_continue().
> 
>>From the interdiff between v23 and v24 (git range-diff
> 99ade5d59b23~1..99ade5d59b23 faa8c09be9fd~1..faa8c09be9fd):
> 
> @@ security/landlock/fs.c (new)
>  +  rcu_dereference(landlock_inode(inode)->object));
>  +  rcu_read_unlock();
>  +
> -+  /* Checks for matching layers. */
> -+  if (rule && (rule->layers | *layer_mask)) {
> -+  if ((rule->access & access_request) == access_request) 

Re: [PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-20 Thread Jann Horn
On Thu, Nov 12, 2020 at 9:52 PM Mickaël Salaün  wrote:
> Thanks to the Landlock objects and ruleset, it is possible to identify
> inodes according to a process's domain.  To enable an unprivileged
> process to express a file hierarchy, it first needs to open a directory
> (or a file) and pass this file descriptor to the kernel through
> landlock_add_rule(2).  When checking if a file access request is
> allowed, we walk from the requested dentry to the real root, following
> the different mount layers.  The access to each "tagged" inodes are
> collected according to their rule layer level, and ANDed to create
> access to the requested file hierarchy.  This makes possible to identify
> a lot of files without tagging every inodes nor modifying the
> filesystem, while still following the view and understanding the user
> has from the filesystem.
>
> Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
> keep the same struct inodes for the same inodes whereas these inodes are
> in use.
>
> This commit adds a minimal set of supported filesystem access-control
> which doesn't enable to restrict all file-related actions.  This is the
> result of multiple discussions to minimize the code of Landlock to ease
> review.  Thanks to the Landlock design, extending this access-control
> without breaking user space will not be a problem.  Moreover, seccomp
> filters can be used to restrict the use of syscall families which may
> not be currently handled by Landlock.
>
> Cc: Al Viro 
> Cc: Anton Ivanov 
> Cc: James Morris 
> Cc: Jann Horn 
> Cc: Jeff Dike 
> Cc: Kees Cook 
> Cc: Richard Weinberger 
> Cc: Serge E. Hallyn 
> Signed-off-by: Mickaël Salaün 
> ---
>
> Changes since v23:
> * Enforce deterministic interleaved path rules.  To have consistent
>   layered rules, granting access to a path implies that all accesses
>   tied to inodes, from the requested file to the real root, must be
>   checked.  Otherwise, stacked rules may result to overzealous
>   restrictions.  By excluding the ability to add exceptions in the same
>   layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
>   deterministic interleaved path rules.  This removes an optimization

I don't understand the "deterministic interleaved path rules" part.


What if I have a policy like this?

/home/user READ
/home/user/Downloads READ+WRITE

That's a reasonable policy, right?

If I then try to open /home/user/Downloads/foo in WRITE mode, the loop
will first check against the READ+WRITE rule for /home/user, that
check will pass, and then it will check against the READ rule for /,
which will deny the access, right? That seems bad.


The v22 code ensured that for each layer, the most specific rule (the
first we encounter on the walk) always wins, right? What's the problem
with that?

>   which could be replaced by a proper cache mechanism.  This also
>   further simplifies and explain check_access_path_continue().

>From the interdiff between v23 and v24 (git range-diff
99ade5d59b23~1..99ade5d59b23 faa8c09be9fd~1..faa8c09be9fd):

@@ security/landlock/fs.c (new)
 +  rcu_dereference(landlock_inode(inode)->object));
 +  rcu_read_unlock();
 +
-+  /* Checks for matching layers. */
-+  if (rule && (rule->layers | *layer_mask)) {
-+  if ((rule->access & access_request) == access_request) {
-+  *layer_mask &= ~rule->layers;
-+  return true;
-+  } else {
-+  return false;
-+  }
++  if (!rule)
++  /* Continues to walk if there is no rule for this inode. */
++  return true;
++  /*
++   * We must check all layers for each inode because we may encounter
++   * multiple different accesses from the same layer in a walk.  Each
++   * layer must at least allow the access request one time (i.e. with one
++   * inode).  This enables to have a deterministic behavior whatever
++   * inode is tagged within interleaved layers.
++   */
++  if ((rule->access & access_request) == access_request) {
++  /* Validates layers for which all accesses are allowed. */
++  *layer_mask &= ~rule->layers;
++  /* Continues to walk until all layers are validated. */
++  return true;
 +  }
-+  return true;
++  /* Stops if a rule in the path don't allow all requested access. */
++  return false;
 +}
 +
 +static int check_access_path(const struct landlock_ruleset *const domain,
@@ security/landlock/fs.c (new)
 +  _mask)) {
 +  struct dentry *parent_dentry;
 +
-+  /* Stops when a rule from each layer granted access. */
-+  if (layer_mask == 0) {
-+  allowed = true;
-+  break;
-+  }
-+

This change also made it so that disconnected paths aren't accessible
unless they're internal, right? 

[PATCH v24 07/12] landlock: Support filesystem access-control

2020-11-12 Thread Mickaël Salaün
From: Mickaël Salaün 

Thanks to the Landlock objects and ruleset, it is possible to identify
inodes according to a process's domain.  To enable an unprivileged
process to express a file hierarchy, it first needs to open a directory
(or a file) and pass this file descriptor to the kernel through
landlock_add_rule(2).  When checking if a file access request is
allowed, we walk from the requested dentry to the real root, following
the different mount layers.  The access to each "tagged" inodes are
collected according to their rule layer level, and ANDed to create
access to the requested file hierarchy.  This makes possible to identify
a lot of files without tagging every inodes nor modifying the
filesystem, while still following the view and understanding the user
has from the filesystem.

Add a new ARCH_EPHEMERAL_INODES for UML because it currently does not
keep the same struct inodes for the same inodes whereas these inodes are
in use.

This commit adds a minimal set of supported filesystem access-control
which doesn't enable to restrict all file-related actions.  This is the
result of multiple discussions to minimize the code of Landlock to ease
review.  Thanks to the Landlock design, extending this access-control
without breaking user space will not be a problem.  Moreover, seccomp
filters can be used to restrict the use of syscall families which may
not be currently handled by Landlock.

Cc: Al Viro 
Cc: Anton Ivanov 
Cc: James Morris 
Cc: Jann Horn 
Cc: Jeff Dike 
Cc: Kees Cook 
Cc: Richard Weinberger 
Cc: Serge E. Hallyn 
Signed-off-by: Mickaël Salaün 
---

Changes since v23:
* Enforce deterministic interleaved path rules.  To have consistent
  layered rules, granting access to a path implies that all accesses
  tied to inodes, from the requested file to the real root, must be
  checked.  Otherwise, stacked rules may result to overzealous
  restrictions.  By excluding the ability to add exceptions in the same
  layer (e.g. /a allowed, /a/b denied, and /a/b/c allowed), we get
  deterministic interleaved path rules.  This removes an optimization
  which could be replaced by a proper cache mechanism.  This also
  further simplifies and explain check_access_path_continue().
* Fix memory allocation error handling in landlock_create_object()
  calls.  This prevent to inadvertently hold an inode.
* In get_inode_object(), improve comments, make code more readable and
  move kfree() call out of the lock window.
* Use the simplified landlock_insert_rule() API.

Changes since v22:
* Simplify check_access_path_continue() (suggested by Jann Horn).
* Remove prefetch() call for now (suggested by Jann Horn).
* Fix spelling and remove superfluous comment (spotted by Jann Horn).
* Cosmetic variable renaming.

Changes since v21:
* Rename ARCH_EPHEMERAL_STATES to ARCH_EPHEMERAL_INODES (suggested by
  James Morris).
* Remove the LANDLOCK_ACCESS_FS_CHROOT right because chroot(2) (which
  requires CAP_SYS_CHROOT) doesn't enable to bypass Landlock (as tests
  demonstrate it), and because it is often used by sandboxes, it would
  be counterproductive to forbid it.  This also reduces the code size.
* Clean up documentation.

Changes since v19:
* Fix spelling (spotted by Randy Dunlap).

Changes since v18:
* Remove useless include.
* Fix spelling.

Changes since v17:
* Replace landlock_release_inodes() with security_sb_delete() (requested
  by James Morris).
* Replace struct super_block->s_landlock_inode_refs with the LSM
  infrastructure management of the superblock (requested by James
  Morris).
* Fix mknod restriction with a zero mode (spotted by Vincent Dagonneau).
* Minimize executed code in path_mknod and file_open hooks when the
  current tasks is not sandboxed.
* Remove useless checks on the file pointer and inode in
  hook_file_open() .
* Constify domain pointers.
* Rename inode_landlock() to landlock_inode().
* Import include/uapi/linux/landlock.h and _LANDLOCK_ACCESS_FS_* from
  the ruleset and domain management patch.
* Explain the rational of this minimal set of access-control.
  https://lore.kernel.org/lkml/f646e1c7-33cf-333f-070c-0a40ad046...@digikod.net/

Changes since v16:
* Add ARCH_EPHEMERAL_STATES and enable it for UML.

Changes since v15:
* Replace layer_levels and layer_depth with a bitfield of layers: this
  enables to properly manage superset and subset of access rights,
  whatever their order in the stack of layers.
  Cf. 
https://lore.kernel.org/lkml/e07fe473-1801-01cc-12ae-b3167f952...@digikod.net/
* Allow to open pipes and similar special files through /proc/self/fd/.
* Properly handle internal filesystems such as nsfs: always allow these
  kind of roots because disconnected path cannot be evaluated.
* Remove the LANDLOCK_ACCESS_FS_LINK_TO and
  LANDLOCK_ACCESS_FS_RENAME_{TO,FROM}, but use the
  LANDLOCK_ACCESS_FS_REMOVE_{FILE,DIR} and LANDLOCK_ACCESS_FS_MAKE_*
  instead.  Indeed, it is not possible for now (and not really useful)
  to express the semantic of a source and a destination.
* Check