Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-04-01 Thread Tycho Andersen
Hi Mickaël,

On Mon, Apr 02, 2018 at 12:04:36AM +0200, Mickaël Salaün wrote:
> >> vDSO is a code mapped for all processes. As you said, these processes
> >> may use it or not. What I was thinking about is to use the same concept,
> >> i.e. map a "shim" code into each processes pertaining to a particular
> >> hierarchy (the same way seccomp filters are inherited across processes).
> >> With a seccomp filter matching some syscall (e.g. mount, open), it is
> >> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> >> shim code should then be able to emulate/patch what is needed, even
> >> faking a file opening by receiving a file descriptor through a UNIX
> >> socket. As did the Chrome sandbox, the seccomp filter may look at the
> >> calling address to allow the shim code to call syscalls without being
> >> catched, if needed. However, relying on SIGSYS may not fit with
> >> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> >> to a specific process address, to emulate the syscall in an easier way
> >> than only relying on a {c,e}BPF program.
> >>
> > 
> > This could indeed be done, but I think that Tycho's approach is much
> > cleaner and probably faster.
> > 
> 
> I like it too but how does this handle file descriptors?

I think it could be done fairly simply, the most complicated part is
probably designing an API that doesn't suck. But the basic idea would
be:

struct seccomp_notif_resp {
__u64 id;
__s32 error;
__s64 val;
__s32 fd;
};

if the handler responds with fd >= 0, we grab the tracer's fd,
duplicate it, and install it somewhere in the tracee's fd table. Since
things like socket() will want to return the fd number as its
installed and the handler doesn't know that, we'll probably want some
way to indicate that the kernel should return this value. We could
either mandate that if fd >= 0, that's the value that will be returned
from the syscall, or add another flag that says "no, install the fd,
but really return what's in val instead).

I guess we can't mandate that we return fd, because e.g. netlink
sockets can sometimes return fds as part of the netlink messages, and
not as the return value from the syscall.

Tycho


[PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-04-01 Thread Mickaël Salaün

On 03/09/2018 12:53 AM, Andy Lutomirski wrote:
> On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün  wrote:
>>
>> On 07/03/2018 02:21, Andy Lutomirski wrote:
>>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:

 On 06/03/2018 23:46, Tycho Andersen wrote:
> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
 Suppose I'm writing a container manager.  I want to run "mount" in the
 container, but I don't want to allow moun() in general and I want to
 emulate certain mount() actions.  I can write a filter that catches
 mount using seccomp and calls out to the container manager for help.
 This isn't theoretical -- Tycho wants *exactly* this use case to be
 supported.
>>>
>>> Well, I think this use case should be handled with something like
>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>> https://github.com/stemjail/stemshim
>>
>> I doubt that will work for containers.  Containers that use user
>> namespaces and, for example, setuid programs aren't going to honor
>> LD_PRELOAD.
>
> Or anything that calls syscalls directly, like go programs.

 That's why the vDSO-like approach. Enforcing an access control is not
 the issue here, patching a buggy userland (without patching its code) is
 the issue isn't it?

 As far as I remember, the main problem is to handle file descriptors
 while "emulating" the kernel behavior. This can be done with a "shim"
 code mapped in every processes. Chrome used something like this (in a
 previous sandbox mechanism) as a kind of emulation (with the current
 seccomp-bpf ). I think it should be doable to replace the (userland)
 emulation code with an IPC wrapper receiving file descriptors through
 UNIX socket.

>>>
>>> Can you explain exactly what you mean by "vDSO-like"?
>>>
>>> When a 64-bit program does a syscall, it just executes the SYSCALL
>>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>>> go through the vDSO, but not always.
>>>
>>> It could be possible to force-load a DSO into an entire container and
>>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>>> such that they merely redirect control to the DSO, but that seems
>>> quite messy.
>>
>> vDSO is a code mapped for all processes. As you said, these processes
>> may use it or not. What I was thinking about is to use the same concept,
>> i.e. map a "shim" code into each processes pertaining to a particular
>> hierarchy (the same way seccomp filters are inherited across processes).
>> With a seccomp filter matching some syscall (e.g. mount, open), it is
>> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
>> shim code should then be able to emulate/patch what is needed, even
>> faking a file opening by receiving a file descriptor through a UNIX
>> socket. As did the Chrome sandbox, the seccomp filter may look at the
>> calling address to allow the shim code to call syscalls without being
>> catched, if needed. However, relying on SIGSYS may not fit with
>> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
>> to a specific process address, to emulate the syscall in an easier way
>> than only relying on a {c,e}BPF program.
>>
> 
> This could indeed be done, but I think that Tycho's approach is much
> cleaner and probably faster.
> 

I like it too but how does this handle file descriptors?



signature.asc
Description: OpenPGP digital signature


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-08 Thread Andy Lutomirski
On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün  wrote:
>
> On 07/03/2018 02:21, Andy Lutomirski wrote:
>> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:
>>>
>>> On 06/03/2018 23:46, Tycho Andersen wrote:
 On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
>>> Suppose I'm writing a container manager.  I want to run "mount" in the
>>> container, but I don't want to allow moun() in general and I want to
>>> emulate certain mount() actions.  I can write a filter that catches
>>> mount using seccomp and calls out to the container manager for help.
>>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>>> supported.
>>
>> Well, I think this use case should be handled with something like
>> LD_PRELOAD and a helper library. FYI, I did something like this:
>> https://github.com/stemjail/stemshim
>
> I doubt that will work for containers.  Containers that use user
> namespaces and, for example, setuid programs aren't going to honor
> LD_PRELOAD.

 Or anything that calls syscalls directly, like go programs.
>>>
>>> That's why the vDSO-like approach. Enforcing an access control is not
>>> the issue here, patching a buggy userland (without patching its code) is
>>> the issue isn't it?
>>>
>>> As far as I remember, the main problem is to handle file descriptors
>>> while "emulating" the kernel behavior. This can be done with a "shim"
>>> code mapped in every processes. Chrome used something like this (in a
>>> previous sandbox mechanism) as a kind of emulation (with the current
>>> seccomp-bpf ). I think it should be doable to replace the (userland)
>>> emulation code with an IPC wrapper receiving file descriptors through
>>> UNIX socket.
>>>
>>
>> Can you explain exactly what you mean by "vDSO-like"?
>>
>> When a 64-bit program does a syscall, it just executes the SYSCALL
>> instruction.  The vDSO isn't involved at all.  32-bit programs usually
>> go through the vDSO, but not always.
>>
>> It could be possible to force-load a DSO into an entire container and
>> rig up seccomp to intercept all SYSCALLs not originating from the DSO
>> such that they merely redirect control to the DSO, but that seems
>> quite messy.
>
> vDSO is a code mapped for all processes. As you said, these processes
> may use it or not. What I was thinking about is to use the same concept,
> i.e. map a "shim" code into each processes pertaining to a particular
> hierarchy (the same way seccomp filters are inherited across processes).
> With a seccomp filter matching some syscall (e.g. mount, open), it is
> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
> shim code should then be able to emulate/patch what is needed, even
> faking a file opening by receiving a file descriptor through a UNIX
> socket. As did the Chrome sandbox, the seccomp filter may look at the
> calling address to allow the shim code to call syscalls without being
> catched, if needed. However, relying on SIGSYS may not fit with
> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
> to a specific process address, to emulate the syscall in an easier way
> than only relying on a {c,e}BPF program.
>

This could indeed be done, but I think that Tycho's approach is much
cleaner and probably faster.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-08 Thread Mickaël Salaün

On 07/03/2018 02:21, Andy Lutomirski wrote:
> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:
>>
>> On 06/03/2018 23:46, Tycho Andersen wrote:
>>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
>> Suppose I'm writing a container manager.  I want to run "mount" in the
>> container, but I don't want to allow moun() in general and I want to
>> emulate certain mount() actions.  I can write a filter that catches
>> mount using seccomp and calls out to the container manager for help.
>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>> supported.
>
> Well, I think this use case should be handled with something like
> LD_PRELOAD and a helper library. FYI, I did something like this:
> https://github.com/stemjail/stemshim

 I doubt that will work for containers.  Containers that use user
 namespaces and, for example, setuid programs aren't going to honor
 LD_PRELOAD.
>>>
>>> Or anything that calls syscalls directly, like go programs.
>>
>> That's why the vDSO-like approach. Enforcing an access control is not
>> the issue here, patching a buggy userland (without patching its code) is
>> the issue isn't it?
>>
>> As far as I remember, the main problem is to handle file descriptors
>> while "emulating" the kernel behavior. This can be done with a "shim"
>> code mapped in every processes. Chrome used something like this (in a
>> previous sandbox mechanism) as a kind of emulation (with the current
>> seccomp-bpf ). I think it should be doable to replace the (userland)
>> emulation code with an IPC wrapper receiving file descriptors through
>> UNIX socket.
>>
> 
> Can you explain exactly what you mean by "vDSO-like"?
> 
> When a 64-bit program does a syscall, it just executes the SYSCALL
> instruction.  The vDSO isn't involved at all.  32-bit programs usually
> go through the vDSO, but not always.
> 
> It could be possible to force-load a DSO into an entire container and
> rig up seccomp to intercept all SYSCALLs not originating from the DSO
> such that they merely redirect control to the DSO, but that seems
> quite messy.

vDSO is a code mapped for all processes. As you said, these processes
may use it or not. What I was thinking about is to use the same concept,
i.e. map a "shim" code into each processes pertaining to a particular
hierarchy (the same way seccomp filters are inherited across processes).
With a seccomp filter matching some syscall (e.g. mount, open), it is
possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This
shim code should then be able to emulate/patch what is needed, even
faking a file opening by receiving a file descriptor through a UNIX
socket. As did the Chrome sandbox, the seccomp filter may look at the
calling address to allow the shim code to call syscalls without being
catched, if needed. However, relying on SIGSYS may not fit with
arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump
to a specific process address, to emulate the syscall in an easier way
than only relying on a {c,e}BPF program.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Andy Lutomirski
On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün  wrote:
>
> On 06/03/2018 23:46, Tycho Andersen wrote:
>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
> Suppose I'm writing a container manager.  I want to run "mount" in the
> container, but I don't want to allow moun() in general and I want to
> emulate certain mount() actions.  I can write a filter that catches
> mount using seccomp and calls out to the container manager for help.
> This isn't theoretical -- Tycho wants *exactly* this use case to be
> supported.

 Well, I think this use case should be handled with something like
 LD_PRELOAD and a helper library. FYI, I did something like this:
 https://github.com/stemjail/stemshim
>>>
>>> I doubt that will work for containers.  Containers that use user
>>> namespaces and, for example, setuid programs aren't going to honor
>>> LD_PRELOAD.
>>
>> Or anything that calls syscalls directly, like go programs.
>
> That's why the vDSO-like approach. Enforcing an access control is not
> the issue here, patching a buggy userland (without patching its code) is
> the issue isn't it?
>
> As far as I remember, the main problem is to handle file descriptors
> while "emulating" the kernel behavior. This can be done with a "shim"
> code mapped in every processes. Chrome used something like this (in a
> previous sandbox mechanism) as a kind of emulation (with the current
> seccomp-bpf ). I think it should be doable to replace the (userland)
> emulation code with an IPC wrapper receiving file descriptors through
> UNIX socket.
>

Can you explain exactly what you mean by "vDSO-like"?

When a 64-bit program does a syscall, it just executes the SYSCALL
instruction.  The vDSO isn't involved at all.  32-bit programs usually
go through the vDSO, but not always.

It could be possible to force-load a DSO into an entire container and
rig up seccomp to intercept all SYSCALLs not originating from the DSO
such that they merely redirect control to the DSO, but that seems
quite messy.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Mickaël Salaün

On 06/03/2018 23:46, Tycho Andersen wrote:
> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
 Suppose I'm writing a container manager.  I want to run "mount" in the
 container, but I don't want to allow moun() in general and I want to
 emulate certain mount() actions.  I can write a filter that catches
 mount using seccomp and calls out to the container manager for help.
 This isn't theoretical -- Tycho wants *exactly* this use case to be
 supported.
>>>
>>> Well, I think this use case should be handled with something like
>>> LD_PRELOAD and a helper library. FYI, I did something like this:
>>> https://github.com/stemjail/stemshim
>>
>> I doubt that will work for containers.  Containers that use user
>> namespaces and, for example, setuid programs aren't going to honor
>> LD_PRELOAD.
> 
> Or anything that calls syscalls directly, like go programs.

That's why the vDSO-like approach. Enforcing an access control is not
the issue here, patching a buggy userland (without patching its code) is
the issue isn't it?

As far as I remember, the main problem is to handle file descriptors
while "emulating" the kernel behavior. This can be done with a "shim"
code mapped in every processes. Chrome used something like this (in a
previous sandbox mechanism) as a kind of emulation (with the current
seccomp-bpf ). I think it should be doable to replace the (userland)
emulation code with an IPC wrapper receiving file descriptors through
UNIX socket.



signature.asc
Description: OpenPGP digital signature


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Tycho Andersen
On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote:
> >> Suppose I'm writing a container manager.  I want to run "mount" in the
> >> container, but I don't want to allow moun() in general and I want to
> >> emulate certain mount() actions.  I can write a filter that catches
> >> mount using seccomp and calls out to the container manager for help.
> >> This isn't theoretical -- Tycho wants *exactly* this use case to be
> >> supported.
> >
> > Well, I think this use case should be handled with something like
> > LD_PRELOAD and a helper library. FYI, I did something like this:
> > https://github.com/stemjail/stemshim
> 
> I doubt that will work for containers.  Containers that use user
> namespaces and, for example, setuid programs aren't going to honor
> LD_PRELOAD.

Or anything that calls syscalls directly, like go programs.

Tycho


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Andy Lutomirski
On Tue, Mar 6, 2018 at 10:25 PM, Mickaël Salaün  wrote:
>
>
> On 28/02/2018 00:09, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün  wrote:
>>>
>>> On 27/02/2018 05:36, Andy Lutomirski wrote:
 On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
> Hi,
>
>>
>
> ## Why use the seccomp(2) syscall?
>
> Landlock use the same semantic as seccomp to apply access rule
> restrictions. It add a new layer of security for the current process
> which is inherited by its children. It makes sense to use an unique
> access-restricting syscall (that should be allowed by seccomp filters)
> which can only drop privileges. Moreover, a Landlock rule could come
> from outside a process (e.g.  passed through a UNIX socket). It is then
> useful to differentiate the creation/load of Landlock eBPF programs via
> bpf(2), from rule enforcement via seccomp(2).

 This seems like a weak argument to me.  Sure, this is a bit different
 from seccomp(), and maybe shoving it into the seccomp() multiplexer is
 awkward, but surely the bpf() multiplexer is even less applicable.
>>>
>>> I think using the seccomp syscall is fine, and everyone agreed on it.
>>>
>>
>> Ah, sorry, I completely misread what you wrote.  My apologies.  You
>> can disregard most of my email.
>>
>>>

 Also, looking forward, I think you're going to want a bunch of the
 stuff that's under consideration as new seccomp features.  Tycho is
 working on a "user notifier" feature for seccomp where, in addition to
 accepting, rejecting, or kicking to ptrace, you can send a message to
 the creator of the filter and wait for a reply.  I think that Landlock
 will want exactly the same feature.
>>>
>>> I don't think why this may be useful at all her. Landlock does not
>>> filter at the syscall level but handles kernel object and actions as
>>> does an LSM. That is the whole purpose of Landlock.
>>
>> Suppose I'm writing a container manager.  I want to run "mount" in the
>> container, but I don't want to allow moun() in general and I want to
>> emulate certain mount() actions.  I can write a filter that catches
>> mount using seccomp and calls out to the container manager for help.
>> This isn't theoretical -- Tycho wants *exactly* this use case to be
>> supported.
>
> Well, I think this use case should be handled with something like
> LD_PRELOAD and a helper library. FYI, I did something like this:
> https://github.com/stemjail/stemshim

I doubt that will work for containers.  Containers that use user
namespaces and, for example, setuid programs aren't going to honor
LD_PRELOAD.

>
> Otherwise, we should think about enabling a process to (dynamically)
> extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level
> and works with static binaries) for a subset of processes (the same way
> seccomp filters are inherited). It may be more powerful and flexible
> than extending the kernel/seccomp to patch (buggy?) userland.

Egads!


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-03-06 Thread Mickaël Salaün


On 28/02/2018 00:09, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün  wrote:
>>
>> On 27/02/2018 05:36, Andy Lutomirski wrote:
>>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
 Hi,

> 

 ## Why use the seccomp(2) syscall?

 Landlock use the same semantic as seccomp to apply access rule
 restrictions. It add a new layer of security for the current process
 which is inherited by its children. It makes sense to use an unique
 access-restricting syscall (that should be allowed by seccomp filters)
 which can only drop privileges. Moreover, a Landlock rule could come
 from outside a process (e.g.  passed through a UNIX socket). It is then
 useful to differentiate the creation/load of Landlock eBPF programs via
 bpf(2), from rule enforcement via seccomp(2).
>>>
>>> This seems like a weak argument to me.  Sure, this is a bit different
>>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>>> awkward, but surely the bpf() multiplexer is even less applicable.
>>
>> I think using the seccomp syscall is fine, and everyone agreed on it.
>>
> 
> Ah, sorry, I completely misread what you wrote.  My apologies.  You
> can disregard most of my email.
> 
>>
>>>
>>> Also, looking forward, I think you're going to want a bunch of the
>>> stuff that's under consideration as new seccomp features.  Tycho is
>>> working on a "user notifier" feature for seccomp where, in addition to
>>> accepting, rejecting, or kicking to ptrace, you can send a message to
>>> the creator of the filter and wait for a reply.  I think that Landlock
>>> will want exactly the same feature.
>>
>> I don't think why this may be useful at all her. Landlock does not
>> filter at the syscall level but handles kernel object and actions as
>> does an LSM. That is the whole purpose of Landlock.
> 
> Suppose I'm writing a container manager.  I want to run "mount" in the
> container, but I don't want to allow moun() in general and I want to
> emulate certain mount() actions.  I can write a filter that catches
> mount using seccomp and calls out to the container manager for help.
> This isn't theoretical -- Tycho wants *exactly* this use case to be
> supported.

Well, I think this use case should be handled with something like
LD_PRELOAD and a helper library. FYI, I did something like this:
https://github.com/stemjail/stemshim

Otherwise, we should think about enabling a process to (dynamically)
extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level
and works with static binaries) for a subset of processes (the same way
seccomp filters are inherited). It may be more powerful and flexible
than extending the kernel/seccomp to patch (buggy?) userland.

> 
> But using seccomp for this is indeed annoying.  It would be nice to
> use Landlock's ability to filter based on the filesystem type, for
> example.  So Tycho could write a Landlock rule like:
> 
> bool filter_mount(...)
> {
>   if (path needs emulation)
> call_user_notifier();
> }
> 
> And it should work.
> 
> This means that, if both seccomp user notifiers and Landlock make it
> upstream, then there should probably be a way to have a user notifier
> bound to a seccomp filter and a set of landlock filters.
> 

Using seccomp filters and Landlock programs may be powerful. However,
for this use case, I think a *post-syscall* vDSO-like (which could get
some data returned by a Landlock program) may be much more flexible
(with less kernel code). What is needed here is a way to know the kernel
semantic (Landlock) and a way to patch userland without patching its
code (vDSO-like).



signature.asc
Description: OpenPGP digital signature


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-27 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün  wrote:
>
> On 27/02/2018 05:36, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
>>> Hi,
>>>

>>>
>>> ## Why use the seccomp(2) syscall?
>>>
>>> Landlock use the same semantic as seccomp to apply access rule
>>> restrictions. It add a new layer of security for the current process
>>> which is inherited by its children. It makes sense to use an unique
>>> access-restricting syscall (that should be allowed by seccomp filters)
>>> which can only drop privileges. Moreover, a Landlock rule could come
>>> from outside a process (e.g.  passed through a UNIX socket). It is then
>>> useful to differentiate the creation/load of Landlock eBPF programs via
>>> bpf(2), from rule enforcement via seccomp(2).
>>
>> This seems like a weak argument to me.  Sure, this is a bit different
>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is
>> awkward, but surely the bpf() multiplexer is even less applicable.
>
> I think using the seccomp syscall is fine, and everyone agreed on it.
>

Ah, sorry, I completely misread what you wrote.  My apologies.  You
can disregard most of my email.

>
>>
>> Also, looking forward, I think you're going to want a bunch of the
>> stuff that's under consideration as new seccomp features.  Tycho is
>> working on a "user notifier" feature for seccomp where, in addition to
>> accepting, rejecting, or kicking to ptrace, you can send a message to
>> the creator of the filter and wait for a reply.  I think that Landlock
>> will want exactly the same feature.
>
> I don't think why this may be useful at all her. Landlock does not
> filter at the syscall level but handles kernel object and actions as
> does an LSM. That is the whole purpose of Landlock.

Suppose I'm writing a container manager.  I want to run "mount" in the
container, but I don't want to allow moun() in general and I want to
emulate certain mount() actions.  I can write a filter that catches
mount using seccomp and calls out to the container manager for help.
This isn't theoretical -- Tycho wants *exactly* this use case to be
supported.

But using seccomp for this is indeed annoying.  It would be nice to
use Landlock's ability to filter based on the filesystem type, for
example.  So Tycho could write a Landlock rule like:

bool filter_mount(...)
{
  if (path needs emulation)
call_user_notifier();
}

And it should work.

This means that, if both seccomp user notifiers and Landlock make it
upstream, then there should probably be a way to have a user notifier
bound to a seccomp filter and a set of landlock filters.


Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-27 Thread Mickaël Salaün

On 27/02/2018 05:36, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
>> Hi,
>>
>> This eight series is a major revamp of the Landlock design compared to
>> the previous series [1]. This enables more flexibility and granularity
>> of access control with file paths. It is now possible to enforce an
>> access control according to a file hierarchy. Landlock uses the concept
>> of inode and path to identify such hierarchy. In a way, it brings tools
>> to program what is a file hierarchy.
>>
>> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
>> Each of them accepts a dedicated eBPF program, called a Landlock
>> program.  They can be chained to enforce a full access control according
>> to a list of directories or files. The set of actions on a file is well
>> defined (e.g. read, write, ioctl, append, lock, mount...) taking
>> inspiration from the major Linux LSMs and some other access-controls
>> like Capsicum.  These program types are designed to be cache-friendly,
>> which give room for optimizations in the future.
>>
>> The documentation patch contains some kernel documentation and
>> explanations on how to use Landlock.  The compiled documentation and
>> a talk I gave at FOSDEM can be found here: https://landlock.io
>> This patch series can be found in the branch landlock-v8 in this repo:
>> https://github.com/landlock-lsm/linux
>>
>> There is still some minor issues with this patch series but it should
>> demonstrate how powerful this design may be. One of these issues is that
>> it is not a stackable LSM anymore, but the infrastructure management of
>> security blobs should allow to stack it with other LSM [4].
>>
>> This is the first step of the roadmap discussed at LPC [2].  While the
>> intended final goal is to allow unprivileged users to use Landlock, this
>> series allows only a process with global CAP_SYS_ADMIN to load and
>> enforce a rule.  This may help to get feedback and avoid unexpected
>> behaviors.
>>
>> This series can be applied on top of bpf-next, commit 7d72637eb39f
>> ("Merge branch 'x86-jit'").  This can be tested with
>> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
>> appreciate constructive comments on the design and the code.
>>
>>
>> # Landlock LSM
>>
>> The goal of this new Linux Security Module (LSM) called Landlock is to
>> allow any process, including unprivileged ones, to create powerful
>> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
>> kind of sandbox is expected to help mitigate the security impact of bugs
>> or unexpected/malicious behaviors in user-space applications.
>>
>> The approach taken is to add the minimum amount of code while still
>> allowing the user-space application to create quite complex access
>> rules.  A dedicated security policy language such as the one used by
>> SELinux, AppArmor and other major LSMs involves a lot of code and is
>> usually permitted to only a trusted user (i.e. root).  On the contrary,
>> eBPF programs already exist and are designed to be safely loaded by
>> unprivileged user-space.
>>
>> This design does not seem too intrusive but is flexible enough to allow
>> a powerful sandbox mechanism accessible by any process on Linux. The use
>> of seccomp and Landlock is more suitable with the help of a user-space
>> library (e.g.  libseccomp) that could help to specify a high-level
>> language to express a security policy instead of raw eBPF programs.
>> Moreover, thanks to the LLVM front-end, it is quite easy to write an
>> eBPF program with a subset of the C language.
>>
>>
>> # Frequently asked questions
>>
>> ## Why is seccomp-bpf not enough?
>>
>> A seccomp filter can access only raw syscall arguments (i.e. the
>> register values) which means that it is not possible to filter according
>> to the value pointed to by an argument, such as a file pathname. As an
>> embryonic Landlock version demonstrated, filtering at the syscall level
>> is complicated (e.g. need to take care of race conditions). This is
>> mainly because the access control checkpoints of the kernel are not at
>> this high-level but more underneath, at the LSM-hook level. The LSM
>> hooks are designed to handle this kind of checks.  Landlock abstracts
>> this approach to leverage the ability of unprivileged users to limit
>> themselves.
>>
>> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>>
>>
>> ## Why use the seccomp(2) syscall?
>>
>> Landlock use the same semantic as seccomp to apply access rule
>> restrictions. It add a new layer of security for the current process
>> which is inherited by its children. It makes sense to use an unique
>> access-restricting syscall (that should be allowed by seccomp filters)
>> which can only drop privileges. Moreover, a Landlock rule could come
>> from outside a process (e.g.  passed through a UNIX socket). It is then
>> useful to differentiate the creation/load of Landlock eBPF 

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-26 Thread Andy Lutomirski
On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
> Hi,
>
> This eight series is a major revamp of the Landlock design compared to
> the previous series [1]. This enables more flexibility and granularity
> of access control with file paths. It is now possible to enforce an
> access control according to a file hierarchy. Landlock uses the concept
> of inode and path to identify such hierarchy. In a way, it brings tools
> to program what is a file hierarchy.
>
> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
> Each of them accepts a dedicated eBPF program, called a Landlock
> program.  They can be chained to enforce a full access control according
> to a list of directories or files. The set of actions on a file is well
> defined (e.g. read, write, ioctl, append, lock, mount...) taking
> inspiration from the major Linux LSMs and some other access-controls
> like Capsicum.  These program types are designed to be cache-friendly,
> which give room for optimizations in the future.
>
> The documentation patch contains some kernel documentation and
> explanations on how to use Landlock.  The compiled documentation and
> a talk I gave at FOSDEM can be found here: https://landlock.io
> This patch series can be found in the branch landlock-v8 in this repo:
> https://github.com/landlock-lsm/linux
>
> There is still some minor issues with this patch series but it should
> demonstrate how powerful this design may be. One of these issues is that
> it is not a stackable LSM anymore, but the infrastructure management of
> security blobs should allow to stack it with other LSM [4].
>
> This is the first step of the roadmap discussed at LPC [2].  While the
> intended final goal is to allow unprivileged users to use Landlock, this
> series allows only a process with global CAP_SYS_ADMIN to load and
> enforce a rule.  This may help to get feedback and avoid unexpected
> behaviors.
>
> This series can be applied on top of bpf-next, commit 7d72637eb39f
> ("Merge branch 'x86-jit'").  This can be tested with
> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
> appreciate constructive comments on the design and the code.
>
>
> # Landlock LSM
>
> The goal of this new Linux Security Module (LSM) called Landlock is to
> allow any process, including unprivileged ones, to create powerful
> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
> kind of sandbox is expected to help mitigate the security impact of bugs
> or unexpected/malicious behaviors in user-space applications.
>
> The approach taken is to add the minimum amount of code while still
> allowing the user-space application to create quite complex access
> rules.  A dedicated security policy language such as the one used by
> SELinux, AppArmor and other major LSMs involves a lot of code and is
> usually permitted to only a trusted user (i.e. root).  On the contrary,
> eBPF programs already exist and are designed to be safely loaded by
> unprivileged user-space.
>
> This design does not seem too intrusive but is flexible enough to allow
> a powerful sandbox mechanism accessible by any process on Linux. The use
> of seccomp and Landlock is more suitable with the help of a user-space
> library (e.g.  libseccomp) that could help to specify a high-level
> language to express a security policy instead of raw eBPF programs.
> Moreover, thanks to the LLVM front-end, it is quite easy to write an
> eBPF program with a subset of the C language.
>
>
> # Frequently asked questions
>
> ## Why is seccomp-bpf not enough?
>
> A seccomp filter can access only raw syscall arguments (i.e. the
> register values) which means that it is not possible to filter according
> to the value pointed to by an argument, such as a file pathname. As an
> embryonic Landlock version demonstrated, filtering at the syscall level
> is complicated (e.g. need to take care of race conditions). This is
> mainly because the access control checkpoints of the kernel are not at
> this high-level but more underneath, at the LSM-hook level. The LSM
> hooks are designed to handle this kind of checks.  Landlock abstracts
> this approach to leverage the ability of unprivileged users to limit
> themselves.
>
> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>
>
> ## Why use the seccomp(2) syscall?
>
> Landlock use the same semantic as seccomp to apply access rule
> restrictions. It add a new layer of security for the current process
> which is inherited by its children. It makes sense to use an unique
> access-restricting syscall (that should be allowed by seccomp filters)
> which can only drop privileges. Moreover, a Landlock rule could come
> from outside a process (e.g.  passed through a UNIX socket). It is then
> useful to differentiate the creation/load of Landlock eBPF programs via
> bpf(2), from rule enforcement via seccomp(2).

This seems like a weak argument to me.  Sure, this is a bit different
from seccomp(), 

[PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-26 Thread Mickaël Salaün
Hi,

This eight series is a major revamp of the Landlock design compared to
the previous series [1]. This enables more flexibility and granularity
of access control with file paths. It is now possible to enforce an
access control according to a file hierarchy. Landlock uses the concept
of inode and path to identify such hierarchy. In a way, it brings tools
to program what is a file hierarchy.

There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
Each of them accepts a dedicated eBPF program, called a Landlock
program.  They can be chained to enforce a full access control according
to a list of directories or files. The set of actions on a file is well
defined (e.g. read, write, ioctl, append, lock, mount...) taking
inspiration from the major Linux LSMs and some other access-controls
like Capsicum.  These program types are designed to be cache-friendly,
which give room for optimizations in the future.

The documentation patch contains some kernel documentation and
explanations on how to use Landlock.  The compiled documentation and
a talk I gave at FOSDEM can be found here: https://landlock.io
This patch series can be found in the branch landlock-v8 in this repo:
https://github.com/landlock-lsm/linux

There is still some minor issues with this patch series but it should
demonstrate how powerful this design may be. One of these issues is that
it is not a stackable LSM anymore, but the infrastructure management of
security blobs should allow to stack it with other LSM [4].

This is the first step of the roadmap discussed at LPC [2].  While the
intended final goal is to allow unprivileged users to use Landlock, this
series allows only a process with global CAP_SYS_ADMIN to load and
enforce a rule.  This may help to get feedback and avoid unexpected
behaviors.

This series can be applied on top of bpf-next, commit 7d72637eb39f
("Merge branch 'x86-jit'").  This can be tested with
CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
appreciate constructive comments on the design and the code.


# Landlock LSM

The goal of this new Linux Security Module (LSM) called Landlock is to
allow any process, including unprivileged ones, to create powerful
security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
kind of sandbox is expected to help mitigate the security impact of bugs
or unexpected/malicious behaviors in user-space applications.

The approach taken is to add the minimum amount of code while still
allowing the user-space application to create quite complex access
rules.  A dedicated security policy language such as the one used by
SELinux, AppArmor and other major LSMs involves a lot of code and is
usually permitted to only a trusted user (i.e. root).  On the contrary,
eBPF programs already exist and are designed to be safely loaded by
unprivileged user-space.

This design does not seem too intrusive but is flexible enough to allow
a powerful sandbox mechanism accessible by any process on Linux. The use
of seccomp and Landlock is more suitable with the help of a user-space
library (e.g.  libseccomp) that could help to specify a high-level
language to express a security policy instead of raw eBPF programs.
Moreover, thanks to the LLVM front-end, it is quite easy to write an
eBPF program with a subset of the C language.


# Frequently asked questions

## Why is seccomp-bpf not enough?

A seccomp filter can access only raw syscall arguments (i.e. the
register values) which means that it is not possible to filter according
to the value pointed to by an argument, such as a file pathname. As an
embryonic Landlock version demonstrated, filtering at the syscall level
is complicated (e.g. need to take care of race conditions). This is
mainly because the access control checkpoints of the kernel are not at
this high-level but more underneath, at the LSM-hook level. The LSM
hooks are designed to handle this kind of checks.  Landlock abstracts
this approach to leverage the ability of unprivileged users to limit
themselves.

Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt


## Why use the seccomp(2) syscall?

Landlock use the same semantic as seccomp to apply access rule
restrictions. It add a new layer of security for the current process
which is inherited by its children. It makes sense to use an unique
access-restricting syscall (that should be allowed by seccomp filters)
which can only drop privileges. Moreover, a Landlock rule could come
from outside a process (e.g.  passed through a UNIX socket). It is then
useful to differentiate the creation/load of Landlock eBPF programs via
bpf(2), from rule enforcement via seccomp(2).


## Why a new LSM? Are SELinux, AppArmor, Smack and Tomoyo not good
   enough?

The current access control LSMs are fine for their purpose which is to
give the *root* the ability to enforce a security policy for the
*system*. What is missing is a way to enforce a security policy for any
application by its developer and