Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
Hi Mickaël, On Mon, Apr 02, 2018 at 12:04:36AM +0200, Mickaël Salaün wrote: > >> vDSO is a code mapped for all processes. As you said, these processes > >> may use it or not. What I was thinking about is to use the same concept, > >> i.e. map a "shim" code into each processes pertaining to a particular > >> hierarchy (the same way seccomp filters are inherited across processes). > >> With a seccomp filter matching some syscall (e.g. mount, open), it is > >> possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This > >> shim code should then be able to emulate/patch what is needed, even > >> faking a file opening by receiving a file descriptor through a UNIX > >> socket. As did the Chrome sandbox, the seccomp filter may look at the > >> calling address to allow the shim code to call syscalls without being > >> catched, if needed. However, relying on SIGSYS may not fit with > >> arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump > >> to a specific process address, to emulate the syscall in an easier way > >> than only relying on a {c,e}BPF program. > >> > > > > This could indeed be done, but I think that Tycho's approach is much > > cleaner and probably faster. > > > > I like it too but how does this handle file descriptors? I think it could be done fairly simply, the most complicated part is probably designing an API that doesn't suck. But the basic idea would be: struct seccomp_notif_resp { __u64 id; __s32 error; __s64 val; __s32 fd; }; if the handler responds with fd >= 0, we grab the tracer's fd, duplicate it, and install it somewhere in the tracee's fd table. Since things like socket() will want to return the fd number as its installed and the handler doesn't know that, we'll probably want some way to indicate that the kernel should return this value. We could either mandate that if fd >= 0, that's the value that will be returned from the syscall, or add another flag that says "no, install the fd, but really return what's in val instead). I guess we can't mandate that we return fd, because e.g. netlink sockets can sometimes return fds as part of the netlink messages, and not as the return value from the syscall. Tycho
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Thu, Mar 8, 2018 at 11:51 PM, Mickaël Salaün wrote: > > On 07/03/2018 02:21, Andy Lutomirski wrote: >> On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün wrote: >>> >>> On 06/03/2018 23:46, Tycho Andersen wrote: On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: >>> Suppose I'm writing a container manager. I want to run "mount" in the >>> container, but I don't want to allow moun() in general and I want to >>> emulate certain mount() actions. I can write a filter that catches >>> mount using seccomp and calls out to the container manager for help. >>> This isn't theoretical -- Tycho wants *exactly* this use case to be >>> supported. >> >> Well, I think this use case should be handled with something like >> LD_PRELOAD and a helper library. FYI, I did something like this: >> https://github.com/stemjail/stemshim > > I doubt that will work for containers. Containers that use user > namespaces and, for example, setuid programs aren't going to honor > LD_PRELOAD. Or anything that calls syscalls directly, like go programs. >>> >>> That's why the vDSO-like approach. Enforcing an access control is not >>> the issue here, patching a buggy userland (without patching its code) is >>> the issue isn't it? >>> >>> As far as I remember, the main problem is to handle file descriptors >>> while "emulating" the kernel behavior. This can be done with a "shim" >>> code mapped in every processes. Chrome used something like this (in a >>> previous sandbox mechanism) as a kind of emulation (with the current >>> seccomp-bpf ). I think it should be doable to replace the (userland) >>> emulation code with an IPC wrapper receiving file descriptors through >>> UNIX socket. >>> >> >> Can you explain exactly what you mean by "vDSO-like"? >> >> When a 64-bit program does a syscall, it just executes the SYSCALL >> instruction. The vDSO isn't involved at all. 32-bit programs usually >> go through the vDSO, but not always. >> >> It could be possible to force-load a DSO into an entire container and >> rig up seccomp to intercept all SYSCALLs not originating from the DSO >> such that they merely redirect control to the DSO, but that seems >> quite messy. > > vDSO is a code mapped for all processes. As you said, these processes > may use it or not. What I was thinking about is to use the same concept, > i.e. map a "shim" code into each processes pertaining to a particular > hierarchy (the same way seccomp filters are inherited across processes). > With a seccomp filter matching some syscall (e.g. mount, open), it is > possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This > shim code should then be able to emulate/patch what is needed, even > faking a file opening by receiving a file descriptor through a UNIX > socket. As did the Chrome sandbox, the seccomp filter may look at the > calling address to allow the shim code to call syscalls without being > catched, if needed. However, relying on SIGSYS may not fit with > arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump > to a specific process address, to emulate the syscall in an easier way > than only relying on a {c,e}BPF program. > This could indeed be done, but I think that Tycho's approach is much cleaner and probably faster.
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On 07/03/2018 02:21, Andy Lutomirski wrote: > On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün wrote: >> >> On 06/03/2018 23:46, Tycho Andersen wrote: >>> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: >> Suppose I'm writing a container manager. I want to run "mount" in the >> container, but I don't want to allow moun() in general and I want to >> emulate certain mount() actions. I can write a filter that catches >> mount using seccomp and calls out to the container manager for help. >> This isn't theoretical -- Tycho wants *exactly* this use case to be >> supported. > > Well, I think this use case should be handled with something like > LD_PRELOAD and a helper library. FYI, I did something like this: > https://github.com/stemjail/stemshim I doubt that will work for containers. Containers that use user namespaces and, for example, setuid programs aren't going to honor LD_PRELOAD. >>> >>> Or anything that calls syscalls directly, like go programs. >> >> That's why the vDSO-like approach. Enforcing an access control is not >> the issue here, patching a buggy userland (without patching its code) is >> the issue isn't it? >> >> As far as I remember, the main problem is to handle file descriptors >> while "emulating" the kernel behavior. This can be done with a "shim" >> code mapped in every processes. Chrome used something like this (in a >> previous sandbox mechanism) as a kind of emulation (with the current >> seccomp-bpf ). I think it should be doable to replace the (userland) >> emulation code with an IPC wrapper receiving file descriptors through >> UNIX socket. >> > > Can you explain exactly what you mean by "vDSO-like"? > > When a 64-bit program does a syscall, it just executes the SYSCALL > instruction. The vDSO isn't involved at all. 32-bit programs usually > go through the vDSO, but not always. > > It could be possible to force-load a DSO into an entire container and > rig up seccomp to intercept all SYSCALLs not originating from the DSO > such that they merely redirect control to the DSO, but that seems > quite messy. vDSO is a code mapped for all processes. As you said, these processes may use it or not. What I was thinking about is to use the same concept, i.e. map a "shim" code into each processes pertaining to a particular hierarchy (the same way seccomp filters are inherited across processes). With a seccomp filter matching some syscall (e.g. mount, open), it is possible to jump back to the shim code thanks to SECCOMP_RET_TRAP. This shim code should then be able to emulate/patch what is needed, even faking a file opening by receiving a file descriptor through a UNIX socket. As did the Chrome sandbox, the seccomp filter may look at the calling address to allow the shim code to call syscalls without being catched, if needed. However, relying on SIGSYS may not fit with arbitrary code. Using a new SECCOMP_RET_EMULATE (?) may be used to jump to a specific process address, to emulate the syscall in an easier way than only relying on a {c,e}BPF program. signature.asc Description: OpenPGP digital signature
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Tue, Mar 6, 2018 at 11:06 PM, Mickaël Salaün wrote: > > On 06/03/2018 23:46, Tycho Andersen wrote: >> On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: > Suppose I'm writing a container manager. I want to run "mount" in the > container, but I don't want to allow moun() in general and I want to > emulate certain mount() actions. I can write a filter that catches > mount using seccomp and calls out to the container manager for help. > This isn't theoretical -- Tycho wants *exactly* this use case to be > supported. Well, I think this use case should be handled with something like LD_PRELOAD and a helper library. FYI, I did something like this: https://github.com/stemjail/stemshim >>> >>> I doubt that will work for containers. Containers that use user >>> namespaces and, for example, setuid programs aren't going to honor >>> LD_PRELOAD. >> >> Or anything that calls syscalls directly, like go programs. > > That's why the vDSO-like approach. Enforcing an access control is not > the issue here, patching a buggy userland (without patching its code) is > the issue isn't it? > > As far as I remember, the main problem is to handle file descriptors > while "emulating" the kernel behavior. This can be done with a "shim" > code mapped in every processes. Chrome used something like this (in a > previous sandbox mechanism) as a kind of emulation (with the current > seccomp-bpf ). I think it should be doable to replace the (userland) > emulation code with an IPC wrapper receiving file descriptors through > UNIX socket. > Can you explain exactly what you mean by "vDSO-like"? When a 64-bit program does a syscall, it just executes the SYSCALL instruction. The vDSO isn't involved at all. 32-bit programs usually go through the vDSO, but not always. It could be possible to force-load a DSO into an entire container and rig up seccomp to intercept all SYSCALLs not originating from the DSO such that they merely redirect control to the DSO, but that seems quite messy.
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On 06/03/2018 23:46, Tycho Andersen wrote: > On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: Suppose I'm writing a container manager. I want to run "mount" in the container, but I don't want to allow moun() in general and I want to emulate certain mount() actions. I can write a filter that catches mount using seccomp and calls out to the container manager for help. This isn't theoretical -- Tycho wants *exactly* this use case to be supported. >>> >>> Well, I think this use case should be handled with something like >>> LD_PRELOAD and a helper library. FYI, I did something like this: >>> https://github.com/stemjail/stemshim >> >> I doubt that will work for containers. Containers that use user >> namespaces and, for example, setuid programs aren't going to honor >> LD_PRELOAD. > > Or anything that calls syscalls directly, like go programs. That's why the vDSO-like approach. Enforcing an access control is not the issue here, patching a buggy userland (without patching its code) is the issue isn't it? As far as I remember, the main problem is to handle file descriptors while "emulating" the kernel behavior. This can be done with a "shim" code mapped in every processes. Chrome used something like this (in a previous sandbox mechanism) as a kind of emulation (with the current seccomp-bpf ). I think it should be doable to replace the (userland) emulation code with an IPC wrapper receiving file descriptors through UNIX socket. signature.asc Description: OpenPGP digital signature
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Tue, Mar 06, 2018 at 10:33:17PM +, Andy Lutomirski wrote: > >> Suppose I'm writing a container manager. I want to run "mount" in the > >> container, but I don't want to allow moun() in general and I want to > >> emulate certain mount() actions. I can write a filter that catches > >> mount using seccomp and calls out to the container manager for help. > >> This isn't theoretical -- Tycho wants *exactly* this use case to be > >> supported. > > > > Well, I think this use case should be handled with something like > > LD_PRELOAD and a helper library. FYI, I did something like this: > > https://github.com/stemjail/stemshim > > I doubt that will work for containers. Containers that use user > namespaces and, for example, setuid programs aren't going to honor > LD_PRELOAD. Or anything that calls syscalls directly, like go programs. Tycho
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Tue, Mar 6, 2018 at 10:25 PM, Mickaël Salaün wrote: > > > On 28/02/2018 00:09, Andy Lutomirski wrote: >> On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün wrote: >>> >>> On 27/02/2018 05:36, Andy Lutomirski wrote: On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün wrote: > Hi, > >> > > ## Why use the seccomp(2) syscall? > > Landlock use the same semantic as seccomp to apply access rule > restrictions. It add a new layer of security for the current process > which is inherited by its children. It makes sense to use an unique > access-restricting syscall (that should be allowed by seccomp filters) > which can only drop privileges. Moreover, a Landlock rule could come > from outside a process (e.g. passed through a UNIX socket). It is then > useful to differentiate the creation/load of Landlock eBPF programs via > bpf(2), from rule enforcement via seccomp(2). This seems like a weak argument to me. Sure, this is a bit different from seccomp(), and maybe shoving it into the seccomp() multiplexer is awkward, but surely the bpf() multiplexer is even less applicable. >>> >>> I think using the seccomp syscall is fine, and everyone agreed on it. >>> >> >> Ah, sorry, I completely misread what you wrote. My apologies. You >> can disregard most of my email. >> >>> Also, looking forward, I think you're going to want a bunch of the stuff that's under consideration as new seccomp features. Tycho is working on a "user notifier" feature for seccomp where, in addition to accepting, rejecting, or kicking to ptrace, you can send a message to the creator of the filter and wait for a reply. I think that Landlock will want exactly the same feature. >>> >>> I don't think why this may be useful at all her. Landlock does not >>> filter at the syscall level but handles kernel object and actions as >>> does an LSM. That is the whole purpose of Landlock. >> >> Suppose I'm writing a container manager. I want to run "mount" in the >> container, but I don't want to allow moun() in general and I want to >> emulate certain mount() actions. I can write a filter that catches >> mount using seccomp and calls out to the container manager for help. >> This isn't theoretical -- Tycho wants *exactly* this use case to be >> supported. > > Well, I think this use case should be handled with something like > LD_PRELOAD and a helper library. FYI, I did something like this: > https://github.com/stemjail/stemshim I doubt that will work for containers. Containers that use user namespaces and, for example, setuid programs aren't going to honor LD_PRELOAD. > > Otherwise, we should think about enabling a process to (dynamically) > extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level > and works with static binaries) for a subset of processes (the same way > seccomp filters are inherited). It may be more powerful and flexible > than extending the kernel/seccomp to patch (buggy?) userland. Egads!
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On 28/02/2018 00:09, Andy Lutomirski wrote: > On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün wrote: >> >> On 27/02/2018 05:36, Andy Lutomirski wrote: >>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün wrote: Hi, > ## Why use the seccomp(2) syscall? Landlock use the same semantic as seccomp to apply access rule restrictions. It add a new layer of security for the current process which is inherited by its children. It makes sense to use an unique access-restricting syscall (that should be allowed by seccomp filters) which can only drop privileges. Moreover, a Landlock rule could come from outside a process (e.g. passed through a UNIX socket). It is then useful to differentiate the creation/load of Landlock eBPF programs via bpf(2), from rule enforcement via seccomp(2). >>> >>> This seems like a weak argument to me. Sure, this is a bit different >>> from seccomp(), and maybe shoving it into the seccomp() multiplexer is >>> awkward, but surely the bpf() multiplexer is even less applicable. >> >> I think using the seccomp syscall is fine, and everyone agreed on it. >> > > Ah, sorry, I completely misread what you wrote. My apologies. You > can disregard most of my email. > >> >>> >>> Also, looking forward, I think you're going to want a bunch of the >>> stuff that's under consideration as new seccomp features. Tycho is >>> working on a "user notifier" feature for seccomp where, in addition to >>> accepting, rejecting, or kicking to ptrace, you can send a message to >>> the creator of the filter and wait for a reply. I think that Landlock >>> will want exactly the same feature. >> >> I don't think why this may be useful at all her. Landlock does not >> filter at the syscall level but handles kernel object and actions as >> does an LSM. That is the whole purpose of Landlock. > > Suppose I'm writing a container manager. I want to run "mount" in the > container, but I don't want to allow moun() in general and I want to > emulate certain mount() actions. I can write a filter that catches > mount using seccomp and calls out to the container manager for help. > This isn't theoretical -- Tycho wants *exactly* this use case to be > supported. Well, I think this use case should be handled with something like LD_PRELOAD and a helper library. FYI, I did something like this: https://github.com/stemjail/stemshim Otherwise, we should think about enabling a process to (dynamically) extend/patch the vDSO (similar to LD_PRELOAD but at the syscall level and works with static binaries) for a subset of processes (the same way seccomp filters are inherited). It may be more powerful and flexible than extending the kernel/seccomp to patch (buggy?) userland. > > But using seccomp for this is indeed annoying. It would be nice to > use Landlock's ability to filter based on the filesystem type, for > example. So Tycho could write a Landlock rule like: > > bool filter_mount(...) > { > if (path needs emulation) > call_user_notifier(); > } > > And it should work. > > This means that, if both seccomp user notifiers and Landlock make it > upstream, then there should probably be a way to have a user notifier > bound to a seccomp filter and a set of landlock filters. > Using seccomp filters and Landlock programs may be powerful. However, for this use case, I think a *post-syscall* vDSO-like (which could get some data returned by a Landlock program) may be much more flexible (with less kernel code). What is needed here is a way to know the kernel semantic (Landlock) and a way to patch userland without patching its code (vDSO-like). signature.asc Description: OpenPGP digital signature
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Tue, Feb 27, 2018 at 10:03 PM, Mickaël Salaün wrote: > > On 27/02/2018 05:36, Andy Lutomirski wrote: >> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün wrote: >>> Hi, >>> >>> >>> ## Why use the seccomp(2) syscall? >>> >>> Landlock use the same semantic as seccomp to apply access rule >>> restrictions. It add a new layer of security for the current process >>> which is inherited by its children. It makes sense to use an unique >>> access-restricting syscall (that should be allowed by seccomp filters) >>> which can only drop privileges. Moreover, a Landlock rule could come >>> from outside a process (e.g. passed through a UNIX socket). It is then >>> useful to differentiate the creation/load of Landlock eBPF programs via >>> bpf(2), from rule enforcement via seccomp(2). >> >> This seems like a weak argument to me. Sure, this is a bit different >> from seccomp(), and maybe shoving it into the seccomp() multiplexer is >> awkward, but surely the bpf() multiplexer is even less applicable. > > I think using the seccomp syscall is fine, and everyone agreed on it. > Ah, sorry, I completely misread what you wrote. My apologies. You can disregard most of my email. > >> >> Also, looking forward, I think you're going to want a bunch of the >> stuff that's under consideration as new seccomp features. Tycho is >> working on a "user notifier" feature for seccomp where, in addition to >> accepting, rejecting, or kicking to ptrace, you can send a message to >> the creator of the filter and wait for a reply. I think that Landlock >> will want exactly the same feature. > > I don't think why this may be useful at all her. Landlock does not > filter at the syscall level but handles kernel object and actions as > does an LSM. That is the whole purpose of Landlock. Suppose I'm writing a container manager. I want to run "mount" in the container, but I don't want to allow moun() in general and I want to emulate certain mount() actions. I can write a filter that catches mount using seccomp and calls out to the container manager for help. This isn't theoretical -- Tycho wants *exactly* this use case to be supported. But using seccomp for this is indeed annoying. It would be nice to use Landlock's ability to filter based on the filesystem type, for example. So Tycho could write a Landlock rule like: bool filter_mount(...) { if (path needs emulation) call_user_notifier(); } And it should work. This means that, if both seccomp user notifiers and Landlock make it upstream, then there should probably be a way to have a user notifier bound to a seccomp filter and a set of landlock filters.
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On 27/02/2018 05:36, Andy Lutomirski wrote: > On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün wrote: >> Hi, >> >> This eight series is a major revamp of the Landlock design compared to >> the previous series [1]. This enables more flexibility and granularity >> of access control with file paths. It is now possible to enforce an >> access control according to a file hierarchy. Landlock uses the concept >> of inode and path to identify such hierarchy. In a way, it brings tools >> to program what is a file hierarchy. >> >> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET. >> Each of them accepts a dedicated eBPF program, called a Landlock >> program. They can be chained to enforce a full access control according >> to a list of directories or files. The set of actions on a file is well >> defined (e.g. read, write, ioctl, append, lock, mount...) taking >> inspiration from the major Linux LSMs and some other access-controls >> like Capsicum. These program types are designed to be cache-friendly, >> which give room for optimizations in the future. >> >> The documentation patch contains some kernel documentation and >> explanations on how to use Landlock. The compiled documentation and >> a talk I gave at FOSDEM can be found here: https://landlock.io >> This patch series can be found in the branch landlock-v8 in this repo: >> https://github.com/landlock-lsm/linux >> >> There is still some minor issues with this patch series but it should >> demonstrate how powerful this design may be. One of these issues is that >> it is not a stackable LSM anymore, but the infrastructure management of >> security blobs should allow to stack it with other LSM [4]. >> >> This is the first step of the roadmap discussed at LPC [2]. While the >> intended final goal is to allow unprivileged users to use Landlock, this >> series allows only a process with global CAP_SYS_ADMIN to load and >> enforce a rule. This may help to get feedback and avoid unexpected >> behaviors. >> >> This series can be applied on top of bpf-next, commit 7d72637eb39f >> ("Merge branch 'x86-jit'"). This can be tested with >> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK. I would really >> appreciate constructive comments on the design and the code. >> >> >> # Landlock LSM >> >> The goal of this new Linux Security Module (LSM) called Landlock is to >> allow any process, including unprivileged ones, to create powerful >> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This >> kind of sandbox is expected to help mitigate the security impact of bugs >> or unexpected/malicious behaviors in user-space applications. >> >> The approach taken is to add the minimum amount of code while still >> allowing the user-space application to create quite complex access >> rules. A dedicated security policy language such as the one used by >> SELinux, AppArmor and other major LSMs involves a lot of code and is >> usually permitted to only a trusted user (i.e. root). On the contrary, >> eBPF programs already exist and are designed to be safely loaded by >> unprivileged user-space. >> >> This design does not seem too intrusive but is flexible enough to allow >> a powerful sandbox mechanism accessible by any process on Linux. The use >> of seccomp and Landlock is more suitable with the help of a user-space >> library (e.g. libseccomp) that could help to specify a high-level >> language to express a security policy instead of raw eBPF programs. >> Moreover, thanks to the LLVM front-end, it is quite easy to write an >> eBPF program with a subset of the C language. >> >> >> # Frequently asked questions >> >> ## Why is seccomp-bpf not enough? >> >> A seccomp filter can access only raw syscall arguments (i.e. the >> register values) which means that it is not possible to filter according >> to the value pointed to by an argument, such as a file pathname. As an >> embryonic Landlock version demonstrated, filtering at the syscall level >> is complicated (e.g. need to take care of race conditions). This is >> mainly because the access control checkpoints of the kernel are not at >> this high-level but more underneath, at the LSM-hook level. The LSM >> hooks are designed to handle this kind of checks. Landlock abstracts >> this approach to leverage the ability of unprivileged users to limit >> themselves. >> >> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt >> >> >> ## Why use the seccomp(2) syscall? >> >> Landlock use the same semantic as seccomp to apply access rule >> restrictions. It add a new layer of security for the current process >> which is inherited by its children. It makes sense to use an unique >> access-restricting syscall (that should be allowed by seccomp filters) >> which can only drop privileges. Moreover, a Landlock rule could come >> from outside a process (e.g. passed through a UNIX socket). It is then >> useful to differentiate the creation/load of Landlock eBPF programs via >> bpf(2), fr
Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing
On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün wrote: > Hi, > > This eight series is a major revamp of the Landlock design compared to > the previous series [1]. This enables more flexibility and granularity > of access control with file paths. It is now possible to enforce an > access control according to a file hierarchy. Landlock uses the concept > of inode and path to identify such hierarchy. In a way, it brings tools > to program what is a file hierarchy. > > There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET. > Each of them accepts a dedicated eBPF program, called a Landlock > program. They can be chained to enforce a full access control according > to a list of directories or files. The set of actions on a file is well > defined (e.g. read, write, ioctl, append, lock, mount...) taking > inspiration from the major Linux LSMs and some other access-controls > like Capsicum. These program types are designed to be cache-friendly, > which give room for optimizations in the future. > > The documentation patch contains some kernel documentation and > explanations on how to use Landlock. The compiled documentation and > a talk I gave at FOSDEM can be found here: https://landlock.io > This patch series can be found in the branch landlock-v8 in this repo: > https://github.com/landlock-lsm/linux > > There is still some minor issues with this patch series but it should > demonstrate how powerful this design may be. One of these issues is that > it is not a stackable LSM anymore, but the infrastructure management of > security blobs should allow to stack it with other LSM [4]. > > This is the first step of the roadmap discussed at LPC [2]. While the > intended final goal is to allow unprivileged users to use Landlock, this > series allows only a process with global CAP_SYS_ADMIN to load and > enforce a rule. This may help to get feedback and avoid unexpected > behaviors. > > This series can be applied on top of bpf-next, commit 7d72637eb39f > ("Merge branch 'x86-jit'"). This can be tested with > CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK. I would really > appreciate constructive comments on the design and the code. > > > # Landlock LSM > > The goal of this new Linux Security Module (LSM) called Landlock is to > allow any process, including unprivileged ones, to create powerful > security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This > kind of sandbox is expected to help mitigate the security impact of bugs > or unexpected/malicious behaviors in user-space applications. > > The approach taken is to add the minimum amount of code while still > allowing the user-space application to create quite complex access > rules. A dedicated security policy language such as the one used by > SELinux, AppArmor and other major LSMs involves a lot of code and is > usually permitted to only a trusted user (i.e. root). On the contrary, > eBPF programs already exist and are designed to be safely loaded by > unprivileged user-space. > > This design does not seem too intrusive but is flexible enough to allow > a powerful sandbox mechanism accessible by any process on Linux. The use > of seccomp and Landlock is more suitable with the help of a user-space > library (e.g. libseccomp) that could help to specify a high-level > language to express a security policy instead of raw eBPF programs. > Moreover, thanks to the LLVM front-end, it is quite easy to write an > eBPF program with a subset of the C language. > > > # Frequently asked questions > > ## Why is seccomp-bpf not enough? > > A seccomp filter can access only raw syscall arguments (i.e. the > register values) which means that it is not possible to filter according > to the value pointed to by an argument, such as a file pathname. As an > embryonic Landlock version demonstrated, filtering at the syscall level > is complicated (e.g. need to take care of race conditions). This is > mainly because the access control checkpoints of the kernel are not at > this high-level but more underneath, at the LSM-hook level. The LSM > hooks are designed to handle this kind of checks. Landlock abstracts > this approach to leverage the ability of unprivileged users to limit > themselves. > > Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt > > > ## Why use the seccomp(2) syscall? > > Landlock use the same semantic as seccomp to apply access rule > restrictions. It add a new layer of security for the current process > which is inherited by its children. It makes sense to use an unique > access-restricting syscall (that should be allowed by seccomp filters) > which can only drop privileges. Moreover, a Landlock rule could come > from outside a process (e.g. passed through a UNIX socket). It is then > useful to differentiate the creation/load of Landlock eBPF programs via > bpf(2), from rule enforcement via seccomp(2). This seems like a weak argument to me. Sure, this is a bit different from seccomp(), and maybe shoving