Re: [PATCH 00/34] fs: idmapped mounts
On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner wrote: > > On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote: > > > > > > > On Oct 28, 2020, at 5:35 PM, Christian Brauner > > > wrote: > > > > > > Hey everyone, > > > > > > I vanished for a little while to focus on this work here so sorry for > > > not being available by mail for a while. > > > > > > Since quite a long time we have issues with sharing mounts between > > > multiple unprivileged containers with different id mappings, sharing a > > > rootfs between multiple containers with different id mappings, and also > > > sharing regular directories and filesystems between users with different > > > uids and gids. The latter use-cases have become even more important with > > > the availability and adoption of systemd-homed (cf. [1]) to implement > > > portable home directories. > > > > > > The solutions we have tried and proposed so far include the introduction > > > of fsid mappings, a tiny overlay based filesystem, and an approach to > > > call override creds in the vfs. None of these solutions have covered all > > > of the above use-cases. > > > > > > The solution proposed here has it's origins in multiple discussions > > > during Linux Plumbers 2017 during and after the end of the containers > > > microconference. > > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > > > James, and myself. A variant of the solution proposed here has also been > > > discussed, again to the best of my knowledge, after a Linux conference > > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > > > after Linux Plumbers. > > > I've taken the time to finally implement a working version of this > > > solution over the last weeks to the best of my abilities. Tycho has > > > signed up for this sligthly crazy endeavour as well and he has helped > > > with the conversion of the xattr codepaths. > > > > > > The core idea is to make idmappings a property of struct vfsmount > > > instead of tying it to a process being inside of a user namespace which > > > has been the case for all other proposed approaches. > > > It means that idmappings become a property of bind-mounts, i.e. each > > > bind-mount can have a separate idmapping. This has the obvious advantage > > > that idmapped mounts can be created inside of the initial user > > > namespace, i.e. on the host itself instead of requiring the caller to be > > > located inside of a user namespace. This enables such use-cases as e.g. > > > making a usb stick available in multiple locations with different > > > idmappings (see the vfat port that is part of this patch series). > > > > > > The vfsmount struct gains a new struct user_namespace member. The > > > idmapping of the user namespace becomes the idmapping of the mount. A > > > caller that is either privileged with respect to the user namespace of > > > the superblock of the underlying filesystem or a caller that is > > > privileged with respect to the user namespace a mount has been idmapped > > > with can create a new bind-mount and mark it with a user namespace. > > > > So one way of thinking about this is that a user namespace that has an > > idmapped mount can, effectively, create or chown files with *any* on-disk > > uid or gid by doing it directly (if that uid exists in-namespace, which is > > likely for interesting ids like 0) or by creating a new userns with that id > > inside. > > > > For a file system that is private to a container, this seems moderately > > safe, although this may depend on what exactly “private” means. We probably > > want a mechanism such that, if you are outside the namespace, a reference > > to a file with the namespace’s vfsmnt does not confer suid privilege. > > > > Imagine the following attack: user creates a namespace with a root user and > > arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or > > using whatever container management tool does this. Inside the namespace, > > the user creates a suid-root file. > > > > Now, outside the namespace, the user has privilege over the namespace. > > (I’m assuming there is some tool that will idmap things in a namespace > > owned by an unprivileged user, which seems likely.). So the user makes a > > new bind mount and if maps it to the init namespace. Game over. > > > > So I think we need to have s
Re: [PATCH 00/34] fs: idmapped mounts
> On Oct 28, 2020, at 5:35 PM, Christian Brauner > wrote: > > Hey everyone, > > I vanished for a little while to focus on this work here so sorry for > not being available by mail for a while. > > Since quite a long time we have issues with sharing mounts between > multiple unprivileged containers with different id mappings, sharing a > rootfs between multiple containers with different id mappings, and also > sharing regular directories and filesystems between users with different > uids and gids. The latter use-cases have become even more important with > the availability and adoption of systemd-homed (cf. [1]) to implement > portable home directories. > > The solutions we have tried and proposed so far include the introduction > of fsid mappings, a tiny overlay based filesystem, and an approach to > call override creds in the vfs. None of these solutions have covered all > of the above use-cases. > > The solution proposed here has it's origins in multiple discussions > during Linux Plumbers 2017 during and after the end of the containers > microconference. > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David, > James, and myself. A variant of the solution proposed here has also been > discussed, again to the best of my knowledge, after a Linux conference > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017 > after Linux Plumbers. > I've taken the time to finally implement a working version of this > solution over the last weeks to the best of my abilities. Tycho has > signed up for this sligthly crazy endeavour as well and he has helped > with the conversion of the xattr codepaths. > > The core idea is to make idmappings a property of struct vfsmount > instead of tying it to a process being inside of a user namespace which > has been the case for all other proposed approaches. > It means that idmappings become a property of bind-mounts, i.e. each > bind-mount can have a separate idmapping. This has the obvious advantage > that idmapped mounts can be created inside of the initial user > namespace, i.e. on the host itself instead of requiring the caller to be > located inside of a user namespace. This enables such use-cases as e.g. > making a usb stick available in multiple locations with different > idmappings (see the vfat port that is part of this patch series). > > The vfsmount struct gains a new struct user_namespace member. The > idmapping of the user namespace becomes the idmapping of the mount. A > caller that is either privileged with respect to the user namespace of > the superblock of the underlying filesystem or a caller that is > privileged with respect to the user namespace a mount has been idmapped > with can create a new bind-mount and mark it with a user namespace. So one way of thinking about this is that a user namespace that has an idmapped mount can, effectively, create or chown files with *any* on-disk uid or gid by doing it directly (if that uid exists in-namespace, which is likely for interesting ids like 0) or by creating a new userns with that id inside. For a file system that is private to a container, this seems moderately safe, although this may depend on what exactly “private” means. We probably want a mechanism such that, if you are outside the namespace, a reference to a file with the namespace’s vfsmnt does not confer suid privilege. Imagine the following attack: user creates a namespace with a root user and arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using whatever container management tool does this. Inside the namespace, the user creates a suid-root file. Now, outside the namespace, the user has privilege over the namespace. (I’m assuming there is some tool that will idmap things in a namespace owned by an unprivileged user, which seems likely.). So the user makes a new bind mount and if maps it to the init namespace. Game over. So I think we need to have some control to mitigate this in a comprehensible way. A big hammer would be to require nosuid. A smaller hammer might be to say that you can’t create a new idmapped mount unless you have privilege over the userns that you want to use for the idmap and to say that a vfsmnt’s paths don’t do suid outside the idmap namespace. We already do the latter for the vfsmnt’s mntns’s userns. Hmm. What happens if we require that an idmap userns equal the vfsmnt’s mntns’s userns? Is that too limiting? I hope that whatever solution gets used is straightforward enough to wrap one’s head around. > When a file/inode is accessed through an idmapped mount the i_uid and > i_gid of the inode will be remapped according to the user namespace the > mount has been marked with. When a new object is created based on the > fsuid and fsgid of the caller they will similarly be remapped according > to the user namespace of the mount they care created from. By “mapped according to”, I presume you mean that the on-disk uid/gid is the gid
Re: [PATCH v2 16/15] syscall_get_arch: add "struct task_struct *" argument
On Tue, Nov 20, 2018 at 4:44 PM Dmitry V. Levin wrote: > > This argument is required to extend the generic ptrace API > with PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going to be > called from ptrace_request() along with other syscall_get_* functions > with a tracee as their argument. > > This change partially reverts commit 5e937a9ae913 ("syscall_get_arch: > remove useless function arguments"). > Reviewed-by: Andy Lutomirski # for x86 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 06/13] arc: define syscall_get_arch()
> On Nov 9, 2018, at 8:50 AM, Vineet Gupta wrote: > >> On 11/8/18 7:16 PM, Dmitry V. Levin wrote: >> syscall_get_arch() is required to be implemented on all architectures >> that use tracehook_report_syscall_entry() in order to extend >> the generic ptrace API with PTRACE_GET_SYSCALL_INFO request. >> >> Signed-off-by: Dmitry V. Levin >> --- >> arch/arc/include/asm/syscall.h | 6 ++ >> include/uapi/linux/audit.h | 1 + >> 2 files changed, 7 insertions(+) >> >> diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h >> index 29de09804306..5662778a7411 100644 >> --- a/arch/arc/include/asm/syscall.h >> +++ b/arch/arc/include/asm/syscall.h >> @@ -9,6 +9,7 @@ >> #ifndef _ASM_ARC_SYSCALL_H >> #define _ASM_ARC_SYSCALL_H 1 >> >> +#include >> #include >> #include >> #include >> @@ -68,4 +69,9 @@ syscall_get_arguments(struct task_struct *task, struct >> pt_regs *regs, >>} >> } >> >> +static inline int syscall_get_arch(void) >> +{ >> +return AUDIT_ARCH_ARC; >> +} >> + > > Does ptrace (or user of this API) need a unique value per arch. Otherwise > instead > of adding the boilerplate code to all arches, they could simply define > AUDIT_ARCH > and common code could return it. Also the EM_xxx are not there in > include/uapi/linux/elf.h to begin with since libc elf.h already defines them. A lot of architectures allow multiple audit_arches at runtime due to compat support and similar features, so it really does want to be a function. The goal of this patch set is to get it supported everywhere. >> #endif >> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h >> index 818ae690ab79..a7149ceb5b98 100644 >> --- a/include/uapi/linux/audit.h >> +++ b/include/uapi/linux/audit.h >> @@ -375,6 +375,7 @@ enum { >> >> #define AUDIT_ARCH_AARCH64(EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) >> #define AUDIT_ARCH_ALPHA(EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) >> +#define AUDIT_ARCH_ARC(EM_ARC) >> #define AUDIT_ARCH_ARM(EM_ARM|__AUDIT_ARCH_LE) >> #define AUDIT_ARCH_ARMEB(EM_ARM) >> #define AUDIT_ARCH_CRIS(EM_CRIS|__AUDIT_ARCH_LE) > > So I don't have the context of this patch (or coverletter) but what exactly > are we > trying to do with this (adding LE to audit) - what happens when an arch is > capable of either and is say built for BE ? The primary intent is that the triple (audit_arch, syscall_nr, arg1, ..., arg6) should describe what system call is being called and what its arguments are. I’m personally not sure what, if any, technical value there is in the LE bit. I do think it makes sense for BE and LE to have different values. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 06/13] arc: define syscall_get_arch()
On Fri, Nov 9, 2018 at 8:11 AM Alexey Brodkin wrote: > > Hi Andy, > > On Fri, 2018-11-09 at 07:56 -0800, Andy Lutomirski wrote: > > > On Nov 9, 2018, at 7:27 AM, Alexey Brodkin > > > wrote: > > > > > > Hi Andy, > > > > > > > On Fri, 2018-11-09 at 07:17 -0800, Andy Lutomirski wrote: > > > > On Fri, Nov 9, 2018 at 6:22 AM Alexey Brodkin > > > > wrote: > > > > > Hi Dmitry, > > > > > > > > > > > On Fri, 2018-11-09 at 06:16 +0300, Dmitry V. Levin wrote: > > > > > > syscall_get_arch() is required to be implemented on all > > > > > > architectures > > > > > > that use tracehook_report_syscall_entry() in order to extend > > > > > > the generic ptrace API with PTRACE_GET_SYSCALL_INFO request. > > > > > > > > > > > > Signed-off-by: Dmitry V. Levin > > > > > > --- > > > > > > arch/arc/include/asm/syscall.h | 6 ++ > > > > > > include/uapi/linux/audit.h | 1 + > > > > > > 2 files changed, 7 insertions(+) > > > > > > > > > > [snip] > > > > > > > > > > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > > > > > > index 818ae690ab79..a7149ceb5b98 100644 > > > > > > --- a/include/uapi/linux/audit.h > > > > > > +++ b/include/uapi/linux/audit.h > > > > > > @@ -375,6 +375,7 @@ enum { > > > > > > > > > > > > #define AUDIT_ARCH_AARCH64 > > > > > > (EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) > > > > > > #define AUDIT_ARCH_ALPHA > > > > > > (EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) > > > > > > +#define AUDIT_ARCH_ARC (EM_ARC) > > > > > > > > > > Similarly here we need to have: > > > > > >8- > > > > > +#define AUDIT_ARCH_ARC (EM_ARC|EM_ARCV2) > > > > > >8- > > > > > > > > > > > > > Huh? How does the bitwise or of two ELF machine codes make any sense? > > > > > > Oops... I didn't read examples of AUDIT_ARCH_ALPHA above :( > > > Indeed that was stupid. > > > > > > But what would be a proper fix then? > > > > > > Something like that? > > > >8- > > > #define AUDIT_ARCH_ARC (EM_ARC) > > > #define AUDIT_ARCH_ARCV2 (EM_ARCV2) > > > > > > > > > static inline int syscall_get_arch(void) > > > { > > > #ifdef __ARC700__ > > > return AUDIT_ARCH_ARC; > > > #else > > > return AUDIT_ARCH_ARCV2; > > > #endif > > > } > > > >8- > > > > > > > Maybe, but I know basically nothing about ARC. Is the syscall numbering or > > calling convention different on ARC vs ARCv2? > > Syscall numbering should be the same as we use UAPI for both ARCompact (AKA > ARCv1) > and ARCv2. As for calling convention I think it indeed differs. > > Note ARCompact and ARCv2 ISAs are binary incompatible! > > Even though assembly look pretty much the same (sans instructions > available only for either ARCompact or ARCv2) encodings are different so > in that sense these are completely different architectures. > > Also I'm wondering what could be other cases for use of syscall_get_arch(). The intent of syscall_get_arch() is that the tuple: (arch, nr, arg1, ..., arg6) fully identifies a system call and its arguments. So it sounds like we do indeed need to arch values. > > So I'd say it's better to report different values for ARC ISAs. > And given we use the same values as in Binutils IMHO it would be good > to not mix IDs here. > > -Alexey -- Andy Lutomirski AMA Capital Management, LLC -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 06/13] arc: define syscall_get_arch()
> On Nov 9, 2018, at 7:27 AM, Alexey Brodkin > wrote: > > Hi Andy, > >> On Fri, 2018-11-09 at 07:17 -0800, Andy Lutomirski wrote: >> On Fri, Nov 9, 2018 at 6:22 AM Alexey Brodkin >> wrote: >>> Hi Dmitry, >>> >>>> On Fri, 2018-11-09 at 06:16 +0300, Dmitry V. Levin wrote: >>>> syscall_get_arch() is required to be implemented on all architectures >>>> that use tracehook_report_syscall_entry() in order to extend >>>> the generic ptrace API with PTRACE_GET_SYSCALL_INFO request. >>>> >>>> Signed-off-by: Dmitry V. Levin >>>> --- >>>> arch/arc/include/asm/syscall.h | 6 ++ >>>> include/uapi/linux/audit.h | 1 + >>>> 2 files changed, 7 insertions(+) >>> >>> [snip] >>> >>>> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h >>>> index 818ae690ab79..a7149ceb5b98 100644 >>>> --- a/include/uapi/linux/audit.h >>>> +++ b/include/uapi/linux/audit.h >>>> @@ -375,6 +375,7 @@ enum { >>>> >>>> #define AUDIT_ARCH_AARCH64 >>>> (EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) >>>> #define AUDIT_ARCH_ALPHA (EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) >>>> +#define AUDIT_ARCH_ARC (EM_ARC) >>> >>> Similarly here we need to have: >>> >8- >>> +#define AUDIT_ARCH_ARC (EM_ARC|EM_ARCV2) >>> >8- >>> >> >> Huh? How does the bitwise or of two ELF machine codes make any sense? > > Oops... I didn't read examples of AUDIT_ARCH_ALPHA above :( > Indeed that was stupid. > > But what would be a proper fix then? > > Something like that? > >8- > #define AUDIT_ARCH_ARC (EM_ARC) > #define AUDIT_ARCH_ARCV2 (EM_ARCV2) > > > static inline int syscall_get_arch(void) > { > #ifdef __ARC700__ > return AUDIT_ARCH_ARC; > #else > return AUDIT_ARCH_ARCV2; > #endif > } > >8- > Maybe, but I know basically nothing about ARC. Is the syscall numbering or calling convention different on ARC vs ARCv2? -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 06/13] arc: define syscall_get_arch()
On Fri, Nov 9, 2018 at 6:22 AM Alexey Brodkin wrote: > > Hi Dmitry, > > On Fri, 2018-11-09 at 06:16 +0300, Dmitry V. Levin wrote: > > syscall_get_arch() is required to be implemented on all architectures > > that use tracehook_report_syscall_entry() in order to extend > > the generic ptrace API with PTRACE_GET_SYSCALL_INFO request. > > > > Signed-off-by: Dmitry V. Levin > > --- > > arch/arc/include/asm/syscall.h | 6 ++ > > include/uapi/linux/audit.h | 1 + > > 2 files changed, 7 insertions(+) > > [snip] > > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h > > index 818ae690ab79..a7149ceb5b98 100644 > > --- a/include/uapi/linux/audit.h > > +++ b/include/uapi/linux/audit.h > > @@ -375,6 +375,7 @@ enum { > > > > #define AUDIT_ARCH_AARCH64 > > (EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) > > #define AUDIT_ARCH_ALPHA (EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) > > +#define AUDIT_ARCH_ARC (EM_ARC) > > Similarly here we need to have: > >8- > +#define AUDIT_ARCH_ARC (EM_ARC|EM_ARCV2) > >8- > Huh? How does the bitwise or of two ELF machine codes make any sense? -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 00/13] Prepare for PTRACE_GET_SYSCALL_INFO
On Thu, Nov 8, 2018 at 7:13 PM Dmitry V. Levin wrote: > > syscall_get_arch() is required to be implemented on all architectures > that use tracehook_report_syscall_entry() in order to extend > the generic ptrace API with PTRACE_GET_SYSCALL_INFO request. Nice! I'm pretty sure you have vastly more changelog than code here :) --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 2/2] RISC-V: Add support for SECCOMP
On Wed, Oct 24, 2018 at 2:42 PM Kees Cook wrote: > > On Wed, Oct 24, 2018 at 1:40 PM, Palmer Dabbelt wrote: > > From: "Wesley W. Terpstra" > > > > This is a fairly straight-forward implementation of seccomp for RISC-V > > systems. > > > > Signed-off-by: Wesley W. Terpstra > > Signed-off-by: Palmer Dabbelt > > --- > > arch/riscv/Kconfig | 18 ++ > > arch/riscv/include/asm/seccomp.h | 10 ++ > > arch/riscv/include/asm/syscall.h | 6 ++ > > arch/riscv/include/asm/thread_info.h | 1 + > > include/uapi/linux/audit.h | 1 + > > 5 files changed, 36 insertions(+) > > create mode 100644 arch/riscv/include/asm/seccomp.h > > > > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig > > index a344980287a5..28abe47602a1 100644 > > --- a/arch/riscv/Kconfig > > +++ b/arch/riscv/Kconfig > > @@ -28,6 +28,7 @@ config RISCV > > select GENERIC_STRNLEN_USER > > select GENERIC_SMP_IDLE_THREAD > > select GENERIC_ATOMIC64 if !64BIT || !RISCV_ISA_A > > + select HAVE_ARCH_SECCOMP_FILTER > > I think this patch is missing most of the actual seccomp glue? > > config HAVE_ARCH_SECCOMP_FILTER > bool > help > An arch should select this symbol if it provides all of these > things: > - syscall_get_arch() > - syscall_get_arguments() > - syscall_rollback() > - syscall_set_return_value() > - SIGSYS siginfo_t support > - secure_computing is called from a ptrace_event()-safe context > - secure_computing return value is checked and a return value of -1 > results in the system call being skipped immediately. > - seccomp syscall wired up > > I only see syscall_get_arch(). Nothing is using TIF_SECCOMP (I'd > expect a masked check in entry.S -- it seems like tracepoints are > getting missed too? I see it handled in ptrace.c but not checked in > entry.S?) There's no checking for seccomp in ptrace.c, etc. Hi RISC-V people: I strongly, strongly suggest that you rewrite your asm to work the way that x86's does: have a function called prepare_exit_to_usermode() and make it work more or less like x86's. Doing all the exit work in asm like you are is just setting you up for a world of pain. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 00/18] xfrm: Add compat layer
> On Jul 27, 2018, at 9:48 AM, Nathan Harold wrote: > > We (Android) are very interested in removing the restriction for 32-bit > userspace processes accessing xfrm netlink on 64-bit kernels. IPsec support > is required to pass Android conformance tests, and any manufacturer wishing > to ship 32-bit userspace with a recent kernel needs out-of-tree changes > (removing the compat_task check) to do so. > > That said, it’s not difficult to work around alignment issues directly in > userspace, so maybe we could just remove the check and make this the caller's > responsibility? Here’s an example of the workaround currently in the Android > tree: > https://android.googlesource.com/platform/system/netd/+/refs/heads/master/server/XfrmController.h#257 > > We could also employ a (relatively simple) solution such as the one above in > the uapi XFRM header itself, though it would require a caller to declare the > target kernel ABI at compile time. Maybe that’s not unthinkable for an > uncommon case? > Could there just be an XFRM2 that is entirely identical to XFRM for 64-bit userspace but makes the 32-bit structures match? If there are a grand total of two or so userspace implementations, that should cover most use cases. L > -Nathan > > >> On Fri, Jul 27, 2018 at 7:51 AM, Dmitry Safonov wrote: >> On Fri, 2018-07-27 at 16:19 +0200, Florian Westphal wrote: >> > Dmitry Safonov wrote: >> > > 1. It will double copy netlink messages, making it O(n) instead of >> > > O(1), where n - is number of bind()s.. Probably we don't care much. >> > >> > About those bind() patches, I don't understand why they are needed. >> > >> > Why can't you just add the compat skb to the native skb when doing >> > the multicast call? >> > >> > skb_shinfo(skb)->frag_list = compat_skb; >> > xfrm_nlmsg_multicast(net, skb, 0, ... >> >> Oh yeah, sorry, I think I misread the patch - will try to add compat >> skb in the multicast call. >> >> -- >> Thanks, >> Dmitry > -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated
On Wed, Mar 14, 2018 at 12:28 AM, Jiri Kosina <ji...@kernel.org> wrote: > On Wed, 14 Mar 2018, Andy Lutomirski wrote: > >> > Yes...I wished I was in on the beginning of this discussion. Here's the >> > problem. We need all tasks auditable unless specifically dismissed as >> > uninteresting. This would be a task,never rule. >> > >> > The way we look at it, is if it boots with audit=1, then we know auditd >> > is expected to run at some point. So, we need all tasks to stay >> > auditable. If they weren't and auditd enabled auditing, then we'd need >> > to walk the whole proctable and stab TIF_AUDIT_SYSCALL into every >> > process in the system. It was decided that this is too ugly. >> >> When was that decided? That's what this patch does. > > I'd like to see some more justification as well. > > Namely, if I compare "setting TIF_AUDIT_SYSCALL for every process on a > need-to-be-so basis" to "we always go through the slow path and > pessimistically assume that audit is enabled and has reasonable ruleset > loaded", I have my own (different) opinion of what is too ugly. Me too. That being said, on re-review of my old code, I think that audit_dec_n_rules() may be the wrong approach. It may be better to leave TIF_AUDIT_SYSCALL set but to make the audit code itself clear the flag the next time through. That way we don't end up with a partially filled in syscall audit record that never gets consumed if we clear TIF_AUDIT_SYSCALL in the middle of a syscall. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated
On Sat, Mar 10, 2018 at 10:15 AM, Steve Grubbwrote: > On Wed, 7 Mar 2018 18:43:42 -0500 > Paul Moore wrote: >> ... and I just realized that linux-audit isn't on the To/CC line, >> adding them now. >> >> Link to the patch is below. >> >> * https://marc.info/?t=15204188763=1=2 > > Yes...I wished I was in on the beginning of this discussion. Here's the > problem. We need all tasks auditable unless specifically dismissed as > uninteresting. This would be a task,never rule. > > The way we look at it, is if it boots with audit=1, then we know auditd > is expected to run at some point. So, we need all tasks to stay > auditable. If they weren't and auditd enabled auditing, then we'd need > to walk the whole proctable and stab TIF_AUDIT_SYSCALL into every > process in the system. It was decided that this is too ugly. When was that decided? That's what this patch does. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated
On Wed, Mar 7, 2018 at 11:41 PM, Paul Moore <p...@paul-moore.com> wrote: > On Wed, Mar 7, 2018 at 11:48 AM, Jiri Kosina <ji...@kernel.org> wrote: >> On Wed, 7 Mar 2018, Andy Lutomirski wrote: >>> Wow, this was a long time ago. >> >> Oh yeah; but it now resurfaced on our side, as we are of course receiving >> a lot of requests with respect to making syscall performance great again >> :) > > Ooof. I'm not sure I can handle making more things "great again" ;) > >>> From memory and a bit of email diving, there are two reasons. >>> >>> 1. The probably was partially solved (by Oleg, IIRC) by making auditctl >>>-a task,never cause newly spawned tasks to not suck. Yes, it's a >>>very partial solution. After considerable nagging, I got Fedora to >>>default to -a task,never. >> >> Hm, right; that's a bit inconvenient, because it takes each and every >> vendor having to realize this option, and put it in. Making kernel do the >> right thing automatically sounds like a better option to me. > > This predates audit falling into my lap, but speaking generally I > think it would be good if the kernel did The Right Thing, so long as > it isn't too painful. > >>> 2. This patch, as is, may be a bit problematic. In particular, if one >>>task changes the audit rules while another task is in the middle of >>>the syscall, then it's too late to audit that syscall correctly. >>>This could be seen as a bug or it could be seen as being just fine. >> >> I don't think this should be a problem, given the fact that the whole >> timing/ordering is not predictable anyway due to scheduling. >> >> Paul, what do you think? > > I'm not overly concerned about the race condition between configuring > the audit filters and syscalls that are currently in-flight; after all > we have that now and "fixing" it would be pretty much impractical > (impossible maybe?). Most serious audit users configure it during > boot and let it run, frequent runtime changes are not common as far as > I can tell. > > I just looked quickly at the patch and decided it isn't something I'm > going to be able to carefully review in the time I've got left today, > so it's going to have to wait until tomorrow and Friday ... however, > speaking on general principle I don't have an objection to the ideas > put forth here. > > Andy, if you've got any Reviewed-by/Tested-by/NACK/etc. you want to > add, that would be good to have. > It's sort of my patch, and I was reasonably happy with it, so definitely no NACK from me. The only caveat I have is that I wrote the original version so long ago that we need to re-audit the code. In particular, I want to make sure that the following two cases don't result in warnings, oopses, or other misbehavior: 1. Do a syscall with TIF_AUDIT_SYSCALL clear. Return with TIF_AUDIT_SYSCALL set and that syscall enabled for auditing. 2.. Do a VFS syscall with TIF_AUDIT_CLEAR and have TIF_AUDIT_SYSCALL set before we execute any VFS code. The VFS code will call into the audit code to log the paths it touches (IIRC). Again, this shouldn't warn or otherwise misbehave. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated
> On Mar 8, 2018, at 1:12 AM, Richard Guy Briggs <r...@redhat.com> wrote: > >> On 2018-03-07 18:43, Paul Moore wrote: >>> On Wed, Mar 7, 2018 at 6:41 PM, Paul Moore <p...@paul-moore.com> wrote: >>>> On Wed, Mar 7, 2018 at 11:48 AM, Jiri Kosina <ji...@kernel.org> wrote: >>>>> On Wed, 7 Mar 2018, Andy Lutomirski wrote: >>>>> Wow, this was a long time ago. >>>> >>>> Oh yeah; but it now resurfaced on our side, as we are of course receiving >>>> a lot of requests with respect to making syscall performance great again >>>> :) >>> >>> Ooof. I'm not sure I can handle making more things "great again" ;) >>> >>>>> From memory and a bit of email diving, there are two reasons. >>>>> >>>>> 1. The probably was partially solved (by Oleg, IIRC) by making auditctl >>>>> -a task,never cause newly spawned tasks to not suck. Yes, it's a >>>>> very partial solution. After considerable nagging, I got Fedora to >>>>> default to -a task,never. >>>> >>>> Hm, right; that's a bit inconvenient, because it takes each and every >>>> vendor having to realize this option, and put it in. Making kernel do the >>>> right thing automatically sounds like a better option to me. >>> >>> This predates audit falling into my lap, but speaking generally I >>> think it would be good if the kernel did The Right Thing, so long as >>> it isn't too painful. >>> >>>>> 2. This patch, as is, may be a bit problematic. In particular, if one >>>>> task changes the audit rules while another task is in the middle of >>>>> the syscall, then it's too late to audit that syscall correctly. >>>>> This could be seen as a bug or it could be seen as being just fine. >>>> >>>> I don't think this should be a problem, given the fact that the whole >>>> timing/ordering is not predictable anyway due to scheduling. >>>> >>>> Paul, what do you think? >>> >>> I'm not overly concerned about the race condition between configuring >>> the audit filters and syscalls that are currently in-flight; after all >>> we have that now and "fixing" it would be pretty much impractical >>> (impossible maybe?). Most serious audit users configure it during >>> boot and let it run, frequent runtime changes are not common as far as >>> I can tell. > > I'd agree the race condition here can't easily be fixed and isn't worth > fixing for the reasons already stated (rules don't change often and may > even be locked once in place relatively early, scheduling uncertainties). > >>> I just looked quickly at the patch and decided it isn't something I'm >>> going to be able to carefully review in the time I've got left today, >>> so it's going to have to wait until tomorrow and Friday ... however, >>> speaking on general principle I don't have an objection to the ideas >>> put forth here. > > The approach seems a bit draconian since it touches all tasks but only > when adding the first rule or deleting the last. > > What we lose is the ability to set or clear individual task auditing and > have it stick to speed up non-audited tasks when there are rules > present, though this isn't currently used, in favour of audit_context > presence. > >>> Andy, if you've got any Reviewed-by/Tested-by/NACK/etc. you want to >>> add, that would be good to have. >> >> ... and I just realized that linux-audit isn't on the To/CC line, >> adding them now. > > (and Andy's non-NACK missed too...) The mailing list *is* in MAINTAINERS. > The mailing list bounces my emails. >> Link to the patch is below. >> >> * https://marc.info/?t=15204188763=1=2 >> >> paul moore > > - RGB > > -- > Richard Guy Briggs <r...@redhat.com> > Sr. S/W Engineer, Kernel Security, Base Operating Systems > Remote, Ottawa, Red Hat Canada > IRC: rgb, SunRaycer > Voice: +1.647.777.2635, Internal: (81) 32635 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V3 07/10] capabilities: remove a layer of conditional logic
On Wed, Aug 23, 2017 at 3:12 AM, Richard Guy Briggswrote: > Remove a layer of conditional logic to make the use of conditions > easier to read and analyse. > > Signed-off-by: Richard Guy Briggs > --- > security/commoncap.c | 13 ++--- > 1 files changed, 6 insertions(+), 7 deletions(-) > > diff --git a/security/commoncap.c b/security/commoncap.c > index 5d81354..ffcaff0 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -551,13 +551,12 @@ static inline bool nonroot_raised_pE(struct cred *cred, > kuid_t root) > { > bool ret = false; > > - if (cap_grew(effective, ambient, cred)) { > - if (!cap_full(effective, cred) || > - !is_eff(root, cred) || !is_real(root, cred) || > - !root_privileged()) { > - ret = true; > - } > - } > + if (cap_grew(effective, ambient, cred) && > + (!cap_full(effective, cred) || > +!is_eff(root, cred) || > +!is_real(root, cred) || > +!root_privileged())) > + ret = true; I'm trying to give this whole series the benefit of the doubt. Here's what happens when I try to read this: Did effective grow ambient caps? No, makes no sense. Did ambient grow effective caps? No, not that either. Is effective more fully grown than ambient? That's probably it, but that sounds really weird. The rest would IMO be clearer if you named the helpers ruid_eq and euid_eq instead of is_eff and is_real. > return ret; > } > > -- > 1.7.1 > -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V3 05/10] capabilities: use intuitive names for id changes
On Wed, Aug 23, 2017 at 3:12 AM, Richard Guy Briggswrote: > Introduce a number of inlines to make the use of the negation of > uid_eq() easier to read and analyse. > > Signed-off-by: Richard Guy Briggs > --- > security/commoncap.c | 26 +- > 1 files changed, 21 insertions(+), 5 deletions(-) > > diff --git a/security/commoncap.c b/security/commoncap.c > index 36c38a1..1af7dec 100644 > --- a/security/commoncap.c > +++ b/security/commoncap.c > @@ -483,6 +483,15 @@ static int get_file_caps(struct linux_binprm *bprm, bool > *effective, bool *has_f > > static inline bool root_privileged(void) { return !issecure(SECURE_NOROOT); } > > +static inline bool is_real(kuid_t uid, struct cred *cred) > +{ return uid_eq(cred->uid, uid); } OK I guess, but this just seems like a way to obfuscate the code a bit and save typing "->uid". > + > +static inline bool is_eff(kuid_t uid, struct cred *cred) > +{ return uid_eq(cred->euid, uid); } Ditto. > + > +static inline bool is_suid(kuid_t uid, struct cred *cred) > +{ return !is_real(uid, cred) && is_eff(uid, cred); } Please no. This is IMO insane. You're hiding really weird, nonintuitive logic in an oddly named helper. Also, this is going to cause massive confusion and severe bugs: given the same, the only remotely sensible guess as to what this function does is uid_eq(cred->suid, uid). So NAK to this. > + > void handle_privileged_root(struct linux_binprm *bprm, bool has_fcap, bool > *effective, kuid_t root_uid) > { > const struct cred *old = current_cred(); > @@ -493,7 +502,7 @@ void handle_privileged_root(struct linux_binprm *bprm, > bool has_fcap, bool *effe > * for a setuid root binary run by a non-root user. Do set it > * for a root user just to cause least surprise to an admin. > */ > - if (has_fcap && !uid_eq(new->uid, root_uid) && uid_eq(new->euid, > root_uid)) { > + if (has_fcap && is_suid(root_uid, new)) { e.g. this. The logic used to be obviously slightly dicey. Now it looks sane but doesn't do what you'd naively expect it to do, which is far worse. > @@ -519,6 +528,13 @@ void handle_privileged_root(struct linux_binprm *bprm, > bool has_fcap, bool *effe > !cap_issubset(cred->cap_##target, cred->cap_##source) > #define cap_full(field, cred) \ > cap_issubset(CAP_FULL_SET, cred->cap_##field) > + > +static inline bool is_setuid(struct cred *new, const struct cred *old) > +{ return !uid_eq(new->euid, old->uid); } > + > +static inline bool is_setgid(struct cred *new, const struct cred *old) > +{ return !gid_eq(new->egid, old->gid); } Please don't. This logic is fragile, and these helpers are pretending it's not fragile even though it's still fragile. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V3 05/10] capabilities: use intuitive names for id changes
--Andy > On Aug 25, 2017, at 11:51 AM, Serge E. Hallyn <se...@hallyn.com> wrote: > > Quoting Andy Lutomirski (l...@kernel.org): >>> On Wed, Aug 23, 2017 at 3:12 AM, Richard Guy Briggs <r...@redhat.com> wrote: >>> Introduce a number of inlines to make the use of the negation of >>> uid_eq() easier to read and analyse. >>> >>> Signed-off-by: Richard Guy Briggs <r...@redhat.com> >>> --- >>> security/commoncap.c | 26 +- >>> 1 files changed, 21 insertions(+), 5 deletions(-) >>> >>> diff --git a/security/commoncap.c b/security/commoncap.c >>> index 36c38a1..1af7dec 100644 >>> --- a/security/commoncap.c >>> +++ b/security/commoncap.c >>> @@ -483,6 +483,15 @@ static int get_file_caps(struct linux_binprm *bprm, >>> bool *effective, bool *has_f >>> >>> static inline bool root_privileged(void) { return !issecure(SECURE_NOROOT); >>> } >>> >>> +static inline bool is_real(kuid_t uid, struct cred *cred) >>> +{ return uid_eq(cred->uid, uid); } >> >> OK I guess, but this just seems like a way to obfuscate the code a bit >> and save typing "->uid". > > Personally I find the new to be far more readable. In the old, the > distinction between uid and euid is one character hidden in the middle > of the expression. Would real_uid_eq be better? > >>> + >>> +static inline bool is_eff(kuid_t uid, struct cred *cred) >>> +{ return uid_eq(cred->euid, uid); } >> >> Ditto. >> >>> + >>> +static inline bool is_suid(kuid_t uid, struct cred *cred) >>> +{ return !is_real(uid, cred) && is_eff(uid, cred); } >> >> Please no. This is IMO insane. You're hiding really weird, >> nonintuitive logic in an oddly named helper. > > How is it nonintuitive? It's very precisely checking that a > nonroot user is executing something that results in euid=0. I can think of several sensible predicated: 1. Are we execing a setuid-root program, where the setuod bit wasn't suppressed by nnp, mount options, trace, etc? 2. Same as 1, but also require that we weren't root. 3. Is the new euid 0 and old uid != 0? 4. Does suid == 0? This helper checks something equivalent to 3, but only once were far enough through exec and before user code starts. This is probably equivalent to 2 as well. This is quite subtle and deserves an open-coded check, a much more carefully named helper, or, better yet, something that looks at binprm instead of cred. is_suid sounds like #4. > That's what it's checking for, the name of the new helper makes > that clear, and the code becomes clearer because we only see the > thing we care about checking for rather than the intent being > hidden. > >> Also, this is going to cause massive confusion and severe bugs: given >> the same, the only remotely sensible guess as to what this function >> does is uid_eq(cred->suid, uid). So NAK to this. >> >>> + >>> void handle_privileged_root(struct linux_binprm *bprm, bool has_fcap, bool >>> *effective, kuid_t root_uid) >>> { >>>const struct cred *old = current_cred(); >>> @@ -493,7 +502,7 @@ void handle_privileged_root(struct linux_binprm *bprm, >>> bool has_fcap, bool *effe >>> * for a setuid root binary run by a non-root user. Do set it >>> * for a root user just to cause least surprise to an admin. >>> */ >>> - if (has_fcap && !uid_eq(new->uid, root_uid) && uid_eq(new->euid, >>> root_uid)) { >>> + if (has_fcap && is_suid(root_uid, new)) { >> >> e.g. this. The logic used to be obviously slightly dicey. Now it >> looks sane but doesn't do what you'd naively expect it to do, which is >> far worse. > > In what way does not do what you'd expect? It doesn't look at cred->suid. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3 0/4] Improved seccomp logging
On Mon, May 1, 2017 at 7:41 PM, Tyler Hickswrote: > On 04/27/2017 07:42 PM, Kees Cook wrote: >> On Thu, Apr 27, 2017 at 3:17 PM, Tyler Hicks wrote: >>> Quick update... I finished the move from the high-water mark >>> log_max_action sysctl to the bitmask based actions_logged sysctl. >> >> Awesome! >> >>> Unfortunately, I've just realized that SECCOMP_SET_LOGGING, or any >>> process-wide logging configuration mechanism, will not work. It is fine >>> for the situation where two unrelated processes set up seccomp filters >>> that should be logged differently. However, it fails when two closely >>> related processes, such as parent and child, need to set up seccomp >>> filters that should be logged differently. Imagine a launcher that sets >>> up an application sandbox (including a seccomp filter) and then launches >>> an electron app which will have its own seccomp filter for sandboxing >>> untrusted code that it runs. Unless the launcher and app completely >>> agree on actions that should be logged, the logging won't work as >>> intended for both processes. >> >> Oh, you mean the forked process sets up the logging it wants for the >> filters it just installed, then after exec a process sets up new >> logging requirements? > > Yes - see below. > >> >>> I think this needs to be configured at the filter level. >> >> I'm not sure that's even the right way to compose the logging desires. >> >> So, my initial thought was "whatever ran SECCOMP_SET_LOGGING knows >> what it's doing" and it should be the actual value. >> >> If the launcher wants logs of everything the application does with its >> filters, then a purely-tied-to-filter approach won't work either. >> >> Perhaps log bits can only be enabled? I.e. SECCOMP_SET_LOGGING >> performs an OR instead of an assignment? > > The problem that I'm envisioning with this design is this: > > 1. Launcher is told to launch Chrome and forks off a process. > > 2. Launcher sets up a filter using RET_ERRNO for all unacceptable > syscalls and enables auditing of RET_ERRNO. > > 3. Launcher execs Chrome. > > 4. Chrome then sets up its own, more restrictive filter that uses > RET_ERRNO, among other actions, but does not want auditing of RET_ERRNO. > > If we use process-wide auditing controls, the logs will be filled with > RET_ERRNO messages that were unintended and unrelated to the RET_ERRNO > actions set up in the launcher's filter. > > Unfortunately, the OR'ing idea doesn't solve the problem. Things like my more complicated solution solve this completely, I think. The launcher would, by whatever means, say "RET_ERRNO and log this". The more restrictive sandbox would say "RET_ERROR and don't log this" and we'd just make sure that the composition rules mean the inner rule wins. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3 0/4] Improved seccomp logging
On Fri, Apr 7, 2017 at 3:16 PM, Tyler Hicks <tyhi...@canonical.com> wrote: > On 02/22/2017 12:46 PM, Kees Cook wrote: >> On Thu, Feb 16, 2017 at 3:29 PM, Kees Cook <keesc...@chromium.org> wrote: >>> On Wed, Feb 15, 2017 at 7:24 PM, Andy Lutomirski <l...@amacapital.net> >>> wrote: >>>> On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks <tyhi...@canonical.com> wrote: >>>>> This patch set is the third revision of the following two previously >>>>> submitted patch sets: >>>>> >>>>> v1: >>>>> http://lkml.kernel.org/r/1483375990-14948-1-git-send-email-tyhi...@canonical.com >>>>> v1: >>>>> http://lkml.kernel.org/r/1483377999-15019-2-git-send-email-tyhi...@canonical.com >>>>> >>>>> v2: >>>>> http://lkml.kernel.org/r/1486100262-32391-1-git-send-email-tyhi...@canonical.com >>>>> >>>>> The patch set aims to address some known deficiencies in seccomp's current >>>>> logging capabilities: >>>>> >>>>> 1. Inability to log all filter actions. >>>>> 2. Inability to selectively enable filtering; e.g. devs want noisy >>>>> logging, >>>>> users want relative quiet. >>>>> 3. Consistent behavior with audit enabled and disabled. >>>>> 4. Inability to easily develop a filter due to the lack of a >>>>> permissive/complain mode. >>>> >>>> I think I dislike this, but I think my dislikes may be fixable with >>>> minor changes. >>>> >>>> What I dislike is that this mixes app-specific built-in configuration >>>> (seccomp) with global privileged stuff (audit). The result is a >>>> potentially difficult to use situation in which you need to modify an >>>> app to make it loggable (using RET_LOG) and then fiddle with >>>> privileged config (auditctl, etc) to actually see the logs. >>> >>> You make a good point about RET_LOG vs log_max_action. I think making >>> RET_LOG the default value would work for 99% of the cases. >> >> Actually, I take this back: making "log" the default means that >> everything else gets logged too, include "expected" return values like >> errno, trap, etc. I think that would be extremely noisy as a default >> (for upstream or Ubuntu). >> >> Perhaps RET_LOG should unconditionally log? Or maybe the logged >> actions should be a bit field instead of a single value? Then the >> default could be "RET_KILL and RET_LOG", but an admin could switch it >> to just RET_KILL, or even nothing at all? Hmmm... > > Hi Kees - my apologies for going quiet on this topic after we spoke > about it more in IRC. I needed to tend to other work but now I'm able to > return to this seccomp logging patch set. > > To summarize what we discussed in IRC, the Chrome browser makes > extensive use of RET_ERRNO, RET_TRACE, etc., to sandbox code that may > not ever be adjusted to keep from bump into the sandbox walls. > Therefore, it is unreasonable to enable logging of something like > RET_ERRNO on a system-wide level where Chrome browser is in use. > > In contrast, snapd wants to set up "noisier" sandboxes for applications > to make it clear to the developers and the users that the sandboxed > application is bumping into the sandbox walls. Developers will know why > their code may not be working as intended and users will know that the > application is doing things that the platform doesn't agree with. These > sandboxes will end up using RET_ERRNO in the majority of cases. > > This means that with the current design of this patch set, Chrome > browser will either be unintentionally spamming the logs or snapd's > sandboxes will be helplessly silent when both Chrome and snapd is > installed at the same time, depending on the admin's preferences. > > To bring it back up a level, two applications may have a very different > outlook on how acceptable a given seccomp action is and they may > disagree on whether or not the user/administrator should be informed. > > I've been giving thought to the idea of providing a way for the > application setting up the filter to opt into logging of certain > actions. Here's a high level breakdown: At the risk of overcomplicating things, I think this issue may be a decent argument for doing something more like what I suggested earlier: let seccomp users emit loggable things and let their parents (optionally) catch and handle those things. Then we get real scoping rather than fiddling with bitmasks and hoping we get what we want to see without generating massive log spam. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3 0/4] Improved seccomp logging
On Thu, Feb 16, 2017 at 3:29 PM, Kees Cook <keesc...@chromium.org> wrote: > On Wed, Feb 15, 2017 at 7:24 PM, Andy Lutomirski <l...@amacapital.net> wrote: >> On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks <tyhi...@canonical.com> wrote: >>> This patch set is the third revision of the following two previously >>> submitted patch sets: >>> >>> v1: >>> http://lkml.kernel.org/r/1483375990-14948-1-git-send-email-tyhi...@canonical.com >>> v1: >>> http://lkml.kernel.org/r/1483377999-15019-2-git-send-email-tyhi...@canonical.com >>> >>> v2: >>> http://lkml.kernel.org/r/1486100262-32391-1-git-send-email-tyhi...@canonical.com >>> >>> The patch set aims to address some known deficiencies in seccomp's current >>> logging capabilities: >>> >>> 1. Inability to log all filter actions. >>> 2. Inability to selectively enable filtering; e.g. devs want noisy >>> logging, >>> users want relative quiet. >>> 3. Consistent behavior with audit enabled and disabled. >>> 4. Inability to easily develop a filter due to the lack of a >>> permissive/complain mode. >> >> I think I dislike this, but I think my dislikes may be fixable with >> minor changes. >> >> What I dislike is that this mixes app-specific built-in configuration >> (seccomp) with global privileged stuff (audit). The result is a >> potentially difficult to use situation in which you need to modify an >> app to make it loggable (using RET_LOG) and then fiddle with >> privileged config (auditctl, etc) to actually see the logs. > > You make a good point about RET_LOG vs log_max_action. I think making > RET_LOG the default value would work for 99% of the cases. > >> What if, instead of logging straight to the audit log, SECCOMP_RET_LOG >> [1] merely meant "tell our parent about this syscall"? (Ideally we'd >> also figure out a way to express "log this and allow", "log this and >> do ERRNO", etc.) Then we could have another mechanism that installs a >> layer in the seccomp stack that, instead of catching syscalls, catches >> log events and sticks them in a ring buffer (or audit). > > So, I really don't like this because it's yet another logging system. > We already have a security event logger: audit. This continues to use > that subsystem without changing semantics very much. Audit sucks for this kind of thing, though. It's mostly useless in a container, for example. But let me propose a middle ground. > >> Concretely, it might work like this. If a filter returns >> SECCOMP_RET_LOG, then we "log" and keep processing. SECCOMP_RET_LOG >> is otherwise treated literally like SECCOMP_RET_ALLOW and has no >> effect on return value. If you want log-and-kill, you install two >> filters. >> >> There's a new seccomp(2) action that returns an fd. That fd >> references a new thing in the seccomp stack that is a BPF program that >> is called whenever SECCOMP_RET_LOG is returned from lower down. The >> output of this filter determines whether the log event is ignored, >> stuck in the ring buffer, or passed up the stack for further >> processing. You read(2) the fd to access the ring buffer. >> >> Using this mechanism, you could write a simple seccomptrace tool that >> needs no privilege and dumps SECCOMP_RET_LOG events from the target >> program to stderr. > > If someone was going to do this, they could just as well set up a > tracer to use RET_TRAP. (And this is what things like minijail does > already, IIRC.) The reality of the situation is that this is way too > much overhead for the common case. We need a generalized logging > system that uses the existing logging mechanisms. True. And we can always add this part later if we want to. But let me propose a different, much more minor change to the patches: First, we currently have seccomp_run_filters running the whole stack and keeping (more or less) the lowest value. What if we changed it a bit so that return values of 0xff were special. Specifically, a return value of 0xff?? from a filter means "take some action right now but don't change the outcome of the filter stack". Then we define SECCOMP_RET_LOG as 0xff00 and perhaps reserve a few bits to be a number reflected in the log entry. (e.g. SECCOMP_RET_LOG(x) = 0xff00 | (x & 0xff)). Now SECCOMP_RET_LOG or SECCOMP_RET_LOG(0) does approximately what it does in the current patch series if used in isolation, but you can install two filters, one of which logs and one of which kills, to get "log and kill". If we do this, we might want SECCOMP_RET_KILL to stop running filters so that filters farther up the stack don't log the syscall. What do you think? This should be a very small delta on top of the current patches. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3 1/4] seccomp: Add sysctl to display available actions
On Thu, Feb 16, 2017 at 10:47 AM, Tyler Hicks <tyhi...@canonical.com> wrote: > On 02/15/2017 09:14 PM, Andy Lutomirski wrote: >> On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks <tyhi...@canonical.com> wrote: >>> This patch creates a read-only sysctl containing an ordered list of >>> seccomp actions that the kernel supports. The ordering, from left to >>> right, is the lowest action value (kill) to the highest action value >>> (allow). Currently, a read of the sysctl file would return "kill trap >>> errno trace allow". The contents of this sysctl file can be useful for >>> userspace code as well as the system administrator. >> >> Would this make more sense as a new seccomp(2) mode a la >> SECCOMP_HAS_ACTION? Then sandboxy things that have no fs access could >> use it. >> > > It would make sense for code that needs to check which actions are > available. It wouldn't make sense for administrators that need to check > which actions are available unless libseccomp provided a wrapper utility. > > Is this a theoretical concern or do you know of a sandboxed piece of > code that cannot access the sysctl before constructing a seccomp filter? > It's semi-theoretical. But suppose I unshare namespaces, unmount a bunch of stuff, then ask libseccomp to install a filter. (I've written code that does exactly that.) libseccomp won't be able to read the sysctl. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3 0/4] Improved seccomp logging
On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hickswrote: > This patch set is the third revision of the following two previously > submitted patch sets: > > v1: > http://lkml.kernel.org/r/1483375990-14948-1-git-send-email-tyhi...@canonical.com > v1: > http://lkml.kernel.org/r/1483377999-15019-2-git-send-email-tyhi...@canonical.com > > v2: > http://lkml.kernel.org/r/1486100262-32391-1-git-send-email-tyhi...@canonical.com > > The patch set aims to address some known deficiencies in seccomp's current > logging capabilities: > > 1. Inability to log all filter actions. > 2. Inability to selectively enable filtering; e.g. devs want noisy logging, > users want relative quiet. > 3. Consistent behavior with audit enabled and disabled. > 4. Inability to easily develop a filter due to the lack of a > permissive/complain mode. I think I dislike this, but I think my dislikes may be fixable with minor changes. What I dislike is that this mixes app-specific built-in configuration (seccomp) with global privileged stuff (audit). The result is a potentially difficult to use situation in which you need to modify an app to make it loggable (using RET_LOG) and then fiddle with privileged config (auditctl, etc) to actually see the logs. What if, instead of logging straight to the audit log, SECCOMP_RET_LOG [1] merely meant "tell our parent about this syscall"? (Ideally we'd also figure out a way to express "log this and allow", "log this and do ERRNO", etc.) Then we could have another mechanism that installs a layer in the seccomp stack that, instead of catching syscalls, catches log events and sticks them in a ring buffer (or audit). Concretely, it might work like this. If a filter returns SECCOMP_RET_LOG, then we "log" and keep processing. SECCOMP_RET_LOG is otherwise treated literally like SECCOMP_RET_ALLOW and has no effect on return value. If you want log-and-kill, you install two filters. There's a new seccomp(2) action that returns an fd. That fd references a new thing in the seccomp stack that is a BPF program that is called whenever SECCOMP_RET_LOG is returned from lower down. The output of this filter determines whether the log event is ignored, stuck in the ring buffer, or passed up the stack for further processing. You read(2) the fd to access the ring buffer. Using this mechanism, you could write a simple seccomptrace tool that needs no privilege and dumps SECCOMP_RET_LOG events from the target program to stderr. Thoughts? [1] If we went this route, it might want to be renamed. P.S. We ought to be able to write a BPF verifier pass that makes sure that filters don't return unsupported return values if we cared to do so. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3 1/4] seccomp: Add sysctl to display available actions
On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hickswrote: > This patch creates a read-only sysctl containing an ordered list of > seccomp actions that the kernel supports. The ordering, from left to > right, is the lowest action value (kill) to the highest action value > (allow). Currently, a read of the sysctl file would return "kill trap > errno trace allow". The contents of this sysctl file can be useful for > userspace code as well as the system administrator. Would this make more sense as a new seccomp(2) mode a la SECCOMP_HAS_ACTION? Then sandboxy things that have no fs access could use it. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 0/2] Begin auditing SECCOMP_RET_ERRNO return actions
On Mon, Jan 2, 2017 at 2:47 PM, Paul Moorewrote: > On Mon, Jan 2, 2017 at 11:53 AM, Tyler Hicks wrote: >> This patch set creates the basis for auditing information specific to a given >> seccomp return action and then starts auditing SECCOMP_RET_ERRNO return >> actions. The audit messages for SECCOMP_RET_ERRNO return actions include the >> errno value that will be returned to userspace. > > I'm replying to this patchset posting because it his my inbox first, > but my comments here apply to both this patchset and the other > seccomp/audit patchset you posted. > > In my experience, we have two or three problems (the count varies > depending on perspective) when it comes to seccomp filter reporting: > > 1. Inability to log all filter actions. > 2. Inability to selectively enable filtering; e.g. devs want noisy > logging, users want relative quiet. > 3. Consistent behavior with audit enabled and disabled. > > My current thinking - forgive me, this has been kicking around in my > head for the better part of six months (longer?) and I haven't > attempted to code it up - is to create a sysctl knob for a system wide > seccomp logging threshold that would be applied to the high 16-bits of > *every* triggered action: if the action was at/below the threshold a > record would be emitted, otherwise silence. This should resolve > problems #1 and #2, and the code should be relatively straightforward > and small. > > As part of the code above, I expect that all seccomp logging would get > routed through a single logging function (sort of like a better > implementation of the existing audit_seccomp()) that would check the > threshold and trigger the logging if needed. This function could be > augmented to check for CONFIG_AUDIT and in the case where audit was > not built into the kernel, a simple printk could be used to log the > seccomp event; solving problem #3. Would this not be doable with a seccomp tracepoint and a BPF filter? --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 0/2] Begin auditing SECCOMP_RET_ERRNO return actions
On Mon, Jan 2, 2017 at 8:53 AM, Tyler Hickswrote: > This patch set creates the basis for auditing information specific to a given > seccomp return action and then starts auditing SECCOMP_RET_ERRNO return > actions. The audit messages for SECCOMP_RET_ERRNO return actions include the > errno value that will be returned to userspace. > Not that I'm opposed to the idea, but what's the intended purpose? -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Should audit_seccomp check audit_enabled?
I would argue that, if auditing is off, audit_seccomp shouldn't do anything. After all, unlike e.g. selinux, seccomp is not a systemwide policy, and seccomp signals might be ordinary behavior that's internal to the seccomp-using application. IOW, for people with audit compiled in and subscribed by journald but switched off, I think that the records shouldn't be emitted. If you agree, I can send the two-line patch. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: Should audit_seccomp check audit_enabled?
On Fri, Oct 23, 2015 at 1:58 PM, Paul Moore <p...@paul-moore.com> wrote: > On Fri, Oct 23, 2015 at 4:51 PM, Steve Grubb <sgr...@redhat.com> wrote: >> On Friday, October 23, 2015 03:38:05 PM Paul Moore wrote: >>> On Fri, Oct 23, 2015 at 1:01 PM, Kees Cook <keesc...@chromium.org> wrote: >>> > On Fri, Oct 23, 2015 at 9:19 AM, Andy Lutomirski <l...@amacapital.net> >> wrote: >>> >> I would argue that, if auditing is off, audit_seccomp shouldn't do >>> >> anything. After all, unlike e.g. selinux, seccomp is not a systemwide >>> >> policy, and seccomp signals might be ordinary behavior that's internal >>> >> to the seccomp-using application. IOW, for people with audit compiled >>> >> in and subscribed by journald but switched off, I think that the >>> >> records shouldn't be emitted. >>> >> >>> >> If you agree, I can send the two-line patch. >>> > >>> > I think signr==0 states (which I would identify as "intended >>> > behavior") don't need to be reported under any situation, but audit >>> > folks wanted to keep it around. >>> >>> Wearing my libseccomp hat, I would like some logging when the seccomp >>> filter triggers a result other than allow. I don't care if this is >>> via audit or printk(), I just want some notification. If we go the >>> printk route and people really don't want to see anything in their >>> logs, I suppose we could always add a sysctl knob to turn off the >>> message completely (we would still need to do whatever audit records >>> are required, see below). >>> >>> Wearing my audit hat, I want to make sure we tick off all the right >>> boxes for the various certifications that people care about. Steve >>> Grubb has commented on what he needs in the past, although I'm not >>> sure it was on-list, so I'll ask him to repeat it here. >> >> I went back and reviewed my notes since this came up in the current Common >> Criteria evaluation. What we decided to do is treat syscall failures which >> failed due to seccomp the same as syscall failures caused by dropping >> capabilities. Both are opt-in DAC policies. That means we don't care. Do >> whatever you like. :-) > > Thanks Steve. > > Andy, is your objection that you don't want to see any seccomp > messages, or just seccomp audit records when audit is disabled? > My objection is that people who have audit compiled in but disabled at runtime shouldn't have the overhead or the log noise from these messages. If people want the messages, then I think they should turn on audit (auditctl -e 1 or whatever). --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: Should audit_seccomp check audit_enabled?
On Oct 23, 2015 10:01 AM, "Kees Cook" <keesc...@chromium.org> wrote: > > On Fri, Oct 23, 2015 at 9:19 AM, Andy Lutomirski <l...@amacapital.net> wrote: > > I would argue that, if auditing is off, audit_seccomp shouldn't do > > anything. After all, unlike e.g. selinux, seccomp is not a systemwide > > policy, and seccomp signals might be ordinary behavior that's internal > > to the seccomp-using application. IOW, for people with audit compiled > > in and subscribed by journald but switched off, I think that the > > records shouldn't be emitted. > > > > If you agree, I can send the two-line patch. > > I think signr==0 states (which I would identify as "intended > behavior") don't need to be reported under any situation, but audit > folks wanted to keep it around. Even if there is a nonzero signr, it could just be a program opting to trap and emulate one of its own syscalls. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: Should audit_seccomp check audit_enabled?
On Fri, Oct 23, 2015 at 2:22 PM, Kees Cook <keesc...@chromium.org> wrote: > On Fri, Oct 23, 2015 at 2:07 PM, Andy Lutomirski <l...@amacapital.net> wrote: >> On Oct 23, 2015 10:01 AM, "Kees Cook" <keesc...@chromium.org> wrote: >>> >>> On Fri, Oct 23, 2015 at 9:19 AM, Andy Lutomirski <l...@amacapital.net> >>> wrote: >>> > I would argue that, if auditing is off, audit_seccomp shouldn't do >>> > anything. After all, unlike e.g. selinux, seccomp is not a systemwide >>> > policy, and seccomp signals might be ordinary behavior that's internal >>> > to the seccomp-using application. IOW, for people with audit compiled >>> > in and subscribed by journald but switched off, I think that the >>> > records shouldn't be emitted. >>> > >>> > If you agree, I can send the two-line patch. >>> >>> I think signr==0 states (which I would identify as "intended >>> behavior") don't need to be reported under any situation, but audit >>> folks wanted to keep it around. >> >> Even if there is a nonzero signr, it could just be a program opting to >> trap and emulate one of its own syscalls. > > At present, that is a rare situation. Programs tend to be ptrace > managed externally. Is there anything catching SIGSYS itself? > I wrote one once. I also wrote a whole set of patches for libseccomp to make it easier that never went anywhere -- I should dust those off and package them into their own library. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances
On Thu, May 14, 2015 at 7:32 PM, Richard Guy Briggs r...@redhat.com wrote: On 15/05/14, Paul Moore wrote: * Look at our existing audit records to determine which records should have namespace and container ID tokens added. We may only want to add the additional fields in the case where the namespace/container ID tokens are not the init namespace. If we have a record that ties a set of namespace IDs with a container ID, then I expect we only need to list the containerID along with auid and sessionID. The problem here is that the kernel has no concept of a container, and I don't think it makes any sense to add one just for audit. Container is a marketing term used by some userspace tools. I can imagine that both audit could benefit from a concept of a namespace *path* that understands nesting (e.g. root/2/5/1 or something along those lines). Mapping these to containers belongs in userspace, I think. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances
On May 15, 2015 9:38 PM, Steve Grubb sgr...@redhat.com wrote: On Thursday, May 14, 2015 11:23:09 PM Andy Lutomirski wrote: On Thu, May 14, 2015 at 7:32 PM, Richard Guy Briggs r...@redhat.com wrote: On 15/05/14, Paul Moore wrote: * Look at our existing audit records to determine which records should have namespace and container ID tokens added. We may only want to add the additional fields in the case where the namespace/container ID tokens are not the init namespace. If we have a record that ties a set of namespace IDs with a container ID, then I expect we only need to list the containerID along with auid and sessionID. The problem here is that the kernel has no concept of a container, and I don't think it makes any sense to add one just for audit. Container is a marketing term used by some userspace tools. No, its a real thing just like a login. Does the kernel have any concept of a login? Yet it happens. And it causes us to generate events describing who, where from, role, success, and time of day. :-) I really hope those records come from userspace, not the kernel. I also wonder what happens when a user logs in and types sudo agetty /dev/ttyS0 115200. If a user does that and then someone logs in on /dev/ttyS0, which login are they? I can imagine that both audit could benefit from a concept of a namespace *path* that understands nesting (e.g. root/2/5/1 or something along those lines). Mapping these to containers belongs in userspace, I think. I don't doubt that just as user space sequences the actions that are a login. I just need the kernel to do some book keeping and associate the necessary attributes in the event record to be able to reconstruct what is actually happening. A precondition for that is having those records have some correspondence to what is actually happening. Since the kernel has no concept of a container, and since the same kernel mechanisms could be used for things that are probably not whatever the Common Criteria rules think a container is, this could be quite difficult to define in a meaningful manner. Hence my suggestion to add only minimal support in the kernel and to do this in userspace. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: x32 + audit status?
On Mar 5, 2015 10:32 AM, David Drysdale drysd...@google.com wrote: Hi, Do we currently expect the audit system to work with x32 syscalls? I was playing with the audit system for the first time today (on v4.0-rc2, due to [1]), and it didn't seem to work for me. (Tweaking ptrace.c like the patch below seemed to help, but I may just have configured something wrong.) I know there was a bunch of activity around this area in mid-2014, but I'm not sure what the final position was... It's totally broken, and it needs ABI work. I think it should keep the high syscall numbers, which means that both userspace and the audit core need to learn how to deal with it. --Andy Thanks, David [1]: https://lkml.org/lkml/2015/3/4/879 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index e510618b2e91..443932afd9e8 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1445,7 +1445,7 @@ static void do_audit_syscall_entry(struct pt_regs *regs, u32 arch) { #ifdef CONFIG_X86_64 if (arch == AUDIT_ARCH_X86_64) { - audit_syscall_entry(regs-orig_ax, regs-di, + audit_syscall_entry(regs-orig_ax __SYSCALL_MASK, regs-di, regs-si, regs-dx, regs-r10); } else #endif -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] powerpc: add little endian flag to syscall_get_arch()
On Tue, Dec 2, 2014 at 1:27 PM, Richard Guy Briggs r...@redhat.com wrote: Since both ppc and ppc64 have LE variants which are now reported by uname, add that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add AUDIT_ARCH_PPC*LE variants. Without this, perf trace and auditctl fail. Mainline kernel reports ppc64le (per a058801) but there is no matching AUDIT_ARCH_PPC64LE. There's no seccomp filter support for powerpc, so there's no risk that this breaks it. --Andy See: https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html https://www.redhat.com/archives/linux-audit/2014-December/msg4.html Signed-off-by: Richard Guy Briggs r...@redhat.com --- arch/powerpc/include/asm/syscall.h |6 +- include/uapi/linux/audit.h |2 ++ 2 files changed, 7 insertions(+), 1 deletions(-) diff --git a/arch/powerpc/include/asm/syscall.h b/arch/powerpc/include/asm/syscall.h index 6fa2708..a58acab 100644 --- a/arch/powerpc/include/asm/syscall.h +++ b/arch/powerpc/include/asm/syscall.h @@ -90,6 +90,10 @@ static inline void syscall_set_arguments(struct task_struct *task, static inline int syscall_get_arch(void) { - return is_32bit_task() ? AUDIT_ARCH_PPC : AUDIT_ARCH_PPC64; + int arch = is_32bit_task() ? AUDIT_ARCH_PPC : AUDIT_ARCH_PPC64; +#ifdef __LITTLE_ENDIAN__ + arch |= __AUDIT_ARCH_LE +#endif + return arch; } #endif /* _ASM_SYSCALL_H */ diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h index 4d100c8..fe29a99 100644 --- a/include/uapi/linux/audit.h +++ b/include/uapi/linux/audit.h @@ -364,7 +364,9 @@ enum { #define AUDIT_ARCH_PARISC (EM_PARISC) #define AUDIT_ARCH_PARISC64(EM_PARISC|__AUDIT_ARCH_64BIT) #define AUDIT_ARCH_PPC (EM_PPC) +#define AUDIT_ARCH_PPCLE (EM_PPC|__AUDIT_ARCH_LE) #define AUDIT_ARCH_PPC64 (EM_PPC64|__AUDIT_ARCH_64BIT) +#define AUDIT_ARCH_PPC64LE (EM_PPC64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE) #define AUDIT_ARCH_S390(EM_S390) #define AUDIT_ARCH_S390X (EM_S390|__AUDIT_ARCH_64BIT) #define AUDIT_ARCH_SH (EM_SH) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-api in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] i386/audit: stop scribbling on the stack frame
On Fri, Oct 24, 2014 at 1:19 PM, H. Peter Anvin h...@zytor.com wrote: On 10/23/2014 12:38 PM, Eric Paris wrote: After the call __audit_syscall_entry aren't they already polluted? Isn't that the reason we need to reload EAX? Well, I guess EAX is special... Because system calls are asmlinkage, all the parameters are on the stack, but %eax is used as the index into the system call table. This should thus be fine until we get rid of regparm(0) entirely, if that ever happens. ...and because __audit_syscall_entry *isn't* asmlinkage, it uses the other convention, which is where the confusion comes from. And, by the time you get to sysenter_do_call, nothing cares about ecx, so you can freely clobber it while popping from the stack. I get it now. --Andy -hpa -- Andy Lutomirski AMA Capital Management, LLC -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] i386/audit: stop scribbling on the stack frame
On 10/22/2014 09:04 PM, Eric Paris wrote: git commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a was very very dumb. It was writing over %esp/pt_regs semi-randomly on i686 with the expected system can't boot results. As noted in: https://bugs.freedesktop.org/show_bug.cgi?id=85277 This patch stops fscking with pt_regs. Instead it sets up the registers for the call to __audit_syscall_entry in the most obvious conceivable way. It then does just a tiny tiny touch of magic. We need to get what started in PT_EDX into 0(%esp) and PT_ESI into 4(%esp). This is as easy as a pair of pushes. After the call to __audit_syscall_entry all we need to do is get that now useless junk off the stack (pair of pops) and reload %eax with the original syscall so other stuff can keep going about it's business. Signed-off-by: Eric Paris epa...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-audit@redhat.com --- arch/x86/kernel/entry_32.S | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index f9e3fab..fb01d22 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -447,15 +447,14 @@ sysenter_exit: sysenter_audit: testl $(_TIF_WORK_SYSCALL_ENTRY ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry - addl $4,%esp - CFI_ADJUST_CFA_OFFSET -4 - movl %esi,4(%esp) /* 5th arg: 4th syscall arg */ - movl %edx,(%esp)/* 4th arg: 3rd syscall arg */ - /* %ecx already in %ecx3rd arg: 2nd syscall arg */ - movl %ebx,%edx /* 2nd arg: 1st syscall arg */ - /* %eax already in %eax1st arg: syscall number */ + /* movl PT_EAX(%esp), %eax already set, syscall number: 1st arg to audit */ + movl PT_EBX(%esp), %edx /* ebx/a0: 2nd arg to audit */ + /* movl PT_ECX(%esp), %ecx already set, a1: 3nd arg to audit */ + pushl_cfi PT_ESI(%esp) /* a3: 5th arg */ + pushl_cfi PT_EDX+4(%esp)/* a2: 4th arg */ call __audit_syscall_entry - pushl_cfi %ebx + popl_cfi %ecx /* get that remapped edx off the stack */ + popl_cfi %ecx /* get that remapped esi off the stack */ movl PT_EAX(%esp),%eax /* reload syscall number */ jmp sysenter_do_call This looks reasonably likely to be correct, but this code is complicated and now ever slower. How hard would it be to just delete it and replace it with a straightforward two-phase trace invocation a la x86_64? --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] i386/audit: stop scribbling on the stack frame
On Thu, Oct 23, 2014 at 12:15 PM, Eric Paris epa...@redhat.com wrote: On Thu, 2014-10-23 at 11:39 -0700, Andy Lutomirski wrote: On 10/22/2014 09:04 PM, Eric Paris wrote: git commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a was very very dumb. It was writing over %esp/pt_regs semi-randomly on i686 with the expected system can't boot results. As noted in: https://bugs.freedesktop.org/show_bug.cgi?id=85277 This patch stops fscking with pt_regs. Instead it sets up the registers for the call to __audit_syscall_entry in the most obvious conceivable way. It then does just a tiny tiny touch of magic. We need to get what started in PT_EDX into 0(%esp) and PT_ESI into 4(%esp). This is as easy as a pair of pushes. After the call to __audit_syscall_entry all we need to do is get that now useless junk off the stack (pair of pops) and reload %eax with the original syscall so other stuff can keep going about it's business. Signed-off-by: Eric Paris epa...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-audit@redhat.com --- arch/x86/kernel/entry_32.S | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index f9e3fab..fb01d22 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -447,15 +447,14 @@ sysenter_exit: sysenter_audit: testl $(_TIF_WORK_SYSCALL_ENTRY ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry - addl $4,%esp - CFI_ADJUST_CFA_OFFSET -4 - movl %esi,4(%esp) /* 5th arg: 4th syscall arg */ - movl %edx,(%esp)/* 4th arg: 3rd syscall arg */ - /* %ecx already in %ecx3rd arg: 2nd syscall arg */ - movl %ebx,%edx /* 2nd arg: 1st syscall arg */ - /* %eax already in %eax1st arg: syscall number */ + /* movl PT_EAX(%esp), %eax already set, syscall number: 1st arg to audit */ + movl PT_EBX(%esp), %edx /* ebx/a0: 2nd arg to audit */ + /* movl PT_ECX(%esp), %ecx already set, a1: 3nd arg to audit */ + pushl_cfi PT_ESI(%esp) /* a3: 5th arg */ + pushl_cfi PT_EDX+4(%esp)/* a2: 4th arg */ call __audit_syscall_entry - pushl_cfi %ebx + popl_cfi %ecx /* get that remapped edx off the stack */ + popl_cfi %ecx /* get that remapped esi off the stack */ movl PT_EAX(%esp),%eax /* reload syscall number */ jmp sysenter_do_call This looks reasonably likely to be correct, but this code is complicated and now ever slower. I guess I could just use push/pop and do the CFI_ADJUST_CFA_OFFSET by hand. But I figured this was reasonable enough... I'm not complaining about your new assembly in particular. There's just too much assembly in there in general. But I feel like I'm missing something in the new code. Aren't you corrupting ecx with those popl_cfi insns? How hard would it be to just delete it and replace it with a straightforward two-phase trace invocation a la x86_64? For me? Hard. I can try it eventually, but my track record with the 32-bit entry code isn't so good. I agree that this would be pretty nasty for urgent, but at least neither that or your fix would need to go to -stable. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH] i386/audit: stop scribbling on the stack frame
On Thu, Oct 23, 2014 at 12:30 PM, Eric Paris epa...@redhat.com wrote: On Thu, 2014-10-23 at 12:20 -0700, Andy Lutomirski wrote: On Thu, Oct 23, 2014 at 12:15 PM, Eric Paris epa...@redhat.com wrote: On Thu, 2014-10-23 at 11:39 -0700, Andy Lutomirski wrote: On 10/22/2014 09:04 PM, Eric Paris wrote: git commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a was very very dumb. It was writing over %esp/pt_regs semi-randomly on i686 with the expected system can't boot results. As noted in: https://bugs.freedesktop.org/show_bug.cgi?id=85277 This patch stops fscking with pt_regs. Instead it sets up the registers for the call to __audit_syscall_entry in the most obvious conceivable way. It then does just a tiny tiny touch of magic. We need to get what started in PT_EDX into 0(%esp) and PT_ESI into 4(%esp). This is as easy as a pair of pushes. After the call to __audit_syscall_entry all we need to do is get that now useless junk off the stack (pair of pops) and reload %eax with the original syscall so other stuff can keep going about it's business. Signed-off-by: Eric Paris epa...@redhat.com Cc: Thomas Gleixner t...@linutronix.de Cc: Ingo Molnar mi...@redhat.com Cc: H. Peter Anvin h...@zytor.com Cc: x...@kernel.org Cc: linux-ker...@vger.kernel.org Cc: linux-audit@redhat.com --- arch/x86/kernel/entry_32.S | 15 +++ 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index f9e3fab..fb01d22 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -447,15 +447,14 @@ sysenter_exit: sysenter_audit: testl $(_TIF_WORK_SYSCALL_ENTRY ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry - addl $4,%esp - CFI_ADJUST_CFA_OFFSET -4 - movl %esi,4(%esp) /* 5th arg: 4th syscall arg */ - movl %edx,(%esp)/* 4th arg: 3rd syscall arg */ - /* %ecx already in %ecx3rd arg: 2nd syscall arg */ - movl %ebx,%edx /* 2nd arg: 1st syscall arg */ - /* %eax already in %eax1st arg: syscall number */ + /* movl PT_EAX(%esp), %eax already set, syscall number: 1st arg to audit */ + movl PT_EBX(%esp), %edx /* ebx/a0: 2nd arg to audit */ + /* movl PT_ECX(%esp), %ecx already set, a1: 3nd arg to audit */ + pushl_cfi PT_ESI(%esp) /* a3: 5th arg */ + pushl_cfi PT_EDX+4(%esp)/* a2: 4th arg */ call __audit_syscall_entry - pushl_cfi %ebx + popl_cfi %ecx /* get that remapped edx off the stack */ + popl_cfi %ecx /* get that remapped esi off the stack */ movl PT_EAX(%esp),%eax /* reload syscall number */ jmp sysenter_do_call This looks reasonably likely to be correct, but this code is complicated and now ever slower. I guess I could just use push/pop and do the CFI_ADJUST_CFA_OFFSET by hand. But I figured this was reasonable enough... I'm not complaining about your new assembly in particular. There's just too much assembly in there in general. But I feel like I'm missing something in the new code. Aren't you corrupting ecx with those popl_cfi insns? After the call __audit_syscall_entry aren't they already polluted? Isn't that the reason we need to reload EAX? You can verify this leaves things in a similar state (although slightly differently polluted) than before it got screwed up. Here is diff between before the breakage and what I propose we do now. (I admit I don't understand how the pushl_cfi %ebx wasn't messing up PT_EBX) /me anxiously awaits x86 guy to tell me how dumb I am hpa, do you have time to figure this out? I don't know the 32-bit ABI well enough, nor will I have time to disassemble things or find and read the spec to figure this out in the next few days. --Andy $ git diff a17c8b54dc738c4fda31e8be0302cd131a04c19f -- arch/x86/kernel/entry_32.S diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 0d0c9d4..fb01d22 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -447,16 +447,14 @@ sysenter_exit: sysenter_audit: testl $(_TIF_WORK_SYSCALL_ENTRY ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp) jnz syscall_trace_entry - addl $4,%esp - CFI_ADJUST_CFA_OFFSET -4 - /* %esi already in 8(%esp) 6th arg: 4th syscall arg */ - /* %edx already in 4(%esp) 5th arg: 3rd syscall arg */ - /* %ecx already in 0(%esp) 4th arg: 2nd syscall arg */ - movl %ebx,%ecx /* 3rd arg: 1st syscall arg */ - movl %eax,%edx /* 2nd arg: syscall number */ - movl $AUDIT_ARCH_I386,%eax /* 1st arg: audit arch */ + /* movl PT_EAX(%esp), %eax already set, syscall number: 1st arg to audit */ + movl PT_EBX
Re: Regression: audit: x86: drop arch from __audit_syscall_entry() interface
On 10/22/2014 11:23 AM, Eric Paris wrote: That's really serious. Looking now. On Wed, 2014-10-22 at 16:08 -0200, Paulo Zanoni wrote: Hi (Cc'ing everybody mentioned in the original patch) I work for Intel, on our Linux Graphics driver - aka i915.ko - and our QA team recently reported a regression on: commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a Author: Richard Guy Briggs Date: Tue Mar 4 10:38:06 2014 -0500 audit: x86: drop arch from __audit_syscall_entry() interface According to our QA, their i386 machine doesn't boot anymore. I tried to write my own revert for the patch, asked QA to test, and they confirmed it solves the problem. Here are the details of QA' s bug report: https://bugs.freedesktop.org/show_bug.cgi?id=85277 . The trees our QA tests are the development trees from i915.ko: http://cgit.freedesktop.org/drm-intel?h=drm-intel-fixes . I tried searching for other bug reports on the same patch, but couldn't find any. Forgive me if this bug was already reported. Feel free to continue this discussion on the bugzilla report if you want. This piece: movl %esi,4(%esp) /* 5th arg: 4th syscall arg */ movl %edx,(%esp)/* 4th arg: 3rd syscall arg */ looks like it's overwriting syscall arguments. This is clearly fixable, but an even better fix would be to drop the asm entirely and switch to two-phase tracing. Want to do it? I can test the seccomp bits if you switch over the asm :) --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: Regression: audit: x86: drop arch from __audit_syscall_entry() interface
On Wed, Oct 22, 2014 at 12:16 PM, Richard Guy Briggs r...@redhat.com wrote: On 14/10/22, Andy Lutomirski wrote: On 10/22/2014 11:23 AM, Eric Paris wrote: That's really serious. Looking now. On Wed, 2014-10-22 at 16:08 -0200, Paulo Zanoni wrote: Hi (Cc'ing everybody mentioned in the original patch) I work for Intel, on our Linux Graphics driver - aka i915.ko - and our QA team recently reported a regression on: commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a Author: Richard Guy Briggs Date: Tue Mar 4 10:38:06 2014 -0500 audit: x86: drop arch from __audit_syscall_entry() interface According to our QA, their i386 machine doesn't boot anymore. I tried to write my own revert for the patch, asked QA to test, and they confirmed it solves the problem. Here are the details of QA' s bug report: https://bugs.freedesktop.org/show_bug.cgi?id=85277 . The trees our QA tests are the development trees from i915.ko: http://cgit.freedesktop.org/drm-intel?h=drm-intel-fixes . I tried searching for other bug reports on the same patch, but couldn't find any. Forgive me if this bug was already reported. Feel free to continue this discussion on the bugzilla report if you want. This piece: movl %esi,4(%esp) /* 5th arg: 4th syscall arg */ movl %edx,(%esp)/* 4th arg: 3rd syscall arg */ looks like it's overwriting syscall arguments. This is clearly fixable, but an even better fix would be to drop the asm entirely and switch to two-phase tracing. Want to do it? I can test the seccomp bits if you switch over the asm :) Like what you did for x86_64. That sounds worth investigating. I'll have a look at the asm, but I'm being distracted by a gunman loose 2km from me and my wife and kids under lockdown in two different locations on the other side of the shooting site. Had to cancel lunch today with two work colleagues 1/2km away from that site. ...not been a productive day. That's putting it mildly. Stay safe. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: Le 25/08/2014 16:04, Andy Lutomirski a écrit : On Aug 25, 2014 6:30 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: CRIU wants to save the complete state of a namespace and then restore it. For that to work, any information exposed to things in the namespace *cannot* be globally unique or unique per boot, since CRIU needs to arrange for that information to match whatever it was when CRIU saved it. How are ifindex of network devices managed? These ifindexes are unique per boot, thus can change depending on the order in which netdev are created. These ifindexes are unique per boot and exposed to userspace ... This does not appear to be true. $ sudo unshare --net # ip link add veth0 type veth peer name veth1 # ip link 1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: veth1: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff 3: veth0: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff # logout $ ip link 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3: em1: NO-CARRIER,BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 I've probably misunderstood what you're trying to say. ifindexes are unique per boot and per netns. I think we both misunderstood each other. The ifindexes are unique *per netns*, which means that, if you're unprivileged in a netns, global information doesn't leak to you. I think this is good. Let me try again, with emphasis in the right place. I think that *code running in a namespace* has no business even knowing a unique identity of *that namespace* from the perspective of the host. In your example, if there's a veth device between netns A and netns B, then code *in netns A* has no business knowing the identity of its veth peer if its peer (B) is a sibling or ancestor. It also IMO has no business knowing the identity of its own netns (A) other than as my netns. I do not agree (see the example below). If A and B are siblings, then their parent needs to know where that veth device goes, but I think this is already the case to a sufficient extent today. I'm not aware of a hierarchy between netns. A daemon should be able to got the full network configuration, even if it's started when this configuration is already applied, ie even if it doesn't know what happen before it starts. I don't know exactly which namespaces have an explicit hierarchy, but there is certainly a hierarchy of *user* namespaces, and network namespaces live in user namespaces, so they at least have somewhat of a hierarchy. I feel like this discussion is falling into a common trap of new API discussions. Can one of you who wants this API please articulate, with a reasonably precise example, what it is that you want to do, why you can't easily do it already, and how this API helps? I currently understand how the API creates problems, but I don't understand how it solves any problems, and I will NAK it (and I suspect that Eric will, too, which is pretty much fatal) unless that changes. What I'm trying to solve is to have full info in netlink messages sent by the kernel, thus beeing able to identify a peer netns (and this is close from what audit guys are trying to have). Theorically, messages sent by the kernel can be reused as is to have the same configuration. This is not the case with x-netns devices. Here is an example, with ip tunnels: $ ip netns add 1 $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 $ ip -d link ls ipip1 8: ipip1@eth0: POINTOPOINT,NOARP mtu 1480 qdisc noop state DOWN mode DEFAULT group default link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0 ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc $ ip link set ipip1 netns 1 $ ip netns exec 1 ip -d link ls ipip1 8: ipip1@tunl0: POINTOPOINT,NOARP,M-DOWN mtu 1480 qdisc noop state DOWN mode DEFAULT group default link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0 ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc Now informations got with 'ip link' are wrong and incomplete: - the link dev is now tunl0 instead of eth0, because we only got an ifindex from the kernel without any netns informations. - the encapsulation addresses are not part of this netns but the user doesn't known that (still because netns info is missing). These IPv4 addresses may exist into this netns. - it's not possible to create the same netdevice with these infos. Aha. That's a genuine problem. Perhaps we need a concept
Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
On Mon, Aug 25, 2014 at 9:41 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: Le 25/08/2014 18:13, Andy Lutomirski a écrit : On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: Le 25/08/2014 16:04, Andy Lutomirski a écrit : On Aug 25, 2014 6:30 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: CRIU wants to save the complete state of a namespace and then restore it. For that to work, any information exposed to things in the namespace *cannot* be globally unique or unique per boot, since CRIU needs to arrange for that information to match whatever it was when CRIU saved it. How are ifindex of network devices managed? These ifindexes are unique per boot, thus can change depending on the order in which netdev are created. These ifindexes are unique per boot and exposed to userspace ... This does not appear to be true. $ sudo unshare --net # ip link add veth0 type veth peer name veth1 # ip link 1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: veth1: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff 3: veth0: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff # logout $ ip link 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3: em1: NO-CARRIER,BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 I've probably misunderstood what you're trying to say. ifindexes are unique per boot and per netns. I think we both misunderstood each other. The ifindexes are unique *per netns*, which means that, if you're unprivileged in a netns, global information doesn't leak to you. I think this is good. Ok, I agree. I think audit daemons are always running under privileged users. Let me try again, with emphasis in the right place. I think that *code running in a namespace* has no business even knowing a unique identity of *that namespace* from the perspective of the host. In your example, if there's a veth device between netns A and netns B, then code *in netns A* has no business knowing the identity of its veth peer if its peer (B) is a sibling or ancestor. It also IMO has no business knowing the identity of its own netns (A) other than as my netns. I do not agree (see the example below). If A and B are siblings, then their parent needs to know where that veth device goes, but I think this is already the case to a sufficient extent today. I'm not aware of a hierarchy between netns. A daemon should be able to got the full network configuration, even if it's started when this configuration is already applied, ie even if it doesn't know what happen before it starts. I don't know exactly which namespaces have an explicit hierarchy, but there is certainly a hierarchy of *user* namespaces, and network namespaces live in user namespaces, so they at least have somewhat of a hierarchy. I feel like this discussion is falling into a common trap of new API discussions. Can one of you who wants this API please articulate, with a reasonably precise example, what it is that you want to do, why you can't easily do it already, and how this API helps? I currently understand how the API creates problems, but I don't understand how it solves any problems, and I will NAK it (and I suspect that Eric will, too, which is pretty much fatal) unless that changes. What I'm trying to solve is to have full info in netlink messages sent by the kernel, thus beeing able to identify a peer netns (and this is close from what audit guys are trying to have). Theorically, messages sent by the kernel can be reused as is to have the same configuration. This is not the case with x-netns devices. Here is an example, with ip tunnels: $ ip netns add 1 $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 $ ip -d link ls ipip1 8: ipip1@eth0: POINTOPOINT,NOARP mtu 1480 qdisc noop state DOWN mode DEFAULT group default link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0 ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc $ ip link set ipip1 netns 1 $ ip netns exec 1 ip -d link ls ipip1 8: ipip1@tunl0: POINTOPOINT,NOARP,M-DOWN mtu 1480 qdisc noop state DOWN mode DEFAULT group default link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0 ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc Now informations got with 'ip link' are wrong and incomplete: - the link dev is now tunl0 instead of eth0, because we only got an ifindex from the kernel without any netns informations. - the encapsulation addresses are not part of this netns but the user
Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
On Aug 25, 2014 6:30 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote: CRIU wants to save the complete state of a namespace and then restore it. For that to work, any information exposed to things in the namespace *cannot* be globally unique or unique per boot, since CRIU needs to arrange for that information to match whatever it was when CRIU saved it. How are ifindex of network devices managed? These ifindexes are unique per boot, thus can change depending on the order in which netdev are created. These ifindexes are unique per boot and exposed to userspace ... This does not appear to be true. $ sudo unshare --net # ip link add veth0 type veth peer name veth1 # ip link 1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN mode DEFAULT group default link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: veth1: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff 3: veth0: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff # logout $ ip link 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3: em1: NO-CARRIER,BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast state DOWN qlen 1000 Also, I think that code running in a namespace has no business even knowing a unique identity of that namespace from the perspective of the host. Another scenario is when you have virtual network devices across two netns. You need to identify the peer netns to have a netlink message which is fully interpretable by the userspace. Let me try again, with emphasis in the right place. I think that *code running in a namespace* has no business even knowing a unique identity of *that namespace* from the perspective of the host. In your example, if there's a veth device between netns A and netns B, then code *in netns A* has no business knowing the identity of its veth peer if its peer (B) is a sibling or ancestor. It also IMO has no business knowing the identity of its own netns (A) other than as my netns. If A and B are siblings, then their parent needs to know where that veth device goes, but I think this is already the case to a sufficient extent today. I feel like this discussion is falling into a common trap of new API discussions. Can one of you who wants this API please articulate, with a reasonably precise example, what it is that you want to do, why you can't easily do it already, and how this API helps? I currently understand how the API creates problems, but I don't understand how it solves any problems, and I will NAK it (and I suspect that Eric will, too, which is pretty much fatal) unless that changes. Thanks, Andy Regards, Nicolas -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
On Thu, Aug 21, 2014 at 6:58 PM, Richard Guy Briggs r...@redhat.com wrote: On 14/08/21, Andy Lutomirski wrote: On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote: Expose the namespace instace serial numbers in the proc filesystem at /proc/pid/ns/ns_snum. The link text gives the serial number in hex. What's the use case? I understand the utility of giving unique numbers to the audit code, but I don't think this part is necessary for that, and I'd like to understand what else will use this before committing to a duplicative API like this. How does a container manager get those numbers? It could provoke a task to cause an audit event that emits a NS_INFO message, or it could run a task in that container to report its namespace serial numbers directly from its /proc mount. Why does a container manager need them? Is there any reason that keeping them entirely contained within the audit system would be a problem? The discussion in this thread touches on the use cases: https://lkml.org/lkml/2014/4/22/662 Note that this API is thoroughly incompatible with CRIU. If we do this, someone will ask for a namespace number namespace, and that way lies madness. I had a very brief look at CRIU, but not enough to understand the issue. Others have hinted at this problem. Do you have a suggestion of a different approach that would be compatible with CRIU? I'd originally considered some sort of UUID that would be globally unique, but that would be very hard to devise or guarantee, and besides, namespaces aren't only used by containers and could be shared in other ways. Tracking the usage and migration of namespaces should be the task of an upper layer. CRIU wants to save the complete state of a namespace and then restore it. For that to work, any information exposed to things in the namespace *cannot* be globally unique or unique per boot, since CRIU needs to arrange for that information to match whatever it was when CRIU saved it. Also, I think that code running in a namespace has no business even knowing a unique identity of that namespace from the perspective of the host. Here's a specific use case for *not* exposing this: Tor. Ideally, Tor clients would run in a namespace that does not know about any global identity. That means no IP addresses, but it also means no global namespace serial numbers. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote: Generate and assign a serial number per namespace instance since boot. Use a serial number per namespace (unique across one boot of one kernel) instead of the inode number (which is claimed to have had the right to change reserved and is not necessarily unique if there is more than one proc fs) to uniquely identify it per kernel boot. Signed-off-by: Richard Guy Briggs r...@redhat.com --- fs/mount.h |1 + fs/namespace.c |1 + include/linux/ipc_namespace.h |1 + include/linux/nsproxy.h|8 include/linux/pid_namespace.h |1 + include/linux/user_namespace.h |1 + include/linux/utsname.h|1 + include/net/net_namespace.h|1 + init/version.c |1 + ipc/msgutil.c |1 + ipc/namespace.c|2 ++ kernel/nsproxy.c | 17 + kernel/pid.c |1 + kernel/pid_namespace.c |2 ++ kernel/user.c |1 + kernel/user_namespace.c|2 ++ kernel/utsname.c |2 ++ net/core/net_namespace.c |8 +++- 18 files changed, 51 insertions(+), 1 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index d55297f..c076f99 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -5,6 +5,7 @@ struct mnt_namespace { atomic_tcount; unsigned intproc_inum; + long long serial_num; struct mount * root; struct list_headlist; struct user_namespace *user_ns; diff --git a/fs/namespace.c b/fs/namespace.c index 182bc41..9af49ff 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) kfree(new_ns); return ERR_PTR(ret); } + new_ns-serial_num = ns_serial(); new_ns-seq = atomic64_add_return(1, mnt_ns_seq); atomic_set(new_ns-count, 1); new_ns-root = NULL; diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 35e7eca..8ccfb2d 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -69,6 +69,7 @@ struct ipc_namespace { struct user_namespace *user_ns; unsigned intproc_inum; + long long serial_num; }; extern struct ipc_namespace init_ipc_ns; diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index b4ec59d..8e5fe0d 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk) return rcu_dereference(tsk-nsproxy); } +long long ns_serial(void); +enum { + NS_IPC_INIT_SN = 1, + NS_UTS_INIT_SN = 2, + NS_USER_INIT_SN = 3, + NS_PID_INIT_SN = 4, Please add an extra value here for the first dynamic value... +/** + * ns_serial - compute a serial number for the namespace + * + * Compute a serial number for the namespace to uniquely identify it in + * audit records. + */ +long long ns_serial(void) +{ + static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */ ...and use it here. Also, does this work on all architectures? --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc
On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote: Expose the namespace instace serial numbers in the proc filesystem at /proc/pid/ns/ns_snum. The link text gives the serial number in hex. What's the use case? I understand the utility of giving unique numbers to the audit code, but I don't think this part is necessary for that, and I'd like to understand what else will use this before committing to a duplicative API like this. Note that this API is thoroughly incompatible with CRIU. If we do this, someone will ask for a namespace number namespace, and that way lies madness. --Andy snum was chosen instead of seq for consistency with inum and there are a number of other uses of seq in the namespace code. Suggested-by: Serge E. Hallyn se...@hallyn.com Signed-off-by: Richard Guy Briggs r...@redhat.com --- fs/proc/namespaces.c | 33 + 1 files changed, 25 insertions(+), 8 deletions(-) diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c index 8902609..e953e0a 100644 --- a/fs/proc/namespaces.c +++ b/fs/proc/namespaces.c @@ -47,12 +47,15 @@ static char *ns_dname(struct dentry *dentry, char *buffer, int buflen) struct inode *inode = dentry-d_inode; const struct proc_ns_operations *ns_ops = PROC_I(inode)-ns.ns_ops; - return dynamic_dname(dentry, buffer, buflen, %s:[%lu], - ns_ops-name, inode-i_ino); + if (strstr(dentry-d_iname, _snum)) + return dynamic_dname(dentry, buffer, buflen, %s_snum:[%llx], + ns_ops-name, ns_ops-snum(PROC_I(inode)-ns.ns)); + else + return dynamic_dname(dentry, buffer, buflen, %s:[%lu], + ns_ops-name, inode-i_ino); } -const struct dentry_operations ns_dentry_operations = -{ +const struct dentry_operations ns_dentry_operations = { .d_delete = always_delete_dentry, .d_dname= ns_dname, }; @@ -160,7 +163,10 @@ static int proc_ns_readlink(struct dentry *dentry, char __user *buffer, int bufl if (!ns) goto out_put_task; - snprintf(name, sizeof(name), %s:[%u], ns_ops-name, ns_ops-inum(ns)); + if (strstr(dentry-d_iname, _snum)) + snprintf(name, sizeof(name), %s_snum:[%llx], ns_ops-name, ns_ops-snum(ns)); + else + snprintf(name, sizeof(name), %s:[%u], ns_ops-name, ns_ops-inum(ns)); res = readlink_copy(buffer, buflen, name); ns_ops-put(ns); out_put_task: @@ -210,16 +216,23 @@ static int proc_ns_dir_readdir(struct file *file, struct dir_context *ctx) if (!dir_emit_dots(file, ctx)) goto out; - if (ctx-pos = 2 + ARRAY_SIZE(ns_entries)) + if (ctx-pos = 2 + 2 * ARRAY_SIZE(ns_entries)) goto out; entry = ns_entries + (ctx-pos - 2); last = ns_entries[ARRAY_SIZE(ns_entries) - 1]; while (entry = last) { const struct proc_ns_operations *ops = *entry; + char name[50]; + if (!proc_fill_cache(file, ctx, ops-name, strlen(ops-name), proc_ns_instantiate, task, ops)) break; ctx-pos++; + snprintf(name, sizeof(name), %s_snum, ops-name); + if (!proc_fill_cache(file, ctx, name, strlen(name), +proc_ns_instantiate, task, ops)) + break; + ctx-pos++; entry++; } out: @@ -247,9 +260,13 @@ static struct dentry *proc_ns_dir_lookup(struct inode *dir, last = ns_entries[ARRAY_SIZE(ns_entries)]; for (entry = ns_entries; entry last; entry++) { - if (strlen((*entry)-name) != len) + char name[50]; + + snprintf(name, sizeof(name), %s_snum, (*entry)-name); + if (strlen((*entry)-name) != len strlen(name) != len) continue; - if (!memcmp(dentry-d_name.name, (*entry)-name, len)) + if (!memcmp(dentry-d_name.name, (*entry)-name, len) + || !memcmp(dentry-d_name.name, name, len)) break; } if (entry == last) -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-api in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number
On Thu, Aug 21, 2014 at 2:28 PM, Richard Guy Briggs r...@redhat.com wrote: On 14/08/21, Andy Lutomirski wrote: On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote: Generate and assign a serial number per namespace instance since boot. Use a serial number per namespace (unique across one boot of one kernel) instead of the inode number (which is claimed to have had the right to change reserved and is not necessarily unique if there is more than one proc fs) to uniquely identify it per kernel boot. Signed-off-by: Richard Guy Briggs r...@redhat.com --- fs/mount.h |1 + fs/namespace.c |1 + include/linux/ipc_namespace.h |1 + include/linux/nsproxy.h|8 include/linux/pid_namespace.h |1 + include/linux/user_namespace.h |1 + include/linux/utsname.h|1 + include/net/net_namespace.h|1 + init/version.c |1 + ipc/msgutil.c |1 + ipc/namespace.c|2 ++ kernel/nsproxy.c | 17 + kernel/pid.c |1 + kernel/pid_namespace.c |2 ++ kernel/user.c |1 + kernel/user_namespace.c|2 ++ kernel/utsname.c |2 ++ net/core/net_namespace.c |8 +++- 18 files changed, 51 insertions(+), 1 deletions(-) diff --git a/fs/mount.h b/fs/mount.h index d55297f..c076f99 100644 --- a/fs/mount.h +++ b/fs/mount.h @@ -5,6 +5,7 @@ struct mnt_namespace { atomic_tcount; unsigned intproc_inum; + long long serial_num; struct mount * root; struct list_headlist; struct user_namespace *user_ns; diff --git a/fs/namespace.c b/fs/namespace.c index 182bc41..9af49ff 100644 --- a/fs/namespace.c +++ b/fs/namespace.c @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct user_namespace *user_ns) kfree(new_ns); return ERR_PTR(ret); } + new_ns-serial_num = ns_serial(); new_ns-seq = atomic64_add_return(1, mnt_ns_seq); atomic_set(new_ns-count, 1); new_ns-root = NULL; diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h index 35e7eca..8ccfb2d 100644 --- a/include/linux/ipc_namespace.h +++ b/include/linux/ipc_namespace.h @@ -69,6 +69,7 @@ struct ipc_namespace { struct user_namespace *user_ns; unsigned intproc_inum; + long long serial_num; }; extern struct ipc_namespace init_ipc_ns; diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h index b4ec59d..8e5fe0d 100644 --- a/include/linux/nsproxy.h +++ b/include/linux/nsproxy.h @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct task_struct *tsk) return rcu_dereference(tsk-nsproxy); } +long long ns_serial(void); +enum { + NS_IPC_INIT_SN = 1, + NS_UTS_INIT_SN = 2, + NS_USER_INIT_SN = 3, + NS_PID_INIT_SN = 4, Please add an extra value here for the first dynamic value... +/** + * ns_serial - compute a serial number for the namespace + * + * Compute a serial number for the namespace to uniquely identify it in + * audit records. + */ +long long ns_serial(void) +{ + static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, user, PID */ ...and use it here. Yup, good idea. Thanks. Also, does this work on all architectures? I only know for certain x86_64. There is discussion elsewhere about changing ns_serial from returning a long long to returning a u64. That should help. I can run standard compile and boot tests on several others, but won't be able to check output. I was thinking of atomic64. I guess there's a generic implementation that works everywhere, if somewhat slowly. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Fwd: 3.15: kernel BUG at kernel/auditsc.c:1525!
Steve Grubb pointed out that I forgot to cc: linux-audit. --Andy -- Forwarded message -- From: Andy Lutomirski l...@amacapital.net Date: Mon, Jun 16, 2014 at 2:35 PM Subject: Re: 3.15: kernel BUG at kernel/auditsc.c:1525! To: Richard Weinberger rich...@nod.at, H. Peter Anvin h...@zytor.com, X86 ML x...@kernel.org Cc: Toralf Förster toralf.foers...@gmx.de, Eric Paris epa...@redhat.com, Linux Kernel linux-ker...@vger.kernel.org [cc: hpa, x86 list] On Mon, Jun 16, 2014 at 1:43 PM, Richard Weinberger rich...@nod.at wrote: Am 16.06.2014 22:41, schrieb Toralf Förster: Well, might be the mail:subject should be adapted, b/c the issue can be triggered in a 3.13.11 kernel too. Unfortunately it does not appear within an UML guest, therefore an automated bisecting isn't possible I fear. You could try KVM. :) Before you do that, just to clarify: What bitness is your kernel? That is, are you on 32-bit or 64-bit kernel? What bitness is your test case? 'file a.out' will say. What does /proc/cpuinfo say in flags? Can you try the attached patch? It's only compile-tested. To hpa, etc: It appears that entry_32.S is missing any call to the audit exit hook on the badsys path. If I'm diagnosing this bug report correctly, this causes OOPSes. The the world at large: it's increasingly apparent that no one (except maybe the blackhats) has ever scrutinized the syscall auditing code. This is two old severe bugs in the code that have probably been there for a long time. --Andy -- Andy Lutomirski AMA Capital Management, LLC From 8b43bd2118d876cb3163e8f7d9cd8253da649335 Mon Sep 17 00:00:00 2001 Message-Id: 8b43bd2118d876cb3163e8f7d9cd8253da649335.1402954406.git.l...@amacapital.net From: Andy Lutomirski l...@amacapital.net Date: Mon, 16 Jun 2014 14:28:19 -0700 Subject: [PATCH] x86_32,entry: Fix badsys paths MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The bad syscall nr paths are their own incomprehensible route through the entry control flow. Rearrange them to work just like syscalls that return -ENOSYS. This should fix an OOPS in the audit code when auditing is enabled and bad syscall nrs are used. Reported-by: Toralf Förster toralf.foers...@gmx.de Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/entry_32.S | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S index 98313ff..eb6e07e 100644 --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -431,9 +431,10 @@ sysenter_past_esp: jnz sysenter_audit sysenter_do_call: cmpl $(NR_syscalls), %eax - jae syscall_badsys + jae sysenter_badsys call *sys_call_table(,%eax,4) movl %eax,PT_EAX(%esp) +sysenter_after_call: LOCKDEP_SYS_EXIT DISABLE_INTERRUPTS(CLBR_ANY) TRACE_IRQS_OFF @@ -687,7 +688,12 @@ END(syscall_fault) syscall_badsys: movl $-ENOSYS,PT_EAX(%esp) - jmp resume_userspace + jmp syscall_exit +END(syscall_badsys) + +sysenter_badsys: + movl $-ENOSYS,PT_EAX(%esp) + jmp sysenter_after_call END(syscall_badsys) CFI_ENDPROC /* -- 1.9.3 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking
On Mon, Jun 9, 2014 at 3:30 PM, Greg KH gre...@linuxfoundation.org wrote: On Wed, May 28, 2014 at 11:09:58PM -0400, Eric Paris wrote: From: Andy Lutomirski l...@amacapital.net Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. eparis: If the admin has enabled auditd and has specifically loaded audit rules. This bug has been around since before git. Wow... Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net Signed-off-by: Eric Paris epa...@redhat.com --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) Did this patch get dropped somewhere? Isn't it a valid bugfix, or did I miss a later conversation about this? Hmm. It seems that it didn't make it into Linus' tree. Crap. IMO we need some kind of real tracking system for issues reported to security@. This shouldn't have been possible (and if I'd realized that the patch got dropped, I wouldn't have publicly disclosed it). For whoever applies this: it's CVE-2014-3917. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking
On Mon, Jun 9, 2014 at 3:53 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Mon, Jun 9, 2014 at 3:35 PM, Andy Lutomirski l...@amacapital.net wrote: Hmm. It seems that it didn't make it into Linus' tree. Crap. I assume that if there is a maintainer who normally sends me stuff by git, when I see patches in emails they are just informational heads-ups about stuff that is being discussed or pending, and that I'll see it later in a pull request. So I just ignore them unless I have specific comments, since clearly the emailed patch is just informational and/or for comments/acks from others. The exception is unless it *VERY CLEARLY* says otherwise (as in Linus, can you please take this directly due to xyz). Because why would somebody send me a patch series sometimes, and git trees at other times? That would just be stupid. In this particular case, it's my patch, and I've never sent you a pull request. I sort of assumed that secur...@kernel.org magically caused acknowledged fixes to end up in your tree. I'm not sure what I'm supposed to do here. Maybe the confusion is because Eric resent the patch? --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[GIT PULL] CVE-2014-3917
[This is my first pull request. I may be doing any number of things wrong.] Hi Linus, The following changes since commit f2159d1e99612ceb94bf9a2dc2fbca409d828b1b: Merge tag 'sound-3.15-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound (2014-05-28 11:17:41 -0700) are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git CVE-2014-3917 for you to fetch changes up to 9fbb05647f3aea17e4e01393bbc42f64ee307409: auditsc: audit_krule mask accesses need bounds checking (2014-06-09 16:06:40 -0700) Andy Lutomirski (1): auditsc: audit_krule mask accesses need bounds checking kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) NB: This is exactly the same patch that's been on the list, except that I added the CVE number to the description. -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking
On Mon, Jun 9, 2014 at 3:46 PM, Greg KH gre...@linuxfoundation.org wrote: On Mon, Jun 09, 2014 at 03:35:02PM -0700, Andy Lutomirski wrote: On Mon, Jun 9, 2014 at 3:30 PM, Greg KH gre...@linuxfoundation.org wrote: On Wed, May 28, 2014 at 11:09:58PM -0400, Eric Paris wrote: From: Andy Lutomirski l...@amacapital.net Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. eparis: If the admin has enabled auditd and has specifically loaded audit rules. This bug has been around since before git. Wow... Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net Signed-off-by: Eric Paris epa...@redhat.com --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) Did this patch get dropped somewhere? Isn't it a valid bugfix, or did I miss a later conversation about this? Hmm. It seems that it didn't make it into Linus' tree. Crap. IMO we need some kind of real tracking system for issues reported to security@. That seems to be my mbox at times :) But yes, having something real might be good if the load gets higher, right now it's so low that my sweep pending security patches task usually catches anything pending, which is rare. There are currently at least two issues that I reported that are stuck in limbo: this one and the (not-yet-public) vfs thing. And there's the CVE-2014-0181 regression fix that almost got forgotten, but that isn't really a security issue. And I can't read your mbox :-/ --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking
On Mon, Jun 9, 2014 at 5:32 PM, Greg KH gre...@linuxfoundation.org wrote: On Mon, Jun 09, 2014 at 03:55:20PM -0700, Andy Lutomirski wrote: On Mon, Jun 9, 2014 at 3:46 PM, Greg KH gre...@linuxfoundation.org wrote: On Mon, Jun 09, 2014 at 03:35:02PM -0700, Andy Lutomirski wrote: On Mon, Jun 9, 2014 at 3:30 PM, Greg KH gre...@linuxfoundation.org wrote: On Wed, May 28, 2014 at 11:09:58PM -0400, Eric Paris wrote: From: Andy Lutomirski l...@amacapital.net Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. eparis: If the admin has enabled auditd and has specifically loaded audit rules. This bug has been around since before git. Wow... Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net Signed-off-by: Eric Paris epa...@redhat.com --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) Did this patch get dropped somewhere? Isn't it a valid bugfix, or did I miss a later conversation about this? Hmm. It seems that it didn't make it into Linus' tree. Crap. IMO we need some kind of real tracking system for issues reported to security@. That seems to be my mbox at times :) But yes, having something real might be good if the load gets higher, right now it's so low that my sweep pending security patches task usually catches anything pending, which is rare. There are currently at least two issues that I reported that are stuck in limbo: this one and the (not-yet-public) vfs thing. That was next on my list to poke people about... And there's the CVE-2014-0181 regression fix that almost got forgotten, but that isn't really a security issue. What is that, where was that reported? commit 2d7a85f4b06e9c27ff629f07a524c48074f07f81 Author: Eric W. Biederman ebied...@xmission.com Date: Fri May 30 11:04:00 2014 -0700 netlink: Only check file credentials for implicit destinations The security issue got fixed quickly, but the fix turned out to be problematic. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH 2/2] audit: Syscall auditing lite
AFAICS the main use of syscall auditing is to get syscall information for syscalls that are already causing another audit message. We don't need any of the fancy syscall auditing machinery for that, though: we can just log this information directly. This should have essentially no overhead and it could end up being much easier to use than auditsc. This produces messages like this: audit: type=1123 audit(1401485315.370:2): pid=125 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='blah blah blah' arch=c03e syscall=44 a0=3 a1=7fff383feb60 a2=5c a3=0 a4=7fff383feb50 a5=c The new fields (arch, syscall, and a0..a5) will only be logged if we are in a syscall but we aren't otherwise building an auditsc context. This is only supported on x86 for now. Other architectures can get this if they implement syscall_in_syscall. Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/audit.c | 44 +++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/kernel/audit.c b/kernel/audit.c index 47845c5..8509d00 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -67,6 +67,10 @@ #include linux/pid_namespace.h #include net/netns/generic.h +#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL +#include asm/syscall.h +#endif + #include audit.h /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED. @@ -1897,6 +1901,40 @@ out: kfree(name); } +#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL +/** + * audit_log_missing_context - append otherwise-missing context + * @ab: the audit_buffer + * + * If syscall auditing is unavailable, try to log syscall context + * information anyway. + */ +static void audit_log_missing_context(struct audit_buffer *ab) +{ + struct task_struct *tsk = current; + struct pt_regs *regs = current_pt_regs(); + unsigned long args[6]; + + if (!syscall_in_syscall(tsk, regs)) + return; + + if (ab-ctx ab-ctx-in_syscall) + return; /* Let audit_log_exit log the context. */ + + syscall_get_arguments(tsk, regs, 0, 6, args); + + audit_log_format(ab, arch=%x syscall=%d a0=%lx a1=%lx a2=%lx a3=%lx a4=%lx a5=%lx, +(unsigned int)syscall_get_arch(), +syscall_get_nr(tsk, regs), +args[0], args[1], args[2], args[3], args[4], args[5]); +} +#else +static void audit_log_missing_context(struct audit_buffer *ab) +{ + /* We need arch support to do this reliably, so don't even try. */ +} +#endif + /** * audit_log_end - end one audit record * @ab: the audit_buffer @@ -1913,7 +1951,11 @@ void audit_log_end(struct audit_buffer *ab) if (!audit_rate_check()) { audit_log_lost(rate limit exceeded); } else { - struct nlmsghdr *nlh = nlmsg_hdr(ab-skb); + struct nlmsghdr *nlh; + + audit_log_missing_context(ab); + + nlh = nlmsg_hdr(ab-skb); nlh-nlmsg_len = ab-skb-len - NLMSG_HDRLEN; if (audit_pid) { -- 1.9.3 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH 0/2] Syscall auditing lite
I've made no secret of the fact that I dislike syscall auditing. As far as I can tell, the main technical (i.e. not compliance-related) use of syscall auditing is to supply some useful context information to go along with events like AVC denials. CONFIG_AUDITSYSCALL is serious overkill to do this. kernel/auditsc.c is ~2500 lines of terror. This patchset accomplishes the same goal, more usefully, with no overhead at all, in under 70 lines of code. It tries to coexist cleanly with CONFIG_AUDITSYSCALL. This is only implemented for x86. Other architectures can add support fairly easily, I think. Andy Lutomirski (2): x86,syscall: Add syscall_in_syscall to test whether we're in a syscall audit: Syscall auditing lite arch/x86/Kconfig | 1 + arch/x86/include/asm/syscall.h | 21 init/Kconfig | 3 +++ kernel/audit.c | 44 +- 4 files changed, 68 insertions(+), 1 deletion(-) -- 1.9.3 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH 1/2] x86, syscall: Add syscall_in_syscall to test whether we're in a syscall
syscall_in_syscall will return true if we're in a real syscall and will return false if we're not in a syscall. If we're in a bad syscall, the return value can vary. The idea is to use this to come up with a much simpler replacement for syscall auditing. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/Kconfig | 1 + arch/x86/include/asm/syscall.h | 21 + init/Kconfig | 3 +++ 3 files changed, 25 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..e2602d4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -130,6 +130,7 @@ config X86 select HAVE_CC_STACKPROTECTOR select GENERIC_CPU_AUTOPROBE select HAVE_ARCH_AUDITSYSCALL + select HAVE_SYSCALL_IN_SYSCALL config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index d6a756a..91e38b3 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -23,6 +23,27 @@ typedef void (*sys_call_ptr_t)(void); extern const sys_call_ptr_t sys_call_table[]; +/** + * syscall_in_syscall() - are we in a syscall context? + * @task: The task to query. + * @regs: The task's pt_regs. + * + * This checks whether we are in a syscall. If it returns true, then + * syscall_get_nr(), etc are usable and the current task is guaranteed + * to either die or to go through the syscall exit path when the syscall + * is done. + * + * If it returns false, no particular guarantees are made. In + * particular, a malicious task can issue a syscall that causes + * syscall_in_syscall to return false. Such a syscall won't do much, + * but it can still cause tracing code and such to run. + */ +static inline bool syscall_in_syscall(struct task_struct *task, + struct pt_regs *regs) +{ + return regs-orig_ax != -1; +} + /* * Only the low 32 bits of orig_ax are meaningful, so we return int. * This importantly ignores the high bits on 64-bit, so comparisons diff --git a/init/Kconfig b/init/Kconfig index 9d3585b..bad2053 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -295,6 +295,9 @@ config AUDIT config HAVE_ARCH_AUDITSYSCALL bool +config HAVE_SYSCALL_IN_SYSCALL + bool + config AUDITSYSCALL bool Enable system-call auditing support depends on AUDIT HAVE_ARCH_AUDITSYSCALL -- 1.9.3 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH 1/2] x86,syscall: Add syscall_in_syscall to test whether we're in a syscall
On May 30, 2014 2:58 PM, Andy Lutomirski l...@amacapital.net wrote: syscall_in_syscall will return true if we're in a real syscall and will return false if we're not in a syscall. If we're in a bad syscall, the return value can vary. The idea is to use this to come up with a much simpler replacement for syscall auditing. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/Kconfig | 1 + arch/x86/include/asm/syscall.h | 21 + init/Kconfig | 3 +++ 3 files changed, 25 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..e2602d4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -130,6 +130,7 @@ config X86 select HAVE_CC_STACKPROTECTOR select GENERIC_CPU_AUTOPROBE select HAVE_ARCH_AUDITSYSCALL + select HAVE_SYSCALL_IN_SYSCALL config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index d6a756a..91e38b3 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -23,6 +23,27 @@ typedef void (*sys_call_ptr_t)(void); extern const sys_call_ptr_t sys_call_table[]; +/** + * syscall_in_syscall() - are we in a syscall context? + * @task: The task to query. + * @regs: The task's pt_regs. + * + * This checks whether we are in a syscall. If it returns true, then + * syscall_get_nr(), etc are usable and the current task is guaranteed + * to either die or to go through the syscall exit path when the syscall + * is done. + * + * If it returns false, no particular guarantees are made. In + * particular, a malicious task can issue a syscall that causes + * syscall_in_syscall to return false. Such a syscall won't do much, + * but it can still cause tracing code and such to run. + */ +static inline bool syscall_in_syscall(struct task_struct *task, + struct pt_regs *regs) +{ + return regs-orig_ax != -1; This is insufficient: anything that interrupts an errorentry too early will incorrectly return true. Also, the actual IRQ entries seem to shove the IRQ number into orig_ax. I'll send a new version. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Wed, May 28, 2014 at 7:54 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 19:40 -0700, Andy Lutomirski wrote: On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote: NAK On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Here are some issues with the code: - It thinks that syscalls have four arguments. Not true at all. It records the registers that would hold the first 4 entries on syscall entry, for use later if needed, as getting those later on some arches is not feasible (see ia64). It makes no assumption about how many syscalls a function has. What about a5 and a6? On the couple of syscalls where a5 and a6 had any state that was actually wanted by someone (mainly just the fd on mmap) audit collects it later in the actual syscall. - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. That's because we haven't had large syscall numbers. That's the whole point of an arch doing select HAVE_ARCH_AUDITSYSCALL. If they don't meet the requirements, they shouldn't be selecting it - It's unclear whether it's supposed to be reliable. Unclear to whom? To me. If some inode access or selinux rule triggers an audit, is the auditsc code guaranteed to write an exit record? And see below... This is an honest question: Do you want to discuss these things, or would you be happier if I shut up, fix the bugs you found, and leave things be? I don't want to have an argument, I'm happy to have a discussion if you think that will be beneficial... I'm happy to discuss these things. I'd be happy if you fix things. I think that there are two major issues affecting users of auditsc and that either those issues should be fixed or users of auditsc should understand and accept these issues. Issue 1: syscall numbering. The syscall filters are unlikely to work well on lots of architectures. Issue 2: performance. Until someone fixes it (which is probably hard), turning this thing on hurts a lot. Oleg added a hack awhile ago that lets you do 'auditctl -a task,never' to get rid of the performance hit for new tasks, but that rather defeats the point. The scariness of the code can be lived with, but holy cow it's scary. I'm just not going to fix the issues myself. I tried and gave up. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
Here are some issues with the code: - It thinks that syscalls have four arguments. - It's a performance disaster. - It assumes that syscall numbers are between 0 and 2048. - It's unclear whether it's supposed to be reliable. - It's broken on things like x32. - It can't support ARM OABI. - Its approach to freeing memory is terrifying. Signed-off-by: Andy Lutomirski l...@amacapital.net --- init/Kconfig | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 9d3585b..24d4b53 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -296,13 +296,16 @@ config HAVE_ARCH_AUDITSYSCALL bool config AUDITSYSCALL - bool Enable system-call auditing support - depends on AUDIT HAVE_ARCH_AUDITSYSCALL + bool Enable system-call auditing support (not recommended) + depends on AUDIT HAVE_ARCH_AUDITSYSCALL BROKEN default y if SECURITY_SELINUX help - Enable low-overhead system-call auditing infrastructure that - can be used independently or with another kernel subsystem, - such as SELinux. + Enable system-call auditing infrastructure that can be used + independently or with another kernel subsystem, such as + SELinux. + + AUDITSYSCALL has serious performance and correctness issues. + Use it with extreme caution. config AUDIT_WATCH def_bool y -- 1.9.3 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking
On Wed, May 28, 2014 at 7:43 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 19:27 -0700, Andy Lutomirski wrote: On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f251a5e..7ccd9db 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key) return AUDIT_BUILD_CONTEXT; } +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val) +{ + int word, bit; + + if (val 0x) + return false; Why is this necessary? To avoid an integer overflow. Admittedly, this particular overflow won't cause a crash, but it will cause incorrect results. You know this code pre-dates git? I admit, I'm shocked no one ever noticed it before! This is ANCIENT. And clearly broken. I'll likely ask Richard to add a WARN_ONCE() in both this place, and below in word AUDIT_BITMASK_SIZE so we might know if we ever need a larger bitmask to store syscall numbers Not there, please! It would make sense to check when rules are *added*, but any user can trivially invoke the filter with pretty much any syscall nr. It'd be nice if lib had a efficient bitmask implementation... + + word = AUDIT_WORD(val); + if (word = AUDIT_BITMASK_SIZE) + return false; Since this covers it and it extensible... + + bit = AUDIT_BIT(val); + + return rule-mask[word] bit; Returning an int as a bool creates worse code than just returning the int. (although in this case if the compiler chooses to inline it might be smart enough not to actually convert this int to a bool and make worse assembly...) I'd suggest the function here return an int. bools usually make the assembly worse... I'm ambivalent. The right assembly would use flags on x86, not an int or a bool, and I don't know what the compiler will do. Also, clearly X86_X32 was implemented without audit support, so we shouldn't config it in. What do you think of this? diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..fa852e2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -129,7 +129,7 @@ config X86 select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64 select HAVE_CC_STACKPROTECTOR select GENERIC_CPU_AUTOPROBE - select HAVE_ARCH_AUDITSYSCALL + select HAVE_ARCH_AUDITSYSCALL if !X86_X32 I'm fine with that, but I hope that distros will choose x32 over auditsc, at least until someone makes enabling auditsc be less intrusive. Or someone could fix it, modulo the syscall nr 2048 thing. x32 syscall nrs are *huge*. So are lots of other architectures', a few of which probably claim to support auditsc. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH v2 0/2] Fix auditsc DoS and mark it BROKEN
CONFIG_AUDITSYSCALL is awful. Patch 2 enumerates some reasons. Patch 1 fixes a nasty DoS and possible information leak. It should be applied and backported. Patch 2 is optional. I leave it to other peoples' judgment. Andy Lutomirski (2): auditsc: audit_krule mask accesses need bounds checking audit: Move CONFIG_AUDITSYSCALL into staging and update help text Andy Lutomirski (2): auditsc: audit_krule mask accesses need bounds checking audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text init/Kconfig | 13 - kernel/auditsc.c | 27 ++- 2 files changed, 26 insertions(+), 14 deletions(-) -- 1.9.3 -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking
On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f251a5e..7ccd9db 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key) return AUDIT_BUILD_CONTEXT; } +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val) +{ + int word, bit; + + if (val 0x) + return false; Why is this necessary? To avoid an integer overflow. Admittedly, this particular overflow won't cause a crash, but it will cause incorrect results. + + word = AUDIT_WORD(val); + if (word = AUDIT_BITMASK_SIZE) + return false; Since this covers it and it extensible... + + bit = AUDIT_BIT(val); + + return rule-mask[word] bit; Returning an int as a bool creates worse code than just returning the int. (although in this case if the compiler chooses to inline it might be smart enough not to actually convert this int to a bool and make worse assembly...) I'd suggest the function here return an int. bools usually make the assembly worse... I'm ambivalent. The right assembly would use flags on x86, not an int or a bool, and I don't know what the compiler will do. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote: NAK On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Here are some issues with the code: - It thinks that syscalls have four arguments. Not true at all. It records the registers that would hold the first 4 entries on syscall entry, for use later if needed, as getting those later on some arches is not feasible (see ia64). It makes no assumption about how many syscalls a function has. What about a5 and a6? - It's a performance disaster. Only if you enable it. If you don't use audit it is a single branch. Hardly a disaster. It forces all syscalls into the slow path and it can do crazy things like building audit contexts just in case actual handling of the syscall triggers an audit condition so that the exit path can log the syscall. That's way worse than a single branch. Try it: benchmark getpid on Fedora and then repeat the experiment with syscall auditing fully disabled. The difference is striking. - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. - It's unclear whether it's supposed to be reliable. Unclear to whom? To me. If some inode access or selinux rule triggers an audit, is the auditsc code guaranteed to write an exit record? And see below... - It's broken on things like x32. - It can't support ARM OABI. Some arches aren't supported? And that makes it BROKEN? It acts like x32 is supported. OABI is fixable. Admittedly, OABI not being supported is fine, now that it's correctly disabled. - Its approach to freeing memory is terrifying. What? None of your reasons hold water. Bugs need to be fixed. Try reporting them... This is just stupid. ...for reference, I've *tried* to fix the performance issues. I've run into all kinds of problems. The actual context code is incredibly tangled. It's unclear whether it would be permissible for various rule combinations to suppress exit records triggered by selinux. Any effort to actually deal with this stuff will have to deal with the fact that the audit code builds up lots of state and frees it on syscall exit. That means that the exit hook needs to be actually invoked, otherwise we leak. I was never able to convince myself that the current code is correct wrt kernel threads. In summary, the code is a giant mess. The way it works is nearly incomprehensible. It contains at least one severe bug. I'd love to see it fixed, but for now, distributions seem to think that enabling CONFIG_AUDITSYSCALL is a reasonable thing to do, and I'd argue that it's actually a terrible choice for anyone who doesn't actually need syscall audit rules. And I don't know who needs these things. I've heard an argument that selinux benefits from this, since the syscall exit log helps with diagnosing denials. I think that's absurd. I've never found anything wrong with the denial record itself that would be helped by seeing the syscall log. (And there's always syscall_get_xyz, which would cover this case splendidly with about an order of magnitude less code.) As I said, I'm not going to push hard for this code to be marked BROKEN. But I think it's rather broken, and I'm definitely not volunteering to fix it. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote: On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote: - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. All syscalls must be auditable. Meaning that if an arch goes above 2048, then we'll need to do some math to get it to fall back within the range. Off the top of my head, x32, some ARM variants, and some MIPS variants have syscalls above 2048. auditsc has been disabled on the offending ARM variant (because I noticed that the same issue affects seccomp, and no one particularly wants to fix it), but I suspect that auditsc is enabled but completely broken on whichever MIPS ABI has the issue. I haven't checked. TBH, I think that it's silly to let the auditsc design be heavily constrained by ia64 considerations -- I still think that the syscall entry hooks could be removed completely with some effort on most architectures and that performance would improve considerably. For users who don't have syscall filter rules, performance might even improve on ia64 -- I don't care how expensive syscall_get_args is, the actual printk/auditd thing will dominate in cases where anything is written. I wonder whether the syscall restart mechanism can be used to thoroughly confused auditsc. I don't really know how syscall restarts work. My point is: I don't know what guarantees auditsc is supposed to provide, nor do I know how to evaluate whether the code is correct. For users who aren't bound by Common Criteria or related things, I suspect that syscall auditing (as opposed to, say, LSM-based auditing) will provide dubious value at best. Keep in mind that many syscalls take pointer arguments, and the auditsc code can't really do anything sensible with them. - It's unclear whether it's supposed to be reliable. Unclear to whom? To me. If some inode access or selinux rule triggers an audit, is the auditsc code guaranteed to write an exit record? And see below... It should or indicate that it could not. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Thu, May 29, 2014 at 9:25 AM, Steve Grubb sgr...@redhat.com wrote: On Thursday, May 29, 2014 09:04:10 AM Andy Lutomirski wrote: On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote: On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote: - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. All syscalls must be auditable. Meaning that if an arch goes above 2048, then we'll need to do some math to get it to fall back within the range. Off the top of my head, x32, some ARM variants, and some MIPS variants have syscalls above 2048. That could be fixed by putting a subtraction in place to get the bit mask to map correctly. User space audit rules would need to fix that also. auditsc has been disabled on the offending ARM variant (because I noticed that the same issue affects seccomp, and no one particularly wants to fix it), but I suspect that auditsc is enabled but completely broken on whichever MIPS ABI has the issue. I haven't checked. TBH, I think that it's silly to let the auditsc design be heavily constrained by ia64 considerations -- I still think that the syscall entry hooks could be removed completely with some effort on most architectures and that performance would improve considerably. For users who don't have syscall filter rules, performance might even improve on ia64 -- I don't care how expensive syscall_get_args is, the actual printk/auditd thing will dominate in cases where anything is written. The last time I heard of benchmarking being done, audit's performance hit was negligible. That was about 4 years ago and there has been a whole lot of code churn since then. But it should in general be the cost of an 'if' statement unless auditing is enabled. If it is enabled, then you necessarily get the full treatment in case you trigger an event. The actual rule matching is reasonably inexpensive, but the killer is the fact that every system call is forced into the slow path. Without audit, this cost is paid by seccomp users and tasks that are being ptraced. With audit, every single syscall for every single task pays the full cost of a slow path system call. This is extra sad because, AFAICS, most of the syscall entry audit code shouldn't be necessary even for audit users. For users who aren't bound by Common Criteria or related things, I suspect that syscall auditing (as opposed to, say, LSM-based auditing) will provide dubious value at best. No, it works pretty good. You can see who is accessing what files pretty clearly. Right, but I don't think it's the *syscall* part that's really necessary here. The fact that someone opened /etc/passwd is interesting. The fact that someone sys_opened (const char *)0x123456 is not. Keep in mind that many syscalls take pointer arguments, and the auditsc code can't really do anything sensible with them. All security relevant information is collected in auxiliary records if they are not directly readable as a syscall argument. This is a basic requirement. We are not required to capture all possible arguments. If you know of any security relevant parameters not captured, please let us know. a5 and a6. The cmsg data in sendmsg (which is very security-relevant -- think of SCM_RIGHTS and SCM_CREDENTIALS). The destination for sendto. The whole payload of netlink messages. I don't know how many of these are actually captures. Plus, except for things like setuid and sigreturn, capturing parameters on entry shouldn't be needed anyway. Capturing at log time should be fine. If this magic capturing only happened for users who need it for compliance reasons, that would be one thing. But it happens for every single system call on Fedora. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [ARCH question] Do syscall_get_nr and syscall_get_arguments always work?
On Tue, Feb 18, 2014 at 11:39 AM, Eric Paris epa...@redhat.com wrote: On Fri, 2014-02-07 at 08:40 -0800, Andy Lutomirski wrote: On Fri, Feb 7, 2014 at 4:58 AM, Jonas Bonn jonas.b...@gmail.com wrote: Hi Andy, On 5 February 2014 00:50, Andy Lutomirski l...@amacapital.net wrote: I can't even find the system call entry point on mips. Is there a semi-official answer here? I don't have an official answer for you, but when I wanted to do something with these entry points a couple of years back I discovered that they aren't very thoroughly implemented across the various architectures. I started cleaning this up and can probably dig up some of this for you if you need it. The syscall_get_xyz functions are certainly implemented and functional in all relevant architectures -- the audit code is already using them. The thing I'm uncertain about is whether they are usable with no syscall slow path bits set. I guess that, if the syscall restart logic needs to read the argument registers, then they're probably reliably saved... Al just indicated to me that on at least ia64, syscall_get_arguments() is really expensive. So maybe not a deal breaker, but sounds like we'd lose a lot of performance trying to get them at syscall exit... I wonder how slow syscall_get_arguments has to be before it's a real problem. Remember that we only need to call it when we already know that an audit record needs to be written (or if a syscall argument is used in a filter rule, I suppose -- I'm sure sure whether that's possible). As long as syscall_get_nr is fast, which I think it is on ia64, then the tradeoff might still be a win. But I think this is still a bit of a lost cause. Currently, if I'm reading the code correctly, signal delivery to a non-auditd process can result in writing an audit event. If the signal is delivered during a syscall, then current code will write an audit record for that syscall on syscall exit. If we want to preserve that behavior without a syscall audit hook, then the signal delivery code needs to know whether it's in the middle of a syscall. AFAIK this is not possible. On the other hand, most interesting signals are probably *not* the result of a syscall anyway, so it may make sense to just remove that code entirely. TBH, as long as something happens to get rid of audit overhead when there are no rules, my interest in personally writing something fancy to make the nonzero-number-of-rules case have less overhead is rather low. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v3] audit: Turn off TIF_SYSCALL_AUDIT when there are no rules
On Mon, Feb 10, 2014 at 8:57 AM, Oleg Nesterov o...@redhat.com wrote: On 02/08, Andy Lutomirski wrote: +void audit_inc_n_rules() +{ + struct task_struct *p, *t; + + read_lock(tasklist_lock); + audit_n_rules++; + smp_wmb(); + if (audit_n_rules == 1) { + /* + * We now have a rule; we need to hook syscall entry. + */ + for_each_process_thread(p, t) { + if (t-audit_context) + set_tsk_thread_flag(t, TIF_SYSCALL_AUDIT); + } + } + read_unlock(tasklist_lock); +} + +void audit_dec_n_rules() +{ + read_lock(tasklist_lock); + --audit_n_rules; + BUG_ON(audit_n_rules 0); + + /* + * If audit_n_rules == 0, then __audit_syscall_exit will clear + * TIF_SYSCALL_AUDIT. + */ + + read_unlock(tasklist_lock); +} To be honest, I do not understand why _dec_ takes tasklist_lock... And why _inc_ increments audit_n_rules under tasklist. Bah, incorrect leftover from last time. @@ -1528,6 +1562,25 @@ void __audit_syscall_exit(int success, long return_code) context-filterkey = NULL; } tsk-audit_context = context; + + if (ACCESS_ONCE(audit_n_rules) == 0) { + /* + * Either this is the very first syscall by this process or + * audit_dec_n_rules recently set audit_n_rules to zero. + */ + smp_rmb(); rmb() looks wrong, we need mb() to serialize ACCESS_ONCE() and clear_tsk_thread_flag(). I clearly need to review the rules. I think you're right, though -- no barrier should be needed. But, otoh, I think we do not need any barrier at all, we can rely on control dependency. See the recent 18c03c61444a21 Documentation/ memory-barriers.txt: Prohibit speculative writes. + /* audit_inc_n_rules could increment audit_n_rules here... */ + + clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); + + smp_rmb(); Again, I guess this should be mb() or smp_mb__after_clear_bit(). And I still think this needs more changes. Once again, I do not think that, say, __audit_log_bprm_fcaps() should populate context-aux if !TIF_SYSCALL_AUDIT, this list can grow indefinitely. Or __audit_signal_info()... Perhaps __audit_syscall_exit() should also set context-dummy? That would work. I'm still torn between trying to make it possible for things like __audit_log_bprm_fcaps to start a syscall audit record in the middle of a syscall or to just try to tighten up the current approach to the point where it will work correctly. Grr. Why is all this crap tied up with syscall auditing anyway? ISTM it would have been a lot nicer if audit calls just immediately emitted audit records, completely independently of the syscall machinery. --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH v3] audit: Turn off TIF_SYSCALL_AUDIT when there are no rules
This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of leaving it set whenever rules might be set in the future. This reduces syscall latency from 60ns to closer to 40ns on my laptop. This code is a little bit tricky. It's not safe to turn TIF_SYSCALL_AUDIT off during a syscall that is being audited. So, with this change, the only thing that ever turns TIF_SYSCALL_AUDIT off after a process has been created is __audit_syscall_exit. This has the somewhat odd side-effect that, unless there's a 'task,never' rule, every task's first syscall will use the slow path. This should be more or less unnoticable (what's another 20-40ns at task creation time between friends?). There is a user-visible effect of this change: if there are no audit rules, then events like AVC denials will not result in a syscall audit record being written. I'm happy to accept this minor loss in debuggability in exchange for a massive speedup of all system calls on default distribution configurations. A better solution would be to ditch the syscall entry hook entirely and use the syscall_get_xyz calls as needed to fill in the context. This could be done on top of this patch, but it would be more code. Cc: Oleg Nesterov o...@redhat.com Cc: Steve Grubb sgr...@redhat.com Cc: Eric Paris epa...@redhat.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- Changes from v2 (actually v2.1): - Use for_each_process_thread instead of do_each_thread :) - Turn off TIF_SYSCALL_AUDIT lazily, thus hopefully avoiding all of the nasty cases that Oleg noticed. Changes from v1: - For new tasks, set flags in a new audit_sync_flags callback instead of in audit_alloc (thanks, Oleg). - Rework locking. - Use irqsave/irqrestore to avoid having to think about who else might have taken spinlocks. include/linux/audit.h | 8 ++-- kernel/auditfilter.c | 4 ++-- kernel/auditsc.c | 57 +-- 3 files changed, 63 insertions(+), 6 deletions(-) diff --git a/include/linux/audit.h b/include/linux/audit.h index aa865a9..ab00ffb 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -298,7 +298,8 @@ static inline void audit_mmap_fd(int fd, int flags) __audit_mmap_fd(fd, flags); } -extern int audit_n_rules; +extern void audit_inc_n_rules(void); +extern void audit_dec_n_rules(void); extern int audit_signals; #else /* CONFIG_AUDITSYSCALL */ static inline int audit_alloc(struct task_struct *task) @@ -404,7 +405,10 @@ static inline void audit_mmap_fd(int fd, int flags) { } static inline void audit_ptrace(struct task_struct *t) { } -#define audit_n_rules 0 +static inline void audit_inc_n_rules(void) +{ } +static inline void audit_dec_n_rules(void) +{ } #define audit_signals 0 #endif /* CONFIG_AUDITSYSCALL */ diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 14a78cc..4c7054b 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry) } #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules++; + audit_inc_n_rules(); if (!audit_match_signal(entry)) audit_signals++; @@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry) #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules--; + audit_dec_n_rules(); if (!audit_match_signal(entry)) audit_signals--; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 7aef2f4..d76947c 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -80,7 +80,7 @@ #define MAX_EXECVE_AUDIT_LEN 7500 /* number of audit rules */ -int audit_n_rules; +static int audit_n_rules; /* determines whether we collect data for signals sent */ int audit_signals; @@ -911,6 +911,39 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state) return context; } +void audit_inc_n_rules() +{ + struct task_struct *p, *t; + + read_lock(tasklist_lock); + audit_n_rules++; + smp_wmb(); + if (audit_n_rules == 1) { + /* +* We now have a rule; we need to hook syscall entry. +*/ + for_each_process_thread(p, t) { + if (t-audit_context) + set_tsk_thread_flag(t, TIF_SYSCALL_AUDIT); + } + } + read_unlock(tasklist_lock); +} + +void audit_dec_n_rules() +{ + read_lock(tasklist_lock); + --audit_n_rules; + BUG_ON(audit_n_rules 0); + + /* +* If audit_n_rules == 0, then __audit_syscall_exit will clear +* TIF_SYSCALL_AUDIT. +*/ + + read_unlock(tasklist_lock); +} + /** * audit_alloc - allocate an audit context block for a task * @tsk: task @@ -938,11 +971,12 @@ int audit_alloc(struct task_struct *tsk) if (!(context
[ARCH question] Do syscall_get_nr and syscall_get_arguments always work?
On Tue, Feb 4, 2014 at 11:32 AM, Andy Lutomirski l...@amacapital.net wrote: Now we get rid of __audit_syscall_entry. (This speeds up even the auditing-is-on case.) Instead we have __audit_start_record, which does more or less the same thing, except that (a) it doesn't BUG if in_syscall and (b) it *sets* TIF_SYSCALL_AUDIT. This relies on the fact that syscall_get_nr and syscall_get_arguments are reliable on x86_64. I suspect that they're reliable everywhere else, too. The idea is that there's nothing wrong with calling __audit_start_record more than once. (Maybe it should be called __audit_record_this_syscall.) I'd like to make a change that can result in syscall_get_nr and syscall_get_arguments being called (on current and task_pt_regs(current)) from any system call (as opposed to being called only from the audit/trace slowpaths). Is this safe? Here's my somewhat clueless analysis: On x86_64, I've tested it, and it works. The entry code saves all of the argument registers, even in the fast path. i386 and ia32_compat look okay, too. If stmiasp, {r0 - r12}@ Calling r0 - r12 does what I think it does, then arm should be okay. I'm totally guessing here, but e10_sync on aarch64 seems to save enough registers. I admit to being a little bit surprised, though -- aarch64 is new, and if I were designing an ABI, I specify that syscalls *don't* save registers. ia64 has a comment in ivt.S that streamlined syscalls save nr in r15. The rest come from unwind info (!). I assume this has something to do with the magic ia64 register rotation thing. I have no idea what happens if there's a NaT in an argument register. I can't even find the system call entry point on mips. Is there a semi-official answer here? --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2.1] audit: Only use the syscall slowpath when syscall audit rules exist
On Tue, Feb 4, 2014 at 8:50 AM, Oleg Nesterov o...@redhat.com wrote: On 02/03, Andy Lutomirski wrote: +void audit_inc_n_rules() +{ + struct task_struct *p, *g; + unsigned long flags; + + read_lock_irqsave(tasklist_lock, flags); Confused... read_lock(tasklist) doesn't need to disable irqs. (ftrace does this for no reason too, perhaps I should resend the patch) Is this because there are no interrupt handlers that write_lock(tasklist)? + if (audit_n_rules++ == 0) { probably this can be done outside of read_lock? I don't think so. I'm cheating and using the tasklist_lock to prevent audit_sync_flags from racing against this increment, so this needs to be inside the lock. I'll add a comment. + do_each_thread(g, p) { for_each_process_thread ;) do_each_thread will die, I hope. Sorry, forgot to mention: where is this mythical for_each_process_thread? Either it only exists in a kernel version I'm not looking at, my grepping skills aren't up to the task, or you just hate do_each_thread so much that you imagined up an alternative :) +void audit_dec_n_rules() +{ + struct task_struct *p, *g; + unsigned long flags; + + read_lock_irqsave(tasklist_lock, flags); + + --audit_n_rules; + BUG_ON(audit_n_rules 0); + + if (audit_n_rules == 0) { + do_each_thread(g, p) { + clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + } The same, and... On a second thought it seems that audit_dec_n_rules() has a problem. Note the BUG_ON(context-in_syscall) in __audit_syscall_entry(). Suppose that audit_dec_n_rules() clears TIF_SYSCALL_AUDIT when a task runs a syscall. In this case (afaics) __audit_syscall_exit() won't be called. The next audit_inc_n_rules() can set TIF_SYSCALL_AUDIT and trigger another __audit_syscall_entry() which will hit this BUG_ON(). Egads! And in general it doesn't look safe although I know almost nothing about audit. I mean, currently __audit_syscall_entry() or __audit_log_bprm_fcaps() assume that __audit_syscall_exit() or __audit_free() will cleanup -audit_context, perhaps we should not break the rules? Once again, I do not pretend I understand this code, this is the question, not the comment. But if I am right, then TIF_SYSCALL_AUDIT should be cleared in __audit_syscall_exit() as you suggested before. I think I'll wait for Eric to chime in. I suspect that the real solution is to simplify all this stuff by relying on the fact that the syscall nr and args are saved by the (fast path and slow path) entry code, so the audit entry hook may be entirely unnecessary. Or maybe a new TIF flag would be needed to make it work. Anyway, *grumble*. My only real interest in this stuff is to get rid of the overhead. I don't actually want syscall auditing on any of my boxes (in contrast to avc auditing, which is useful but (mostly) independent). --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Re: [PATCH v2.1] audit: Only use the syscall slowpath when syscall audit rules exist
On Tue, Feb 4, 2014 at 11:11 AM, Oleg Nesterov o...@redhat.com wrote: On 02/04, Andy Lutomirski wrote: On Tue, Feb 4, 2014 at 8:50 AM, Oleg Nesterov o...@redhat.com wrote: On 02/03, Andy Lutomirski wrote: Sorry, forgot to mention: where is this mythical for_each_process_thread? In Linus's tree, please see 0c740d0afc3bff. or you just hate do_each_thread so much that you imagined up an alternative :) sort of ;) Aha -- it probably got merged just after I pulled Linus' tree and looked for it. Or something. I think I'll wait for Eric to chime in. I suspect that the real solution is to simplify all this stuff by relying on the fact that the syscall nr and args are saved by the (fast path and slow path) entry code, so the audit entry hook may be entirely unnecessary. Perhaps... but even in this case we need to do something with, say, __audit_log_bprm_fcaps(). At least this list should not grow indefinitely if the task skips __audit_syscall_exit(). Although at first glance this can be probably cleanuped too. OK, here's a thought: let's change the semantics of TIF_SYSCALL_AUDIT. New semantics: TIF_SYSCALL_AUDIT is set if (the task is eligible for syscall auditing and n_rules != 0 *or* something has started writing an audit record). TIF_SYSCALL_AUDIT is *never* cleared by anything other than copy_process or __audit_syscall_exit. This means that, if there's a pending audit record, then TIF_SYSCALL_AUDIT will be set. That, in turn, means that __audit_syscall_exit will be called, which avoids the BUGs you pointed out. Now we get rid of __audit_syscall_entry. (This speeds up even the auditing-is-on case.) Instead we have __audit_start_record, which does more or less the same thing, except that (a) it doesn't BUG if in_syscall and (b) it *sets* TIF_SYSCALL_AUDIT. This relies on the fact that syscall_get_nr and syscall_get_arguments are reliable on x86_64. I suspect that they're reliable everywhere else, too. The idea is that there's nothing wrong with calling __audit_start_record more than once. (Maybe it should be called __audit_record_this_syscall.) To finish the job, we change __audit_syscall_exit to clear TIF_SYSCALL_AUDIT if n_rules==0. inc_n_rules can set TIF_SYSCALL_AUDIT (without even needing to worry about races, I think). dec_n_rules doesn't need to do anything special. Benefits: - Removing the syscall entry hook speeds everyone up. Even silly people who use exit rules :) - There's no need to think about mismatched entry/exit hook calls, since there is no entry hook. - The same mechanism could be reused for non-audit purposes. Any code could say, at any point hey, this syscall is interesting. let's record it. Disadvantages: - Need to check other architectures to make sure that syscall_get_xyz works reliably for fast path syscalls. - For the full performance boost, all architectures need to avoid checking TIF_SYSCALL_AUDIT in the entry path. I prefer not to mess with non-x86 assembly, so I won't do that part, since it's not required for correctness. Eric, any thoughts? --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
Why is syscall auditing on with no rules?
On a stock Fedora installation: $ sudo auditctl -l No rules Nonetheless TIF_SYSCALL_AUDIT is set and the __audit_syscall_entry and __audit_syscall_exit account for 20% of syscall overhead according to perf. This sucks. Unless I'm missing something, syscall auditing is *off*. How hard would it be to arrange for TIF_SYSCALL_AUDIT to be cleared when there are no syscall rules? (This is extra bad in kernels before 3.13, where the clear call for TIF_SYSCALL_AUDIT was completely missing.) --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit
[PATCH] audit: Only use the syscall slowpath when syscall audit rules exist
This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of leaving it set whenever rules might be set in the future. This reduces syscall latency from 60ns to closer to 40ns on my laptop. Cc: Oleg Nesterov o...@redhat.com Cc: Steve Grubb sgr...@redhat.com Cc: Eric Paris epa...@redhat.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- This is not the best-tested patch in the world. Someone who actually knows how to use syscall auditing should probably give it a spin. It fixes an IMO huge performance issue, though. include/linux/audit.h | 9 +-- kernel/auditfilter.c | 4 ++-- kernel/auditsc.c | 65 --- 3 files changed, 71 insertions(+), 7 deletions(-) diff --git a/include/linux/audit.h b/include/linux/audit.h index a406419..534ba85 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -298,7 +298,9 @@ static inline void audit_mmap_fd(int fd, int flags) __audit_mmap_fd(fd, flags); } -extern int audit_n_rules; +extern void audit_inc_n_rules(void); +extern void audit_dec_n_rules(void); + extern int audit_signals; #else /* CONFIG_AUDITSYSCALL */ static inline int audit_alloc(struct task_struct *task) @@ -404,7 +406,10 @@ static inline void audit_mmap_fd(int fd, int flags) { } static inline void audit_ptrace(struct task_struct *t) { } -#define audit_n_rules 0 +static inline void audit_inc_n_rules(void) +{ } +static inline void audit_dec_n_rules(void) +{ } #define audit_signals 0 #endif /* CONFIG_AUDITSYSCALL */ diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 51f3fd4..0ce531d 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry) } #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules++; + audit_inc_n_rules(); if (!audit_match_signal(entry)) audit_signals++; @@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry) #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules--; + audit_dec_n_rules(); if (!audit_match_signal(entry)) audit_signals--; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 90594c9..363d0bb 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -79,8 +79,15 @@ /* no execve audit message should be longer than this (userspace limits) */ #define MAX_EXECVE_AUDIT_LEN 7500 -/* number of audit rules */ -int audit_n_rules; +/* + * number of audit rules + * + * n_rules_lock protects audit_n_rules. It also prevents audit_alloc from + * racing against changes to audit_n_rules: to change someone else's + * audit_context or TIF_SYSCALL_AUDIT, you need write_lock(n_rules_lock). + */ +static DEFINE_RWLOCK(n_rules_lock); +static int audit_n_rules; /* determines whether we collect data for signals sent */ int audit_signals; @@ -911,6 +918,47 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state) return context; } +void audit_inc_n_rules() +{ + struct task_struct *p, *g; + + write_lock(n_rules_lock); + + if (audit_n_rules++ != 0) + goto out; /* The overall state isn't changing. */ + + read_lock(tasklist_lock); + do_each_thread(g, p) { + if (p-audit_context) + set_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + read_unlock(tasklist_lock); + +out: + write_unlock(n_rules_lock); +} + +void audit_dec_n_rules() +{ + struct task_struct *p, *g; + + write_lock(n_rules_lock); + --audit_n_rules; + BUG_ON(audit_n_rules 0); + + if (audit_n_rules != 0) + goto out; /* The overall state isn't changing. */ + + read_lock(tasklist_lock); + do_each_thread(g, p) { + clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + read_unlock(tasklist_lock); + +out: + write_unlock(n_rules_lock); +} + /** * audit_alloc - allocate an audit context block for a task * @tsk: task @@ -931,6 +979,11 @@ int audit_alloc(struct task_struct *tsk) state = audit_filter_task(tsk, key); if (state == AUDIT_DISABLED) { + /* +* There is no relevant race against inc/dec_n_rules +* here -- inc won't set TIF_SYSCALL_AUDIT, and, if dec +* clears it redundantly, there's no harm done. +*/ clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); return 0; } @@ -942,8 +995,14 @@ int audit_alloc(struct task_struct *tsk) } context-filterkey = key; + read_lock(n_rules_lock); tsk-audit_context = context; - set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); + if (audit_n_rules) + set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT
[PATCH v2.1] audit: Only use the syscall slowpath when syscall audit rules exist
This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of leaving it set whenever rules might be set in the future. This reduces syscall latency from 60ns to closer to 40ns on my laptop. Cc: Oleg Nesterov o...@redhat.com Cc: Steve Grubb sgr...@redhat.com Cc: Eric Paris epa...@redhat.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- This brown paper bag release is brought to you by git commit's -a flag. Changes from v2: Contains the correct patch Changes from v1: - For new tasks, set flags in a new audit_sync_flags callback instead of in audit_alloc (thanks, Oleg). - Rework locking. - Use irqsave/irqrestore to avoid having to think about who else might have taken spinlocks. include/linux/audit.h | 11 -- kernel/auditfilter.c | 4 ++-- kernel/auditsc.c | 57 +-- kernel/fork.c | 3 +++ 4 files changed, 65 insertions(+), 10 deletions(-) diff --git a/include/linux/audit.h b/include/linux/audit.h index a406419..a81f498 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -298,7 +298,9 @@ static inline void audit_mmap_fd(int fd, int flags) __audit_mmap_fd(fd, flags); } -extern int audit_n_rules; +extern void audit_inc_n_rules(void); +extern void audit_dec_n_rules(void); +extern void audit_sync_flags(struct task_struct *tsk); extern int audit_signals; #else /* CONFIG_AUDITSYSCALL */ static inline int audit_alloc(struct task_struct *task) @@ -404,7 +406,12 @@ static inline void audit_mmap_fd(int fd, int flags) { } static inline void audit_ptrace(struct task_struct *t) { } -#define audit_n_rules 0 +static inline void audit_inc_n_rules(void) +{ } +static inline void audit_dec_n_rules(void) +{ } +static inline void audit_sync_flags(struct task_struct *tsk) +{ } #define audit_signals 0 #endif /* CONFIG_AUDITSYSCALL */ diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 51f3fd4..0ce531d 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry) } #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules++; + audit_inc_n_rules(); if (!audit_match_signal(entry)) audit_signals++; @@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry) #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules--; + audit_dec_n_rules(); if (!audit_match_signal(entry)) audit_signals--; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 90594c9..cd44c88 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -79,8 +79,13 @@ /* no execve audit message should be longer than this (userspace limits) */ #define MAX_EXECVE_AUDIT_LEN 7500 -/* number of audit rules */ -int audit_n_rules; +/* + * number of audit rules + * + * To change this, you must hold audit_filter_mutex *and* have a read lock + * on tasklist_lock. + */ +static int audit_n_rules; /* determines whether we collect data for signals sent */ int audit_signals; @@ -911,6 +916,40 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state) return context; } +void audit_inc_n_rules() +{ + struct task_struct *p, *g; + unsigned long flags; + + read_lock_irqsave(tasklist_lock, flags); + if (audit_n_rules++ == 0) { + do_each_thread(g, p) { + if (p-audit_context) + set_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + } + read_unlock_irqrestore(tasklist_lock, flags); +} + +void audit_dec_n_rules() +{ + struct task_struct *p, *g; + unsigned long flags; + + read_lock_irqsave(tasklist_lock, flags); + + --audit_n_rules; + BUG_ON(audit_n_rules 0); + + if (audit_n_rules == 0) { + do_each_thread(g, p) { + clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + } + + read_unlock_irqrestore(tasklist_lock, flags); +} + /** * audit_alloc - allocate an audit context block for a task * @tsk: task @@ -930,10 +969,8 @@ int audit_alloc(struct task_struct *tsk) return 0; /* Return if not auditing. */ state = audit_filter_task(tsk, key); - if (state == AUDIT_DISABLED) { - clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); + if (state == AUDIT_DISABLED) return 0; - } if (!(context = audit_alloc_context(state))) { kfree(key); @@ -943,10 +980,18 @@ int audit_alloc(struct task_struct *tsk) context-filterkey = key; tsk-audit_context = context; - set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); return 0; } +void audit_sync_flags(struct task_struct *tsk) +{ + /* The caller has a write
[PATCH v2] audit: Only use the syscall slowpath when syscall audit rules exist
This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of leaving it set whenever rules might be set in the future. This reduces syscall latency from 60ns to closer to 40ns on my laptop. Cc: Oleg Nesterov o...@redhat.com Cc: Steve Grubb sgr...@redhat.com Cc: Eric Paris epa...@redhat.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- Changes from v1: - For new tasks, set flags in a new audit_sync_flags callback instead of in audit_alloc (thanks, Oleg). - Rework locking. - Use irqsave/irqrestore to avoid having to think about who else might have taken spinlocks. include/linux/audit.h | 9 +-- kernel/auditfilter.c | 4 ++-- kernel/auditsc.c | 65 --- 3 files changed, 71 insertions(+), 7 deletions(-) diff --git a/include/linux/audit.h b/include/linux/audit.h index a406419..534ba85 100644 --- a/include/linux/audit.h +++ b/include/linux/audit.h @@ -298,7 +298,9 @@ static inline void audit_mmap_fd(int fd, int flags) __audit_mmap_fd(fd, flags); } -extern int audit_n_rules; +extern void audit_inc_n_rules(void); +extern void audit_dec_n_rules(void); + extern int audit_signals; #else /* CONFIG_AUDITSYSCALL */ static inline int audit_alloc(struct task_struct *task) @@ -404,7 +406,10 @@ static inline void audit_mmap_fd(int fd, int flags) { } static inline void audit_ptrace(struct task_struct *t) { } -#define audit_n_rules 0 +static inline void audit_inc_n_rules(void) +{ } +static inline void audit_dec_n_rules(void) +{ } #define audit_signals 0 #endif /* CONFIG_AUDITSYSCALL */ diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c index 51f3fd4..0ce531d 100644 --- a/kernel/auditfilter.c +++ b/kernel/auditfilter.c @@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry) } #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules++; + audit_inc_n_rules(); if (!audit_match_signal(entry)) audit_signals++; @@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry) #ifdef CONFIG_AUDITSYSCALL if (!dont_count) - audit_n_rules--; + audit_dec_n_rules(); if (!audit_match_signal(entry)) audit_signals--; diff --git a/kernel/auditsc.c b/kernel/auditsc.c index 90594c9..363d0bb 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -79,8 +79,15 @@ /* no execve audit message should be longer than this (userspace limits) */ #define MAX_EXECVE_AUDIT_LEN 7500 -/* number of audit rules */ -int audit_n_rules; +/* + * number of audit rules + * + * n_rules_lock protects audit_n_rules. It also prevents audit_alloc from + * racing against changes to audit_n_rules: to change someone else's + * audit_context or TIF_SYSCALL_AUDIT, you need write_lock(n_rules_lock). + */ +static DEFINE_RWLOCK(n_rules_lock); +static int audit_n_rules; /* determines whether we collect data for signals sent */ int audit_signals; @@ -911,6 +918,47 @@ static inline struct audit_context *audit_alloc_context(enum audit_state state) return context; } +void audit_inc_n_rules() +{ + struct task_struct *p, *g; + + write_lock(n_rules_lock); + + if (audit_n_rules++ != 0) + goto out; /* The overall state isn't changing. */ + + read_lock(tasklist_lock); + do_each_thread(g, p) { + if (p-audit_context) + set_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + read_unlock(tasklist_lock); + +out: + write_unlock(n_rules_lock); +} + +void audit_dec_n_rules() +{ + struct task_struct *p, *g; + + write_lock(n_rules_lock); + --audit_n_rules; + BUG_ON(audit_n_rules 0); + + if (audit_n_rules != 0) + goto out; /* The overall state isn't changing. */ + + read_lock(tasklist_lock); + do_each_thread(g, p) { + clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT); + } while_each_thread(g, p); + read_unlock(tasklist_lock); + +out: + write_unlock(n_rules_lock); +} + /** * audit_alloc - allocate an audit context block for a task * @tsk: task @@ -931,6 +979,11 @@ int audit_alloc(struct task_struct *tsk) state = audit_filter_task(tsk, key); if (state == AUDIT_DISABLED) { + /* +* There is no relevant race against inc/dec_n_rules +* here -- inc won't set TIF_SYSCALL_AUDIT, and, if dec +* clears it redundantly, there's no harm done. +*/ clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); return 0; } @@ -942,8 +995,14 @@ int audit_alloc(struct task_struct *tsk) } context-filterkey = key; + read_lock(n_rules_lock); tsk-audit_context = context; - set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT); + if (audit_n_rules
Re: [PATCH] audit: Only use the syscall slowpath when syscall audit rules exist
On Mon, Feb 3, 2014 at 12:23 PM, Steve Grubb sgr...@redhat.com wrote: On Monday, February 03, 2014 09:53:23 AM Andy Lutomirski wrote: This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of leaving it set whenever rules might be set in the future. This reduces syscall latency from 60ns to closer to 40ns on my laptop. Does this mean that we have processes that don't have the TIF_SYSCALL_AUDIT flag set? When rules get loaded, how do we get the flag put back into all processes? By looping over all processes and setting the flag, which is what my patch does. The theory of ops is supposed to be that for anyone not needing audit, there is only the cost of if (tif TIF_SYSCALL_AUDIT). On current kernels *all* processes have TIF_SYSCALL_AUDIT, even if they don't need auditing because there's nothing to audit. So everything pays the full cost. That should be it. If you have audit enabled or had it enabled (which means it might be loaded with new rules), we want to inspect the syscall. My point is that there's nothing to inspect -- there are no rules. Unless the audit code needs to do something just in case a non-syscall audit event gets written, in which case the audit code should IMO be fixed. (This is what Eric is talking about, I think.) --Andy -- Linux-audit mailing list Linux-audit@redhat.com https://www.redhat.com/mailman/listinfo/linux-audit