from:"Andy Lutomirski"

Re: [PATCH 00/34] fs: idmapped mounts

2020-11-03 Thread Andy Lutomirski

On Fri, Oct 30, 2020 at 5:02 AM Christian Brauner
 wrote:
>
> On Thu, Oct 29, 2020 at 02:58:55PM -0700, Andy Lutomirski wrote:
> >
> >
> > > On Oct 28, 2020, at 5:35 PM, Christian Brauner 
> > >  wrote:
> > >
> > > Hey everyone,
> > >
> > > I vanished for a little while to focus on this work here so sorry for
> > > not being available by mail for a while.
> > >
> > > Since quite a long time we have issues with sharing mounts between
> > > multiple unprivileged containers with different id mappings, sharing a
> > > rootfs between multiple containers with different id mappings, and also
> > > sharing regular directories and filesystems between users with different
> > > uids and gids. The latter use-cases have become even more important with
> > > the availability and adoption of systemd-homed (cf. [1]) to implement
> > > portable home directories.
> > >
> > > The solutions we have tried and proposed so far include the introduction
> > > of fsid mappings, a tiny overlay based filesystem, and an approach to
> > > call override creds in the vfs. None of these solutions have covered all
> > > of the above use-cases.
> > >
> > > The solution proposed here has it's origins in multiple discussions
> > > during Linux Plumbers 2017 during and after the end of the containers
> > > microconference.
> > > To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> > > James, and myself. A variant of the solution proposed here has also been
> > > discussed, again to the best of my knowledge, after a Linux conference
> > > in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> > > after Linux Plumbers.
> > > I've taken the time to finally implement a working version of this
> > > solution over the last weeks to the best of my abilities. Tycho has
> > > signed up for this sligthly crazy endeavour as well and he has helped
> > > with the conversion of the xattr codepaths.
> > >
> > > The core idea is to make idmappings a property of struct vfsmount
> > > instead of tying it to a process being inside of a user namespace which
> > > has been the case for all other proposed approaches.
> > > It means that idmappings become a property of bind-mounts, i.e. each
> > > bind-mount can have a separate idmapping. This has the obvious advantage
> > > that idmapped mounts can be created inside of the initial user
> > > namespace, i.e. on the host itself instead of requiring the caller to be
> > > located inside of a user namespace. This enables such use-cases as e.g.
> > > making a usb stick available in multiple locations with different
> > > idmappings (see the vfat port that is part of this patch series).
> > >
> > > The vfsmount struct gains a new struct user_namespace member. The
> > > idmapping of the user namespace becomes the idmapping of the mount. A
> > > caller that is either privileged with respect to the user namespace of
> > > the superblock of the underlying filesystem or a caller that is
> > > privileged with respect to the user namespace a mount has been idmapped
> > > with can create a new bind-mount and mark it with a user namespace.
> >
> > So one way of thinking about this is that a user namespace that has an 
> > idmapped mount can, effectively, create or chown files with *any* on-disk 
> > uid or gid by doing it directly (if that uid exists in-namespace, which is 
> > likely for interesting ids like 0) or by creating a new userns with that id 
> > inside.
> >
> > For a file system that is private to a container, this seems moderately 
> > safe, although this may depend on what exactly “private” means. We probably 
> > want a mechanism such that, if you are outside the namespace, a reference 
> > to a file with the namespace’s vfsmnt does not confer suid privilege.
> >
> > Imagine the following attack: user creates a namespace with a root user and 
> > arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or 
> > using whatever container management tool does this.  Inside the namespace, 
> > the user creates a suid-root file.
> >
> > Now, outside the namespace, the user has privilege over the namespace.  
> > (I’m assuming there is some tool that will idmap things in a namespace 
> > owned by an unprivileged user, which seems likely.). So the user makes a 
> > new bind mount and if maps it to the init namespace. Game over.
> >
> > So I think we need to have s

Re: [PATCH 00/34] fs: idmapped mounts

2020-10-29 Thread Andy Lutomirski

> On Oct 28, 2020, at 5:35 PM, Christian Brauner  
> wrote:
> 
> Hey everyone,
> 
> I vanished for a little while to focus on this work here so sorry for
> not being available by mail for a while.
> 
> Since quite a long time we have issues with sharing mounts between
> multiple unprivileged containers with different id mappings, sharing a
> rootfs between multiple containers with different id mappings, and also
> sharing regular directories and filesystems between users with different
> uids and gids. The latter use-cases have become even more important with
> the availability and adoption of systemd-homed (cf. [1]) to implement
> portable home directories.
> 
> The solutions we have tried and proposed so far include the introduction
> of fsid mappings, a tiny overlay based filesystem, and an approach to
> call override creds in the vfs. None of these solutions have covered all
> of the above use-cases.
> 
> The solution proposed here has it's origins in multiple discussions
> during Linux Plumbers 2017 during and after the end of the containers
> microconference.
> To the best of my knowledge this involved Aleksa, Stéphane, Eric, David,
> James, and myself. A variant of the solution proposed here has also been
> discussed, again to the best of my knowledge, after a Linux conference
> in St. Petersburg in Russia between Christoph, Tycho, and myself in 2017
> after Linux Plumbers.
> I've taken the time to finally implement a working version of this
> solution over the last weeks to the best of my abilities. Tycho has
> signed up for this sligthly crazy endeavour as well and he has helped
> with the conversion of the xattr codepaths.
> 
> The core idea is to make idmappings a property of struct vfsmount
> instead of tying it to a process being inside of a user namespace which
> has been the case for all other proposed approaches.
> It means that idmappings become a property of bind-mounts, i.e. each
> bind-mount can have a separate idmapping. This has the obvious advantage
> that idmapped mounts can be created inside of the initial user
> namespace, i.e. on the host itself instead of requiring the caller to be
> located inside of a user namespace. This enables such use-cases as e.g.
> making a usb stick available in multiple locations with different
> idmappings (see the vfat port that is part of this patch series).
> 
> The vfsmount struct gains a new struct user_namespace member. The
> idmapping of the user namespace becomes the idmapping of the mount. A
> caller that is either privileged with respect to the user namespace of
> the superblock of the underlying filesystem or a caller that is
> privileged with respect to the user namespace a mount has been idmapped
> with can create a new bind-mount and mark it with a user namespace.

So one way of thinking about this is that a user namespace that has an idmapped 
mount can, effectively, create or chown files with *any* on-disk uid or gid by 
doing it directly (if that uid exists in-namespace, which is likely for 
interesting ids like 0) or by creating a new userns with that id inside.

For a file system that is private to a container, this seems moderately safe, 
although this may depend on what exactly “private” means. We probably want a 
mechanism such that, if you are outside the namespace, a reference to a file 
with the namespace’s vfsmnt does not confer suid privilege.

Imagine the following attack: user creates a namespace with a root user and 
arranges to get an idmapped fs, e.g. by inserting an ext4 usb stick or using 
whatever container management tool does this.  Inside the namespace, the user 
creates a suid-root file.

Now, outside the namespace, the user has privilege over the namespace.  (I’m 
assuming there is some tool that will idmap things in a namespace owned by an 
unprivileged user, which seems likely.). So the user makes a new bind mount and 
if maps it to the init namespace. Game over.

So I think we need to have some control to mitigate this in a comprehensible 
way. A big hammer would be to require nosuid. A smaller hammer might be to say 
that you can’t create a new idmapped mount unless you have privilege over the 
userns that you want to use for the idmap and to say that a vfsmnt’s paths 
don’t do suid outside the idmap namespace.  We already do the latter for the 
vfsmnt’s mntns’s userns.

Hmm.  What happens if we require that an idmap userns equal the vfsmnt’s 
mntns’s userns?  Is that too limiting?

I hope that whatever solution gets used is straightforward enough to wrap one’s 
head around.

> When a file/inode is accessed through an idmapped mount the i_uid and
> i_gid of the inode will be remapped according to the user namespace the
> mount has been marked with. When a new object is created based on the
> fsuid and fsgid of the caller they will similarly be remapped according
> to the user namespace of the mount they care created from.

By “mapped according to”, I presume you mean that the on-disk uid/gid is the 
gid

Re: [PATCH v2 16/15] syscall_get_arch: add "struct task_struct *" argument

2018-11-21 Thread Andy Lutomirski

On Tue, Nov 20, 2018 at 4:44 PM Dmitry V. Levin  wrote:
>
> This argument is required to extend the generic ptrace API
> with PTRACE_GET_SYSCALL_INFO request: syscall_get_arch() is going to be
> called from ptrace_request() along with other syscall_get_* functions
> with a tracee as their argument.
>
> This change partially reverts commit 5e937a9ae913 ("syscall_get_arch:
> remove useless function arguments").
>

Reviewed-by: Andy Lutomirski  # for x86

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 06/13] arc: define syscall_get_arch()

2018-11-09 Thread Andy Lutomirski



> On Nov 9, 2018, at 8:50 AM, Vineet Gupta  wrote:
> 
>> On 11/8/18 7:16 PM, Dmitry V. Levin wrote:
>> syscall_get_arch() is required to be implemented on all architectures
>> that use tracehook_report_syscall_entry() in order to extend
>> the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
>> 
>> Signed-off-by: Dmitry V. Levin 
>> ---
>> arch/arc/include/asm/syscall.h | 6 ++
>> include/uapi/linux/audit.h | 1 +
>> 2 files changed, 7 insertions(+)
>> 
>> diff --git a/arch/arc/include/asm/syscall.h b/arch/arc/include/asm/syscall.h
>> index 29de09804306..5662778a7411 100644
>> --- a/arch/arc/include/asm/syscall.h
>> +++ b/arch/arc/include/asm/syscall.h
>> @@ -9,6 +9,7 @@
>> #ifndef _ASM_ARC_SYSCALL_H
>> #define _ASM_ARC_SYSCALL_H  1
>> 
>> +#include 
>> #include 
>> #include 
>> #include 
>> @@ -68,4 +69,9 @@ syscall_get_arguments(struct task_struct *task, struct 
>> pt_regs *regs,
>>}
>> }
>> 
>> +static inline int syscall_get_arch(void)
>> +{
>> +return AUDIT_ARCH_ARC;
>> +}
>> +
> 
> Does ptrace (or user of this API) need a unique value per arch. Otherwise 
> instead
> of adding the boilerplate code to all arches, they could simply define 
> AUDIT_ARCH
> and common code could return it. Also the EM_xxx are not there in
> include/uapi/linux/elf.h to begin with since libc elf.h already defines them.

A lot of architectures allow multiple audit_arches at runtime due to compat 
support and similar features, so it really does want to be a function.  The 
goal of this patch set is to get it supported everywhere.

>> #endif
>> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
>> index 818ae690ab79..a7149ceb5b98 100644
>> --- a/include/uapi/linux/audit.h
>> +++ b/include/uapi/linux/audit.h
>> @@ -375,6 +375,7 @@ enum {
>> 
>> #define AUDIT_ARCH_AARCH64(EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
>> #define AUDIT_ARCH_ALPHA(EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
>> +#define AUDIT_ARCH_ARC(EM_ARC)
>> #define AUDIT_ARCH_ARM(EM_ARM|__AUDIT_ARCH_LE)
>> #define AUDIT_ARCH_ARMEB(EM_ARM)
>> #define AUDIT_ARCH_CRIS(EM_CRIS|__AUDIT_ARCH_LE)
> 
> So I don't have the context of this patch (or coverletter) but what exactly 
> are we
> trying to do with this (adding LE to audit)  - what happens when an arch is
> capable of either and is say built for BE ?

The primary intent is that the triple (audit_arch, syscall_nr, arg1, ..., arg6) 
should describe what system call is being called and what its arguments are.  
I’m personally not sure what, if any, technical value there is in the LE bit.

I do think it makes sense for BE and LE to have different values.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 06/13] arc: define syscall_get_arch()

2018-11-09 Thread Andy Lutomirski

On Fri, Nov 9, 2018 at 8:11 AM Alexey Brodkin
 wrote:
>
> Hi Andy,
>
> On Fri, 2018-11-09 at 07:56 -0800, Andy Lutomirski wrote:
> > > On Nov 9, 2018, at 7:27 AM, Alexey Brodkin  
> > > wrote:
> > >
> > > Hi Andy,
> > >
> > > > On Fri, 2018-11-09 at 07:17 -0800, Andy Lutomirski wrote:
> > > > On Fri, Nov 9, 2018 at 6:22 AM Alexey Brodkin
> > > >  wrote:
> > > > > Hi Dmitry,
> > > > >
> > > > > > On Fri, 2018-11-09 at 06:16 +0300, Dmitry V. Levin wrote:
> > > > > > syscall_get_arch() is required to be implemented on all 
> > > > > > architectures
> > > > > > that use tracehook_report_syscall_entry() in order to extend
> > > > > > the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
> > > > > >
> > > > > > Signed-off-by: Dmitry V. Levin 
> > > > > > ---
> > > > > > arch/arc/include/asm/syscall.h | 6 ++
> > > > > > include/uapi/linux/audit.h | 1 +
> > > > > > 2 files changed, 7 insertions(+)
> > > > >
> > > > > [snip]
> > > > >
> > > > > > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > > > > > index 818ae690ab79..a7149ceb5b98 100644
> > > > > > --- a/include/uapi/linux/audit.h
> > > > > > +++ b/include/uapi/linux/audit.h
> > > > > > @@ -375,6 +375,7 @@ enum {
> > > > > >
> > > > > > #define AUDIT_ARCH_AARCH64   
> > > > > > (EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> > > > > > #define AUDIT_ARCH_ALPHA 
> > > > > > (EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> > > > > > +#define AUDIT_ARCH_ARC   (EM_ARC)
> > > > >
> > > > > Similarly here we need to have:
> > > > > >8-
> > > > > +#define AUDIT_ARCH_ARC (EM_ARC|EM_ARCV2)
> > > > > >8-
> > > > >
> > > >
> > > > Huh?  How does the bitwise or of two ELF machine codes make any sense?
> > >
> > > Oops... I didn't read examples of AUDIT_ARCH_ALPHA above :(
> > > Indeed that was stupid.
> > >
> > > But what would be a proper fix then?
> > >
> > > Something like that?
> > > >8-
> > > #define AUDIT_ARCH_ARC   (EM_ARC)
> > > #define AUDIT_ARCH_ARCV2 (EM_ARCV2)
> > >
> > >
> > > static inline int syscall_get_arch(void)
> > > {
> > > #ifdef __ARC700__
> > >   return AUDIT_ARCH_ARC;
> > > #else
> > >   return AUDIT_ARCH_ARCV2;
> > > #endif
> > > }
> > > >8-
> > >
> >
> > Maybe, but I know basically nothing about ARC.  Is the syscall numbering or 
> > calling convention different on ARC vs ARCv2?
>
> Syscall numbering should be the same as we use UAPI for both ARCompact (AKA 
> ARCv1)
> and ARCv2. As for calling convention I think it indeed differs.
>
> Note ARCompact and ARCv2 ISAs are binary incompatible!
>
> Even though assembly look pretty much the same (sans instructions
> available only for either ARCompact or ARCv2) encodings are different so
> in that sense these are completely different architectures.
>
> Also I'm wondering what could be other cases for use of syscall_get_arch().

The intent of syscall_get_arch() is that the tuple:

(arch, nr, arg1, ..., arg6)

fully identifies a system call and its arguments.  So it sounds like
we do indeed need to arch values.

>
> So I'd say it's better to report different values for ARC ISAs.
> And given we use the same values as in Binutils IMHO it would be good
> to not mix IDs here.
>
> -Alexey



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 06/13] arc: define syscall_get_arch()

2018-11-09 Thread Andy Lutomirski




> On Nov 9, 2018, at 7:27 AM, Alexey Brodkin  
> wrote:
> 
> Hi Andy,
> 
>> On Fri, 2018-11-09 at 07:17 -0800, Andy Lutomirski wrote:
>> On Fri, Nov 9, 2018 at 6:22 AM Alexey Brodkin
>>  wrote:
>>> Hi Dmitry,
>>> 
>>>> On Fri, 2018-11-09 at 06:16 +0300, Dmitry V. Levin wrote:
>>>> syscall_get_arch() is required to be implemented on all architectures
>>>> that use tracehook_report_syscall_entry() in order to extend
>>>> the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
>>>> 
>>>> Signed-off-by: Dmitry V. Levin 
>>>> ---
>>>> arch/arc/include/asm/syscall.h | 6 ++
>>>> include/uapi/linux/audit.h | 1 +
>>>> 2 files changed, 7 insertions(+)
>>> 
>>> [snip]
>>> 
>>>> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
>>>> index 818ae690ab79..a7149ceb5b98 100644
>>>> --- a/include/uapi/linux/audit.h
>>>> +++ b/include/uapi/linux/audit.h
>>>> @@ -375,6 +375,7 @@ enum {
>>>> 
>>>> #define AUDIT_ARCH_AARCH64   
>>>> (EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
>>>> #define AUDIT_ARCH_ALPHA (EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
>>>> +#define AUDIT_ARCH_ARC   (EM_ARC)
>>> 
>>> Similarly here we need to have:
>>> >8-
>>> +#define AUDIT_ARCH_ARC (EM_ARC|EM_ARCV2)
>>> >8-
>>> 
>> 
>> Huh?  How does the bitwise or of two ELF machine codes make any sense?
> 
> Oops... I didn't read examples of AUDIT_ARCH_ALPHA above :(
> Indeed that was stupid.
> 
> But what would be a proper fix then?
> 
> Something like that?
> >8-
> #define AUDIT_ARCH_ARC   (EM_ARC)
> #define AUDIT_ARCH_ARCV2 (EM_ARCV2)
> 
> 
> static inline int syscall_get_arch(void)
> {
> #ifdef __ARC700__
>   return AUDIT_ARCH_ARC;
> #else
>   return AUDIT_ARCH_ARCV2;
> #endif
> }
> >8-
> 

Maybe, but I know basically nothing about ARC.  Is the syscall numbering or 
calling convention different on ARC vs ARCv2?

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 06/13] arc: define syscall_get_arch()

2018-11-09 Thread Andy Lutomirski

On Fri, Nov 9, 2018 at 6:22 AM Alexey Brodkin
 wrote:
>
> Hi Dmitry,
>
> On Fri, 2018-11-09 at 06:16 +0300, Dmitry V. Levin wrote:
> > syscall_get_arch() is required to be implemented on all architectures
> > that use tracehook_report_syscall_entry() in order to extend
> > the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.
> >
> > Signed-off-by: Dmitry V. Levin 
> > ---
> >  arch/arc/include/asm/syscall.h | 6 ++
> >  include/uapi/linux/audit.h | 1 +
> >  2 files changed, 7 insertions(+)
>
> [snip]
>
> > diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> > index 818ae690ab79..a7149ceb5b98 100644
> > --- a/include/uapi/linux/audit.h
> > +++ b/include/uapi/linux/audit.h
> > @@ -375,6 +375,7 @@ enum {
> >
> >  #define AUDIT_ARCH_AARCH64   
> > (EM_AARCH64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> >  #define AUDIT_ARCH_ALPHA (EM_ALPHA|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
> > +#define AUDIT_ARCH_ARC   (EM_ARC)
>
> Similarly here we need to have:
> >8-
> +#define AUDIT_ARCH_ARC (EM_ARC|EM_ARCV2)
> >8-
>

Huh?  How does the bitwise or of two ELF machine codes make any sense?

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 00/13] Prepare for PTRACE_GET_SYSCALL_INFO

2018-11-09 Thread Andy Lutomirski

On Thu, Nov 8, 2018 at 7:13 PM Dmitry V. Levin  wrote:
>
> syscall_get_arch() is required to be implemented on all architectures
> that use tracehook_report_syscall_entry() in order to extend
> the generic ptrace API with PTRACE_GET_SYSCALL_INFO request.

Nice!

I'm pretty sure you have vastly more changelog than code here :)

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 2/2] RISC-V: Add support for SECCOMP

2018-10-25 Thread Andy Lutomirski

On Wed, Oct 24, 2018 at 2:42 PM Kees Cook  wrote:
>
> On Wed, Oct 24, 2018 at 1:40 PM, Palmer Dabbelt  wrote:
> > From: "Wesley W. Terpstra" 
> >
> > This is a fairly straight-forward implementation of seccomp for RISC-V
> > systems.
> >
> > Signed-off-by: Wesley W. Terpstra 
> > Signed-off-by: Palmer Dabbelt 
> > ---
> >  arch/riscv/Kconfig   | 18 ++
> >  arch/riscv/include/asm/seccomp.h | 10 ++
> >  arch/riscv/include/asm/syscall.h |  6 ++
> >  arch/riscv/include/asm/thread_info.h |  1 +
> >  include/uapi/linux/audit.h   |  1 +
> >  5 files changed, 36 insertions(+)
> >  create mode 100644 arch/riscv/include/asm/seccomp.h
> >
> > diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
> > index a344980287a5..28abe47602a1 100644
> > --- a/arch/riscv/Kconfig
> > +++ b/arch/riscv/Kconfig
> > @@ -28,6 +28,7 @@ config RISCV
> > select GENERIC_STRNLEN_USER
> > select GENERIC_SMP_IDLE_THREAD
> > select GENERIC_ATOMIC64 if !64BIT || !RISCV_ISA_A
> > +   select HAVE_ARCH_SECCOMP_FILTER
>
> I think this patch is missing most of the actual seccomp glue?
>
> config HAVE_ARCH_SECCOMP_FILTER
> bool
> help
>   An arch should select this symbol if it provides all of these 
> things:
>   - syscall_get_arch()
>   - syscall_get_arguments()
>   - syscall_rollback()
>   - syscall_set_return_value()
>   - SIGSYS siginfo_t support
>   - secure_computing is called from a ptrace_event()-safe context
>   - secure_computing return value is checked and a return value of -1
> results in the system call being skipped immediately.
>   - seccomp syscall wired up
>
> I only see syscall_get_arch(). Nothing is using TIF_SECCOMP (I'd
> expect a masked check in entry.S -- it seems like tracepoints are
> getting missed too? I see it handled in ptrace.c but not checked in
> entry.S?) There's no checking for seccomp in ptrace.c, etc.

Hi RISC-V people:

I strongly, strongly suggest that you rewrite your asm to work the way
that x86's does: have a function called prepare_exit_to_usermode() and
make it work more or less like x86's.  Doing all the exit work in asm
like you are is just setting you up for a world of pain.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 00/18] xfrm: Add compat layer

2018-07-27 Thread Andy Lutomirski



> On Jul 27, 2018, at 9:48 AM, Nathan Harold  wrote:
> 
> We (Android) are very interested in removing the restriction for 32-bit 
> userspace processes accessing xfrm netlink on 64-bit kernels. IPsec support 
> is required to pass Android conformance tests, and any manufacturer wishing 
> to ship 32-bit userspace with a recent kernel needs out-of-tree changes 
> (removing the compat_task check) to do so.
> 
> That said, it’s not difficult to work around alignment issues directly in 
> userspace, so maybe we could just remove the check and make this the caller's 
> responsibility? Here’s an example of the workaround currently in the Android 
> tree:
> https://android.googlesource.com/platform/system/netd/+/refs/heads/master/server/XfrmController.h#257
> 
> We could also employ a (relatively simple) solution such as the one above in 
> the uapi XFRM header itself, though it would require a caller to declare the 
> target kernel ABI at compile time. Maybe that’s not unthinkable for an 
> uncommon case?
> 

Could there just be an XFRM2 that is entirely identical to XFRM for 64-bit 
userspace but makes the 32-bit structures match?  If there are a grand total of 
two or so userspace implementations, that should cover most use cases. L

> -Nathan
> 
> 
>> On Fri, Jul 27, 2018 at 7:51 AM, Dmitry Safonov  wrote:
>> On Fri, 2018-07-27 at 16:19 +0200, Florian Westphal wrote:
>> > Dmitry Safonov  wrote:
>> > > 1. It will double copy netlink messages, making it O(n) instead of
>> > > O(1), where n - is number of bind()s.. Probably we don't care much.
>> > 
>> > About those bind() patches, I don't understand why they are needed.
>> > 
>> > Why can't you just add the compat skb to the native skb when doing
>> > the multicast call?
>> > 
>> > skb_shinfo(skb)->frag_list = compat_skb;
>> > xfrm_nlmsg_multicast(net, skb, 0, ...
>> 
>> Oh yeah, sorry, I think I misread the patch - will try to add compat
>> skb in the multicast call.
>> 
>> -- 
>> Thanks,
>>  Dmitry
> 
--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated

2018-03-15 Thread Andy Lutomirski

On Wed, Mar 14, 2018 at 12:28 AM, Jiri Kosina <ji...@kernel.org> wrote:
> On Wed, 14 Mar 2018, Andy Lutomirski wrote:
>
>> > Yes...I wished I was in on the beginning of this discussion. Here's the
>> > problem. We need all tasks auditable unless specifically dismissed as
>> > uninteresting. This would be a task,never rule.
>> >
>> > The way we look at it, is if it boots with audit=1, then we know auditd
>> > is expected to run at some point. So, we need all tasks to stay
>> > auditable. If they weren't and auditd enabled auditing, then we'd need
>> > to walk the whole proctable and stab TIF_AUDIT_SYSCALL into every
>> > process in the system. It was decided that this is too ugly.
>>
>> When was that decided?  That's what this patch does.
>
> I'd like to see some more justification as well.
>
> Namely, if I compare "setting TIF_AUDIT_SYSCALL for every process on a
> need-to-be-so basis" to "we always go through the slow path and
> pessimistically assume that audit is enabled and has reasonable ruleset
> loaded", I have my own (different) opinion of what is too ugly.

Me too.

That being said, on re-review of my old code, I think that
audit_dec_n_rules() may be the wrong approach.  It may be better to
leave TIF_AUDIT_SYSCALL set but to make the audit code itself clear
the flag the next time through.  That way we don't end up with a
partially filled in syscall audit record that never gets consumed if
we clear TIF_AUDIT_SYSCALL in the middle of a syscall.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated

2018-03-14 Thread Andy Lutomirski

On Sat, Mar 10, 2018 at 10:15 AM, Steve Grubb  wrote:
> On Wed, 7 Mar 2018 18:43:42 -0500
> Paul Moore  wrote:
>> ... and I just realized that linux-audit isn't on the To/CC line,
>> adding them now.
>>
>> Link to the patch is below.
>>
>> * https://marc.info/?t=15204188763=1=2
>
> Yes...I wished I was in on the beginning of this discussion. Here's the
> problem. We need all tasks auditable unless specifically dismissed as
> uninteresting. This would be a task,never rule.
>
> The way we look at it, is if it boots with audit=1, then we know auditd
> is expected to run at some point. So, we need all tasks to stay
> auditable. If they weren't and auditd enabled auditing, then we'd need
> to walk the whole proctable and stab TIF_AUDIT_SYSCALL into every
> process in the system. It was decided that this is too ugly.

When was that decided?  That's what this patch does.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated

2018-03-08 Thread Andy Lutomirski

On Wed, Mar 7, 2018 at 11:41 PM, Paul Moore <p...@paul-moore.com> wrote:
> On Wed, Mar 7, 2018 at 11:48 AM, Jiri Kosina <ji...@kernel.org> wrote:
>> On Wed, 7 Mar 2018, Andy Lutomirski wrote:
>>> Wow, this was a long time ago.
>>
>> Oh yeah; but it now resurfaced on our side, as we are of course receiving
>> a lot of requests with respect to making syscall performance great again
>> :)
>
> Ooof.  I'm not sure I can handle making more things "great again" ;)
>
>>> From memory and a bit of email diving, there are two reasons.
>>>
>>> 1. The probably was partially solved (by Oleg, IIRC) by making auditctl
>>>-a task,never cause newly spawned tasks to not suck.  Yes, it's a
>>>very partial solution.  After considerable nagging, I got Fedora to
>>>default to -a task,never.
>>
>> Hm, right; that's a bit inconvenient, because it takes each and every
>> vendor having to realize this option, and put it in. Making kernel do the
>> right thing automatically sounds like a better option to me.
>
> This predates audit falling into my lap, but speaking generally I
> think it would be good if the kernel did The Right Thing, so long as
> it isn't too painful.
>
>>> 2. This patch, as is, may be a bit problematic.  In particular, if one
>>>task changes the audit rules while another task is in the middle of
>>>the syscall, then it's too late to audit that syscall correctly.
>>>This could be seen as a bug or it could be seen as being just fine.
>>
>> I don't think this should be a problem, given the fact that the whole
>> timing/ordering is not predictable anyway due to scheduling.
>>
>> Paul, what do you think?
>
> I'm not overly concerned about the race condition between configuring
> the audit filters and syscalls that are currently in-flight; after all
> we have that now and "fixing" it would be pretty much impractical
> (impossible maybe?).  Most serious audit users configure it during
> boot and let it run, frequent runtime changes are not common as far as
> I can tell.
>
> I just looked quickly at the patch and decided it isn't something I'm
> going to be able to carefully review in the time I've got left today,
> so it's going to have to wait until tomorrow and Friday ... however,
> speaking on general principle I don't have an objection to the ideas
> put forth here.
>
> Andy, if you've got any Reviewed-by/Tested-by/NACK/etc. you want to
> add, that would be good to have.
>

It's sort of my patch, and I was reasonably happy with it, so
definitely no NACK from me.  The only caveat I have is that I wrote
the original version so long ago that we need to re-audit the code.
In particular, I want to make sure that the following two cases don't
result in warnings, oopses, or other misbehavior:

1. Do a syscall with TIF_AUDIT_SYSCALL clear.  Return with
TIF_AUDIT_SYSCALL set and that syscall enabled for auditing.

2.. Do a VFS syscall with TIF_AUDIT_CLEAR and have TIF_AUDIT_SYSCALL
set before we execute any VFS code.  The VFS code will call into the
audit code to log the paths it touches (IIRC).  Again, this shouldn't
warn or otherwise misbehave.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] audit: set TIF_AUDIT_SYSCALL only if audit filter has been populated

2018-03-08 Thread Andy Lutomirski



> On Mar 8, 2018, at 1:12 AM, Richard Guy Briggs <r...@redhat.com> wrote:
> 
>> On 2018-03-07 18:43, Paul Moore wrote:
>>> On Wed, Mar 7, 2018 at 6:41 PM, Paul Moore <p...@paul-moore.com> wrote:
>>>> On Wed, Mar 7, 2018 at 11:48 AM, Jiri Kosina <ji...@kernel.org> wrote:
>>>>> On Wed, 7 Mar 2018, Andy Lutomirski wrote:
>>>>> Wow, this was a long time ago.
>>>> 
>>>> Oh yeah; but it now resurfaced on our side, as we are of course receiving
>>>> a lot of requests with respect to making syscall performance great again
>>>> :)
>>> 
>>> Ooof.  I'm not sure I can handle making more things "great again" ;)
>>> 
>>>>> From memory and a bit of email diving, there are two reasons.
>>>>> 
>>>>> 1. The probably was partially solved (by Oleg, IIRC) by making auditctl
>>>>>   -a task,never cause newly spawned tasks to not suck.  Yes, it's a
>>>>>   very partial solution.  After considerable nagging, I got Fedora to
>>>>>   default to -a task,never.
>>>> 
>>>> Hm, right; that's a bit inconvenient, because it takes each and every
>>>> vendor having to realize this option, and put it in. Making kernel do the
>>>> right thing automatically sounds like a better option to me.
>>> 
>>> This predates audit falling into my lap, but speaking generally I
>>> think it would be good if the kernel did The Right Thing, so long as
>>> it isn't too painful.
>>> 
>>>>> 2. This patch, as is, may be a bit problematic.  In particular, if one
>>>>>   task changes the audit rules while another task is in the middle of
>>>>>   the syscall, then it's too late to audit that syscall correctly.
>>>>>   This could be seen as a bug or it could be seen as being just fine.
>>>> 
>>>> I don't think this should be a problem, given the fact that the whole
>>>> timing/ordering is not predictable anyway due to scheduling.
>>>> 
>>>> Paul, what do you think?
>>> 
>>> I'm not overly concerned about the race condition between configuring
>>> the audit filters and syscalls that are currently in-flight; after all
>>> we have that now and "fixing" it would be pretty much impractical
>>> (impossible maybe?).  Most serious audit users configure it during
>>> boot and let it run, frequent runtime changes are not common as far as
>>> I can tell.
> 
> I'd agree the race condition here can't easily be fixed and isn't worth
> fixing for the reasons already stated (rules don't change often and may
> even be locked once in place relatively early, scheduling uncertainties).
> 
>>> I just looked quickly at the patch and decided it isn't something I'm
>>> going to be able to carefully review in the time I've got left today,
>>> so it's going to have to wait until tomorrow and Friday ... however,
>>> speaking on general principle I don't have an objection to the ideas
>>> put forth here.
> 
> The approach seems a bit draconian since it touches all tasks but only
> when adding the first rule or deleting the last.
> 
> What we lose is the ability to set or clear individual task auditing and
> have it stick to speed up non-audited tasks when there are rules
> present, though this isn't currently used, in favour of audit_context
> presence.
> 
>>> Andy, if you've got any Reviewed-by/Tested-by/NACK/etc. you want to
>>> add, that would be good to have.
>> 
>> ... and I just realized that linux-audit isn't on the To/CC line,
>> adding them now.
> 
> (and Andy's non-NACK missed too...)  The mailing list *is* in MAINTAINERS.
> 

The mailing list bounces my emails. 

>> Link to the patch is below.
>> 
>> * https://marc.info/?t=15204188763=1=2
>> 
>> paul moore
> 
> - RGB
> 
> --
> Richard Guy Briggs <r...@redhat.com>
> Sr. S/W Engineer, Kernel Security, Base Operating Systems
> Remote, Ottawa, Red Hat Canada
> IRC: rgb, SunRaycer
> Voice: +1.647.777.2635, Internal: (81) 32635

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V3 07/10] capabilities: remove a layer of conditional logic

2017-08-27 Thread Andy Lutomirski

On Wed, Aug 23, 2017 at 3:12 AM, Richard Guy Briggs  wrote:
> Remove a layer of conditional logic to make the use of conditions
> easier to read and analyse.
>
> Signed-off-by: Richard Guy Briggs 
> ---
>  security/commoncap.c |   13 ++---
>  1 files changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 5d81354..ffcaff0 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -551,13 +551,12 @@ static inline bool nonroot_raised_pE(struct cred *cred, 
> kuid_t root)
>  {
> bool ret = false;
>
> -   if (cap_grew(effective, ambient, cred)) {
> -   if (!cap_full(effective, cred) ||
> -   !is_eff(root, cred) || !is_real(root, cred) ||
> -   !root_privileged()) {
> -   ret = true;
> -   }
> -   }
> +   if (cap_grew(effective, ambient, cred) &&
> +   (!cap_full(effective, cred) ||
> +!is_eff(root, cred) ||
> +!is_real(root, cred) ||
> +!root_privileged()))
> +   ret = true;

I'm trying to give this whole series the benefit of the doubt.  Here's
what happens when I try to read this:

Did effective grow ambient caps?  No, makes no sense.  Did ambient
grow effective caps?  No, not that either.  Is effective more fully
grown than ambient?  That's probably it, but that sounds really weird.

The rest would IMO be clearer if you named the helpers ruid_eq and
euid_eq instead of is_eff and is_real.

> return ret;
>  }
>
> --
> 1.7.1
>

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V3 05/10] capabilities: use intuitive names for id changes

2017-08-27 Thread Andy Lutomirski

On Wed, Aug 23, 2017 at 3:12 AM, Richard Guy Briggs  wrote:
> Introduce a number of inlines to make the use of the negation of
> uid_eq() easier to read and analyse.
>
> Signed-off-by: Richard Guy Briggs 
> ---
>  security/commoncap.c |   26 +-
>  1 files changed, 21 insertions(+), 5 deletions(-)
>
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 36c38a1..1af7dec 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -483,6 +483,15 @@ static int get_file_caps(struct linux_binprm *bprm, bool 
> *effective, bool *has_f
>
>  static inline bool root_privileged(void) { return !issecure(SECURE_NOROOT); }
>
> +static inline bool is_real(kuid_t uid, struct cred *cred)
> +{ return uid_eq(cred->uid, uid); }

OK I guess, but this just seems like a way to obfuscate the code a bit
and save typing "->uid".

> +
> +static inline bool is_eff(kuid_t uid, struct cred *cred)
> +{ return uid_eq(cred->euid, uid); }

Ditto.

> +
> +static inline bool is_suid(kuid_t uid, struct cred *cred)
> +{ return !is_real(uid, cred) && is_eff(uid, cred); }

Please no.  This is IMO insane.  You're hiding really weird,
nonintuitive logic in an oddly named helper.

Also, this is going to cause massive confusion and severe bugs: given
the same, the only remotely sensible guess as to what this function
does is uid_eq(cred->suid, uid).  So NAK to this.

> +
>  void handle_privileged_root(struct linux_binprm *bprm, bool has_fcap, bool 
> *effective, kuid_t root_uid)
>  {
> const struct cred *old = current_cred();
> @@ -493,7 +502,7 @@ void handle_privileged_root(struct linux_binprm *bprm, 
> bool has_fcap, bool *effe
>  * for a setuid root binary run by a non-root user.  Do set it
>  * for a root user just to cause least surprise to an admin.
>  */
> -   if (has_fcap && !uid_eq(new->uid, root_uid) && uid_eq(new->euid, 
> root_uid)) {
> +   if (has_fcap && is_suid(root_uid, new)) {

e.g. this.  The logic used to be obviously slightly dicey.  Now it
looks sane but doesn't do what you'd naively expect it to do, which is
far worse.

> @@ -519,6 +528,13 @@ void handle_privileged_root(struct linux_binprm *bprm, 
> bool has_fcap, bool *effe
> !cap_issubset(cred->cap_##target, cred->cap_##source)
>  #define cap_full(field, cred) \
> cap_issubset(CAP_FULL_SET, cred->cap_##field)
> +
> +static inline bool is_setuid(struct cred *new, const struct cred *old)
> +{ return !uid_eq(new->euid, old->uid); }
> +
> +static inline bool is_setgid(struct cred *new, const struct cred *old)
> +{ return !gid_eq(new->egid, old->gid); }

Please don't.  This logic is fragile, and these helpers are pretending
it's not fragile even though it's still fragile.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V3 05/10] capabilities: use intuitive names for id changes

2017-08-27 Thread Andy Lutomirski




--Andy
> On Aug 25, 2017, at 11:51 AM, Serge E. Hallyn <se...@hallyn.com> wrote:
> 
> Quoting Andy Lutomirski (l...@kernel.org):
>>> On Wed, Aug 23, 2017 at 3:12 AM, Richard Guy Briggs <r...@redhat.com> wrote:
>>> Introduce a number of inlines to make the use of the negation of
>>> uid_eq() easier to read and analyse.
>>> 
>>> Signed-off-by: Richard Guy Briggs <r...@redhat.com>
>>> ---
>>> security/commoncap.c |   26 +-
>>> 1 files changed, 21 insertions(+), 5 deletions(-)
>>> 
>>> diff --git a/security/commoncap.c b/security/commoncap.c
>>> index 36c38a1..1af7dec 100644
>>> --- a/security/commoncap.c
>>> +++ b/security/commoncap.c
>>> @@ -483,6 +483,15 @@ static int get_file_caps(struct linux_binprm *bprm, 
>>> bool *effective, bool *has_f
>>> 
>>> static inline bool root_privileged(void) { return !issecure(SECURE_NOROOT); 
>>> }
>>> 
>>> +static inline bool is_real(kuid_t uid, struct cred *cred)
>>> +{ return uid_eq(cred->uid, uid); }
>> 
>> OK I guess, but this just seems like a way to obfuscate the code a bit
>> and save typing "->uid".
> 
> Personally I find the new to be far more readable.  In the old, the
> distinction between uid and euid is one character hidden in the middle
> of the expression.

Would real_uid_eq be better?

> 
>>> +
>>> +static inline bool is_eff(kuid_t uid, struct cred *cred)
>>> +{ return uid_eq(cred->euid, uid); }
>> 
>> Ditto.
>> 
>>> +
>>> +static inline bool is_suid(kuid_t uid, struct cred *cred)
>>> +{ return !is_real(uid, cred) && is_eff(uid, cred); }
>> 
>> Please no.  This is IMO insane.  You're hiding really weird,
>> nonintuitive logic in an oddly named helper.
> 
> How is it nonintuitive?  It's very precisely checking that a
> nonroot user is executing something that results in euid=0.

I can think of several sensible predicated:

1. Are we execing a setuid-root program, where the setuod bit wasn't suppressed 
by nnp, mount options, trace, etc?

2. Same as 1, but also require that we weren't root.

3. Is the new euid 0 and old uid != 0?

4. Does suid == 0?

This helper checks something equivalent to 3, but only once were far enough 
through exec and before user code starts.  This is probably equivalent to 2 as 
well.  This is quite subtle and deserves an open-coded check, a much more 
carefully named helper, or, better yet, something that looks at binprm instead 
of cred.

is_suid sounds like #4.

> That's what it's checking for, the name of the new helper makes
> that clear, and the code becomes clearer because we only see the
> thing we care about checking for rather than the intent being
> hidden.


> 
>> Also, this is going to cause massive confusion and severe bugs: given
>> the same, the only remotely sensible guess as to what this function
>> does is uid_eq(cred->suid, uid).  So NAK to this.
>> 
>>> +
>>> void handle_privileged_root(struct linux_binprm *bprm, bool has_fcap, bool 
>>> *effective, kuid_t root_uid)
>>> {
>>>const struct cred *old = current_cred();
>>> @@ -493,7 +502,7 @@ void handle_privileged_root(struct linux_binprm *bprm, 
>>> bool has_fcap, bool *effe
>>> * for a setuid root binary run by a non-root user.  Do set it
>>> * for a root user just to cause least surprise to an admin.
>>> */
>>> -   if (has_fcap && !uid_eq(new->uid, root_uid) && uid_eq(new->euid, 
>>> root_uid)) {
>>> +   if (has_fcap && is_suid(root_uid, new)) {
>> 
>> e.g. this.  The logic used to be obviously slightly dicey.  Now it
>> looks sane but doesn't do what you'd naively expect it to do, which is
>> far worse.
> 
> In what way does not do what you'd expect?

It doesn't look at cred->suid.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3 0/4] Improved seccomp logging

2017-05-02 Thread Andy Lutomirski

On Mon, May 1, 2017 at 7:41 PM, Tyler Hicks  wrote:
> On 04/27/2017 07:42 PM, Kees Cook wrote:
>> On Thu, Apr 27, 2017 at 3:17 PM, Tyler Hicks  wrote:
>>> Quick update... I finished the move from the high-water mark
>>> log_max_action sysctl to the bitmask based actions_logged sysctl.
>>
>> Awesome!
>>
>>> Unfortunately, I've just realized that SECCOMP_SET_LOGGING, or any
>>> process-wide logging configuration mechanism, will not work. It is fine
>>> for the situation where two unrelated processes set up seccomp filters
>>> that should be logged differently. However, it fails when two closely
>>> related processes, such as parent and child, need to set up seccomp
>>> filters that should be logged differently. Imagine a launcher that sets
>>> up an application sandbox (including a seccomp filter) and then launches
>>> an electron app which will have its own seccomp filter for sandboxing
>>> untrusted code that it runs. Unless the launcher and app completely
>>> agree on actions that should be logged, the logging won't work as
>>> intended for both processes.
>>
>> Oh, you mean the forked process sets up the logging it wants for the
>> filters it just installed, then after exec a process sets up new
>> logging requirements?
>
> Yes - see below.
>
>>
>>> I think this needs to be configured at the filter level.
>>
>> I'm not sure that's even the right way to compose the logging desires.
>>
>> So, my initial thought was "whatever ran SECCOMP_SET_LOGGING knows
>> what it's doing" and it should be the actual value.
>>
>> If the launcher wants logs of everything the application does with its
>> filters, then a purely-tied-to-filter approach won't work either.
>>
>> Perhaps log bits can only be enabled? I.e. SECCOMP_SET_LOGGING
>> performs an OR instead of an assignment?
>
> The problem that I'm envisioning with this design is this:
>
> 1. Launcher is told to launch Chrome and forks off a process.
>
> 2. Launcher sets up a filter using RET_ERRNO for all unacceptable
> syscalls and enables auditing of RET_ERRNO.
>
> 3. Launcher execs Chrome.
>
> 4. Chrome then sets up its own, more restrictive filter that uses
> RET_ERRNO, among other actions, but does not want auditing of RET_ERRNO.
>
> If we use process-wide auditing controls, the logs will be filled with
> RET_ERRNO messages that were unintended and unrelated to the RET_ERRNO
> actions set up in the launcher's filter.
>
> Unfortunately, the OR'ing idea doesn't solve the problem.

Things like my more complicated solution solve this completely, I
think.  The launcher would, by whatever means, say "RET_ERRNO and log
this".  The more restrictive sandbox would say "RET_ERROR and don't
log this" and we'd just make sure that the composition rules mean the
inner rule wins.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3 0/4] Improved seccomp logging

2017-04-10 Thread Andy Lutomirski

On Fri, Apr 7, 2017 at 3:16 PM, Tyler Hicks <tyhi...@canonical.com> wrote:
> On 02/22/2017 12:46 PM, Kees Cook wrote:
>> On Thu, Feb 16, 2017 at 3:29 PM, Kees Cook <keesc...@chromium.org> wrote:
>>> On Wed, Feb 15, 2017 at 7:24 PM, Andy Lutomirski <l...@amacapital.net> 
>>> wrote:
>>>> On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks <tyhi...@canonical.com> wrote:
>>>>> This patch set is the third revision of the following two previously
>>>>> submitted patch sets:
>>>>>
>>>>> v1: 
>>>>> http://lkml.kernel.org/r/1483375990-14948-1-git-send-email-tyhi...@canonical.com
>>>>> v1: 
>>>>> http://lkml.kernel.org/r/1483377999-15019-2-git-send-email-tyhi...@canonical.com
>>>>>
>>>>> v2: 
>>>>> http://lkml.kernel.org/r/1486100262-32391-1-git-send-email-tyhi...@canonical.com
>>>>>
>>>>> The patch set aims to address some known deficiencies in seccomp's current
>>>>> logging capabilities:
>>>>>
>>>>>   1. Inability to log all filter actions.
>>>>>   2. Inability to selectively enable filtering; e.g. devs want noisy 
>>>>> logging,
>>>>>  users want relative quiet.
>>>>>   3. Consistent behavior with audit enabled and disabled.
>>>>>   4. Inability to easily develop a filter due to the lack of a
>>>>>  permissive/complain mode.
>>>>
>>>> I think I dislike this, but I think my dislikes may be fixable with
>>>> minor changes.
>>>>
>>>> What I dislike is that this mixes app-specific built-in configuration
>>>> (seccomp) with global privileged stuff (audit).  The result is a
>>>> potentially difficult to use situation in which you need to modify an
>>>> app to make it loggable (using RET_LOG) and then fiddle with
>>>> privileged config (auditctl, etc) to actually see the logs.
>>>
>>> You make a good point about RET_LOG vs log_max_action. I think making
>>> RET_LOG the default value would work for 99% of the cases.
>>
>> Actually, I take this back: making "log" the default means that
>> everything else gets logged too, include "expected" return values like
>> errno, trap, etc. I think that would be extremely noisy as a default
>> (for upstream or Ubuntu).
>>
>> Perhaps RET_LOG should unconditionally log? Or maybe the logged
>> actions should be a bit field instead of a single value? Then the
>> default could be "RET_KILL and RET_LOG", but an admin could switch it
>> to just RET_KILL, or even nothing at all? Hmmm...
>
> Hi Kees - my apologies for going quiet on this topic after we spoke
> about it more in IRC. I needed to tend to other work but now I'm able to
> return to this seccomp logging patch set.
>
> To summarize what we discussed in IRC, the Chrome browser makes
> extensive use of RET_ERRNO, RET_TRACE, etc., to sandbox code that may
> not ever be adjusted to keep from bump into the sandbox walls.
> Therefore, it is unreasonable to enable logging of something like
> RET_ERRNO on a system-wide level where Chrome browser is in use.
>
> In contrast, snapd wants to set up "noisier" sandboxes for applications
> to make it clear to the developers and the users that the sandboxed
> application is bumping into the sandbox walls. Developers will know why
> their code may not be working as intended and users will know that the
> application is doing things that the platform doesn't agree with. These
> sandboxes will end up using RET_ERRNO in the majority of cases.
>
> This means that with the current design of this patch set, Chrome
> browser will either be unintentionally spamming the logs or snapd's
> sandboxes will be helplessly silent when both Chrome and snapd is
> installed at the same time, depending on the admin's preferences.
>
> To bring it back up a level, two applications may have a very different
> outlook on how acceptable a given seccomp action is and they may
> disagree on whether or not the user/administrator should be informed.
>
> I've been giving thought to the idea of providing a way for the
> application setting up the filter to opt into logging of certain
> actions. Here's a high level breakdown:

At the risk of overcomplicating things, I think this issue may be a
decent argument for doing something more like what I suggested
earlier: let seccomp users emit loggable things and let their parents
(optionally) catch and handle those things.  Then we get real scoping
rather than fiddling with bitmasks and hoping we get what we want to
see without generating massive log spam.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3 0/4] Improved seccomp logging

2017-02-17 Thread Andy Lutomirski

On Thu, Feb 16, 2017 at 3:29 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Wed, Feb 15, 2017 at 7:24 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>> On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks <tyhi...@canonical.com> wrote:
>>> This patch set is the third revision of the following two previously
>>> submitted patch sets:
>>>
>>> v1: 
>>> http://lkml.kernel.org/r/1483375990-14948-1-git-send-email-tyhi...@canonical.com
>>> v1: 
>>> http://lkml.kernel.org/r/1483377999-15019-2-git-send-email-tyhi...@canonical.com
>>>
>>> v2: 
>>> http://lkml.kernel.org/r/1486100262-32391-1-git-send-email-tyhi...@canonical.com
>>>
>>> The patch set aims to address some known deficiencies in seccomp's current
>>> logging capabilities:
>>>
>>>   1. Inability to log all filter actions.
>>>   2. Inability to selectively enable filtering; e.g. devs want noisy 
>>> logging,
>>>  users want relative quiet.
>>>   3. Consistent behavior with audit enabled and disabled.
>>>   4. Inability to easily develop a filter due to the lack of a
>>>  permissive/complain mode.
>>
>> I think I dislike this, but I think my dislikes may be fixable with
>> minor changes.
>>
>> What I dislike is that this mixes app-specific built-in configuration
>> (seccomp) with global privileged stuff (audit).  The result is a
>> potentially difficult to use situation in which you need to modify an
>> app to make it loggable (using RET_LOG) and then fiddle with
>> privileged config (auditctl, etc) to actually see the logs.
>
> You make a good point about RET_LOG vs log_max_action. I think making
> RET_LOG the default value would work for 99% of the cases.
>
>> What if, instead of logging straight to the audit log, SECCOMP_RET_LOG
>> [1] merely meant "tell our parent about this syscall"?  (Ideally we'd
>> also figure out a way to express "log this and allow", "log this and
>> do ERRNO", etc.)  Then we could have another mechanism that installs a
>> layer in the seccomp stack that, instead of catching syscalls, catches
>> log events and sticks them in a ring buffer (or audit).
>
> So, I really don't like this because it's yet another logging system.
> We already have a security event logger: audit. This continues to use
> that subsystem without changing semantics very much.

Audit sucks for this kind of thing, though.  It's mostly useless in a
container, for example.  But let me propose a middle ground.

>
>> Concretely, it might work like this.  If a filter returns
>> SECCOMP_RET_LOG, then we "log" and keep processing.  SECCOMP_RET_LOG
>> is otherwise treated literally like SECCOMP_RET_ALLOW and has no
>> effect on return value.  If you want log-and-kill, you install two
>> filters.
>>
>> There's a new seccomp(2) action that returns an fd.  That fd
>> references a new thing in the seccomp stack that is a BPF program that
>> is called whenever SECCOMP_RET_LOG is returned from lower down.  The
>> output of this filter determines whether the log event is ignored,
>> stuck in the ring buffer, or passed up the stack for further
>> processing.  You read(2) the fd to access the ring buffer.
>>
>> Using this mechanism, you could write a simple seccomptrace tool that
>> needs no privilege and dumps SECCOMP_RET_LOG events from the target
>> program to stderr.
>
> If someone was going to do this, they could just as well set up a
> tracer to use RET_TRAP. (And this is what things like minijail does
> already, IIRC.) The reality of the situation is that this is way too
> much overhead for the common case. We need a generalized logging
> system that uses the existing logging mechanisms.

True.  And we can always add this part later if we want to.

But let me propose a different, much more minor change to the patches:

First, we currently have seccomp_run_filters running the whole stack
and keeping (more or less) the lowest value.  What if we changed it a
bit so that return values of 0xff were special.  Specifically,
a return value of 0xff?? from a filter means "take some action
right now but don't change the outcome of the filter stack".  Then we
define SECCOMP_RET_LOG as 0xff00 and perhaps reserve a few bits to
be a number reflected in the log entry.  (e.g. SECCOMP_RET_LOG(x) =
0xff00 | (x & 0xff)).

Now SECCOMP_RET_LOG or SECCOMP_RET_LOG(0) does approximately what it
does in the current patch series if used in isolation, but you can
install two filters, one of which logs and one of which kills, to get
"log and kill".

If we do this, we might want SECCOMP_RET_KILL to stop running filters
so that filters farther up the stack don't log the syscall.

What do you think?  This should be a very small delta on top of the
current patches.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3 1/4] seccomp: Add sysctl to display available actions

2017-02-16 Thread Andy Lutomirski

On Thu, Feb 16, 2017 at 10:47 AM, Tyler Hicks <tyhi...@canonical.com> wrote:
> On 02/15/2017 09:14 PM, Andy Lutomirski wrote:
>> On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks <tyhi...@canonical.com> wrote:
>>> This patch creates a read-only sysctl containing an ordered list of
>>> seccomp actions that the kernel supports. The ordering, from left to
>>> right, is the lowest action value (kill) to the highest action value
>>> (allow). Currently, a read of the sysctl file would return "kill trap
>>> errno trace allow". The contents of this sysctl file can be useful for
>>> userspace code as well as the system administrator.
>>
>> Would this make more sense as a new seccomp(2) mode a la
>> SECCOMP_HAS_ACTION?  Then sandboxy things that have no fs access could
>> use it.
>>
>
> It would make sense for code that needs to check which actions are
> available. It wouldn't make sense for administrators that need to check
> which actions are available unless libseccomp provided a wrapper utility.
>
> Is this a theoretical concern or do you know of a sandboxed piece of
> code that cannot access the sysctl before constructing a seccomp filter?
>

It's semi-theoretical.  But suppose I unshare namespaces, unmount a
bunch of stuff, then ask libseccomp to install a filter.  (I've
written code that does exactly that.)   libseccomp won't be able to
read the sysctl.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3 0/4] Improved seccomp logging

2017-02-16 Thread Andy Lutomirski

On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks  wrote:
> This patch set is the third revision of the following two previously
> submitted patch sets:
>
> v1: 
> http://lkml.kernel.org/r/1483375990-14948-1-git-send-email-tyhi...@canonical.com
> v1: 
> http://lkml.kernel.org/r/1483377999-15019-2-git-send-email-tyhi...@canonical.com
>
> v2: 
> http://lkml.kernel.org/r/1486100262-32391-1-git-send-email-tyhi...@canonical.com
>
> The patch set aims to address some known deficiencies in seccomp's current
> logging capabilities:
>
>   1. Inability to log all filter actions.
>   2. Inability to selectively enable filtering; e.g. devs want noisy logging,
>  users want relative quiet.
>   3. Consistent behavior with audit enabled and disabled.
>   4. Inability to easily develop a filter due to the lack of a
>  permissive/complain mode.

I think I dislike this, but I think my dislikes may be fixable with
minor changes.

What I dislike is that this mixes app-specific built-in configuration
(seccomp) with global privileged stuff (audit).  The result is a
potentially difficult to use situation in which you need to modify an
app to make it loggable (using RET_LOG) and then fiddle with
privileged config (auditctl, etc) to actually see the logs.

What if, instead of logging straight to the audit log, SECCOMP_RET_LOG
[1] merely meant "tell our parent about this syscall"?  (Ideally we'd
also figure out a way to express "log this and allow", "log this and
do ERRNO", etc.)  Then we could have another mechanism that installs a
layer in the seccomp stack that, instead of catching syscalls, catches
log events and sticks them in a ring buffer (or audit).

Concretely, it might work like this.  If a filter returns
SECCOMP_RET_LOG, then we "log" and keep processing.  SECCOMP_RET_LOG
is otherwise treated literally like SECCOMP_RET_ALLOW and has no
effect on return value.  If you want log-and-kill, you install two
filters.

There's a new seccomp(2) action that returns an fd.  That fd
references a new thing in the seccomp stack that is a BPF program that
is called whenever SECCOMP_RET_LOG is returned from lower down.  The
output of this filter determines whether the log event is ignored,
stuck in the ring buffer, or passed up the stack for further
processing.  You read(2) the fd to access the ring buffer.

Using this mechanism, you could write a simple seccomptrace tool that
needs no privilege and dumps SECCOMP_RET_LOG events from the target
program to stderr.

Thoughts?

[1] If we went this route, it might want to be renamed.

P.S. We ought to be able to write a BPF verifier pass that makes sure
that filters don't return unsupported return values if we cared to do
so.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3 1/4] seccomp: Add sysctl to display available actions

2017-02-15 Thread Andy Lutomirski

On Mon, Feb 13, 2017 at 7:45 PM, Tyler Hicks  wrote:
> This patch creates a read-only sysctl containing an ordered list of
> seccomp actions that the kernel supports. The ordering, from left to
> right, is the lowest action value (kill) to the highest action value
> (allow). Currently, a read of the sysctl file would return "kill trap
> errno trace allow". The contents of this sysctl file can be useful for
> userspace code as well as the system administrator.

Would this make more sense as a new seccomp(2) mode a la
SECCOMP_HAS_ACTION?  Then sandboxy things that have no fs access could
use it.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 0/2] Begin auditing SECCOMP_RET_ERRNO return actions

2017-01-03 Thread Andy Lutomirski

On Mon, Jan 2, 2017 at 2:47 PM, Paul Moore  wrote:
> On Mon, Jan 2, 2017 at 11:53 AM, Tyler Hicks  wrote:
>> This patch set creates the basis for auditing information specific to a given
>> seccomp return action and then starts auditing SECCOMP_RET_ERRNO return
>> actions. The audit messages for SECCOMP_RET_ERRNO return actions include the
>> errno value that will be returned to userspace.
>
> I'm replying to this patchset posting because it his my inbox first,
> but my comments here apply to both this patchset and the other
> seccomp/audit patchset you posted.
>
> In my experience, we have two or three problems (the count varies
> depending on perspective) when it comes to seccomp filter reporting:
>
> 1. Inability to log all filter actions.
> 2. Inability to selectively enable filtering; e.g. devs want noisy
> logging, users want relative quiet.
> 3. Consistent behavior with audit enabled and disabled.
>
> My current thinking - forgive me, this has been kicking around in my
> head for the better part of six months (longer?) and I haven't
> attempted to code it up - is to create a sysctl knob for a system wide
> seccomp logging threshold that would be applied to the high 16-bits of
> *every* triggered action: if the action was at/below the threshold a
> record would be emitted, otherwise silence.  This should resolve
> problems #1 and #2, and the code should be relatively straightforward
> and small.
>
> As part of the code above, I expect that all seccomp logging would get
> routed through a single logging function (sort of like a better
> implementation of the existing audit_seccomp()) that would check the
> threshold and trigger the logging if needed.  This function could be
> augmented to check for CONFIG_AUDIT and in the case where audit was
> not built into the kernel, a simple printk could be used to log the
> seccomp event; solving problem #3.

Would this not be doable with a seccomp tracepoint and a BPF filter?

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 0/2] Begin auditing SECCOMP_RET_ERRNO return actions

2017-01-03 Thread Andy Lutomirski

On Mon, Jan 2, 2017 at 8:53 AM, Tyler Hicks  wrote:
> This patch set creates the basis for auditing information specific to a given
> seccomp return action and then starts auditing SECCOMP_RET_ERRNO return
> actions. The audit messages for SECCOMP_RET_ERRNO return actions include the
> errno value that will be returned to userspace.
>

Not that I'm opposed to the idea, but what's the intended purpose?

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Should audit_seccomp check audit_enabled?

2015-10-23 Thread Andy Lutomirski

I would argue that, if auditing is off, audit_seccomp shouldn't do
anything.  After all, unlike e.g. selinux, seccomp is not a systemwide
policy, and seccomp signals might be ordinary behavior that's internal
to the seccomp-using application.  IOW, for people with audit compiled
in and subscribed by journald but switched off, I think that the
records shouldn't be emitted.

If you agree, I can send the two-line patch.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: Should audit_seccomp check audit_enabled?

2015-10-23 Thread Andy Lutomirski

On Fri, Oct 23, 2015 at 1:58 PM, Paul Moore <p...@paul-moore.com> wrote:
> On Fri, Oct 23, 2015 at 4:51 PM, Steve Grubb <sgr...@redhat.com> wrote:
>> On Friday, October 23, 2015 03:38:05 PM Paul Moore wrote:
>>> On Fri, Oct 23, 2015 at 1:01 PM, Kees Cook <keesc...@chromium.org> wrote:
>>> > On Fri, Oct 23, 2015 at 9:19 AM, Andy Lutomirski <l...@amacapital.net>
>> wrote:
>>> >> I would argue that, if auditing is off, audit_seccomp shouldn't do
>>> >> anything.  After all, unlike e.g. selinux, seccomp is not a systemwide
>>> >> policy, and seccomp signals might be ordinary behavior that's internal
>>> >> to the seccomp-using application.  IOW, for people with audit compiled
>>> >> in and subscribed by journald but switched off, I think that the
>>> >> records shouldn't be emitted.
>>> >>
>>> >> If you agree, I can send the two-line patch.
>>> >
>>> > I think signr==0 states (which I would identify as "intended
>>> > behavior") don't need to be reported under any situation, but audit
>>> > folks wanted to keep it around.
>>>
>>> Wearing my libseccomp hat, I would like some logging when the seccomp
>>> filter triggers a result other than allow.  I don't care if this is
>>> via audit or printk(), I just want some notification.  If we go the
>>> printk route and people really don't want to see anything in their
>>> logs, I suppose we could always add a sysctl knob to turn off the
>>> message completely (we would still need to do whatever audit records
>>> are required, see below).
>>>
>>> Wearing my audit hat, I want to make sure we tick off all the right
>>> boxes for the various certifications that people care about.  Steve
>>> Grubb has commented on what he needs in the past, although I'm not
>>> sure it was on-list, so I'll ask him to repeat it here.
>>
>> I went back and reviewed my notes since this came up in the current Common
>> Criteria evaluation. What we decided to do is treat syscall failures which
>> failed due to seccomp the same as syscall failures caused by dropping
>> capabilities. Both are opt-in DAC policies. That means we don't care. Do
>> whatever you like. :-)
>
> Thanks Steve.
>
> Andy, is your objection that you don't want to see any seccomp
> messages, or just seccomp audit records when audit is disabled?
>

My objection is that people who have audit compiled in but disabled at
runtime shouldn't have the overhead or the log noise from these
messages.  If people want the messages, then I think they should turn
on audit (auditctl -e 1 or whatever).

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: Should audit_seccomp check audit_enabled?

2015-10-23 Thread Andy Lutomirski

On Oct 23, 2015 10:01 AM, "Kees Cook" <keesc...@chromium.org> wrote:
>
> On Fri, Oct 23, 2015 at 9:19 AM, Andy Lutomirski <l...@amacapital.net> wrote:
> > I would argue that, if auditing is off, audit_seccomp shouldn't do
> > anything.  After all, unlike e.g. selinux, seccomp is not a systemwide
> > policy, and seccomp signals might be ordinary behavior that's internal
> > to the seccomp-using application.  IOW, for people with audit compiled
> > in and subscribed by journald but switched off, I think that the
> > records shouldn't be emitted.
> >
> > If you agree, I can send the two-line patch.
>
> I think signr==0 states (which I would identify as "intended
> behavior") don't need to be reported under any situation, but audit
> folks wanted to keep it around.

Even if there is a nonzero signr, it could just be a program opting to
trap and emulate one of its own syscalls.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: Should audit_seccomp check audit_enabled?

2015-10-23 Thread Andy Lutomirski

On Fri, Oct 23, 2015 at 2:22 PM, Kees Cook <keesc...@chromium.org> wrote:
> On Fri, Oct 23, 2015 at 2:07 PM, Andy Lutomirski <l...@amacapital.net> wrote:
>> On Oct 23, 2015 10:01 AM, "Kees Cook" <keesc...@chromium.org> wrote:
>>>
>>> On Fri, Oct 23, 2015 at 9:19 AM, Andy Lutomirski <l...@amacapital.net> 
>>> wrote:
>>> > I would argue that, if auditing is off, audit_seccomp shouldn't do
>>> > anything.  After all, unlike e.g. selinux, seccomp is not a systemwide
>>> > policy, and seccomp signals might be ordinary behavior that's internal
>>> > to the seccomp-using application.  IOW, for people with audit compiled
>>> > in and subscribed by journald but switched off, I think that the
>>> > records shouldn't be emitted.
>>> >
>>> > If you agree, I can send the two-line patch.
>>>
>>> I think signr==0 states (which I would identify as "intended
>>> behavior") don't need to be reported under any situation, but audit
>>> folks wanted to keep it around.
>>
>> Even if there is a nonzero signr, it could just be a program opting to
>> trap and emulate one of its own syscalls.
>
> At present, that is a rare situation. Programs tend to be ptrace
> managed externally. Is there anything catching SIGSYS itself?
>

I wrote one once.  I also wrote a whole set of patches for libseccomp
to make it easier that never went anywhere -- I should dust those off
and package them into their own library.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-15 Thread Andy Lutomirski

On Thu, May 14, 2015 at 7:32 PM, Richard Guy Briggs r...@redhat.com wrote:
 On 15/05/14, Paul Moore wrote:
 * Look at our existing audit records to determine which records should have
 namespace and container ID tokens added.  We may only want to add the
 additional fields in the case where the namespace/container ID tokens are not
 the init namespace.

 If we have a record that ties a set of namespace IDs with a container
 ID, then I expect we only need to list the containerID along with auid
 and sessionID.

The problem here is that the kernel has no concept of a container, and I
don't think it makes any sense to add one just for audit.  Container is a
marketing term used by some userspace tools.

I can imagine that both audit could benefit from a concept of a
namespace *path* that understands nesting (e.g. root/2/5/1 or
something along those lines).  Mapping these to containers belongs
in userspace, I think.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V6 05/10] audit: log creation and deletion of namespace instances

2015-05-15 Thread Andy Lutomirski

On May 15, 2015 9:38 PM, Steve Grubb sgr...@redhat.com wrote:

 On Thursday, May 14, 2015 11:23:09 PM Andy Lutomirski wrote:
  On Thu, May 14, 2015 at 7:32 PM, Richard Guy Briggs r...@redhat.com wrote:
   On 15/05/14, Paul Moore wrote:
   * Look at our existing audit records to determine which records should
   have
   namespace and container ID tokens added.  We may only want to add the
   additional fields in the case where the namespace/container ID tokens are
   not the init namespace.
  
   If we have a record that ties a set of namespace IDs with a container
   ID, then I expect we only need to list the containerID along with auid
   and sessionID.
 
  The problem here is that the kernel has no concept of a container, and I
  don't think it makes any sense to add one just for audit.  Container is a
  marketing term used by some userspace tools.

 No, its a real thing just like a login. Does the kernel have any concept of a
 login? Yet it happens. And it causes us to generate events describing who,
 where from, role, success, and time of day. :-)


I really hope those records come from userspace, not the kernel.  I
also wonder what happens when a user logs in and types sudo agetty
/dev/ttyS0 115200.  If a user does that and then someone logs in on
/dev/ttyS0, which login are they?


  I can imagine that both audit could benefit from a concept of a
  namespace *path* that understands nesting (e.g. root/2/5/1 or
  something along those lines).  Mapping these to containers belongs
  in userspace, I think.

 I don't doubt that just as user space sequences the actions that are a login.
 I just need the kernel to do some book keeping and associate the necessary
 attributes in the event record to be able to reconstruct what is actually
 happening.

A precondition for that is having those records have some
correspondence to what is actually happening.  Since the kernel has no
concept of a container, and since the same kernel mechanisms could be
used for things that are probably not whatever the Common Criteria
rules think a container is, this could be quite difficult to define in
a meaningful manner.

Hence my suggestion to add only minimal support in the kernel and to
do this in userspace.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: x32 + audit status?

2015-03-06 Thread Andy Lutomirski

On Mar 5, 2015 10:32 AM, David Drysdale drysd...@google.com wrote:

 Hi,

 Do we currently expect the audit system to work with x32 syscalls?

 I was playing with the audit system for the first time today (on
 v4.0-rc2, due to [1]), and it didn't seem to work for me.  (Tweaking
 ptrace.c like the patch below seemed to help, but I may just have
 configured something wrong.)

 I know there was a bunch of activity around this area in mid-2014,
 but I'm not sure what the final position was...

It's totally broken, and it needs ABI work.  I think it should keep
the high syscall numbers, which means that both userspace and the
audit core need to learn how to deal with it.

--Andy


 Thanks,
 David

 [1]: https://lkml.org/lkml/2015/3/4/879

 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
 index e510618b2e91..443932afd9e8 100644
 --- a/arch/x86/kernel/ptrace.c
 +++ b/arch/x86/kernel/ptrace.c
 @@ -1445,7 +1445,7 @@ static void do_audit_syscall_entry(struct
 pt_regs *regs, u32 arch)
  {
  #ifdef CONFIG_X86_64
 if (arch == AUDIT_ARCH_X86_64) {
 -   audit_syscall_entry(regs-orig_ax, regs-di,
 +   audit_syscall_entry(regs-orig_ax  __SYSCALL_MASK, regs-di,
 regs-si, regs-dx, regs-r10);
 } else
  #endif

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] powerpc: add little endian flag to syscall_get_arch()

2014-12-02 Thread Andy Lutomirski

On Tue, Dec 2, 2014 at 1:27 PM, Richard Guy Briggs r...@redhat.com wrote:
 Since both ppc and ppc64 have LE variants which are now reported by uname, add
 that flag (__AUDIT_ARCH_LE) to syscall_get_arch() and add AUDIT_ARCH_PPC*LE
 variants.

 Without this,  perf trace and auditctl fail.

 Mainline kernel reports ppc64le (per a058801) but there is no matching
 AUDIT_ARCH_PPC64LE.


There's no seccomp filter support for powerpc, so there's no risk that
this breaks it.

--Andy

 See:
 https://www.redhat.com/archives/linux-audit/2014-August/msg00082.html
 
 https://www.redhat.com/archives/linux-audit/2014-December/msg4.html

 Signed-off-by: Richard Guy Briggs r...@redhat.com
 ---
  arch/powerpc/include/asm/syscall.h |6 +-
  include/uapi/linux/audit.h |2 ++
  2 files changed, 7 insertions(+), 1 deletions(-)

 diff --git a/arch/powerpc/include/asm/syscall.h 
 b/arch/powerpc/include/asm/syscall.h
 index 6fa2708..a58acab 100644
 --- a/arch/powerpc/include/asm/syscall.h
 +++ b/arch/powerpc/include/asm/syscall.h
 @@ -90,6 +90,10 @@ static inline void syscall_set_arguments(struct 
 task_struct *task,

  static inline int syscall_get_arch(void)
  {
 -   return is_32bit_task() ? AUDIT_ARCH_PPC : AUDIT_ARCH_PPC64;
 +   int arch = is_32bit_task() ? AUDIT_ARCH_PPC : AUDIT_ARCH_PPC64;
 +#ifdef __LITTLE_ENDIAN__
 +   arch |= __AUDIT_ARCH_LE
 +#endif
 +   return arch;
  }
  #endif /* _ASM_SYSCALL_H */
 diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
 index 4d100c8..fe29a99 100644
 --- a/include/uapi/linux/audit.h
 +++ b/include/uapi/linux/audit.h
 @@ -364,7 +364,9 @@ enum {
  #define AUDIT_ARCH_PARISC  (EM_PARISC)
  #define AUDIT_ARCH_PARISC64(EM_PARISC|__AUDIT_ARCH_64BIT)
  #define AUDIT_ARCH_PPC (EM_PPC)
 +#define AUDIT_ARCH_PPCLE   (EM_PPC|__AUDIT_ARCH_LE)
  #define AUDIT_ARCH_PPC64   (EM_PPC64|__AUDIT_ARCH_64BIT)
 +#define AUDIT_ARCH_PPC64LE (EM_PPC64|__AUDIT_ARCH_64BIT|__AUDIT_ARCH_LE)
  #define AUDIT_ARCH_S390(EM_S390)
  #define AUDIT_ARCH_S390X   (EM_S390|__AUDIT_ARCH_64BIT)
  #define AUDIT_ARCH_SH  (EM_SH)
 --
 1.7.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-api in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] i386/audit: stop scribbling on the stack frame

2014-10-25 Thread Andy Lutomirski

On Fri, Oct 24, 2014 at 1:19 PM, H. Peter Anvin h...@zytor.com wrote:
 On 10/23/2014 12:38 PM, Eric Paris wrote:

 After the call __audit_syscall_entry aren't they already polluted?
 Isn't that the reason we need to reload EAX?

 Well, I guess EAX is special...


 Because system calls are asmlinkage, all the parameters are on the
 stack, but %eax is used as the index into the system call table.  This
 should thus be fine until we get rid of regparm(0) entirely, if that
 ever happens.


...and because __audit_syscall_entry *isn't* asmlinkage, it uses the
other convention, which is where the confusion comes from.  And, by
the time you get to sysenter_do_call, nothing cares about ecx, so you
can freely clobber it while popping from the stack.  I get it now.

--Andy

 -hpa





-- 
Andy Lutomirski
AMA Capital Management, LLC

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] i386/audit: stop scribbling on the stack frame

2014-10-24 Thread Andy Lutomirski

On 10/22/2014 09:04 PM, Eric Paris wrote:
 git commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a was very very dumb.
 It was writing over %esp/pt_regs semi-randomly on i686  with the expected
 system can't boot results.  As noted in:
 
 https://bugs.freedesktop.org/show_bug.cgi?id=85277
 
 This patch stops fscking with pt_regs.  Instead it sets up the registers
 for the call to __audit_syscall_entry in the most obvious conceivable
 way.  It then does just a tiny tiny touch of magic.  We need to get what
 started in PT_EDX into 0(%esp) and PT_ESI into 4(%esp).  This is as easy
 as a pair of pushes.
 
 After the call to __audit_syscall_entry all we need to do is get that
 now useless junk off the stack (pair of pops) and reload %eax with the
 original syscall so other stuff can keep going about it's business.
 
 Signed-off-by: Eric Paris epa...@redhat.com
 Cc: Thomas Gleixner t...@linutronix.de
 Cc: Ingo Molnar mi...@redhat.com
 Cc: H. Peter Anvin h...@zytor.com
 Cc: x...@kernel.org
 Cc: linux-ker...@vger.kernel.org
 Cc: linux-audit@redhat.com
 ---
  arch/x86/kernel/entry_32.S | 15 +++
  1 file changed, 7 insertions(+), 8 deletions(-)
 
 diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
 index f9e3fab..fb01d22 100644
 --- a/arch/x86/kernel/entry_32.S
 +++ b/arch/x86/kernel/entry_32.S
 @@ -447,15 +447,14 @@ sysenter_exit:
  sysenter_audit:
   testl $(_TIF_WORK_SYSCALL_ENTRY  ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
   jnz syscall_trace_entry
 - addl $4,%esp
 - CFI_ADJUST_CFA_OFFSET -4
 - movl %esi,4(%esp)   /* 5th arg: 4th syscall arg */
 - movl %edx,(%esp)/* 4th arg: 3rd syscall arg */
 - /* %ecx already in %ecx3rd arg: 2nd syscall arg */
 - movl %ebx,%edx  /* 2nd arg: 1st syscall arg */
 - /* %eax already in %eax1st arg: syscall number */
 + /* movl PT_EAX(%esp), %eax  already set, syscall number: 1st arg to 
 audit */
 + movl PT_EBX(%esp), %edx /* ebx/a0: 2nd arg to audit */
 + /* movl PT_ECX(%esp), %ecx  already set, a1: 3nd arg to audit */
 + pushl_cfi PT_ESI(%esp)  /* a3: 5th arg */
 + pushl_cfi PT_EDX+4(%esp)/* a2: 4th arg */
   call __audit_syscall_entry
 - pushl_cfi %ebx
 + popl_cfi %ecx /* get that remapped edx off the stack */
 + popl_cfi %ecx /* get that remapped esi off the stack */
   movl PT_EAX(%esp),%eax  /* reload syscall number */
   jmp sysenter_do_call
  
 

This looks reasonably likely to be correct, but this code is complicated
and now ever slower.

How hard would it be to just delete it and replace it with a
straightforward two-phase trace invocation a la x86_64?

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] i386/audit: stop scribbling on the stack frame

2014-10-24 Thread Andy Lutomirski

On Thu, Oct 23, 2014 at 12:15 PM, Eric Paris epa...@redhat.com wrote:
 On Thu, 2014-10-23 at 11:39 -0700, Andy Lutomirski wrote:
 On 10/22/2014 09:04 PM, Eric Paris wrote:
  git commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a was very very dumb.
  It was writing over %esp/pt_regs semi-randomly on i686  with the expected
  system can't boot results.  As noted in:
 
  https://bugs.freedesktop.org/show_bug.cgi?id=85277
 
  This patch stops fscking with pt_regs.  Instead it sets up the registers
  for the call to __audit_syscall_entry in the most obvious conceivable
  way.  It then does just a tiny tiny touch of magic.  We need to get what
  started in PT_EDX into 0(%esp) and PT_ESI into 4(%esp).  This is as easy
  as a pair of pushes.
 
  After the call to __audit_syscall_entry all we need to do is get that
  now useless junk off the stack (pair of pops) and reload %eax with the
  original syscall so other stuff can keep going about it's business.
 
  Signed-off-by: Eric Paris epa...@redhat.com
  Cc: Thomas Gleixner t...@linutronix.de
  Cc: Ingo Molnar mi...@redhat.com
  Cc: H. Peter Anvin h...@zytor.com
  Cc: x...@kernel.org
  Cc: linux-ker...@vger.kernel.org
  Cc: linux-audit@redhat.com
  ---
   arch/x86/kernel/entry_32.S | 15 +++
   1 file changed, 7 insertions(+), 8 deletions(-)
 
  diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
  index f9e3fab..fb01d22 100644
  --- a/arch/x86/kernel/entry_32.S
  +++ b/arch/x86/kernel/entry_32.S
  @@ -447,15 +447,14 @@ sysenter_exit:
   sysenter_audit:
  testl $(_TIF_WORK_SYSCALL_ENTRY  ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
  jnz syscall_trace_entry
  -   addl $4,%esp
  -   CFI_ADJUST_CFA_OFFSET -4
  -   movl %esi,4(%esp)   /* 5th arg: 4th syscall arg */
  -   movl %edx,(%esp)/* 4th arg: 3rd syscall arg */
  -   /* %ecx already in %ecx3rd arg: 2nd syscall arg */
  -   movl %ebx,%edx  /* 2nd arg: 1st syscall arg */
  -   /* %eax already in %eax1st arg: syscall number */
  +   /* movl PT_EAX(%esp), %eax  already set, syscall number: 1st arg 
  to audit */
  +   movl PT_EBX(%esp), %edx /* ebx/a0: 2nd arg to audit */
  +   /* movl PT_ECX(%esp), %ecx  already set, a1: 3nd arg to audit */
  +   pushl_cfi PT_ESI(%esp)  /* a3: 5th arg */
  +   pushl_cfi PT_EDX+4(%esp)/* a2: 4th arg */
  call __audit_syscall_entry
  -   pushl_cfi %ebx
  +   popl_cfi %ecx /* get that remapped edx off the stack */
  +   popl_cfi %ecx /* get that remapped esi off the stack */
  movl PT_EAX(%esp),%eax  /* reload syscall number */
  jmp sysenter_do_call
 
 

 This looks reasonably likely to be correct, but this code is complicated
 and now ever slower.

 I guess I could just use push/pop and do the CFI_ADJUST_CFA_OFFSET by
 hand.  But I figured this was reasonable enough...


I'm not complaining about your new assembly in particular.  There's
just too much assembly in there in general.

But I feel like I'm missing something in the new code.  Aren't you
corrupting ecx with those popl_cfi insns?

 How hard would it be to just delete it and replace it with a
 straightforward two-phase trace invocation a la x86_64?

 For me?  Hard.


I can try it eventually, but my track record with the 32-bit entry
code isn't so good.  I agree that this would be pretty nasty for
urgent, but at least neither that or your fix would need to go to
-stable.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH] i386/audit: stop scribbling on the stack frame

2014-10-24 Thread Andy Lutomirski

On Thu, Oct 23, 2014 at 12:30 PM, Eric Paris epa...@redhat.com wrote:
 On Thu, 2014-10-23 at 12:20 -0700, Andy Lutomirski wrote:
 On Thu, Oct 23, 2014 at 12:15 PM, Eric Paris epa...@redhat.com wrote:
  On Thu, 2014-10-23 at 11:39 -0700, Andy Lutomirski wrote:
  On 10/22/2014 09:04 PM, Eric Paris wrote:
   git commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a was very very dumb.
   It was writing over %esp/pt_regs semi-randomly on i686  with the 
   expected
   system can't boot results.  As noted in:
  
   https://bugs.freedesktop.org/show_bug.cgi?id=85277
  
   This patch stops fscking with pt_regs.  Instead it sets up the registers
   for the call to __audit_syscall_entry in the most obvious conceivable
   way.  It then does just a tiny tiny touch of magic.  We need to get what
   started in PT_EDX into 0(%esp) and PT_ESI into 4(%esp).  This is as easy
   as a pair of pushes.
  
   After the call to __audit_syscall_entry all we need to do is get that
   now useless junk off the stack (pair of pops) and reload %eax with the
   original syscall so other stuff can keep going about it's business.
  
   Signed-off-by: Eric Paris epa...@redhat.com
   Cc: Thomas Gleixner t...@linutronix.de
   Cc: Ingo Molnar mi...@redhat.com
   Cc: H. Peter Anvin h...@zytor.com
   Cc: x...@kernel.org
   Cc: linux-ker...@vger.kernel.org
   Cc: linux-audit@redhat.com
   ---
arch/x86/kernel/entry_32.S | 15 +++
1 file changed, 7 insertions(+), 8 deletions(-)
  
   diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
   index f9e3fab..fb01d22 100644
   --- a/arch/x86/kernel/entry_32.S
   +++ b/arch/x86/kernel/entry_32.S
   @@ -447,15 +447,14 @@ sysenter_exit:
sysenter_audit:
   testl $(_TIF_WORK_SYSCALL_ENTRY  
   ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
   jnz syscall_trace_entry
   -   addl $4,%esp
   -   CFI_ADJUST_CFA_OFFSET -4
   -   movl %esi,4(%esp)   /* 5th arg: 4th syscall arg */
   -   movl %edx,(%esp)/* 4th arg: 3rd syscall arg */
   -   /* %ecx already in %ecx3rd arg: 2nd syscall arg */
   -   movl %ebx,%edx  /* 2nd arg: 1st syscall arg */
   -   /* %eax already in %eax1st arg: syscall number */
   +   /* movl PT_EAX(%esp), %eax  already set, syscall number: 1st 
   arg to audit */
   +   movl PT_EBX(%esp), %edx /* ebx/a0: 2nd arg to audit */
   +   /* movl PT_ECX(%esp), %ecx  already set, a1: 3nd arg to audit */
   +   pushl_cfi PT_ESI(%esp)  /* a3: 5th arg */
   +   pushl_cfi PT_EDX+4(%esp)/* a2: 4th arg */
   call __audit_syscall_entry
   -   pushl_cfi %ebx
   +   popl_cfi %ecx /* get that remapped edx off the stack */
   +   popl_cfi %ecx /* get that remapped esi off the stack */
   movl PT_EAX(%esp),%eax  /* reload syscall number */
   jmp sysenter_do_call
  
  
 
  This looks reasonably likely to be correct, but this code is complicated
  and now ever slower.
 
  I guess I could just use push/pop and do the CFI_ADJUST_CFA_OFFSET by
  hand.  But I figured this was reasonable enough...
 

 I'm not complaining about your new assembly in particular.  There's
 just too much assembly in there in general.

 But I feel like I'm missing something in the new code.  Aren't you
 corrupting ecx with those popl_cfi insns?

 After the call __audit_syscall_entry aren't they already polluted?
 Isn't that the reason we need to reload EAX?  You can verify this leaves
 things in a similar state (although slightly differently polluted) than
 before it got screwed up.  Here is diff between before the breakage and
 what I propose we do now.

 (I admit I don't understand how the pushl_cfi %ebx wasn't messing up
 PT_EBX)

 /me anxiously awaits x86 guy to tell me how dumb I am


hpa, do you have time to figure this out?  I don't know the 32-bit ABI
well enough, nor will I have time to disassemble things or find and
read the spec to figure this out in the next few days.

--Andy

 $ git diff a17c8b54dc738c4fda31e8be0302cd131a04c19f -- 
 arch/x86/kernel/entry_32.S
 diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
 index 0d0c9d4..fb01d22 100644
 --- a/arch/x86/kernel/entry_32.S
 +++ b/arch/x86/kernel/entry_32.S
 @@ -447,16 +447,14 @@ sysenter_exit:
  sysenter_audit:
 testl $(_TIF_WORK_SYSCALL_ENTRY  ~_TIF_SYSCALL_AUDIT),TI_flags(%ebp)
 jnz syscall_trace_entry
 -   addl $4,%esp
 -   CFI_ADJUST_CFA_OFFSET -4
 -   /* %esi already in 8(%esp) 6th arg: 4th syscall arg */
 -   /* %edx already in 4(%esp) 5th arg: 3rd syscall arg */
 -   /* %ecx already in 0(%esp) 4th arg: 2nd syscall arg */
 -   movl %ebx,%ecx  /* 3rd arg: 1st syscall arg */
 -   movl %eax,%edx  /* 2nd arg: syscall number */
 -   movl $AUDIT_ARCH_I386,%eax  /* 1st arg: audit arch */
 +   /* movl PT_EAX(%esp), %eax  already set, syscall number: 1st arg 
 to audit */
 +   movl PT_EBX

Re: Regression: audit: x86: drop arch from __audit_syscall_entry() interface

2014-10-22 Thread Andy Lutomirski

On 10/22/2014 11:23 AM, Eric Paris wrote:
 That's really serious.  Looking now.
 
 On Wed, 2014-10-22 at 16:08 -0200, Paulo Zanoni wrote:
 Hi

 (Cc'ing everybody mentioned in the original patch)

 I work for Intel, on our Linux Graphics driver - aka i915.ko - and our
 QA team recently reported a regression on:

 commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a
 Author: Richard Guy Briggs
 Date:   Tue Mar 4 10:38:06 2014 -0500
 audit: x86: drop arch from __audit_syscall_entry() interface

 According to our QA, their i386 machine doesn't boot anymore. I tried
 to write my own revert for the patch, asked QA to test, and they
 confirmed it solves the problem.

 Here are the details of QA' s bug report:
 https://bugs.freedesktop.org/show_bug.cgi?id=85277 .

 The trees our QA tests are the development trees from i915.ko:
 http://cgit.freedesktop.org/drm-intel?h=drm-intel-fixes .

 I tried searching for other bug reports on the same patch, but
 couldn't find any. Forgive me if this bug was already reported.

 Feel free to continue this discussion on the bugzilla report if you want.

This piece:

movl %esi,4(%esp)   /* 5th arg: 4th syscall arg */
movl %edx,(%esp)/* 4th arg: 3rd syscall arg */

looks like it's overwriting syscall arguments.

This is clearly fixable, but an even better fix would be to drop the asm
entirely and switch to two-phase tracing.  Want to do it?  I can test
the seccomp bits if you switch over the asm :)

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: Regression: audit: x86: drop arch from __audit_syscall_entry() interface

2014-10-22 Thread Andy Lutomirski

On Wed, Oct 22, 2014 at 12:16 PM, Richard Guy Briggs r...@redhat.com wrote:
 On 14/10/22, Andy Lutomirski wrote:
 On 10/22/2014 11:23 AM, Eric Paris wrote:
  That's really serious.  Looking now.
 
  On Wed, 2014-10-22 at 16:08 -0200, Paulo Zanoni wrote:
  Hi
 
  (Cc'ing everybody mentioned in the original patch)
 
  I work for Intel, on our Linux Graphics driver - aka i915.ko - and our
  QA team recently reported a regression on:
 
  commit b4f0d3755c5e9cc86292d5fd78261903b4f23d4a
  Author: Richard Guy Briggs
  Date:   Tue Mar 4 10:38:06 2014 -0500
  audit: x86: drop arch from __audit_syscall_entry() interface
 
  According to our QA, their i386 machine doesn't boot anymore. I tried
  to write my own revert for the patch, asked QA to test, and they
  confirmed it solves the problem.
 
  Here are the details of QA' s bug report:
  https://bugs.freedesktop.org/show_bug.cgi?id=85277 .
 
  The trees our QA tests are the development trees from i915.ko:
  http://cgit.freedesktop.org/drm-intel?h=drm-intel-fixes .
 
  I tried searching for other bug reports on the same patch, but
  couldn't find any. Forgive me if this bug was already reported.
 
  Feel free to continue this discussion on the bugzilla report if you want.

 This piece:

   movl %esi,4(%esp)   /* 5th arg: 4th syscall arg */
   movl %edx,(%esp)/* 4th arg: 3rd syscall arg */

 looks like it's overwriting syscall arguments.

 This is clearly fixable, but an even better fix would be to drop the asm
 entirely and switch to two-phase tracing.  Want to do it?  I can test
 the seccomp bits if you switch over the asm :)

 Like what you did for x86_64.  That sounds worth investigating.

 I'll have a look at the asm, but I'm being distracted by a gunman loose
 2km from me and my wife and kids under lockdown in two different
 locations on the other side of the shooting site.  Had to cancel lunch
 today with two work colleagues 1/2km away from that site.  ...not been a
 productive day.


That's putting it mildly.  Stay safe.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

2014-08-25 Thread Andy Lutomirski

On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
nicolas.dich...@6wind.com wrote:
 Le 25/08/2014 16:04, Andy Lutomirski a écrit :

 On Aug 25, 2014 6:30 AM, Nicolas Dichtel nicolas.dich...@6wind.com
 wrote:

 CRIU wants to save the complete state of a namespace and then restore
 it.  For that to work, any information exposed to things in the
 namespace *cannot* be globally unique or unique per boot, since CRIU
 needs to arrange for that information to match whatever it was when
 CRIU saved it.


 How are ifindex of network devices managed? These ifindexes are unique
 per boot,
 thus can change depending on the order in which netdev are created.
 These ifindexes are unique per boot and exposed to userspace ...


 This does not appear to be true.

 $ sudo unshare --net
 # ip link add veth0 type veth peer name veth1
 # ip link
 1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN mode DEFAULT group
 default
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 2: veth1: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode
 DEFAULT group default qlen 1000
  link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
 3: veth0: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode
 DEFAULT group default qlen 1000
  link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
 # logout
 $ ip link
 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 3: em1: NO-CARRIER,BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast
 state DOWN qlen 1000

 I've probably misunderstood what you're trying to say. ifindexes are unique
 per
 boot and per netns.

I think we both misunderstood each other.  The ifindexes are unique
*per netns*, which means that, if you're unprivileged in a netns,
global information doesn't leak to you.  I think this is good.


 Let me try again, with emphasis in the right place.

 I think that *code running in a namespace* has no business even
 knowing a unique identity of *that namespace* from the perspective of
 the host.

 In your example, if there's a veth device between netns A and netns B,
 then code *in netns A* has no business knowing the identity of its
 veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
 no business knowing the identity of its own netns (A) other than as
 my netns.

 I do not agree (see the example below).



 If A and B are siblings, then their parent needs to know where that
 veth device goes, but I think this is already the case to a sufficient
 extent today.

 I'm not aware of a hierarchy between netns. A daemon should be able to
 got the full network configuration, even if it's started when this
 configuration
 is already applied, ie even if it doesn't know what happen before it starts.


I don't know exactly which namespaces have an explicit hierarchy, but
there is certainly a hierarchy of *user* namespaces, and network
namespaces live in user namespaces, so they at least have somewhat of
a hierarchy.



 I feel like this discussion is falling into a common trap of new API
 discussions.  Can one of you who wants this API please articulate,
 with a reasonably precise example, what it is that you want to do, why
 you can't easily do it already, and how this API helps?  I currently
 understand how the API creates problems, but I don't understand how it
 solves any problems, and I will NAK it (and I suspect that Eric will,
 too, which is pretty much fatal) unless that changes.

 What I'm trying to solve is to have full info in netlink messages sent by
 the
 kernel, thus beeing able to identify a peer netns (and this is close from
 what
 audit guys are trying to have). Theorically, messages sent by the kernel can
 be
 reused as is to have the same configuration. This is not the case with
 x-netns
 devices. Here is an example, with ip tunnels:

 $ ip netns add 1
 $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev eth0
 $ ip -d link ls ipip1
 8: ipip1@eth0: POINTOPOINT,NOARP mtu 1480 qdisc noop state DOWN mode
 DEFAULT group default
 link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
 ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit pmtudisc
 $ ip link set ipip1 netns 1
 $ ip netns exec 1 ip -d link ls ipip1
 8: ipip1@tunl0: POINTOPOINT,NOARP,M-DOWN mtu 1480 qdisc noop state DOWN
 mode DEFAULT group default
 link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
 ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit pmtudisc

 Now informations got with 'ip link' are wrong and incomplete:
  - the link dev is now tunl0 instead of eth0, because we only got an ifindex
from the kernel without any netns informations.
  - the encapsulation addresses are not part of this netns but the user
 doesn't
known that (still because netns info is missing). These IPv4 addresses
 may
exist into this netns.
  - it's not possible to create the same netdevice with these infos.


Aha.  That's a genuine problem.

Perhaps we need a concept

Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

2014-08-25 Thread Andy Lutomirski

On Mon, Aug 25, 2014 at 9:41 AM, Nicolas Dichtel
nicolas.dich...@6wind.com wrote:
 Le 25/08/2014 18:13, Andy Lutomirski a écrit :

 On Mon, Aug 25, 2014 at 8:43 AM, Nicolas Dichtel
 nicolas.dich...@6wind.com wrote:

 Le 25/08/2014 16:04, Andy Lutomirski a écrit :

 On Aug 25, 2014 6:30 AM, Nicolas Dichtel nicolas.dich...@6wind.com
 wrote:


 CRIU wants to save the complete state of a namespace and then restore
 it.  For that to work, any information exposed to things in the
 namespace *cannot* be globally unique or unique per boot, since CRIU
 needs to arrange for that information to match whatever it was when
 CRIU saved it.



 How are ifindex of network devices managed? These ifindexes are unique
 per boot,
 thus can change depending on the order in which netdev are created.
 These ifindexes are unique per boot and exposed to userspace ...


 This does not appear to be true.

 $ sudo unshare --net
 # ip link add veth0 type veth peer name veth1
 # ip link
 1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN mode DEFAULT group
 default
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 2: veth1: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode
 DEFAULT group default qlen 1000
   link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
 3: veth0: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode
 DEFAULT group default qlen 1000
   link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
 # logout
 $ ip link
 1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN
   link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 3: em1: NO-CARRIER,BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast
 state DOWN qlen 1000

 I've probably misunderstood what you're trying to say. ifindexes are
 unique
 per
 boot and per netns.


 I think we both misunderstood each other.  The ifindexes are unique
 *per netns*, which means that, if you're unprivileged in a netns,
 global information doesn't leak to you.  I think this is good.

 Ok, I agree. I think audit daemons are always running under privileged
 users.




 Let me try again, with emphasis in the right place.

 I think that *code running in a namespace* has no business even
 knowing a unique identity of *that namespace* from the perspective of
 the host.

 In your example, if there's a veth device between netns A and netns B,
 then code *in netns A* has no business knowing the identity of its
 veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
 no business knowing the identity of its own netns (A) other than as
 my netns.


 I do not agree (see the example below).



 If A and B are siblings, then their parent needs to know where that
 veth device goes, but I think this is already the case to a sufficient
 extent today.


 I'm not aware of a hierarchy between netns. A daemon should be able to
 got the full network configuration, even if it's started when this
 configuration
 is already applied, ie even if it doesn't know what happen before it
 starts.


 I don't know exactly which namespaces have an explicit hierarchy, but
 there is certainly a hierarchy of *user* namespaces, and network
 namespaces live in user namespaces, so they at least have somewhat of
 a hierarchy.



 I feel like this discussion is falling into a common trap of new API
 discussions.  Can one of you who wants this API please articulate,
 with a reasonably precise example, what it is that you want to do, why
 you can't easily do it already, and how this API helps?  I currently
 understand how the API creates problems, but I don't understand how it
 solves any problems, and I will NAK it (and I suspect that Eric will,
 too, which is pretty much fatal) unless that changes.


 What I'm trying to solve is to have full info in netlink messages sent by
 the
 kernel, thus beeing able to identify a peer netns (and this is close from
 what
 audit guys are trying to have). Theorically, messages sent by the kernel
 can
 be
 reused as is to have the same configuration. This is not the case with
 x-netns
 devices. Here is an example, with ip tunnels:

 $ ip netns add 1
 $ ip link add ipip1 type ipip remote 10.16.0.121 local 10.16.0.249 dev
 eth0
 $ ip -d link ls ipip1
 8: ipip1@eth0: POINTOPOINT,NOARP mtu 1480 qdisc noop state DOWN mode
 DEFAULT group default
  link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
  ipip remote 10.16.0.121 local 10.16.0.249 dev eth0 ttl inherit
 pmtudisc
 $ ip link set ipip1 netns 1
 $ ip netns exec 1 ip -d link ls ipip1
 8: ipip1@tunl0: POINTOPOINT,NOARP,M-DOWN mtu 1480 qdisc noop state DOWN
 mode DEFAULT group default
  link/ipip 10.16.0.249 peer 10.16.0.121 promiscuity 0
  ipip remote 10.16.0.121 local 10.16.0.249 dev tunl0 ttl inherit
 pmtudisc

 Now informations got with 'ip link' are wrong and incomplete:
   - the link dev is now tunl0 instead of eth0, because we only got an
 ifindex
 from the kernel without any netns informations.
   - the encapsulation addresses are not part of this netns but the user

Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

2014-08-25 Thread Andy Lutomirski

On Aug 25, 2014 6:30 AM, Nicolas Dichtel nicolas.dich...@6wind.com wrote:
 CRIU wants to save the complete state of a namespace and then restore
 it.  For that to work, any information exposed to things in the
 namespace *cannot* be globally unique or unique per boot, since CRIU
 needs to arrange for that information to match whatever it was when
 CRIU saved it.

 How are ifindex of network devices managed? These ifindexes are unique per 
 boot,
 thus can change depending on the order in which netdev are created.
 These ifindexes are unique per boot and exposed to userspace ...


This does not appear to be true.

$ sudo unshare --net
# ip link add veth0 type veth peer name veth1
# ip link
1: lo: LOOPBACK mtu 65536 qdisc noop state DOWN mode DEFAULT group default
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: veth1: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
link/ether 06:0d:59:c7:a6:a8 brd ff:ff:ff:ff:ff:ff
3: veth0: BROADCAST,MULTICAST mtu 1500 qdisc noop state DOWN mode
DEFAULT group default qlen 1000
link/ether b2:5c:8b:f2:12:28 brd ff:ff:ff:ff:ff:ff
# logout
$ ip link
1: lo: LOOPBACK,UP,LOWER_UP mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: em1: NO-CARRIER,BROADCAST,MULTICAST,UP mtu 1500 qdisc pfifo_fast
state DOWN qlen 1000



 Also, I think that code running in a namespace has no business even
 knowing a unique identity of that namespace from the perspective of
 the host.

 Another scenario is when you have virtual network devices across two netns. 
 You
 need to identify the peer netns to have a netlink message which is fully 
 interpretable by the userspace.

Let me try again, with emphasis in the right place.

I think that *code running in a namespace* has no business even
knowing a unique identity of *that namespace* from the perspective of
the host.

In your example, if there's a veth device between netns A and netns B,
then code *in netns A* has no business knowing the identity of its
veth peer if its peer (B) is a sibling or ancestor.  It also IMO has
no business knowing the identity of its own netns (A) other than as
my netns.

If A and B are siblings, then their parent needs to know where that
veth device goes, but I think this is already the case to a sufficient
extent today.

I feel like this discussion is falling into a common trap of new API
discussions.  Can one of you who wants this API please articulate,
with a reasonably precise example, what it is that you want to do, why
you can't easily do it already, and how this API helps?  I currently
understand how the API creates problems, but I don't understand how it
solves any problems, and I will NAK it (and I suspect that Eric will,
too, which is pretty much fatal) unless that changes.

Thanks,
Andy



 Regards,
 Nicolas

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

2014-08-24 Thread Andy Lutomirski

On Thu, Aug 21, 2014 at 6:58 PM, Richard Guy Briggs r...@redhat.com wrote:
 On 14/08/21, Andy Lutomirski wrote:
 On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote:
  Expose the namespace instace serial numbers in the proc filesystem at
  /proc/pid/ns/ns_snum.  The link text gives the serial number in hex.

 What's the use case?

 I understand the utility of giving unique numbers to the audit code,
 but I don't think this part is necessary for that, and I'd like to
 understand what else will use this before committing to a duplicative
 API like this.

 How does a container manager get those numbers?  It could provoke a task
 to cause an audit event that emits a NS_INFO message, or it could run a
 task in that container to report its namespace serial numbers directly
 from its /proc mount.

Why does a container manager need them?  Is there any reason that
keeping them entirely contained within the audit system would be a
problem?


 The discussion in this thread touches on the use cases:
 https://lkml.org/lkml/2014/4/22/662

 Note that this API is thoroughly incompatible with CRIU.  If we do
 this, someone will ask for a namespace number namespace, and that way
 lies madness.

 I had a very brief look at CRIU, but not enough to understand the issue.
 Others have hinted at this problem.

 Do you have a suggestion of a different approach that would be
 compatible with CRIU?

 I'd originally considered some sort of UUID that would be globally
 unique, but that would be very hard to devise or guarantee, and besides,
 namespaces aren't only used by containers and could be shared in other
 ways.  Tracking the usage and migration of namespaces should be the task
 of an upper layer.


CRIU wants to save the complete state of a namespace and then restore
it.  For that to work, any information exposed to things in the
namespace *cannot* be globally unique or unique per boot, since CRIU
needs to arrange for that information to match whatever it was when
CRIU saved it.

Also, I think that code running in a namespace has no business even
knowing a unique identity of that namespace from the perspective of
the host.

Here's a specific use case for *not* exposing this: Tor.  Ideally, Tor
clients would run in a namespace that does not know about any global
identity.  That means no IP addresses, but it also means no global
namespace serial numbers.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number

2014-08-21 Thread Andy Lutomirski

On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote:

 Generate and assign a serial number per namespace instance since boot.

 Use a serial number per namespace (unique across one boot of one kernel)
 instead of the inode number (which is claimed to have had the right to change
 reserved and is not necessarily unique if there is more than one proc fs) to
 uniquely identify it per kernel boot.

 Signed-off-by: Richard Guy Briggs r...@redhat.com
 ---
  fs/mount.h |1 +
  fs/namespace.c |1 +
  include/linux/ipc_namespace.h  |1 +
  include/linux/nsproxy.h|8 
  include/linux/pid_namespace.h  |1 +
  include/linux/user_namespace.h |1 +
  include/linux/utsname.h|1 +
  include/net/net_namespace.h|1 +
  init/version.c |1 +
  ipc/msgutil.c  |1 +
  ipc/namespace.c|2 ++
  kernel/nsproxy.c   |   17 +
  kernel/pid.c   |1 +
  kernel/pid_namespace.c |2 ++
  kernel/user.c  |1 +
  kernel/user_namespace.c|2 ++
  kernel/utsname.c   |2 ++
  net/core/net_namespace.c   |8 +++-
  18 files changed, 51 insertions(+), 1 deletions(-)

 diff --git a/fs/mount.h b/fs/mount.h
 index d55297f..c076f99 100644
 --- a/fs/mount.h
 +++ b/fs/mount.h
 @@ -5,6 +5,7 @@
  struct mnt_namespace {
 atomic_tcount;
 unsigned intproc_inum;
 +   long long   serial_num;
 struct mount *  root;
 struct list_headlist;
 struct user_namespace   *user_ns;
 diff --git a/fs/namespace.c b/fs/namespace.c
 index 182bc41..9af49ff 100644
 --- a/fs/namespace.c
 +++ b/fs/namespace.c
 @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct 
 user_namespace *user_ns)
 kfree(new_ns);
 return ERR_PTR(ret);
 }
 +   new_ns-serial_num = ns_serial();
 new_ns-seq = atomic64_add_return(1, mnt_ns_seq);
 atomic_set(new_ns-count, 1);
 new_ns-root = NULL;
 diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
 index 35e7eca..8ccfb2d 100644
 --- a/include/linux/ipc_namespace.h
 +++ b/include/linux/ipc_namespace.h
 @@ -69,6 +69,7 @@ struct ipc_namespace {
 struct user_namespace *user_ns;

 unsigned intproc_inum;
 +   long long   serial_num;
  };

  extern struct ipc_namespace init_ipc_ns;
 diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
 index b4ec59d..8e5fe0d 100644
 --- a/include/linux/nsproxy.h
 +++ b/include/linux/nsproxy.h
 @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct 
 task_struct *tsk)
 return rcu_dereference(tsk-nsproxy);
  }

 +long long ns_serial(void);
 +enum {
 +   NS_IPC_INIT_SN  = 1,
 +   NS_UTS_INIT_SN  = 2,
 +   NS_USER_INIT_SN = 3,
 +   NS_PID_INIT_SN  = 4,

Please add an extra value here for the first dynamic value...
 +/**
 + * ns_serial - compute a serial number for the namespace
 + *
 + * Compute a serial number for the namespace to uniquely identify it in
 + * audit records.
 + */
 +long long ns_serial(void)
 +{
 +   static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, UTS, 
 user, PID */

...and use it here.

Also, does this work on all architectures?


--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V4 3/8] namespaces: expose ns instance serial numbers in proc

2014-08-21 Thread Andy Lutomirski

On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote:

 Expose the namespace instace serial numbers in the proc filesystem at
 /proc/pid/ns/ns_snum.  The link text gives the serial number in hex.

What's the use case?

I understand the utility of giving unique numbers to the audit code,
but I don't think this part is necessary for that, and I'd like to
understand what else will use this before committing to a duplicative
API like this.

Note that this API is thoroughly incompatible with CRIU.  If we do
this, someone will ask for a namespace number namespace, and that way
lies madness.

--Andy


 snum was chosen instead of seq for consistency with inum and there are a
 number of other uses of seq in the namespace code.

 Suggested-by: Serge E. Hallyn se...@hallyn.com
 Signed-off-by: Richard Guy Briggs r...@redhat.com
 ---
  fs/proc/namespaces.c |   33 +
  1 files changed, 25 insertions(+), 8 deletions(-)

 diff --git a/fs/proc/namespaces.c b/fs/proc/namespaces.c
 index 8902609..e953e0a 100644
 --- a/fs/proc/namespaces.c
 +++ b/fs/proc/namespaces.c
 @@ -47,12 +47,15 @@ static char *ns_dname(struct dentry *dentry, char 
 *buffer, int buflen)
 struct inode *inode = dentry-d_inode;
 const struct proc_ns_operations *ns_ops = PROC_I(inode)-ns.ns_ops;

 -   return dynamic_dname(dentry, buffer, buflen, %s:[%lu],
 -   ns_ops-name, inode-i_ino);
 +   if (strstr(dentry-d_iname, _snum))
 +   return dynamic_dname(dentry, buffer, buflen, %s_snum:[%llx],
 +   ns_ops-name, ns_ops-snum(PROC_I(inode)-ns.ns));
 +   else
 +   return dynamic_dname(dentry, buffer, buflen, %s:[%lu],
 +   ns_ops-name, inode-i_ino);
  }

 -const struct dentry_operations ns_dentry_operations =
 -{
 +const struct dentry_operations ns_dentry_operations = {
 .d_delete   = always_delete_dentry,
 .d_dname= ns_dname,
  };
 @@ -160,7 +163,10 @@ static int proc_ns_readlink(struct dentry *dentry, char 
 __user *buffer, int bufl
 if (!ns)
 goto out_put_task;

 -   snprintf(name, sizeof(name), %s:[%u], ns_ops-name, 
 ns_ops-inum(ns));
 +   if (strstr(dentry-d_iname, _snum))
 +   snprintf(name, sizeof(name), %s_snum:[%llx], ns_ops-name, 
 ns_ops-snum(ns));
 +   else
 +   snprintf(name, sizeof(name), %s:[%u], ns_ops-name, 
 ns_ops-inum(ns));
 res = readlink_copy(buffer, buflen, name);
 ns_ops-put(ns);
  out_put_task:
 @@ -210,16 +216,23 @@ static int proc_ns_dir_readdir(struct file *file, 
 struct dir_context *ctx)

 if (!dir_emit_dots(file, ctx))
 goto out;
 -   if (ctx-pos = 2 + ARRAY_SIZE(ns_entries))
 +   if (ctx-pos = 2 + 2 * ARRAY_SIZE(ns_entries))
 goto out;
 entry = ns_entries + (ctx-pos - 2);
 last = ns_entries[ARRAY_SIZE(ns_entries) - 1];
 while (entry = last) {
 const struct proc_ns_operations *ops = *entry;
 +   char name[50];
 +
 if (!proc_fill_cache(file, ctx, ops-name, strlen(ops-name),
  proc_ns_instantiate, task, ops))
 break;
 ctx-pos++;
 +   snprintf(name, sizeof(name), %s_snum, ops-name);
 +   if (!proc_fill_cache(file, ctx, name, strlen(name),
 +proc_ns_instantiate, task, ops))
 +   break;
 +   ctx-pos++;
 entry++;
 }
  out:
 @@ -247,9 +260,13 @@ static struct dentry *proc_ns_dir_lookup(struct inode 
 *dir,

 last = ns_entries[ARRAY_SIZE(ns_entries)];
 for (entry = ns_entries; entry  last; entry++) {
 -   if (strlen((*entry)-name) != len)
 +   char name[50];
 +
 +   snprintf(name, sizeof(name), %s_snum, (*entry)-name);
 +   if (strlen((*entry)-name) != len  strlen(name) != len)
 continue;
 -   if (!memcmp(dentry-d_name.name, (*entry)-name, len))
 +   if (!memcmp(dentry-d_name.name, (*entry)-name, len)
 +   || !memcmp(dentry-d_name.name, name, len))
 break;
 }
 if (entry == last)
 --
 1.7.1

 --
 To unsubscribe from this list: send the line unsubscribe linux-api in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH V4 1/8] namespaces: assign each namespace instance a serial number

2014-08-21 Thread Andy Lutomirski

On Thu, Aug 21, 2014 at 2:28 PM, Richard Guy Briggs r...@redhat.com wrote:
 On 14/08/21, Andy Lutomirski wrote:
 On Aug 20, 2014 8:12 PM, Richard Guy Briggs r...@redhat.com wrote:
 
  Generate and assign a serial number per namespace instance since boot.
 
  Use a serial number per namespace (unique across one boot of one kernel)
  instead of the inode number (which is claimed to have had the right to 
  change
  reserved and is not necessarily unique if there is more than one proc fs) 
  to
  uniquely identify it per kernel boot.
 
  Signed-off-by: Richard Guy Briggs r...@redhat.com
  ---
   fs/mount.h |1 +
   fs/namespace.c |1 +
   include/linux/ipc_namespace.h  |1 +
   include/linux/nsproxy.h|8 
   include/linux/pid_namespace.h  |1 +
   include/linux/user_namespace.h |1 +
   include/linux/utsname.h|1 +
   include/net/net_namespace.h|1 +
   init/version.c |1 +
   ipc/msgutil.c  |1 +
   ipc/namespace.c|2 ++
   kernel/nsproxy.c   |   17 +
   kernel/pid.c   |1 +
   kernel/pid_namespace.c |2 ++
   kernel/user.c  |1 +
   kernel/user_namespace.c|2 ++
   kernel/utsname.c   |2 ++
   net/core/net_namespace.c   |8 +++-
   18 files changed, 51 insertions(+), 1 deletions(-)
 
  diff --git a/fs/mount.h b/fs/mount.h
  index d55297f..c076f99 100644
  --- a/fs/mount.h
  +++ b/fs/mount.h
  @@ -5,6 +5,7 @@
   struct mnt_namespace {
  atomic_tcount;
  unsigned intproc_inum;
  +   long long   serial_num;
  struct mount *  root;
  struct list_headlist;
  struct user_namespace   *user_ns;
  diff --git a/fs/namespace.c b/fs/namespace.c
  index 182bc41..9af49ff 100644
  --- a/fs/namespace.c
  +++ b/fs/namespace.c
  @@ -2486,6 +2486,7 @@ static struct mnt_namespace *alloc_mnt_ns(struct 
  user_namespace *user_ns)
  kfree(new_ns);
  return ERR_PTR(ret);
  }
  +   new_ns-serial_num = ns_serial();
  new_ns-seq = atomic64_add_return(1, mnt_ns_seq);
  atomic_set(new_ns-count, 1);
  new_ns-root = NULL;
  diff --git a/include/linux/ipc_namespace.h b/include/linux/ipc_namespace.h
  index 35e7eca..8ccfb2d 100644
  --- a/include/linux/ipc_namespace.h
  +++ b/include/linux/ipc_namespace.h
  @@ -69,6 +69,7 @@ struct ipc_namespace {
  struct user_namespace *user_ns;
 
  unsigned intproc_inum;
  +   long long   serial_num;
   };
 
   extern struct ipc_namespace init_ipc_ns;
  diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
  index b4ec59d..8e5fe0d 100644
  --- a/include/linux/nsproxy.h
  +++ b/include/linux/nsproxy.h
  @@ -66,6 +66,14 @@ static inline struct nsproxy *task_nsproxy(struct 
  task_struct *tsk)
  return rcu_dereference(tsk-nsproxy);
   }
 
  +long long ns_serial(void);
  +enum {
  +   NS_IPC_INIT_SN  = 1,
  +   NS_UTS_INIT_SN  = 2,
  +   NS_USER_INIT_SN = 3,
  +   NS_PID_INIT_SN  = 4,

 Please add an extra value here for the first dynamic value...
  +/**
  + * ns_serial - compute a serial number for the namespace
  + *
  + * Compute a serial number for the namespace to uniquely identify it in
  + * audit records.
  + */
  +long long ns_serial(void)
  +{
  +   static atomic64_t serial = ATOMIC_INIT(4); /* reserved for IPC, 
  UTS, user, PID */

 ...and use it here.

 Yup, good idea.  Thanks.

 Also, does this work on all architectures?

 I only know for certain x86_64.  There is discussion elsewhere about
 changing ns_serial from returning a long long to returning a u64.  That
 should help.  I can run standard compile and boot tests on several
 others, but won't be able to check output.

I was thinking of atomic64.  I guess there's a generic implementation
that works everywhere, if somewhat slowly.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Fwd: 3.15: kernel BUG at kernel/auditsc.c:1525!

2014-06-20 Thread Andy Lutomirski

Steve Grubb pointed out that I forgot to cc: linux-audit.

--Andy

-- Forwarded message --
From: Andy Lutomirski l...@amacapital.net
Date: Mon, Jun 16, 2014 at 2:35 PM
Subject: Re: 3.15: kernel BUG at kernel/auditsc.c:1525!
To: Richard Weinberger rich...@nod.at, H. Peter Anvin
h...@zytor.com, X86 ML x...@kernel.org
Cc: Toralf Förster toralf.foers...@gmx.de, Eric Paris
epa...@redhat.com, Linux Kernel linux-ker...@vger.kernel.org

[cc: hpa, x86 list]

On Mon, Jun 16, 2014 at 1:43 PM, Richard Weinberger rich...@nod.at wrote:
 Am 16.06.2014 22:41, schrieb Toralf Förster:
 Well, might be the mail:subject should be adapted, b/c the issue can be 
 triggered in a 3.13.11 kernel too.
 Unfortunately it does not appear within an UML guest, therefore an automated 
 bisecting isn't possible I fear.

 You could try KVM. :)

Before you do that, just to clarify:

What bitness is your kernel?  That is, are you on 32-bit or 64-bit kernel?

What bitness is your test case?  'file a.out' will say.

What does /proc/cpuinfo say in flags?

Can you try the attached patch?  It's only compile-tested.

To hpa, etc:  It appears that entry_32.S is missing any call to the
audit exit hook on the badsys path.  If I'm diagnosing this bug report
correctly, this causes OOPSes.

The the world at large: it's increasingly apparent that no one (except
maybe the blackhats) has ever scrutinized the syscall auditing code.
This is two old severe bugs in the code that have probably been there
for a long time.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC
From 8b43bd2118d876cb3163e8f7d9cd8253da649335 Mon Sep 17 00:00:00 2001
Message-Id: 8b43bd2118d876cb3163e8f7d9cd8253da649335.1402954406.git.l...@amacapital.net
From: Andy Lutomirski l...@amacapital.net
Date: Mon, 16 Jun 2014 14:28:19 -0700
Subject: [PATCH] x86_32,entry: Fix badsys paths
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The bad syscall nr paths are their own incomprehensible route
through the entry control flow.  Rearrange them to work just like
syscalls that return -ENOSYS.

This should fix an OOPS in the audit code when auditing is enabled
and bad syscall nrs are used.

Reported-by: Toralf Förster toralf.foers...@gmx.de
Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/entry_32.S | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 98313ff..eb6e07e 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -431,9 +431,10 @@ sysenter_past_esp:
 	jnz sysenter_audit
 sysenter_do_call:
 	cmpl $(NR_syscalls), %eax
-	jae syscall_badsys
+	jae sysenter_badsys
 	call *sys_call_table(,%eax,4)
 	movl %eax,PT_EAX(%esp)
+sysenter_after_call:
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
@@ -687,7 +688,12 @@ END(syscall_fault)

 syscall_badsys:
 	movl $-ENOSYS,PT_EAX(%esp)
-	jmp resume_userspace
+	jmp syscall_exit
+END(syscall_badsys)
+
+sysenter_badsys:
+	movl $-ENOSYS,PT_EAX(%esp)
+	jmp sysenter_after_call
 END(syscall_badsys)
 	CFI_ENDPROC
 /*
-- 
1.9.3

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-06-09 Thread Andy Lutomirski

On Mon, Jun 9, 2014 at 3:30 PM, Greg KH gre...@linuxfoundation.org wrote:
 On Wed, May 28, 2014 at 11:09:58PM -0400, Eric Paris wrote:
 From: Andy Lutomirski l...@amacapital.net

 Fixes an easy DoS and possible information disclosure.

 This does nothing about the broken state of x32 auditing.

 eparis: If the admin has enabled auditd and has specifically loaded audit
 rules.  This bug has been around since before git.  Wow...

 Cc: sta...@vger.kernel.org
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 Signed-off-by: Eric Paris epa...@redhat.com
 ---
  kernel/auditsc.c | 27 ++-
  1 file changed, 18 insertions(+), 9 deletions(-)

 Did this patch get dropped somewhere?  Isn't it a valid bugfix, or did I
 miss a later conversation about this?

Hmm.  It seems that it didn't make it into Linus' tree.  Crap.

IMO we need some kind of real tracking system for issues reported to
security@.  This shouldn't have been possible (and if I'd realized
that the patch got dropped, I wouldn't have publicly disclosed it).

For whoever applies this: it's CVE-2014-3917.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-06-09 Thread Andy Lutomirski

On Mon, Jun 9, 2014 at 3:53 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Mon, Jun 9, 2014 at 3:35 PM, Andy Lutomirski l...@amacapital.net wrote:

 Hmm.  It seems that it didn't make it into Linus' tree.  Crap.

 I assume that if there is a maintainer who normally sends me stuff by
 git, when I see patches in emails they are just informational
 heads-ups about stuff that is being discussed or pending, and that
 I'll see it later in a pull request. So I just ignore them unless I
 have specific comments, since clearly the emailed patch is just
 informational and/or for comments/acks from others.

 The exception is unless it *VERY CLEARLY* says otherwise (as in
 Linus, can you please take this directly due to xyz).

 Because why would somebody send me a patch series sometimes, and git
 trees at other times? That would just be stupid.

In this particular case, it's my patch, and I've never sent you a pull
request.  I sort of assumed that secur...@kernel.org magically caused
acknowledged fixes to end up in your tree.  I'm not sure what I'm
supposed to do here.

Maybe the confusion is because Eric resent the patch?

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[GIT PULL] CVE-2014-3917

2014-06-09 Thread Andy Lutomirski

[This is my first pull request.  I may be doing any number of things wrong.]

Hi Linus,

The following changes since commit f2159d1e99612ceb94bf9a2dc2fbca409d828b1b:

  Merge tag 'sound-3.15-rc8' of
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound (2014-05-28
11:17:41 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git CVE-2014-3917

for you to fetch changes up to 9fbb05647f3aea17e4e01393bbc42f64ee307409:

  auditsc: audit_krule mask accesses need bounds checking (2014-06-09
16:06:40 -0700)


Andy Lutomirski (1):
  auditsc: audit_krule mask accesses need bounds checking

 kernel/auditsc.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

NB: This is exactly the same patch that's been on the list, except
that I added the CVE number to the description.

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-06-09 Thread Andy Lutomirski

On Mon, Jun 9, 2014 at 3:46 PM, Greg KH gre...@linuxfoundation.org wrote:
 On Mon, Jun 09, 2014 at 03:35:02PM -0700, Andy Lutomirski wrote:
 On Mon, Jun 9, 2014 at 3:30 PM, Greg KH gre...@linuxfoundation.org wrote:
  On Wed, May 28, 2014 at 11:09:58PM -0400, Eric Paris wrote:
  From: Andy Lutomirski l...@amacapital.net
 
  Fixes an easy DoS and possible information disclosure.
 
  This does nothing about the broken state of x32 auditing.
 
  eparis: If the admin has enabled auditd and has specifically loaded audit
  rules.  This bug has been around since before git.  Wow...
 
  Cc: sta...@vger.kernel.org
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  Signed-off-by: Eric Paris epa...@redhat.com
  ---
   kernel/auditsc.c | 27 ++-
   1 file changed, 18 insertions(+), 9 deletions(-)
 
  Did this patch get dropped somewhere?  Isn't it a valid bugfix, or did I
  miss a later conversation about this?

 Hmm.  It seems that it didn't make it into Linus' tree.  Crap.

 IMO we need some kind of real tracking system for issues reported to
 security@.

 That seems to be my mbox at times :)

 But yes, having something real might be good if the load gets higher,
 right now it's so low that my sweep pending security patches task
 usually catches anything pending, which is rare.


There are currently at least two issues that I reported that are stuck
in limbo: this one and the (not-yet-public) vfs thing.  And there's
the CVE-2014-0181 regression fix that almost got forgotten, but that
isn't really a security issue.

And I can't read your mbox :-/

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-06-09 Thread Andy Lutomirski

On Mon, Jun 9, 2014 at 5:32 PM, Greg KH gre...@linuxfoundation.org wrote:
 On Mon, Jun 09, 2014 at 03:55:20PM -0700, Andy Lutomirski wrote:
 On Mon, Jun 9, 2014 at 3:46 PM, Greg KH gre...@linuxfoundation.org wrote:
  On Mon, Jun 09, 2014 at 03:35:02PM -0700, Andy Lutomirski wrote:
  On Mon, Jun 9, 2014 at 3:30 PM, Greg KH gre...@linuxfoundation.org 
  wrote:
   On Wed, May 28, 2014 at 11:09:58PM -0400, Eric Paris wrote:
   From: Andy Lutomirski l...@amacapital.net
  
   Fixes an easy DoS and possible information disclosure.
  
   This does nothing about the broken state of x32 auditing.
  
   eparis: If the admin has enabled auditd and has specifically loaded 
   audit
   rules.  This bug has been around since before git.  Wow...
  
   Cc: sta...@vger.kernel.org
   Signed-off-by: Andy Lutomirski l...@amacapital.net
   Signed-off-by: Eric Paris epa...@redhat.com
   ---
kernel/auditsc.c | 27 ++-
1 file changed, 18 insertions(+), 9 deletions(-)
  
   Did this patch get dropped somewhere?  Isn't it a valid bugfix, or did I
   miss a later conversation about this?
 
  Hmm.  It seems that it didn't make it into Linus' tree.  Crap.
 
  IMO we need some kind of real tracking system for issues reported to
  security@.
 
  That seems to be my mbox at times :)
 
  But yes, having something real might be good if the load gets higher,
  right now it's so low that my sweep pending security patches task
  usually catches anything pending, which is rare.
 

 There are currently at least two issues that I reported that are stuck
 in limbo: this one and the (not-yet-public) vfs thing.

 That was next on my list to poke people about...

 And there's the CVE-2014-0181 regression fix that almost got
 forgotten, but that isn't really a security issue.

 What is that, where was that reported?

commit 2d7a85f4b06e9c27ff629f07a524c48074f07f81
Author: Eric W. Biederman ebied...@xmission.com
Date:   Fri May 30 11:04:00 2014 -0700

netlink: Only check file credentials for implicit destinations


The security issue got fixed quickly, but the fix turned out to be problematic.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH 2/2] audit: Syscall auditing lite

2014-06-02 Thread Andy Lutomirski

AFAICS the main use of syscall auditing is to get syscall
information for syscalls that are already causing another audit
message.

We don't need any of the fancy syscall auditing machinery for that,
though: we can just log this information directly.  This should have
essentially no overhead and it could end up being much easier to use
than auditsc.

This produces messages like this:

audit: type=1123 audit(1401485315.370:2): pid=125 uid=0 auid=4294967295 
ses=4294967295 subj=kernel msg='blah blah blah' arch=c03e syscall=44 a0=3 
a1=7fff383feb60 a2=5c a3=0 a4=7fff383feb50 a5=c

The new fields (arch, syscall, and a0..a5) will only be logged if we
are in a syscall but we aren't otherwise building an auditsc context.

This is only supported on x86 for now.  Other architectures can get
this if they implement syscall_in_syscall.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 kernel/audit.c | 44 +++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 47845c5..8509d00 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -67,6 +67,10 @@
 #include linux/pid_namespace.h
 #include net/netns/generic.h
 
+#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL
+#include asm/syscall.h
+#endif
+
 #include audit.h
 
 /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED.
@@ -1897,6 +1901,40 @@ out:
kfree(name);
 }
 
+#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL
+/**
+ * audit_log_missing_context - append otherwise-missing context
+ * @ab: the audit_buffer
+ *
+ * If syscall auditing is unavailable, try to log syscall context
+ * information anyway.
+ */
+static void audit_log_missing_context(struct audit_buffer *ab)
+{
+   struct task_struct *tsk = current;
+   struct pt_regs *regs = current_pt_regs();
+   unsigned long args[6];
+
+   if (!syscall_in_syscall(tsk, regs))
+   return;
+
+   if (ab-ctx  ab-ctx-in_syscall)
+   return;  /* Let audit_log_exit log the context. */
+
+   syscall_get_arguments(tsk, regs, 0, 6, args);
+
+   audit_log_format(ab,  arch=%x syscall=%d a0=%lx a1=%lx a2=%lx a3=%lx 
a4=%lx a5=%lx,
+(unsigned int)syscall_get_arch(),
+syscall_get_nr(tsk, regs),
+args[0], args[1], args[2], args[3], args[4], args[5]);
+}
+#else
+static void audit_log_missing_context(struct audit_buffer *ab)
+{
+   /* We need arch support to do this reliably, so don't even try. */
+}
+#endif
+
 /**
  * audit_log_end - end one audit record
  * @ab: the audit_buffer
@@ -1913,7 +1951,11 @@ void audit_log_end(struct audit_buffer *ab)
if (!audit_rate_check()) {
audit_log_lost(rate limit exceeded);
} else {
-   struct nlmsghdr *nlh = nlmsg_hdr(ab-skb);
+   struct nlmsghdr *nlh;
+
+   audit_log_missing_context(ab);
+
+   nlh = nlmsg_hdr(ab-skb);
nlh-nlmsg_len = ab-skb-len - NLMSG_HDRLEN;
 
if (audit_pid) {
-- 
1.9.3

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH 0/2] Syscall auditing lite

2014-06-02 Thread Andy Lutomirski

I've made no secret of the fact that I dislike syscall auditing.  As far
as I can tell, the main technical (i.e. not compliance-related) use of
syscall auditing is to supply some useful context information to go
along with events like AVC denials.

CONFIG_AUDITSYSCALL is serious overkill to do this.  kernel/auditsc.c is
~2500 lines of terror.

This patchset accomplishes the same goal, more usefully, with no
overhead at all, in under 70 lines of code.  It tries to coexist cleanly
with CONFIG_AUDITSYSCALL.

This is only implemented for x86.  Other architectures can add support
fairly easily, I think.

Andy Lutomirski (2):
  x86,syscall: Add syscall_in_syscall to test whether we're in a syscall
  audit: Syscall auditing lite

 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/syscall.h | 21 
 init/Kconfig   |  3 +++
 kernel/audit.c | 44 +-
 4 files changed, 68 insertions(+), 1 deletion(-)

-- 
1.9.3

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH 1/2] x86, syscall: Add syscall_in_syscall to test whether we're in a syscall

2014-06-02 Thread Andy Lutomirski

syscall_in_syscall will return true if we're in a real syscall and
will return false if we're not in a syscall.  If we're in a bad
syscall, the return value can vary.

The idea is to use this to come up with a much simpler replacement
for syscall auditing.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/syscall.h | 21 +
 init/Kconfig   |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..e2602d4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -130,6 +130,7 @@ config X86
select HAVE_CC_STACKPROTECTOR
select GENERIC_CPU_AUTOPROBE
select HAVE_ARCH_AUDITSYSCALL
+   select HAVE_SYSCALL_IN_SYSCALL
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d6a756a..91e38b3 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -23,6 +23,27 @@
 typedef void (*sys_call_ptr_t)(void);
 extern const sys_call_ptr_t sys_call_table[];
 
+/**
+ * syscall_in_syscall() - are we in a syscall context?
+ * @task: The task to query.
+ * @regs: The task's pt_regs.
+ *
+ * This checks whether we are in a syscall.  If it returns true, then
+ * syscall_get_nr(), etc are usable and the current task is guaranteed
+ * to either die or to go through the syscall exit path when the syscall
+ * is done.
+ *
+ * If it returns false, no particular guarantees are made.  In
+ * particular, a malicious task can issue a syscall that causes
+ * syscall_in_syscall to return false.  Such a syscall won't do much,
+ * but it can still cause tracing code and such to run.
+ */
+static inline bool syscall_in_syscall(struct task_struct *task,
+ struct pt_regs *regs)
+{
+   return regs-orig_ax != -1;
+}
+
 /*
  * Only the low 32 bits of orig_ax are meaningful, so we return int.
  * This importantly ignores the high bits on 64-bit, so comparisons
diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..bad2053 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -295,6 +295,9 @@ config AUDIT
 config HAVE_ARCH_AUDITSYSCALL
bool
 
+config HAVE_SYSCALL_IN_SYSCALL
+   bool
+
 config AUDITSYSCALL
bool Enable system-call auditing support
depends on AUDIT  HAVE_ARCH_AUDITSYSCALL
-- 
1.9.3

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH 1/2] x86,syscall: Add syscall_in_syscall to test whether we're in a syscall

2014-06-02 Thread Andy Lutomirski

On May 30, 2014 2:58 PM, Andy Lutomirski l...@amacapital.net wrote:

 syscall_in_syscall will return true if we're in a real syscall and
 will return false if we're not in a syscall.  If we're in a bad
 syscall, the return value can vary.

 The idea is to use this to come up with a much simpler replacement
 for syscall auditing.

 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/Kconfig   |  1 +
  arch/x86/include/asm/syscall.h | 21 +
  init/Kconfig   |  3 +++
  3 files changed, 25 insertions(+)

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 25d2c6f..e2602d4 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -130,6 +130,7 @@ config X86
 select HAVE_CC_STACKPROTECTOR
 select GENERIC_CPU_AUTOPROBE
 select HAVE_ARCH_AUDITSYSCALL
 +   select HAVE_SYSCALL_IN_SYSCALL

  config INSTRUCTION_DECODER
 def_bool y
 diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
 index d6a756a..91e38b3 100644
 --- a/arch/x86/include/asm/syscall.h
 +++ b/arch/x86/include/asm/syscall.h
 @@ -23,6 +23,27 @@
  typedef void (*sys_call_ptr_t)(void);
  extern const sys_call_ptr_t sys_call_table[];

 +/**
 + * syscall_in_syscall() - are we in a syscall context?
 + * @task: The task to query.
 + * @regs: The task's pt_regs.
 + *
 + * This checks whether we are in a syscall.  If it returns true, then
 + * syscall_get_nr(), etc are usable and the current task is guaranteed
 + * to either die or to go through the syscall exit path when the syscall
 + * is done.
 + *
 + * If it returns false, no particular guarantees are made.  In
 + * particular, a malicious task can issue a syscall that causes
 + * syscall_in_syscall to return false.  Such a syscall won't do much,
 + * but it can still cause tracing code and such to run.
 + */
 +static inline bool syscall_in_syscall(struct task_struct *task,
 + struct pt_regs *regs)
 +{
 +   return regs-orig_ax != -1;

This is insufficient: anything that interrupts an errorentry too early
will incorrectly return true.  Also, the actual IRQ entries seem to
shove the IRQ number into orig_ax.

I'll send a new version.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski

On Wed, May 28, 2014 at 7:54 PM, Eric Paris epa...@redhat.com wrote:
 On Wed, 2014-05-28 at 19:40 -0700, Andy Lutomirski wrote:
 On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote:
  NAK
 
  On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
  Here are some issues with the code:
   - It thinks that syscalls have four arguments.
 
  Not true at all.  It records the registers that would hold the first 4
  entries on syscall entry, for use later if needed, as getting those
  later on some arches is not feasible (see ia64).  It makes no assumption
  about how many syscalls a function has.

 What about a5 and a6?

 On the couple of syscalls where a5 and a6 had any state that was
 actually wanted by someone (mainly just the fd on mmap) audit collects
 it later in the actual syscall.

   - It assumes that syscall numbers are between 0 and 2048.
 
  There could well be a bug here.  Not questioning that.  Although that
  would be patch 1/2

 Even with patch 1, it still doesn't handle large syscall numbers -- it
 just assumes they're not audited.

 That's because we haven't had large syscall numbers.  That's the whole
 point of an arch doing select HAVE_ARCH_AUDITSYSCALL.  If they don't
 meet the requirements, they shouldn't be selecting it

   - It's unclear whether it's supposed to be reliable.
 
  Unclear to whom?

 To me.

 If some inode access or selinux rule triggers an audit, is the auditsc
 code guaranteed to write an exit record?  And see below...

 This is an honest question:  Do you want to discuss these things, or
 would you be happier if I shut up, fix the bugs you found, and leave
 things be?  I don't want to have an argument, I'm happy to have a
 discussion if you think that will be beneficial...


I'm happy to discuss these things.  I'd be happy if you fix things.  I
think that there are two major issues affecting users of auditsc and
that either those issues should be fixed or users of auditsc should
understand and accept these issues.

Issue 1: syscall numbering.  The syscall filters are unlikely to work
well on lots of architectures.

Issue 2: performance.  Until someone fixes it (which is probably
hard), turning this thing on hurts a lot.  Oleg added a hack awhile
ago that lets you do 'auditctl -a task,never' to get rid of the
performance hit for new tasks, but that rather defeats the point.

The scariness of the code can be lived with, but holy cow it's scary.

I'm just not going to fix the issues myself.  I tried and gave up.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski

Here are some issues with the code:
 - It thinks that syscalls have four arguments.
 - It's a performance disaster.
 - It assumes that syscall numbers are between 0 and 2048.
 - It's unclear whether it's supposed to be reliable.
 - It's broken on things like x32.
 - It can't support ARM OABI.
 - Its approach to freeing memory is terrifying.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 init/Kconfig | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..24d4b53 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -296,13 +296,16 @@ config HAVE_ARCH_AUDITSYSCALL
bool
 
 config AUDITSYSCALL
-   bool Enable system-call auditing support
-   depends on AUDIT  HAVE_ARCH_AUDITSYSCALL
+   bool Enable system-call auditing support (not recommended)
+   depends on AUDIT  HAVE_ARCH_AUDITSYSCALL  BROKEN
default y if SECURITY_SELINUX
help
- Enable low-overhead system-call auditing infrastructure that
- can be used independently or with another kernel subsystem,
- such as SELinux.
+ Enable system-call auditing infrastructure that can be used
+ independently or with another kernel subsystem, such as
+ SELinux.
+
+ AUDITSYSCALL has serious performance and correctness issues.
+ Use it with extreme caution.
 
 config AUDIT_WATCH
def_bool y
-- 
1.9.3

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-05-29 Thread Andy Lutomirski

On Wed, May 28, 2014 at 7:43 PM, Eric Paris epa...@redhat.com wrote:
 On Wed, 2014-05-28 at 19:27 -0700, Andy Lutomirski wrote:
 On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote:
  On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
  Fixes an easy DoS and possible information disclosure.
 
  This does nothing about the broken state of x32 auditing.
 
  Cc: sta...@vger.kernel.org
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   kernel/auditsc.c | 27 ++-
   1 file changed, 18 insertions(+), 9 deletions(-)
 
  diff --git a/kernel/auditsc.c b/kernel/auditsc.c
  index f251a5e..7ccd9db 100644
  --- a/kernel/auditsc.c
  +++ b/kernel/auditsc.c
  @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct 
  task_struct *tsk, char **key)
return AUDIT_BUILD_CONTEXT;
   }
 
  +static bool audit_in_mask(const struct audit_krule *rule, unsigned long 
  val)
  +{
  + int word, bit;
  +
  + if (val  0x)
  + return false;
 
  Why is this necessary?

 To avoid an integer overflow.  Admittedly, this particular overflow
 won't cause a crash, but it will cause incorrect results.

 You know this code pre-dates git?  I admit, I'm shocked no one ever
 noticed it before!  This is ANCIENT.  And clearly broken.

 I'll likely ask Richard to add a WARN_ONCE() in both this place, and
 below in word  AUDIT_BITMASK_SIZE so we might know if we ever need a
 larger bitmask to store syscall numbers


Not there, please!  It would make sense to check when rules are
*added*, but any user can trivially invoke the filter with pretty much
any syscall nr.

 It'd be nice if lib had a efficient bitmask implementation...

 
  +
  + word = AUDIT_WORD(val);
  + if (word = AUDIT_BITMASK_SIZE)
  + return false;
 
  Since this covers it and it extensible...
 
  +
  + bit = AUDIT_BIT(val);
  +
  + return rule-mask[word]  bit;
 
  Returning an int as a bool creates worse code than just returning the
  int.  (although in this case if the compiler chooses to inline it might
  be smart enough not to actually convert this int to a bool and make
  worse assembly...)   I'd suggest the function here return an int.  bools
  usually make the assembly worse...

 I'm ambivalent.  The right assembly would use flags on x86, not an int
 or a bool, and I don't know what the compiler will do.

 Also, clearly X86_X32 was implemented without audit support, so we
 shouldn't config it in.  What do you think of this?

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 25d2c6f..fa852e2 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -129,7 +129,7 @@ config X86
 select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
 select HAVE_CC_STACKPROTECTOR
 select GENERIC_CPU_AUTOPROBE
 -   select HAVE_ARCH_AUDITSYSCALL
 +   select HAVE_ARCH_AUDITSYSCALL if !X86_X32

I'm fine with that, but I hope that distros will choose x32 over
auditsc, at least until someone makes enabling auditsc be less
intrusive.

Or someone could fix it, modulo the syscall nr 2048 thing.  x32
syscall nrs are *huge*.  So are lots of other architectures', a few of
which probably claim to support auditsc.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH v2 0/2] Fix auditsc DoS and mark it BROKEN

2014-05-29 Thread Andy Lutomirski

CONFIG_AUDITSYSCALL is awful.  Patch 2 enumerates some reasons.

Patch 1 fixes a nasty DoS and possible information leak.  It should
be applied and backported.

Patch 2 is optional.  I leave it to other peoples' judgment.

Andy Lutomirski (2):
  auditsc: audit_krule mask accesses need bounds checking
  audit: Move CONFIG_AUDITSYSCALL into staging and update help text

Andy Lutomirski (2):
  auditsc: audit_krule mask accesses need bounds checking
  audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

 init/Kconfig | 13 -
 kernel/auditsc.c | 27 ++-
 2 files changed, 26 insertions(+), 14 deletions(-)

-- 
1.9.3

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-05-29 Thread Andy Lutomirski

On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote:
 On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
 Fixes an easy DoS and possible information disclosure.

 This does nothing about the broken state of x32 auditing.

 Cc: sta...@vger.kernel.org
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  kernel/auditsc.c | 27 ++-
  1 file changed, 18 insertions(+), 9 deletions(-)

 diff --git a/kernel/auditsc.c b/kernel/auditsc.c
 index f251a5e..7ccd9db 100644
 --- a/kernel/auditsc.c
 +++ b/kernel/auditsc.c
 @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct 
 task_struct *tsk, char **key)
   return AUDIT_BUILD_CONTEXT;
  }

 +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 +{
 + int word, bit;
 +
 + if (val  0x)
 + return false;

 Why is this necessary?

To avoid an integer overflow.  Admittedly, this particular overflow
won't cause a crash, but it will cause incorrect results.


 +
 + word = AUDIT_WORD(val);
 + if (word = AUDIT_BITMASK_SIZE)
 + return false;

 Since this covers it and it extensible...

 +
 + bit = AUDIT_BIT(val);
 +
 + return rule-mask[word]  bit;

 Returning an int as a bool creates worse code than just returning the
 int.  (although in this case if the compiler chooses to inline it might
 be smart enough not to actually convert this int to a bool and make
 worse assembly...)   I'd suggest the function here return an int.  bools
 usually make the assembly worse...

I'm ambivalent.  The right assembly would use flags on x86, not an int
or a bool, and I don't know what the compiler will do.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski

On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote:
 NAK

 On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
 Here are some issues with the code:
  - It thinks that syscalls have four arguments.

 Not true at all.  It records the registers that would hold the first 4
 entries on syscall entry, for use later if needed, as getting those
 later on some arches is not feasible (see ia64).  It makes no assumption
 about how many syscalls a function has.

What about a5 and a6?


  - It's a performance disaster.

 Only if you enable it.  If you don't use audit it is a single branch.
 Hardly a disaster.

It forces all syscalls into the slow path and it can do crazy things
like building audit contexts just in case actual handling of the
syscall triggers an audit condition so that the exit path can log the
syscall.  That's way worse than a single branch.

Try it: benchmark getpid on Fedora and then repeat the experiment with
syscall auditing fully disabled.  The difference is striking.


  - It assumes that syscall numbers are between 0 and 2048.

 There could well be a bug here.  Not questioning that.  Although that
 would be patch 1/2

Even with patch 1, it still doesn't handle large syscall numbers -- it
just assumes they're not audited.


  - It's unclear whether it's supposed to be reliable.

 Unclear to whom?

To me.

If some inode access or selinux rule triggers an audit, is the auditsc
code guaranteed to write an exit record?  And see below...


  - It's broken on things like x32.
  - It can't support ARM OABI.

 Some arches aren't supported?  And that makes it BROKEN?

It acts like x32 is supported.  OABI is fixable.  Admittedly, OABI not
being supported is fine, now that it's correctly disabled.


  - Its approach to freeing memory is terrifying.

 What?

 None of your reasons hold water.  Bugs need to be fixed.  Try reporting
 them...  This is just stupid.

...for reference, I've *tried* to fix the performance issues.  I've
run into all kinds of problems.

The actual context code is incredibly tangled.  It's unclear whether
it would be permissible for various rule combinations to suppress exit
records triggered by selinux.  Any effort to actually deal with this
stuff will have to deal with the fact that the audit code builds up
lots of state and frees it on syscall exit.  That means that the exit
hook needs to be actually invoked, otherwise we leak.  I was never
able to convince myself that the current code is correct wrt kernel
threads.

In summary, the code is a giant mess.  The way it works is nearly
incomprehensible.  It contains at least one severe bug.  I'd love to
see it fixed, but for now, distributions seem to think that enabling
CONFIG_AUDITSYSCALL is a reasonable thing to do, and I'd argue that
it's actually a terrible choice for anyone who doesn't actually need
syscall audit rules.  And I don't know who needs these things.

I've heard an argument that selinux benefits from this, since the
syscall exit log helps with diagnosing denials.  I think that's
absurd.  I've never found anything wrong with the denial record itself
that would be helped by seeing the syscall log.  (And there's always
syscall_get_xyz, which would cover this case splendidly with about an
order of magnitude less code.)

As I said, I'm not going to push hard for this code to be marked
BROKEN.  But I think it's rather broken, and I'm definitely not
volunteering to fix it.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski

On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote:
 On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote:
   - It assumes that syscall numbers are between 0 and 2048.
 
  There could well be a bug here.  Not questioning that.  Although that
  would be patch 1/2

 Even with patch 1, it still doesn't handle large syscall numbers -- it
 just assumes they're not audited.

 All syscalls must be auditable. Meaning that if an arch goes above 2048, then
 we'll need to do some math to get it to fall back within the range.


Off the top of my head, x32, some ARM variants, and some MIPS variants
have syscalls above 2048.  auditsc has been disabled on the offending
ARM variant (because I noticed that the same issue affects seccomp,
and no one particularly wants to fix it), but I suspect that auditsc
is enabled but completely broken on whichever MIPS ABI has the issue.
I haven't checked.

TBH, I think that it's silly to let the auditsc design be heavily
constrained by ia64 considerations -- I still think that the syscall
entry hooks could be removed completely with some effort on most
architectures and that performance would improve considerably.  For
users who don't have syscall filter rules, performance might even
improve on ia64 -- I don't care how expensive syscall_get_args is, the
actual printk/auditd thing will dominate in cases where anything is
written.

I wonder whether the syscall restart mechanism can be used to
thoroughly confused auditsc.  I don't really know how syscall restarts
work.

My point is: I don't know what guarantees auditsc is supposed to
provide, nor do I know how to evaluate whether the code is correct.
For users who aren't bound by Common Criteria or related things, I
suspect that syscall auditing (as opposed to, say, LSM-based auditing)
will provide dubious value at best.  Keep in mind that many syscalls
take pointer arguments, and the auditsc code can't really do anything
sensible with them.


   - It's unclear whether it's supposed to be reliable.
 
  Unclear to whom?

 To me.

 If some inode access or selinux rule triggers an audit, is the auditsc
 code guaranteed to write an exit record?  And see below...

 It should or indicate that it could not.


--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski

On Thu, May 29, 2014 at 9:25 AM, Steve Grubb sgr...@redhat.com wrote:
 On Thursday, May 29, 2014 09:04:10 AM Andy Lutomirski wrote:
 On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote:
  On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote:
- It assumes that syscall numbers are between 0 and 2048.
  
   There could well be a bug here.  Not questioning that.  Although that
   would be patch 1/2
 
  Even with patch 1, it still doesn't handle large syscall numbers -- it
  just assumes they're not audited.
 
  All syscalls must be auditable. Meaning that if an arch goes above 2048,
  then we'll need to do some math to get it to fall back within the range.

 Off the top of my head, x32, some ARM variants, and some MIPS variants
 have syscalls above 2048.

 That could be fixed by putting a subtraction in place to get the bit mask to
 map correctly. User space audit rules would need to fix that also.


 auditsc has been disabled on the offending
 ARM variant (because I noticed that the same issue affects seccomp,
 and no one particularly wants to fix it), but I suspect that auditsc
 is enabled but completely broken on whichever MIPS ABI has the issue.
 I haven't checked.

 TBH, I think that it's silly to let the auditsc design be heavily
 constrained by ia64 considerations -- I still think that the syscall
 entry hooks could be removed completely with some effort on most
 architectures and that performance would improve considerably.  For
 users who don't have syscall filter rules, performance might even
 improve on ia64 -- I don't care how expensive syscall_get_args is, the
 actual printk/auditd thing will dominate in cases where anything is
 written.

 The last time I heard of benchmarking being done, audit's performance hit was
 negligible. That was about 4 years ago and there has been a whole lot of code
 churn since then. But it should in general be the cost of an 'if' statement
 unless auditing is enabled. If it is enabled, then you necessarily get the
 full treatment in case you trigger an event.


The actual rule matching is reasonably inexpensive, but the killer is
the fact that every system call is forced into the slow path.  Without
audit, this cost is paid by seccomp users and tasks that are being
ptraced.  With audit, every single syscall for every single task pays
the full cost of a slow path system call.

This is extra sad because, AFAICS, most of the syscall entry audit
code shouldn't be necessary even for audit users.

 For users who aren't bound by Common Criteria or related things, I
 suspect that syscall auditing (as opposed to, say, LSM-based auditing)
 will provide dubious value at best.

 No, it works pretty good. You can see who is accessing what files pretty
 clearly.

Right, but I don't think it's the *syscall* part that's really
necessary here.  The fact that someone opened /etc/passwd is
interesting.  The fact that someone sys_opened (const char *)0x123456
is not.


 Keep in mind that many syscalls take pointer arguments, and the auditsc code
 can't really do anything sensible with them.

 All security relevant information is collected in auxiliary records if they
 are not directly readable as a syscall argument. This is a basic requirement.
 We are not required to capture all possible arguments. If you know of any
 security relevant parameters not captured, please let us know.


a5 and a6.  The cmsg data in sendmsg (which is very security-relevant
-- think of SCM_RIGHTS and SCM_CREDENTIALS).  The destination for
sendto.  The whole payload of netlink messages.  I don't know how many
of these are actually captures.

Plus, except for things like setuid and sigreturn, capturing
parameters on entry shouldn't be needed anyway.  Capturing at log time
should be fine.

If this magic capturing only happened for users who need it for
compliance reasons, that would be one thing.  But it happens for every
single system call on Fedora.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [ARCH question] Do syscall_get_nr and syscall_get_arguments always work?

2014-02-19 Thread Andy Lutomirski

On Tue, Feb 18, 2014 at 11:39 AM, Eric Paris epa...@redhat.com wrote:
 On Fri, 2014-02-07 at 08:40 -0800, Andy Lutomirski wrote:
 On Fri, Feb 7, 2014 at 4:58 AM, Jonas Bonn jonas.b...@gmail.com wrote:
  Hi Andy,
 
  On 5 February 2014 00:50, Andy Lutomirski l...@amacapital.net wrote:
 
  I can't even find the system call entry point on mips.
 
 
  Is there a semi-official answer here?
 
  I don't have an official answer for you, but when I wanted to do
  something with these entry points a couple of years back I discovered
  that they aren't very thoroughly implemented across the various
  architectures.  I started cleaning this up and can probably dig up
  some of this for you if you need it.

 The syscall_get_xyz functions are certainly implemented and functional
 in all relevant architectures -- the audit code is already using them.
  The thing I'm uncertain about is whether they are usable with no
 syscall slow path bits set.

 I guess that, if the syscall restart logic needs to read the argument
 registers, then they're probably reliably saved...

 Al just indicated to me that on at least ia64, syscall_get_arguments()
 is really expensive.  So maybe not a deal breaker, but sounds like we'd
 lose a lot of performance trying to get them at syscall exit...


I wonder how slow syscall_get_arguments has to be before it's a real
problem.  Remember that we only need to call it when we already know
that an audit record needs to be written (or if a syscall argument is
used in a filter rule, I suppose -- I'm sure sure whether that's
possible).  As long as syscall_get_nr is fast, which I think it is on
ia64, then the tradeoff might still be a win.

But I think this is still a bit of a lost cause.  Currently, if I'm
reading the code correctly, signal delivery to a non-auditd process
can result in writing an audit event.  If the signal is delivered
during a syscall, then current code will write an audit record for
that syscall on syscall exit.

If we want to preserve that behavior without a syscall audit hook,
then the signal delivery code needs to know whether it's in the middle
of a syscall.  AFAIK this is not possible.

On the other hand, most interesting signals are probably *not* the
result of a syscall anyway, so it may make sense to just remove that
code entirely.

TBH, as long as something happens to get rid of audit overhead when
there are no rules, my interest in personally writing something fancy
to make the nonzero-number-of-rules case have less overhead is rather
low.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v3] audit: Turn off TIF_SYSCALL_AUDIT when there are no rules

2014-02-10 Thread Andy Lutomirski

On Mon, Feb 10, 2014 at 8:57 AM, Oleg Nesterov o...@redhat.com wrote:
 On 02/08, Andy Lutomirski wrote:

 +void audit_inc_n_rules()
 +{
 + struct task_struct *p, *t;
 +
 + read_lock(tasklist_lock);
 + audit_n_rules++;
 + smp_wmb();
 + if (audit_n_rules == 1) {
 + /*
 +  * We now have a rule; we need to hook syscall entry.
 +  */
 + for_each_process_thread(p, t) {
 + if (t-audit_context)
 + set_tsk_thread_flag(t, TIF_SYSCALL_AUDIT);
 + }
 + }
 + read_unlock(tasklist_lock);
 +}
 +
 +void audit_dec_n_rules()
 +{
 + read_lock(tasklist_lock);
 + --audit_n_rules;
 + BUG_ON(audit_n_rules  0);
 +
 + /*
 +  * If audit_n_rules == 0, then __audit_syscall_exit will clear
 +  * TIF_SYSCALL_AUDIT.
 +  */
 +
 + read_unlock(tasklist_lock);
 +}

 To be honest, I do not understand why _dec_ takes tasklist_lock...
 And why _inc_ increments audit_n_rules under tasklist.

Bah, incorrect leftover from last time.


 @@ -1528,6 +1562,25 @@ void __audit_syscall_exit(int success, long 
 return_code)
   context-filterkey = NULL;
   }
   tsk-audit_context = context;
 +
 + if (ACCESS_ONCE(audit_n_rules) == 0) {
 + /*
 +  * Either this is the very first syscall by this process or
 +  * audit_dec_n_rules recently set audit_n_rules to zero.
 +  */
 + smp_rmb();

 rmb() looks wrong, we need mb() to serialize ACCESS_ONCE() and
 clear_tsk_thread_flag().

I clearly need to review the rules.  I think you're right, though --
no barrier should be needed.


 But, otoh, I think we do not need any barrier at all, we can rely on
 control dependency. See the recent 18c03c61444a21 Documentation/
 memory-barriers.txt: Prohibit speculative writes.

 + /* audit_inc_n_rules could increment audit_n_rules here... */
 +
 + clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
 +
 + smp_rmb();

 Again, I guess this should be mb() or smp_mb__after_clear_bit().


 And I still think this needs more changes. Once again, I do not think
 that, say, __audit_log_bprm_fcaps() should populate context-aux if
 !TIF_SYSCALL_AUDIT, this list can grow indefinitely. Or 
 __audit_signal_info()...

 Perhaps __audit_syscall_exit() should also set context-dummy?

That would work.

I'm still torn between trying to make it possible for things like
__audit_log_bprm_fcaps to start a syscall audit record in the middle
of a syscall or to just try to tighten up the current approach to the
point where it will work correctly.

Grr.  Why is all this crap tied up with syscall auditing anyway?  ISTM
it would have been a lot nicer if audit calls just immediately emitted
audit records, completely independently of the syscall machinery.

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH v3] audit: Turn off TIF_SYSCALL_AUDIT when there are no rules

2014-02-09 Thread Andy Lutomirski

This toggles TIF_SYSCALL_AUDIT as needed when rules change instead
of leaving it set whenever rules might be set in the future.  This
reduces syscall latency from 60ns to closer to 40ns on my laptop.

This code is a little bit tricky.  It's not safe to turn
TIF_SYSCALL_AUDIT off during a syscall that is being audited.  So,
with this change, the only thing that ever turns TIF_SYSCALL_AUDIT
off after a process has been created is __audit_syscall_exit.

This has the somewhat odd side-effect that, unless there's a
'task,never' rule, every task's first syscall will use the slow
path.  This should be more or less unnoticable (what's another
20-40ns at task creation time between friends?).

There is a user-visible effect of this change: if there are no audit
rules, then events like AVC denials will not result in a syscall
audit record being written.  I'm happy to accept this minor loss in
debuggability in exchange for a massive speedup of all system calls
on default distribution configurations.

A better solution would be to ditch the syscall entry hook entirely
and use the syscall_get_xyz calls as needed to fill in the context.
This could be done on top of this patch, but it would be more code.

Cc: Oleg Nesterov o...@redhat.com
Cc: Steve Grubb sgr...@redhat.com
Cc: Eric Paris epa...@redhat.com
Signed-off-by: Andy Lutomirski l...@amacapital.net
---

Changes from v2 (actually v2.1):
 - Use for_each_process_thread instead of do_each_thread :)
 - Turn off TIF_SYSCALL_AUDIT lazily, thus hopefully avoiding all of
   the nasty cases that Oleg noticed.

Changes from v1:
 - For new tasks, set flags in a new audit_sync_flags callback instead of
   in audit_alloc (thanks, Oleg).
 - Rework locking.
 - Use irqsave/irqrestore to avoid having to think about who else might have
   taken spinlocks.

 include/linux/audit.h |  8 ++--
 kernel/auditfilter.c  |  4 ++--
 kernel/auditsc.c  | 57 +--
 3 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index aa865a9..ab00ffb 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -298,7 +298,8 @@ static inline void audit_mmap_fd(int fd, int flags)
__audit_mmap_fd(fd, flags);
 }
 
-extern int audit_n_rules;
+extern void audit_inc_n_rules(void);
+extern void audit_dec_n_rules(void);
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
 static inline int audit_alloc(struct task_struct *task)
@@ -404,7 +405,10 @@ static inline void audit_mmap_fd(int fd, int flags)
 { }
 static inline void audit_ptrace(struct task_struct *t)
 { }
-#define audit_n_rules 0
+static inline void audit_inc_n_rules(void)
+{ }
+static inline void audit_dec_n_rules(void)
+{ }
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
 
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 14a78cc..4c7054b 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry)
}
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules++;
+   audit_inc_n_rules();
 
if (!audit_match_signal(entry))
audit_signals++;
@@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry)
 
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules--;
+   audit_dec_n_rules();
 
if (!audit_match_signal(entry))
audit_signals--;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 7aef2f4..d76947c 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -80,7 +80,7 @@
 #define MAX_EXECVE_AUDIT_LEN 7500
 
 /* number of audit rules */
-int audit_n_rules;
+static int audit_n_rules;
 
 /* determines whether we collect data for signals sent */
 int audit_signals;
@@ -911,6 +911,39 @@ static inline struct audit_context 
*audit_alloc_context(enum audit_state state)
return context;
 }
 
+void audit_inc_n_rules()
+{
+   struct task_struct *p, *t;
+
+   read_lock(tasklist_lock);
+   audit_n_rules++;
+   smp_wmb();
+   if (audit_n_rules == 1) {
+   /*
+* We now have a rule; we need to hook syscall entry.
+*/
+   for_each_process_thread(p, t) {
+   if (t-audit_context)
+   set_tsk_thread_flag(t, TIF_SYSCALL_AUDIT);
+   }
+   }
+   read_unlock(tasklist_lock);
+}
+
+void audit_dec_n_rules()
+{
+   read_lock(tasklist_lock);
+   --audit_n_rules;
+   BUG_ON(audit_n_rules  0);
+
+   /*
+* If audit_n_rules == 0, then __audit_syscall_exit will clear
+* TIF_SYSCALL_AUDIT.
+*/
+
+   read_unlock(tasklist_lock);
+}
+
 /**
  * audit_alloc - allocate an audit context block for a task
  * @tsk: task
@@ -938,11 +971,12 @@ int audit_alloc(struct task_struct *tsk)
if (!(context

[ARCH question] Do syscall_get_nr and syscall_get_arguments always work?

2014-02-05 Thread Andy Lutomirski

On Tue, Feb 4, 2014 at 11:32 AM, Andy Lutomirski l...@amacapital.net wrote:
 Now we get rid of __audit_syscall_entry.  (This speeds up even the
 auditing-is-on case.)  Instead we have __audit_start_record, which
 does more or less the same thing, except that (a) it doesn't BUG if
 in_syscall and (b) it *sets* TIF_SYSCALL_AUDIT.  This relies on the
 fact that syscall_get_nr and syscall_get_arguments are reliable on
 x86_64.  I suspect that they're reliable everywhere else, too.  The
 idea is that there's nothing wrong with calling __audit_start_record
 more than once.  (Maybe it should be called
 __audit_record_this_syscall.)

I'd like to make a change that can result in syscall_get_nr and
syscall_get_arguments being called (on current and
task_pt_regs(current)) from any system call (as opposed to being
called only from the audit/trace slowpaths).  Is this safe?

Here's my somewhat clueless analysis:

On x86_64, I've tested it, and it works.  The entry code saves all of
the argument registers, even in the fast path.

i386 and ia32_compat look okay, too.

If stmiasp, {r0 - r12}@ Calling r0 - r12 does what I
think it does, then arm should be okay.

I'm totally guessing here, but e10_sync on aarch64 seems to save
enough registers.  I admit to being a little bit surprised, though --
aarch64 is new, and if I were designing an ABI, I specify that
syscalls *don't* save registers.

ia64 has a comment in ivt.S that streamlined syscalls save nr in r15.
The rest come from unwind info (!).  I assume this has something to do
with the magic ia64 register rotation thing.  I have no idea what
happens if there's a NaT in an argument register.

I can't even find the system call entry point on mips.


Is there a semi-official answer here?


--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2.1] audit: Only use the syscall slowpath when syscall audit rules exist

2014-02-04 Thread Andy Lutomirski

On Tue, Feb 4, 2014 at 8:50 AM, Oleg Nesterov o...@redhat.com wrote:
 On 02/03, Andy Lutomirski wrote:

 +void audit_inc_n_rules()
 +{
 + struct task_struct *p, *g;
 + unsigned long flags;
 +
 + read_lock_irqsave(tasklist_lock, flags);

 Confused... read_lock(tasklist) doesn't need to disable irqs.

 (ftrace does this for no reason too, perhaps I should resend the patch)

Is this because there are no interrupt handlers that write_lock(tasklist)?


 + if (audit_n_rules++ == 0) {

 probably this can be done outside of read_lock?

I don't think so.  I'm cheating and using the tasklist_lock to prevent
audit_sync_flags from racing against this increment, so this needs to
be inside the lock.  I'll add a comment.


 + do_each_thread(g, p) {

 for_each_process_thread ;) do_each_thread will die, I hope.


Sorry, forgot to mention: where is this mythical
for_each_process_thread?  Either it only exists in a kernel version
I'm not looking at, my grepping skills aren't up to the task, or you
just hate do_each_thread so much that you imagined up an alternative
:)

 +void audit_dec_n_rules()
 +{
 + struct task_struct *p, *g;
 + unsigned long flags;
 +
 + read_lock_irqsave(tasklist_lock, flags);
 +
 + --audit_n_rules;
 + BUG_ON(audit_n_rules  0);
 +
 + if (audit_n_rules == 0) {
 + do_each_thread(g, p) {
 + clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
 + } while_each_thread(g, p);
 + }

 The same, and...

 On a second thought it seems that audit_dec_n_rules() has a problem.
 Note the BUG_ON(context-in_syscall) in __audit_syscall_entry().

 Suppose that audit_dec_n_rules() clears TIF_SYSCALL_AUDIT when a task
 runs a syscall. In this case (afaics) __audit_syscall_exit() won't be
 called. The next audit_inc_n_rules() can set TIF_SYSCALL_AUDIT and
 trigger another __audit_syscall_entry() which will hit this BUG_ON().


Egads!

 And in general it doesn't look safe although I know almost nothing
 about audit. I mean, currently __audit_syscall_entry() or
 __audit_log_bprm_fcaps() assume that __audit_syscall_exit() or
 __audit_free() will cleanup -audit_context, perhaps we should not
 break the rules?

 Once again, I do not pretend I understand this code, this is the
 question, not the comment.

 But if I am right, then TIF_SYSCALL_AUDIT should be cleared in
 __audit_syscall_exit() as you suggested before.

I think I'll wait for Eric to chime in.  I suspect that the real
solution is to simplify all this stuff by relying on the fact that the
syscall nr and args are saved by the (fast path and slow path) entry
code, so the audit entry hook may be entirely unnecessary.  Or maybe a
new TIF flag would be needed to make it work.

Anyway, *grumble*.  My only real interest in this stuff is to get rid
of the overhead.  I don't actually want syscall auditing on any of my
boxes (in contrast to avc auditing, which is useful but (mostly)
independent).

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Re: [PATCH v2.1] audit: Only use the syscall slowpath when syscall audit rules exist

2014-02-04 Thread Andy Lutomirski

On Tue, Feb 4, 2014 at 11:11 AM, Oleg Nesterov o...@redhat.com wrote:
 On 02/04, Andy Lutomirski wrote:
 On Tue, Feb 4, 2014 at 8:50 AM, Oleg Nesterov o...@redhat.com wrote:
  On 02/03, Andy Lutomirski wrote:
 Sorry, forgot to mention: where is this mythical
 for_each_process_thread?

 In Linus's tree, please see 0c740d0afc3bff.

 or you
 just hate do_each_thread so much that you imagined up an alternative
 :)

 sort of ;)

Aha -- it probably got merged just after I pulled Linus' tree and
looked for it.  Or something.


 I think I'll wait for Eric to chime in.  I suspect that the real
 solution is to simplify all this stuff by relying on the fact that the
 syscall nr and args are saved by the (fast path and slow path) entry
 code, so the audit entry hook may be entirely unnecessary.

 Perhaps... but even in this case we need to do something with, say,
 __audit_log_bprm_fcaps().

 At least this list should not grow indefinitely if the task skips
 __audit_syscall_exit(). Although at first glance this can be probably
 cleanuped too.

OK, here's a thought: let's change the semantics of TIF_SYSCALL_AUDIT.
 New semantics:

TIF_SYSCALL_AUDIT is set if (the task is eligible for syscall auditing
and n_rules != 0 *or* something has started writing an audit record).
TIF_SYSCALL_AUDIT is *never* cleared by anything other than
copy_process or __audit_syscall_exit.

This means that, if there's a pending audit record, then
TIF_SYSCALL_AUDIT will be set.  That, in turn, means that
__audit_syscall_exit will be called, which avoids the BUGs you pointed
out.

Now we get rid of __audit_syscall_entry.  (This speeds up even the
auditing-is-on case.)  Instead we have __audit_start_record, which
does more or less the same thing, except that (a) it doesn't BUG if
in_syscall and (b) it *sets* TIF_SYSCALL_AUDIT.  This relies on the
fact that syscall_get_nr and syscall_get_arguments are reliable on
x86_64.  I suspect that they're reliable everywhere else, too.  The
idea is that there's nothing wrong with calling __audit_start_record
more than once.  (Maybe it should be called
__audit_record_this_syscall.)

To finish the job, we change __audit_syscall_exit to clear
TIF_SYSCALL_AUDIT if n_rules==0.  inc_n_rules can set
TIF_SYSCALL_AUDIT (without even needing to worry about races, I
think).  dec_n_rules doesn't need to do anything special.

Benefits:
 - Removing the syscall entry hook speeds everyone up.  Even silly
people who use exit rules :)
 - There's no need to think about mismatched entry/exit hook calls,
since there is no entry hook.
 - The same mechanism could be reused for non-audit purposes.  Any
code could say, at any point hey, this syscall is interesting.  let's
record it.

Disadvantages:
 - Need to check other architectures to make sure that syscall_get_xyz
works reliably for fast path syscalls.
 - For the full performance boost, all architectures need to avoid
checking TIF_SYSCALL_AUDIT in the entry path.  I prefer not to mess
with non-x86 assembly, so I won't do that part, since it's not
required for correctness.

Eric, any thoughts?

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

Why is syscall auditing on with no rules?

2014-02-03 Thread Andy Lutomirski

On a stock Fedora installation:

$ sudo auditctl -l
No rules

Nonetheless TIF_SYSCALL_AUDIT is set and the __audit_syscall_entry and
__audit_syscall_exit account for 20% of syscall overhead according to
perf.

This sucks.  Unless I'm missing something, syscall auditing is *off*.

How hard would it be to arrange for TIF_SYSCALL_AUDIT to be cleared
when there are no syscall rules?

(This is extra bad in kernels before 3.13, where the clear call for
TIF_SYSCALL_AUDIT was completely missing.)

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

[PATCH] audit: Only use the syscall slowpath when syscall audit rules exist

2014-02-03 Thread Andy Lutomirski

This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of
leaving it set whenever rules might be set in the future.  This reduces
syscall latency from 60ns to closer to 40ns on my laptop.

Cc: Oleg Nesterov o...@redhat.com
Cc: Steve Grubb sgr...@redhat.com
Cc: Eric Paris epa...@redhat.com
Signed-off-by: Andy Lutomirski l...@amacapital.net
---

This is not the best-tested patch in the world.  Someone who actually
knows how to use syscall auditing should probably give it a spin.  It
fixes an IMO huge performance issue, though.

 include/linux/audit.h |  9 +--
 kernel/auditfilter.c  |  4 ++--
 kernel/auditsc.c  | 65 ---
 3 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index a406419..534ba85 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -298,7 +298,9 @@ static inline void audit_mmap_fd(int fd, int flags)
__audit_mmap_fd(fd, flags);
 }
 
-extern int audit_n_rules;
+extern void audit_inc_n_rules(void);
+extern void audit_dec_n_rules(void);
+
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
 static inline int audit_alloc(struct task_struct *task)
@@ -404,7 +406,10 @@ static inline void audit_mmap_fd(int fd, int flags)
 { }
 static inline void audit_ptrace(struct task_struct *t)
 { }
-#define audit_n_rules 0
+static inline void audit_inc_n_rules(void)
+{ }
+static inline void audit_dec_n_rules(void)
+{ }
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
 
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 51f3fd4..0ce531d 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry)
}
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules++;
+   audit_inc_n_rules();
 
if (!audit_match_signal(entry))
audit_signals++;
@@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry)
 
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules--;
+   audit_dec_n_rules();
 
if (!audit_match_signal(entry))
audit_signals--;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 90594c9..363d0bb 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -79,8 +79,15 @@
 /* no execve audit message should be longer than this (userspace limits) */
 #define MAX_EXECVE_AUDIT_LEN 7500
 
-/* number of audit rules */
-int audit_n_rules;
+/*
+ * number of audit rules
+ *
+ * n_rules_lock protects audit_n_rules.  It also prevents audit_alloc from
+ * racing against changes to audit_n_rules: to change someone else's
+ * audit_context or TIF_SYSCALL_AUDIT, you need write_lock(n_rules_lock).
+ */
+static DEFINE_RWLOCK(n_rules_lock);
+static int audit_n_rules;
 
 /* determines whether we collect data for signals sent */
 int audit_signals;
@@ -911,6 +918,47 @@ static inline struct audit_context 
*audit_alloc_context(enum audit_state state)
return context;
 }
 
+void audit_inc_n_rules()
+{
+   struct task_struct *p, *g;
+
+   write_lock(n_rules_lock);
+
+   if (audit_n_rules++ != 0)
+   goto out;  /* The overall state isn't changing. */
+
+   read_lock(tasklist_lock);
+   do_each_thread(g, p) {
+   if (p-audit_context)
+   set_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
+   } while_each_thread(g, p);
+   read_unlock(tasklist_lock);
+
+out:
+   write_unlock(n_rules_lock);
+}
+
+void audit_dec_n_rules()
+{
+   struct task_struct *p, *g;
+
+   write_lock(n_rules_lock);
+   --audit_n_rules;
+   BUG_ON(audit_n_rules  0);
+
+   if (audit_n_rules != 0)
+   goto out;  /* The overall state isn't changing. */
+
+   read_lock(tasklist_lock);
+   do_each_thread(g, p) {
+   clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
+   } while_each_thread(g, p);
+   read_unlock(tasklist_lock);
+
+out:
+   write_unlock(n_rules_lock);
+}
+
 /**
  * audit_alloc - allocate an audit context block for a task
  * @tsk: task
@@ -931,6 +979,11 @@ int audit_alloc(struct task_struct *tsk)
 
state = audit_filter_task(tsk, key);
if (state == AUDIT_DISABLED) {
+   /*
+* There is no relevant race against inc/dec_n_rules
+* here -- inc won't set TIF_SYSCALL_AUDIT, and, if dec
+* clears it redundantly, there's no harm done.
+*/
clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
return 0;
}
@@ -942,8 +995,14 @@ int audit_alloc(struct task_struct *tsk)
}
context-filterkey = key;
 
+   read_lock(n_rules_lock);
tsk-audit_context  = context;
-   set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
+   if (audit_n_rules)
+   set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT

[PATCH v2.1] audit: Only use the syscall slowpath when syscall audit rules exist

2014-02-03 Thread Andy Lutomirski

This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of
leaving it set whenever rules might be set in the future.  This reduces
syscall latency from 60ns to closer to 40ns on my laptop.

Cc: Oleg Nesterov o...@redhat.com
Cc: Steve Grubb sgr...@redhat.com
Cc: Eric Paris epa...@redhat.com
Signed-off-by: Andy Lutomirski l...@amacapital.net
---

This brown paper bag release is brought to you by git commit's -a flag.

Changes from v2: Contains the correct patch

Changes from v1:
 - For new tasks, set flags in a new audit_sync_flags callback instead of
   in audit_alloc (thanks, Oleg).
 - Rework locking.
 - Use irqsave/irqrestore to avoid having to think about who else might have
   taken spinlocks.

 include/linux/audit.h | 11 --
 kernel/auditfilter.c  |  4 ++--
 kernel/auditsc.c  | 57 +--
 kernel/fork.c |  3 +++
 4 files changed, 65 insertions(+), 10 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index a406419..a81f498 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -298,7 +298,9 @@ static inline void audit_mmap_fd(int fd, int flags)
__audit_mmap_fd(fd, flags);
 }
 
-extern int audit_n_rules;
+extern void audit_inc_n_rules(void);
+extern void audit_dec_n_rules(void);
+extern void audit_sync_flags(struct task_struct *tsk);
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
 static inline int audit_alloc(struct task_struct *task)
@@ -404,7 +406,12 @@ static inline void audit_mmap_fd(int fd, int flags)
 { }
 static inline void audit_ptrace(struct task_struct *t)
 { }
-#define audit_n_rules 0
+static inline void audit_inc_n_rules(void)
+{ }
+static inline void audit_dec_n_rules(void)
+{ }
+static inline void audit_sync_flags(struct task_struct *tsk)
+{ }
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
 
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 51f3fd4..0ce531d 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry)
}
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules++;
+   audit_inc_n_rules();
 
if (!audit_match_signal(entry))
audit_signals++;
@@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry)
 
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules--;
+   audit_dec_n_rules();
 
if (!audit_match_signal(entry))
audit_signals--;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 90594c9..cd44c88 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -79,8 +79,13 @@
 /* no execve audit message should be longer than this (userspace limits) */
 #define MAX_EXECVE_AUDIT_LEN 7500
 
-/* number of audit rules */
-int audit_n_rules;
+/*
+ * number of audit rules
+ *
+ * To change this, you must hold audit_filter_mutex *and* have a read lock
+ * on tasklist_lock.
+ */
+static int audit_n_rules;
 
 /* determines whether we collect data for signals sent */
 int audit_signals;
@@ -911,6 +916,40 @@ static inline struct audit_context 
*audit_alloc_context(enum audit_state state)
return context;
 }
 
+void audit_inc_n_rules()
+{
+   struct task_struct *p, *g;
+   unsigned long flags;
+
+   read_lock_irqsave(tasklist_lock, flags);
+   if (audit_n_rules++ == 0) {
+   do_each_thread(g, p) {
+   if (p-audit_context)
+   set_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
+   } while_each_thread(g, p);
+   }
+   read_unlock_irqrestore(tasklist_lock, flags);
+}
+
+void audit_dec_n_rules()
+{
+   struct task_struct *p, *g;
+   unsigned long flags;
+
+   read_lock_irqsave(tasklist_lock, flags);
+
+   --audit_n_rules;
+   BUG_ON(audit_n_rules  0);
+
+   if (audit_n_rules == 0) {
+   do_each_thread(g, p) {
+   clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
+   } while_each_thread(g, p);
+   }
+
+   read_unlock_irqrestore(tasklist_lock, flags);
+}
+
 /**
  * audit_alloc - allocate an audit context block for a task
  * @tsk: task
@@ -930,10 +969,8 @@ int audit_alloc(struct task_struct *tsk)
return 0; /* Return if not auditing. */
 
state = audit_filter_task(tsk, key);
-   if (state == AUDIT_DISABLED) {
-   clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
+   if (state == AUDIT_DISABLED)
return 0;
-   }
 
if (!(context = audit_alloc_context(state))) {
kfree(key);
@@ -943,10 +980,18 @@ int audit_alloc(struct task_struct *tsk)
context-filterkey = key;
 
tsk-audit_context  = context;
-   set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
return 0;
 }
 
+void audit_sync_flags(struct task_struct *tsk)
+{
+   /* The caller has a write

[PATCH v2] audit: Only use the syscall slowpath when syscall audit rules exist

2014-02-03 Thread Andy Lutomirski

This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of
leaving it set whenever rules might be set in the future.  This reduces
syscall latency from 60ns to closer to 40ns on my laptop.

Cc: Oleg Nesterov o...@redhat.com
Cc: Steve Grubb sgr...@redhat.com
Cc: Eric Paris epa...@redhat.com
Signed-off-by: Andy Lutomirski l...@amacapital.net
---

Changes from v1:
 - For new tasks, set flags in a new audit_sync_flags callback instead of
   in audit_alloc (thanks, Oleg).
 - Rework locking.
 - Use irqsave/irqrestore to avoid having to think about who else might have
   taken spinlocks.

 include/linux/audit.h |  9 +--
 kernel/auditfilter.c  |  4 ++--
 kernel/auditsc.c  | 65 ---
 3 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/include/linux/audit.h b/include/linux/audit.h
index a406419..534ba85 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -298,7 +298,9 @@ static inline void audit_mmap_fd(int fd, int flags)
__audit_mmap_fd(fd, flags);
 }
 
-extern int audit_n_rules;
+extern void audit_inc_n_rules(void);
+extern void audit_dec_n_rules(void);
+
 extern int audit_signals;
 #else /* CONFIG_AUDITSYSCALL */
 static inline int audit_alloc(struct task_struct *task)
@@ -404,7 +406,10 @@ static inline void audit_mmap_fd(int fd, int flags)
 { }
 static inline void audit_ptrace(struct task_struct *t)
 { }
-#define audit_n_rules 0
+static inline void audit_inc_n_rules(void)
+{ }
+static inline void audit_dec_n_rules(void)
+{ }
 #define audit_signals 0
 #endif /* CONFIG_AUDITSYSCALL */
 
diff --git a/kernel/auditfilter.c b/kernel/auditfilter.c
index 51f3fd4..0ce531d 100644
--- a/kernel/auditfilter.c
+++ b/kernel/auditfilter.c
@@ -903,7 +903,7 @@ static inline int audit_add_rule(struct audit_entry *entry)
}
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules++;
+   audit_inc_n_rules();
 
if (!audit_match_signal(entry))
audit_signals++;
@@ -955,7 +955,7 @@ static inline int audit_del_rule(struct audit_entry *entry)
 
 #ifdef CONFIG_AUDITSYSCALL
if (!dont_count)
-   audit_n_rules--;
+   audit_dec_n_rules();
 
if (!audit_match_signal(entry))
audit_signals--;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index 90594c9..363d0bb 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -79,8 +79,15 @@
 /* no execve audit message should be longer than this (userspace limits) */
 #define MAX_EXECVE_AUDIT_LEN 7500
 
-/* number of audit rules */
-int audit_n_rules;
+/*
+ * number of audit rules
+ *
+ * n_rules_lock protects audit_n_rules.  It also prevents audit_alloc from
+ * racing against changes to audit_n_rules: to change someone else's
+ * audit_context or TIF_SYSCALL_AUDIT, you need write_lock(n_rules_lock).
+ */
+static DEFINE_RWLOCK(n_rules_lock);
+static int audit_n_rules;
 
 /* determines whether we collect data for signals sent */
 int audit_signals;
@@ -911,6 +918,47 @@ static inline struct audit_context 
*audit_alloc_context(enum audit_state state)
return context;
 }
 
+void audit_inc_n_rules()
+{
+   struct task_struct *p, *g;
+
+   write_lock(n_rules_lock);
+
+   if (audit_n_rules++ != 0)
+   goto out;  /* The overall state isn't changing. */
+
+   read_lock(tasklist_lock);
+   do_each_thread(g, p) {
+   if (p-audit_context)
+   set_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
+   } while_each_thread(g, p);
+   read_unlock(tasklist_lock);
+
+out:
+   write_unlock(n_rules_lock);
+}
+
+void audit_dec_n_rules()
+{
+   struct task_struct *p, *g;
+
+   write_lock(n_rules_lock);
+   --audit_n_rules;
+   BUG_ON(audit_n_rules  0);
+
+   if (audit_n_rules != 0)
+   goto out;  /* The overall state isn't changing. */
+
+   read_lock(tasklist_lock);
+   do_each_thread(g, p) {
+   clear_tsk_thread_flag(p, TIF_SYSCALL_AUDIT);
+   } while_each_thread(g, p);
+   read_unlock(tasklist_lock);
+
+out:
+   write_unlock(n_rules_lock);
+}
+
 /**
  * audit_alloc - allocate an audit context block for a task
  * @tsk: task
@@ -931,6 +979,11 @@ int audit_alloc(struct task_struct *tsk)
 
state = audit_filter_task(tsk, key);
if (state == AUDIT_DISABLED) {
+   /*
+* There is no relevant race against inc/dec_n_rules
+* here -- inc won't set TIF_SYSCALL_AUDIT, and, if dec
+* clears it redundantly, there's no harm done.
+*/
clear_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
return 0;
}
@@ -942,8 +995,14 @@ int audit_alloc(struct task_struct *tsk)
}
context-filterkey = key;
 
+   read_lock(n_rules_lock);
tsk-audit_context  = context;
-   set_tsk_thread_flag(tsk, TIF_SYSCALL_AUDIT);
+   if (audit_n_rules

Re: [PATCH] audit: Only use the syscall slowpath when syscall audit rules exist

2014-02-03 Thread Andy Lutomirski

On Mon, Feb 3, 2014 at 12:23 PM, Steve Grubb sgr...@redhat.com wrote:
 On Monday, February 03, 2014 09:53:23 AM Andy Lutomirski wrote:
 This toggles TIF_SYSCALL_AUDIT as needed when rules change instead of
 leaving it set whenever rules might be set in the future.  This reduces
 syscall latency from 60ns to closer to 40ns on my laptop.

 Does this mean that we have processes that don't have the  TIF_SYSCALL_AUDIT
 flag set? When rules get loaded, how do we get the flag put back into all
 processes?

By looping over all processes and setting the flag, which is what my patch does.


 The theory of ops is supposed to be that for anyone not needing audit, there
 is only the cost of  if (tif  TIF_SYSCALL_AUDIT).

On current kernels *all* processes have TIF_SYSCALL_AUDIT, even if
they don't need auditing because there's nothing to audit.  So
everything pays the full cost.

 That should be it. If you
 have audit enabled or had it enabled (which means it might be loaded with new
 rules), we want to inspect the syscall.


My point is that there's nothing to inspect -- there are no rules.
Unless the audit code needs to do something just in case a non-syscall
audit event gets written, in which case the audit code should IMO be
fixed.  (This is what Eric is talking about, I think.)

--Andy

--
Linux-audit mailing list
Linux-audit@redhat.com
https://www.redhat.com/mailman/listinfo/linux-audit

75 matches

Mail list logo