Re: [PATCH v3 0/7] File Sealing memfd_create()

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote:
 Hi

 This is v3 of the File-Sealing and memfd_create() patches. You can find v1 
 with
 a longer introduction at gmane:
   http://thread.gmane.org/gmane.comp.video.dri.devel/102241
 An LWN article about memfd+sealing is available, too:
   https://lwn.net/Articles/593918/
 v2 with some more discussions can be found here:
   http://thread.gmane.org/gmane.linux.kernel.mm/115713

 This series introduces two new APIs:
   memfd_create(): Think of this syscall as malloc() but it returns a
   file-descriptor instead of a pointer. That file-descriptor 
 is
   backed by anon-memory and can be memory-mapped for access.
   sealing: The sealing API can be used to prevent a specific set of operations
on a file-descriptor. You 'seal' the file and give thus the
guarantee, that it cannot be modified in the specific ways.

 A short high-level introduction is also available here:
   http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

Potentially silly question: is it guaranteed that mmapping and reading
a SEAL_SHRINKed fd within size bounds will not SIGBUS?  If so, should
this be documented?  (The particular issue here would be reading
holes.  It should work by using the zero page, but, if so, we should
probably make it a real documented guarantee.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/7] File Sealing memfd_create()

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 8:15 AM, David Herrmann dh.herrm...@gmail.com wrote:
 Hi

 On Fri, Jun 13, 2014 at 5:10 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com 
 wrote:
 Hi

 This is v3 of the File-Sealing and memfd_create() patches. You can find v1 
 with
 a longer introduction at gmane:
   http://thread.gmane.org/gmane.comp.video.dri.devel/102241
 An LWN article about memfd+sealing is available, too:
   https://lwn.net/Articles/593918/
 v2 with some more discussions can be found here:
   http://thread.gmane.org/gmane.linux.kernel.mm/115713

 This series introduces two new APIs:
   memfd_create(): Think of this syscall as malloc() but it returns a
   file-descriptor instead of a pointer. That 
 file-descriptor is
   backed by anon-memory and can be memory-mapped for access.
   sealing: The sealing API can be used to prevent a specific set of 
 operations
on a file-descriptor. You 'seal' the file and give thus the
guarantee, that it cannot be modified in the specific ways.

 A short high-level introduction is also available here:
   http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/

 Potentially silly question: is it guaranteed that mmapping and reading
 a SEAL_SHRINKed fd within size bounds will not SIGBUS?  If so, should
 this be documented?  (The particular issue here would be reading
 holes.  It should work by using the zero page, but, if so, we should
 probably make it a real documented guarantee.)

 No, this is not guaranteed. See the previous discussion in v2 on Patch
 2/4 between Hugh and me.

 Summary is: If you want mmap-reads to not fail, use mlock(). There are
 many situations where a fault might fail (think: OOM) and sealing is
 not meant to protect against that. Btw., holes are automatically
 filled with fresh pages by shmem. So a read only fails in OOM
 situations (or memcg limits, etc.).


Isn't the point of SEAL_SHRINK to allow servers to mmap and read
safely without worrying about SIGBUS?

--Andy

 Thanks
 David



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: vdso feature requests from the Go people

2014-06-13 Thread Andy Lutomirski
On Thu, Jun 12, 2014 at 10:39 PM, H. Peter Anvin h...@zytor.com wrote:
 On 06/12/2014 09:36 PM, Andy Lutomirski wrote:

 1. Parsing the vDSO is a PITA.  What if we bundled the reference
 parser inside the vdso?  Concretely, we could have AT_VDSO_FINDENTRY
 point to a function like:

 void *vdso_find_entry(const char *name, const char *version)

 Then things like Go and maybe even musl (and klibc?) could just call
 that function.  And we'd never have to worry about maintaining
 compatibility with more and more weird vdso parsers.

 Implementing this could be as simple as shoving parse_vdso.c into the
 vdso, although vdso2c could help and allow a really simple in-vdso
 implementation.


 I'm not really sure how much of a win that is... you have to parse
 *something*, and for the vast majority of all implementations there will
 be a dynamic linker just sitting there, and that is what it *does*.

I'm only aware of two implementations that work like that: glibc and
musl.  AFAIK neither one even tries to use the vdso when statically
linked.  IIRC, Bionic doesn't support the vdso at all, and Go has the
present issue.

And ELF parsing is a giant mess.  Currently the vdso doesn't use
DT_GNU_HASH (easy to fix) but no one can safely rely on DT_GNU_HASH
being there, and DT_GNU_HASH isn't actually easier to parse.

My point is that, for things that otherwise need to carry around a
full ELF loader, having a really easy, guaranteed-correct way to use
the vdso would be useful.

I can see how much text size it would add.  If the parser were part of
the vdso, it could probably be trimmed down a lot.  For example, there
is currently exactly one version definition, and it could be
hard-coded.  If someone were to add another version definition, they
could fix the parser at the same time.

Basically, everything except for the vdso_sym function in my parser could go.


 2. Go uses a segmented stack, and the vdso is quite unfriendly for
 segmented stack.  If we can get compiler support, is there a
 reasonable way that we could advertise the maximum stack usage of each
 vdso entry point?

 I suspect an easier way to do that would just be to define a maximum
 stack usage for *any* vdso entry point, and then enable the gcc stack
 depth warning (perhaps even with Werror)... we can do this now.

I can imagine this causing lots of pain when gcc 4.11 comes out with
some issue that blows up the stack usage.  Or when akpm compiles on
Fedora Core 6 using some ancient toolchain that spills every local
variable three or four times and assigns every possible inline
function its own non-overlapping stack range.

My copy of gcc supports -fstack-usage, which seems like an easyish way
to obtain the information.  I'm not entirely sure whether
-fstack-usage refers to the whole call tree or just to the specific
function.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann dh.herrm...@gmail.com wrote:
 Hi

 On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com 
 wrote:
 When setting SEAL_WRITE, we must make sure nobody has a writable reference
 to the pages (via GUP or similar). We currently check references and wait
 some time for them to be dropped. This, however, might fail for several
 reasons, including:
  - the page is pinned for longer than we wait
  - while we wait, someone takes an already pinned page for read-access

 Therefore, this patch introduces page-isolation. When sealing a file with
 SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage
 is put in place atomically, the old page is detached and left alone. It
 will get reclaimed once the last external user dropped it.

 Signed-off-by: David Herrmann dh.herrm...@gmail.com

 Won't this have unexpected effects?

 Thread 1:  start read into mapping backed by fd

 Thread 2:  SEAL_WRITE

 Thread 1: read finishes.  now the page doesn't match the sealed page

 Just to be clear: you're talking about read() calls that write into
 the memfd? (like my FUSE example does) Your language might be
 ambiguous to others as read into actually implies a write.

 No, this does not have unexpected effects. But yes, your conclusion is
 right. To be clear, this behavior would be part of the API. Any
 asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your
 buffer before the write finishes. But you actually have to extend your
 example:

 Thread 1: p = mmap(memfd, SIZE);
 Thread 1: h = async_read(some_fd, p, SIZE);
 Thread 1: munmap(p, SIZE);
 Thread 2: SEAL_WRITE
 Thread 1: async_wait(h);

 If you don't do the unmap(), then SEAL_WRITE will fail due to an
 elevated i_mmap_writable. I think this is fine. In fact, I remember
 reading that async-IO is not required to resolve user-space addresses
 at the time of the syscall, but might delay it to the time of the
 actual write. But you're right, it would be misleading that the AIO
 operation returns success. This would be part of the memfd-API,
 though. And if you mess with your address space while running an
 async-IO operation on it, you're screwed anyway.

Ok, I missed the part where you had to munmap to trigger the oddity.
That seems fine to me.


 Btw., your sealing use-case is really odd. No-one guarantees that the
 SEAL_WRITE happens _after_ you schedule your async-read. In case you
 have some synchronization there, you just have to move it after
 waiting for your async-io to finish.

 Does that clear things up?

I think so.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3] x86,vdso: Fix vdso_install

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 10:24 AM, H. Peter Anvin h...@zytor.com wrote:
 On 06/12/2014 10:01 AM, Josh Boyer wrote:
 On Thu, Jun 12, 2014 at 11:28 AM, Andy Lutomirski l...@amacapital.net 
 wrote:
 The filenames are different now.  Inspired by a patch from
 Sam Ravnborg.

 Acked-by: Sam Ravnborg s...@ravnborg.org
 Reported-by: Josh Boyer jwbo...@fedoraproject.org
 Signed-off-by: Andy Lutomirski l...@amacapital.net

 Things look great with this version.  Thanks much!

 Tested-by: Josh Boyer jwbo...@fedoraproject.org


 Are we okay with the 64-bit vdso now being installed as vdso64.so?

Unless someone actually complains, I think so.  Ideally, I think we'd
have a symlink from the build id, but that's a separate issue.

AFAIK, the only thing that actually references these files is Fedora's
gdb, which is finding them from Fedora's build-id symlinks, which
should work fine regardless of what the thing is called.

How's this for a patch description:

make vdso_install has been broken since this commit:

commit 6f121e548f83674ab4920a4e60afb58d4f61b829
Author: Andy Lutomirski l...@amacapital.net
Date:   Mon May 5 12:19:34 2014 -0700

x86, vdso: Reimplement vdso.so preparation in build-time C

because the names of the files to be installed changed.   This fixes
vdso_install to install the right files.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kbuild: Do not run modules_install and install in paralel

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 2:45 AM, Michal Marek mma...@suse.cz wrote:
 Dne 13.6.2014 11:39, Michal Marek napsal(a):
 Based on a x86-only patch by Andy Lutomirski l...@amacapital.net

 With modular kernels, 'make install' is going to need the installed
 modules at some point to generate the initramfs.

 Signed-off-by: Michal Marek mma...@suse.cz
 ---
  Makefile | 6 ++
  1 file changed, 6 insertions(+)

 diff --git a/Makefile b/Makefile
 index 7680d7c..7e5e483 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -503,6 +503,12 @@ ifeq ($(KBUILD_EXTMOD),)
  endif
  endif
  endif
 +# install and module_install need also be processed one by one
 +ifneq ($(filter install,$(MAKECMDGOALS)),)
 +ifneq ($(filter modules_install,$(MAKECMDGOALS)),)
 + mixed-targets := 1
 +endif
 +endif

 Note that this version does not enforce the ordering, it just avoids the
 interleaved execution. It can be added if desired.

Hmm.  This will fix 'make modules_install install' but will not fix
'make install modules_install'.  I don't know how many people would
type the latter.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 6/9] seccomp: add seccomp syscall

2014-06-13 Thread Andy Lutomirski
On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook keesc...@chromium.org wrote:
 This adds the new seccomp syscall with both an operation and flags
 parameter for future expansion. The third argument is a pointer value,
 used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
 be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

Question for the linux-abi people:

What's the preferred way to do this these days?  This syscall is a
general purpose adjust the seccomp state thing.  The alternative
would be a specific new syscall to add a filter with a flags argument.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 6/9] seccomp: add seccomp syscall

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook keesc...@chromium.org wrote:
 This adds the new seccomp syscall with both an operation and flags
 parameter for future expansion. The third argument is a pointer value,
 used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
 be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

 Signed-off-by: Kees Cook keesc...@chromium.org
 Cc: linux-...@vger.kernel.org
 ---
  arch/x86/syscalls/syscall_32.tbl  |1 +
  arch/x86/syscalls/syscall_64.tbl  |1 +
  include/linux/syscalls.h  |2 ++
  include/uapi/asm-generic/unistd.h |4 ++-
  include/uapi/linux/seccomp.h  |4 +++
  kernel/seccomp.c  |   63 
 -
  kernel/sys_ni.c   |3 ++
  7 files changed, 69 insertions(+), 9 deletions(-)

 diff --git a/arch/x86/syscalls/syscall_32.tbl 
 b/arch/x86/syscalls/syscall_32.tbl
 index d6b867921612..7527eac24122 100644
 --- a/arch/x86/syscalls/syscall_32.tbl
 +++ b/arch/x86/syscalls/syscall_32.tbl
 @@ -360,3 +360,4 @@
  351i386sched_setattr   sys_sched_setattr
  352i386sched_getattr   sys_sched_getattr
  353i386renameat2   sys_renameat2
 +354i386seccomp sys_seccomp
 diff --git a/arch/x86/syscalls/syscall_64.tbl 
 b/arch/x86/syscalls/syscall_64.tbl
 index ec255a1646d2..16272a6c12b7 100644
 --- a/arch/x86/syscalls/syscall_64.tbl
 +++ b/arch/x86/syscalls/syscall_64.tbl
 @@ -323,6 +323,7 @@
  314common  sched_setattr   sys_sched_setattr
  315common  sched_getattr   sys_sched_getattr
  316common  renameat2   sys_renameat2
 +317common  seccomp sys_seccomp

  #
  # x32-specific system call numbers start at 512 to avoid cache impact
 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
 index b0881a0ed322..1713977ee26f 100644
 --- a/include/linux/syscalls.h
 +++ b/include/linux/syscalls.h
 @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
  unsigned long idx1, unsigned long idx2);
  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
 flags);
 +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 +   const char __user *uargs);

 It looks odd to add 'flags' argument to syscall that is not even used.
 It don't think it will be extensible this way.
 'uargs' is used only in 2nd command as well and it's not 'char __user *'
 but rather 'struct sock_fprog __user *'
 I think it makes more sense to define only first argument as 'int op' and the
 rest as variable length array.
 Something like:
 long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
 then different commands can interpret 'attrs' differently.
 if op == mode_strict, then attrs == NULL, len == 0
 if op == mode_filter, then attrs-nla_type == seccomp_bpf_filter
 and nla_data(attrs) is 'struct sock_fprog'

Eww.  If the operation doesn't imply the type, then I think we've
totally screwed up.

 If we decide to add new types of filters or new commands, the syscall 
 prototype
 won't need to change. New commands can be added preserving backward
 compatibility.
 The basic TLV concept has been around forever in netlink world. imo makes
 sense to use it with new syscalls. Passing 'struct xxx' into syscalls
 is the thing
 of the past. TLV style is more extensible. Fields of structures can become
 optional in the future, new fields added, etc.
 'struct nlattr' brings the same benefits to kernel api as protobuf did
 to user land.

I see no reason to bring nl_attr into this.

Admittedly, I've never dealt with nl_attr, but everything
netlink-related I've even been involved in has involved some sort of
API atrocity.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v6 6/9] seccomp: add seccomp syscall

2014-06-13 Thread Andy Lutomirski
On Fri, Jun 13, 2014 at 2:37 PM, Alexei Starovoitov a...@plumgrid.com wrote:
 On Fri, Jun 13, 2014 at 2:25 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov a...@plumgrid.com 
 wrote:
 On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook keesc...@chromium.org wrote:
 This adds the new seccomp syscall with both an operation and flags
 parameter for future expansion. The third argument is a pointer value,
 used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
 be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).

 Signed-off-by: Kees Cook keesc...@chromium.org
 Cc: linux-...@vger.kernel.org
 ---
  arch/x86/syscalls/syscall_32.tbl  |1 +
  arch/x86/syscalls/syscall_64.tbl  |1 +
  include/linux/syscalls.h  |2 ++
  include/uapi/asm-generic/unistd.h |4 ++-
  include/uapi/linux/seccomp.h  |4 +++
  kernel/seccomp.c  |   63 
 -
  kernel/sys_ni.c   |3 ++
  7 files changed, 69 insertions(+), 9 deletions(-)

 diff --git a/arch/x86/syscalls/syscall_32.tbl 
 b/arch/x86/syscalls/syscall_32.tbl
 index d6b867921612..7527eac24122 100644
 --- a/arch/x86/syscalls/syscall_32.tbl
 +++ b/arch/x86/syscalls/syscall_32.tbl
 @@ -360,3 +360,4 @@
  351i386sched_setattr   sys_sched_setattr
  352i386sched_getattr   sys_sched_getattr
  353i386renameat2   sys_renameat2
 +354i386seccomp sys_seccomp
 diff --git a/arch/x86/syscalls/syscall_64.tbl 
 b/arch/x86/syscalls/syscall_64.tbl
 index ec255a1646d2..16272a6c12b7 100644
 --- a/arch/x86/syscalls/syscall_64.tbl
 +++ b/arch/x86/syscalls/syscall_64.tbl
 @@ -323,6 +323,7 @@
  314common  sched_setattr   sys_sched_setattr
  315common  sched_getattr   sys_sched_getattr
  316common  renameat2   sys_renameat2
 +317common  seccomp sys_seccomp

  #
  # x32-specific system call numbers start at 512 to avoid cache impact
 diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
 index b0881a0ed322..1713977ee26f 100644
 --- a/include/linux/syscalls.h
 +++ b/include/linux/syscalls.h
 @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid,
  asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
  unsigned long idx1, unsigned long idx2);
  asmlinkage long sys_finit_module(int fd, const char __user *uargs, int 
 flags);
 +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 +   const char __user *uargs);

 It looks odd to add 'flags' argument to syscall that is not even used.
 It don't think it will be extensible this way.
 'uargs' is used only in 2nd command as well and it's not 'char __user *'
 but rather 'struct sock_fprog __user *'
 I think it makes more sense to define only first argument as 'int op' and 
 the
 rest as variable length array.
 Something like:
 long sys_seccomp(unsigned int op, struct nlattr *attrs, int len);
 then different commands can interpret 'attrs' differently.
 if op == mode_strict, then attrs == NULL, len == 0
 if op == mode_filter, then attrs-nla_type == seccomp_bpf_filter
 and nla_data(attrs) is 'struct sock_fprog'

 Eww.  If the operation doesn't imply the type, then I think we've
 totally screwed up.

 If we decide to add new types of filters or new commands, the syscall 
 prototype
 won't need to change. New commands can be added preserving backward
 compatibility.
 The basic TLV concept has been around forever in netlink world. imo makes
 sense to use it with new syscalls. Passing 'struct xxx' into syscalls
 is the thing
 of the past. TLV style is more extensible. Fields of structures can become
 optional in the future, new fields added, etc.
 'struct nlattr' brings the same benefits to kernel api as protobuf did
 to user land.

 I see no reason to bring nl_attr into this.

 Admittedly, I've never dealt with nl_attr, but everything
 netlink-related I've even been involved in has involved some sort of
 API atrocity.

 netlink has a lot of legacy and there is genetlink which is not pretty
 either because of extra socket creation, binding, dealing with packet
 loss issues, but the key concept of variable length encoding is sound.
 Right now seccomp has two commands and they already don't fit
 into single syscall neatly. Are you saying there should be two syscalls
 here? What about another seccomp related command? Another syscall?
 imo all seccomp related commands needs to be mux/demux-ed under
 one syscall. What is the way to mux/demux potentially very different
 commands under one syscall? I cannot think of anything better than
 TLV style. 'struct nlattr' is what we have today and I think it works fine.
 I'm not suggesting to bring the whole netlink into the picture, but rather
 TLV style of encoding different arguments for different commands.

I'm unconvinced.  These are simple

[PATCH 4/4] x86,mm: Replace arch_vma_name with vm_ops-name for vsyscalls

2014-05-19 Thread Andy Lutomirski
arch_vma_name is now completely gone from x86.  Good riddance.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/mm/init_64.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 9deb59b..bdcde58 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1185,11 +1185,19 @@ int kern_addr_valid(unsigned long addr)
  * covers the 64bit vsyscall page now. 32bit has a real VMA now and does
  * not need special handling anymore:
  */
+static const char *gate_vma_name(struct vm_area_struct *vma)
+{
+   return [vsyscall];
+}
+static struct vm_operations_struct gate_vma_ops = {
+   .name = gate_vma_name,
+};
 static struct vm_area_struct gate_vma = {
.vm_start   = VSYSCALL_ADDR,
.vm_end = VSYSCALL_ADDR + PAGE_SIZE,
.vm_page_prot   = PAGE_READONLY_EXEC,
-   .vm_flags   = VM_READ | VM_EXEC
+   .vm_flags   = VM_READ | VM_EXEC,
+   .vm_ops = gate_vma_ops,
 };
 
 struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
@@ -1221,13 +1229,6 @@ int in_gate_area_no_mm(unsigned long addr)
return (addr  PAGE_MASK) == VSYSCALL_ADDR;
 }
 
-const char *arch_vma_name(struct vm_area_struct *vma)
-{
-   if (vma == gate_vma)
-   return [vsyscall];
-   return NULL;
-}
-
 #ifdef CONFIG_X86_UV
 unsigned long memory_block_size_bytes(void)
 {
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming

2014-05-19 Thread Andy Lutomirski
Using arch_vma_name to give special mappings a name is awkward.  x86
currently implements it by comparing the start address of the vma to
the expected address of the vdso.  This requires tracking the start
address of special mappings and is probably buggy if a special vma
is split or moved.

Improve _install_special_mapping to just name the vma directly.  Use
it to give the x86 vvar area a name, which should make CRIU's life
easier.

As a side effect, the vvar area will show up in core dumps.  This
could be considered weird and is fixable.  Thoughts?

Cc: Cyrill Gorcunov gorcu...@openvz.org
Cc: Pavel Emelyanov xe...@parallels.com
Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/include/asm/vdso.h  |  6 ++-
 arch/x86/mm/init_64.c|  3 --
 arch/x86/vdso/vdso2c.h   |  5 ++-
 arch/x86/vdso/vdso32-setup.c |  7 
 arch/x86/vdso/vma.c  | 25 -
 include/linux/mm.h   |  4 +-
 include/linux/mm_types.h |  6 +++
 mm/mmap.c| 89 +---
 8 files changed, 94 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h
index d0a2c90..30be253 100644
--- a/arch/x86/include/asm/vdso.h
+++ b/arch/x86/include/asm/vdso.h
@@ -7,10 +7,14 @@
 
 #ifndef __ASSEMBLER__
 
+#include linux/mm_types.h
+
 struct vdso_image {
void *data;
unsigned long size;   /* Always a multiple of PAGE_SIZE */
-   struct page **pages;  /* Big enough for data/size page pointers */
+
+   /* text_mapping.pages is big enough for data/size page pointers */
+   struct vm_special_mapping text_mapping;
 
unsigned long alt, alt_len;
 
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 6f88184..9deb59b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1223,9 +1223,6 @@ int in_gate_area_no_mm(unsigned long addr)
 
 const char *arch_vma_name(struct vm_area_struct *vma)
 {
-   if (vma-vm_mm  vma-vm_start ==
-   (long __force)vma-vm_mm-context.vdso)
-   return [vdso];
if (vma == gate_vma)
return [vsyscall];
return NULL;
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index ed2e894..3dcc61e 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -136,7 +136,10 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
fprintf(outfile, const struct vdso_image %s = {\n, name);
fprintf(outfile, \t.data = raw_data,\n);
fprintf(outfile, \t.size = %lu,\n, data_size);
-   fprintf(outfile, \t.pages = pages,\n);
+   fprintf(outfile, \t.text_mapping = {\n);
+   fprintf(outfile, \t\t.name = \[vdso]\,\n);
+   fprintf(outfile, \t\t.pages = pages,\n);
+   fprintf(outfile, \t},\n);
if (alt_sec) {
fprintf(outfile, \t.alt = %lu,\n,
(unsigned long)alt_sec-sh_offset);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index c3ed708..e4f7781 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -119,13 +119,6 @@ __initcall(ia32_binfmt_init);
 
 #else  /* CONFIG_X86_32 */
 
-const char *arch_vma_name(struct vm_area_struct *vma)
-{
-   if (vma-vm_mm  vma-vm_start == (long)vma-vm_mm-context.vdso)
-   return [vdso];
-   return NULL;
-}
-
 struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
 {
return NULL;
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 8ad0081..e1513c4 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -30,7 +30,8 @@ void __init init_vdso_image(const struct vdso_image *image)
 
BUG_ON(image-size % PAGE_SIZE != 0);
for (i = 0; i  npages; i++)
-   image-pages[i] = virt_to_page(image-data + i*PAGE_SIZE);
+   image-text_mapping.pages[i] =
+   virt_to_page(image-data + i*PAGE_SIZE);
 
apply_alternatives((struct alt_instr *)(image-data + image-alt),
   (struct alt_instr *)(image-data + image-alt +
@@ -91,6 +92,10 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
unsigned long addr;
int ret = 0;
static struct page *no_pages[] = {NULL};
+   static struct vm_special_mapping vvar_mapping = {
+   .name = [vvar],
+   .pages = no_pages,
+   };
 
if (calculate_addr) {
addr = vdso_addr(current-mm-start_stack,
@@ -112,21 +117,23 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
/*
 * MAYWRITE to allow gdb to COW and set breakpoints
 */
-   ret = install_special_mapping(mm,
- addr,
- image-size,
- VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC

[PATCH 1/4] x86,vdso: Fix an OOPS accessing the hpet mapping w/o an hpet

2014-05-19 Thread Andy Lutomirski
The oops can be triggered in qemu using -no-hpet (but not nohpet) by
reading a couple of pages past the end of the vdso text.  This
should send SIGBUS instead of OOPSing.

The bug was introduced by:

commit 7a59ed415f5b57469e22e41fc4188d5399e0b194
Author: Stefani Seibold stef...@seibold.net
Date:   Mon Mar 17 23:22:09 2014 +0100

x86, vdso: Add 32 bit VDSO time support for 32 bit kernel

which is new in 3.15.

This will be fixed separately in 3.15, but that patch will not apply
to tip/x86/vdso.  This is the equivalent fix for tip/x86/vdso and,
presumably, 3.16.

Cc: Stefani Seibold stef...@seibold.net
Reported-by: Sasha Levin sasha.le...@oracle.com
Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/vdso/vma.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index e915eae..8ad0081 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -90,6 +90,7 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
struct vm_area_struct *vma;
unsigned long addr;
int ret = 0;
+   static struct page *no_pages[] = {NULL};
 
if (calculate_addr) {
addr = vdso_addr(current-mm-start_stack,
@@ -125,7 +126,7 @@ static int map_vdso(const struct vdso_image *image, bool 
calculate_addr)
   addr + image-size,
   image-sym_end_mapping - image-size,
   VM_READ,
-  NULL);
+  no_pages);
 
if (IS_ERR(vma)) {
ret = PTR_ERR(vma);
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/4] x86,mm: vdso fixes for an OOPS and /proc/PID/maps

2014-05-19 Thread Andy Lutomirski
[This applies to tip/x86/vdso.  Patch 1/4 is a resend.]

This fixes an OOPS on systems without an HPET and incomplete
information in /proc/PID/maps.

The latter is done by adding a new vm_ops callback to replace
arch_vma_name, which is inflexible and awkward to use correctly.

With this series applied, calling mremap on the vdso results in
sensible output in /proc/PID/maps and the vvar area shows up
correctly.  I don't want to guarantee that mremap on the vdso will
do anything sensible right now, but that's unchanged from before.
In fact, I suspect that mremapping the vdso on 32-bit tasks is
rather broken right now due to sigreturn.

In current kernels, mremapping the vdso blows away the name:
badc0de-badc0de2000 r-xp  00:00 0

Now it doesn't:
badc0de-badc0de1000 r-xp  00:00 0[vdso]

As a followup, it might pay to replace install_special_mapping with
a new install_vdso_mapping function that hardcodes the [vdso]
name, to separately fix all the other arch_vma_name users (maybe
just ARM?) and then kill arch_vma_name completely.

NB: This touches core mm code.  I'd appreciate some review by the mm
folks.

Andy Lutomirski (4):
  x86,vdso: Fix an OOPS accessing the hpet mapping w/o an hpet
  mm,fs: Add vm_ops-name as an alternative to arch_vma_name
  x86,mm: Improve _install_special_mapping and fix x86 vdso naming
  x86,mm: Replace arch_vma_name with vm_ops-name for vsyscalls

 arch/x86/include/asm/vdso.h  |  6 ++-
 arch/x86/mm/init_64.c| 20 +-
 arch/x86/vdso/vdso2c.h   |  5 ++-
 arch/x86/vdso/vdso32-setup.c |  7 
 arch/x86/vdso/vma.c  | 26 -
 fs/binfmt_elf.c  |  8 
 fs/proc/task_mmu.c   |  6 +++
 include/linux/mm.h   | 10 -
 include/linux/mm_types.h |  6 +++
 mm/mmap.c| 89 +---
 10 files changed, 124 insertions(+), 59 deletions(-)

-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/4] mm,fs: Add vm_ops-name as an alternative to arch_vma_name

2014-05-19 Thread Andy Lutomirski
arch_vma_name sucks.  It's a silly hack, and it's annoying to
implement correctly.  In fact, AFAICS, even the straightforward x86
implementation is incorrect (I suspect that it breaks if the vdso
mapping is split or gets remapped).

This adds a new vm_ops-name operation that can replace it.  The
followup patches will remove all uses of arch_vma_name on x86,
fixing a couple of annoyances in the process.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 fs/binfmt_elf.c| 8 
 fs/proc/task_mmu.c | 6 ++
 include/linux/mm.h | 6 ++
 3 files changed, 20 insertions(+)

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index aa3cb62..df9ea41 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1108,6 +1108,14 @@ static bool always_dump_vma(struct vm_area_struct *vma)
/* Any vsyscall mappings? */
if (vma == get_gate_vma(vma-vm_mm))
return true;
+
+   /*
+* Assume that all vmas with a .name op should always be dumped.
+* If this changes, a new vm_ops field can easily be added.
+*/
+   if (vma-vm_ops  vma-vm_ops-name  vma-vm_ops-name(vma))
+   return true;
+
/*
 * arch_vma_name() returns non-NULL for special architecture mappings,
 * such as vDSO sections.
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 442177b..9b2f5d6 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -300,6 +300,12 @@ show_map_vma(struct seq_file *m, struct vm_area_struct 
*vma, int is_pid)
goto done;
}
 
+   if (vma-vm_ops  vma-vm_ops-name) {
+   name = vma-vm_ops-name(vma);
+   if (name)
+   goto done;
+   }
+
name = arch_vma_name(vma);
if (!name) {
pid_t tid;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index bf9811e..63f8d4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -239,6 +239,12 @@ struct vm_operations_struct {
 */
int (*access)(struct vm_area_struct *vma, unsigned long addr,
  void *buf, int len, int write);
+
+   /* Called by the /proc/PID/maps code to ask the vma whether it
+* has a special name.  Returning non-NULL will also cause this
+* vma to be dumped unconditionally. */
+   const char *(*name)(struct vm_area_struct *vma);
+
 #ifdef CONFIG_NUMA
/*
 * set_policy() op must add a reference to any non-NULL @new mempolicy
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming

2014-05-20 Thread Andy Lutomirski
On Tue, May 20, 2014 at 10:21 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Mon, May 19, 2014 at 03:58:33PM -0700, Andy Lutomirski wrote:
 Using arch_vma_name to give special mappings a name is awkward.  x86
 currently implements it by comparing the start address of the vma to
 the expected address of the vdso.  This requires tracking the start
 address of special mappings and is probably buggy if a special vma
 is split or moved.

 Improve _install_special_mapping to just name the vma directly.  Use
 it to give the x86 vvar area a name, which should make CRIU's life
 easier.

 As a side effect, the vvar area will show up in core dumps.  This
 could be considered weird and is fixable.  Thoughts?

 Cc: Cyrill Gorcunov gorcu...@openvz.org
 Cc: Pavel Emelyanov xe...@parallels.com
 Signed-off-by: Andy Lutomirski l...@amacapital.net

 Hi Andy, thanks a lot for this! I must confess I don't yet know how
 would we deal with compat tasks but this is 'must have' mark which
 allow us to detect vvar area!

Out of curiosity, how does CRIU currently handle checkpointing a
restored task?  In current kernels, the [vdso] name in maps goes
away after mremapping the vdso.

I suspect that you'll need kernel changes for compat tasks, since I
think that mremapping the vdso on any reasonably modern hardware in a
32-bit task will cause sigreturn to blow up.  This could be fixed by
making mremap magical, although adding a new prctl or arch_prctl to
reliably move the vdso might be a better bet.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming

2014-05-20 Thread Andy Lutomirski
On Tue, May 20, 2014 at 10:47 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Tue, May 20, 2014 at 10:24:49AM -0700, Andy Lutomirski wrote:
 On Tue, May 20, 2014 at 10:21 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Mon, May 19, 2014 at 03:58:33PM -0700, Andy Lutomirski wrote:
  Using arch_vma_name to give special mappings a name is awkward.  x86
  currently implements it by comparing the start address of the vma to
  the expected address of the vdso.  This requires tracking the start
  address of special mappings and is probably buggy if a special vma
  is split or moved.
 
  Improve _install_special_mapping to just name the vma directly.  Use
  it to give the x86 vvar area a name, which should make CRIU's life
  easier.
 
  As a side effect, the vvar area will show up in core dumps.  This
  could be considered weird and is fixable.  Thoughts?
 
  Cc: Cyrill Gorcunov gorcu...@openvz.org
  Cc: Pavel Emelyanov xe...@parallels.com
  Signed-off-by: Andy Lutomirski l...@amacapital.net
 
  Hi Andy, thanks a lot for this! I must confess I don't yet know how
  would we deal with compat tasks but this is 'must have' mark which
  allow us to detect vvar area!

 Out of curiosity, how does CRIU currently handle checkpointing a
 restored task?  In current kernels, the [vdso] name in maps goes
 away after mremapping the vdso.

   We use not only [vdso] mark to detect vdso area but also page frame
 number of the living vdso. If mark is not present in procfs output
 we examinate executable areas and check if pfn == vdso_pfn, it's
 a slow path because there migh be a bunch of executable areas and
 touching every of it is not that fast thing, but we simply have no
 choise.

This patch should fix this issue, at least.  If there's still a way to
get a native vdso that doesn't say [vdso], please let me know/


   The situation get worse when task was dumped on one kernel and
 then restored on another kernel where vdso content is different
 from one save in image -- is such case as I mentioned we need
 that named vdso proxy which redirect calls to vdso of the system
 where task is restoring. And when such restored task get checkpointed
 second time we don't dump new living vdso but save only old vdso
 proxy on disk (detecting it is a different story, in short we
 inject a unique mark into elf header).

Yuck.  But I don't know whether the kernel can help much here.



 I suspect that you'll need kernel changes for compat tasks, since I
 think that mremapping the vdso on any reasonably modern hardware in a
 32-bit task will cause sigreturn to blow up.  This could be fixed by
 making mremap magical, although adding a new prctl or arch_prctl to
 reliably move the vdso might be a better bet.

 Well, as far as I understand compat code uses abs addressing for
 vvar data and if vvar data position doesn't change we're safe,
 but same time because vvar addresses are not abi I fear one day
 we indeed hit the problems and the only solution would be
 to use kernel's help. But again, Andy, I didn't think much
 about implementing compat mode in criu yet so i might be
 missing some details.

Prior to 3.15, the compat code didn't have vvar data at all.  In 3.15
and up, the vvar data is accessed using PC-relative addressing, even
in compat mode (using the usual call; mov trick to read EIP).

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming

2014-05-20 Thread Andy Lutomirski
On Tue, May 20, 2014 at 11:18 AM, H. Peter Anvin h...@zytor.com wrote:
 On 05/20/2014 11:01 AM, Cyrill Gorcunov wrote:

 This patch should fix this issue, at least.  If there's still a way to
 get a native vdso that doesn't say [vdso], please let me know/

 Yes, having a native procfs way to detect vdso is much preferred!


 Is there any path by which we can end up with [vdso] without a leading
 slash in /proc/self/maps?  Otherwise, why is that not native?

Dunno.  But before this patch the reverse was possible: we can end up
with a vdso that doesn't say [vdso].


   The situation get worse when task was dumped on one kernel and
 then restored on another kernel where vdso content is different
 from one save in image -- is such case as I mentioned we need
 that named vdso proxy which redirect calls to vdso of the system
 where task is restoring. And when such restored task get checkpointed
 second time we don't dump new living vdso but save only old vdso
 proxy on disk (detecting it is a different story, in short we
 inject a unique mark into elf header).

 Yuck.  But I don't know whether the kernel can help much here.

 Some prctl which would tell kernel to put vdso at specifed address.
 We can live without it for now so not a big deal (yet ;)

 mremap() will do this for you.

Except that it's buggy: it doesn't change mm-context.vdso.  For
64-bit tasks, the only consumer outside exec was arch_vma_name, and
this patch removes even that.  For 32-bit tasks, though, it's needed
for signal delivery.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming

2014-05-20 Thread Andy Lutomirski
On Tue, May 20, 2014 at 11:27 AM, H. Peter Anvin h...@zytor.com wrote:
 On 05/20/2014 11:24 AM, Andy Lutomirski wrote:
 On Tue, May 20, 2014 at 11:18 AM, H. Peter Anvin h...@zytor.com wrote:
 On 05/20/2014 11:01 AM, Cyrill Gorcunov wrote:

 This patch should fix this issue, at least.  If there's still a way to
 get a native vdso that doesn't say [vdso], please let me know/

 Yes, having a native procfs way to detect vdso is much preferred!


 Is there any path by which we can end up with [vdso] without a leading
 slash in /proc/self/maps?  Otherwise, why is that not native?

 Dunno.  But before this patch the reverse was possible: we can end up
 with a vdso that doesn't say [vdso].


 That's a bug, which is being fixed.  We can't go back in time and create
 new interfaces on old kernels.


   The situation get worse when task was dumped on one kernel and
 then restored on another kernel where vdso content is different
 from one save in image -- is such case as I mentioned we need
 that named vdso proxy which redirect calls to vdso of the system
 where task is restoring. And when such restored task get checkpointed
 second time we don't dump new living vdso but save only old vdso
 proxy on disk (detecting it is a different story, in short we
 inject a unique mark into elf header).

 Yuck.  But I don't know whether the kernel can help much here.

 Some prctl which would tell kernel to put vdso at specifed address.
 We can live without it for now so not a big deal (yet ;)

 mremap() will do this for you.

 Except that it's buggy: it doesn't change mm-context.vdso.  For
 64-bit tasks, the only consumer outside exec was arch_vma_name, and
 this patch removes even that.  For 32-bit tasks, though, it's needed
 for signal delivery.


 Again, a bug, let's fix it rather than saying we need a new interface.

What happens if someone remaps just part of the vdso?

Presumably we'd just track the position of the first page of the vdso,
but this might be hard to implement: I don't think there's any
callback from the core mm code for ths.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-20 Thread Andy Lutomirski
Here's a real proposal for iret-less return.  If this is correct, then
NMIs will never nest, which will probably delete a lot more scariness
than is added by the code I'm describing.

The rest of this email is valid markdown :)  If I end up implementing
this, this text will go straight into Documentation/x86/x86_64.

tl;dr: The only particularly tricky cases are exit from #DB, #BP, and
#MC.  I think they're not so bad, though.

FWIW, if there's a way to read the NMI masking bit, this would be a
lot simpler.  I don't know of any way to do that, though.

`IRET`-less return
==

There are at least two ways that we can return from a trap entry:
`IRET` and `RET`.  They have a few important differences.

  * `IRET` is very slow on all current (2014) CPUs -- it seems to
take hundreds of cycles.  `RET` is fast.

  * `IRET` unconditionally unmasks NMIs.  `RET` never unmasks NMIs.

  * `IRET` can change `CS`, `RSP`, `SS`, `RIP`, and `RFLAGS`
atomically.  `RET` can't; it requires a return address on the
stack, and it can't apply anything other than a small offset to
the stack pointer.  It can, in theory, change `CS`, but this
seems unlikely to be helpful.

Times when we must use `IRET`
=

  * If we're returning to a different `CS` (i.e. if firmware is
doing something funny or if we're returning to userspace), then
`RET` won't help; we need to use `IRET` unless we're willing to
play fragile games with `SYSEXIT` or `SYSRET`.

  * If we are changing stacks, the we need to be extremely careful
about using `RET`: using `RET` requires that we put the target
`RIP` on the target stack, so the target stack must be valid.
This means that we cannot use `RET` if, for example, a `SYSCALL`
just happened.

  * If we're returning from NMI, we `IRET` is mandatory: we need to
unmask NMIs, and `IRET` is the only way to do that.

Note that, if `RFLAGS.IF` is set, then interrupts were enabled when
we trapped, so `RET` is safe.

Times when we must use `RET`


If there's an NMI on the stack, we must use `RET` until we're ready
to re-enabled NMIs.

Assumptions
===

  * Neither the NMI, the MCE handler, nor anything that nests inside
them will ever change `CS` or run with an invalid stack.

  * Interrupts will never be enabled with an NMI on the stack.

  * We explicitly do not assume that we can reliably determine
whether we were on user `GS` or kernel `GS` when a trap happens.
In current (3.15) kernels we can tell, but if we ever enable
`WRGSBASE` then we will lose that ability.

  * The IST interrupts are: #DB #BP #NM #DF #SS, and #MC.

  * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero
whenever an NMI or MCE is on the stack.  We'll increment it at the
very beginning of the NMI handler and clear it at the very end.
We will also increment it in `do_machine_check` before doing
anything that can cause an interrupt.  The result is that the only
interrupt that can happen with `nmi_mce_nest_count == 0` in NMI
context is an MCE at the beginning or end of the NMI handler.


The algorithm
=

  1. If the target `CS` is not the standard 64-bit kernel CPL0
 selector, then never use `RET`.  This is safe: this will never
 happen with an NMI on the stack.

  2. If we are returning from a non-IST interrupt, then use `RET`.
 Non-IST interrupts use the interrupted code's stack, so the
 stack is always valid.

  3. If we are returning from #NM, then use `IRET`.

  4. If we are returning from #DF or #SS, then use `IRET`.  These
 interrupts cannot occur inside an NMI, or, at the very least,
 if they do happen, then they are not recoverable.

  5. If we are returning from #DB or #BP, then use `RET` if
 `nmi_mce_nest_count != 0` and `IRET` otherwise.

  6. If we are returning from #MC, use `IRET`, unless the return address is
 to the NMI entry or exit code, in which case we use `RET`.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-20 Thread Andy Lutomirski
On Tue, May 20, 2014 at 7:27 PM, Steven Rostedt rost...@goodmis.org wrote:
 On Tue, 2014-05-20 at 17:53 -0700, Andy Lutomirski wrote:

 If there's an NMI on the stack, we must use `RET` until we're ready
 to re-enabled NMIs.

 I'm a little confused by NMI on the stack. Do you mean NMI on the target
 stack? If so, please state that.

I mean that if we're in an NMI handler or in anything nested inside it.


   * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero
 whenever an NMI or MCE is on the stack.  We'll increment it at the
 very beginning of the NMI handler and clear it at the very end.
 We will also increment it in `do_machine_check` before doing
 anything that can cause an interrupt.  The result is that the only
 interrupt that can happen with `nmi_mce_nest_count == 0` in NMI
 context is an MCE at the beginning or end of the NMI handler.

 Just note that this will probably be done in the C code, as NMI has
 issues with gs being safe.

 Also, should we call it nmi specifically. Perhaps
 ist_stack_nest_count, stating that the stack is ist to match
 do_machine_check as well? Maybe that's not a good name either. Someone
 else can come up with something that's a little more generic than NMI?

So the issue here is that we can have an NMI followed immediately by
an MCE.  The MCE code can call force_sig, which could plausibly result
in a kprobe or something similar happening.  The return from that
needs to use IRET.

Since I don't see a clean way to reliably detect that we're inside an
NMI, I propose instead detecting when we're in *either* NMI or MCE,
hence the name.  As long as we mark do_machine_check and whatever asm
code calls it __kprobes, I think we'll be okay.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote:

 On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
  So the issue here is that we can have an NMI followed immediately by
  an MCE.

 That part might need clarification for me: #MC is higher prio interrupt
 than NMI so a machine check exception can interrupt the NMI handler at
 any point.

Except that NMI can interrupt #MC at any point as well, I think.


 But you're talking only about the small window when nmi_mce_nest_count
 hasn't been incremented yet, right? I.e., this:

 The result is that the only interrupt that can happen with
 `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or
 end of the NMI handler.

 Correct?

Exactly.


 --
 Regards/Gruss,
 Boris.

 Sent from a fat crate under my desk. Formatting is fine.
 --
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 64bit x86: NMI nesting still buggy?

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 7:58 AM, Vojtech Pavlik vojt...@suse.cz wrote:

 On Wed, May 21, 2014 at 04:20:55PM +0200, Borislav Petkov wrote:
  On Wed, May 21, 2014 at 03:42:24PM +0200, Jiri Kosina wrote:
   Alright, Andy's iret optimization efforts do immediately bring a
   followup question -- why is this not a problem with iret-based return
   from #MC possibly interrupting NMI?
 
  Yeah, and frankly, I don't see this nesting fun at all protected against
  a #MC interrupting it at any point actually. Because once the #MC
  handler returns, it goes into paranoid_exit and that place doesn't
  account for NMIs at all, AFAICS.
 
  Which would mean:
 
  * NMI goes off
  * MCE happens, we switch to machine_check which is paranoidzeroentry
  * #MC handler is done - paranoid_exit - IRET
  - boom! Or if not boom, at least, the NMI gets forgotten.
 
  Am I missing something?

 I think to get a full BOOM you need a bit more complex process, namely:

 * NMI triggered
 * NMI handler starts
 * MCE happens
 * Second NMI triggered and queued
 * handler done, IRET
 * Second NMI handler starts and overwrites NMI return address on stack
 * Second NMI handler ends
 * First NMI handler ends and goes into an infinite IRET loop, always
   returning to the beginning of itself

 But you do have all the ingredients.

 And I don't see any other way out than not calling IRET for MCEs.

The MCE handler could detect this and fiddle with the IST entry.  This
sounds considerably uglier than returning via RET, though.

--Andy


 --
 Vojtech Pavlik
 Director SUSE Labs
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On May 21, 2014 5:51 AM, Jiri Kosina jkos...@suse.cz wrote:

 On Tue, 20 May 2014, Andy Lutomirski wrote:

  So the issue here is that we can have an NMI followed immediately by
  an MCE.  The MCE code can call force_sig

 This is interesting by itself. force_sig() takes siglock spinlock. This
 really looks like a deadlock sitting there waiting to happen.

ISTM the do_machine_check code ought to consider any kill-worthy MCE
from kernel space to be non-recoverable, but I want to keep the scope
of these patches under control.

That being said, if an MCE that came from CPL0 never tried to return,
this would be simpler.  I don't know enough about the machine check
architecture to know whether that's a reasonable thing to do.

--Andy


 --
 Jiri Kosina
 SUSE Labs

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote:
 On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote:
 
  On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote:
   So the issue here is that we can have an NMI followed immediately by
   an MCE.
 
  That part might need clarification for me: #MC is higher prio interrupt
  than NMI so a machine check exception can interrupt the NMI handler at
  any point.

 Except that NMI can interrupt #MC at any point as well, I think.

 No, #MC is higher prio than NMI, actually even the highest along with
 RESET#. And come to think of it, all exceptions which have a higher prio
 than NMI should touch that nmi_mce_nest_count thing.

 See Table 8-8 here:

 http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf

 That's the prios before 3, i.e. the NMI one.

 HOWEVER, this all is spoken with the assumption that higher prio
 interrupts can interrupt the NMI handler too at the first instruction
 boundary they've been recognized.

 The text is talking about simultaneous interrupts and not about
 interrupt handler preemption.

 But it must be because Steve wouldn't be dealing with exceptions in the
 NMI handler and nested NMIs otherwise...

I think that some of these exceptions are synchronous things (e.g.
int3 or page faults) that happen because the kernel caused them.

Anyway, going through the list:

Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway.

SMI is already supposedly correct wrt nesting inside NMI.

Debug register stuff should be handled in my outline.  Hopefully
correctly :)  We need to make sure that no breakpoints trip before the
nmi count is incremented, but that should be straightforward as long
as we don't do ridiculous things like poking at userspace addresses.
I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a
kernel address (e.g. the nesting count) or enables single-stepping,
we'll mess up.


It may pay to bump the nesting count inside the #DB and #BP handlers
and to check the RIP that we're returning to, but that starts to look
ugly, and we have to be careful about NMI, immediate breakpoint, and
them immediate MCE.  I'd rather just be able to say that there are
some very short windows in which a debug or breakpoint exception will
never happen.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote:
 Here's a real proposal for iret-less return.  If this is correct, then
 NMIs will never nest, which will probably delete a lot more scariness
 than is added by the code I'm describing.

OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
16-bit stack.  The return from NMI goes through the espfix code.
Something interrupts while on the espfix stack.  Boom!  Neither return
style is particularly good.

More generally, if we got interrupted while on the espfix stack, we
need to return back there using IRET.  Fortunately, re-enabling NMIs
there in harmless, since we've already switched off the NMI stack.

This makes me think that maybe the logic should be turned around: have
some RIP ranges on which the kernel stack might be invalid (which
includes the espfix code and some of the syscall code) and use IRET
only on return from NMI, return to nonstandard CS, and return to these
special ranges.  The NMI code just needs to never so any of this stuff
unless it switches off the NMI stack first.

For this to work reliably, we'll probably have to change CS before
calling into EFI code.  That should be straightforward.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3.15] MIPS: Add new AUDIT_ARCH token for the N32 ABI on MIPS64

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 1:59 PM, Paul Moore pmo...@redhat.com wrote:
 On Monday, May 12, 2014 02:53:05 PM Paul Moore wrote:
 On Tuesday, April 22, 2014 03:40:36 PM Markos Chandras wrote:
  A MIPS64 kernel may support ELF files for all 3 MIPS ABIs
  (O32, N32, N64). Furthermore, the AUDIT_ARCH_MIPS{,EL}64 token
  does not provide enough information about the ABI for the 64-bit
  process. As a result of which, userland needs to use complex
  seccomp filters to decide whether a syscall belongs to the o32 or n32
  or n64 ABI. Therefore, a new arch token for MIPS64/n32 is added so it
  can be used by seccomp to explicitely set syscall filters for this ABI.
 
  Link: http://sourceforge.net/p/libseccomp/mailman/message/32239040/
  Cc: Andy Lutomirski l...@amacapital.net
  Cc: Eric Paris epa...@redhat.com
  Cc: Paul Moore pmo...@redhat.com
  Cc: Ralf Baechle r...@linux-mips.org
  Signed-off-by: Markos Chandras markos.chand...@imgtec.com
  ---
  Ralf, can we please have this in 3.15 (Assuming it's ACK'd)?
 
  Thanks a lot!
  ---
 
   arch/mips/include/asm/syscall.h |  2 ++
   include/uapi/linux/audit.h  | 12 
   2 files changed, 14 insertions(+)

 [NOTE: Adding lkml to the To line to hopefully spur discussion/acceptance as
 this *really* should be in 3.15]

 I'm re-replying to this patch and adding lkml to the To line because I
 believe it is very important we get this patch into 3.15.  For those who
 don't follow the MIPS architecture very closely, the upcoming 3.15 is the
 first release to include support for seccomp filters, the latest generation
 of syscall filtering which used a BPF based filter language.  For reason
 that are easy to understand, the syscall filters are ABI specific (e.g.
 syscall tables, word length, endianness) and those generating syscall
 filters in userspace (e.g. libseccomp) need to take great care to ensure
 that the generated filters take the ABI into account and fail safely in the
 case where a different ABI is used (e.g. x86, x86_64, x32).

 The patch below corrects, what is IMHO, an omission in the original MIPS
 seccomp filter patch, allowing userspace to easily separate MIPS and MIPS64.
 Without this patch we will be forced to handle MIPS/MIPS64 like we handle
 x86_64/x32 which is a royal pain and not something I want to have deal with
 again.

 Further, while I don't want to speak for the audit folks, it is my
 understanding that they want this patch for similar reasons.

 Please merge this patch for 3.15 or at least provide some feedback as to why
 this isn't a viable solution for upstream.  Once 3.15 ships, fixing this
 will require breaking the MIPS ABI which isn't something any of us want.

 Thanks,
 -Paul

 *Bump*

 I don't know what else needs to be done to get some action on this and we're
 running out of time for 3.15.

Reply to Linus' next -rc email.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina jkos...@suse.cz wrote:
 On Wed, 21 May 2014, Borislav Petkov wrote:

  ISTM the do_machine_check code ought to consider any kill-worthy MCE
  from kernel space to be non-recoverable, but I want to keep the scope
  of these patches under control.

 MCA has a bit called RIPV which, if set, signals that RIP is valid and
 it is safe to return provided we've taken proper care of handling even
 non-correctable errors (memory poisoning, etc).

 Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered
 at the time the CPU was already holding sighand-siglock for that
 particular task, it'll deadlock against itself.


If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
like the right solution anyway.

Are there any machine check exceptions for which it makes sense to
continue right where we left off without a signal?  Is CMIC such a
beast?  Can CMIC be delivered when interrupts are off?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin h...@zytor.com wrote:
 Adding Tony.

 On 05/21/2014 02:43 PM, Borislav Petkov wrote:
 On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote:
 Seriously. If an NMI is interrupted by an MCE, you might as well
 consider the machine dead. Don't worry about it. We may or may not
 recover, but it is *not* our problem.

 I certainly like this way of handling it. We can even issue a nice
 banner saying something like You're f*cked - go change hw.


 Actually, it would be a lot better to panic than deadlock (HA systems
 tend to have something in place to catch the panic and/or reboot).  Any
 way we can see if the CPU is already holding that lock and panic in that
 case?


Is there anything actually wrong with just panicking if
!user_mode_vm(regs)?  That would make this a lot more sane.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote:
 If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem
 like the right solution anyway.

 Are there any machine check exceptions for which it makes sense to
 continue right where we left off without a signal?  Is CMIC such a
 beast?  Can CMIC be delivered when interrupts are off?

 I think you mean CMCI and that's not even reported with a MCE exception
 - there's a separate APIC interrupt for that.

 I think this signal thing is for killing processes which have poisoned
 memory but this memory can contained within that process and the
 physical page frame can be poisoned so that it doesn't get used ever
 again. In any case, this is an example for an uncorrectable error which
 needs action from us but doesn't necessarily have to kill the whole
 machine.

 This is supposed to be more graceful instead of consuming the corrupted
 data and sending it out to disk.

 But sending signals from #MC context is definitely a bad idea. I think
 we had addressed this with irq_work at some point but my memory is very
 hazy.

Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote:
 Why is it a problem if user_mode_vm(regs)?  Conversely, why is sending
 a signal a remotely reasonable thing to do if !user_mode_vm(regs)?

 Let me quote Jiri:

 (1) task sends signal to itself
 (2) it acquires sighand-siglock so that it's able to queue the signal
 (3) MCE triggers

...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should
presumably just panic instead of trying to send a signal.

I missed the IN_KERNEL thing because I didn't realize that -cs was
copied to struct mce.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] x86_64: Merge paranoidzeroentry_ist into idtentry

2014-05-21 Thread Andy Lutomirski
One more specialized entry function is now gone.  Again, this seems
to only change line numbers in entry_64.o.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/entry_64.S | 47 ++
 1 file changed, 22 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index eac9b81..be846d2 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1203,8 +1203,15 @@ apicinterrupt IRQ_WORK_VECTOR \
 /*
  * Exception entry points.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0
+#define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
+
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
+   /* Sanity check */
+   .if \shift_ist != -1  \paranoid == 0
+   .error using shift_ist requires paranoid=1
+   .endif
+
.if \has_error_code
XCPT_FRAME
.else
@@ -1230,8 +1237,12 @@ ENTRY(\sym)
DEFAULT_FRAME 0
 
.if \paranoid
+   .if \shift_ist != -1
+   TRACE_IRQS_OFF_DEBUG/* reload IDT in case of recursion */
+   .else
TRACE_IRQS_OFF
.endif
+   .endif
 
movq %rsp,%rdi  /* pt_regs pointer */
 
@@ -1242,8 +1253,16 @@ ENTRY(\sym)
xorl %esi,%esi  /* no error code */
.endif
 
+   .if \shift_ist != -1
+   subq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist)
+   .endif
+
call \do_sym
 
+   .if \shift_ist != -1
+   addq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist)
+   .endif
+
.if \paranoid
jmp paranoid_exit   /* %ebx: no swapgs flag */
.else
@@ -1254,28 +1273,6 @@ ENTRY(\sym)
 END(\sym)
 .endm
 
-#define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
-.macro paranoidzeroentry_ist sym do_sym ist
-ENTRY(\sym)
-   INTR_FRAME
-   ASM_CLAC
-   PARAVIRT_ADJUST_EXCEPTION_FRAME
-   pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
-   subq $ORIG_RAX-R15, %rsp
-   CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
-   call save_paranoid
-   DEFAULT_FRAME 0
-   TRACE_IRQS_OFF_DEBUG
-   movq %rsp,%rdi  /* pt_regs pointer */
-   xorl %esi,%esi  /* no error code */
-   subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
-   call \do_sym
-   addq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist)
-   jmp paranoid_exit   /* %ebx: no swapgs flag */
-   CFI_ENDPROC
-END(\sym)
-.endm
-
 #ifdef CONFIG_TRACING
 .macro trace_idtentry sym do_sym has_error_code:req
 idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code
@@ -1460,8 +1457,8 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
  */
.pushsection .kprobes.text, ax
 
-paranoidzeroentry_ist debug do_debug DEBUG_STACK
-paranoidzeroentry_ist int3 do_int3 DEBUG_STACK
+idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
+idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
 idtentry stack_segment do_stack_segment has_error_code=1 paranoid=1
 #ifdef CONFIG_XEN
 idtentry xen_debug do_debug has_error_code=0
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] x86_64: Add missing 'DEFAULT_FRAME 0' entry annotations

2014-05-21 Thread Andy Lutomirski
The paranoidzeroentry macros were missing them.  I'm not at all
convinced that these annotations are correct and/or necessary, but
this makes the macros more consistent with each other.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/entry_64.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c36..39372ec 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1230,6 +1230,7 @@ ENTRY(\sym)
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call save_paranoid
+   DEFAULT_FRAME 0
TRACE_IRQS_OFF
movq %rsp,%rdi  /* pt_regs pointer */
xorl %esi,%esi  /* no error code */
@@ -1249,6 +1250,7 @@ ENTRY(\sym)
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
call save_paranoid
+   DEFAULT_FRAME 0
TRACE_IRQS_OFF_DEBUG
movq %rsp,%rdi  /* pt_regs pointer */
xorl %esi,%esi  /* no error code */
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] x86_64: Merge most 64-bit asm entry macros

2014-05-21 Thread Andy Lutomirski
I haven't touched the device interrupt code, which is different
enough that it's probably not worth merging, and I haven't done
anything about paranoidzeroentry_ist yet.

This appears to produce an entry_64.o file that differs only in the
debug info line numbers.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/entry_64.S | 152 +++--
 1 file changed, 64 insertions(+), 88 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 39372ec..eac9b81 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -36,7 +36,7 @@
  * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack
  * frame that is otherwise undefined after a SYSCALL
  * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging.
- * - errorentry/paranoidentry/zeroentry - Define exception entry points.
+ * - idtentry - Define exception entry points.
  */
 
 #include linux/linkage.h
@@ -1203,39 +1203,53 @@ apicinterrupt IRQ_WORK_VECTOR \
 /*
  * Exception entry points.
  */
-.macro zeroentry sym do_sym
+.macro idtentry sym do_sym has_error_code:req paranoid=0
 ENTRY(\sym)
+   .if \has_error_code
+   XCPT_FRAME
+   .else
INTR_FRAME
-   ASM_CLAC
-   PARAVIRT_ADJUST_EXCEPTION_FRAME
-   pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
-   subq $ORIG_RAX-R15, %rsp
-   CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
-   call error_entry
-   DEFAULT_FRAME 0
-   movq %rsp,%rdi  /* pt_regs pointer */
-   xorl %esi,%esi  /* no error code */
-   call \do_sym
-   jmp error_exit  /* %ebx: no swapgs flag */
-   CFI_ENDPROC
-END(\sym)
-.endm
+   .endif
 
-.macro paranoidzeroentry sym do_sym
-ENTRY(\sym)
-   INTR_FRAME
ASM_CLAC
PARAVIRT_ADJUST_EXCEPTION_FRAME
-   pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
+
+   .ifeq \has_error_code
+   pushq_cfi $-1   /* ORIG_RAX: no syscall to restart */
+   .endif
+
subq $ORIG_RAX-R15, %rsp
CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+
+   .if \paranoid
call save_paranoid
+   .else
+   call error_entry
+   .endif
+
DEFAULT_FRAME 0
+
+   .if \paranoid
TRACE_IRQS_OFF
-   movq %rsp,%rdi  /* pt_regs pointer */
-   xorl %esi,%esi  /* no error code */
+   .endif
+
+   movq %rsp,%rdi  /* pt_regs pointer */
+
+   .if \has_error_code
+   movq ORIG_RAX(%rsp),%rsi/* get error code */
+   movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */
+   .else
+   xorl %esi,%esi  /* no error code */
+   .endif
+
call \do_sym
-   jmp paranoid_exit   /* %ebx: no swapgs flag */
+
+   .if \paranoid
+   jmp paranoid_exit   /* %ebx: no swapgs flag */
+   .else
+   jmp error_exit  /* %ebx: no swapgs flag */
+   .endif
+
CFI_ENDPROC
 END(\sym)
 .endm
@@ -1262,68 +1276,30 @@ ENTRY(\sym)
 END(\sym)
 .endm
 
-.macro errorentry sym do_sym
-ENTRY(\sym)
-   XCPT_FRAME
-   ASM_CLAC
-   PARAVIRT_ADJUST_EXCEPTION_FRAME
-   subq $ORIG_RAX-R15, %rsp
-   CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
-   call error_entry
-   DEFAULT_FRAME 0
-   movq %rsp,%rdi  /* pt_regs pointer */
-   movq ORIG_RAX(%rsp),%rsi/* get error code */
-   movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */
-   call \do_sym
-   jmp error_exit  /* %ebx: no swapgs flag */
-   CFI_ENDPROC
-END(\sym)
-.endm
-
 #ifdef CONFIG_TRACING
-.macro trace_errorentry sym do_sym
-errorentry trace(\sym) trace(\do_sym)
-errorentry \sym \do_sym
+.macro trace_idtentry sym do_sym has_error_code:req
+idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code
+idtentry \sym \do_sym has_error_code=\has_error_code
 .endm
 #else
-.macro trace_errorentry sym do_sym
-errorentry \sym \do_sym
+.macro trace_idtentry sym do_sym has_error_code:req
+idtentry \sym \do_sym has_error_code=\has_error_code
 .endm
 #endif
 
-   /* error code is on the stack already */
-.macro paranoiderrorentry sym do_sym
-ENTRY(\sym)
-   XCPT_FRAME
-   ASM_CLAC
-   PARAVIRT_ADJUST_EXCEPTION_FRAME
-   subq $ORIG_RAX-R15, %rsp
-   CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
-   call save_paranoid
-   DEFAULT_FRAME 0
-   TRACE_IRQS_OFF
-   movq %rsp,%rdi  /* pt_regs pointer */
-   movq ORIG_RAX(%rsp),%rsi/* get error code */
-   movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */
-   call \do_sym
-   jmp paranoid_exit   /* %ebx: no swapgs flag */
-   CFI_ENDPROC
-END(\sym)
-.endm
-
-zeroentry divide_error do_divide_error
-zeroentry overflow do_overflow
-zeroentry bounds do_bounds
-zeroentry invalid_op do_invalid_op
-zeroentry device_not_available

[PATCH 0/3] x86_64: Merge (paranoid)?(zero|error)entry(_idt)?

2014-05-21 Thread Andy Lutomirski
Inspired by the RET vs IRET discussion, I thought about how much of a
mess the current entry macros are and I shuddered at the thought of
trying to modify them.

This little series merges them all into one idtentry macro that has
arguments that specify all the various weird behaviors.  It has a major
benefit: you can now look at the code and see how, say, the
has_error_code and !has_error_code cases differ.

Andy Lutomirski (3):
  x86_64: Add missing 'DEFAULT_FRAME 0' entry annotations
  x86_64: Merge most 64-bit asm entry macros
  x86_64: Merge paranoidzeroentry_ist into idtentry

 arch/x86/kernel/entry_64.S | 185 -
 1 file changed, 80 insertions(+), 105 deletions(-)

-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:01 PM, Luck, Tony tony.l...@intel.com wrote:
 But sending signals from #MC context is definitely a bad idea. I think
 we had addressed this with irq_work at some point but my memory is very
 hazy.

 We added code for recoverable errors to get out of the MC context
 before trying to lookup the page and send the signal.  Bottom of
 do_machine_check():

 if (cfg-tolerant  3) {
 if (no_way_out)
 mce_panic(Fatal machine check on current CPU, m, 
 msg);
 if (worst == MCE_AR_SEVERITY) {
 /* schedule action before return to userland */
 mce_save_info(m.addr, m.mcgstatus  MCG_STATUS_RIPV);
 set_thread_flag(TIF_MCE_NOTIFY);
 } else if (kill_it) {
 force_sig(SIGBUS, current);
 }
 }

 That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
 mce_notify_process().

Why is this necessary?

If the MCE hit kernel code, then we're going to die anyway.  If the
MCE hit user code, then we should be in a completely sensible context
and we can just send the signal.

--Andy


 The force_sig() there is legacy code - and perhaps should just move off to 
 mce_notify_process()
 too (need to save worst so it will know what to do).

 -Tony



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote:
 Why is this necessary?

 If the MCE hit kernel code, then we're going to die anyway.  If the
 MCE hit user code, then we should be in a completely sensible context
 and we can just send the signal.

 Are we guaranteed that the first thing the process will execute when
 scheduled back in are the signal handlers?

It's not even scheduled out, right?  This should be just like a signal
from a failed page fault, I think.


 And besides, maybe we don't even want to allow to do the switch_to() but
 kill it while it is sleeping.

What switch_to?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:18 PM, Luck, Tony tony.l...@intel.com wrote:
 That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in 
 mce_notify_process().

 Why is this necessary?

 The recovery path has to do more than just send a signal - it needs to walk 
 processes and
 mms to see which have mapped the physical address that the h/w told us has 
 gone bad.

I still feel like I'm missing something.  If we interrupted user space
code, then the context we're in should be identical to the context
we'll get when we're about to return to userspace.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:25 PM, Andi Kleen a...@firstfloor.org wrote:

 Seems like a lot of effort and risk to essentially only optimize in kernel
 interrupt handlers.

The idea is that it might allow us to remove a bunch of scary nested
NMI code as well as speeding things up.


 AFAIK the most interesting cases (like user page faults) are not
 affected at all. Usually most workloads don't spend all that much time
 in the kernel, so it won't help most interrupts.

 I suspect the only case that's really interesting here is interrupting
 idle. Maybe it would be possible to do some fast path in this case only.

 However idle currently has so much overhead that I suspect that there
 are lower hanging fruit elsewhere.

I will gladly buy a meal or beverage for whomever fixes the ttwu stuff
to stop sending IPIs to idle CPUs, which will help a lot.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:32 PM, Luck, Tony tony.l...@intel.com wrote:
 The recovery path has to do more than just send a signal - it needs to walk 
 processes and
 mms to see which have mapped the physical address that the h/w told us 
 has gone bad.

 I still feel like I'm missing something.  If we interrupted user space
 code, then the context we're in should be identical to the context
 we'll get when we're about to return to userspace.

 True. And this far along in do_machine_check() we have set all the other cpus
 free, so the are heading back to whatever context we interrupted them in. So
 we might be able to do all that other stuff inline here ... we interrupted 
 user
 mode, so we know we don't hold any locks. Other cpus are running, so they can
 complete what they are doing to release any locks we might need.

 But it will take a while (to scan all those processes). And we haven't yet
 cleared MCG_STATUS ... so another machine check before we do that
 would be fatal (x86 doesn't allow nesting).  Even if we moved the work
 after the clear of MCG_STATUS we'd still be vulnerable to a new machine
 check on x86_64 because we are sitting on the one  only machine check
 stack.

But if we get a new MCE in here, it will be an MCE from kernel context
and it's fatal.  So, yes, we'll clobber the stack, but we'll never
return (unless tolerant is set to something insane), so who cares?

Anyway, I care less about this now that I don't have to worry about it
re: IRET :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/21/2014 11:11 AM, Andy Lutomirski wrote:
 On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote:
 Here's a real proposal for iret-less return.  If this is correct, then
 NMIs will never nest, which will probably delete a lot more scariness
 than is added by the code I'm describing.

 OK, here's a case where I'm wrong.  An NMI interrupts userspace on a
 16-bit stack.  The return from NMI goes through the espfix code.
 Something interrupts while on the espfix stack.  Boom!  Neither return
 style is particularly good.

 More generally, if we got interrupted while on the espfix stack, we
 need to return back there using IRET.  Fortunately, re-enabling NMIs
 there in harmless, since we've already switched off the NMI stack.

 This makes me think that maybe the logic should be turned around: have
 some RIP ranges on which the kernel stack might be invalid (which
 includes the espfix code and some of the syscall code) and use IRET
 only on return from NMI, return to nonstandard CS, and return to these
 special ranges.  The NMI code just needs to never so any of this stuff
 unless it switches off the NMI stack first.

 For this to work reliably, we'll probably have to change CS before
 calling into EFI code.  That should be straightforward.


 I think you are onto something here.

 In particular, the key observation here is that inside the kernel, we
 can never *both* have an invalid stack *and* be inside an NMI, #MC or
 #DB handler, even if nested.

Except for espfix :)


 Now, does this prevent us from using RET in the common case?  I'm not
 sure it is a huge loss since kernel-to-kernel is relatively rare.

I don't think so.  The most common case should be plain old interrupts
and I suspect that #PF is a distant second.

In any event, plain old interrupts and #PF are non-IST interrupts and
they should be unconditionally safe for RET

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [x86, vdso] cfda7bb9ecb: +14.7% will-it-scale.per_thread_ops

2014-05-21 Thread Andy Lutomirski
On Mon, May 19, 2014 at 10:59 PM, Jet Chen jet.c...@intel.com wrote:
 Hi Andy,

 FYI, we noticed the below changes on

 git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/vdso
 commit cfda7bb9ecbf9d96264bb5bade33a842966d1062 (x86, vdso: Move syscall
 and sysenter setup into kernel/cpu/common.c)

This is really strange.  That change shouldn't have any effect except
that some kernels might boot a cycle or two faster.


 test case: nhm4/will-it-scale/sched_yield

 3d7ee969bffcc98  cfda7bb9ecbf9d96264bb5bad
 ---  -
 5497021 ~ 0% +14.7%6303424 ~ 0%  TOTAL
 will-it-scale.per_thread_ops
0.54 ~ 0%  +5.6%   0.57 ~ 0%  TOTAL will-it-scale.scalability
 6209483 ~ 0%  +1.6%6305917 ~ 0%  TOTAL
 will-it-scale.per_process_ops
2455 ~ 5% +16.9%   2870 ~ 5%  TOTAL cpuidle.C1-NHM.usage
8829 ~ 7% +15.2%  10169 ~10%  TOTAL
 slabinfo.kmalloc-64.active_objs
   24.13 ~12% +48.9%  35.93 ~14%  TOTAL time.user_time
 393 ~ 0%  -3.0%382 ~ 1%  TOTAL time.system_time


Is this a speedup or a slowdown?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov b...@alien8.de wrote:
 On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
 But if we get a new MCE in here, it will be an MCE from kernel context
 and it's fatal. So, yes, we'll clobber the stack, but we'll never
 return (unless tolerant is set to something insane), so who cares?

 Ok, but we still have to do the work before returning to the process. So
 if not mce_notify_process() how else are you suggesting we do this?

I'm suggesting that you re-enable interrupts and do the work in
do_machine_check.  I think it'll just work.  It might pay to set a
flag so that you panic very loudly if do_machine_check recurses.

I suspect that, if the hardware is generating machine checks while
doing memory poisoning, the hardware is broken enough that even
panicking might not work, though :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 4:05 PM, Luck, Tony tony.l...@intel.com wrote:
 On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote:
 But if we get a new MCE in here, it will be an MCE from kernel context
 and it's fatal. So, yes, we'll clobber the stack, but we'll never
 return (unless tolerant is set to something insane), so who cares?

 Remember that machine checks are broadcast.  So some other cpu
 can hit a recoverable machine check in user mode ... but that int#18
 goes everywhere.  Other cpus are innocent bystanders ... they will
 see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important
 in any of their machine check banks.

 But if we are still finishing off processing the previous machine check,
 this will be a nested one - and BOOM, we are dead.

Oh.  Well, crap.

FWIW, this means that there really is a problem if one of these #MC
errors hits an innocent bystander who just happens to be handling an
NMI, at least if we delete the nested NMI code.  But I think my
simplified proposal gets this right.


 -Tony

 [If you peer closely at the latest edition of the SDM - you'll see the
 bits are defined for a non-broadcast model ... e.g. LMCE_S bit in
 MCG_STATUS  but currently shipping silicon doesn't use that]



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC] x86_64: A real proposal for iret-less return to kernel

2014-05-21 Thread Andy Lutomirski
On Wed, May 21, 2014 at 4:51 PM, Borislav Petkov b...@alien8.de wrote:
 On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote:
 If the OS then decides to take down the whole machine, the OS - not
 the hardware - can choose to do something that will punch through
 other CPU's NMI blocking (notably, init/reset), but the hardware doing
 this on its own is just broken if true.

 Not that it is any consolation but MCE is not broadcast on AMD.

 Regardless, exceptions like MCE cannot be held pending and do pierce the
 NMI handler on both.

 Now, if the NMI handler experiences a non-broadcast MCE on the same CPU,
 while running, we're simply going to panic as we're in kernel space
 anyway.

 The only problem is if the NMI handler gets interrupted while running
 on a bystander CPU. And I think we could deal with this because the
 bystander would not see an MCE and will return safely. We just need
 to make sure that it returns back to the said NMI handler and not to
 userspace. Unless I'm missing something ...

Under my always RET unless returning from IST to weird CS or to
specific known-invalid-stack regions proposal this should work fine.
In the current code it'll also work fine *unless* it hits really early
in the NMI, in which case a second NMI can kill us.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/3] x86_64,vsyscall: Minor cleanups and fixes

2014-05-22 Thread Andy Lutomirski
This turns the vsyscall page all the way off in vsyscall=none mode
and fixes a bug in warn_bad_vsyscall that resulted in corrupt log
lines.

With this series applied, vsyscall=none results in a tidier
/proc/PID/maps.

Andy Lutomirski (3):
  x86_64,vsyscall: Move all of the gate_area code to vsyscall_64.c
  x86_64,vsyscall: Turn vsyscalls all the way off when vsyscall=none
  x86_64,vsyscall: Fix warn_bad_vsyscall log output

 arch/x86/kernel/vsyscall_64.c | 69 ++-
 arch/x86/mm/init_64.c | 49 --
 2 files changed, 61 insertions(+), 57 deletions(-)

-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 3/3] x86_64,vsyscall: Fix warn_bad_vsyscall log output

2014-05-22 Thread Andy Lutomirski
This commit:

commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655
Author: Joe Perches j...@perches.com
Date:   Mon May 21 19:50:07 2012 -0700

x86/debug: Add KERN_LEVEL to bare printks, convert printks to 
pr_level

caused warn_bad_vsyscall to output garbage in the middle of the
line.  Revert the bad part of it.

The printk in question isn't actually bare; the level is %s.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/vsyscall_64.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 8d38eb5..6463f9e 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -81,10 +81,10 @@ static void warn_bad_vsyscall(const char *level, struct 
pt_regs *regs,
if (!show_unhandled_signals)
return;
 
-   pr_notice_ratelimited(%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx 
di:%lx\n,
- level, current-comm, task_pid_nr(current),
- message, regs-ip, regs-cs,
- regs-sp, regs-ax, regs-si, regs-di);
+   printk_ratelimited(%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx 
di:%lx\n,
+  level, current-comm, task_pid_nr(current),
+  message, regs-ip, regs-cs,
+  regs-sp, regs-ax, regs-si, regs-di);
 }
 
 static int addr_to_vsyscall_nr(unsigned long addr)
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/3] x86_64,vsyscall: Move all of the gate_area code to vsyscall_64.c

2014-05-22 Thread Andy Lutomirski
This code exists for the sole purpose of making the vsyscall page
look sort of like real userspace memory.  Move it so that it lives
with the rest of the vsyscall code.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/vsyscall_64.c | 49 +++
 arch/x86/mm/init_64.c | 49 ---
 2 files changed, 49 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index ea5b570..ad84894 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -284,6 +284,55 @@ sigsegv:
 }
 
 /*
+ * A pseudo VMA to allow ptrace access for the vsyscall page.  This only
+ * covers the 64bit vsyscall page now. 32bit has a real VMA now and does
+ * not need special handling anymore:
+ */
+static const char *gate_vma_name(struct vm_area_struct *vma)
+{
+   return [vsyscall];
+}
+static struct vm_operations_struct gate_vma_ops = {
+   .name = gate_vma_name,
+};
+static struct vm_area_struct gate_vma = {
+   .vm_start   = VSYSCALL_ADDR,
+   .vm_end = VSYSCALL_ADDR + PAGE_SIZE,
+   .vm_page_prot   = PAGE_READONLY_EXEC,
+   .vm_flags   = VM_READ | VM_EXEC,
+   .vm_ops = gate_vma_ops,
+};
+
+struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
+{
+#ifdef CONFIG_IA32_EMULATION
+   if (!mm || mm-context.ia32_compat)
+   return NULL;
+#endif
+   return gate_vma;
+}
+
+int in_gate_area(struct mm_struct *mm, unsigned long addr)
+{
+   struct vm_area_struct *vma = get_gate_vma(mm);
+
+   if (!vma)
+   return 0;
+
+   return (addr = vma-vm_start)  (addr  vma-vm_end);
+}
+
+/*
+ * Use this when you have no reliable mm, typically from interrupt
+ * context. It is less reliable than using a task's mm and may give
+ * false positives.
+ */
+int in_gate_area_no_mm(unsigned long addr)
+{
+   return (addr  PAGE_MASK) == VSYSCALL_ADDR;
+}
+
+/*
  * Assume __initcall executes before all user space. Hopefully kmod
  * doesn't violate that. We'll find out if it does.
  */
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index bdcde58..1390b7b 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1180,55 +1180,6 @@ int kern_addr_valid(unsigned long addr)
return pfn_valid(pte_pfn(*pte));
 }
 
-/*
- * A pseudo VMA to allow ptrace access for the vsyscall page.  This only
- * covers the 64bit vsyscall page now. 32bit has a real VMA now and does
- * not need special handling anymore:
- */
-static const char *gate_vma_name(struct vm_area_struct *vma)
-{
-   return [vsyscall];
-}
-static struct vm_operations_struct gate_vma_ops = {
-   .name = gate_vma_name,
-};
-static struct vm_area_struct gate_vma = {
-   .vm_start   = VSYSCALL_ADDR,
-   .vm_end = VSYSCALL_ADDR + PAGE_SIZE,
-   .vm_page_prot   = PAGE_READONLY_EXEC,
-   .vm_flags   = VM_READ | VM_EXEC,
-   .vm_ops = gate_vma_ops,
-};
-
-struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
-{
-#ifdef CONFIG_IA32_EMULATION
-   if (!mm || mm-context.ia32_compat)
-   return NULL;
-#endif
-   return gate_vma;
-}
-
-int in_gate_area(struct mm_struct *mm, unsigned long addr)
-{
-   struct vm_area_struct *vma = get_gate_vma(mm);
-
-   if (!vma)
-   return 0;
-
-   return (addr = vma-vm_start)  (addr  vma-vm_end);
-}
-
-/*
- * Use this when you have no reliable mm, typically from interrupt
- * context. It is less reliable than using a task's mm and may give
- * false positives.
- */
-int in_gate_area_no_mm(unsigned long addr)
-{
-   return (addr  PAGE_MASK) == VSYSCALL_ADDR;
-}
-
 #ifdef CONFIG_X86_UV
 unsigned long memory_block_size_bytes(void)
 {
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/3] x86_64,vsyscall: Turn vsyscalls all the way off when vsyscall=none

2014-05-22 Thread Andy Lutomirski
I see no point in having an unusable read-only page sitting at
0xff60 when vsyscall=none.  Instead, skip mapping it and
remove it from /proc/PID/maps.

I kept the ratelimited warning when programs try to use a vsyscall
in this mode, since it may help admins avoid confusion.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/kernel/vsyscall_64.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index ad84894..8d38eb5 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -309,6 +309,8 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm)
if (!mm || mm-context.ia32_compat)
return NULL;
 #endif
+   if (vsyscall_mode == NONE)
+   return NULL;
return gate_vma;
 }
 
@@ -329,7 +331,7 @@ int in_gate_area(struct mm_struct *mm, unsigned long addr)
  */
 int in_gate_area_no_mm(unsigned long addr)
 {
-   return (addr  PAGE_MASK) == VSYSCALL_ADDR;
+   return vsyscall_mode != NONE  (addr  PAGE_MASK) == VSYSCALL_ADDR;
 }
 
 /*
@@ -380,10 +382,12 @@ void __init map_vsyscall(void)
extern char __vsyscall_page;
unsigned long physaddr_vsyscall = __pa_symbol(__vsyscall_page);
 
-   __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
-vsyscall_mode == NATIVE
-? PAGE_KERNEL_VSYSCALL
-: PAGE_KERNEL_VVAR);
+   if (vsyscall_mode != NONE)
+   __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
+vsyscall_mode == NATIVE
+? PAGE_KERNEL_VSYSCALL
+: PAGE_KERNEL_VVAR);
+
BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
 (unsigned long)VSYSCALL_ADDR);
 }
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Pondering per-process vsyscall disablement

2014-05-22 Thread Andy Lutomirski
It would be nice to have a way for new programs to declare that they
don't need vsyscalls.  What's the right way to do this?  An ELF header
entry in the loader?  An ELF header entry in the program?  A new
arch_prctl?

As background, there's an old part of the x86_64 ABI that allows
programs to do gettimeofday, clock_gettime, and getcpu by calling to
fixed addresses of the form 0xff600n00 where n indicates which
of those three syscalls is being invoked.  This is a security issue.

Since Linux 3.1, vsyscalls are emulated using NX and page faults.  As
a result, vsyscalls no longer offer any performance advantage over
normal syscalls; in fact, they're much slower.  As far as I know,
nothing newer than 2012 will attempt to use vsyscalls if a vdso is
present.  (Sadly, a lot of things will still fall back to the vsyscall
page if there is no vdso, but that shouldn't matter, since there is
always a vdso.)

Despite the emulation, they could still be used as a weird form of ROP
gadget that lives at a fixed address.  I'd like to offer a way for new
runtimes to indicate that they don't use vsyscalls so that the kernel
can selectively disable emulation and remove the fixed-address
executable code issue.


--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 4/6] seccomp: move no_new_privs into seccomp

2014-05-22 Thread Andy Lutomirski
On Thu, May 22, 2014 at 4:05 PM, Kees Cook keesc...@chromium.org wrote:
 Since seccomp transitions between threads requires updates to the
 no_new_privs flag to be atomic, changes must be atomic. This moves the nnp
 flag into the seccomp field as a separate unsigned long for atomic access.

 Signed-off-by: Kees Cook keesc...@chromium.org

Acked-by: Andy Lutomirski l...@amacapital.net
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC

2014-05-22 Thread Andy Lutomirski
On Thu, May 22, 2014 at 4:05 PM, Kees Cook keesc...@chromium.org wrote:
 Applying restrictive seccomp filter programs to large or diverse
 codebases often requires handling threads which may be started early in
 the process lifetime (e.g., by code that is linked in). While it is
 possible to apply permissive programs prior to process start up, it is
 difficult to further restrict the kernel ABI to those threads after that
 point.

 This change adds a new seccomp extension action for synchronizing thread
 group seccomp filters and a prctl() for accessing that functionality,
 as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter
 installation time.

 When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER,
 flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling
 prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it
 will attempt to synchronize all threads in current's threadgroup to its
 seccomp filter program. This is possible iff all threads are using a filter
 that is an ancestor to the filter current is attempting to synchronize to.
 NULL filters (where the task is running as SECCOMP_MODE_NONE) are also
 treated as ancestors allowing threads to be transitioned into
 SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the
 calling thread, no_new_privs will be set for all synchronized threads too.
 On success, 0 is returned. On failure, the pid of one of the failing threads
 will be returned, with as many filters installed as possible.

Is there a use case for adding a filter and synchronizing filters
being separate operations?  If not, I think this would be easier to
understand and to use if there was just a single operation.

If you did that, you'd have to decide whether to continue requiring
that all the other threads have a filter that's an ancestor of the
current thread's filter.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: futex(2) man page update help request

2014-05-14 Thread Andy Lutomirski
On Wed, May 14, 2014 at 5:28 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/14/2014 01:56 PM, Davidlohr Bueso wrote:

 However, unless I'm sorely mistaken, the larger problem is that glibc
 removed the futex() call entirely, so these man pages don't describe

 I don't think futex() ever was in glibc--that's by design, and
 completely understandable: no user-space application would want to
 directly use futex().

 That's actually not quite true. There are plenty of software efforts out
 there that use futex calls directly to implement userspace serialization
 mechanisms as an alternative to the bulky sysv semaphores. I worked
 closely with an in-memory DB project that makes heavy use of them. Not
 everyone can simply rely on pthreads.


 More fundamentally, futex(2), like clone(2), are things that can be
 legitimately by user space without automatically breaking all of glibc.

I'm lost -- I think the missing verb is important :)

  There are some other things where that is *not* true, because glibc
 relies on being able to mediate all accesses to a kernel facility, but
 not here.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: NULL ptr deref handling mmaping of special mappings

2014-05-15 Thread Andy Lutomirski
On May 14, 2014 8:36 PM, Pavel Emelyanov xe...@parallels.com wrote:

 On 05/15/2014 02:23 AM, Andy Lutomirski wrote:
  On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote:
  On Wed, May 14, 2014 at 2:31 PM, Andrew Morton
  a...@linux-foundation.org wrote:
  On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin sasha.le...@oracle.com 
  wrote:
 
  In my linux-next all that code got deleted by Andy's x86, vdso:
  Reimplement vdso.so preparation in build-time C anyway.  What kernel
  were you looking at?
 
  Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 .
 
  I don't see Andy's patch removing that code either.
 
  ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into
  arch/x86/vdso/vma.c.
 
  Maybe you managed to take a fault against the symbol area between the
  _install_special_mapping() and the remap_pfn_range() call, but mmap_sem
  should prevent that.
 
  Or the remap_pfn_range() call never happened.  Should map_vdso() be
  running _install_special_mapping() at all if
  image-sym_vvar_page==NULL?
 
  I'm confused: are we talking about 3.15-rcsomething or linux-next?
  That code changed.
 
  Would this all make more sense if there were just a single vma in
  here?  cc: Pavel and Cyrill, who might have to deal with this stuff in
  CRIU
 
  Well, for criu we've not modified any vdso kernel's code (except
  setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping).
  And never experienced problems Sasha points. Looks like indeed in
  -next code is pretty different from mainline one. To figure out
  why I need to fetch -next branch and get some research. I would
  try to do that tomorrow (still hoping someone more experienced
  in mm system would beat me on that).
 
  I can summarize:
 
  On 3.14 and before, the vdso is just a bunch of ELF headers and
  executable data.  When executed by 64-bit binaries, it reads from the
  fixmap to do its thing.  That is, it reads from kernel addresses that
  don't have vmas.  When executed by 32-bit binaries, it doesn't read
  anything, since there was no 32-bit timing code.
 
  On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
  a separate vma containing two pages worth of time-varying read-only
  data.  The vdso reads those pages using PIC references.
 
  On linux-next, all vdsos work the same way.  There are two vmas.  The
  first vma is executable text, which can be poked at by ptrace, etc
  normally.  The second vma contains time-varying state, should not
  allow poking, and is accessed by PIC references.

 Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow?

It is in maps, and it's not marked.  I can write a patch to change
that.  I imagine it shouldn't be called [vdso], though.


  What does CRIU do to restore the vdso?  Will 3.15 and/or linux-next
  need to make some concession for CRIU?

 We detect the vdso by [vdso] mark in proc at dump time and mark it in
 the images. At restore time we check that vdso symbols layout hasn't changed
 and just remap it in proper location.

 If this remains the same in -next, then we're fine :)

If you just remap the vdso, you'll crash.

This is the case in 3.15, too, for 32-bit apps, anyway.

What happens if you try to checkpoint a program that's in the vdso or,
worse, in a signal frame with the vdso on the stack?

--Andy


 Thanks,
 Pavel
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: NULL ptr deref handling mmaping of special mappings

2014-05-15 Thread Andy Lutomirski
On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote:
 I can summarize:

 On 3.14 and before, the vdso is just a bunch of ELF headers and
 executable data.  When executed by 64-bit binaries, it reads from the
 fixmap to do its thing.  That is, it reads from kernel addresses that
 don't have vmas.  When executed by 32-bit binaries, it doesn't read
 anything, since there was no 32-bit timing code.

 On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
 a separate vma containing two pages worth of time-varying read-only
 data.  The vdso reads those pages using PIC references.

 Andy, could you please point me where is the code which creates a second vma?
 latest 3.15 master branch

Search for _install_special_mapping in arch/x86/vdso.  It's in a
different place in 3.15-rc and -next.


 [root@fc ~]# cat /proc/self/maps
 ...
 7fff57b6e000-7fff57b8f000 rw-p  00:00 0  
 [stack]
 7fff57bff000-7fff57c0 r-xp  00:00 0  
 [vdso]
 ff60-ff601000 r-xp  00:00 0  
 [vsyscall]
 [root@fc ~]#


What version and bitness is this?

 Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone.
 On restore we don't touch  vsyscall either but for vdso there are two cases

vsyscalls are almost gone now :)


  - if there were no kernel change on vdso contents we simply use vdso provided
by the kernel at the moment of criu startup

  - if vdso has been changed and looks different from one saved in image during
checkpoint, we map it from image but then patch (push jmp instruction) so
when application calls for some of vdso function it jumps into vdso code
saved in image and then jumps into vdso mapped by the kernel (ie kind of
proxy calls) This force us to do own Elf parsing inside criu to calculate
proper offsets.

Yuck :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: NULL ptr deref handling mmaping of special mappings

2014-05-15 Thread Andy Lutomirski
On Thu, May 15, 2014 at 12:53 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Thu, May 15, 2014 at 12:46:34PM -0700, Andy Lutomirski wrote:
 On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov gorcu...@gmail.com wrote:
  On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote:
  I can summarize:
 
  On 3.14 and before, the vdso is just a bunch of ELF headers and
  executable data.  When executed by 64-bit binaries, it reads from the
  fixmap to do its thing.  That is, it reads from kernel addresses that
  don't have vmas.  When executed by 32-bit binaries, it doesn't read
  anything, since there was no 32-bit timing code.
 
  On 3.15, the x86_64 vdso is unchanged.  The 32-bit vdso is preceded by
  a separate vma containing two pages worth of time-varying read-only
  data.  The vdso reads those pages using PIC references.
 
  Andy, could you please point me where is the code which creates a second 
  vma?
  latest 3.15 master branch

 Search for _install_special_mapping in arch/x86/vdso.  It's in a
 different place in 3.15-rc and -next.

 As far as I see _install_special_mapping allocates one vma from cache and

 vma-vm_start = addr;
 vma-vm_end = addr + len;

 so where is the second one?

Look at its callers in vdso32-setup.c and/or vma.c, depending on version.



 
  [root@fc ~]# cat /proc/self/maps
  ...
  7fff57b6e000-7fff57b8f000 rw-p  00:00 0  
  [stack]
  7fff57bff000-7fff57c0 r-xp  00:00 0  
  [vdso]
  ff60-ff601000 r-xp  00:00 0  
  [vsyscall]
  [root@fc ~]#
 

 What version and bitness is this?

 x86-64, 3.15-rc5

Aha.  Give tip/x86/vdso or -next a try or boot a 32-bit 3.15-rc kernel
and you'll see it.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: NULL ptr deref handling mmaping of special mappings

2014-05-15 Thread Andy Lutomirski
On Thu, May 15, 2014 at 2:31 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Fri, May 16, 2014 at 12:19:14AM +0400, Cyrill Gorcunov wrote:

 I see what you mean. We're rather targeting on bare x86-64 at the moment
 but compat mode is needed as well (not yet implemented though). I'll take
 a precise look into this area. Thanks!

 Indeed, because we were not running 32bit tasks 
 vdso32-setup.c::arch_setup_additional_pages
 has never been called. That's the mode we will have to implement one day.

 Looking forward the question appear -- will VDSO_PREV_PAGES and rest of 
 variables
 be kind of immutable constants? If yes, we could calculate where the 
 additional
 vma lives without requiring any kind of [vdso] mark in proc/pid/maps output.

Please don't!

These might, in principle, even vary between tasks on the same system.
 Certainly the relative positions of the vmas will be different
between 3.15 and 3.16, since we need almost my entire cleanup series
to reliably put them into their 3.16 location.  And I intend to change
the number of pages in 3.16 or 3.17.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: NULL ptr deref handling mmaping of special mappings

2014-05-15 Thread Andy Lutomirski
On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote:
 
  Looking forward the question appear -- will VDSO_PREV_PAGES and rest of 
  variables
  be kind of immutable constants? If yes, we could calculate where the 
  additional
  vma lives without requiring any kind of [vdso] mark in proc/pid/maps 
  output.

 Please don't!

 These might, in principle, even vary between tasks on the same system.
  Certainly the relative positions of the vmas will be different
 between 3.15 and 3.16, since we need almost my entire cleanup series
 to reliably put them into their 3.16 location.  And I intend to change
 the number of pages in 3.16 or 3.17.

 There are other ways how to find where additional pages are laying but it
 would be great if there a straightforward interface for that (ie some mark
 in /proc/pid/maps output).

I'll try to write a patch in time for 3.15.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: mm: NULL ptr deref handling mmaping of special mappings

2014-05-16 Thread Andy Lutomirski
On Thu, May 15, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov gorcu...@gmail.com wrote:
 On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote:
 
  Looking forward the question appear -- will VDSO_PREV_PAGES and rest of 
  variables
  be kind of immutable constants? If yes, we could calculate where the 
  additional
  vma lives without requiring any kind of [vdso] mark in proc/pid/maps 
  output.

 Please don't!

 These might, in principle, even vary between tasks on the same system.
  Certainly the relative positions of the vmas will be different
 between 3.15 and 3.16, since we need almost my entire cleanup series
 to reliably put them into their 3.16 location.  And I intend to change
 the number of pages in 3.16 or 3.17.

 There are other ways how to find where additional pages are laying but it
 would be great if there a straightforward interface for that (ie some mark
 in /proc/pid/maps output).

 I'll try to write a patch in time for 3.15.


My current draft is here:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups

On 64-bit userspace, it results in:

7fffa1dfd000-7fffa1dfe000 r-xp  00:00 0  [vdso]
7fffa1dfe000-7fffa1e0 r--p  00:00 0  [vvar]
ff60-ff601000 r-xp  00:00 0
  [vsyscall]

On 32-bit userspace, it results in:

f7748000-f7749000 r-xp  00:00 0  [vdso]
f7749000-f774b000 r--p  00:00 0  [vvar]
ffd94000-ffdb5000 rw-p  00:00 0  [stack]

Is this good for CRIU?  Another approach would be to name both of
these things vdso, since they are sort of both the vdso, but that
might be a bit confusing -- [vvar] is not static text the way that
[vdso] is.

If I backport this for 3.15 (which might be nasty -- I would argue
that the code change is actually a cleanup, but it's fairly
intrusive), then [vvar] will be *before* [vdso], not after it.  I'd be
very hesitant to name both of them [vdso] in that case, since there
is probably code that assumes that the beginning of [vdso] is a DSO.

Note that it is *not* safe to blindly read from [vvar].  On some
configurations you *will* get SIGBUS if you try to read from some of
the vvar pages.  (That's what started this whole thread.)  Some pages
in [vvar] may have strange caching modes, so SIGBUS might not be the
only surprising thing about poking at it.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/4] SROP mitigation: Architecture independent code for signal canaries

2014-05-19 Thread Andy Lutomirski
On 05/15/2014 02:09 PM, Erik Bosman wrote:
 
 Architecture independent code for signal canaries
 
 Add support for canary values in user-space signal frames.  These canaries
 function much like stack canaries/cookies, making it harder for an attacker to
 fake a call to {rt_,}sigreturn()
 
 This patch deals with architecture independent changes needed to support
 these canaries.
 
 
 These patches are meant to make Sigreturn Oriented Programming (SROP) a much
 less attractive exploitation path.  In Sigreturn Oriented Programming, an
 attacker causes a user-space program to call the sigreturn system call in 
 order
 to get complete control control over the entire userspace context in one go.
 
 ( see: http://www.cs.vu.nl/~herbertb/papers/srop_sp14.pdf )
 
 While mitigating SROP will probably not stop determined attackers from
 exploiting a program, as there's always the much more well-known Return
 Oriented Programming, we still think SROP's relative ease warrants mitigation,
 especially since the mitigation is so cheap.

If you're willing to make the mitigation a bit more sneaky, you could
make the canary value depend on the address that the canary is at.  For
example, it could be H(some per-exec secret || address) for your
favorite hash function H.

Also, I would have sigreturn clear the canary on the stack.

This would mitigate attacks based on trying to read the canary value
from some unused / leaked stack space.

--Andy


 
 
 Signed-off-by: Erik Bosman e...@minemu.org
 
 ---
  arch/Kconfig  | 16 
  fs/exec.c |  8 
  include/linux/sched.h |  5 +
  3 files changed, 29 insertions(+)
 
 diff --git a/arch/Kconfig b/arch/Kconfig
 index 97ff872..8319984 100644
 --- a/arch/Kconfig
 +++ b/arch/Kconfig
 @@ -399,6 +399,22 @@ config CC_STACKPROTECTOR_STRONG
  
  endchoice
  
 +config HAVE_SIGNAL_CANARY
 + bool
 + help
 +   An arch should select this symbol if:
 +   - its struct sigframe contains a canary field
 +   - it has implemented signal canary checking
 +
 +config SIGNAL_CANARY
 + bool signal canary
 + default y
 + depends on HAVE_SIGNAL_CANARY
 + help
 +   Mitigate against a userland exploitation techinque called
 +   sigreturn oriented programming by putting a canary value on a
 +   signal's struct sigframe
 +
  config HAVE_CONTEXT_TRACKING
   bool
   help
 diff --git a/fs/exec.c b/fs/exec.c
 index 476f3eb..883f456 100644
 --- a/fs/exec.c
 +++ b/fs/exec.c
 @@ -56,6 +56,7 @@
  #include linux/pipe_fs_i.h
  #include linux/oom.h
  #include linux/compat.h
 +#include linux/random.h
  
  #include asm/uaccess.h
  #include asm/mmu_context.h
 @@ -1105,6 +1106,13 @@ void setup_new_exec(struct linux_binprm * bprm)
   /* This is the point of no return */
   current-sas_ss_sp = current-sas_ss_size = 0;
  
 +#ifdef CONFIG_SIGNAL_CANARY
 + /* canary value to mitigate the use of sigreturn in (userland) exploits
 +  * get_random_int() should be random enough also for 64bit
 +  */
 + current-signal_canary = (unsigned long)get_random_int();
 +#endif
 +
   if (uid_eq(current_euid(), current_uid())  gid_eq(current_egid(), 
 current_gid()))
   set_dumpable(current-mm, SUID_DUMP_USER);
   else
 diff --git a/include/linux/sched.h b/include/linux/sched.h
 index 25f54c7..cb8b54b 100644
 --- a/include/linux/sched.h
 +++ b/include/linux/sched.h
 @@ -1364,6 +1364,11 @@ struct task_struct {
  
   unsigned long sas_ss_sp;
   size_t sas_ss_size;
 +
 +#ifdef CONFIG_SIGNAL_CANARY
 + u32 signal_canary; /* sigreturn exploit mitigation */
 +#endif
 +
   int (*notifier)(void *priv);
   void *notifier_data;
   sigset_t *notifier_mask;
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-19 Thread Andy Lutomirski
On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:

 Quoting Richard Weinberger (rich...@nod.at):
  Am 15.05.2014 21:50, schrieb Serge Hallyn:
   Quoting Richard Weinberger (richard.weinber...@gmail.com):
   On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman
   gre...@linuxfoundation.org wrote:
   Then don't use a container to build such a thing, or fix the build
   scripts to not do that :)
  
   I second this.
   To me it looks like some folks try to (ab)use Linux containers
   for purposes where KVM would much better fit in.
   Please don't put more complexity into containers. They are already
   horrible complex
   and error prone.
  
   I, naturally, disagree :)  The only use case which is inherently not
   valid for containers is running a kernel.  Practically speaking there
   are other things which likely will never be possible, but if someone
   offers a way to do something in containers, you can't do that in
   containers is not an apropos response.
  
   That abstraction is wrong is certainly valid, as when vpids were
   originally proposed and rejected, resulting in the development of
   pid namespaces.  We have to work out (x) first can be valid (and
   I can think of examples here), assuming it's not just trying to hide
   behind a catch-22/chicken-egg problem.
  
   Finally, saying containers are complex and error prone is conflating
   several large suites of userspace code and many kernel features which
   support them.  Being more precise would, if the argument is valid,
   lend it a lot more weight.
 
  We (my company) use Linux containers since 2011 in production. First LXC, 
  now libvirt-lxc.
  To understand the internals better I also wrote my own userspace to 
  create/start
  containers. There are so many things which can hurt you badly.
  With user namespaces we expose a really big attack surface to regular users.
  I.e. Suddenly a user is allowed to mount filesystems.

 That is currently not the case.  They can mount some virtual filesystems
 and do bind mounts, but cannot mount most real filesystems.  This keeps
 us protected (for now) from potentially unsafe superblock readers in the
 kernel.

  Ask Andy, he found already lots of nasty things...

I don't think I have anything brilliant to add to this discussion
right now, except possibly:

ISTM that Linux distributions are, in general, vulnerable to all kinds
of shenanigans that would happen if an untrusted user can cause a
block device to appear.  That user doesn't need permission to mount it
or even necessarily to change its contents on the fly.

E.g. what happens if you boot a machine that contains a malicious disk
image that has the same partition UUID as /?  Nothing good, I imagine.

So if we're going to go down this road, we really need some way to
tell the host that certain devices are not trusted.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath

2014-05-28 Thread Andy Lutomirski
On 05/28/2014 05:19 AM, Philipp Kern wrote:
 audit_filter_syscall uses the syscall number to reference into a
 bitmask (e-rule.mask[word]). Not removing the x32 bit before passing
 the number to this architecture independent codepath will fail to
 lookup the proper audit bit. Furthermore it will cause an invalid memory
 access in the kernel if the out of bound location is not mapped:
 
   BUG: unable to handle kernel paging request at 8800e5446630
   IP: [810fcdd0] audit_filter_syscall+0x90/0xf0
 
 Together with the entrypoint in entry_64.S this change causes x32
 programs to pass in both AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 depending
 on the syscall path.
 
 Cc: linux-kernel@vger.kernel.org
 Cc: H. J. Lu hjl.to...@gmail.com
 Cc: Eric Paris epa...@redhat.com
 Signed-off-by: Philipp Kern pk...@google.com
 ---
  arch/x86/kernel/ptrace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
 
 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
 index 678c0ad..166a3c7 100644
 --- a/arch/x86/kernel/ptrace.c
 +++ b/arch/x86/kernel/ptrace.c
 @@ -1489,7 +1489,7 @@ long syscall_trace_enter(struct pt_regs *regs)
  
   if (IS_IA32)
   audit_syscall_entry(AUDIT_ARCH_I386,
 - regs-orig_ax,
 + regs-orig_ax  __SYSCALL_MASK,

This is weird.  Three questions:

1. How can this happen?  I thought that x32 syscalls always came in
through the syscall path, which doesn't set is_compat_task.  (Can
someone rename is_compat_task to in_compat_syscall?  Pretty please?)

2. Now audit can't tell whether a syscall is x32 or i386.  And audit is
inconsistent with seccomp.  This seems wrong.

3. The OOPS you're fixing doesn't seem like it's fixed.  What if some
other random high bits are set?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 2:01 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 05/28/2014 01:47 PM, Andy Lutomirski wrote:
 On 05/28/2014 05:19 AM, Philipp Kern wrote:
 audit_filter_syscall uses the syscall number to reference into a
 bitmask (e-rule.mask[word]). Not removing the x32 bit before passing
 the number to this architecture independent codepath will fail to
 lookup the proper audit bit. Furthermore it will cause an invalid memory
 access in the kernel if the out of bound location is not mapped:

   BUG: unable to handle kernel paging request at 8800e5446630
   IP: [810fcdd0] audit_filter_syscall+0x90/0xf0

 Together with the entrypoint in entry_64.S this change causes x32
 programs to pass in both AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 depending
 on the syscall path.

 Cc: linux-kernel@vger.kernel.org
 Cc: H. J. Lu hjl.to...@gmail.com
 Cc: Eric Paris epa...@redhat.com
 Signed-off-by: Philipp Kern pk...@google.com
 ---
  arch/x86/kernel/ptrace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
 index 678c0ad..166a3c7 100644
 --- a/arch/x86/kernel/ptrace.c
 +++ b/arch/x86/kernel/ptrace.c
 @@ -1489,7 +1489,7 @@ long syscall_trace_enter(struct pt_regs *regs)

  if (IS_IA32)
  audit_syscall_entry(AUDIT_ARCH_I386,
 -regs-orig_ax,
 +regs-orig_ax  __SYSCALL_MASK,

 This is weird.  Three questions:

 1. How can this happen?  I thought that x32 syscalls always came in
 through the syscall path, which doesn't set is_compat_task.  (Can
 someone rename is_compat_task to in_compat_syscall?  Pretty please?)

 The SYSCALL path doesn't set TS_COMPAT, but is_compat_task() looks both
 as TS_COMPAT and bit 30 of orig_ax.

 I think what is really needed here is IS_IA32 should use is_ia32_task()
 instead, and *that* is the context we can mask off the x32 bit in at
 all.  However, does audit not need that information?

This is thoroughly fscked up.

What *should* have happened is that there should have been an
AUDIT_ARCH_X32.  Unfortunately no one caught that in time, so the
current enshrined ABI is that, as far as seccomp is concerned, x32 is
AUDIT_ARCH_X86_64 (see syscall_get_arch) but nr has the x32 bit set.

But the audit code is different: x32 is AUDIT_ARCH_I386 and the x32
bit is set.  This may be changeable, since fixes to the audit ABI may
not break anything.  Can we just use syscall_get_arch to determine the
token to pass to the audit code?

If it makes everyone feel better, I think that every single
architecture has screwed this up.  We actually had to prevent seccomp
and audit from being configured in on any OABI-supporting ARM kernel,
and MIPS almost did the same thing that x32 did.  We caught that one
on time, and it's now fixed in Linus' tree.


 (And why the frakk does audit receive the first four syscall arguments?
  Audit seems like the worst turd ever...)

If you ever benchmark the performance impact of merely running auditd,
you might have to upgrade that insult :-/


 2. Now audit can't tell whether a syscall is x32 or i386.  And audit is
 inconsistent with seccomp.  This seems wrong.

 This is completely and totally bogus, indeed.

I'd suggest just fixing the bug in auditsc.c.


 3. The OOPS you're fixing doesn't seem like it's fixed.  What if some
 other random high bits are set?

 There is a range check in entry_*.S for the system call.

I can imagine that causing a certain amount of confusion to fancy
seccomp users.  Oh, well.  No one that I know of has complained yet.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 2:19 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 05/28/2014 02:15 PM, Andy Lutomirski wrote:

 3. The OOPS you're fixing doesn't seem like it's fixed.  What if some
 other random high bits are set?

 There is a range check in entry_*.S for the system call.

 I can imagine that causing a certain amount of confusion to fancy
 seccomp users.  Oh, well.  No one that I know of has complained yet.


 I don't know how seccomp or audit deal with out-of-range syscall
 numbers, so that might be worth taking a look at.

audit oopses, apparently.  seccomp will tell BPF about it and follow
directions, barring bugs.

However: are you sure that entry_64.S handles this?  It looks like
tracesys has higher priority than badsys.  And strace can certainly
see out-of-range syscalls.  And I can oops audit by doing:

auditctl -a exit,always -S open
./a.out

where a.out is this:

#include stdio.h
#include sys/syscall.h

int main()
{
  long i;
  for (i = 1000; ; i += 64 * 4096)
syscall(i, 0, 0, 0, 0, 0, 0);
  return 0;
}

I would *love* to deprecate the entire syscall auditing mechanism.  Or
at least I'd love to have distros stop using it.

I'll go ask for a CVE number.  Sigh.


--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 2:53 PM, Philipp Kern pk...@google.com wrote:
 On Wed, May 28, 2014 at 11:43 PM, Andy Lutomirski l...@amacapital.net wrote:
 However: are you sure that entry_64.S handles this?  It looks like
 tracesys has higher priority than badsys.  And strace can certainly
 see out-of-range syscalls. […]

 Not only can it see them: It must see that this bit is set as that's
 the only identifier it has to deduce that the binary is running in x32
 mode.

 Out of range syscall numbers certainly do not work for auditing right
 now, hence my attempt to patch around it.

There appears to be a completely arbitrary limit of 32*64 syscalls.
There's also an arbitrary limit of 4 arguments.  Both are wrong.  I
have no intention of fixing either.

I'll fix the OOPS, though.


 Kind regards and thanks
 Philipp Kern



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] audit: Move CONFIG_AUDITSYSCALL into staging and update help text

2014-05-28 Thread Andy Lutomirski
Here are some issues with the code:
 - It thinks that syscalls have four arguments.
 - It's a performance disaster.
 - It assumes that syscall numbers are between 0 and 2048.
 - It's unclear whether it's supposed to be reliable.
 - It's broken on things like x32.
 - It can't support ARM OABI.
 - Its approach to memory allocation is terrifying.

I considered marking it BROKEN, but that might be too harsh.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 init/Kconfig | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..4584f8a 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -296,13 +296,16 @@ config HAVE_ARCH_AUDITSYSCALL
bool
 
 config AUDITSYSCALL
-   bool Enable system-call auditing support
-   depends on AUDIT  HAVE_ARCH_AUDITSYSCALL
+   bool Enable system-call auditing support (not recommended)
+   depends on AUDIT  HAVE_ARCH_AUDITSYSCALL  STAGING
default y if SECURITY_SELINUX
help
- Enable low-overhead system-call auditing infrastructure that
- can be used independently or with another kernel subsystem,
- such as SELinux.
+ Enable system-call auditing infrastructure that can be used
+ independently or with another kernel subsystem, such as
+ SELinux.
+
+ AUDITSYSCALL has serious performance and correctness issues.
+ Use it with extreme caution.
 
 config AUDIT_WATCH
def_bool y
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Fix auditsc DoS and move it to staging

2014-05-28 Thread Andy Lutomirski
CONFIG_AUDITSYSCALL is awful.  Patch 2 enumerates some reasons.

Patch 1 fixes a nasty DoS and possible information leak.  It should
be applied and backported.

Patch 2 is optional.  I leave it to other peoples' judgment.

Andy Lutomirski (2):
  auditsc: audit_krule mask accesses need bounds checking
  audit: Move CONFIG_AUDITSYSCALL into staging and update help text

 init/Kconfig | 13 -
 kernel/auditsc.c | 27 ++-
 2 files changed, 26 insertions(+), 14 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-05-28 Thread Andy Lutomirski
Fixes an easy DoS and possible information disclosure.

This does nothing about the broken state of x32 auditing.

Cc: sta...@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 kernel/auditsc.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index f251a5e..7ccd9db 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct 
task_struct *tsk, char **key)
return AUDIT_BUILD_CONTEXT;
 }
 
+static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
+{
+   int word, bit;
+
+   if (val  0x)
+   return false;
+
+   word = AUDIT_WORD(val);
+   if (word = AUDIT_BITMASK_SIZE)
+   return false;
+
+   bit = AUDIT_BIT(val);
+
+   return rule-mask[word]  bit;
+}
+
 /* At syscall entry and exit time, this filter is called if the
  * audit_state is not low enough that auditing cannot take place, but is
  * also not high enough that we already know we have to write an audit
@@ -745,11 +761,8 @@ static enum audit_state audit_filter_syscall(struct 
task_struct *tsk,
 
rcu_read_lock();
if (!list_empty(list)) {
-   int word = AUDIT_WORD(ctx-major);
-   int bit  = AUDIT_BIT(ctx-major);
-
list_for_each_entry_rcu(e, list, list) {
-   if ((e-rule.mask[word]  bit) == bit 
+   if (audit_in_mask(e-rule, ctx-major) 
audit_filter_rules(tsk, e-rule, ctx, NULL,
   state, false)) {
rcu_read_unlock();
@@ -769,20 +782,16 @@ static enum audit_state audit_filter_syscall(struct 
task_struct *tsk,
 static int audit_filter_inode_name(struct task_struct *tsk,
   struct audit_names *n,
   struct audit_context *ctx) {
-   int word, bit;
int h = audit_hash_ino((u32)n-ino);
struct list_head *list = audit_inode_hash[h];
struct audit_entry *e;
enum audit_state state;
 
-   word = AUDIT_WORD(ctx-major);
-   bit  = AUDIT_BIT(ctx-major);
-
if (list_empty(list))
return 0;
 
list_for_each_entry_rcu(e, list, list) {
-   if ((e-rule.mask[word]  bit) == bit 
+   if (audit_in_mask(e-rule, ctx-major) 
audit_filter_rules(tsk, e-rule, ctx, n, state, false)) {
ctx-current_state = state;
return 1;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC][PATCH 0/8] sched,idle: need resched polling rework

2014-05-28 Thread Andy Lutomirski
On Thu, May 22, 2014 at 6:09 AM, Peter Zijlstra pet...@infradead.org wrote:
 On Thu, May 22, 2014 at 02:58:18PM +0200, Peter Zijlstra wrote:
 ---
 diff --git a/kernel/sched/core.c b/kernel/sched/core.c
 index 4ea7b3f1a087..a5da85fb3570 100644
 --- a/kernel/sched/core.c
 +++ b/kernel/sched/core.c
 @@ -546,12 +546,38 @@ static bool set_nr_and_not_polling(struct task_struct 
 *p)
   struct thread_info *ti = task_thread_info(p);
   return !(fetch_or(ti-flags, _TIF_NEED_RESCHED)  
 _TIF_POLLING_NRFLAG);
  }
 +
 +/*
 + * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set.
 + */
 +static bool set_nr_if_polling(struct task_struct *p)
 +{
 + struct thread_info *ti = task_thread_info(p);
 + typeof(ti-flags) old, val = ti-flags;
 +
 + for (;;) {
 + if (!(val  _TIF_POLLING_NRFLAG))
 + return false;
 + if (val  _TIF_NEED_RESCHED)
 + return true;

 Hmm, I think this is racy, false would be safer. If its already set we
 might already be past the sched_ttwu_pending() invocation, while if its
 clear and we're the one to set it, we're guaranteed not.

 + old = cmpxchg(ti-flags, val, val | _TIF_NEED_RESCHED);
 + if (old == val)
 + return true;
 + val = old;
 + }
 +}

Do you have an updated patch?  After fixing the MIME flow damage
(sigh), it doesn't apply to sched/core, which is my best guess for
what it's based on.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-28 Thread Andy Lutomirski
Here are some issues with the code:
 - It thinks that syscalls have four arguments.
 - It's a performance disaster.
 - It assumes that syscall numbers are between 0 and 2048.
 - It's unclear whether it's supposed to be reliable.
 - It's broken on things like x32.
 - It can't support ARM OABI.
 - Its approach to freeing memory is terrifying.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 init/Kconfig | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..24d4b53 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -296,13 +296,16 @@ config HAVE_ARCH_AUDITSYSCALL
bool
 
 config AUDITSYSCALL
-   bool Enable system-call auditing support
-   depends on AUDIT  HAVE_ARCH_AUDITSYSCALL
+   bool Enable system-call auditing support (not recommended)
+   depends on AUDIT  HAVE_ARCH_AUDITSYSCALL  BROKEN
default y if SECURITY_SELINUX
help
- Enable low-overhead system-call auditing infrastructure that
- can be used independently or with another kernel subsystem,
- such as SELinux.
+ Enable system-call auditing infrastructure that can be used
+ independently or with another kernel subsystem, such as
+ SELinux.
+
+ AUDITSYSCALL has serious performance and correctness issues.
+ Use it with extreme caution.
 
 config AUDIT_WATCH
def_bool y
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-05-28 Thread Andy Lutomirski
Fixes an easy DoS and possible information disclosure.

This does nothing about the broken state of x32 auditing.

Cc: sta...@vger.kernel.org
Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 kernel/auditsc.c | 27 ++-
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index f251a5e..7ccd9db 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct 
task_struct *tsk, char **key)
return AUDIT_BUILD_CONTEXT;
 }
 
+static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
+{
+   int word, bit;
+
+   if (val  0x)
+   return false;
+
+   word = AUDIT_WORD(val);
+   if (word = AUDIT_BITMASK_SIZE)
+   return false;
+
+   bit = AUDIT_BIT(val);
+
+   return rule-mask[word]  bit;
+}
+
 /* At syscall entry and exit time, this filter is called if the
  * audit_state is not low enough that auditing cannot take place, but is
  * also not high enough that we already know we have to write an audit
@@ -745,11 +761,8 @@ static enum audit_state audit_filter_syscall(struct 
task_struct *tsk,
 
rcu_read_lock();
if (!list_empty(list)) {
-   int word = AUDIT_WORD(ctx-major);
-   int bit  = AUDIT_BIT(ctx-major);
-
list_for_each_entry_rcu(e, list, list) {
-   if ((e-rule.mask[word]  bit) == bit 
+   if (audit_in_mask(e-rule, ctx-major) 
audit_filter_rules(tsk, e-rule, ctx, NULL,
   state, false)) {
rcu_read_unlock();
@@ -769,20 +782,16 @@ static enum audit_state audit_filter_syscall(struct 
task_struct *tsk,
 static int audit_filter_inode_name(struct task_struct *tsk,
   struct audit_names *n,
   struct audit_context *ctx) {
-   int word, bit;
int h = audit_hash_ino((u32)n-ino);
struct list_head *list = audit_inode_hash[h];
struct audit_entry *e;
enum audit_state state;
 
-   word = AUDIT_WORD(ctx-major);
-   bit  = AUDIT_BIT(ctx-major);
-
if (list_empty(list))
return 0;
 
list_for_each_entry_rcu(e, list, list) {
-   if ((e-rule.mask[word]  bit) == bit 
+   if (audit_in_mask(e-rule, ctx-major) 
audit_filter_rules(tsk, e-rule, ctx, n, state, false)) {
ctx-current_state = state;
return 1;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 0/2] Fix auditsc DoS and mark it BROKEN

2014-05-28 Thread Andy Lutomirski
CONFIG_AUDITSYSCALL is awful.  Patch 2 enumerates some reasons.

Patch 1 fixes a nasty DoS and possible information leak.  It should
be applied and backported.

Patch 2 is optional.  I leave it to other peoples' judgment.

Andy Lutomirski (2):
  auditsc: audit_krule mask accesses need bounds checking
  audit: Move CONFIG_AUDITSYSCALL into staging and update help text

Andy Lutomirski (2):
  auditsc: audit_krule mask accesses need bounds checking
  audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

 init/Kconfig | 13 -
 kernel/auditsc.c | 27 ++-
 2 files changed, 26 insertions(+), 14 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote:
 On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
 Fixes an easy DoS and possible information disclosure.

 This does nothing about the broken state of x32 auditing.

 Cc: sta...@vger.kernel.org
 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  kernel/auditsc.c | 27 ++-
  1 file changed, 18 insertions(+), 9 deletions(-)

 diff --git a/kernel/auditsc.c b/kernel/auditsc.c
 index f251a5e..7ccd9db 100644
 --- a/kernel/auditsc.c
 +++ b/kernel/auditsc.c
 @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct 
 task_struct *tsk, char **key)
   return AUDIT_BUILD_CONTEXT;
  }

 +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val)
 +{
 + int word, bit;
 +
 + if (val  0x)
 + return false;

 Why is this necessary?

To avoid an integer overflow.  Admittedly, this particular overflow
won't cause a crash, but it will cause incorrect results.


 +
 + word = AUDIT_WORD(val);
 + if (word = AUDIT_BITMASK_SIZE)
 + return false;

 Since this covers it and it extensible...

 +
 + bit = AUDIT_BIT(val);
 +
 + return rule-mask[word]  bit;

 Returning an int as a bool creates worse code than just returning the
 int.  (although in this case if the compiler chooses to inline it might
 be smart enough not to actually convert this int to a bool and make
 worse assembly...)   I'd suggest the function here return an int.  bools
 usually make the assembly worse...

I'm ambivalent.  The right assembly would use flags on x86, not an int
or a bool, and I don't know what the compiler will do.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote:
 NAK

 On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
 Here are some issues with the code:
  - It thinks that syscalls have four arguments.

 Not true at all.  It records the registers that would hold the first 4
 entries on syscall entry, for use later if needed, as getting those
 later on some arches is not feasible (see ia64).  It makes no assumption
 about how many syscalls a function has.

What about a5 and a6?


  - It's a performance disaster.

 Only if you enable it.  If you don't use audit it is a single branch.
 Hardly a disaster.

It forces all syscalls into the slow path and it can do crazy things
like building audit contexts just in case actual handling of the
syscall triggers an audit condition so that the exit path can log the
syscall.  That's way worse than a single branch.

Try it: benchmark getpid on Fedora and then repeat the experiment with
syscall auditing fully disabled.  The difference is striking.


  - It assumes that syscall numbers are between 0 and 2048.

 There could well be a bug here.  Not questioning that.  Although that
 would be patch 1/2

Even with patch 1, it still doesn't handle large syscall numbers -- it
just assumes they're not audited.


  - It's unclear whether it's supposed to be reliable.

 Unclear to whom?

To me.

If some inode access or selinux rule triggers an audit, is the auditsc
code guaranteed to write an exit record?  And see below...


  - It's broken on things like x32.
  - It can't support ARM OABI.

 Some arches aren't supported?  And that makes it BROKEN?

It acts like x32 is supported.  OABI is fixable.  Admittedly, OABI not
being supported is fine, now that it's correctly disabled.


  - Its approach to freeing memory is terrifying.

 What?

 None of your reasons hold water.  Bugs need to be fixed.  Try reporting
 them...  This is just stupid.

...for reference, I've *tried* to fix the performance issues.  I've
run into all kinds of problems.

The actual context code is incredibly tangled.  It's unclear whether
it would be permissible for various rule combinations to suppress exit
records triggered by selinux.  Any effort to actually deal with this
stuff will have to deal with the fact that the audit code builds up
lots of state and frees it on syscall exit.  That means that the exit
hook needs to be actually invoked, otherwise we leak.  I was never
able to convince myself that the current code is correct wrt kernel
threads.

In summary, the code is a giant mess.  The way it works is nearly
incomprehensible.  It contains at least one severe bug.  I'd love to
see it fixed, but for now, distributions seem to think that enabling
CONFIG_AUDITSYSCALL is a reasonable thing to do, and I'd argue that
it's actually a terrible choice for anyone who doesn't actually need
syscall audit rules.  And I don't know who needs these things.

I've heard an argument that selinux benefits from this, since the
syscall exit log helps with diagnosing denials.  I think that's
absurd.  I've never found anything wrong with the denial record itself
that would be helped by seeing the syscall log.  (And there's always
syscall_get_xyz, which would cover this case splendidly with about an
order of magnitude less code.)

As I said, I'm not going to push hard for this code to be marked
BROKEN.  But I think it's rather broken, and I'm definitely not
volunteering to fix it.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 7:43 PM, Eric Paris epa...@redhat.com wrote:
 On Wed, 2014-05-28 at 19:27 -0700, Andy Lutomirski wrote:
 On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote:
  On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
  Fixes an easy DoS and possible information disclosure.
 
  This does nothing about the broken state of x32 auditing.
 
  Cc: sta...@vger.kernel.org
  Signed-off-by: Andy Lutomirski l...@amacapital.net
  ---
   kernel/auditsc.c | 27 ++-
   1 file changed, 18 insertions(+), 9 deletions(-)
 
  diff --git a/kernel/auditsc.c b/kernel/auditsc.c
  index f251a5e..7ccd9db 100644
  --- a/kernel/auditsc.c
  +++ b/kernel/auditsc.c
  @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct 
  task_struct *tsk, char **key)
return AUDIT_BUILD_CONTEXT;
   }
 
  +static bool audit_in_mask(const struct audit_krule *rule, unsigned long 
  val)
  +{
  + int word, bit;
  +
  + if (val  0x)
  + return false;
 
  Why is this necessary?

 To avoid an integer overflow.  Admittedly, this particular overflow
 won't cause a crash, but it will cause incorrect results.

 You know this code pre-dates git?  I admit, I'm shocked no one ever
 noticed it before!  This is ANCIENT.  And clearly broken.

 I'll likely ask Richard to add a WARN_ONCE() in both this place, and
 below in word  AUDIT_BITMASK_SIZE so we might know if we ever need a
 larger bitmask to store syscall numbers


Not there, please!  It would make sense to check when rules are
*added*, but any user can trivially invoke the filter with pretty much
any syscall nr.

 It'd be nice if lib had a efficient bitmask implementation...

 
  +
  + word = AUDIT_WORD(val);
  + if (word = AUDIT_BITMASK_SIZE)
  + return false;
 
  Since this covers it and it extensible...
 
  +
  + bit = AUDIT_BIT(val);
  +
  + return rule-mask[word]  bit;
 
  Returning an int as a bool creates worse code than just returning the
  int.  (although in this case if the compiler chooses to inline it might
  be smart enough not to actually convert this int to a bool and make
  worse assembly...)   I'd suggest the function here return an int.  bools
  usually make the assembly worse...

 I'm ambivalent.  The right assembly would use flags on x86, not an int
 or a bool, and I don't know what the compiler will do.

 Also, clearly X86_X32 was implemented without audit support, so we
 shouldn't config it in.  What do you think of this?

 diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
 index 25d2c6f..fa852e2 100644
 --- a/arch/x86/Kconfig
 +++ b/arch/x86/Kconfig
 @@ -129,7 +129,7 @@ config X86
 select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64
 select HAVE_CC_STACKPROTECTOR
 select GENERIC_CPU_AUTOPROBE
 -   select HAVE_ARCH_AUDITSYSCALL
 +   select HAVE_ARCH_AUDITSYSCALL if !X86_X32

I'm fine with that, but I hope that distros will choose x32 over
auditsc, at least until someone makes enabling auditsc be less
intrusive.

Or someone could fix it, modulo the syscall nr 2048 thing.  x32
syscall nrs are *huge*.  So are lots of other architectures', a few of
which probably claim to support auditsc.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-28 Thread Andy Lutomirski
On Wed, May 28, 2014 at 7:54 PM, Eric Paris epa...@redhat.com wrote:
 On Wed, 2014-05-28 at 19:40 -0700, Andy Lutomirski wrote:
 On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote:
  NAK
 
  On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote:
  Here are some issues with the code:
   - It thinks that syscalls have four arguments.
 
  Not true at all.  It records the registers that would hold the first 4
  entries on syscall entry, for use later if needed, as getting those
  later on some arches is not feasible (see ia64).  It makes no assumption
  about how many syscalls a function has.

 What about a5 and a6?

 On the couple of syscalls where a5 and a6 had any state that was
 actually wanted by someone (mainly just the fd on mmap) audit collects
 it later in the actual syscall.

   - It assumes that syscall numbers are between 0 and 2048.
 
  There could well be a bug here.  Not questioning that.  Although that
  would be patch 1/2

 Even with patch 1, it still doesn't handle large syscall numbers -- it
 just assumes they're not audited.

 That's because we haven't had large syscall numbers.  That's the whole
 point of an arch doing select HAVE_ARCH_AUDITSYSCALL.  If they don't
 meet the requirements, they shouldn't be selecting it

   - It's unclear whether it's supposed to be reliable.
 
  Unclear to whom?

 To me.

 If some inode access or selinux rule triggers an audit, is the auditsc
 code guaranteed to write an exit record?  And see below...

 This is an honest question:  Do you want to discuss these things, or
 would you be happier if I shut up, fix the bugs you found, and leave
 things be?  I don't want to have an argument, I'm happy to have a
 discussion if you think that will be beneficial...


I'm happy to discuss these things.  I'd be happy if you fix things.  I
think that there are two major issues affecting users of auditsc and
that either those issues should be fixed or users of auditsc should
understand and accept these issues.

Issue 1: syscall numbering.  The syscall filters are unlikely to work
well on lots of architectures.

Issue 2: performance.  Until someone fixes it (which is probably
hard), turning this thing on hurts a lot.  Oleg added a hack awhile
ago that lets you do 'auditctl -a task,never' to get rid of the
performance hit for new tasks, but that rather defeats the point.

The scariness of the code can be lived with, but holy cow it's scary.

I'm just not going to fix the issues myself.  I tried and gave up.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski
On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote:
 On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote:
   - It assumes that syscall numbers are between 0 and 2048.
 
  There could well be a bug here.  Not questioning that.  Although that
  would be patch 1/2

 Even with patch 1, it still doesn't handle large syscall numbers -- it
 just assumes they're not audited.

 All syscalls must be auditable. Meaning that if an arch goes above 2048, then
 we'll need to do some math to get it to fall back within the range.


Off the top of my head, x32, some ARM variants, and some MIPS variants
have syscalls above 2048.  auditsc has been disabled on the offending
ARM variant (because I noticed that the same issue affects seccomp,
and no one particularly wants to fix it), but I suspect that auditsc
is enabled but completely broken on whichever MIPS ABI has the issue.
I haven't checked.

TBH, I think that it's silly to let the auditsc design be heavily
constrained by ia64 considerations -- I still think that the syscall
entry hooks could be removed completely with some effort on most
architectures and that performance would improve considerably.  For
users who don't have syscall filter rules, performance might even
improve on ia64 -- I don't care how expensive syscall_get_args is, the
actual printk/auditd thing will dominate in cases where anything is
written.

I wonder whether the syscall restart mechanism can be used to
thoroughly confused auditsc.  I don't really know how syscall restarts
work.

My point is: I don't know what guarantees auditsc is supposed to
provide, nor do I know how to evaluate whether the code is correct.
For users who aren't bound by Common Criteria or related things, I
suspect that syscall auditing (as opposed to, say, LSM-based auditing)
will provide dubious value at best.  Keep in mind that many syscalls
take pointer arguments, and the auditsc code can't really do anything
sensible with them.


   - It's unclear whether it's supposed to be reliable.
 
  Unclear to whom?

 To me.

 If some inode access or selinux rule triggers an audit, is the auditsc
 code guaranteed to write an exit record?  And see below...

 It should or indicate that it could not.


--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text

2014-05-29 Thread Andy Lutomirski
On Thu, May 29, 2014 at 9:25 AM, Steve Grubb sgr...@redhat.com wrote:
 On Thursday, May 29, 2014 09:04:10 AM Andy Lutomirski wrote:
 On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote:
  On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote:
- It assumes that syscall numbers are between 0 and 2048.
  
   There could well be a bug here.  Not questioning that.  Although that
   would be patch 1/2
 
  Even with patch 1, it still doesn't handle large syscall numbers -- it
  just assumes they're not audited.
 
  All syscalls must be auditable. Meaning that if an arch goes above 2048,
  then we'll need to do some math to get it to fall back within the range.

 Off the top of my head, x32, some ARM variants, and some MIPS variants
 have syscalls above 2048.

 That could be fixed by putting a subtraction in place to get the bit mask to
 map correctly. User space audit rules would need to fix that also.


 auditsc has been disabled on the offending
 ARM variant (because I noticed that the same issue affects seccomp,
 and no one particularly wants to fix it), but I suspect that auditsc
 is enabled but completely broken on whichever MIPS ABI has the issue.
 I haven't checked.

 TBH, I think that it's silly to let the auditsc design be heavily
 constrained by ia64 considerations -- I still think that the syscall
 entry hooks could be removed completely with some effort on most
 architectures and that performance would improve considerably.  For
 users who don't have syscall filter rules, performance might even
 improve on ia64 -- I don't care how expensive syscall_get_args is, the
 actual printk/auditd thing will dominate in cases where anything is
 written.

 The last time I heard of benchmarking being done, audit's performance hit was
 negligible. That was about 4 years ago and there has been a whole lot of code
 churn since then. But it should in general be the cost of an 'if' statement
 unless auditing is enabled. If it is enabled, then you necessarily get the
 full treatment in case you trigger an event.


The actual rule matching is reasonably inexpensive, but the killer is
the fact that every system call is forced into the slow path.  Without
audit, this cost is paid by seccomp users and tasks that are being
ptraced.  With audit, every single syscall for every single task pays
the full cost of a slow path system call.

This is extra sad because, AFAICS, most of the syscall entry audit
code shouldn't be necessary even for audit users.

 For users who aren't bound by Common Criteria or related things, I
 suspect that syscall auditing (as opposed to, say, LSM-based auditing)
 will provide dubious value at best.

 No, it works pretty good. You can see who is accessing what files pretty
 clearly.

Right, but I don't think it's the *syscall* part that's really
necessary here.  The fact that someone opened /etc/passwd is
interesting.  The fact that someone sys_opened (const char *)0x123456
is not.


 Keep in mind that many syscalls take pointer arguments, and the auditsc code
 can't really do anything sensible with them.

 All security relevant information is collected in auxiliary records if they
 are not directly readable as a syscall argument. This is a basic requirement.
 We are not required to capture all possible arguments. If you know of any
 security relevant parameters not captured, please let us know.


a5 and a6.  The cmsg data in sendmsg (which is very security-relevant
-- think of SCM_RIGHTS and SCM_CREDENTIALS).  The destination for
sendto.  The whole payload of netlink messages.  I don't know how many
of these are actually captures.

Plus, except for things like setuid and sigreturn, capturing
parameters on entry shouldn't be needed anyway.  Capturing at log time
should be fine.

If this magic capturing only happened for users who need it for
compliance reasons, that would be one thing.  But it happens for every
single system call on Fedora.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/vdso] x86, vdso: Reimplement vdso.so preparation in build-time C

2014-05-29 Thread Andy Lutomirski
On Thu, May 29, 2014 at 12:17 PM, Paul Gortmaker
paul.gortma...@windriver.com wrote:
 On Mon, May 5, 2014 at 6:25 PM, tip-bot for Andy Lutomirski
 tip...@zytor.com wrote:
 Commit-ID:  6f121e548f83674ab4920a4e60afb58d4f61b829
 Gitweb: 
 http://git.kernel.org/tip/6f121e548f83674ab4920a4e60afb58d4f61b829
 Author: Andy Lutomirski l...@amacapital.net
 AuthorDate: Mon, 5 May 2014 12:19:34 -0700
 Committer:  H. Peter Anvin h...@linux.intel.com
 CommitDate: Mon, 5 May 2014 13:18:51 -0700

 x86, vdso: Reimplement vdso.so preparation in build-time C

 Just a heads up in case it hasn't been mentioned already;
 the x86 builds in linux-next which IIRC are done as cross
 compile on a powerpc box are failing as follows:

   VDSO2C  arch/x86/vdso/vdso-image-64.c
 /bin/sh: line 1: 15995 Segmentation fault  arch/x86/vdso/vdso2c
 arch/x86/vdso/vdso64.so.dbg arch/x86/vdso/vdso-image-64.c
 make[3]: *** [arch/x86/vdso/vdso-image-64.c] Error 139

 http://kisskb.ellerman.id.au/kisskb/buildresult/11265454/


Egads.  This wouldn't be a big-endian host by any chance?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [tip:x86/vdso] x86, vdso: Reimplement vdso.so preparation in build-time C

2014-05-29 Thread Andy Lutomirski
On Thu, May 29, 2014 at 12:43 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/29/2014 12:32 PM, Andy Lutomirski wrote:
 On Thu, May 29, 2014 at 12:17 PM, Paul Gortmaker
 paul.gortma...@windriver.com wrote:
 On Mon, May 5, 2014 at 6:25 PM, tip-bot for Andy Lutomirski
 tip...@zytor.com wrote:
 Commit-ID:  6f121e548f83674ab4920a4e60afb58d4f61b829
 Gitweb: 
 http://git.kernel.org/tip/6f121e548f83674ab4920a4e60afb58d4f61b829
 Author: Andy Lutomirski l...@amacapital.net
 AuthorDate: Mon, 5 May 2014 12:19:34 -0700
 Committer:  H. Peter Anvin h...@linux.intel.com
 CommitDate: Mon, 5 May 2014 13:18:51 -0700

 x86, vdso: Reimplement vdso.so preparation in build-time C

 Just a heads up in case it hasn't been mentioned already;
 the x86 builds in linux-next which IIRC are done as cross
 compile on a powerpc box are failing as follows:

   VDSO2C  arch/x86/vdso/vdso-image-64.c
 /bin/sh: line 1: 15995 Segmentation fault  arch/x86/vdso/vdso2c
 arch/x86/vdso/vdso64.so.dbg arch/x86/vdso/vdso-image-64.c
 make[3]: *** [arch/x86/vdso/vdso-image-64.c] Error 139

 http://kisskb.ellerman.id.au/kisskb/buildresult/11265454/


 Egads.  This wouldn't be a big-endian host by any chance?


 He said PowerPC; most but not all PowerPC systems are bigendian.  Seems
 like a fair assumption to make.


I suppose I shouldn't have assumed that code in arch/x86 would always
run on little-endian machines.  I'll fix it.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] x86,vdso: Fix cross-compilation from big-endian architectures

2014-05-29 Thread Andy Lutomirski
This adds a macro GET(x) to convert x from big-endian to
little-endian.  Hopefully I put it everywhere it needs to go and got
all the cases needed for everyone's linux/elf.h.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/vdso/vdso2c.c | 15 +
 arch/x86/vdso/vdso2c.h | 59 --
 2 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c
index fe8bfbf..de19ced 100644
--- a/arch/x86/vdso/vdso2c.c
+++ b/arch/x86/vdso/vdso2c.c
@@ -51,6 +51,21 @@ static void fail(const char *format, ...)
va_end(ap);
 }
 
+/*
+ * Evil macros to do a little-endian read.
+ */
+#define __GET_TYPE(x, type, bits, ifnot)   \
+   __builtin_choose_expr(  \
+   __builtin_types_compatible_p(typeof(x), type),  \
+   le##bits##toh((x)), ifnot)
+
+extern void bad_get(uint64_t);
+
+#define GET(x) \
+   __GET_TYPE((x), __u32, 32, __GET_TYPE((x), __u64, 64,   \
+   __GET_TYPE((x), __s32, 32, __GET_TYPE((x), __s64, 64,   \
+   __GET_TYPE((x), __u16, 16, bad_get(x))
+
 #define NSYMS (sizeof(required_syms) / sizeof(required_syms[0]))
 
 #define BITS 64
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index 26a7c1f..dadecb9 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -18,25 +18,27 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
const char *secstrings;
uint64_t syms[NSYMS] = {};
 
-   Elf_Phdr *pt = (Elf_Phdr *)(addr + hdr-e_phoff);
+   Elf_Phdr *pt = (Elf_Phdr *)(addr + GET(hdr-e_phoff));
 
/* Walk the segment table. */
-   for (i = 0; i  hdr-e_phnum; i++) {
-   if (pt[i].p_type == PT_LOAD) {
+   for (i = 0; i  GET(hdr-e_phnum); i++) {
+   if (GET(pt[i].p_type) == PT_LOAD) {
if (found_load)
fail(multiple PT_LOAD segs\n);
 
-   if (pt[i].p_offset != 0 || pt[i].p_vaddr != 0)
+   if (GET(pt[i].p_offset) != 0 ||
+   GET(pt[i].p_vaddr) != 0)
fail(PT_LOAD in wrong place\n);
 
-   if (pt[i].p_memsz != pt[i].p_filesz)
+   if (GET(pt[i].p_memsz) != GET(pt[i].p_filesz))
fail(cannot handle memsz != filesz\n);
 
-   load_size = pt[i].p_memsz;
+   load_size = GET(pt[i].p_memsz);
found_load = 1;
-   } else if (pt[i].p_type == PT_DYNAMIC) {
-   dyn = addr + pt[i].p_offset;
-   dyn_end = addr + pt[i].p_offset + pt[i].p_memsz;
+   } else if (GET(pt[i].p_type) == PT_DYNAMIC) {
+   dyn = addr + GET(pt[i].p_offset);
+   dyn_end = addr + GET(pt[i].p_offset) +
+   GET(pt[i].p_memsz);
}
}
if (!found_load)
@@ -44,43 +46,48 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
data_size = (load_size + 4095) / 4096 * 4096;
 
/* Walk the dynamic table */
-   for (i = 0; dyn + i  dyn_end  dyn[i].d_tag != DT_NULL; i++) {
-   if (dyn[i].d_tag == DT_REL || dyn[i].d_tag == DT_RELSZ ||
-   dyn[i].d_tag == DT_RELENT || dyn[i].d_tag == DT_TEXTREL)
+   for (i = 0; dyn + i  dyn_end  GET(dyn[i].d_tag) != DT_NULL; i++) {
+   typeof(dyn[i].d_tag) tag = GET(dyn[i].d_tag);
+   if (tag == DT_REL || tag == DT_RELSZ ||
+   tag == DT_RELENT || tag == DT_TEXTREL)
fail(vdso image contains dynamic relocations\n);
}
 
/* Walk the section table */
-   secstrings_hdr = addr + hdr-e_shoff + hdr-e_shentsize*hdr-e_shstrndx;
-   secstrings = addr + secstrings_hdr-sh_offset;
-   for (i = 0; i  hdr-e_shnum; i++) {
-   Elf_Shdr *sh = addr + hdr-e_shoff + hdr-e_shentsize * i;
-   if (sh-sh_type == SHT_SYMTAB)
+   secstrings_hdr = addr + GET(hdr-e_shoff) +
+   GET(hdr-e_shentsize)*GET(hdr-e_shstrndx);
+   secstrings = addr + GET(secstrings_hdr-sh_offset);
+   for (i = 0; i  GET(hdr-e_shnum); i++) {
+   Elf_Shdr *sh = addr + GET(hdr-e_shoff) +
+   GET(hdr-e_shentsize) * i;
+   if (GET(sh-sh_type) == SHT_SYMTAB)
symtab_hdr = sh;
 
-   if (!strcmp(secstrings + sh-sh_name, .altinstructions))
+   if (!strcmp(secstrings + GET(sh-sh_name), .altinstructions))
alt_sec = sh;
}
 
if (!symtab_hdr)
fail(no symbol table\n);
 
-   strtab_hdr = addr + hdr-e_shoff

[PATCH 0/2] x86,vdso: vdso build fixes and improvements

2014-05-29 Thread Andy Lutomirski
Patch 1 causes make;make to behave similarly to make if vdso2c fails.
Patch 2 hopefully fixes x86 crossbuilds on big-endian machines.  I don't
have a big-endian machine to test on, though.

Andy Lutomirski (2):
  x86,vdso: When vdso2c fails, unlink the output
  x86,vdso: Fix cross-compilation from big-endian architectures

 arch/x86/vdso/vdso2c.c | 35 ++---
 arch/x86/vdso/vdso2c.h | 69 ++
 2 files changed, 62 insertions(+), 42 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] x86,vdso: When vdso2c fails, unlink the output

2014-05-29 Thread Andy Lutomirski
This avoids bizarre failures if make is run again.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/vdso/vdso2c.c | 20 +++-
 arch/x86/vdso/vdso2c.h | 10 +++---
 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c
index 81edd1e..fe8bfbf 100644
--- a/arch/x86/vdso/vdso2c.c
+++ b/arch/x86/vdso/vdso2c.c
@@ -14,6 +14,8 @@
 #include linux/elf.h
 #include linux/types.h
 
+const char *outfilename;
+
 /* Symbols that we need in vdso2c. */
 enum {
sym_vvar_page,
@@ -44,6 +46,7 @@ static void fail(const char *format, ...)
va_start(ap, format);
fprintf(stderr, Error: );
vfprintf(stderr, format, ap);
+   unlink(outfilename);
exit(1);
va_end(ap);
 }
@@ -82,17 +85,16 @@ static void fail(const char *format, ...)
 #undef Elf_Sym
 #undef Elf_Dyn
 
-static int go(void *addr, size_t len, FILE *outfile, const char *name)
+static void go(void *addr, size_t len, FILE *outfile, const char *name)
 {
Elf64_Ehdr *hdr = (Elf64_Ehdr *)addr;
 
if (hdr-e_ident[EI_CLASS] == ELFCLASS64) {
-   return go64(addr, len, outfile, name);
+   go64(addr, len, outfile, name);
} else if (hdr-e_ident[EI_CLASS] == ELFCLASS32) {
-   return go32(addr, len, outfile, name);
+   go32(addr, len, outfile, name);
} else {
-   fprintf(stderr, Error: unknown ELF class\n);
-   return 1;
+   fail(unknown ELF class\n);
}
 }
 
@@ -102,7 +104,6 @@ int main(int argc, char **argv)
off_t len;
void *addr;
FILE *outfile;
-   int ret;
char *name, *tmp;
int namelen;
 
@@ -143,14 +144,15 @@ int main(int argc, char **argv)
if (addr == MAP_FAILED)
err(1, mmap);
 
-   outfile = fopen(argv[2], w);
+   outfilename = argv[2];
+   outfile = fopen(outfilename, w);
if (!outfile)
err(1, %s, argv[2]);
 
-   ret = go(addr, (size_t)len, outfile, name);
+   go(addr, (size_t)len, outfile, name);
 
munmap(addr, len);
fclose(outfile);
 
-   return ret;
+   return 0;
 }
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index 3dcc61e..26a7c1f 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -4,7 +4,7 @@
  * are built for 32-bit userspace.
  */
 
-static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name)
+static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name)
 {
int found_load = 0;
unsigned long load_size = -1;  /* Work around bogus warning */
@@ -62,10 +62,8 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
alt_sec = sh;
}
 
-   if (!symtab_hdr) {
+   if (!symtab_hdr)
fail(no symbol table\n);
-   return 1;
-   }
 
strtab_hdr = addr + hdr-e_shoff +
hdr-e_shentsize * symtab_hdr-sh_link;
@@ -112,7 +110,7 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
 
if (!name) {
fwrite(addr, load_size, 1, outfile);
-   return 0;
+   return;
}
 
fprintf(outfile, /* AUTOMATICALLY GENERATED -- DO NOT EDIT */\n\n);
@@ -152,6 +150,4 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
required_syms[i], syms[i]);
}
fprintf(outfile, };\n);
-
-   return 0;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] x86,vdso: vdso build fixes and improvements

2014-05-29 Thread Andy Lutomirski
On Thu, May 29, 2014 at 3:41 PM, Paul Gortmaker
paul.gortma...@windriver.com wrote:
 On 14-05-29 05:57 PM, Andy Lutomirski wrote:
 Patch 1 causes make;make to behave similarly to make if vdso2c fails.
 Patch 2 hopefully fixes x86 crossbuilds on big-endian machines.  I don't
 have a big-endian machine to test on, though.

 Since the x86 builds are unconditionally failing as-is now in
 linux-next, perhaps Stephen [Cc'd] can layer these on the top of
 the tree he'll be making within the next couple of hours to give
 them a big-endian test.

Stephen, if you do this, could you also send me some of the build outputs:

arch/x86/vdso/*.so.dbg
arch/x86/vdso/*-image-*.c

I'd like to verify that the output is correct.  It would be
unfortunate if the cross-built kernel had some subtle error.

Thanks,
Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 0/2] x86,vdso: vdso build fixes and improvements

2014-05-30 Thread Andy Lutomirski
On Thu, May 29, 2014 at 10:42 PM, Stephen Rothwell s...@canb.auug.org.au 
wrote:
 Hi Andy,

 On Thu, 29 May 2014 15:49:52 -0700 Andy Lutomirski l...@amacapital.net 
 wrote:

 On Thu, May 29, 2014 at 3:41 PM, Paul Gortmaker
 paul.gortma...@windriver.com wrote:
  On 14-05-29 05:57 PM, Andy Lutomirski wrote:
  Patch 1 causes make;make to behave similarly to make if vdso2c fails.
  Patch 2 hopefully fixes x86 crossbuilds on big-endian machines.  I don't
  have a big-endian machine to test on, though.
 
  Since the x86 builds are unconditionally failing as-is now in
  linux-next, perhaps Stephen [Cc'd] can layer these on the top of
  the tree he'll be making within the next couple of hours to give
  them a big-endian test.

 I added them just after the merge of the tip tree today and will keep
 them there until they (or somo other solution) turn up some other way.

 Stephen, if you do this, could you also send me some of the build outputs:

 arch/x86/vdso/*.so.dbg
 arch/x86/vdso/*-image-*.c

 I'd like to verify that the output is correct.  It would be
 unfortunate if the cross-built kernel had some subtle error.

 OK, I will send them in a separate email.

For the benefit of the archives: the patches are bad.  New ones coming.

--Andy


 --
 Cheers,
 Stephen Rothwells...@canb.auug.org.au



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 2/2] x86,vdso: Fix cross-compilation from big-endian architectures

2014-05-30 Thread Andy Lutomirski
This adds a macro GET(x) to convert x from big-endian to
little-endian.  Hopefully I put it everywhere it needs to go and got
all the cases needed for everyone's linux/elf.h.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/vdso/vdso2c.c | 15 
 arch/x86/vdso/vdso2c.h | 63 --
 2 files changed, 50 insertions(+), 28 deletions(-)

diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c
index fe8bfbf..de19ced 100644
--- a/arch/x86/vdso/vdso2c.c
+++ b/arch/x86/vdso/vdso2c.c
@@ -51,6 +51,21 @@ static void fail(const char *format, ...)
va_end(ap);
 }
 
+/*
+ * Evil macros to do a little-endian read.
+ */
+#define __GET_TYPE(x, type, bits, ifnot)   \
+   __builtin_choose_expr(  \
+   __builtin_types_compatible_p(typeof(x), type),  \
+   le##bits##toh((x)), ifnot)
+
+extern void bad_get(uint64_t);
+
+#define GET(x) \
+   __GET_TYPE((x), __u32, 32, __GET_TYPE((x), __u64, 64,   \
+   __GET_TYPE((x), __s32, 32, __GET_TYPE((x), __s64, 64,   \
+   __GET_TYPE((x), __u16, 16, bad_get(x))
+
 #define NSYMS (sizeof(required_syms) / sizeof(required_syms[0]))
 
 #define BITS 64
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index 26a7c1f..f0475da 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -18,25 +18,27 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
const char *secstrings;
uint64_t syms[NSYMS] = {};
 
-   Elf_Phdr *pt = (Elf_Phdr *)(addr + hdr-e_phoff);
+   Elf_Phdr *pt = (Elf_Phdr *)(addr + GET(hdr-e_phoff));
 
/* Walk the segment table. */
-   for (i = 0; i  hdr-e_phnum; i++) {
-   if (pt[i].p_type == PT_LOAD) {
+   for (i = 0; i  GET(hdr-e_phnum); i++) {
+   if (GET(pt[i].p_type) == PT_LOAD) {
if (found_load)
fail(multiple PT_LOAD segs\n);
 
-   if (pt[i].p_offset != 0 || pt[i].p_vaddr != 0)
+   if (GET(pt[i].p_offset) != 0 ||
+   GET(pt[i].p_vaddr) != 0)
fail(PT_LOAD in wrong place\n);
 
-   if (pt[i].p_memsz != pt[i].p_filesz)
+   if (GET(pt[i].p_memsz) != GET(pt[i].p_filesz))
fail(cannot handle memsz != filesz\n);
 
-   load_size = pt[i].p_memsz;
+   load_size = GET(pt[i].p_memsz);
found_load = 1;
-   } else if (pt[i].p_type == PT_DYNAMIC) {
-   dyn = addr + pt[i].p_offset;
-   dyn_end = addr + pt[i].p_offset + pt[i].p_memsz;
+   } else if (GET(pt[i].p_type) == PT_DYNAMIC) {
+   dyn = addr + GET(pt[i].p_offset);
+   dyn_end = addr + GET(pt[i].p_offset) +
+   GET(pt[i].p_memsz);
}
}
if (!found_load)
@@ -44,43 +46,48 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
data_size = (load_size + 4095) / 4096 * 4096;
 
/* Walk the dynamic table */
-   for (i = 0; dyn + i  dyn_end  dyn[i].d_tag != DT_NULL; i++) {
-   if (dyn[i].d_tag == DT_REL || dyn[i].d_tag == DT_RELSZ ||
-   dyn[i].d_tag == DT_RELENT || dyn[i].d_tag == DT_TEXTREL)
+   for (i = 0; dyn + i  dyn_end  GET(dyn[i].d_tag) != DT_NULL; i++) {
+   typeof(dyn[i].d_tag) tag = GET(dyn[i].d_tag);
+   if (tag == DT_REL || tag == DT_RELSZ ||
+   tag == DT_RELENT || tag == DT_TEXTREL)
fail(vdso image contains dynamic relocations\n);
}
 
/* Walk the section table */
-   secstrings_hdr = addr + hdr-e_shoff + hdr-e_shentsize*hdr-e_shstrndx;
-   secstrings = addr + secstrings_hdr-sh_offset;
-   for (i = 0; i  hdr-e_shnum; i++) {
-   Elf_Shdr *sh = addr + hdr-e_shoff + hdr-e_shentsize * i;
-   if (sh-sh_type == SHT_SYMTAB)
+   secstrings_hdr = addr + GET(hdr-e_shoff) +
+   GET(hdr-e_shentsize)*GET(hdr-e_shstrndx);
+   secstrings = addr + GET(secstrings_hdr-sh_offset);
+   for (i = 0; i  GET(hdr-e_shnum); i++) {
+   Elf_Shdr *sh = addr + GET(hdr-e_shoff) +
+   GET(hdr-e_shentsize) * i;
+   if (GET(sh-sh_type) == SHT_SYMTAB)
symtab_hdr = sh;
 
-   if (!strcmp(secstrings + sh-sh_name, .altinstructions))
+   if (!strcmp(secstrings + GET(sh-sh_name), .altinstructions))
alt_sec = sh;
}
 
if (!symtab_hdr)
fail(no symbol table\n);
 
-   strtab_hdr = addr + hdr-e_shoff

[PATCH v2 0/2] x86,vdso: vdso build fixes and improvements

2014-05-30 Thread Andy Lutomirski
Patch 1 causes make;make to behave similarly to make if vdso2c fails.
Patch 2 hopefully fixes x86 crossbuilds on big-endian machines.  I don't
have a big-endian machine to test on, though.

Changes from v1: Add two missing endian fixes

Andy Lutomirski (2):
  x86,vdso: When vdso2c fails, unlink the output
  x86,vdso: Fix cross-compilation from big-endian architectures

 arch/x86/vdso/vdso2c.c | 35 +---
 arch/x86/vdso/vdso2c.h | 73 ++
 2 files changed, 64 insertions(+), 44 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2 1/2] x86,vdso: When vdso2c fails, unlink the output

2014-05-30 Thread Andy Lutomirski
This avoids bizarre failures if make is run again.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/vdso/vdso2c.c | 20 +++-
 arch/x86/vdso/vdso2c.h | 10 +++---
 2 files changed, 14 insertions(+), 16 deletions(-)

diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c
index 81edd1e..fe8bfbf 100644
--- a/arch/x86/vdso/vdso2c.c
+++ b/arch/x86/vdso/vdso2c.c
@@ -14,6 +14,8 @@
 #include linux/elf.h
 #include linux/types.h
 
+const char *outfilename;
+
 /* Symbols that we need in vdso2c. */
 enum {
sym_vvar_page,
@@ -44,6 +46,7 @@ static void fail(const char *format, ...)
va_start(ap, format);
fprintf(stderr, Error: );
vfprintf(stderr, format, ap);
+   unlink(outfilename);
exit(1);
va_end(ap);
 }
@@ -82,17 +85,16 @@ static void fail(const char *format, ...)
 #undef Elf_Sym
 #undef Elf_Dyn
 
-static int go(void *addr, size_t len, FILE *outfile, const char *name)
+static void go(void *addr, size_t len, FILE *outfile, const char *name)
 {
Elf64_Ehdr *hdr = (Elf64_Ehdr *)addr;
 
if (hdr-e_ident[EI_CLASS] == ELFCLASS64) {
-   return go64(addr, len, outfile, name);
+   go64(addr, len, outfile, name);
} else if (hdr-e_ident[EI_CLASS] == ELFCLASS32) {
-   return go32(addr, len, outfile, name);
+   go32(addr, len, outfile, name);
} else {
-   fprintf(stderr, Error: unknown ELF class\n);
-   return 1;
+   fail(unknown ELF class\n);
}
 }
 
@@ -102,7 +104,6 @@ int main(int argc, char **argv)
off_t len;
void *addr;
FILE *outfile;
-   int ret;
char *name, *tmp;
int namelen;
 
@@ -143,14 +144,15 @@ int main(int argc, char **argv)
if (addr == MAP_FAILED)
err(1, mmap);
 
-   outfile = fopen(argv[2], w);
+   outfilename = argv[2];
+   outfile = fopen(outfilename, w);
if (!outfile)
err(1, %s, argv[2]);
 
-   ret = go(addr, (size_t)len, outfile, name);
+   go(addr, (size_t)len, outfile, name);
 
munmap(addr, len);
fclose(outfile);
 
-   return ret;
+   return 0;
 }
diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h
index 3dcc61e..26a7c1f 100644
--- a/arch/x86/vdso/vdso2c.h
+++ b/arch/x86/vdso/vdso2c.h
@@ -4,7 +4,7 @@
  * are built for 32-bit userspace.
  */
 
-static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name)
+static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name)
 {
int found_load = 0;
unsigned long load_size = -1;  /* Work around bogus warning */
@@ -62,10 +62,8 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
alt_sec = sh;
}
 
-   if (!symtab_hdr) {
+   if (!symtab_hdr)
fail(no symbol table\n);
-   return 1;
-   }
 
strtab_hdr = addr + hdr-e_shoff +
hdr-e_shentsize * symtab_hdr-sh_link;
@@ -112,7 +110,7 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
 
if (!name) {
fwrite(addr, load_size, 1, outfile);
-   return 0;
+   return;
}
 
fprintf(outfile, /* AUTOMATICALLY GENERATED -- DO NOT EDIT */\n\n);
@@ -152,6 +150,4 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, 
const char *name)
required_syms[i], syms[i]);
}
fprintf(outfile, };\n);
-
-   return 0;
 }
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pondering per-process vsyscall disablement

2014-05-30 Thread Andy Lutomirski
On Wed, May 28, 2014 at 2:45 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/23/2014 09:40 AM, Andy Lutomirski wrote:

 I don't think this should be something configured by the
 administrator, unless the administrator is the builder of a kiosky
 thing like Chromium OS.  In that case, the administrator can use
 vsyscall=none.

 I think this should be handled by either libc or the toolchain, hence
 the suggestions of a syscall or an ELF header.


 We could mimic the NX stack stuff, but it would have a lot of false
 negatives, simply because very few things would actually poke at the
 vsyscall page.

 The NX stuff uses a dummy program header in the ELF image.

 On the other hand, you could make the argument that anything compiled
 with a new toolchain simply should not use the vsyscall page, and just
 unconditionally set the opt-out bit (header) in question.

 It might be better to have some kind of flags field (which a number of
 architectures use) than keep using dummy program headers, though.

Do the flags go in the ELF loader or in the executable we're running?
Or both (and, if both, do we and them or or them)?

I think the interpreter makes a little more sense in general: for the
most part, use of vsyscalls is a property of the runtime environment,
not of the program being run.  But maybe this is naive.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] x86,vdso: Fix cross-compilation from big-endian architectures

2014-05-30 Thread Andy Lutomirski
On Fri, May 30, 2014 at 1:02 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/30/2014 08:48 AM, Andy Lutomirski wrote:
 This adds a macro GET(x) to convert x from big-endian to
 little-endian.  Hopefully I put it everywhere it needs to go and got
 all the cases needed for everyone's linux/elf.h.

 Signed-off-by: Andy Lutomirski l...@amacapital.net
 ---
  arch/x86/vdso/vdso2c.c | 15 
  arch/x86/vdso/vdso2c.h | 63 
 --
  2 files changed, 50 insertions(+), 28 deletions(-)

 A couple of observations:

 1. We shouldn't use double-underscore in host C code.

 2. It would be nice if we can take these sort of things (host-build
helper macros) and move them to some common file in the Linux kernel
eventually, so it would be a good thing to make the naming a little
less general.

 3. Even though it isn't necessary, making it work on 8-bit values so
one doesn't have to worry about the type would seem like a good
thing.

 I came up with the following, it seems like a reasonable simplification:

 #define _LE(x, bits, ifnot)   \
   __builtin_choose_expr(  \
   (sizeof(x) == bits/8),  \
   (__typeof__(x))le##bits##toh(x), ifnot)

This will do awful things if x is a floating-point type, and, for
integers, the cast is probably unnecessary.  But it should be okay.


 extern void bad_le(uint64_t);

If this ever goes in a common header, then we should do the
__attribute__((error)) thing.  I wonder if it would ever make sense to
have __LINUX_HOSTPROG__ and make some of the basic headers work.  Hmm.

 #define _LAST_LE(x)   \
   __builtin_choose_expr(sizeof(x) == 1, (x), bad_le(x))

 #define LE(x) \
   _LE(x, 64, _LE(x, 32, _LE(x, 16, _LAST_LE(x

 What do you think?

My only real real objection is that _LE sounds like converting *to*
little-endian to me.  Admittedly, that's the same thing on any
remotely sane architecture, but still.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pondering per-process vsyscall disablement

2014-05-30 Thread Andy Lutomirski
On Fri, May 30, 2014 at 1:05 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/30/2014 01:00 PM, Andy Lutomirski wrote:

 Do the flags go in the ELF loader or in the executable we're running?
 Or both (and, if both, do we and them or or them)?

 I think the interpreter makes a little more sense in general: for the
 most part, use of vsyscalls is a property of the runtime environment,
 not of the program being run.  But maybe this is naive.


 They go into each object which becomes part of the running program, i.e.
 executable, dynamic libraries, and dynamic linker.

Well, sure, but the kernel is not about to start reading ELF headers
in dynamic libraries.  So we need to make a decision based on the
interpreter and the executable.  The conservative approach is to
require both to have the flag set *and* to offer a prctl to twiddle
the flags.  Then userspace loaders can do whatever they want, and
distros get to rebuild the world :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2 2/2] x86,vdso: Fix cross-compilation from big-endian architectures

2014-05-30 Thread Andy Lutomirski
On Fri, May 30, 2014 at 1:21 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/30/2014 01:09 PM, Andy Lutomirski wrote:

 I came up with the following, it seems like a reasonable simplification:

 #define _LE(x, bits, ifnot)   \
   __builtin_choose_expr(  \
   (sizeof(x) == bits/8),  \
   (__typeof__(x))le##bits##toh(x), ifnot)

 This will do awful things if x is a floating-point type, and, for
 integers, the cast is probably unnecessary.  But it should be okay.


 I mostly wanted to preserve the signedness.  Yes, if we care about
 floating-point it gets trickier.

 At some point hopefully there will be a native C feature to handle this
 crap.

 extern void bad_le(uint64_t);

 If this ever goes in a common header, then we should do the
 __attribute__((error)) thing.  I wonder if it would ever make sense to
 have __LINUX_HOSTPROG__ and make some of the basic headers work.  Hmm.

 #define _LAST_LE(x)   \
   __builtin_choose_expr(sizeof(x) == 1, (x), bad_le(x))

 #define LE(x) \
   _LE(x, 64, _LE(x, 32, _LE(x, 16, _LAST_LE(x

 What do you think?

 My only real real objection is that _LE sounds like converting *to*
 little-endian to me.  Admittedly, that's the same thing on any
 remotely sane architecture, but still.

 GET_LE() then?

Sounds good.

Are you planning on writing the patch?

I think my v2 is good -- the only diff I could find in my image.c
files and Stephen's was in the alt_xyz output, and I think I fixed
that in v2.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pondering per-process vsyscall disablement

2014-05-30 Thread Andy Lutomirski
On Fri, May 30, 2014 at 1:20 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/30/2014 01:11 PM, Andy Lutomirski wrote:
 On Fri, May 30, 2014 at 1:05 PM, H. Peter Anvin h...@zytor.com wrote:
 On 05/30/2014 01:00 PM, Andy Lutomirski wrote:

 Do the flags go in the ELF loader or in the executable we're running?
 Or both (and, if both, do we and them or or them)?

 I think the interpreter makes a little more sense in general: for the
 most part, use of vsyscalls is a property of the runtime environment,
 not of the program being run.  But maybe this is naive.


 They go into each object which becomes part of the running program, i.e.
 executable, dynamic libraries, and dynamic linker.

 Well, sure, but the kernel is not about to start reading ELF headers
 in dynamic libraries.  So we need to make a decision based on the
 interpreter and the executable.  The conservative approach is to
 require both to have the flag set *and* to offer a prctl to twiddle
 the flags.  Then userspace loaders can do whatever they want, and
 distros get to rebuild the world :)


 Yes, something like that.

I'll hack something up once the merge window closes.  Or maybe sooner
if you commit my vsyscall patches from a few days ago.  Otherwise I'm
just going to confuse my git tree too much :)

--Andy


 -hpa


 --
 To unsubscribe from this list: send the line unsubscribe linux-api in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 1/2] x86,syscall: Add syscall_in_syscall to test whether we're in a syscall

2014-05-30 Thread Andy Lutomirski
syscall_in_syscall will return true if we're in a real syscall and
will return false if we're not in a syscall.  If we're in a bad
syscall, the return value can vary.

The idea is to use this to come up with a much simpler replacement
for syscall auditing.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/syscall.h | 21 +
 init/Kconfig   |  3 +++
 3 files changed, 25 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 25d2c6f..e2602d4 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -130,6 +130,7 @@ config X86
select HAVE_CC_STACKPROTECTOR
select GENERIC_CPU_AUTOPROBE
select HAVE_ARCH_AUDITSYSCALL
+   select HAVE_SYSCALL_IN_SYSCALL
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h
index d6a756a..91e38b3 100644
--- a/arch/x86/include/asm/syscall.h
+++ b/arch/x86/include/asm/syscall.h
@@ -23,6 +23,27 @@
 typedef void (*sys_call_ptr_t)(void);
 extern const sys_call_ptr_t sys_call_table[];
 
+/**
+ * syscall_in_syscall() - are we in a syscall context?
+ * @task: The task to query.
+ * @regs: The task's pt_regs.
+ *
+ * This checks whether we are in a syscall.  If it returns true, then
+ * syscall_get_nr(), etc are usable and the current task is guaranteed
+ * to either die or to go through the syscall exit path when the syscall
+ * is done.
+ *
+ * If it returns false, no particular guarantees are made.  In
+ * particular, a malicious task can issue a syscall that causes
+ * syscall_in_syscall to return false.  Such a syscall won't do much,
+ * but it can still cause tracing code and such to run.
+ */
+static inline bool syscall_in_syscall(struct task_struct *task,
+ struct pt_regs *regs)
+{
+   return regs-orig_ax != -1;
+}
+
 /*
  * Only the low 32 bits of orig_ax are meaningful, so we return int.
  * This importantly ignores the high bits on 64-bit, so comparisons
diff --git a/init/Kconfig b/init/Kconfig
index 9d3585b..bad2053 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -295,6 +295,9 @@ config AUDIT
 config HAVE_ARCH_AUDITSYSCALL
bool
 
+config HAVE_SYSCALL_IN_SYSCALL
+   bool
+
 config AUDITSYSCALL
bool Enable system-call auditing support
depends on AUDIT  HAVE_ARCH_AUDITSYSCALL
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 0/2] Syscall auditing lite

2014-05-30 Thread Andy Lutomirski
I've made no secret of the fact that I dislike syscall auditing.  As far
as I can tell, the main technical (i.e. not compliance-related) use of
syscall auditing is to supply some useful context information to go
along with events like AVC denials.

CONFIG_AUDITSYSCALL is serious overkill to do this.  kernel/auditsc.c is
~2500 lines of terror.

This patchset accomplishes the same goal, more usefully, with no
overhead at all, in under 70 lines of code.  It tries to coexist cleanly
with CONFIG_AUDITSYSCALL.

This is only implemented for x86.  Other architectures can add support
fairly easily, I think.

Andy Lutomirski (2):
  x86,syscall: Add syscall_in_syscall to test whether we're in a syscall
  audit: Syscall auditing lite

 arch/x86/Kconfig   |  1 +
 arch/x86/include/asm/syscall.h | 21 
 init/Kconfig   |  3 +++
 kernel/audit.c | 44 +-
 4 files changed, 68 insertions(+), 1 deletion(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH 2/2] audit: Syscall auditing lite

2014-05-30 Thread Andy Lutomirski
AFAICS the main use of syscall auditing is to get syscall
information for syscalls that are already causing another audit
message.

We don't need any of the fancy syscall auditing machinery for that,
though: we can just log this information directly.  This should have
essentially no overhead and it could end up being much easier to use
than auditsc.

This produces messages like this:

audit: type=1123 audit(1401485315.370:2): pid=125 uid=0 auid=4294967295 
ses=4294967295 subj=kernel msg='blah blah blah' arch=c03e syscall=44 a0=3 
a1=7fff383feb60 a2=5c a3=0 a4=7fff383feb50 a5=c

The new fields (arch, syscall, and a0..a5) will only be logged if we
are in a syscall but we aren't otherwise building an auditsc context.

This is only supported on x86 for now.  Other architectures can get
this if they implement syscall_in_syscall.

Signed-off-by: Andy Lutomirski l...@amacapital.net
---
 kernel/audit.c | 44 +++-
 1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/kernel/audit.c b/kernel/audit.c
index 47845c5..8509d00 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -67,6 +67,10 @@
 #include linux/pid_namespace.h
 #include net/netns/generic.h
 
+#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL
+#include asm/syscall.h
+#endif
+
 #include audit.h
 
 /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED.
@@ -1897,6 +1901,40 @@ out:
kfree(name);
 }
 
+#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL
+/**
+ * audit_log_missing_context - append otherwise-missing context
+ * @ab: the audit_buffer
+ *
+ * If syscall auditing is unavailable, try to log syscall context
+ * information anyway.
+ */
+static void audit_log_missing_context(struct audit_buffer *ab)
+{
+   struct task_struct *tsk = current;
+   struct pt_regs *regs = current_pt_regs();
+   unsigned long args[6];
+
+   if (!syscall_in_syscall(tsk, regs))
+   return;
+
+   if (ab-ctx  ab-ctx-in_syscall)
+   return;  /* Let audit_log_exit log the context. */
+
+   syscall_get_arguments(tsk, regs, 0, 6, args);
+
+   audit_log_format(ab,  arch=%x syscall=%d a0=%lx a1=%lx a2=%lx a3=%lx 
a4=%lx a5=%lx,
+(unsigned int)syscall_get_arch(),
+syscall_get_nr(tsk, regs),
+args[0], args[1], args[2], args[3], args[4], args[5]);
+}
+#else
+static void audit_log_missing_context(struct audit_buffer *ab)
+{
+   /* We need arch support to do this reliably, so don't even try. */
+}
+#endif
+
 /**
  * audit_log_end - end one audit record
  * @ab: the audit_buffer
@@ -1913,7 +1951,11 @@ void audit_log_end(struct audit_buffer *ab)
if (!audit_rate_check()) {
audit_log_lost(rate limit exceeded);
} else {
-   struct nlmsghdr *nlh = nlmsg_hdr(ab-skb);
+   struct nlmsghdr *nlh;
+
+   audit_log_missing_context(ab);
+
+   nlh = nlmsg_hdr(ab-skb);
nlh-nlmsg_len = ab-skb-len - NLMSG_HDRLEN;
 
if (audit_pid) {
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces

2014-05-23 Thread Andy Lutomirski
On Fri, May 23, 2014 at 6:16 AM, James Bottomley
james.bottom...@hansenpartnership.com wrote:
 On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote:
 On 05/20/2014 05:19 PM, Serge Hallyn wrote:
  Quoting Andy Lutomirski (l...@amacapital.net):
  On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote:
 
  Quoting Richard Weinberger (rich...@nod.at):
  Am 15.05.2014 21:50, schrieb Serge Hallyn:
  Quoting Richard Weinberger (richard.weinber...@gmail.com):
  On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman 
  gre...@linuxfoundation.org wrote:
  Then don't use a container to build such a thing, or fix the build 
  scripts to not do that :)
 
  I second this. To me it looks like some folks try to (ab)use Linux 
  containers for purposes where KVM
  would much better fit in. Please don't put more complexity into 
  containers. They are already horrible
  complex and error prone.
 
  I, naturally, disagree :)  The only use case which is inherently not 
  valid for containers is running a
  kernel.  Practically speaking there are other things which likely will 
  never be possible, but if someone
  offers a way to do something in containers, you can't do that in 
  containers is not an apropos response.
 
  That abstraction is wrong is certainly valid, as when vpids were 
  originally proposed and rejected,
  resulting in the development of pid namespaces.  We have to work out 
  (x) first can be valid (and I can
  think of examples here), assuming it's not just trying to hide behind 
  a catch-22/chicken-egg problem.
 
  Finally, saying containers are complex and error prone is conflating 
  several large suites of userspace
  code and many kernel features which support them.  Being more precise 
  would, if the argument is valid, lend
  it a lot more weight.
 
  We (my company) use Linux containers since 2011 in production. First 
  LXC, now libvirt-lxc. To understand the
  internals better I also wrote my own userspace to create/start 
  containers. There are so many things which can
  hurt you badly. With user namespaces we expose a really big attack 
  surface to regular users. I.e. Suddenly a
  user is allowed to mount filesystems.
 
  That is currently not the case.  They can mount some virtual filesystems 
  and do bind mounts, but cannot mount
  most real filesystems.  This keeps us protected (for now) from 
  potentially unsafe superblock readers in the
  kernel.
 
  Ask Andy, he found already lots of nasty things...
 
  I don't think I have anything brilliant to add to this discussion right 
  now, except possibly:
 
  ISTM that Linux distributions are, in general, vulnerable to all kinds of 
  shenanigans that would happen if an
  untrusted user can cause a block device to appear.  That user doesn't 
  need permission to mount it
 
  Interesting point.  This would further suggest that we absolutely must 
  ensure that a loop device which shows up in
  the container does not also show up in the host.

 Can I suggest the usage of the devices cgroup to achieve that?

 Not really ... cgroups impose resource limits, it's namespaces that
 impose visibility separations.  In theory this can be done with the
 device namespace that's been proposed; however, a simpler way is simply
 to rm the device node in the host and mknod it in the guest.  I don't
 really see host visibility as a huge problem: in a shared OS
 virtualisation it's not really possible securely to separate the guest
 from the host (only vice versa).

 But I really don't think we want to do it this way.  Giving a container
 the ability to do a mount is too dangerous.  What we want to do is
 intercept the mount in the host and perform it on behalf of the guest as
 host root in the guest's mount namespace.  If you do it that way, it
 doesn't really matter what device actually shows up in the guest, as
 long as the host knows what to do when the mount request comes along.

This is only useful/safe if the host understands what's going on.  By
the host, I mean the host's udev and other system-level stuff.  This
is probably fine for disks and such, but it might not be so great for
loop devices, FUSE, etc.  I already know of one user of containers
that wants container-local FUSE mounts.  This ought to Just Work (tm),
but there's fair amount of work needed to get there.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Pondering per-process vsyscall disablement

2014-05-23 Thread Andy Lutomirski
On Thu, May 22, 2014 at 7:44 PM, Marian Marinov m...@1h.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 05/23/2014 02:04 AM, Andy Lutomirski wrote:
 It would be nice to have a way for new programs to declare that they don't 
 need vsyscalls.  What's the right way to
 do this?  An ELF header entry in the loader?  An ELF header entry in the 
 program?  A new arch_prctl?

 As background, there's an old part of the x86_64 ABI that allows programs to 
 do gettimeofday, clock_gettime, and
 getcpu by calling to fixed addresses of the form 0xff600n00 where n 
 indicates which of those three syscalls
 is being invoked.  This is a security issue.

 Since Linux 3.1, vsyscalls are emulated using NX and page faults.  As a 
 result, vsyscalls no longer offer any
 performance advantage over normal syscalls; in fact, they're much slower.  
 As far as I know, nothing newer than
 2012 will attempt to use vsyscalls if a vdso is present.  (Sadly, a lot of 
 things will still fall back to the
 vsyscall page if there is no vdso, but that shouldn't matter, since there is 
 always a vdso.)

 Despite the emulation, they could still be used as a weird form of ROP 
 gadget that lives at a fixed address.  I'd
 like to offer a way for new runtimes to indicate that they don't use 
 vsyscalls so that the kernel can selectively
 disable emulation and remove the fixed-address executable code issue.


 Wouldn't it be more useful if the check is against a bitmask added as 
 extended attribute for that executable?
 This way the administrators and will have the flexibility to simply add the 
 new attribute to the executable.

I don't think this should be something configured by the
administrator, unless the administrator is the builder of a kiosky
thing like Chromium OS.  In that case, the administrator can use
vsyscall=none.

I think this should be handled by either libc or the toolchain, hence
the suggestions of a syscall or an ELF header.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86, MCE: Flesh out when to panic comment

2014-05-26 Thread Andy Lutomirski
On Mon, May 26, 2014 at 4:06 AM, Borislav Petkov b...@alien8.de wrote:
 On Mon, May 26, 2014 at 12:51:10PM +0200, Jiri Kosina wrote:
 I think the comment is still not explaining the big part of what the
 discussion was about -- i.e. if it was in kernel context, we always
 panic.

 I thought the pointer to mce_severity was enough? People should open an
 editor and look at the function and at its gory insanity. :-P

It may be worth at least pointing out that mce_severity looks at
whether we faulted from kernel context.  I missed that the first time
around because mce_severity doesn't take a pt_regs pointer.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 1/9] procfs: use flags to deny or allow access to /proc/pid/$entry

2014-05-26 Thread Andy Lutomirski
On Mon, May 26, 2014 at 6:27 AM, Djalal Harouni tix...@opendz.org wrote:
 Add the deny or allow flags, so we can perform proper permission checks
 and set the result accordingly. These flags are needed in case we have
 to cache the result of permission checks that are done during -open()
 time. Later during -read(), we can decide to allow or deny the read().

 The pid entries that need these flags are:
 /proc/pid/stat
 /proc/pid/wchan
 /proc/pid/maps  (will be handled in next patches).

 These files are world readable, userspace depend on that. To prevent
 ASLR leaks and to avoid breaking userspace, we follow this scheme:

 a) Perform permission checks during -open()
 b) Cache the result of a) and return success
 c) Recheck the cached result during -read()

Why is (c) needed?


  /*
 + * Flags used to deny or allow current to access /proc/pid/$entry
 + * after proper permission checks.
 + */
 +enum {
 +   PID_ENTRY_DENY  = 0,/* Deny access */
 +   PID_ENTRY_ALLOW = 1,/* Allow access */
 +};

I think this would be less alarming if this were:

#define PID_ENTRY_DENY ((void *)1UL)
#define PID_ENTRY_ALLOW ((void *)2UL)

Also, I don't like DENY and ALLOW.  It's not denying and allowing.
How about PID_ENTRY_OPENER_MAY_PTRACE and
PID_ENTRY_OPENER_MAY_NOT_PTRACE?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/9] procfs: add pid_entry_access() for proper checks on /proc/pid/*

2014-05-26 Thread Andy Lutomirski
On Mon, May 26, 2014 at 6:27 AM, Djalal Harouni tix...@opendz.org wrote:
 Add the helper pid_entry_access() to unify the permission checks during
 -open()

 This is a preparation patch.

 Signed-off-by: Djalal Harouni tix...@opendz.org
 ---
  fs/proc/generic.c  | 22 ++
  fs/proc/internal.h |  2 ++
  2 files changed, 24 insertions(+)

 diff --git a/fs/proc/generic.c b/fs/proc/generic.c
 index b7f268e..98ed927 100644
 --- a/fs/proc/generic.c
 +++ b/fs/proc/generic.c
 @@ -23,6 +23,7 @@
  #include linux/bitops.h
  #include linux/spinlock.h
  #include linux/completion.h
 +#include linux/ptrace.h
  #include asm/uaccess.h

  #include internal.h
 @@ -596,3 +597,24 @@ void *PDE_DATA(const struct inode *inode)
 return __PDE_DATA(inode);
  }
  EXPORT_SYMBOL(PDE_DATA);
 +
 +int pid_entry_access(struct file *filp, unsigned int mode)

pid_entry_may_ptrace, perhaps?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


<    3   4   5   6   7   8   9   10   11   12   >