Re: [PATCH v3 0/7] File Sealing memfd_create()
On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote: Hi This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with a longer introduction at gmane: http://thread.gmane.org/gmane.comp.video.dri.devel/102241 An LWN article about memfd+sealing is available, too: https://lwn.net/Articles/593918/ v2 with some more discussions can be found here: http://thread.gmane.org/gmane.linux.kernel.mm/115713 This series introduces two new APIs: memfd_create(): Think of this syscall as malloc() but it returns a file-descriptor instead of a pointer. That file-descriptor is backed by anon-memory and can be memory-mapped for access. sealing: The sealing API can be used to prevent a specific set of operations on a file-descriptor. You 'seal' the file and give thus the guarantee, that it cannot be modified in the specific ways. A short high-level introduction is also available here: http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/ Potentially silly question: is it guaranteed that mmapping and reading a SEAL_SHRINKed fd within size bounds will not SIGBUS? If so, should this be documented? (The particular issue here would be reading holes. It should work by using the zero page, but, if so, we should probably make it a real documented guarantee.) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/7] File Sealing memfd_create()
On Fri, Jun 13, 2014 at 8:15 AM, David Herrmann dh.herrm...@gmail.com wrote: Hi On Fri, Jun 13, 2014 at 5:10 PM, Andy Lutomirski l...@amacapital.net wrote: On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote: Hi This is v3 of the File-Sealing and memfd_create() patches. You can find v1 with a longer introduction at gmane: http://thread.gmane.org/gmane.comp.video.dri.devel/102241 An LWN article about memfd+sealing is available, too: https://lwn.net/Articles/593918/ v2 with some more discussions can be found here: http://thread.gmane.org/gmane.linux.kernel.mm/115713 This series introduces two new APIs: memfd_create(): Think of this syscall as malloc() but it returns a file-descriptor instead of a pointer. That file-descriptor is backed by anon-memory and can be memory-mapped for access. sealing: The sealing API can be used to prevent a specific set of operations on a file-descriptor. You 'seal' the file and give thus the guarantee, that it cannot be modified in the specific ways. A short high-level introduction is also available here: http://dvdhrm.wordpress.com/2014/06/10/memfd_create2/ Potentially silly question: is it guaranteed that mmapping and reading a SEAL_SHRINKed fd within size bounds will not SIGBUS? If so, should this be documented? (The particular issue here would be reading holes. It should work by using the zero page, but, if so, we should probably make it a real documented guarantee.) No, this is not guaranteed. See the previous discussion in v2 on Patch 2/4 between Hugh and me. Summary is: If you want mmap-reads to not fail, use mlock(). There are many situations where a fault might fail (think: OOM) and sealing is not meant to protect against that. Btw., holes are automatically filled with fresh pages by shmem. So a read only fails in OOM situations (or memcg limits, etc.). Isn't the point of SEAL_SHRINK to allow servers to mmap and read safely without worrying about SIGBUS? --Andy Thanks David -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vdso feature requests from the Go people
On Thu, Jun 12, 2014 at 10:39 PM, H. Peter Anvin h...@zytor.com wrote: On 06/12/2014 09:36 PM, Andy Lutomirski wrote: 1. Parsing the vDSO is a PITA. What if we bundled the reference parser inside the vdso? Concretely, we could have AT_VDSO_FINDENTRY point to a function like: void *vdso_find_entry(const char *name, const char *version) Then things like Go and maybe even musl (and klibc?) could just call that function. And we'd never have to worry about maintaining compatibility with more and more weird vdso parsers. Implementing this could be as simple as shoving parse_vdso.c into the vdso, although vdso2c could help and allow a really simple in-vdso implementation. I'm not really sure how much of a win that is... you have to parse *something*, and for the vast majority of all implementations there will be a dynamic linker just sitting there, and that is what it *does*. I'm only aware of two implementations that work like that: glibc and musl. AFAIK neither one even tries to use the vdso when statically linked. IIRC, Bionic doesn't support the vdso at all, and Go has the present issue. And ELF parsing is a giant mess. Currently the vdso doesn't use DT_GNU_HASH (easy to fix) but no one can safely rely on DT_GNU_HASH being there, and DT_GNU_HASH isn't actually easier to parse. My point is that, for things that otherwise need to carry around a full ELF loader, having a really easy, guaranteed-correct way to use the vdso would be useful. I can see how much text size it would add. If the parser were part of the vdso, it could probably be trimmed down a lot. For example, there is currently exactly one version definition, and it could be hard-coded. If someone were to add another version definition, they could fix the parser at the same time. Basically, everything except for the vdso_sym function in my parser could go. 2. Go uses a segmented stack, and the vdso is quite unfriendly for segmented stack. If we can get compiler support, is there a reasonable way that we could advertise the maximum stack usage of each vdso entry point? I suspect an easier way to do that would just be to define a maximum stack usage for *any* vdso entry point, and then enable the gcc stack depth warning (perhaps even with Werror)... we can do this now. I can imagine this causing lots of pain when gcc 4.11 comes out with some issue that blows up the stack usage. Or when akpm compiles on Fedora Core 6 using some ancient toolchain that spills every local variable three or four times and assigns every possible inline function its own non-overlapping stack range. My copy of gcc supports -fstack-usage, which seems like an easyish way to obtain the information. I'm not entirely sure whether -fstack-usage refers to the whole call tree or just to the specific function. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v3 7/7] shm: isolate pinned pages when sealing files
On Fri, Jun 13, 2014 at 8:27 AM, David Herrmann dh.herrm...@gmail.com wrote: Hi On Fri, Jun 13, 2014 at 5:06 PM, Andy Lutomirski l...@amacapital.net wrote: On Fri, Jun 13, 2014 at 3:36 AM, David Herrmann dh.herrm...@gmail.com wrote: When setting SEAL_WRITE, we must make sure nobody has a writable reference to the pages (via GUP or similar). We currently check references and wait some time for them to be dropped. This, however, might fail for several reasons, including: - the page is pinned for longer than we wait - while we wait, someone takes an already pinned page for read-access Therefore, this patch introduces page-isolation. When sealing a file with SEAL_WRITE, we copy all pages that have an elevated ref-count. The newpage is put in place atomically, the old page is detached and left alone. It will get reclaimed once the last external user dropped it. Signed-off-by: David Herrmann dh.herrm...@gmail.com Won't this have unexpected effects? Thread 1: start read into mapping backed by fd Thread 2: SEAL_WRITE Thread 1: read finishes. now the page doesn't match the sealed page Just to be clear: you're talking about read() calls that write into the memfd? (like my FUSE example does) Your language might be ambiguous to others as read into actually implies a write. No, this does not have unexpected effects. But yes, your conclusion is right. To be clear, this behavior would be part of the API. Any asynchronous write might be cut off by SEAL_WRITE _iff_ you unmap your buffer before the write finishes. But you actually have to extend your example: Thread 1: p = mmap(memfd, SIZE); Thread 1: h = async_read(some_fd, p, SIZE); Thread 1: munmap(p, SIZE); Thread 2: SEAL_WRITE Thread 1: async_wait(h); If you don't do the unmap(), then SEAL_WRITE will fail due to an elevated i_mmap_writable. I think this is fine. In fact, I remember reading that async-IO is not required to resolve user-space addresses at the time of the syscall, but might delay it to the time of the actual write. But you're right, it would be misleading that the AIO operation returns success. This would be part of the memfd-API, though. And if you mess with your address space while running an async-IO operation on it, you're screwed anyway. Ok, I missed the part where you had to munmap to trigger the oddity. That seems fine to me. Btw., your sealing use-case is really odd. No-one guarantees that the SEAL_WRITE happens _after_ you schedule your async-read. In case you have some synchronization there, you just have to move it after waiting for your async-io to finish. Does that clear things up? I think so. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3] x86,vdso: Fix vdso_install
On Fri, Jun 13, 2014 at 10:24 AM, H. Peter Anvin h...@zytor.com wrote: On 06/12/2014 10:01 AM, Josh Boyer wrote: On Thu, Jun 12, 2014 at 11:28 AM, Andy Lutomirski l...@amacapital.net wrote: The filenames are different now. Inspired by a patch from Sam Ravnborg. Acked-by: Sam Ravnborg s...@ravnborg.org Reported-by: Josh Boyer jwbo...@fedoraproject.org Signed-off-by: Andy Lutomirski l...@amacapital.net Things look great with this version. Thanks much! Tested-by: Josh Boyer jwbo...@fedoraproject.org Are we okay with the 64-bit vdso now being installed as vdso64.so? Unless someone actually complains, I think so. Ideally, I think we'd have a symlink from the build id, but that's a separate issue. AFAIK, the only thing that actually references these files is Fedora's gdb, which is finding them from Fedora's build-id symlinks, which should work fine regardless of what the thing is called. How's this for a patch description: make vdso_install has been broken since this commit: commit 6f121e548f83674ab4920a4e60afb58d4f61b829 Author: Andy Lutomirski l...@amacapital.net Date: Mon May 5 12:19:34 2014 -0700 x86, vdso: Reimplement vdso.so preparation in build-time C because the names of the files to be installed changed. This fixes vdso_install to install the right files. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kbuild: Do not run modules_install and install in paralel
On Fri, Jun 13, 2014 at 2:45 AM, Michal Marek mma...@suse.cz wrote: Dne 13.6.2014 11:39, Michal Marek napsal(a): Based on a x86-only patch by Andy Lutomirski l...@amacapital.net With modular kernels, 'make install' is going to need the installed modules at some point to generate the initramfs. Signed-off-by: Michal Marek mma...@suse.cz --- Makefile | 6 ++ 1 file changed, 6 insertions(+) diff --git a/Makefile b/Makefile index 7680d7c..7e5e483 100644 --- a/Makefile +++ b/Makefile @@ -503,6 +503,12 @@ ifeq ($(KBUILD_EXTMOD),) endif endif endif +# install and module_install need also be processed one by one +ifneq ($(filter install,$(MAKECMDGOALS)),) +ifneq ($(filter modules_install,$(MAKECMDGOALS)),) + mixed-targets := 1 +endif +endif Note that this version does not enforce the ordering, it just avoids the interleaved execution. It can be added if desired. Hmm. This will fix 'make modules_install install' but will not fix 'make install modules_install'. I don't know how many people would type the latter. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 6/9] seccomp: add seccomp syscall
On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook keesc...@chromium.org wrote: This adds the new seccomp syscall with both an operation and flags parameter for future expansion. The third argument is a pointer value, used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...). Question for the linux-abi people: What's the preferred way to do this these days? This syscall is a general purpose adjust the seccomp state thing. The alternative would be a specific new syscall to add a filter with a flags argument. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 6/9] seccomp: add seccomp syscall
On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook keesc...@chromium.org wrote: This adds the new seccomp syscall with both an operation and flags parameter for future expansion. The third argument is a pointer value, used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...). Signed-off-by: Kees Cook keesc...@chromium.org Cc: linux-...@vger.kernel.org --- arch/x86/syscalls/syscall_32.tbl |1 + arch/x86/syscalls/syscall_64.tbl |1 + include/linux/syscalls.h |2 ++ include/uapi/asm-generic/unistd.h |4 ++- include/uapi/linux/seccomp.h |4 +++ kernel/seccomp.c | 63 - kernel/sys_ni.c |3 ++ 7 files changed, 69 insertions(+), 9 deletions(-) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index d6b867921612..7527eac24122 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -360,3 +360,4 @@ 351i386sched_setattr sys_sched_setattr 352i386sched_getattr sys_sched_getattr 353i386renameat2 sys_renameat2 +354i386seccomp sys_seccomp diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index ec255a1646d2..16272a6c12b7 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -323,6 +323,7 @@ 314common sched_setattr sys_sched_setattr 315common sched_getattr sys_sched_getattr 316common renameat2 sys_renameat2 +317common seccomp sys_seccomp # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b0881a0ed322..1713977ee26f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid, asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2); asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags); +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, + const char __user *uargs); It looks odd to add 'flags' argument to syscall that is not even used. It don't think it will be extensible this way. 'uargs' is used only in 2nd command as well and it's not 'char __user *' but rather 'struct sock_fprog __user *' I think it makes more sense to define only first argument as 'int op' and the rest as variable length array. Something like: long sys_seccomp(unsigned int op, struct nlattr *attrs, int len); then different commands can interpret 'attrs' differently. if op == mode_strict, then attrs == NULL, len == 0 if op == mode_filter, then attrs-nla_type == seccomp_bpf_filter and nla_data(attrs) is 'struct sock_fprog' Eww. If the operation doesn't imply the type, then I think we've totally screwed up. If we decide to add new types of filters or new commands, the syscall prototype won't need to change. New commands can be added preserving backward compatibility. The basic TLV concept has been around forever in netlink world. imo makes sense to use it with new syscalls. Passing 'struct xxx' into syscalls is the thing of the past. TLV style is more extensible. Fields of structures can become optional in the future, new fields added, etc. 'struct nlattr' brings the same benefits to kernel api as protobuf did to user land. I see no reason to bring nl_attr into this. Admittedly, I've never dealt with nl_attr, but everything netlink-related I've even been involved in has involved some sort of API atrocity. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v6 6/9] seccomp: add seccomp syscall
On Fri, Jun 13, 2014 at 2:37 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Fri, Jun 13, 2014 at 2:25 PM, Andy Lutomirski l...@amacapital.net wrote: On Fri, Jun 13, 2014 at 2:22 PM, Alexei Starovoitov a...@plumgrid.com wrote: On Tue, Jun 10, 2014 at 8:25 PM, Kees Cook keesc...@chromium.org wrote: This adds the new seccomp syscall with both an operation and flags parameter for future expansion. The third argument is a pointer value, used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...). Signed-off-by: Kees Cook keesc...@chromium.org Cc: linux-...@vger.kernel.org --- arch/x86/syscalls/syscall_32.tbl |1 + arch/x86/syscalls/syscall_64.tbl |1 + include/linux/syscalls.h |2 ++ include/uapi/asm-generic/unistd.h |4 ++- include/uapi/linux/seccomp.h |4 +++ kernel/seccomp.c | 63 - kernel/sys_ni.c |3 ++ 7 files changed, 69 insertions(+), 9 deletions(-) diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl index d6b867921612..7527eac24122 100644 --- a/arch/x86/syscalls/syscall_32.tbl +++ b/arch/x86/syscalls/syscall_32.tbl @@ -360,3 +360,4 @@ 351i386sched_setattr sys_sched_setattr 352i386sched_getattr sys_sched_getattr 353i386renameat2 sys_renameat2 +354i386seccomp sys_seccomp diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index ec255a1646d2..16272a6c12b7 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -323,6 +323,7 @@ 314common sched_setattr sys_sched_setattr 315common sched_getattr sys_sched_getattr 316common renameat2 sys_renameat2 +317common seccomp sys_seccomp # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h index b0881a0ed322..1713977ee26f 100644 --- a/include/linux/syscalls.h +++ b/include/linux/syscalls.h @@ -866,4 +866,6 @@ asmlinkage long sys_process_vm_writev(pid_t pid, asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type, unsigned long idx1, unsigned long idx2); asmlinkage long sys_finit_module(int fd, const char __user *uargs, int flags); +asmlinkage long sys_seccomp(unsigned int op, unsigned int flags, + const char __user *uargs); It looks odd to add 'flags' argument to syscall that is not even used. It don't think it will be extensible this way. 'uargs' is used only in 2nd command as well and it's not 'char __user *' but rather 'struct sock_fprog __user *' I think it makes more sense to define only first argument as 'int op' and the rest as variable length array. Something like: long sys_seccomp(unsigned int op, struct nlattr *attrs, int len); then different commands can interpret 'attrs' differently. if op == mode_strict, then attrs == NULL, len == 0 if op == mode_filter, then attrs-nla_type == seccomp_bpf_filter and nla_data(attrs) is 'struct sock_fprog' Eww. If the operation doesn't imply the type, then I think we've totally screwed up. If we decide to add new types of filters or new commands, the syscall prototype won't need to change. New commands can be added preserving backward compatibility. The basic TLV concept has been around forever in netlink world. imo makes sense to use it with new syscalls. Passing 'struct xxx' into syscalls is the thing of the past. TLV style is more extensible. Fields of structures can become optional in the future, new fields added, etc. 'struct nlattr' brings the same benefits to kernel api as protobuf did to user land. I see no reason to bring nl_attr into this. Admittedly, I've never dealt with nl_attr, but everything netlink-related I've even been involved in has involved some sort of API atrocity. netlink has a lot of legacy and there is genetlink which is not pretty either because of extra socket creation, binding, dealing with packet loss issues, but the key concept of variable length encoding is sound. Right now seccomp has two commands and they already don't fit into single syscall neatly. Are you saying there should be two syscalls here? What about another seccomp related command? Another syscall? imo all seccomp related commands needs to be mux/demux-ed under one syscall. What is the way to mux/demux potentially very different commands under one syscall? I cannot think of anything better than TLV style. 'struct nlattr' is what we have today and I think it works fine. I'm not suggesting to bring the whole netlink into the picture, but rather TLV style of encoding different arguments for different commands. I'm unconvinced. These are simple
[PATCH 4/4] x86,mm: Replace arch_vma_name with vm_ops-name for vsyscalls
arch_vma_name is now completely gone from x86. Good riddance. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/mm/init_64.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 9deb59b..bdcde58 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1185,11 +1185,19 @@ int kern_addr_valid(unsigned long addr) * covers the 64bit vsyscall page now. 32bit has a real VMA now and does * not need special handling anymore: */ +static const char *gate_vma_name(struct vm_area_struct *vma) +{ + return [vsyscall]; +} +static struct vm_operations_struct gate_vma_ops = { + .name = gate_vma_name, +}; static struct vm_area_struct gate_vma = { .vm_start = VSYSCALL_ADDR, .vm_end = VSYSCALL_ADDR + PAGE_SIZE, .vm_page_prot = PAGE_READONLY_EXEC, - .vm_flags = VM_READ | VM_EXEC + .vm_flags = VM_READ | VM_EXEC, + .vm_ops = gate_vma_ops, }; struct vm_area_struct *get_gate_vma(struct mm_struct *mm) @@ -1221,13 +1229,6 @@ int in_gate_area_no_mm(unsigned long addr) return (addr PAGE_MASK) == VSYSCALL_ADDR; } -const char *arch_vma_name(struct vm_area_struct *vma) -{ - if (vma == gate_vma) - return [vsyscall]; - return NULL; -} - #ifdef CONFIG_X86_UV unsigned long memory_block_size_bytes(void) { -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming
Using arch_vma_name to give special mappings a name is awkward. x86 currently implements it by comparing the start address of the vma to the expected address of the vdso. This requires tracking the start address of special mappings and is probably buggy if a special vma is split or moved. Improve _install_special_mapping to just name the vma directly. Use it to give the x86 vvar area a name, which should make CRIU's life easier. As a side effect, the vvar area will show up in core dumps. This could be considered weird and is fixable. Thoughts? Cc: Cyrill Gorcunov gorcu...@openvz.org Cc: Pavel Emelyanov xe...@parallels.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/include/asm/vdso.h | 6 ++- arch/x86/mm/init_64.c| 3 -- arch/x86/vdso/vdso2c.h | 5 ++- arch/x86/vdso/vdso32-setup.c | 7 arch/x86/vdso/vma.c | 25 - include/linux/mm.h | 4 +- include/linux/mm_types.h | 6 +++ mm/mmap.c| 89 +--- 8 files changed, 94 insertions(+), 51 deletions(-) diff --git a/arch/x86/include/asm/vdso.h b/arch/x86/include/asm/vdso.h index d0a2c90..30be253 100644 --- a/arch/x86/include/asm/vdso.h +++ b/arch/x86/include/asm/vdso.h @@ -7,10 +7,14 @@ #ifndef __ASSEMBLER__ +#include linux/mm_types.h + struct vdso_image { void *data; unsigned long size; /* Always a multiple of PAGE_SIZE */ - struct page **pages; /* Big enough for data/size page pointers */ + + /* text_mapping.pages is big enough for data/size page pointers */ + struct vm_special_mapping text_mapping; unsigned long alt, alt_len; diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index 6f88184..9deb59b 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1223,9 +1223,6 @@ int in_gate_area_no_mm(unsigned long addr) const char *arch_vma_name(struct vm_area_struct *vma) { - if (vma-vm_mm vma-vm_start == - (long __force)vma-vm_mm-context.vdso) - return [vdso]; if (vma == gate_vma) return [vsyscall]; return NULL; diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h index ed2e894..3dcc61e 100644 --- a/arch/x86/vdso/vdso2c.h +++ b/arch/x86/vdso/vdso2c.h @@ -136,7 +136,10 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) fprintf(outfile, const struct vdso_image %s = {\n, name); fprintf(outfile, \t.data = raw_data,\n); fprintf(outfile, \t.size = %lu,\n, data_size); - fprintf(outfile, \t.pages = pages,\n); + fprintf(outfile, \t.text_mapping = {\n); + fprintf(outfile, \t\t.name = \[vdso]\,\n); + fprintf(outfile, \t\t.pages = pages,\n); + fprintf(outfile, \t},\n); if (alt_sec) { fprintf(outfile, \t.alt = %lu,\n, (unsigned long)alt_sec-sh_offset); diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c index c3ed708..e4f7781 100644 --- a/arch/x86/vdso/vdso32-setup.c +++ b/arch/x86/vdso/vdso32-setup.c @@ -119,13 +119,6 @@ __initcall(ia32_binfmt_init); #else /* CONFIG_X86_32 */ -const char *arch_vma_name(struct vm_area_struct *vma) -{ - if (vma-vm_mm vma-vm_start == (long)vma-vm_mm-context.vdso) - return [vdso]; - return NULL; -} - struct vm_area_struct *get_gate_vma(struct mm_struct *mm) { return NULL; diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c index 8ad0081..e1513c4 100644 --- a/arch/x86/vdso/vma.c +++ b/arch/x86/vdso/vma.c @@ -30,7 +30,8 @@ void __init init_vdso_image(const struct vdso_image *image) BUG_ON(image-size % PAGE_SIZE != 0); for (i = 0; i npages; i++) - image-pages[i] = virt_to_page(image-data + i*PAGE_SIZE); + image-text_mapping.pages[i] = + virt_to_page(image-data + i*PAGE_SIZE); apply_alternatives((struct alt_instr *)(image-data + image-alt), (struct alt_instr *)(image-data + image-alt + @@ -91,6 +92,10 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) unsigned long addr; int ret = 0; static struct page *no_pages[] = {NULL}; + static struct vm_special_mapping vvar_mapping = { + .name = [vvar], + .pages = no_pages, + }; if (calculate_addr) { addr = vdso_addr(current-mm-start_stack, @@ -112,21 +117,23 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) /* * MAYWRITE to allow gdb to COW and set breakpoints */ - ret = install_special_mapping(mm, - addr, - image-size, - VM_READ|VM_EXEC| - VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC
[PATCH 1/4] x86,vdso: Fix an OOPS accessing the hpet mapping w/o an hpet
The oops can be triggered in qemu using -no-hpet (but not nohpet) by reading a couple of pages past the end of the vdso text. This should send SIGBUS instead of OOPSing. The bug was introduced by: commit 7a59ed415f5b57469e22e41fc4188d5399e0b194 Author: Stefani Seibold stef...@seibold.net Date: Mon Mar 17 23:22:09 2014 +0100 x86, vdso: Add 32 bit VDSO time support for 32 bit kernel which is new in 3.15. This will be fixed separately in 3.15, but that patch will not apply to tip/x86/vdso. This is the equivalent fix for tip/x86/vdso and, presumably, 3.16. Cc: Stefani Seibold stef...@seibold.net Reported-by: Sasha Levin sasha.le...@oracle.com Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vma.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c index e915eae..8ad0081 100644 --- a/arch/x86/vdso/vma.c +++ b/arch/x86/vdso/vma.c @@ -90,6 +90,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) struct vm_area_struct *vma; unsigned long addr; int ret = 0; + static struct page *no_pages[] = {NULL}; if (calculate_addr) { addr = vdso_addr(current-mm-start_stack, @@ -125,7 +126,7 @@ static int map_vdso(const struct vdso_image *image, bool calculate_addr) addr + image-size, image-sym_end_mapping - image-size, VM_READ, - NULL); + no_pages); if (IS_ERR(vma)) { ret = PTR_ERR(vma); -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/4] x86,mm: vdso fixes for an OOPS and /proc/PID/maps
[This applies to tip/x86/vdso. Patch 1/4 is a resend.] This fixes an OOPS on systems without an HPET and incomplete information in /proc/PID/maps. The latter is done by adding a new vm_ops callback to replace arch_vma_name, which is inflexible and awkward to use correctly. With this series applied, calling mremap on the vdso results in sensible output in /proc/PID/maps and the vvar area shows up correctly. I don't want to guarantee that mremap on the vdso will do anything sensible right now, but that's unchanged from before. In fact, I suspect that mremapping the vdso on 32-bit tasks is rather broken right now due to sigreturn. In current kernels, mremapping the vdso blows away the name: badc0de-badc0de2000 r-xp 00:00 0 Now it doesn't: badc0de-badc0de1000 r-xp 00:00 0[vdso] As a followup, it might pay to replace install_special_mapping with a new install_vdso_mapping function that hardcodes the [vdso] name, to separately fix all the other arch_vma_name users (maybe just ARM?) and then kill arch_vma_name completely. NB: This touches core mm code. I'd appreciate some review by the mm folks. Andy Lutomirski (4): x86,vdso: Fix an OOPS accessing the hpet mapping w/o an hpet mm,fs: Add vm_ops-name as an alternative to arch_vma_name x86,mm: Improve _install_special_mapping and fix x86 vdso naming x86,mm: Replace arch_vma_name with vm_ops-name for vsyscalls arch/x86/include/asm/vdso.h | 6 ++- arch/x86/mm/init_64.c| 20 +- arch/x86/vdso/vdso2c.h | 5 ++- arch/x86/vdso/vdso32-setup.c | 7 arch/x86/vdso/vma.c | 26 - fs/binfmt_elf.c | 8 fs/proc/task_mmu.c | 6 +++ include/linux/mm.h | 10 - include/linux/mm_types.h | 6 +++ mm/mmap.c| 89 +--- 10 files changed, 124 insertions(+), 59 deletions(-) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/4] mm,fs: Add vm_ops-name as an alternative to arch_vma_name
arch_vma_name sucks. It's a silly hack, and it's annoying to implement correctly. In fact, AFAICS, even the straightforward x86 implementation is incorrect (I suspect that it breaks if the vdso mapping is split or gets remapped). This adds a new vm_ops-name operation that can replace it. The followup patches will remove all uses of arch_vma_name on x86, fixing a couple of annoyances in the process. Signed-off-by: Andy Lutomirski l...@amacapital.net --- fs/binfmt_elf.c| 8 fs/proc/task_mmu.c | 6 ++ include/linux/mm.h | 6 ++ 3 files changed, 20 insertions(+) diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index aa3cb62..df9ea41 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1108,6 +1108,14 @@ static bool always_dump_vma(struct vm_area_struct *vma) /* Any vsyscall mappings? */ if (vma == get_gate_vma(vma-vm_mm)) return true; + + /* +* Assume that all vmas with a .name op should always be dumped. +* If this changes, a new vm_ops field can easily be added. +*/ + if (vma-vm_ops vma-vm_ops-name vma-vm_ops-name(vma)) + return true; + /* * arch_vma_name() returns non-NULL for special architecture mappings, * such as vDSO sections. diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 442177b..9b2f5d6 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -300,6 +300,12 @@ show_map_vma(struct seq_file *m, struct vm_area_struct *vma, int is_pid) goto done; } + if (vma-vm_ops vma-vm_ops-name) { + name = vma-vm_ops-name(vma); + if (name) + goto done; + } + name = arch_vma_name(vma); if (!name) { pid_t tid; diff --git a/include/linux/mm.h b/include/linux/mm.h index bf9811e..63f8d4e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -239,6 +239,12 @@ struct vm_operations_struct { */ int (*access)(struct vm_area_struct *vma, unsigned long addr, void *buf, int len, int write); + + /* Called by the /proc/PID/maps code to ask the vma whether it +* has a special name. Returning non-NULL will also cause this +* vma to be dumped unconditionally. */ + const char *(*name)(struct vm_area_struct *vma); + #ifdef CONFIG_NUMA /* * set_policy() op must add a reference to any non-NULL @new mempolicy -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming
On Tue, May 20, 2014 at 10:21 AM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Mon, May 19, 2014 at 03:58:33PM -0700, Andy Lutomirski wrote: Using arch_vma_name to give special mappings a name is awkward. x86 currently implements it by comparing the start address of the vma to the expected address of the vdso. This requires tracking the start address of special mappings and is probably buggy if a special vma is split or moved. Improve _install_special_mapping to just name the vma directly. Use it to give the x86 vvar area a name, which should make CRIU's life easier. As a side effect, the vvar area will show up in core dumps. This could be considered weird and is fixable. Thoughts? Cc: Cyrill Gorcunov gorcu...@openvz.org Cc: Pavel Emelyanov xe...@parallels.com Signed-off-by: Andy Lutomirski l...@amacapital.net Hi Andy, thanks a lot for this! I must confess I don't yet know how would we deal with compat tasks but this is 'must have' mark which allow us to detect vvar area! Out of curiosity, how does CRIU currently handle checkpointing a restored task? In current kernels, the [vdso] name in maps goes away after mremapping the vdso. I suspect that you'll need kernel changes for compat tasks, since I think that mremapping the vdso on any reasonably modern hardware in a 32-bit task will cause sigreturn to blow up. This could be fixed by making mremap magical, although adding a new prctl or arch_prctl to reliably move the vdso might be a better bet. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming
On Tue, May 20, 2014 at 10:47 AM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Tue, May 20, 2014 at 10:24:49AM -0700, Andy Lutomirski wrote: On Tue, May 20, 2014 at 10:21 AM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Mon, May 19, 2014 at 03:58:33PM -0700, Andy Lutomirski wrote: Using arch_vma_name to give special mappings a name is awkward. x86 currently implements it by comparing the start address of the vma to the expected address of the vdso. This requires tracking the start address of special mappings and is probably buggy if a special vma is split or moved. Improve _install_special_mapping to just name the vma directly. Use it to give the x86 vvar area a name, which should make CRIU's life easier. As a side effect, the vvar area will show up in core dumps. This could be considered weird and is fixable. Thoughts? Cc: Cyrill Gorcunov gorcu...@openvz.org Cc: Pavel Emelyanov xe...@parallels.com Signed-off-by: Andy Lutomirski l...@amacapital.net Hi Andy, thanks a lot for this! I must confess I don't yet know how would we deal with compat tasks but this is 'must have' mark which allow us to detect vvar area! Out of curiosity, how does CRIU currently handle checkpointing a restored task? In current kernels, the [vdso] name in maps goes away after mremapping the vdso. We use not only [vdso] mark to detect vdso area but also page frame number of the living vdso. If mark is not present in procfs output we examinate executable areas and check if pfn == vdso_pfn, it's a slow path because there migh be a bunch of executable areas and touching every of it is not that fast thing, but we simply have no choise. This patch should fix this issue, at least. If there's still a way to get a native vdso that doesn't say [vdso], please let me know/ The situation get worse when task was dumped on one kernel and then restored on another kernel where vdso content is different from one save in image -- is such case as I mentioned we need that named vdso proxy which redirect calls to vdso of the system where task is restoring. And when such restored task get checkpointed second time we don't dump new living vdso but save only old vdso proxy on disk (detecting it is a different story, in short we inject a unique mark into elf header). Yuck. But I don't know whether the kernel can help much here. I suspect that you'll need kernel changes for compat tasks, since I think that mremapping the vdso on any reasonably modern hardware in a 32-bit task will cause sigreturn to blow up. This could be fixed by making mremap magical, although adding a new prctl or arch_prctl to reliably move the vdso might be a better bet. Well, as far as I understand compat code uses abs addressing for vvar data and if vvar data position doesn't change we're safe, but same time because vvar addresses are not abi I fear one day we indeed hit the problems and the only solution would be to use kernel's help. But again, Andy, I didn't think much about implementing compat mode in criu yet so i might be missing some details. Prior to 3.15, the compat code didn't have vvar data at all. In 3.15 and up, the vvar data is accessed using PC-relative addressing, even in compat mode (using the usual call; mov trick to read EIP). --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming
On Tue, May 20, 2014 at 11:18 AM, H. Peter Anvin h...@zytor.com wrote: On 05/20/2014 11:01 AM, Cyrill Gorcunov wrote: This patch should fix this issue, at least. If there's still a way to get a native vdso that doesn't say [vdso], please let me know/ Yes, having a native procfs way to detect vdso is much preferred! Is there any path by which we can end up with [vdso] without a leading slash in /proc/self/maps? Otherwise, why is that not native? Dunno. But before this patch the reverse was possible: we can end up with a vdso that doesn't say [vdso]. The situation get worse when task was dumped on one kernel and then restored on another kernel where vdso content is different from one save in image -- is such case as I mentioned we need that named vdso proxy which redirect calls to vdso of the system where task is restoring. And when such restored task get checkpointed second time we don't dump new living vdso but save only old vdso proxy on disk (detecting it is a different story, in short we inject a unique mark into elf header). Yuck. But I don't know whether the kernel can help much here. Some prctl which would tell kernel to put vdso at specifed address. We can live without it for now so not a big deal (yet ;) mremap() will do this for you. Except that it's buggy: it doesn't change mm-context.vdso. For 64-bit tasks, the only consumer outside exec was arch_vma_name, and this patch removes even that. For 32-bit tasks, though, it's needed for signal delivery. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86,mm: Improve _install_special_mapping and fix x86 vdso naming
On Tue, May 20, 2014 at 11:27 AM, H. Peter Anvin h...@zytor.com wrote: On 05/20/2014 11:24 AM, Andy Lutomirski wrote: On Tue, May 20, 2014 at 11:18 AM, H. Peter Anvin h...@zytor.com wrote: On 05/20/2014 11:01 AM, Cyrill Gorcunov wrote: This patch should fix this issue, at least. If there's still a way to get a native vdso that doesn't say [vdso], please let me know/ Yes, having a native procfs way to detect vdso is much preferred! Is there any path by which we can end up with [vdso] without a leading slash in /proc/self/maps? Otherwise, why is that not native? Dunno. But before this patch the reverse was possible: we can end up with a vdso that doesn't say [vdso]. That's a bug, which is being fixed. We can't go back in time and create new interfaces on old kernels. The situation get worse when task was dumped on one kernel and then restored on another kernel where vdso content is different from one save in image -- is such case as I mentioned we need that named vdso proxy which redirect calls to vdso of the system where task is restoring. And when such restored task get checkpointed second time we don't dump new living vdso but save only old vdso proxy on disk (detecting it is a different story, in short we inject a unique mark into elf header). Yuck. But I don't know whether the kernel can help much here. Some prctl which would tell kernel to put vdso at specifed address. We can live without it for now so not a big deal (yet ;) mremap() will do this for you. Except that it's buggy: it doesn't change mm-context.vdso. For 64-bit tasks, the only consumer outside exec was arch_vma_name, and this patch removes even that. For 32-bit tasks, though, it's needed for signal delivery. Again, a bug, let's fix it rather than saying we need a new interface. What happens if someone remaps just part of the vdso? Presumably we'd just track the position of the first page of the vdso, but this might be hard to implement: I don't think there's any callback from the core mm code for ths. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] x86_64: A real proposal for iret-less return to kernel
Here's a real proposal for iret-less return. If this is correct, then NMIs will never nest, which will probably delete a lot more scariness than is added by the code I'm describing. The rest of this email is valid markdown :) If I end up implementing this, this text will go straight into Documentation/x86/x86_64. tl;dr: The only particularly tricky cases are exit from #DB, #BP, and #MC. I think they're not so bad, though. FWIW, if there's a way to read the NMI masking bit, this would be a lot simpler. I don't know of any way to do that, though. `IRET`-less return == There are at least two ways that we can return from a trap entry: `IRET` and `RET`. They have a few important differences. * `IRET` is very slow on all current (2014) CPUs -- it seems to take hundreds of cycles. `RET` is fast. * `IRET` unconditionally unmasks NMIs. `RET` never unmasks NMIs. * `IRET` can change `CS`, `RSP`, `SS`, `RIP`, and `RFLAGS` atomically. `RET` can't; it requires a return address on the stack, and it can't apply anything other than a small offset to the stack pointer. It can, in theory, change `CS`, but this seems unlikely to be helpful. Times when we must use `IRET` = * If we're returning to a different `CS` (i.e. if firmware is doing something funny or if we're returning to userspace), then `RET` won't help; we need to use `IRET` unless we're willing to play fragile games with `SYSEXIT` or `SYSRET`. * If we are changing stacks, the we need to be extremely careful about using `RET`: using `RET` requires that we put the target `RIP` on the target stack, so the target stack must be valid. This means that we cannot use `RET` if, for example, a `SYSCALL` just happened. * If we're returning from NMI, we `IRET` is mandatory: we need to unmask NMIs, and `IRET` is the only way to do that. Note that, if `RFLAGS.IF` is set, then interrupts were enabled when we trapped, so `RET` is safe. Times when we must use `RET` If there's an NMI on the stack, we must use `RET` until we're ready to re-enabled NMIs. Assumptions === * Neither the NMI, the MCE handler, nor anything that nests inside them will ever change `CS` or run with an invalid stack. * Interrupts will never be enabled with an NMI on the stack. * We explicitly do not assume that we can reliably determine whether we were on user `GS` or kernel `GS` when a trap happens. In current (3.15) kernels we can tell, but if we ever enable `WRGSBASE` then we will lose that ability. * The IST interrupts are: #DB #BP #NM #DF #SS, and #MC. * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero whenever an NMI or MCE is on the stack. We'll increment it at the very beginning of the NMI handler and clear it at the very end. We will also increment it in `do_machine_check` before doing anything that can cause an interrupt. The result is that the only interrupt that can happen with `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or end of the NMI handler. The algorithm = 1. If the target `CS` is not the standard 64-bit kernel CPL0 selector, then never use `RET`. This is safe: this will never happen with an NMI on the stack. 2. If we are returning from a non-IST interrupt, then use `RET`. Non-IST interrupts use the interrupted code's stack, so the stack is always valid. 3. If we are returning from #NM, then use `IRET`. 4. If we are returning from #DF or #SS, then use `IRET`. These interrupts cannot occur inside an NMI, or, at the very least, if they do happen, then they are not recoverable. 5. If we are returning from #DB or #BP, then use `RET` if `nmi_mce_nest_count != 0` and `IRET` otherwise. 6. If we are returning from #MC, use `IRET`, unless the return address is to the NMI entry or exit code, in which case we use `RET`. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, May 20, 2014 at 7:27 PM, Steven Rostedt rost...@goodmis.org wrote: On Tue, 2014-05-20 at 17:53 -0700, Andy Lutomirski wrote: If there's an NMI on the stack, we must use `RET` until we're ready to re-enabled NMIs. I'm a little confused by NMI on the stack. Do you mean NMI on the target stack? If so, please state that. I mean that if we're in an NMI handler or in anything nested inside it. * We can add a per-cpu variable `nmi_mce_nest_count` that is nonzero whenever an NMI or MCE is on the stack. We'll increment it at the very beginning of the NMI handler and clear it at the very end. We will also increment it in `do_machine_check` before doing anything that can cause an interrupt. The result is that the only interrupt that can happen with `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or end of the NMI handler. Just note that this will probably be done in the C code, as NMI has issues with gs being safe. Also, should we call it nmi specifically. Perhaps ist_stack_nest_count, stating that the stack is ist to match do_machine_check as well? Maybe that's not a good name either. Someone else can come up with something that's a little more generic than NMI? So the issue here is that we can have an NMI followed immediately by an MCE. The MCE code can call force_sig, which could plausibly result in a kprobe or something similar happening. The return from that needs to use IRET. Since I don't see a clean way to reliably detect that we're inside an NMI, I propose instead detecting when we're in *either* NMI or MCE, hence the name. As long as we mark do_machine_check and whatever asm code calls it __kprobes, I think we'll be okay. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote: On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. Except that NMI can interrupt #MC at any point as well, I think. But you're talking only about the small window when nmi_mce_nest_count hasn't been incremented yet, right? I.e., this: The result is that the only interrupt that can happen with `nmi_mce_nest_count == 0` in NMI context is an MCE at the beginning or end of the NMI handler. Correct? Exactly. -- Regards/Gruss, Boris. Sent from a fat crate under my desk. Formatting is fine. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: 64bit x86: NMI nesting still buggy?
On May 21, 2014 7:58 AM, Vojtech Pavlik vojt...@suse.cz wrote: On Wed, May 21, 2014 at 04:20:55PM +0200, Borislav Petkov wrote: On Wed, May 21, 2014 at 03:42:24PM +0200, Jiri Kosina wrote: Alright, Andy's iret optimization efforts do immediately bring a followup question -- why is this not a problem with iret-based return from #MC possibly interrupting NMI? Yeah, and frankly, I don't see this nesting fun at all protected against a #MC interrupting it at any point actually. Because once the #MC handler returns, it goes into paranoid_exit and that place doesn't account for NMIs at all, AFAICS. Which would mean: * NMI goes off * MCE happens, we switch to machine_check which is paranoidzeroentry * #MC handler is done - paranoid_exit - IRET - boom! Or if not boom, at least, the NMI gets forgotten. Am I missing something? I think to get a full BOOM you need a bit more complex process, namely: * NMI triggered * NMI handler starts * MCE happens * Second NMI triggered and queued * handler done, IRET * Second NMI handler starts and overwrites NMI return address on stack * Second NMI handler ends * First NMI handler ends and goes into an infinite IRET loop, always returning to the beginning of itself But you do have all the ingredients. And I don't see any other way out than not calling IRET for MCEs. The MCE handler could detect this and fiddle with the IST entry. This sounds considerably uglier than returning via RET, though. --Andy -- Vojtech Pavlik Director SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On May 21, 2014 5:51 AM, Jiri Kosina jkos...@suse.cz wrote: On Tue, 20 May 2014, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. The MCE code can call force_sig This is interesting by itself. force_sig() takes siglock spinlock. This really looks like a deadlock sitting there waiting to happen. ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. That being said, if an MCE that came from CPL0 never tried to return, this would be simpler. I don't know enough about the machine check architecture to know whether that's a reasonable thing to do. --Andy -- Jiri Kosina SUSE Labs -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 9:30 AM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 08:21:08AM -0700, Andy Lutomirski wrote: On May 21, 2014 2:46 AM, Borislav Petkov b...@alien8.de wrote: On Tue, May 20, 2014 at 07:39:31PM -0700, Andy Lutomirski wrote: So the issue here is that we can have an NMI followed immediately by an MCE. That part might need clarification for me: #MC is higher prio interrupt than NMI so a machine check exception can interrupt the NMI handler at any point. Except that NMI can interrupt #MC at any point as well, I think. No, #MC is higher prio than NMI, actually even the highest along with RESET#. And come to think of it, all exceptions which have a higher prio than NMI should touch that nmi_mce_nest_count thing. See Table 8-8 here: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2012/10/24593_APM_v21.pdf That's the prios before 3, i.e. the NMI one. HOWEVER, this all is spoken with the assumption that higher prio interrupts can interrupt the NMI handler too at the first instruction boundary they've been recognized. The text is talking about simultaneous interrupts and not about interrupt handler preemption. But it must be because Steve wouldn't be dealing with exceptions in the NMI handler and nested NMIs otherwise... I think that some of these exceptions are synchronous things (e.g. int3 or page faults) that happen because the kernel caused them. Anyway, going through the list: Reset, INIT, and stpclk ought to be irrelevant -- we don't handle them anyway. SMI is already supposedly correct wrt nesting inside NMI. Debug register stuff should be handled in my outline. Hopefully correctly :) We need to make sure that no breakpoints trip before the nmi count is incremented, but that should be straightforward as long as we don't do ridiculous things like poking at userspace addresses. I don't know how kgdb/kdb fits in -- if someone sets a watchpoint on a kernel address (e.g. the nesting count) or enables single-stepping, we'll mess up. It may pay to bump the nesting count inside the #DB and #BP handlers and to check the RIP that we're returning to, but that starts to look ugly, and we have to be careful about NMI, immediate breakpoint, and them immediate MCE. I'd rather just be able to say that there are some very short windows in which a debug or breakpoint exception will never happen. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote: Here's a real proposal for iret-less return. If this is correct, then NMIs will never nest, which will probably delete a lot more scariness than is added by the code I'm describing. OK, here's a case where I'm wrong. An NMI interrupts userspace on a 16-bit stack. The return from NMI goes through the espfix code. Something interrupts while on the espfix stack. Boom! Neither return style is particularly good. More generally, if we got interrupted while on the espfix stack, we need to return back there using IRET. Fortunately, re-enabling NMIs there in harmless, since we've already switched off the NMI stack. This makes me think that maybe the logic should be turned around: have some RIP ranges on which the kernel stack might be invalid (which includes the espfix code and some of the syscall code) and use IRET only on return from NMI, return to nonstandard CS, and return to these special ranges. The NMI code just needs to never so any of this stuff unless it switches off the NMI stack first. For this to work reliably, we'll probably have to change CS before calling into EFI code. That should be straightforward. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3.15] MIPS: Add new AUDIT_ARCH token for the N32 ABI on MIPS64
On Wed, May 21, 2014 at 1:59 PM, Paul Moore pmo...@redhat.com wrote: On Monday, May 12, 2014 02:53:05 PM Paul Moore wrote: On Tuesday, April 22, 2014 03:40:36 PM Markos Chandras wrote: A MIPS64 kernel may support ELF files for all 3 MIPS ABIs (O32, N32, N64). Furthermore, the AUDIT_ARCH_MIPS{,EL}64 token does not provide enough information about the ABI for the 64-bit process. As a result of which, userland needs to use complex seccomp filters to decide whether a syscall belongs to the o32 or n32 or n64 ABI. Therefore, a new arch token for MIPS64/n32 is added so it can be used by seccomp to explicitely set syscall filters for this ABI. Link: http://sourceforge.net/p/libseccomp/mailman/message/32239040/ Cc: Andy Lutomirski l...@amacapital.net Cc: Eric Paris epa...@redhat.com Cc: Paul Moore pmo...@redhat.com Cc: Ralf Baechle r...@linux-mips.org Signed-off-by: Markos Chandras markos.chand...@imgtec.com --- Ralf, can we please have this in 3.15 (Assuming it's ACK'd)? Thanks a lot! --- arch/mips/include/asm/syscall.h | 2 ++ include/uapi/linux/audit.h | 12 2 files changed, 14 insertions(+) [NOTE: Adding lkml to the To line to hopefully spur discussion/acceptance as this *really* should be in 3.15] I'm re-replying to this patch and adding lkml to the To line because I believe it is very important we get this patch into 3.15. For those who don't follow the MIPS architecture very closely, the upcoming 3.15 is the first release to include support for seccomp filters, the latest generation of syscall filtering which used a BPF based filter language. For reason that are easy to understand, the syscall filters are ABI specific (e.g. syscall tables, word length, endianness) and those generating syscall filters in userspace (e.g. libseccomp) need to take great care to ensure that the generated filters take the ABI into account and fail safely in the case where a different ABI is used (e.g. x86, x86_64, x32). The patch below corrects, what is IMHO, an omission in the original MIPS seccomp filter patch, allowing userspace to easily separate MIPS and MIPS64. Without this patch we will be forced to handle MIPS/MIPS64 like we handle x86_64/x32 which is a royal pain and not something I want to have deal with again. Further, while I don't want to speak for the audit folks, it is my understanding that they want this patch for similar reasons. Please merge this patch for 3.15 or at least provide some feedback as to why this isn't a viable solution for upstream. Once 3.15 ships, fixing this will require breaking the MIPS ABI which isn't something any of us want. Thanks, -Paul *Bump* I don't know what else needs to be done to get some action on this and we're running out of time for 3.15. Reply to Linus' next -rc email. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:25 PM, Jiri Kosina jkos...@suse.cz wrote: On Wed, 21 May 2014, Borislav Petkov wrote: ISTM the do_machine_check code ought to consider any kill-worthy MCE from kernel space to be non-recoverable, but I want to keep the scope of these patches under control. MCA has a bit called RIPV which, if set, signals that RIP is valid and it is safe to return provided we've taken proper care of handling even non-correctable errors (memory poisoning, etc). Yeah, but it tries to send SIGBUS from MCE context. And if MCE triggered at the time the CPU was already holding sighand-siglock for that particular task, it'll deadlock against itself. If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem like the right solution anyway. Are there any machine check exceptions for which it makes sense to continue right where we left off without a signal? Is CMIC such a beast? Can CMIC be delivered when interrupts are off? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:45 PM, H. Peter Anvin h...@zytor.com wrote: Adding Tony. On 05/21/2014 02:43 PM, Borislav Petkov wrote: On Thu, May 22, 2014 at 06:37:26AM +0900, Linus Torvalds wrote: Seriously. If an NMI is interrupted by an MCE, you might as well consider the machine dead. Don't worry about it. We may or may not recover, but it is *not* our problem. I certainly like this way of handling it. We can even issue a nice banner saying something like You're f*cked - go change hw. Actually, it would be a lot better to panic than deadlock (HA systems tend to have something in place to catch the panic and/or reboot). Any way we can see if the CPU is already holding that lock and panic in that case? Is there anything actually wrong with just panicking if !user_mode_vm(regs)? That would make this a lot more sane. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:48 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 02:35:59PM -0700, Andy Lutomirski wrote: If RIPV is set but we interrupted *kernel* code, SIGBUS doesn't seem like the right solution anyway. Are there any machine check exceptions for which it makes sense to continue right where we left off without a signal? Is CMIC such a beast? Can CMIC be delivered when interrupts are off? I think you mean CMCI and that's not even reported with a MCE exception - there's a separate APIC interrupt for that. I think this signal thing is for killing processes which have poisoned memory but this memory can contained within that process and the physical page frame can be poisoned so that it doesn't get used ever again. In any case, this is an example for an uncorrectable error which needs action from us but doesn't necessarily have to kill the whole machine. This is supposed to be more graceful instead of consuming the corrupted data and sending it out to disk. But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 2:55 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 02:52:55PM -0700, Andy Lutomirski wrote: Why is it a problem if user_mode_vm(regs)? Conversely, why is sending a signal a remotely reasonable thing to do if !user_mode_vm(regs)? Let me quote Jiri: (1) task sends signal to itself (2) it acquires sighand-siglock so that it's able to queue the signal (3) MCE triggers ...and !user_mode_vm(regs), and hence we're IN_KERNEL, and we should presumably just panic instead of trying to send a signal. I missed the IN_KERNEL thing because I didn't realize that -cs was copied to struct mce. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] x86_64: Merge paranoidzeroentry_ist into idtentry
One more specialized entry function is now gone. Again, this seems to only change line numbers in entry_64.o. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/entry_64.S | 47 ++ 1 file changed, 22 insertions(+), 25 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index eac9b81..be846d2 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1203,8 +1203,15 @@ apicinterrupt IRQ_WORK_VECTOR \ /* * Exception entry points. */ -.macro idtentry sym do_sym has_error_code:req paranoid=0 +#define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8) + +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ENTRY(\sym) + /* Sanity check */ + .if \shift_ist != -1 \paranoid == 0 + .error using shift_ist requires paranoid=1 + .endif + .if \has_error_code XCPT_FRAME .else @@ -1230,8 +1237,12 @@ ENTRY(\sym) DEFAULT_FRAME 0 .if \paranoid + .if \shift_ist != -1 + TRACE_IRQS_OFF_DEBUG/* reload IDT in case of recursion */ + .else TRACE_IRQS_OFF .endif + .endif movq %rsp,%rdi /* pt_regs pointer */ @@ -1242,8 +1253,16 @@ ENTRY(\sym) xorl %esi,%esi /* no error code */ .endif + .if \shift_ist != -1 + subq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist) + .endif + call \do_sym + .if \shift_ist != -1 + addq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist) + .endif + .if \paranoid jmp paranoid_exit /* %ebx: no swapgs flag */ .else @@ -1254,28 +1273,6 @@ ENTRY(\sym) END(\sym) .endm -#define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8) -.macro paranoidzeroentry_ist sym do_sym ist -ENTRY(\sym) - INTR_FRAME - ASM_CLAC - PARAVIRT_ADJUST_EXCEPTION_FRAME - pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */ - subq $ORIG_RAX-R15, %rsp - CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 - call save_paranoid - DEFAULT_FRAME 0 - TRACE_IRQS_OFF_DEBUG - movq %rsp,%rdi /* pt_regs pointer */ - xorl %esi,%esi /* no error code */ - subq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist) - call \do_sym - addq $EXCEPTION_STKSZ, INIT_TSS_IST(\ist) - jmp paranoid_exit /* %ebx: no swapgs flag */ - CFI_ENDPROC -END(\sym) -.endm - #ifdef CONFIG_TRACING .macro trace_idtentry sym do_sym has_error_code:req idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code @@ -1460,8 +1457,8 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \ */ .pushsection .kprobes.text, ax -paranoidzeroentry_ist debug do_debug DEBUG_STACK -paranoidzeroentry_ist int3 do_int3 DEBUG_STACK +idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK +idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK idtentry stack_segment do_stack_segment has_error_code=1 paranoid=1 #ifdef CONFIG_XEN idtentry xen_debug do_debug has_error_code=0 -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] x86_64: Add missing 'DEFAULT_FRAME 0' entry annotations
The paranoidzeroentry macros were missing them. I'm not at all convinced that these annotations are correct and/or necessary, but this makes the macros more consistent with each other. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/entry_64.S | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 1e96c36..39372ec 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -1230,6 +1230,7 @@ ENTRY(\sym) subq $ORIG_RAX-R15, %rsp CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 call save_paranoid + DEFAULT_FRAME 0 TRACE_IRQS_OFF movq %rsp,%rdi /* pt_regs pointer */ xorl %esi,%esi /* no error code */ @@ -1249,6 +1250,7 @@ ENTRY(\sym) subq $ORIG_RAX-R15, %rsp CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 call save_paranoid + DEFAULT_FRAME 0 TRACE_IRQS_OFF_DEBUG movq %rsp,%rdi /* pt_regs pointer */ xorl %esi,%esi /* no error code */ -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] x86_64: Merge most 64-bit asm entry macros
I haven't touched the device interrupt code, which is different enough that it's probably not worth merging, and I haven't done anything about paranoidzeroentry_ist yet. This appears to produce an entry_64.o file that differs only in the debug info line numbers. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/entry_64.S | 152 +++-- 1 file changed, 64 insertions(+), 88 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 39372ec..eac9b81 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -36,7 +36,7 @@ * - FIXUP_TOP_OF_STACK/RESTORE_TOP_OF_STACK - Fix up the hardware stack * frame that is otherwise undefined after a SYSCALL * - TRACE_IRQ_* - Trace hard interrupt state for lock debugging. - * - errorentry/paranoidentry/zeroentry - Define exception entry points. + * - idtentry - Define exception entry points. */ #include linux/linkage.h @@ -1203,39 +1203,53 @@ apicinterrupt IRQ_WORK_VECTOR \ /* * Exception entry points. */ -.macro zeroentry sym do_sym +.macro idtentry sym do_sym has_error_code:req paranoid=0 ENTRY(\sym) + .if \has_error_code + XCPT_FRAME + .else INTR_FRAME - ASM_CLAC - PARAVIRT_ADJUST_EXCEPTION_FRAME - pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */ - subq $ORIG_RAX-R15, %rsp - CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 - call error_entry - DEFAULT_FRAME 0 - movq %rsp,%rdi /* pt_regs pointer */ - xorl %esi,%esi /* no error code */ - call \do_sym - jmp error_exit /* %ebx: no swapgs flag */ - CFI_ENDPROC -END(\sym) -.endm + .endif -.macro paranoidzeroentry sym do_sym -ENTRY(\sym) - INTR_FRAME ASM_CLAC PARAVIRT_ADJUST_EXCEPTION_FRAME - pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */ + + .ifeq \has_error_code + pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */ + .endif + subq $ORIG_RAX-R15, %rsp CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 + + .if \paranoid call save_paranoid + .else + call error_entry + .endif + DEFAULT_FRAME 0 + + .if \paranoid TRACE_IRQS_OFF - movq %rsp,%rdi /* pt_regs pointer */ - xorl %esi,%esi /* no error code */ + .endif + + movq %rsp,%rdi /* pt_regs pointer */ + + .if \has_error_code + movq ORIG_RAX(%rsp),%rsi/* get error code */ + movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */ + .else + xorl %esi,%esi /* no error code */ + .endif + call \do_sym - jmp paranoid_exit /* %ebx: no swapgs flag */ + + .if \paranoid + jmp paranoid_exit /* %ebx: no swapgs flag */ + .else + jmp error_exit /* %ebx: no swapgs flag */ + .endif + CFI_ENDPROC END(\sym) .endm @@ -1262,68 +1276,30 @@ ENTRY(\sym) END(\sym) .endm -.macro errorentry sym do_sym -ENTRY(\sym) - XCPT_FRAME - ASM_CLAC - PARAVIRT_ADJUST_EXCEPTION_FRAME - subq $ORIG_RAX-R15, %rsp - CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 - call error_entry - DEFAULT_FRAME 0 - movq %rsp,%rdi /* pt_regs pointer */ - movq ORIG_RAX(%rsp),%rsi/* get error code */ - movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */ - call \do_sym - jmp error_exit /* %ebx: no swapgs flag */ - CFI_ENDPROC -END(\sym) -.endm - #ifdef CONFIG_TRACING -.macro trace_errorentry sym do_sym -errorentry trace(\sym) trace(\do_sym) -errorentry \sym \do_sym +.macro trace_idtentry sym do_sym has_error_code:req +idtentry trace(\sym) trace(\do_sym) has_error_code=\has_error_code +idtentry \sym \do_sym has_error_code=\has_error_code .endm #else -.macro trace_errorentry sym do_sym -errorentry \sym \do_sym +.macro trace_idtentry sym do_sym has_error_code:req +idtentry \sym \do_sym has_error_code=\has_error_code .endm #endif - /* error code is on the stack already */ -.macro paranoiderrorentry sym do_sym -ENTRY(\sym) - XCPT_FRAME - ASM_CLAC - PARAVIRT_ADJUST_EXCEPTION_FRAME - subq $ORIG_RAX-R15, %rsp - CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 - call save_paranoid - DEFAULT_FRAME 0 - TRACE_IRQS_OFF - movq %rsp,%rdi /* pt_regs pointer */ - movq ORIG_RAX(%rsp),%rsi/* get error code */ - movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */ - call \do_sym - jmp paranoid_exit /* %ebx: no swapgs flag */ - CFI_ENDPROC -END(\sym) -.endm - -zeroentry divide_error do_divide_error -zeroentry overflow do_overflow -zeroentry bounds do_bounds -zeroentry invalid_op do_invalid_op -zeroentry device_not_available
[PATCH 0/3] x86_64: Merge (paranoid)?(zero|error)entry(_idt)?
Inspired by the RET vs IRET discussion, I thought about how much of a mess the current entry macros are and I shuddered at the thought of trying to modify them. This little series merges them all into one idtentry macro that has arguments that specify all the various weird behaviors. It has a major benefit: you can now look at the code and see how, say, the has_error_code and !has_error_code cases differ. Andy Lutomirski (3): x86_64: Add missing 'DEFAULT_FRAME 0' entry annotations x86_64: Merge most 64-bit asm entry macros x86_64: Merge paranoidzeroentry_ist into idtentry arch/x86/kernel/entry_64.S | 185 - 1 file changed, 80 insertions(+), 105 deletions(-) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:01 PM, Luck, Tony tony.l...@intel.com wrote: But sending signals from #MC context is definitely a bad idea. I think we had addressed this with irq_work at some point but my memory is very hazy. We added code for recoverable errors to get out of the MC context before trying to lookup the page and send the signal. Bottom of do_machine_check(): if (cfg-tolerant 3) { if (no_way_out) mce_panic(Fatal machine check on current CPU, m, msg); if (worst == MCE_AR_SEVERITY) { /* schedule action before return to userland */ mce_save_info(m.addr, m.mcgstatus MCG_STATUS_RIPV); set_thread_flag(TIF_MCE_NOTIFY); } else if (kill_it) { force_sig(SIGBUS, current); } } That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). Why is this necessary? If the MCE hit kernel code, then we're going to die anyway. If the MCE hit user code, then we should be in a completely sensible context and we can just send the signal. --Andy The force_sig() there is legacy code - and perhaps should just move off to mce_notify_process() too (need to save worst so it will know what to do). -Tony -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:17 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 03:13:16PM -0700, Andy Lutomirski wrote: Why is this necessary? If the MCE hit kernel code, then we're going to die anyway. If the MCE hit user code, then we should be in a completely sensible context and we can just send the signal. Are we guaranteed that the first thing the process will execute when scheduled back in are the signal handlers? It's not even scheduled out, right? This should be just like a signal from a failed page fault, I think. And besides, maybe we don't even want to allow to do the switch_to() but kill it while it is sleeping. What switch_to? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:18 PM, Luck, Tony tony.l...@intel.com wrote: That TIF_MCE_NOTIFY prevents the return to user mode, and we end up in mce_notify_process(). Why is this necessary? The recovery path has to do more than just send a signal - it needs to walk processes and mms to see which have mapped the physical address that the h/w told us has gone bad. I still feel like I'm missing something. If we interrupted user space code, then the context we're in should be identical to the context we'll get when we're about to return to userspace. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:25 PM, Andi Kleen a...@firstfloor.org wrote: Seems like a lot of effort and risk to essentially only optimize in kernel interrupt handlers. The idea is that it might allow us to remove a bunch of scary nested NMI code as well as speeding things up. AFAIK the most interesting cases (like user page faults) are not affected at all. Usually most workloads don't spend all that much time in the kernel, so it won't help most interrupts. I suspect the only case that's really interesting here is interrupting idle. Maybe it would be possible to do some fast path in this case only. However idle currently has so much overhead that I suspect that there are lower hanging fruit elsewhere. I will gladly buy a meal or beverage for whomever fixes the ttwu stuff to stop sending IPIs to idle CPUs, which will help a lot. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:32 PM, Luck, Tony tony.l...@intel.com wrote: The recovery path has to do more than just send a signal - it needs to walk processes and mms to see which have mapped the physical address that the h/w told us has gone bad. I still feel like I'm missing something. If we interrupted user space code, then the context we're in should be identical to the context we'll get when we're about to return to userspace. True. And this far along in do_machine_check() we have set all the other cpus free, so the are heading back to whatever context we interrupted them in. So we might be able to do all that other stuff inline here ... we interrupted user mode, so we know we don't hold any locks. Other cpus are running, so they can complete what they are doing to release any locks we might need. But it will take a while (to scan all those processes). And we haven't yet cleared MCG_STATUS ... so another machine check before we do that would be fatal (x86 doesn't allow nesting). Even if we moved the work after the clear of MCG_STATUS we'd still be vulnerable to a new machine check on x86_64 because we are sitting on the one only machine check stack. But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Anyway, I care less about this now that I don't have to worry about it re: IRET :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:36 PM, H. Peter Anvin h...@zytor.com wrote: On 05/21/2014 11:11 AM, Andy Lutomirski wrote: On Tue, May 20, 2014 at 5:53 PM, Andy Lutomirski l...@amacapital.net wrote: Here's a real proposal for iret-less return. If this is correct, then NMIs will never nest, which will probably delete a lot more scariness than is added by the code I'm describing. OK, here's a case where I'm wrong. An NMI interrupts userspace on a 16-bit stack. The return from NMI goes through the espfix code. Something interrupts while on the espfix stack. Boom! Neither return style is particularly good. More generally, if we got interrupted while on the espfix stack, we need to return back there using IRET. Fortunately, re-enabling NMIs there in harmless, since we've already switched off the NMI stack. This makes me think that maybe the logic should be turned around: have some RIP ranges on which the kernel stack might be invalid (which includes the espfix code and some of the syscall code) and use IRET only on return from NMI, return to nonstandard CS, and return to these special ranges. The NMI code just needs to never so any of this stuff unless it switches off the NMI stack first. For this to work reliably, we'll probably have to change CS before calling into EFI code. That should be straightforward. I think you are onto something here. In particular, the key observation here is that inside the kernel, we can never *both* have an invalid stack *and* be inside an NMI, #MC or #DB handler, even if nested. Except for espfix :) Now, does this prevent us from using RET in the common case? I'm not sure it is a huge loss since kernel-to-kernel is relatively rare. I don't think so. The most common case should be plain old interrupts and I suspect that #PF is a distant second. In any event, plain old interrupts and #PF are non-IST interrupts and they should be unconditionally safe for RET --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [x86, vdso] cfda7bb9ecb: +14.7% will-it-scale.per_thread_ops
On Mon, May 19, 2014 at 10:59 PM, Jet Chen jet.c...@intel.com wrote: Hi Andy, FYI, we noticed the below changes on git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/vdso commit cfda7bb9ecbf9d96264bb5bade33a842966d1062 (x86, vdso: Move syscall and sysenter setup into kernel/cpu/common.c) This is really strange. That change shouldn't have any effect except that some kernels might boot a cycle or two faster. test case: nhm4/will-it-scale/sched_yield 3d7ee969bffcc98 cfda7bb9ecbf9d96264bb5bad --- - 5497021 ~ 0% +14.7%6303424 ~ 0% TOTAL will-it-scale.per_thread_ops 0.54 ~ 0% +5.6% 0.57 ~ 0% TOTAL will-it-scale.scalability 6209483 ~ 0% +1.6%6305917 ~ 0% TOTAL will-it-scale.per_process_ops 2455 ~ 5% +16.9% 2870 ~ 5% TOTAL cpuidle.C1-NHM.usage 8829 ~ 7% +15.2% 10169 ~10% TOTAL slabinfo.kmalloc-64.active_objs 24.13 ~12% +48.9% 35.93 ~14% TOTAL time.user_time 393 ~ 0% -3.0%382 ~ 1% TOTAL time.system_time Is this a speedup or a slowdown? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 3:48 PM, Borislav Petkov b...@alien8.de wrote: On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Ok, but we still have to do the work before returning to the process. So if not mce_notify_process() how else are you suggesting we do this? I'm suggesting that you re-enable interrupts and do the work in do_machine_check. I think it'll just work. It might pay to set a flag so that you panic very loudly if do_machine_check recurses. I suspect that, if the hardware is generating machine checks while doing memory poisoning, the hardware is broken enough that even panicking might not work, though :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 4:05 PM, Luck, Tony tony.l...@intel.com wrote: On Wed, May 21, 2014 at 03:39:11PM -0700, Andy Lutomirski wrote: But if we get a new MCE in here, it will be an MCE from kernel context and it's fatal. So, yes, we'll clobber the stack, but we'll never return (unless tolerant is set to something insane), so who cares? Remember that machine checks are broadcast. So some other cpu can hit a recoverable machine check in user mode ... but that int#18 goes everywhere. Other cpus are innocent bystanders ... they will see MCG_STATUS.RIPV=1, MCG_STATUS.EIPV=0 and nothing important in any of their machine check banks. But if we are still finishing off processing the previous machine check, this will be a nested one - and BOOM, we are dead. Oh. Well, crap. FWIW, this means that there really is a problem if one of these #MC errors hits an innocent bystander who just happens to be handling an NMI, at least if we delete the nested NMI code. But I think my simplified proposal gets this right. -Tony [If you peer closely at the latest edition of the SDM - you'll see the bits are defined for a non-broadcast model ... e.g. LMCE_S bit in MCG_STATUS but currently shipping silicon doesn't use that] -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] x86_64: A real proposal for iret-less return to kernel
On Wed, May 21, 2014 at 4:51 PM, Borislav Petkov b...@alien8.de wrote: On Thu, May 22, 2014 at 08:30:33AM +0900, Linus Torvalds wrote: If the OS then decides to take down the whole machine, the OS - not the hardware - can choose to do something that will punch through other CPU's NMI blocking (notably, init/reset), but the hardware doing this on its own is just broken if true. Not that it is any consolation but MCE is not broadcast on AMD. Regardless, exceptions like MCE cannot be held pending and do pierce the NMI handler on both. Now, if the NMI handler experiences a non-broadcast MCE on the same CPU, while running, we're simply going to panic as we're in kernel space anyway. The only problem is if the NMI handler gets interrupted while running on a bystander CPU. And I think we could deal with this because the bystander would not see an MCE and will return safely. We just need to make sure that it returns back to the said NMI handler and not to userspace. Unless I'm missing something ... Under my always RET unless returning from IST to weird CS or to specific known-invalid-stack regions proposal this should work fine. In the current code it'll also work fine *unless* it hits really early in the NMI, in which case a second NMI can kill us. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/3] x86_64,vsyscall: Minor cleanups and fixes
This turns the vsyscall page all the way off in vsyscall=none mode and fixes a bug in warn_bad_vsyscall that resulted in corrupt log lines. With this series applied, vsyscall=none results in a tidier /proc/PID/maps. Andy Lutomirski (3): x86_64,vsyscall: Move all of the gate_area code to vsyscall_64.c x86_64,vsyscall: Turn vsyscalls all the way off when vsyscall=none x86_64,vsyscall: Fix warn_bad_vsyscall log output arch/x86/kernel/vsyscall_64.c | 69 ++- arch/x86/mm/init_64.c | 49 -- 2 files changed, 61 insertions(+), 57 deletions(-) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 3/3] x86_64,vsyscall: Fix warn_bad_vsyscall log output
This commit: commit c767a54ba0657e52e6edaa97cbe0b0a8bf1c1655 Author: Joe Perches j...@perches.com Date: Mon May 21 19:50:07 2012 -0700 x86/debug: Add KERN_LEVEL to bare printks, convert printks to pr_level caused warn_bad_vsyscall to output garbage in the middle of the line. Revert the bad part of it. The printk in question isn't actually bare; the level is %s. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/vsyscall_64.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index 8d38eb5..6463f9e 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -81,10 +81,10 @@ static void warn_bad_vsyscall(const char *level, struct pt_regs *regs, if (!show_unhandled_signals) return; - pr_notice_ratelimited(%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n, - level, current-comm, task_pid_nr(current), - message, regs-ip, regs-cs, - regs-sp, regs-ax, regs-si, regs-di); + printk_ratelimited(%s%s[%d] %s ip:%lx cs:%lx sp:%lx ax:%lx si:%lx di:%lx\n, + level, current-comm, task_pid_nr(current), + message, regs-ip, regs-cs, + regs-sp, regs-ax, regs-si, regs-di); } static int addr_to_vsyscall_nr(unsigned long addr) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/3] x86_64,vsyscall: Move all of the gate_area code to vsyscall_64.c
This code exists for the sole purpose of making the vsyscall page look sort of like real userspace memory. Move it so that it lives with the rest of the vsyscall code. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/vsyscall_64.c | 49 +++ arch/x86/mm/init_64.c | 49 --- 2 files changed, 49 insertions(+), 49 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index ea5b570..ad84894 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -284,6 +284,55 @@ sigsegv: } /* + * A pseudo VMA to allow ptrace access for the vsyscall page. This only + * covers the 64bit vsyscall page now. 32bit has a real VMA now and does + * not need special handling anymore: + */ +static const char *gate_vma_name(struct vm_area_struct *vma) +{ + return [vsyscall]; +} +static struct vm_operations_struct gate_vma_ops = { + .name = gate_vma_name, +}; +static struct vm_area_struct gate_vma = { + .vm_start = VSYSCALL_ADDR, + .vm_end = VSYSCALL_ADDR + PAGE_SIZE, + .vm_page_prot = PAGE_READONLY_EXEC, + .vm_flags = VM_READ | VM_EXEC, + .vm_ops = gate_vma_ops, +}; + +struct vm_area_struct *get_gate_vma(struct mm_struct *mm) +{ +#ifdef CONFIG_IA32_EMULATION + if (!mm || mm-context.ia32_compat) + return NULL; +#endif + return gate_vma; +} + +int in_gate_area(struct mm_struct *mm, unsigned long addr) +{ + struct vm_area_struct *vma = get_gate_vma(mm); + + if (!vma) + return 0; + + return (addr = vma-vm_start) (addr vma-vm_end); +} + +/* + * Use this when you have no reliable mm, typically from interrupt + * context. It is less reliable than using a task's mm and may give + * false positives. + */ +int in_gate_area_no_mm(unsigned long addr) +{ + return (addr PAGE_MASK) == VSYSCALL_ADDR; +} + +/* * Assume __initcall executes before all user space. Hopefully kmod * doesn't violate that. We'll find out if it does. */ diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c index bdcde58..1390b7b 100644 --- a/arch/x86/mm/init_64.c +++ b/arch/x86/mm/init_64.c @@ -1180,55 +1180,6 @@ int kern_addr_valid(unsigned long addr) return pfn_valid(pte_pfn(*pte)); } -/* - * A pseudo VMA to allow ptrace access for the vsyscall page. This only - * covers the 64bit vsyscall page now. 32bit has a real VMA now and does - * not need special handling anymore: - */ -static const char *gate_vma_name(struct vm_area_struct *vma) -{ - return [vsyscall]; -} -static struct vm_operations_struct gate_vma_ops = { - .name = gate_vma_name, -}; -static struct vm_area_struct gate_vma = { - .vm_start = VSYSCALL_ADDR, - .vm_end = VSYSCALL_ADDR + PAGE_SIZE, - .vm_page_prot = PAGE_READONLY_EXEC, - .vm_flags = VM_READ | VM_EXEC, - .vm_ops = gate_vma_ops, -}; - -struct vm_area_struct *get_gate_vma(struct mm_struct *mm) -{ -#ifdef CONFIG_IA32_EMULATION - if (!mm || mm-context.ia32_compat) - return NULL; -#endif - return gate_vma; -} - -int in_gate_area(struct mm_struct *mm, unsigned long addr) -{ - struct vm_area_struct *vma = get_gate_vma(mm); - - if (!vma) - return 0; - - return (addr = vma-vm_start) (addr vma-vm_end); -} - -/* - * Use this when you have no reliable mm, typically from interrupt - * context. It is less reliable than using a task's mm and may give - * false positives. - */ -int in_gate_area_no_mm(unsigned long addr) -{ - return (addr PAGE_MASK) == VSYSCALL_ADDR; -} - #ifdef CONFIG_X86_UV unsigned long memory_block_size_bytes(void) { -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/3] x86_64,vsyscall: Turn vsyscalls all the way off when vsyscall=none
I see no point in having an unusable read-only page sitting at 0xff60 when vsyscall=none. Instead, skip mapping it and remove it from /proc/PID/maps. I kept the ratelimited warning when programs try to use a vsyscall in this mode, since it may help admins avoid confusion. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/kernel/vsyscall_64.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c index ad84894..8d38eb5 100644 --- a/arch/x86/kernel/vsyscall_64.c +++ b/arch/x86/kernel/vsyscall_64.c @@ -309,6 +309,8 @@ struct vm_area_struct *get_gate_vma(struct mm_struct *mm) if (!mm || mm-context.ia32_compat) return NULL; #endif + if (vsyscall_mode == NONE) + return NULL; return gate_vma; } @@ -329,7 +331,7 @@ int in_gate_area(struct mm_struct *mm, unsigned long addr) */ int in_gate_area_no_mm(unsigned long addr) { - return (addr PAGE_MASK) == VSYSCALL_ADDR; + return vsyscall_mode != NONE (addr PAGE_MASK) == VSYSCALL_ADDR; } /* @@ -380,10 +382,12 @@ void __init map_vsyscall(void) extern char __vsyscall_page; unsigned long physaddr_vsyscall = __pa_symbol(__vsyscall_page); - __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall, -vsyscall_mode == NATIVE -? PAGE_KERNEL_VSYSCALL -: PAGE_KERNEL_VVAR); + if (vsyscall_mode != NONE) + __set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall, +vsyscall_mode == NATIVE +? PAGE_KERNEL_VSYSCALL +: PAGE_KERNEL_VVAR); + BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) != (unsigned long)VSYSCALL_ADDR); } -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Pondering per-process vsyscall disablement
It would be nice to have a way for new programs to declare that they don't need vsyscalls. What's the right way to do this? An ELF header entry in the loader? An ELF header entry in the program? A new arch_prctl? As background, there's an old part of the x86_64 ABI that allows programs to do gettimeofday, clock_gettime, and getcpu by calling to fixed addresses of the form 0xff600n00 where n indicates which of those three syscalls is being invoked. This is a security issue. Since Linux 3.1, vsyscalls are emulated using NX and page faults. As a result, vsyscalls no longer offer any performance advantage over normal syscalls; in fact, they're much slower. As far as I know, nothing newer than 2012 will attempt to use vsyscalls if a vdso is present. (Sadly, a lot of things will still fall back to the vsyscall page if there is no vdso, but that shouldn't matter, since there is always a vdso.) Despite the emulation, they could still be used as a weird form of ROP gadget that lives at a fixed address. I'd like to offer a way for new runtimes to indicate that they don't use vsyscalls so that the kernel can selectively disable emulation and remove the fixed-address executable code issue. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 4/6] seccomp: move no_new_privs into seccomp
On Thu, May 22, 2014 at 4:05 PM, Kees Cook keesc...@chromium.org wrote: Since seccomp transitions between threads requires updates to the no_new_privs flag to be atomic, changes must be atomic. This moves the nnp flag into the seccomp field as a separate unsigned long for atomic access. Signed-off-by: Kees Cook keesc...@chromium.org Acked-by: Andy Lutomirski l...@amacapital.net -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 6/6] seccomp: add SECCOMP_EXT_ACT_TSYNC and SECCOMP_FILTER_TSYNC
On Thu, May 22, 2014 at 4:05 PM, Kees Cook keesc...@chromium.org wrote: Applying restrictive seccomp filter programs to large or diverse codebases often requires handling threads which may be started early in the process lifetime (e.g., by code that is linked in). While it is possible to apply permissive programs prior to process start up, it is difficult to further restrict the kernel ABI to those threads after that point. This change adds a new seccomp extension action for synchronizing thread group seccomp filters and a prctl() for accessing that functionality, as well as a flag for SECCOMP_EXT_ACT_FILTER to perform sync at filter installation time. When calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_FILTER, flags, filter) with flags containing SECCOMP_FILTER_TSYNC, or when calling prctl(PR_SECCOMP_EXT, SECCOMP_EXT_ACT, SECCOMP_EXT_ACT_TSYNC, 0, 0), it will attempt to synchronize all threads in current's threadgroup to its seccomp filter program. This is possible iff all threads are using a filter that is an ancestor to the filter current is attempting to synchronize to. NULL filters (where the task is running as SECCOMP_MODE_NONE) are also treated as ancestors allowing threads to be transitioned into SECCOMP_MODE_FILTER. If prctrl(PR_SET_NO_NEW_PRIVS, ...) has been set on the calling thread, no_new_privs will be set for all synchronized threads too. On success, 0 is returned. On failure, the pid of one of the failing threads will be returned, with as many filters installed as possible. Is there a use case for adding a filter and synchronizing filters being separate operations? If not, I think this would be easier to understand and to use if there was just a single operation. If you did that, you'd have to decide whether to continue requiring that all the other threads have a filter that's an ancestor of the current thread's filter. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: futex(2) man page update help request
On Wed, May 14, 2014 at 5:28 PM, H. Peter Anvin h...@zytor.com wrote: On 05/14/2014 01:56 PM, Davidlohr Bueso wrote: However, unless I'm sorely mistaken, the larger problem is that glibc removed the futex() call entirely, so these man pages don't describe I don't think futex() ever was in glibc--that's by design, and completely understandable: no user-space application would want to directly use futex(). That's actually not quite true. There are plenty of software efforts out there that use futex calls directly to implement userspace serialization mechanisms as an alternative to the bulky sysv semaphores. I worked closely with an in-memory DB project that makes heavy use of them. Not everyone can simply rely on pthreads. More fundamentally, futex(2), like clone(2), are things that can be legitimately by user space without automatically breaking all of glibc. I'm lost -- I think the missing verb is important :) There are some other things where that is *not* true, because glibc relies on being able to mediate all accesses to a kernel facility, but not here. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: NULL ptr deref handling mmaping of special mappings
On May 14, 2014 8:36 PM, Pavel Emelyanov xe...@parallels.com wrote: On 05/15/2014 02:23 AM, Andy Lutomirski wrote: On Wed, May 14, 2014 at 3:11 PM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Wed, May 14, 2014 at 02:33:54PM -0700, Andy Lutomirski wrote: On Wed, May 14, 2014 at 2:31 PM, Andrew Morton a...@linux-foundation.org wrote: On Wed, 14 May 2014 17:11:00 -0400 Sasha Levin sasha.le...@oracle.com wrote: In my linux-next all that code got deleted by Andy's x86, vdso: Reimplement vdso.so preparation in build-time C anyway. What kernel were you looking at? Deleted? It appears in today's -next. arch/x86/vdso/vma.c:124 . I don't see Andy's patch removing that code either. ah, OK, it got moved from arch/x86/vdso/vdso32-setup.c into arch/x86/vdso/vma.c. Maybe you managed to take a fault against the symbol area between the _install_special_mapping() and the remap_pfn_range() call, but mmap_sem should prevent that. Or the remap_pfn_range() call never happened. Should map_vdso() be running _install_special_mapping() at all if image-sym_vvar_page==NULL? I'm confused: are we talking about 3.15-rcsomething or linux-next? That code changed. Would this all make more sense if there were just a single vma in here? cc: Pavel and Cyrill, who might have to deal with this stuff in CRIU Well, for criu we've not modified any vdso kernel's code (except setting VM_SOFTDIRTY for this vdso VMA in _install_special_mapping). And never experienced problems Sasha points. Looks like indeed in -next code is pretty different from mainline one. To figure out why I need to fetch -next branch and get some research. I would try to do that tomorrow (still hoping someone more experienced in mm system would beat me on that). I can summarize: On 3.14 and before, the vdso is just a bunch of ELF headers and executable data. When executed by 64-bit binaries, it reads from the fixmap to do its thing. That is, it reads from kernel addresses that don't have vmas. When executed by 32-bit binaries, it doesn't read anything, since there was no 32-bit timing code. On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by a separate vma containing two pages worth of time-varying read-only data. The vdso reads those pages using PIC references. On linux-next, all vdsos work the same way. There are two vmas. The first vma is executable text, which can be poked at by ptrace, etc normally. The second vma contains time-varying state, should not allow poking, and is accessed by PIC references. Is this 2nd vma seen in /proc/pid/maps? And if so, is it marked somehow? It is in maps, and it's not marked. I can write a patch to change that. I imagine it shouldn't be called [vdso], though. What does CRIU do to restore the vdso? Will 3.15 and/or linux-next need to make some concession for CRIU? We detect the vdso by [vdso] mark in proc at dump time and mark it in the images. At restore time we check that vdso symbols layout hasn't changed and just remap it in proper location. If this remains the same in -next, then we're fine :) If you just remap the vdso, you'll crash. This is the case in 3.15, too, for 32-bit apps, anyway. What happens if you try to checkpoint a program that's in the vdso or, worse, in a signal frame with the vdso on the stack? --Andy Thanks, Pavel -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: NULL ptr deref handling mmaping of special mappings
On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote: I can summarize: On 3.14 and before, the vdso is just a bunch of ELF headers and executable data. When executed by 64-bit binaries, it reads from the fixmap to do its thing. That is, it reads from kernel addresses that don't have vmas. When executed by 32-bit binaries, it doesn't read anything, since there was no 32-bit timing code. On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by a separate vma containing two pages worth of time-varying read-only data. The vdso reads those pages using PIC references. Andy, could you please point me where is the code which creates a second vma? latest 3.15 master branch Search for _install_special_mapping in arch/x86/vdso. It's in a different place in 3.15-rc and -next. [root@fc ~]# cat /proc/self/maps ... 7fff57b6e000-7fff57b8f000 rw-p 00:00 0 [stack] 7fff57bff000-7fff57c0 r-xp 00:00 0 [vdso] ff60-ff601000 r-xp 00:00 0 [vsyscall] [root@fc ~]# What version and bitness is this? Or you mean vsyscall area? If yes, then in criu we don't dump vsyscall zone. On restore we don't touch vsyscall either but for vdso there are two cases vsyscalls are almost gone now :) - if there were no kernel change on vdso contents we simply use vdso provided by the kernel at the moment of criu startup - if vdso has been changed and looks different from one saved in image during checkpoint, we map it from image but then patch (push jmp instruction) so when application calls for some of vdso function it jumps into vdso code saved in image and then jumps into vdso mapped by the kernel (ie kind of proxy calls) This force us to do own Elf parsing inside criu to calculate proper offsets. Yuck :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: NULL ptr deref handling mmaping of special mappings
On Thu, May 15, 2014 at 12:53 PM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Thu, May 15, 2014 at 12:46:34PM -0700, Andy Lutomirski wrote: On Thu, May 15, 2014 at 1:45 AM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Wed, May 14, 2014 at 03:23:27PM -0700, Andy Lutomirski wrote: I can summarize: On 3.14 and before, the vdso is just a bunch of ELF headers and executable data. When executed by 64-bit binaries, it reads from the fixmap to do its thing. That is, it reads from kernel addresses that don't have vmas. When executed by 32-bit binaries, it doesn't read anything, since there was no 32-bit timing code. On 3.15, the x86_64 vdso is unchanged. The 32-bit vdso is preceded by a separate vma containing two pages worth of time-varying read-only data. The vdso reads those pages using PIC references. Andy, could you please point me where is the code which creates a second vma? latest 3.15 master branch Search for _install_special_mapping in arch/x86/vdso. It's in a different place in 3.15-rc and -next. As far as I see _install_special_mapping allocates one vma from cache and vma-vm_start = addr; vma-vm_end = addr + len; so where is the second one? Look at its callers in vdso32-setup.c and/or vma.c, depending on version. [root@fc ~]# cat /proc/self/maps ... 7fff57b6e000-7fff57b8f000 rw-p 00:00 0 [stack] 7fff57bff000-7fff57c0 r-xp 00:00 0 [vdso] ff60-ff601000 r-xp 00:00 0 [vsyscall] [root@fc ~]# What version and bitness is this? x86-64, 3.15-rc5 Aha. Give tip/x86/vdso or -next a try or boot a 32-bit 3.15-rc kernel and you'll see it. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: NULL ptr deref handling mmaping of special mappings
On Thu, May 15, 2014 at 2:31 PM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Fri, May 16, 2014 at 12:19:14AM +0400, Cyrill Gorcunov wrote: I see what you mean. We're rather targeting on bare x86-64 at the moment but compat mode is needed as well (not yet implemented though). I'll take a precise look into this area. Thanks! Indeed, because we were not running 32bit tasks vdso32-setup.c::arch_setup_additional_pages has never been called. That's the mode we will have to implement one day. Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables be kind of immutable constants? If yes, we could calculate where the additional vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. Please don't! These might, in principle, even vary between tasks on the same system. Certainly the relative positions of the vmas will be different between 3.15 and 3.16, since we need almost my entire cleanup series to reliably put them into their 3.16 location. And I intend to change the number of pages in 3.16 or 3.17. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: NULL ptr deref handling mmaping of special mappings
On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote: Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables be kind of immutable constants? If yes, we could calculate where the additional vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. Please don't! These might, in principle, even vary between tasks on the same system. Certainly the relative positions of the vmas will be different between 3.15 and 3.16, since we need almost my entire cleanup series to reliably put them into their 3.16 location. And I intend to change the number of pages in 3.16 or 3.17. There are other ways how to find where additional pages are laying but it would be great if there a straightforward interface for that (ie some mark in /proc/pid/maps output). I'll try to write a patch in time for 3.15. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: mm: NULL ptr deref handling mmaping of special mappings
On Thu, May 15, 2014 at 3:15 PM, Andy Lutomirski l...@amacapital.net wrote: On Thu, May 15, 2014 at 2:57 PM, Cyrill Gorcunov gorcu...@gmail.com wrote: On Thu, May 15, 2014 at 02:42:48PM -0700, Andy Lutomirski wrote: Looking forward the question appear -- will VDSO_PREV_PAGES and rest of variables be kind of immutable constants? If yes, we could calculate where the additional vma lives without requiring any kind of [vdso] mark in proc/pid/maps output. Please don't! These might, in principle, even vary between tasks on the same system. Certainly the relative positions of the vmas will be different between 3.15 and 3.16, since we need almost my entire cleanup series to reliably put them into their 3.16 location. And I intend to change the number of pages in 3.16 or 3.17. There are other ways how to find where additional pages are laying but it would be great if there a straightforward interface for that (ie some mark in /proc/pid/maps output). I'll try to write a patch in time for 3.15. My current draft is here: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=vdso/cleanups On 64-bit userspace, it results in: 7fffa1dfd000-7fffa1dfe000 r-xp 00:00 0 [vdso] 7fffa1dfe000-7fffa1e0 r--p 00:00 0 [vvar] ff60-ff601000 r-xp 00:00 0 [vsyscall] On 32-bit userspace, it results in: f7748000-f7749000 r-xp 00:00 0 [vdso] f7749000-f774b000 r--p 00:00 0 [vvar] ffd94000-ffdb5000 rw-p 00:00 0 [stack] Is this good for CRIU? Another approach would be to name both of these things vdso, since they are sort of both the vdso, but that might be a bit confusing -- [vvar] is not static text the way that [vdso] is. If I backport this for 3.15 (which might be nasty -- I would argue that the code change is actually a cleanup, but it's fairly intrusive), then [vvar] will be *before* [vdso], not after it. I'd be very hesitant to name both of them [vdso] in that case, since there is probably code that assumes that the beginning of [vdso] is a DSO. Note that it is *not* safe to blindly read from [vvar]. On some configurations you *will* get SIGBUS if you try to read from some of the vvar pages. (That's what started this whole thread.) Some pages in [vvar] may have strange caching modes, so SIGBUS might not be the only surprising thing about poking at it. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/4] SROP mitigation: Architecture independent code for signal canaries
On 05/15/2014 02:09 PM, Erik Bosman wrote: Architecture independent code for signal canaries Add support for canary values in user-space signal frames. These canaries function much like stack canaries/cookies, making it harder for an attacker to fake a call to {rt_,}sigreturn() This patch deals with architecture independent changes needed to support these canaries. These patches are meant to make Sigreturn Oriented Programming (SROP) a much less attractive exploitation path. In Sigreturn Oriented Programming, an attacker causes a user-space program to call the sigreturn system call in order to get complete control control over the entire userspace context in one go. ( see: http://www.cs.vu.nl/~herbertb/papers/srop_sp14.pdf ) While mitigating SROP will probably not stop determined attackers from exploiting a program, as there's always the much more well-known Return Oriented Programming, we still think SROP's relative ease warrants mitigation, especially since the mitigation is so cheap. If you're willing to make the mitigation a bit more sneaky, you could make the canary value depend on the address that the canary is at. For example, it could be H(some per-exec secret || address) for your favorite hash function H. Also, I would have sigreturn clear the canary on the stack. This would mitigate attacks based on trying to read the canary value from some unused / leaked stack space. --Andy Signed-off-by: Erik Bosman e...@minemu.org --- arch/Kconfig | 16 fs/exec.c | 8 include/linux/sched.h | 5 + 3 files changed, 29 insertions(+) diff --git a/arch/Kconfig b/arch/Kconfig index 97ff872..8319984 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -399,6 +399,22 @@ config CC_STACKPROTECTOR_STRONG endchoice +config HAVE_SIGNAL_CANARY + bool + help + An arch should select this symbol if: + - its struct sigframe contains a canary field + - it has implemented signal canary checking + +config SIGNAL_CANARY + bool signal canary + default y + depends on HAVE_SIGNAL_CANARY + help + Mitigate against a userland exploitation techinque called + sigreturn oriented programming by putting a canary value on a + signal's struct sigframe + config HAVE_CONTEXT_TRACKING bool help diff --git a/fs/exec.c b/fs/exec.c index 476f3eb..883f456 100644 --- a/fs/exec.c +++ b/fs/exec.c @@ -56,6 +56,7 @@ #include linux/pipe_fs_i.h #include linux/oom.h #include linux/compat.h +#include linux/random.h #include asm/uaccess.h #include asm/mmu_context.h @@ -1105,6 +1106,13 @@ void setup_new_exec(struct linux_binprm * bprm) /* This is the point of no return */ current-sas_ss_sp = current-sas_ss_size = 0; +#ifdef CONFIG_SIGNAL_CANARY + /* canary value to mitigate the use of sigreturn in (userland) exploits + * get_random_int() should be random enough also for 64bit + */ + current-signal_canary = (unsigned long)get_random_int(); +#endif + if (uid_eq(current_euid(), current_uid()) gid_eq(current_egid(), current_gid())) set_dumpable(current-mm, SUID_DUMP_USER); else diff --git a/include/linux/sched.h b/include/linux/sched.h index 25f54c7..cb8b54b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1364,6 +1364,11 @@ struct task_struct { unsigned long sas_ss_sp; size_t sas_ss_size; + +#ifdef CONFIG_SIGNAL_CANARY + u32 signal_canary; /* sigreturn exploit mitigation */ +#endif + int (*notifier)(void *priv); void *notifier_data; sigset_t *notifier_mask; -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Richard Weinberger (rich...@nod.at): Am 15.05.2014 21:50, schrieb Serge Hallyn: Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, you can't do that in containers is not an apropos response. That abstraction is wrong is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. We have to work out (x) first can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying containers are complex and error prone is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the kernel. Ask Andy, he found already lots of nasty things... I don't think I have anything brilliant to add to this discussion right now, except possibly: ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an untrusted user can cause a block device to appear. That user doesn't need permission to mount it or even necessarily to change its contents on the fly. E.g. what happens if you boot a machine that contains a malicious disk image that has the same partition UUID as /? Nothing good, I imagine. So if we're going to go down this road, we really need some way to tell the host that certain devices are not trusted. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath
On 05/28/2014 05:19 AM, Philipp Kern wrote: audit_filter_syscall uses the syscall number to reference into a bitmask (e-rule.mask[word]). Not removing the x32 bit before passing the number to this architecture independent codepath will fail to lookup the proper audit bit. Furthermore it will cause an invalid memory access in the kernel if the out of bound location is not mapped: BUG: unable to handle kernel paging request at 8800e5446630 IP: [810fcdd0] audit_filter_syscall+0x90/0xf0 Together with the entrypoint in entry_64.S this change causes x32 programs to pass in both AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 depending on the syscall path. Cc: linux-kernel@vger.kernel.org Cc: H. J. Lu hjl.to...@gmail.com Cc: Eric Paris epa...@redhat.com Signed-off-by: Philipp Kern pk...@google.com --- arch/x86/kernel/ptrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 678c0ad..166a3c7 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1489,7 +1489,7 @@ long syscall_trace_enter(struct pt_regs *regs) if (IS_IA32) audit_syscall_entry(AUDIT_ARCH_I386, - regs-orig_ax, + regs-orig_ax __SYSCALL_MASK, This is weird. Three questions: 1. How can this happen? I thought that x32 syscalls always came in through the syscall path, which doesn't set is_compat_task. (Can someone rename is_compat_task to in_compat_syscall? Pretty please?) 2. Now audit can't tell whether a syscall is x32 or i386. And audit is inconsistent with seccomp. This seems wrong. 3. The OOPS you're fixing doesn't seem like it's fixed. What if some other random high bits are set? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath
On Wed, May 28, 2014 at 2:01 PM, H. Peter Anvin h...@linux.intel.com wrote: On 05/28/2014 01:47 PM, Andy Lutomirski wrote: On 05/28/2014 05:19 AM, Philipp Kern wrote: audit_filter_syscall uses the syscall number to reference into a bitmask (e-rule.mask[word]). Not removing the x32 bit before passing the number to this architecture independent codepath will fail to lookup the proper audit bit. Furthermore it will cause an invalid memory access in the kernel if the out of bound location is not mapped: BUG: unable to handle kernel paging request at 8800e5446630 IP: [810fcdd0] audit_filter_syscall+0x90/0xf0 Together with the entrypoint in entry_64.S this change causes x32 programs to pass in both AUDIT_ARCH_X86_64 and AUDIT_ARCH_I386 depending on the syscall path. Cc: linux-kernel@vger.kernel.org Cc: H. J. Lu hjl.to...@gmail.com Cc: Eric Paris epa...@redhat.com Signed-off-by: Philipp Kern pk...@google.com --- arch/x86/kernel/ptrace.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c index 678c0ad..166a3c7 100644 --- a/arch/x86/kernel/ptrace.c +++ b/arch/x86/kernel/ptrace.c @@ -1489,7 +1489,7 @@ long syscall_trace_enter(struct pt_regs *regs) if (IS_IA32) audit_syscall_entry(AUDIT_ARCH_I386, -regs-orig_ax, +regs-orig_ax __SYSCALL_MASK, This is weird. Three questions: 1. How can this happen? I thought that x32 syscalls always came in through the syscall path, which doesn't set is_compat_task. (Can someone rename is_compat_task to in_compat_syscall? Pretty please?) The SYSCALL path doesn't set TS_COMPAT, but is_compat_task() looks both as TS_COMPAT and bit 30 of orig_ax. I think what is really needed here is IS_IA32 should use is_ia32_task() instead, and *that* is the context we can mask off the x32 bit in at all. However, does audit not need that information? This is thoroughly fscked up. What *should* have happened is that there should have been an AUDIT_ARCH_X32. Unfortunately no one caught that in time, so the current enshrined ABI is that, as far as seccomp is concerned, x32 is AUDIT_ARCH_X86_64 (see syscall_get_arch) but nr has the x32 bit set. But the audit code is different: x32 is AUDIT_ARCH_I386 and the x32 bit is set. This may be changeable, since fixes to the audit ABI may not break anything. Can we just use syscall_get_arch to determine the token to pass to the audit code? If it makes everyone feel better, I think that every single architecture has screwed this up. We actually had to prevent seccomp and audit from being configured in on any OABI-supporting ARM kernel, and MIPS almost did the same thing that x32 did. We caught that one on time, and it's now fixed in Linus' tree. (And why the frakk does audit receive the first four syscall arguments? Audit seems like the worst turd ever...) If you ever benchmark the performance impact of merely running auditd, you might have to upgrade that insult :-/ 2. Now audit can't tell whether a syscall is x32 or i386. And audit is inconsistent with seccomp. This seems wrong. This is completely and totally bogus, indeed. I'd suggest just fixing the bug in auditsc.c. 3. The OOPS you're fixing doesn't seem like it's fixed. What if some other random high bits are set? There is a range check in entry_*.S for the system call. I can imagine that causing a certain amount of confusion to fancy seccomp users. Oh, well. No one that I know of has complained yet. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath
On Wed, May 28, 2014 at 2:19 PM, H. Peter Anvin h...@linux.intel.com wrote: On 05/28/2014 02:15 PM, Andy Lutomirski wrote: 3. The OOPS you're fixing doesn't seem like it's fixed. What if some other random high bits are set? There is a range check in entry_*.S for the system call. I can imagine that causing a certain amount of confusion to fancy seccomp users. Oh, well. No one that I know of has complained yet. I don't know how seccomp or audit deal with out-of-range syscall numbers, so that might be worth taking a look at. audit oopses, apparently. seccomp will tell BPF about it and follow directions, barring bugs. However: are you sure that entry_64.S handles this? It looks like tracesys has higher priority than badsys. And strace can certainly see out-of-range syscalls. And I can oops audit by doing: auditctl -a exit,always -S open ./a.out where a.out is this: #include stdio.h #include sys/syscall.h int main() { long i; for (i = 1000; ; i += 64 * 4096) syscall(i, 0, 0, 0, 0, 0, 0); return 0; } I would *love* to deprecate the entire syscall auditing mechanism. Or at least I'd love to have distros stop using it. I'll go ask for a CVE number. Sigh. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x32: Mask away the x32 syscall bit in the ptrace codepath
On Wed, May 28, 2014 at 2:53 PM, Philipp Kern pk...@google.com wrote: On Wed, May 28, 2014 at 11:43 PM, Andy Lutomirski l...@amacapital.net wrote: However: are you sure that entry_64.S handles this? It looks like tracesys has higher priority than badsys. And strace can certainly see out-of-range syscalls. […] Not only can it see them: It must see that this bit is set as that's the only identifier it has to deduce that the binary is running in x32 mode. Out of range syscall numbers certainly do not work for auditing right now, hence my attempt to patch around it. There appears to be a completely arbitrary limit of 32*64 syscalls. There's also an arbitrary limit of 4 arguments. Both are wrong. I have no intention of fixing either. I'll fix the OOPS, though. Kind regards and thanks Philipp Kern -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] audit: Move CONFIG_AUDITSYSCALL into staging and update help text
Here are some issues with the code: - It thinks that syscalls have four arguments. - It's a performance disaster. - It assumes that syscall numbers are between 0 and 2048. - It's unclear whether it's supposed to be reliable. - It's broken on things like x32. - It can't support ARM OABI. - Its approach to memory allocation is terrifying. I considered marking it BROKEN, but that might be too harsh. Signed-off-by: Andy Lutomirski l...@amacapital.net --- init/Kconfig | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 9d3585b..4584f8a 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -296,13 +296,16 @@ config HAVE_ARCH_AUDITSYSCALL bool config AUDITSYSCALL - bool Enable system-call auditing support - depends on AUDIT HAVE_ARCH_AUDITSYSCALL + bool Enable system-call auditing support (not recommended) + depends on AUDIT HAVE_ARCH_AUDITSYSCALL STAGING default y if SECURITY_SELINUX help - Enable low-overhead system-call auditing infrastructure that - can be used independently or with another kernel subsystem, - such as SELinux. + Enable system-call auditing infrastructure that can be used + independently or with another kernel subsystem, such as + SELinux. + + AUDITSYSCALL has serious performance and correctness issues. + Use it with extreme caution. config AUDIT_WATCH def_bool y -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Fix auditsc DoS and move it to staging
CONFIG_AUDITSYSCALL is awful. Patch 2 enumerates some reasons. Patch 1 fixes a nasty DoS and possible information leak. It should be applied and backported. Patch 2 is optional. I leave it to other peoples' judgment. Andy Lutomirski (2): auditsc: audit_krule mask accesses need bounds checking audit: Move CONFIG_AUDITSYSCALL into staging and update help text init/Kconfig | 13 - kernel/auditsc.c | 27 ++- 2 files changed, 26 insertions(+), 14 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] auditsc: audit_krule mask accesses need bounds checking
Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f251a5e..7ccd9db 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key) return AUDIT_BUILD_CONTEXT; } +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val) +{ + int word, bit; + + if (val 0x) + return false; + + word = AUDIT_WORD(val); + if (word = AUDIT_BITMASK_SIZE) + return false; + + bit = AUDIT_BIT(val); + + return rule-mask[word] bit; +} + /* At syscall entry and exit time, this filter is called if the * audit_state is not low enough that auditing cannot take place, but is * also not high enough that we already know we have to write an audit @@ -745,11 +761,8 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, rcu_read_lock(); if (!list_empty(list)) { - int word = AUDIT_WORD(ctx-major); - int bit = AUDIT_BIT(ctx-major); - list_for_each_entry_rcu(e, list, list) { - if ((e-rule.mask[word] bit) == bit + if (audit_in_mask(e-rule, ctx-major) audit_filter_rules(tsk, e-rule, ctx, NULL, state, false)) { rcu_read_unlock(); @@ -769,20 +782,16 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, static int audit_filter_inode_name(struct task_struct *tsk, struct audit_names *n, struct audit_context *ctx) { - int word, bit; int h = audit_hash_ino((u32)n-ino); struct list_head *list = audit_inode_hash[h]; struct audit_entry *e; enum audit_state state; - word = AUDIT_WORD(ctx-major); - bit = AUDIT_BIT(ctx-major); - if (list_empty(list)) return 0; list_for_each_entry_rcu(e, list, list) { - if ((e-rule.mask[word] bit) == bit + if (audit_in_mask(e-rule, ctx-major) audit_filter_rules(tsk, e-rule, ctx, n, state, false)) { ctx-current_state = state; return 1; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC][PATCH 0/8] sched,idle: need resched polling rework
On Thu, May 22, 2014 at 6:09 AM, Peter Zijlstra pet...@infradead.org wrote: On Thu, May 22, 2014 at 02:58:18PM +0200, Peter Zijlstra wrote: --- diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 4ea7b3f1a087..a5da85fb3570 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -546,12 +546,38 @@ static bool set_nr_and_not_polling(struct task_struct *p) struct thread_info *ti = task_thread_info(p); return !(fetch_or(ti-flags, _TIF_NEED_RESCHED) _TIF_POLLING_NRFLAG); } + +/* + * Atomically set TIF_NEED_RESCHED if TIF_POLLING_NRFLAG is set. + */ +static bool set_nr_if_polling(struct task_struct *p) +{ + struct thread_info *ti = task_thread_info(p); + typeof(ti-flags) old, val = ti-flags; + + for (;;) { + if (!(val _TIF_POLLING_NRFLAG)) + return false; + if (val _TIF_NEED_RESCHED) + return true; Hmm, I think this is racy, false would be safer. If its already set we might already be past the sched_ttwu_pending() invocation, while if its clear and we're the one to set it, we're guaranteed not. + old = cmpxchg(ti-flags, val, val | _TIF_NEED_RESCHED); + if (old == val) + return true; + val = old; + } +} Do you have an updated patch? After fixing the MIME flow damage (sigh), it doesn't apply to sched/core, which is my best guess for what it's based on. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
Here are some issues with the code: - It thinks that syscalls have four arguments. - It's a performance disaster. - It assumes that syscall numbers are between 0 and 2048. - It's unclear whether it's supposed to be reliable. - It's broken on things like x32. - It can't support ARM OABI. - Its approach to freeing memory is terrifying. Signed-off-by: Andy Lutomirski l...@amacapital.net --- init/Kconfig | 13 - 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/init/Kconfig b/init/Kconfig index 9d3585b..24d4b53 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -296,13 +296,16 @@ config HAVE_ARCH_AUDITSYSCALL bool config AUDITSYSCALL - bool Enable system-call auditing support - depends on AUDIT HAVE_ARCH_AUDITSYSCALL + bool Enable system-call auditing support (not recommended) + depends on AUDIT HAVE_ARCH_AUDITSYSCALL BROKEN default y if SECURITY_SELINUX help - Enable low-overhead system-call auditing infrastructure that - can be used independently or with another kernel subsystem, - such as SELinux. + Enable system-call auditing infrastructure that can be used + independently or with another kernel subsystem, such as + SELinux. + + AUDITSYSCALL has serious performance and correctness issues. + Use it with extreme caution. config AUDIT_WATCH def_bool y -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking
Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f251a5e..7ccd9db 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key) return AUDIT_BUILD_CONTEXT; } +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val) +{ + int word, bit; + + if (val 0x) + return false; + + word = AUDIT_WORD(val); + if (word = AUDIT_BITMASK_SIZE) + return false; + + bit = AUDIT_BIT(val); + + return rule-mask[word] bit; +} + /* At syscall entry and exit time, this filter is called if the * audit_state is not low enough that auditing cannot take place, but is * also not high enough that we already know we have to write an audit @@ -745,11 +761,8 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, rcu_read_lock(); if (!list_empty(list)) { - int word = AUDIT_WORD(ctx-major); - int bit = AUDIT_BIT(ctx-major); - list_for_each_entry_rcu(e, list, list) { - if ((e-rule.mask[word] bit) == bit + if (audit_in_mask(e-rule, ctx-major) audit_filter_rules(tsk, e-rule, ctx, NULL, state, false)) { rcu_read_unlock(); @@ -769,20 +782,16 @@ static enum audit_state audit_filter_syscall(struct task_struct *tsk, static int audit_filter_inode_name(struct task_struct *tsk, struct audit_names *n, struct audit_context *ctx) { - int word, bit; int h = audit_hash_ino((u32)n-ino); struct list_head *list = audit_inode_hash[h]; struct audit_entry *e; enum audit_state state; - word = AUDIT_WORD(ctx-major); - bit = AUDIT_BIT(ctx-major); - if (list_empty(list)) return 0; list_for_each_entry_rcu(e, list, list) { - if ((e-rule.mask[word] bit) == bit + if (audit_in_mask(e-rule, ctx-major) audit_filter_rules(tsk, e-rule, ctx, n, state, false)) { ctx-current_state = state; return 1; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 0/2] Fix auditsc DoS and mark it BROKEN
CONFIG_AUDITSYSCALL is awful. Patch 2 enumerates some reasons. Patch 1 fixes a nasty DoS and possible information leak. It should be applied and backported. Patch 2 is optional. I leave it to other peoples' judgment. Andy Lutomirski (2): auditsc: audit_krule mask accesses need bounds checking audit: Move CONFIG_AUDITSYSCALL into staging and update help text Andy Lutomirski (2): auditsc: audit_krule mask accesses need bounds checking audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text init/Kconfig | 13 - kernel/auditsc.c | 27 ++- 2 files changed, 26 insertions(+), 14 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking
On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f251a5e..7ccd9db 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key) return AUDIT_BUILD_CONTEXT; } +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val) +{ + int word, bit; + + if (val 0x) + return false; Why is this necessary? To avoid an integer overflow. Admittedly, this particular overflow won't cause a crash, but it will cause incorrect results. + + word = AUDIT_WORD(val); + if (word = AUDIT_BITMASK_SIZE) + return false; Since this covers it and it extensible... + + bit = AUDIT_BIT(val); + + return rule-mask[word] bit; Returning an int as a bool creates worse code than just returning the int. (although in this case if the compiler chooses to inline it might be smart enough not to actually convert this int to a bool and make worse assembly...) I'd suggest the function here return an int. bools usually make the assembly worse... I'm ambivalent. The right assembly would use flags on x86, not an int or a bool, and I don't know what the compiler will do. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote: NAK On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Here are some issues with the code: - It thinks that syscalls have four arguments. Not true at all. It records the registers that would hold the first 4 entries on syscall entry, for use later if needed, as getting those later on some arches is not feasible (see ia64). It makes no assumption about how many syscalls a function has. What about a5 and a6? - It's a performance disaster. Only if you enable it. If you don't use audit it is a single branch. Hardly a disaster. It forces all syscalls into the slow path and it can do crazy things like building audit contexts just in case actual handling of the syscall triggers an audit condition so that the exit path can log the syscall. That's way worse than a single branch. Try it: benchmark getpid on Fedora and then repeat the experiment with syscall auditing fully disabled. The difference is striking. - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. - It's unclear whether it's supposed to be reliable. Unclear to whom? To me. If some inode access or selinux rule triggers an audit, is the auditsc code guaranteed to write an exit record? And see below... - It's broken on things like x32. - It can't support ARM OABI. Some arches aren't supported? And that makes it BROKEN? It acts like x32 is supported. OABI is fixable. Admittedly, OABI not being supported is fine, now that it's correctly disabled. - Its approach to freeing memory is terrifying. What? None of your reasons hold water. Bugs need to be fixed. Try reporting them... This is just stupid. ...for reference, I've *tried* to fix the performance issues. I've run into all kinds of problems. The actual context code is incredibly tangled. It's unclear whether it would be permissible for various rule combinations to suppress exit records triggered by selinux. Any effort to actually deal with this stuff will have to deal with the fact that the audit code builds up lots of state and frees it on syscall exit. That means that the exit hook needs to be actually invoked, otherwise we leak. I was never able to convince myself that the current code is correct wrt kernel threads. In summary, the code is a giant mess. The way it works is nearly incomprehensible. It contains at least one severe bug. I'd love to see it fixed, but for now, distributions seem to think that enabling CONFIG_AUDITSYSCALL is a reasonable thing to do, and I'd argue that it's actually a terrible choice for anyone who doesn't actually need syscall audit rules. And I don't know who needs these things. I've heard an argument that selinux benefits from this, since the syscall exit log helps with diagnosing denials. I think that's absurd. I've never found anything wrong with the denial record itself that would be helped by seeing the syscall log. (And there's always syscall_get_xyz, which would cover this case splendidly with about an order of magnitude less code.) As I said, I'm not going to push hard for this code to be marked BROKEN. But I think it's rather broken, and I'm definitely not volunteering to fix it. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 1/2] auditsc: audit_krule mask accesses need bounds checking
On Wed, May 28, 2014 at 7:43 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 19:27 -0700, Andy Lutomirski wrote: On Wed, May 28, 2014 at 7:23 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Fixes an easy DoS and possible information disclosure. This does nothing about the broken state of x32 auditing. Cc: sta...@vger.kernel.org Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/auditsc.c | 27 ++- 1 file changed, 18 insertions(+), 9 deletions(-) diff --git a/kernel/auditsc.c b/kernel/auditsc.c index f251a5e..7ccd9db 100644 --- a/kernel/auditsc.c +++ b/kernel/auditsc.c @@ -728,6 +728,22 @@ static enum audit_state audit_filter_task(struct task_struct *tsk, char **key) return AUDIT_BUILD_CONTEXT; } +static bool audit_in_mask(const struct audit_krule *rule, unsigned long val) +{ + int word, bit; + + if (val 0x) + return false; Why is this necessary? To avoid an integer overflow. Admittedly, this particular overflow won't cause a crash, but it will cause incorrect results. You know this code pre-dates git? I admit, I'm shocked no one ever noticed it before! This is ANCIENT. And clearly broken. I'll likely ask Richard to add a WARN_ONCE() in both this place, and below in word AUDIT_BITMASK_SIZE so we might know if we ever need a larger bitmask to store syscall numbers Not there, please! It would make sense to check when rules are *added*, but any user can trivially invoke the filter with pretty much any syscall nr. It'd be nice if lib had a efficient bitmask implementation... + + word = AUDIT_WORD(val); + if (word = AUDIT_BITMASK_SIZE) + return false; Since this covers it and it extensible... + + bit = AUDIT_BIT(val); + + return rule-mask[word] bit; Returning an int as a bool creates worse code than just returning the int. (although in this case if the compiler chooses to inline it might be smart enough not to actually convert this int to a bool and make worse assembly...) I'd suggest the function here return an int. bools usually make the assembly worse... I'm ambivalent. The right assembly would use flags on x86, not an int or a bool, and I don't know what the compiler will do. Also, clearly X86_X32 was implemented without audit support, so we shouldn't config it in. What do you think of this? diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..fa852e2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -129,7 +129,7 @@ config X86 select HAVE_IRQ_EXIT_ON_IRQ_STACK if X86_64 select HAVE_CC_STACKPROTECTOR select GENERIC_CPU_AUTOPROBE - select HAVE_ARCH_AUDITSYSCALL + select HAVE_ARCH_AUDITSYSCALL if !X86_X32 I'm fine with that, but I hope that distros will choose x32 over auditsc, at least until someone makes enabling auditsc be less intrusive. Or someone could fix it, modulo the syscall nr 2048 thing. x32 syscall nrs are *huge*. So are lots of other architectures', a few of which probably claim to support auditsc. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Wed, May 28, 2014 at 7:54 PM, Eric Paris epa...@redhat.com wrote: On Wed, 2014-05-28 at 19:40 -0700, Andy Lutomirski wrote: On Wed, May 28, 2014 at 7:09 PM, Eric Paris epa...@redhat.com wrote: NAK On Wed, 2014-05-28 at 18:44 -0700, Andy Lutomirski wrote: Here are some issues with the code: - It thinks that syscalls have four arguments. Not true at all. It records the registers that would hold the first 4 entries on syscall entry, for use later if needed, as getting those later on some arches is not feasible (see ia64). It makes no assumption about how many syscalls a function has. What about a5 and a6? On the couple of syscalls where a5 and a6 had any state that was actually wanted by someone (mainly just the fd on mmap) audit collects it later in the actual syscall. - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. That's because we haven't had large syscall numbers. That's the whole point of an arch doing select HAVE_ARCH_AUDITSYSCALL. If they don't meet the requirements, they shouldn't be selecting it - It's unclear whether it's supposed to be reliable. Unclear to whom? To me. If some inode access or selinux rule triggers an audit, is the auditsc code guaranteed to write an exit record? And see below... This is an honest question: Do you want to discuss these things, or would you be happier if I shut up, fix the bugs you found, and leave things be? I don't want to have an argument, I'm happy to have a discussion if you think that will be beneficial... I'm happy to discuss these things. I'd be happy if you fix things. I think that there are two major issues affecting users of auditsc and that either those issues should be fixed or users of auditsc should understand and accept these issues. Issue 1: syscall numbering. The syscall filters are unlikely to work well on lots of architectures. Issue 2: performance. Until someone fixes it (which is probably hard), turning this thing on hurts a lot. Oleg added a hack awhile ago that lets you do 'auditctl -a task,never' to get rid of the performance hit for new tasks, but that rather defeats the point. The scariness of the code can be lived with, but holy cow it's scary. I'm just not going to fix the issues myself. I tried and gave up. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote: On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote: - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. All syscalls must be auditable. Meaning that if an arch goes above 2048, then we'll need to do some math to get it to fall back within the range. Off the top of my head, x32, some ARM variants, and some MIPS variants have syscalls above 2048. auditsc has been disabled on the offending ARM variant (because I noticed that the same issue affects seccomp, and no one particularly wants to fix it), but I suspect that auditsc is enabled but completely broken on whichever MIPS ABI has the issue. I haven't checked. TBH, I think that it's silly to let the auditsc design be heavily constrained by ia64 considerations -- I still think that the syscall entry hooks could be removed completely with some effort on most architectures and that performance would improve considerably. For users who don't have syscall filter rules, performance might even improve on ia64 -- I don't care how expensive syscall_get_args is, the actual printk/auditd thing will dominate in cases where anything is written. I wonder whether the syscall restart mechanism can be used to thoroughly confused auditsc. I don't really know how syscall restarts work. My point is: I don't know what guarantees auditsc is supposed to provide, nor do I know how to evaluate whether the code is correct. For users who aren't bound by Common Criteria or related things, I suspect that syscall auditing (as opposed to, say, LSM-based auditing) will provide dubious value at best. Keep in mind that many syscalls take pointer arguments, and the auditsc code can't really do anything sensible with them. - It's unclear whether it's supposed to be reliable. Unclear to whom? To me. If some inode access or selinux rule triggers an audit, is the auditsc code guaranteed to write an exit record? And see below... It should or indicate that it could not. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] audit: Mark CONFIG_AUDITSYSCALL BROKEN and update help text
On Thu, May 29, 2014 at 9:25 AM, Steve Grubb sgr...@redhat.com wrote: On Thursday, May 29, 2014 09:04:10 AM Andy Lutomirski wrote: On Thu, May 29, 2014 at 6:05 AM, Steve Grubb sgr...@redhat.com wrote: On Wednesday, May 28, 2014 07:40:57 PM Andy Lutomirski wrote: - It assumes that syscall numbers are between 0 and 2048. There could well be a bug here. Not questioning that. Although that would be patch 1/2 Even with patch 1, it still doesn't handle large syscall numbers -- it just assumes they're not audited. All syscalls must be auditable. Meaning that if an arch goes above 2048, then we'll need to do some math to get it to fall back within the range. Off the top of my head, x32, some ARM variants, and some MIPS variants have syscalls above 2048. That could be fixed by putting a subtraction in place to get the bit mask to map correctly. User space audit rules would need to fix that also. auditsc has been disabled on the offending ARM variant (because I noticed that the same issue affects seccomp, and no one particularly wants to fix it), but I suspect that auditsc is enabled but completely broken on whichever MIPS ABI has the issue. I haven't checked. TBH, I think that it's silly to let the auditsc design be heavily constrained by ia64 considerations -- I still think that the syscall entry hooks could be removed completely with some effort on most architectures and that performance would improve considerably. For users who don't have syscall filter rules, performance might even improve on ia64 -- I don't care how expensive syscall_get_args is, the actual printk/auditd thing will dominate in cases where anything is written. The last time I heard of benchmarking being done, audit's performance hit was negligible. That was about 4 years ago and there has been a whole lot of code churn since then. But it should in general be the cost of an 'if' statement unless auditing is enabled. If it is enabled, then you necessarily get the full treatment in case you trigger an event. The actual rule matching is reasonably inexpensive, but the killer is the fact that every system call is forced into the slow path. Without audit, this cost is paid by seccomp users and tasks that are being ptraced. With audit, every single syscall for every single task pays the full cost of a slow path system call. This is extra sad because, AFAICS, most of the syscall entry audit code shouldn't be necessary even for audit users. For users who aren't bound by Common Criteria or related things, I suspect that syscall auditing (as opposed to, say, LSM-based auditing) will provide dubious value at best. No, it works pretty good. You can see who is accessing what files pretty clearly. Right, but I don't think it's the *syscall* part that's really necessary here. The fact that someone opened /etc/passwd is interesting. The fact that someone sys_opened (const char *)0x123456 is not. Keep in mind that many syscalls take pointer arguments, and the auditsc code can't really do anything sensible with them. All security relevant information is collected in auxiliary records if they are not directly readable as a syscall argument. This is a basic requirement. We are not required to capture all possible arguments. If you know of any security relevant parameters not captured, please let us know. a5 and a6. The cmsg data in sendmsg (which is very security-relevant -- think of SCM_RIGHTS and SCM_CREDENTIALS). The destination for sendto. The whole payload of netlink messages. I don't know how many of these are actually captures. Plus, except for things like setuid and sigreturn, capturing parameters on entry shouldn't be needed anyway. Capturing at log time should be fine. If this magic capturing only happened for users who need it for compliance reasons, that would be one thing. But it happens for every single system call on Fedora. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/vdso] x86, vdso: Reimplement vdso.so preparation in build-time C
On Thu, May 29, 2014 at 12:17 PM, Paul Gortmaker paul.gortma...@windriver.com wrote: On Mon, May 5, 2014 at 6:25 PM, tip-bot for Andy Lutomirski tip...@zytor.com wrote: Commit-ID: 6f121e548f83674ab4920a4e60afb58d4f61b829 Gitweb: http://git.kernel.org/tip/6f121e548f83674ab4920a4e60afb58d4f61b829 Author: Andy Lutomirski l...@amacapital.net AuthorDate: Mon, 5 May 2014 12:19:34 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Mon, 5 May 2014 13:18:51 -0700 x86, vdso: Reimplement vdso.so preparation in build-time C Just a heads up in case it hasn't been mentioned already; the x86 builds in linux-next which IIRC are done as cross compile on a powerpc box are failing as follows: VDSO2C arch/x86/vdso/vdso-image-64.c /bin/sh: line 1: 15995 Segmentation fault arch/x86/vdso/vdso2c arch/x86/vdso/vdso64.so.dbg arch/x86/vdso/vdso-image-64.c make[3]: *** [arch/x86/vdso/vdso-image-64.c] Error 139 http://kisskb.ellerman.id.au/kisskb/buildresult/11265454/ Egads. This wouldn't be a big-endian host by any chance? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [tip:x86/vdso] x86, vdso: Reimplement vdso.so preparation in build-time C
On Thu, May 29, 2014 at 12:43 PM, H. Peter Anvin h...@zytor.com wrote: On 05/29/2014 12:32 PM, Andy Lutomirski wrote: On Thu, May 29, 2014 at 12:17 PM, Paul Gortmaker paul.gortma...@windriver.com wrote: On Mon, May 5, 2014 at 6:25 PM, tip-bot for Andy Lutomirski tip...@zytor.com wrote: Commit-ID: 6f121e548f83674ab4920a4e60afb58d4f61b829 Gitweb: http://git.kernel.org/tip/6f121e548f83674ab4920a4e60afb58d4f61b829 Author: Andy Lutomirski l...@amacapital.net AuthorDate: Mon, 5 May 2014 12:19:34 -0700 Committer: H. Peter Anvin h...@linux.intel.com CommitDate: Mon, 5 May 2014 13:18:51 -0700 x86, vdso: Reimplement vdso.so preparation in build-time C Just a heads up in case it hasn't been mentioned already; the x86 builds in linux-next which IIRC are done as cross compile on a powerpc box are failing as follows: VDSO2C arch/x86/vdso/vdso-image-64.c /bin/sh: line 1: 15995 Segmentation fault arch/x86/vdso/vdso2c arch/x86/vdso/vdso64.so.dbg arch/x86/vdso/vdso-image-64.c make[3]: *** [arch/x86/vdso/vdso-image-64.c] Error 139 http://kisskb.ellerman.id.au/kisskb/buildresult/11265454/ Egads. This wouldn't be a big-endian host by any chance? He said PowerPC; most but not all PowerPC systems are bigendian. Seems like a fair assumption to make. I suppose I shouldn't have assumed that code in arch/x86 would always run on little-endian machines. I'll fix it. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] x86,vdso: Fix cross-compilation from big-endian architectures
This adds a macro GET(x) to convert x from big-endian to little-endian. Hopefully I put it everywhere it needs to go and got all the cases needed for everyone's linux/elf.h. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vdso2c.c | 15 + arch/x86/vdso/vdso2c.h | 59 -- 2 files changed, 48 insertions(+), 26 deletions(-) diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c index fe8bfbf..de19ced 100644 --- a/arch/x86/vdso/vdso2c.c +++ b/arch/x86/vdso/vdso2c.c @@ -51,6 +51,21 @@ static void fail(const char *format, ...) va_end(ap); } +/* + * Evil macros to do a little-endian read. + */ +#define __GET_TYPE(x, type, bits, ifnot) \ + __builtin_choose_expr( \ + __builtin_types_compatible_p(typeof(x), type), \ + le##bits##toh((x)), ifnot) + +extern void bad_get(uint64_t); + +#define GET(x) \ + __GET_TYPE((x), __u32, 32, __GET_TYPE((x), __u64, 64, \ + __GET_TYPE((x), __s32, 32, __GET_TYPE((x), __s64, 64, \ + __GET_TYPE((x), __u16, 16, bad_get(x)) + #define NSYMS (sizeof(required_syms) / sizeof(required_syms[0])) #define BITS 64 diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h index 26a7c1f..dadecb9 100644 --- a/arch/x86/vdso/vdso2c.h +++ b/arch/x86/vdso/vdso2c.h @@ -18,25 +18,27 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) const char *secstrings; uint64_t syms[NSYMS] = {}; - Elf_Phdr *pt = (Elf_Phdr *)(addr + hdr-e_phoff); + Elf_Phdr *pt = (Elf_Phdr *)(addr + GET(hdr-e_phoff)); /* Walk the segment table. */ - for (i = 0; i hdr-e_phnum; i++) { - if (pt[i].p_type == PT_LOAD) { + for (i = 0; i GET(hdr-e_phnum); i++) { + if (GET(pt[i].p_type) == PT_LOAD) { if (found_load) fail(multiple PT_LOAD segs\n); - if (pt[i].p_offset != 0 || pt[i].p_vaddr != 0) + if (GET(pt[i].p_offset) != 0 || + GET(pt[i].p_vaddr) != 0) fail(PT_LOAD in wrong place\n); - if (pt[i].p_memsz != pt[i].p_filesz) + if (GET(pt[i].p_memsz) != GET(pt[i].p_filesz)) fail(cannot handle memsz != filesz\n); - load_size = pt[i].p_memsz; + load_size = GET(pt[i].p_memsz); found_load = 1; - } else if (pt[i].p_type == PT_DYNAMIC) { - dyn = addr + pt[i].p_offset; - dyn_end = addr + pt[i].p_offset + pt[i].p_memsz; + } else if (GET(pt[i].p_type) == PT_DYNAMIC) { + dyn = addr + GET(pt[i].p_offset); + dyn_end = addr + GET(pt[i].p_offset) + + GET(pt[i].p_memsz); } } if (!found_load) @@ -44,43 +46,48 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) data_size = (load_size + 4095) / 4096 * 4096; /* Walk the dynamic table */ - for (i = 0; dyn + i dyn_end dyn[i].d_tag != DT_NULL; i++) { - if (dyn[i].d_tag == DT_REL || dyn[i].d_tag == DT_RELSZ || - dyn[i].d_tag == DT_RELENT || dyn[i].d_tag == DT_TEXTREL) + for (i = 0; dyn + i dyn_end GET(dyn[i].d_tag) != DT_NULL; i++) { + typeof(dyn[i].d_tag) tag = GET(dyn[i].d_tag); + if (tag == DT_REL || tag == DT_RELSZ || + tag == DT_RELENT || tag == DT_TEXTREL) fail(vdso image contains dynamic relocations\n); } /* Walk the section table */ - secstrings_hdr = addr + hdr-e_shoff + hdr-e_shentsize*hdr-e_shstrndx; - secstrings = addr + secstrings_hdr-sh_offset; - for (i = 0; i hdr-e_shnum; i++) { - Elf_Shdr *sh = addr + hdr-e_shoff + hdr-e_shentsize * i; - if (sh-sh_type == SHT_SYMTAB) + secstrings_hdr = addr + GET(hdr-e_shoff) + + GET(hdr-e_shentsize)*GET(hdr-e_shstrndx); + secstrings = addr + GET(secstrings_hdr-sh_offset); + for (i = 0; i GET(hdr-e_shnum); i++) { + Elf_Shdr *sh = addr + GET(hdr-e_shoff) + + GET(hdr-e_shentsize) * i; + if (GET(sh-sh_type) == SHT_SYMTAB) symtab_hdr = sh; - if (!strcmp(secstrings + sh-sh_name, .altinstructions)) + if (!strcmp(secstrings + GET(sh-sh_name), .altinstructions)) alt_sec = sh; } if (!symtab_hdr) fail(no symbol table\n); - strtab_hdr = addr + hdr-e_shoff
[PATCH 0/2] x86,vdso: vdso build fixes and improvements
Patch 1 causes make;make to behave similarly to make if vdso2c fails. Patch 2 hopefully fixes x86 crossbuilds on big-endian machines. I don't have a big-endian machine to test on, though. Andy Lutomirski (2): x86,vdso: When vdso2c fails, unlink the output x86,vdso: Fix cross-compilation from big-endian architectures arch/x86/vdso/vdso2c.c | 35 ++--- arch/x86/vdso/vdso2c.h | 69 ++ 2 files changed, 62 insertions(+), 42 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] x86,vdso: When vdso2c fails, unlink the output
This avoids bizarre failures if make is run again. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vdso2c.c | 20 +++- arch/x86/vdso/vdso2c.h | 10 +++--- 2 files changed, 14 insertions(+), 16 deletions(-) diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c index 81edd1e..fe8bfbf 100644 --- a/arch/x86/vdso/vdso2c.c +++ b/arch/x86/vdso/vdso2c.c @@ -14,6 +14,8 @@ #include linux/elf.h #include linux/types.h +const char *outfilename; + /* Symbols that we need in vdso2c. */ enum { sym_vvar_page, @@ -44,6 +46,7 @@ static void fail(const char *format, ...) va_start(ap, format); fprintf(stderr, Error: ); vfprintf(stderr, format, ap); + unlink(outfilename); exit(1); va_end(ap); } @@ -82,17 +85,16 @@ static void fail(const char *format, ...) #undef Elf_Sym #undef Elf_Dyn -static int go(void *addr, size_t len, FILE *outfile, const char *name) +static void go(void *addr, size_t len, FILE *outfile, const char *name) { Elf64_Ehdr *hdr = (Elf64_Ehdr *)addr; if (hdr-e_ident[EI_CLASS] == ELFCLASS64) { - return go64(addr, len, outfile, name); + go64(addr, len, outfile, name); } else if (hdr-e_ident[EI_CLASS] == ELFCLASS32) { - return go32(addr, len, outfile, name); + go32(addr, len, outfile, name); } else { - fprintf(stderr, Error: unknown ELF class\n); - return 1; + fail(unknown ELF class\n); } } @@ -102,7 +104,6 @@ int main(int argc, char **argv) off_t len; void *addr; FILE *outfile; - int ret; char *name, *tmp; int namelen; @@ -143,14 +144,15 @@ int main(int argc, char **argv) if (addr == MAP_FAILED) err(1, mmap); - outfile = fopen(argv[2], w); + outfilename = argv[2]; + outfile = fopen(outfilename, w); if (!outfile) err(1, %s, argv[2]); - ret = go(addr, (size_t)len, outfile, name); + go(addr, (size_t)len, outfile, name); munmap(addr, len); fclose(outfile); - return ret; + return 0; } diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h index 3dcc61e..26a7c1f 100644 --- a/arch/x86/vdso/vdso2c.h +++ b/arch/x86/vdso/vdso2c.h @@ -4,7 +4,7 @@ * are built for 32-bit userspace. */ -static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) +static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) { int found_load = 0; unsigned long load_size = -1; /* Work around bogus warning */ @@ -62,10 +62,8 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) alt_sec = sh; } - if (!symtab_hdr) { + if (!symtab_hdr) fail(no symbol table\n); - return 1; - } strtab_hdr = addr + hdr-e_shoff + hdr-e_shentsize * symtab_hdr-sh_link; @@ -112,7 +110,7 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) if (!name) { fwrite(addr, load_size, 1, outfile); - return 0; + return; } fprintf(outfile, /* AUTOMATICALLY GENERATED -- DO NOT EDIT */\n\n); @@ -152,6 +150,4 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) required_syms[i], syms[i]); } fprintf(outfile, };\n); - - return 0; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] x86,vdso: vdso build fixes and improvements
On Thu, May 29, 2014 at 3:41 PM, Paul Gortmaker paul.gortma...@windriver.com wrote: On 14-05-29 05:57 PM, Andy Lutomirski wrote: Patch 1 causes make;make to behave similarly to make if vdso2c fails. Patch 2 hopefully fixes x86 crossbuilds on big-endian machines. I don't have a big-endian machine to test on, though. Since the x86 builds are unconditionally failing as-is now in linux-next, perhaps Stephen [Cc'd] can layer these on the top of the tree he'll be making within the next couple of hours to give them a big-endian test. Stephen, if you do this, could you also send me some of the build outputs: arch/x86/vdso/*.so.dbg arch/x86/vdso/*-image-*.c I'd like to verify that the output is correct. It would be unfortunate if the cross-built kernel had some subtle error. Thanks, Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] x86,vdso: vdso build fixes and improvements
On Thu, May 29, 2014 at 10:42 PM, Stephen Rothwell s...@canb.auug.org.au wrote: Hi Andy, On Thu, 29 May 2014 15:49:52 -0700 Andy Lutomirski l...@amacapital.net wrote: On Thu, May 29, 2014 at 3:41 PM, Paul Gortmaker paul.gortma...@windriver.com wrote: On 14-05-29 05:57 PM, Andy Lutomirski wrote: Patch 1 causes make;make to behave similarly to make if vdso2c fails. Patch 2 hopefully fixes x86 crossbuilds on big-endian machines. I don't have a big-endian machine to test on, though. Since the x86 builds are unconditionally failing as-is now in linux-next, perhaps Stephen [Cc'd] can layer these on the top of the tree he'll be making within the next couple of hours to give them a big-endian test. I added them just after the merge of the tip tree today and will keep them there until they (or somo other solution) turn up some other way. Stephen, if you do this, could you also send me some of the build outputs: arch/x86/vdso/*.so.dbg arch/x86/vdso/*-image-*.c I'd like to verify that the output is correct. It would be unfortunate if the cross-built kernel had some subtle error. OK, I will send them in a separate email. For the benefit of the archives: the patches are bad. New ones coming. --Andy -- Cheers, Stephen Rothwells...@canb.auug.org.au -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 2/2] x86,vdso: Fix cross-compilation from big-endian architectures
This adds a macro GET(x) to convert x from big-endian to little-endian. Hopefully I put it everywhere it needs to go and got all the cases needed for everyone's linux/elf.h. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vdso2c.c | 15 arch/x86/vdso/vdso2c.h | 63 -- 2 files changed, 50 insertions(+), 28 deletions(-) diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c index fe8bfbf..de19ced 100644 --- a/arch/x86/vdso/vdso2c.c +++ b/arch/x86/vdso/vdso2c.c @@ -51,6 +51,21 @@ static void fail(const char *format, ...) va_end(ap); } +/* + * Evil macros to do a little-endian read. + */ +#define __GET_TYPE(x, type, bits, ifnot) \ + __builtin_choose_expr( \ + __builtin_types_compatible_p(typeof(x), type), \ + le##bits##toh((x)), ifnot) + +extern void bad_get(uint64_t); + +#define GET(x) \ + __GET_TYPE((x), __u32, 32, __GET_TYPE((x), __u64, 64, \ + __GET_TYPE((x), __s32, 32, __GET_TYPE((x), __s64, 64, \ + __GET_TYPE((x), __u16, 16, bad_get(x)) + #define NSYMS (sizeof(required_syms) / sizeof(required_syms[0])) #define BITS 64 diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h index 26a7c1f..f0475da 100644 --- a/arch/x86/vdso/vdso2c.h +++ b/arch/x86/vdso/vdso2c.h @@ -18,25 +18,27 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) const char *secstrings; uint64_t syms[NSYMS] = {}; - Elf_Phdr *pt = (Elf_Phdr *)(addr + hdr-e_phoff); + Elf_Phdr *pt = (Elf_Phdr *)(addr + GET(hdr-e_phoff)); /* Walk the segment table. */ - for (i = 0; i hdr-e_phnum; i++) { - if (pt[i].p_type == PT_LOAD) { + for (i = 0; i GET(hdr-e_phnum); i++) { + if (GET(pt[i].p_type) == PT_LOAD) { if (found_load) fail(multiple PT_LOAD segs\n); - if (pt[i].p_offset != 0 || pt[i].p_vaddr != 0) + if (GET(pt[i].p_offset) != 0 || + GET(pt[i].p_vaddr) != 0) fail(PT_LOAD in wrong place\n); - if (pt[i].p_memsz != pt[i].p_filesz) + if (GET(pt[i].p_memsz) != GET(pt[i].p_filesz)) fail(cannot handle memsz != filesz\n); - load_size = pt[i].p_memsz; + load_size = GET(pt[i].p_memsz); found_load = 1; - } else if (pt[i].p_type == PT_DYNAMIC) { - dyn = addr + pt[i].p_offset; - dyn_end = addr + pt[i].p_offset + pt[i].p_memsz; + } else if (GET(pt[i].p_type) == PT_DYNAMIC) { + dyn = addr + GET(pt[i].p_offset); + dyn_end = addr + GET(pt[i].p_offset) + + GET(pt[i].p_memsz); } } if (!found_load) @@ -44,43 +46,48 @@ static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) data_size = (load_size + 4095) / 4096 * 4096; /* Walk the dynamic table */ - for (i = 0; dyn + i dyn_end dyn[i].d_tag != DT_NULL; i++) { - if (dyn[i].d_tag == DT_REL || dyn[i].d_tag == DT_RELSZ || - dyn[i].d_tag == DT_RELENT || dyn[i].d_tag == DT_TEXTREL) + for (i = 0; dyn + i dyn_end GET(dyn[i].d_tag) != DT_NULL; i++) { + typeof(dyn[i].d_tag) tag = GET(dyn[i].d_tag); + if (tag == DT_REL || tag == DT_RELSZ || + tag == DT_RELENT || tag == DT_TEXTREL) fail(vdso image contains dynamic relocations\n); } /* Walk the section table */ - secstrings_hdr = addr + hdr-e_shoff + hdr-e_shentsize*hdr-e_shstrndx; - secstrings = addr + secstrings_hdr-sh_offset; - for (i = 0; i hdr-e_shnum; i++) { - Elf_Shdr *sh = addr + hdr-e_shoff + hdr-e_shentsize * i; - if (sh-sh_type == SHT_SYMTAB) + secstrings_hdr = addr + GET(hdr-e_shoff) + + GET(hdr-e_shentsize)*GET(hdr-e_shstrndx); + secstrings = addr + GET(secstrings_hdr-sh_offset); + for (i = 0; i GET(hdr-e_shnum); i++) { + Elf_Shdr *sh = addr + GET(hdr-e_shoff) + + GET(hdr-e_shentsize) * i; + if (GET(sh-sh_type) == SHT_SYMTAB) symtab_hdr = sh; - if (!strcmp(secstrings + sh-sh_name, .altinstructions)) + if (!strcmp(secstrings + GET(sh-sh_name), .altinstructions)) alt_sec = sh; } if (!symtab_hdr) fail(no symbol table\n); - strtab_hdr = addr + hdr-e_shoff
[PATCH v2 0/2] x86,vdso: vdso build fixes and improvements
Patch 1 causes make;make to behave similarly to make if vdso2c fails. Patch 2 hopefully fixes x86 crossbuilds on big-endian machines. I don't have a big-endian machine to test on, though. Changes from v1: Add two missing endian fixes Andy Lutomirski (2): x86,vdso: When vdso2c fails, unlink the output x86,vdso: Fix cross-compilation from big-endian architectures arch/x86/vdso/vdso2c.c | 35 +--- arch/x86/vdso/vdso2c.h | 73 ++ 2 files changed, 64 insertions(+), 44 deletions(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2 1/2] x86,vdso: When vdso2c fails, unlink the output
This avoids bizarre failures if make is run again. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vdso2c.c | 20 +++- arch/x86/vdso/vdso2c.h | 10 +++--- 2 files changed, 14 insertions(+), 16 deletions(-) diff --git a/arch/x86/vdso/vdso2c.c b/arch/x86/vdso/vdso2c.c index 81edd1e..fe8bfbf 100644 --- a/arch/x86/vdso/vdso2c.c +++ b/arch/x86/vdso/vdso2c.c @@ -14,6 +14,8 @@ #include linux/elf.h #include linux/types.h +const char *outfilename; + /* Symbols that we need in vdso2c. */ enum { sym_vvar_page, @@ -44,6 +46,7 @@ static void fail(const char *format, ...) va_start(ap, format); fprintf(stderr, Error: ); vfprintf(stderr, format, ap); + unlink(outfilename); exit(1); va_end(ap); } @@ -82,17 +85,16 @@ static void fail(const char *format, ...) #undef Elf_Sym #undef Elf_Dyn -static int go(void *addr, size_t len, FILE *outfile, const char *name) +static void go(void *addr, size_t len, FILE *outfile, const char *name) { Elf64_Ehdr *hdr = (Elf64_Ehdr *)addr; if (hdr-e_ident[EI_CLASS] == ELFCLASS64) { - return go64(addr, len, outfile, name); + go64(addr, len, outfile, name); } else if (hdr-e_ident[EI_CLASS] == ELFCLASS32) { - return go32(addr, len, outfile, name); + go32(addr, len, outfile, name); } else { - fprintf(stderr, Error: unknown ELF class\n); - return 1; + fail(unknown ELF class\n); } } @@ -102,7 +104,6 @@ int main(int argc, char **argv) off_t len; void *addr; FILE *outfile; - int ret; char *name, *tmp; int namelen; @@ -143,14 +144,15 @@ int main(int argc, char **argv) if (addr == MAP_FAILED) err(1, mmap); - outfile = fopen(argv[2], w); + outfilename = argv[2]; + outfile = fopen(outfilename, w); if (!outfile) err(1, %s, argv[2]); - ret = go(addr, (size_t)len, outfile, name); + go(addr, (size_t)len, outfile, name); munmap(addr, len); fclose(outfile); - return ret; + return 0; } diff --git a/arch/x86/vdso/vdso2c.h b/arch/x86/vdso/vdso2c.h index 3dcc61e..26a7c1f 100644 --- a/arch/x86/vdso/vdso2c.h +++ b/arch/x86/vdso/vdso2c.h @@ -4,7 +4,7 @@ * are built for 32-bit userspace. */ -static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) +static void GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) { int found_load = 0; unsigned long load_size = -1; /* Work around bogus warning */ @@ -62,10 +62,8 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) alt_sec = sh; } - if (!symtab_hdr) { + if (!symtab_hdr) fail(no symbol table\n); - return 1; - } strtab_hdr = addr + hdr-e_shoff + hdr-e_shentsize * symtab_hdr-sh_link; @@ -112,7 +110,7 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) if (!name) { fwrite(addr, load_size, 1, outfile); - return 0; + return; } fprintf(outfile, /* AUTOMATICALLY GENERATED -- DO NOT EDIT */\n\n); @@ -152,6 +150,4 @@ static int GOFUNC(void *addr, size_t len, FILE *outfile, const char *name) required_syms[i], syms[i]); } fprintf(outfile, };\n); - - return 0; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Pondering per-process vsyscall disablement
On Wed, May 28, 2014 at 2:45 PM, H. Peter Anvin h...@zytor.com wrote: On 05/23/2014 09:40 AM, Andy Lutomirski wrote: I don't think this should be something configured by the administrator, unless the administrator is the builder of a kiosky thing like Chromium OS. In that case, the administrator can use vsyscall=none. I think this should be handled by either libc or the toolchain, hence the suggestions of a syscall or an ELF header. We could mimic the NX stack stuff, but it would have a lot of false negatives, simply because very few things would actually poke at the vsyscall page. The NX stuff uses a dummy program header in the ELF image. On the other hand, you could make the argument that anything compiled with a new toolchain simply should not use the vsyscall page, and just unconditionally set the opt-out bit (header) in question. It might be better to have some kind of flags field (which a number of architectures use) than keep using dummy program headers, though. Do the flags go in the ELF loader or in the executable we're running? Or both (and, if both, do we and them or or them)? I think the interpreter makes a little more sense in general: for the most part, use of vsyscalls is a property of the runtime environment, not of the program being run. But maybe this is naive. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] x86,vdso: Fix cross-compilation from big-endian architectures
On Fri, May 30, 2014 at 1:02 PM, H. Peter Anvin h...@zytor.com wrote: On 05/30/2014 08:48 AM, Andy Lutomirski wrote: This adds a macro GET(x) to convert x from big-endian to little-endian. Hopefully I put it everywhere it needs to go and got all the cases needed for everyone's linux/elf.h. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/vdso/vdso2c.c | 15 arch/x86/vdso/vdso2c.h | 63 -- 2 files changed, 50 insertions(+), 28 deletions(-) A couple of observations: 1. We shouldn't use double-underscore in host C code. 2. It would be nice if we can take these sort of things (host-build helper macros) and move them to some common file in the Linux kernel eventually, so it would be a good thing to make the naming a little less general. 3. Even though it isn't necessary, making it work on 8-bit values so one doesn't have to worry about the type would seem like a good thing. I came up with the following, it seems like a reasonable simplification: #define _LE(x, bits, ifnot) \ __builtin_choose_expr( \ (sizeof(x) == bits/8), \ (__typeof__(x))le##bits##toh(x), ifnot) This will do awful things if x is a floating-point type, and, for integers, the cast is probably unnecessary. But it should be okay. extern void bad_le(uint64_t); If this ever goes in a common header, then we should do the __attribute__((error)) thing. I wonder if it would ever make sense to have __LINUX_HOSTPROG__ and make some of the basic headers work. Hmm. #define _LAST_LE(x) \ __builtin_choose_expr(sizeof(x) == 1, (x), bad_le(x)) #define LE(x) \ _LE(x, 64, _LE(x, 32, _LE(x, 16, _LAST_LE(x What do you think? My only real real objection is that _LE sounds like converting *to* little-endian to me. Admittedly, that's the same thing on any remotely sane architecture, but still. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Pondering per-process vsyscall disablement
On Fri, May 30, 2014 at 1:05 PM, H. Peter Anvin h...@zytor.com wrote: On 05/30/2014 01:00 PM, Andy Lutomirski wrote: Do the flags go in the ELF loader or in the executable we're running? Or both (and, if both, do we and them or or them)? I think the interpreter makes a little more sense in general: for the most part, use of vsyscalls is a property of the runtime environment, not of the program being run. But maybe this is naive. They go into each object which becomes part of the running program, i.e. executable, dynamic libraries, and dynamic linker. Well, sure, but the kernel is not about to start reading ELF headers in dynamic libraries. So we need to make a decision based on the interpreter and the executable. The conservative approach is to require both to have the flag set *and* to offer a prctl to twiddle the flags. Then userspace loaders can do whatever they want, and distros get to rebuild the world :) --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2 2/2] x86,vdso: Fix cross-compilation from big-endian architectures
On Fri, May 30, 2014 at 1:21 PM, H. Peter Anvin h...@zytor.com wrote: On 05/30/2014 01:09 PM, Andy Lutomirski wrote: I came up with the following, it seems like a reasonable simplification: #define _LE(x, bits, ifnot) \ __builtin_choose_expr( \ (sizeof(x) == bits/8), \ (__typeof__(x))le##bits##toh(x), ifnot) This will do awful things if x is a floating-point type, and, for integers, the cast is probably unnecessary. But it should be okay. I mostly wanted to preserve the signedness. Yes, if we care about floating-point it gets trickier. At some point hopefully there will be a native C feature to handle this crap. extern void bad_le(uint64_t); If this ever goes in a common header, then we should do the __attribute__((error)) thing. I wonder if it would ever make sense to have __LINUX_HOSTPROG__ and make some of the basic headers work. Hmm. #define _LAST_LE(x) \ __builtin_choose_expr(sizeof(x) == 1, (x), bad_le(x)) #define LE(x) \ _LE(x, 64, _LE(x, 32, _LE(x, 16, _LAST_LE(x What do you think? My only real real objection is that _LE sounds like converting *to* little-endian to me. Admittedly, that's the same thing on any remotely sane architecture, but still. GET_LE() then? Sounds good. Are you planning on writing the patch? I think my v2 is good -- the only diff I could find in my image.c files and Stephen's was in the alt_xyz output, and I think I fixed that in v2. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Pondering per-process vsyscall disablement
On Fri, May 30, 2014 at 1:20 PM, H. Peter Anvin h...@zytor.com wrote: On 05/30/2014 01:11 PM, Andy Lutomirski wrote: On Fri, May 30, 2014 at 1:05 PM, H. Peter Anvin h...@zytor.com wrote: On 05/30/2014 01:00 PM, Andy Lutomirski wrote: Do the flags go in the ELF loader or in the executable we're running? Or both (and, if both, do we and them or or them)? I think the interpreter makes a little more sense in general: for the most part, use of vsyscalls is a property of the runtime environment, not of the program being run. But maybe this is naive. They go into each object which becomes part of the running program, i.e. executable, dynamic libraries, and dynamic linker. Well, sure, but the kernel is not about to start reading ELF headers in dynamic libraries. So we need to make a decision based on the interpreter and the executable. The conservative approach is to require both to have the flag set *and* to offer a prctl to twiddle the flags. Then userspace loaders can do whatever they want, and distros get to rebuild the world :) Yes, something like that. I'll hack something up once the merge window closes. Or maybe sooner if you commit my vsyscall patches from a few days ago. Otherwise I'm just going to confuse my git tree too much :) --Andy -hpa -- To unsubscribe from this list: send the line unsubscribe linux-api in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 1/2] x86,syscall: Add syscall_in_syscall to test whether we're in a syscall
syscall_in_syscall will return true if we're in a real syscall and will return false if we're not in a syscall. If we're in a bad syscall, the return value can vary. The idea is to use this to come up with a much simpler replacement for syscall auditing. Signed-off-by: Andy Lutomirski l...@amacapital.net --- arch/x86/Kconfig | 1 + arch/x86/include/asm/syscall.h | 21 + init/Kconfig | 3 +++ 3 files changed, 25 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 25d2c6f..e2602d4 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -130,6 +130,7 @@ config X86 select HAVE_CC_STACKPROTECTOR select GENERIC_CPU_AUTOPROBE select HAVE_ARCH_AUDITSYSCALL + select HAVE_SYSCALL_IN_SYSCALL config INSTRUCTION_DECODER def_bool y diff --git a/arch/x86/include/asm/syscall.h b/arch/x86/include/asm/syscall.h index d6a756a..91e38b3 100644 --- a/arch/x86/include/asm/syscall.h +++ b/arch/x86/include/asm/syscall.h @@ -23,6 +23,27 @@ typedef void (*sys_call_ptr_t)(void); extern const sys_call_ptr_t sys_call_table[]; +/** + * syscall_in_syscall() - are we in a syscall context? + * @task: The task to query. + * @regs: The task's pt_regs. + * + * This checks whether we are in a syscall. If it returns true, then + * syscall_get_nr(), etc are usable and the current task is guaranteed + * to either die or to go through the syscall exit path when the syscall + * is done. + * + * If it returns false, no particular guarantees are made. In + * particular, a malicious task can issue a syscall that causes + * syscall_in_syscall to return false. Such a syscall won't do much, + * but it can still cause tracing code and such to run. + */ +static inline bool syscall_in_syscall(struct task_struct *task, + struct pt_regs *regs) +{ + return regs-orig_ax != -1; +} + /* * Only the low 32 bits of orig_ax are meaningful, so we return int. * This importantly ignores the high bits on 64-bit, so comparisons diff --git a/init/Kconfig b/init/Kconfig index 9d3585b..bad2053 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -295,6 +295,9 @@ config AUDIT config HAVE_ARCH_AUDITSYSCALL bool +config HAVE_SYSCALL_IN_SYSCALL + bool + config AUDITSYSCALL bool Enable system-call auditing support depends on AUDIT HAVE_ARCH_AUDITSYSCALL -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 0/2] Syscall auditing lite
I've made no secret of the fact that I dislike syscall auditing. As far as I can tell, the main technical (i.e. not compliance-related) use of syscall auditing is to supply some useful context information to go along with events like AVC denials. CONFIG_AUDITSYSCALL is serious overkill to do this. kernel/auditsc.c is ~2500 lines of terror. This patchset accomplishes the same goal, more usefully, with no overhead at all, in under 70 lines of code. It tries to coexist cleanly with CONFIG_AUDITSYSCALL. This is only implemented for x86. Other architectures can add support fairly easily, I think. Andy Lutomirski (2): x86,syscall: Add syscall_in_syscall to test whether we're in a syscall audit: Syscall auditing lite arch/x86/Kconfig | 1 + arch/x86/include/asm/syscall.h | 21 init/Kconfig | 3 +++ kernel/audit.c | 44 +- 4 files changed, 68 insertions(+), 1 deletion(-) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH 2/2] audit: Syscall auditing lite
AFAICS the main use of syscall auditing is to get syscall information for syscalls that are already causing another audit message. We don't need any of the fancy syscall auditing machinery for that, though: we can just log this information directly. This should have essentially no overhead and it could end up being much easier to use than auditsc. This produces messages like this: audit: type=1123 audit(1401485315.370:2): pid=125 uid=0 auid=4294967295 ses=4294967295 subj=kernel msg='blah blah blah' arch=c03e syscall=44 a0=3 a1=7fff383feb60 a2=5c a3=0 a4=7fff383feb50 a5=c The new fields (arch, syscall, and a0..a5) will only be logged if we are in a syscall but we aren't otherwise building an auditsc context. This is only supported on x86 for now. Other architectures can get this if they implement syscall_in_syscall. Signed-off-by: Andy Lutomirski l...@amacapital.net --- kernel/audit.c | 44 +++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/kernel/audit.c b/kernel/audit.c index 47845c5..8509d00 100644 --- a/kernel/audit.c +++ b/kernel/audit.c @@ -67,6 +67,10 @@ #include linux/pid_namespace.h #include net/netns/generic.h +#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL +#include asm/syscall.h +#endif + #include audit.h /* No auditing will take place until audit_initialized == AUDIT_INITIALIZED. @@ -1897,6 +1901,40 @@ out: kfree(name); } +#ifdef CONFIG_HAVE_SYSCALL_IN_SYSCALL +/** + * audit_log_missing_context - append otherwise-missing context + * @ab: the audit_buffer + * + * If syscall auditing is unavailable, try to log syscall context + * information anyway. + */ +static void audit_log_missing_context(struct audit_buffer *ab) +{ + struct task_struct *tsk = current; + struct pt_regs *regs = current_pt_regs(); + unsigned long args[6]; + + if (!syscall_in_syscall(tsk, regs)) + return; + + if (ab-ctx ab-ctx-in_syscall) + return; /* Let audit_log_exit log the context. */ + + syscall_get_arguments(tsk, regs, 0, 6, args); + + audit_log_format(ab, arch=%x syscall=%d a0=%lx a1=%lx a2=%lx a3=%lx a4=%lx a5=%lx, +(unsigned int)syscall_get_arch(), +syscall_get_nr(tsk, regs), +args[0], args[1], args[2], args[3], args[4], args[5]); +} +#else +static void audit_log_missing_context(struct audit_buffer *ab) +{ + /* We need arch support to do this reliably, so don't even try. */ +} +#endif + /** * audit_log_end - end one audit record * @ab: the audit_buffer @@ -1913,7 +1951,11 @@ void audit_log_end(struct audit_buffer *ab) if (!audit_rate_check()) { audit_log_lost(rate limit exceeded); } else { - struct nlmsghdr *nlh = nlmsg_hdr(ab-skb); + struct nlmsghdr *nlh; + + audit_log_missing_context(ab); + + nlh = nlmsg_hdr(ab-skb); nlh-nlmsg_len = ab-skb-len - NLMSG_HDRLEN; if (audit_pid) { -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [lxc-devel] [RFC PATCH 00/11] Add support for devtmpfs in user namespaces
On Fri, May 23, 2014 at 6:16 AM, James Bottomley james.bottom...@hansenpartnership.com wrote: On Fri, 2014-05-23 at 11:20 +0300, Marian Marinov wrote: On 05/20/2014 05:19 PM, Serge Hallyn wrote: Quoting Andy Lutomirski (l...@amacapital.net): On May 15, 2014 1:26 PM, Serge E. Hallyn se...@hallyn.com wrote: Quoting Richard Weinberger (rich...@nod.at): Am 15.05.2014 21:50, schrieb Serge Hallyn: Quoting Richard Weinberger (richard.weinber...@gmail.com): On Thu, May 15, 2014 at 4:08 PM, Greg Kroah-Hartman gre...@linuxfoundation.org wrote: Then don't use a container to build such a thing, or fix the build scripts to not do that :) I second this. To me it looks like some folks try to (ab)use Linux containers for purposes where KVM would much better fit in. Please don't put more complexity into containers. They are already horrible complex and error prone. I, naturally, disagree :) The only use case which is inherently not valid for containers is running a kernel. Practically speaking there are other things which likely will never be possible, but if someone offers a way to do something in containers, you can't do that in containers is not an apropos response. That abstraction is wrong is certainly valid, as when vpids were originally proposed and rejected, resulting in the development of pid namespaces. We have to work out (x) first can be valid (and I can think of examples here), assuming it's not just trying to hide behind a catch-22/chicken-egg problem. Finally, saying containers are complex and error prone is conflating several large suites of userspace code and many kernel features which support them. Being more precise would, if the argument is valid, lend it a lot more weight. We (my company) use Linux containers since 2011 in production. First LXC, now libvirt-lxc. To understand the internals better I also wrote my own userspace to create/start containers. There are so many things which can hurt you badly. With user namespaces we expose a really big attack surface to regular users. I.e. Suddenly a user is allowed to mount filesystems. That is currently not the case. They can mount some virtual filesystems and do bind mounts, but cannot mount most real filesystems. This keeps us protected (for now) from potentially unsafe superblock readers in the kernel. Ask Andy, he found already lots of nasty things... I don't think I have anything brilliant to add to this discussion right now, except possibly: ISTM that Linux distributions are, in general, vulnerable to all kinds of shenanigans that would happen if an untrusted user can cause a block device to appear. That user doesn't need permission to mount it Interesting point. This would further suggest that we absolutely must ensure that a loop device which shows up in the container does not also show up in the host. Can I suggest the usage of the devices cgroup to achieve that? Not really ... cgroups impose resource limits, it's namespaces that impose visibility separations. In theory this can be done with the device namespace that's been proposed; however, a simpler way is simply to rm the device node in the host and mknod it in the guest. I don't really see host visibility as a huge problem: in a shared OS virtualisation it's not really possible securely to separate the guest from the host (only vice versa). But I really don't think we want to do it this way. Giving a container the ability to do a mount is too dangerous. What we want to do is intercept the mount in the host and perform it on behalf of the guest as host root in the guest's mount namespace. If you do it that way, it doesn't really matter what device actually shows up in the guest, as long as the host knows what to do when the mount request comes along. This is only useful/safe if the host understands what's going on. By the host, I mean the host's udev and other system-level stuff. This is probably fine for disks and such, but it might not be so great for loop devices, FUSE, etc. I already know of one user of containers that wants container-local FUSE mounts. This ought to Just Work (tm), but there's fair amount of work needed to get there. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Pondering per-process vsyscall disablement
On Thu, May 22, 2014 at 7:44 PM, Marian Marinov m...@1h.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/23/2014 02:04 AM, Andy Lutomirski wrote: It would be nice to have a way for new programs to declare that they don't need vsyscalls. What's the right way to do this? An ELF header entry in the loader? An ELF header entry in the program? A new arch_prctl? As background, there's an old part of the x86_64 ABI that allows programs to do gettimeofday, clock_gettime, and getcpu by calling to fixed addresses of the form 0xff600n00 where n indicates which of those three syscalls is being invoked. This is a security issue. Since Linux 3.1, vsyscalls are emulated using NX and page faults. As a result, vsyscalls no longer offer any performance advantage over normal syscalls; in fact, they're much slower. As far as I know, nothing newer than 2012 will attempt to use vsyscalls if a vdso is present. (Sadly, a lot of things will still fall back to the vsyscall page if there is no vdso, but that shouldn't matter, since there is always a vdso.) Despite the emulation, they could still be used as a weird form of ROP gadget that lives at a fixed address. I'd like to offer a way for new runtimes to indicate that they don't use vsyscalls so that the kernel can selectively disable emulation and remove the fixed-address executable code issue. Wouldn't it be more useful if the check is against a bitmask added as extended attribute for that executable? This way the administrators and will have the flexibility to simply add the new attribute to the executable. I don't think this should be something configured by the administrator, unless the administrator is the builder of a kiosky thing like Chromium OS. In that case, the administrator can use vsyscall=none. I think this should be handled by either libc or the toolchain, hence the suggestions of a syscall or an ELF header. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] x86, MCE: Flesh out when to panic comment
On Mon, May 26, 2014 at 4:06 AM, Borislav Petkov b...@alien8.de wrote: On Mon, May 26, 2014 at 12:51:10PM +0200, Jiri Kosina wrote: I think the comment is still not explaining the big part of what the discussion was about -- i.e. if it was in kernel context, we always panic. I thought the pointer to mce_severity was enough? People should open an editor and look at the function and at its gory insanity. :-P It may be worth at least pointing out that mce_severity looks at whether we faulted from kernel context. I missed that the first time around because mce_severity doesn't take a pt_regs pointer. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/9] procfs: use flags to deny or allow access to /proc/pid/$entry
On Mon, May 26, 2014 at 6:27 AM, Djalal Harouni tix...@opendz.org wrote: Add the deny or allow flags, so we can perform proper permission checks and set the result accordingly. These flags are needed in case we have to cache the result of permission checks that are done during -open() time. Later during -read(), we can decide to allow or deny the read(). The pid entries that need these flags are: /proc/pid/stat /proc/pid/wchan /proc/pid/maps (will be handled in next patches). These files are world readable, userspace depend on that. To prevent ASLR leaks and to avoid breaking userspace, we follow this scheme: a) Perform permission checks during -open() b) Cache the result of a) and return success c) Recheck the cached result during -read() Why is (c) needed? /* + * Flags used to deny or allow current to access /proc/pid/$entry + * after proper permission checks. + */ +enum { + PID_ENTRY_DENY = 0,/* Deny access */ + PID_ENTRY_ALLOW = 1,/* Allow access */ +}; I think this would be less alarming if this were: #define PID_ENTRY_DENY ((void *)1UL) #define PID_ENTRY_ALLOW ((void *)2UL) Also, I don't like DENY and ALLOW. It's not denying and allowing. How about PID_ENTRY_OPENER_MAY_PTRACE and PID_ENTRY_OPENER_MAY_NOT_PTRACE? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/9] procfs: add pid_entry_access() for proper checks on /proc/pid/*
On Mon, May 26, 2014 at 6:27 AM, Djalal Harouni tix...@opendz.org wrote: Add the helper pid_entry_access() to unify the permission checks during -open() This is a preparation patch. Signed-off-by: Djalal Harouni tix...@opendz.org --- fs/proc/generic.c | 22 ++ fs/proc/internal.h | 2 ++ 2 files changed, 24 insertions(+) diff --git a/fs/proc/generic.c b/fs/proc/generic.c index b7f268e..98ed927 100644 --- a/fs/proc/generic.c +++ b/fs/proc/generic.c @@ -23,6 +23,7 @@ #include linux/bitops.h #include linux/spinlock.h #include linux/completion.h +#include linux/ptrace.h #include asm/uaccess.h #include internal.h @@ -596,3 +597,24 @@ void *PDE_DATA(const struct inode *inode) return __PDE_DATA(inode); } EXPORT_SYMBOL(PDE_DATA); + +int pid_entry_access(struct file *filp, unsigned int mode) pid_entry_may_ptrace, perhaps? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/