Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
On Thu, Jul 16, 2020 at 09:48:50PM -0700, Andy Lutomirski wrote: > On Thu, Jul 16, 2020 at 7:15 PM Gabriel Krisman Bertazi > wrote: > > > > Andy Lutomirski writes: > > > > > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi > > > wrote: > > >> > > > > > > This is quite nice. I have a few comments, though: > > > > > > You mentioned rt_sigreturn(). Should this automatically exempt the > > > kernel-provided signal restorer on architectures (e.g. x86_32) that > > > provide one? > > > > That seems reasonable. Not sure how easy it is to do it, though. > > For better or for worse, it's currently straightforward because the code is: > > __kernel_sigreturn: > .LSTART_sigreturn: > popl %eax /* XXX does this mean it needs unwind info? */ > movl $__NR_sigreturn, %eax > SYSCALL_ENTER_KERNEL > > and SYSCALL_ENTER_KERNEL is hardwired as int $0x80. (The latter is > probably my fault, for better or for worse.) So this would change to: > > __vdso32_sigreturn_syscall: > SYSCALL_ENTER_KERNEL > > and vdso2c would wire up __vdso32_sigreturn_syscall. Then there would > be something like: > > bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs); > > and that would be that. Does anyone have an opinion as to whether > this is a good idea? Modern glibc shouldn't be using this mechanism, > I think, but I won't swear to it. On arm64 sigreturn is always through the vdso, so IIUC we'd certainly need something like this. Otherwise it'd be the user's responsibility to register the vdso sigtramp range when making the prctl, and flip the selector in each signal handler, which sounds both painful and fragile. Mark.
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
Hi Thomas, Thanks for the valuable feedback! Thomas Gleixner writes: > Gabriel Krisman Bertazi writes: >> Introduce a mechanism to quickly disable/enable syscall handling for a >> specific process and redirect to userspace via SIGSYS. This is useful >> for processes with parts that require syscall redirection and parts that >> don't, but who need to perform this boundary crossing really fast, >> without paying the cost of a system call to reconfigure syscall handling >> on each boundary transition. This is particularly important for Windows >> games running over Wine. >> >> The proposed interface looks like this: >> >> prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , >> [selector]) >> >> The range [,] is a part of the process memory map >> that is allowed to by-pass the redirection code and dispatch syscalls >> directly, such that in fast paths a process doesn't need to disable the >> trap nor the kernel has to check the selector. This is essential to >> return from SIGSYS to a blocked area without triggering another SIGSYS >> from rt_sigreturn. > > Why isn't rt_sigreturn() exempt from that redirection in the first > place? This was actually a design decision for me. The main use case I'm considering is emulation of applications written for other OSs (games over wine), which means this dispatcher code is exposed to applications built against different ABIs, who trigger syscalls with bogus parameters (from a linux perspective) In this emulation scenario, I cannot really trust the syscall number means rt_sigreturn, so I try to only base the dispatcher decision on the memory region and selector variable. I think the best we can do is what Andy said: to exempt rt_sigreturn when it comes from the vdso, for architectures that do it that way. > >> --- >> arch/Kconfig | 20 ++ >> arch/x86/Kconfig | 1 + >> arch/x86/entry/common.c | 5 ++ >> arch/x86/include/asm/thread_info.h| 4 +- >> arch/x86/kernel/signal_compat.c | 2 +- >> fs/exec.c | 2 + >> include/linux/sched.h | 3 + >> include/linux/syscall_user_dispatch.h | 50 +++ >> include/uapi/asm-generic/siginfo.h| 3 +- >> include/uapi/linux/prctl.h| 5 ++ >> kernel/Makefile | 1 + >> kernel/fork.c | 1 + >> kernel/sys.c | 5 ++ >> kernel/syscall_user_dispatch.c| 92 +++ > > A big combo patch is not how we do that. Please split it up into the > core part and a patch enabling it for a particular architexture. > > As I said in my reply to Andy, this wants to go on top of the generic > entry/exit work stuff: > > https://lore.kernel.org/r/20200716182208.180916...@linutronix.de > > and then syscall_user_dispatch.c ends up in kernel/entry/ and the > dispatching function is not exposed outside of that directory. > > I'm going to post a new version later today. Will cc you. Thanks. Will do! -- Gabriel Krisman Bertazi
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
Gabriel, Gabriel Krisman Bertazi writes: > Introduce a mechanism to quickly disable/enable syscall handling for a > specific process and redirect to userspace via SIGSYS. This is useful > for processes with parts that require syscall redirection and parts that > don't, but who need to perform this boundary crossing really fast, > without paying the cost of a system call to reconfigure syscall handling > on each boundary transition. This is particularly important for Windows > games running over Wine. > > The proposed interface looks like this: > > prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , > [selector]) > > The range [,] is a part of the process memory map > that is allowed to by-pass the redirection code and dispatch syscalls > directly, such that in fast paths a process doesn't need to disable the > trap nor the kernel has to check the selector. This is essential to > return from SIGSYS to a blocked area without triggering another SIGSYS > from rt_sigreturn. Why isn't rt_sigreturn() exempt from that redirection in the first place? > --- > arch/Kconfig | 20 ++ > arch/x86/Kconfig | 1 + > arch/x86/entry/common.c | 5 ++ > arch/x86/include/asm/thread_info.h| 4 +- > arch/x86/kernel/signal_compat.c | 2 +- > fs/exec.c | 2 + > include/linux/sched.h | 3 + > include/linux/syscall_user_dispatch.h | 50 +++ > include/uapi/asm-generic/siginfo.h| 3 +- > include/uapi/linux/prctl.h| 5 ++ > kernel/Makefile | 1 + > kernel/fork.c | 1 + > kernel/sys.c | 5 ++ > kernel/syscall_user_dispatch.c| 92 +++ A big combo patch is not how we do that. Please split it up into the core part and a patch enabling it for a particular architexture. As I said in my reply to Andy, this wants to go on top of the generic entry/exit work stuff: https://lore.kernel.org/r/20200716182208.180916...@linutronix.de and then syscall_user_dispatch.c ends up in kernel/entry/ and the dispatching function is not exposed outside of that directory. I'm going to post a new version later today. Will cc you. > --- a/arch/x86/include/asm/thread_info.h > +++ b/arch/x86/include/asm/thread_info.h > @@ -93,6 +93,7 @@ struct thread_info { > #define TIF_NOTSC16 /* TSC is not accessible in userland */ > #define TIF_IA32 17 /* IA32 compatibility process */ > #define TIF_SLD 18 /* Restore split lock detection > on context switch */ > +#define TIF_SYSCALL_USER_DISPATCH 19 /* Redirect syscall for userspace > handling */ There are two other things out there which compete about the last TIF bits on x86, so we need to clean that up first. > +static void trigger_sigsys(struct pt_regs *regs) > +{ > + struct kernel_siginfo info; > + > + clear_siginfo(); > + info.si_signo = SIGSYS; > + info.si_code = SYS_USER_DISPATCH; > + info.si_call_addr = (void __user *)KSTK_EIP(current); > + info.si_errno = 0; > + info.si_arch = syscall_get_arch(current); > + info.si_syscall = syscall_get_nr(current, regs); > + > + force_sig_info(); > +} > + > +int do_syscall_user_dispatch(struct pt_regs *regs) > +{ > + struct syscall_user_dispatch *sd = >syscall_dispatch; > + unsigned long ip = instruction_pointer(regs); > + char state; > + > + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end)) > + return 0; > + > + if (likely(sd->selector)) { > + if (unlikely(__get_user(state, sd->selector))) __get_user() mandates an explicit access_ok() which happened in the prctl(). So this wants a comment why there is none right here. > + do_exit(SIGSEGV); > + > + if (likely(state == 0)) > + return 0; > + > + if (state != 1) > + do_exit(SIGSEGV); If that happens its going to be quite interesting to debug. Also please use proper defines which are exposed to user space instead of 0/1. > + } > + > + syscall_rollback(current, regs); > + trigger_sigsys(regs); > + > + return 1; > +} > + > +int set_syscall_user_dispatch(int mode, unsigned long dispatcher_start, > + unsigned long dispatcher_end, char __user > *selector) > +{ > + switch (mode) { > + case PR_SYS_DISPATCH_OFF: > + if (dispatcher_start || dispatcher_end || selector) > + return -EINVAL; > + break; > + case PR_SYS_DISPATCH_ON: > + /* > + * Validate the direct dispatcher region just for basic > + * sanity. If the user is able to submit a syscall from > + * an address, that address is obviously valid. > + */ > + if (dispatcher_end < dispatcher_start) > +
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
On Mon, Jul 20, 2020 at 11:23:13AM +0200, Thomas Gleixner wrote: > Andy Lutomirski writes: > > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi > > wrote: > > The amount of syscall entry wiring that arches need to do is IMO > > already a bit out of hand. Should we instead rename TIF_SECCOMP to > > TIF_SYSCALL_INTERCEPTION and have one generic callback that handles > > seccomp and this new thing? > > The right way to go is to consolidate all the stupidly different > entry/exit work handling implementations and have exactly one in generic > code, i.e. what I posted a few days ago. > > Then we can make new features only available in the generic version by > hiding the new functionality in the core code and not exposing the > functions to architecture implementations. > > Making it easy for architectures to keep their own variant forever just > proliferates the mess we have right now. Couldn't agree more. We recently added PTRACE_SYSEMU to arm64 and I deeply regret doing that now that yet another way to rewrite the syscall number has come along. I only just untangled some of the mess in our entry code for that, so I can't say I'm looking forward to opening it right back up to support this new feature. Much better to do it in the core code instead. Will
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
Andy Lutomirski writes: > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi > wrote: > The amount of syscall entry wiring that arches need to do is IMO > already a bit out of hand. Should we instead rename TIF_SECCOMP to > TIF_SYSCALL_INTERCEPTION and have one generic callback that handles > seccomp and this new thing? The right way to go is to consolidate all the stupidly different entry/exit work handling implementations and have exactly one in generic code, i.e. what I posted a few days ago. Then we can make new features only available in the generic version by hiding the new functionality in the core code and not exposing the functions to architecture implementations. Making it easy for architectures to keep their own variant forever just proliferates the mess we have right now. Thanks, tglx
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
On Thu, Jul 16, 2020 at 7:15 PM Gabriel Krisman Bertazi wrote: > > Andy Lutomirski writes: > > > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi > > wrote: > >> > > > > This is quite nice. I have a few comments, though: > > > > You mentioned rt_sigreturn(). Should this automatically exempt the > > kernel-provided signal restorer on architectures (e.g. x86_32) that > > provide one? > > That seems reasonable. Not sure how easy it is to do it, though. For better or for worse, it's currently straightforward because the code is: __kernel_sigreturn: .LSTART_sigreturn: popl %eax /* XXX does this mean it needs unwind info? */ movl $__NR_sigreturn, %eax SYSCALL_ENTER_KERNEL and SYSCALL_ENTER_KERNEL is hardwired as int $0x80. (The latter is probably my fault, for better or for worse.) So this would change to: __vdso32_sigreturn_syscall: SYSCALL_ENTER_KERNEL and vdso2c would wire up __vdso32_sigreturn_syscall. Then there would be something like: bool arch_syscall_is_vdso_sigreturn(struct pt_regs *regs); and that would be that. Does anyone have an opinion as to whether this is a good idea? Modern glibc shouldn't be using this mechanism, I think, but I won't swear to it. > > > The amount of syscall entry wiring that arches need to do is IMO > > already a bit out of hand. Should we instead rename TIF_SECCOMP to > > TIF_SYSCALL_INTERCEPTION and have one generic callback that handles > > seccomp and this new thing? > > Considering the previous suggestion from Kees to hide it inside the > tracehook and Thomas rework of this path, I'm not sure what is the best > solution here, but some rework of these flags is due. Thomas suggested > expanding these flags to 64 bits and having some arch specific and > arch-agnostic flags. With the storage expansion and arch-agnostic flags, > would this still be desirable? I think it would be desirable to consolidate this to avoid having multiple arches need to separately wire up all of these mechanisms. I'm not sure that the initial upstream implementation needs this, but it might be nice to support this out of the box on all arches with seccomp support. > > >> +int do_syscall_user_dispatch(struct pt_regs *regs) > >> +{ > >> + struct syscall_user_dispatch *sd = >syscall_dispatch; > >> + unsigned long ip = instruction_pointer(regs); > >> + char state; > >> + > >> + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end)) > >> + return 0; > >> + > >> + if (likely(sd->selector)) { > >> + if (unlikely(__get_user(state, sd->selector))) > >> + do_exit(SIGSEGV); > >> + > >> + if (likely(state == 0)) > >> + return 0; > >> + > >> + if (state != 1) > >> + do_exit(SIGSEGV); > > > > This seems a bit extreme and hard to debug if it ever happens. > > Makes sense, but I don't see a better way to return the error here. > Maybe a SIGSYS with a different si_errno? Alternatively, we could > revert to the previous behavior of allowing syscalls on state != 0, that > existed in v1. What do you think? > I don't have a strong opinion. SIGSYS with different si_errno is probably reasonable. --Andy
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
Andy Lutomirski writes: > On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi > wrote: >> > > This is quite nice. I have a few comments, though: > > You mentioned rt_sigreturn(). Should this automatically exempt the > kernel-provided signal restorer on architectures (e.g. x86_32) that > provide one? That seems reasonable. Not sure how easy it is to do it, though. > The amount of syscall entry wiring that arches need to do is IMO > already a bit out of hand. Should we instead rename TIF_SECCOMP to > TIF_SYSCALL_INTERCEPTION and have one generic callback that handles > seccomp and this new thing? Considering the previous suggestion from Kees to hide it inside the tracehook and Thomas rework of this path, I'm not sure what is the best solution here, but some rework of these flags is due. Thomas suggested expanding these flags to 64 bits and having some arch specific and arch-agnostic flags. With the storage expansion and arch-agnostic flags, would this still be desirable? >> +int do_syscall_user_dispatch(struct pt_regs *regs) >> +{ >> + struct syscall_user_dispatch *sd = >syscall_dispatch; >> + unsigned long ip = instruction_pointer(regs); >> + char state; >> + >> + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end)) >> + return 0; >> + >> + if (likely(sd->selector)) { >> + if (unlikely(__get_user(state, sd->selector))) >> + do_exit(SIGSEGV); >> + >> + if (likely(state == 0)) >> + return 0; >> + >> + if (state != 1) >> + do_exit(SIGSEGV); > > This seems a bit extreme and hard to debug if it ever happens. Makes sense, but I don't see a better way to return the error here. Maybe a SIGSYS with a different si_errno? Alternatively, we could revert to the previous behavior of allowing syscalls on state != 0, that existed in v1. What do you think? -- Gabriel Krisman Bertazi
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
On Thu, Jul 16, 2020 at 12:31 PM Gabriel Krisman Bertazi wrote: > This is quite nice. I have a few comments, though: You mentioned rt_sigreturn(). Should this automatically exempt the kernel-provided signal restorer on architectures (e.g. x86_32) that provide one? The amount of syscall entry wiring that arches need to do is IMO already a bit out of hand. Should we instead rename TIF_SECCOMP to TIF_SYSCALL_INTERCEPTION and have one generic callback that handles seccomp and this new thing? > +int do_syscall_user_dispatch(struct pt_regs *regs) > +{ > + struct syscall_user_dispatch *sd = >syscall_dispatch; > + unsigned long ip = instruction_pointer(regs); > + char state; > + > + if (likely(ip >= sd->dispatcher_start && ip <= sd->dispatcher_end)) > + return 0; > + > + if (likely(sd->selector)) { > + if (unlikely(__get_user(state, sd->selector))) > + do_exit(SIGSEGV); > + > + if (likely(state == 0)) > + return 0; > + > + if (state != 1) > + do_exit(SIGSEGV); This seems a bit extreme and hard to debug if it ever happens.
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
On Thu, Jul 16, 2020 at 10:06:01PM +0100, Matthew Wilcox wrote: > On Thu, Jul 16, 2020 at 03:31:40PM -0400, Gabriel Krisman Bertazi wrote: > > selector is an optional pointer to a char-sized userspace memory region > > that has a key switch for the mechanism. This key switch is set to > > either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the > > redirection without calling the kernel. > > > > The feature is meant to be set per-thread and it is disabled on > > fork/clone/execv. > > Disabled on exec. Disabled in the child on clone/fork (and vfork, I > think). > > That means we don't need to worry about it interacting badly with > a setuid program, right? Right, that's the intention. -- Kees Cook
Re: [PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
On Thu, Jul 16, 2020 at 03:31:40PM -0400, Gabriel Krisman Bertazi wrote: > selector is an optional pointer to a char-sized userspace memory region > that has a key switch for the mechanism. This key switch is set to > either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the > redirection without calling the kernel. > > The feature is meant to be set per-thread and it is disabled on > fork/clone/execv. Disabled on exec. Disabled in the child on clone/fork (and vfork, I think). That means we don't need to worry about it interacting badly with a setuid program, right?
[PATCH v4 1/2] kernel: Implement selective syscall userspace redirection
Introduce a mechanism to quickly disable/enable syscall handling for a specific process and redirect to userspace via SIGSYS. This is useful for processes with parts that require syscall redirection and parts that don't, but who need to perform this boundary crossing really fast, without paying the cost of a system call to reconfigure syscall handling on each boundary transition. This is particularly important for Windows games running over Wine. The proposed interface looks like this: prctl(PR_SET_SYSCALL_USER_DISPATCH, , , , [selector]) The range [,] is a part of the process memory map that is allowed to by-pass the redirection code and dispatch syscalls directly, such that in fast paths a process doesn't need to disable the trap nor the kernel has to check the selector. This is essential to return from SIGSYS to a blocked area without triggering another SIGSYS from rt_sigreturn. selector is an optional pointer to a char-sized userspace memory region that has a key switch for the mechanism. This key switch is set to either PR_SYS_DISPATCH_ON, PR_SYS_DISPATCH_OFF to enable and disable the redirection without calling the kernel. The feature is meant to be set per-thread and it is disabled on fork/clone/execv. Internally, this doesn't add overhead to the syscall hot path, and it requires very little per-architecture support. I avoided using seccomp, even though it duplicates some functionality, due to previous feedback that maybe it shouldn't mix with seccomp since it is not a security mechanism. And obviously, this should never be considered a security mechanism, since any part of the program can by-pass it by using the syscall dispatcher. For the sysinfo benchmark, which measures the overhead added to executing a native syscall that doesn't require interception, the overhead using only the direct dispatcher region to issue syscalls is pretty much irrelevant. The overhead of using the selector goes around 40ns for a native (unredirected) syscall in my system, and it is (as expected) dominated by the supervisor-mode user-address access. In fact, with SMAP off, the overhead is consistently less than 5ns on my test box. Right now, it is only supported by x86_64 and x86, but it should be easily enabled for other architectures. An example code using this interface can be found at: https://gitlab.collabora.com/krisman/syscall-disable-personality Changes since v2: (Matthew Wilcox suggestions) - Drop __user on non-ptr type. - Move #define closer to similar defs - Allow a memory region that can dispatch directly (Kees Cook suggestions) - Improve kconfig summary line - Move flag cleanup on execve to begin_new_exec - Hint branch predictor in the syscall path (Me) - Convert selector to char Changes since RFC: (Kees Cook suggestions) - Don't mention personality while explaining the feature - Use syscall_get_nr - Remove header guard on several places - Convert WARN_ON to WARN_ON_ONCE - Explicit check for state values - Rename to syscall user dispatcher Cc: Matthew Wilcox Cc: Andy Lutomirski Cc: Paul Gofman Cc: Kees Cook Reviewed-by: Kees Cook Signed-off-by: Gabriel Krisman Bertazi --- arch/Kconfig | 20 ++ arch/x86/Kconfig | 1 + arch/x86/entry/common.c | 5 ++ arch/x86/include/asm/thread_info.h| 4 +- arch/x86/kernel/signal_compat.c | 2 +- fs/exec.c | 2 + include/linux/sched.h | 3 + include/linux/syscall_user_dispatch.h | 50 +++ include/uapi/asm-generic/siginfo.h| 3 +- include/uapi/linux/prctl.h| 5 ++ kernel/Makefile | 1 + kernel/fork.c | 1 + kernel/sys.c | 5 ++ kernel/syscall_user_dispatch.c| 92 +++ 14 files changed, 191 insertions(+), 3 deletions(-) create mode 100644 include/linux/syscall_user_dispatch.h create mode 100644 kernel/syscall_user_dispatch.c diff --git a/arch/Kconfig b/arch/Kconfig index 8cc35dc556c7..0ebd971d0d8f 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -465,6 +465,26 @@ config SECCOMP_FILTER See Documentation/userspace-api/seccomp_filter.rst for details. +config HAVE_ARCH_SYSCALL_USER_DISPATCH + bool + help + An arch should select this symbol if it provides all of these things: + - TIF_SYSCALL_USER_DISPATCH + - syscall_get_arch + - syscall_rollback + - syscall_get_nr + - SIGSYS siginfo_t support + +config SYSCALL_USER_DISPATCH + bool "Support syscall redirection to userspace dispatcher" + depends on HAVE_ARCH_SYSCALL_USER_DISPATCH + help + Enable tasks to ask the kernel to redirect syscalls not + issued from a predefined dispatcher back to userspace, + depending on a userspace memory selector. + + This option is useful to optimize games