Re: [PATCH 0/3 v2] arm64/ptrace: allow to get all registers on syscall traps
> > Could you describe the problem in more details? I wonder whether we have > the same thing in CRIU... > Sure, basically the issue is that orig_x0 gets recorded at the start of the syscall entry, but is not otherwise part of the ptrace state. It used to be primarily used for resetting the argument back to its original value during syscall restarts, but I see it's been expanded slightly with the user dispatch mechanism (though as far as I can tell not in a way that interacts with ptrace). Basically the problem is that if you change the value of x0 during either the ptrace entry stop or a seccomp stop and then incur a syscall restart, the syscall will restart with the original x0 rather than with the modified x0, which may be unexpected. Of course, relatedly, if you're doing CRIU-like things you can end up in situations where the future behavior will depend on the orig_x0 value, which isn't restore-able at the moment. It's possible to work around all of this by keeping a local copy of orig_x0 and being very careful with the ptrace traps around restarts, but getting the logic right is extremely tricky. My suggestion for what I thought would be reasonable behavior was: 1. Expose orig_x0 to ptrace 2. Set orig_x0 to x0 and set x0 to -ENOSYS at the start of the syscall dispatcher 3. Use orig_x0 for syscall arguments/seccomp/restarts That's basically how rax works on x86_64 and it doesn't seem to cause major problems (though of course I may be biased by having x86_64 already work when I started the aarch64 port). Just the first item would be sufficient of course for getting rid of most of the bookkeeping. I should also say that, for us, the ptrace getregs call can be the throughput limiting operation, so it would be nice if getting the entire basic register set would only require one syscall. I won't insist on it, since we do have a solution in place that kinda works (and only requires the one syscall), but I thought I'd mention it. While we're on this topic, and in case it's helpful to anybody, I should also point out that the order of the ptrace-signal-stop, vs setup for the syscall restart differs between x86 and aarch64 (aarch64 sets up the restart first then delivers the ptrace trap/signal - x86 the other way around). I actually think the aarch64 behavior is saner here, but I figured I'd leave this breadcrumb for anybody who's writing a ptracer and stumbles across this. Keno
Re: [PATCH 0/3 v2] arm64/ptrace: allow to get all registers on syscall traps
Hi Andrei, > This series introduces the PTRACE_O_ARM64_RAW_REGS option. If it is set, > PTRACE_GETREGSET returns values of all registers, and PTRACE_SETREGSET > allows to change any of them. thanks for picking this up. I meant to work on this, but unfortunately ran out of time to be able to push it through, so I'm glad you're working on it, since it does absolutely need to get fixed. Besides this issue, the other problem we ran into when trying to port our ptracer to aarch64 is that orig_x0 is not accessible through the ptrace interface on aarch64, which can cause tricky behavior around restarts. We managed to work around that in the end, but it's painful. If we're fixing the kernel here anyway, I'm wondering if we might want to address that as well while we're at it. Keno
brk checks in PR_SET_MM code
Hi all, The code in prctl(PR_SET_MM, ...) performs a number of sanity checks, among them ``` /* * @brk should be after @end_data in traditional maps. */ if (prctl_map->start_brk <= prctl_map->end_data || prctl_map->brk <= prctl_map->end_data) goto out; ``` The original commit that introduces this check (f606b77f1a9e362451aca8f81d8f36a3a112139e) says: ``` 4) As in regular Elf loading procedure we require that @start_brk and @brk be greater than @end_data. ``` However, it does not appear that this invariant is actually enforced during regular ELF loading. In particular, at least on my linux distribution, it does not appear to be satisfied when invoking the dynamic linker directly. For example, consider the following test application: ``` #include #include #include int main(void) { int err = prctl(PR_SET_MM, PR_SET_MM_BRK, sbrk(0), 0, 0); assert(err == 0); return 0; } ``` ``` $ su # ./a.out # /lib64/ld-linux-x86-64.so.2 ./a.out a.out: test.c:7: main: Assertion `err == 0' failed. Aborted ``` I don't understand this code well enough to know what the intended behavior is, but unfortunately this causes some processes to be non-restorable using the PR_SET_MM mechanism, which defeats the whole purpose of that API. Could somebody clarify whether this situation is indeed supposed to be impossible and if not whether said checks in PR_SET_MM are actually supposed to be there? I suppose this is also technically a regression when the old PR_SET_MM commands were refactored to use this new validation. Previously only the commands that changed the brk validated this invariant, but these days it tries to validate the entire structure at once, so all the PR_SET_MM calls will fail in a process whose layout violates the sanity check. Thanks, Keno
Re: ptrace: seccomp: Return value when the call was already invalid
> > Now, if we have a seccomp filter that simply does > > SECCOMP_RET_TRACE, and a ptracer that simply > > does PTRACE_CONT > > Ok, so this means that we're _skipping_ the system call, right? If the system call were positive this would result in the system call being executed. The notion of "skipping" the syscall is a bit odd in this situation. Having the ptracer set the syscallno to -1 is generally accepted as the way to do it, but what happens if the syscallno is already -1 or negative is underspecified. > > then the assert will fire/fail on arm64, but not on x86_64. > > It feels weird to me that skipping the system call has any effect on the > tracee registers... I think the correct way to frame it is to ask whether the behavior matches that of the tracee in absence of the ptracer. I would argue that if the ptracer doesn't explicitly modify register contents, then the tracee shouldn't observe any behavior difference. > > Interestingly, arm64 does do something different > > if the syscall is -1 rather than -10, where early > > in the ptrace stop it does. > > ``` > > /* set default errno for user-issued syscall(-1) */ > > if (scno == NO_SYSCALL) > > regs->regs[0] = -ENOSYS; > > ... so I think this should be fixed too. How about the diff below? I think the patch behavior is better overall, but I'm not sure it's ideal. I think the biggest question is what the behavior should be here and if we want a behavioral difference between *the syscall was -1 at entry* and *the syscall was -1 because the ptracer wanted to skip the syscall*. I think there is a bit of a semantic disconnect because "skipping" the syscall is not really an operation that the ptracer has at its disposal (unless it's using SYSEMU of course). The only thing it can do is set the syscall to -1. However, arguably that already has semantics (of returning -ENOSYS), so it's not at all clear to me that we should deviate from that. Unfortunately, none of this is currently consistent across architectures, so I think before we go changing arm64, we should decide what we'd like to happen in theory and then see what we can do to improve the situation without being too breaking. Keno
Re: arm64: Register modification during syscall entry/exit stop
On Mon, Jun 1, 2020 at 5:23 AM Dave Martin wrote: > > > Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the > > > syscall at the syscall enter stop, then modifying the regs at the > > > syscall exit stop? > > > > Yes, it can. The idea behind SYSEMU is to be able to save half the > > ptrace traps that would require, in theory making the ptracer > > a decent amount faster. That said, the x7 issue is orthogonal to > > SYSEMU, you'd have the same issues if you used PTRACE_SYSCALL. > > Right, I just wondered whether there was some deeper difference between > the two approaches. You're asking about a new regset vs trying to do it via ptrace option? I don't think there's anything a ptrace option can do that a new regset that replicates the same registers (I'm gonna propose adding orig_x0, while we're at it and changing the x0 semantics a bit, will have those details with the patch) wouldn't be able to do . The reason I originally thought it might have to be a ptrace option is because the register modification currently gets applied in the syscall entry code to the actual regs struct, so I thought you might have to know to preserve those registers. However, then I realized that you could just change the regset accessors to emulate the old behavior, since we do already store all the required information (what kind of stop we're currently at) in order to be able to answer the ptrace informational queries. So doing that it probably just all around easier. I guess NT_PRSTATUS might also rot, but I guess strace doesn't really have to stop using it, since it doesn't care about the x7 value nor does it need to modify it. Keno
Re: arm64: Register modification during syscall entry/exit stop
On Mon, Jun 1, 2020 at 5:14 AM Dave Martin wrote: > Can you explain why userspace would write a changed value for x7 > but at the same time need that new to be thrown away? The discarding behavior is the primary reason things aren't completely broken at the moment. If it read the wrong x7 value and didn't know about the Aarch64 quirk, it's often just trying to write that same wrong value back during the next stop, so if that's just ignored, that's probably fine in 99% of cases, since the value in the tracee will be undisturbed. I don't think there's a sane way to change the aarch64 NT_PRSTATUS semantics without just completely removing the x7 behavior, but of course people may be relying on that (I think somebody said upthread that strace does?) Keno
Re: arm64: Register modification during syscall entry/exit stop
> Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the > syscall at the syscall enter stop, then modifying the regs at the > syscall exit stop? Yes, it can. The idea behind SYSEMU is to be able to save half the ptrace traps that would require, in theory making the ptracer a decent amount faster. That said, the x7 issue is orthogonal to SYSEMU, you'd have the same issues if you used PTRACE_SYSCALL. Keno
Re: arm64: Register modification during syscall entry/exit stop
> Keno -- are you planning to send out a patch? You previously spoke about > implementing this using PTRACE_SETOPTIONS. Yes, I'll have a patch for you. Though I've come to the conclusion that introducing a new regset is probably a better way to solve it. We can then also expose orig_x0 at the same time and give it sane semantics (there's some problems with the way it works currently - I'll write it up together with the patch). Keno
mm: Behavior of process_vm_* with short local buffers
Hi everyone, I'm in the process of trying to port a debugging tool (http://rr-project.org/) from x86 to various other architectures. This tool relies on noting every change that was made to the memory of the process being debugged. As such, it has a battery of tests for corner cases of copyin/out and it is one of these that I saw behaving strangely when ported to non-x86 architectures. This particular test was testing the behavior of process_vm_readv (and writev, but for simplicity, let's assume readv here) with short local buffers. On x86 if the buffer is short and the following page is unmapped, the syscall will fill the remainder of the page, and then return however many bytes it actually wrote. However, on other architectures (I mostly looked at arm64, though the same applies elsewhere), the behavior can be quite different. In general, the behavior depends strongly on factors like how close to the start of the copy region the page break occurs, how many bytes were supposed to be left after the page break and the total size of the region to be copied. In various situations, I'm seeing: - Writes that end many bytes before the page break - Bytes being modified beyond what the syscall result would indicate happened. - Combinations thereof I can work around this in my port, but I thought it might be valuable to ask where the line is between "architecture-defined behavior" and a bug that should be reported to the appropriate architecture maintainers and eventually fixed. For example, I think it would be nice if the syscall result actually did match the actual number of bytes written in all cases. I've written a small program [1] that sets up this situation for various parameter values and prints the results. I have access to arm64, powerpc and x86, so I included results for those architectures, but I suspect other architectures have similar issues. The program should be easy to run to get your own results for a different architecture. [1] https://gist.github.com/Keno/b247bca85219c4e3bdde9f7d7ff36c77 Thanks, Keno
Re: arm64: Register modification during syscall entry/exit stop
Just ran into this issue again, with what I think may be most compelling example yet why this is problematic: The tracee incurred a signal, we PTRACE_SYSEMU'd to the rt_sigreturn, which the tracer tried to emulate by applying the state from the signal frame. However, the PTRACE_SYSEMU stop is a syscall-stop, so the tracer's write to x7 was ignored and x7 retained the value it had in the signal handler, which broke the tracee. Keno On Sat, May 23, 2020 at 1:35 AM Keno Fischer wrote: > > I got bitten by this again, so I decided to write up a simple example > that shows the problem: > > https://gist.github.com/Keno/cde691b26e32373307fb7449ad305739 > > This runs the same child twice. First vanilla where it prints "Hello world". > The second time, using a textbook ptrace example, to only print "world". > The problem here is that by the time the ptracer gets around to restoring > the registers, it's no longer in a syscall stop, so the write to x7 does not > get ignored and the correct value of x7 gets clobbered. > I copied the syscall definition from musl, so the compiler thinks x7 is > live, and we can see an assertion. > > Output on my machine (will depend on compiler version, etc.): > ``` > $ gcc -g3 -O3 ptrace_lies.c > $ ./a.out > Hello World > World > a.out: ptrace_lies.c:49: do_child: Assertion `v3 == values[2]' failed. > a.out: ptrace_lies.c:134: main: Assertion `WIFEXITED(status) && > WEXITSTATUS(status) == 0' failed. > Aborted (core dumped) > ``` > > However, I don't think that whether or not the compiler thinks that x7 is > live is the problem here. The problem is entirely that this mechanism > prevents the ptracer from precisely controlling the register state. While > basic ptracers don't need this feature (strace), > more advanced ptracers (think criu, etc.) absolutely do want to precisely > control what the register state is. > The ptracer I'm working on (https://rr-project.org/) > happens to be an extreme case of this, where it wants *bitwise* equivalent > register states such that it can run the same code many times and get > the exact same results. > > Also, if the issue was just that the kernel clobbered x7, that would be fine > we could deal with that no problem. However, it's much worse than that, > because the behavior of the kernel with respect to x7 depends on what > kind of ptrace stop we're in and even worse, in some kinds of stop, > there's absolutely no way to get at the actual value of x7. > > > Hmm, does that actually result in the SVC instruction getting inlined? I > > think that's quite dangerous, since we document that we can trash the SVE > > register state on a system call, for example. I'm also surprised that > > the register variables are honoured by compilers if that inlining can occur. > > I haven't gotten to trying SVE yet, so I appreciate the warning :). That said, > deterministic clobbering of registers is fine. Even changing the registers to > random junk is fine. We're happy to read those registers through ptrace. > The problem here is that the kernel lies about what the contents of the x7 > register is and discards any writes to it. > > I really hope we can come up with a solution here, I'm already dreading > the next time I unexpectedly run into this and have to add yet > another special case :(. > > Keno
[PATCH] arm64: ptrace: Fix PTRACE_SINGLESTEP into signal handler
Executing PTRACE_SINGLESTEP at a signal stop is special. It is supposed to step merely the signal setup work that the kernel does, but not any instructions in user space. Since not all architectures have the ability to generate a single-step exception directly upon return from user-space, there is a generic pseudo-single-step-stop that may be used for this purpose (tracehook_signal_handler). Now, arm64 does have the ability to generate single-step exceptions directly upon return to userspace and was using this capability (rather than the generic pseudo-trap) to obtain a similar effect. However, there is actually a subtle difference that becomes noticeable when the signal handler in question attempts to block SIGTRAP (either because it is set in sa_mask, or because it is a handler for SIGTRAP itself and SA_NODEFER is not set). In such a situation, a real single step exception will cause the SIGTRAP signal to be forcibly unblocked and the signal disposition to be reset to SIG_DFL. The generic pseudo-single-step does not suffer from this problem, because the SIGTRAP it delivers is not real. The arm64 behavior is problematic, because a forced reset of the signal disposition can be quite disruptive to the userspace program. This patch brings the arm64 behavior in line with the other major architectures by using the generic pseudo-single-step-stop, avoiding the problematic interaction with SIGTRAP masks. Fixes: 2c020ed8 ("arm64: Signal handling support") Signed-off-by: Keno Fischer --- arch/arm64/kernel/signal.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c index 339882db5a91..cf237ee9443b 100644 --- a/arch/arm64/kernel/signal.c +++ b/arch/arm64/kernel/signal.c @@ -808,14 +808,7 @@ static void handle_signal(struct ksignal *ksig, struct pt_regs *regs) */ ret |= !valid_user_regs(>user_regs, current); - /* -* Fast forward the stepping logic so we step into the signal -* handler. -*/ - if (!ret) - user_fastforward_single_step(tsk); - - signal_setup_done(ret, ksig, 0); + signal_setup_done(ret, ksig, test_thread_flag(TIF_SINGLESTEP)); } /* -- 2.25.1
Re: arm64: Register modification during syscall entry/exit stop
I got bitten by this again, so I decided to write up a simple example that shows the problem: https://gist.github.com/Keno/cde691b26e32373307fb7449ad305739 This runs the same child twice. First vanilla where it prints "Hello world". The second time, using a textbook ptrace example, to only print "world". The problem here is that by the time the ptracer gets around to restoring the registers, it's no longer in a syscall stop, so the write to x7 does not get ignored and the correct value of x7 gets clobbered. I copied the syscall definition from musl, so the compiler thinks x7 is live, and we can see an assertion. Output on my machine (will depend on compiler version, etc.): ``` $ gcc -g3 -O3 ptrace_lies.c $ ./a.out Hello World World a.out: ptrace_lies.c:49: do_child: Assertion `v3 == values[2]' failed. a.out: ptrace_lies.c:134: main: Assertion `WIFEXITED(status) && WEXITSTATUS(status) == 0' failed. Aborted (core dumped) ``` However, I don't think that whether or not the compiler thinks that x7 is live is the problem here. The problem is entirely that this mechanism prevents the ptracer from precisely controlling the register state. While basic ptracers don't need this feature (strace), more advanced ptracers (think criu, etc.) absolutely do want to precisely control what the register state is. The ptracer I'm working on (https://rr-project.org/) happens to be an extreme case of this, where it wants *bitwise* equivalent register states such that it can run the same code many times and get the exact same results. Also, if the issue was just that the kernel clobbered x7, that would be fine we could deal with that no problem. However, it's much worse than that, because the behavior of the kernel with respect to x7 depends on what kind of ptrace stop we're in and even worse, in some kinds of stop, there's absolutely no way to get at the actual value of x7. > Hmm, does that actually result in the SVC instruction getting inlined? I > think that's quite dangerous, since we document that we can trash the SVE > register state on a system call, for example. I'm also surprised that > the register variables are honoured by compilers if that inlining can occur. I haven't gotten to trying SVE yet, so I appreciate the warning :). That said, deterministic clobbering of registers is fine. Even changing the registers to random junk is fine. We're happy to read those registers through ptrace. The problem here is that the kernel lies about what the contents of the x7 register is and discards any writes to it. I really hope we can come up with a solution here, I'm already dreading the next time I unexpectedly run into this and have to add yet another special case :(. Keno
ptrace: seccomp: Return value when the call was already invalid
I'm seeing the following while porting a ptracer from x86_64 to arm64 (cc'ing arm64 folks, but in this case x86_64 is the odd one out, I think other archs would be consistent with arm64). Consider userspace code like the following: ``` int ret = syscall(-10, 0); assert(ret == -ENOSYS); ``` (Never mind the fact that this is something userspace shouldn't do, I saw this in our test suite that tests corner cases where the ptracer shouldn't affect behavior). Now, if we have a seccomp filter that simply does SECCOMP_RET_TRACE, and a ptracer that simply does PTRACE_CONT, then the assert will fire/fail on arm64, but not on x86_64. The reason this happens is that the return value gets set early on x86_64, but this is not possible on arm64, because doing so would clobber the first argument register that it shares. As a result, no return value is set and `ret` retains the value that the first syscall argument used to have. I can work around this of course, but I guess my question is whether this is expected/ok, or you would expect an active ptracer that does not touch the registers not to affect behavior. Interestingly, arm64 does do something different if the syscall is -1 rather than -10, where early in the ptrace stop it does. ``` /* set default errno for user-issued syscall(-1) */ if (scno == NO_SYSCALL) regs->regs[0] = -ENOSYS; ``` I'm not sure that's great either since the ptracer may want to inspect x0 and arm64 does not make orig_x0 available via ptrace. To me this indicates that maybe this was intended to apply to any syscall skipped here, not just -1 (the different comes from the fact that seccomp considers any negative syscall a skip/fail, but on syscall-entry stops arm64 only considers a literal -1 a skip). On the other hand if this is deemed expected, I'll go ahead and submit a man-page patch to at least document this architecture difference. Keno
Re: arm64: Register modification during syscall entry/exit stop
Hi Will, > Yes, we inherited this from ARM and I think strace relies on it. In > hindsight, it is a little odd, although x7 is a parameter register in the > PCS and so it won't be live on entry to a system call. I'm not familiar with the PCS acronym, but I assume you mean the calling convention? You have more faith in userspace than I do ;). For example, cursory googling brought up this arm64 syscall definition in musl: https://github.com/bminor/musl/blob/593caa456309714402ca4cb77c3770f4c24da9da/arch/aarch64/syscall_arch.h The constraints on those asm blocks allow the compiler to assume that x7 is preserved across the syscall. If a ptracer accidentally modified it (which is easy to do in the situations that I mentioned), it could absolutely cause incorrect execution of the userspace program. > Although the examples you've > listed above are interesting, I don't see why x7 is important in any of > them (and we only support up to 6 system call arguments). It's not so much that x7 is important, it's that lying to the ptracer is problematic, because it might remember that lie and act on it later. I did run into exactly this problem, where my ptracer accidentally changed the value of x7 and caused incorrect execution in the tracee (now that incorrect execution happened to be an assertion, because my application is paranoid about these kinds of issues, but it was incorrect nevertheless) If it would be helpful, I can code up the syscall entry -> signal trap example ptracer to have a concrete example. Keno
arm64: Register modification during syscall entry/exit stop
Continuing my theme of "weird things I encounter while trying to use ptrace on arm64", I ran into the effect of the following code in the syscall entry/exit reporting: ``` /* * A scratch register (ip(r12) on AArch32, x7 on AArch64) is * used to denote syscall entry/exit: */ regno = (is_compat_task() ? 12 : 7); saved_reg = regs->regs[regno]; regs->regs[regno] = dir; ``` This seems very weird to me. I can't think of any other architecture that does something similar (other than unicore32 apparently, but the ptrace support there seems like it might have just been copied from ARM). I'm able to work around this in my application, but it adds another stumbling block. Some examples of things that happen: - Writes to x7 during syscall exit stops are ignored, so if the ptracer tries to emulate a setjmp-type thing, it might miss this register (ptracers sometimes like to do this to manually serialize execution between different threads, puppeteering a single thread of execution between different register states). - Reads from x7 are incorrect, so if the ptracer saves a register state and later tries to set it back to the task, it may get x7 incorrect, but user space may be expecting the register to be preserved (when might this happen? - consider a ptracer that wants to modify some syscall arguments, it modifies the arguments, restarts the syscall but then incurs a signal, so it tries to restore the original registers to let userspace deal with the signal without being confused - expect signal traps don't ignore x7 modifications, so x7 may have been unexpectedly modified). - We now have seccomp traps, which kind of look and act like syscall-entry traps, but don't have this behavior, so it's not particularly reliable for ptracers to use. Furthermore, it seems unnecessary to me on modern kernels. We now have PTRACE_GET_SYSCALL_INFO, which exposes this information without lying to the ptracer about the tracee's registers. I understand, we can't just change this, since people may be relying on it, but I would like to propose adding a ptrace option (PTRACE_O_ARM_REGSGOOD?) that turns this behavior off. Now, I don't think we currently have any other arch-specific ptrace options, so maybe there is a different option that would be preferable (e.g. could be a different regset), but I do think it would be good to have a way to operate on the real x7 value. As I said, I can work around it, but hopefully I will be able to save a future implementer some headache. Keno
[PATCH] arm64: Fix PTRACE_SYSEMU semantics
Quoth the man page: ``` If the tracee was restarted by PTRACE_SYSCALL or PTRACE_SYSEMU, the tracee enters syscall-enter-stop just prior to entering any system call (which will not be executed if the restart was using PTRACE_SYSEMU, regardless of any change made to registers at this point or how the tracee is restarted after this stop). ``` The parenthetical comment is currently true on x86 and powerpc, but not currently true on arm64. arm64 re-checks the _TIF_SYSCALL_EMU flag after the syscall entry ptrace stop. However, at this point, it reflects which method was used to re-start the syscall at the entry stop, rather than the method that was used to reach it. Fix that by recording the original flag before performing the ptrace stop, bringing the behavior in line with documentation and x86/powerpc. Signed-off-by: Keno Fischer --- arch/arm64/kernel/ptrace.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c index b3d3005d9515..b67b4d14aa17 100644 --- a/arch/arm64/kernel/ptrace.c +++ b/arch/arm64/kernel/ptrace.c @@ -1829,10 +1829,12 @@ static void tracehook_report_syscall(struct pt_regs *regs, int syscall_trace_enter(struct pt_regs *regs) { - if (test_thread_flag(TIF_SYSCALL_TRACE) || - test_thread_flag(TIF_SYSCALL_EMU)) { + u32 flags = READ_ONCE(current_thread_info()->flags) & + (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE); + + if (flags) { tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER); - if (!in_syscall(regs) || test_thread_flag(TIF_SYSCALL_EMU)) + if (!in_syscall(regs) || (flags & _TIF_SYSCALL_EMU)) return -1; } -- 2.25.1
Re: PTRACE_SYSEMU behavior difference on arm64
On Fri, May 15, 2020 at 8:13 AM Will Deacon wrote: > But it also > means that nobody is using this on arm64, so we could also consider removing > it entirely. Did you spot this because you are trying to use it for > something or just by inspection/unit-testing? No, I was trying to port a tool from x86 and nothing made sense for many hours :). (it was quite a bit of debugging, because the syscall that it was supposed to skip installed a seccomp filter, which then later veto'd random syscalls making the symptoms quite confusing). Having PTRACE_SYSEMU isn't critical, but we might as well support it. It makes things a bit more efficient and is probably safer (if it works correctly ;). The patch is fairly small. Will validate and then send it here for review. Keno
PTRACE_SYSEMU behavior difference on arm64
The behavior of PTRACE_SYSEMU on arm64 appears to differ substantially from that of x86 and powerpc (the other two architectures on which this feature is implemented). In particular, after PTRACE_SYSEMU the syscall will always be skipped on x86 and powerpc, but executed on arm64 unless the syscall-entry stop was again continued using PTRACE_SYSEMU. The skipping behavior is also documented in the manpage, so I suspect this may just be a bug (the skipping behavior makes sense to me and is what I would expect). The reason this happens is that `syscall_trace_enter` re-checks TIF_SYSCALL_EMU after the ptrace stop, but at that point it may have already been superseded by a new ptrace request. x86 and power save the original value of the flag, rather than acting on the new value. I can submit a patch to fix this, but wanted to check first whether this was intentional. If it is, I can fix the man page instead. Keno
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> So, to be useful, this interface needs to be called before an > application can run XGETBV or XSAVE for the first time and caches a > "bad" value. I think that means that it might not be feasible to use > outside of cases where you ptrace() something and inject things before > it has a chance to run any real instructions. > > Fundamentally, I think that makes _this_ interface pretty useless in > practice. The only practical option is to have a _future_ XCR0 value > set by the prctl() and then have it get made active by the kernel at > execve(). Fair enough, but it don't see this as really fundamentally different from the cpuid masking use case, which has the same problem and the same interface. I'm also not convinced that there is *no* use case where one may want to turn on certain XCR0 features while the process is running and then turn them off again. To give a concrete example in this context, it can useful to write a small program into the memory space of the replayed program and use it to analyze the memory state of the program (e.g. to checksum the memory - or in our case to perform a GC state validation). Such implants may want to use the AVX512 registers for performance, so it would be nice if that was possible. > IMNHO, if you haven't guessed yet, I think this whole exercise is a dead > end. Just boot an identical XCR0 VM on your new hardware and do replay > there. Done. I had a hunch ;). However, there are a couple considerations that make me still want this in the kernel proper: 1. The recording side application of this feature - getting our users to do everything in a VM to send us a recording is not easy. Part of the appeal of rr over VM-based record/replay techniques is that it "just works" on basically any linux hosts. 2. Starting a VM generally requires root permissions, which may not be available. 3. And probably the biggest from my perspective is performance. rr needs to do a lot twiddling with the performance counters, which I've seen have significant performance overhead in a virtualized environment. There's of course also a per-VM resource consumption, but presumably we could keep one VM per-XCR0 value and replay multiple traces per VM.
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> So, to be useful, this interface needs to be called before an > application can run XGETBV or XSAVE for the first time and caches a > "bad" value. I think that means that it might not be feasible to use > outside of cases where you ptrace() something and inject things before > it has a chance to run any real instructions. > > Fundamentally, I think that makes _this_ interface pretty useless in > practice. The only practical option is to have a _future_ XCR0 value > set by the prctl() and then have it get made active by the kernel at > execve(). Fair enough, but it don't see this as really fundamentally different from the cpuid masking use case, which has the same problem and the same interface. I'm also not convinced that there is *no* use case where one may want to turn on certain XCR0 features while the process is running and then turn them off again. To give a concrete example in this context, it can useful to write a small program into the memory space of the replayed program and use it to analyze the memory state of the program (e.g. to checksum the memory - or in our case to perform a GC state validation). Such implants may want to use the AVX512 registers for performance, so it would be nice if that was possible. > IMNHO, if you haven't guessed yet, I think this whole exercise is a dead > end. Just boot an identical XCR0 VM on your new hardware and do replay > there. Done. I had a hunch ;). However, there are a couple considerations that make me still want this in the kernel proper: 1. The recording side application of this feature - getting our users to do everything in a VM to send us a recording is not easy. Part of the appeal of rr over VM-based record/replay techniques is that it "just works" on basically any linux hosts. 2. Starting a VM generally requires root permissions, which may not be available. 3. And probably the biggest from my perspective is performance. rr needs to do a lot twiddling with the performance counters, which I've seen have significant performance overhead in a virtualized environment. There's of course also a per-VM resource consumption, but presumably we could keep one VM per-XCR0 value and replay multiple traces per VM.
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> So we're talking about a workaround for broken software. The question > is how wide spread is it? For rr to work, it tries to replicate the process state *exactly*. That means: 1. The same instructions executed in the same order 2. The exact same register state at those instructions 3. The same memory state, bit-by-bit In particular 1) means that any extra instructions executed/not executed will cause a replay divergence (in practice rr uses retired conditional branches rather than instructions, because the instruction counter is not accurate, while the branch one is). This alone causes a problem for the present case, because glibc branches on the xcr0 value before it branches on the cpuid value for AVX512. Glibc does check for the correct cpuid before calling xgetbv, so one possible thing to do is to completely disable xsave during recording by disabling it in CPUID, but that would make rr quite a bit less useful, since it wouldn't be able to record any bugs that require AVX to be used. However, the xsave problem is worse, because xcr0 determines how much memory `xsave` writes, so if we emulate cpuid, to pretend that AVX512 does not exist, and the user space application uses that to determine the size of the required buffer, we now suddenly overflow that buffer (unless the user space application uses cpuid to compute a minimal RFBM for xsave, which no application seems to do). > Do memory contents which are never read by the application matter? In theory, no. However, in practice, I've seen most memory divergences (esp if on the stack), end up causing control flow divergences down the line, because some code somewhere picks up the uninitialized memory and branches on it. Hope that helps, Keno
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> So we're talking about a workaround for broken software. The question > is how wide spread is it? For rr to work, it tries to replicate the process state *exactly*. That means: 1. The same instructions executed in the same order 2. The exact same register state at those instructions 3. The same memory state, bit-by-bit In particular 1) means that any extra instructions executed/not executed will cause a replay divergence (in practice rr uses retired conditional branches rather than instructions, because the instruction counter is not accurate, while the branch one is). This alone causes a problem for the present case, because glibc branches on the xcr0 value before it branches on the cpuid value for AVX512. Glibc does check for the correct cpuid before calling xgetbv, so one possible thing to do is to completely disable xsave during recording by disabling it in CPUID, but that would make rr quite a bit less useful, since it wouldn't be able to record any bugs that require AVX to be used. However, the xsave problem is worse, because xcr0 determines how much memory `xsave` writes, so if we emulate cpuid, to pretend that AVX512 does not exist, and the user space application uses that to determine the size of the required buffer, we now suddenly overflow that buffer (unless the user space application uses cpuid to compute a minimal RFBM for xsave, which no application seems to do). > Do memory contents which are never read by the application matter? In theory, no. However, in practice, I've seen most memory divergences (esp if on the stack), end up causing control flow divergences down the line, because some code somewhere picks up the uninitialized memory and branches on it. Hope that helps, Keno
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
On Mon, Jun 18, 2018 at 12:16 PM, Dave Hansen wrote: > On 06/18/2018 08:13 AM, Keno Fischer wrote: >>>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update >>>> XCR0, redo the xsaves/restores, put XCR0 back and continue >>>> execution after the faulting instruction. >>> >>> I'm worried about the kernel pieces that go digging in the XSAVE data >>> getting confused more than the hardware getting confused. >> >> So you prefer this option? If so, I can try to have a go at implementing it >> this way and seeing if I run into any trouble. > > No, I'm saying that depending on faults is not a viable solution. We > are not guaranteed to get faults in all the cases you would need to fix up. > > XSAVE*/XRSTOR* are not even *called* in some of those cases. Ah, my apologies, I was under the mistaken impression that xsaves also read xcomp_bv to inform the layout, rather than using the RFBM and then updating the xcomp_bv field. Let me think about this some more and see what I can come up with.
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
On Mon, Jun 18, 2018 at 12:16 PM, Dave Hansen wrote: > On 06/18/2018 08:13 AM, Keno Fischer wrote: >>>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update >>>> XCR0, redo the xsaves/restores, put XCR0 back and continue >>>> execution after the faulting instruction. >>> >>> I'm worried about the kernel pieces that go digging in the XSAVE data >>> getting confused more than the hardware getting confused. >> >> So you prefer this option? If so, I can try to have a go at implementing it >> this way and seeing if I run into any trouble. > > No, I'm saying that depending on faults is not a viable solution. We > are not guaranteed to get faults in all the cases you would need to fix up. > > XSAVE*/XRSTOR* are not even *called* in some of those cases. Ah, my apologies, I was under the mistaken impression that xsaves also read xcomp_bv to inform the layout, rather than using the RFBM and then updating the xcomp_bv field. Let me think about this some more and see what I can come up with.
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update >> XCR0, redo the xsaves/restores, put XCR0 back and continue >> execution after the faulting instruction. > > I'm worried about the kernel pieces that go digging in the XSAVE data > getting confused more than the hardware getting confused. So you prefer this option? If so, I can try to have a go at implementing it this way and seeing if I run into any trouble. >> At least currently, it is my understanding that `xfeatures_mask` only has >> user features, am I mistaken about that? > > We're slowing adding supervisor support. I think accounting for > supervisor features is a requirement for any new XSAVE code. Sure, I don't think this is in any way incompatible with that (though probably also informs that we want to keep the memory layout the same if possible).
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update >> XCR0, redo the xsaves/restores, put XCR0 back and continue >> execution after the faulting instruction. > > I'm worried about the kernel pieces that go digging in the XSAVE data > getting confused more than the hardware getting confused. So you prefer this option? If so, I can try to have a go at implementing it this way and seeing if I run into any trouble. >> At least currently, it is my understanding that `xfeatures_mask` only has >> user features, am I mistaken about that? > > We're slowing adding supervisor support. I think accounting for > supervisor features is a requirement for any new XSAVE code. Sure, I don't think this is in any way incompatible with that (though probably also informs that we want to keep the memory layout the same if possible).
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
Hi Dave, thank you for your thorough comments, replies inline: On Mon, Jun 18, 2018 at 8:47 AM, Dave Hansen wrote: > On 06/16/2018 05:33 PM, Keno Fischer wrote: >> For my use case, it would be sufficient to simply disallow >> any value of XCR0 with "holes" in it, > But what if the hardware you are migrating to/from *has* holes? Yes, that would be the problem ;). I'm not aware of that being the case for user space states on current hardware, but perhaps, I'm just missing a case. > There's no way this is even close to viable until it has been made to > cope with holes. Ok, it seems to me the first decision is whether it should be allowed to have the compacted format (with holes) in an in-memory xstate image. Doing so seems possible, but would require more invasive changes to the kernel, so I'm gonna ignore it for now. If we want to keep the in-memory xstate layout constant after boot time, I see four ways to do that for this feature. 1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors use an XCR0 that matches the layout the kernel expects. 2) Use xsaveopt when this particular thread flag is set and have the kernel be able to deal with non-compressed xstate images, even if xsaves is available. 3) What's in this patch now, but fix up the xstate after saving it. 4) Catch the fault thrown by xsaves/xrestors in this situation, update XCR0, redo the xsaves/restores, put XCR0 back and continue execution after the faulting instruction. Option 1) seems easiest, but it would also require adding code somewhere on the common path, which I assume is a no-go ;). Option 3) seems doable entirely in the slow path for this feature. If we xsaves with the modified XCR0, we can then fix up the memory image to match the expected layout. Similarly, we could xrestors in the slow path (from standard layout) and rely on the `fpregs_state_valid` mechanism to prevent the fault. Option 4) seems also doable, but of course messing around with traps makes things quite a bit more complicated. > FWIW, I just don't think this is going to be viable. I have the feeling > that there's way too much stuff that hard-codes assumptions about XCR0 > inside the kernel and out. This is just going to make it much more fragile. > > Folks that want this level of container migration are probably better > off running one of the hardware-based containers and migrating _those_. > Or, just ensuring the places to/from they want to migrate have a > homogeneous XCR0 mix. Fair enough :). I figured I'd throw it out there as an RFC and see if I can get it into an acceptable shape for you to include upstream. For my, particular use case (replaying old traces on newer hardware), since I control the replay hardware, I'm happy if there's a patch for me to apply locally to solve my problem. Of course there's also the opposite use case (recording a trace on a new machine, but wanting to replay on an older one), which would also require the recording machine to have this patch. The reason I make this distinction is that we don't usually control the recording machine (since those are our users' machines), so that tends to have a much more varied hardware/kernel mix and it's harder to get kernel patches applied. I'm happy to keep chipping away at this for as long as there is feedback, but I understand if it won't make it in at the end of the process. >> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void) >> /* If cpuid was previously disabled for this task, re-enable it. */ >> if (test_thread_flag(TIF_NOCPUID)) >> enable_cpuid(); >> + if (test_thread_flag(TIF_MASKXCR0)) >> + reset_xcr0_mask(); >> } > > So the mask is cleared on exec(). Does that mean that *every* > individual process using this interface has to set up its own mask > before anything in the C library establishes its cached value of XCR0. > I'd want to see how that's being accomplished. Yes, that is correct. In my rr patch this is done by injecting an arch_prctl syscall using ptrace, at the same spot that we inject the arch_prctl syscall for enabling cpuid faulting (i.e. the interface that sets the `TIF_NOCPUID` flag in the check prior to this one). Since these two features would seem to make the most sense when used together, I figured it would be best to align semantics. Happy to entertain alternative suggestions though. >> +static int xstate_is_initial(unsigned long mask) >> +{ >> + int i, j; >> + unsigned long max_bit = __ffs(mask); >> + >> + for (i = 0; i < max_bit; ++i) { >> + if (mask & (1 << i)) { >> + char *xfeature_addr = (char *)get_xsave_addr( >> + >thread.fpu.state.xsave, >> + 1 << i);
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
Hi Dave, thank you for your thorough comments, replies inline: On Mon, Jun 18, 2018 at 8:47 AM, Dave Hansen wrote: > On 06/16/2018 05:33 PM, Keno Fischer wrote: >> For my use case, it would be sufficient to simply disallow >> any value of XCR0 with "holes" in it, > But what if the hardware you are migrating to/from *has* holes? Yes, that would be the problem ;). I'm not aware of that being the case for user space states on current hardware, but perhaps, I'm just missing a case. > There's no way this is even close to viable until it has been made to > cope with holes. Ok, it seems to me the first decision is whether it should be allowed to have the compacted format (with holes) in an in-memory xstate image. Doing so seems possible, but would require more invasive changes to the kernel, so I'm gonna ignore it for now. If we want to keep the in-memory xstate layout constant after boot time, I see four ways to do that for this feature. 1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors use an XCR0 that matches the layout the kernel expects. 2) Use xsaveopt when this particular thread flag is set and have the kernel be able to deal with non-compressed xstate images, even if xsaves is available. 3) What's in this patch now, but fix up the xstate after saving it. 4) Catch the fault thrown by xsaves/xrestors in this situation, update XCR0, redo the xsaves/restores, put XCR0 back and continue execution after the faulting instruction. Option 1) seems easiest, but it would also require adding code somewhere on the common path, which I assume is a no-go ;). Option 3) seems doable entirely in the slow path for this feature. If we xsaves with the modified XCR0, we can then fix up the memory image to match the expected layout. Similarly, we could xrestors in the slow path (from standard layout) and rely on the `fpregs_state_valid` mechanism to prevent the fault. Option 4) seems also doable, but of course messing around with traps makes things quite a bit more complicated. > FWIW, I just don't think this is going to be viable. I have the feeling > that there's way too much stuff that hard-codes assumptions about XCR0 > inside the kernel and out. This is just going to make it much more fragile. > > Folks that want this level of container migration are probably better > off running one of the hardware-based containers and migrating _those_. > Or, just ensuring the places to/from they want to migrate have a > homogeneous XCR0 mix. Fair enough :). I figured I'd throw it out there as an RFC and see if I can get it into an acceptable shape for you to include upstream. For my, particular use case (replaying old traces on newer hardware), since I control the replay hardware, I'm happy if there's a patch for me to apply locally to solve my problem. Of course there's also the opposite use case (recording a trace on a new machine, but wanting to replay on an older one), which would also require the recording machine to have this patch. The reason I make this distinction is that we don't usually control the recording machine (since those are our users' machines), so that tends to have a much more varied hardware/kernel mix and it's harder to get kernel patches applied. I'm happy to keep chipping away at this for as long as there is feedback, but I understand if it won't make it in at the end of the process. >> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void) >> /* If cpuid was previously disabled for this task, re-enable it. */ >> if (test_thread_flag(TIF_NOCPUID)) >> enable_cpuid(); >> + if (test_thread_flag(TIF_MASKXCR0)) >> + reset_xcr0_mask(); >> } > > So the mask is cleared on exec(). Does that mean that *every* > individual process using this interface has to set up its own mask > before anything in the C library establishes its cached value of XCR0. > I'd want to see how that's being accomplished. Yes, that is correct. In my rr patch this is done by injecting an arch_prctl syscall using ptrace, at the same spot that we inject the arch_prctl syscall for enabling cpuid faulting (i.e. the interface that sets the `TIF_NOCPUID` flag in the check prior to this one). Since these two features would seem to make the most sense when used together, I figured it would be best to align semantics. Happy to entertain alternative suggestions though. >> +static int xstate_is_initial(unsigned long mask) >> +{ >> + int i, j; >> + unsigned long max_bit = __ffs(mask); >> + >> + for (i = 0; i < max_bit; ++i) { >> + if (mask & (1 << i)) { >> + char *xfeature_addr = (char *)get_xsave_addr( >> + >thread.fpu.state.xsave, >> + 1 << i);
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> Almost difference in CPU behavior > between the replayer and the replayee. Not sure what happened to this sentence. What I meant to say was: Almost any difference in CPU behavior between the replayer and the replayee will cause problems for the determinism of the trace. To elaborate on this, even if a register whose content differs between the recording and the replay, it can still cause problems down the line, e.g. if it is spilled to the stack and that stack address is then re-used later. In order for rr to work, we basically rely on the CPU behaving *exactly* the same during the record and the replay (down to counting performance counters the same). This works well, but there are corner cases (like this XCR0 one).
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> Almost difference in CPU behavior > between the replayer and the replayee. Not sure what happened to this sentence. What I meant to say was: Almost any difference in CPU behavior between the replayer and the replayee will cause problems for the determinism of the trace. To elaborate on this, even if a register whose content differs between the recording and the replay, it can still cause problems down the line, e.g. if it is spilled to the stack and that stack address is then re-used later. In order for rr to work, we basically rely on the CPU behaving *exactly* the same during the record and the replay (down to counting performance counters the same). This works well, but there are corner cases (like this XCR0 one).
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> Patch seems pointless if you can already control CPUID, which rr > supports. Just disable AVX512 in CPUID. All code using AVX should check > cpuid (or will fail anyways). Unfortunately, that is insufficient. Almost difference in CPU behavior between the replayer and the replayee. In particular, the problems for rr here are 1) xgetbv, which can introduce differing register contents (and if code branches on this, potentially differences in control flow). 2) xsave writing more memory than the trace expects, causing divergence in the memory contents of the process. Robert O'Callahan has a blog post that goes into some detail here: https://robert.ocallahan.org/2018/04/cpuid-features-xsave-and-rr-trace.html I hope that makes sense. Please let me know if you'd like me to explain the problem in more detail or give an example. FWIW, I do have a version of rr that uses the proposed feature in this patch and it does fix our problem with replaying AVX traces on skylake machines. Without this patch, rr does emulate the cpuid response of the recording machine, but the replay fails because of the mentioned issue.
Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
> Patch seems pointless if you can already control CPUID, which rr > supports. Just disable AVX512 in CPUID. All code using AVX should check > cpuid (or will fail anyways). Unfortunately, that is insufficient. Almost difference in CPU behavior between the replayer and the replayee. In particular, the problems for rr here are 1) xgetbv, which can introduce differing register contents (and if code branches on this, potentially differences in control flow). 2) xsave writing more memory than the trace expects, causing divergence in the memory contents of the process. Robert O'Callahan has a blog post that goes into some detail here: https://robert.ocallahan.org/2018/04/cpuid-features-xsave-and-rr-trace.html I hope that makes sense. Please let me know if you'd like me to explain the problem in more detail or give an example. FWIW, I do have a version of rr that uses the proposed feature in this patch and it does fix our problem with replaying AVX traces on skylake machines. Without this patch, rr does emulate the cpuid response of the recording machine, but the replay fails because of the mentioned issue.
[RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
From: Keno Fischer The rr (http://rr-project.org/) debugger provides user space record-and-replay functionality by carefully controlling the process environment in order to ensure completely deterministic execution of recorded traces. The recently added ARCH_SET_CPUID arch_prctl allows rr to move traces across (Intel) machines, by allowing cpuid invocations to be reliably recorded and replayed. This works very well, with one catch: It is currently not possible to replay a recording from a machine supporting a smaller set of XCR0 state components on one supporting a larger set. This is because the value of XCR0 is obersevable in userspace (either by explicit xgetbv or by looking at the result of xsave) and since glibc does oberseve this value, replay divergence is almost immediate. I also suspect that people intrested in process (or container) live-migration may eventually care about this if a migration happens in between a userspace xsave and a corresponding xrstor. We encounter this problem quite frequently since most of our users are using pre-Skylake systems (and thus don't support the AVX512 state components), while we recently upgraded our main development machines to Skylake. This patch attempts to provide a way disable XCR0 state components on a per-thread basis, such that rr may use this feature to emulate the recording machine's XCR0 for the replayed process. We do this by providing a new ARCH_SET_XCR0 arch_prctl that takes as its argument the desired XCR0 value. The system call fails if: - XSAVE is not available - The user attempts to enable a state component that would not regularly be enabled by the kernel. - The value of XCR0 is illegal (in the sense that setting it would cause a fault). - Any state component being disabled is not in its initial state. The last restriction is debatable, but it seemed sensible to me, since it means the kernel need not be in the business of trying to maintain the disabled state components when the ordinary xsave/xrestor will no longer save/restore them. The patch does not currently provide a corresponding ARCH_GET_XCR0, since the `xgetbv` instruction fulfills this purpose. However, we may want to provide this for consistency. It may also be useful (and perhaps more useful) to provide a mechanism to obtain the kernel's XCR0 (i.e. the upper bound on which bits are allowed to be set through this interface). On the kernel side, this value is stored as a mask, rather than a value to set. The thought behind this was to be defensive about new state components being added but forgetting to update the validity check in the arch_prctl. However, getting this wrong in either direction is probably bad (e.g. if Intel added an AVX1024 extension that always required AVX512, then this would be the wrong way to be defensive about it), so I'd appreciate some advice on which would be preferred. Please take this patch as an RFC that I hope is sufficient to enable discussion about this feature, rather than a complete patch. In particular, this patch is missing: - Cleanup - A selftest exercising all the corner cases - There's code sharing opportunities with KVM (which has to provide similar functionality for virtual machines modifying XCR0), which I did not take advantage of in this patch to keep the changes local. A full patch would probably involve some refactoring there. There is one remaining TODO in the code, which has to do with the xcomp_bv xsave header. The `xrstors` instructions requires that no bits in this headers field be set that are not active in XCR0. However, this unfortunately means that changing the value of XCR0 can change the layout of the compacted xsave area, which so far the kernel does not do anywhere else (except for some handling in the KVM context). For my use case, it would be sufficient to simply disallow any value of XCR0 with "holes" in it, that would change the layout to anything other than a strict prefix, but I would appreciate hearing any alternative solutions you may have. Please also let me know if I missed a fundamental way in which this causes a problem. I realize that this is a very sensitive part of the kernel code. Since this patch changes XCR0, any xsave/xrestor or any use of XCR0-enabled features in kernel space, not guarded by kernel_fpu_begin() (which I modified for this patch set), would cause a problem. I don't have a sufficiently wide view of the kernel to know if there are any such uses, so please let me know if I missed something. Signed-off-by: Keno Fischer --- arch/x86/include/asm/fpu/xstate.h | 1 + arch/x86/include/asm/processor.h | 2 + arch/x86/include/asm/thread_info.h | 5 +- arch/x86/include/uapi/asm/prctl.h | 2 + arch/x86/kernel/fpu/core.c | 10 ++- arch/x86/kernel/fpu/xstate.c | 3 +- arch/x86/kernel/process.c | 125 - arch/x86/kernel/process_64.c | 16 ++--- 8 files changed, 152 insertions(+), 12
[RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread
From: Keno Fischer The rr (http://rr-project.org/) debugger provides user space record-and-replay functionality by carefully controlling the process environment in order to ensure completely deterministic execution of recorded traces. The recently added ARCH_SET_CPUID arch_prctl allows rr to move traces across (Intel) machines, by allowing cpuid invocations to be reliably recorded and replayed. This works very well, with one catch: It is currently not possible to replay a recording from a machine supporting a smaller set of XCR0 state components on one supporting a larger set. This is because the value of XCR0 is obersevable in userspace (either by explicit xgetbv or by looking at the result of xsave) and since glibc does oberseve this value, replay divergence is almost immediate. I also suspect that people intrested in process (or container) live-migration may eventually care about this if a migration happens in between a userspace xsave and a corresponding xrstor. We encounter this problem quite frequently since most of our users are using pre-Skylake systems (and thus don't support the AVX512 state components), while we recently upgraded our main development machines to Skylake. This patch attempts to provide a way disable XCR0 state components on a per-thread basis, such that rr may use this feature to emulate the recording machine's XCR0 for the replayed process. We do this by providing a new ARCH_SET_XCR0 arch_prctl that takes as its argument the desired XCR0 value. The system call fails if: - XSAVE is not available - The user attempts to enable a state component that would not regularly be enabled by the kernel. - The value of XCR0 is illegal (in the sense that setting it would cause a fault). - Any state component being disabled is not in its initial state. The last restriction is debatable, but it seemed sensible to me, since it means the kernel need not be in the business of trying to maintain the disabled state components when the ordinary xsave/xrestor will no longer save/restore them. The patch does not currently provide a corresponding ARCH_GET_XCR0, since the `xgetbv` instruction fulfills this purpose. However, we may want to provide this for consistency. It may also be useful (and perhaps more useful) to provide a mechanism to obtain the kernel's XCR0 (i.e. the upper bound on which bits are allowed to be set through this interface). On the kernel side, this value is stored as a mask, rather than a value to set. The thought behind this was to be defensive about new state components being added but forgetting to update the validity check in the arch_prctl. However, getting this wrong in either direction is probably bad (e.g. if Intel added an AVX1024 extension that always required AVX512, then this would be the wrong way to be defensive about it), so I'd appreciate some advice on which would be preferred. Please take this patch as an RFC that I hope is sufficient to enable discussion about this feature, rather than a complete patch. In particular, this patch is missing: - Cleanup - A selftest exercising all the corner cases - There's code sharing opportunities with KVM (which has to provide similar functionality for virtual machines modifying XCR0), which I did not take advantage of in this patch to keep the changes local. A full patch would probably involve some refactoring there. There is one remaining TODO in the code, which has to do with the xcomp_bv xsave header. The `xrstors` instructions requires that no bits in this headers field be set that are not active in XCR0. However, this unfortunately means that changing the value of XCR0 can change the layout of the compacted xsave area, which so far the kernel does not do anywhere else (except for some handling in the KVM context). For my use case, it would be sufficient to simply disallow any value of XCR0 with "holes" in it, that would change the layout to anything other than a strict prefix, but I would appreciate hearing any alternative solutions you may have. Please also let me know if I missed a fundamental way in which this causes a problem. I realize that this is a very sensitive part of the kernel code. Since this patch changes XCR0, any xsave/xrestor or any use of XCR0-enabled features in kernel space, not guarded by kernel_fpu_begin() (which I modified for this patch set), would cause a problem. I don't have a sufficiently wide view of the kernel to know if there are any such uses, so please let me know if I missed something. Signed-off-by: Keno Fischer --- arch/x86/include/asm/fpu/xstate.h | 1 + arch/x86/include/asm/processor.h | 2 + arch/x86/include/asm/thread_info.h | 5 +- arch/x86/include/uapi/asm/prctl.h | 2 + arch/x86/kernel/fpu/core.c | 10 ++- arch/x86/kernel/fpu/xstate.c | 3 +- arch/x86/kernel/process.c | 125 - arch/x86/kernel/process_64.c | 16 ++--- 8 files changed, 152 insertions(+), 12
Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR
Hi Michael, I was hoping to get a clear statement one way or another from the kernel maintainers as to whether an EINTR from stat() is supposed to be allowed kernel behavior (hence the RFC in the subject). If it's not, then I don't think it should be documented, even if there is buggy filesystems that do at the moment. So I'd say let's hold off on applying this until more people have had a chance to comment. If it would be more convenient for you, feel free to drop this from your patch queue and if appropriate, I'll resend a non-RFC version of this patch for you to apply, once a conclusion has been reached. On Mon, Dec 4, 2017 at 3:58 PM, Michael Kerrisk (man-pages) <mtk.manpa...@gmail.com> wrote: > Hello Keno > > On 12/03/2017 04:15 AM, Keno Fischer wrote: >> Resending as plain text (apologies for those receiving it twice, and >> those that got >> an HTML copy, I'm used to my mail client switching that over >> automatically, which >> for some reason didn't happen here). >> >> >> This is exactly the discussion I want to generate, so thank you. >> I should point out that I'm not advocating for anything other >> than clarity of what kernel behavior user space may assume. > > So, should the documentation patch be applied at this point, or dropped? > > Thanks, > > Michael > > >> On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox <wi...@infradead.org> wrote: >>> On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote: >>>> The catalyst for this patch was me experiencing EINTR errors when >>>> using the 9p file system. In linux commit 9523feac, the 9p file >>>> system was changed to use wait_event_killable instead of >>>> wait_event_interruptible, which does indeed address my problem, >>>> but also makes me a bit unhappy, because uninterruptable waits >>>> prevents things like ^C'ing the execution and some debugging >>>> tools which depend on being able to cancel long-running operations >>>> by sending signals. >>> >>> Wait, wait, wait. killable is not uninterruptible. It's "can accept >>> a signal if the signal is fatal". ie userspace will never see it. >>> So, no, it doesn't prevent ^C. It does prevent the debugging tool you're >>> talking about from working, because it's handling the signal, so it's not >>> fatal. >> >> This probably shows that I've been in REPL based environments too long, >> that catch SIGINT ;). You are of course correct that a fatal SIGINT would >> still be delivered. >> >>>> I realize I'm probably 20 years too late here, but it feels like >>>> clarificaion on what to expect from the kernel would still go a long >>>> way here. >>> >>> A change to user-visible behaviour has to be opt-in. >> >> I agree. However, it was my impression that stat() can return EINTR >> depending on the file system. Prior to the referenced commit, >> this was certainly true on 9p and I suspect it's not the only network file >> system for which this is true (though prior to my experiencing this >> with 9p, the only >> time I've ever experienced it was on HPC clusters with who knows what >> code providing the network filesystem). If it is indeed the case that >> an EINTR return from stat() and similar is illegal and should be considered >> a kernel bug, a statement to that extent all I'm looking for here. >> > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR
Hi Michael, I was hoping to get a clear statement one way or another from the kernel maintainers as to whether an EINTR from stat() is supposed to be allowed kernel behavior (hence the RFC in the subject). If it's not, then I don't think it should be documented, even if there is buggy filesystems that do at the moment. So I'd say let's hold off on applying this until more people have had a chance to comment. If it would be more convenient for you, feel free to drop this from your patch queue and if appropriate, I'll resend a non-RFC version of this patch for you to apply, once a conclusion has been reached. On Mon, Dec 4, 2017 at 3:58 PM, Michael Kerrisk (man-pages) wrote: > Hello Keno > > On 12/03/2017 04:15 AM, Keno Fischer wrote: >> Resending as plain text (apologies for those receiving it twice, and >> those that got >> an HTML copy, I'm used to my mail client switching that over >> automatically, which >> for some reason didn't happen here). >> >> >> This is exactly the discussion I want to generate, so thank you. >> I should point out that I'm not advocating for anything other >> than clarity of what kernel behavior user space may assume. > > So, should the documentation patch be applied at this point, or dropped? > > Thanks, > > Michael > > >> On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox wrote: >>> On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote: >>>> The catalyst for this patch was me experiencing EINTR errors when >>>> using the 9p file system. In linux commit 9523feac, the 9p file >>>> system was changed to use wait_event_killable instead of >>>> wait_event_interruptible, which does indeed address my problem, >>>> but also makes me a bit unhappy, because uninterruptable waits >>>> prevents things like ^C'ing the execution and some debugging >>>> tools which depend on being able to cancel long-running operations >>>> by sending signals. >>> >>> Wait, wait, wait. killable is not uninterruptible. It's "can accept >>> a signal if the signal is fatal". ie userspace will never see it. >>> So, no, it doesn't prevent ^C. It does prevent the debugging tool you're >>> talking about from working, because it's handling the signal, so it's not >>> fatal. >> >> This probably shows that I've been in REPL based environments too long, >> that catch SIGINT ;). You are of course correct that a fatal SIGINT would >> still be delivered. >> >>>> I realize I'm probably 20 years too late here, but it feels like >>>> clarificaion on what to expect from the kernel would still go a long >>>> way here. >>> >>> A change to user-visible behaviour has to be opt-in. >> >> I agree. However, it was my impression that stat() can return EINTR >> depending on the file system. Prior to the referenced commit, >> this was certainly true on 9p and I suspect it's not the only network file >> system for which this is true (though prior to my experiencing this >> with 9p, the only >> time I've ever experienced it was on HPC clusters with who knows what >> code providing the network filesystem). If it is indeed the case that >> an EINTR return from stat() and similar is illegal and should be considered >> a kernel bug, a statement to that extent all I'm looking for here. >> > > > -- > Michael Kerrisk > Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/ > Linux/UNIX System Programming Training: http://man7.org/training/
Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR
Resending as plain text (apologies for those receiving it twice, and those that got an HTML copy, I'm used to my mail client switching that over automatically, which for some reason didn't happen here). This is exactly the discussion I want to generate, so thank you. I should point out that I'm not advocating for anything other than clarity of what kernel behavior user space may assume. On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox <wi...@infradead.org> wrote: > On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote: >> The catalyst for this patch was me experiencing EINTR errors when >> using the 9p file system. In linux commit 9523feac, the 9p file >> system was changed to use wait_event_killable instead of >> wait_event_interruptible, which does indeed address my problem, >> but also makes me a bit unhappy, because uninterruptable waits >> prevents things like ^C'ing the execution and some debugging >> tools which depend on being able to cancel long-running operations >> by sending signals. > > Wait, wait, wait. killable is not uninterruptible. It's "can accept > a signal if the signal is fatal". ie userspace will never see it. > So, no, it doesn't prevent ^C. It does prevent the debugging tool you're > talking about from working, because it's handling the signal, so it's not > fatal. This probably shows that I've been in REPL based environments too long, that catch SIGINT ;). You are of course correct that a fatal SIGINT would still be delivered. >> I realize I'm probably 20 years too late here, but it feels like >> clarificaion on what to expect from the kernel would still go a long >> way here. > > A change to user-visible behaviour has to be opt-in. I agree. However, it was my impression that stat() can return EINTR depending on the file system. Prior to the referenced commit, this was certainly true on 9p and I suspect it's not the only network file system for which this is true (though prior to my experiencing this with 9p, the only time I've ever experienced it was on HPC clusters with who knows what code providing the network filesystem). If it is indeed the case that an EINTR return from stat() and similar is illegal and should be considered a kernel bug, a statement to that extent all I'm looking for here.
Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR
Resending as plain text (apologies for those receiving it twice, and those that got an HTML copy, I'm used to my mail client switching that over automatically, which for some reason didn't happen here). This is exactly the discussion I want to generate, so thank you. I should point out that I'm not advocating for anything other than clarity of what kernel behavior user space may assume. On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox wrote: > On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote: >> The catalyst for this patch was me experiencing EINTR errors when >> using the 9p file system. In linux commit 9523feac, the 9p file >> system was changed to use wait_event_killable instead of >> wait_event_interruptible, which does indeed address my problem, >> but also makes me a bit unhappy, because uninterruptable waits >> prevents things like ^C'ing the execution and some debugging >> tools which depend on being able to cancel long-running operations >> by sending signals. > > Wait, wait, wait. killable is not uninterruptible. It's "can accept > a signal if the signal is fatal". ie userspace will never see it. > So, no, it doesn't prevent ^C. It does prevent the debugging tool you're > talking about from working, because it's handling the signal, so it's not > fatal. This probably shows that I've been in REPL based environments too long, that catch SIGINT ;). You are of course correct that a fatal SIGINT would still be delivered. >> I realize I'm probably 20 years too late here, but it feels like >> clarificaion on what to expect from the kernel would still go a long >> way here. > > A change to user-visible behaviour has to be opt-in. I agree. However, it was my impression that stat() can return EINTR depending on the file system. Prior to the referenced commit, this was certainly true on 9p and I suspect it's not the only network file system for which this is true (though prior to my experiencing this with 9p, the only time I've ever experienced it was on HPC clusters with who knows what code providing the network filesystem). If it is indeed the case that an EINTR return from stat() and similar is illegal and should be considered a kernel bug, a statement to that extent all I'm looking for here.
[PATCH RFC] stat.2: Document that stat can fail with EINTR
Particularly on network file systems, a stat call may require submitting a message over the network and waiting interruptably for a reply. Signed-off-by: Keno Fischer <k...@juliacomputing.com> --- The catalyst for this patch was me experiencing EINTR errors when using the 9p file system. In linux commit 9523feac, the 9p file system was changed to use wait_event_killable instead of wait_event_interruptible, which does indeed address my problem, but also makes me a bit unhappy, because uninterruptable waits prevents things like ^C'ing the execution and some debugging tools which depend on being able to cancel long-running operations by sending signals. I'd like to ask the user space applications I care about to properly handle such situations (either by using SA_RESTART or by explicitly handling EINTR), but it's a bit of a hard sell if EINTR isn't documented to be a possibility. I'm hoping this doc PATCH will generate a discussion of whether EINTR is an appropriate thing for stat (as a stand in for a file system call that's not read/write) to return. If so, I'd be happy to submit patches to other file system-related syscalls along these same lines. I realize I'm probably 20 years too late here, but it feels like clarificaion on what to expect from the kernel would still go a long way here. man2/stat.2 | 5 + 1 file changed, 5 insertions(+) diff --git a/man2/stat.2 b/man2/stat.2 index dad9a01..f10235a 100644 --- a/man2/stat.2 +++ b/man2/stat.2 @@ -452,6 +452,11 @@ Invalid flag specified in is relative and .I dirfd is a file descriptor referring to a file other than a directory. +.TP +.B EINTR +The call was interrupted by delivery of a signal caught by a handler; see +.BR signal (7). +The possibility of this error is file-system dependent. .SH VERSIONS .BR fstatat () was added to Linux in kernel 2.6.16; -- 2.8.1
[PATCH RFC] stat.2: Document that stat can fail with EINTR
Particularly on network file systems, a stat call may require submitting a message over the network and waiting interruptably for a reply. Signed-off-by: Keno Fischer --- The catalyst for this patch was me experiencing EINTR errors when using the 9p file system. In linux commit 9523feac, the 9p file system was changed to use wait_event_killable instead of wait_event_interruptible, which does indeed address my problem, but also makes me a bit unhappy, because uninterruptable waits prevents things like ^C'ing the execution and some debugging tools which depend on being able to cancel long-running operations by sending signals. I'd like to ask the user space applications I care about to properly handle such situations (either by using SA_RESTART or by explicitly handling EINTR), but it's a bit of a hard sell if EINTR isn't documented to be a possibility. I'm hoping this doc PATCH will generate a discussion of whether EINTR is an appropriate thing for stat (as a stand in for a file system call that's not read/write) to return. If so, I'd be happy to submit patches to other file system-related syscalls along these same lines. I realize I'm probably 20 years too late here, but it feels like clarificaion on what to expect from the kernel would still go a long way here. man2/stat.2 | 5 + 1 file changed, 5 insertions(+) diff --git a/man2/stat.2 b/man2/stat.2 index dad9a01..f10235a 100644 --- a/man2/stat.2 +++ b/man2/stat.2 @@ -452,6 +452,11 @@ Invalid flag specified in is relative and .I dirfd is a file descriptor referring to a file other than a directory. +.TP +.B EINTR +The call was interrupted by delivery of a signal caught by a handler; see +.BR signal (7). +The possibility of this error is file-system dependent. .SH VERSIONS .BR fstatat () was added to Linux in kernel 2.6.16; -- 2.8.1
Re: Yes, people use FOLL_FORCE ;)
Hi Linus, > But it sounds like your JIT case actually uses it for writing - > but if you can write a small blurb about it, that would be nice. yes, we use it for writing. Happy to describe the scheme in more detail. > (b) it would probably be nice to limit FOLL_FORCE in general as much > as possible, so if your case is about writing to your very _own_ > memory mapping, as opposed to writing to another process' memory, > maybe we can do something like > > if (mm == current->mm) > flags |= FOLL_FORCE; > > which at least avoids the whole "let's change the VM in odd ways for a > process that isn't even me". While this would fix our current use case, we do have a use case for modifying non-local address space as well (putting the JIT into a different process). Similarly, the rr use case precisely uses the remote mm case. I think in general this feature is very useful for anybody who needs to precisely control the execution of some other process. Various debuggers (gdb/lldb/rr) certainly fall into that category, but there's another class of such processes (wine, various emulators) which may want to do that kind of thing. Now, I suspect most of these will have the other process under ptrace control, so maybe allowing (same_mm || ptraced) would be ok, but at least for the sandbox/remote-jit use case, it would be perfectly reasonable to not have the jit server be a ptracer.
Re: Yes, people use FOLL_FORCE ;)
Hi Linus, > But it sounds like your JIT case actually uses it for writing - > but if you can write a small blurb about it, that would be nice. yes, we use it for writing. Happy to describe the scheme in more detail. > (b) it would probably be nice to limit FOLL_FORCE in general as much > as possible, so if your case is about writing to your very _own_ > memory mapping, as opposed to writing to another process' memory, > maybe we can do something like > > if (mm == current->mm) > flags |= FOLL_FORCE; > > which at least avoids the whole "let's change the VM in odd ways for a > process that isn't even me". While this would fix our current use case, we do have a use case for modifying non-local address space as well (putting the JIT into a different process). Similarly, the rr use case precisely uses the remote mm case. I think in general this feature is very useful for anybody who needs to precisely control the execution of some other process. Various debuggers (gdb/lldb/rr) certainly fall into that category, but there's another class of such processes (wine, various emulators) which may want to do that kind of thing. Now, I suspect most of these will have the other process under ptrace control, so maybe allowing (same_mm || ptraced) would be ok, but at least for the sandbox/remote-jit use case, it would be perfectly reasonable to not have the jit server be a ptracer.
Yes, people use FOLL_FORCE ;)
Hi Linus et al., In 8ee74a91 "proc: try to remove use of FOLL_FORCE entirely", you removed punch through semantics of /proc//mem. We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach. We do have fallbacks to these other methods (though getting EIO here actually causes an assert in released versions - we'll updated that to make sure to take the fall back in that case). Nevertheless the /proc/self/mem approach was our favored approach because it a) Required an attacker to be able to execute syscalls which is a taller order than getting memory write and b) didn't double the virtual address space requirements (as a dual mapping approach would). Now, while we're probably fine with using the fallbacks, I know there's others that rely on this behavior as well (cc'ing Robert O'Callahan of the rr project for which this change will result in significant performance degradation). Also, judging by who complained last time FOLL_FORCE was broken, I suspect the Wine people are relying on this as well. Frankly, I'm a bit surprised that this change was made in the first place. Making a userspace-breaking change on mainline and seeing if anybody complains doesn't seem like the ideal way to find out if features are used. As I said, personally we can patch our software and deal with this, but I think a change like this deserves a bit wider discussion, so may I suggest a revert of this change for the time being? Maybe there can be a syslog warning such that people who use it will notice and have their say on the mailing list. Thanks, Keno
Yes, people use FOLL_FORCE ;)
Hi Linus et al., In 8ee74a91 "proc: try to remove use of FOLL_FORCE entirely", you removed punch through semantics of /proc//mem. We used these semantics as a hardening mechanism in the julia JIT. By opening /proc/self/mem and using these semantics, we could avoid needing RWX pages, or a dual mapping approach. We do have fallbacks to these other methods (though getting EIO here actually causes an assert in released versions - we'll updated that to make sure to take the fall back in that case). Nevertheless the /proc/self/mem approach was our favored approach because it a) Required an attacker to be able to execute syscalls which is a taller order than getting memory write and b) didn't double the virtual address space requirements (as a dual mapping approach would). Now, while we're probably fine with using the fallbacks, I know there's others that rely on this behavior as well (cc'ing Robert O'Callahan of the rr project for which this change will result in significant performance degradation). Also, judging by who complained last time FOLL_FORCE was broken, I suspect the Wine people are relying on this as well. Frankly, I'm a bit surprised that this change was made in the first place. Making a userspace-breaking change on mainline and seeing if anybody complains doesn't seem like the ideal way to find out if features are used. As I said, personally we can patch our software and deal with this, but I think a change like this deserves a bit wider discussion, so may I suggest a revert of this change for the time being? Maybe there can be a syslog warning such that people who use it will notice and have their say on the mailing list. Thanks, Keno
[PATCH v2] mm: Respect FOLL_FORCE/FOLL_COW for thp
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include #include #include #include #include #include #include #include #include #include #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, , 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. Signed-off-by: Keno Fischer <k...@juliacomputing.com> --- Changes since v1: * In follow_devmap_pmd, WARN_ONCE if FOLL_COW is encountered, rather than allowing it, since that situation should not happen. As suggested by Kirill A. Shutemov mm/huge_memory.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..f7ec01d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -783,6 +783,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, assert_spin_locked(pmd_lockptr(mm, pmd)); + /* +* When we COW a devmap PMD entry, we split it into PTEs, so we should +* not be in this function with `flags & FOLL_COW` set. +*/ + WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); + if (flags & FOLL_WRITE && !pmd_write(*pmd)) return NULL; @@ -1127,6 +1133,16 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) return ret; } +/* + * FOLL_FORCE can write to even unwritable pmd's, but only + * after we've gone through a COW cycle and they are dirty. + */ +static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +{ + return pmd_write(pmd) || + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1137,7 +1153,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) goto out; /* Avoid dumping huge zero page */ -- 2.9.3
[PATCH v2] mm: Respect FOLL_FORCE/FOLL_COW for thp
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): #include #include #include #include #include #include #include #include #include #include #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, , 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern exists in follow_devmap_pmd. However, we should not be able to reach that check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever do. Signed-off-by: Keno Fischer --- Changes since v1: * In follow_devmap_pmd, WARN_ONCE if FOLL_COW is encountered, rather than allowing it, since that situation should not happen. As suggested by Kirill A. Shutemov mm/huge_memory.c | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..f7ec01d 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -783,6 +783,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, assert_spin_locked(pmd_lockptr(mm, pmd)); + /* +* When we COW a devmap PMD entry, we split it into PTEs, so we should +* not be in this function with `flags & FOLL_COW` set. +*/ + WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW set"); + if (flags & FOLL_WRITE && !pmd_write(*pmd)) return NULL; @@ -1127,6 +1133,16 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd) return ret; } +/* + * FOLL_FORCE can write to even unwritable pmd's, but only + * after we've gone through a COW cycle and they are dirty. + */ +static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +{ + return pmd_write(pmd) || + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1137,7 +1153,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) goto out; /* Avoid dumping huge zero page */ -- 2.9.3
Re: [PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp
>> @@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct >> *vma, unsigned long addr, >> >> assert_spin_locked(pmd_lockptr(mm, pmd)); >> >> - if (flags & FOLL_WRITE && !pmd_write(*pmd)) >> + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) >> return NULL; > > I don't think this part is needed: once we COW devmap PMD entry, we split > it into PTE table, so IIUC we never get here with PMD. > > Maybe we should WARN_ONCE() if have FOLL_COW here. Sounds good to me. As I said, I don't have a testcase for this code path, I just noticed the same pattern. Will send an updated patch with the WARN_ONCE there.
Re: [PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp
>> @@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct >> *vma, unsigned long addr, >> >> assert_spin_locked(pmd_lockptr(mm, pmd)); >> >> - if (flags & FOLL_WRITE && !pmd_write(*pmd)) >> + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) >> return NULL; > > I don't think this part is needed: once we COW devmap PMD entry, we split > it into PTE table, so IIUC we never get here with PMD. > > Maybe we should WARN_ONCE() if have FOLL_COW here. Sounds good to me. As I said, I don't have a testcase for this code path, I just noticed the same pattern. Will send an updated patch with the WARN_ONCE there.
[PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state, the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): ``` #include #include #include #include #include #include #include #include #include #include #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, , 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } ``` Fix this by updating the instances in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern existed in follow_devmap_pmd, so I have changed that location as well. However, I do not have a test case that for that code path. Signed-off-by: Keno Fischer <k...@juliacomputing.com> --- mm/huge_memory.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..84497a8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -773,6 +773,16 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, update_mmu_cache_pmd(vma, addr, pmd); } +/* + * FOLL_FORCE can write to even unwritable pmd's, but only + * after we've gone through a COW cycle and they are dirty. + */ +static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +{ + return pmd_write(pmd) || + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); +} + struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, int flags) { @@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) return NULL; if (pmd_present(*pmd) && pmd_devmap(*pmd)) @@ -1137,7 +1147,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) goto out; /* Avoid dumping huge zero page */ -- 2.9.3
[PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had also been set. However, a similar check in huge_memory.c was forgotten. As a result, remote memory writes to ro regions of memory backed by transparent huge pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this state, the process is stil SIGKILLable, but little else works (e.g. no ptrace attach, no other signals). This is easily reproduced with the following code (assuming thp are set to always): ``` #include #include #include #include #include #include #include #include #include #include #define TEST_SIZE 5 * 1024 * 1024 int main(void) { int status; pid_t child; int fd = open("/proc/self/mem", O_RDWR); void *addr = mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr != MAP_FAILED); pid_t parent_pid = getpid(); if ((child = fork()) == 0) { void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0); assert(addr2 != MAP_FAILED); memset(addr2, 'a', TEST_SIZE); pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr); return 0; } assert(child == waitpid(child, , 0)); assert(WIFEXITED(status) && WEXITSTATUS(status) == 0); return 0; } ``` Fix this by updating the instances in huge_memory.c analogously to the update in gup.c in the original commit. The same pattern existed in follow_devmap_pmd, so I have changed that location as well. However, I do not have a test case that for that code path. Signed-off-by: Keno Fischer --- mm/huge_memory.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 10eedbf..84497a8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -773,6 +773,16 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned long addr, update_mmu_cache_pmd(vma, addr, pmd); } +/* + * FOLL_FORCE can write to even unwritable pmd's, but only + * after we've gone through a COW cycle and they are dirty. + */ +static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags) +{ + return pmd_write(pmd) || + ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd)); +} + struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, int flags) { @@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) return NULL; if (pmd_present(*pmd) && pmd_devmap(*pmd)) @@ -1137,7 +1147,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, assert_spin_locked(pmd_lockptr(mm, pmd)); - if (flags & FOLL_WRITE && !pmd_write(*pmd)) + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags)) goto out; /* Avoid dumping huge zero page */ -- 2.9.3
Why do Zombie process' /proc entries have uid 0?
This is mostly out of curiosity, but I was surprised by the behavior, so I was hoping somebody might be able to explain why this behavior was chosen. In particular, consider any zombie process, e.g. $ cat /proc/77078/status Name: test State: Z (zombie) Tgid: 77078 Ngid: 0 Pid: 77078 PPid: 77077 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 [...] now, this process has uid 1000, as does the /proc/ directory $ stat /proc/77078 File: '/proc/77078' Access: (0555/dr-xr-xr-x) Uid: ( 1000/keno) Gid: ( 1000/keno) but most files in /proc/ are owned by root: $ stat /proc/77078/status File: '/proc/77078/status' Access: (0444/-r--r--r--) Uid: (0/root) Gid: (0/root) Why is this? Why don't these files remain owned by the same uid as the process itself? Keno
Why do Zombie process' /proc entries have uid 0?
This is mostly out of curiosity, but I was surprised by the behavior, so I was hoping somebody might be able to explain why this behavior was chosen. In particular, consider any zombie process, e.g. $ cat /proc/77078/status Name: test State: Z (zombie) Tgid: 77078 Ngid: 0 Pid: 77078 PPid: 77077 TracerPid: 0 Uid: 1000 1000 1000 1000 Gid: 1000 1000 1000 1000 [...] now, this process has uid 1000, as does the /proc/ directory $ stat /proc/77078 File: '/proc/77078' Access: (0555/dr-xr-xr-x) Uid: ( 1000/keno) Gid: ( 1000/keno) but most files in /proc/ are owned by root: $ stat /proc/77078/status File: '/proc/77078/status' Access: (0444/-r--r--r--) Uid: (0/root) Gid: (0/root) Why is this? Why don't these files remain owned by the same uid as the process itself? Keno
Bug in fs/gs_base PTRACE_SETREGS on pre-4.7 kernels
Hi Andy (), this is more of a heads up than a bug report, since it turns out you already fixed this in 731e33e: x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization In any case, without that commit, trying to use PTRACE_SETREGS to set either fs_base, or gs_base to 0 when it was previously <4G, but wasn't 0, fails to take effect in the tracee. This is caused by the `if (child->thread.fs != value)`, in `ptrace.c:putreg`, which skips the `do_arch_prctl` call. Of course the problem here is that while the optimization is in place `fs` is set to 0, but does not actually hold the fs base, so the call is incorrectly skipped. In any case, figured you may be interested that the commit changes behavior (for the better - not complaining ;), even if user code does not go out of its way to confuse ptrace. Keno
Bug in fs/gs_base PTRACE_SETREGS on pre-4.7 kernels
Hi Andy (), this is more of a heads up than a bug report, since it turns out you already fixed this in 731e33e: x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization In any case, without that commit, trying to use PTRACE_SETREGS to set either fs_base, or gs_base to 0 when it was previously <4G, but wasn't 0, fails to take effect in the tracee. This is caused by the `if (child->thread.fs != value)`, in `ptrace.c:putreg`, which skips the `do_arch_prctl` call. Of course the problem here is that while the optimization is in place `fs` is set to 0, but does not actually hold the fs base, so the call is incorrectly skipped. In any case, figured you may be interested that the commit changes behavior (for the better - not complaining ;), even if user code does not go out of its way to confuse ptrace. Keno
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Just as an FYI, the linker bug has been fixed in binutils. On Fri, Nov 11, 2016 at 5:07 PM, Richard Weinberger <rich...@nod.at> wrote: > On 11.11.2016 22:03, Keno Fischer wrote: >> Did you have CONFIG_INET set? I'm attaching my full .config. This is >> on vanilla Ubuntu 16.10. > > Yes, CONFIG_INET is set. Let my try on Ubuntu. ;-\ > >> I did see the same error when building with `CONFIG_STATIC_LINK=y`. >> Note that I also, separately, ran into a linker problem, though I >> believe it is unrelated to this patch >> (though perhaps is related to the problem you're seeing?): >> https://sourceware.org/bugzilla/show_bug.cgi?id=20800. > > This seems to be an UML<->glibc issue. > memmove() is now an ifunc and for whatever reason it does not work with UML. > >> I'd also be happy to provide you with ssh access to the machine that >> I'm seeing this on if that >> would be helpful. > > Okay, let me try myself first. I think I'm able to install Ubuntu. :) > > Thanks, > //richard
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Just as an FYI, the linker bug has been fixed in binutils. On Fri, Nov 11, 2016 at 5:07 PM, Richard Weinberger wrote: > On 11.11.2016 22:03, Keno Fischer wrote: >> Did you have CONFIG_INET set? I'm attaching my full .config. This is >> on vanilla Ubuntu 16.10. > > Yes, CONFIG_INET is set. Let my try on Ubuntu. ;-\ > >> I did see the same error when building with `CONFIG_STATIC_LINK=y`. >> Note that I also, separately, ran into a linker problem, though I >> believe it is unrelated to this patch >> (though perhaps is related to the problem you're seeing?): >> https://sourceware.org/bugzilla/show_bug.cgi?id=20800. > > This seems to be an UML<->glibc issue. > memmove() is now an ifunc and for whatever reason it does not work with UML. > >> I'd also be happy to provide you with ssh access to the machine that >> I'm seeing this on if that >> would be helpful. > > Okay, let me try myself first. I think I'm able to install Ubuntu. :) > > Thanks, > //richard
[PATCH v2] gpio: Remove GPIO_DEVRES option
This option was added in 6a89a314ab107a12af08c71420c19a37a30fc2d3 to allow use of the devm_gpio_* functions without CONFIG_GPIOLIB. However, only a few months later in b69ac52449c658b7ac40034dc3c5f5f4a71a723d, CONFIG_GPIOLIB, was added as a dependency, defeating the original purpose of this option. Instead of that patch, the original commit could have just been reverted (and in fact was partially so in 403c1d0be5ccbd750d25c59d8358843a81e52e3b). Further, since this option has a dependency on HAS_IOMEM, even though it does not require it, it causes build failures when !HAS_IOMEM (e.g. in a uml build). Fix that by completely removing the option, in essence completing the reversion of the original commit. --- In the original version of this patch (http://marc.info/?l=linux-gpio=147874300313315=2), I had kept the option, and just fixed the build failure. However, Linus Walleij pointed out (and the git history agrees) that this option is now obsolete and should just be removed. Also, while here I should note that there was a `might_sleep();` in the stub for `devm_gpio_free`, that was not reintroduced when the stub was brought back. Not sure this makes a difference, it felt worth pointing out to the maintainers. drivers/gpio/Kconfig | 4 drivers/gpio/Makefile | 2 +- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig index d011cb8..ed37e59 100644 --- a/drivers/gpio/Kconfig +++ b/drivers/gpio/Kconfig @@ -22,10 +22,6 @@ menuconfig GPIOLIB if GPIOLIB -config GPIO_DEVRES - def_bool y - depends on HAS_IOMEM - config OF_GPIO def_bool y depends on OF diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile index ab28a2d..d074c22 100644 --- a/drivers/gpio/Makefile +++ b/drivers/gpio/Makefile @@ -2,7 +2,7 @@ ccflags-$(CONFIG_DEBUG_GPIO) += -DDEBUG -obj-$(CONFIG_GPIO_DEVRES) += devres.o +obj-$(CONFIG_GPIOLIB) += devres.o obj-$(CONFIG_GPIOLIB) += gpiolib.o obj-$(CONFIG_GPIOLIB) += gpiolib-legacy.o obj-$(CONFIG_OF_GPIO) += gpiolib-of.o -- 2.9.3
[PATCH v2] gpio: Remove GPIO_DEVRES option
This option was added in 6a89a314ab107a12af08c71420c19a37a30fc2d3 to allow use of the devm_gpio_* functions without CONFIG_GPIOLIB. However, only a few months later in b69ac52449c658b7ac40034dc3c5f5f4a71a723d, CONFIG_GPIOLIB, was added as a dependency, defeating the original purpose of this option. Instead of that patch, the original commit could have just been reverted (and in fact was partially so in 403c1d0be5ccbd750d25c59d8358843a81e52e3b). Further, since this option has a dependency on HAS_IOMEM, even though it does not require it, it causes build failures when !HAS_IOMEM (e.g. in a uml build). Fix that by completely removing the option, in essence completing the reversion of the original commit. --- In the original version of this patch (http://marc.info/?l=linux-gpio=147874300313315=2), I had kept the option, and just fixed the build failure. However, Linus Walleij pointed out (and the git history agrees) that this option is now obsolete and should just be removed. Also, while here I should note that there was a `might_sleep();` in the stub for `devm_gpio_free`, that was not reintroduced when the stub was brought back. Not sure this makes a difference, it felt worth pointing out to the maintainers. drivers/gpio/Kconfig | 4 drivers/gpio/Makefile | 2 +- 2 files changed, 1 insertion(+), 5 deletions(-) diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig index d011cb8..ed37e59 100644 --- a/drivers/gpio/Kconfig +++ b/drivers/gpio/Kconfig @@ -22,10 +22,6 @@ menuconfig GPIOLIB if GPIOLIB -config GPIO_DEVRES - def_bool y - depends on HAS_IOMEM - config OF_GPIO def_bool y depends on OF diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile index ab28a2d..d074c22 100644 --- a/drivers/gpio/Makefile +++ b/drivers/gpio/Makefile @@ -2,7 +2,7 @@ ccflags-$(CONFIG_DEBUG_GPIO) += -DDEBUG -obj-$(CONFIG_GPIO_DEVRES) += devres.o +obj-$(CONFIG_GPIOLIB) += devres.o obj-$(CONFIG_GPIOLIB) += gpiolib.o obj-$(CONFIG_GPIOLIB) += gpiolib-legacy.o obj-$(CONFIG_OF_GPIO) += gpiolib-of.o -- 2.9.3
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Did you have CONFIG_INET set? I'm attaching my full .config. This is on vanilla Ubuntu 16.10. I did see the same error when building with `CONFIG_STATIC_LINK=y`. Note that I also, separately, ran into a linker problem, though I believe it is unrelated to this patch (though perhaps is related to the problem you're seeing?): https://sourceware.org/bugzilla/show_bug.cgi?id=20800. I'd also be happy to provide you with ssh access to the machine that I'm seeing this on if that would be helpful. config Description: Binary data
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Did you have CONFIG_INET set? I'm attaching my full .config. This is on vanilla Ubuntu 16.10. I did see the same error when building with `CONFIG_STATIC_LINK=y`. Note that I also, separately, ran into a linker problem, though I believe it is unrelated to this patch (though perhaps is related to the problem you're seeing?): https://sourceware.org/bugzilla/show_bug.cgi?id=20800. I'd also be happy to provide you with ssh access to the machine that I'm seeing this on if that would be helpful. config Description: Binary data
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
On Thu, Nov 10, 2016 at 3:19 PM, Richard Weinbergerwrote: > Can you please reply to Sebastian's patch series and explain him how you > trigger > that error? > I don't have a gcc broken by Debian on my machine right now. I'm not sure how to reply to his patch series directly since I'm not subscribed to lkml (I imagine I can set headers manually, such that things get appropriately threaded, but for now let me just reply here). I don't think there's much to it. It seems like any use of `current_text_addr()` in `ARCH=um` build will cause this problem, there's just not very many such uses in the code base. I believe this particular problem should be reproducible as long as CONFIG_INET is set for an `ARCH=um` build.
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
On Thu, Nov 10, 2016 at 3:19 PM, Richard Weinberger wrote: > Can you please reply to Sebastian's patch series and explain him how you > trigger > that error? > I don't have a gcc broken by Debian on my machine right now. I'm not sure how to reply to his patch series directly since I'm not subscribed to lkml (I imagine I can set headers manually, such that things get appropriately threaded, but for now let me just reply here). I don't think there's much to it. It seems like any use of `current_text_addr()` in `ARCH=um` build will cause this problem, there's just not very many such uses in the code base. I believe this particular problem should be reproducible as long as CONFIG_INET is set for an `ARCH=um` build.
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Yes On Thu, Nov 10, 2016 at 3:14 PM, Richard Weinberger <rich...@nod.at> wrote: > Keno, > > On 10.11.2016 21:10, Keno Fischer wrote: >>> The problem is ready being solved in a generic way: >>> http://marc.info/?l=linux-kernel=147828481602561=2 >>> >>> Can you please give this patch a try? >> >> No dice. After backing out my patch and applying that one I get: >> >> /usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc >> 11; recompile with -fPIC > > But you applied the whole series, right? > > Thanks, > //richard
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
Yes On Thu, Nov 10, 2016 at 3:14 PM, Richard Weinberger wrote: > Keno, > > On 10.11.2016 21:10, Keno Fischer wrote: >>> The problem is ready being solved in a generic way: >>> http://marc.info/?l=linux-kernel=147828481602561=2 >>> >>> Can you please give this patch a try? >> >> No dice. After backing out my patch and applying that one I get: >> >> /usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc >> 11; recompile with -fPIC > > But you applied the whole series, right? > > Thanks, > //richard
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
> The problem is ready being solved in a generic way: > http://marc.info/?l=linux-kernel=147828481602561=2 > > Can you please give this patch a try? No dice. After backing out my patch and applying that one I get: /usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc 11; recompile with -fPIC
Re: [PATCH] um: Fix compile failure due to current_text_address() definition
> The problem is ready being solved in a generic way: > http://marc.info/?l=linux-kernel=147828481602561=2 > > Can you please give this patch a try? No dice. After backing out my patch and applying that one I get: /usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc 11; recompile with -fPIC
[PATCH] gpio: Guard devm_* functions behind CONFIG_GPIO_DEVRES not CONFIG_GPIOLIB
These functions are defined in devres.c, which only gets compiled with CONFIG_GPIO_DEVRES (in addition to CONFIG_GPIOLIB). However, in the header files, the difference between the declaration and the inline stub was only guarded by CONFIG_GPIOLIB, not CONFIG_GPIO_DEVRES, causing undefined symbol problems in modpost. Signed-off-by: Keno Fischer <k...@juliacomputing.com> --- I encountered this while trying to build uml in an attempt to debug some kernel behavior I don't understand. To be as close to my actual kernel as possible, I used the same .config, which of course tried to build a bunch of drivers. Arguably I should just not build those, but this seems correct nonetheless and allows the build to go through. include/linux/gpio.h | 28 +++--- include/linux/gpio/consumer.h | 193 ++ 2 files changed, 117 insertions(+), 104 deletions(-) diff --git a/include/linux/gpio.h b/include/linux/gpio.h index d12b5d5..7a1b979 100644 --- a/include/linux/gpio.h +++ b/include/linux/gpio.h @@ -77,15 +77,6 @@ static inline int irq_to_gpio(unsigned int irq) #endif /* ! CONFIG_ARCH_HAVE_CUSTOM_GPIO_H */ -/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */ - -struct device; - -int devm_gpio_request(struct device *dev, unsigned gpio, const char *label); -int devm_gpio_request_one(struct device *dev, unsigned gpio, - unsigned long flags, const char *label); -void devm_gpio_free(struct device *dev, unsigned int gpio); - #else /* ! CONFIG_GPIOLIB */ #include @@ -253,6 +244,23 @@ gpiochip_remove_pin_ranges(struct gpio_chip *chip) WARN_ON(1); } +#endif /* ! CONFIG_GPIOLIB */ + +#ifdef CONFIG_GPIO_DEVRES + +/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */ + +struct device; + +int devm_gpio_request(struct device *dev, unsigned gpio, const char *label); +int devm_gpio_request_one(struct device *dev, unsigned gpio, + unsigned long flags, const char *label); +void devm_gpio_free(struct device *dev, unsigned int gpio); + +#else + +struct device; + static inline int devm_gpio_request(struct device *dev, unsigned gpio, const char *label) { @@ -272,6 +280,6 @@ static inline void devm_gpio_free(struct device *dev, unsigned int gpio) WARN_ON(1); } -#endif /* ! CONFIG_GPIOLIB */ +#endif #endif /* __LINUX_GPIO_H */ diff --git a/include/linux/gpio/consumer.h b/include/linux/gpio/consumer.h index fb0fde6..3e84311 100644 --- a/include/linux/gpio/consumer.h +++ b/include/linux/gpio/consumer.h @@ -70,28 +70,6 @@ struct gpio_descs *__must_check gpiod_get_array_optional(struct device *dev, void gpiod_put(struct gpio_desc *desc); void gpiod_put_array(struct gpio_descs *descs); -struct gpio_desc *__must_check devm_gpiod_get(struct device *dev, - const char *con_id, - enum gpiod_flags flags); -struct gpio_desc *__must_check devm_gpiod_get_index(struct device *dev, - const char *con_id, - unsigned int idx, - enum gpiod_flags flags); -struct gpio_desc *__must_check devm_gpiod_get_optional(struct device *dev, - const char *con_id, - enum gpiod_flags flags); -struct gpio_desc *__must_check -devm_gpiod_get_index_optional(struct device *dev, const char *con_id, - unsigned int index, enum gpiod_flags flags); -struct gpio_descs *__must_check devm_gpiod_get_array(struct device *dev, -const char *con_id, -enum gpiod_flags flags); -struct gpio_descs *__must_check -devm_gpiod_get_array_optional(struct device *dev, const char *con_id, - enum gpiod_flags flags); -void devm_gpiod_put(struct device *dev, struct gpio_desc *desc); -void devm_gpiod_put_array(struct device *dev, struct gpio_descs *descs); - int gpiod_get_direction(struct gpio_desc *desc); int gpiod_direction_input(struct gpio_desc *desc); int gpiod_direction_output(struct gpio_desc *desc, int value); @@ -136,9 +114,7 @@ struct fwnode_handle; struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode, const char *propname); -struct gpio_desc *devm_get_gpiod_from_child(struct device *dev, - const char *con_id, - struct fwnode_handle *child); + #else /* CONFIG_GPIOLIB */ static inline int gpiod_count(struct device *dev, const char *con_id) @@ -205,69 +181,6 @@ static inline void gpiod_put_array(struct gpio_descs *descs) WAR
[PATCH] gpio: Guard devm_* functions behind CONFIG_GPIO_DEVRES not CONFIG_GPIOLIB
These functions are defined in devres.c, which only gets compiled with CONFIG_GPIO_DEVRES (in addition to CONFIG_GPIOLIB). However, in the header files, the difference between the declaration and the inline stub was only guarded by CONFIG_GPIOLIB, not CONFIG_GPIO_DEVRES, causing undefined symbol problems in modpost. Signed-off-by: Keno Fischer --- I encountered this while trying to build uml in an attempt to debug some kernel behavior I don't understand. To be as close to my actual kernel as possible, I used the same .config, which of course tried to build a bunch of drivers. Arguably I should just not build those, but this seems correct nonetheless and allows the build to go through. include/linux/gpio.h | 28 +++--- include/linux/gpio/consumer.h | 193 ++ 2 files changed, 117 insertions(+), 104 deletions(-) diff --git a/include/linux/gpio.h b/include/linux/gpio.h index d12b5d5..7a1b979 100644 --- a/include/linux/gpio.h +++ b/include/linux/gpio.h @@ -77,15 +77,6 @@ static inline int irq_to_gpio(unsigned int irq) #endif /* ! CONFIG_ARCH_HAVE_CUSTOM_GPIO_H */ -/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */ - -struct device; - -int devm_gpio_request(struct device *dev, unsigned gpio, const char *label); -int devm_gpio_request_one(struct device *dev, unsigned gpio, - unsigned long flags, const char *label); -void devm_gpio_free(struct device *dev, unsigned int gpio); - #else /* ! CONFIG_GPIOLIB */ #include @@ -253,6 +244,23 @@ gpiochip_remove_pin_ranges(struct gpio_chip *chip) WARN_ON(1); } +#endif /* ! CONFIG_GPIOLIB */ + +#ifdef CONFIG_GPIO_DEVRES + +/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */ + +struct device; + +int devm_gpio_request(struct device *dev, unsigned gpio, const char *label); +int devm_gpio_request_one(struct device *dev, unsigned gpio, + unsigned long flags, const char *label); +void devm_gpio_free(struct device *dev, unsigned int gpio); + +#else + +struct device; + static inline int devm_gpio_request(struct device *dev, unsigned gpio, const char *label) { @@ -272,6 +280,6 @@ static inline void devm_gpio_free(struct device *dev, unsigned int gpio) WARN_ON(1); } -#endif /* ! CONFIG_GPIOLIB */ +#endif #endif /* __LINUX_GPIO_H */ diff --git a/include/linux/gpio/consumer.h b/include/linux/gpio/consumer.h index fb0fde6..3e84311 100644 --- a/include/linux/gpio/consumer.h +++ b/include/linux/gpio/consumer.h @@ -70,28 +70,6 @@ struct gpio_descs *__must_check gpiod_get_array_optional(struct device *dev, void gpiod_put(struct gpio_desc *desc); void gpiod_put_array(struct gpio_descs *descs); -struct gpio_desc *__must_check devm_gpiod_get(struct device *dev, - const char *con_id, - enum gpiod_flags flags); -struct gpio_desc *__must_check devm_gpiod_get_index(struct device *dev, - const char *con_id, - unsigned int idx, - enum gpiod_flags flags); -struct gpio_desc *__must_check devm_gpiod_get_optional(struct device *dev, - const char *con_id, - enum gpiod_flags flags); -struct gpio_desc *__must_check -devm_gpiod_get_index_optional(struct device *dev, const char *con_id, - unsigned int index, enum gpiod_flags flags); -struct gpio_descs *__must_check devm_gpiod_get_array(struct device *dev, -const char *con_id, -enum gpiod_flags flags); -struct gpio_descs *__must_check -devm_gpiod_get_array_optional(struct device *dev, const char *con_id, - enum gpiod_flags flags); -void devm_gpiod_put(struct device *dev, struct gpio_desc *desc); -void devm_gpiod_put_array(struct device *dev, struct gpio_descs *descs); - int gpiod_get_direction(struct gpio_desc *desc); int gpiod_direction_input(struct gpio_desc *desc); int gpiod_direction_output(struct gpio_desc *desc, int value); @@ -136,9 +114,7 @@ struct fwnode_handle; struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode, const char *propname); -struct gpio_desc *devm_get_gpiod_from_child(struct device *dev, - const char *con_id, - struct fwnode_handle *child); + #else /* CONFIG_GPIOLIB */ static inline int gpiod_count(struct device *dev, const char *con_id) @@ -205,69 +181,6 @@ static inline void gpiod_put_array(struct gpio_descs *descs) WARN_ON(1); } -static inline
[PATCH] um: Fix compile failure due to current_text_address() definition
Fixes the following link error: ``` /usr/bin/ld: net/built-in.o: relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC ``` This is the same definition used on some other architectures. Signed-off-by: Keno Fischer <k...@juliacomputing.com> --- I am not sure this is the correct patch in the context of uml. I believe this should give the runtime ip, which may be different between runs. It may be better to use the offset in .text (e.g. by using `pc-__text_start`), which should be consistent. arch/x86/um/asm/processor_64.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/um/asm/processor_64.h b/arch/x86/um/asm/processor_64.h index c3be852..6ca3304 100644 --- a/arch/x86/um/asm/processor_64.h +++ b/arch/x86/um/asm/processor_64.h @@ -32,7 +32,7 @@ static inline void arch_copy_thread(struct arch_thread *from, } #define current_text_addr() \ - ({ void *pc; __asm__("movq $1f,%0\n1:":"=g" (pc)); pc; }) + ({ __label__ _l; _l: &&_l; }) #define current_sp() ({ void *sp; __asm__("movq %%rsp, %0" : "=r" (sp) : ); sp; }) #define current_bp() ({ unsigned long bp; __asm__("movq %%rbp, %0" : "=r" (bp) : ); bp; }) -- 2.9.3
[PATCH] um: Fix compile failure due to current_text_address() definition
Fixes the following link error: ``` /usr/bin/ld: net/built-in.o: relocation R_X86_64_32S against `.text' can not be used when making a shared object; recompile with -fPIC ``` This is the same definition used on some other architectures. Signed-off-by: Keno Fischer --- I am not sure this is the correct patch in the context of uml. I believe this should give the runtime ip, which may be different between runs. It may be better to use the offset in .text (e.g. by using `pc-__text_start`), which should be consistent. arch/x86/um/asm/processor_64.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/um/asm/processor_64.h b/arch/x86/um/asm/processor_64.h index c3be852..6ca3304 100644 --- a/arch/x86/um/asm/processor_64.h +++ b/arch/x86/um/asm/processor_64.h @@ -32,7 +32,7 @@ static inline void arch_copy_thread(struct arch_thread *from, } #define current_text_addr() \ - ({ void *pc; __asm__("movq $1f,%0\n1:":"=g" (pc)); pc; }) + ({ __label__ _l; _l: &&_l; }) #define current_sp() ({ void *sp; __asm__("movq %%rsp, %0" : "=r" (sp) : ); sp; }) #define current_bp() ({ unsigned long bp; __asm__("movq %%rbp, %0" : "=r" (bp) : ); bp; }) -- 2.9.3
Use of r10 in powerpc syscall entry
Hi Anton, I was reading the powerpc syscall entry code and git points me at your commit 05b05f28 (powerpc: Relocatable system call no longer uses the LR) for one part that confused me, so I hope you don't mind a quick question. In particular, that commit removed the use of r10 to restore lr, but didn't touch the earlier `mflr r10` to actually save the lr to r10. Is r10 still required there for some reason or is that just left over? Part of the reason I'm asking is because it seems one could use r10, instead of r13 later, i.e. #define SYSCALL_PSERIES_2_DIRECT \ - mflr r10 ; \ ld r12,PACAKBASE(r13) ; \ LOAD_HANDLER(r12, system_call_entry) ; \ mtctr r12 ; \ mfspr r12,SPRN_SRR1 ; \ - /* Re-use of r13... No spare regs to do this */ \ - li r13,MSR_RI ; \ - mtmsrd r13,1 ; \ + li r10, MSR_RI ; \ + mtmsrd r10,1 ; \ - GET_PACA(r13) ; /* get r13 back */ \ bctr ; Also only semi-relatedly, I was curious (if you, or anybody reading happen to know) why SRR0 and SRR1 are being moved to registers so early in the interrupt handler. I had speculated that this was because of potential page faults on memory access that would clobber those registers, but then I noticed the `ld r12,PACAKBASE(r13)` before `mfspr r12,SPRN_SRR1`, which seemed like it could touch memory, so I was confused again. Hope the questions make sense, and sorry if I missed something obvious - I have very little experience with ppc. Thanks, Keno
Use of r10 in powerpc syscall entry
Hi Anton, I was reading the powerpc syscall entry code and git points me at your commit 05b05f28 (powerpc: Relocatable system call no longer uses the LR) for one part that confused me, so I hope you don't mind a quick question. In particular, that commit removed the use of r10 to restore lr, but didn't touch the earlier `mflr r10` to actually save the lr to r10. Is r10 still required there for some reason or is that just left over? Part of the reason I'm asking is because it seems one could use r10, instead of r13 later, i.e. #define SYSCALL_PSERIES_2_DIRECT \ - mflr r10 ; \ ld r12,PACAKBASE(r13) ; \ LOAD_HANDLER(r12, system_call_entry) ; \ mtctr r12 ; \ mfspr r12,SPRN_SRR1 ; \ - /* Re-use of r13... No spare regs to do this */ \ - li r13,MSR_RI ; \ - mtmsrd r13,1 ; \ + li r10, MSR_RI ; \ + mtmsrd r10,1 ; \ - GET_PACA(r13) ; /* get r13 back */ \ bctr ; Also only semi-relatedly, I was curious (if you, or anybody reading happen to know) why SRR0 and SRR1 are being moved to registers so early in the interrupt handler. I had speculated that this was because of potential page faults on memory access that would clobber those registers, but then I noticed the `ld r12,PACAKBASE(r13)` before `mfspr r12,SPRN_SRR1`, which seemed like it could touch memory, so I was confused again. Hope the questions make sense, and sorry if I missed something obvious - I have very little experience with ppc. Thanks, Keno
Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?
Hi Oleg, I have another obscure ptrace question which seems somewhat related, so let me ask it here. Consider this: ``` static int sigchld_counter = 0; void sigchld_handler(int sig) { (void)sig; sigchld_counter++; } int main(void) { signal(SIGCHLD, sigchld_handler); pid_t child; if (0 == (child = fork())) { raise(SIGSTOP); assert(0); // Should never be continued } // Wait until stopped int status; pid_t wret = waitpid(child, , __WALL | WSTOPPED); assert(wret == child); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); assert(sigchld_counter == 1); // Now PTRACE_SEIZE the child long err = ptrace(PTRACE_SEIZE, child, NULL, (void*)PTRACE_O_TRACESYSGOOD); assert(err == 0); // Make sure that didn't cause a notification wret = waitpid(child, , __WALL | WSTOPPED | WNOHANG); assert(wret == 0); assert(sigchld_counter == 1); } ``` I wouldn't have expected the PTRACE_SEIZE to generate another SIGCHLD/be waitable again, (the last two assertions fail). Is that supposed to happen? If so, I'd like to update the man page to mention this behavior, but I wanted to check with you first. Thanks, Keno On Tue, Aug 23, 2016 at 11:34 AM, Oleg Nesterov <o...@redhat.com> wrote: > On 08/18, Keno Fischer wrote: >> >> On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov <o...@redhat.com> wrote: >> > >> > And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after >> > PTRACE_INTERRUPT, you know that the tracee did not report the "new" >> > SIGTTIN. >> >> It seems possible to remember whether or not we injected a stopping >> signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise >> a PTRACE_INTERRUPT stop. Currently what I do is the other way around, >> after issuing PTRACE_INTERRUPT, the first (if any) of the next two >> stops that is a PTRACE_EVENT_STOP get interpreted as a >> PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I >> can't give you a concrete example I worried about, it just seems >> fragile compared to just checking whether WSTOPSIG() == SIGTRAP. > > Yes, I see your point. And to remind, I was confused too. > > Perhaps we can add another THIS_SIGNAL_WAS_ALREADY_REPORTED bit, but > you know, I'd prefer to avoid another subtle change in behaviour. You > can never know if it is "safe" or not when it comes to ptrace, perhaps > some application already relies on this WSTOPSIG(). > > Oleg. >
Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?
Hi Oleg, I have another obscure ptrace question which seems somewhat related, so let me ask it here. Consider this: ``` static int sigchld_counter = 0; void sigchld_handler(int sig) { (void)sig; sigchld_counter++; } int main(void) { signal(SIGCHLD, sigchld_handler); pid_t child; if (0 == (child = fork())) { raise(SIGSTOP); assert(0); // Should never be continued } // Wait until stopped int status; pid_t wret = waitpid(child, , __WALL | WSTOPPED); assert(wret == child); assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); assert(sigchld_counter == 1); // Now PTRACE_SEIZE the child long err = ptrace(PTRACE_SEIZE, child, NULL, (void*)PTRACE_O_TRACESYSGOOD); assert(err == 0); // Make sure that didn't cause a notification wret = waitpid(child, , __WALL | WSTOPPED | WNOHANG); assert(wret == 0); assert(sigchld_counter == 1); } ``` I wouldn't have expected the PTRACE_SEIZE to generate another SIGCHLD/be waitable again, (the last two assertions fail). Is that supposed to happen? If so, I'd like to update the man page to mention this behavior, but I wanted to check with you first. Thanks, Keno On Tue, Aug 23, 2016 at 11:34 AM, Oleg Nesterov wrote: > On 08/18, Keno Fischer wrote: >> >> On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov wrote: >> > >> > And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after >> > PTRACE_INTERRUPT, you know that the tracee did not report the "new" >> > SIGTTIN. >> >> It seems possible to remember whether or not we injected a stopping >> signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise >> a PTRACE_INTERRUPT stop. Currently what I do is the other way around, >> after issuing PTRACE_INTERRUPT, the first (if any) of the next two >> stops that is a PTRACE_EVENT_STOP get interpreted as a >> PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I >> can't give you a concrete example I worried about, it just seems >> fragile compared to just checking whether WSTOPSIG() == SIGTRAP. > > Yes, I see your point. And to remind, I was confused too. > > Perhaps we can add another THIS_SIGNAL_WAS_ALREADY_REPORTED bit, but > you know, I'd prefer to avoid another subtle change in behaviour. You > can never know if it is "safe" or not when it comes to ptrace, perhaps > some application already relies on this WSTOPSIG(). > > Oleg. >
How does the size field work in IOCTL numbers?
Hi folks, this is more of a general linux question, but since I noticed it while looking perf_events code, I'm ccing perf_events folks in case the answer is perf_events specific (hope that's ok). uapi/linux/perf_event.h has the following: #define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64) #define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5) #define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *) #define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *) #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32) Now, my question is how to interpret the last argument to the _IO* macros. The man page for ioctl(2) says, it encodes the "size of the argument argp in bytes", but I'm not sure how to interpret that. For example, IOC_PERIOD takes a pointer to a 64-bit value, so the __u64 makes sense for me. But as far as I know, IOC_ID also takes a pointer to a 64-bit value (and writes the ID to it), but it has `__64 *` rather than `__u64`. The we have IOC_SET_FILTER which as far as I know takes a pointer to a variable-length string (but with the above definition the size field is sizeof(char*)), and IOC_SET_BPF which doesn't take a pointer at all, but interprets the argument as a file descriptor (similar to IOC_SET_OUTPUT, which doesn't have a size at all). I don't understand what the rule is for what to put in that third argument, or is it ioctl specific? Please let me know if I missed something. Thanks, Keno
How does the size field work in IOCTL numbers?
Hi folks, this is more of a general linux question, but since I noticed it while looking perf_events code, I'm ccing perf_events folks in case the answer is perf_events specific (hope that's ok). uapi/linux/perf_event.h has the following: #define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64) #define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5) #define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *) #define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *) #define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32) Now, my question is how to interpret the last argument to the _IO* macros. The man page for ioctl(2) says, it encodes the "size of the argument argp in bytes", but I'm not sure how to interpret that. For example, IOC_PERIOD takes a pointer to a 64-bit value, so the __u64 makes sense for me. But as far as I know, IOC_ID also takes a pointer to a 64-bit value (and writes the ID to it), but it has `__64 *` rather than `__u64`. The we have IOC_SET_FILTER which as far as I know takes a pointer to a variable-length string (but with the above definition the size field is sizeof(char*)), and IOC_SET_BPF which doesn't take a pointer at all, but interprets the argument as a file descriptor (similar to IOC_SET_OUTPUT, which doesn't have a size at all). I don't understand what the rule is for what to put in that third argument, or is it ioctl specific? Please let me know if I missed something. Thanks, Keno
Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?
On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterovwrote: > > And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after > PTRACE_INTERRUPT, you know that the tracee did not report the "new" > SIGTTIN. It seems possible to remember whether or not we injected a stopping signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise a PTRACE_INTERRUPT stop. Currently what I do is the other way around, after issuing PTRACE_INTERRUPT, the first (if any) of the next two stops that is a PTRACE_EVENT_STOP get interpreted as a PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I can't give you a concrete example I worried about, it just seems fragile compared to just checking whether WSTOPSIG() == SIGTRAP.
Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?
On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov wrote: > > And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after > PTRACE_INTERRUPT, you know that the tracee did not report the "new" > SIGTTIN. It seems possible to remember whether or not we injected a stopping signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise a PTRACE_INTERRUPT stop. Currently what I do is the other way around, after issuing PTRACE_INTERRUPT, the first (if any) of the next two stops that is a PTRACE_EVENT_STOP get interpreted as a PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I can't give you a concrete example I worried about, it just seems fragile compared to just checking whether WSTOPSIG() == SIGTRAP.
ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?
Consider below test case (not all of it is necessary for reproducing the behavior in question, but I wanted to cover related cases as well to make sure they behave as expected). In this test case, the last group-stop (after PTRACE_INTERRUPT) is delivered with a WSTOPSIG(status) of SIGTTIN, which was the signr of the previous group stop. From reading the man-page, I would have expected SIGTRAP. Now, I understand that if there is another stop pending, PTRACE_INTERRUPT will simply piggy-backs off that one, but I don't believe that is happening in this case. Further, the current behavior seems to make it very hard (impossible?) to reliably tell a true group-stop from a PTRACE_INTERRUPT generated one. Is this behavior intended? If so, I would be happy to update the man page accordingly, but would like some guidance on what the intended semantics are. ``` #include #include #include #include #include #include #include #include int main() { pid_t child, ret; int err; int status; if (0 == (child = fork())) { kill(getpid(), SIGSTOP); kill(getpid(), SIGSTOP); kill(getpid(), SIGTTIN); sleep(1000); exit(0); } ret = waitpid(child, , WSTOPPED); assert(ret == child); err = ptrace(PTRACE_SEIZE, child, NULL, NULL); assert(err == 0); err = ptrace(PTRACE_CONT, child, NULL, NULL); assert(err == 0); // Should now hit SIGSTOP signal-stop ret = waitpid(child, , 0); assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGSTOP); assert(err == 0); // Should now hit SIGSTOP group-stop ret = waitpid(child, , 0); assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) && WSTOPSIG(status) == SIGSTOP); err = ptrace(PTRACE_CONT, child, NULL, NULL); assert(err == 0); // Should now hit SIGTTIN signal-stop ret = waitpid(child, , 0); assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGTTIN); err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGTTIN); assert(err == 0); // Should now hit SIGTTIN group-stop ret = waitpid(child, , 0); assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) && WSTOPSIG(status) == SIGTTIN); err = ptrace(PTRACE_CONT, child, NULL, NULL); assert(err == 0); // Now interrupt it err = ptrace(PTRACE_INTERRUPT, child, NULL, NULL); assert(err == 0); // Should now hit interrupt group-stop ret = waitpid(child, , 0); printf("Interrupt group-stop delivered with signal %d\n", WSTOPSIG(status)); assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) && WSTOPSIG(status) == SIGTRAP); exit(0); } ```
ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?
Consider below test case (not all of it is necessary for reproducing the behavior in question, but I wanted to cover related cases as well to make sure they behave as expected). In this test case, the last group-stop (after PTRACE_INTERRUPT) is delivered with a WSTOPSIG(status) of SIGTTIN, which was the signr of the previous group stop. From reading the man-page, I would have expected SIGTRAP. Now, I understand that if there is another stop pending, PTRACE_INTERRUPT will simply piggy-backs off that one, but I don't believe that is happening in this case. Further, the current behavior seems to make it very hard (impossible?) to reliably tell a true group-stop from a PTRACE_INTERRUPT generated one. Is this behavior intended? If so, I would be happy to update the man page accordingly, but would like some guidance on what the intended semantics are. ``` #include #include #include #include #include #include #include #include int main() { pid_t child, ret; int err; int status; if (0 == (child = fork())) { kill(getpid(), SIGSTOP); kill(getpid(), SIGSTOP); kill(getpid(), SIGTTIN); sleep(1000); exit(0); } ret = waitpid(child, , WSTOPPED); assert(ret == child); err = ptrace(PTRACE_SEIZE, child, NULL, NULL); assert(err == 0); err = ptrace(PTRACE_CONT, child, NULL, NULL); assert(err == 0); // Should now hit SIGSTOP signal-stop ret = waitpid(child, , 0); assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP); err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGSTOP); assert(err == 0); // Should now hit SIGSTOP group-stop ret = waitpid(child, , 0); assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) && WSTOPSIG(status) == SIGSTOP); err = ptrace(PTRACE_CONT, child, NULL, NULL); assert(err == 0); // Should now hit SIGTTIN signal-stop ret = waitpid(child, , 0); assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGTTIN); err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGTTIN); assert(err == 0); // Should now hit SIGTTIN group-stop ret = waitpid(child, , 0); assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) && WSTOPSIG(status) == SIGTTIN); err = ptrace(PTRACE_CONT, child, NULL, NULL); assert(err == 0); // Now interrupt it err = ptrace(PTRACE_INTERRUPT, child, NULL, NULL); assert(err == 0); // Should now hit interrupt group-stop ret = waitpid(child, , 0); printf("Interrupt group-stop delivered with signal %d\n", WSTOPSIG(status)); assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) && WSTOPSIG(status) == SIGTRAP); exit(0); } ```