Re: [PATCH 0/3 v2] arm64/ptrace: allow to get all registers on syscall traps

2021-02-08 Thread Keno Fischer
>
> Could you describe the problem in more details? I wonder whether we have
> the same thing in CRIU...
>

Sure, basically the issue is that orig_x0 gets recorded at the start of the
syscall entry, but is not otherwise part of the ptrace state. It used
to be primarily used for resetting the argument back to its original
value during syscall restarts, but I see it's been expanded slightly
with the user dispatch mechanism (though as far as I can tell
not in a way that interacts with ptrace). Basically the problem
is that if you change the value of x0 during either the ptrace
entry stop or a seccomp stop and then incur a syscall restart,
the syscall will restart with the original x0 rather than with
the modified x0, which may be unexpected. Of course,
relatedly, if you're doing CRIU-like things you can end up
in situations where the future behavior will depend on the
orig_x0 value, which isn't restore-able at the moment. It's
possible to work around all of this by keeping a local copy
of orig_x0 and being very careful with the ptrace traps around
restarts, but getting the logic right is extremely tricky. My
suggestion for what I thought would be reasonable
behavior was:

1. Expose orig_x0 to ptrace
2. Set orig_x0 to x0 and set x0 to -ENOSYS at the start of the syscall
dispatcher
3. Use orig_x0 for syscall arguments/seccomp/restarts

That's basically how rax works on x86_64 and it doesn't
seem to cause major problems (though of course I may
be biased by having x86_64 already work when I started the
aarch64 port). Just the first item would be sufficient of course
for getting rid of most of the bookkeeping. I should also say
that, for us, the ptrace getregs call can be the throughput
limiting operation, so it would be nice if getting the entire
basic register set would only require one syscall. I won't
insist on it, since we do have a solution in place that kinda
works (and only requires the one syscall),
but I thought I'd mention it.

While we're on this topic, and in case it's helpful to anybody,
I should also point out that the order of the ptrace-signal-stop,
vs setup for the syscall restart differs between x86 and
aarch64 (aarch64 sets up the restart first then delivers the
ptrace trap/signal - x86 the other way around). I actually
think the aarch64 behavior is saner here, but I figured I'd
leave this breadcrumb for anybody who's writing a ptracer
and stumbles across this.

Keno


Re: [PATCH 0/3 v2] arm64/ptrace: allow to get all registers on syscall traps

2021-02-01 Thread Keno Fischer
Hi Andrei,

> This series introduces the PTRACE_O_ARM64_RAW_REGS option. If it is set,
> PTRACE_GETREGSET returns values of all registers, and PTRACE_SETREGSET
> allows to change any of them.

thanks for picking this up. I meant to work on this, but unfortunately ran out
of time to be able to push it through, so I'm glad you're working on
it, since it
does absolutely need to get fixed. Besides this issue, the other problem we
ran into when trying to port our ptracer to aarch64 is that orig_x0 is not
accessible through the ptrace interface on aarch64, which can cause tricky
behavior around restarts. We managed to work around that in the end,
but it's painful. If we're fixing the kernel here anyway, I'm wondering if
we might want to address that as well while we're at it.

Keno


brk checks in PR_SET_MM code

2020-12-16 Thread Keno Fischer
Hi all,

The code in prctl(PR_SET_MM, ...) performs a number of sanity checks,
among them

```
/*
 * @brk should be after @end_data in traditional maps.
 */
if (prctl_map->start_brk <= prctl_map->end_data ||
prctl_map->brk <= prctl_map->end_data)
goto out;
```

The original commit that introduces this check
(f606b77f1a9e362451aca8f81d8f36a3a112139e) says:

```
4) As in regular Elf loading procedure we require that @start_brk and
   @brk be greater than @end_data.
```

However, it does not appear that this invariant is actually
enforced during regular ELF loading. In particular, at least on my
linux distribution, it does not appear to be satisfied when
invoking the dynamic linker directly.
For example, consider the following test application:

```
#include 
#include 
#include 

int main(void) {
int err = prctl(PR_SET_MM, PR_SET_MM_BRK, sbrk(0), 0, 0);
assert(err == 0);
return 0;
}
```
```
$ su
# ./a.out
# /lib64/ld-linux-x86-64.so.2 ./a.out
a.out: test.c:7: main: Assertion `err == 0' failed.
Aborted
```

I don't understand this code well enough to know what the
intended behavior is, but unfortunately this causes some
processes to be non-restorable using the PR_SET_MM
mechanism, which defeats the whole purpose of that API.
Could somebody clarify whether this situation is indeed
supposed to be impossible and if not whether said checks
in PR_SET_MM are actually supposed to be there?
I suppose this is also technically a regression when the
old PR_SET_MM commands were refactored to use this
new validation. Previously only the commands that changed
the brk validated this invariant, but these days it tries
to validate the entire structure at once, so all the PR_SET_MM
calls will fail in a process whose layout violates the sanity
check.

Thanks,
Keno


Re: ptrace: seccomp: Return value when the call was already invalid

2020-07-03 Thread Keno Fischer
> > Now, if we have a seccomp filter that simply does
> > SECCOMP_RET_TRACE, and a ptracer that simply
> > does PTRACE_CONT
>
> Ok, so this means that we're _skipping_ the system call, right?

If the system call were positive this would result in the system call
being executed. The notion of "skipping" the syscall is a bit odd in
this situation. Having the ptracer set the syscallno to -1 is generally
accepted as the way to do it, but what happens if the syscallno is
already -1 or negative is underspecified.

> > then the assert will fire/fail on arm64, but not on x86_64.
>
> It feels weird to me that skipping the system call has any effect on the
> tracee registers...

I think the correct way to frame it is to ask whether the behavior
matches that of the tracee in absence of the ptracer. I would argue
that if the ptracer doesn't explicitly modify register contents, then
the tracee shouldn't observe any behavior difference.

> > Interestingly, arm64 does do something different
> > if the syscall is -1 rather than -10, where early
> > in the ptrace stop it does.
> > ```
> > /* set default errno for user-issued syscall(-1) */
> > if (scno == NO_SYSCALL)
> > regs->regs[0] = -ENOSYS;
>
> ... so I think this should be fixed too. How about the diff below?

I think the patch behavior is better overall, but I'm not sure it's ideal.
I think the biggest question is what the behavior should be here and
if we want a behavioral difference between *the syscall was -1 at entry*
and *the syscall was -1 because the ptracer wanted to skip the syscall*.
I think there is a bit of a semantic disconnect because "skipping" the
syscall is not really an operation that the ptracer has at its disposal
(unless it's using SYSEMU of course). The only thing it can do is set
the syscall to -1. However, arguably that already has semantics
(of returning -ENOSYS), so it's not at all clear to me that we should
deviate from that. Unfortunately, none of this is currently consistent
across architectures, so I think before we go changing arm64, we
should decide what we'd like to happen in theory and then see
what we can do to improve the situation without being too breaking.

Keno


Re: arm64: Register modification during syscall entry/exit stop

2020-06-01 Thread Keno Fischer
On Mon, Jun 1, 2020 at 5:23 AM Dave Martin  wrote:
> > > Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the
> > > syscall at the syscall enter stop, then modifying the regs at the
> > > syscall exit stop?
> >
> > Yes, it can. The idea behind SYSEMU is to be able to save half the
> > ptrace traps that would require, in theory making the ptracer
> > a decent amount faster. That said, the x7 issue is orthogonal to
> > SYSEMU, you'd have the same issues if you used PTRACE_SYSCALL.
>
> Right, I just wondered whether there was some deeper difference between
> the two approaches.

You're asking about a new regset vs trying to do it via ptrace option?
I don't think there's anything a ptrace option can do that a new regset
that replicates the same registers (I'm gonna propose adding orig_x0,
while we're at it and changing the x0 semantics a bit, will have
those details with the patch) wouldn't be able to do . The reason I
originally thought it might have to be a ptrace option is because
the register modification currently gets applied in the syscall entry
code to the actual regs struct, so I thought you might have to know
to preserve those registers. However, then I realized that you could
just change the regset accessors to emulate the old behavior, since
we do already store all the required information (what kind of stop
we're currently at) in order to be able to answer the ptrace
informational queries. So doing that it probably just all around
easier. I guess NT_PRSTATUS might also rot, but I guess strace
doesn't really have to stop using it, since it doesn't care about
the x7 value nor does it need to modify it.

Keno


Re: arm64: Register modification during syscall entry/exit stop

2020-06-01 Thread Keno Fischer
On Mon, Jun 1, 2020 at 5:14 AM Dave Martin  wrote:
> Can you explain why userspace would write a changed value for x7
> but at the same time need that new to be thrown away?

The discarding behavior is the primary reason things aren't completely
broken at the moment. If it read the wrong x7 value and didn't know about
the Aarch64 quirk, it's often just trying to write that same wrong
value back during the next stop, so if that's just ignored,
that's probably fine in 99% of cases, since the value in the
tracee will be undisturbed.

I don't think there's a sane way to change the aarch64 NT_PRSTATUS
semantics without just completely removing the x7 behavior, but of course
people may be relying on that (I think somebody said upthread that strace does?)

Keno


Re: arm64: Register modification during syscall entry/exit stop

2020-05-31 Thread Keno Fischer
> Can't PTRACE_SYSEMU be emulated by using PTRACE_SYSCALL, cancelling the
> syscall at the syscall enter stop, then modifying the regs at the
> syscall exit stop?

Yes, it can. The idea behind SYSEMU is to be able to save half the
ptrace traps that would require, in theory making the ptracer
a decent amount faster. That said, the x7 issue is orthogonal to
SYSEMU, you'd have the same issues if you used PTRACE_SYSCALL.


Keno


Re: arm64: Register modification during syscall entry/exit stop

2020-05-31 Thread Keno Fischer
> Keno -- are you planning to send out a patch? You previously spoke about
> implementing this using PTRACE_SETOPTIONS.

Yes, I'll have a patch for you. Though I've come to the conclusion
that introducing a new regset is probably a better way to solve it.
We can then also expose orig_x0 at the same time and give it sane semantics
(there's some problems with the way it works currently - I'll write it up
together with the patch).


Keno


mm: Behavior of process_vm_* with short local buffers

2020-05-24 Thread Keno Fischer
Hi everyone,

I'm in the process of trying to port a debugging tool (http://rr-project.org/)
from x86 to various other architectures. This tool relies on noting every
change that was made to the memory of the process being debugged.
As such, it has a battery of tests for corner cases of copyin/out and it
is one of these that I saw behaving strangely when ported to non-x86
architectures. This particular test was testing the behavior of
process_vm_readv (and writev, but for simplicity, let's assume readv here)
with short local buffers.

On x86 if the buffer is short and the following page is unmapped,
the syscall will fill the remainder of the page, and
then return however many bytes it actually wrote. However, on other
architectures (I mostly looked at arm64, though the same applies
elsewhere), the behavior can be quite different.
In general, the behavior depends strongly on factors like how close to
the start of the copy region the page break occurs, how many bytes
were supposed to be left after the page break and the total size of
the region to be copied. In various situations, I'm seeing:

- Writes that end many bytes before the page break
- Bytes being modified beyond what the syscall result would indicate
happened.
- Combinations thereof

I can work around this in my port, but I thought it might be valuable
to ask where the line is between "architecture-defined behavior" and
a bug that should be reported to the appropriate architecture
maintainers and eventually fixed. For example, I think it would be
nice if the syscall result actually did match the actual number of
bytes written in all cases.

I've written a small program [1] that sets up this situation for various
parameter values and prints the results. I have access to arm64,
powerpc and x86, so I included results for those architectures,
but I suspect other architectures have similar issues. The
program should be easy to run to get your own results for
a different architecture.

[1] https://gist.github.com/Keno/b247bca85219c4e3bdde9f7d7ff36c77

Thanks, Keno


Re: arm64: Register modification during syscall entry/exit stop

2020-05-24 Thread Keno Fischer
Just ran into this issue again, with what I think may be most compelling
example yet why this is problematic:

The tracee incurred a signal, we PTRACE_SYSEMU'd to the rt_sigreturn,
which the tracer tried to emulate by applying the state from the signal frame.
However, the PTRACE_SYSEMU stop is a syscall-stop, so the tracer's write
to x7 was ignored and x7 retained the value it had in the signal handler,
which broke the tracee.

Keno

On Sat, May 23, 2020 at 1:35 AM Keno Fischer  wrote:
>
> I got bitten by this again, so I decided to write up a simple example
> that shows the problem:
>
> https://gist.github.com/Keno/cde691b26e32373307fb7449ad305739
>
> This runs the same child twice. First vanilla where it prints "Hello world".
> The second time, using a textbook ptrace example, to only print "world".
> The problem here is that by the time the ptracer gets around to restoring
> the registers, it's no longer in a syscall stop, so the write to x7 does not
> get ignored and the correct value of x7 gets clobbered.
> I copied the syscall definition from musl, so the compiler thinks x7 is
> live, and we can see an assertion.
>
> Output on my machine (will depend on compiler version, etc.):
> ```
> $ gcc -g3 -O3 ptrace_lies.c
> $ ./a.out
> Hello World
> World
> a.out: ptrace_lies.c:49: do_child: Assertion `v3 == values[2]' failed.
> a.out: ptrace_lies.c:134: main: Assertion `WIFEXITED(status) &&
> WEXITSTATUS(status) == 0' failed.
> Aborted (core dumped)
> ```
>
> However, I don't think that whether or not the compiler thinks that x7 is
> live is the problem here. The problem is entirely that this mechanism
> prevents the ptracer from precisely controlling the register state. While
> basic ptracers don't need this feature (strace),
> more advanced ptracers (think criu, etc.) absolutely do want to precisely
> control what the register state is.
> The ptracer I'm working on (https://rr-project.org/)
> happens to be an extreme case of this, where it wants *bitwise* equivalent
> register states such that it can run the same code many times and get
> the exact same results.
>
> Also, if the issue was just that the kernel clobbered x7, that would be fine
> we could deal with that no problem. However, it's much worse than that,
> because the behavior of the kernel with respect to x7 depends on what
> kind of ptrace stop we're in and even worse, in some kinds of stop,
> there's absolutely no way to get at the actual value of x7.
>
> > Hmm, does that actually result in the SVC instruction getting inlined? I
> > think that's quite dangerous, since we document that we can trash the SVE
> > register state on a system call, for example. I'm also surprised that
> > the register variables are honoured by compilers if that inlining can occur.
>
> I haven't gotten to trying SVE yet, so I appreciate the warning :). That said,
> deterministic clobbering of registers is fine. Even changing the registers to
> random junk is fine. We're happy to read those registers through ptrace.
> The problem here is that the kernel lies about what the contents of the x7
> register is and discards any writes to it.
>
> I really hope we can come up with a solution here, I'm already dreading
> the next time I unexpectedly run into this and have to add yet
> another special case :(.
>
> Keno


[PATCH] arm64: ptrace: Fix PTRACE_SINGLESTEP into signal handler

2020-05-23 Thread Keno Fischer
Executing PTRACE_SINGLESTEP at a signal stop is special. It
is supposed to step merely the signal setup work that the
kernel does, but not any instructions in user space. Since
not all architectures have the ability to generate a
single-step exception directly upon return from user-space,
there is a generic pseudo-single-step-stop that may be used
for this purpose (tracehook_signal_handler). Now, arm64 does
have the ability to generate single-step exceptions directly
upon return to userspace and was using this capability (rather
than the generic pseudo-trap) to obtain a similar effect. However,
there is actually a subtle difference that becomes noticeable
when the signal handler in question attempts to block SIGTRAP
(either because it is set in sa_mask, or because it is a handler
for SIGTRAP itself and SA_NODEFER is not set). In such a
situation, a real single step exception will cause the SIGTRAP
signal to be forcibly unblocked and the signal disposition
to be reset to SIG_DFL. The generic pseudo-single-step does
not suffer from this problem, because the SIGTRAP it delivers
is not real. The arm64 behavior is problematic, because a forced
reset of the signal disposition can be quite disruptive to the
userspace program. This patch brings the arm64 behavior in line
with the other major architectures by using the generic
pseudo-single-step-stop, avoiding the problematic interaction
with SIGTRAP masks.

Fixes: 2c020ed8 ("arm64: Signal handling support")
Signed-off-by: Keno Fischer 
---
 arch/arm64/kernel/signal.c | 9 +
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/arch/arm64/kernel/signal.c b/arch/arm64/kernel/signal.c
index 339882db5a91..cf237ee9443b 100644
--- a/arch/arm64/kernel/signal.c
+++ b/arch/arm64/kernel/signal.c
@@ -808,14 +808,7 @@ static void handle_signal(struct ksignal *ksig, struct 
pt_regs *regs)
 */
ret |= !valid_user_regs(>user_regs, current);
 
-   /*
-* Fast forward the stepping logic so we step into the signal
-* handler.
-*/
-   if (!ret)
-   user_fastforward_single_step(tsk);
-
-   signal_setup_done(ret, ksig, 0);
+   signal_setup_done(ret, ksig, test_thread_flag(TIF_SINGLESTEP));
 }
 
 /*
-- 
2.25.1



Re: arm64: Register modification during syscall entry/exit stop

2020-05-22 Thread Keno Fischer
I got bitten by this again, so I decided to write up a simple example
that shows the problem:

https://gist.github.com/Keno/cde691b26e32373307fb7449ad305739

This runs the same child twice. First vanilla where it prints "Hello world".
The second time, using a textbook ptrace example, to only print "world".
The problem here is that by the time the ptracer gets around to restoring
the registers, it's no longer in a syscall stop, so the write to x7 does not
get ignored and the correct value of x7 gets clobbered.
I copied the syscall definition from musl, so the compiler thinks x7 is
live, and we can see an assertion.

Output on my machine (will depend on compiler version, etc.):
```
$ gcc -g3 -O3 ptrace_lies.c
$ ./a.out
Hello World
World
a.out: ptrace_lies.c:49: do_child: Assertion `v3 == values[2]' failed.
a.out: ptrace_lies.c:134: main: Assertion `WIFEXITED(status) &&
WEXITSTATUS(status) == 0' failed.
Aborted (core dumped)
```

However, I don't think that whether or not the compiler thinks that x7 is
live is the problem here. The problem is entirely that this mechanism
prevents the ptracer from precisely controlling the register state. While
basic ptracers don't need this feature (strace),
more advanced ptracers (think criu, etc.) absolutely do want to precisely
control what the register state is.
The ptracer I'm working on (https://rr-project.org/)
happens to be an extreme case of this, where it wants *bitwise* equivalent
register states such that it can run the same code many times and get
the exact same results.

Also, if the issue was just that the kernel clobbered x7, that would be fine
we could deal with that no problem. However, it's much worse than that,
because the behavior of the kernel with respect to x7 depends on what
kind of ptrace stop we're in and even worse, in some kinds of stop,
there's absolutely no way to get at the actual value of x7.

> Hmm, does that actually result in the SVC instruction getting inlined? I
> think that's quite dangerous, since we document that we can trash the SVE
> register state on a system call, for example. I'm also surprised that
> the register variables are honoured by compilers if that inlining can occur.

I haven't gotten to trying SVE yet, so I appreciate the warning :). That said,
deterministic clobbering of registers is fine. Even changing the registers to
random junk is fine. We're happy to read those registers through ptrace.
The problem here is that the kernel lies about what the contents of the x7
register is and discards any writes to it.

I really hope we can come up with a solution here, I'm already dreading
the next time I unexpectedly run into this and have to add yet
another special case :(.

Keno


ptrace: seccomp: Return value when the call was already invalid

2020-05-22 Thread Keno Fischer
I'm seeing the following while porting a ptracer from
x86_64 to arm64 (cc'ing arm64 folks, but in this case
x86_64 is the odd one out, I think other archs would
be consistent with arm64).

Consider userspace code like the following:
```
int ret = syscall(-10, 0);
assert(ret == -ENOSYS);
```

(Never mind the fact that this is something userspace
shouldn't do, I saw this in our test suite that tests
corner cases where the ptracer shouldn't affect behavior).

Now, if we have a seccomp filter that simply does
SECCOMP_RET_TRACE, and a ptracer that simply
does PTRACE_CONT, then the assert will fire/fail on arm64,
but not on x86_64.

The reason this happens is that the return value gets set
early on x86_64, but this is not possible on arm64,
because doing so would clobber the first argument
register that it shares. As a result, no return value is
set and `ret` retains the value that the first syscall
argument used to have.

I can work around this of course, but I guess my
question is whether this is expected/ok,
or you would expect an active ptracer that does not
touch the registers not to affect behavior.

Interestingly, arm64 does do something different
if the syscall is -1 rather than -10, where early
in the ptrace stop it does.
```
/* set default errno for user-issued syscall(-1) */
if (scno == NO_SYSCALL)
regs->regs[0] = -ENOSYS;
```

I'm not sure that's great either since the ptracer
may want to inspect x0 and arm64 does not
make orig_x0 available via ptrace. To me
this indicates that maybe this was intended
to apply to any syscall skipped here, not
just -1 (the different comes from the fact
that seccomp considers any negative
syscall a skip/fail, but on syscall-entry
stops arm64 only considers a literal -1
a skip).

On the other hand if this is deemed expected,
I'll go ahead and submit a man-page patch to at
least document this architecture difference.

Keno


Re: arm64: Register modification during syscall entry/exit stop

2020-05-19 Thread Keno Fischer
Hi Will,

> Yes, we inherited this from ARM and I think strace relies on it. In
> hindsight, it is a little odd, although x7 is a parameter register in the
> PCS and so it won't be live on entry to a system call.

I'm not familiar with the PCS acronym, but I assume you mean the
calling convention? You have more faith in userspace than I do ;). For
example, cursory googling brought up this arm64 syscall definition in musl:

https://github.com/bminor/musl/blob/593caa456309714402ca4cb77c3770f4c24da9da/arch/aarch64/syscall_arch.h

The constraints on those asm blocks allow the compiler to assume that
x7 is preserved across the syscall. If a ptracer accidentally modified it
(which is easy to do in the situations that I mentioned), it could
absolutely cause incorrect execution of the userspace program.

> Although the examples you've
> listed above are interesting, I don't see why x7 is important in any of
> them (and we only support up to 6 system call arguments).

It's not so much that x7 is important, it's that lying to the ptracer is
problematic, because it might remember that lie and act on it later.
I did run into exactly this problem, where my ptracer accidentally
changed the value of x7 and caused incorrect execution in the tracee
(now that incorrect execution happened to be an assertion, because
my application is paranoid about these kinds of issues, but it was
incorrect nevertheless)

If it would be helpful, I can code up the syscall entry -> signal trap example
ptracer to have a concrete example.

Keno


arm64: Register modification during syscall entry/exit stop

2020-05-18 Thread Keno Fischer
Continuing my theme of "weird things I encounter
while trying to use ptrace on arm64", I ran into the
effect of the following code in the syscall entry/exit
reporting:

```
/*
* A scratch register (ip(r12) on AArch32, x7 on AArch64) is
* used to denote syscall entry/exit:
*/
regno = (is_compat_task() ? 12 : 7);
saved_reg = regs->regs[regno];
regs->regs[regno] = dir;
```

This seems very weird to me. I can't think of any
other architecture that does something similar
(other than unicore32 apparently, but the ptrace
support there seems like it might have just been
copied from ARM). I'm able to work around this
in my application, but it adds another stumbling block.
Some examples of things that happen:
- Writes to x7 during syscall exit stops are ignored, so
  if the ptracer tries to emulate a setjmp-type thing, it
  might miss this register (ptracers sometimes like to do
  this to manually serialize execution between different
  threads, puppeteering a single thread of execution
  between different register states).
- Reads from x7 are incorrect, so if the ptracer saves
  a register state and later tries to set it back to the task,
  it may get x7 incorrect, but user space may be expecting
  the register to be preserved (when might this happen? -
  consider a ptracer that wants to modify some syscall
  arguments, it modifies the arguments, restarts the syscall
  but then incurs a signal, so it tries to restore the original
  registers to let userspace deal with the signal without
  being confused - expect signal traps don't ignore x7
  modifications, so x7 may have been unexpectedly
  modified).
- We now have seccomp traps, which kind of look and
  act like syscall-entry traps, but don't have this behavior,
  so it's not particularly reliable for ptracers to use.

Furthermore, it seems unnecessary to me on modern
kernels. We now have PTRACE_GET_SYSCALL_INFO,
which exposes this information without lying to the ptracer
about the tracee's registers.

I understand, we can't just change this, since people may
be relying on it, but I would like to propose adding a ptrace
option (PTRACE_O_ARM_REGSGOOD?) that turns this
behavior off. Now, I don't think we currently have any other
arch-specific ptrace options, so maybe there is a different
option that would be preferable (e.g. could be a different
regset), but I do think it would be good to have a way to
operate on the real x7 value. As I said, I can work around it,
but hopefully I will be able to save a future implementer
some headache.

Keno


[PATCH] arm64: Fix PTRACE_SYSEMU semantics

2020-05-15 Thread Keno Fischer
Quoth the man page:
```
   If the tracee was restarted by PTRACE_SYSCALL or PTRACE_SYSEMU, the
   tracee enters syscall-enter-stop just prior to entering any system
   call (which will not be executed if the restart was using
   PTRACE_SYSEMU, regardless of any change made to registers at this
   point or how the tracee is restarted after this stop).
```

The parenthetical comment is currently true on x86 and powerpc,
but not currently true on arm64. arm64 re-checks the _TIF_SYSCALL_EMU
flag after the syscall entry ptrace stop. However, at this point,
it reflects which method was used to re-start the syscall
at the entry stop, rather than the method that was used to reach it.
Fix that by recording the original flag before performing the ptrace
stop, bringing the behavior in line with documentation and x86/powerpc.

Signed-off-by: Keno Fischer 
---
 arch/arm64/kernel/ptrace.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/kernel/ptrace.c b/arch/arm64/kernel/ptrace.c
index b3d3005d9515..b67b4d14aa17 100644
--- a/arch/arm64/kernel/ptrace.c
+++ b/arch/arm64/kernel/ptrace.c
@@ -1829,10 +1829,12 @@ static void tracehook_report_syscall(struct pt_regs 
*regs,
 
 int syscall_trace_enter(struct pt_regs *regs)
 {
-   if (test_thread_flag(TIF_SYSCALL_TRACE) ||
-   test_thread_flag(TIF_SYSCALL_EMU)) {
+   u32 flags = READ_ONCE(current_thread_info()->flags) &
+   (_TIF_SYSCALL_EMU | _TIF_SYSCALL_TRACE);
+
+   if (flags) {
tracehook_report_syscall(regs, PTRACE_SYSCALL_ENTER);
-   if (!in_syscall(regs) || test_thread_flag(TIF_SYSCALL_EMU))
+   if (!in_syscall(regs) || (flags & _TIF_SYSCALL_EMU))
return -1;
}
 
-- 
2.25.1



Re: PTRACE_SYSEMU behavior difference on arm64

2020-05-15 Thread Keno Fischer
On Fri, May 15, 2020 at 8:13 AM Will Deacon  wrote:
> But it also
> means that nobody is using this on arm64, so we could also consider removing
> it entirely. Did you spot this because you are trying to use it for
> something or just by inspection/unit-testing?

No, I was trying to port a tool from x86 and nothing made sense for
many hours :). (it was quite a bit of debugging, because the
syscall that it was supposed to skip installed a seccomp filter,
which then later veto'd random syscalls making the
symptoms quite confusing). Having PTRACE_SYSEMU isn't
critical, but we might as well support it.
It makes things a bit more efficient and is probably safer
(if it works correctly ;). The patch is fairly small. Will validate
and then send it here for review.


Keno


PTRACE_SYSEMU behavior difference on arm64

2020-05-15 Thread Keno Fischer
The behavior of PTRACE_SYSEMU on arm64
appears to differ substantially from that of x86 and powerpc
(the other two architectures on which this feature is implemented).
In particular, after PTRACE_SYSEMU the syscall will always
be skipped on x86 and powerpc, but executed on arm64 unless
the syscall-entry stop was again continued using PTRACE_SYSEMU.
The skipping behavior is also documented in the manpage,
so I suspect this may just be a bug (the skipping behavior
makes sense to me and is what I would expect).
The reason this happens is that `syscall_trace_enter`
re-checks TIF_SYSCALL_EMU after the ptrace stop, but at that
point it may have already been superseded by a new ptrace
request. x86 and power save the original value of the flag,
rather than acting on the new value. I can submit a patch to
fix this, but wanted to check first whether this was intentional.
If it is, I can fix the man page instead.

Keno


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
> So, to be useful, this interface needs to be called before an
> application can run XGETBV or XSAVE for the first time and caches a
> "bad" value.  I think that means that it might not be feasible to use
> outside of cases where you ptrace() something and inject things before
> it has a chance to run any real instructions.
>
> Fundamentally, I think that makes _this_ interface pretty useless in
> practice.  The only practical option is to have a _future_ XCR0 value
> set by the prctl() and then have it get made active by the kernel at
> execve().

Fair enough, but it don't see this as really fundamentally different
from the cpuid masking use case, which has the same problem and
the same interface. I'm also not convinced that there is *no* use case
where one may want to turn on certain XCR0 features while the process
is running and then turn them off again. To give a concrete example in
this context, it can useful to write a small program into the memory space
of the replayed program and use it to analyze the memory state of the
program (e.g. to checksum the memory - or in our case to perform a
GC state validation). Such implants may want to use the AVX512
registers for performance, so it would be nice if that was possible.

> IMNHO, if you haven't guessed yet, I think this whole exercise is a dead
> end.  Just boot an identical XCR0 VM on your new hardware and do replay
> there.  Done.

I had a hunch ;). However, there are a couple considerations that
make me still want this in the kernel proper:
1. The recording side application of this feature - getting our users
to do everything in a VM to send us a recording is not easy. Part
of the appeal of rr over VM-based record/replay techniques
is that it "just works" on basically any linux hosts.
2. Starting a VM generally requires root permissions, which may
not be available.
3. And probably the biggest from my perspective is performance. rr
needs to do a lot twiddling with the performance counters, which
I've seen have significant performance overhead in a virtualized
environment. There's of course also a per-VM resource consumption,
but presumably we could keep one VM per-XCR0 value and replay
multiple traces per VM.


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
> So, to be useful, this interface needs to be called before an
> application can run XGETBV or XSAVE for the first time and caches a
> "bad" value.  I think that means that it might not be feasible to use
> outside of cases where you ptrace() something and inject things before
> it has a chance to run any real instructions.
>
> Fundamentally, I think that makes _this_ interface pretty useless in
> practice.  The only practical option is to have a _future_ XCR0 value
> set by the prctl() and then have it get made active by the kernel at
> execve().

Fair enough, but it don't see this as really fundamentally different
from the cpuid masking use case, which has the same problem and
the same interface. I'm also not convinced that there is *no* use case
where one may want to turn on certain XCR0 features while the process
is running and then turn them off again. To give a concrete example in
this context, it can useful to write a small program into the memory space
of the replayed program and use it to analyze the memory state of the
program (e.g. to checksum the memory - or in our case to perform a
GC state validation). Such implants may want to use the AVX512
registers for performance, so it would be nice if that was possible.

> IMNHO, if you haven't guessed yet, I think this whole exercise is a dead
> end.  Just boot an identical XCR0 VM on your new hardware and do replay
> there.  Done.

I had a hunch ;). However, there are a couple considerations that
make me still want this in the kernel proper:
1. The recording side application of this feature - getting our users
to do everything in a VM to send us a recording is not easy. Part
of the appeal of rr over VM-based record/replay techniques
is that it "just works" on basically any linux hosts.
2. Starting a VM generally requires root permissions, which may
not be available.
3. And probably the biggest from my perspective is performance. rr
needs to do a lot twiddling with the performance counters, which
I've seen have significant performance overhead in a virtualized
environment. There's of course also a per-VM resource consumption,
but presumably we could keep one VM per-XCR0 value and replay
multiple traces per VM.


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
> So we're talking about a workaround for broken software. The question
> is how wide spread is it?

For rr to work, it tries to replicate the process state *exactly*. That means:

1. The same instructions executed in the same order
2. The exact same register state at those instructions
3. The same memory state, bit-by-bit

In particular 1) means that any extra instructions executed/not executed
will cause a replay divergence (in practice rr uses retired conditional
branches rather than instructions, because the instruction counter is
not accurate, while the branch one is). This alone causes a problem
for the present case, because glibc branches on the xcr0 value before
it branches on the cpuid value for AVX512. Glibc does check for the
correct cpuid before calling xgetbv, so one possible thing to do is to
completely disable xsave during recording by disabling it in CPUID, but
that would make rr quite a bit less useful, since it wouldn't be able to
record any bugs that require AVX to be used. However, the xsave
problem is worse, because xcr0 determines how much memory
`xsave` writes, so if we emulate cpuid, to pretend that AVX512
does not exist, and the user space application uses that to
determine the size of the required buffer, we now suddenly
overflow that buffer (unless the user space application uses
cpuid to compute a minimal RFBM for xsave, which no application
seems to do).

> Do memory contents which are never read by the application matter?

In theory, no. However, in practice, I've seen most memory
divergences (esp if on the stack), end up causing control flow divergences
down the line, because some code somewhere picks up the uninitialized
memory and branches on it.

Hope that helps,
Keno


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
> So we're talking about a workaround for broken software. The question
> is how wide spread is it?

For rr to work, it tries to replicate the process state *exactly*. That means:

1. The same instructions executed in the same order
2. The exact same register state at those instructions
3. The same memory state, bit-by-bit

In particular 1) means that any extra instructions executed/not executed
will cause a replay divergence (in practice rr uses retired conditional
branches rather than instructions, because the instruction counter is
not accurate, while the branch one is). This alone causes a problem
for the present case, because glibc branches on the xcr0 value before
it branches on the cpuid value for AVX512. Glibc does check for the
correct cpuid before calling xgetbv, so one possible thing to do is to
completely disable xsave during recording by disabling it in CPUID, but
that would make rr quite a bit less useful, since it wouldn't be able to
record any bugs that require AVX to be used. However, the xsave
problem is worse, because xcr0 determines how much memory
`xsave` writes, so if we emulate cpuid, to pretend that AVX512
does not exist, and the user space application uses that to
determine the size of the required buffer, we now suddenly
overflow that buffer (unless the user space application uses
cpuid to compute a minimal RFBM for xsave, which no application
seems to do).

> Do memory contents which are never read by the application matter?

In theory, no. However, in practice, I've seen most memory
divergences (esp if on the stack), end up causing control flow divergences
down the line, because some code somewhere picks up the uninitialized
memory and branches on it.

Hope that helps,
Keno


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
On Mon, Jun 18, 2018 at 12:16 PM, Dave Hansen
 wrote:
> On 06/18/2018 08:13 AM, Keno Fischer wrote:
>>>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>>>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>>>> execution after the faulting instruction.
>>>
>>> I'm worried about the kernel pieces that go digging in the XSAVE data
>>> getting confused more than the hardware getting confused.
>>
>> So you prefer this option? If so, I can try to have a go at implementing it
>> this way and seeing if I run into any trouble.
>
> No, I'm saying that depending on faults is not a viable solution.  We
> are not guaranteed to get faults in all the cases you would need to fix up.
>
> XSAVE*/XRSTOR* are not even *called* in some of those cases.

Ah, my apologies, I was under the mistaken impression that xsaves also
read xcomp_bv to inform the layout, rather than using the RFBM and then
updating the xcomp_bv field. Let me think about this some more and see
what I can come up with.


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
On Mon, Jun 18, 2018 at 12:16 PM, Dave Hansen
 wrote:
> On 06/18/2018 08:13 AM, Keno Fischer wrote:
>>>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>>>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>>>> execution after the faulting instruction.
>>>
>>> I'm worried about the kernel pieces that go digging in the XSAVE data
>>> getting confused more than the hardware getting confused.
>>
>> So you prefer this option? If so, I can try to have a go at implementing it
>> this way and seeing if I run into any trouble.
>
> No, I'm saying that depending on faults is not a viable solution.  We
> are not guaranteed to get faults in all the cases you would need to fix up.
>
> XSAVE*/XRSTOR* are not even *called* in some of those cases.

Ah, my apologies, I was under the mistaken impression that xsaves also
read xcomp_bv to inform the layout, rather than using the RFBM and then
updating the xcomp_bv field. Let me think about this some more and see
what I can come up with.


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>> execution after the faulting instruction.
>
> I'm worried about the kernel pieces that go digging in the XSAVE data
> getting confused more than the hardware getting confused.

So you prefer this option? If so, I can try to have a go at implementing it
this way and seeing if I run into any trouble.

>> At least currently, it is my understanding that `xfeatures_mask` only has
>> user features, am I mistaken about that?
>
> We're slowing adding supervisor support.  I think accounting for
> supervisor features is a requirement for any new XSAVE code.

Sure, I don't think this is in any way incompatible with that (though
probably also informs that we want to keep the memory layout the
same if possible).


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
>> 4) Catch the fault thrown by xsaves/xrestors in this situation, update
>> XCR0, redo the xsaves/restores, put XCR0 back and continue
>> execution after the faulting instruction.
>
> I'm worried about the kernel pieces that go digging in the XSAVE data
> getting confused more than the hardware getting confused.

So you prefer this option? If so, I can try to have a go at implementing it
this way and seeing if I run into any trouble.

>> At least currently, it is my understanding that `xfeatures_mask` only has
>> user features, am I mistaken about that?
>
> We're slowing adding supervisor support.  I think accounting for
> supervisor features is a requirement for any new XSAVE code.

Sure, I don't think this is in any way incompatible with that (though
probably also informs that we want to keep the memory layout the
same if possible).


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
Hi Dave,

thank you for your thorough comments, replies inline:

On Mon, Jun 18, 2018 at 8:47 AM, Dave Hansen
 wrote:
> On 06/16/2018 05:33 PM, Keno Fischer wrote:
>> For my use case, it would be sufficient to simply disallow
>> any value of XCR0 with "holes" in it,
> But what if the hardware you are migrating to/from *has* holes?

Yes, that would be the problem ;). I'm not aware of that being the
case for user space states on current hardware, but perhaps,
I'm just missing a case.

> There's no way this is even close to viable until it has been made to
> cope with holes.

Ok, it seems to me the first decision is whether it should be allowed
to have the compacted format (with holes) in an in-memory xstate
image. Doing so seems possible, but would require more invasive
changes to the kernel, so I'm gonna ignore it for now.

If we want to keep the in-memory xstate layout constant after boot
time, I see four ways to do that for this feature.

1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors
use an XCR0 that matches the layout the kernel expects.
2) Use xsaveopt when this particular thread flag is set and have the
kernel be able to deal with non-compressed xstate images, even
if xsaves is available.
3) What's in this patch now, but fix up the xstate after saving it.
4) Catch the fault thrown by xsaves/xrestors in this situation, update
XCR0, redo the xsaves/restores, put XCR0 back and continue
execution after the faulting instruction.

Option 1) seems easiest, but it would also require adding code
somewhere on the common path, which I assume is a no-go ;).

Option 3) seems doable entirely in the slow path for this feature.
If we xsaves with the modified XCR0, we can then fix up the memory
image to match the expected layout. Similarly, we could xrestors
in the slow path (from standard layout) and rely on the
`fpregs_state_valid` mechanism to prevent the fault.

Option 4) seems also doable, but of course messing around with
traps makes things quite a bit more complicated.

> FWIW, I just don't think this is going to be viable.  I have the feeling
> that there's way too much stuff that hard-codes assumptions about XCR0
> inside the kernel and out. This is just going to make it much more fragile.
>
> Folks that want this level of container migration are probably better
> off running one of the hardware-based containers and migrating _those_.
> Or, just ensuring the places to/from they want to migrate have a
> homogeneous XCR0 mix.

Fair enough :). I figured I'd throw it out there as an RFC and see if I can
get it into an acceptable shape for you to include upstream. For my,
particular use case (replaying old traces on newer hardware),
since I control the replay hardware, I'm happy if there's a patch
for me to apply locally to solve my problem. Of course there's also
the opposite use case (recording a trace on a new machine, but
wanting to replay on an older one), which would also require the
recording machine to have this patch. The reason I make this
distinction is that we don't usually control the recording machine
(since those are our users' machines), so that tends to have a
much more varied hardware/kernel mix and it's harder to get kernel
patches applied. I'm happy to keep chipping away at this for as
long as there is feedback, but I understand if it won't make it
in at the end of the process.

>> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
>>   /* If cpuid was previously disabled for this task, re-enable it. */
>>   if (test_thread_flag(TIF_NOCPUID))
>>   enable_cpuid();
>> + if (test_thread_flag(TIF_MASKXCR0))
>> + reset_xcr0_mask();
>>  }
>
> So the mask is cleared on exec().  Does that mean that *every*
> individual process using this interface has to set up its own mask
> before anything in the C library establishes its cached value of XCR0.
> I'd want to see how that's being accomplished.

Yes, that is correct. In my rr patch this is done by injecting an
arch_prctl syscall using ptrace, at the same spot that we inject
the arch_prctl syscall for enabling cpuid faulting (i.e. the interface
that sets the `TIF_NOCPUID` flag in the check prior to this one).
Since these two features would seem to make the most sense when
used together, I figured it would be best to align semantics.
Happy to entertain alternative suggestions though.

>> +static int xstate_is_initial(unsigned long mask)
>> +{
>> + int i, j;
>> + unsigned long max_bit = __ffs(mask);
>> +
>> + for (i = 0; i < max_bit; ++i) {
>> + if (mask & (1 << i)) {
>> + char *xfeature_addr = (char *)get_xsave_addr(
>> + >thread.fpu.state.xsave,
>> + 1 << i);

Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-18 Thread Keno Fischer
Hi Dave,

thank you for your thorough comments, replies inline:

On Mon, Jun 18, 2018 at 8:47 AM, Dave Hansen
 wrote:
> On 06/16/2018 05:33 PM, Keno Fischer wrote:
>> For my use case, it would be sufficient to simply disallow
>> any value of XCR0 with "holes" in it,
> But what if the hardware you are migrating to/from *has* holes?

Yes, that would be the problem ;). I'm not aware of that being the
case for user space states on current hardware, but perhaps,
I'm just missing a case.

> There's no way this is even close to viable until it has been made to
> cope with holes.

Ok, it seems to me the first decision is whether it should be allowed
to have the compacted format (with holes) in an in-memory xstate
image. Doing so seems possible, but would require more invasive
changes to the kernel, so I'm gonna ignore it for now.

If we want to keep the in-memory xstate layout constant after boot
time, I see four ways to do that for this feature.

1) Set up XCR0 in a different place, so that the kernel xsaves/xrstors
use an XCR0 that matches the layout the kernel expects.
2) Use xsaveopt when this particular thread flag is set and have the
kernel be able to deal with non-compressed xstate images, even
if xsaves is available.
3) What's in this patch now, but fix up the xstate after saving it.
4) Catch the fault thrown by xsaves/xrestors in this situation, update
XCR0, redo the xsaves/restores, put XCR0 back and continue
execution after the faulting instruction.

Option 1) seems easiest, but it would also require adding code
somewhere on the common path, which I assume is a no-go ;).

Option 3) seems doable entirely in the slow path for this feature.
If we xsaves with the modified XCR0, we can then fix up the memory
image to match the expected layout. Similarly, we could xrestors
in the slow path (from standard layout) and rely on the
`fpregs_state_valid` mechanism to prevent the fault.

Option 4) seems also doable, but of course messing around with
traps makes things quite a bit more complicated.

> FWIW, I just don't think this is going to be viable.  I have the feeling
> that there's way too much stuff that hard-codes assumptions about XCR0
> inside the kernel and out. This is just going to make it much more fragile.
>
> Folks that want this level of container migration are probably better
> off running one of the hardware-based containers and migrating _those_.
> Or, just ensuring the places to/from they want to migrate have a
> homogeneous XCR0 mix.

Fair enough :). I figured I'd throw it out there as an RFC and see if I can
get it into an acceptable shape for you to include upstream. For my,
particular use case (replaying old traces on newer hardware),
since I control the replay hardware, I'm happy if there's a patch
for me to apply locally to solve my problem. Of course there's also
the opposite use case (recording a trace on a new machine, but
wanting to replay on an older one), which would also require the
recording machine to have this patch. The reason I make this
distinction is that we don't usually control the recording machine
(since those are our users' machines), so that tends to have a
much more varied hardware/kernel mix and it's harder to get kernel
patches applied. I'm happy to keep chipping away at this for as
long as there is feedback, but I understand if it won't make it
in at the end of the process.

>> @@ -252,6 +301,8 @@ void arch_setup_new_exec(void)
>>   /* If cpuid was previously disabled for this task, re-enable it. */
>>   if (test_thread_flag(TIF_NOCPUID))
>>   enable_cpuid();
>> + if (test_thread_flag(TIF_MASKXCR0))
>> + reset_xcr0_mask();
>>  }
>
> So the mask is cleared on exec().  Does that mean that *every*
> individual process using this interface has to set up its own mask
> before anything in the C library establishes its cached value of XCR0.
> I'd want to see how that's being accomplished.

Yes, that is correct. In my rr patch this is done by injecting an
arch_prctl syscall using ptrace, at the same spot that we inject
the arch_prctl syscall for enabling cpuid faulting (i.e. the interface
that sets the `TIF_NOCPUID` flag in the check prior to this one).
Since these two features would seem to make the most sense when
used together, I figured it would be best to align semantics.
Happy to entertain alternative suggestions though.

>> +static int xstate_is_initial(unsigned long mask)
>> +{
>> + int i, j;
>> + unsigned long max_bit = __ffs(mask);
>> +
>> + for (i = 0; i < max_bit; ++i) {
>> + if (mask & (1 << i)) {
>> + char *xfeature_addr = (char *)get_xsave_addr(
>> + >thread.fpu.state.xsave,
>> + 1 << i);

Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-17 Thread Keno Fischer
> Almost difference in CPU behavior
> between the replayer and the replayee.

Not sure what happened to this sentence.
What I meant to say was:

Almost any difference in CPU behavior between
the replayer and the replayee will cause problems
for the determinism of the trace.

To elaborate on this, even if a register whose content
differs between the recording and the replay, it can
still cause problems down the line, e.g. if it is spilled
to the stack and that stack address is then re-used later.
In order for rr to work, we basically rely on the CPU
behaving *exactly* the same during the record and the
replay (down to counting performance counters the same).
This works well, but there are corner cases (like this XCR0
one).


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-17 Thread Keno Fischer
> Almost difference in CPU behavior
> between the replayer and the replayee.

Not sure what happened to this sentence.
What I meant to say was:

Almost any difference in CPU behavior between
the replayer and the replayee will cause problems
for the determinism of the trace.

To elaborate on this, even if a register whose content
differs between the recording and the replay, it can
still cause problems down the line, e.g. if it is spilled
to the stack and that stack address is then re-used later.
In order for rr to work, we basically rely on the CPU
behaving *exactly* the same during the record and the
replay (down to counting performance counters the same).
This works well, but there are corner cases (like this XCR0
one).


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-17 Thread Keno Fischer
> Patch seems pointless if you can already control CPUID, which rr
> supports. Just disable AVX512 in CPUID. All code using AVX should check
> cpuid (or will fail anyways).

Unfortunately, that is insufficient. Almost difference in CPU behavior
between the replayer
and the replayee. In particular, the problems for rr here are
1) xgetbv, which can introduce differing register contents (and if
code branches on this,
potentially differences in control flow).
2) xsave writing more memory than the trace expects, causing
divergence in the memory
contents of the process.

Robert O'Callahan has a blog post that goes into some detail here:
https://robert.ocallahan.org/2018/04/cpuid-features-xsave-and-rr-trace.html

I hope that makes sense. Please let me know if you'd like me to explain the
problem in more detail or give an example. FWIW, I do have a version of rr
that uses the proposed feature in this patch and it does fix our problem with
replaying AVX traces on skylake machines. Without this patch, rr does emulate
the cpuid response of the recording machine, but the replay fails because
of the mentioned issue.


Re: [RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-17 Thread Keno Fischer
> Patch seems pointless if you can already control CPUID, which rr
> supports. Just disable AVX512 in CPUID. All code using AVX should check
> cpuid (or will fail anyways).

Unfortunately, that is insufficient. Almost difference in CPU behavior
between the replayer
and the replayee. In particular, the problems for rr here are
1) xgetbv, which can introduce differing register contents (and if
code branches on this,
potentially differences in control flow).
2) xsave writing more memory than the trace expects, causing
divergence in the memory
contents of the process.

Robert O'Callahan has a blog post that goes into some detail here:
https://robert.ocallahan.org/2018/04/cpuid-features-xsave-and-rr-trace.html

I hope that makes sense. Please let me know if you'd like me to explain the
problem in more detail or give an example. FWIW, I do have a version of rr
that uses the proposed feature in this patch and it does fix our problem with
replaying AVX traces on skylake machines. Without this patch, rr does emulate
the cpuid response of the recording machine, but the replay fails because
of the mentioned issue.


[RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-16 Thread Keno Fischer
From: Keno Fischer 

The rr (http://rr-project.org/) debugger provides user space
record-and-replay functionality by carefully controlling the process
environment in order to ensure completely deterministic execution
of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
allows rr to move traces across (Intel) machines, by allowing cpuid
invocations to be reliably recorded and replayed. This works very
well, with one catch: It is currently not possible to replay a
recording from a machine supporting a smaller set of XCR0 state
components on one supporting a larger set. This is because the
value of XCR0 is obersevable in userspace (either by explicit
xgetbv or by looking at the result of xsave) and since glibc
does oberseve this value, replay divergence is almost immediate.
I also suspect that people intrested in process (or container)
live-migration may eventually care about this if a migration happens
in between a userspace xsave and a corresponding xrstor.

We encounter this problem quite frequently since most of our users
are using pre-Skylake systems (and thus don't support the AVX512
state components), while we recently upgraded our main development
machines to Skylake.

This patch attempts to provide a way disable XCR0 state components
on a per-thread basis, such that rr may use this feature to emulate
the recording machine's XCR0 for the replayed process.

We do this by providing a new ARCH_SET_XCR0 arch_prctl that takes
as its argument the desired XCR0 value. The system call fails if:
 - XSAVE is not available
 - The user attempts to enable a state component that would not regularly
   be enabled by the kernel.
 - The value of XCR0 is illegal (in the sense that setting it would
   cause a fault).
 - Any state component being disabled is not in its initial state.

The last restriction is debatable, but it seemed sensible to me,
since it means the kernel need not be in the business of trying to
maintain the disabled state components when the ordinary xsave/xrestor
will no longer save/restore them.

The patch does not currently provide a corresponding ARCH_GET_XCR0,
since the `xgetbv` instruction fulfills this purpose. However, we
may want to provide this for consistency. It may also be useful (and
perhaps more useful) to provide a mechanism to obtain the kernel's
XCR0 (i.e. the upper bound on which bits are allowed to be set through
this interface).

On the kernel side, this value is stored as a mask, rather than a value
to set. The thought behind this was to be defensive about new state
components being added but forgetting to update the validity check
in the arch_prctl. However, getting this wrong in either direction
is probably bad (e.g. if Intel added an AVX1024 extension that always
required AVX512, then this would be the wrong way to be defensive about
it), so I'd appreciate some advice on which would be preferred.

Please take this patch as an RFC that I hope is sufficient to enable
discussion about this feature, rather than a complete patch.
In particular, this patch is missing:
 - Cleanup
 - A selftest exercising all the corner cases
 - There's code sharing opportunities with KVM (which has to provide
   similar functionality for virtual machines modifying XCR0), which
   I did not take advantage of in this patch to keep the changes local.
   A full patch would probably involve some refactoring there.

There is one remaining TODO in the code, which has to do with the
xcomp_bv xsave header. The `xrstors` instructions requires that
no bits in this headers field be set that are not active in XCR0.
However, this unfortunately means that changing the value of XCR0
can change the layout of the compacted xsave area, which so far the
kernel does not do anywhere else (except for some handling in the
KVM context). For my use case, it would be sufficient to simply disallow
any value of XCR0 with "holes" in it, that would change the layout
to anything other than a strict prefix, but I would appreciate
hearing any alternative solutions you may have.

Please also let me know if I missed a fundamental way in which this
causes a problem. I realize that this is a very sensitive part of
the kernel code. Since this patch changes XCR0, any xsave/xrestor or any
use of XCR0-enabled features in kernel space, not guarded by
kernel_fpu_begin() (which I modified for this patch set), would cause
a problem. I don't have a sufficiently wide view of the kernel
to know if there are any such uses, so please let me know if I missed
something.

Signed-off-by: Keno Fischer 
---
 arch/x86/include/asm/fpu/xstate.h  |   1 +
 arch/x86/include/asm/processor.h   |   2 +
 arch/x86/include/asm/thread_info.h |   5 +-
 arch/x86/include/uapi/asm/prctl.h  |   2 +
 arch/x86/kernel/fpu/core.c |  10 ++-
 arch/x86/kernel/fpu/xstate.c   |   3 +-
 arch/x86/kernel/process.c  | 125 -
 arch/x86/kernel/process_64.c   |  16 ++---
 8 files changed, 152 insertions(+), 12

[RFC PATCH] x86/arch_prctl: Add ARCH_SET_XCR0 to mask XCR0 per-thread

2018-06-16 Thread Keno Fischer
From: Keno Fischer 

The rr (http://rr-project.org/) debugger provides user space
record-and-replay functionality by carefully controlling the process
environment in order to ensure completely deterministic execution
of recorded traces. The recently added ARCH_SET_CPUID arch_prctl
allows rr to move traces across (Intel) machines, by allowing cpuid
invocations to be reliably recorded and replayed. This works very
well, with one catch: It is currently not possible to replay a
recording from a machine supporting a smaller set of XCR0 state
components on one supporting a larger set. This is because the
value of XCR0 is obersevable in userspace (either by explicit
xgetbv or by looking at the result of xsave) and since glibc
does oberseve this value, replay divergence is almost immediate.
I also suspect that people intrested in process (or container)
live-migration may eventually care about this if a migration happens
in between a userspace xsave and a corresponding xrstor.

We encounter this problem quite frequently since most of our users
are using pre-Skylake systems (and thus don't support the AVX512
state components), while we recently upgraded our main development
machines to Skylake.

This patch attempts to provide a way disable XCR0 state components
on a per-thread basis, such that rr may use this feature to emulate
the recording machine's XCR0 for the replayed process.

We do this by providing a new ARCH_SET_XCR0 arch_prctl that takes
as its argument the desired XCR0 value. The system call fails if:
 - XSAVE is not available
 - The user attempts to enable a state component that would not regularly
   be enabled by the kernel.
 - The value of XCR0 is illegal (in the sense that setting it would
   cause a fault).
 - Any state component being disabled is not in its initial state.

The last restriction is debatable, but it seemed sensible to me,
since it means the kernel need not be in the business of trying to
maintain the disabled state components when the ordinary xsave/xrestor
will no longer save/restore them.

The patch does not currently provide a corresponding ARCH_GET_XCR0,
since the `xgetbv` instruction fulfills this purpose. However, we
may want to provide this for consistency. It may also be useful (and
perhaps more useful) to provide a mechanism to obtain the kernel's
XCR0 (i.e. the upper bound on which bits are allowed to be set through
this interface).

On the kernel side, this value is stored as a mask, rather than a value
to set. The thought behind this was to be defensive about new state
components being added but forgetting to update the validity check
in the arch_prctl. However, getting this wrong in either direction
is probably bad (e.g. if Intel added an AVX1024 extension that always
required AVX512, then this would be the wrong way to be defensive about
it), so I'd appreciate some advice on which would be preferred.

Please take this patch as an RFC that I hope is sufficient to enable
discussion about this feature, rather than a complete patch.
In particular, this patch is missing:
 - Cleanup
 - A selftest exercising all the corner cases
 - There's code sharing opportunities with KVM (which has to provide
   similar functionality for virtual machines modifying XCR0), which
   I did not take advantage of in this patch to keep the changes local.
   A full patch would probably involve some refactoring there.

There is one remaining TODO in the code, which has to do with the
xcomp_bv xsave header. The `xrstors` instructions requires that
no bits in this headers field be set that are not active in XCR0.
However, this unfortunately means that changing the value of XCR0
can change the layout of the compacted xsave area, which so far the
kernel does not do anywhere else (except for some handling in the
KVM context). For my use case, it would be sufficient to simply disallow
any value of XCR0 with "holes" in it, that would change the layout
to anything other than a strict prefix, but I would appreciate
hearing any alternative solutions you may have.

Please also let me know if I missed a fundamental way in which this
causes a problem. I realize that this is a very sensitive part of
the kernel code. Since this patch changes XCR0, any xsave/xrestor or any
use of XCR0-enabled features in kernel space, not guarded by
kernel_fpu_begin() (which I modified for this patch set), would cause
a problem. I don't have a sufficiently wide view of the kernel
to know if there are any such uses, so please let me know if I missed
something.

Signed-off-by: Keno Fischer 
---
 arch/x86/include/asm/fpu/xstate.h  |   1 +
 arch/x86/include/asm/processor.h   |   2 +
 arch/x86/include/asm/thread_info.h |   5 +-
 arch/x86/include/uapi/asm/prctl.h  |   2 +
 arch/x86/kernel/fpu/core.c |  10 ++-
 arch/x86/kernel/fpu/xstate.c   |   3 +-
 arch/x86/kernel/process.c  | 125 -
 arch/x86/kernel/process_64.c   |  16 ++---
 8 files changed, 152 insertions(+), 12

Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR

2017-12-04 Thread Keno Fischer
Hi Michael,

I was hoping to get a clear statement one way or another from the kernel
maintainers as to whether an EINTR from stat() is supposed to be allowed
kernel behavior (hence the RFC in the subject). If it's not, then I don't think
it should be documented, even if there is buggy filesystems that do at
the moment.
So I'd say let's hold off on applying this until more people have had a chance
to comment. If it would be more convenient for you, feel free to drop
this from your
patch queue and if appropriate, I'll resend a non-RFC version of this
patch for you
to apply, once a conclusion has been reached.


On Mon, Dec 4, 2017 at 3:58 PM, Michael Kerrisk (man-pages)
<mtk.manpa...@gmail.com> wrote:
> Hello Keno
>
> On 12/03/2017 04:15 AM, Keno Fischer wrote:
>> Resending as plain text (apologies for those receiving it twice, and
>> those that got
>> an HTML copy, I'm used to my mail client switching that over
>> automatically, which
>> for some reason didn't happen here).
>>
>>
>> This is exactly the discussion I want to generate, so thank you.
>> I should point out that I'm not advocating for anything other
>> than clarity of what kernel behavior user space may assume.
>
> So, should the documentation patch be applied at this point, or dropped?
>
> Thanks,
>
> Michael
>
>
>> On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox <wi...@infradead.org> wrote:
>>> On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote:
>>>> The catalyst for this patch was me experiencing EINTR errors when
>>>> using the 9p file system. In linux commit 9523feac, the 9p file
>>>> system was changed to use wait_event_killable instead of
>>>> wait_event_interruptible, which does indeed address my problem,
>>>> but also makes me a bit unhappy, because uninterruptable waits
>>>> prevents things like ^C'ing the execution and some debugging
>>>> tools which depend on being able to cancel long-running operations
>>>> by sending signals.
>>>
>>> Wait, wait, wait.  killable is not uninterruptible.  It's "can accept
>>> a signal if the signal is fatal".  ie userspace will never see it.
>>> So, no, it doesn't prevent ^C.  It does prevent the debugging tool you're
>>> talking about from working, because it's handling the signal, so it's not
>>> fatal.
>>
>> This probably shows that I've been in REPL based environments too long,
>> that catch SIGINT ;). You are of course correct that a fatal SIGINT would
>> still be delivered.
>>
>>>> I realize I'm probably 20 years too late here, but it feels like
>>>> clarificaion on what to expect from the kernel would still go a long
>>>> way here.
>>>
>>> A change to user-visible behaviour has to be opt-in.
>>
>> I agree. However, it was my impression that stat() can return EINTR
>> depending on the file system. Prior to the referenced commit,
>> this was certainly true on 9p and I suspect it's not the only network file
>> system for which this is true (though prior to my experiencing this
>> with 9p, the only
>> time I've ever experienced it was on HPC clusters with who knows what
>> code providing the network filesystem). If it is indeed the case that
>> an EINTR return from stat() and similar is illegal and should be considered
>> a kernel bug, a statement to that extent all I'm looking for here.
>>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR

2017-12-04 Thread Keno Fischer
Hi Michael,

I was hoping to get a clear statement one way or another from the kernel
maintainers as to whether an EINTR from stat() is supposed to be allowed
kernel behavior (hence the RFC in the subject). If it's not, then I don't think
it should be documented, even if there is buggy filesystems that do at
the moment.
So I'd say let's hold off on applying this until more people have had a chance
to comment. If it would be more convenient for you, feel free to drop
this from your
patch queue and if appropriate, I'll resend a non-RFC version of this
patch for you
to apply, once a conclusion has been reached.


On Mon, Dec 4, 2017 at 3:58 PM, Michael Kerrisk (man-pages)
 wrote:
> Hello Keno
>
> On 12/03/2017 04:15 AM, Keno Fischer wrote:
>> Resending as plain text (apologies for those receiving it twice, and
>> those that got
>> an HTML copy, I'm used to my mail client switching that over
>> automatically, which
>> for some reason didn't happen here).
>>
>>
>> This is exactly the discussion I want to generate, so thank you.
>> I should point out that I'm not advocating for anything other
>> than clarity of what kernel behavior user space may assume.
>
> So, should the documentation patch be applied at this point, or dropped?
>
> Thanks,
>
> Michael
>
>
>> On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox  wrote:
>>> On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote:
>>>> The catalyst for this patch was me experiencing EINTR errors when
>>>> using the 9p file system. In linux commit 9523feac, the 9p file
>>>> system was changed to use wait_event_killable instead of
>>>> wait_event_interruptible, which does indeed address my problem,
>>>> but also makes me a bit unhappy, because uninterruptable waits
>>>> prevents things like ^C'ing the execution and some debugging
>>>> tools which depend on being able to cancel long-running operations
>>>> by sending signals.
>>>
>>> Wait, wait, wait.  killable is not uninterruptible.  It's "can accept
>>> a signal if the signal is fatal".  ie userspace will never see it.
>>> So, no, it doesn't prevent ^C.  It does prevent the debugging tool you're
>>> talking about from working, because it's handling the signal, so it's not
>>> fatal.
>>
>> This probably shows that I've been in REPL based environments too long,
>> that catch SIGINT ;). You are of course correct that a fatal SIGINT would
>> still be delivered.
>>
>>>> I realize I'm probably 20 years too late here, but it feels like
>>>> clarificaion on what to expect from the kernel would still go a long
>>>> way here.
>>>
>>> A change to user-visible behaviour has to be opt-in.
>>
>> I agree. However, it was my impression that stat() can return EINTR
>> depending on the file system. Prior to the referenced commit,
>> this was certainly true on 9p and I suspect it's not the only network file
>> system for which this is true (though prior to my experiencing this
>> with 9p, the only
>> time I've ever experienced it was on HPC clusters with who knows what
>> code providing the network filesystem). If it is indeed the case that
>> an EINTR return from stat() and similar is illegal and should be considered
>> a kernel bug, a statement to that extent all I'm looking for here.
>>
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/


Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR

2017-12-02 Thread Keno Fischer
Resending as plain text (apologies for those receiving it twice, and
those that got
an HTML copy, I'm used to my mail client switching that over
automatically, which
for some reason didn't happen here).


This is exactly the discussion I want to generate, so thank you.
I should point out that I'm not advocating for anything other
than clarity of what kernel behavior user space may assume.


On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox <wi...@infradead.org> wrote:
> On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote:
>> The catalyst for this patch was me experiencing EINTR errors when
>> using the 9p file system. In linux commit 9523feac, the 9p file
>> system was changed to use wait_event_killable instead of
>> wait_event_interruptible, which does indeed address my problem,
>> but also makes me a bit unhappy, because uninterruptable waits
>> prevents things like ^C'ing the execution and some debugging
>> tools which depend on being able to cancel long-running operations
>> by sending signals.
>
> Wait, wait, wait.  killable is not uninterruptible.  It's "can accept
> a signal if the signal is fatal".  ie userspace will never see it.
> So, no, it doesn't prevent ^C.  It does prevent the debugging tool you're
> talking about from working, because it's handling the signal, so it's not
> fatal.

This probably shows that I've been in REPL based environments too long,
that catch SIGINT ;). You are of course correct that a fatal SIGINT would
still be delivered.

>> I realize I'm probably 20 years too late here, but it feels like
>> clarificaion on what to expect from the kernel would still go a long
>> way here.
>
> A change to user-visible behaviour has to be opt-in.

I agree. However, it was my impression that stat() can return EINTR
depending on the file system. Prior to the referenced commit,
this was certainly true on 9p and I suspect it's not the only network file
system for which this is true (though prior to my experiencing this
with 9p, the only
time I've ever experienced it was on HPC clusters with who knows what
code providing the network filesystem). If it is indeed the case that
an EINTR return from stat() and similar is illegal and should be considered
a kernel bug, a statement to that extent all I'm looking for here.


Re: [PATCH RFC] stat.2: Document that stat can fail with EINTR

2017-12-02 Thread Keno Fischer
Resending as plain text (apologies for those receiving it twice, and
those that got
an HTML copy, I'm used to my mail client switching that over
automatically, which
for some reason didn't happen here).


This is exactly the discussion I want to generate, so thank you.
I should point out that I'm not advocating for anything other
than clarity of what kernel behavior user space may assume.


On Sat, Dec 2, 2017 at 9:25 PM, Matthew Wilcox  wrote:
> On Sat, Dec 02, 2017 at 07:23:59PM -0500, Keno Fischer wrote:
>> The catalyst for this patch was me experiencing EINTR errors when
>> using the 9p file system. In linux commit 9523feac, the 9p file
>> system was changed to use wait_event_killable instead of
>> wait_event_interruptible, which does indeed address my problem,
>> but also makes me a bit unhappy, because uninterruptable waits
>> prevents things like ^C'ing the execution and some debugging
>> tools which depend on being able to cancel long-running operations
>> by sending signals.
>
> Wait, wait, wait.  killable is not uninterruptible.  It's "can accept
> a signal if the signal is fatal".  ie userspace will never see it.
> So, no, it doesn't prevent ^C.  It does prevent the debugging tool you're
> talking about from working, because it's handling the signal, so it's not
> fatal.

This probably shows that I've been in REPL based environments too long,
that catch SIGINT ;). You are of course correct that a fatal SIGINT would
still be delivered.

>> I realize I'm probably 20 years too late here, but it feels like
>> clarificaion on what to expect from the kernel would still go a long
>> way here.
>
> A change to user-visible behaviour has to be opt-in.

I agree. However, it was my impression that stat() can return EINTR
depending on the file system. Prior to the referenced commit,
this was certainly true on 9p and I suspect it's not the only network file
system for which this is true (though prior to my experiencing this
with 9p, the only
time I've ever experienced it was on HPC clusters with who knows what
code providing the network filesystem). If it is indeed the case that
an EINTR return from stat() and similar is illegal and should be considered
a kernel bug, a statement to that extent all I'm looking for here.


[PATCH RFC] stat.2: Document that stat can fail with EINTR

2017-12-02 Thread Keno Fischer
Particularly on network file systems, a stat call may require
submitting a message over the network and waiting interruptably
for a reply.

Signed-off-by: Keno Fischer <k...@juliacomputing.com>
---

The catalyst for this patch was me experiencing EINTR errors when
using the 9p file system. In linux commit 9523feac, the 9p file
system was changed to use wait_event_killable instead of
wait_event_interruptible, which does indeed address my problem,
but also makes me a bit unhappy, because uninterruptable waits
prevents things like ^C'ing the execution and some debugging
tools which depend on being able to cancel long-running operations
by sending signals. I'd like to ask the user space applications I
care about to properly handle such situations (either by using
SA_RESTART or by explicitly handling EINTR), but it's a bit of a
hard sell if EINTR isn't documented to be a possibility. I'm hoping
this doc PATCH will generate a discussion of whether EINTR is an
appropriate thing for stat (as a stand in for a file system call that's
not read/write) to return. If so, I'd be happy to submit
patches to other file system-related syscalls along these same lines.

I realize I'm probably 20 years too late here, but it feels like
clarificaion on what to expect from the kernel would still go a long
way here.  

 man2/stat.2 | 5 +
 1 file changed, 5 insertions(+)

diff --git a/man2/stat.2 b/man2/stat.2
index dad9a01..f10235a 100644
--- a/man2/stat.2
+++ b/man2/stat.2
@@ -452,6 +452,11 @@ Invalid flag specified in
 is relative and
 .I dirfd
 is a file descriptor referring to a file other than a directory.
+.TP
+.B EINTR
+The call was interrupted by delivery of a signal caught by a handler; see
+.BR signal (7).
+The possibility of this error is file-system dependent.
 .SH VERSIONS
 .BR fstatat ()
 was added to Linux in kernel 2.6.16;
-- 
2.8.1



[PATCH RFC] stat.2: Document that stat can fail with EINTR

2017-12-02 Thread Keno Fischer
Particularly on network file systems, a stat call may require
submitting a message over the network and waiting interruptably
for a reply.

Signed-off-by: Keno Fischer 
---

The catalyst for this patch was me experiencing EINTR errors when
using the 9p file system. In linux commit 9523feac, the 9p file
system was changed to use wait_event_killable instead of
wait_event_interruptible, which does indeed address my problem,
but also makes me a bit unhappy, because uninterruptable waits
prevents things like ^C'ing the execution and some debugging
tools which depend on being able to cancel long-running operations
by sending signals. I'd like to ask the user space applications I
care about to properly handle such situations (either by using
SA_RESTART or by explicitly handling EINTR), but it's a bit of a
hard sell if EINTR isn't documented to be a possibility. I'm hoping
this doc PATCH will generate a discussion of whether EINTR is an
appropriate thing for stat (as a stand in for a file system call that's
not read/write) to return. If so, I'd be happy to submit
patches to other file system-related syscalls along these same lines.

I realize I'm probably 20 years too late here, but it feels like
clarificaion on what to expect from the kernel would still go a long
way here.  

 man2/stat.2 | 5 +
 1 file changed, 5 insertions(+)

diff --git a/man2/stat.2 b/man2/stat.2
index dad9a01..f10235a 100644
--- a/man2/stat.2
+++ b/man2/stat.2
@@ -452,6 +452,11 @@ Invalid flag specified in
 is relative and
 .I dirfd
 is a file descriptor referring to a file other than a directory.
+.TP
+.B EINTR
+The call was interrupted by delivery of a signal caught by a handler; see
+.BR signal (7).
+The possibility of this error is file-system dependent.
 .SH VERSIONS
 .BR fstatat ()
 was added to Linux in kernel 2.6.16;
-- 
2.8.1



Re: Yes, people use FOLL_FORCE ;)

2017-05-30 Thread Keno Fischer
Hi Linus,

>  But it sounds like your JIT case actually uses it for writing -
> but if you can write a small blurb about it, that would be nice.

yes, we use it for writing. Happy to describe the scheme in more detail.

>  (b) it would probably be nice to limit FOLL_FORCE in general as much
> as possible, so if your case is about writing to your very _own_
> memory mapping, as opposed to writing to another process' memory,
> maybe we can do something like
>
> if (mm == current->mm)
> flags |= FOLL_FORCE;
>
> which at least avoids the whole "let's change the VM in odd ways for a
> process that isn't even me".

While this would fix our current use case, we do have a use case for modifying
non-local address space as well (putting the JIT into a different
process). Similarly,
the rr use case precisely uses the remote mm case. I think in general
this feature
is very useful for anybody who needs to precisely control the execution of some
other process. Various debuggers (gdb/lldb/rr) certainly fall into
that category, but
there's another class of such processes (wine, various emulators) which may want
to do that kind of thing. Now, I suspect most of these will have the
other process
under ptrace control, so maybe allowing (same_mm || ptraced) would be ok, but
at least for the sandbox/remote-jit use case, it would be perfectly
reasonable to not
have the jit server be a ptracer.


Re: Yes, people use FOLL_FORCE ;)

2017-05-30 Thread Keno Fischer
Hi Linus,

>  But it sounds like your JIT case actually uses it for writing -
> but if you can write a small blurb about it, that would be nice.

yes, we use it for writing. Happy to describe the scheme in more detail.

>  (b) it would probably be nice to limit FOLL_FORCE in general as much
> as possible, so if your case is about writing to your very _own_
> memory mapping, as opposed to writing to another process' memory,
> maybe we can do something like
>
> if (mm == current->mm)
> flags |= FOLL_FORCE;
>
> which at least avoids the whole "let's change the VM in odd ways for a
> process that isn't even me".

While this would fix our current use case, we do have a use case for modifying
non-local address space as well (putting the JIT into a different
process). Similarly,
the rr use case precisely uses the remote mm case. I think in general
this feature
is very useful for anybody who needs to precisely control the execution of some
other process. Various debuggers (gdb/lldb/rr) certainly fall into
that category, but
there's another class of such processes (wine, various emulators) which may want
to do that kind of thing. Now, I suspect most of these will have the
other process
under ptrace control, so maybe allowing (same_mm || ptraced) would be ok, but
at least for the sandbox/remote-jit use case, it would be perfectly
reasonable to not
have the jit server be a ptracer.


Yes, people use FOLL_FORCE ;)

2017-05-29 Thread Keno Fischer
Hi Linus et al.,

In 8ee74a91 "proc: try to remove use of FOLL_FORCE entirely", you removed
punch through semantics of /proc//mem. We used these semantics as a
hardening mechanism in the julia JIT. By opening /proc/self/mem and using
these semantics, we could avoid needing RWX pages, or a dual mapping
approach. We do have fallbacks to these other methods (though getting
EIO here actually causes an assert in released versions - we'll updated
that to make sure to take the fall back in that case). Nevertheless the
/proc/self/mem approach was our favored approach because it
a) Required an attacker to be able to execute syscalls which is a taller
order than getting memory write and b) didn't double the virtual
address space requirements (as a dual mapping approach would).

Now, while we're probably fine with using the fallbacks, I know there's
others that rely on this behavior as well (cc'ing Robert O'Callahan of the
rr project for which this change will result in significant performance
degradation). Also, judging by who complained last time FOLL_FORCE
was broken, I suspect the Wine people are relying on this as well.
Frankly, I'm a bit surprised that this change was made in the first place.
Making a userspace-breaking change on mainline and seeing if anybody
complains doesn't seem like the ideal way to find out if features are used.

As I said, personally we can patch our software and deal with this, but I think
a change like this deserves a bit wider discussion, so may I suggest a revert
of this change for the time being? Maybe there can be a syslog warning such
that people who use it will notice and have their say on the mailing list.

Thanks,
Keno


Yes, people use FOLL_FORCE ;)

2017-05-29 Thread Keno Fischer
Hi Linus et al.,

In 8ee74a91 "proc: try to remove use of FOLL_FORCE entirely", you removed
punch through semantics of /proc//mem. We used these semantics as a
hardening mechanism in the julia JIT. By opening /proc/self/mem and using
these semantics, we could avoid needing RWX pages, or a dual mapping
approach. We do have fallbacks to these other methods (though getting
EIO here actually causes an assert in released versions - we'll updated
that to make sure to take the fall back in that case). Nevertheless the
/proc/self/mem approach was our favored approach because it
a) Required an attacker to be able to execute syscalls which is a taller
order than getting memory write and b) didn't double the virtual
address space requirements (as a dual mapping approach would).

Now, while we're probably fine with using the fallbacks, I know there's
others that rely on this behavior as well (cc'ing Robert O'Callahan of the
rr project for which this change will result in significant performance
degradation). Also, judging by who complained last time FOLL_FORCE
was broken, I suspect the Wine people are relying on this as well.
Frankly, I'm a bit surprised that this change was made in the first place.
Making a userspace-breaking change on mainline and seeing if anybody
complains doesn't seem like the ideal way to find out if features are used.

As I said, personally we can patch our software and deal with this, but I think
a change like this deserves a bit wider discussion, so may I suggest a revert
of this change for the time being? Maybe there can be a syslog warning such
that people who use it will notice and have their say on the mailing list.

Thanks,
Keno


[PATCH v2] mm: Respect FOLL_FORCE/FOLL_COW for thp

2017-01-05 Thread Keno Fischer
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"),
the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to
setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in
gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had
also been set. However, a similar check in huge_memory.c was forgotten. As a
result, remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and
returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately
because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this
state the process is stil SIGKILLable, but little else works (e.g. no ptrace
attach, no other signals). This is easily reproduced with the following
code (assuming thp are set to always):

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_SIZE 5 * 1024 * 1024

int main(void) {
  int status;
  pid_t child;
  int fd = open("/proc/self/mem", O_RDWR);
  void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
  assert(addr != MAP_FAILED);
  pid_t parent_pid = getpid();
  if ((child = fork()) == 0) {
void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
   MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
assert(addr2 != MAP_FAILED);
memset(addr2, 'a', TEST_SIZE);
pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
return 0;
  }
  assert(child == waitpid(child, , 0));
  assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
  return 0;
}

Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to
the update in gup.c in the original commit. The same pattern exists in
follow_devmap_pmd. However, we should not be able to reach that check
with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever
do.

Signed-off-by: Keno Fischer <k...@juliacomputing.com>
---
Changes since v1:
 * In follow_devmap_pmd, WARN_ONCE if FOLL_COW is encountered, rather
   than allowing it, since that situation should not happen.
   As suggested by Kirill A. Shutemov

 mm/huge_memory.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf..f7ec01d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -783,6 +783,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
+   /*
+* When we COW a devmap PMD entry, we split it into PTEs, so we should
+* not be in this function with `flags & FOLL_COW` set.
+*/
+   WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW 
set");
+
if (flags & FOLL_WRITE && !pmd_write(*pmd))
return NULL;
 
@@ -1127,6 +1133,16 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
return ret;
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+{
+   return pmd_write(pmd) ||
+   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+}
+
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
   unsigned long addr,
   pmd_t *pmd,
@@ -1137,7 +1153,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
-   if (flags & FOLL_WRITE && !pmd_write(*pmd))
+   if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
goto out;
 
/* Avoid dumping huge zero page */
-- 
2.9.3



[PATCH v2] mm: Respect FOLL_FORCE/FOLL_COW for thp

2017-01-05 Thread Keno Fischer
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"),
the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to
setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in
gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had
also been set. However, a similar check in huge_memory.c was forgotten. As a
result, remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and
returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately
because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this
state the process is stil SIGKILLable, but little else works (e.g. no ptrace
attach, no other signals). This is easily reproduced with the following
code (assuming thp are set to always):

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_SIZE 5 * 1024 * 1024

int main(void) {
  int status;
  pid_t child;
  int fd = open("/proc/self/mem", O_RDWR);
  void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
  assert(addr != MAP_FAILED);
  pid_t parent_pid = getpid();
  if ((child = fork()) == 0) {
void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
   MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
assert(addr2 != MAP_FAILED);
memset(addr2, 'a', TEST_SIZE);
pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
return 0;
  }
  assert(child == waitpid(child, , 0));
  assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
  return 0;
}

Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously to
the update in gup.c in the original commit. The same pattern exists in
follow_devmap_pmd. However, we should not be able to reach that check
with FOLL_COW set, so add WARN_ONCE to make sure we notice if we ever
do.

Signed-off-by: Keno Fischer 
---
Changes since v1:
 * In follow_devmap_pmd, WARN_ONCE if FOLL_COW is encountered, rather
   than allowing it, since that situation should not happen.
   As suggested by Kirill A. Shutemov

 mm/huge_memory.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf..f7ec01d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -783,6 +783,12 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
+   /*
+* When we COW a devmap PMD entry, we split it into PTEs, so we should
+* not be in this function with `flags & FOLL_COW` set.
+*/
+   WARN_ONCE(flags & FOLL_COW, "mm: In follow_devmap_pmd with FOLL_COW 
set");
+
if (flags & FOLL_WRITE && !pmd_write(*pmd))
return NULL;
 
@@ -1127,6 +1133,16 @@ int do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t 
orig_pmd)
return ret;
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+{
+   return pmd_write(pmd) ||
+   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+}
+
 struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
   unsigned long addr,
   pmd_t *pmd,
@@ -1137,7 +1153,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
-   if (flags & FOLL_WRITE && !pmd_write(*pmd))
+   if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
goto out;
 
/* Avoid dumping huge zero page */
-- 
2.9.3



Re: [PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp

2017-01-05 Thread Keno Fischer
>> @@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct 
>> *vma, unsigned long addr,
>>
>>   assert_spin_locked(pmd_lockptr(mm, pmd));
>>
>> - if (flags & FOLL_WRITE && !pmd_write(*pmd))
>> + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
>>   return NULL;
>
> I don't think this part is needed: once we COW devmap PMD entry, we split
> it into PTE table, so IIUC we never get here with PMD.
>
> Maybe we should WARN_ONCE() if have FOLL_COW here.

Sounds good to me. As I said, I don't have a testcase for this code
path, I just noticed the same pattern.
Will send an updated patch with the WARN_ONCE there.


Re: [PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp

2017-01-05 Thread Keno Fischer
>> @@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct 
>> *vma, unsigned long addr,
>>
>>   assert_spin_locked(pmd_lockptr(mm, pmd));
>>
>> - if (flags & FOLL_WRITE && !pmd_write(*pmd))
>> + if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
>>   return NULL;
>
> I don't think this part is needed: once we COW devmap PMD entry, we split
> it into PTE table, so IIUC we never get here with PMD.
>
> Maybe we should WARN_ONCE() if have FOLL_COW here.

Sounds good to me. As I said, I don't have a testcase for this code
path, I just noticed the same pattern.
Will send an updated patch with the WARN_ONCE there.


[PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp

2017-01-04 Thread Keno Fischer
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"),
the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to
setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in
gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had
also been set. However, a similar check in huge_memory.c was forgotten. As a
result, remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and
returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately
because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this
state, the process is stil SIGKILLable, but little else works (e.g. no ptrace
attach, no other signals). This is easily reproduced with the following
code (assuming thp are set to always):

```
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_SIZE 5 * 1024 * 1024

int main(void) {
  int status;
  pid_t child;
  int fd = open("/proc/self/mem", O_RDWR);
  void *addr =
  mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
  assert(addr != MAP_FAILED);
  pid_t parent_pid = getpid();
  if ((child = fork()) == 0) {
void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
   MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
assert(addr2 != MAP_FAILED);
memset(addr2, 'a', TEST_SIZE);
pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
return 0;
  }
  assert(child == waitpid(child, , 0));
  assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
  return 0;
}
```

Fix this by updating the instances in huge_memory.c analogously to
the update in gup.c in the original commit. The same pattern existed in
follow_devmap_pmd, so I have changed that location as well. However,
I do not have a test case that for that code path.

Signed-off-by: Keno Fischer <k...@juliacomputing.com>
---
 mm/huge_memory.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf..84497a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -773,6 +773,16 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned 
long addr,
update_mmu_cache_pmd(vma, addr, pmd);
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+{
+   return pmd_write(pmd) ||
+   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+}
+
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags)
 {
@@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
-   if (flags & FOLL_WRITE && !pmd_write(*pmd))
+   if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
return NULL;
 
if (pmd_present(*pmd) && pmd_devmap(*pmd))
@@ -1137,7 +1147,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
-   if (flags & FOLL_WRITE && !pmd_write(*pmd))
+   if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
goto out;
 
/* Avoid dumping huge zero page */
-- 
2.9.3



[PATCH] mm: Respect FOLL_FORCE/FOLL_COW for thp

2017-01-04 Thread Keno Fischer
In 19be0eaff ("mm: remove gup_flags FOLL_WRITE games from __get_user_pages()"),
the mm code was changed from unsetting FOLL_WRITE after a COW was resolved to
setting the (newly introduced) FOLL_COW instead. Simultaneously, the check in
gup.c was updated to still allow writes with FOLL_FORCE set if FOLL_COW had
also been set. However, a similar check in huge_memory.c was forgotten. As a
result, remote memory writes to ro regions of memory backed by transparent huge
pages cause an infinite loop in the kernel (handle_mm_fault sets FOLL_COW and
returns 0 causing a retry, but follow_trans_huge_pmd bails out immidiately
because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is true. While in this
state, the process is stil SIGKILLable, but little else works (e.g. no ptrace
attach, no other signals). This is easily reproduced with the following
code (assuming thp are set to always):

```
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TEST_SIZE 5 * 1024 * 1024

int main(void) {
  int status;
  pid_t child;
  int fd = open("/proc/self/mem", O_RDWR);
  void *addr =
  mmap(NULL, TEST_SIZE, PROT_READ, MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
  assert(addr != MAP_FAILED);
  pid_t parent_pid = getpid();
  if ((child = fork()) == 0) {
void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
   MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
assert(addr2 != MAP_FAILED);
memset(addr2, 'a', TEST_SIZE);
pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
return 0;
  }
  assert(child == waitpid(child, , 0));
  assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
  return 0;
}
```

Fix this by updating the instances in huge_memory.c analogously to
the update in gup.c in the original commit. The same pattern existed in
follow_devmap_pmd, so I have changed that location as well. However,
I do not have a test case that for that code path.

Signed-off-by: Keno Fischer 
---
 mm/huge_memory.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 10eedbf..84497a8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -773,6 +773,16 @@ static void touch_pmd(struct vm_area_struct *vma, unsigned 
long addr,
update_mmu_cache_pmd(vma, addr, pmd);
 }
 
+/*
+ * FOLL_FORCE can write to even unwritable pmd's, but only
+ * after we've gone through a COW cycle and they are dirty.
+ */
+static inline bool can_follow_write_pmd(pmd_t pmd, unsigned int flags)
+{
+   return pmd_write(pmd) ||
+   ((flags & FOLL_FORCE) && (flags & FOLL_COW) && pmd_dirty(pmd));
+}
+
 struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, int flags)
 {
@@ -783,7 +793,7 @@ struct page *follow_devmap_pmd(struct vm_area_struct *vma, 
unsigned long addr,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
-   if (flags & FOLL_WRITE && !pmd_write(*pmd))
+   if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
return NULL;
 
if (pmd_present(*pmd) && pmd_devmap(*pmd))
@@ -1137,7 +1147,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct 
*vma,
 
assert_spin_locked(pmd_lockptr(mm, pmd));
 
-   if (flags & FOLL_WRITE && !pmd_write(*pmd))
+   if (flags & FOLL_WRITE && !can_follow_write_pmd(*pmd, flags))
goto out;
 
/* Avoid dumping huge zero page */
-- 
2.9.3



Why do Zombie process' /proc entries have uid 0?

2016-12-22 Thread Keno Fischer
This is mostly out of curiosity, but I was surprised by the behavior, so
I was hoping somebody might be able to explain why this behavior was
chosen. In particular, consider any zombie process, e.g.

$ cat /proc/77078/status
Name: test
State: Z (zombie)
Tgid: 77078
Ngid: 0
Pid: 77078
PPid: 77077
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
[...]

now, this process has uid 1000, as does the /proc/ directory

$ stat /proc/77078
  File: '/proc/77078'
Access: (0555/dr-xr-xr-x)  Uid: ( 1000/keno)   Gid: ( 1000/keno)

but most files in /proc/ are owned by root:

$ stat /proc/77078/status
  File: '/proc/77078/status'
Access: (0444/-r--r--r--)  Uid: (0/root)   Gid: (0/root)

Why is this? Why don't these files remain owned by the same uid as the
process itself?

Keno


Why do Zombie process' /proc entries have uid 0?

2016-12-22 Thread Keno Fischer
This is mostly out of curiosity, but I was surprised by the behavior, so
I was hoping somebody might be able to explain why this behavior was
chosen. In particular, consider any zombie process, e.g.

$ cat /proc/77078/status
Name: test
State: Z (zombie)
Tgid: 77078
Ngid: 0
Pid: 77078
PPid: 77077
TracerPid: 0
Uid: 1000 1000 1000 1000
Gid: 1000 1000 1000 1000
[...]

now, this process has uid 1000, as does the /proc/ directory

$ stat /proc/77078
  File: '/proc/77078'
Access: (0555/dr-xr-xr-x)  Uid: ( 1000/keno)   Gid: ( 1000/keno)

but most files in /proc/ are owned by root:

$ stat /proc/77078/status
  File: '/proc/77078/status'
Access: (0444/-r--r--r--)  Uid: (0/root)   Gid: (0/root)

Why is this? Why don't these files remain owned by the same uid as the
process itself?

Keno


Bug in fs/gs_base PTRACE_SETREGS on pre-4.7 kernels

2016-11-17 Thread Keno Fischer
Hi Andy (),

this is more of a heads up than a bug report, since it turns out you
already fixed this in

731e33e: x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization

In any case, without that commit, trying to use PTRACE_SETREGS to set
either fs_base, or gs_base to 0 when it was previously <4G, but wasn't
0, fails to take effect in the tracee.

This is caused by the `if (child->thread.fs != value)`, in
`ptrace.c:putreg`, which skips the `do_arch_prctl` call. Of course the
problem here is that while the optimization is in place `fs` is set to
0, but does not actually hold the fs base, so the call is incorrectly
skipped.

In any case, figured you may be interested that the commit changes behavior
(for the better - not complaining ;), even if user code does not go
out of its way to confuse ptrace.

Keno


Bug in fs/gs_base PTRACE_SETREGS on pre-4.7 kernels

2016-11-17 Thread Keno Fischer
Hi Andy (),

this is more of a heads up than a bug report, since it turns out you
already fixed this in

731e33e: x86/arch_prctl/64: Remove FSBASE/GSBASE < 4G optimization

In any case, without that commit, trying to use PTRACE_SETREGS to set
either fs_base, or gs_base to 0 when it was previously <4G, but wasn't
0, fails to take effect in the tracee.

This is caused by the `if (child->thread.fs != value)`, in
`ptrace.c:putreg`, which skips the `do_arch_prctl` call. Of course the
problem here is that while the optimization is in place `fs` is set to
0, but does not actually hold the fs base, so the call is incorrectly
skipped.

In any case, figured you may be interested that the commit changes behavior
(for the better - not complaining ;), even if user code does not go
out of its way to confuse ptrace.

Keno


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-15 Thread Keno Fischer
Just as an FYI, the linker bug has been fixed in binutils.

On Fri, Nov 11, 2016 at 5:07 PM, Richard Weinberger <rich...@nod.at> wrote:
> On 11.11.2016 22:03, Keno Fischer wrote:
>> Did you have CONFIG_INET set? I'm attaching my full .config. This is
>> on vanilla Ubuntu 16.10.
>
> Yes, CONFIG_INET is set. Let my try on Ubuntu. ;-\
>
>> I did see the same error when building with `CONFIG_STATIC_LINK=y`.
>> Note that I also, separately, ran into a linker problem, though I
>> believe it is unrelated to this patch
>> (though perhaps is related to the problem you're seeing?):
>> https://sourceware.org/bugzilla/show_bug.cgi?id=20800.
>
> This seems to be an UML<->glibc issue.
> memmove() is now an ifunc and for whatever reason it does not work with UML.
>
>> I'd also be happy to provide you with ssh access to the machine that
>> I'm seeing this on if that
>> would be helpful.
>
> Okay, let me try myself first. I think I'm able to install Ubuntu. :)
>
> Thanks,
> //richard


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-15 Thread Keno Fischer
Just as an FYI, the linker bug has been fixed in binutils.

On Fri, Nov 11, 2016 at 5:07 PM, Richard Weinberger  wrote:
> On 11.11.2016 22:03, Keno Fischer wrote:
>> Did you have CONFIG_INET set? I'm attaching my full .config. This is
>> on vanilla Ubuntu 16.10.
>
> Yes, CONFIG_INET is set. Let my try on Ubuntu. ;-\
>
>> I did see the same error when building with `CONFIG_STATIC_LINK=y`.
>> Note that I also, separately, ran into a linker problem, though I
>> believe it is unrelated to this patch
>> (though perhaps is related to the problem you're seeing?):
>> https://sourceware.org/bugzilla/show_bug.cgi?id=20800.
>
> This seems to be an UML<->glibc issue.
> memmove() is now an ifunc and for whatever reason it does not work with UML.
>
>> I'd also be happy to provide you with ssh access to the machine that
>> I'm seeing this on if that
>> would be helpful.
>
> Okay, let me try myself first. I think I'm able to install Ubuntu. :)
>
> Thanks,
> //richard


[PATCH v2] gpio: Remove GPIO_DEVRES option

2016-11-15 Thread Keno Fischer
This option was added in 6a89a314ab107a12af08c71420c19a37a30fc2d3 to allow use
of the devm_gpio_* functions without CONFIG_GPIOLIB. However, only a few months
later in b69ac52449c658b7ac40034dc3c5f5f4a71a723d, CONFIG_GPIOLIB, was added
as a dependency, defeating the original purpose of this option. Instead of
that patch, the original commit could have just been reverted (and in fact
was partially so in 403c1d0be5ccbd750d25c59d8358843a81e52e3b). Further,
since this option has a dependency on HAS_IOMEM, even though it does not
require it, it causes build failures when !HAS_IOMEM (e.g. in a uml build).
Fix that by completely removing the option, in essence completing the
reversion of the original commit.
---

In the original version of this patch 
(http://marc.info/?l=linux-gpio=147874300313315=2),
I had kept the option, and just fixed the build failure. However,
Linus Walleij pointed out (and the git history agrees) that this
option is now obsolete and should just be removed.

Also, while here I should note that there was a `might_sleep();` in the
stub for `devm_gpio_free`, that was not reintroduced when the stub was
brought back. Not sure this makes a difference, it felt worth pointing
out to the maintainers.

 drivers/gpio/Kconfig  | 4 
 drivers/gpio/Makefile | 2 +-
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index d011cb8..ed37e59 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -22,10 +22,6 @@ menuconfig GPIOLIB
 
 if GPIOLIB
 
-config GPIO_DEVRES
-   def_bool y
-   depends on HAS_IOMEM
-
 config OF_GPIO
def_bool y
depends on OF
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index ab28a2d..d074c22 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -2,7 +2,7 @@
 
 ccflags-$(CONFIG_DEBUG_GPIO)   += -DDEBUG
 
-obj-$(CONFIG_GPIO_DEVRES)  += devres.o
+obj-$(CONFIG_GPIOLIB)  += devres.o
 obj-$(CONFIG_GPIOLIB)  += gpiolib.o
 obj-$(CONFIG_GPIOLIB)  += gpiolib-legacy.o
 obj-$(CONFIG_OF_GPIO)  += gpiolib-of.o
-- 
2.9.3



[PATCH v2] gpio: Remove GPIO_DEVRES option

2016-11-15 Thread Keno Fischer
This option was added in 6a89a314ab107a12af08c71420c19a37a30fc2d3 to allow use
of the devm_gpio_* functions without CONFIG_GPIOLIB. However, only a few months
later in b69ac52449c658b7ac40034dc3c5f5f4a71a723d, CONFIG_GPIOLIB, was added
as a dependency, defeating the original purpose of this option. Instead of
that patch, the original commit could have just been reverted (and in fact
was partially so in 403c1d0be5ccbd750d25c59d8358843a81e52e3b). Further,
since this option has a dependency on HAS_IOMEM, even though it does not
require it, it causes build failures when !HAS_IOMEM (e.g. in a uml build).
Fix that by completely removing the option, in essence completing the
reversion of the original commit.
---

In the original version of this patch 
(http://marc.info/?l=linux-gpio=147874300313315=2),
I had kept the option, and just fixed the build failure. However,
Linus Walleij pointed out (and the git history agrees) that this
option is now obsolete and should just be removed.

Also, while here I should note that there was a `might_sleep();` in the
stub for `devm_gpio_free`, that was not reintroduced when the stub was
brought back. Not sure this makes a difference, it felt worth pointing
out to the maintainers.

 drivers/gpio/Kconfig  | 4 
 drivers/gpio/Makefile | 2 +-
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpio/Kconfig b/drivers/gpio/Kconfig
index d011cb8..ed37e59 100644
--- a/drivers/gpio/Kconfig
+++ b/drivers/gpio/Kconfig
@@ -22,10 +22,6 @@ menuconfig GPIOLIB
 
 if GPIOLIB
 
-config GPIO_DEVRES
-   def_bool y
-   depends on HAS_IOMEM
-
 config OF_GPIO
def_bool y
depends on OF
diff --git a/drivers/gpio/Makefile b/drivers/gpio/Makefile
index ab28a2d..d074c22 100644
--- a/drivers/gpio/Makefile
+++ b/drivers/gpio/Makefile
@@ -2,7 +2,7 @@
 
 ccflags-$(CONFIG_DEBUG_GPIO)   += -DDEBUG
 
-obj-$(CONFIG_GPIO_DEVRES)  += devres.o
+obj-$(CONFIG_GPIOLIB)  += devres.o
 obj-$(CONFIG_GPIOLIB)  += gpiolib.o
 obj-$(CONFIG_GPIOLIB)  += gpiolib-legacy.o
 obj-$(CONFIG_OF_GPIO)  += gpiolib-of.o
-- 
2.9.3



Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-11 Thread Keno Fischer
Did you have CONFIG_INET set? I'm attaching my full .config. This is
on vanilla Ubuntu 16.10.
I did see the same error when building with `CONFIG_STATIC_LINK=y`.
Note that I also, separately, ran into a linker problem, though I
believe it is unrelated to this patch
(though perhaps is related to the problem you're seeing?):
https://sourceware.org/bugzilla/show_bug.cgi?id=20800.

I'd also be happy to provide you with ssh access to the machine that
I'm seeing this on if that
would be helpful.


config
Description: Binary data


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-11 Thread Keno Fischer
Did you have CONFIG_INET set? I'm attaching my full .config. This is
on vanilla Ubuntu 16.10.
I did see the same error when building with `CONFIG_STATIC_LINK=y`.
Note that I also, separately, ran into a linker problem, though I
believe it is unrelated to this patch
(though perhaps is related to the problem you're seeing?):
https://sourceware.org/bugzilla/show_bug.cgi?id=20800.

I'd also be happy to provide you with ssh access to the machine that
I'm seeing this on if that
would be helpful.


config
Description: Binary data


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-10 Thread Keno Fischer
On Thu, Nov 10, 2016 at 3:19 PM, Richard Weinberger  wrote:
> Can you please reply to Sebastian's patch series and explain him how you 
> trigger
> that error?
> I don't have a gcc broken by Debian on my machine right now.

I'm not sure how to reply to his patch series directly since I'm not
subscribed to lkml
(I imagine I can set headers manually, such that things get
appropriately threaded,
but for now let me just reply here). I don't think there's much to it.
It seems like any
use of `current_text_addr()` in `ARCH=um` build will cause this
problem, there's just
not very many such uses in the code base. I believe this particular
problem should
be reproducible as long as CONFIG_INET is set for an `ARCH=um` build.


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-10 Thread Keno Fischer
On Thu, Nov 10, 2016 at 3:19 PM, Richard Weinberger  wrote:
> Can you please reply to Sebastian's patch series and explain him how you 
> trigger
> that error?
> I don't have a gcc broken by Debian on my machine right now.

I'm not sure how to reply to his patch series directly since I'm not
subscribed to lkml
(I imagine I can set headers manually, such that things get
appropriately threaded,
but for now let me just reply here). I don't think there's much to it.
It seems like any
use of `current_text_addr()` in `ARCH=um` build will cause this
problem, there's just
not very many such uses in the code base. I believe this particular
problem should
be reproducible as long as CONFIG_INET is set for an `ARCH=um` build.


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-10 Thread Keno Fischer
Yes

On Thu, Nov 10, 2016 at 3:14 PM, Richard Weinberger <rich...@nod.at> wrote:
> Keno,
>
> On 10.11.2016 21:10, Keno Fischer wrote:
>>> The problem is ready being solved in a generic way:
>>> http://marc.info/?l=linux-kernel=147828481602561=2
>>>
>>> Can you please give this patch a try?
>>
>> No dice. After backing out my patch and applying that one I get:
>>
>> /usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc
>> 11; recompile with -fPIC
>
> But you applied the whole series, right?
>
> Thanks,
> //richard


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-10 Thread Keno Fischer
Yes

On Thu, Nov 10, 2016 at 3:14 PM, Richard Weinberger  wrote:
> Keno,
>
> On 10.11.2016 21:10, Keno Fischer wrote:
>>> The problem is ready being solved in a generic way:
>>> http://marc.info/?l=linux-kernel=147828481602561=2
>>>
>>> Can you please give this patch a try?
>>
>> No dice. After backing out my patch and applying that one I get:
>>
>> /usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc
>> 11; recompile with -fPIC
>
> But you applied the whole series, right?
>
> Thanks,
> //richard


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-10 Thread Keno Fischer
> The problem is ready being solved in a generic way:
> http://marc.info/?l=linux-kernel=147828481602561=2
>
> Can you please give this patch a try?

No dice. After backing out my patch and applying that one I get:

/usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc
11; recompile with -fPIC


Re: [PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-10 Thread Keno Fischer
> The problem is ready being solved in a generic way:
> http://marc.info/?l=linux-kernel=147828481602561=2
>
> Can you please give this patch a try?

No dice. After backing out my patch and applying that one I get:

/usr/bin/ld: error: net/built-in.o: requires unsupported dynamic reloc
11; recompile with -fPIC


[PATCH] gpio: Guard devm_* functions behind CONFIG_GPIO_DEVRES not CONFIG_GPIOLIB

2016-11-09 Thread Keno Fischer
These functions are defined in devres.c, which only gets compiled with
CONFIG_GPIO_DEVRES (in addition to CONFIG_GPIOLIB). However, in the
header files, the difference between the declaration and the inline
stub was only guarded by CONFIG_GPIOLIB, not CONFIG_GPIO_DEVRES,
causing undefined symbol problems in modpost.

Signed-off-by: Keno Fischer <k...@juliacomputing.com>
---

I encountered this while trying to build uml in an attempt to debug some kernel
behavior I don't understand. To be as close to my actual kernel as possible,
I used the same .config, which of course tried to build a bunch of drivers.
Arguably I should just not build those, but this seems correct nonetheless
and allows the build to go through.

 include/linux/gpio.h  |  28 +++---
 include/linux/gpio/consumer.h | 193 ++
 2 files changed, 117 insertions(+), 104 deletions(-)

diff --git a/include/linux/gpio.h b/include/linux/gpio.h
index d12b5d5..7a1b979 100644
--- a/include/linux/gpio.h
+++ b/include/linux/gpio.h
@@ -77,15 +77,6 @@ static inline int irq_to_gpio(unsigned int irq)
 
 #endif /* ! CONFIG_ARCH_HAVE_CUSTOM_GPIO_H */
 
-/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */
-
-struct device;
-
-int devm_gpio_request(struct device *dev, unsigned gpio, const char *label);
-int devm_gpio_request_one(struct device *dev, unsigned gpio,
- unsigned long flags, const char *label);
-void devm_gpio_free(struct device *dev, unsigned int gpio);
-
 #else /* ! CONFIG_GPIOLIB */
 
 #include 
@@ -253,6 +244,23 @@ gpiochip_remove_pin_ranges(struct gpio_chip *chip)
WARN_ON(1);
 }
 
+#endif /* ! CONFIG_GPIOLIB */
+
+#ifdef CONFIG_GPIO_DEVRES
+
+/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */
+
+struct device;
+
+int devm_gpio_request(struct device *dev, unsigned gpio, const char *label);
+int devm_gpio_request_one(struct device *dev, unsigned gpio,
+ unsigned long flags, const char *label);
+void devm_gpio_free(struct device *dev, unsigned int gpio);
+
+#else
+
+struct device;
+
 static inline int devm_gpio_request(struct device *dev, unsigned gpio,
const char *label)
 {
@@ -272,6 +280,6 @@ static inline void devm_gpio_free(struct device *dev, 
unsigned int gpio)
WARN_ON(1);
 }
 
-#endif /* ! CONFIG_GPIOLIB */
+#endif
 
 #endif /* __LINUX_GPIO_H */
diff --git a/include/linux/gpio/consumer.h b/include/linux/gpio/consumer.h
index fb0fde6..3e84311 100644
--- a/include/linux/gpio/consumer.h
+++ b/include/linux/gpio/consumer.h
@@ -70,28 +70,6 @@ struct gpio_descs *__must_check 
gpiod_get_array_optional(struct device *dev,
 void gpiod_put(struct gpio_desc *desc);
 void gpiod_put_array(struct gpio_descs *descs);
 
-struct gpio_desc *__must_check devm_gpiod_get(struct device *dev,
- const char *con_id,
- enum gpiod_flags flags);
-struct gpio_desc *__must_check devm_gpiod_get_index(struct device *dev,
-   const char *con_id,
-   unsigned int idx,
-   enum gpiod_flags flags);
-struct gpio_desc *__must_check devm_gpiod_get_optional(struct device *dev,
-  const char *con_id,
-  enum gpiod_flags flags);
-struct gpio_desc *__must_check
-devm_gpiod_get_index_optional(struct device *dev, const char *con_id,
- unsigned int index, enum gpiod_flags flags);
-struct gpio_descs *__must_check devm_gpiod_get_array(struct device *dev,
-const char *con_id,
-enum gpiod_flags flags);
-struct gpio_descs *__must_check
-devm_gpiod_get_array_optional(struct device *dev, const char *con_id,
- enum gpiod_flags flags);
-void devm_gpiod_put(struct device *dev, struct gpio_desc *desc);
-void devm_gpiod_put_array(struct device *dev, struct gpio_descs *descs);
-
 int gpiod_get_direction(struct gpio_desc *desc);
 int gpiod_direction_input(struct gpio_desc *desc);
 int gpiod_direction_output(struct gpio_desc *desc, int value);
@@ -136,9 +114,7 @@ struct fwnode_handle;
 
 struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode,
 const char *propname);
-struct gpio_desc *devm_get_gpiod_from_child(struct device *dev,
-   const char *con_id,
-   struct fwnode_handle *child);
+
 #else /* CONFIG_GPIOLIB */
 
 static inline int gpiod_count(struct device *dev, const char *con_id)
@@ -205,69 +181,6 @@ static inline void gpiod_put_array(struct gpio_descs 
*descs)
WAR

[PATCH] gpio: Guard devm_* functions behind CONFIG_GPIO_DEVRES not CONFIG_GPIOLIB

2016-11-09 Thread Keno Fischer
These functions are defined in devres.c, which only gets compiled with
CONFIG_GPIO_DEVRES (in addition to CONFIG_GPIOLIB). However, in the
header files, the difference between the declaration and the inline
stub was only guarded by CONFIG_GPIOLIB, not CONFIG_GPIO_DEVRES,
causing undefined symbol problems in modpost.

Signed-off-by: Keno Fischer 
---

I encountered this while trying to build uml in an attempt to debug some kernel
behavior I don't understand. To be as close to my actual kernel as possible,
I used the same .config, which of course tried to build a bunch of drivers.
Arguably I should just not build those, but this seems correct nonetheless
and allows the build to go through.

 include/linux/gpio.h  |  28 +++---
 include/linux/gpio/consumer.h | 193 ++
 2 files changed, 117 insertions(+), 104 deletions(-)

diff --git a/include/linux/gpio.h b/include/linux/gpio.h
index d12b5d5..7a1b979 100644
--- a/include/linux/gpio.h
+++ b/include/linux/gpio.h
@@ -77,15 +77,6 @@ static inline int irq_to_gpio(unsigned int irq)
 
 #endif /* ! CONFIG_ARCH_HAVE_CUSTOM_GPIO_H */
 
-/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */
-
-struct device;
-
-int devm_gpio_request(struct device *dev, unsigned gpio, const char *label);
-int devm_gpio_request_one(struct device *dev, unsigned gpio,
- unsigned long flags, const char *label);
-void devm_gpio_free(struct device *dev, unsigned int gpio);
-
 #else /* ! CONFIG_GPIOLIB */
 
 #include 
@@ -253,6 +244,23 @@ gpiochip_remove_pin_ranges(struct gpio_chip *chip)
WARN_ON(1);
 }
 
+#endif /* ! CONFIG_GPIOLIB */
+
+#ifdef CONFIG_GPIO_DEVRES
+
+/* CONFIG_GPIOLIB: bindings for managed devices that want to request gpios */
+
+struct device;
+
+int devm_gpio_request(struct device *dev, unsigned gpio, const char *label);
+int devm_gpio_request_one(struct device *dev, unsigned gpio,
+ unsigned long flags, const char *label);
+void devm_gpio_free(struct device *dev, unsigned int gpio);
+
+#else
+
+struct device;
+
 static inline int devm_gpio_request(struct device *dev, unsigned gpio,
const char *label)
 {
@@ -272,6 +280,6 @@ static inline void devm_gpio_free(struct device *dev, 
unsigned int gpio)
WARN_ON(1);
 }
 
-#endif /* ! CONFIG_GPIOLIB */
+#endif
 
 #endif /* __LINUX_GPIO_H */
diff --git a/include/linux/gpio/consumer.h b/include/linux/gpio/consumer.h
index fb0fde6..3e84311 100644
--- a/include/linux/gpio/consumer.h
+++ b/include/linux/gpio/consumer.h
@@ -70,28 +70,6 @@ struct gpio_descs *__must_check 
gpiod_get_array_optional(struct device *dev,
 void gpiod_put(struct gpio_desc *desc);
 void gpiod_put_array(struct gpio_descs *descs);
 
-struct gpio_desc *__must_check devm_gpiod_get(struct device *dev,
- const char *con_id,
- enum gpiod_flags flags);
-struct gpio_desc *__must_check devm_gpiod_get_index(struct device *dev,
-   const char *con_id,
-   unsigned int idx,
-   enum gpiod_flags flags);
-struct gpio_desc *__must_check devm_gpiod_get_optional(struct device *dev,
-  const char *con_id,
-  enum gpiod_flags flags);
-struct gpio_desc *__must_check
-devm_gpiod_get_index_optional(struct device *dev, const char *con_id,
- unsigned int index, enum gpiod_flags flags);
-struct gpio_descs *__must_check devm_gpiod_get_array(struct device *dev,
-const char *con_id,
-enum gpiod_flags flags);
-struct gpio_descs *__must_check
-devm_gpiod_get_array_optional(struct device *dev, const char *con_id,
- enum gpiod_flags flags);
-void devm_gpiod_put(struct device *dev, struct gpio_desc *desc);
-void devm_gpiod_put_array(struct device *dev, struct gpio_descs *descs);
-
 int gpiod_get_direction(struct gpio_desc *desc);
 int gpiod_direction_input(struct gpio_desc *desc);
 int gpiod_direction_output(struct gpio_desc *desc, int value);
@@ -136,9 +114,7 @@ struct fwnode_handle;
 
 struct gpio_desc *fwnode_get_named_gpiod(struct fwnode_handle *fwnode,
 const char *propname);
-struct gpio_desc *devm_get_gpiod_from_child(struct device *dev,
-   const char *con_id,
-   struct fwnode_handle *child);
+
 #else /* CONFIG_GPIOLIB */
 
 static inline int gpiod_count(struct device *dev, const char *con_id)
@@ -205,69 +181,6 @@ static inline void gpiod_put_array(struct gpio_descs 
*descs)
WARN_ON(1);
 }
 
-static inline

[PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-09 Thread Keno Fischer
Fixes the following link error:
```
/usr/bin/ld: net/built-in.o: relocation R_X86_64_32S against `.text'
can not be used when making a shared object; recompile with -fPIC
```

This is the same definition used on some other architectures.

Signed-off-by: Keno Fischer <k...@juliacomputing.com>
---
I am not sure this is the correct patch in the context of uml. I believe this
should give the runtime ip, which may be different between runs. It may be
better to use the offset in .text (e.g. by using `pc-__text_start`), which
should be consistent.

 arch/x86/um/asm/processor_64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/asm/processor_64.h b/arch/x86/um/asm/processor_64.h
index c3be852..6ca3304 100644
--- a/arch/x86/um/asm/processor_64.h
+++ b/arch/x86/um/asm/processor_64.h
@@ -32,7 +32,7 @@ static inline void arch_copy_thread(struct arch_thread *from,
 }
 
 #define current_text_addr() \
-   ({ void *pc; __asm__("movq $1f,%0\n1:":"=g" (pc)); pc; })
+   ({ __label__ _l; _l: &&_l; })
 
 #define current_sp() ({ void *sp; __asm__("movq %%rsp, %0" : "=r" (sp) : ); 
sp; })
 #define current_bp() ({ unsigned long bp; __asm__("movq %%rbp, %0" : "=r" (bp) 
: ); bp; })
-- 
2.9.3



[PATCH] um: Fix compile failure due to current_text_address() definition

2016-11-09 Thread Keno Fischer
Fixes the following link error:
```
/usr/bin/ld: net/built-in.o: relocation R_X86_64_32S against `.text'
can not be used when making a shared object; recompile with -fPIC
```

This is the same definition used on some other architectures.

Signed-off-by: Keno Fischer 
---
I am not sure this is the correct patch in the context of uml. I believe this
should give the runtime ip, which may be different between runs. It may be
better to use the offset in .text (e.g. by using `pc-__text_start`), which
should be consistent.

 arch/x86/um/asm/processor_64.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/um/asm/processor_64.h b/arch/x86/um/asm/processor_64.h
index c3be852..6ca3304 100644
--- a/arch/x86/um/asm/processor_64.h
+++ b/arch/x86/um/asm/processor_64.h
@@ -32,7 +32,7 @@ static inline void arch_copy_thread(struct arch_thread *from,
 }
 
 #define current_text_addr() \
-   ({ void *pc; __asm__("movq $1f,%0\n1:":"=g" (pc)); pc; })
+   ({ __label__ _l; _l: &&_l; })
 
 #define current_sp() ({ void *sp; __asm__("movq %%rsp, %0" : "=r" (sp) : ); 
sp; })
 #define current_bp() ({ unsigned long bp; __asm__("movq %%rbp, %0" : "=r" (bp) 
: ); bp; })
-- 
2.9.3



Use of r10 in powerpc syscall entry

2016-10-04 Thread Keno Fischer
Hi Anton,

I was reading the powerpc syscall entry code and git points me at your commit
05b05f28 (powerpc: Relocatable system call no longer uses the LR) for one
part that confused me, so I hope you don't mind a quick question. In particular,
that commit removed the use of r10 to restore lr, but didn't touch the earlier
`mflr r10` to actually save the lr to r10. Is r10 still required there
for some reason
or is that just left over? Part of the reason I'm asking is because it seems
one could use r10, instead of r13 later, i.e.

  #define SYSCALL_PSERIES_2_DIRECT \
-  mflr r10 ; \
  ld r12,PACAKBASE(r13) ; \
  LOAD_HANDLER(r12, system_call_entry) ; \
  mtctr r12 ; \
  mfspr r12,SPRN_SRR1 ; \
- /* Re-use of r13... No spare regs to do this */ \
- li r13,MSR_RI ; \
- mtmsrd r13,1 ; \
+ li r10, MSR_RI ; \
+ mtmsrd r10,1 ; \
- GET_PACA(r13) ; /* get r13 back */ \
  bctr ;

Also only semi-relatedly, I was curious (if you, or anybody reading
happen to know) why
SRR0 and SRR1 are being moved to registers so early in the interrupt handler.
I had speculated that this was because of potential page faults on memory access
that would clobber those registers, but then I noticed the `ld
r12,PACAKBASE(r13)` before `mfspr r12,SPRN_SRR1`, which seemed like it
could touch memory, so I was confused again.

Hope the questions make sense, and sorry if I missed something obvious - I have
very little experience with ppc.

Thanks,
Keno


Use of r10 in powerpc syscall entry

2016-10-04 Thread Keno Fischer
Hi Anton,

I was reading the powerpc syscall entry code and git points me at your commit
05b05f28 (powerpc: Relocatable system call no longer uses the LR) for one
part that confused me, so I hope you don't mind a quick question. In particular,
that commit removed the use of r10 to restore lr, but didn't touch the earlier
`mflr r10` to actually save the lr to r10. Is r10 still required there
for some reason
or is that just left over? Part of the reason I'm asking is because it seems
one could use r10, instead of r13 later, i.e.

  #define SYSCALL_PSERIES_2_DIRECT \
-  mflr r10 ; \
  ld r12,PACAKBASE(r13) ; \
  LOAD_HANDLER(r12, system_call_entry) ; \
  mtctr r12 ; \
  mfspr r12,SPRN_SRR1 ; \
- /* Re-use of r13... No spare regs to do this */ \
- li r13,MSR_RI ; \
- mtmsrd r13,1 ; \
+ li r10, MSR_RI ; \
+ mtmsrd r10,1 ; \
- GET_PACA(r13) ; /* get r13 back */ \
  bctr ;

Also only semi-relatedly, I was curious (if you, or anybody reading
happen to know) why
SRR0 and SRR1 are being moved to registers so early in the interrupt handler.
I had speculated that this was because of potential page faults on memory access
that would clobber those registers, but then I noticed the `ld
r12,PACAKBASE(r13)` before `mfspr r12,SPRN_SRR1`, which seemed like it
could touch memory, so I was confused again.

Hope the questions make sense, and sorry if I missed something obvious - I have
very little experience with ppc.

Thanks,
Keno


Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?

2016-09-13 Thread Keno Fischer
Hi Oleg,

I have another obscure ptrace question which seems somewhat related,
so let me ask it here.
Consider this:

```
static int sigchld_counter = 0;
void sigchld_handler(int sig) {
  (void)sig;
  sigchld_counter++;
}

int main(void) {
  signal(SIGCHLD, sigchld_handler);

  pid_t child;
  if (0 == (child = fork())) {
raise(SIGSTOP);
assert(0); // Should never be continued
  }

  // Wait until stopped
  int status;
  pid_t wret = waitpid(child, , __WALL | WSTOPPED);
  assert(wret == child);
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);
  assert(sigchld_counter == 1);

  // Now PTRACE_SEIZE the child
  long err = ptrace(PTRACE_SEIZE, child, NULL, (void*)PTRACE_O_TRACESYSGOOD);
  assert(err == 0);

  // Make sure that didn't cause a notification
  wret = waitpid(child, , __WALL | WSTOPPED | WNOHANG);
  assert(wret == 0);
  assert(sigchld_counter == 1);
}
```

I wouldn't have expected the PTRACE_SEIZE to generate another
SIGCHLD/be waitable again,
(the last two assertions fail).
Is that supposed to happen? If so, I'd like to update the man page to
mention this behavior, but
I wanted to check with you first.

Thanks,
Keno


On Tue, Aug 23, 2016 at 11:34 AM, Oleg Nesterov <o...@redhat.com> wrote:
> On 08/18, Keno Fischer wrote:
>>
>> On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov <o...@redhat.com> wrote:
>> >
>> > And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after
>> > PTRACE_INTERRUPT, you know that the tracee did not report the "new"
>> > SIGTTIN.
>>
>> It seems possible to remember whether or not we injected a stopping
>> signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise
>> a PTRACE_INTERRUPT stop. Currently what I do is the other way around,
>> after issuing PTRACE_INTERRUPT, the first (if any) of the next two
>> stops that is a PTRACE_EVENT_STOP get interpreted as a
>> PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I
>> can't give you a concrete example I worried about, it just seems
>> fragile compared to just checking whether WSTOPSIG() == SIGTRAP.
>
> Yes, I see your point. And to remind, I was confused too.
>
> Perhaps we can add another THIS_SIGNAL_WAS_ALREADY_REPORTED bit, but
> you know, I'd prefer to avoid another subtle change in behaviour. You
> can never know if it is "safe" or not when it comes to ptrace, perhaps
> some application already relies on this WSTOPSIG().
>
> Oleg.
>


Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?

2016-09-13 Thread Keno Fischer
Hi Oleg,

I have another obscure ptrace question which seems somewhat related,
so let me ask it here.
Consider this:

```
static int sigchld_counter = 0;
void sigchld_handler(int sig) {
  (void)sig;
  sigchld_counter++;
}

int main(void) {
  signal(SIGCHLD, sigchld_handler);

  pid_t child;
  if (0 == (child = fork())) {
raise(SIGSTOP);
assert(0); // Should never be continued
  }

  // Wait until stopped
  int status;
  pid_t wret = waitpid(child, , __WALL | WSTOPPED);
  assert(wret == child);
  assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);
  assert(sigchld_counter == 1);

  // Now PTRACE_SEIZE the child
  long err = ptrace(PTRACE_SEIZE, child, NULL, (void*)PTRACE_O_TRACESYSGOOD);
  assert(err == 0);

  // Make sure that didn't cause a notification
  wret = waitpid(child, , __WALL | WSTOPPED | WNOHANG);
  assert(wret == 0);
  assert(sigchld_counter == 1);
}
```

I wouldn't have expected the PTRACE_SEIZE to generate another
SIGCHLD/be waitable again,
(the last two assertions fail).
Is that supposed to happen? If so, I'd like to update the man page to
mention this behavior, but
I wanted to check with you first.

Thanks,
Keno


On Tue, Aug 23, 2016 at 11:34 AM, Oleg Nesterov  wrote:
> On 08/18, Keno Fischer wrote:
>>
>> On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov  wrote:
>> >
>> > And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after
>> > PTRACE_INTERRUPT, you know that the tracee did not report the "new"
>> > SIGTTIN.
>>
>> It seems possible to remember whether or not we injected a stopping
>> signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise
>> a PTRACE_INTERRUPT stop. Currently what I do is the other way around,
>> after issuing PTRACE_INTERRUPT, the first (if any) of the next two
>> stops that is a PTRACE_EVENT_STOP get interpreted as a
>> PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I
>> can't give you a concrete example I worried about, it just seems
>> fragile compared to just checking whether WSTOPSIG() == SIGTRAP.
>
> Yes, I see your point. And to remind, I was confused too.
>
> Perhaps we can add another THIS_SIGNAL_WAS_ALREADY_REPORTED bit, but
> you know, I'd prefer to avoid another subtle change in behaviour. You
> can never know if it is "safe" or not when it comes to ptrace, perhaps
> some application already relies on this WSTOPSIG().
>
> Oleg.
>


How does the size field work in IOCTL numbers?

2016-09-02 Thread Keno Fischer
Hi folks,

this is more of a general linux question, but since I noticed it
while looking perf_events code, I'm ccing perf_events folks
in case the answer is perf_events specific (hope that's ok).

uapi/linux/perf_event.h has the following:
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
#define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)

Now, my question is how to interpret the last argument to the
_IO* macros. The man page for ioctl(2) says, it encodes the
"size of the argument argp in bytes", but I'm not sure how to
interpret that.

For example, IOC_PERIOD takes a pointer to a 64-bit value,
so the __u64 makes sense for me. But as far as I know,
IOC_ID also takes a pointer to a 64-bit value
(and writes the ID to it), but it has `__64 *` rather than
`__u64`. The we have IOC_SET_FILTER which as far as I know
takes a pointer to a variable-length string (but with the
above definition the size field is sizeof(char*)), and
IOC_SET_BPF which doesn't take a pointer at all,
but interprets the argument as a file descriptor (similar to
IOC_SET_OUTPUT, which doesn't have a size at all).

I don't understand what the rule is for what to put in that third
argument, or is it ioctl specific? Please let me know if I missed
something.

Thanks,
Keno


How does the size field work in IOCTL numbers?

2016-09-02 Thread Keno Fischer
Hi folks,

this is more of a general linux question, but since I noticed it
while looking perf_events code, I'm ccing perf_events folks
in case the answer is perf_events specific (hope that's ok).

uapi/linux/perf_event.h has the following:
#define PERF_EVENT_IOC_PERIOD _IOW('$', 4, __u64)
#define PERF_EVENT_IOC_SET_OUTPUT _IO ('$', 5)
#define PERF_EVENT_IOC_SET_FILTER _IOW('$', 6, char *)
#define PERF_EVENT_IOC_ID _IOR('$', 7, __u64 *)
#define PERF_EVENT_IOC_SET_BPF _IOW('$', 8, __u32)

Now, my question is how to interpret the last argument to the
_IO* macros. The man page for ioctl(2) says, it encodes the
"size of the argument argp in bytes", but I'm not sure how to
interpret that.

For example, IOC_PERIOD takes a pointer to a 64-bit value,
so the __u64 makes sense for me. But as far as I know,
IOC_ID also takes a pointer to a 64-bit value
(and writes the ID to it), but it has `__64 *` rather than
`__u64`. The we have IOC_SET_FILTER which as far as I know
takes a pointer to a variable-length string (but with the
above definition the size field is sizeof(char*)), and
IOC_SET_BPF which doesn't take a pointer at all,
but interprets the argument as a file descriptor (similar to
IOC_SET_OUTPUT, which doesn't have a size at all).

I don't understand what the rule is for what to put in that third
argument, or is it ioctl specific? Please let me know if I missed
something.

Thanks,
Keno


Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?

2016-08-18 Thread Keno Fischer
On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov  wrote:
>
> And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after
> PTRACE_INTERRUPT, you know that the tracee did not report the "new"
> SIGTTIN.

It seems possible to remember whether or not we injected a stopping
signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise
a PTRACE_INTERRUPT stop. Currently what I do is the other way around,
after issuing PTRACE_INTERRUPT, the first (if any) of the next two
stops that is a PTRACE_EVENT_STOP get interpreted as a
PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I
can't give you a concrete example I worried about, it just seems
fragile compared to just checking whether WSTOPSIG() == SIGTRAP.


Re: ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?

2016-08-18 Thread Keno Fischer
On Thu, Aug 18, 2016 at 12:23 PM, Oleg Nesterov  wrote:
>
> And you if you get PTRACE_EVENT_STOP and WSTOPSIG() == SIGTTIN after
> PTRACE_INTERRUPT, you know that the tracee did not report the "new"
> SIGTTIN.

It seems possible to remember whether or not we injected a stopping
signal and if so the next PTRACE_EVENT_STOP is a group-stop, otherwise
a PTRACE_INTERRUPT stop. Currently what I do is the other way around,
after issuing PTRACE_INTERRUPT, the first (if any) of the next two
stops that is a PTRACE_EVENT_STOP get interpreted as a
PTRACE_INTERRUPT stop. I haven't thought through this fully yet, so I
can't give you a concrete example I worried about, it just seems
fragile compared to just checking whether WSTOPSIG() == SIGTRAP.


ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?

2016-08-17 Thread Keno Fischer
Consider below test case (not all of it is necessary for reproducing
the behavior in question, but I wanted to cover related cases as well
to make sure they behave as expected). In this test case, the last
group-stop (after PTRACE_INTERRUPT) is delivered with a
WSTOPSIG(status) of SIGTTIN, which was the signr of the previous group
stop. From reading the man-page, I would have expected SIGTRAP. Now, I
understand that if there is another stop pending, PTRACE_INTERRUPT
will simply piggy-backs off that one, but I don't believe that is
happening in this case. Further, the current behavior seems to make it
very hard (impossible?) to reliably tell a true group-stop from a
PTRACE_INTERRUPT generated one. Is this behavior intended? If so, I
would be happy to update the man page accordingly, but would like some
guidance on what the intended semantics are.

```
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t child, ret;
int err;
int status;
if (0 == (child = fork())) {
   kill(getpid(), SIGSTOP);
   kill(getpid(), SIGSTOP);
   kill(getpid(), SIGTTIN);
   sleep(1000);
   exit(0);
}
ret = waitpid(child, , WSTOPPED);
assert(ret == child);
err = ptrace(PTRACE_SEIZE, child, NULL, NULL);
assert(err == 0);
err = ptrace(PTRACE_CONT, child, NULL, NULL);
assert(err == 0);
// Should now hit SIGSTOP signal-stop
ret = waitpid(child, , 0);
assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);
err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGSTOP);
assert(err == 0);
// Should now hit SIGSTOP group-stop
ret = waitpid(child, , 0);
assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) &&
WSTOPSIG(status) == SIGSTOP);
err = ptrace(PTRACE_CONT, child, NULL, NULL);
assert(err == 0);
// Should now hit SIGTTIN signal-stop
ret = waitpid(child, , 0);
assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGTTIN);
err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGTTIN);
assert(err == 0);
// Should now hit SIGTTIN group-stop
ret = waitpid(child, , 0);
assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) &&
WSTOPSIG(status) == SIGTTIN);
err = ptrace(PTRACE_CONT, child, NULL, NULL);
assert(err == 0);
// Now interrupt it
err = ptrace(PTRACE_INTERRUPT, child, NULL, NULL);
assert(err == 0);
// Should now hit interrupt group-stop
ret = waitpid(child, , 0);
printf("Interrupt group-stop delivered with signal %d\n", WSTOPSIG(status));
assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) &&
WSTOPSIG(status) == SIGTRAP);
exit(0);
}
```


ptrace group stop signal number not reset before PTRACE_INTERRUPT is delivered?

2016-08-17 Thread Keno Fischer
Consider below test case (not all of it is necessary for reproducing
the behavior in question, but I wanted to cover related cases as well
to make sure they behave as expected). In this test case, the last
group-stop (after PTRACE_INTERRUPT) is delivered with a
WSTOPSIG(status) of SIGTTIN, which was the signr of the previous group
stop. From reading the man-page, I would have expected SIGTRAP. Now, I
understand that if there is another stop pending, PTRACE_INTERRUPT
will simply piggy-backs off that one, but I don't believe that is
happening in this case. Further, the current behavior seems to make it
very hard (impossible?) to reliably tell a true group-stop from a
PTRACE_INTERRUPT generated one. Is this behavior intended? If so, I
would be happy to update the man page accordingly, but would like some
guidance on what the intended semantics are.

```
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

int main() {
pid_t child, ret;
int err;
int status;
if (0 == (child = fork())) {
   kill(getpid(), SIGSTOP);
   kill(getpid(), SIGSTOP);
   kill(getpid(), SIGTTIN);
   sleep(1000);
   exit(0);
}
ret = waitpid(child, , WSTOPPED);
assert(ret == child);
err = ptrace(PTRACE_SEIZE, child, NULL, NULL);
assert(err == 0);
err = ptrace(PTRACE_CONT, child, NULL, NULL);
assert(err == 0);
// Should now hit SIGSTOP signal-stop
ret = waitpid(child, , 0);
assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGSTOP);
err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGSTOP);
assert(err == 0);
// Should now hit SIGSTOP group-stop
ret = waitpid(child, , 0);
assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) &&
WSTOPSIG(status) == SIGSTOP);
err = ptrace(PTRACE_CONT, child, NULL, NULL);
assert(err == 0);
// Should now hit SIGTTIN signal-stop
ret = waitpid(child, , 0);
assert(ret == child && WIFSTOPPED(status) && WSTOPSIG(status) == SIGTTIN);
err = ptrace(PTRACE_CONT, child, NULL, (void*)SIGTTIN);
assert(err == 0);
// Should now hit SIGTTIN group-stop
ret = waitpid(child, , 0);
assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) &&
WSTOPSIG(status) == SIGTTIN);
err = ptrace(PTRACE_CONT, child, NULL, NULL);
assert(err == 0);
// Now interrupt it
err = ptrace(PTRACE_INTERRUPT, child, NULL, NULL);
assert(err == 0);
// Should now hit interrupt group-stop
ret = waitpid(child, , 0);
printf("Interrupt group-stop delivered with signal %d\n", WSTOPSIG(status));
assert(ret == child && (status>>16 == PTRACE_EVENT_STOP) &&
WSTOPSIG(status) == SIGTRAP);
exit(0);
}
```