Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-03-31 Thread Robert O'Callahan
For the record, the benefits of dynamic XCR0 for rr recording
portability still apply. I guess it'd be useful for CRIU too. We would
also benefit from anything that incentivizes increased support for
CPUID faulting.

Rob
--
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken

2021-01-31 Thread Robert O'Callahan
On Mon, Feb 1, 2021 at 12:40 PM Andy Lutomirski  wrote:
> I admit that PTRACE_SINGLESTEP seems like an odd way to spell "advance
> to the end of the syscall", but you're right, it should work.

We don't know of any better way to advance to the end of the syscall
without executing any userspace instructions. We could set a
breakpoint at the syscall return address but weird edge cases
complicate that.

Rob
-- 
"He was pierced for our transgressions, he was crushed for our
iniquities; the punishment that brought us peace was upon him, and by
his wounds we are healed. We all, like sheep, have gone astray, each
of us has turned to his own way; and the LORD has laid on him the
iniquity of us all." [Isaiah 53:5-6]


Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

2020-06-25 Thread Robert O'Callahan
On Fri, Jun 26, 2020 at 11:49 AM Gabriel Krisman Bertazi
 wrote:
> We couldn't patch Windows code because of the aforementioned DRM and
> anti-cheat mechanisms, but I suppose this limitation doesn't apply to
> Wine/native code, and if this assumption is correct, this approach could
> work.
>
> One complexity might be the consistent model for the syscall live
> patching.  I don't know how much of the problem is diminished from the
> original userspace live-patching problem, but I believe at least part of
> it applies.  And fencing every thread to patch would kill performance.
> Also, we cannot just patch everything at the beginning.  How does rr
> handle that?

That's a good point. rr only allows one tracee thread to run at a time
for other reasons, so when we consider patching a syscall instruction,
we inspect all thread states to see if the patch would interfere with
any other thread, and skip patching it in that unlikely case. (We'll
try to patch it again next time that instruction is executed.) Wine
would need some other solution, but indeed that could be a
showstopper.

> Another problem is that we will want to support i386 and other
> architectures.  For int 0x80, it is trickier to encode a branch to
> another region, given the limited instruction space, and the patching
> might not be possible in hot paths.

This is no worse than for x86-64 `syscall`, which is also two bytes.
We have pretty much always been able to patch the frequently executed
syscalls by replacing both the syscall instruction and an instruction
before or after the syscall with a five-byte jump, and folding the
extra instruction into the stub.

>I did port libsyscall-intercept to
> x86-32 once and I could correctly patch glibc, but it's not guaranteed
> that an updated libc or something else won't break it.

We haven't found this to be much of a problem in rr. From time to time
we have to add new patch patterns. The good news is that if things
change so a syscall can't be patched, the result is degraded
performance, not functional breakage.

> I'm not sure the benefit of not needing enhanced kernel support
> justifies the complexity and performance cost required to make this work
> reliably, in particular since the semantics for a kernel implementation
> that we are discussing doesn't seem overly intrusive and might have
> other applications like in the generic filter Andy mentioned.

That's fair --- our solution is complex. (But worth it --- for us,
it's valuable that rr works on quite old Linux kernels.)

As for performance, it performs well for us. I think we'd prefer our
current approach to Andy's hypothetical PR_SET_SYSCALL_THUNK because
the latter would have higher overhead (two trips into the kernel per
syscall). We definitely don't want to require CAP_SYS_ADMIN so that
rules out any eBPF-based alternative too. I would love to see a
low-overhead unprivileged syscall interception mechanism that would
obsolete our patching approach --- preferably one that's also
stackable so rr can record and replay processes that use the new
mechanism --- but I don't think any of the proposals here are that,
yet, unfortunately.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

2020-06-25 Thread Robert O'Callahan
rr (https://rr-project.org, https://arxiv.org/abs/1705.05937) grapples
with a similar problem. We need to intercept commonly-executed system
calls and wrap them with our own processing, with minimal overhead. I
think our basic approach might work for Wine without kernel changes.

We use SECCOMP_SET_MODE_FILTER with a simple filter that returns
SECCOMP_RET_TRAP on all syscalls except for those called from a single
specific trampoline page (which get SECCOMP_RET_ALLOW). rr ptraces its
children. So, when user-space makes a syscall, the seccomp filter
triggers a ptrace trap. The ptracer looks at the code around the
syscall and if it matches certain common patterns, the ptracer patches
the code with a jump to a stub that does extra work and issues a real
syscall via the trampoline. Thus, each library syscall instruction is
slow the first time and fast every subsequent time. "Weird" syscalls
that the ptracer chooses not to patch do incur the context-switch
penalty every time so their overhead does increase a lot ... but it
sounds like that might be OK in Wine's case?

A more efficient variant of this approach which would work in some
cases (but maybe not Wine?) would be to avoid using a ptracer and give
the process a SIGSYS handler which does the patching.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

2019-05-02 Thread Robert O'Callahan
On Fri, May 3, 2019 at 3:20 AM Ingo Molnar  wrote:
> So what might work better is if we defined a Rust dialect that used C
> syntax. I.e. the end result would be something like the 'c2rust' or
> 'citrus' projects, where code like this would be directly translatable to
> Rust:
>
> void gz_compress(FILE * in, gzFile out)
> {
> char buf[BUFLEN];
> int len;
> int err;
>
> for (;;) {
> len = fread(buf, 1, sizeof(buf), in);
> if (ferror(in)) {
> perror("fread");
> exit(1);
> }
> if (len == 0)
> break;
> if (gzwrite(out, buf, (unsigned)len) != len)
> error(gzerror(out, ));
> }
> fclose(in);
>
> if (gzclose(out) != Z_OK)
> error("failed gzclose");
> }
>
>
> #[no_mangle]
> pub unsafe extern "C" fn gz_compress(mut in_: *mut FILE, mut out: gzFile) {
> let mut buf: [i8; 16384];
> let mut len;
> let mut err;
> loop  {
> len = fread(buf, 1, std::mem::size_of_val(), in_);
> if ferror(in_) != 0 { perror("fread"); exit(1); }
> if len == 0 { break ; }
> if gzwrite(out, buf, len as c_uint) != len {
> error(gzerror(out,  err));
> };
> }
> fclose(in_);
> if gzclose(out) != Z_OK { error("failed gzclose"); };
> }
>
> Example taken from:
>
>https://gitlab.com/citrus-rs/citrus
>
> Does this make sense?

Are you saying you want a tool like c2rust/citrus that translates some
new "looks like C, but really Rust" language into actual Rust at build
time? I guess that might work, but I suspect your "looks like C"
language isn't going to end up being much like C (e.g. it's going to
need Rust-style enums-with-fields, Rust polymorphism, Rust traits, and
Rust lifetimes), so it may not be beneficial, because you've just
created a new language no-one knows, and that has some real downsides.

If you're inspired by the dream of transitioning to safer languages,
then I think the first practical step would be to identify some part
of the kernel where the payoff of converting code would be highest.
This is probably something small, relatively isolated, that's not well
tested, generally suspicious, but still in use. Then do an experiment,
converting it to Rust (or something else) using off-the-shelf tools
and manual labor, and see where the pain points are and what benefits
accrue, if any. (Work like https://github.com/tsgates/rust.ko might be
a helpful starting point.) Then you'd have some data to start thinking
about how to reduce the costs, increase the benefits, and sell it to
the kernel community. If you reached out to the Rust community you
might find some volunteers to help with this.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation

2019-05-02 Thread Robert O'Callahan
On Sat, Apr 27, 2019 at 10:46 PM Ingo Molnar  wrote:
>  - A C language runtime that is a subset of current C syntax and
>semantics used in the kernel, and which doesn't allow access outside
>of existing objects and thus creates a strictly enforced separation
>between memory used for data, and memory used for code and control
>flow.
>
>  - This would involve, at minimum:
>
> - tracking every type and object and its inherent length and valid
>   access patterns, and never losing track of its type.
>
> - being a lot more organized about initialization, i.e. no
>   uninitialized variables/fields.
>
> - being a lot more strict about type conversions and pointers in
>   general.
>
> - ... and a metric ton of other details.

Several research groups have tried to do this, and it is very
difficult to do. In particular this was almost exactly the goal of
C-Cured [1]. Much more recently, there's Microsoft's CheckedC [2] [3],
which is less ambitious. Check the references of the latter for lots
of relevant work. If anyone really pursues this they should talk
directly to researchers who've worked on this, e.g. George Necula; you
need to know what *didn't* work well, which is hard to glean from
papers. (Academic publishing is broken that way.)

One problem with adopting "safe C" or Rust in the kernel is that most
of your security mitigations (e.g. KASLR, CFI, other randomizations)
probably need to remain in place as long as there is a significant
amount of C in the kernel, which means the benefits from eliminating
them will be realized very far in the future, if ever, which makes the
whole exercise harder to justify.

Having said that, I think there's a good case to be made for writing
kernel code in Rust, e.g. sketchy drivers. The classes of bugs
prevented in Rust are significantly broader than your usual safe-C
dialect (e.g. data races).

[1] https://web.eecs.umich.edu/~weimerw/p/p477-necula.pdf
[2] 
https://www.microsoft.com/en-us/research/uploads/prod/2019/05/checkedc-post2019.pdf
[3] https://github.com/Microsoft/checkedc

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: Regression in 32-bit-compat TIOCGPTPEER ioctl due to 311fc65c9fb9c966bca8e6f3ff8132ce57344ab9

2019-01-15 Thread Robert O'Callahan
On Tue, Jan 15, 2019 at 8:25 PM Eric W. Biederman  wrote:
> Can you confirm this fixes it for you?

Sorry --- I made a mistake. It looks like this is already fixed on
master, by commit e21120383f2dce32312f63ffca145ff8a87d41f5 (though I
don't think Al Viro knew he was fixing this bug). So, nothing to do
here.

Thanks,
Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Regression in 32-bit-compat TIOCGPTPEER ioctl due to 311fc65c9fb9c966bca8e6f3ff8132ce57344ab9

2019-01-14 Thread Robert O'Callahan
This commit refactored the implementation of TIOCGPTPEER, moving "case
TIOCGPTPEER" from pty_unix98_ioctl() to tty_ioctl().
pty_unix98_ioctl() is called by pty_unix98_compat_ioctl(), so before
the commit, TIOCGPTPEER worked for 32-bit userspace. Unfortunately
tty_compat_ioctl() does not call tty_ioctl() so after the commit,
TIOCGPTPEER from 32-bit userspace fails with ENOTTY.

Testcase in https://bugzilla.kernel.org/show_bug.cgi?id=202271.

I found this bug running the rr test suite.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-15 Thread Robert O'Callahan
On Fri, Jun 15, 2018 at 10:16 AM, Kyle Huey  wrote:
>
> If you want a sysctl for your own reasons that's fine. But we don't
> want a sysctl. We want to work without any further configuration.

Also toggling a sysctl would require root privileges, but rr does not
currently need to run as root. Thus rr users would have to either
permanently change their system configuration (and every extra
configuration step is a pain), or run rr as root so rr can toggle the
sysctl itself. Running rr as root is highly undesirable.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [PATCH v1 0/2] perf: Drop leaked kernel samples

2018-06-15 Thread Robert O'Callahan
On Fri, Jun 15, 2018 at 10:16 AM, Kyle Huey  wrote:
>
> If you want a sysctl for your own reasons that's fine. But we don't
> want a sysctl. We want to work without any further configuration.

Also toggling a sysctl would require root privileges, but rr does not
currently need to run as root. Thus rr users would have to either
permanently change their system configuration (and every extra
configuration step is a pain), or run rr as root so rr can toggle the
sysctl itself. Running rr as root is highly undesirable.

Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


regression in 32-bit-compat dev_ioctl due to commit bf4405737f9f85a06db2b0ce5d76a818b61992e2

2018-04-22 Thread Robert O'Callahan
The commit says

Once upon a time net/socket.c:dev_ifsioc() used to handle SIOCSHWTSTAMP and
SIOCSIFMAP.  These have different native and compat layout, so the format
conversion had been needed.  In 2009 these two cases had been taken out,
turning the rest into a convoluted way to calling sock_do_ioctl().  We copy
compat structure into native one, call sock_do_ioctl() on that and copy
the result back for the in/out ioctls.  No layout transformation anywhere,
so we might as well just call sock_do_ioctl() and skip all the headache with
copying.

However there is one problem: 32-bit 'struct ifreq' and 64-bit 'struct
ifreq' are not the same size. The former is 32 bytes and the latter is
40 bytes. Thus, if you place a 32-bit 'struct ifreq' immediately
before an unmapped page and try to pass it to one of these ioctls, the
syscall fails with EFAULT due to this commit.

More details including test program in
https://bugzilla.kernel.org/show_bug.cgi?id=199469.

I found this bug running the rr test suite.

Thanks,
Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


regression in 32-bit-compat dev_ioctl due to commit bf4405737f9f85a06db2b0ce5d76a818b61992e2

2018-04-22 Thread Robert O'Callahan
The commit says

Once upon a time net/socket.c:dev_ifsioc() used to handle SIOCSHWTSTAMP and
SIOCSIFMAP.  These have different native and compat layout, so the format
conversion had been needed.  In 2009 these two cases had been taken out,
turning the rest into a convoluted way to calling sock_do_ioctl().  We copy
compat structure into native one, call sock_do_ioctl() on that and copy
the result back for the in/out ioctls.  No layout transformation anywhere,
so we might as well just call sock_do_ioctl() and skip all the headache with
copying.

However there is one problem: 32-bit 'struct ifreq' and 64-bit 'struct
ifreq' are not the same size. The former is 32 bytes and the latter is
40 bytes. Thus, if you place a 32-bit 'struct ifreq' immediately
before an unmapped page and try to pass it to one of these ioctls, the
syscall fails with EFAULT due to this commit.

More details including test program in
https://bugzilla.kernel.org/show_bug.cgi?id=199469.

I found this bug running the rr test suite.

Thanks,
Rob
-- 
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.


Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.

2017-09-03 Thread Robert O'Callahan
Sorry about replying to this old thread, but...

On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman  wrote:
> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
> know what the implications of changing it are.  Let's see...
>
> gdb - no
> upstart - no
> lldb - yes
> strace - no

For the record, rr uses PTRACE_O_TRACEEXIT.

When a thread exits we need to examine its address space to find its
robust futex list and record the changes that will be performed by the
kernel as it cleans up that list. So, any failures to deliver
PTRACE_EVENT_EXIT are potentially problematic for us because we won't
get a chance to examine the address space before it disappears.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.

2017-09-03 Thread Robert O'Callahan
Sorry about replying to this old thread, but...

On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman  wrote:
> I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually
> know what the implications of changing it are.  Let's see...
>
> gdb - no
> upstart - no
> lldb - yes
> strace - no

For the record, rr uses PTRACE_O_TRACEEXIT.

When a thread exits we need to examine its address space to find its
robust futex list and record the changes that will be performed by the
kernel as it cleans up that list. So, any failures to deliver
PTRACE_EVENT_EXIT are potentially problematic for us because we won't
get a chance to examine the address space before it disappears.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-05 Thread Robert O'Callahan
On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland  wrote:
> Should any of those be moved into the "should be dropped" pile?

Why not be conservative and clear every sample you're not sure about?

We'd appreciate a fix sooner rather than later here, since rr is
currently broken on every stable Linux kernel and our attempts to
implement a workaround have failed.

(We have separate "interrupt" and "measure" counters, and I thought we
might work around this regression by programming the "interrupt"
counter to count kernel events as well as user events (interrupting
early is OK), but that caused our (completely separate) "measure"
counter to report off-by-one results (!), which seems to be a
different bug present on a range of older kernels.)

Thanks,
Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)

2017-07-05 Thread Robert O'Callahan
On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland  wrote:
> Should any of those be moved into the "should be dropped" pile?

Why not be conservative and clear every sample you're not sure about?

We'd appreciate a fix sooner rather than later here, since rr is
currently broken on every stable Linux kernel and our attempts to
implement a workaround have failed.

(We have separate "interrupt" and "measure" counters, and I thought we
might work around this regression by programming the "interrupt"
counter to count kernel events as well as user events (interrupting
early is OK), but that caused our (completely separate) "measure"
counter to report off-by-one results (!), which seems to be a
different bug present on a range of older kernels.)

Thanks,
Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Robert O'Callahan
On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland  wrote:
> It seems odd that an event without any samples to take has a sample
> period. I'm surprised that there's not *some* sample_type set.

Yes, it is a bit odd :-). Rather than trying to gather some limited
set of information about the target process, we want to interrupt its
execution after a particular number of events have occurred.

The bigger picture here is that we're replaying a previously recorded
process execution. We want to the process to run until it reaches a
particular state that occurred during recording; that state is
identified by a value for the performance counter (retired conditional
branches) plus the values of the general-purpose registers. So our
first step is to run the process until the performance counter reaches
a value less than but "close to" the target value. (We know the
process will be interrupted after a number of extra events have
occurred, so we set the sample period to a value less than the target
by a heuristic "maximum skid size".)

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region

2017-06-28 Thread Robert O'Callahan
On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland  wrote:
> It seems odd that an event without any samples to take has a sample
> period. I'm surprised that there's not *some* sample_type set.

Yes, it is a bit odd :-). Rather than trying to gather some limited
set of information about the target process, we want to interrupt its
execution after a particular number of events have occurred.

The bigger picture here is that we're replaying a previously recorded
process execution. We want to the process to run until it reaches a
particular state that occurred during recording; that state is
identified by a value for the performance counter (retired conditional
branches) plus the values of the general-purpose registers. So our
first step is to run the process until the performance counter reaches
a value less than but "close to" the target value. (We know the
process will be interrupted after a number of extra events have
occurred, so we set the sample period to a value less than the target
by a heuristic "maximum skid size".)

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: Yes, people use FOLL_FORCE ;)

2017-05-29 Thread Robert O'Callahan
On Tue, May 30, 2017 at 11:08 AM, Keno Fischer <k...@juliacomputing.com> wrote:
> Now, while we're probably fine with using the fallbacks, I know there's
> others that rely on this behavior as well (cc'ing Robert O'Callahan of the
> rr project for which this change will result in significant performance
> degradation).

Yeah, this breaks rr. We write to readonly code and data mappings via
/proc/.../mem in lots of different situations, particularly when we're
adjusting program state during replay to match the recorded execution.

Like Julia, we can add workarounds, but they could be expensive. For
small writes we can fall back to PTRACE_POKEDATA without much
performance loss, but for big writes it would be cheaper to
sock-puppet the ptracee to call mprotect to temporarily make pages
writeable. Of course that takes at least four context switches, so the
lower-bound overhead of that approach is pretty bad. OTOH there are
probably tricks we could pull to mitigate the overhead some more. For
example during replay we might be able to make some pages that
"should" be read-only actually be read-write when we know the recorded
process didn't try writing to them.

So in summary: rr can be upgraded to cope with this, incurring some
unknown amount of additional overhead, but existing rr installs will
definitely be totally broken.

Thanks,
Rob


Re: Yes, people use FOLL_FORCE ;)

2017-05-29 Thread Robert O'Callahan
On Tue, May 30, 2017 at 11:08 AM, Keno Fischer  wrote:
> Now, while we're probably fine with using the fallbacks, I know there's
> others that rely on this behavior as well (cc'ing Robert O'Callahan of the
> rr project for which this change will result in significant performance
> degradation).

Yeah, this breaks rr. We write to readonly code and data mappings via
/proc/.../mem in lots of different situations, particularly when we're
adjusting program state during replay to match the recorded execution.

Like Julia, we can add workarounds, but they could be expensive. For
small writes we can fall back to PTRACE_POKEDATA without much
performance loss, but for big writes it would be cheaper to
sock-puppet the ptracee to call mprotect to temporarily make pages
writeable. Of course that takes at least four context switches, so the
lower-bound overhead of that approach is pretty bad. OTOH there are
probably tricks we could pull to mitigate the overhead some more. For
example during replay we might be able to make some pages that
"should" be read-only actually be read-write when we know the recorded
process didn't try writing to them.

So in summary: rr can be upgraded to cope with this, incurring some
unknown amount of additional overhead, but existing rr installs will
definitely be totally broken.

Thanks,
Rob


Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-02-23 Thread Robert O'Callahan
Great, thanks!

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-02-23 Thread Robert O'Callahan
Great, thanks!

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-02-22 Thread Robert O'Callahan
Ping? Is there something else I need to do to get someone to look at this?

Thanks,
Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-02-22 Thread Robert O'Callahan
Ping? Is there something else I need to do to get someone to look at this?

Thanks,
Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


[PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-01-31 Thread Robert O'Callahan
Changes in version 2:
- Fixed block comment formatting as requested by Ingo Molnar
- Changed commit summary to refer to a testcase that more easily shows the
problem.

For reference here's a complete testcase for an Intel TSX-enabled KVM guest:

#include 
#include 
#include 
#include 
#include 
#include 
int main(void) {
struct perf_event_attr attr;
int fd;
long long count;
memset(, 0, sizeof(attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
attr.sample_period = 0;
attr.exclude_kernel = 1;
attr.exclude_guest = 1;
fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
for (int i = 0; i < 500; ++i) {
putchar('.');
}
read(fd, , sizeof(count));
printf("\nConditional branches: %lld\n%s\n", count,
   count < 500 ? "FAILED" : "PASSED");
return 0;
}

Robert O'Callahan (1):
  KVM: x86: never specify a sample period for virtualized in_tx_cp
counters

 arch/x86/kvm/pmu.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

-- 
2.9.3



[PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-01-31 Thread Robert O'Callahan
Changes in version 2:
- Fixed block comment formatting as requested by Ingo Molnar
- Changed commit summary to refer to a testcase that more easily shows the
problem.

For reference here's a complete testcase for an Intel TSX-enabled KVM guest:

#include 
#include 
#include 
#include 
#include 
#include 
int main(void) {
struct perf_event_attr attr;
int fd;
long long count;
memset(, 0, sizeof(attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
attr.sample_period = 0;
attr.exclude_kernel = 1;
attr.exclude_guest = 1;
fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
for (int i = 0; i < 500; ++i) {
putchar('.');
}
read(fd, , sizeof(count));
printf("\nConditional branches: %lld\n%s\n", count,
   count < 500 ? "FAILED" : "PASSED");
return 0;
}

Robert O'Callahan (1):
  KVM: x86: never specify a sample period for virtualized in_tx_cp
counters

 arch/x86/kvm/pmu.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

-- 
2.9.3



[PATCH v2 01/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-01-31 Thread Robert O'Callahan
pmc_reprogram_counter() always sets a sample period based on the value of
pmc->counter. However, hsw_hw_config() rejects sample periods less than
2^31 - 1. So for example, if a KVM guest does

struct perf_event_attr attr;
memset(, 0, sizeof(attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
attr.sample_period = 0;
int fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

the guest kernel counts some conditional branch events, then updates the
virtual PMU register with a nonzero count. The host reaches
pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP
in hsw_hw_config(), prints "kvm_pmu: event creation failed" in
pmc_reprogram_counter(), and silently (from the guest's point of view) stops
counting events.

We fix event counting by forcing attr.sample_period to always be zero for
in_tx_cp counters. Sampling doesn't work, but it already didn't work and
can't be fixed without major changes to the approach in hsw_hw_config().

Signed-off-by: Robert O'Callahan <rob...@ocallahan.org>
---
 arch/x86/kvm/pmu.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06ce377..026db42 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -113,12 +113,19 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, 
u32 type,
.config = config,
};
 
+   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
if (in_tx)
attr.config |= HSW_IN_TX;
-   if (in_tx_cp)
+   if (in_tx_cp) {
+   /*
+* HSW_IN_TX_CHECKPOINTED is not supported with nonzero
+* period. Just clear the sample period so at least
+* allocating the counter doesn't fail.
+*/
+   attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
-
-   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+   }
 
event = perf_event_create_kernel_counter(, -1, current,
 intr ? kvm_perf_overflow_intr :
-- 
2.9.3



[PATCH v2 01/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-01-31 Thread Robert O'Callahan
pmc_reprogram_counter() always sets a sample period based on the value of
pmc->counter. However, hsw_hw_config() rejects sample periods less than
2^31 - 1. So for example, if a KVM guest does

struct perf_event_attr attr;
memset(, 0, sizeof(attr));
attr.type = PERF_TYPE_RAW;
attr.size = sizeof(attr);
attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
attr.sample_period = 0;
int fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0);
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

the guest kernel counts some conditional branch events, then updates the
virtual PMU register with a nonzero count. The host reaches
pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP
in hsw_hw_config(), prints "kvm_pmu: event creation failed" in
pmc_reprogram_counter(), and silently (from the guest's point of view) stops
counting events.

We fix event counting by forcing attr.sample_period to always be zero for
in_tx_cp counters. Sampling doesn't work, but it already didn't work and
can't be fixed without major changes to the approach in hsw_hw_config().

Signed-off-by: Robert O'Callahan 
---
 arch/x86/kvm/pmu.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06ce377..026db42 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -113,12 +113,19 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, 
u32 type,
.config = config,
};
 
+   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
if (in_tx)
attr.config |= HSW_IN_TX;
-   if (in_tx_cp)
+   if (in_tx_cp) {
+   /*
+* HSW_IN_TX_CHECKPOINTED is not supported with nonzero
+* period. Just clear the sample period so at least
+* allocating the counter doesn't fail.
+*/
+   attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
-
-   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+   }
 
event = perf_event_create_kernel_counter(, -1, current,
 intr ? kvm_perf_overflow_intr :
-- 
2.9.3



[PATCH] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-01-30 Thread Robert O'Callahan
pmc_reprogram_counter() always sets a sample period based on the value of
pmc->counter. However, hsw_hw_config() rejects sample periods less than
2^31 - 1. So for example, a KVM guest doing
  perf stat -e r2005101c4 sleep 0
will count some conditional branch events, deschedule the task, reschedule
the task, try to restore the guest PMU state for the task, in the host
reach pmc_reprogram_counter() with nonzero pmc->count, trigger EOPNOTSUPP
in hsw_hw_config(), print "kvm_pmu: event creation failed" in
pmc_reprogram_counter(), and silently (from the guest's point of view) stop
counting events.

We fix event counting by forcing attr.sample_period to always be zero for
in_tx_cp counters. Sampling doesn't work, but it already didn't work and
can't be fixed without major changes to the approach in hsw_hw_config().

Signed-off-by: Robert O'Callahan <rob...@ocallahan.org>
---
 arch/x86/kvm/pmu.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06ce377..af993d7 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -113,12 +113,18 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, 
u32 type,
.config = config,
};
 
+   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
if (in_tx)
attr.config |= HSW_IN_TX;
-   if (in_tx_cp)
+   if (in_tx_cp) {
+   /* HSW_IN_TX_CHECKPOINTED is not supported with nonzero
+* period. Just clear the sample period so at least
+* allocating the counter doesn't fail.
+*/
+   attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
-
-   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+   }
 
event = perf_event_create_kernel_counter(, -1, current,
 intr ? kvm_perf_overflow_intr :
-- 
2.9.3



[PATCH] KVM: x86: never specify a sample period for virtualized in_tx_cp counters

2017-01-30 Thread Robert O'Callahan
pmc_reprogram_counter() always sets a sample period based on the value of
pmc->counter. However, hsw_hw_config() rejects sample periods less than
2^31 - 1. So for example, a KVM guest doing
  perf stat -e r2005101c4 sleep 0
will count some conditional branch events, deschedule the task, reschedule
the task, try to restore the guest PMU state for the task, in the host
reach pmc_reprogram_counter() with nonzero pmc->count, trigger EOPNOTSUPP
in hsw_hw_config(), print "kvm_pmu: event creation failed" in
pmc_reprogram_counter(), and silently (from the guest's point of view) stop
counting events.

We fix event counting by forcing attr.sample_period to always be zero for
in_tx_cp counters. Sampling doesn't work, but it already didn't work and
can't be fixed without major changes to the approach in hsw_hw_config().

Signed-off-by: Robert O'Callahan 
---
 arch/x86/kvm/pmu.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c
index 06ce377..af993d7 100644
--- a/arch/x86/kvm/pmu.c
+++ b/arch/x86/kvm/pmu.c
@@ -113,12 +113,18 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, 
u32 type,
.config = config,
};
 
+   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+
if (in_tx)
attr.config |= HSW_IN_TX;
-   if (in_tx_cp)
+   if (in_tx_cp) {
+   /* HSW_IN_TX_CHECKPOINTED is not supported with nonzero
+* period. Just clear the sample period so at least
+* allocating the counter doesn't fail.
+*/
+   attr.sample_period = 0;
attr.config |= HSW_IN_TX_CHECKPOINTED;
-
-   attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc);
+   }
 
event = perf_event_create_kernel_counter(, -1, current,
 intr ? kvm_perf_overflow_intr :
-- 
2.9.3



Re: [PATCH] seccomp: Fix tracer exit notifications during fatal signals

2016-08-11 Thread Robert O'Callahan
Thanks!

On Fri, Aug 12, 2016 at 3:12 AM, Oleg Nesterov  wrote:
> > The bug happens because when __seccomp_filter() detects
> > fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
> > signal. When do_exit() sends the PTRACE_EVENT_EXIT
>
> I _never_ understood what PTRACE_EVENT_EXIT should actually do. I mean,
> when it should actually stop. This was never defined.
>
> > notification and
> > that task is descheduled, __schedule() notices that there is a fatal
> > signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
>
> And this can happen anyway, with or without this change, with or without
> seccomp. Because another fatal signal can be pending. So PTRACE_EVENT_EXIT
> actually depends on /dev/random.

True. But at least (as Kees alluded to later) this patch ensures
PTRACE_EVENT_EXIT delivery when exit is due to exit_group() and no
genuine fatal signals are involved.

> Perhaps we should finally define what it should do. Say, it should only
> stop if SIGKILL was sent "implicitely" by exit/exec. But as for exec,
> there are more (off-topic) complications, not sure we actually want this...

The ptrace man page currently says:
> A SIGKILL signal may still cause a PTRACE_EVENT_EXIT stop before actual 
> signal death.  This may be changed in the future; SIGKILL is meant to always 
> immediately kill tasks even under ptrace.  Last confirmed on Linux 3.13.

In practice, a PTRACE_EVENT_EXIT is almost always observed after
SIGKILL. That's nice for rr, because it lets us observe the process's
final state. But it allows a process to stay alive indefinitely after
receiving SIGKILL, so I can see why you might want to change it.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: [PATCH] seccomp: Fix tracer exit notifications during fatal signals

2016-08-11 Thread Robert O'Callahan
Thanks!

On Fri, Aug 12, 2016 at 3:12 AM, Oleg Nesterov  wrote:
> > The bug happens because when __seccomp_filter() detects
> > fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
> > signal. When do_exit() sends the PTRACE_EVENT_EXIT
>
> I _never_ understood what PTRACE_EVENT_EXIT should actually do. I mean,
> when it should actually stop. This was never defined.
>
> > notification and
> > that task is descheduled, __schedule() notices that there is a fatal
> > signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
>
> And this can happen anyway, with or without this change, with or without
> seccomp. Because another fatal signal can be pending. So PTRACE_EVENT_EXIT
> actually depends on /dev/random.

True. But at least (as Kees alluded to later) this patch ensures
PTRACE_EVENT_EXIT delivery when exit is due to exit_group() and no
genuine fatal signals are involved.

> Perhaps we should finally define what it should do. Say, it should only
> stop if SIGKILL was sent "implicitely" by exit/exec. But as for exec,
> there are more (off-topic) complications, not sure we actually want this...

The ptrace man page currently says:
> A SIGKILL signal may still cause a PTRACE_EVENT_EXIT stop before actual 
> signal death.  This may be changed in the future; SIGKILL is meant to always 
> immediately kill tasks even under ptrace.  Last confirmed on Linux 3.13.

In practice, a PTRACE_EVENT_EXIT is almost always observed after
SIGKILL. That's nice for rr, because it lets us observe the process's
final state. But it allows a process to stay alive indefinitely after
receiving SIGKILL, so I can see why you might want to change it.

Rob
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug

2016-08-03 Thread Robert O'Callahan
I work on rr (http://rr-project.org/), a record-and-replay
reverse-execution debugger which is a heavy user of ptrace and
seccomp. The recent change to perform syscall-entry PTRACE_SYSCALL
stops before PTRACE_EVENT_SECCOMP stops broke rr, which is fine
because I'm fixing rr and this change actually makes rr faster
(thanks!). However, it exposed an existing kernel bug which creates a
problem for us, and which I'm not sure how to fix.

The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP
trap, or has been resumed after such a trap but not yet been
scheduled, and another task in the thread-group calls exit_group(),
then the tracee task exits without the ptracer receiving a
PTRACE_EVENT_EXIT notification. Small-ish testcase here:
https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068.

The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

This bug has been in the kernel for a while. rr never hit it before
because we trace all threads and mostly run only one tracee thread at
a time. Immediately after each PTRACE_EVENT_SECCOMP notification we'd
issue a PTRACE_SYSCALL to get that task to the syscall-entry
PTRACE_SYSCALL stop, so there was never an opportunity for one tracee
thread to call exit_group while another tracee was in the problematic
part of __seccomp_filter(). Unfortunately now there is no way for us
to avoid that possibility.

My guess is that __seccomp_filter() should dequeue the fatal signal it
detects before calling do_exit(), to behave more like get_signal(). Is
that correct, and if so, what would be the right way to do that?

Thanks,
Robert O'Callahan
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn


Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug

2016-08-03 Thread Robert O'Callahan
I work on rr (http://rr-project.org/), a record-and-replay
reverse-execution debugger which is a heavy user of ptrace and
seccomp. The recent change to perform syscall-entry PTRACE_SYSCALL
stops before PTRACE_EVENT_SECCOMP stops broke rr, which is fine
because I'm fixing rr and this change actually makes rr faster
(thanks!). However, it exposed an existing kernel bug which creates a
problem for us, and which I'm not sure how to fix.

The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP
trap, or has been resumed after such a trap but not yet been
scheduled, and another task in the thread-group calls exit_group(),
then the tracee task exits without the ptracer receiving a
PTRACE_EVENT_EXIT notification. Small-ish testcase here:
https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068.

The bug happens because when __seccomp_filter() detects
fatal_signal_pending(), it calls do_exit() without dequeuing the fatal
signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and
that task is descheduled, __schedule() notices that there is a fatal
signal pending and changes its state from TASK_TRACED to TASK_RUNNING.
That prevents the ptracer's waitpid() from returning the ptrace event.
A more detailed analysis is here:
https://github.com/mozilla/rr/issues/1762#issuecomment-237396255.

This bug has been in the kernel for a while. rr never hit it before
because we trace all threads and mostly run only one tracee thread at
a time. Immediately after each PTRACE_EVENT_SECCOMP notification we'd
issue a PTRACE_SYSCALL to get that task to the syscall-entry
PTRACE_SYSCALL stop, so there was never an opportunity for one tracee
thread to call exit_group while another tracee was in the problematic
part of __seccomp_filter(). Unfortunately now there is no way for us
to avoid that possibility.

My guess is that __seccomp_filter() should dequeue the fatal signal it
detects before calling do_exit(), to behave more like get_signal(). Is
that correct, and if so, what would be the right way to do that?

Thanks,
Robert O'Callahan
-- 
lbir ye,ea yer.tnietoehr  rdn rdsme,anea lurpr  edna e hnysnenh hhe uresyf toD
selthor  stor  edna  siewaoeodm  or v sstvr  esBa  kbvted,t rdsme,aoreseoouoto
o l euetiuruewFa  kbn e hnystoivateweh uresyf tulsa rehr  rdm  or rnea lurpr
.a war hsrer holsa rodvted,t  nenh hneireseoouot.tniesiewaoeivatewt sstvr  esn