Re: Candidate Linux ABI for Intel AMX and hypothetical new related features
For the record, the benefits of dynamic XCR0 for rr recording portability still apply. I guess it'd be useful for CRIU too. We would also benefit from anything that incentivizes increased support for CPUID faulting. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [REGRESSION] x86/entry: TIF_SINGLESTEP handling is still broken
On Mon, Feb 1, 2021 at 12:40 PM Andy Lutomirski wrote: > I admit that PTRACE_SINGLESTEP seems like an odd way to spell "advance > to the end of the syscall", but you're right, it should work. We don't know of any better way to advance to the end of the syscall without executing any userspace instructions. We could set a breakpoint at the syscall return address but weird edge cases complicate that. Rob -- "He was pierced for our transgressions, he was crushed for our iniquities; the punishment that brought us peace was upon him, and by his wounds we are healed. We all, like sheep, have gone astray, each of us has turned to his own way; and the LORD has laid on him the iniquity of us all." [Isaiah 53:5-6]
Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
On Fri, Jun 26, 2020 at 11:49 AM Gabriel Krisman Bertazi wrote: > We couldn't patch Windows code because of the aforementioned DRM and > anti-cheat mechanisms, but I suppose this limitation doesn't apply to > Wine/native code, and if this assumption is correct, this approach could > work. > > One complexity might be the consistent model for the syscall live > patching. I don't know how much of the problem is diminished from the > original userspace live-patching problem, but I believe at least part of > it applies. And fencing every thread to patch would kill performance. > Also, we cannot just patch everything at the beginning. How does rr > handle that? That's a good point. rr only allows one tracee thread to run at a time for other reasons, so when we consider patching a syscall instruction, we inspect all thread states to see if the patch would interfere with any other thread, and skip patching it in that unlikely case. (We'll try to patch it again next time that instruction is executed.) Wine would need some other solution, but indeed that could be a showstopper. > Another problem is that we will want to support i386 and other > architectures. For int 0x80, it is trickier to encode a branch to > another region, given the limited instruction space, and the patching > might not be possible in hot paths. This is no worse than for x86-64 `syscall`, which is also two bytes. We have pretty much always been able to patch the frequently executed syscalls by replacing both the syscall instruction and an instruction before or after the syscall with a five-byte jump, and folding the extra instruction into the stub. >I did port libsyscall-intercept to > x86-32 once and I could correctly patch glibc, but it's not guaranteed > that an updated libc or something else won't break it. We haven't found this to be much of a problem in rr. From time to time we have to add new patch patterns. The good news is that if things change so a syscall can't be patched, the result is degraded performance, not functional breakage. > I'm not sure the benefit of not needing enhanced kernel support > justifies the complexity and performance cost required to make this work > reliably, in particular since the semantics for a kernel implementation > that we are discussing doesn't seem overly intrusive and might have > other applications like in the generic filter Andy mentioned. That's fair --- our solution is complex. (But worth it --- for us, it's valuable that rr works on quite old Linux kernels.) As for performance, it performs well for us. I think we'd prefer our current approach to Andy's hypothetical PR_SET_SYSCALL_THUNK because the latter would have higher overhead (two trips into the kernel per syscall). We definitely don't want to require CAP_SYS_ADMIN so that rules out any eBPF-based alternative too. I would love to see a low-overhead unprivileged syscall interception mechanism that would obsolete our patching approach --- preferably one that's also stackable so rr can record and replay processes that use the new mechanism --- but I don't think any of the proposals here are that, yet, unfortunately. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas
rr (https://rr-project.org, https://arxiv.org/abs/1705.05937) grapples with a similar problem. We need to intercept commonly-executed system calls and wrap them with our own processing, with minimal overhead. I think our basic approach might work for Wine without kernel changes. We use SECCOMP_SET_MODE_FILTER with a simple filter that returns SECCOMP_RET_TRAP on all syscalls except for those called from a single specific trampoline page (which get SECCOMP_RET_ALLOW). rr ptraces its children. So, when user-space makes a syscall, the seccomp filter triggers a ptrace trap. The ptracer looks at the code around the syscall and if it matches certain common patterns, the ptracer patches the code with a jump to a stub that does extra work and issues a real syscall via the trampoline. Thus, each library syscall instruction is slow the first time and fast every subsequent time. "Weird" syscalls that the ptracer chooses not to patch do incur the context-switch penalty every time so their overhead does increase a lot ... but it sounds like that might be OK in Wine's case? A more efficient variant of this approach which would work in some cases (but maybe not Wine?) would be to avoid using a ptracer and give the process a SIGSYS handler which does the patching. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation
On Fri, May 3, 2019 at 3:20 AM Ingo Molnar wrote: > So what might work better is if we defined a Rust dialect that used C > syntax. I.e. the end result would be something like the 'c2rust' or > 'citrus' projects, where code like this would be directly translatable to > Rust: > > void gz_compress(FILE * in, gzFile out) > { > char buf[BUFLEN]; > int len; > int err; > > for (;;) { > len = fread(buf, 1, sizeof(buf), in); > if (ferror(in)) { > perror("fread"); > exit(1); > } > if (len == 0) > break; > if (gzwrite(out, buf, (unsigned)len) != len) > error(gzerror(out, )); > } > fclose(in); > > if (gzclose(out) != Z_OK) > error("failed gzclose"); > } > > > #[no_mangle] > pub unsafe extern "C" fn gz_compress(mut in_: *mut FILE, mut out: gzFile) { > let mut buf: [i8; 16384]; > let mut len; > let mut err; > loop { > len = fread(buf, 1, std::mem::size_of_val(), in_); > if ferror(in_) != 0 { perror("fread"); exit(1); } > if len == 0 { break ; } > if gzwrite(out, buf, len as c_uint) != len { > error(gzerror(out, err)); > }; > } > fclose(in_); > if gzclose(out) != Z_OK { error("failed gzclose"); }; > } > > Example taken from: > >https://gitlab.com/citrus-rs/citrus > > Does this make sense? Are you saying you want a tool like c2rust/citrus that translates some new "looks like C, but really Rust" language into actual Rust at build time? I guess that might work, but I suspect your "looks like C" language isn't going to end up being much like C (e.g. it's going to need Rust-style enums-with-fields, Rust polymorphism, Rust traits, and Rust lifetimes), so it may not be beneficial, because you've just created a new language no-one knows, and that has some real downsides. If you're inspired by the dream of transitioning to safer languages, then I think the first practical step would be to identify some part of the kernel where the payoff of converting code would be highest. This is probably something small, relatively isolated, that's not well tested, generally suspicious, but still in use. Then do an experiment, converting it to Rust (or something else) using off-the-shelf tools and manual labor, and see where the pain points are and what benefits accrue, if any. (Work like https://github.com/tsgates/rust.ko might be a helpful starting point.) Then you'd have some data to start thinking about how to reduce the costs, increase the benefits, and sell it to the kernel community. If you reached out to the Rust community you might find some volunteers to help with this. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [RFC PATCH 2/7] x86/sci: add core implementation for system call isolation
On Sat, Apr 27, 2019 at 10:46 PM Ingo Molnar wrote: > - A C language runtime that is a subset of current C syntax and >semantics used in the kernel, and which doesn't allow access outside >of existing objects and thus creates a strictly enforced separation >between memory used for data, and memory used for code and control >flow. > > - This would involve, at minimum: > > - tracking every type and object and its inherent length and valid > access patterns, and never losing track of its type. > > - being a lot more organized about initialization, i.e. no > uninitialized variables/fields. > > - being a lot more strict about type conversions and pointers in > general. > > - ... and a metric ton of other details. Several research groups have tried to do this, and it is very difficult to do. In particular this was almost exactly the goal of C-Cured [1]. Much more recently, there's Microsoft's CheckedC [2] [3], which is less ambitious. Check the references of the latter for lots of relevant work. If anyone really pursues this they should talk directly to researchers who've worked on this, e.g. George Necula; you need to know what *didn't* work well, which is hard to glean from papers. (Academic publishing is broken that way.) One problem with adopting "safe C" or Rust in the kernel is that most of your security mitigations (e.g. KASLR, CFI, other randomizations) probably need to remain in place as long as there is a significant amount of C in the kernel, which means the benefits from eliminating them will be realized very far in the future, if ever, which makes the whole exercise harder to justify. Having said that, I think there's a good case to be made for writing kernel code in Rust, e.g. sketchy drivers. The classes of bugs prevented in Rust are significantly broader than your usual safe-C dialect (e.g. data races). [1] https://web.eecs.umich.edu/~weimerw/p/p477-necula.pdf [2] https://www.microsoft.com/en-us/research/uploads/prod/2019/05/checkedc-post2019.pdf [3] https://github.com/Microsoft/checkedc Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: Regression in 32-bit-compat TIOCGPTPEER ioctl due to 311fc65c9fb9c966bca8e6f3ff8132ce57344ab9
On Tue, Jan 15, 2019 at 8:25 PM Eric W. Biederman wrote: > Can you confirm this fixes it for you? Sorry --- I made a mistake. It looks like this is already fixed on master, by commit e21120383f2dce32312f63ffca145ff8a87d41f5 (though I don't think Al Viro knew he was fixing this bug). So, nothing to do here. Thanks, Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Regression in 32-bit-compat TIOCGPTPEER ioctl due to 311fc65c9fb9c966bca8e6f3ff8132ce57344ab9
This commit refactored the implementation of TIOCGPTPEER, moving "case TIOCGPTPEER" from pty_unix98_ioctl() to tty_ioctl(). pty_unix98_ioctl() is called by pty_unix98_compat_ioctl(), so before the commit, TIOCGPTPEER worked for 32-bit userspace. Unfortunately tty_compat_ioctl() does not call tty_ioctl() so after the commit, TIOCGPTPEER from 32-bit userspace fails with ENOTTY. Testcase in https://bugzilla.kernel.org/show_bug.cgi?id=202271. I found this bug running the rr test suite. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [PATCH v1 0/2] perf: Drop leaked kernel samples
On Fri, Jun 15, 2018 at 10:16 AM, Kyle Huey wrote: > > If you want a sysctl for your own reasons that's fine. But we don't > want a sysctl. We want to work without any further configuration. Also toggling a sysctl would require root privileges, but rr does not currently need to run as root. Thus rr users would have to either permanently change their system configuration (and every extra configuration step is a pain), or run rr as root so rr can toggle the sysctl itself. Running rr as root is highly undesirable. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [PATCH v1 0/2] perf: Drop leaked kernel samples
On Fri, Jun 15, 2018 at 10:16 AM, Kyle Huey wrote: > > If you want a sysctl for your own reasons that's fine. But we don't > want a sysctl. We want to work without any further configuration. Also toggling a sysctl would require root privileges, but rr does not currently need to run as root. Thus rr users would have to either permanently change their system configuration (and every extra configuration step is a pain), or run rr as root so rr can toggle the sysctl itself. Running rr as root is highly undesirable. Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
regression in 32-bit-compat dev_ioctl due to commit bf4405737f9f85a06db2b0ce5d76a818b61992e2
The commit says Once upon a time net/socket.c:dev_ifsioc() used to handle SIOCSHWTSTAMP and SIOCSIFMAP. These have different native and compat layout, so the format conversion had been needed. In 2009 these two cases had been taken out, turning the rest into a convoluted way to calling sock_do_ioctl(). We copy compat structure into native one, call sock_do_ioctl() on that and copy the result back for the in/out ioctls. No layout transformation anywhere, so we might as well just call sock_do_ioctl() and skip all the headache with copying. However there is one problem: 32-bit 'struct ifreq' and 64-bit 'struct ifreq' are not the same size. The former is 32 bytes and the latter is 40 bytes. Thus, if you place a 32-bit 'struct ifreq' immediately before an unmapped page and try to pass it to one of these ioctls, the syscall fails with EFAULT due to this commit. More details including test program in https://bugzilla.kernel.org/show_bug.cgi?id=199469. I found this bug running the rr test suite. Thanks, Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
regression in 32-bit-compat dev_ioctl due to commit bf4405737f9f85a06db2b0ce5d76a818b61992e2
The commit says Once upon a time net/socket.c:dev_ifsioc() used to handle SIOCSHWTSTAMP and SIOCSIFMAP. These have different native and compat layout, so the format conversion had been needed. In 2009 these two cases had been taken out, turning the rest into a convoluted way to calling sock_do_ioctl(). We copy compat structure into native one, call sock_do_ioctl() on that and copy the result back for the in/out ioctls. No layout transformation anywhere, so we might as well just call sock_do_ioctl() and skip all the headache with copying. However there is one problem: 32-bit 'struct ifreq' and 64-bit 'struct ifreq' are not the same size. The former is 32 bytes and the latter is 40 bytes. Thus, if you place a 32-bit 'struct ifreq' immediately before an unmapped page and try to pass it to one of these ioctls, the syscall fails with EFAULT due to this commit. More details including test program in https://bugzilla.kernel.org/show_bug.cgi?id=199469. I found this bug running the rr test suite. Thanks, Rob -- Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw, draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.
Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
Sorry about replying to this old thread, but... On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biedermanwrote: > I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually > know what the implications of changing it are. Let's see... > > gdb - no > upstart - no > lldb - yes > strace - no For the record, rr uses PTRACE_O_TRACEEXIT. When a thread exits we need to examine its address space to find its robust futex list and record the changes that will be performed by the kernel as it cleans up that list. So, any failures to deliver PTRACE_EVENT_EXIT are potentially problematic for us because we won't get a chance to examine the address space before it disappears. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [RFC][PATCH] exec: Don't wait for ptraced threads to be reaped.
Sorry about replying to this old thread, but... On Mon, Apr 3, 2017 at 9:07 AM, Eric W. Biederman wrote: > I don't know who actually useses PTRACE_O_TRACEEXIT so I don't actually > know what the implications of changing it are. Let's see... > > gdb - no > upstart - no > lldb - yes > strace - no For the record, rr uses PTRACE_O_TRACEEXIT. When a thread exits we need to examine its address space to find its robust futex list and record the changes that will be performed by the kernel as it cleans up that list. So, any failures to deliver PTRACE_EVENT_EXIT are potentially problematic for us because we won't get a chance to examine the address space before it disappears. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)
On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutlandwrote: > Should any of those be moved into the "should be dropped" pile? Why not be conservative and clear every sample you're not sure about? We'd appreciate a fix sooner rather than later here, since rr is currently broken on every stable Linux kernel and our attempts to implement a workaround have failed. (We have separate "interrupt" and "measure" counters, and I thought we might work around this regression by programming the "interrupt" counter to count kernel events as well as user events (interrupting early is OK), but that caused our (completely separate) "measure" counter to report off-by-one results (!), which seems to be a different bug present on a range of older kernels.) Thanks, Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [PATCH] perf/core: generate overflow signal when samples are dropped (WAS: Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region)
On Tue, Jul 4, 2017 at 3:21 AM, Mark Rutland wrote: > Should any of those be moved into the "should be dropped" pile? Why not be conservative and clear every sample you're not sure about? We'd appreciate a fix sooner rather than later here, since rr is currently broken on every stable Linux kernel and our attempts to implement a workaround have failed. (We have separate "interrupt" and "measure" counters, and I thought we might work around this regression by programming the "interrupt" counter to count kernel events as well as user events (interrupting early is OK), but that caused our (completely separate) "measure" counter to report off-by-one results (!), which seems to be a different bug present on a range of older kernels.) Thanks, Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region
On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutlandwrote: > It seems odd that an event without any samples to take has a sample > period. I'm surprised that there's not *some* sample_type set. Yes, it is a bit odd :-). Rather than trying to gather some limited set of information about the target process, we want to interrupt its execution after a particular number of events have occurred. The bigger picture here is that we're replaying a previously recorded process execution. We want to the process to run until it reaches a particular state that occurred during recording; that state is identified by a value for the performance counter (retired conditional branches) plus the values of the general-purpose registers. So our first step is to run the process until the performance counter reaches a value less than but "close to" the target value. (We know the process will be interrupted after a number of extra events have occurred, so we set the sample period to a value less than the target by a heuristic "maximum skid size".) Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [REGRESSION] perf/core: PMU interrupts dropped if we entered the kernel in the "skid" region
On Wed, Jun 28, 2017 at 10:19 AM, Mark Rutland wrote: > It seems odd that an event without any samples to take has a sample > period. I'm surprised that there's not *some* sample_type set. Yes, it is a bit odd :-). Rather than trying to gather some limited set of information about the target process, we want to interrupt its execution after a particular number of events have occurred. The bigger picture here is that we're replaying a previously recorded process execution. We want to the process to run until it reaches a particular state that occurred during recording; that state is identified by a value for the performance counter (retired conditional branches) plus the values of the general-purpose registers. So our first step is to run the process until the performance counter reaches a value less than but "close to" the target value. (We know the process will be interrupted after a number of extra events have occurred, so we set the sample period to a value less than the target by a heuristic "maximum skid size".) Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: Yes, people use FOLL_FORCE ;)
On Tue, May 30, 2017 at 11:08 AM, Keno Fischer <k...@juliacomputing.com> wrote: > Now, while we're probably fine with using the fallbacks, I know there's > others that rely on this behavior as well (cc'ing Robert O'Callahan of the > rr project for which this change will result in significant performance > degradation). Yeah, this breaks rr. We write to readonly code and data mappings via /proc/.../mem in lots of different situations, particularly when we're adjusting program state during replay to match the recorded execution. Like Julia, we can add workarounds, but they could be expensive. For small writes we can fall back to PTRACE_POKEDATA without much performance loss, but for big writes it would be cheaper to sock-puppet the ptracee to call mprotect to temporarily make pages writeable. Of course that takes at least four context switches, so the lower-bound overhead of that approach is pretty bad. OTOH there are probably tricks we could pull to mitigate the overhead some more. For example during replay we might be able to make some pages that "should" be read-only actually be read-write when we know the recorded process didn't try writing to them. So in summary: rr can be upgraded to cope with this, incurring some unknown amount of additional overhead, but existing rr installs will definitely be totally broken. Thanks, Rob
Re: Yes, people use FOLL_FORCE ;)
On Tue, May 30, 2017 at 11:08 AM, Keno Fischer wrote: > Now, while we're probably fine with using the fallbacks, I know there's > others that rely on this behavior as well (cc'ing Robert O'Callahan of the > rr project for which this change will result in significant performance > degradation). Yeah, this breaks rr. We write to readonly code and data mappings via /proc/.../mem in lots of different situations, particularly when we're adjusting program state during replay to match the recorded execution. Like Julia, we can add workarounds, but they could be expensive. For small writes we can fall back to PTRACE_POKEDATA without much performance loss, but for big writes it would be cheaper to sock-puppet the ptracee to call mprotect to temporarily make pages writeable. Of course that takes at least four context switches, so the lower-bound overhead of that approach is pretty bad. OTOH there are probably tricks we could pull to mitigate the overhead some more. For example during replay we might be able to make some pages that "should" be read-only actually be read-write when we know the recorded process didn't try writing to them. So in summary: rr can be upgraded to cope with this, incurring some unknown amount of additional overhead, but existing rr installs will definitely be totally broken. Thanks, Rob
Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
Great, thanks! Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
Great, thanks! Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
Ping? Is there something else I need to do to get someone to look at this? Thanks, Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
Ping? Is there something else I need to do to get someone to look at this? Thanks, Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
[PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
Changes in version 2: - Fixed block comment formatting as requested by Ingo Molnar - Changed commit summary to refer to a testcase that more easily shows the problem. For reference here's a complete testcase for an Intel TSX-enabled KVM guest: #include #include #include #include #include #include int main(void) { struct perf_event_attr attr; int fd; long long count; memset(, 0, sizeof(attr)); attr.type = PERF_TYPE_RAW; attr.size = sizeof(attr); attr.config = 0x2005101c4; // conditional branches retired IN_TXCP attr.sample_period = 0; attr.exclude_kernel = 1; attr.exclude_guest = 1; fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); for (int i = 0; i < 500; ++i) { putchar('.'); } read(fd, , sizeof(count)); printf("\nConditional branches: %lld\n%s\n", count, count < 500 ? "FAILED" : "PASSED"); return 0; } Robert O'Callahan (1): KVM: x86: never specify a sample period for virtualized in_tx_cp counters arch/x86/kvm/pmu.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) -- 2.9.3
[PATCH v2 00/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
Changes in version 2: - Fixed block comment formatting as requested by Ingo Molnar - Changed commit summary to refer to a testcase that more easily shows the problem. For reference here's a complete testcase for an Intel TSX-enabled KVM guest: #include #include #include #include #include #include int main(void) { struct perf_event_attr attr; int fd; long long count; memset(, 0, sizeof(attr)); attr.type = PERF_TYPE_RAW; attr.size = sizeof(attr); attr.config = 0x2005101c4; // conditional branches retired IN_TXCP attr.sample_period = 0; attr.exclude_kernel = 1; attr.exclude_guest = 1; fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); for (int i = 0; i < 500; ++i) { putchar('.'); } read(fd, , sizeof(count)); printf("\nConditional branches: %lld\n%s\n", count, count < 500 ? "FAILED" : "PASSED"); return 0; } Robert O'Callahan (1): KVM: x86: never specify a sample period for virtualized in_tx_cp counters arch/x86/kvm/pmu.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) -- 2.9.3
[PATCH v2 01/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
pmc_reprogram_counter() always sets a sample period based on the value of pmc->counter. However, hsw_hw_config() rejects sample periods less than 2^31 - 1. So for example, if a KVM guest does struct perf_event_attr attr; memset(, 0, sizeof(attr)); attr.type = PERF_TYPE_RAW; attr.size = sizeof(attr); attr.config = 0x2005101c4; // conditional branches retired IN_TXCP attr.sample_period = 0; int fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); the guest kernel counts some conditional branch events, then updates the virtual PMU register with a nonzero count. The host reaches pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP in hsw_hw_config(), prints "kvm_pmu: event creation failed" in pmc_reprogram_counter(), and silently (from the guest's point of view) stops counting events. We fix event counting by forcing attr.sample_period to always be zero for in_tx_cp counters. Sampling doesn't work, but it already didn't work and can't be fixed without major changes to the approach in hsw_hw_config(). Signed-off-by: Robert O'Callahan <rob...@ocallahan.org> --- arch/x86/kvm/pmu.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 06ce377..026db42 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -113,12 +113,19 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .config = config, }; + attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + if (in_tx) attr.config |= HSW_IN_TX; - if (in_tx_cp) + if (in_tx_cp) { + /* +* HSW_IN_TX_CHECKPOINTED is not supported with nonzero +* period. Just clear the sample period so at least +* allocating the counter doesn't fail. +*/ + attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; - - attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + } event = perf_event_create_kernel_counter(, -1, current, intr ? kvm_perf_overflow_intr : -- 2.9.3
[PATCH v2 01/01] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
pmc_reprogram_counter() always sets a sample period based on the value of pmc->counter. However, hsw_hw_config() rejects sample periods less than 2^31 - 1. So for example, if a KVM guest does struct perf_event_attr attr; memset(, 0, sizeof(attr)); attr.type = PERF_TYPE_RAW; attr.size = sizeof(attr); attr.config = 0x2005101c4; // conditional branches retired IN_TXCP attr.sample_period = 0; int fd = syscall(__NR_perf_event_open, , 0, -1, -1, 0); ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); the guest kernel counts some conditional branch events, then updates the virtual PMU register with a nonzero count. The host reaches pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP in hsw_hw_config(), prints "kvm_pmu: event creation failed" in pmc_reprogram_counter(), and silently (from the guest's point of view) stops counting events. We fix event counting by forcing attr.sample_period to always be zero for in_tx_cp counters. Sampling doesn't work, but it already didn't work and can't be fixed without major changes to the approach in hsw_hw_config(). Signed-off-by: Robert O'Callahan --- arch/x86/kvm/pmu.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 06ce377..026db42 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -113,12 +113,19 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .config = config, }; + attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + if (in_tx) attr.config |= HSW_IN_TX; - if (in_tx_cp) + if (in_tx_cp) { + /* +* HSW_IN_TX_CHECKPOINTED is not supported with nonzero +* period. Just clear the sample period so at least +* allocating the counter doesn't fail. +*/ + attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; - - attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + } event = perf_event_create_kernel_counter(, -1, current, intr ? kvm_perf_overflow_intr : -- 2.9.3
[PATCH] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
pmc_reprogram_counter() always sets a sample period based on the value of pmc->counter. However, hsw_hw_config() rejects sample periods less than 2^31 - 1. So for example, a KVM guest doing perf stat -e r2005101c4 sleep 0 will count some conditional branch events, deschedule the task, reschedule the task, try to restore the guest PMU state for the task, in the host reach pmc_reprogram_counter() with nonzero pmc->count, trigger EOPNOTSUPP in hsw_hw_config(), print "kvm_pmu: event creation failed" in pmc_reprogram_counter(), and silently (from the guest's point of view) stop counting events. We fix event counting by forcing attr.sample_period to always be zero for in_tx_cp counters. Sampling doesn't work, but it already didn't work and can't be fixed without major changes to the approach in hsw_hw_config(). Signed-off-by: Robert O'Callahan <rob...@ocallahan.org> --- arch/x86/kvm/pmu.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 06ce377..af993d7 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -113,12 +113,18 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .config = config, }; + attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + if (in_tx) attr.config |= HSW_IN_TX; - if (in_tx_cp) + if (in_tx_cp) { + /* HSW_IN_TX_CHECKPOINTED is not supported with nonzero +* period. Just clear the sample period so at least +* allocating the counter doesn't fail. +*/ + attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; - - attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + } event = perf_event_create_kernel_counter(, -1, current, intr ? kvm_perf_overflow_intr : -- 2.9.3
[PATCH] KVM: x86: never specify a sample period for virtualized in_tx_cp counters
pmc_reprogram_counter() always sets a sample period based on the value of pmc->counter. However, hsw_hw_config() rejects sample periods less than 2^31 - 1. So for example, a KVM guest doing perf stat -e r2005101c4 sleep 0 will count some conditional branch events, deschedule the task, reschedule the task, try to restore the guest PMU state for the task, in the host reach pmc_reprogram_counter() with nonzero pmc->count, trigger EOPNOTSUPP in hsw_hw_config(), print "kvm_pmu: event creation failed" in pmc_reprogram_counter(), and silently (from the guest's point of view) stop counting events. We fix event counting by forcing attr.sample_period to always be zero for in_tx_cp counters. Sampling doesn't work, but it already didn't work and can't be fixed without major changes to the approach in hsw_hw_config(). Signed-off-by: Robert O'Callahan --- arch/x86/kvm/pmu.c | 12 +--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/arch/x86/kvm/pmu.c b/arch/x86/kvm/pmu.c index 06ce377..af993d7 100644 --- a/arch/x86/kvm/pmu.c +++ b/arch/x86/kvm/pmu.c @@ -113,12 +113,18 @@ static void pmc_reprogram_counter(struct kvm_pmc *pmc, u32 type, .config = config, }; + attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + if (in_tx) attr.config |= HSW_IN_TX; - if (in_tx_cp) + if (in_tx_cp) { + /* HSW_IN_TX_CHECKPOINTED is not supported with nonzero +* period. Just clear the sample period so at least +* allocating the counter doesn't fail. +*/ + attr.sample_period = 0; attr.config |= HSW_IN_TX_CHECKPOINTED; - - attr.sample_period = (-pmc->counter) & pmc_bitmask(pmc); + } event = perf_event_create_kernel_counter(, -1, current, intr ? kvm_perf_overflow_intr : -- 2.9.3
Re: [PATCH] seccomp: Fix tracer exit notifications during fatal signals
Thanks! On Fri, Aug 12, 2016 at 3:12 AM, Oleg Nesterovwrote: > > The bug happens because when __seccomp_filter() detects > > fatal_signal_pending(), it calls do_exit() without dequeuing the fatal > > signal. When do_exit() sends the PTRACE_EVENT_EXIT > > I _never_ understood what PTRACE_EVENT_EXIT should actually do. I mean, > when it should actually stop. This was never defined. > > > notification and > > that task is descheduled, __schedule() notices that there is a fatal > > signal pending and changes its state from TASK_TRACED to TASK_RUNNING. > > And this can happen anyway, with or without this change, with or without > seccomp. Because another fatal signal can be pending. So PTRACE_EVENT_EXIT > actually depends on /dev/random. True. But at least (as Kees alluded to later) this patch ensures PTRACE_EVENT_EXIT delivery when exit is due to exit_group() and no genuine fatal signals are involved. > Perhaps we should finally define what it should do. Say, it should only > stop if SIGKILL was sent "implicitely" by exit/exec. But as for exec, > there are more (off-topic) complications, not sure we actually want this... The ptrace man page currently says: > A SIGKILL signal may still cause a PTRACE_EVENT_EXIT stop before actual > signal death. This may be changed in the future; SIGKILL is meant to always > immediately kill tasks even under ptrace. Last confirmed on Linux 3.13. In practice, a PTRACE_EVENT_EXIT is almost always observed after SIGKILL. That's nice for rr, because it lets us observe the process's final state. But it allows a process to stay alive indefinitely after receiving SIGKILL, so I can see why you might want to change it. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: [PATCH] seccomp: Fix tracer exit notifications during fatal signals
Thanks! On Fri, Aug 12, 2016 at 3:12 AM, Oleg Nesterov wrote: > > The bug happens because when __seccomp_filter() detects > > fatal_signal_pending(), it calls do_exit() without dequeuing the fatal > > signal. When do_exit() sends the PTRACE_EVENT_EXIT > > I _never_ understood what PTRACE_EVENT_EXIT should actually do. I mean, > when it should actually stop. This was never defined. > > > notification and > > that task is descheduled, __schedule() notices that there is a fatal > > signal pending and changes its state from TASK_TRACED to TASK_RUNNING. > > And this can happen anyway, with or without this change, with or without > seccomp. Because another fatal signal can be pending. So PTRACE_EVENT_EXIT > actually depends on /dev/random. True. But at least (as Kees alluded to later) this patch ensures PTRACE_EVENT_EXIT delivery when exit is due to exit_group() and no genuine fatal signals are involved. > Perhaps we should finally define what it should do. Say, it should only > stop if SIGKILL was sent "implicitely" by exit/exec. But as for exec, > there are more (off-topic) complications, not sure we actually want this... The ptrace man page currently says: > A SIGKILL signal may still cause a PTRACE_EVENT_EXIT stop before actual > signal death. This may be changed in the future; SIGKILL is meant to always > immediately kill tasks even under ptrace. Last confirmed on Linux 3.13. In practice, a PTRACE_EVENT_EXIT is almost always observed after SIGKILL. That's nice for rr, because it lets us observe the process's final state. But it allows a process to stay alive indefinitely after receiving SIGKILL, so I can see why you might want to change it. Rob -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug
I work on rr (http://rr-project.org/), a record-and-replay reverse-execution debugger which is a heavy user of ptrace and seccomp. The recent change to perform syscall-entry PTRACE_SYSCALL stops before PTRACE_EVENT_SECCOMP stops broke rr, which is fine because I'm fixing rr and this change actually makes rr faster (thanks!). However, it exposed an existing kernel bug which creates a problem for us, and which I'm not sure how to fix. The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Small-ish testcase here: https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068. The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: https://github.com/mozilla/rr/issues/1762#issuecomment-237396255. This bug has been in the kernel for a while. rr never hit it before because we trace all threads and mostly run only one tracee thread at a time. Immediately after each PTRACE_EVENT_SECCOMP notification we'd issue a PTRACE_SYSCALL to get that task to the syscall-entry PTRACE_SYSCALL stop, so there was never an opportunity for one tracee thread to call exit_group while another tracee was in the problematic part of __seccomp_filter(). Unfortunately now there is no way for us to avoid that possibility. My guess is that __seccomp_filter() should dequeue the fatal signal it detects before calling do_exit(), to behave more like get_signal(). Is that correct, and if so, what would be the right way to do that? Thanks, Robert O'Callahan -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn
Re: "run seccomp after ptrace" changes expose "missing PTRACE_EVENT_EXIT" bug
I work on rr (http://rr-project.org/), a record-and-replay reverse-execution debugger which is a heavy user of ptrace and seccomp. The recent change to perform syscall-entry PTRACE_SYSCALL stops before PTRACE_EVENT_SECCOMP stops broke rr, which is fine because I'm fixing rr and this change actually makes rr faster (thanks!). However, it exposed an existing kernel bug which creates a problem for us, and which I'm not sure how to fix. The problem is that if a tracee task is in a PTRACE_EVENT_SECCOMP trap, or has been resumed after such a trap but not yet been scheduled, and another task in the thread-group calls exit_group(), then the tracee task exits without the ptracer receiving a PTRACE_EVENT_EXIT notification. Small-ish testcase here: https://gist.github.com/rocallahan/1344f7d01183c233d08a2c6b93413068. The bug happens because when __seccomp_filter() detects fatal_signal_pending(), it calls do_exit() without dequeuing the fatal signal. When do_exit() sends the PTRACE_EVENT_EXIT notification and that task is descheduled, __schedule() notices that there is a fatal signal pending and changes its state from TASK_TRACED to TASK_RUNNING. That prevents the ptracer's waitpid() from returning the ptrace event. A more detailed analysis is here: https://github.com/mozilla/rr/issues/1762#issuecomment-237396255. This bug has been in the kernel for a while. rr never hit it before because we trace all threads and mostly run only one tracee thread at a time. Immediately after each PTRACE_EVENT_SECCOMP notification we'd issue a PTRACE_SYSCALL to get that task to the syscall-entry PTRACE_SYSCALL stop, so there was never an opportunity for one tracee thread to call exit_group while another tracee was in the problematic part of __seccomp_filter(). Unfortunately now there is no way for us to avoid that possibility. My guess is that __seccomp_filter() should dequeue the fatal signal it detects before calling do_exit(), to behave more like get_signal(). Is that correct, and if so, what would be the right way to do that? Thanks, Robert O'Callahan -- lbir ye,ea yer.tnietoehr rdn rdsme,anea lurpr edna e hnysnenh hhe uresyf toD selthor stor edna siewaoeodm or v sstvr esBa kbvted,t rdsme,aoreseoouoto o l euetiuruewFa kbn e hnystoivateweh uresyf tulsa rehr rdm or rnea lurpr .a war hsrer holsa rodvted,t nenh hneireseoouot.tniesiewaoeivatewt sstvr esn