Re: [RFC PATCH] ptrace: make ptrace() fail if the tracee changed its pid unexpectedly
On 12/17/20 11:39 PM, Eric W. Biederman wrote: >> resume the old leader L, it resumes the post-exec thread T which was >> actually now stopped in PTHREAD_EVENT_EXEC. In this case the >> PTHREAD_EVENT_EXEC event is lost, and the tracer can't know that the >> tracee changed its pid. > > The change seems sensible. I don't expect this is common but it looks > painful to deal with if it happens. Yeah, the debug session becomes completely messed up, as the ptracer has no idea the process is running a new image, breakpoints were wiped out, and the post-exec process is resumed without the ptracer having had a chance to install new breakpoints. I don't see any way to deal with it without kernel help. > > Acked-by: "Eric W. Biederman" > > I am wondering if this should be expanded to all ptrace types for > consistency. Or maybe we should set a flag to make this happen for > all ptrace events. > > It just seems really odd to only worry about missing this event. > I admit this a threaded PTRACE_EVENT_EXEC is the only event we are > likely to miss but still. It's more about the tid stealing than the event itself. I mean, we lose the event because the tid changes magically without warning. An exec is the only scenario where this happens. > > Do you by any chance have any debugger/strace test cases? > > I would think that would be the way to test to see if this breaks > anything. I think I remember strace having a good test suite. I ran the GDB testsuite against this, no regressions showed up. BTW, the problem was discovered by Simon Marchi when he tried to write a GDB testcase for a multi-threaded exec scenario: https://sourceware.org/bugzilla/show_bug.cgi?id=26754 I've went through GDB's code looking for potential issues with the change and whether it would affect GDBs already in the wild. Tricky corner cases abound, but I think we're good. Feel free to add my ack: Acked-by: Pedro Alves Thanks, Pedro Alves
Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
On 06/21/2016 05:32 PM, Andy Lutomirski wrote: > On Jun 21, 2016 5:40 AM, "Pedro Alves" <pal...@redhat.com> wrote: > I didn't try that particular experiment. But, from that email: > >> After that, GDB can control the stopped inferior. To call function "func1()" >> of inferior, GDB need: Step 1, save current values of registers ($rax >> 0xfe00(64 bits -512) is cut to 0xfe00(32 bits -512) because >> inferior is a 32 bits program). > > That sounds like it may be a gdb bug. Why does gdb truncate the register? Because when debugging a 32-bit program, gdb's register cache only stores 32-bit-wide registers ($eax, $eip, etc., not $rax, etc.) Let me turn this around: Why does the kernel care about the upper 32-bit bits of $orig_rax when the task is in 32-bit mode in the first place? The 32-bit syscall entry points already only care about $eax, not $rax, since $rax doesn't exist in real 32-bit CPUs. Looking at arch/x86/entry/entry_64_compat.S, all three 64-bit x 32-bit syscall entry points zero-extend $eax already, and then push that as pt_regs->orig_ax. So if the kernel is giving significance to the higher 32-bits of orig_ax at 32-bit syscall restart time, that very much looks like a kernel bug to me. > > I haven't played with it recently, but, in my experience, gdb seems to > work quite poorly in mixed-mode situations. For example, if you > attach 64-bit gdb to qemu-system-x86_64's gdbserver, boot a 64-bit > guest, and breakpoint in early 32-bit code, gdb tends to explode > pretty badly. Right, but that's a bit of a red herring, and not entirely gdb's fault. The case you mention happens because qemu does exactly the opposite of what you're suggesting below. It's qemu that changes the remote protocol's register layout (and thus size) in the remote protocol register read/write packets when the kernel changes mode, behind gdb's back, and gdb errors out because obviously it isn't expecting that. All gdb knows is that qemu is now sending bogus register read replies, different from what the target description qemu reported on initial remote connection described. I say "entirely", because gdb has its own share of fault for the remote protocol not including a some kind of standard mechanism to inform gdb of mode changes. However, the usual scenario where the program _doesn't_ change mode during execution, is supported. > > On x86_64, I think gdb should treat CPU state as 64-bit no matter > what. The fact that a 32-bit tracee code segment is in use shouldn't > change much. It's not as clear or easy as you make it sound, unfortunately. For normal userspace programs, the current design across gdb/remote protocol/ptrace/kernel/core dump format/elf formats/ is that what matters is the program's architecture, not whatever the tracer's arch is. Should core dumping dump 64-bit CPU state as well for 32-bit programs? The current core dump format dumps a 32-bit elf with notes that contain 32-bit registers. And I think it'd be a bit odd for a 32-bit program to dump different cores files depending on the bitness of the kernel. Should a gdb connected to a 64-bit gdbserver that is debugging a 32-bit program see different registers compared to a gdb that is connected to a 32-bit gdbserver that is debugging a 32-bit program? Currently, it doesn't. The architecture of gdbserver doesn't matter here, only the tracee's. > Admittedly the kernel doesn't really help. There is some questionable > code involving which regsets to show to ptrace. I don't know what code you're looking at, but I consider this mandatory reading: Roland McGrath on PTRACE_GETREGSET design: https://sourceware.org/ml/archer/2010-q3/msg00193.html "A caveat about those requests for bi-arch systems. Unlike other ptrace requests, these access the native formats of the tracee process, rather than the native formats of the debugger process. So, a 64-bit debugger process using PTRACE_GETREGSET on a 32-bit tracee process will see the 32-bit layouts (i.e. what would appear in an ELF core file if that process dumped one)." gdb currently uses PTRACE_SETREGS for the general registers, which means it currently writes those as 64-bit registers. However, if gdb or any ptracer restores/writes $eax/$orig_eax using PTRACE_SETREGSET, it's only going to pass down a 32-bit value, and again it must be the kernel that sign extends $orig_eax if it wants to interpret it as signed 64-bit internally. But actually looking at your patch 3, I'm confused, because it seems to be doing what I'm suggesting? Thanks, Pedro Alves
Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
On 06/21/2016 05:32 PM, Andy Lutomirski wrote: > On Jun 21, 2016 5:40 AM, "Pedro Alves" wrote: > I didn't try that particular experiment. But, from that email: > >> After that, GDB can control the stopped inferior. To call function "func1()" >> of inferior, GDB need: Step 1, save current values of registers ($rax >> 0xfe00(64 bits -512) is cut to 0xfe00(32 bits -512) because >> inferior is a 32 bits program). > > That sounds like it may be a gdb bug. Why does gdb truncate the register? Because when debugging a 32-bit program, gdb's register cache only stores 32-bit-wide registers ($eax, $eip, etc., not $rax, etc.) Let me turn this around: Why does the kernel care about the upper 32-bit bits of $orig_rax when the task is in 32-bit mode in the first place? The 32-bit syscall entry points already only care about $eax, not $rax, since $rax doesn't exist in real 32-bit CPUs. Looking at arch/x86/entry/entry_64_compat.S, all three 64-bit x 32-bit syscall entry points zero-extend $eax already, and then push that as pt_regs->orig_ax. So if the kernel is giving significance to the higher 32-bits of orig_ax at 32-bit syscall restart time, that very much looks like a kernel bug to me. > > I haven't played with it recently, but, in my experience, gdb seems to > work quite poorly in mixed-mode situations. For example, if you > attach 64-bit gdb to qemu-system-x86_64's gdbserver, boot a 64-bit > guest, and breakpoint in early 32-bit code, gdb tends to explode > pretty badly. Right, but that's a bit of a red herring, and not entirely gdb's fault. The case you mention happens because qemu does exactly the opposite of what you're suggesting below. It's qemu that changes the remote protocol's register layout (and thus size) in the remote protocol register read/write packets when the kernel changes mode, behind gdb's back, and gdb errors out because obviously it isn't expecting that. All gdb knows is that qemu is now sending bogus register read replies, different from what the target description qemu reported on initial remote connection described. I say "entirely", because gdb has its own share of fault for the remote protocol not including a some kind of standard mechanism to inform gdb of mode changes. However, the usual scenario where the program _doesn't_ change mode during execution, is supported. > > On x86_64, I think gdb should treat CPU state as 64-bit no matter > what. The fact that a 32-bit tracee code segment is in use shouldn't > change much. It's not as clear or easy as you make it sound, unfortunately. For normal userspace programs, the current design across gdb/remote protocol/ptrace/kernel/core dump format/elf formats/ is that what matters is the program's architecture, not whatever the tracer's arch is. Should core dumping dump 64-bit CPU state as well for 32-bit programs? The current core dump format dumps a 32-bit elf with notes that contain 32-bit registers. And I think it'd be a bit odd for a 32-bit program to dump different cores files depending on the bitness of the kernel. Should a gdb connected to a 64-bit gdbserver that is debugging a 32-bit program see different registers compared to a gdb that is connected to a 32-bit gdbserver that is debugging a 32-bit program? Currently, it doesn't. The architecture of gdbserver doesn't matter here, only the tracee's. > Admittedly the kernel doesn't really help. There is some questionable > code involving which regsets to show to ptrace. I don't know what code you're looking at, but I consider this mandatory reading: Roland McGrath on PTRACE_GETREGSET design: https://sourceware.org/ml/archer/2010-q3/msg00193.html "A caveat about those requests for bi-arch systems. Unlike other ptrace requests, these access the native formats of the tracee process, rather than the native formats of the debugger process. So, a 64-bit debugger process using PTRACE_GETREGSET on a 32-bit tracee process will see the 32-bit layouts (i.e. what would appear in an ELF core file if that process dumped one)." gdb currently uses PTRACE_SETREGS for the general registers, which means it currently writes those as 64-bit registers. However, if gdb or any ptracer restores/writes $eax/$orig_eax using PTRACE_SETREGSET, it's only going to pass down a 32-bit value, and again it must be the kernel that sign extends $orig_eax if it wants to interpret it as signed 64-bit internally. But actually looking at your patch 3, I'm confused, because it seems to be doing what I'm suggesting? Thanks, Pedro Alves
Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
Hi Andy, On 06/21/2016 12:39 AM, Andy Lutomirski wrote: > Suppose a 64-bit task A traces a 32-bit task B. I gave your x86/ptrace branch a try: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/ptrace (this looks to be the same patch set.) Unfortunately, with gdb git master, I still get the 64-bit ptracer x 32-bit ptracee problem: (gdb) r Starting program: interrupt.32 talk to me baby ^C Program received signal SIGINT, Interrupt. 0xf7fd9d09 in __kernel_vsyscall () (gdb) p func1 () $1 = 4 (gdb) c Continuing. Unknown error 512 [Inferior 1 (process 2198) exited with code 01] (gdb) q Is this expected? This is the same testcase as before: https://sourceware.org/ml/gdb/2014-05/msg4.html Thanks, Pedro Alves
Re: [PATCH v3 2/3] x86/signal: Rewire the restart_block() syscall to have a constant nr
Hi Andy, On 06/21/2016 12:39 AM, Andy Lutomirski wrote: > Suppose a 64-bit task A traces a 32-bit task B. I gave your x86/ptrace branch a try: https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/log/?h=x86/ptrace (this looks to be the same patch set.) Unfortunately, with gdb git master, I still get the 64-bit ptracer x 32-bit ptracee problem: (gdb) r Starting program: interrupt.32 talk to me baby ^C Program received signal SIGINT, Interrupt. 0xf7fd9d09 in __kernel_vsyscall () (gdb) p func1 () $1 = 4 (gdb) c Continuing. Unknown error 512 [Inferior 1 (process 2198) exited with code 01] (gdb) q Is this expected? This is the same testcase as before: https://sourceware.org/ml/gdb/2014-05/msg4.html Thanks, Pedro Alves
Re: [PATCH v2] x86/ptrace: Stop setting TS_COMPAT in ptrace code
On 06/20/2016 08:37 PM, Andy Lutomirski wrote: > On Mon, Jun 20, 2016 at 9:29 AM, Andy Lutomirski <l...@kernel.org> wrote: >> Setting TS_COMPAT in ptrace is wrong: if we happen to do it during >> syscall entry, then we'll confuse seccomp and audit. (The former >> isn't a security problem: seccomp is currently entirely insecure if a >> malicious ptracer is attached.) As a minimal fix, this patch adds a >> new flag TS_I386_REGS_POKED that handles the ptrace special case. >> >> Cc: Pedro Alves <pal...@redhat.com> >> Cc: Oleg Nesterov <o...@redhat.com> >> Cc: Kees Cook <keesc...@chromium.org> >> Signed-off-by: Andy Lutomirski <l...@kernel.org> > > In case you're interested, my draft followup (definitely not for x86/urgent) > is: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=50d2f2a9fe1b > > Pedro, this appears to pass ptrace-tests. I need to try the 64-vs-32 > thing, but it's intended to fix it for real. Awesome, thanks much for working on this! Thanks, Pedro Alves
Re: [PATCH v2] x86/ptrace: Stop setting TS_COMPAT in ptrace code
On 06/20/2016 08:37 PM, Andy Lutomirski wrote: > On Mon, Jun 20, 2016 at 9:29 AM, Andy Lutomirski wrote: >> Setting TS_COMPAT in ptrace is wrong: if we happen to do it during >> syscall entry, then we'll confuse seccomp and audit. (The former >> isn't a security problem: seccomp is currently entirely insecure if a >> malicious ptracer is attached.) As a minimal fix, this patch adds a >> new flag TS_I386_REGS_POKED that handles the ptrace special case. >> >> Cc: Pedro Alves >> Cc: Oleg Nesterov >> Cc: Kees Cook >> Signed-off-by: Andy Lutomirski > > In case you're interested, my draft followup (definitely not for x86/urgent) > is: > > https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/vmap_stack=50d2f2a9fe1b > > Pedro, this appears to pass ptrace-tests. I need to try the 64-vs-32 > thing, but it's intended to fix it for real. Awesome, thanks much for working on this! Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/19/2016 11:09 PM, Andy Lutomirski wrote: > > The latter bit is a mess and is probably broken on current kernels for > 64-bit gdb attached to a 32-bit process. (Is it? All of this stuff > is a bit of a pain to test.) The testcase at: https://sourceware.org/ml/gdb/2014-05/msg4.html still fails for me on Fedora 23 with git master gdb. Nevermind the misleading URL, that's a kernel patch. $ gcc -g -m32 interrupt.c -o interrupt.32 ... (gdb) r Starting program: /home/pedro/tmp/interrupt.32 talk to me baby ^C Program received signal SIGINT, Interrupt. 0xf7fd9d49 in __kernel_vsyscall () (gdb) p func1() $1 = 4 (gdb) c Continuing. Unknown error 512 [Inferior 1 (process 20252) exited with code 01] (gdb) That was a 64-bit gdb. Note it doesn't fail with fedora 23's gdb, because of a fedora-local workaround. Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/19/2016 11:09 PM, Andy Lutomirski wrote: > > The latter bit is a mess and is probably broken on current kernels for > 64-bit gdb attached to a 32-bit process. (Is it? All of this stuff > is a bit of a pain to test.) The testcase at: https://sourceware.org/ml/gdb/2014-05/msg4.html still fails for me on Fedora 23 with git master gdb. Nevermind the misleading URL, that's a kernel patch. $ gcc -g -m32 interrupt.c -o interrupt.32 ... (gdb) r Starting program: /home/pedro/tmp/interrupt.32 talk to me baby ^C Program received signal SIGINT, Interrupt. 0xf7fd9d49 in __kernel_vsyscall () (gdb) p func1() $1 = 4 (gdb) c Continuing. Unknown error 512 [Inferior 1 (process 20252) exited with code 01] (gdb) That was a 64-bit gdb. Note it doesn't fail with fedora 23's gdb, because of a fedora-local workaround. Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/18/2016 06:02 PM, Andy Lutomirski wrote: > Yuck. I should have dug in to the history. Why not just > unconditionally sign-extend eax when set by a 32-bit tracer? No idea. > > Do you know how to acquire a copy of erestartsys-trap.c? The old > links appear to be broken. That's part of the ptrace testsuite project, still in cvs, though the url changed: $ https://sourceware.org/systemtap/wiki/utrace/tests $ cvs -d :pserver:anoncvs:anon...@sourceware.org:/cvs/systemtap co ptrace-tests Can't seem to find a cvsweb interface for that. I think it'd be great to move these to the selftests infrastructure directly in the kernel tree. However, nobody's has ever managed to find energy for that. > > Also, while I have your attention: when gdb restores old state like > this, does it do it with individual calls to PTRACE_POKEUSER or does > it use SETREGSET or similar to do it all at once? I'm asking because > I have some other code (fsgsbase) that's on hold until I can figure > out how to keep it from breaking gdb if and when gdb writes to fs and > fs_base. > It depends on which register you're accessing, and on kernel version. But on a recent kernel, it should be using PTRACE_SETREGS / PTRACE_SETREGSET, thus storing a whole register set in one go. (And it's likely we could get rid of PTRACE_POKE fallback paths by now.) To write to the debug registers (dr0-dr7), PTRACE_POKEUSER is always used. There's code that coordinates with glibc's libthread_db.so that ends up _reading_ fs_base/gs_base, and gdb uses PTRACE_PEEKUSER for that, though there's a pending series that changes it, exposing fs_base/gs_base as just another register in gdb's register cache: https://www.sourceware.org/ml/gdb-patches/2015-11/msg00076.html https://www.sourceware.org/ml/gdb-patches/2015-11/msg00077.html Guess that makes fs_base/gs_base user-writable, if the kernel allows it. Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/18/2016 06:02 PM, Andy Lutomirski wrote: > Yuck. I should have dug in to the history. Why not just > unconditionally sign-extend eax when set by a 32-bit tracer? No idea. > > Do you know how to acquire a copy of erestartsys-trap.c? The old > links appear to be broken. That's part of the ptrace testsuite project, still in cvs, though the url changed: $ https://sourceware.org/systemtap/wiki/utrace/tests $ cvs -d :pserver:anoncvs:anon...@sourceware.org:/cvs/systemtap co ptrace-tests Can't seem to find a cvsweb interface for that. I think it'd be great to move these to the selftests infrastructure directly in the kernel tree. However, nobody's has ever managed to find energy for that. > > Also, while I have your attention: when gdb restores old state like > this, does it do it with individual calls to PTRACE_POKEUSER or does > it use SETREGSET or similar to do it all at once? I'm asking because > I have some other code (fsgsbase) that's on hold until I can figure > out how to keep it from breaking gdb if and when gdb writes to fs and > fs_base. > It depends on which register you're accessing, and on kernel version. But on a recent kernel, it should be using PTRACE_SETREGS / PTRACE_SETREGSET, thus storing a whole register set in one go. (And it's likely we could get rid of PTRACE_POKE fallback paths by now.) To write to the debug registers (dr0-dr7), PTRACE_POKEUSER is always used. There's code that coordinates with glibc's libthread_db.so that ends up _reading_ fs_base/gs_base, and gdb uses PTRACE_PEEKUSER for that, though there's a pending series that changes it, exposing fs_base/gs_base as just another register in gdb's register cache: https://www.sourceware.org/ml/gdb-patches/2015-11/msg00076.html https://www.sourceware.org/ml/gdb-patches/2015-11/msg00077.html Guess that makes fs_base/gs_base user-writable, if the kernel allows it. Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/18/2016 02:55 PM, Pedro Alves wrote: > This hunk being mentioned in this thread a couple years ago too: > > https://www.sourceware.org/ml/gdb/2014-04/msg00095.html > > Please don't break this use case ( and fix the one reported in > that thread :-) ). BTW, there was a follow up v2 patch to that last url, here: [PATCH v2] Fix get ERESTARTSYS with m32 in x86_64 when debug by GDB https://sourceware.org/ml/gdb/2014-05/msg4.html (for some reason, I can only find the ping on the lkml, not the original post, though it was apparently CCed.) Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/18/2016 02:55 PM, Pedro Alves wrote: > This hunk being mentioned in this thread a couple years ago too: > > https://www.sourceware.org/ml/gdb/2014-04/msg00095.html > > Please don't break this use case ( and fix the one reported in > that thread :-) ). BTW, there was a follow up v2 patch to that last url, here: [PATCH v2] Fix get ERESTARTSYS with m32 in x86_64 when debug by GDB https://sourceware.org/ml/gdb/2014-05/msg4.html (for some reason, I can only find the ping on the lkml, not the original post, though it was apparently CCed.) Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/18/2016 11:21 AM, Andy Lutomirski wrote: > A 32-bit tracer can set a tracee's TS_COMPAT flag by poking orig_ax. > > - If the tracee is stopped in a 32-bit syscall, this has no effect > as TS_COMPAT will already be set. > > - If the tracee is stopped on entry to a 64-bit syscall, this could > cause problems: in_compat_syscall() etc will be out of sync with > the actual syscall table in use. I can imagine this bre aking > audit. (It can't meaningfully break seccomp, as a malicious > tracer can already fully bypass seccomp.) I could also imagine it > subtly confusing the implementations of a few syscalls. > > - If the tracee is stopped in a non-syscall context, then TS_COMPAT >will end up set in user mode, which isn't supposed to happen. The results >are likely to be similar to the 64-bit syscall entry case. > > Fix it by deleting the code. > > Here's my understanding of the previous intent. Suppose a > 32-bit tracee makes a syscall that returns one of the -ERESTART > codes. A 32-bit tracer saves away its register state. The tracee > resumes, returns from the system call, and gets stopped again for a > non-syscall reason (e.g. a signal). Then the tracer tries to roll > back the tracee's state by writing all of the saved registers back. > > The result of this sequence of events will be that the tracee's > registers' low bits match what they were at the end of the syscall > but TS_COMPAT will be clear. This will cause syscall_get_error() to > return a *positive* number, because we zero-extend registers poked > by 32-bit tracers instead of sign-extending them. As a result, with > this change, syscall restart won't happen, whereas, historically, it > would have happened. > > As far as I can tell, this corner case isn't very important, and I I believe it's actually very much very important for gdb, for restoring the inferior state when the user calls a function in the inferior, with: (gdb) print foo() Some background here: http://linux-kernel.vger.kernel.narkive.com/fqrh4xKK/patch-x86-ptrace-sign-extend-eax-with-orig-eax-0 This hunk being mentioned in this thread a couple years ago too: https://www.sourceware.org/ml/gdb/2014-04/msg00095.html Please don't break this use case ( and fix the one reported in that thread :-) ). Thanks, Pedro Alves
Re: [PATCH] x86/ptrace: Remove questionable TS_COMPAT usage in ptrace
On 06/18/2016 11:21 AM, Andy Lutomirski wrote: > A 32-bit tracer can set a tracee's TS_COMPAT flag by poking orig_ax. > > - If the tracee is stopped in a 32-bit syscall, this has no effect > as TS_COMPAT will already be set. > > - If the tracee is stopped on entry to a 64-bit syscall, this could > cause problems: in_compat_syscall() etc will be out of sync with > the actual syscall table in use. I can imagine this bre aking > audit. (It can't meaningfully break seccomp, as a malicious > tracer can already fully bypass seccomp.) I could also imagine it > subtly confusing the implementations of a few syscalls. > > - If the tracee is stopped in a non-syscall context, then TS_COMPAT >will end up set in user mode, which isn't supposed to happen. The results >are likely to be similar to the 64-bit syscall entry case. > > Fix it by deleting the code. > > Here's my understanding of the previous intent. Suppose a > 32-bit tracee makes a syscall that returns one of the -ERESTART > codes. A 32-bit tracer saves away its register state. The tracee > resumes, returns from the system call, and gets stopped again for a > non-syscall reason (e.g. a signal). Then the tracer tries to roll > back the tracee's state by writing all of the saved registers back. > > The result of this sequence of events will be that the tracee's > registers' low bits match what they were at the end of the syscall > but TS_COMPAT will be clear. This will cause syscall_get_error() to > return a *positive* number, because we zero-extend registers poked > by 32-bit tracers instead of sign-extending them. As a result, with > this change, syscall restart won't happen, whereas, historically, it > would have happened. > > As far as I can tell, this corner case isn't very important, and I I believe it's actually very much very important for gdb, for restoring the inferior state when the user calls a function in the inferior, with: (gdb) print foo() Some background here: http://linux-kernel.vger.kernel.narkive.com/fqrh4xKK/patch-x86-ptrace-sign-extend-eax-with-orig-eax-0 This hunk being mentioned in this thread a couple years ago too: https://www.sourceware.org/ml/gdb/2014-04/msg00095.html Please don't break this use case ( and fix the one reported in that thread :-) ). Thanks, Pedro Alves
Re: ptrace() hangs on attempt to seize/attach stopped & frozen task
On 11/19/2015 05:47 PM, Oleg Nesterov wrote: > Thanks Pedro for your email, > >> 918 /* We need to wait for SIGSTOP before being able to make the next >> 919 ptrace call on this LWP. */ >> 920 new_lwp->must_set_ptrace_flags = 1; >> 921 >> 922 if (linux_proc_pid_is_stopped (lwpid)) > > This can't happen today. Starting from v3.0 at least. Eh, interesting. So right after PTRACE_ATTACH, we either observe "running" or "ptrace-stopped", but never "job stopped". Correct? I've actually just now tried this: diff --git c/gdb/linux-nat.c w/gdb/linux-nat.c index 841ec39..42f2b0d 100644 --- c/gdb/linux-nat.c +++ w/gdb/linux-nat.c @@ -981,6 +981,7 @@ linux_nat_post_attach_wait (ptid_t ptid, int first, int *cloned, pid_t new_pid, pid = ptid_get_lwp (ptid); int status; +#if 0 if (linux_proc_pid_is_stopped (pid)) { if (debug_linux_nat) @@ -1006,6 +1007,7 @@ linux_nat_post_attach_wait (ptid_t ptid, int first, int *cloned, (or a higher priority signal, just like normal PTRACE_ATTACH). */ ptrace (PTRACE_CONT, pid, 0, 0); } +#endif /* Make sure the initial process is stopped. The user-level threads layer might want to poke around in the inferior, and that won't and sure enough, gdb's test that covers that use case still passes, on Fedora 20 (Linux 3.19.8). And given that my Thunderbird crashed while writing this, I had sufficient time to be sure that a full test run passes cleanly too. :-P :-) >> But maybe not, if we're sure that >> that when that happens, waitpid returns for the initial >> PTRACE_ATTACH-induced SIGSTOP. > > Yes. Just you can't assume that watpid(WNOHANG) will succeed. Is it OK? Yes, assuming the ptracer is guaranteed to get a SIGCHLD to wake it up. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ptrace() hangs on attempt to seize/attach stopped & frozen task
On 11/17/2015 07:34 PM, Oleg Nesterov wrote: > On 11/16, Tejun Heo wrote: >>> And perhaps we can simply remove this logic? I forgot why do we hide this >>> STOPPED -> RUNNING -> TRACED transition from the attaching thread. But the >>> vague feeling tells me that we discussed this before and perhaps it was me >>> who suggested to avoid the user-visible change when you introduced this >>> transition... >> >> Heh, it was too long ago for me to remember much. :) > > Same here... > >>> Anyway, now I do not understand why do we want to hide it. Lets consider >>> the following "test-case", >>> >>> void test(int pid) >>> { >>> kill(pid, SIGSTOP); >>> waitpid(pid, NULL, WSTOPPED); >>> >>> ptrace(PTRACE_ATTACH-OR-PTRACE_SEIZE, pid, 0,0); >>> >>> assert(ptrace(PTRACE_DETACH, pid, 0,0) == 0); >>> } >>> >>> Yes, it will fail if we remove JOBCTL_TRAPPING. But it can equally fail >>> if SIGCONT comes before ATTACH, so perhaps we do not really care? >>> >>> Jan, Pedro, do you think the patch below can break gdb somehow? With this >>> patch you can never assume that waitpid(WNOHANG) or ptrace(WHATEVER) will >>> succeed right after PTRACE_ATTACH/PTRACE_SEIZE, even if you know that the >>> tracee was TASK_STOPPED before attach. Not sure, because I don't think I fully understand that proposed change. Both GDB and gdbserver have special processing for attaching to already-stopped processes. (and neither use PTRACE_SEIZE yet.) Here's the gdbserver version: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/gdbserver/linux-low.c;h=41ab510fa4ac5654f101f08efb68e26b5bc5dbd7;hb=HEAD#l903 Copied here for convenience: 907 linux_attach_lwp (ptid_t ptid) 908 { 909 struct lwp_info *new_lwp; 910 int lwpid = ptid_get_lwp (ptid); 911 912 if (ptrace (PTRACE_ATTACH, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0) 913 != 0) 914 return errno; 915 916 new_lwp = add_lwp (ptid); 917 918 /* We need to wait for SIGSTOP before being able to make the next 919 ptrace call on this LWP. */ 920 new_lwp->must_set_ptrace_flags = 1; 921 922 if (linux_proc_pid_is_stopped (lwpid)) 923 { 924 if (debug_threads) 925 debug_printf ("Attached to a stopped process\n"); 926 927 /* The process is definitely stopped. It is in a job control 928 stop, unless the kernel predates the TASK_STOPPED / 929 TASK_TRACED distinction, in which case it might be in a 930 ptrace stop. Make sure it is in a ptrace stop; from there we 931 can kill it, signal it, et cetera. 932 933 First make sure there is a pending SIGSTOP. Since we are 934 already attached, the process can not transition from stopped 935 to running without a PTRACE_CONT; so we know this signal will 936 go into the queue. The SIGSTOP generated by PTRACE_ATTACH is 937 probably already in the queue (unless this kernel is old 938 enough to use TASK_STOPPED for ptrace stops); but since 939 SIGSTOP is not an RT signal, it can only be queued once. */ 940 kill_lwp (lwpid, SIGSTOP); 941 942 /* Finally, resume the stopped process. This will deliver the 943 SIGSTOP (or a higher priority signal, just like normal 944 PTRACE_ATTACH), which we'll catch later on. */ 945 ptrace (PTRACE_CONT, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0); 946 } 947 948 /* The next time we wait for this LWP we'll see a SIGSTOP as PTRACE_ATTACH 949 brings it to a halt. 950 linux_proc_pid_is_stopped checks whether the state in /proc/pid/status is "T (stopped)". Here's the equivalent in gdb: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-nat.c;h=841ec3949c37438dfba924d8db6b37ffc416dd29;hb=HEAD#l974 This queuing of a SIGSTOP + PTRACE_CONT was necessary because otherwise when gdb attaches to a job stopped process, gdb would hang in the waitpid after PTRACE_ATTACH, waiting for the initial SIGSTOP which would never arrive. If the proposed change makes it so that a new intermediate state can be observed right after PTRACE_ATTACH, and so linux_proc_pid_is_stopped can return false, then there's potential for breakage. But maybe not, if we're sure that that when that happens, waitpid returns for the initial PTRACE_ATTACH-induced SIGSTOP. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ptrace() hangs on attempt to seize/attach stopped & frozen task
On 11/19/2015 05:47 PM, Oleg Nesterov wrote: > Thanks Pedro for your email, > >> 918 /* We need to wait for SIGSTOP before being able to make the next >> 919 ptrace call on this LWP. */ >> 920 new_lwp->must_set_ptrace_flags = 1; >> 921 >> 922 if (linux_proc_pid_is_stopped (lwpid)) > > This can't happen today. Starting from v3.0 at least. Eh, interesting. So right after PTRACE_ATTACH, we either observe "running" or "ptrace-stopped", but never "job stopped". Correct? I've actually just now tried this: diff --git c/gdb/linux-nat.c w/gdb/linux-nat.c index 841ec39..42f2b0d 100644 --- c/gdb/linux-nat.c +++ w/gdb/linux-nat.c @@ -981,6 +981,7 @@ linux_nat_post_attach_wait (ptid_t ptid, int first, int *cloned, pid_t new_pid, pid = ptid_get_lwp (ptid); int status; +#if 0 if (linux_proc_pid_is_stopped (pid)) { if (debug_linux_nat) @@ -1006,6 +1007,7 @@ linux_nat_post_attach_wait (ptid_t ptid, int first, int *cloned, (or a higher priority signal, just like normal PTRACE_ATTACH). */ ptrace (PTRACE_CONT, pid, 0, 0); } +#endif /* Make sure the initial process is stopped. The user-level threads layer might want to poke around in the inferior, and that won't and sure enough, gdb's test that covers that use case still passes, on Fedora 20 (Linux 3.19.8). And given that my Thunderbird crashed while writing this, I had sufficient time to be sure that a full test run passes cleanly too. :-P :-) >> But maybe not, if we're sure that >> that when that happens, waitpid returns for the initial >> PTRACE_ATTACH-induced SIGSTOP. > > Yes. Just you can't assume that watpid(WNOHANG) will succeed. Is it OK? Yes, assuming the ptracer is guaranteed to get a SIGCHLD to wake it up. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ptrace() hangs on attempt to seize/attach stopped & frozen task
On 11/17/2015 07:34 PM, Oleg Nesterov wrote: > On 11/16, Tejun Heo wrote: >>> And perhaps we can simply remove this logic? I forgot why do we hide this >>> STOPPED -> RUNNING -> TRACED transition from the attaching thread. But the >>> vague feeling tells me that we discussed this before and perhaps it was me >>> who suggested to avoid the user-visible change when you introduced this >>> transition... >> >> Heh, it was too long ago for me to remember much. :) > > Same here... > >>> Anyway, now I do not understand why do we want to hide it. Lets consider >>> the following "test-case", >>> >>> void test(int pid) >>> { >>> kill(pid, SIGSTOP); >>> waitpid(pid, NULL, WSTOPPED); >>> >>> ptrace(PTRACE_ATTACH-OR-PTRACE_SEIZE, pid, 0,0); >>> >>> assert(ptrace(PTRACE_DETACH, pid, 0,0) == 0); >>> } >>> >>> Yes, it will fail if we remove JOBCTL_TRAPPING. But it can equally fail >>> if SIGCONT comes before ATTACH, so perhaps we do not really care? >>> >>> Jan, Pedro, do you think the patch below can break gdb somehow? With this >>> patch you can never assume that waitpid(WNOHANG) or ptrace(WHATEVER) will >>> succeed right after PTRACE_ATTACH/PTRACE_SEIZE, even if you know that the >>> tracee was TASK_STOPPED before attach. Not sure, because I don't think I fully understand that proposed change. Both GDB and gdbserver have special processing for attaching to already-stopped processes. (and neither use PTRACE_SEIZE yet.) Here's the gdbserver version: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/gdbserver/linux-low.c;h=41ab510fa4ac5654f101f08efb68e26b5bc5dbd7;hb=HEAD#l903 Copied here for convenience: 907 linux_attach_lwp (ptid_t ptid) 908 { 909 struct lwp_info *new_lwp; 910 int lwpid = ptid_get_lwp (ptid); 911 912 if (ptrace (PTRACE_ATTACH, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0) 913 != 0) 914 return errno; 915 916 new_lwp = add_lwp (ptid); 917 918 /* We need to wait for SIGSTOP before being able to make the next 919 ptrace call on this LWP. */ 920 new_lwp->must_set_ptrace_flags = 1; 921 922 if (linux_proc_pid_is_stopped (lwpid)) 923 { 924 if (debug_threads) 925 debug_printf ("Attached to a stopped process\n"); 926 927 /* The process is definitely stopped. It is in a job control 928 stop, unless the kernel predates the TASK_STOPPED / 929 TASK_TRACED distinction, in which case it might be in a 930 ptrace stop. Make sure it is in a ptrace stop; from there we 931 can kill it, signal it, et cetera. 932 933 First make sure there is a pending SIGSTOP. Since we are 934 already attached, the process can not transition from stopped 935 to running without a PTRACE_CONT; so we know this signal will 936 go into the queue. The SIGSTOP generated by PTRACE_ATTACH is 937 probably already in the queue (unless this kernel is old 938 enough to use TASK_STOPPED for ptrace stops); but since 939 SIGSTOP is not an RT signal, it can only be queued once. */ 940 kill_lwp (lwpid, SIGSTOP); 941 942 /* Finally, resume the stopped process. This will deliver the 943 SIGSTOP (or a higher priority signal, just like normal 944 PTRACE_ATTACH), which we'll catch later on. */ 945 ptrace (PTRACE_CONT, lwpid, (PTRACE_TYPE_ARG3) 0, (PTRACE_TYPE_ARG4) 0); 946 } 947 948 /* The next time we wait for this LWP we'll see a SIGSTOP as PTRACE_ATTACH 949 brings it to a halt. 950 linux_proc_pid_is_stopped checks whether the state in /proc/pid/status is "T (stopped)". Here's the equivalent in gdb: https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/linux-nat.c;h=841ec3949c37438dfba924d8db6b37ffc416dd29;hb=HEAD#l974 This queuing of a SIGSTOP + PTRACE_CONT was necessary because otherwise when gdb attaches to a job stopped process, gdb would hang in the waitpid after PTRACE_ATTACH, waiting for the initial SIGSTOP which would never arrive. If the proposed change makes it so that a new intermediate state can be observed right after PTRACE_ATTACH, and so linux_proc_pid_is_stopped can return false, then there's potential for breakage. But maybe not, if we're sure that that when that happens, waitpid returns for the initial PTRACE_ATTACH-induced SIGSTOP. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
On 10/28/2015 04:11 PM, Oleg Nesterov wrote: > On 10/26, Pedro Alves wrote: >> >> On 10/25/2015 03:54 PM, Oleg Nesterov wrote: >>> >>> In any case, the real question is whether we should change the kernel to >>> fix the problem, or ask the distros to fix their init's. In the former >>> case 1/2 looks simpler/safer to me than the change in ptrace_traceme(), >>> and you seem to agree that 1/2 is not that bad. >> >> A risk here seems to be that waitpid will start returning unexpected >> (thread) PIDs to parent processes, > > I don't see how this change can make the things worse, > >> and it's not unreasonable to assume >> that e.g., a program asserts that waitpid either returns error or a >> known (process) PID. > > Well. /sbin/init can never assume this, obviously. Right. I was actually thinking of !init processes -- basically code that spawns helper processes, keeps a data structure indexed by pid, then discards the structure when the child exits. Something like: pid = waitpid(-1, , 0); if (pid > 0) { struct child_process *child = find_process(pid); assert (child != NULL); } As in, before your change, the child could get stuck forever, but after your change, the parent could die/assert instead. But ... > >> That's not an init-only issue, > > Yes. Because we have CLONE_PARENT. So "waitpid either returns error or a > known (process) PID" is only true if you trust your children. ... OK, that's indeed a good point. >> (Also, in the original test case, if the child gets/raises a signal or execs >> before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT, >> and the child will thus end up stuck (though should be SIGKILLable, > > Oh, but if it is killable everything is fine. How does this differ from the > case when, say, you jusr reparent to init and do kill(getpid(), SIGSTOP) ? The difference is that if the child called PTRACE_TRACEME, then it goes to ptrace-stop instead and no amount of SIGCONT unstucks it -- the only way out is force killing. I agree it's not a major issue as there's a way out (and thus made it a parens), but I wouldn't call it nice either. >> All this because PTRACE_TRACEME is broken by design > > Heh. I agree. But we can't fix it now. Perhaps the man page could document it as deprecated, suggesting PTRACE_ATTACH/PTRACE_SEIZE instead? Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
On 10/28/2015 04:11 PM, Oleg Nesterov wrote: > On 10/26, Pedro Alves wrote: >> >> On 10/25/2015 03:54 PM, Oleg Nesterov wrote: >>> >>> In any case, the real question is whether we should change the kernel to >>> fix the problem, or ask the distros to fix their init's. In the former >>> case 1/2 looks simpler/safer to me than the change in ptrace_traceme(), >>> and you seem to agree that 1/2 is not that bad. >> >> A risk here seems to be that waitpid will start returning unexpected >> (thread) PIDs to parent processes, > > I don't see how this change can make the things worse, > >> and it's not unreasonable to assume >> that e.g., a program asserts that waitpid either returns error or a >> known (process) PID. > > Well. /sbin/init can never assume this, obviously. Right. I was actually thinking of !init processes -- basically code that spawns helper processes, keeps a data structure indexed by pid, then discards the structure when the child exits. Something like: pid = waitpid(-1, , 0); if (pid > 0) { struct child_process *child = find_process(pid); assert (child != NULL); } As in, before your change, the child could get stuck forever, but after your change, the parent could die/assert instead. But ... > >> That's not an init-only issue, > > Yes. Because we have CLONE_PARENT. So "waitpid either returns error or a > known (process) PID" is only true if you trust your children. ... OK, that's indeed a good point. >> (Also, in the original test case, if the child gets/raises a signal or execs >> before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT, >> and the child will thus end up stuck (though should be SIGKILLable, > > Oh, but if it is killable everything is fine. How does this differ from the > case when, say, you jusr reparent to init and do kill(getpid(), SIGSTOP) ? The difference is that if the child called PTRACE_TRACEME, then it goes to ptrace-stop instead and no amount of SIGCONT unstucks it -- the only way out is force killing. I agree it's not a major issue as there's a way out (and thus made it a parens), but I wouldn't call it nice either. >> All this because PTRACE_TRACEME is broken by design > > Heh. I agree. But we can't fix it now. Perhaps the man page could document it as deprecated, suggesting PTRACE_ATTACH/PTRACE_SEIZE instead? Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
On 10/25/2015 03:54 PM, Oleg Nesterov wrote: > On 10/22, Denys Vlasenko wrote: >> >> On Wed, Oct 21, 2015 at 11:47 PM, Oleg Nesterov wrote: >>> On 10/21, Denys Vlasenko wrote: >>>> >>>> On 10/21/2015 09:59 PM, Denys Vlasenko wrote: >>>>> On 10/21/2015 12:31 AM, Andrew Morton wrote: >>>>>> Well, to fix this a distro needs to roll out a new kernel. Or a new >>>>>> init(8). Is there any reason to believe that distributing/deploying a >>>>>> new kernel is significantly easier for everyone? Because fixing init >>>>>> sounds like a much preferable solution to this problem. >>>>> >>>>> People will continue to write new init(8) implementations, >>>>> and they will miss this obscure case. >>>>> >>>>> Before this bug was found, it was considered possible to use >>>>> a shell script as init process. What now, every shell needs to add >>>>> __WALL to its waitpids? >>> >>> Why not? I think it can safely use __WALL too. >> >> Because having any userspace program which can happen to be init, >> which includes all shells out there in the wild >> (bash, dash, ash, ksh, zsh, msh, hush,...) >> learn about __WALL is wrong: apart from this wart, they do not have >> to use any Linux-specific code. It can all be perfectly legitimate POSIX. > > Yes, this is true. I meant that they could safely use __WALL to, but I > understand that this change can be painful. > >>> Sure. But people do the things which were never intended to be >>> used all the time. We simply can not know if this "feature" >>> already has a creative user or not. >> >> It won't be unfixable: they will just have to switch from PTRACE_TRACEME >> to PTRACE_ATTACH. >> >> As of now we do not know any people craz^W creative enough >> to create a cross between init and strace. If such specimens would >> materialize, don't they deserve to have to make that change? > > This also applies to people who use bash/whatever as /sbin/init and allow > the untrusted users to run the exploits ;) I do not know who is more crazy. > > In any case, the real question is whether we should change the kernel to > fix the problem, or ask the distros to fix their init's. In the former > case 1/2 looks simpler/safer to me than the change in ptrace_traceme(), > and you seem to agree that 1/2 is not that bad. A risk here seems to be that waitpid will start returning unexpected (thread) PIDs to parent processes, and it's not unreasonable to assume that e.g., a program asserts that waitpid either returns error or a known (process) PID. That's not an init-only issue, but something that might affect any process that runs a child that happens to decide to call PTRACE_TRACEME. The ptrace man page says: "A process can initiate a trace by calling fork(2) and having the resulting child do a PTRACE_TRACEME, followed (typically) by an execve(2)." Given that, can we instead make the kernel error out on PTRACE_TRACEME issued from a non-leader thread? Then between PTRACE_TRACEME and the parent's waitpid, __WALL or !__WALL should make no difference. (Also, in the original test case, if the child gets/raises a signal or execs before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT, and the child will thus end up stuck (though should be SIGKILLable, I believe). All this because PTRACE_TRACEME is broken by design by making it be the child's choice whether to be traced...) Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/2] wait/ptrace: always assume __WALL if the child is traced
On 10/25/2015 03:54 PM, Oleg Nesterov wrote: > On 10/22, Denys Vlasenko wrote: >> >> On Wed, Oct 21, 2015 at 11:47 PM, Oleg Nesterov <o...@redhat.com> wrote: >>> On 10/21, Denys Vlasenko wrote: >>>> >>>> On 10/21/2015 09:59 PM, Denys Vlasenko wrote: >>>>> On 10/21/2015 12:31 AM, Andrew Morton wrote: >>>>>> Well, to fix this a distro needs to roll out a new kernel. Or a new >>>>>> init(8). Is there any reason to believe that distributing/deploying a >>>>>> new kernel is significantly easier for everyone? Because fixing init >>>>>> sounds like a much preferable solution to this problem. >>>>> >>>>> People will continue to write new init(8) implementations, >>>>> and they will miss this obscure case. >>>>> >>>>> Before this bug was found, it was considered possible to use >>>>> a shell script as init process. What now, every shell needs to add >>>>> __WALL to its waitpids? >>> >>> Why not? I think it can safely use __WALL too. >> >> Because having any userspace program which can happen to be init, >> which includes all shells out there in the wild >> (bash, dash, ash, ksh, zsh, msh, hush,...) >> learn about __WALL is wrong: apart from this wart, they do not have >> to use any Linux-specific code. It can all be perfectly legitimate POSIX. > > Yes, this is true. I meant that they could safely use __WALL to, but I > understand that this change can be painful. > >>> Sure. But people do the things which were never intended to be >>> used all the time. We simply can not know if this "feature" >>> already has a creative user or not. >> >> It won't be unfixable: they will just have to switch from PTRACE_TRACEME >> to PTRACE_ATTACH. >> >> As of now we do not know any people craz^W creative enough >> to create a cross between init and strace. If such specimens would >> materialize, don't they deserve to have to make that change? > > This also applies to people who use bash/whatever as /sbin/init and allow > the untrusted users to run the exploits ;) I do not know who is more crazy. > > In any case, the real question is whether we should change the kernel to > fix the problem, or ask the distros to fix their init's. In the former > case 1/2 looks simpler/safer to me than the change in ptrace_traceme(), > and you seem to agree that 1/2 is not that bad. A risk here seems to be that waitpid will start returning unexpected (thread) PIDs to parent processes, and it's not unreasonable to assume that e.g., a program asserts that waitpid either returns error or a known (process) PID. That's not an init-only issue, but something that might affect any process that runs a child that happens to decide to call PTRACE_TRACEME. The ptrace man page says: "A process can initiate a trace by calling fork(2) and having the resulting child do a PTRACE_TRACEME, followed (typically) by an execve(2)." Given that, can we instead make the kernel error out on PTRACE_TRACEME issued from a non-leader thread? Then between PTRACE_TRACEME and the parent's waitpid, __WALL or !__WALL should make no difference. (Also, in the original test case, if the child gets/raises a signal or execs before exiting, the bash/init/whatever process won't be issuing PTRACE_CONT, and the child will thus end up stuck (though should be SIGKILLable, I believe). All this because PTRACE_TRACEME is broken by design by making it be the child's choice whether to be traced...) Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced
On 10/20/2015 06:17 PM, Oleg Nesterov wrote: > Jan, Pedro, could you please confirm this won't break gdb? I tried > to look into gdb-7.1, and at first glance gdb uses __WCLONE only > because __WALL doesn't work on older kernels, iow it seems to me > that gdb actually wants __WALL so this change should be fine. Right, gdb actually wants __WALL, but it doesn't use it to keep compatibility with kernels that predate it. gdb nowadays has an __WALL emulation waitpid wrapper (alternates __WCLONE+WNOHANG with 0+WNOHANG, blocks on sigsuspend/SIGCHLD): https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-waitpid.c;h=cbcdd95afa9c664993542b0b3851e79fbae4e1df;hb=HEAD#l77 Though it's not used everywhere. Some older code in the ptrace backend open codes the "try __WCLONE, then try !__WCLONE." dance. Seems like __WALL was added in Linux 2.4; gdb could probably assume it's available nowadays... In any case, to make sure existing gdb binaries would still work with your kernel change, I ran GDB's testsuite with this: ~~ diff --git a/gdb/nat/linux-waitpid.c b/gdb/nat/linux-waitpid.c index cbcdd95..864ba2e 100644 --- a/gdb/nat/linux-waitpid.c +++ b/gdb/nat/linux-waitpid.c @@ -149,3 +149,17 @@ my_waitpid (int pid, int *status, int flags) errno = out_errno; return ret; } + +#include + +pid_t +waitpid (pid_t pid, int *status, int options) +{ + static pid_t (*waitpid2) (pid_t pid, int *status, int options) = NULL; + + if (waitpid2 == NULL) +waitpid2 = dlsym (RTLD_NEXT, "waitpid"); + + options |= __WALL; + return waitpid2 (pid, status, options); +} ~~ and got no regressions. So seems like all would be well from GDB's perspective. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/2] wait/ptrace: always assume __WALL if the child is traced
On 10/20/2015 06:17 PM, Oleg Nesterov wrote: > Jan, Pedro, could you please confirm this won't break gdb? I tried > to look into gdb-7.1, and at first glance gdb uses __WCLONE only > because __WALL doesn't work on older kernels, iow it seems to me > that gdb actually wants __WALL so this change should be fine. Right, gdb actually wants __WALL, but it doesn't use it to keep compatibility with kernels that predate it. gdb nowadays has an __WALL emulation waitpid wrapper (alternates __WCLONE+WNOHANG with 0+WNOHANG, blocks on sigsuspend/SIGCHLD): https://sourceware.org/git/?p=binutils-gdb.git;a=blob;f=gdb/nat/linux-waitpid.c;h=cbcdd95afa9c664993542b0b3851e79fbae4e1df;hb=HEAD#l77 Though it's not used everywhere. Some older code in the ptrace backend open codes the "try __WCLONE, then try !__WCLONE." dance. Seems like __WALL was added in Linux 2.4; gdb could probably assume it's available nowadays... In any case, to make sure existing gdb binaries would still work with your kernel change, I ran GDB's testsuite with this: ~~ diff --git a/gdb/nat/linux-waitpid.c b/gdb/nat/linux-waitpid.c index cbcdd95..864ba2e 100644 --- a/gdb/nat/linux-waitpid.c +++ b/gdb/nat/linux-waitpid.c @@ -149,3 +149,17 @@ my_waitpid (int pid, int *status, int flags) errno = out_errno; return ret; } + +#include + +pid_t +waitpid (pid_t pid, int *status, int options) +{ + static pid_t (*waitpid2) (pid_t pid, int *status, int options) = NULL; + + if (waitpid2 == NULL) +waitpid2 = dlsym (RTLD_NEXT, "waitpid"); + + options |= __WALL; + return waitpid2 (pid, status, options); +} ~~ and got no regressions. So seems like all would be well from GDB's perspective. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/10] x86: Compile-time asm code validation
On 06/12/2015 03:10 PM, Josh Poimboeuf wrote: > On Fri, Jun 12, 2015 at 12:18:16PM +0100, Pedro Alves wrote: >> On 06/11/2015 03:10 PM, Josh Poimboeuf wrote: >> >>> C would definitely make more sense when analyzing object code. In fact, >>> asmvalidate is written in C. But then I guess we'd have to re-implement >>> the .cfi stuff and populate the DWARF sections manually instead of >>> letting the assembler do it. >> >> Was doing all this directly in the assembler considered? That is, >> e.g., add some knob that makes it error/warn in the same conditions >> you're making the validator catch. For tail calls, you'd e.g., add >> some new ".nonlocal" directive that you'd use to whitelist the >> following jump. And then if it's possible run a CFI generator >> as a separate step over the source, it sounds like it should also >> be possible to have the assembler do it instead too (again with >> some new high level directive to trigger/help it). > > In general I think doing these types of things in the assembler would be > a good idea. Missing or inaccurate debug data for asm code seems to be > a common problem for other projects as well. As Andy pointed out, > they're doing similar things in musl [1]. Thanks for the pointer. > So it might be useful to add > an option to the assembler which validates that the code conforms to > certain structural rules, and then inserts frame pointer and/or .cfi > directives. > That said, the kernel has much more custom features than other projects. > There are some sneaky macros, like _ASM_EXTABLE and ALTERNATIVE, which > hide code in various sections. Unless we're able to somehow either stop > using these macros or isolate them to a few places, I doubt that such a > general purpose assembler option would work. How does the asmvalidator handle these? > [1] http://www.openwall.com/lists/musl/2015/05/31/5 Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/10] x86: Compile-time asm code validation
On 06/11/2015 03:10 PM, Josh Poimboeuf wrote: > C would definitely make more sense when analyzing object code. In fact, > asmvalidate is written in C. But then I guess we'd have to re-implement > the .cfi stuff and populate the DWARF sections manually instead of > letting the assembler do it. Was doing all this directly in the assembler considered? That is, e.g., add some knob that makes it error/warn in the same conditions you're making the validator catch. For tail calls, you'd e.g., add some new ".nonlocal" directive that you'd use to whitelist the following jump. And then if it's possible run a CFI generator as a separate step over the source, it sounds like it should also be possible to have the assembler do it instead too (again with some new high level directive to trigger/help it). Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/10] x86: Compile-time asm code validation
On 06/11/2015 03:10 PM, Josh Poimboeuf wrote: C would definitely make more sense when analyzing object code. In fact, asmvalidate is written in C. But then I guess we'd have to re-implement the .cfi stuff and populate the DWARF sections manually instead of letting the assembler do it. Was doing all this directly in the assembler considered? That is, e.g., add some knob that makes it error/warn in the same conditions you're making the validator catch. For tail calls, you'd e.g., add some new .nonlocal directive that you'd use to whitelist the following jump. And then if it's possible run a CFI generator as a separate step over the source, it sounds like it should also be possible to have the assembler do it instead too (again with some new high level directive to trigger/help it). Thanks, Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v5 02/10] x86: Compile-time asm code validation
On 06/12/2015 03:10 PM, Josh Poimboeuf wrote: On Fri, Jun 12, 2015 at 12:18:16PM +0100, Pedro Alves wrote: On 06/11/2015 03:10 PM, Josh Poimboeuf wrote: C would definitely make more sense when analyzing object code. In fact, asmvalidate is written in C. But then I guess we'd have to re-implement the .cfi stuff and populate the DWARF sections manually instead of letting the assembler do it. Was doing all this directly in the assembler considered? That is, e.g., add some knob that makes it error/warn in the same conditions you're making the validator catch. For tail calls, you'd e.g., add some new .nonlocal directive that you'd use to whitelist the following jump. And then if it's possible run a CFI generator as a separate step over the source, it sounds like it should also be possible to have the assembler do it instead too (again with some new high level directive to trigger/help it). In general I think doing these types of things in the assembler would be a good idea. Missing or inaccurate debug data for asm code seems to be a common problem for other projects as well. As Andy pointed out, they're doing similar things in musl [1]. Thanks for the pointer. So it might be useful to add an option to the assembler which validates that the code conforms to certain structural rules, and then inserts frame pointer and/or .cfi directives. That said, the kernel has much more custom features than other projects. There are some sneaky macros, like _ASM_EXTABLE and ALTERNATIVE, which hide code in various sections. Unless we're able to somehow either stop using these macros or isolate them to a few places, I doubt that such a general purpose assembler option would work. How does the asmvalidator handle these? [1] http://www.openwall.com/lists/musl/2015/05/31/5 Thanks, Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: install_special_mapping && vm_pgoff (Was: vvar, gup && coredump)
Thanks for looking over all this, guys. Really appreciated. On 03/16/2015 07:01 PM, Oleg Nesterov wrote: > is wrong. To verify, I changed show_map_vma() to expose pgoff even if !file, > but this test-case can show the problem too: Might be good to add tests like this to selftests/ once all this is sorted. Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: install_special_mapping vm_pgoff (Was: vvar, gup coredump)
Thanks for looking over all this, guys. Really appreciated. On 03/16/2015 07:01 PM, Oleg Nesterov wrote: is wrong. To verify, I changed show_map_vma() to expose pgoff even if !file, but this test-case can show the problem too: Might be good to add tests like this to selftests/ once all this is sorted. Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vvar, gup && coredump
On 03/12/2015 05:46 PM, Oleg Nesterov wrote: > On 03/12, Oleg Nesterov wrote: >> >> Yes, this is true. OK, lets not dump it. > > OTOH. We can probably add ->access() into special_mapping_vmops, this > way __access_remote_vm() could work even if gup() fails ? > > Jan, Sergio. How much do we want do dump this area ? The change above > should be justified. Memory mappings that weren't touched since they were initially mapped can be retrieved from the program binary and the shared libraries, even if the core dump is moved to another machine. However, in vvar case, sounds like there's nowhere to read it from offline? In that case, it could be justified to dump it. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: vvar, gup coredump
On 03/12/2015 05:46 PM, Oleg Nesterov wrote: On 03/12, Oleg Nesterov wrote: Yes, this is true. OK, lets not dump it. OTOH. We can probably add -access() into special_mapping_vmops, this way __access_remote_vm() could work even if gup() fails ? Jan, Sergio. How much do we want do dump this area ? The change above should be justified. Memory mappings that weren't touched since they were initially mapped can be retrieved from the program binary and the shared libraries, even if the core dump is moved to another machine. However, in vvar case, sounds like there's nowhere to read it from offline? In that case, it could be justified to dump it. Thanks, Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] ptrace/x86: fix the TIF_FORCED_TF logic in handle_signal()
On 11/04/2014 11:55 PM, Oleg Nesterov wrote: > On 11/03, Pedro Alves wrote: >> >> Thanks a lot Oleg. > > thanks for the detailed report ;) > >> Question - shouldn't ptrace tests be put in >> tools/testing/selftests/ptrace/ in the kernel tree nowadays? > > Oh, I do not know. Personally I am not sure that selftests/ptrace/ > makes sense and I did not even know (or forgot) we already have it. I didn't know about it either until recently. > To me it would be better to move the single peeksiginfo.c to ptrace > testsuite and remove this dir. > > That said, I certainly won't argue if you or Jan will maintain > selftests/ptrace and send the patches with the new test-cases ;) IMO, having the tests in the kernel tree might make it simpler to require fixes to come along with tests, so we're better covered against regressions, but whatever works best to whoever is maintaining ptrace is good for me. ;-) > > The only problem is that every new test-case should be justified > somehow. For example, should we add the test-case from this changelog > into selftests/ptrace/ ? IMO, we should have a ptrace test for this somewhere. I just don't know about the intention of selftests/ptrace/, but it looked like it was meant as a replacement for the out of tree testsuite. Having the ptrace testsuite in the tree makes sense to me, for making it easier to require ptrace fixes and extensions to come along with tests, and to make it easier to catch regressions closer to the source, and to serve as self-documentation on expected behavior, but I don't have time to actively pursue that myself, unfortunately. > Honestly, I do not know. This bug is minor, > and probably we do not want a test-case for every bug we fix. My view is that there was a bug, and a test case would help prevent re-introducing it. > So I'd > leave this to you, you know how ptrace is actually _used_ and what is > important for gdb. Right. FYI, I've added a testcase to gdb as well, to cover the use case from a higher level functionality view: https://sourceware.org/ml/gdb-patches/2014-10/msg00780.html (Between a bug/regression being introduced in the kernel and running the gdb testsuite against that, a lot of time may pass. Myself, I usually only test and develop against whatever kernel the distro ships, which is a released kernel. And gdb doesn't use every ptrace feature.) > > The same for other tests in ptrace testsuite. Some of them are really > "random", in any case (I think) we should not blindly put them all into > selftests/ptrace. Not to mention the coding style which should be fixed. Agreed. > > And I know that gdb has a lot of internal tests, and gdb developers > run them (and ptrace tests) to check the new kernels... Who else do > you think will use selftests/ptrace? IMO, whoever touches ptrace code in the kernel should be required to run the ptrace tests, but not the testsuites of random other tools that use ptrace (like gdb, strace and others). Having the ptrace tests in the tree sounded like might make such a requirement simpler, and I naively thought that the existence of the directory might mean that that's the direction kernel folks had agreed on, that's all. > > But again, if you/Jan want to add something into selftests - please > send a patch. I will even try to review it but only in a sense that > I will try to convince myself that I understand _what_ this test does, > not _why_ we need this test-case. You certainly understand "why" much > better. > > Add Denys, perhaps he also has some tests for strace. Thanks, Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] ptrace/x86: fix the TIF_FORCED_TF logic in handle_signal()
On 11/04/2014 11:55 PM, Oleg Nesterov wrote: On 11/03, Pedro Alves wrote: Thanks a lot Oleg. thanks for the detailed report ;) Question - shouldn't ptrace tests be put in tools/testing/selftests/ptrace/ in the kernel tree nowadays? Oh, I do not know. Personally I am not sure that selftests/ptrace/ makes sense and I did not even know (or forgot) we already have it. I didn't know about it either until recently. To me it would be better to move the single peeksiginfo.c to ptrace testsuite and remove this dir. That said, I certainly won't argue if you or Jan will maintain selftests/ptrace and send the patches with the new test-cases ;) IMO, having the tests in the kernel tree might make it simpler to require fixes to come along with tests, so we're better covered against regressions, but whatever works best to whoever is maintaining ptrace is good for me. ;-) The only problem is that every new test-case should be justified somehow. For example, should we add the test-case from this changelog into selftests/ptrace/ ? IMO, we should have a ptrace test for this somewhere. I just don't know about the intention of selftests/ptrace/, but it looked like it was meant as a replacement for the out of tree testsuite. Having the ptrace testsuite in the tree makes sense to me, for making it easier to require ptrace fixes and extensions to come along with tests, and to make it easier to catch regressions closer to the source, and to serve as self-documentation on expected behavior, but I don't have time to actively pursue that myself, unfortunately. Honestly, I do not know. This bug is minor, and probably we do not want a test-case for every bug we fix. My view is that there was a bug, and a test case would help prevent re-introducing it. So I'd leave this to you, you know how ptrace is actually _used_ and what is important for gdb. Right. FYI, I've added a testcase to gdb as well, to cover the use case from a higher level functionality view: https://sourceware.org/ml/gdb-patches/2014-10/msg00780.html (Between a bug/regression being introduced in the kernel and running the gdb testsuite against that, a lot of time may pass. Myself, I usually only test and develop against whatever kernel the distro ships, which is a released kernel. And gdb doesn't use every ptrace feature.) The same for other tests in ptrace testsuite. Some of them are really random, in any case (I think) we should not blindly put them all into selftests/ptrace. Not to mention the coding style which should be fixed. Agreed. And I know that gdb has a lot of internal tests, and gdb developers run them (and ptrace tests) to check the new kernels... Who else do you think will use selftests/ptrace? IMO, whoever touches ptrace code in the kernel should be required to run the ptrace tests, but not the testsuites of random other tools that use ptrace (like gdb, strace and others). Having the ptrace tests in the tree sounded like might make such a requirement simpler, and I naively thought that the existence of the directory might mean that that's the direction kernel folks had agreed on, that's all. But again, if you/Jan want to add something into selftests - please send a patch. I will even try to review it but only in a sense that I will try to convince myself that I understand _what_ this test does, not _why_ we need this test-case. You certainly understand why much better. Add Denys, perhaps he also has some tests for strace. Thanks, Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 1/1] ptrace/x86: fix the TIF_FORCED_TF logic in handle_signal()
Thanks a lot Oleg. Question - shouldn't ptrace tests be put in tools/testing/selftests/ptrace/ in the kernel tree nowadays? Thanks, Pedro Alves On 11/03/2014 08:13 PM, Oleg Nesterov wrote: > When the TIF_SINGLESTEP tracee dequeues a signal, handle_signal() > clears TIF_FORCED_TF and X86_EFLAGS_TF but leaves TIF_SINGLESTEP set. > > If the tracer does PTRACE_SINGLESTEP again, enable_single_step() sets > X86_EFLAGS_TF but not TIF_FORCED_TF. This means that the subsequent > PTRACE_CONT doesn't not clear X86_EFLAGS_TF, and the tracee gets the > wrong SIGTRAP. > > Test-case (needs -O2 to avoid prologue insns in signal handler): > > #include > #include > #include > #include > #include > #include > #include > > void handler(int n) > { > asm("nop"); > } > > int child(void) > { > assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); > signal(SIGALRM, handler); > kill(getpid(), SIGALRM); > return 0x23; > } > > void *getip(int pid) > { > return (void*)ptrace(PTRACE_PEEKUSER, pid, > offsetof(struct user, regs.rip), 0); > } > > int main(void) > { > int pid, status; > > pid = fork(); > if (!pid) > return child(); > > assert(wait() == pid); > assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGALRM); > > assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0); > assert(wait() == pid); > assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP); > assert((getip(pid) - (void*)handler) == 0); > > assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0); > assert(wait() == pid); > assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP); > assert((getip(pid) - (void*)handler) == 1); > > assert(ptrace(PTRACE_CONT, pid, 0,0) == 0); > assert(wait() == pid); > assert(WIFEXITED(status) && WEXITSTATUS(status) == 0x23); > > return 0; > } > > The last assert() fails because PTRACE_CONT wrongly triggers another > single-step and X86_EFLAGS_TF can't be cleared by debugger until the > tracee does sys_rt_sigreturn(). > > Change handle_signal() to do user_disable_single_step() if stepping, > we do not need to preserve TIF_SINGLESTEP because we are going to do > ptrace_notify(), and it is simply wrong to leak this bit. > > While at it, change the comment to explain why we also need to clear > TF unconditionally after setup_rt_frame(). > > Note: in the longer term we should probably change setup_sigcontext() > to use get_flags() and then just remove this user_disable_single_step(). > And, the state of TIF_FORCED_TF can be wrong after restore_sigcontext() > which can set/clear TF, this needs another fix. > > Reported-by: Evan Teran > Reported-by: Pedro Alves > Signed-off-by: Oleg Nesterov > --- > arch/x86/kernel/signal.c | 22 +++--- > 1 files changed, 11 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c > index ed37a76..9d3a15b 100644 > --- a/arch/x86/kernel/signal.c > +++ b/arch/x86/kernel/signal.c > @@ -629,7 +629,8 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs) > static void > handle_signal(struct ksignal *ksig, struct pt_regs *regs) > { > - bool failed; > + bool stepping, failed; > + > /* Are we from a system call? */ > if (syscall_get_nr(current, regs) >= 0) { > /* If so, check system call restarting.. */ > @@ -653,12 +654,13 @@ handle_signal(struct ksignal *ksig, struct pt_regs > *regs) > } > > /* > - * If TF is set due to a debugger (TIF_FORCED_TF), clear the TF > - * flag so that register information in the sigcontext is correct. > + * If TF is set due to a debugger (TIF_FORCED_TF), clear TF now > + * so that register information in the sigcontext is correct and > + * then notify the tracer before entering the signal handler. >*/ > - if (unlikely(regs->flags & X86_EFLAGS_TF) && > - likely(test_and_clear_thread_flag(TIF_FORCED_TF))) > - regs->flags &= ~X86_EFLAGS_TF; > + stepping = test_thread_flag(TIF_SINGLESTEP); > + if (stepping) > + user_disable_single_step(current); > > failed = (s
Re: [PATCH 1/1] ptrace/x86: fix the TIF_FORCED_TF logic in handle_signal()
Thanks a lot Oleg. Question - shouldn't ptrace tests be put in tools/testing/selftests/ptrace/ in the kernel tree nowadays? Thanks, Pedro Alves On 11/03/2014 08:13 PM, Oleg Nesterov wrote: When the TIF_SINGLESTEP tracee dequeues a signal, handle_signal() clears TIF_FORCED_TF and X86_EFLAGS_TF but leaves TIF_SINGLESTEP set. If the tracer does PTRACE_SINGLESTEP again, enable_single_step() sets X86_EFLAGS_TF but not TIF_FORCED_TF. This means that the subsequent PTRACE_CONT doesn't not clear X86_EFLAGS_TF, and the tracee gets the wrong SIGTRAP. Test-case (needs -O2 to avoid prologue insns in signal handler): #include unistd.h #include stdio.h #include sys/ptrace.h #include sys/wait.h #include sys/user.h #include assert.h #include stddef.h void handler(int n) { asm(nop); } int child(void) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); signal(SIGALRM, handler); kill(getpid(), SIGALRM); return 0x23; } void *getip(int pid) { return (void*)ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), 0); } int main(void) { int pid, status; pid = fork(); if (!pid) return child(); assert(wait(status) == pid); assert(WIFSTOPPED(status) WSTOPSIG(status) == SIGALRM); assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0); assert(wait(status) == pid); assert(WIFSTOPPED(status) WSTOPSIG(status) == SIGTRAP); assert((getip(pid) - (void*)handler) == 0); assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0); assert(wait(status) == pid); assert(WIFSTOPPED(status) WSTOPSIG(status) == SIGTRAP); assert((getip(pid) - (void*)handler) == 1); assert(ptrace(PTRACE_CONT, pid, 0,0) == 0); assert(wait(status) == pid); assert(WIFEXITED(status) WEXITSTATUS(status) == 0x23); return 0; } The last assert() fails because PTRACE_CONT wrongly triggers another single-step and X86_EFLAGS_TF can't be cleared by debugger until the tracee does sys_rt_sigreturn(). Change handle_signal() to do user_disable_single_step() if stepping, we do not need to preserve TIF_SINGLESTEP because we are going to do ptrace_notify(), and it is simply wrong to leak this bit. While at it, change the comment to explain why we also need to clear TF unconditionally after setup_rt_frame(). Note: in the longer term we should probably change setup_sigcontext() to use get_flags() and then just remove this user_disable_single_step(). And, the state of TIF_FORCED_TF can be wrong after restore_sigcontext() which can set/clear TF, this needs another fix. Reported-by: Evan Teran ete...@alum.rit.edu Reported-by: Pedro Alves pal...@redhat.com Signed-off-by: Oleg Nesterov o...@redhat.com --- arch/x86/kernel/signal.c | 22 +++--- 1 files changed, 11 insertions(+), 11 deletions(-) diff --git a/arch/x86/kernel/signal.c b/arch/x86/kernel/signal.c index ed37a76..9d3a15b 100644 --- a/arch/x86/kernel/signal.c +++ b/arch/x86/kernel/signal.c @@ -629,7 +629,8 @@ setup_rt_frame(struct ksignal *ksig, struct pt_regs *regs) static void handle_signal(struct ksignal *ksig, struct pt_regs *regs) { - bool failed; + bool stepping, failed; + /* Are we from a system call? */ if (syscall_get_nr(current, regs) = 0) { /* If so, check system call restarting.. */ @@ -653,12 +654,13 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs) } /* - * If TF is set due to a debugger (TIF_FORCED_TF), clear the TF - * flag so that register information in the sigcontext is correct. + * If TF is set due to a debugger (TIF_FORCED_TF), clear TF now + * so that register information in the sigcontext is correct and + * then notify the tracer before entering the signal handler. */ - if (unlikely(regs-flags X86_EFLAGS_TF) - likely(test_and_clear_thread_flag(TIF_FORCED_TF))) - regs-flags = ~X86_EFLAGS_TF; + stepping = test_thread_flag(TIF_SINGLESTEP); + if (stepping) + user_disable_single_step(current); failed = (setup_rt_frame(ksig, regs) 0); if (!failed) { @@ -669,10 +671,8 @@ handle_signal(struct ksignal *ksig, struct pt_regs *regs) * it might disable possible debug exception from the * signal handler. * - * Clear TF when entering the signal handler, but - * notify any tracer that was single-stepping it. - * The tracer may want
Re: [PATCH] elf, coredump: Extract only the active register set during core dump
On 05/23/2014 06:16 AM, Anshuman Khandual wrote: > Regset active hooks provide a way to query how many registers in the > register set are active at any point of time. Currently this information > is being ignored while creating core dump sections corresponding to any > core note register set. This way the core dump will contain data which are > not part of the active context of the process and may not be useful. This > patch will make sure that only the active part of the register set are > captured during the core dump process which will reduce the core dump > size. > > Signed-off-by: Anshuman Khandual > --- > NOTE: > Pedro Alves has mentioned that producing smaller note sections in the core > dump may break some existing consumers. I request suggestions, reviews and > test reports on different architectures to prove that this patch does not > break any existing consumer. Thank you. Yeah, FYI, I mentioned that after noticing that ia64 does: mainline/linux-2.6/arch/ia64/kernel/ptrace.c: static int fpregs_active(struct task_struct *target, const struct user_regset *regset) { return (target->thread.flags & IA64_THREAD_FPH_VALID) ? 128 : 32; } And it's likely that tools expect fpregset_t to have a fixed size. include/uapi/linux/elfcore.h: 22 typedef elf_fpregset_t fpregset_t; arch/ia64/include/asm/elf.h: 186 typedef struct ia64_fpreg elf_fpreg_t; 187 typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG]; arch/ia64/include/asm/elf.h: 154 #define ELF_NGREG 128 /* we really need just 72 but let's leave some headroom... */ 155 #define ELF_NFPREG 128 /* f0 and f1 could be omitted, but so what... */ I haven't done an exhaustive look over ports. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] elf, coredump: Extract only the active register set during core dump
On 05/23/2014 06:16 AM, Anshuman Khandual wrote: Regset active hooks provide a way to query how many registers in the register set are active at any point of time. Currently this information is being ignored while creating core dump sections corresponding to any core note register set. This way the core dump will contain data which are not part of the active context of the process and may not be useful. This patch will make sure that only the active part of the register set are captured during the core dump process which will reduce the core dump size. Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com --- NOTE: Pedro Alves has mentioned that producing smaller note sections in the core dump may break some existing consumers. I request suggestions, reviews and test reports on different architectures to prove that this patch does not break any existing consumer. Thank you. Yeah, FYI, I mentioned that after noticing that ia64 does: mainline/linux-2.6/arch/ia64/kernel/ptrace.c: static int fpregs_active(struct task_struct *target, const struct user_regset *regset) { return (target-thread.flags IA64_THREAD_FPH_VALID) ? 128 : 32; } And it's likely that tools expect fpregset_t to have a fixed size. include/uapi/linux/elfcore.h: 22 typedef elf_fpregset_t fpregset_t; arch/ia64/include/asm/elf.h: 186 typedef struct ia64_fpreg elf_fpreg_t; 187 typedef elf_fpreg_t elf_fpregset_t[ELF_NFPREG]; arch/ia64/include/asm/elf.h: 154 #define ELF_NGREG 128 /* we really need just 72 but let's leave some headroom... */ 155 #define ELF_NFPREG 128 /* f0 and f1 could be omitted, but so what... */ I haven't done an exhaustive look over ports. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/20/2014 09:14 AM, Anshuman Khandual wrote: > On 05/19/2014 08:13 PM, Pedro Alves wrote: >> On 05/19/2014 12:46 PM, Anshuman Khandual wrote: >> >>>>> I couldn't actually find any arch that currently returns -ENODEV in >>>>> the "active" hook. I see that binfmt_elf.c doesn't handle >>>>> regset->active() returning < 0. Guess that may be why. Looks like >>>>> something that could be cleaned up, to me. >>>>> >>> Also it does not consider the return value of regset->active(t->task, >>> regset) >>> (whose objective is to figure out whether we need to request regset->n >>> number >>> of elements or less than that) in the subsequent call to regset->get >>> function. >> >> Indeed. >> >> TBC, do you plan on fixing this? Otherwise ... > > Sure, thinking something like this as mentioned below. But still not sure how > to use > the return type of -ENODEV from the function regset->active(). Right now if > any > regset does have the active hook and it returns anything but positive value, > it will > be ignored and the control moves to the next regset in view. This prevents > the thread > core note type being written to the core dump. Looks to me that that's exactly what should happen for -ENODEV too. The regset should be ignored. If regset->active() returns -ENODEV, then the machine doesn't have the registers at all, so what makes sense to me is to not write the corresponding core note in the dump. IOW, on such a machine, the kernel generates a core exactly like if the support for these registers that don't make sense for this machine wasn't compiled in at all. And generates a core exactly like an older kernel that didn't know about that regset (which is fine for that same machine) yet. > > diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c > index aa3cb62..80672fb 100644 > --- a/fs/binfmt_elf.c > +++ b/fs/binfmt_elf.c > @@ -1553,7 +1553,15 @@ static int fill_thread_core_info(struct > elf_thread_core_info *t, > if (regset->core_note_type && regset->get && > (!regset->active || regset->active(t->task, regset))) { > int ret; So, here, this ? (!regset->active || regset->active(t->task, regset) > 0)) { > - size_t size = regset->n * regset->size; > + size_t size; > + > + /* Request only the active elements in the regset */ > + if (!regset->active) > + size = regset->n * regset->size; > + else > + size = regset->active(t->task, regset) > + * > regset->size; > + I wonder if it wouldn't be cleaner to add a function like: int regset_active (tast *task, regseg *regset) { if (!regset->active) return regset->n * regset->size; else return regset->active(task, regset); } And then use it like if (regset->core_note_type && regset->get) { int size = regset_active (t->task, regset); if (size > 0) { ... } Though at this point, we don't actually make use of the distinction between -ENODEV vs 0. Guess that's what we should be thinking about. Seems like there some details that need to be sorted out, and some verification that consumers aren't broken by outputting smaller notes -- e.g., ia64 makes me wonder that. Maybe we should leave this for another day, and have tm_spr_active return 0 instead of -ENODEV when the machine doesn't have the hardware, or not install that hook at all. Seems like the effect will be the same, as the note isn't output if ->get fails. > void *data = kmalloc(size, GFP_KERNEL); > if (unlikely(!data)) > return 0; > >> >>> Now coming to the installation of the .active hooks part for all the new >>> regsets, it >>> should be pretty straight forward as well. Though its optional and used for >>> elf_core_dump >>> purpose only, its worth adding them here. Example of an active function >>> should be something >>> like this. The function is inexpensive as required. >>> >>> +static int tm_spr_active(struct task_struct *target, >>> + const struct user_regset *regset) >>> +{ >>> + if (!cpu_has_feature(CPU_FTR_TM)) >>> +
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/20/2014 09:14 AM, Anshuman Khandual wrote: On 05/19/2014 08:13 PM, Pedro Alves wrote: On 05/19/2014 12:46 PM, Anshuman Khandual wrote: I couldn't actually find any arch that currently returns -ENODEV in the active hook. I see that binfmt_elf.c doesn't handle regset-active() returning 0. Guess that may be why. Looks like something that could be cleaned up, to me. Also it does not consider the return value of regset-active(t-task, regset) (whose objective is to figure out whether we need to request regset-n number of elements or less than that) in the subsequent call to regset-get function. Indeed. TBC, do you plan on fixing this? Otherwise ... Sure, thinking something like this as mentioned below. But still not sure how to use the return type of -ENODEV from the function regset-active(). Right now if any regset does have the active hook and it returns anything but positive value, it will be ignored and the control moves to the next regset in view. This prevents the thread core note type being written to the core dump. Looks to me that that's exactly what should happen for -ENODEV too. The regset should be ignored. If regset-active() returns -ENODEV, then the machine doesn't have the registers at all, so what makes sense to me is to not write the corresponding core note in the dump. IOW, on such a machine, the kernel generates a core exactly like if the support for these registers that don't make sense for this machine wasn't compiled in at all. And generates a core exactly like an older kernel that didn't know about that regset (which is fine for that same machine) yet. diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index aa3cb62..80672fb 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1553,7 +1553,15 @@ static int fill_thread_core_info(struct elf_thread_core_info *t, if (regset-core_note_type regset-get (!regset-active || regset-active(t-task, regset))) { int ret; So, here, this ? (!regset-active || regset-active(t-task, regset) 0)) { - size_t size = regset-n * regset-size; + size_t size; + + /* Request only the active elements in the regset */ + if (!regset-active) + size = regset-n * regset-size; + else + size = regset-active(t-task, regset) + * regset-size; + I wonder if it wouldn't be cleaner to add a function like: int regset_active (tast *task, regseg *regset) { if (!regset-active) return regset-n * regset-size; else return regset-active(task, regset); } And then use it like if (regset-core_note_type regset-get) { int size = regset_active (t-task, regset); if (size 0) { ... } Though at this point, we don't actually make use of the distinction between -ENODEV vs 0. Guess that's what we should be thinking about. Seems like there some details that need to be sorted out, and some verification that consumers aren't broken by outputting smaller notes -- e.g., ia64 makes me wonder that. Maybe we should leave this for another day, and have tm_spr_active return 0 instead of -ENODEV when the machine doesn't have the hardware, or not install that hook at all. Seems like the effect will be the same, as the note isn't output if -get fails. void *data = kmalloc(size, GFP_KERNEL); if (unlikely(!data)) return 0; Now coming to the installation of the .active hooks part for all the new regsets, it should be pretty straight forward as well. Though its optional and used for elf_core_dump purpose only, its worth adding them here. Example of an active function should be something like this. The function is inexpensive as required. +static int tm_spr_active(struct task_struct *target, + const struct user_regset *regset) +{ + if (!cpu_has_feature(CPU_FTR_TM)) + return -ENODEV; ... unfortunately this will do the wrong thing. I am not sure whether I understand this correctly. Are you saying that its wrong to return -ENODEV in this case as above ? No, sorry for not being clear. The (...)'s were connected: do you plan on fixing this? Otherwise ... ... unfortunately this will do the wrong thing. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/19/2014 12:46 PM, Anshuman Khandual wrote: >> > I couldn't actually find any arch that currently returns -ENODEV in >> > the "active" hook. I see that binfmt_elf.c doesn't handle >> > regset->active() returning < 0. Guess that may be why. Looks like >> > something that could be cleaned up, to me. >> > > Also it does not consider the return value of regset->active(t->task, regset) > (whose objective is to figure out whether we need to request regset->n number > of elements or less than that) in the subsequent call to regset->get function. Indeed. TBC, do you plan on fixing this? Otherwise ... > Now coming to the installation of the .active hooks part for all the new > regsets, it > should be pretty straight forward as well. Though its optional and used for > elf_core_dump > purpose only, its worth adding them here. Example of an active function > should be something > like this. The function is inexpensive as required. > > +static int tm_spr_active(struct task_struct *target, > + const struct user_regset *regset) > +{ > + if (!cpu_has_feature(CPU_FTR_TM)) > + return -ENODEV; ... unfortunately this will do the wrong thing. Thanks, -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/19/2014 12:46 PM, Anshuman Khandual wrote: I couldn't actually find any arch that currently returns -ENODEV in the active hook. I see that binfmt_elf.c doesn't handle regset-active() returning 0. Guess that may be why. Looks like something that could be cleaned up, to me. Also it does not consider the return value of regset-active(t-task, regset) (whose objective is to figure out whether we need to request regset-n number of elements or less than that) in the subsequent call to regset-get function. Indeed. TBC, do you plan on fixing this? Otherwise ... Now coming to the installation of the .active hooks part for all the new regsets, it should be pretty straight forward as well. Though its optional and used for elf_core_dump purpose only, its worth adding them here. Example of an active function should be something like this. The function is inexpensive as required. +static int tm_spr_active(struct task_struct *target, + const struct user_regset *regset) +{ + if (!cpu_has_feature(CPU_FTR_TM)) + return -ENODEV; ... unfortunately this will do the wrong thing. Thanks, -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT}
On 05/15/2014 03:36 PM, Denys Vlasenko wrote: > On 05/14/2014 08:49 PM, Pedro Alves wrote: >> I realized now that I responded to this. Sorry about that. >> >> On 01/19/2014 03:29 PM, Oleg Nesterov wrote: >>> On 01/19, Sergio Durigan Junior wrote: >>>> >>>> On Friday, January 10 2014, Oleg Nesterov wrote: >>>> >>>>> So suppose that gdb does ptrace(PTRACE_SINGLESTEP) and the tracee >>>>> executes the "syscall" insn. What it should report? >>>> [...] >>>>> But what should syscall-exit do? Should it still report SIGSEGV as >>>>> it currently does, or should it report _SYSCALL_EXIT instead (if >>>>> PTRACE_O_SYSCALL_EXIT of course), or should it report both? >>>> >>>> Both only if _SYSCALL_EXIT is set. Otherwise, stick to the current >>>> behavior, I guess. >>> >>> OK, both. In which order? Probably _EXIT first. But this looks a bit >>> strange. Suppose that the tracee reports _EXIT, then debugger does >>> ptrace(PTRACE_CONT), should the tracee report SIGTRAP? >> >> Seems to me that this should be very much the same as fork/vfork/clone >> event handling. Those are triggered by a syscall anyway. So, say: >> >> - ptrace(PTRACE_SINGLESTEP) >> - the tracee executes the "syscall" insn, and the syscall is "clone". >> - PTRACE_EVENT_FORK is reported. >> - The debugger does ptrace(PTRACE_CONT). >> >> What should be reported? What is reported now? > > Yes, PTRACE_SINGLESTEP looks ambiguous. Just like PTRACE_SYSCALL is. Well, we just need to _define_ what should happen in these cases, then it's no longer ambiguous. I tried that out under GDB, using 3.14.3-200.fc20.x86_64 (Fedora 20), _without_ Sérgio's patches. Here's what I see. All this is with PTRACE_O_TRACEFORK enabled, of course. #1 - Enter the syscall with PTRACE_SINGLESTEP, then PTRACE_SINGLESTEP The first PTRACE_SINGLESTEP triggers the PTRACE_EVENT_FORK event. We see that PTRACE_EVENT_FORK leaves the PC pointing after the syscall instruction: Dump of assembler code from 0x373bcbca6a to 0x373bcbca74: 0x00373bcbca6a <__libc_fork+170>:syscall => 0x00373bcbca6c <__libc_fork+172>:cmp$0xf000,%rax 0x00373bcbca72 <__libc_fork+178>:ja 0x373bcbcbb6 <__libc_fork+502> Then after the second PTRACE_SINGLESTEP, the kernel reports a trap with the PC still pointing at the insn after the "syscall". IOW, the PC doesn't move, and the instruction after the "syscall" insn hasn't been executed yet. But the PC did move if we consider last time the thread was resumed while it was running userspace code. So, the "syscall" insn is finally fully executed. I think this is reasonable. #2 - Enter the syscall with PTRACE_SINGLESTEP, then PTRACE_CONTINUE After the PTRACE_SINGLESTEP, the state is the same as in #1. That is, PC points after "syscall". The kernel reports no trap for the latter PTRACE_CONTINUE, the process just continues free until the next event. #3 - Enter syscall with PTRACE_CONTINUE, then PTRACE_SINGLESTEP Like in #1 and #2, we see that PTRACE_EVENT_FORK leaves the PC pointing after the syscall instruction. The latter PTRACE_SINGLESTEP behaves just like in #1. IOW, we get a trap, but the PC doesn't move -- the "syscall" insn is finally fully executed. #4 - Enter syscall with PTRACE_CONTINUE, then PTRACE_CONTINUE Same as #2. IOW, no spurious trap. So it doesn't really matter whether the process enters the syscall through PTRACE_SINGLESTEP vs PTRACE_CONTINUE. What matters is the last resumption. At least from the ptracer's perspective, the kernel doesn't "remember" that the process was PTRACE_SINGLESTEP'd on entry to the syscall. I think this is the most reasonable behavior, and the simplest for a ptracer, and very much what I prefer for GDB. If that last PTRACE_CONTINUE would instead report a SIGTRAP for the previous PTRACE_SINGLESTEP, then the ptracer would need to keep track of more state, for no good reason. If the ptracer really wants the ptracee to continue single-stepping as if the fork/vfork/clone wasn't caught with a magic event, then it just continues issuing PTRACE_SINGLESTEP. Now I think PTRACE_EVENT_SYSCALL_{ENTER,EXIT} should behave just the same. That is, this is the sequence I expect: -> PTRACE_SINGLESTEP/PTRACE_CONTINUE over "syscall" <- PTRACE_EVENT_SYSCALL_ENTER -- PC points after "syscall" -> PTRACE_SINGLESTEP/PTRACE_CONTINUE <- PTRACE_EVENT_FORK -- PC points after "syscall" -> PTRACE_SINGLESTEP/PTRACE_CONTINUE <- PTRACE_EVENT_SYSCALL_EXIT -- PC points after "
Re: [RFC/PATCH] Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT}
On 05/15/2014 03:36 PM, Denys Vlasenko wrote: On 05/14/2014 08:49 PM, Pedro Alves wrote: I realized now that I responded to this. Sorry about that. On 01/19/2014 03:29 PM, Oleg Nesterov wrote: On 01/19, Sergio Durigan Junior wrote: On Friday, January 10 2014, Oleg Nesterov wrote: So suppose that gdb does ptrace(PTRACE_SINGLESTEP) and the tracee executes the syscall insn. What it should report? [...] But what should syscall-exit do? Should it still report SIGSEGV as it currently does, or should it report _SYSCALL_EXIT instead (if PTRACE_O_SYSCALL_EXIT of course), or should it report both? Both only if _SYSCALL_EXIT is set. Otherwise, stick to the current behavior, I guess. OK, both. In which order? Probably _EXIT first. But this looks a bit strange. Suppose that the tracee reports _EXIT, then debugger does ptrace(PTRACE_CONT), should the tracee report SIGTRAP? Seems to me that this should be very much the same as fork/vfork/clone event handling. Those are triggered by a syscall anyway. So, say: - ptrace(PTRACE_SINGLESTEP) - the tracee executes the syscall insn, and the syscall is clone. - PTRACE_EVENT_FORK is reported. - The debugger does ptrace(PTRACE_CONT). What should be reported? What is reported now? Yes, PTRACE_SINGLESTEP looks ambiguous. Just like PTRACE_SYSCALL is. Well, we just need to _define_ what should happen in these cases, then it's no longer ambiguous. I tried that out under GDB, using 3.14.3-200.fc20.x86_64 (Fedora 20), _without_ Sérgio's patches. Here's what I see. All this is with PTRACE_O_TRACEFORK enabled, of course. #1 - Enter the syscall with PTRACE_SINGLESTEP, then PTRACE_SINGLESTEP The first PTRACE_SINGLESTEP triggers the PTRACE_EVENT_FORK event. We see that PTRACE_EVENT_FORK leaves the PC pointing after the syscall instruction: Dump of assembler code from 0x373bcbca6a to 0x373bcbca74: 0x00373bcbca6a __libc_fork+170:syscall = 0x00373bcbca6c __libc_fork+172:cmp$0xf000,%rax 0x00373bcbca72 __libc_fork+178:ja 0x373bcbcbb6 __libc_fork+502 Then after the second PTRACE_SINGLESTEP, the kernel reports a trap with the PC still pointing at the insn after the syscall. IOW, the PC doesn't move, and the instruction after the syscall insn hasn't been executed yet. But the PC did move if we consider last time the thread was resumed while it was running userspace code. So, the syscall insn is finally fully executed. I think this is reasonable. #2 - Enter the syscall with PTRACE_SINGLESTEP, then PTRACE_CONTINUE After the PTRACE_SINGLESTEP, the state is the same as in #1. That is, PC points after syscall. The kernel reports no trap for the latter PTRACE_CONTINUE, the process just continues free until the next event. #3 - Enter syscall with PTRACE_CONTINUE, then PTRACE_SINGLESTEP Like in #1 and #2, we see that PTRACE_EVENT_FORK leaves the PC pointing after the syscall instruction. The latter PTRACE_SINGLESTEP behaves just like in #1. IOW, we get a trap, but the PC doesn't move -- the syscall insn is finally fully executed. #4 - Enter syscall with PTRACE_CONTINUE, then PTRACE_CONTINUE Same as #2. IOW, no spurious trap. So it doesn't really matter whether the process enters the syscall through PTRACE_SINGLESTEP vs PTRACE_CONTINUE. What matters is the last resumption. At least from the ptracer's perspective, the kernel doesn't remember that the process was PTRACE_SINGLESTEP'd on entry to the syscall. I think this is the most reasonable behavior, and the simplest for a ptracer, and very much what I prefer for GDB. If that last PTRACE_CONTINUE would instead report a SIGTRAP for the previous PTRACE_SINGLESTEP, then the ptracer would need to keep track of more state, for no good reason. If the ptracer really wants the ptracee to continue single-stepping as if the fork/vfork/clone wasn't caught with a magic event, then it just continues issuing PTRACE_SINGLESTEP. Now I think PTRACE_EVENT_SYSCALL_{ENTER,EXIT} should behave just the same. That is, this is the sequence I expect: - PTRACE_SINGLESTEP/PTRACE_CONTINUE over syscall - PTRACE_EVENT_SYSCALL_ENTER -- PC points after syscall - PTRACE_SINGLESTEP/PTRACE_CONTINUE - PTRACE_EVENT_FORK -- PC points after syscall - PTRACE_SINGLESTEP/PTRACE_CONTINUE - PTRACE_EVENT_SYSCALL_EXIT -- PC points after syscall And then after, either: a) PTRACE_SINGLESTEP - SIGTRAP -- PC points after syscall b) PTRACE_CONTINUE - *nothing* Of course, skip the PTRACE_EVENT_FORK step in the middle of the sequence if PTRACE_O_TRACEFORK is not enabled. Thanks, -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/15/2014 09:25 AM, Anshuman Khandual wrote: > On 05/14/2014 04:45 PM, Pedro Alves wrote: >> On 05/14/14 06:46, Anshuman Khandual wrote: >>> On 05/13/2014 10:43 PM, Pedro Alves wrote: >>>> On 05/05/14 08:54, Anshuman Khandual wrote: >>>>> This patch enables get and set of transactional memory related register >>>>> sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing >>>>> four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, >>>>> REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new >>>>> ELF core note types added previously in this regard. >>>>> >>>>> (1) NT_PPC_TM_SPR >>>>> (2) NT_PPC_TM_CGPR >>>>> (3) NT_PPC_TM_CFPR >>>>> (4) NT_PPC_TM_CVMX >>>> >>>> Sorry that I couldn't tell this from the code, but, what does the >>>> kernel return when the ptracer requests these registers and the >>>> program is not in a transaction? Specifically I'm wondering whether >>>> this follows the same semantics as the s390 port. >>>> >>> >>> Right now, it still returns the saved state of the registers from thread >>> struct. I had assumed that the user must know the state of the transaction >>> before initiating the ptrace request. I guess its better to check for >>> the transaction status before processing the request. In case if TM is not >>> active on that thread, we should return -EINVAL. >> >> I think s390 returns ENODATA in that case. >> >> https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html >> >> We'll want some way to tell whether the system actually >> supports this. That could be ENODATA vs something-else (EINVAL >> or perhaps better EIO for "request is invalid"). > > As Mickey has pointed out, the transaction memory support in the system can be > checked from the HWCAP2 flags. So when the transaction is not active, we will > return ENODATA instead for TM related ptrace regset requests. Returning ENODATA when the transaction is not active, like s390 is great. Thank you. But I think it's worth it to consider what should the kernel return when the machine doesn't have these registers at all. Sure, for this case we happen to have the hwcap flag. But in general, I don't know whether we will always have a hwcap bit for each register set that is added. Maybe we will, so that the info ends up in core dumps. Still, I think it's worth to consider this case in the general sense, irrespective of hwcap. That is, what should PTRACE_GETREGSET/PTRACE_SETREGSET return when the machine doesn't have the registers at all. We shouldn't need to consult something elsewhere (like hwcap) to determine what ENODATA means. The kernel knows it right there. I think s390 goofed here. Taking a look at x86, for example, we see: [REGSET_XSTATE] = { .core_note_type = NT_X86_XSTATE, .size = sizeof(u64), .align = sizeof(u64), .active = xstateregs_active, .get = xstateregs_get, .set = xstateregs_set }, Note that it installs the ".active" hook. 24 /** 25 * user_regset_active_fn - type of @active function in user_regset 26 * @target: thread being examined 27 * @regset: regset being examined 28 * 29 * Return -%ENODEV if not available on the hardware found. 30 * Return %0 if no interesting state in this thread. 31 * Return >%0 number of @size units of interesting state. 32 * Any get call fetching state beyond that number will 33 * see the default initialization state for this data, 34 * so a caller that knows what the default state is need 35 * not copy it all out. 36 * This call is optional; the pointer is %NULL if there 37 * is no inexpensive check to yield a value < @n. 38 */ 39 typedef int user_regset_active_fn(struct task_struct *target, 40 const struct user_regset *regset); 41 Note the mention of ENODEV. I couldn't actually find any arch that currently returns -ENODEV in the "active" hook. I see that binfmt_elf.c doesn't handle regset->active() returning < 0. Guess that may be why. Looks like something that could be cleaned up, to me. Anyway, notice x86's REGSET_XSTATE regset->get hook: int xstateregs_get(struct task_struct *target, const struct user_regset *regset, unsigned int pos, unsigned int count, void *kbuf, void __user *ubuf) { int ret; if (!cpu_has_xsave) return -ENODEV; ^^ And then we see that xfpregs_get has a similar ENODEV case. So in sum, it very much looks like the intention is
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/15/2014 09:25 AM, Anshuman Khandual wrote: On 05/14/2014 04:45 PM, Pedro Alves wrote: On 05/14/14 06:46, Anshuman Khandual wrote: On 05/13/2014 10:43 PM, Pedro Alves wrote: On 05/05/14 08:54, Anshuman Khandual wrote: This patch enables get and set of transactional memory related register sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new ELF core note types added previously in this regard. (1) NT_PPC_TM_SPR (2) NT_PPC_TM_CGPR (3) NT_PPC_TM_CFPR (4) NT_PPC_TM_CVMX Sorry that I couldn't tell this from the code, but, what does the kernel return when the ptracer requests these registers and the program is not in a transaction? Specifically I'm wondering whether this follows the same semantics as the s390 port. Right now, it still returns the saved state of the registers from thread struct. I had assumed that the user must know the state of the transaction before initiating the ptrace request. I guess its better to check for the transaction status before processing the request. In case if TM is not active on that thread, we should return -EINVAL. I think s390 returns ENODATA in that case. https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html We'll want some way to tell whether the system actually supports this. That could be ENODATA vs something-else (EINVAL or perhaps better EIO for request is invalid). As Mickey has pointed out, the transaction memory support in the system can be checked from the HWCAP2 flags. So when the transaction is not active, we will return ENODATA instead for TM related ptrace regset requests. Returning ENODATA when the transaction is not active, like s390 is great. Thank you. But I think it's worth it to consider what should the kernel return when the machine doesn't have these registers at all. Sure, for this case we happen to have the hwcap flag. But in general, I don't know whether we will always have a hwcap bit for each register set that is added. Maybe we will, so that the info ends up in core dumps. Still, I think it's worth to consider this case in the general sense, irrespective of hwcap. That is, what should PTRACE_GETREGSET/PTRACE_SETREGSET return when the machine doesn't have the registers at all. We shouldn't need to consult something elsewhere (like hwcap) to determine what ENODATA means. The kernel knows it right there. I think s390 goofed here. Taking a look at x86, for example, we see: [REGSET_XSTATE] = { .core_note_type = NT_X86_XSTATE, .size = sizeof(u64), .align = sizeof(u64), .active = xstateregs_active, .get = xstateregs_get, .set = xstateregs_set }, Note that it installs the .active hook. 24 /** 25 * user_regset_active_fn - type of @active function in struct user_regset 26 * @target: thread being examined 27 * @regset: regset being examined 28 * 29 * Return -%ENODEV if not available on the hardware found. 30 * Return %0 if no interesting state in this thread. 31 * Return %0 number of @size units of interesting state. 32 * Any get call fetching state beyond that number will 33 * see the default initialization state for this data, 34 * so a caller that knows what the default state is need 35 * not copy it all out. 36 * This call is optional; the pointer is %NULL if there 37 * is no inexpensive check to yield a value @n. 38 */ 39 typedef int user_regset_active_fn(struct task_struct *target, 40 const struct user_regset *regset); 41 Note the mention of ENODEV. I couldn't actually find any arch that currently returns -ENODEV in the active hook. I see that binfmt_elf.c doesn't handle regset-active() returning 0. Guess that may be why. Looks like something that could be cleaned up, to me. Anyway, notice x86's REGSET_XSTATE regset-get hook: int xstateregs_get(struct task_struct *target, const struct user_regset *regset, unsigned int pos, unsigned int count, void *kbuf, void __user *ubuf) { int ret; if (!cpu_has_xsave) return -ENODEV; ^^ And then we see that xfpregs_get has a similar ENODEV case. So in sum, it very much looks like the intention is for PTRACE_GETREGSET/PTRACE_SETREGSET to return ENODEV in the case the regset doesn't exist on the running machine, and then it looks like at least x86 works that way. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT}
I realized now that I responded to this. Sorry about that. On 01/19/2014 03:29 PM, Oleg Nesterov wrote: > On 01/19, Sergio Durigan Junior wrote: >> >> On Friday, January 10 2014, Oleg Nesterov wrote: >> >>> So suppose that gdb does ptrace(PTRACE_SINGLESTEP) and the tracee >>> executes the "syscall" insn. What it should report? >> [...] >>> But what should syscall-exit do? Should it still report SIGSEGV as >>> it currently does, or should it report _SYSCALL_EXIT instead (if >>> PTRACE_O_SYSCALL_EXIT of course), or should it report both? >> >> Both only if _SYSCALL_EXIT is set. Otherwise, stick to the current >> behavior, I guess. > > OK, both. In which order? Probably _EXIT first. But this looks a bit > strange. Suppose that the tracee reports _EXIT, then debugger does > ptrace(PTRACE_CONT), should the tracee report SIGTRAP? Seems to me that this should be very much the same as fork/vfork/clone event handling. Those are triggered by a syscall anyway. So, say: - ptrace(PTRACE_SINGLESTEP) - the tracee executes the "syscall" insn, and the syscall is "clone". - PTRACE_EVENT_FORK is reported. - The debugger does ptrace(PTRACE_CONT). What should be reported? What is reported now? -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/14/14 12:18, Michael Neuling wrote: > >> s390 actually screwed that, though it got away because >> there's a bit in HWCAP to signal transactions support. See: >> >> https://sourceware.org/ml/gdb-patches/2013-11/msg00080.html >> >> Are you adding something to HWCAP too? > > Yes but it's in HWCAP2 That's fine. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/14/14 06:46, Anshuman Khandual wrote: > On 05/13/2014 10:43 PM, Pedro Alves wrote: >> On 05/05/14 08:54, Anshuman Khandual wrote: >>> This patch enables get and set of transactional memory related register >>> sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing >>> four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, >>> REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new >>> ELF core note types added previously in this regard. >>> >>> (1) NT_PPC_TM_SPR >>> (2) NT_PPC_TM_CGPR >>> (3) NT_PPC_TM_CFPR >>> (4) NT_PPC_TM_CVMX >> >> Sorry that I couldn't tell this from the code, but, what does the >> kernel return when the ptracer requests these registers and the >> program is not in a transaction? Specifically I'm wondering whether >> this follows the same semantics as the s390 port. >> > > Right now, it still returns the saved state of the registers from thread > struct. I had assumed that the user must know the state of the transaction > before initiating the ptrace request. I guess its better to check for > the transaction status before processing the request. In case if TM is not > active on that thread, we should return -EINVAL. I think s390 returns ENODATA in that case. https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html We'll want some way to tell whether the system actually supports this. That could be ENODATA vs something-else (EINVAL or perhaps better EIO for "request is invalid"). s390 actually screwed that, though it got away because there's a bit in HWCAP to signal transactions support. See: https://sourceware.org/ml/gdb-patches/2013-11/msg00080.html Are you adding something to HWCAP too? > > I am not familiar with the s390 side of code. But if we look at the > s390_tdb_get function it checks for (regs->int_code & 0x200) before > processing the request. Not sure what 0x200 signifies though. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] ptrace: Clarify PTRACE_GETREGSET/PTRACE_SETREGSET, documentation in uapi header
On 05/14/14 08:10, Anshuman Khandual wrote: > On 05/13/2014 11:39 PM, Pedro Alves wrote: >> On 05/05/14 05:10, Anshuman Khandual wrote: >>> On 05/01/2014 07:43 PM, Pedro Alves wrote: >> OK, then this is what I suggest instead: ... >>> Shall I resend the patch with the your proposed changes and your >>> "Signed-off-by" and >>> moving myself as "Reported-by" ? >> >> No idea of the actual policy to follow. Feel free to do that if that's the >> standard procedure. > > Even I am not sure about this, so to preserve the correct authorship, would > you > mind sending this patch ? Here you go. This is against current Linus'. Please take it from here if necessary. 8<------ >From 1237f5ac5896f3910f66df83a5093bb548006188 Mon Sep 17 00:00:00 2001 From: Pedro Alves Date: Wed, 14 May 2014 11:05:07 +0100 Subject: [PATCH] ptrace: Clarify PTRACE_GETREGSET/PTRACE_SETREGSET documentation in uapi header The current comments don't explicitly state in plain words that iov.len must be set to the buffer's length prior to the ptrace call. A user might get confused and leave that uninitialized. In the ptrace_regset function (snippet below) we see that the buffer length has to be a multiple of the slot/register size for the given NT_XXX_TYPE: if (!regset || (kiov->iov_len % regset->size) != 0) return -EINVAL; Note regset->size is the size of each slot/register in the set, not the size of the whole set. And then, we see here: kiov->iov_len = min(kiov->iov_len, (__kernel_size_t) (regset->n * regset->size)); that the kernel takes care of capping the requested length to the size of the whole regset. Signed-off-by: Pedro Alves Reported-by: Anshuman Khandual --- include/uapi/linux/ptrace.h | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index cf1019e..30836b9 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -39,12 +39,17 @@ * payload are exactly the same layout. * * This interface usage is as follows: - * struct iovec iov = { buf, len}; + * struct iovec iov = { buf, len }; * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, ); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * On entry, iov describes the buffer's address and length. The buffer's length + * must be a multiple of the size of a single register in the register set. The + * kernel never reads or writes more than iov.len, and caps the buffer length to + * the register set's size. In other words, the kernel reads or writes + * min(iov.len, regset size). On successful completion, iov.len is updated by + * the kernel, specifying how much the kernel has read from / written to the + * user's iov.buf. */ #define PTRACE_GETREGSET 0x4204 #define PTRACE_SETREGSET 0x4205 -- 1.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH v2] ptrace: Clarify PTRACE_GETREGSET/PTRACE_SETREGSET, documentation in uapi header
On 05/14/14 08:10, Anshuman Khandual wrote: On 05/13/2014 11:39 PM, Pedro Alves wrote: On 05/05/14 05:10, Anshuman Khandual wrote: On 05/01/2014 07:43 PM, Pedro Alves wrote: OK, then this is what I suggest instead: ... Shall I resend the patch with the your proposed changes and your Signed-off-by and moving myself as Reported-by ? No idea of the actual policy to follow. Feel free to do that if that's the standard procedure. Even I am not sure about this, so to preserve the correct authorship, would you mind sending this patch ? Here you go. This is against current Linus'. Please take it from here if necessary. 8-- From 1237f5ac5896f3910f66df83a5093bb548006188 Mon Sep 17 00:00:00 2001 From: Pedro Alves pal...@redhat.com Date: Wed, 14 May 2014 11:05:07 +0100 Subject: [PATCH] ptrace: Clarify PTRACE_GETREGSET/PTRACE_SETREGSET documentation in uapi header The current comments don't explicitly state in plain words that iov.len must be set to the buffer's length prior to the ptrace call. A user might get confused and leave that uninitialized. In the ptrace_regset function (snippet below) we see that the buffer length has to be a multiple of the slot/register size for the given NT_XXX_TYPE: if (!regset || (kiov-iov_len % regset-size) != 0) return -EINVAL; Note regset-size is the size of each slot/register in the set, not the size of the whole set. And then, we see here: kiov-iov_len = min(kiov-iov_len, (__kernel_size_t) (regset-n * regset-size)); that the kernel takes care of capping the requested length to the size of the whole regset. Signed-off-by: Pedro Alves pal...@redhat.com Reported-by: Anshuman Khandual khand...@linux.vnet.ibm.com --- include/uapi/linux/ptrace.h | 11 --- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/ptrace.h b/include/uapi/linux/ptrace.h index cf1019e..30836b9 100644 --- a/include/uapi/linux/ptrace.h +++ b/include/uapi/linux/ptrace.h @@ -39,12 +39,17 @@ * payload are exactly the same layout. * * This interface usage is as follows: - * struct iovec iov = { buf, len}; + * struct iovec iov = { buf, len }; * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * On entry, iov describes the buffer's address and length. The buffer's length + * must be a multiple of the size of a single register in the register set. The + * kernel never reads or writes more than iov.len, and caps the buffer length to + * the register set's size. In other words, the kernel reads or writes + * min(iov.len, regset size). On successful completion, iov.len is updated by + * the kernel, specifying how much the kernel has read from / written to the + * user's iov.buf. */ #define PTRACE_GETREGSET 0x4204 #define PTRACE_SETREGSET 0x4205 -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/14/14 06:46, Anshuman Khandual wrote: On 05/13/2014 10:43 PM, Pedro Alves wrote: On 05/05/14 08:54, Anshuman Khandual wrote: This patch enables get and set of transactional memory related register sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new ELF core note types added previously in this regard. (1) NT_PPC_TM_SPR (2) NT_PPC_TM_CGPR (3) NT_PPC_TM_CFPR (4) NT_PPC_TM_CVMX Sorry that I couldn't tell this from the code, but, what does the kernel return when the ptracer requests these registers and the program is not in a transaction? Specifically I'm wondering whether this follows the same semantics as the s390 port. Right now, it still returns the saved state of the registers from thread struct. I had assumed that the user must know the state of the transaction before initiating the ptrace request. I guess its better to check for the transaction status before processing the request. In case if TM is not active on that thread, we should return -EINVAL. I think s390 returns ENODATA in that case. https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html We'll want some way to tell whether the system actually supports this. That could be ENODATA vs something-else (EINVAL or perhaps better EIO for request is invalid). s390 actually screwed that, though it got away because there's a bit in HWCAP to signal transactions support. See: https://sourceware.org/ml/gdb-patches/2013-11/msg00080.html Are you adding something to HWCAP too? I am not familiar with the s390 side of code. But if we look at the s390_tdb_get function it checks for (regs-int_code 0x200) before processing the request. Not sure what 0x200 signifies though. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/14/14 12:18, Michael Neuling wrote: s390 actually screwed that, though it got away because there's a bit in HWCAP to signal transactions support. See: https://sourceware.org/ml/gdb-patches/2013-11/msg00080.html Are you adding something to HWCAP too? Yes but it's in HWCAP2 That's fine. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: Fix PTRACE_GETREGSET/PTRACE_SETREGSET in code documentation
On 05/05/14 05:10, Anshuman Khandual wrote: > On 05/01/2014 07:43 PM, Pedro Alves wrote: >> On 04/28/2014 12:00 PM, Anshuman Khandual wrote: >>> The current documentation is bit misleading and does not explicitly >>> specify that iov.len need to be initialized failing which kernel >>> may just ignore the ptrace request and never read from/write into >>> the user specified buffer. This patch fixes the documentation. >> >> Well, it kind of does, here: >> >> * struct iovec iov = { buf, len}; > > :) Thats not explicit enough. > >> >>> @@ -43,8 +43,12 @@ >>> * >>> * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, ); >>> * >>> - * On the successful completion, iov.len will be updated by the kernel, >>> - * specifying how much the kernel has written/read to/from the user's >>> iov.buf. >>> + * A non-zero value upto the max size of data expected to be written/read >>> by the >>> + * kernel in response to any NT_XXX_TYPE request type must be assigned to >>> iov.len >>> + * before initiating the ptrace call. If iov.len is 0, then kernel will >>> neither >>> + * read from or write into the user buffer specified. On successful >>> completion, >>> + * iov.len will be updated by the kernel, specifying how much the kernel >>> has >>> + * written/read to/from the user's iov.buf. >> >> I really appreciate that you're trying to make this clearer, but I >> find the new sentence very hard to read/reason. :-/ >> >> I suggest: >> >> * This interface usage is as follows: >> - * struct iovec iov = { buf, len}; >> + * struct iovec iov = { buf, len }; >> * >> * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, >> ); >> * >> - * On the successful completion, iov.len will be updated by the kernel, >> - * specifying how much the kernel has written/read to/from the user's >> iov.buf. >> + * On entry, iov describes the buffer's address and length. The buffer's >> + * length must be equal to or shorter than the size of the NT_XXX_TYPE >> regset. >> + * On successful completion, iov.len is updated by the kernel, specifying >> how >> + * much the kernel has written/read to/from the user's iov.buf. >> > > Yeah, sounds better. I may add "If the length is zero, the kernel will > neither read > from or write into the buffer" Well, I think that much should be obvious. What's not obvious is whether that is considered success or error (what is the return code?) I suspect and expect success return if the regset type is known, and error otherwise. So that could be used as a way to probe for support for a given regset without using stack or heap space, if it ever matters. The kernel never reads/writes beyond iov.len, so better say that, and then it automatically gets the 0 case handled too, right? >> I'm not sure I understood what you're saying correctly, though. >> Specifically, >> I don't know whether the buffer's length must really be shorter than the >> size of the NT_XXX_TYPE regset. > > No, it does not have to. From the code snippet below (ptrace_regset function) > the buffer length has to be multiple of regset->size for the given NT_XXX_TYPE > upto the max regset size for the user to see any valid data. Ah, I guess one could call it a bug. If the passed in len is bigger than the whole register set size, then there seems to be no point in validating whether the length is multiple of a single register's size. That unnecessarily prevents coming up with a register set in the future that has registers of different sizes... But given that that's how things are today, I suppose we should document it... The problem what I > faced was when you use any iovec structure with the length parameter > uninitialized, > the kernel simply ignores and does not return anything. Ah. Well, saying "does not return anything" is quite confusing. It does return something -- -EINVAL. > > if (!regset || (kiov->iov_len % regset->size) != 0) > return -EINVAL; > >> >>> The current documentation is bit misleading and does not explicitly >>> specify that iov.len need to be initialized failing which kernel >>> may just ignore the ptrace request and never read from/write into >>> the user specified buffer. >> >> You're saying that if iov.len is larger than the NT_XXX_TYPE regset, >> then the kernel returns _success_, but actually doesn't fill the >> buffer? That sounds like a bug to me. > > No, I am not
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
I wonder whether people are getting Roland's address from? It's frequent that ptrace related patches end up CCed to rol...@redhat.com, but, he's not been at Red Hat for a few years now. Roland, do you still want to be CCed on ptrace-related issues? If so, there's probably a script somewhere in the kernel that needs updating. If not, well, it'd be good if it were updated anyway. :-) It's a little annoying, as Red Hat's servers outright reject email sent from a @redhat.com address if one tries to send an email that includes a CC/FROM to a user that no longer exists in the @redhat.com domain. -- Pedro Alves On 05/05/14 08:54, Anshuman Khandual wrote: > This patch enables get and set of transactional memory related register > sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing > four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, > REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new > ELF core note types added previously in this regard. > > (1) NT_PPC_TM_SPR > (2) NT_PPC_TM_CGPR > (3) NT_PPC_TM_CFPR > (4) NT_PPC_TM_CVMX > > Signed-off-by: Anshuman Khandual > --- > arch/powerpc/include/asm/switch_to.h | 8 + > arch/powerpc/kernel/process.c| 24 ++ > arch/powerpc/kernel/ptrace.c | 683 > +-- > 3 files changed, 687 insertions(+), 28 deletions(-) > > diff --git a/arch/powerpc/include/asm/switch_to.h > b/arch/powerpc/include/asm/switch_to.h > index 0e83e7d..2737f46 100644 > --- a/arch/powerpc/include/asm/switch_to.h > +++ b/arch/powerpc/include/asm/switch_to.h > @@ -80,6 +80,14 @@ static inline void flush_spe_to_thread(struct task_struct > *t) > } > #endif > > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM > +extern void flush_tmregs_to_thread(struct task_struct *); > +#else > +static inline void flush_tmregs_to_thread(struct task_struct *t) > +{ > +} > +#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ > + > static inline void clear_task_ebb(struct task_struct *t) > { > #ifdef CONFIG_PPC_BOOK3S_64 > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c > index 31d0215..e247898 100644 > --- a/arch/powerpc/kernel/process.c > +++ b/arch/powerpc/kernel/process.c > @@ -695,6 +695,30 @@ static inline void __switch_to_tm(struct task_struct > *prev) > } > } > > +void flush_tmregs_to_thread(struct task_struct *tsk) > +{ > + /* > + * If task is not current, it should have been flushed > + * already to it's thread_struct during __switch_to(). > + */ > + if (tsk != current) > + return; > + > + preempt_disable(); > + if (tsk->thread.regs) { > + /* > + * If we are still current, the TM state need to > + * be flushed to thread_struct as it will be still > + * present in the current cpu. > + */ > + if (MSR_TM_ACTIVE(tsk->thread.regs->msr)) { > + __switch_to_tm(tsk); > + tm_recheckpoint_new_task(tsk); > + } > + } > + preempt_enable(); > +} > + > /* > * This is called if we are on the way out to userspace and the > * TIF_RESTORE_TM flag is set. It checks if we need to reload > diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c > index 2e3d2bf..92faded 100644 > --- a/arch/powerpc/kernel/ptrace.c > +++ b/arch/powerpc/kernel/ptrace.c > @@ -357,6 +357,17 @@ static int gpr_set(struct task_struct *target, const > struct user_regset *regset, > return ret; > } > > +/* > + * When any transaction is active, "thread_struct->transact_fp" holds > + * the current running value of all FPR registers and "thread_struct-> > + * fp_state" holds the last checkpointed FPR registers state for the > + * current transaction. > + * > + * struct data { > + * u64 fpr[32]; > + * u64 fpscr; > + * }; > + */ > static int fpr_get(struct task_struct *target, const struct user_regset > *regset, > unsigned int pos, unsigned int count, > void *kbuf, void __user *ubuf) > @@ -365,21 +376,41 @@ static int fpr_get(struct task_struct *target, const > struct user_regset *regset, > u64 buf[33]; > int i; > #endif > - flush_fp_to_thread(target); > + if (MSR_TM_ACTIVE(target->thread.regs->msr)) { > + flush_fp_to_thread(target); > + flush_altivec_to_thread(target); > + flush_tmregs_to_thread(target); > + } else { > + flush_fp_to_thread(target); > + } > > #ifdef CONFIG_V
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/05/14 08:54, Anshuman Khandual wrote: > This patch enables get and set of transactional memory related register > sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing > four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, > REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new > ELF core note types added previously in this regard. > > (1) NT_PPC_TM_SPR > (2) NT_PPC_TM_CGPR > (3) NT_PPC_TM_CFPR > (4) NT_PPC_TM_CVMX Sorry that I couldn't tell this from the code, but, what does the kernel return when the ptracer requests these registers and the program is not in a transaction? Specifically I'm wondering whether this follows the same semantics as the s390 port. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
On 05/05/14 08:54, Anshuman Khandual wrote: This patch enables get and set of transactional memory related register sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new ELF core note types added previously in this regard. (1) NT_PPC_TM_SPR (2) NT_PPC_TM_CGPR (3) NT_PPC_TM_CFPR (4) NT_PPC_TM_CVMX Sorry that I couldn't tell this from the code, but, what does the kernel return when the ptracer requests these registers and the program is not in a transaction? Specifically I'm wondering whether this follows the same semantics as the s390 port. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH V2 2/3] powerpc, ptrace: Enable support for transactional memory register sets
I wonder whether people are getting Roland's address from? It's frequent that ptrace related patches end up CCed to rol...@redhat.com, but, he's not been at Red Hat for a few years now. Roland, do you still want to be CCed on ptrace-related issues? If so, there's probably a script somewhere in the kernel that needs updating. If not, well, it'd be good if it were updated anyway. :-) It's a little annoying, as Red Hat's servers outright reject email sent from a @redhat.com address if one tries to send an email that includes a CC/FROM to a user that no longer exists in the @redhat.com domain. -- Pedro Alves On 05/05/14 08:54, Anshuman Khandual wrote: This patch enables get and set of transactional memory related register sets through PTRACE_GETREGSET/PTRACE_SETREGSET interface by implementing four new powerpc specific register sets i.e REGSET_TM_SPR, REGSET_TM_CGPR, REGSET_TM_CFPR, REGSET_CVMX support corresponding to these following new ELF core note types added previously in this regard. (1) NT_PPC_TM_SPR (2) NT_PPC_TM_CGPR (3) NT_PPC_TM_CFPR (4) NT_PPC_TM_CVMX Signed-off-by: Anshuman Khandual khand...@linux.vnet.ibm.com --- arch/powerpc/include/asm/switch_to.h | 8 + arch/powerpc/kernel/process.c| 24 ++ arch/powerpc/kernel/ptrace.c | 683 +-- 3 files changed, 687 insertions(+), 28 deletions(-) diff --git a/arch/powerpc/include/asm/switch_to.h b/arch/powerpc/include/asm/switch_to.h index 0e83e7d..2737f46 100644 --- a/arch/powerpc/include/asm/switch_to.h +++ b/arch/powerpc/include/asm/switch_to.h @@ -80,6 +80,14 @@ static inline void flush_spe_to_thread(struct task_struct *t) } #endif +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM +extern void flush_tmregs_to_thread(struct task_struct *); +#else +static inline void flush_tmregs_to_thread(struct task_struct *t) +{ +} +#endif /* CONFIG_PPC_TRANSACTIONAL_MEM */ + static inline void clear_task_ebb(struct task_struct *t) { #ifdef CONFIG_PPC_BOOK3S_64 diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c index 31d0215..e247898 100644 --- a/arch/powerpc/kernel/process.c +++ b/arch/powerpc/kernel/process.c @@ -695,6 +695,30 @@ static inline void __switch_to_tm(struct task_struct *prev) } } +void flush_tmregs_to_thread(struct task_struct *tsk) +{ + /* + * If task is not current, it should have been flushed + * already to it's thread_struct during __switch_to(). + */ + if (tsk != current) + return; + + preempt_disable(); + if (tsk-thread.regs) { + /* + * If we are still current, the TM state need to + * be flushed to thread_struct as it will be still + * present in the current cpu. + */ + if (MSR_TM_ACTIVE(tsk-thread.regs-msr)) { + __switch_to_tm(tsk); + tm_recheckpoint_new_task(tsk); + } + } + preempt_enable(); +} + /* * This is called if we are on the way out to userspace and the * TIF_RESTORE_TM flag is set. It checks if we need to reload diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c index 2e3d2bf..92faded 100644 --- a/arch/powerpc/kernel/ptrace.c +++ b/arch/powerpc/kernel/ptrace.c @@ -357,6 +357,17 @@ static int gpr_set(struct task_struct *target, const struct user_regset *regset, return ret; } +/* + * When any transaction is active, thread_struct-transact_fp holds + * the current running value of all FPR registers and thread_struct- + * fp_state holds the last checkpointed FPR registers state for the + * current transaction. + * + * struct data { + * u64 fpr[32]; + * u64 fpscr; + * }; + */ static int fpr_get(struct task_struct *target, const struct user_regset *regset, unsigned int pos, unsigned int count, void *kbuf, void __user *ubuf) @@ -365,21 +376,41 @@ static int fpr_get(struct task_struct *target, const struct user_regset *regset, u64 buf[33]; int i; #endif - flush_fp_to_thread(target); + if (MSR_TM_ACTIVE(target-thread.regs-msr)) { + flush_fp_to_thread(target); + flush_altivec_to_thread(target); + flush_tmregs_to_thread(target); + } else { + flush_fp_to_thread(target); + } #ifdef CONFIG_VSX /* copy to local buffer then write that out */ - for (i = 0; i 32 ; i++) - buf[i] = target-thread.TS_FPR(i); - buf[32] = target-thread.fp_state.fpscr; + if (MSR_TM_ACTIVE(target-thread.regs-msr)) { + for (i = 0; i 32 ; i++) + buf[i] = target-thread.TS_TRANS_FPR(i); + buf[32] = target-thread.transact_fp.fpscr; + } else { + for (i = 0; i 32 ; i++) + buf[i] = target
Re: [PATCH] ptrace: Fix PTRACE_GETREGSET/PTRACE_SETREGSET in code documentation
On 05/05/14 05:10, Anshuman Khandual wrote: On 05/01/2014 07:43 PM, Pedro Alves wrote: On 04/28/2014 12:00 PM, Anshuman Khandual wrote: The current documentation is bit misleading and does not explicitly specify that iov.len need to be initialized failing which kernel may just ignore the ptrace request and never read from/write into the user specified buffer. This patch fixes the documentation. Well, it kind of does, here: * struct iovec iov = { buf, len}; :) Thats not explicit enough. @@ -43,8 +43,12 @@ * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * A non-zero value upto the max size of data expected to be written/read by the + * kernel in response to any NT_XXX_TYPE request type must be assigned to iov.len + * before initiating the ptrace call. If iov.len is 0, then kernel will neither + * read from or write into the user buffer specified. On successful completion, + * iov.len will be updated by the kernel, specifying how much the kernel has + * written/read to/from the user's iov.buf. I really appreciate that you're trying to make this clearer, but I find the new sentence very hard to read/reason. :-/ I suggest: * This interface usage is as follows: - * struct iovec iov = { buf, len}; + * struct iovec iov = { buf, len }; * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * On entry, iov describes the buffer's address and length. The buffer's + * length must be equal to or shorter than the size of the NT_XXX_TYPE regset. + * On successful completion, iov.len is updated by the kernel, specifying how + * much the kernel has written/read to/from the user's iov.buf. Yeah, sounds better. I may add If the length is zero, the kernel will neither read from or write into the buffer Well, I think that much should be obvious. What's not obvious is whether that is considered success or error (what is the return code?) I suspect and expect success return if the regset type is known, and error otherwise. So that could be used as a way to probe for support for a given regset without using stack or heap space, if it ever matters. The kernel never reads/writes beyond iov.len, so better say that, and then it automatically gets the 0 case handled too, right? I'm not sure I understood what you're saying correctly, though. Specifically, I don't know whether the buffer's length must really be shorter than the size of the NT_XXX_TYPE regset. No, it does not have to. From the code snippet below (ptrace_regset function) the buffer length has to be multiple of regset-size for the given NT_XXX_TYPE upto the max regset size for the user to see any valid data. Ah, I guess one could call it a bug. If the passed in len is bigger than the whole register set size, then there seems to be no point in validating whether the length is multiple of a single register's size. That unnecessarily prevents coming up with a register set in the future that has registers of different sizes... But given that that's how things are today, I suppose we should document it... The problem what I faced was when you use any iovec structure with the length parameter uninitialized, the kernel simply ignores and does not return anything. Ah. Well, saying does not return anything is quite confusing. It does return something -- -EINVAL. if (!regset || (kiov-iov_len % regset-size) != 0) return -EINVAL; The current documentation is bit misleading and does not explicitly specify that iov.len need to be initialized failing which kernel may just ignore the ptrace request and never read from/write into the user specified buffer. You're saying that if iov.len is larger than the NT_XXX_TYPE regset, then the kernel returns _success_, but actually doesn't fill the buffer? That sounds like a bug to me. No, I am not saying that. The kernel takes care of that situation by capping the length to regset size of the NT_XXX_TYPE. kiov-iov_len = min(kiov-iov_len, (__kernel_size_t) (regset-n * regset-size)); OK, then this is what I suggest instead: * This interface usage is as follows: - * struct iovec iov = { buf, len}; + * struct iovec iov = { buf, len }; * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov); * * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * On entry, iov describes the buffer's address and length. The buffer's + * length
Re: [PATCH] ptrace: Fix PTRACE_GETREGSET/PTRACE_SETREGSET in code documentation
On 04/28/2014 12:00 PM, Anshuman Khandual wrote: > The current documentation is bit misleading and does not explicitly > specify that iov.len need to be initialized failing which kernel > may just ignore the ptrace request and never read from/write into > the user specified buffer. This patch fixes the documentation. Well, it kind of does, here: * struct iovec iov = { buf, len}; > @@ -43,8 +43,12 @@ > * > * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, ); > * > - * On the successful completion, iov.len will be updated by the kernel, > - * specifying how much the kernel has written/read to/from the user's > iov.buf. > + * A non-zero value upto the max size of data expected to be written/read by > the > + * kernel in response to any NT_XXX_TYPE request type must be assigned to > iov.len > + * before initiating the ptrace call. If iov.len is 0, then kernel will > neither > + * read from or write into the user buffer specified. On successful > completion, > + * iov.len will be updated by the kernel, specifying how much the kernel has > + * written/read to/from the user's iov.buf. I really appreciate that you're trying to make this clearer, but I find the new sentence very hard to read/reason. :-/ I suggest: * This interface usage is as follows: - * struct iovec iov = { buf, len}; + * struct iovec iov = { buf, len }; * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, ); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * On entry, iov describes the buffer's address and length. The buffer's + * length must be equal to or shorter than the size of the NT_XXX_TYPE regset. + * On successful completion, iov.len is updated by the kernel, specifying how + * much the kernel has written/read to/from the user's iov.buf. I'm not sure I understood what you're saying correctly, though. Specifically, I don't know whether the buffer's length must really be shorter than the size of the NT_XXX_TYPE regset. > The current documentation is bit misleading and does not explicitly > specify that iov.len need to be initialized failing which kernel > may just ignore the ptrace request and never read from/write into > the user specified buffer. You're saying that if iov.len is larger than the NT_XXX_TYPE regset, then the kernel returns _success_, but actually doesn't fill the buffer? That sounds like a bug to me. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] powerpc, ptrace: Add new ptrace request macros for transactional memory
On 04/28/2014 11:30 AM, Anshuman Khandual wrote: > On 04/26/2014 05:12 AM, Pedro Alves wrote: >> On 04/02/2014 08:02 AM, Anshuman Khandual wrote: >>> This patch adds following new sets of ptrace request macros for >>> transactional >>> memory expanding the existing ptrace ABI on PowerPC. >>> >>> /* TM special purpose registers */ >>> PTRACE_GETTM_SPRREGS >>> PTRACE_SETTM_SPRREGS >>> >>> /* TM checkpointed GPR registers */ >>> PTRACE_GETTM_CGPRREGS >>> PTRACE_SETTM_CGPRREGS >>> >>> /* TM checkpointed FPR registers */ >>> PTRACE_GETTM_CFPRREGS >>> PTRACE_SETTM_CFPRREGS >>> >>> /* TM checkpointed VMX registers */ >>> PTRACE_GETTM_CVMXREGS >>> PTRACE_SETTM_CVMXREGS >> >> Urgh, we're _still_ adding specialized register specific calls? >> Why aren't these exported as new register sets, accessible through >> PTRACE_GETREGSET / PTRACE_SETREGSET? That's supposed to be the >> Modern Way to do things. > > All these new register sets can be accessed through PTRACE_GETREGSET > /SETREGSET requests with the new NT_PPC_* core note types added in the > previous patch. PowerPC already has some register specific ptrace > requests, so thought of adding some new requests for transactional > memory purpose. But yes these are redundant and can be dropped. Thank you. I assume you guys will be working on gdb support for this, and I'm really hoping that transaction support in the kernel is being made uniform across archs. E.g., on s390, PTRACE_GETREGSET returns ENODATA if the ptracee was interrupted outside transactions. https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] powerpc, ptrace: Add new ptrace request macros for transactional memory
On 04/28/2014 11:30 AM, Anshuman Khandual wrote: On 04/26/2014 05:12 AM, Pedro Alves wrote: On 04/02/2014 08:02 AM, Anshuman Khandual wrote: This patch adds following new sets of ptrace request macros for transactional memory expanding the existing ptrace ABI on PowerPC. /* TM special purpose registers */ PTRACE_GETTM_SPRREGS PTRACE_SETTM_SPRREGS /* TM checkpointed GPR registers */ PTRACE_GETTM_CGPRREGS PTRACE_SETTM_CGPRREGS /* TM checkpointed FPR registers */ PTRACE_GETTM_CFPRREGS PTRACE_SETTM_CFPRREGS /* TM checkpointed VMX registers */ PTRACE_GETTM_CVMXREGS PTRACE_SETTM_CVMXREGS Urgh, we're _still_ adding specialized register specific calls? Why aren't these exported as new register sets, accessible through PTRACE_GETREGSET / PTRACE_SETREGSET? That's supposed to be the Modern Way to do things. All these new register sets can be accessed through PTRACE_GETREGSET /SETREGSET requests with the new NT_PPC_* core note types added in the previous patch. PowerPC already has some register specific ptrace requests, so thought of adding some new requests for transactional memory purpose. But yes these are redundant and can be dropped. Thank you. I assume you guys will be working on gdb support for this, and I'm really hoping that transaction support in the kernel is being made uniform across archs. E.g., on s390, PTRACE_GETREGSET returns ENODATA if the ptracee was interrupted outside transactions. https://sourceware.org/ml/gdb-patches/2013-06/msg00273.html -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: Fix PTRACE_GETREGSET/PTRACE_SETREGSET in code documentation
On 04/28/2014 12:00 PM, Anshuman Khandual wrote: The current documentation is bit misleading and does not explicitly specify that iov.len need to be initialized failing which kernel may just ignore the ptrace request and never read from/write into the user specified buffer. This patch fixes the documentation. Well, it kind of does, here: * struct iovec iov = { buf, len}; @@ -43,8 +43,12 @@ * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * A non-zero value upto the max size of data expected to be written/read by the + * kernel in response to any NT_XXX_TYPE request type must be assigned to iov.len + * before initiating the ptrace call. If iov.len is 0, then kernel will neither + * read from or write into the user buffer specified. On successful completion, + * iov.len will be updated by the kernel, specifying how much the kernel has + * written/read to/from the user's iov.buf. I really appreciate that you're trying to make this clearer, but I find the new sentence very hard to read/reason. :-/ I suggest: * This interface usage is as follows: - * struct iovec iov = { buf, len}; + * struct iovec iov = { buf, len }; * * ret = ptrace(PTRACE_GETREGSET/PTRACE_SETREGSET, pid, NT_XXX_TYPE, iov); * - * On the successful completion, iov.len will be updated by the kernel, - * specifying how much the kernel has written/read to/from the user's iov.buf. + * On entry, iov describes the buffer's address and length. The buffer's + * length must be equal to or shorter than the size of the NT_XXX_TYPE regset. + * On successful completion, iov.len is updated by the kernel, specifying how + * much the kernel has written/read to/from the user's iov.buf. I'm not sure I understood what you're saying correctly, though. Specifically, I don't know whether the buffer's length must really be shorter than the size of the NT_XXX_TYPE regset. The current documentation is bit misleading and does not explicitly specify that iov.len need to be initialized failing which kernel may just ignore the ptrace request and never read from/write into the user specified buffer. You're saying that if iov.len is larger than the NT_XXX_TYPE regset, then the kernel returns _success_, but actually doesn't fill the buffer? That sounds like a bug to me. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix get ERESTARTSYS with m32 in x86_64 when debug by GDB
On 04/30/2014 02:35 PM, Mark Kettenis wrote: >> Date: Tue, 29 Apr 2014 22:10:15 -0700 >> From: "H. Peter Anvin" >> >> On 04/29/2014 10:08 PM, Andrew Pinski wrote: >>> >>> restoring the values is hard since even the ptrace interface does not >>> allow for that. >>> >> >> So that begs the ultimate question, which is: given the fact that there >> is *state missing* from the state vector (this is the core of the >> problem), is there a way we can add that state so that gdb will be able >> to save and restore it? > > Carrying around additional state in GDB is complicated; I'd rather > avoid it. Agreed very much. > arch/x86/kernel/ptrace.c:putreg32() has this bit of code: > > case offsetof(struct user32, regs.orig_eax): > /* > * A 32-bit debugger setting orig_eax means to restore > * the state of the task restarting a 32-bit syscall. > * Make sure we interpret the -ERESTART* codes correctly > * in case the task is not actually still sitting at the > * exit from a 32-bit syscall with TS_COMPAT still set. > */ > regs->orig_ax = value; > if (syscall_get_nr(child, regs) >= 0) > task_thread_info(child)->status |= TS_COMPAT; > break; > > which gets used for 32-bit compat ptrace(2). Perhaps the same logic > should be added to putreg() if the child is a 32-bit process? Sounds like the right fix to me. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] Fix get ERESTARTSYS with m32 in x86_64 when debug by GDB
On 04/30/2014 02:35 PM, Mark Kettenis wrote: Date: Tue, 29 Apr 2014 22:10:15 -0700 From: H. Peter Anvin h...@zytor.com On 04/29/2014 10:08 PM, Andrew Pinski wrote: restoring the values is hard since even the ptrace interface does not allow for that. So that begs the ultimate question, which is: given the fact that there is *state missing* from the state vector (this is the core of the problem), is there a way we can add that state so that gdb will be able to save and restore it? Carrying around additional state in GDB is complicated; I'd rather avoid it. Agreed very much. arch/x86/kernel/ptrace.c:putreg32() has this bit of code: case offsetof(struct user32, regs.orig_eax): /* * A 32-bit debugger setting orig_eax means to restore * the state of the task restarting a 32-bit syscall. * Make sure we interpret the -ERESTART* codes correctly * in case the task is not actually still sitting at the * exit from a 32-bit syscall with TS_COMPAT still set. */ regs-orig_ax = value; if (syscall_get_nr(child, regs) = 0) task_thread_info(child)-status |= TS_COMPAT; break; which gets used for 32-bit compat ptrace(2). Perhaps the same logic should be added to putreg() if the child is a 32-bit process? Sounds like the right fix to me. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 25/28] nios2: ptrace support
Does this support PTRACE_GETREGSET / PTRACE_SETREGSET ? IMO, the kernel shouldn't accept ports without those anymore. And IMHO, we shouldn't even allow new ports having PTRACE_GETREGS. http://sourceware.org/ml/archer/2010-q3/msg00193.html -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] powerpc, ptrace: Add new ptrace request macros for transactional memory
On 04/02/2014 08:02 AM, Anshuman Khandual wrote: > This patch adds following new sets of ptrace request macros for transactional > memory expanding the existing ptrace ABI on PowerPC. > > /* TM special purpose registers */ > PTRACE_GETTM_SPRREGS > PTRACE_SETTM_SPRREGS > > /* TM checkpointed GPR registers */ > PTRACE_GETTM_CGPRREGS > PTRACE_SETTM_CGPRREGS > > /* TM checkpointed FPR registers */ > PTRACE_GETTM_CFPRREGS > PTRACE_SETTM_CFPRREGS > > /* TM checkpointed VMX registers */ > PTRACE_GETTM_CVMXREGS > PTRACE_SETTM_CVMXREGS Urgh, we're _still_ adding specialized register specific calls? Why aren't these exported as new register sets, accessible through PTRACE_GETREGSET / PTRACE_SETREGSET? That's supposed to be the Modern Way to do things. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/3] powerpc, ptrace: Add new ptrace request macros for transactional memory
On 04/02/2014 08:02 AM, Anshuman Khandual wrote: This patch adds following new sets of ptrace request macros for transactional memory expanding the existing ptrace ABI on PowerPC. /* TM special purpose registers */ PTRACE_GETTM_SPRREGS PTRACE_SETTM_SPRREGS /* TM checkpointed GPR registers */ PTRACE_GETTM_CGPRREGS PTRACE_SETTM_CGPRREGS /* TM checkpointed FPR registers */ PTRACE_GETTM_CFPRREGS PTRACE_SETTM_CFPRREGS /* TM checkpointed VMX registers */ PTRACE_GETTM_CVMXREGS PTRACE_SETTM_CVMXREGS Urgh, we're _still_ adding specialized register specific calls? Why aren't these exported as new register sets, accessible through PTRACE_GETREGSET / PTRACE_SETREGSET? That's supposed to be the Modern Way to do things. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 25/28] nios2: ptrace support
Does this support PTRACE_GETREGSET / PTRACE_SETREGSET ? IMO, the kernel shouldn't accept ports without those anymore. And IMHO, we shouldn't even allow new ports having PTRACE_GETREGS. http://sourceware.org/ml/archer/2010-q3/msg00193.html -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT}
On 01/07/2014 03:30 PM, Oleg Nesterov wrote: > > If we add the new API, perhaps we should change ptrace_resume ? > I mean, > > --- x/kernel/ptrace.c > +++ x/kernel/ptrace.c > @@ -723,7 +723,9 @@ static int ptrace_resume(struct task_str > if (!valid_signal(data)) > return -EIO; > > - if (request == PTRACE_SYSCALL) > + if (request == PTRACE_SYSCALL || > + ptrace_event_enabled(PTRACE_EVENT_SYSCALL_ENTER) || > + ptrace_event_enabled(PTRACE_EVENT_SYSCALL_EXIT)) > set_tsk_thread_flag(child, TIF_SYSCALL_TRACE); > else > clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); > > > This way PTRACE_O_SYSCALL_* will work like other ptrace options which > ask to report an event. +10^6. With PTRACE_SYSCALL/sysgood, we don't have a way to trace syscalls when single-stepping, which isn't much of a problem for strace, but of course is for GDB. That is one of the things the new API should definitely sort out. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC/PATCH] Implement new PTRACE_EVENT_SYSCALL_{ENTER,EXIT}
On 01/07/2014 03:30 PM, Oleg Nesterov wrote: If we add the new API, perhaps we should change ptrace_resume ? I mean, --- x/kernel/ptrace.c +++ x/kernel/ptrace.c @@ -723,7 +723,9 @@ static int ptrace_resume(struct task_str if (!valid_signal(data)) return -EIO; - if (request == PTRACE_SYSCALL) + if (request == PTRACE_SYSCALL || + ptrace_event_enabled(PTRACE_EVENT_SYSCALL_ENTER) || + ptrace_event_enabled(PTRACE_EVENT_SYSCALL_EXIT)) set_tsk_thread_flag(child, TIF_SYSCALL_TRACE); else clear_tsk_thread_flag(child, TIF_SYSCALL_TRACE); This way PTRACE_O_SYSCALL_* will work like other ptrace options which ask to report an event. +10^6. With PTRACE_SYSCALL/sysgood, we don't have a way to trace syscalls when single-stepping, which isn't much of a problem for strace, but of course is for GDB. That is one of the things the new API should definitely sort out. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: make PTRACE_DETACH work on non-stopped tracees.
On 06/19/2013 05:09 PM, Jan Kratochvil wrote: > On Wed, 19 Jun 2013 17:15:36 +0200, Denys Vlasenko wrote: >> CCing Jan to hear his comments from gdb side. PTRACE_DETACH takes a signal number in the data parameter. What happens to if the tracer passes a non-zero signal? -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: make PTRACE_DETACH work on non-stopped tracees.
On 06/19/2013 05:09 PM, Jan Kratochvil wrote: On Wed, 19 Jun 2013 17:15:36 +0200, Denys Vlasenko wrote: CCing Jan to hear his comments from gdb side. PTRACE_DETACH takes a signal number in the data parameter. What happens to if the tracer passes a non-zero signal? -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: add ability to retrieve signals without removing them from a queue
Forgot to reply to this bit: On 02/15/2013 07:43 PM, Oleg Nesterov wrote: >> We'd miss the poke >> > variant, but that looks like something that could be always be added >> > later. > Yes. _POKE_ or _QUEUE_ or _DEQUEUE_, we can add more features if user- > space wants them. In general, IMO, I agree with Roland at https://lkml.org/lkml/2002/12/20/84 in that it's good to have setters for completeness, so that you can change all the state via ptrace that you can read via ptrace. But I'm not doing any of this work, hence my "could always be added later" comment instead of actually requesting it. But if we had it, we could make e.g., gdb inspect the signal queues, and then be able to tweak a realtime signal before it is delivered. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: add ability to retrieve signals without removing them from a queue
On 02/15/2013 07:43 PM, Oleg Nesterov wrote: > On 02/15, Pedro Alves wrote: >> >> Not sure I'm reading the patch right, but it looks like GDB would >> be able to use this as alternative to PTRACE_GET_SIGINFO variant > > No, it is different. PTRACE_GETSIGINFO reports the siginfo for the signal > which was already dequeued (ignoring the fact ->last_siginfo != NULL doesn't > necessarily mean we report a signal), while this patch allows to look at > the pending signals which were not reported yet. Ah, I had assumed queue position 0 would be the dequeued signal. >> I wouldn't mind if this was added unconditionally >> instead of wrapped on CONFIG_CHECKPOINT_RESTORE. > > I agree. If you think gdb can use this new feature, CONFIG_ can go away. I think so -- we can add a gdb command to dump the list of pending signals, which I think is something that'd be useful. > >> We'd miss the poke >> variant, but that looks like something that could be always be added >> later. > > Yes. _POKE_ or _QUEUE_ or _DEQUEUE_, we can add more features if user- > space wants them. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: add ability to retrieve signals without removing them from a queue
On 02/13/2013 03:16 PM, Andrey Vagin wrote: > This patch adds a new ptrace request PTRACE_PEEKSIGINFO. > > This request is used to retrieve information about a signal with the > specified sequence number. A siginfo_t structure is copied from the child > to location data in the parent. > > The low 16 bits of addr contains a sequence number of signal in a queue. > All other bits of addr is used for flags. Currently here is only one > flag PTRACE_PEEK_SHARED for dumping signals from process-wide shared > queue. If this flag is not set, a signal is read from a per-thread > queue. A result siginfo contains a kernel part of si_code which usually > striped, but it's required for queuing the same siginfo back during > restore of pending signals. > > If a signal with the specified sequence number doesn't exist, ptrace > returns ENOENT. > > This functionality is required for checkpointing pending signals. > > The prototype of this code was developed by Oleg Nesterov. Not sure I'm reading the patch right, but it looks like GDB would be able to use this as alternative to PTRACE_GET_SIGINFO variant that returns the siginfo_t object in the architecture/bitness of the tracee, rather than the architecture of the kernel, right? So it no longer would need to try and replicate the kernel's siginfo conversions. I wouldn't mind if this was added unconditionally instead of wrapped on CONFIG_CHECKPOINT_RESTORE. We'd miss the poke variant, but that looks like something that could be always be added later. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: add ability to retrieve signals without removing them from a queue
On 02/13/2013 03:16 PM, Andrey Vagin wrote: This patch adds a new ptrace request PTRACE_PEEKSIGINFO. This request is used to retrieve information about a signal with the specified sequence number. A siginfo_t structure is copied from the child to location data in the parent. The low 16 bits of addr contains a sequence number of signal in a queue. All other bits of addr is used for flags. Currently here is only one flag PTRACE_PEEK_SHARED for dumping signals from process-wide shared queue. If this flag is not set, a signal is read from a per-thread queue. A result siginfo contains a kernel part of si_code which usually striped, but it's required for queuing the same siginfo back during restore of pending signals. If a signal with the specified sequence number doesn't exist, ptrace returns ENOENT. This functionality is required for checkpointing pending signals. The prototype of this code was developed by Oleg Nesterov. Not sure I'm reading the patch right, but it looks like GDB would be able to use this as alternative to PTRACE_GET_SIGINFO variant that returns the siginfo_t object in the architecture/bitness of the tracee, rather than the architecture of the kernel, right? So it no longer would need to try and replicate the kernel's siginfo conversions. I wouldn't mind if this was added unconditionally instead of wrapped on CONFIG_CHECKPOINT_RESTORE. We'd miss the poke variant, but that looks like something that could be always be added later. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: add ability to retrieve signals without removing them from a queue
On 02/15/2013 07:43 PM, Oleg Nesterov wrote: On 02/15, Pedro Alves wrote: Not sure I'm reading the patch right, but it looks like GDB would be able to use this as alternative to PTRACE_GET_SIGINFO variant No, it is different. PTRACE_GETSIGINFO reports the siginfo for the signal which was already dequeued (ignoring the fact -last_siginfo != NULL doesn't necessarily mean we report a signal), while this patch allows to look at the pending signals which were not reported yet. Ah, I had assumed queue position 0 would be the dequeued signal. I wouldn't mind if this was added unconditionally instead of wrapped on CONFIG_CHECKPOINT_RESTORE. I agree. If you think gdb can use this new feature, CONFIG_ can go away. I think so -- we can add a gdb command to dump the list of pending signals, which I think is something that'd be useful. We'd miss the poke variant, but that looks like something that could be always be added later. Yes. _POKE_ or _QUEUE_ or _DEQUEUE_, we can add more features if user- space wants them. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] ptrace: add ability to retrieve signals without removing them from a queue
Forgot to reply to this bit: On 02/15/2013 07:43 PM, Oleg Nesterov wrote: We'd miss the poke variant, but that looks like something that could be always be added later. Yes. _POKE_ or _QUEUE_ or _DEQUEUE_, we can add more features if user- space wants them. In general, IMO, I agree with Roland at https://lkml.org/lkml/2002/12/20/84 in that it's good to have setters for completeness, so that you can change all the state via ptrace that you can read via ptrace. But I'm not doing any of this work, hence my could always be added later comment instead of actually requesting it. But if we had it, we could make e.g., gdb inspect the signal queues, and then be able to tweak a realtime signal before it is delivered. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Redefinition of struct in6_addr in and
On 01/18/2013 02:24 PM, YOSHIFUJI Hideaki wrote: >>>> It's simple enough to move all of the __GLIBC__ uses into libc-compat.h, >>>> then you control userspace libc coordination from one file. >>> >>> How about just deciding on a single macro/symbol both the >>> kernel and libc (any libc that needs this) define? Something >>> like both the kernel and userland doing: >>> >>> #ifndef __IPV6_BITS_DEFINED >>> #define __IPV6_BITS_DEFINED >>> ... >>> define in6_addr, sockaddr_in6, ipv6_mreq, whatnot >>> #endif > > Hmm, how should we handle future structs/enums then? > For example, if I want to have in6_flowlabel_req{} defined in > glibc, what should we do? Include the glibc header first? Or is this some other use case? The point wasn't that you'd have only one macro for all structs/enums (you could split into __IPV6_IN6_ADDR_DEFINED, __IPV6_SOCKADDR_IN6_DEFINED, etc.) but to have the kernel and libc agree on guard macros, instead of having the kernel do #ifdef __GLIBC__ and glibc doing #ifdef _NETINET_IN_H. But as Carlos says, the devil is in the details, and I sure am not qualified on the details here. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Redefinition of struct in6_addr in and
On 01/18/2013 04:22 AM, Carlos O'Donell wrote: > On Thu, Jan 17, 2013 at 11:20 PM, Mike Frysinger wrote: >> On Wednesday 16 January 2013 22:15:38 David Miller wrote: >>> From: Carlos O'Donell >>> Date: Wed, 16 Jan 2013 21:15:03 -0500 >>> >>>> +/* If a glibc-based userspace has already included in.h, then we will >>>> not + * define in6_addr (nor the defines), sockaddr_in6, or ipv6_mreq. >>>> The + * ABI used by the kernel and by glibc match exactly. Neither the >>>> kernel + * nor glibc should break this ABI without coordination. >>>> + */ >>>> +#ifndef _NETINET_IN_H >>>> + >>> >>> I think we should shoot for a non-glibc-centric solution. >>> >>> I can't imagine that other libc's won't have the same exact problem >>> with their netinet/in.h conflicting with the kernel's, redefining >>> structures like in6_addr, that we'd want to provide a protection >>> scheme for here as well. >> >> yes, the kernel's use of __GLIBC__ in exported headers has already caused >> problems in the past. fortunately, it's been reduced down to just one case >> now (stat.h). let's not balloon it back up. >> -mike > > I also see coda.h has grown a __GLIBC__ usage. > > In the next revision of the patch I created a single libc-compat.h header > which encompasses the logic for any libc that wants to coordinate with > the kernel headers. > It's simple enough to move all of the __GLIBC__ uses into libc-compat.h, > then you control userspace libc coordination from one file. How about just deciding on a single macro/symbol both the kernel and libc (any libc that needs this) define? Something like both the kernel and userland doing: #ifndef __IPV6_BITS_DEFINED #define __IPV6_BITS_DEFINED ... define in6_addr, sockaddr_in6, ipv6_mreq, whatnot #endif So whichever the application includes first, wins. Too naive? I didn't see this option being discarded, so not sure it was considered. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Redefinition of struct in6_addr in netinet/in.h and linux/in6.h
On 01/18/2013 04:22 AM, Carlos O'Donell wrote: On Thu, Jan 17, 2013 at 11:20 PM, Mike Frysinger vap...@gentoo.org wrote: On Wednesday 16 January 2013 22:15:38 David Miller wrote: From: Carlos O'Donell car...@systemhalted.org Date: Wed, 16 Jan 2013 21:15:03 -0500 +/* If a glibc-based userspace has already included in.h, then we will not + * define in6_addr (nor the defines), sockaddr_in6, or ipv6_mreq. The + * ABI used by the kernel and by glibc match exactly. Neither the kernel + * nor glibc should break this ABI without coordination. + */ +#ifndef _NETINET_IN_H + I think we should shoot for a non-glibc-centric solution. I can't imagine that other libc's won't have the same exact problem with their netinet/in.h conflicting with the kernel's, redefining structures like in6_addr, that we'd want to provide a protection scheme for here as well. yes, the kernel's use of __GLIBC__ in exported headers has already caused problems in the past. fortunately, it's been reduced down to just one case now (stat.h). let's not balloon it back up. -mike I also see coda.h has grown a __GLIBC__ usage. In the next revision of the patch I created a single libc-compat.h header which encompasses the logic for any libc that wants to coordinate with the kernel headers. It's simple enough to move all of the __GLIBC__ uses into libc-compat.h, then you control userspace libc coordination from one file. How about just deciding on a single macro/symbol both the kernel and libc (any libc that needs this) define? Something like both the kernel and userland doing: #ifndef __IPV6_BITS_DEFINED #define __IPV6_BITS_DEFINED ... define in6_addr, sockaddr_in6, ipv6_mreq, whatnot #endif So whichever the application includes first, wins. Too naive? I didn't see this option being discarded, so not sure it was considered. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Redefinition of struct in6_addr in netinet/in.h and linux/in6.h
On 01/18/2013 02:24 PM, YOSHIFUJI Hideaki wrote: It's simple enough to move all of the __GLIBC__ uses into libc-compat.h, then you control userspace libc coordination from one file. How about just deciding on a single macro/symbol both the kernel and libc (any libc that needs this) define? Something like both the kernel and userland doing: #ifndef __IPV6_BITS_DEFINED #define __IPV6_BITS_DEFINED ... define in6_addr, sockaddr_in6, ipv6_mreq, whatnot #endif Hmm, how should we handle future structs/enums then? For example, if I want to have in6_flowlabel_req{} defined in glibc, what should we do? Include the glibc header first? Or is this some other use case? The point wasn't that you'd have only one macro for all structs/enums (you could split into __IPV6_IN6_ADDR_DEFINED, __IPV6_SOCKADDR_IN6_DEFINED, etc.) but to have the kernel and libc agree on guard macros, instead of having the kernel do #ifdef __GLIBC__ and glibc doing #ifdef _NETINET_IN_H. But as Carlos says, the devil is in the details, and I sure am not qualified on the details here. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
On 12/04/2012 05:59 PM, Oleg Nesterov wrote: > But If we want to allow to trace vsyscall's, hw bp doesn't look very > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all. Irrespective of the whole syscall tracing issue, allowing HW bkpts in the vsyscall just seems like a bug fix to me. > That is why I think PTRACE_SYSCALL should "simply work" somehow. And > so far I think that "just report syscall_exit with orig_ax = -1" is > the best (and simple) solution. If you report exit alone, you'll confuse current GDB into mistaking it for an enter, and all following enter/exits end up swapped/confused. GDB assumes trap/sysgood alternates between enter/exit, and always starts with enter. (Mildly related, GDB has an old comment in the code (linux-nat.c) that says: "However, most architectures can't handle a syscall being traced on the way out if it wasn't traced on the way in." I've no clue if that's still true nowadays.) > OK. We can do more. We can report both syscall_enter/exit and we can > change orig_ax/ax temporary to "fool" the tracer, so that everything > will look as a "normal" syscall. Like vsyscall_seccomp() does. > > But this needs much more changes. I'd just like to add, that if any new syscall related option is to be added, can we please just go all the way and add PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead? http://sourceware.org/gdb/wiki/LinuxKernelWishList -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PTRACE_SYSCALL vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)
On 12/04/2012 05:59 PM, Oleg Nesterov wrote: But If we want to allow to trace vsyscall's, hw bp doesn't look very nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all. Irrespective of the whole syscall tracing issue, allowing HW bkpts in the vsyscall just seems like a bug fix to me. That is why I think PTRACE_SYSCALL should simply work somehow. And so far I think that just report syscall_exit with orig_ax = -1 is the best (and simple) solution. If you report exit alone, you'll confuse current GDB into mistaking it for an enter, and all following enter/exits end up swapped/confused. GDB assumes trap/sysgood alternates between enter/exit, and always starts with enter. (Mildly related, GDB has an old comment in the code (linux-nat.c) that says: However, most architectures can't handle a syscall being traced on the way out if it wasn't traced on the way in. I've no clue if that's still true nowadays.) OK. We can do more. We can report both syscall_enter/exit and we can change orig_ax/ax temporary to fool the tracer, so that everything will look as a normal syscall. Like vsyscall_seccomp() does. But this needs much more changes. I'd just like to add, that if any new syscall related option is to be added, can we please just go all the way and add PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead? http://sourceware.org/gdb/wiki/LinuxKernelWishList -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PT_EXITKILL (Was: pdeath_signal)
On 11/08/2012 12:44 PM, Oleg Nesterov wrote: > On 11/08, Pedro Alves wrote: >> If this isn't inherited by the ptrace child's children, a fork child can >> end up detached if the tracer dies before it had a chance of setting >> the PTRACE_O_EXITKILL on the new auto-attached child. > > It is copied like the other options. Oh, you're right. I got confused - GDB has code to always set options on the fork children after PTRACE_EVENT_(V)FORK. Dunno where that came from. >> Which sounds like another argument for PTRACE_O_INHERIT, as in: >> http://sourceware.org/ml/archer/2011-q1/msg00026.html > > The point of PTRACE_O_INHERIT would be to attach newly-created threads > and > children without causing an event stop and the attendant overhead. > > this is another thing, I guess. Yes, yes. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PT_EXITKILL (Was: pdeath_signal)
On 11/07/2012 03:09 PM, Oleg Nesterov wrote: >> > > > > What I would IDEALLY like to have is a call, probably a ptrace >> > > > > option, >> > > > > where the parent can request: "If I am ever to terminate or be >> > > > > killed, >> > > > > then my ptraced son MUST die as well". >> > > >> > > Perhaps this makes sense... >> > > >> > > Chris, iirc you also suggested something like this? And the patch is >> > > trivial. (...) > OK. Please see the untested/uncompiled (but trivial) patch below > > - it adds PTRACE_O_EXITKILL. A better name? > > - A better numeric value? Note that the new option is not equal to > the last-ptrace-option << 1. Because currently all options have > the event, and the new one starts the eventless group. 1 << 16 > means we have the room for 8 more events. > > - it needs the convincing changelog for akpm If this isn't inherited by the ptrace child's children, a fork child can end up detached if the tracer dies before it had a chance of setting the PTRACE_O_EXITKILL on the new auto-attached child. Which sounds like another argument for PTRACE_O_INHERIT, as in: http://sourceware.org/ml/archer/2011-q1/msg00026.html (it sounds like you need to use PTRACE_SEIZE+options too to plug the race between PTRACE_ME/PTRACE_ATTACH and setting PTRACE_SETOPTIONS). (For completeness, Windows' age old equivalent, DebugSetProcessKillOnExit, it a tracer option, not tracee option, though that's not as flexible.) -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PT_EXITKILL (Was: pdeath_signal)
On 11/07/2012 03:09 PM, Oleg Nesterov wrote: What I would IDEALLY like to have is a call, probably a ptrace option, where the parent can request: If I am ever to terminate or be killed, then my ptraced son MUST die as well. Perhaps this makes sense... Chris, iirc you also suggested something like this? And the patch is trivial. (...) OK. Please see the untested/uncompiled (but trivial) patch below - it adds PTRACE_O_EXITKILL. A better name? - A better numeric value? Note that the new option is not equal to the last-ptrace-option 1. Because currently all options have the event, and the new one starts the eventless group. 1 16 means we have the room for 8 more events. - it needs the convincing changelog for akpm If this isn't inherited by the ptrace child's children, a fork child can end up detached if the tracer dies before it had a chance of setting the PTRACE_O_EXITKILL on the new auto-attached child. Which sounds like another argument for PTRACE_O_INHERIT, as in: http://sourceware.org/ml/archer/2011-q1/msg00026.html (it sounds like you need to use PTRACE_SEIZE+options too to plug the race between PTRACE_ME/PTRACE_ATTACH and setting PTRACE_SETOPTIONS). (For completeness, Windows' age old equivalent, DebugSetProcessKillOnExit, it a tracer option, not tracee option, though that's not as flexible.) -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: PT_EXITKILL (Was: pdeath_signal)
On 11/08/2012 12:44 PM, Oleg Nesterov wrote: On 11/08, Pedro Alves wrote: If this isn't inherited by the ptrace child's children, a fork child can end up detached if the tracer dies before it had a chance of setting the PTRACE_O_EXITKILL on the new auto-attached child. It is copied like the other options. Oh, you're right. I got confused - GDB has code to always set options on the fork children after PTRACE_EVENT_(V)FORK. Dunno where that came from. Which sounds like another argument for PTRACE_O_INHERIT, as in: http://sourceware.org/ml/archer/2011-q1/msg00026.html The point of PTRACE_O_INHERIT would be to attach newly-created threads and children without causing an event stop and the attendant overhead. this is another thing, I guess. Yes, yes. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] coredump: add a new elf note with siginfo fields of the signal
On 09/13/2012 06:25 PM, Roland McGrath wrote: > I agree with Oleg. If there is an NT_SIGINFO note, it should contain > exactly the siginfo_t layout and data that we otherwise expose to userland > already. That is, it must match what PTRACE_GETSIGINFO reports, which (I > think) also matches exactly what appears on the stack for a signal > delivery. Agreed. Please. One note though: PTRACE_GETSIGINFO always returns the siginfo_t in the architecture/bitness of the kernel. IOW, PTRACE_GETSIGINFO on a 32-bit debuggee, running on a 64-bit kernel, returns the siginfo_t in 64-bit layout. IOW, copy_siginfo_to_user32 doesn't apply). Or maybe that's the architecture of the tracer, instead of the kernel. Haven't really checked. In any case, GDB has to basically duplicate the kernel's copy_siginfo_to_user/from_user for each supported arch (that does biarch) because of this, for "(gdb) print $_siginfo". I wish we had a PTRACE_GETSIGINFO variant that returned the siginfo_t in layout of the program, not the kernel's.. > Note also that compat_binfmt_elf must deliver the 32-bit version > of siginfo_t, i.e. as copy_siginfo_to_user32 does. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] coredump: add a new elf note with siginfo fields of the signal
On 09/13/2012 06:25 PM, Roland McGrath wrote: I agree with Oleg. If there is an NT_SIGINFO note, it should contain exactly the siginfo_t layout and data that we otherwise expose to userland already. That is, it must match what PTRACE_GETSIGINFO reports, which (I think) also matches exactly what appears on the stack for a signal delivery. Agreed. Please. One note though: PTRACE_GETSIGINFO always returns the siginfo_t in the architecture/bitness of the kernel. IOW, PTRACE_GETSIGINFO on a 32-bit debuggee, running on a 64-bit kernel, returns the siginfo_t in 64-bit layout. IOW, copy_siginfo_to_user32 doesn't apply). Or maybe that's the architecture of the tracer, instead of the kernel. Haven't really checked. In any case, GDB has to basically duplicate the kernel's copy_siginfo_to_user/from_user for each supported arch (that does biarch) because of this, for (gdb) print $_siginfo. I wish we had a PTRACE_GETSIGINFO variant that returned the siginfo_t in layout of the program, not the kernel's.. Note also that compat_binfmt_elf must deliver the 32-bit version of siginfo_t, i.e. as copy_siginfo_to_user32 does. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/06/2012 02:53 PM, Sasha Levin wrote: > So I think that for the hash iterator it might actually be simpler. > > My solution to making 'break' work in the iterator is: > > for (bkt = 0, node = NULL; bkt < HASH_SIZE(name) && node == NULL; bkt++) > hlist_for_each_entry(obj, node, [bkt], member) > > We initialize our node loop cursor with NULL in the external loop, and the > external loop will have a new condition to loop while that cursor is NULL. > > My logic is that we can only 'break' when we are iterating over an object in > the > internal loop. If we're iterating over an object in that loop then 'node != > NULL'. > > This way, if we broke from within the internal loop, the external loop will > see > node as not NULL, and so it will stop looping itself. On the other hand, if > the > internal loop has actually ended, then node will be NULL, and the outer loop > will keep running. > > Is there anything I've missed? Looks right to me, from a cursory look at hlist_for_each_entry. That's exactly what I meant with this most often being trivial when the inner loop's iterator is a pointer that goes NULL at the end. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/06/2012 02:53 PM, Sasha Levin wrote: So I think that for the hash iterator it might actually be simpler. My solution to making 'break' work in the iterator is: for (bkt = 0, node = NULL; bkt HASH_SIZE(name) node == NULL; bkt++) hlist_for_each_entry(obj, node, name[bkt], member) We initialize our node loop cursor with NULL in the external loop, and the external loop will have a new condition to loop while that cursor is NULL. My logic is that we can only 'break' when we are iterating over an object in the internal loop. If we're iterating over an object in that loop then 'node != NULL'. This way, if we broke from within the internal loop, the external loop will see node as not NULL, and so it will stop looping itself. On the other hand, if the internal loop has actually ended, then node will be NULL, and the outer loop will keep running. Is there anything I've missed? Looks right to me, from a cursory look at hlist_for_each_entry. That's exactly what I meant with this most often being trivial when the inner loop's iterator is a pointer that goes NULL at the end. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/04/2012 11:41 PM, Steven Rostedt wrote: > Ah, I missed the condition with the rec == >records[pg->index]. But > if ftrace_pages_start is NULL, the rec = >records[pg->index] will > fault. Right. > > You could do something like rec = pg ? >records[pg->index] : NULL, Right. > but IIRC, the comma operator does not guarantee order evaluation. That > is, the compiler is allowed to process "a , b" as "b; a;" and not "a; > b;". Not true. The comma operator introduces a sequence point. It's the comma that separates function parameters that doesn't guarantee ordering. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/04/2012 09:59 PM, Steven Rostedt wrote: > On Tue, 2012-09-04 at 18:21 +0100, Pedro Alves wrote: >> On 09/04/2012 06:17 PM, Steven Rostedt wrote: >>> On Tue, 2012-09-04 at 17:40 +0100, Pedro Alves wrote: >>> >>>> BTW, you can also go a step further and remove the need to close with >>>> double }}, >>>> with something like: >>>> >>>> #define do_for_each_ftrace_rec(pg, rec) >>>> \ >>>> for (pg = ftrace_pages_start, rec = >records[pg->index]; >>>> \ >>>> pg && rec == >records[pg->index]; >>>> \ >>>> pg = pg->next) >>>> \ >>>> for (rec = pg->records; rec < >records[pg->index]; rec++) >>>> >>> >>> Yeah, but why bother? It's hidden in a macro, and the extra '{ }' shows >>> that this is something "special". >> >> The point of both changes is that there's nothing special in the end >> at all. It all just works... >> > > It would still fail on a 'break'. The 'while' macro tells us that it is > special, because in the end, it wont work. Please explain why it would fail on a 'break'. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/04/2012 06:17 PM, Steven Rostedt wrote: > On Tue, 2012-09-04 at 17:40 +0100, Pedro Alves wrote: > >> BTW, you can also go a step further and remove the need to close with double >> }}, >> with something like: >> >> #define do_for_each_ftrace_rec(pg, rec) >> \ >> for (pg = ftrace_pages_start, rec = >records[pg->index]; >> \ >> pg && rec == >records[pg->index]; >> \ >> pg = pg->next) >> \ >> for (rec = pg->records; rec < >records[pg->index]; rec++) >> > > Yeah, but why bother? It's hidden in a macro, and the extra '{ }' shows > that this is something "special". The point of both changes is that there's nothing special in the end at all. It all just works... -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/04/2012 05:30 PM, Pedro Alves wrote: > On 09/04/2012 04:35 PM, Steven Rostedt wrote: >> On Tue, 2012-08-28 at 19:00 -0400, Mathieu Desnoyers wrote: >> >>> Looking again at: >>> >>> +#define hash_for_each_size(name, bits, bkt, node, obj, member) >>> \ >>> + for (bkt = 0; bkt < HASH_SIZE(bits); bkt++) >>> \ >>> + hlist_for_each_entry(obj, node, [bkt], member) >>> >>> you will notice that a "break" or "continue" in the inner loop will not >>> affect the outer loop, which is certainly not what the programmer would >>> expect! >>> >>> I advise strongly against creating such error-prone construct. >>> >> >> A few existing loop macros do this. But they require a do { } while () >> approach, and all have a comment. >> >> It's used by do_each_thread() in sched.h and ftrace does this as well. >> Look at kernel/trace/ftrace.c at do_for_each_ftrace_rec(). >> >> Yes it breaks 'break' but it does not break 'continue' as it would just >> go to the next item that would have been found (like a normal for >> would). > > /* > * This is a double for. Do not use 'break' to break out of the loop, > * you must use a goto. > */ > #define do_for_each_ftrace_rec(pg, rec) \ > for (pg = ftrace_pages_start; pg; pg = pg->next) { \ > int _i; \ > for (_i = 0; _i < pg->index; _i++) {\ > rec = >records[_i]; > > > > You can make 'break' also work as expected if you can embed a little knowledge > of the inner loop's condition in the outer loop's condition. Sometimes it's > trivial, most often when the inner loop's iterator is a pointer that goes > NULL at the end, but other times not so much. Something like (completely > untested): > > #define do_for_each_ftrace_rec(pg, rec) \ > for (pg = ftrace_pages_start, rec = >records[pg->index];\ > pg && rec == >records[pg->index]; \ > pg = pg->next) { \ > int _i; \ > for (_i = 0; _i < pg->index; _i++) {\ > rec = >records[_i]; > > > (other variants possible) > > IOW, the outer loop only iterates if the inner loop completes. If there's > a break in the inner loop, then the outer loop breaks too. Of course, it > all depends on whether the generated code looks sane or hideous, if > the uses of the macro care for it over bug avoidance. > BTW, you can also go a step further and remove the need to close with double }}, with something like: #define do_for_each_ftrace_rec(pg, rec) \ for (pg = ftrace_pages_start, rec = >records[pg->index]; \ pg && rec == >records[pg->index]; \ pg = pg->next) \ for (rec = pg->records; rec < >records[pg->index]; rec++) -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/04/2012 04:35 PM, Steven Rostedt wrote: > On Tue, 2012-08-28 at 19:00 -0400, Mathieu Desnoyers wrote: > >> Looking again at: >> >> +#define hash_for_each_size(name, bits, bkt, node, obj, member) >>\ >> + for (bkt = 0; bkt < HASH_SIZE(bits); bkt++) >>\ >> + hlist_for_each_entry(obj, node, [bkt], member) >> >> you will notice that a "break" or "continue" in the inner loop will not >> affect the outer loop, which is certainly not what the programmer would >> expect! >> >> I advise strongly against creating such error-prone construct. >> > > A few existing loop macros do this. But they require a do { } while () > approach, and all have a comment. > > It's used by do_each_thread() in sched.h and ftrace does this as well. > Look at kernel/trace/ftrace.c at do_for_each_ftrace_rec(). > > Yes it breaks 'break' but it does not break 'continue' as it would just > go to the next item that would have been found (like a normal for > would). /* * This is a double for. Do not use 'break' to break out of the loop, * you must use a goto. */ #define do_for_each_ftrace_rec(pg, rec) \ for (pg = ftrace_pages_start; pg; pg = pg->next) { \ int _i; \ for (_i = 0; _i < pg->index; _i++) {\ rec = >records[_i]; You can make 'break' also work as expected if you can embed a little knowledge of the inner loop's condition in the outer loop's condition. Sometimes it's trivial, most often when the inner loop's iterator is a pointer that goes NULL at the end, but other times not so much. Something like (completely untested): #define do_for_each_ftrace_rec(pg, rec) \ for (pg = ftrace_pages_start, rec = >records[pg->index];\ pg && rec == >records[pg->index]; \ pg = pg->next) { \ int _i; \ for (_i = 0; _i < pg->index; _i++) {\ rec = >records[_i]; (other variants possible) IOW, the outer loop only iterates if the inner loop completes. If there's a break in the inner loop, then the outer loop breaks too. Of course, it all depends on whether the generated code looks sane or hideous, if the uses of the macro care for it over bug avoidance. -- Pedro Alves -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 01/17] hashtable: introduce a small and naive hashtable
On 09/04/2012 04:35 PM, Steven Rostedt wrote: On Tue, 2012-08-28 at 19:00 -0400, Mathieu Desnoyers wrote: Looking again at: +#define hash_for_each_size(name, bits, bkt, node, obj, member) \ + for (bkt = 0; bkt HASH_SIZE(bits); bkt++) \ + hlist_for_each_entry(obj, node, name[bkt], member) you will notice that a break or continue in the inner loop will not affect the outer loop, which is certainly not what the programmer would expect! I advise strongly against creating such error-prone construct. A few existing loop macros do this. But they require a do { } while () approach, and all have a comment. It's used by do_each_thread() in sched.h and ftrace does this as well. Look at kernel/trace/ftrace.c at do_for_each_ftrace_rec(). Yes it breaks 'break' but it does not break 'continue' as it would just go to the next item that would have been found (like a normal for would). /* * This is a double for. Do not use 'break' to break out of the loop, * you must use a goto. */ #define do_for_each_ftrace_rec(pg, rec) \ for (pg = ftrace_pages_start; pg; pg = pg-next) { \ int _i; \ for (_i = 0; _i pg-index; _i++) {\ rec = pg-records[_i]; You can make 'break' also work as expected if you can embed a little knowledge of the inner loop's condition in the outer loop's condition. Sometimes it's trivial, most often when the inner loop's iterator is a pointer that goes NULL at the end, but other times not so much. Something like (completely untested): #define do_for_each_ftrace_rec(pg, rec) \ for (pg = ftrace_pages_start, rec = pg-records[pg-index];\ pg rec == pg-records[pg-index]; \ pg = pg-next) { \ int _i; \ for (_i = 0; _i pg-index; _i++) {\ rec = pg-records[_i]; (other variants possible) IOW, the outer loop only iterates if the inner loop completes. If there's a break in the inner loop, then the outer loop breaks too. Of course, it all depends on whether the generated code looks sane or hideous, if the uses of the macro care for it over bug avoidance. -- Pedro Alves -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/