Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2013-01-13 Thread u3557
Hi,

> I would not say this is a bug but let me repeat, no need to convince me.
>
> Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
> I can't push these changes into Linus's tree.

So here again is the patch that I need so badly - clearly it fixes a bug
and harms nobody:

---
diff -Naur before/arch/x86/kernel/hw_breakpoint.c
after/arch/x86/kernel/hw_breakpoint.c
--- before/arch/x86/kernel/hw_breakpoint.c  2013-01-14 12:45:20.0
+1030
+++ after/arch/x86/kernel/hw_breakpoint.c   2013-01-14 12:46:24.0 
+1030
@@ -200,7 +200,8 @@
va = info->address;
len = get_hbp_len(info->len);

-   return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
+   return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE) &&
+   !((va >= VSYSCALL_START) && ((va + len - 1) <= VSYSCALL_END));
 }

 int arch_bp_generic_fields(int x86_len, int x86_type,
---

Where else can I send it?
Amnon.

> On 01/10, u3...@miso.sublimeip.com wrote:
>>
>> Hi Everyone,
>>
>> > On 01/08, Pedro Alves wrote:
>> >>
>> >> On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
>> >>
>> >> > But If we want to allow to trace vsyscall's, hw bp doesn't look
>> very
>> >> > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them
>> all.
>> >>
>> >> Irrespective of the whole syscall tracing issue, allowing HW bkpts in
>> >> the vsyscall just seems like a bug fix to me.
>> >
>> > And I never argued. I sent the patch iirc ;)
>>
>> Exactly, it is a bug and I am still waiting for it to be fixed in the
>> Linux kernel.
>
> I would not say this is a bug but let me repeat, no need to convince me.
>
> Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
> I can't push these changes into Linus's tree.
>
> Oleg.
>
>


patch
Description: Binary data


Re: PTRACE_SYSCALL vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2013-01-13 Thread u3557
Hi,

 I would not say this is a bug but let me repeat, no need to convince me.

 Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
 I can't push these changes into Linus's tree.

So here again is the patch that I need so badly - clearly it fixes a bug
and harms nobody:

---
diff -Naur before/arch/x86/kernel/hw_breakpoint.c
after/arch/x86/kernel/hw_breakpoint.c
--- before/arch/x86/kernel/hw_breakpoint.c  2013-01-14 12:45:20.0
+1030
+++ after/arch/x86/kernel/hw_breakpoint.c   2013-01-14 12:46:24.0 
+1030
@@ -200,7 +200,8 @@
va = info-address;
len = get_hbp_len(info-len);

-   return (va = TASK_SIZE)  ((va + len - 1) = TASK_SIZE);
+   return (va = TASK_SIZE)  ((va + len - 1) = TASK_SIZE) 
+   !((va = VSYSCALL_START)  ((va + len - 1) = VSYSCALL_END));
 }

 int arch_bp_generic_fields(int x86_len, int x86_type,
---

Where else can I send it?
Amnon.

 On 01/10, u3...@miso.sublimeip.com wrote:

 Hi Everyone,

  On 01/08, Pedro Alves wrote:
 
  On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
 
   But If we want to allow to trace vsyscall's, hw bp doesn't look
 very
   nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them
 all.
 
  Irrespective of the whole syscall tracing issue, allowing HW bkpts in
  the vsyscall just seems like a bug fix to me.
 
  And I never argued. I sent the patch iirc ;)

 Exactly, it is a bug and I am still waiting for it to be fixed in the
 Linux kernel.

 I would not say this is a bug but let me repeat, no need to convince me.

 Please feel free to re-send the patch(es) I sent to maintainers. Sorry,
 I can't push these changes into Linus's tree.

 Oleg.




patch
Description: Binary data


Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2013-01-09 Thread u3557
Hi Everyone,

> On 01/08, Pedro Alves wrote:
>>
>> On 12/04/2012 05:59 PM, Oleg Nesterov wrote:
>>
>> > But If we want to allow to trace vsyscall's, hw bp doesn't look very
>> > nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.
>>
>> Irrespective of the whole syscall tracing issue, allowing HW bkpts in
>> the vsyscall just seems like a bug fix to me.
>
> And I never argued. I sent the patch iirc ;)

Exactly, it is a bug and I am still waiting for it to be fixed in the
Linux kernel.

Fully emulating PTRACE_SYSCALL could also provide a suitable way to
fix my problem, and it may also help others by saving them the need
to program and waste x86 debug registers, but it doesn't change the
fact that my problem is caused by a bug in the first place, which
should be fixed in any case.

Best Regards,
Amnon.


>
>> > That is why I think PTRACE_SYSCALL should "simply work" somehow. And
>> > so far I think that "just report syscall_exit with orig_ax = -1" is
>> > the best (and simple) solution.
>>
>> If you report exit alone, you'll confuse current GDB into mistaking
>> it for an enter,
>
> Sure. That is why I asked Jan.
>
>> > OK. We can do more. We can report both syscall_enter/exit and we can
>> > change orig_ax/ax temporary to "fool" the tracer, so that everything
>> > will look as a "normal" syscall. Like vsyscall_seccomp() does.
>> >
>> > But this needs much more changes.
>>
>> I'd just like to add, that if any new syscall related option is
>> to be added, can we please just go all the way and add
>> PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead?
>
> Oh yes, this was suggested many times.
>
> Oleg.
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2013-01-09 Thread u3557
Hi Everyone,

 On 01/08, Pedro Alves wrote:

 On 12/04/2012 05:59 PM, Oleg Nesterov wrote:

  But If we want to allow to trace vsyscall's, hw bp doesn't look very
  nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

 Irrespective of the whole syscall tracing issue, allowing HW bkpts in
 the vsyscall just seems like a bug fix to me.

 And I never argued. I sent the patch iirc ;)

Exactly, it is a bug and I am still waiting for it to be fixed in the
Linux kernel.

Fully emulating PTRACE_SYSCALL could also provide a suitable way to
fix my problem, and it may also help others by saving them the need
to program and waste x86 debug registers, but it doesn't change the
fact that my problem is caused by a bug in the first place, which
should be fixed in any case.

Best Regards,
Amnon.



  That is why I think PTRACE_SYSCALL should simply work somehow. And
  so far I think that just report syscall_exit with orig_ax = -1 is
  the best (and simple) solution.

 If you report exit alone, you'll confuse current GDB into mistaking
 it for an enter,

 Sure. That is why I asked Jan.

  OK. We can do more. We can report both syscall_enter/exit and we can
  change orig_ax/ax temporary to fool the tracer, so that everything
  will look as a normal syscall. Like vsyscall_seccomp() does.
 
  But this needs much more changes.

 I'd just like to add, that if any new syscall related option is
 to be added, can we please just go all the way and add
 PTRACE_EVENT_SYSCALL_ENTER|PTRACE_EVENT_SYSCALL_EXIT instead?

 Oh yes, this was suggested many times.

 Oleg.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2012-12-05 Thread u3557
Dear Jan,

> x86 debug registers are already very scarce.  Besides that userland
> applications know they have 4 of them available so it would also break
> them.

If a userland application wants to cheat, then it has no need to bypass
the debug registers: even if there were 4096 of them, covering the whole
vsyscall page, it could simply copy the whole vsyscall page somewhere
else, then run (or emulate) it, or look directly at the raw data within
the vsyscall page.  The only way to overcome that would be to make the
vsyscall page either non-existent or unreadable.

Personally, allowing userland applications to read the vsyscall page
can't hurt me and my applications, but if someone else is concerned with
such malicious programs (does anyone?), if they need to construct the
strictest-of-strict jail, where jailed applications cannot glimpse
any information from the kernel they run on no matter how hard they try,
then they must at least make the vsyscall page unreadable, then rely
either on kernel emulation or a SIGSEGV (the later would be quite
sufficient for my own needs as a substitute for debug-registers,
but unfortunately not for the current version of "strace").

If, as I was told, it's too hard to remove the vsyscall page on a
per-process basis, then it's sufficient to make it unreadable on
context-switch.

My concern, however, is not with the bad guys, but with good honest
programs that would run incorrectly if allowed to call "time()" or
"gettimeofday()" unsupervised.  No good program or library jumps into
the vsyscall page except into its 3 official entry points.

In summary, it should be decided:

If it is important enough for Linux to support paranoidically strict
jails, then full kernel emulation of PTRACE_SYSCALL (and PTRACE_SYSEMU)
is inescapable.

If, on the other hand, there is only a need to allow applications such as
mine and "strace"/"gdb" to trap system-calls that occur via the vsyscall
page, then in principle a variety of options are possible:

1. Allow setting the x86 hardware-debug registers into the vsyscall page.
2. Optional (per-process) removal of execute-permission from the vsyscall
   page.
3. Optional (per-process) removal of both read and execute permissions
   from the vsyscall page.
4. Optional (per-process) elimination of the vsyscall page altogether.
5. Kernel vsyscall emulation code to send some signal or event to traced
   processes if the ptracer asked for it (using a new ptrace option).
6. Complete and transparent emulation of PTRACE_SYSCALL/PTRACE_SYSEMU.

Option #1 requires the least effort (a 2-line fix).
Option #6 requires the most effort.

Best Regards,
Amnon.

> On Sun, 02 Dec 2012 20:30:58 +0100, Oleg Nesterov wrote:
>> Yes, that is why I said this needs the new option.
>
> I do not mind new options although personally I do not find them
> meaningful
> for an already deprecated ABI compatibility-only issue.
>
>
>> If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
>> _after_ gettimeofday/etc. The tracer can look at regs->orig_ax == -1
>> and detect that this is not syscall but vsyscall, it can look at
>> regs->ip then (not with the patch below).
>
> I believe applications just call PTRACE_SYSCALL twice, without checking
> orig_eax.  At least strace and its TCB_INSYSCALL looks so.
>
>
> On Mon, 03 Dec 2012 00:54:58 +0100, u3...@miso.sublimeip.com wrote:
>> The beauty of using the x86 debug-registers,
>
> x86 debug registers are already very scarce.  Besides that userland
> applications know they have 4 of them available so it would also break
> them.
>
>
> Regards,
> Jan
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2012-12-05 Thread u3557
Dear Jan,

 x86 debug registers are already very scarce.  Besides that userland
 applications know they have 4 of them available so it would also break
 them.

If a userland application wants to cheat, then it has no need to bypass
the debug registers: even if there were 4096 of them, covering the whole
vsyscall page, it could simply copy the whole vsyscall page somewhere
else, then run (or emulate) it, or look directly at the raw data within
the vsyscall page.  The only way to overcome that would be to make the
vsyscall page either non-existent or unreadable.

Personally, allowing userland applications to read the vsyscall page
can't hurt me and my applications, but if someone else is concerned with
such malicious programs (does anyone?), if they need to construct the
strictest-of-strict jail, where jailed applications cannot glimpse
any information from the kernel they run on no matter how hard they try,
then they must at least make the vsyscall page unreadable, then rely
either on kernel emulation or a SIGSEGV (the later would be quite
sufficient for my own needs as a substitute for debug-registers,
but unfortunately not for the current version of strace).

If, as I was told, it's too hard to remove the vsyscall page on a
per-process basis, then it's sufficient to make it unreadable on
context-switch.

My concern, however, is not with the bad guys, but with good honest
programs that would run incorrectly if allowed to call time() or
gettimeofday() unsupervised.  No good program or library jumps into
the vsyscall page except into its 3 official entry points.

In summary, it should be decided:

If it is important enough for Linux to support paranoidically strict
jails, then full kernel emulation of PTRACE_SYSCALL (and PTRACE_SYSEMU)
is inescapable.

If, on the other hand, there is only a need to allow applications such as
mine and strace/gdb to trap system-calls that occur via the vsyscall
page, then in principle a variety of options are possible:

1. Allow setting the x86 hardware-debug registers into the vsyscall page.
2. Optional (per-process) removal of execute-permission from the vsyscall
   page.
3. Optional (per-process) removal of both read and execute permissions
   from the vsyscall page.
4. Optional (per-process) elimination of the vsyscall page altogether.
5. Kernel vsyscall emulation code to send some signal or event to traced
   processes if the ptracer asked for it (using a new ptrace option).
6. Complete and transparent emulation of PTRACE_SYSCALL/PTRACE_SYSEMU.

Option #1 requires the least effort (a 2-line fix).
Option #6 requires the most effort.

Best Regards,
Amnon.

 On Sun, 02 Dec 2012 20:30:58 +0100, Oleg Nesterov wrote:
 Yes, that is why I said this needs the new option.

 I do not mind new options although personally I do not find them
 meaningful
 for an already deprecated ABI compatibility-only issue.


 If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
 _after_ gettimeofday/etc. The tracer can look at regs-orig_ax == -1
 and detect that this is not syscall but vsyscall, it can look at
 regs-ip then (not with the patch below).

 I believe applications just call PTRACE_SYSCALL twice, without checking
 orig_eax.  At least strace and its TCB_INSYSCALL looks so.


 On Mon, 03 Dec 2012 00:54:58 +0100, u3...@miso.sublimeip.com wrote:
 The beauty of using the x86 debug-registers,

 x86 debug registers are already very scarce.  Besides that userland
 applications know they have 4 of them available so it would also break
 them.


 Regards,
 Jan



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2012-12-04 Thread u3557
Dear Oleg,

> Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
> sent the patch (untested/uncompiled). But given that even the simple
> bugfix which started this thread was ignored by maintainers, I am
> not sure how we can convince them this change makes sense ;)

Just to confirm, DR_RW_EXECUTE won't only "probably" work - it DOES work,
I have tested it.

A fully super-duper automatic and transparent emulation of PTRACE_SYSCALL,
including the faking of user-registers, would undoubtedly be great, but
it's complex and will require a large patch, while here is a trivial 1-2
line fix which doesn't harm anyone and allows ptracers to trap a vsyscall
in no time.

> But If we want to allow to trace vsyscall's, hw bp doesn't look very
> nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

The next solution in line, in terms of patch-size, if we don't want to
waste debug registers, is to have the execute permission of the vsyscall
page changed on context switches when a process/ptracer requests so, in a
manner similar to prctl(PR_SET_TSC).

Best Regards,
Amnon.

>>
>> > However. Of course it would be nice to avoid the new option. IMO it
>> > would be better to do nothing ;) vsyscall is deprecated, and EMULATE
>> > is x86-specific.
>>
>> The problem is that the current static glibc invokes the vsyscall page,
>
> Yes I know.
>
> Still I'd like to avoid to change the ptrace API, even if the change is
> simple. This emulate_vsyscall() is too "exotic" imho.
>
>> > You forgot again that EMULATE does not execute the code in the
>> > vsyscall page.
>>
>> The beauty of using the x86 debug-registers, is that they do not
>> trap the instruction, but rather the fact that the program-counter
>> has a given value.
>
> Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
> sent the patch (untested/uncompiled). But given that even the simple
> bugfix which started this thread was ignored by maintainers, I am
> not sure how we can convince them this change makes sense ;)
>
> However. This looks like a hack to me, because this code is never
> executed. But this is sudjective and I am not saying this can't work.
> And yes, this doesn't add new ptrace hacks.
>
>
>
> But If we want to allow to trace vsyscall's, hw bp doesn't look very
> nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.
>
> And what about strace? It won't be easy to change it to use hwbp.
>
>
> That is why I think PTRACE_SYSCALL should "simply work" somehow. And
> so far I think that "just report syscall_exit with orig_ax = -1" is
> the best (and simple) solution.
>
> OK. We can do more. We can report both syscall_enter/exit and we can
> change orig_ax/ax temporary to "fool" the tracer, so that everything
> will look as a "normal" syscall. Like vsyscall_seccomp() does.
>
> But this needs much more changes.
>
> Oleg.
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2012-12-04 Thread u3557
Dear Oleg,

 Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
 sent the patch (untested/uncompiled). But given that even the simple
 bugfix which started this thread was ignored by maintainers, I am
 not sure how we can convince them this change makes sense ;)

Just to confirm, DR_RW_EXECUTE won't only probably work - it DOES work,
I have tested it.

A fully super-duper automatic and transparent emulation of PTRACE_SYSCALL,
including the faking of user-registers, would undoubtedly be great, but
it's complex and will require a large patch, while here is a trivial 1-2
line fix which doesn't harm anyone and allows ptracers to trap a vsyscall
in no time.

 But If we want to allow to trace vsyscall's, hw bp doesn't look very
 nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

The next solution in line, in terms of patch-size, if we don't want to
waste debug registers, is to have the execute permission of the vsyscall
page changed on context switches when a process/ptracer requests so, in a
manner similar to prctl(PR_SET_TSC).

Best Regards,
Amnon.


  However. Of course it would be nice to avoid the new option. IMO it
  would be better to do nothing ;) vsyscall is deprecated, and EMULATE
  is x86-specific.

 The problem is that the current static glibc invokes the vsyscall page,

 Yes I know.

 Still I'd like to avoid to change the ptrace API, even if the change is
 simple. This emulate_vsyscall() is too exotic imho.

  You forgot again that EMULATE does not execute the code in the
  vsyscall page.

 The beauty of using the x86 debug-registers, is that they do not
 trap the instruction, but rather the fact that the program-counter
 has a given value.

 Yes, I understand, so DR_RW_EXECUTE should probably work. And I even
 sent the patch (untested/uncompiled). But given that even the simple
 bugfix which started this thread was ignored by maintainers, I am
 not sure how we can convince them this change makes sense ;)

 However. This looks like a hack to me, because this code is never
 executed. But this is sudjective and I am not saying this can't work.
 And yes, this doesn't add new ptrace hacks.



 But If we want to allow to trace vsyscall's, hw bp doesn't look very
 nice imo. HBP_NUM = 4 and you need to setup 3 bp's to trace them all.

 And what about strace? It won't be easy to change it to use hwbp.


 That is why I think PTRACE_SYSCALL should simply work somehow. And
 so far I think that just report syscall_exit with orig_ax = -1 is
 the best (and simple) solution.

 OK. We can do more. We can report both syscall_enter/exit and we can
 change orig_ax/ax temporary to fool the tracer, so that everything
 will look as a normal syscall. Like vsyscall_seccomp() does.

 But this needs much more changes.

 Oleg.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL && vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2012-12-02 Thread u3557
Hi Oleg,

> However. Of course it would be nice to avoid the new option. IMO it
> would be better to do nothing ;) vsyscall is deprecated, and EMULATE
> is x86-specific.

The problem is that the current static glibc invokes the vsyscall page,
so statically-linked 3rd-party executables that were distributed only
in binary form are not going to stop using vsyscall even in 100 years.

Note also that even now, while the dynamic glibc library uses vDSO
for "gettimeofday()" it still uses the vsyscall page for "time()",
so even when fixed, it would still take years in any case until all
Linux distributions are free of references to the vsyscall page.

> You forgot again that EMULATE does not execute the code in the
> vsyscall page.

The beauty of using the x86 debug-registers, is that they do not
trap the instruction, but rather the fact that the program-counter
has a given value.  They work like single-stepping, so the condition
of attempting to access the vsyscall page is detected in hardware
BEFORE attempting to access the vsyscall page itself, BEFORE even
discovering that the vsyscall page is inaccessible (in EMULATE mode).

Once trapped of course, the program-counter will be changed by the ptracer,
so neither NATIVE nor EMULATE will ever be invoked.

Current versions of "strace" and "gdb" will not automatically benefit
from this, but wouldn't be harmed either.  Future versions can then
be written to make use of the debug-registers to detect a vsyscall.

Best Regards,
Amnon.

> Amnon, sorry for delay...
>
> On 11/26, Amnon Shiloh wrote:
>>
>> > Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
>> > change emulate_vsyscall() to respect PTRACE_SYSCALL and report
>> > TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
>> > http://marc.info/?l=linux-kernel=135343635523715 ?
>> >
>> > Oleg.
>> >
>>
>> For my own application, I would be happy with this.
>
> OK, good.
>
>> But I suspect it might break current versions of "strace",
>> ...
>> I think it COULD work, but not based on PTRACE_SYSCALL
>> (or PTRACE_SYSEMU) alone.  A new ptrace option will be needed, saying:
>> "Yes, I am aware of TRAP_VSYSCALL and I know how to handle it."
>
> Yes, that is why I said this needs the new option.
>
> However. Of course it would be nice to avoid the new option. IMO it
> would be better to do nothing ;) vsyscall is deprecated, and EMULATE
> is x86-specific.
>
>
> May be we could simply do something like the patch below? (Just in
> case, this hack is only for illustration, it is not complete).
>
> If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
> _after_ gettimeofday/etc. The tracer can look at regs->orig_ax == -1
> and detect that this is not syscall but vsyscall, it can look at
> regs->ip then (not with the patch below).
>
> Denys, Jan, Pedro. Do you think this change can break/confuse
> gdb/strace ?
>
>
>> While for my own application, just fixing the range-check in
>> arch_check_bp_in_kernelspace will do,
>
> You forgot again that EMULATE does not execute the code in the
> vsyscall page.
>
> Oleg.
>
> --- a/arch/x86/kernel/vsyscall_64.c
> +++ b/arch/x86/kernel/vsyscall_64.c
> @@ -186,6 +186,8 @@ static bool write_ok_or_segv(unsigned long ptr, size_t
> size)
>   }
>  }
>
> +#include 
> +
>  bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>  {
>   struct task_struct *tsk;
> @@ -312,6 +314,8 @@ do_ret:
>   regs->ip = caller;
>   regs->sp += 8;
>  done:
> + if (test_thread_flag(TIF_SYSCALL_TRACE))
> + ptrace_report_syscall(regs);
>   return true;
>
>  sigsegv:
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PTRACE_SYSCALL vsyscall (Was: arch_check_bp_in_kernelspace: fix the range check)

2012-12-02 Thread u3557
Hi Oleg,

 However. Of course it would be nice to avoid the new option. IMO it
 would be better to do nothing ;) vsyscall is deprecated, and EMULATE
 is x86-specific.

The problem is that the current static glibc invokes the vsyscall page,
so statically-linked 3rd-party executables that were distributed only
in binary form are not going to stop using vsyscall even in 100 years.

Note also that even now, while the dynamic glibc library uses vDSO
for gettimeofday() it still uses the vsyscall page for time(),
so even when fixed, it would still take years in any case until all
Linux distributions are free of references to the vsyscall page.

 You forgot again that EMULATE does not execute the code in the
 vsyscall page.

The beauty of using the x86 debug-registers, is that they do not
trap the instruction, but rather the fact that the program-counter
has a given value.  They work like single-stepping, so the condition
of attempting to access the vsyscall page is detected in hardware
BEFORE attempting to access the vsyscall page itself, BEFORE even
discovering that the vsyscall page is inaccessible (in EMULATE mode).

Once trapped of course, the program-counter will be changed by the ptracer,
so neither NATIVE nor EMULATE will ever be invoked.

Current versions of strace and gdb will not automatically benefit
from this, but wouldn't be harmed either.  Future versions can then
be written to make use of the debug-registers to detect a vsyscall.

Best Regards,
Amnon.

 Amnon, sorry for delay...

 On 11/26, Amnon Shiloh wrote:

  Why do you need to _prevent_, say, sys_gettimeofday()? Why we can't
  change emulate_vsyscall() to respect PTRACE_SYSCALL and report
  TRAP_VSYSCALL or PTRACE_EVENT_VSYSCALL as I tried to suggest in
  http://marc.info/?l=linux-kernelm=135343635523715 ?
 
  Oleg.
 

 For my own application, I would be happy with this.

 OK, good.

 But I suspect it might break current versions of strace,
 ...
 I think it COULD work, but not based on PTRACE_SYSCALL
 (or PTRACE_SYSEMU) alone.  A new ptrace option will be needed, saying:
 Yes, I am aware of TRAP_VSYSCALL and I know how to handle it.

 Yes, that is why I said this needs the new option.

 However. Of course it would be nice to avoid the new option. IMO it
 would be better to do nothing ;) vsyscall is deprecated, and EMULATE
 is x86-specific.


 May be we could simply do something like the patch below? (Just in
 case, this hack is only for illustration, it is not complete).

 If the tracer does PTRACE_SYSCALL the tracee reports syscall exit
 _after_ gettimeofday/etc. The tracer can look at regs-orig_ax == -1
 and detect that this is not syscall but vsyscall, it can look at
 regs-ip then (not with the patch below).

 Denys, Jan, Pedro. Do you think this change can break/confuse
 gdb/strace ?


 While for my own application, just fixing the range-check in
 arch_check_bp_in_kernelspace will do,

 You forgot again that EMULATE does not execute the code in the
 vsyscall page.

 Oleg.

 --- a/arch/x86/kernel/vsyscall_64.c
 +++ b/arch/x86/kernel/vsyscall_64.c
 @@ -186,6 +186,8 @@ static bool write_ok_or_segv(unsigned long ptr, size_t
 size)
   }
  }

 +#include linux/tracehook.h
 +
  bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
  {
   struct task_struct *tsk;
 @@ -312,6 +314,8 @@ do_ret:
   regs-ip = caller;
   regs-sp += 8;
  done:
 + if (test_thread_flag(TIF_SYSCALL_TRACE))
 + ptrace_report_syscall(regs);
   return true;

  sigsegv:




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

2012-11-20 Thread u3557
Hi Oleg,

> Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> do
>
>
>   if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
>   send_sigtrap(TRAP_VSYSCALL, ...);
>
> if it returns true?
>

I wish it were possible, but the vsyscall page is entered in user-mode,
where it is not possible to read either "current" or the per-thread flags.

One nice alternative, however, is to take away the execute permission from
the vsyscall page of the traced process, so it will incur a SIGSEGV (which
the ptracer can easily handle and cancel).

The drawbacks of this nice-to-have alternative are:
1) It may require a bigger patch.
2) It needs to use a separate ptrace option, because current applications
   that use PTRACE_SYSCALL (or PTRACE_SYSEMU), including "strace", will not
   be aware of this new feature, so they can be broken.

> Yes, the tracee won't report _before_ it calls the function, but perhaps
> this is enough? Amnon?

The vsyscall page was designed in order to avoid user/kernel context
switches, which is possible when a system-call is only reading kernel
global information, not modifying any.  In most cases therefore (with
a few exceptions, such as when the hardware clock is broken or
misconfigured), system-calls in the vsyscall page never invoke the kernel
at all - so no trap would be generated either before or after.

Best Regards,
Amnon.

> On 11/20, Oleg Nesterov wrote:
>>
>> On 11/19, Steven Rostedt wrote:
>> >
>> > Is this really what we want?
>>
>> I am not sure, that is why I am asking.
>
> Yes.
>
> And,
>
>> Well yes, with the new implementation __vsyscall_page simply does
>> syscall, so I guess (say) strace will "just work". But, afaics, only
>> if vsyscall_mode == NATIVE.
>>
>> So perhaps bp in vsyscall_page still makes sense...
>
> Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
> do
>
>
>   if (current->ptrace && test_thread_flag(TIF_SYSCALL_TRACE))
>   send_sigtrap(TRAP_VSYSCALL, ...);
>
> if it returns true?
>
> This way the debugger doesn't need to play with the debug registers,
> it can use ptrace(PTRACE_SYSCALL). The tracee will report SIGTRAP after
> vsyscall, debugger can detect this case looking at info->si_code.
>
> Yes, the tracee won't report _before_ it calls the function, but perhaps
> this is enough? Amnon?
>
> Oleg.
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

2012-11-20 Thread u3557
Dear Steve,

> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.

Just to clarify, there is no intention to allow conventional breakpoints
in the vsyscall page - that would indeed be a disaster affecting all other
processes.

The aim of this patch is to allow ptracers to set the x86 debug-registers
of their traced process, so that it stops when it reaches the entry points
of those (few) system-calls that are in the vsyscall page.  Note that those
debug registers are anyway swapped at context-switch, so no other processes
are affected.

> Is this really what we want? I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.

Most system-calls can be trapped by the PTRACE_SYSCALL option (and the
later PTRACE_SYSEMU), but those few system-calls on the vsyscall page
defy the intention to trap ALL system-calls.  They also cannot be
checkpointed by inserting a trap-instruction (as that, if allowed, would
break all other processes), hence the only alternative left is to have
this patch that fixes the oversight in the design of
PTRACE_SYSCALL/PTRACE_SYSEMU in the presence of a vsyscall page.

Best Regards,
Amnon.

> On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
>> On 11/09, Oleg Nesterov wrote:
>> >
>> > On 11/09, Oleg Nesterov wrote:
>> > >
>> > > Note: TASK_SIZE doesn't look right at least on x86, I think it
>> should
>> > > be replaced by TASK_SIZE_MAX.
>> > > ...
>> > > --- x/arch/x86/kernel/hw_breakpoint.c
>> > > +++ x/arch/x86/kernel/hw_breakpoint.c
>> > > @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
>> > >  va = info->address;
>> > >  len = get_hbp_len(info->len);
>> > >
>> > > -return (va >= TASK_SIZE) && ((va + len - 1) >= TASK_SIZE);
>> > > +return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>> >
>> > But actully I'd like to ask another question.
>> >
>> > Can't we add the additional !in_gate_area_no_mm() check to allow
>> > the hw breakpoints in vsyscall?
>> >
>> > Quoting Amnon:
>> >
>> >My service needs to catch the system-calls of its traced son.
>> >Almost all system-calls are caught with PTRACE_SYSCALL, but not those
>> >using the vsyscall page - especially "gettimeofday()" and "time()".
>> >
>> >...> Is this really what we want? I mean, isn't syscall tracing in
the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.
>> >
>> >However, the code in "arch/x86/kernel/ptrace.c" forbids catching
>> kernel
>> >addresses.
>> >
>> > I tend to agree with Amnon...
>> >
>> > What do you think?
>>
>> ping ;)
>>
>> I agree the patch I sent is very minor (although it looks like the
>> trivial
>> bugfix to me).
>>
>> Just I think Amnon has a point, shouldn't we change this change like
>> below?
>> (on top of this fixlet).
>>
>> Oleg.
>>
>> --- x/arch/x86/kernel/hw_breakpoint.c
>> +++ x/arch/x86/kernel/hw_breakpoint.c
>> @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
>>  return len_in_bytes;
>>  }
>>
>> +static inline bool bp_in_gate(unsigned long start, unsigned long end)
>> +{
>> +return in_gate_area_no_mm(start) && in_gate_area_no_mm(end);
>> +}
>> +
>>  /*
>>   * Check for virtual address in kernel space.
>>   */
>> @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
>>  va = info->address;
>>  len = get_hbp_len(info->len);
>>
>> -return (va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE);
>> +return ((va >= TASK_SIZE) || ((va + len - 1) >= TASK_SIZE)) &&
>> +!bp_in_gate(va, va + len - 1);
>
> So we want to allow non privileged to add a breakpoint in a vsyscall
> area even if it happens to lay in kernel space?
>
> Is this really what we want? I mean, isn't syscall tracing in the kernel
> done by a flag set to the task's structure. If the task has a flag set,
> then it does the slow (ptrace) path which handles the breakpoint for the
> user.
>
> But here, there's no prejudice between tasks. All tasks will now hit the
> breakpoint regardless of if it is being traced or not.
>
> -- Steve
>
>
>>  }
>>
>>  int arch_bp_generic_fields(int x86_len, int x86_type,
>
>
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

2012-11-20 Thread u3557
Dear Steve,

 But here, there's no prejudice between tasks. All tasks will now hit the
 breakpoint regardless of if it is being traced or not.

Just to clarify, there is no intention to allow conventional breakpoints
in the vsyscall page - that would indeed be a disaster affecting all other
processes.

The aim of this patch is to allow ptracers to set the x86 debug-registers
of their traced process, so that it stops when it reaches the entry points
of those (few) system-calls that are in the vsyscall page.  Note that those
debug registers are anyway swapped at context-switch, so no other processes
are affected.

 Is this really what we want? I mean, isn't syscall tracing in the kernel
 done by a flag set to the task's structure. If the task has a flag set,
 then it does the slow (ptrace) path which handles the breakpoint for the
 user.

Most system-calls can be trapped by the PTRACE_SYSCALL option (and the
later PTRACE_SYSEMU), but those few system-calls on the vsyscall page
defy the intention to trap ALL system-calls.  They also cannot be
checkpointed by inserting a trap-instruction (as that, if allowed, would
break all other processes), hence the only alternative left is to have
this patch that fixes the oversight in the design of
PTRACE_SYSCALL/PTRACE_SYSEMU in the presence of a vsyscall page.

Best Regards,
Amnon.

 On Mon, 2012-11-19 at 18:47 +0100, Oleg Nesterov wrote:
 On 11/09, Oleg Nesterov wrote:
 
  On 11/09, Oleg Nesterov wrote:
  
   Note: TASK_SIZE doesn't look right at least on x86, I think it
 should
   be replaced by TASK_SIZE_MAX.
   ...
   --- x/arch/x86/kernel/hw_breakpoint.c
   +++ x/arch/x86/kernel/hw_breakpoint.c
   @@ -200,7 +200,7 @@ int arch_check_bp_in_kernelspace(struct
va = info-address;
len = get_hbp_len(info-len);
  
   -return (va = TASK_SIZE)  ((va + len - 1) = TASK_SIZE);
   +return (va = TASK_SIZE) || ((va + len - 1) = TASK_SIZE);
 
  But actully I'd like to ask another question.
 
  Can't we add the additional !in_gate_area_no_mm() check to allow
  the hw breakpoints in vsyscall?
 
  Quoting Amnon:
 
 My service needs to catch the system-calls of its traced son.
 Almost all system-calls are caught with PTRACE_SYSCALL, but not those
 using the vsyscall page - especially gettimeofday() and time().
 
 ... Is this really what we want? I mean, isn't syscall tracing in
the kernel
 done by a flag set to the task's structure. If the task has a flag set,
 then it does the slow (ptrace) path which handles the breakpoint for the
 user.
 
 However, the code in arch/x86/kernel/ptrace.c forbids catching
 kernel
 addresses.
 
  I tend to agree with Amnon...
 
  What do you think?

 ping ;)

 I agree the patch I sent is very minor (although it looks like the
 trivial
 bugfix to me).

 Just I think Amnon has a point, shouldn't we change this change like
 below?
 (on top of this fixlet).

 Oleg.

 --- x/arch/x86/kernel/hw_breakpoint.c
 +++ x/arch/x86/kernel/hw_breakpoint.c
 @@ -188,6 +188,11 @@ static int get_hbp_len(u8 hbp_len)
  return len_in_bytes;
  }

 +static inline bool bp_in_gate(unsigned long start, unsigned long end)
 +{
 +return in_gate_area_no_mm(start)  in_gate_area_no_mm(end);
 +}
 +
  /*
   * Check for virtual address in kernel space.
   */
 @@ -200,7 +205,8 @@ int arch_check_bp_in_kernelspace(struct
  va = info-address;
  len = get_hbp_len(info-len);

 -return (va = TASK_SIZE) || ((va + len - 1) = TASK_SIZE);
 +return ((va = TASK_SIZE) || ((va + len - 1) = TASK_SIZE)) 
 +!bp_in_gate(va, va + len - 1);

 So we want to allow non privileged to add a breakpoint in a vsyscall
 area even if it happens to lay in kernel space?

 Is this really what we want? I mean, isn't syscall tracing in the kernel
 done by a flag set to the task's structure. If the task has a flag set,
 then it does the slow (ptrace) path which handles the breakpoint for the
 user.

 But here, there's no prejudice between tasks. All tasks will now hit the
 breakpoint regardless of if it is being traced or not.

 -- Steve


  }

  int arch_bp_generic_fields(int x86_len, int x86_type,





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] arch_check_bp_in_kernelspace: fix the range check

2012-11-20 Thread u3557
Hi Oleg,

 Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
 do


   if (current-ptrace  test_thread_flag(TIF_SYSCALL_TRACE))
   send_sigtrap(TRAP_VSYSCALL, ...);

 if it returns true?


I wish it were possible, but the vsyscall page is entered in user-mode,
where it is not possible to read either current or the per-thread flags.

One nice alternative, however, is to take away the execute permission from
the vsyscall page of the traced process, so it will incur a SIGSEGV (which
the ptracer can easily handle and cancel).

The drawbacks of this nice-to-have alternative are:
1) It may require a bigger patch.
2) It needs to use a separate ptrace option, because current applications
   that use PTRACE_SYSCALL (or PTRACE_SYSEMU), including strace, will not
   be aware of this new feature, so they can be broken.

 Yes, the tracee won't report _before_ it calls the function, but perhaps
 this is enough? Amnon?

The vsyscall page was designed in order to avoid user/kernel context
switches, which is possible when a system-call is only reading kernel
global information, not modifying any.  In most cases therefore (with
a few exceptions, such as when the hardware clock is broken or
misconfigured), system-calls in the vsyscall page never invoke the kernel
at all - so no trap would be generated either before or after.

Best Regards,
Amnon.

 On 11/20, Oleg Nesterov wrote:

 On 11/19, Steven Rostedt wrote:
 
  Is this really what we want?

 I am not sure, that is why I am asking.

 Yes.

 And,

 Well yes, with the new implementation __vsyscall_page simply does
 syscall, so I guess (say) strace will just work. But, afaics, only
 if vsyscall_mode == NATIVE.

 So perhaps bp in vsyscall_page still makes sense...

 Or. Perhaps we can define TRAP_VSYSCALL and change emulate_vsyscall() to
 do


   if (current-ptrace  test_thread_flag(TIF_SYSCALL_TRACE))
   send_sigtrap(TRAP_VSYSCALL, ...);

 if it returns true?

 This way the debugger doesn't need to play with the debug registers,
 it can use ptrace(PTRACE_SYSCALL). The tracee will report SIGTRAP after
 vsyscall, debugger can detect this case looking at info-si_code.

 Yes, the tracee won't report _before_ it calls the function, but perhaps
 this is enough? Amnon?

 Oleg.




--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/