Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Will Drewry
On Fri, Jul 13, 2012 at 7:48 PM, Will Drewry  wrote:
> On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski  wrote:
>> On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry  wrote:
>>> If a seccomp filter program is installed, older static binaries and
>>> distributions with older libc implementations (glibc 2.13 and earlier)
>>> that rely on vsyscall use will be terminated regardless of the filter
>>> program policy when executing time, gettimeofday, or getcpu.  This is
>>> only the case when vsyscall emulation is in use (vsyscall=emulate is the
>>> default).
>>>
>>> This patch emulates system call entry inside a vsyscall=emulate by
>>> populating regs->ax and regs->orig_ax with the system call number prior
>>> to calling into seccomp such that all seccomp-dependencies function
>>> normally.  Additionally, system call return behavior is emulated in line
>>> with other vsyscall entrypoints for the trace/trap cases.
>>>
>>> Reported-by: Owen Kibel 
>>> Signed-off-by: Will Drewry 
>>>
>>> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)
>>
>>> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
>>> long address)
>>>
>>> current_thread_info()->sig_on_uaccess_error = 
>>> prev_sig_on_uaccess_error;
>>>
>>> +   if (skip) {
>>> +   if ((long)regs->ax <= 0L) /* seccomp errno emulation */
>>> +   goto do_ret;
>>> +   goto done; /* seccomp trace/trap */
>>> +   }
>>> +
>>> if (ret == -EFAULT) {
>>> /* Bad news -- userspace fed a bad pointer to a vsyscall. */
>>> warn_bad_vsyscall(KERN_INFO, regs,
>>> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
>>> long address)
>>>
>>> regs->ax = ret;
>>>
>>> +do_ret:
>>> /* Emulate a ret instruction. */
>>> regs->ip = caller;
>>> regs->sp += 8;
>>> -
>>> +done:
>>> return true;
>>>
>>>  sigsegv:
>>> --
>>> 1.7.9.5
>>>
>>
>> This has the same odd property as the sigsegv path that the faulting
>> instruction will appear to be the mov, not the syscall.  That seems to
>> be okay, though -- various pieces of code that try to restart the segv
>> are okay with that.
>
> Yeah - I would otherwise do
>   regs->ip += 9;
> but I wanted to match the code that was therefor SIGSEGV.  If regs->ip
> += 9 _just_ for the SIGSYS case is fine, then I'll make that change
> shortly.  Since any code that sees the vsyscall address should be wise
> enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
> problem so far.

I dashed this off without more thought. It's best to leave it as is
because any return to the emulated page will cause a vsyscall fault
event.

>> Is there any code that assumes that changing rax (i.e. the syscall
>> number) and restarting a syscall after SIGSYS will invoke the new
>> syscall?  (The RET_TRACE path might be similar -- does the
>> ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
>> a chance to synchronously cancel or change the syscall?
>
> Unfortunately, it does in normal interception. I don't see any way out
> of that quirk with vsyscall=emulate.  As is without seccomp,
> vsyscall=emulate doesn't allow ptrace interception (or syscall
> auditing for that matter) while vsyscall=native does.   So the option
> here is to document the quirky interaction in
> Documentation/prctl/seccomp_filter.txt.  In particular, if the tracer
> sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
> it will know it can't rewrite or bypass the call.Is there a better
> option?
>
> Given that, I will include a tweak to the documentation to indicate
> that behavior so that userspace authors of BPF programs that use
> SECCOMP_RET_TRACE will be aware of the behavior.
>
>> If those issues aren't problems, then:
>>
>> Reviewed-by: Andy Lutomirski 
>>
>> (If the syscall number needs to change after the fact in the
>> SECCOMP_RET_TRAP case, it'll be a mess.)
>
> Nah - traps are delivered like the forced sigsegv path.
>
> I'll spin a v3 soon including the documentation tweak and the ip
> offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the
> skip case).  Of course, any better ideas for the trace-case will be
> more than welcome, but it seems to me to be an acceptable tradeoff - I
> hope others agree.
>
> I'll make the changes and then put it through its paces to see if any
> other little idiosyncrasies emerge.

I've written up a documentation patch to accompany this one. It
reflects one more change I've made in a v3 of the patch, but it is
optional.  I've added support for SECCOMP_RET_TRACE to still
skip/emulate the system call if it desires.  In v2 it can't.  Either
way is fine in practice, but I'd need to change the accompanying
documentation.

thanks again!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Will Drewry
On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski  wrote:
> On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry  wrote:
>> If a seccomp filter program is installed, older static binaries and
>> distributions with older libc implementations (glibc 2.13 and earlier)
>> that rely on vsyscall use will be terminated regardless of the filter
>> program policy when executing time, gettimeofday, or getcpu.  This is
>> only the case when vsyscall emulation is in use (vsyscall=emulate is the
>> default).
>>
>> This patch emulates system call entry inside a vsyscall=emulate by
>> populating regs->ax and regs->orig_ax with the system call number prior
>> to calling into seccomp such that all seccomp-dependencies function
>> normally.  Additionally, system call return behavior is emulated in line
>> with other vsyscall entrypoints for the trace/trap cases.
>>
>> Reported-by: Owen Kibel 
>> Signed-off-by: Will Drewry 
>>
>> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)
>
>> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
>> long address)
>>
>> current_thread_info()->sig_on_uaccess_error = 
>> prev_sig_on_uaccess_error;
>>
>> +   if (skip) {
>> +   if ((long)regs->ax <= 0L) /* seccomp errno emulation */
>> +   goto do_ret;
>> +   goto done; /* seccomp trace/trap */
>> +   }
>> +
>> if (ret == -EFAULT) {
>> /* Bad news -- userspace fed a bad pointer to a vsyscall. */
>> warn_bad_vsyscall(KERN_INFO, regs,
>> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
>> long address)
>>
>> regs->ax = ret;
>>
>> +do_ret:
>> /* Emulate a ret instruction. */
>> regs->ip = caller;
>> regs->sp += 8;
>> -
>> +done:
>> return true;
>>
>>  sigsegv:
>> --
>> 1.7.9.5
>>
>
> This has the same odd property as the sigsegv path that the faulting
> instruction will appear to be the mov, not the syscall.  That seems to
> be okay, though -- various pieces of code that try to restart the segv
> are okay with that.

Yeah - I would otherwise do
  regs->ip += 9;
but I wanted to match the code that was therefor SIGSEGV.  If regs->ip
+= 9 _just_ for the SIGSYS case is fine, then I'll make that change
shortly.  Since any code that sees the vsyscall address should be wise
enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
problem so far.

> Is there any code that assumes that changing rax (i.e. the syscall
> number) and restarting a syscall after SIGSYS will invoke the new
> syscall?  (The RET_TRACE path might be similar -- does the
> ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
> a chance to synchronously cancel or change the syscall?

Unfortunately, it does in normal interception. I don't see any way out
of that quirk with vsyscall=emulate.  As is without seccomp,
vsyscall=emulate doesn't allow ptrace interception (or syscall
auditing for that matter) while vsyscall=native does.   So the option
here is to document the quirky interaction in
Documentation/prctl/seccomp_filter.txt.  In particular, if the tracer
sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
it will know it can't rewrite or bypass the call.Is there a better
option?

Given that, I will include a tweak to the documentation to indicate
that behavior so that userspace authors of BPF programs that use
SECCOMP_RET_TRACE will be aware of the behavior.

> If those issues aren't problems, then:
>
> Reviewed-by: Andy Lutomirski 
>
> (If the syscall number needs to change after the fact in the
> SECCOMP_RET_TRAP case, it'll be a mess.)

Nah - traps are delivered like the forced sigsegv path.

I'll spin a v3 soon including the documentation tweak and the ip
offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the
skip case).  Of course, any better ideas for the trace-case will be
more than welcome, but it seems to me to be an acceptable tradeoff - I
hope others agree.

I'll make the changes and then put it through its paces to see if any
other little idiosyncrasies emerge.

Thanks for the close review!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Andrew Lutomirski
On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry  wrote:
> If a seccomp filter program is installed, older static binaries and
> distributions with older libc implementations (glibc 2.13 and earlier)
> that rely on vsyscall use will be terminated regardless of the filter
> program policy when executing time, gettimeofday, or getcpu.  This is
> only the case when vsyscall emulation is in use (vsyscall=emulate is the
> default).
>
> This patch emulates system call entry inside a vsyscall=emulate by
> populating regs->ax and regs->orig_ax with the system call number prior
> to calling into seccomp such that all seccomp-dependencies function
> normally.  Additionally, system call return behavior is emulated in line
> with other vsyscall entrypoints for the trace/trap cases.
>
> Reported-by: Owen Kibel 
> Signed-off-by: Will Drewry 
>
> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)

> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
>
> current_thread_info()->sig_on_uaccess_error = 
> prev_sig_on_uaccess_error;
>
> +   if (skip) {
> +   if ((long)regs->ax <= 0L) /* seccomp errno emulation */
> +   goto do_ret;
> +   goto done; /* seccomp trace/trap */
> +   }
> +
> if (ret == -EFAULT) {
> /* Bad news -- userspace fed a bad pointer to a vsyscall. */
> warn_bad_vsyscall(KERN_INFO, regs,
> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
> long address)
>
> regs->ax = ret;
>
> +do_ret:
> /* Emulate a ret instruction. */
> regs->ip = caller;
> regs->sp += 8;
> -
> +done:
> return true;
>
>  sigsegv:
> --
> 1.7.9.5
>

This has the same odd property as the sigsegv path that the faulting
instruction will appear to be the mov, not the syscall.  That seems to
be okay, though -- various pieces of code that try to restart the segv
are okay with that.

Is there any code that assumes that changing rax (i.e. the syscall
number) and restarting a syscall after SIGSYS will invoke the new
syscall?  (The RET_TRACE path might be similar -- does the
ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
a chance to synchronously cancel or change the syscall?

If those issues aren't problems, then:

Reviewed-by: Andy Lutomirski 

(If the syscall number needs to change after the fact in the
SECCOMP_RET_TRAP case, it'll be a mess.)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Will Drewry
If a seccomp filter program is installed, older static binaries and
distributions with older libc implementations (glibc 2.13 and earlier)
that rely on vsyscall use will be terminated regardless of the filter
program policy when executing time, gettimeofday, or getcpu.  This is
only the case when vsyscall emulation is in use (vsyscall=emulate is the
default).

This patch emulates system call entry inside a vsyscall=emulate by
populating regs->ax and regs->orig_ax with the system call number prior
to calling into seccomp such that all seccomp-dependencies function
normally.  Additionally, system call return behavior is emulated in line
with other vsyscall entrypoints for the trace/trap cases.

Reported-by: Owen Kibel 
Signed-off-by: Will Drewry 

v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)
---
 arch/x86/kernel/vsyscall_64.c |   35 +++
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 7515cf0..08a18d0 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -139,6 +139,15 @@ static int addr_to_vsyscall_nr(unsigned long addr)
return nr;
 }
 
+static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
+{
+   if (!seccomp_mode(>seccomp))
+   return 0;
+   task_pt_regs(tsk)->orig_ax = syscall_nr;
+   task_pt_regs(tsk)->ax = syscall_nr;
+   return __secure_computing(syscall_nr);
+}
+
 static bool write_ok_or_segv(unsigned long ptr, size_t size)
 {
/*
@@ -174,6 +183,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
int vsyscall_nr;
int prev_sig_on_uaccess_error;
long ret;
+   int skip;
 
/*
 * No point in checking CS -- the only way to get here is a user mode
@@ -205,9 +215,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
}
 
tsk = current;
-   if (seccomp_mode(>seccomp))
-   do_exit(SIGKILL);
-
/*
 * With a real vsyscall, page faults cause SIGSEGV.  We want to
 * preserve that behavior to make writing exploits harder.
@@ -222,8 +229,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
 * address 0".
 */
ret = -EFAULT;
+   skip = 0;
switch (vsyscall_nr) {
case 0:
+   skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
+   if (skip)
+   break;
+
if (!write_ok_or_segv(regs->di, sizeof(struct timeval)) ||
!write_ok_or_segv(regs->si, sizeof(struct timezone)))
break;
@@ -234,6 +246,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
break;
 
case 1:
+   skip = vsyscall_seccomp(tsk, __NR_time);
+   if (skip)
+   break;
+
if (!write_ok_or_segv(regs->di, sizeof(time_t)))
break;
 
@@ -241,6 +257,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
break;
 
case 2:
+   skip = vsyscall_seccomp(tsk, __NR_getcpu);
+   if (skip)
+   break;
+
if (!write_ok_or_segv(regs->di, sizeof(unsigned)) ||
!write_ok_or_segv(regs->si, sizeof(unsigned)))
break;
@@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
 
current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
 
+   if (skip) {
+   if ((long)regs->ax <= 0L) /* seccomp errno emulation */
+   goto do_ret;
+   goto done; /* seccomp trace/trap */
+   }
+
if (ret == -EFAULT) {
/* Bad news -- userspace fed a bad pointer to a vsyscall. */
warn_bad_vsyscall(KERN_INFO, regs,
@@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
 
regs->ax = ret;
 
+do_ret:
/* Emulate a ret instruction. */
regs->ip = caller;
regs->sp += 8;
-
+done:
return true;
 
 sigsegv:
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Will Drewry
If a seccomp filter program is installed, older static binaries and
distributions with older libc implementations (glibc 2.13 and earlier)
that rely on vsyscall use will be terminated regardless of the filter
program policy when executing time, gettimeofday, or getcpu.  This is
only the case when vsyscall emulation is in use (vsyscall=emulate is the
default).

This patch emulates system call entry inside a vsyscall=emulate by
populating regs-ax and regs-orig_ax with the system call number prior
to calling into seccomp such that all seccomp-dependencies function
normally.  Additionally, system call return behavior is emulated in line
with other vsyscall entrypoints for the trace/trap cases.

Reported-by: Owen Kibel qme...@gmail.com
Signed-off-by: Will Drewry w...@chromium.org

v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)
---
 arch/x86/kernel/vsyscall_64.c |   35 +++
 1 file changed, 31 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
index 7515cf0..08a18d0 100644
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -139,6 +139,15 @@ static int addr_to_vsyscall_nr(unsigned long addr)
return nr;
 }
 
+static int vsyscall_seccomp(struct task_struct *tsk, int syscall_nr)
+{
+   if (!seccomp_mode(tsk-seccomp))
+   return 0;
+   task_pt_regs(tsk)-orig_ax = syscall_nr;
+   task_pt_regs(tsk)-ax = syscall_nr;
+   return __secure_computing(syscall_nr);
+}
+
 static bool write_ok_or_segv(unsigned long ptr, size_t size)
 {
/*
@@ -174,6 +183,7 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
int vsyscall_nr;
int prev_sig_on_uaccess_error;
long ret;
+   int skip;
 
/*
 * No point in checking CS -- the only way to get here is a user mode
@@ -205,9 +215,6 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
}
 
tsk = current;
-   if (seccomp_mode(tsk-seccomp))
-   do_exit(SIGKILL);
-
/*
 * With a real vsyscall, page faults cause SIGSEGV.  We want to
 * preserve that behavior to make writing exploits harder.
@@ -222,8 +229,13 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
 * address 0.
 */
ret = -EFAULT;
+   skip = 0;
switch (vsyscall_nr) {
case 0:
+   skip = vsyscall_seccomp(tsk, __NR_gettimeofday);
+   if (skip)
+   break;
+
if (!write_ok_or_segv(regs-di, sizeof(struct timeval)) ||
!write_ok_or_segv(regs-si, sizeof(struct timezone)))
break;
@@ -234,6 +246,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
break;
 
case 1:
+   skip = vsyscall_seccomp(tsk, __NR_time);
+   if (skip)
+   break;
+
if (!write_ok_or_segv(regs-di, sizeof(time_t)))
break;
 
@@ -241,6 +257,10 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
break;
 
case 2:
+   skip = vsyscall_seccomp(tsk, __NR_getcpu);
+   if (skip)
+   break;
+
if (!write_ok_or_segv(regs-di, sizeof(unsigned)) ||
!write_ok_or_segv(regs-si, sizeof(unsigned)))
break;
@@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
 
current_thread_info()-sig_on_uaccess_error = prev_sig_on_uaccess_error;
 
+   if (skip) {
+   if ((long)regs-ax = 0L) /* seccomp errno emulation */
+   goto do_ret;
+   goto done; /* seccomp trace/trap */
+   }
+
if (ret == -EFAULT) {
/* Bad news -- userspace fed a bad pointer to a vsyscall. */
warn_bad_vsyscall(KERN_INFO, regs,
@@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long 
address)
 
regs-ax = ret;
 
+do_ret:
/* Emulate a ret instruction. */
regs-ip = caller;
regs-sp += 8;
-
+done:
return true;
 
 sigsegv:
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Andrew Lutomirski
On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry w...@chromium.org wrote:
 If a seccomp filter program is installed, older static binaries and
 distributions with older libc implementations (glibc 2.13 and earlier)
 that rely on vsyscall use will be terminated regardless of the filter
 program policy when executing time, gettimeofday, or getcpu.  This is
 only the case when vsyscall emulation is in use (vsyscall=emulate is the
 default).

 This patch emulates system call entry inside a vsyscall=emulate by
 populating regs-ax and regs-orig_ax with the system call number prior
 to calling into seccomp such that all seccomp-dependencies function
 normally.  Additionally, system call return behavior is emulated in line
 with other vsyscall entrypoints for the trace/trap cases.

 Reported-by: Owen Kibel qme...@gmail.com
 Signed-off-by: Will Drewry w...@chromium.org

 v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)

 @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 current_thread_info()-sig_on_uaccess_error = 
 prev_sig_on_uaccess_error;

 +   if (skip) {
 +   if ((long)regs-ax = 0L) /* seccomp errno emulation */
 +   goto do_ret;
 +   goto done; /* seccomp trace/trap */
 +   }
 +
 if (ret == -EFAULT) {
 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
 warn_bad_vsyscall(KERN_INFO, regs,
 @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 regs-ax = ret;

 +do_ret:
 /* Emulate a ret instruction. */
 regs-ip = caller;
 regs-sp += 8;
 -
 +done:
 return true;

  sigsegv:
 --
 1.7.9.5


This has the same odd property as the sigsegv path that the faulting
instruction will appear to be the mov, not the syscall.  That seems to
be okay, though -- various pieces of code that try to restart the segv
are okay with that.

Is there any code that assumes that changing rax (i.e. the syscall
number) and restarting a syscall after SIGSYS will invoke the new
syscall?  (The RET_TRACE path might be similar -- does the
ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
a chance to synchronously cancel or change the syscall?

If those issues aren't problems, then:

Reviewed-by: Andy Lutomirski l...@amacapital.net

(If the syscall number needs to change after the fact in the
SECCOMP_RET_TRAP case, it'll be a mess.)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Will Drewry
On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski l...@mit.edu wrote:
 On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry w...@chromium.org wrote:
 If a seccomp filter program is installed, older static binaries and
 distributions with older libc implementations (glibc 2.13 and earlier)
 that rely on vsyscall use will be terminated regardless of the filter
 program policy when executing time, gettimeofday, or getcpu.  This is
 only the case when vsyscall emulation is in use (vsyscall=emulate is the
 default).

 This patch emulates system call entry inside a vsyscall=emulate by
 populating regs-ax and regs-orig_ax with the system call number prior
 to calling into seccomp such that all seccomp-dependencies function
 normally.  Additionally, system call return behavior is emulated in line
 with other vsyscall entrypoints for the trace/trap cases.

 Reported-by: Owen Kibel qme...@gmail.com
 Signed-off-by: Will Drewry w...@chromium.org

 v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)

 @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 current_thread_info()-sig_on_uaccess_error = 
 prev_sig_on_uaccess_error;

 +   if (skip) {
 +   if ((long)regs-ax = 0L) /* seccomp errno emulation */
 +   goto do_ret;
 +   goto done; /* seccomp trace/trap */
 +   }
 +
 if (ret == -EFAULT) {
 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
 warn_bad_vsyscall(KERN_INFO, regs,
 @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 regs-ax = ret;

 +do_ret:
 /* Emulate a ret instruction. */
 regs-ip = caller;
 regs-sp += 8;
 -
 +done:
 return true;

  sigsegv:
 --
 1.7.9.5


 This has the same odd property as the sigsegv path that the faulting
 instruction will appear to be the mov, not the syscall.  That seems to
 be okay, though -- various pieces of code that try to restart the segv
 are okay with that.

Yeah - I would otherwise do
  regs-ip += 9;
but I wanted to match the code that was therefor SIGSEGV.  If regs-ip
+= 9 _just_ for the SIGSYS case is fine, then I'll make that change
shortly.  Since any code that sees the vsyscall address should be wise
enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
problem so far.

 Is there any code that assumes that changing rax (i.e. the syscall
 number) and restarting a syscall after SIGSYS will invoke the new
 syscall?  (The RET_TRACE path might be similar -- does the
 ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
 a chance to synchronously cancel or change the syscall?

Unfortunately, it does in normal interception. I don't see any way out
of that quirk with vsyscall=emulate.  As is without seccomp,
vsyscall=emulate doesn't allow ptrace interception (or syscall
auditing for that matter) while vsyscall=native does.   So the option
here is to document the quirky interaction in
Documentation/prctl/seccomp_filter.txt.  In particular, if the tracer
sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
it will know it can't rewrite or bypass the call.Is there a better
option?

Given that, I will include a tweak to the documentation to indicate
that behavior so that userspace authors of BPF programs that use
SECCOMP_RET_TRACE will be aware of the behavior.

 If those issues aren't problems, then:

 Reviewed-by: Andy Lutomirski l...@amacapital.net

 (If the syscall number needs to change after the fact in the
 SECCOMP_RET_TRAP case, it'll be a mess.)

Nah - traps are delivered like the forced sigsegv path.

I'll spin a v3 soon including the documentation tweak and the ip
offset to match vsyscall=native behavior (regs-ip += 9 _just_ for the
skip case).  Of course, any better ideas for the trace-case will be
more than welcome, but it seems to me to be an acceptable tradeoff - I
hope others agree.

I'll make the changes and then put it through its paces to see if any
other little idiosyncrasies emerge.

Thanks for the close review!
will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

2012-07-13 Thread Will Drewry
On Fri, Jul 13, 2012 at 7:48 PM, Will Drewry w...@chromium.org wrote:
 On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski l...@mit.edu wrote:
 On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry w...@chromium.org wrote:
 If a seccomp filter program is installed, older static binaries and
 distributions with older libc implementations (glibc 2.13 and earlier)
 that rely on vsyscall use will be terminated regardless of the filter
 program policy when executing time, gettimeofday, or getcpu.  This is
 only the case when vsyscall emulation is in use (vsyscall=emulate is the
 default).

 This patch emulates system call entry inside a vsyscall=emulate by
 populating regs-ax and regs-orig_ax with the system call number prior
 to calling into seccomp such that all seccomp-dependencies function
 normally.  Additionally, system call return behavior is emulated in line
 with other vsyscall entrypoints for the trace/trap cases.

 Reported-by: Owen Kibel qme...@gmail.com
 Signed-off-by: Will Drewry w...@chromium.org

 v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to l...@mit.edu)

 @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 current_thread_info()-sig_on_uaccess_error = 
 prev_sig_on_uaccess_error;

 +   if (skip) {
 +   if ((long)regs-ax = 0L) /* seccomp errno emulation */
 +   goto do_ret;
 +   goto done; /* seccomp trace/trap */
 +   }
 +
 if (ret == -EFAULT) {
 /* Bad news -- userspace fed a bad pointer to a vsyscall. */
 warn_bad_vsyscall(KERN_INFO, regs,
 @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned 
 long address)

 regs-ax = ret;

 +do_ret:
 /* Emulate a ret instruction. */
 regs-ip = caller;
 regs-sp += 8;
 -
 +done:
 return true;

  sigsegv:
 --
 1.7.9.5


 This has the same odd property as the sigsegv path that the faulting
 instruction will appear to be the mov, not the syscall.  That seems to
 be okay, though -- various pieces of code that try to restart the segv
 are okay with that.

 Yeah - I would otherwise do
   regs-ip += 9;
 but I wanted to match the code that was therefor SIGSEGV.  If regs-ip
 += 9 _just_ for the SIGSYS case is fine, then I'll make that change
 shortly.  Since any code that sees the vsyscall address should be wise
 enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
 problem so far.

I dashed this off without more thought. It's best to leave it as is
because any return to the emulated page will cause a vsyscall fault
event.

 Is there any code that assumes that changing rax (i.e. the syscall
 number) and restarting a syscall after SIGSYS will invoke the new
 syscall?  (The RET_TRACE path might be similar -- does the
 ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
 a chance to synchronously cancel or change the syscall?

 Unfortunately, it does in normal interception. I don't see any way out
 of that quirk with vsyscall=emulate.  As is without seccomp,
 vsyscall=emulate doesn't allow ptrace interception (or syscall
 auditing for that matter) while vsyscall=native does.   So the option
 here is to document the quirky interaction in
 Documentation/prctl/seccomp_filter.txt.  In particular, if the tracer
 sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
 it will know it can't rewrite or bypass the call.Is there a better
 option?

 Given that, I will include a tweak to the documentation to indicate
 that behavior so that userspace authors of BPF programs that use
 SECCOMP_RET_TRACE will be aware of the behavior.

 If those issues aren't problems, then:

 Reviewed-by: Andy Lutomirski l...@amacapital.net

 (If the syscall number needs to change after the fact in the
 SECCOMP_RET_TRAP case, it'll be a mess.)

 Nah - traps are delivered like the forced sigsegv path.

 I'll spin a v3 soon including the documentation tweak and the ip
 offset to match vsyscall=native behavior (regs-ip += 9 _just_ for the
 skip case).  Of course, any better ideas for the trace-case will be
 more than welcome, but it seems to me to be an acceptable tradeoff - I
 hope others agree.

 I'll make the changes and then put it through its paces to see if any
 other little idiosyncrasies emerge.

I've written up a documentation patch to accompany this one. It
reflects one more change I've made in a v3 of the patch, but it is
optional.  I've added support for SECCOMP_RET_TRACE to still
skip/emulate the system call if it desires.  In v2 it can't.  Either
way is fine in practice, but I'd need to change the accompanying
documentation.

thanks again!
will
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/