Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-29 Thread H. Peter Anvin
On 04/28/2014 09:36 PM, H. Peter Anvin wrote:
> 
> There are still things that need fixing: we need to go through the
> espfix path even when returning from NMI/MC (which fortunately can't
> nest with taking an NMI/MC on the espfix path itself, since in that case
> we will have been interrupted while running in the kernel with a kernel
> stack.)
> 
> (Cc: Rostedt because of the NMI issue.)
> 

NMI is fine: we go through irq_return except for nested NMI.  There are
only three IRETs in the kernel (irq_return, nested_nmi_out, and the
early trap handler) and all of them are good.

I think we just need to clean up the PV aspects of this and then we
should be in good shape.  I have done a bunch of cleanups to the
development git tree.

I'm considering making 16-bit segment support a EXPERT config option for
both 32- and 64-bit kernels, as it seems like a bit of a waste for
embedded systems which don't need this kind of backward compatibility.
Maybe that is something that can be left for someone else to implement
if they feel like it.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-29 Thread H. Peter Anvin
On 04/28/2014 09:36 PM, H. Peter Anvin wrote:
 
 There are still things that need fixing: we need to go through the
 espfix path even when returning from NMI/MC (which fortunately can't
 nest with taking an NMI/MC on the espfix path itself, since in that case
 we will have been interrupted while running in the kernel with a kernel
 stack.)
 
 (Cc: Rostedt because of the NMI issue.)
 

NMI is fine: we go through irq_return except for nested NMI.  There are
only three IRETs in the kernel (irq_return, nested_nmi_out, and the
early trap handler) and all of them are good.

I think we just need to clean up the PV aspects of this and then we
should be in good shape.  I have done a bunch of cleanups to the
development git tree.

I'm considering making 16-bit segment support a EXPERT config option for
both 32- and 64-bit kernels, as it seems like a bit of a waste for
embedded systems which don't need this kind of backward compatibility.
Maybe that is something that can be left for someone else to implement
if they feel like it.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
> 
> OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
> it never sets SS to an LDT segment.  This means that it should really
> have zero footprint versus the espfix code, and implies that we instead
> have another bug involved.  Why the espfix code should have any effect
> whatsoever is a mystery, however... if it indeed does?
> 
> I have uploaded a fixed ldttest.c, but it seems we might be chasing more
> than that...
> 

With the test fixed, the bug was easy to find: we can't compare against
__KERNEL_DS in the doublefault handler, because both SS and the image on
the stack have the stack segment set to zero (NULL).

With that both ldttest and run16 pass with the doublefault code, even
with randomization turned back on.

I have pushed out the fix.

There are still things that need fixing: we need to go through the
espfix path even when returning from NMI/MC (which fortunately can't
nest with taking an NMI/MC on the espfix path itself, since in that case
we will have been interrupted while running in the kernel with a kernel
stack.)

(Cc: Rostedt because of the NMI issue.)

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
> 
> OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
> it never sets SS to an LDT segment.  This means that it should really
> have zero footprint versus the espfix code, and implies that we instead
> have another bug involved.  Why the espfix code should have any effect
> whatsoever is a mystery, however... if it indeed does?
> 
> I have uploaded a fixed ldttest.c, but it seems we might be chasing more
> than that...
> 

In particular, I was already wondered how we avoid an "upside down
swapgs" with a #GP on IRET.  The answer might be that we don't...

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
> On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>>
>> ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
>> on your branch.  It even said this:
>>
>> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
>> found during reset
>>
>> I have no idea what an uncluncked fd is :)
>>
> 
> It means 9p wasn't properly shut down.
> 

OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
it never sets SS to an LDT segment.  This means that it should really
have zero footprint versus the espfix code, and implies that we instead
have another bug involved.  Why the espfix code should have any effect
whatsoever is a mystery, however... if it indeed does?

I have uploaded a fixed ldttest.c, but it seems we might be chasing more
than that...

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
> On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
>>
>> ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
>> on your branch.  It even said this:
>>
>> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
>> found during reset
>>
>> I have no idea what an uncluncked fd is :)
>>
> 
> It means 9p wasn't properly shut down.
> 

(A "fid" is like the 9p version of a file descriptor.  Sort of.)

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
> 
> ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
> on your branch.  It even said this:
> 
> qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
> found during reset
> 
> I have no idea what an uncluncked fd is :)
> 

It means 9p wasn't properly shut down.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 5:02 PM, Andrew Lutomirski  wrote:
> On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin  wrote:
>> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>>
>>> So I tried writing this bit up, but it fails in some rather spectacular
>>> ways.  Furthermore, I have been unable to debug it under Qemu, because
>>> breakpoints don't work right (common Qemu problem, sadly.)
>>>
>>> The kernel code is at:
>>>
>>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>>
>>> There are two tests:
>>>
>>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>>> ./run16 test/hello.elf
>>> http://www.zytor.com/~hpa/ldttest.c
>>>
>>> The former will exercise the irq_return_ldt path, but not the fault
>>> path; the latter will exercise the fault path, but doesn't actually use
>>> a 16-bit segment.
>>>
>>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>>> latter should pass.
>>>
>>
>> Current status of the above code: if I remove the randomization in
>> espfix_64.c then the first test passes; the second generally crashes the
>> machine.  With the randomization there, both generally crash the machine.
>>
>> All my testing so far has been under KVM or Qemu, so there is always the
>> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
>> something simpler than that.
>
> I'm compiling your branch.  In the mean time, two possibly stupid questions:
>
> What's the assembly code in the double-fault entry for?
>
> Have you tried hbreak in qemu?  I've had better luck with hbreak than
> regular break in the past.
>

ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
on your branch.  It even said this:

qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
found during reset

I have no idea what an uncluncked fd is :)

hello.elf fails to sigbus.  weird.  gdb says:

1: x/i $pc
=> 0x8170559c :
jmp0x81705537 
(gdb) si

1: x/i $pc
=> 0x81705537 :iretq
(gdb) si
Cannot access memory at address 0xf000f
(gdb) info registers
rax0xffe4000f1000-7881234923384832
rbx0x10001068719476752
rcx0xffe4f558f000-7611541041909760
rdx0x805d000134598656
rsi0x10217ffe3283772784279523
rdi0xf000764424509447
rbp0xf000f0xf000f
rsp0xf000f0xf000f
r8 0x00
r9 0x00
r100x00
r110x00
r120x00
r130x00
r140x00
r150x00
rip0x00x0 
eflags 0x0[ ]
cs 0x00
ss 0x37f895
ds 0x00
es 0x00
fs 0x00
---Type  to continue, or q  to quit---
gs 0x00

I got this with 'hbreak irq_return_ldt' using 'target remote :1234'
and virtme-run --console --kimg
~/apps/linux-devel/arch/x86/boot/bzImage --qemu-opts -s

This set of registers looks thoroughly bogus.  I don't trust it.  I'm
now stuck -- single-stepping stays exactly where it started.
Something is rather screwed up here.  Telling gdb to continue causes
gdb to explode and 'Hello, Afterworld!' to be displayed.

I was not able to get a breakpoint on __do_double_fault to hit.

FWIW, I think that gdb is known to have issues debugging a guest that
switches bitness.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 05:02 PM, Andrew Lutomirski wrote:
> 
> I'm compiling your branch.  In the mean time, two possibly stupid questions:
> 
> What's the assembly code in the double-fault entry for?
> 

It was easier for me to add it there than adding all the glue
(prototypes and so on) to put it into C code... can convert it to C when
it works.

> Have you tried hbreak in qemu?  I've had better luck with hbreak than
> regular break in the past.

Yes, no real change.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin  wrote:
> On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
>>
>> So I tried writing this bit up, but it fails in some rather spectacular
>> ways.  Furthermore, I have been unable to debug it under Qemu, because
>> breakpoints don't work right (common Qemu problem, sadly.)
>>
>> The kernel code is at:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
>>
>> There are two tests:
>>
>> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
>> ./run16 test/hello.elf
>> http://www.zytor.com/~hpa/ldttest.c
>>
>> The former will exercise the irq_return_ldt path, but not the fault
>> path; the latter will exercise the fault path, but doesn't actually use
>> a 16-bit segment.
>>
>> Under the 3.14 stock kernel, the former should die with SIGBUS and the
>> latter should pass.
>>
>
> Current status of the above code: if I remove the randomization in
> espfix_64.c then the first test passes; the second generally crashes the
> machine.  With the randomization there, both generally crash the machine.
>
> All my testing so far has been under KVM or Qemu, so there is always the
> possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
> something simpler than that.

I'm compiling your branch.  In the mean time, two possibly stupid questions:

What's the assembly code in the double-fault entry for?

Have you tried hbreak in qemu?  I've had better luck with hbreak than
regular break in the past.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
> 
> So I tried writing this bit up, but it fails in some rather spectacular
> ways.  Furthermore, I have been unable to debug it under Qemu, because
> breakpoints don't work right (common Qemu problem, sadly.)
> 
> The kernel code is at:
> 
> https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
> 
> There are two tests:
> 
> git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
> ./run16 test/hello.elf
> http://www.zytor.com/~hpa/ldttest.c
> 
> The former will exercise the irq_return_ldt path, but not the fault
> path; the latter will exercise the fault path, but doesn't actually use
> a 16-bit segment.
> 
> Under the 3.14 stock kernel, the former should die with SIGBUS and the
> latter should pass.
> 

Current status of the above code: if I remove the randomization in
espfix_64.c then the first test passes; the second generally crashes the
machine.  With the randomization there, both generally crash the machine.

All my testing so far has been under KVM or Qemu, so there is always the
possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
something simpler than that.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
> 
> This particular vector hurts: you can safely keep trying until it works.
> 
> This just gave me an evil idea: what if we make the whole espfix area
> read-only?  This has some weird effects.  To switch to the espfix
> stack, you have to write to an alias.  That's a little strange but
> harmless and barely complicates the implementation.  If the iret
> faults, though, I think the result will be a #DF.  This may actually
> be a good thing: if the #DF handler detects that the cause was a bad
> espfix iret, it could just return directly to bad_iret or send the
> signal itself the same way that do_stack_segment does.  This could
> even be written in C :)
> 
> Peter, is this idea completely nuts?  The only exceptions that can
> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
> so they won't double-fault.
> 

So I tried writing this bit up, but it fails in some rather spectacular
ways.  Furthermore, I have been unable to debug it under Qemu, because
breakpoints don't work right (common Qemu problem, sadly.)

The kernel code is at:

https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

There are two tests:

git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
./run16 test/hello.elf
http://www.zytor.com/~hpa/ldttest.c

The former will exercise the irq_return_ldt path, but not the fault
path; the latter will exercise the fault path, but doesn't actually use
a 16-bit segment.

Under the 3.14 stock kernel, the former should die with SIGBUS and the
latter should pass.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Konrad Rzeszutek Wilk
On Wed, Apr 23, 2014 at 09:56:00AM -0700, H. Peter Anvin wrote:
> On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
> >>
> >> Konrad - I really could use some help figuring out what needs to be done
> >> for this not to break Xen.
> > 
> > This does break Xen PV:
> > 
> 
> I know it does.  This is why I asked for help.

This week is chaotic for me but I taking a stab at it. Should have
something by the end of the week on top of your patch.

> 
> This is fundamentally the problem with PV and *especially* the way Xen
> PV was integrated into Linux: it acts as a development brake for native
> hardware.  Fortunately, Konrad has been quite responsive to that kind of
> problems, which hasn't always been true of the Xen community in the past.

Thank you for such kind words!

I hope that in Chicago you will be have a chance to meet other folks that
are involved in Xen and formulate a similar opinion of them.

Cheers!

> 
>   -hpa
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Konrad Rzeszutek Wilk
On Wed, Apr 23, 2014 at 09:56:00AM -0700, H. Peter Anvin wrote:
 On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
 
  Konrad - I really could use some help figuring out what needs to be done
  for this not to break Xen.
  
  This does break Xen PV:
  
 
 I know it does.  This is why I asked for help.

This week is chaotic for me but I taking a stab at it. Should have
something by the end of the week on top of your patch.

 
 This is fundamentally the problem with PV and *especially* the way Xen
 PV was integrated into Linux: it acts as a development brake for native
 hardware.  Fortunately, Konrad has been quite responsive to that kind of
 problems, which hasn't always been true of the Xen community in the past.

Thank you for such kind words!

I hope that in Chicago you will be have a chance to meet other folks that
are involved in Xen and formulate a similar opinion of them.

Cheers!

 
   -hpa
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
 
 This particular vector hurts: you can safely keep trying until it works.
 
 This just gave me an evil idea: what if we make the whole espfix area
 read-only?  This has some weird effects.  To switch to the espfix
 stack, you have to write to an alias.  That's a little strange but
 harmless and barely complicates the implementation.  If the iret
 faults, though, I think the result will be a #DF.  This may actually
 be a good thing: if the #DF handler detects that the cause was a bad
 espfix iret, it could just return directly to bad_iret or send the
 signal itself the same way that do_stack_segment does.  This could
 even be written in C :)
 
 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.
 

So I tried writing this bit up, but it fails in some rather spectacular
ways.  Furthermore, I have been unable to debug it under Qemu, because
breakpoints don't work right (common Qemu problem, sadly.)

The kernel code is at:

https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

There are two tests:

git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
./run16 test/hello.elf
http://www.zytor.com/~hpa/ldttest.c

The former will exercise the irq_return_ldt path, but not the fault
path; the latter will exercise the fault path, but doesn't actually use
a 16-bit segment.

Under the 3.14 stock kernel, the former should die with SIGBUS and the
latter should pass.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 04:05 PM, H. Peter Anvin wrote:
 
 So I tried writing this bit up, but it fails in some rather spectacular
 ways.  Furthermore, I have been unable to debug it under Qemu, because
 breakpoints don't work right (common Qemu problem, sadly.)
 
 The kernel code is at:
 
 https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/
 
 There are two tests:
 
 git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
 ./run16 test/hello.elf
 http://www.zytor.com/~hpa/ldttest.c
 
 The former will exercise the irq_return_ldt path, but not the fault
 path; the latter will exercise the fault path, but doesn't actually use
 a 16-bit segment.
 
 Under the 3.14 stock kernel, the former should die with SIGBUS and the
 latter should pass.
 

Current status of the above code: if I remove the randomization in
espfix_64.c then the first test passes; the second generally crashes the
machine.  With the randomization there, both generally crash the machine.

All my testing so far has been under KVM or Qemu, so there is always the
possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
something simpler than that.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 04/28/2014 04:05 PM, H. Peter Anvin wrote:

 So I tried writing this bit up, but it fails in some rather spectacular
 ways.  Furthermore, I have been unable to debug it under Qemu, because
 breakpoints don't work right (common Qemu problem, sadly.)

 The kernel code is at:

 https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

 There are two tests:

 git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
 ./run16 test/hello.elf
 http://www.zytor.com/~hpa/ldttest.c

 The former will exercise the irq_return_ldt path, but not the fault
 path; the latter will exercise the fault path, but doesn't actually use
 a 16-bit segment.

 Under the 3.14 stock kernel, the former should die with SIGBUS and the
 latter should pass.


 Current status of the above code: if I remove the randomization in
 espfix_64.c then the first test passes; the second generally crashes the
 machine.  With the randomization there, both generally crash the machine.

 All my testing so far has been under KVM or Qemu, so there is always the
 possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
 something simpler than that.

I'm compiling your branch.  In the mean time, two possibly stupid questions:

What's the assembly code in the double-fault entry for?

Have you tried hbreak in qemu?  I've had better luck with hbreak than
regular break in the past.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 05:02 PM, Andrew Lutomirski wrote:
 
 I'm compiling your branch.  In the mean time, two possibly stupid questions:
 
 What's the assembly code in the double-fault entry for?
 

It was easier for me to add it there than adding all the glue
(prototypes and so on) to put it into C code... can convert it to C when
it works.

 Have you tried hbreak in qemu?  I've had better luck with hbreak than
 regular break in the past.

Yes, no real change.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread Andrew Lutomirski
On Mon, Apr 28, 2014 at 5:02 PM, Andrew Lutomirski aml...@gmail.com wrote:
 On Mon, Apr 28, 2014 at 4:08 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 04/28/2014 04:05 PM, H. Peter Anvin wrote:

 So I tried writing this bit up, but it fails in some rather spectacular
 ways.  Furthermore, I have been unable to debug it under Qemu, because
 breakpoints don't work right (common Qemu problem, sadly.)

 The kernel code is at:

 https://git.kernel.org/cgit/linux/kernel/git/hpa/espfix64.git/

 There are two tests:

 git://git.zytor.com/users/hpa/test16/test16.git, build it, and run
 ./run16 test/hello.elf
 http://www.zytor.com/~hpa/ldttest.c

 The former will exercise the irq_return_ldt path, but not the fault
 path; the latter will exercise the fault path, but doesn't actually use
 a 16-bit segment.

 Under the 3.14 stock kernel, the former should die with SIGBUS and the
 latter should pass.


 Current status of the above code: if I remove the randomization in
 espfix_64.c then the first test passes; the second generally crashes the
 machine.  With the randomization there, both generally crash the machine.

 All my testing so far has been under KVM or Qemu, so there is always the
 possibility that I'm chasing a KVM/Qemu bug, but I suspect it is
 something simpler than that.

 I'm compiling your branch.  In the mean time, two possibly stupid questions:

 What's the assembly code in the double-fault entry for?

 Have you tried hbreak in qemu?  I've had better luck with hbreak than
 regular break in the past.


ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
on your branch.  It even said this:

qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
found during reset

I have no idea what an uncluncked fd is :)

hello.elf fails to sigbus.  weird.  gdb says:

1: x/i $pc
= 0x8170559c irq_return_ldt+90:
jmp0x81705537 irq_return_iret
(gdb) si
signal handler called
1: x/i $pc
= 0x81705537 irq_return_iret:iretq
(gdb) si
Cannot access memory at address 0xf000f
(gdb) info registers
rax0xffe4000f1000-7881234923384832
rbx0x10001068719476752
rcx0xffe4f558f000-7611541041909760
rdx0x805d000134598656
rsi0x10217ffe3283772784279523
rdi0xf000764424509447
rbp0xf000f0xf000f
rsp0xf000f0xf000f
r8 0x00
r9 0x00
r100x00
r110x00
r120x00
r130x00
r140x00
r150x00
rip0x00x0 irq_stack_union
eflags 0x0[ ]
cs 0x00
ss 0x37f895
ds 0x00
es 0x00
fs 0x00
---Type return to continue, or q return to quit---
gs 0x00

I got this with 'hbreak irq_return_ldt' using 'target remote :1234'
and virtme-run --console --kimg
~/apps/linux-devel/arch/x86/boot/bzImage --qemu-opts -s

This set of registers looks thoroughly bogus.  I don't trust it.  I'm
now stuck -- single-stepping stays exactly where it started.
Something is rather screwed up here.  Telling gdb to continue causes
gdb to explode and 'Hello, Afterworld!' to be displayed.

I was not able to get a breakpoint on __do_double_fault to hit.

FWIW, I think that gdb is known to have issues debugging a guest that
switches bitness.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:
 
 ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
 on your branch.  It even said this:
 
 qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
 found during reset
 
 I have no idea what an uncluncked fd is :)
 

It means 9p wasn't properly shut down.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
 On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:

 ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
 on your branch.  It even said this:

 qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
 found during reset

 I have no idea what an uncluncked fd is :)

 
 It means 9p wasn't properly shut down.
 

(A fid is like the 9p version of a file descriptor.  Sort of.)

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 07:38 PM, H. Peter Anvin wrote:
 On 04/28/2014 05:20 PM, Andrew Lutomirski wrote:

 ldttest segfaults on 3.13 and 3.14 for me.  It reboots (triple fault?)
 on your branch.  It even said this:

 qemu-system-x86_64: 9pfs:virtfs_reset: One or more uncluncked fids
 found during reset

 I have no idea what an uncluncked fd is :)

 
 It means 9p wasn't properly shut down.
 

OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
it never sets SS to an LDT segment.  This means that it should really
have zero footprint versus the espfix code, and implies that we instead
have another bug involved.  Why the espfix code should have any effect
whatsoever is a mystery, however... if it indeed does?

I have uploaded a fixed ldttest.c, but it seems we might be chasing more
than that...

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
 
 OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
 it never sets SS to an LDT segment.  This means that it should really
 have zero footprint versus the espfix code, and implies that we instead
 have another bug involved.  Why the espfix code should have any effect
 whatsoever is a mystery, however... if it indeed does?
 
 I have uploaded a fixed ldttest.c, but it seems we might be chasing more
 than that...
 

In particular, I was already wondered how we avoid an upside down
swapgs with a #GP on IRET.  The answer might be that we don't...

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-28 Thread H. Peter Anvin
On 04/28/2014 08:45 PM, H. Peter Anvin wrote:
 
 OK, so I found a bug in ldttest.c -- it sets CS to an LDT segment, but
 it never sets SS to an LDT segment.  This means that it should really
 have zero footprint versus the espfix code, and implies that we instead
 have another bug involved.  Why the espfix code should have any effect
 whatsoever is a mystery, however... if it indeed does?
 
 I have uploaded a fixed ldttest.c, but it seems we might be chasing more
 than that...
 

With the test fixed, the bug was easy to find: we can't compare against
__KERNEL_DS in the doublefault handler, because both SS and the image on
the stack have the stack segment set to zero (NULL).

With that both ldttest and run16 pass with the doublefault code, even
with randomization turned back on.

I have pushed out the fix.

There are still things that need fixing: we need to go through the
espfix path even when returning from NMI/MC (which fortunately can't
nest with taking an NMI/MC on the espfix path itself, since in that case
we will have been interrupted while running in the kernel with a kernel
stack.)

(Cc: Rostedt because of the NMI issue.)

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread H. Peter Anvin
On 04/25/2014 05:02 AM, Pavel Machek wrote:
> 
> Just to understand the consequences -- we leak 16 bit of kernel data
> to the userspace, right? Because it is %esp, we know that we leak
> stack address, which is not too sensitive, but will make kernel
> address randomization less useful...?
> 

It is rather sensitive, in fact.

>> The 64-bit implementation works like this:
>>
>> Set up a ministack for each CPU, which is then mapped 65536 times
>> using the page tables.  This implementation uses the second-to-last
>> PGD slot for this; with a 64-byte espfix stack this is sufficient for
>> 2^18 CPUs (currently we support a max of 2^13 CPUs.)
> 
> 16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
> Wine? Would the solution be to disallow that?

Welcome to the show.  We do, in fact disallow it now in the 3.15-rc
series.  The Wine guys are complaining.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread H. Peter Anvin
On 04/25/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:
> 
> Any particular reason you are using __pgd
> 
> _pud
> 
> and _pmd?
> 
> and __pte instead of the 'pmd', 'pud', 'pmd' and 'pte' macros?
> 

Not that I know of other than that the semantics of the various macros
are not described anywhere to the best of my knowledge.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread Konrad Rzeszutek Wilk
On Tue, Apr 22, 2014 at 06:17:21PM -0700, H. Peter Anvin wrote:
> Another spin of the prototype.  This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
> 
> It still does redundant SWAPGS on the slow path.  I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths.  Some of that
> restructuring might actually be beneficial, but still...

Sorry about being late to the party.


 .. snip..
> diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
> new file mode 100644
> index ..05567d706f92
> --- /dev/null
> +++ b/arch/x86/kernel/espfix_64.c
> @@ -0,0 +1,136 @@
> +/* --- *
> + *
> + *   Copyright 2014 Intel Corporation; author: H. Peter Anvin
> + *
> + *   This file is part of the Linux kernel, and is made available under
> + *   the terms of the GNU General Public License version 2 or (at your
> + *   option) any later version; incorporated herein by reference.
> + *
> + * --- */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define ESPFIX_STACK_SIZE64UL
> +#define ESPFIX_STACKS_PER_PAGE   (PAGE_SIZE/ESPFIX_STACK_SIZE)
> +
> +#define ESPFIX_MAX_CPUS (ESPFIX_STACKS_PER_PAGE << 
> (PGDIR_SHIFT-PAGE_SHIFT-16))
> +#if CONFIG_NR_CPUS > ESPFIX_MAX_CPUS
> +# error "Need more than one PGD for the ESPFIX hack"
> +#endif
> +
> +#define ESPFIX_BASE_ADDR (-2UL << PGDIR_SHIFT)
> +
> +#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
> +
> +/* This contains the *bottom* address of the espfix stack */
> +DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
> +
> +/* Initialization mutex - should this be a spinlock? */
> +static DEFINE_MUTEX(espfix_init_mutex);
> +
> +/* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
> +#define ESPFIX_MAX_PAGES  DIV_ROUND_UP(CONFIG_NR_CPUS, 
> ESPFIX_STACKS_PER_PAGE)
> +#define ESPFIX_MAP_SIZE   DIV_ROUND_UP(ESPFIX_MAX_PAGES, BITS_PER_LONG)
> +static unsigned long espfix_page_alloc_map[ESPFIX_MAP_SIZE];
> +
> +static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
> + __aligned(PAGE_SIZE);
> +
> +/*
> + * This returns the bottom address of the espfix stack for a specific CPU.
> + * The math allows for a non-power-of-two ESPFIX_STACK_SIZE, in which case
> + * we have to account for some amount of padding at the end of each page.
> + */
> +static inline unsigned long espfix_base_addr(unsigned int cpu)
> +{
> + unsigned long page, addr;
> +
> + page = (cpu / ESPFIX_STACKS_PER_PAGE) << PAGE_SHIFT;
> + addr = page + (cpu % ESPFIX_STACKS_PER_PAGE) * ESPFIX_STACK_SIZE;
> + addr = (addr & 0xUL) | ((addr & ~0xUL) << 16);
> + addr += ESPFIX_BASE_ADDR;
> + return addr;
> +}
> +
> +#define PTE_STRIDE(65536/PAGE_SIZE)
> +#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
> +#define ESPFIX_PMD_CLONES PTRS_PER_PMD
> +#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
> +
> +void init_espfix_this_cpu(void)
> +{
> + unsigned int cpu, page;
> + unsigned long addr;
> + pgd_t pgd, *pgd_p;
> + pud_t pud, *pud_p;
> + pmd_t pmd, *pmd_p;
> + pte_t pte, *pte_p;
> + int n;
> + void *stack_page;
> + pteval_t ptemask;
> +
> + /* We only have to do this once... */
> + if (likely(this_cpu_read(espfix_stack)))
> + return; /* Already initialized */
> +
> + cpu = smp_processor_id();
> + addr = espfix_base_addr(cpu);
> + page = cpu/ESPFIX_STACKS_PER_PAGE;
> +
> + /* Did another CPU already set this up? */
> + if (likely(test_bit(page, espfix_page_alloc_map)))
> + goto done;
> +
> + mutex_lock(_init_mutex);
> +
> + /* Did we race on the lock? */
> + if (unlikely(test_bit(page, espfix_page_alloc_map)))
> + goto unlock_done;
> +
> + ptemask = __supported_pte_mask;
> +
> + pgd_p = _level4_pgt[pgd_index(addr)];
> + pgd = *pgd_p;
> + if (!pgd_present(pgd)) {
> + /* This can only happen on the BSP */
> + pgd = __pgd(__pa_symbol(espfix_pud_page) |

Any particular reason you are using __pgd

> + (_KERNPG_TABLE & ptemask));
> + set_pgd(pgd_p, pgd);
> + }
> +
> + pud_p = _pud_page[pud_index(addr)];
> + pud = *pud_p;
> + if (!pud_present(pud)) {
> + pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
> + pud = __pud(__pa(pmd_p) | (_KERNPG_TABLE & ptemask));

_pud
> + for (n = 0; n < ESPFIX_PUD_CLONES; n++)
> + set_pud(_p[n], pud);
> + }
> +
> + pmd_p = pmd_offset(, addr);
> + pmd = *pmd_p;
> + if (!pmd_present(pmd)) {
> + pte_p = 

Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread Pavel Machek
Hi!

> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.

Just to understand the consequences -- we leak 16 bit of kernel data
to the userspace, right? Because it is %esp, we know that we leak
stack address, which is not too sensitive, but will make kernel
address randomization less useful...?

> The 64-bit implementation works like this:
> 
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables.  This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)

16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
Wine? Would the solution be to disallow that?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread Pavel Machek
Hi!

 This is a prototype of espfix for the 64-bit kernel.  espfix is a
 workaround for the architectural definition of IRET, which fails to
 restore bits [31:16] of %esp when returning to a 16-bit stack
 segment.  We have a workaround for the 32-bit kernel, but that
 implementation doesn't work for 64 bits.

Just to understand the consequences -- we leak 16 bit of kernel data
to the userspace, right? Because it is %esp, we know that we leak
stack address, which is not too sensitive, but will make kernel
address randomization less useful...?

 The 64-bit implementation works like this:
 
 Set up a ministack for each CPU, which is then mapped 65536 times
 using the page tables.  This implementation uses the second-to-last
 PGD slot for this; with a 64-byte espfix stack this is sufficient for
 2^18 CPUs (currently we support a max of 2^13 CPUs.)

16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
Wine? Would the solution be to disallow that?
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread Konrad Rzeszutek Wilk
On Tue, Apr 22, 2014 at 06:17:21PM -0700, H. Peter Anvin wrote:
 Another spin of the prototype.  This one avoids the espfix for anything
 but #GP, and avoids save/restore/saving registers... one can wonder,
 though, how much that actually matters in practice.
 
 It still does redundant SWAPGS on the slow path.  I'm not sure I
 personally care enough to optimize that, as it means some fairly
 significant restructuring of some of the code paths.  Some of that
 restructuring might actually be beneficial, but still...

Sorry about being late to the party.


 .. snip..
 diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
 new file mode 100644
 index ..05567d706f92
 --- /dev/null
 +++ b/arch/x86/kernel/espfix_64.c
 @@ -0,0 +1,136 @@
 +/* --- *
 + *
 + *   Copyright 2014 Intel Corporation; author: H. Peter Anvin
 + *
 + *   This file is part of the Linux kernel, and is made available under
 + *   the terms of the GNU General Public License version 2 or (at your
 + *   option) any later version; incorporated herein by reference.
 + *
 + * --- */
 +
 +#include linux/init.h
 +#include linux/kernel.h
 +#include linux/percpu.h
 +#include linux/gfp.h
 +#include asm/pgtable.h
 +
 +#define ESPFIX_STACK_SIZE64UL
 +#define ESPFIX_STACKS_PER_PAGE   (PAGE_SIZE/ESPFIX_STACK_SIZE)
 +
 +#define ESPFIX_MAX_CPUS (ESPFIX_STACKS_PER_PAGE  
 (PGDIR_SHIFT-PAGE_SHIFT-16))
 +#if CONFIG_NR_CPUS  ESPFIX_MAX_CPUS
 +# error Need more than one PGD for the ESPFIX hack
 +#endif
 +
 +#define ESPFIX_BASE_ADDR (-2UL  PGDIR_SHIFT)
 +
 +#define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_REPEAT | __GFP_ZERO)
 +
 +/* This contains the *bottom* address of the espfix stack */
 +DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
 +
 +/* Initialization mutex - should this be a spinlock? */
 +static DEFINE_MUTEX(espfix_init_mutex);
 +
 +/* Page allocation bitmap - each page serves ESPFIX_STACKS_PER_PAGE CPUs */
 +#define ESPFIX_MAX_PAGES  DIV_ROUND_UP(CONFIG_NR_CPUS, 
 ESPFIX_STACKS_PER_PAGE)
 +#define ESPFIX_MAP_SIZE   DIV_ROUND_UP(ESPFIX_MAX_PAGES, BITS_PER_LONG)
 +static unsigned long espfix_page_alloc_map[ESPFIX_MAP_SIZE];
 +
 +static __page_aligned_bss pud_t espfix_pud_page[PTRS_PER_PUD]
 + __aligned(PAGE_SIZE);
 +
 +/*
 + * This returns the bottom address of the espfix stack for a specific CPU.
 + * The math allows for a non-power-of-two ESPFIX_STACK_SIZE, in which case
 + * we have to account for some amount of padding at the end of each page.
 + */
 +static inline unsigned long espfix_base_addr(unsigned int cpu)
 +{
 + unsigned long page, addr;
 +
 + page = (cpu / ESPFIX_STACKS_PER_PAGE)  PAGE_SHIFT;
 + addr = page + (cpu % ESPFIX_STACKS_PER_PAGE) * ESPFIX_STACK_SIZE;
 + addr = (addr  0xUL) | ((addr  ~0xUL)  16);
 + addr += ESPFIX_BASE_ADDR;
 + return addr;
 +}
 +
 +#define PTE_STRIDE(65536/PAGE_SIZE)
 +#define ESPFIX_PTE_CLONES (PTRS_PER_PTE/PTE_STRIDE)
 +#define ESPFIX_PMD_CLONES PTRS_PER_PMD
 +#define ESPFIX_PUD_CLONES (65536/(ESPFIX_PTE_CLONES*ESPFIX_PMD_CLONES))
 +
 +void init_espfix_this_cpu(void)
 +{
 + unsigned int cpu, page;
 + unsigned long addr;
 + pgd_t pgd, *pgd_p;
 + pud_t pud, *pud_p;
 + pmd_t pmd, *pmd_p;
 + pte_t pte, *pte_p;
 + int n;
 + void *stack_page;
 + pteval_t ptemask;
 +
 + /* We only have to do this once... */
 + if (likely(this_cpu_read(espfix_stack)))
 + return; /* Already initialized */
 +
 + cpu = smp_processor_id();
 + addr = espfix_base_addr(cpu);
 + page = cpu/ESPFIX_STACKS_PER_PAGE;
 +
 + /* Did another CPU already set this up? */
 + if (likely(test_bit(page, espfix_page_alloc_map)))
 + goto done;
 +
 + mutex_lock(espfix_init_mutex);
 +
 + /* Did we race on the lock? */
 + if (unlikely(test_bit(page, espfix_page_alloc_map)))
 + goto unlock_done;
 +
 + ptemask = __supported_pte_mask;
 +
 + pgd_p = init_level4_pgt[pgd_index(addr)];
 + pgd = *pgd_p;
 + if (!pgd_present(pgd)) {
 + /* This can only happen on the BSP */
 + pgd = __pgd(__pa_symbol(espfix_pud_page) |

Any particular reason you are using __pgd

 + (_KERNPG_TABLE  ptemask));
 + set_pgd(pgd_p, pgd);
 + }
 +
 + pud_p = espfix_pud_page[pud_index(addr)];
 + pud = *pud_p;
 + if (!pud_present(pud)) {
 + pmd_p = (pmd_t *)__get_free_page(PGALLOC_GFP);
 + pud = __pud(__pa(pmd_p) | (_KERNPG_TABLE  ptemask));

_pud
 + for (n = 0; n  ESPFIX_PUD_CLONES; n++)
 + set_pud(pud_p[n], pud);
 + }
 +
 + pmd_p = pmd_offset(pud, addr);
 + pmd = *pmd_p;
 + if (!pmd_present(pmd)) {
 + pte_p = (pte_t *)__get_free_page(PGALLOC_GFP);
 + pmd = 

Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread H. Peter Anvin
On 04/25/2014 02:02 PM, Konrad Rzeszutek Wilk wrote:
 
 Any particular reason you are using __pgd
 
 _pud
 
 and _pmd?
 
 and __pte instead of the 'pmd', 'pud', 'pmd' and 'pte' macros?
 

Not that I know of other than that the semantics of the various macros
are not described anywhere to the best of my knowledge.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-25 Thread H. Peter Anvin
On 04/25/2014 05:02 AM, Pavel Machek wrote:
 
 Just to understand the consequences -- we leak 16 bit of kernel data
 to the userspace, right? Because it is %esp, we know that we leak
 stack address, which is not too sensitive, but will make kernel
 address randomization less useful...?
 

It is rather sensitive, in fact.

 The 64-bit implementation works like this:

 Set up a ministack for each CPU, which is then mapped 65536 times
 using the page tables.  This implementation uses the second-to-last
 PGD slot for this; with a 64-byte espfix stack this is sufficient for
 2^18 CPUs (currently we support a max of 2^13 CPUs.)
 
 16-bit stack segments on 64-bit machine. Who still uses it? Dosemu?
 Wine? Would the solution be to disallow that?

Welcome to the show.  We do, in fact disallow it now in the 3.15-rc
series.  The Wine guys are complaining.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:37 PM, H. Peter Anvin  wrote:
> On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
>>
>> I was imagining just randomizing a couple of high bits so the whole
>> espfix area moves as a unit.
>>
>>> We could XOR with a random constant with no penalty at all.  Only
>>> problem is that this happens early, so the entropy system is not yet
>>> available.  Fine if we have RDRAND, but...
>>
>> How many people have SMAP and not RDRAND?  I think this is a complete
>> nonissue for non-SMAP systems.
>>
>
> Most likely none, unless some "clever" virtualizer turns off RDRAND out
> of spite.
>
 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.
>>>
>>> It is completely nuts, but sometimes completely nuts is actually useful.
>>>  It is more complexity, to be sure, but it doesn't seem completely out
>>> of the realm of reason, and avoids having to unwind the ministack except
>>> in the normally-fatal #DF handler.  #DFs are documented as not
>>> recoverable, but we might be able to do something here.
>>>
>>> The only real disadvantage I see is the need for more bookkeeping
>>> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
>>> an array, plus we need a second percpu variable.  Given that if
>>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
>>
>> Doing something in #DF needs percpu data?  What am I missing?
>
> You need the second percpu variable in the espfix setup code so you have
> both the write address and the target rsp (read address).
>

Duh. :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread H. Peter Anvin
On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
> 
> I was imagining just randomizing a couple of high bits so the whole
> espfix area moves as a unit.
> 
>> We could XOR with a random constant with no penalty at all.  Only
>> problem is that this happens early, so the entropy system is not yet
>> available.  Fine if we have RDRAND, but...
> 
> How many people have SMAP and not RDRAND?  I think this is a complete
> nonissue for non-SMAP systems.
> 

Most likely none, unless some "clever" virtualizer turns off RDRAND out
of spite.

>>> Peter, is this idea completely nuts?  The only exceptions that can
>>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>>> so they won't double-fault.
>>
>> It is completely nuts, but sometimes completely nuts is actually useful.
>>  It is more complexity, to be sure, but it doesn't seem completely out
>> of the realm of reason, and avoids having to unwind the ministack except
>> in the normally-fatal #DF handler.  #DFs are documented as not
>> recoverable, but we might be able to do something here.
>>
>> The only real disadvantage I see is the need for more bookkeeping
>> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
>> an array, plus we need a second percpu variable.  Given that if
>> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
> 
> Doing something in #DF needs percpu data?  What am I missing?

You need the second percpu variable in the espfix setup code so you have
both the write address and the target rsp (read address).

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:24 PM, H. Peter Anvin  wrote:
> On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>>
>>> - The user can put arbitrary data in registers before returning to the
>>> LDT in order to get it saved at a known address accessible from the
>>> kernel.  With SMAP and KASLR this might otherwise be difficult.
>>
>> For one thing, this only matters on Haswell.  Otherwise the user can
>> put arbitrary data in userspace.
>>
>> On Haswell, the HPET fixmap is currently a much simpler vector that
>> can do much the same thing, as long as you're willing to wait for the
>> HPET counter to contain some particular value.  I have patches that
>> will fix that as a side effect.
>>
>> Would it pay to randomize the location of the espfix area?  Another
>> somewhat silly idea is to add some random offset to the CPU number mod
>> NR_CPUS so that at attacker won't know which ministack is which.
>
> Since we store the espfix stack location explicitly, as long as the
> scrambling happens in the initialization code that's fine.  However, we
> don't want to reduce locality lest we massively blow up the memory
> requirements.

I was imagining just randomizing a couple of high bits so the whole
espfix area moves as a unit.

>
> We could XOR with a random constant with no penalty at all.  Only
> problem is that this happens early, so the entropy system is not yet
> available.  Fine if we have RDRAND, but...

How many people have SMAP and not RDRAND?  I think this is a complete
nonissue for non-SMAP systems.

>> Peter, is this idea completely nuts?  The only exceptions that can
>> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
>> so they won't double-fault.
>
> It is completely nuts, but sometimes completely nuts is actually useful.
>  It is more complexity, to be sure, but it doesn't seem completely out
> of the realm of reason, and avoids having to unwind the ministack except
> in the normally-fatal #DF handler.  #DFs are documented as not
> recoverable, but we might be able to do something here.
>
> The only real disadvantage I see is the need for more bookkeeping
> metadata.  Basically the bitmask in espfix_64.c now needs to turn into
> an array, plus we need a second percpu variable.  Given that if
> CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

Doing something in #DF needs percpu data?  What am I missing?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread H. Peter Anvin
On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:
>>
>> - The user can put arbitrary data in registers before returning to the
>> LDT in order to get it saved at a known address accessible from the
>> kernel.  With SMAP and KASLR this might otherwise be difficult.
> 
> For one thing, this only matters on Haswell.  Otherwise the user can
> put arbitrary data in userspace.
> 
> On Haswell, the HPET fixmap is currently a much simpler vector that
> can do much the same thing, as long as you're willing to wait for the
> HPET counter to contain some particular value.  I have patches that
> will fix that as a side effect.
> 
> Would it pay to randomize the location of the espfix area?  Another
> somewhat silly idea is to add some random offset to the CPU number mod
> NR_CPUS so that at attacker won't know which ministack is which.

Since we store the espfix stack location explicitly, as long as the
scrambling happens in the initialization code that's fine.  However, we
don't want to reduce locality lest we massively blow up the memory
requirements.

We could XOR with a random constant with no penalty at all.  Only
problem is that this happens early, so the entropy system is not yet
available.  Fine if we have RDRAND, but...

>> - If the iret faults, kernel addresses will get stored there (and not
>> cleared).  If a vulnerability could return data from an arbitrary
>> specified address to the user, this would be harmful.
> 
> Can this be fixed by clearing the ministack in bad_iret?  There will
> still be a window in which the kernel address is in there, but it'll
> be short.

We could, if anyone thinks this is actually beneficial.

I'm trying to dig into some of the deeper semantics of IRET to figure
out another issue (a much bigger potential problem), this would affect
that as well.  My current belief is that we don't actually have a
problem here.

>> - If a vulnerability allowed overwriting data at an arbitrary
>> specified address, the exception frame could get overwritten at
>> exactly the right moment between the copy and iret (or right after the
>> iret to mess up fixup_exception)?  You probably know better than I
>> whether or not caches prevent this from actually being possible.
> 
> To attack this, you'd change the saved CS value.  I don't think caches
> would make a difference.
> 
> This particular vector hurts: you can safely keep trying until it works.
> 
> This just gave me an evil idea: what if we make the whole espfix area
> read-only?  This has some weird effects.  To switch to the espfix
> stack, you have to write to an alias.  That's a little strange but
> harmless and barely complicates the implementation.  If the iret
> faults, though, I think the result will be a #DF.  This may actually
> be a good thing: if the #DF handler detects that the cause was a bad
> espfix iret, it could just return directly to bad_iret or send the
> signal itself the same way that do_stack_segment does.  This could
> even be written in C :)
>
> Peter, is this idea completely nuts?  The only exceptions that can
> happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
> so they won't double-fault.

It is completely nuts, but sometimes completely nuts is actually useful.
 It is more complexity, to be sure, but it doesn't seem completely out
of the realm of reason, and avoids having to unwind the ministack except
in the normally-fatal #DF handler.  #DFs are documented as not
recoverable, but we might be able to do something here.

The only real disadvantage I see is the need for more bookkeeping
metadata.  Basically the bitmask in espfix_64.c now needs to turn into
an array, plus we need a second percpu variable.  Given that if
CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread H. Peter Anvin
On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:

 - The user can put arbitrary data in registers before returning to the
 LDT in order to get it saved at a known address accessible from the
 kernel.  With SMAP and KASLR this might otherwise be difficult.
 
 For one thing, this only matters on Haswell.  Otherwise the user can
 put arbitrary data in userspace.
 
 On Haswell, the HPET fixmap is currently a much simpler vector that
 can do much the same thing, as long as you're willing to wait for the
 HPET counter to contain some particular value.  I have patches that
 will fix that as a side effect.
 
 Would it pay to randomize the location of the espfix area?  Another
 somewhat silly idea is to add some random offset to the CPU number mod
 NR_CPUS so that at attacker won't know which ministack is which.

Since we store the espfix stack location explicitly, as long as the
scrambling happens in the initialization code that's fine.  However, we
don't want to reduce locality lest we massively blow up the memory
requirements.

We could XOR with a random constant with no penalty at all.  Only
problem is that this happens early, so the entropy system is not yet
available.  Fine if we have RDRAND, but...

 - If the iret faults, kernel addresses will get stored there (and not
 cleared).  If a vulnerability could return data from an arbitrary
 specified address to the user, this would be harmful.
 
 Can this be fixed by clearing the ministack in bad_iret?  There will
 still be a window in which the kernel address is in there, but it'll
 be short.

We could, if anyone thinks this is actually beneficial.

I'm trying to dig into some of the deeper semantics of IRET to figure
out another issue (a much bigger potential problem), this would affect
that as well.  My current belief is that we don't actually have a
problem here.

 - If a vulnerability allowed overwriting data at an arbitrary
 specified address, the exception frame could get overwritten at
 exactly the right moment between the copy and iret (or right after the
 iret to mess up fixup_exception)?  You probably know better than I
 whether or not caches prevent this from actually being possible.
 
 To attack this, you'd change the saved CS value.  I don't think caches
 would make a difference.
 
 This particular vector hurts: you can safely keep trying until it works.
 
 This just gave me an evil idea: what if we make the whole espfix area
 read-only?  This has some weird effects.  To switch to the espfix
 stack, you have to write to an alias.  That's a little strange but
 harmless and barely complicates the implementation.  If the iret
 faults, though, I think the result will be a #DF.  This may actually
 be a good thing: if the #DF handler detects that the cause was a bad
 espfix iret, it could just return directly to bad_iret or send the
 signal itself the same way that do_stack_segment does.  This could
 even be written in C :)

 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.

It is completely nuts, but sometimes completely nuts is actually useful.
 It is more complexity, to be sure, but it doesn't seem completely out
of the realm of reason, and avoids having to unwind the ministack except
in the normally-fatal #DF handler.  #DFs are documented as not
recoverable, but we might be able to do something here.

The only real disadvantage I see is the need for more bookkeeping
metadata.  Basically the bitmask in espfix_64.c now needs to turn into
an array, plus we need a second percpu variable.  Given that if
CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:24 PM, H. Peter Anvin h...@linux.intel.com wrote:
 On 04/23/2014 09:53 PM, Andrew Lutomirski wrote:

 - The user can put arbitrary data in registers before returning to the
 LDT in order to get it saved at a known address accessible from the
 kernel.  With SMAP and KASLR this might otherwise be difficult.

 For one thing, this only matters on Haswell.  Otherwise the user can
 put arbitrary data in userspace.

 On Haswell, the HPET fixmap is currently a much simpler vector that
 can do much the same thing, as long as you're willing to wait for the
 HPET counter to contain some particular value.  I have patches that
 will fix that as a side effect.

 Would it pay to randomize the location of the espfix area?  Another
 somewhat silly idea is to add some random offset to the CPU number mod
 NR_CPUS so that at attacker won't know which ministack is which.

 Since we store the espfix stack location explicitly, as long as the
 scrambling happens in the initialization code that's fine.  However, we
 don't want to reduce locality lest we massively blow up the memory
 requirements.

I was imagining just randomizing a couple of high bits so the whole
espfix area moves as a unit.


 We could XOR with a random constant with no penalty at all.  Only
 problem is that this happens early, so the entropy system is not yet
 available.  Fine if we have RDRAND, but...

How many people have SMAP and not RDRAND?  I think this is a complete
nonissue for non-SMAP systems.

 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.

 It is completely nuts, but sometimes completely nuts is actually useful.
  It is more complexity, to be sure, but it doesn't seem completely out
 of the realm of reason, and avoids having to unwind the ministack except
 in the normally-fatal #DF handler.  #DFs are documented as not
 recoverable, but we might be able to do something here.

 The only real disadvantage I see is the need for more bookkeeping
 metadata.  Basically the bitmask in espfix_64.c now needs to turn into
 an array, plus we need a second percpu variable.  Given that if
 CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

Doing something in #DF needs percpu data?  What am I missing?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread H. Peter Anvin
On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:
 
 I was imagining just randomizing a couple of high bits so the whole
 espfix area moves as a unit.
 
 We could XOR with a random constant with no penalty at all.  Only
 problem is that this happens early, so the entropy system is not yet
 available.  Fine if we have RDRAND, but...
 
 How many people have SMAP and not RDRAND?  I think this is a complete
 nonissue for non-SMAP systems.
 

Most likely none, unless some clever virtualizer turns off RDRAND out
of spite.

 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.

 It is completely nuts, but sometimes completely nuts is actually useful.
  It is more complexity, to be sure, but it doesn't seem completely out
 of the realm of reason, and avoids having to unwind the ministack except
 in the normally-fatal #DF handler.  #DFs are documented as not
 recoverable, but we might be able to do something here.

 The only real disadvantage I see is the need for more bookkeeping
 metadata.  Basically the bitmask in espfix_64.c now needs to turn into
 an array, plus we need a second percpu variable.  Given that if
 CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.
 
 Doing something in #DF needs percpu data?  What am I missing?

You need the second percpu variable in the espfix setup code so you have
both the write address and the target rsp (read address).

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-24 Thread Andrew Lutomirski
On Thu, Apr 24, 2014 at 3:37 PM, H. Peter Anvin h...@zytor.com wrote:
 On 04/24/2014 03:31 PM, Andrew Lutomirski wrote:

 I was imagining just randomizing a couple of high bits so the whole
 espfix area moves as a unit.

 We could XOR with a random constant with no penalty at all.  Only
 problem is that this happens early, so the entropy system is not yet
 available.  Fine if we have RDRAND, but...

 How many people have SMAP and not RDRAND?  I think this is a complete
 nonissue for non-SMAP systems.


 Most likely none, unless some clever virtualizer turns off RDRAND out
 of spite.

 Peter, is this idea completely nuts?  The only exceptions that can
 happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
 so they won't double-fault.

 It is completely nuts, but sometimes completely nuts is actually useful.
  It is more complexity, to be sure, but it doesn't seem completely out
 of the realm of reason, and avoids having to unwind the ministack except
 in the normally-fatal #DF handler.  #DFs are documented as not
 recoverable, but we might be able to do something here.

 The only real disadvantage I see is the need for more bookkeeping
 metadata.  Basically the bitmask in espfix_64.c now needs to turn into
 an array, plus we need a second percpu variable.  Given that if
 CONFIG_NR_CPUS=8192 the array has 128 entries I think we can survive that.

 Doing something in #DF needs percpu data?  What am I missing?

 You need the second percpu variable in the espfix setup code so you have
 both the write address and the target rsp (read address).


Duh. :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 9:13 PM, comex  wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin  wrote:
>> This is a prototype of espfix for the 64-bit kernel.  espfix is a
>> workaround for the architectural definition of IRET, which fails to
>> restore bits [31:16] of %esp when returning to a 16-bit stack
>> segment.  We have a workaround for the 32-bit kernel, but that
>> implementation doesn't work for 64 bits.
>
> Hi,
>
> A comment: The main purpose of espfix is to prevent attackers from
> learning sensitive addresses, right?  But as far as I can tell, this
> mini-stack becomes itself somewhat sensitive:
>
> - The user can put arbitrary data in registers before returning to the
> LDT in order to get it saved at a known address accessible from the
> kernel.  With SMAP and KASLR this might otherwise be difficult.

For one thing, this only matters on Haswell.  Otherwise the user can
put arbitrary data in userspace.

On Haswell, the HPET fixmap is currently a much simpler vector that
can do much the same thing, as long as you're willing to wait for the
HPET counter to contain some particular value.  I have patches that
will fix that as a side effect.

Would it pay to randomize the location of the espfix area?  Another
somewhat silly idea is to add some random offset to the CPU number mod
NR_CPUS so that at attacker won't know which ministack is which.

> - If the iret faults, kernel addresses will get stored there (and not
> cleared).  If a vulnerability could return data from an arbitrary
> specified address to the user, this would be harmful.

Can this be fixed by clearing the ministack in bad_iret?  There will
still be a window in which the kernel address is in there, but it'll
be short.

>
> I guess with the current KASLR implementation you could get the same
> effects via brute force anyway, by filling up and browsing memory,
> respectively, but ideally there wouldn't be any virtual addresses
> guaranteed not to fault.
>
> - If a vulnerability allowed overwriting data at an arbitrary
> specified address, the exception frame could get overwritten at
> exactly the right moment between the copy and iret (or right after the
> iret to mess up fixup_exception)?  You probably know better than I
> whether or not caches prevent this from actually being possible.

To attack this, you'd change the saved CS value.  I don't think caches
would make a difference.

This particular vector hurts: you can safely keep trying until it works.

This just gave me an evil idea: what if we make the whole espfix area
read-only?  This has some weird effects.  To switch to the espfix
stack, you have to write to an alias.  That's a little strange but
harmless and barely complicates the implementation.  If the iret
faults, though, I think the result will be a #DF.  This may actually
be a good thing: if the #DF handler detects that the cause was a bad
espfix iret, it could just return directly to bad_iret or send the
signal itself the same way that do_stack_segment does.  This could
even be written in C :)

Peter, is this idea completely nuts?  The only exceptions that can
happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
so they won't double-fault.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread comex
On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin  wrote:
> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.

Hi,

A comment: The main purpose of espfix is to prevent attackers from
learning sensitive addresses, right?  But as far as I can tell, this
mini-stack becomes itself somewhat sensitive:

- The user can put arbitrary data in registers before returning to the
LDT in order to get it saved at a known address accessible from the
kernel.  With SMAP and KASLR this might otherwise be difficult.
- If the iret faults, kernel addresses will get stored there (and not
cleared).  If a vulnerability could return data from an arbitrary
specified address to the user, this would be harmful.

I guess with the current KASLR implementation you could get the same
effects via brute force anyway, by filling up and browsing memory,
respectively, but ideally there wouldn't be any virtual addresses
guaranteed not to fault.

- If a vulnerability allowed overwriting data at an arbitrary
specified address, the exception frame could get overwritten at
exactly the right moment between the copy and iret (or right after the
iret to mess up fixup_exception)?  You probably know better than I
whether or not caches prevent this from actually being possible.

Just raising the issue.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:28 AM, H. Peter Anvin  wrote:
> On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
>> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin  wrote:
>>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:

 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.
>>>
>>> I don't see why sigreturn needs to be involved... all you need is
>>> modify_ldt() on one CPU while the other is in the middle of an IRET
>>> return.  Small window, so hard to hit, but still.
>>
>> If you set the flag as soon as anyone calls modify_ldt, before any
>> descriptor is installed, then I don't think this can happen.  But
>> there's still sigreturn, and I don't think this is worth all the
>> complexity to save a single branch on #GP.
>>
>
> Who cares?  Since we only need to enter the fixup path for LDT
> selectors, anything that is dependent on having called modify_ldt() is
> already redundant.

But you still have to test this, and folding it into the existing
check for thread flags would eliminate that.  Still, I think this
would not be worth it, even if it were correct.

>
> In some ways that is the saving grace.  SS being an LDT selector is
> fortunately a rare case.
>
>> I do mean intra-kernel.  And yes, this has nothing to do with espfix,
>> but it would make write_msr_safe fail more quickly :)
>
> And, pray tell, how important is that?

Not very.

Page faults may be a different story for some workloads, particularly
if they are IO-heavy.  Returning to preempted kernel threads may also
matter.

For my particular workload, returns from rescheduling interrupts
delivered to idle cpus probably also matters, but the fact that those
interrupts are happening at all is a bug that tglx is working on.

--Andy

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
> On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin  wrote:
>> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>>
>>> The only way I can see to trigger the race is with sigreturn, but it's
>>> still there.  Sigh.
>>
>> I don't see why sigreturn needs to be involved... all you need is
>> modify_ldt() on one CPU while the other is in the middle of an IRET
>> return.  Small window, so hard to hit, but still.
> 
> If you set the flag as soon as anyone calls modify_ldt, before any
> descriptor is installed, then I don't think this can happen.  But
> there's still sigreturn, and I don't think this is worth all the
> complexity to save a single branch on #GP.
> 

Who cares?  Since we only need to enter the fixup path for LDT
selectors, anything that is dependent on having called modify_ldt() is
already redundant.

In some ways that is the saving grace.  SS being an LDT selector is
fortunately a rare case.

> I do mean intra-kernel.  And yes, this has nothing to do with espfix,
> but it would make write_msr_safe fail more quickly :)

And, pray tell, how important is that?

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin  wrote:
> On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
>>
>> The only way I can see to trigger the race is with sigreturn, but it's
>> still there.  Sigh.
>>
>
> I don't see why sigreturn needs to be involved... all you need is
> modify_ldt() on one CPU while the other is in the middle of an IRET
> return.  Small window, so hard to hit, but still.
>

If you set the flag as soon as anyone calls modify_ldt, before any
descriptor is installed, then I don't think this can happen.  But
there's still sigreturn, and I don't think this is worth all the
complexity to save a single branch on #GP.

>> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
>> iret entirely.  It could be something like:
>>
>> SS
>> RSP
>> EFLAGS
>> CS
>> RIP
>>
>> push 16($rsp)
>> popfq [does this need to force rex.w somehow?]
>> ret $64
>
> When you say return to CPL 0 you mean intra-kernel return?  That isn't
> really the problem here, though.  I think this will also break the
> kernel debugger since it will have the wrong behavior for TF and RF.

I do mean intra-kernel.  And yes, this has nothing to do with espfix,
but it would make write_msr_safe fail more quickly :)

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
> 
> The only way I can see to trigger the race is with sigreturn, but it's
> still there.  Sigh.
> 

I don't see why sigreturn needs to be involved... all you need is
modify_ldt() on one CPU while the other is in the middle of an IRET
return.  Small window, so hard to hit, but still.

> 2. I've often pondered changing the way we return *to* CPL 0 to bypass
> iret entirely.  It could be something like:
> 
> SS
> RSP
> EFLAGS
> CS
> RIP
> 
> push 16($rsp)
> popfq [does this need to force rex.w somehow?]
> ret $64

When you say return to CPL 0 you mean intra-kernel return?  That isn't
really the problem here, though.  I think this will also break the
kernel debugger since it will have the wrong behavior for TF and RF.

>>> The other question I have is - is there any reason we can't fix up the
>>> IRET to do a 32bit return into a vsyscall type userspace page which then
>>> does a long jump or retf to the right place ?
>>
>> I did a writeup on this a while ago.  It does have the problem that you
>> need additional memory in userspace, which is per-thread and in the
>> right region of userspace; this pretty much means you have to muck about
>> with the user space stack when user space is running in weird modes.
>> This gets complex very quickly and does have some "footprint".
>> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
>> still a way to leak bits [63:32].  I believe the in-kernel solution is
>> actually simpler.
>>
> 
> There's also no real guarantee that user code won't unmap the vdso.

There is, but there is also at some point a "don't do that, then" aspect
to it all.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 8:53 AM, H. Peter Anvin  wrote:
> On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>>> Ideally the tests should be doable such that on a normal machine the
>>> tests can be overlapped with the other things we have to do on that
>>> path.  The exit branch will be strongly predicted in the negative
>>> direction, so it shouldn't be a significant problem.
>>>
>>> Again, this is not the case in the current prototype.
>>
>> Or you make sure that you switch to those code paths only after software
>> has executed syscalls that make it possible it will use a 16bit ss.
>>
>
> Which, again, would introduce a race, I believe, at least if we have an
> LDT at all (and since we only enter these code paths for LDT descriptors
> in the first place, it is equivalent to the current code minus the filters.)

The only way I can see to trigger the race is with sigreturn, but it's
still there.  Sigh.

Here are two semi-related things:

1. The Intel manual's description of iretq does seems like it forgot
to mention that iret restores the stack pointer in anything except
vm86 mode.  Fortunately, the AMD manual seems to thing that, when
returning *from* 64-bit mode, RSP is always restored, which I think is
necessary for this patch to work correctly.

2. I've often pondered changing the way we return *to* CPL 0 to bypass
iret entirely.  It could be something like:

SS
RSP
EFLAGS
CS
RIP

push 16($rsp)
popfq [does this need to force rex.w somehow?]
ret $64

This may break backtraces if cfi isn't being used and we get an NMI
just before the popfq.  I'm not quite sure how that works.

I haven't benchmarked this at all, but the only slow part should be
the popfq, and I doubt it's anywhere near as slow as iret.

>
>> The other question I have is - is there any reason we can't fix up the
>> IRET to do a 32bit return into a vsyscall type userspace page which then
>> does a long jump or retf to the right place ?
>
> I did a writeup on this a while ago.  It does have the problem that you
> need additional memory in userspace, which is per-thread and in the
> right region of userspace; this pretty much means you have to muck about
> with the user space stack when user space is running in weird modes.
> This gets complex very quickly and does have some "footprint".
> Furthermore, on some CPUs (not including any recent Intel CPUs) there is
> still a way to leak bits [63:32].  I believe the in-kernel solution is
> actually simpler.
>

There's also no real guarantee that user code won't unmap the vdso.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:
>>
>> Konrad - I really could use some help figuring out what needs to be done
>> for this not to break Xen.
> 
> This does break Xen PV:
> 

I know it does.  This is why I asked for help.

This is fundamentally the problem with PV and *especially* the way Xen
PV was integrated into Linux: it acts as a development brake for native
hardware.  Fortunately, Konrad has been quite responsive to that kind of
problems, which hasn't always been true of the Xen community in the past.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
>> Ideally the tests should be doable such that on a normal machine the
>> tests can be overlapped with the other things we have to do on that
>> path.  The exit branch will be strongly predicted in the negative
>> direction, so it shouldn't be a significant problem.
>>
>> Again, this is not the case in the current prototype.
> 
> Or you make sure that you switch to those code paths only after software
> has executed syscalls that make it possible it will use a 16bit ss. 
> 

Which, again, would introduce a race, I believe, at least if we have an
LDT at all (and since we only enter these code paths for LDT descriptors
in the first place, it is equivalent to the current code minus the filters.)

> The other question I have is - is there any reason we can't fix up the
> IRET to do a 32bit return into a vsyscall type userspace page which then
> does a long jump or retf to the right place ?

I did a writeup on this a while ago.  It does have the problem that you
need additional memory in userspace, which is per-thread and in the
right region of userspace; this pretty much means you have to muck about
with the user space stack when user space is running in weird modes.
This gets complex very quickly and does have some "footprint".
Furthermore, on some CPUs (not including any recent Intel CPUs) there is
still a way to leak bits [63:32].  I believe the in-kernel solution is
actually simpler.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Boris Ostrovsky

On 04/22/2014 09:42 PM, H. Peter Anvin wrote:

On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:

What's the to_dmesg thing for?


It's for debugging... the espfix page tables generate so many duplicate
entries that trying to output it via a seqfile runs out of memory.  I
suspect we need to do something like skip the espfix range or some other
hack.


It looks sane, although I haven't checked the detailed register manipulation.

Users of big systems may complain when every single CPU lines up for
that mutex.  Maybe no one cares.

Right now the whole smpboot sequence is fully serialized... that needs
to be fixed.

Konrad - I really could use some help figuring out what needs to be done
for this not to break Xen.


This does break Xen PV:

[3.683735] [ cut here ]
[3.683807] WARNING: CPU: 0 PID: 0 at arch/x86/xen/multicalls.c:129 
xen_mc_flush+0x1c8/0x1d0()

[3.683903] Modules linked in:
[3.684006] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.15.0-rc2 #2

[3.684176]  0009 81c01de0 816cfb15 

[3.684416]  81c01e18 81084abd  
0001
[3.684654]   88023da0b180 0010 
81c01e28

[3.684893] Call Trace:
[3.684962]  [] dump_stack+0x45/0x56
[3.685032]  [] warn_slowpath_common+0x7d/0xa0
[3.685102]  [] warn_slowpath_null+0x1a/0x20
[3.685171]  [] xen_mc_flush+0x1c8/0x1d0
[3.685240]  [] xen_set_pgd+0x1f5/0x220
[3.685310]  [] init_espfix_this_cpu+0x36a/0x380
[3.685379]  [] ? acpi_tb_initialize_facs+0x31/0x33
[3.685450]  [] start_kernel+0x37f/0x411
[3.685517]  [] ? repair_env_string+0x5c/0x5c
[3.685586]  [] x86_64_start_reservations+0x2a/0x2c
[3.685654]  [] xen_start_kernel+0x594/0x5a0
[3.685728] ---[ end trace a2cf2d7b2ecab826 ]---

But then I think we may want to rearrange preempt_enable/disable in 
xen_set_pgd().


-boris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread One Thousand Gnomes
> Ideally the tests should be doable such that on a normal machine the
> tests can be overlapped with the other things we have to do on that
> path.  The exit branch will be strongly predicted in the negative
> direction, so it shouldn't be a significant problem.
> 
> Again, this is not the case in the current prototype.

Or you make sure that you switch to those code paths only after software
has executed syscalls that make it possible it will use a 16bit ss. 

The other question I have is - is there any reason we can't fix up the
IRET to do a 32bit return into a vsyscall type userspace page which then
does a long jump or retf to the right place ?

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Alexandre Julliard
"H. Peter Anvin"  writes:

> Does anyone have any idea if there is a real use case for non-16-bit
> LDT segments used as the stack segment?  Does Wine use anything like
> that?

Wine uses them for DPMI support, though that would only get used when
vm86 mode is available.

-- 
Alexandre Julliard
julli...@winehq.org
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin

On 04/22/2014 10:04 AM, Linus Torvalds wrote:


The segment table is shared for a process. So you can have one thread
doing a load_ldt() that invalidates a segment, while another thread is
busy taking a page fault. The segment was valid at page fault time and
is saved on the kernel stack, but by the time the page fault returns,
it is no longer valid and the iretq will fault.

Anyway, if done correctly, this whole espfix should be totally free
for normal processes, since it should only trigger if SS is a LDT
entry (bit #2 set in the segment descriptor). So the normal fast-path
should just have a simple test for that.

And if you have a SS that is a descriptor in the LDT, nobody cares
about performance any more.



I just realized that with the LDT being a process-level object (unlike 
the GDT), we need to remove the filtering on the espfix hack, both for 
32-bit and 64-bit kernels.  Otherwise there is a race condition between 
executing the LAR instruction in the filter and the IRET, which could 
allow the leak to become manifest.


The "good" part is that I think the espfix hack is harmless even with a 
32/64-bit stack segment, although it has a substantial performance penalty.


Does anyone have any idea if there is a real use case for non-16-bit LDT 
segments used as the stack segment?  Does Wine use anything like that?


Very old NPTL Linux binaries use LDT segments, but only for data segments.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 07:24 AM, Boris Ostrovsky wrote:

 Konrad - I really could use some help figuring out what needs to be done
 for this not to break Xen.
 
 This does break Xen PV:
 

I know it does.  This is why I asked for help.

This is fundamentally the problem with PV and *especially* the way Xen
PV was integrated into Linux: it acts as a development brake for native
hardware.  Fortunately, Konrad has been quite responsive to that kind of
problems, which hasn't always been true of the Xen community in the past.

-hpa

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 8:53 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
 Ideally the tests should be doable such that on a normal machine the
 tests can be overlapped with the other things we have to do on that
 path.  The exit branch will be strongly predicted in the negative
 direction, so it shouldn't be a significant problem.

 Again, this is not the case in the current prototype.

 Or you make sure that you switch to those code paths only after software
 has executed syscalls that make it possible it will use a 16bit ss.


 Which, again, would introduce a race, I believe, at least if we have an
 LDT at all (and since we only enter these code paths for LDT descriptors
 in the first place, it is equivalent to the current code minus the filters.)

The only way I can see to trigger the race is with sigreturn, but it's
still there.  Sigh.

Here are two semi-related things:

1. The Intel manual's description of iretq does seems like it forgot
to mention that iret restores the stack pointer in anything except
vm86 mode.  Fortunately, the AMD manual seems to thing that, when
returning *from* 64-bit mode, RSP is always restored, which I think is
necessary for this patch to work correctly.

2. I've often pondered changing the way we return *to* CPL 0 to bypass
iret entirely.  It could be something like:

SS
RSP
EFLAGS
CS
RIP

push 16($rsp)
popfq [does this need to force rex.w somehow?]
ret $64

This may break backtraces if cfi isn't being used and we get an NMI
just before the popfq.  I'm not quite sure how that works.

I haven't benchmarked this at all, but the only slow part should be
the popfq, and I doubt it's anywhere near as slow as iret.


 The other question I have is - is there any reason we can't fix up the
 IRET to do a 32bit return into a vsyscall type userspace page which then
 does a long jump or retf to the right place ?

 I did a writeup on this a while ago.  It does have the problem that you
 need additional memory in userspace, which is per-thread and in the
 right region of userspace; this pretty much means you have to muck about
 with the user space stack when user space is running in weird modes.
 This gets complex very quickly and does have some footprint.
 Furthermore, on some CPUs (not including any recent Intel CPUs) there is
 still a way to leak bits [63:32].  I believe the in-kernel solution is
 actually simpler.


There's also no real guarantee that user code won't unmap the vdso.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:
 
 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.
 

I don't see why sigreturn needs to be involved... all you need is
modify_ldt() on one CPU while the other is in the middle of an IRET
return.  Small window, so hard to hit, but still.

 2. I've often pondered changing the way we return *to* CPL 0 to bypass
 iret entirely.  It could be something like:
 
 SS
 RSP
 EFLAGS
 CS
 RIP
 
 push 16($rsp)
 popfq [does this need to force rex.w somehow?]
 ret $64

When you say return to CPL 0 you mean intra-kernel return?  That isn't
really the problem here, though.  I think this will also break the
kernel debugger since it will have the wrong behavior for TF and RF.

 The other question I have is - is there any reason we can't fix up the
 IRET to do a 32bit return into a vsyscall type userspace page which then
 does a long jump or retf to the right place ?

 I did a writeup on this a while ago.  It does have the problem that you
 need additional memory in userspace, which is per-thread and in the
 right region of userspace; this pretty much means you have to muck about
 with the user space stack when user space is running in weird modes.
 This gets complex very quickly and does have some footprint.
 Furthermore, on some CPUs (not including any recent Intel CPUs) there is
 still a way to leak bits [63:32].  I believe the in-kernel solution is
 actually simpler.

 
 There's also no real guarantee that user code won't unmap the vdso.

There is, but there is also at some point a don't do that, then aspect
to it all.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:

 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.


 I don't see why sigreturn needs to be involved... all you need is
 modify_ldt() on one CPU while the other is in the middle of an IRET
 return.  Small window, so hard to hit, but still.


If you set the flag as soon as anyone calls modify_ldt, before any
descriptor is installed, then I don't think this can happen.  But
there's still sigreturn, and I don't think this is worth all the
complexity to save a single branch on #GP.

 2. I've often pondered changing the way we return *to* CPL 0 to bypass
 iret entirely.  It could be something like:

 SS
 RSP
 EFLAGS
 CS
 RIP

 push 16($rsp)
 popfq [does this need to force rex.w somehow?]
 ret $64

 When you say return to CPL 0 you mean intra-kernel return?  That isn't
 really the problem here, though.  I think this will also break the
 kernel debugger since it will have the wrong behavior for TF and RF.

I do mean intra-kernel.  And yes, this has nothing to do with espfix,
but it would make write_msr_safe fail more quickly :)

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
 On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:

 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.

 I don't see why sigreturn needs to be involved... all you need is
 modify_ldt() on one CPU while the other is in the middle of an IRET
 return.  Small window, so hard to hit, but still.
 
 If you set the flag as soon as anyone calls modify_ldt, before any
 descriptor is installed, then I don't think this can happen.  But
 there's still sigreturn, and I don't think this is worth all the
 complexity to save a single branch on #GP.
 

Who cares?  Since we only need to enter the fixup path for LDT
selectors, anything that is dependent on having called modify_ldt() is
already redundant.

In some ways that is the saving grace.  SS being an LDT selector is
fortunately a rare case.

 I do mean intra-kernel.  And yes, this has nothing to do with espfix,
 but it would make write_msr_safe fail more quickly :)

And, pray tell, how important is that?

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 10:28 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:25 AM, Andrew Lutomirski wrote:
 On Wed, Apr 23, 2014 at 10:16 AM, H. Peter Anvin h...@zytor.com wrote:
 On 04/23/2014 10:08 AM, Andrew Lutomirski wrote:

 The only way I can see to trigger the race is with sigreturn, but it's
 still there.  Sigh.

 I don't see why sigreturn needs to be involved... all you need is
 modify_ldt() on one CPU while the other is in the middle of an IRET
 return.  Small window, so hard to hit, but still.

 If you set the flag as soon as anyone calls modify_ldt, before any
 descriptor is installed, then I don't think this can happen.  But
 there's still sigreturn, and I don't think this is worth all the
 complexity to save a single branch on #GP.


 Who cares?  Since we only need to enter the fixup path for LDT
 selectors, anything that is dependent on having called modify_ldt() is
 already redundant.

But you still have to test this, and folding it into the existing
check for thread flags would eliminate that.  Still, I think this
would not be worth it, even if it were correct.


 In some ways that is the saving grace.  SS being an LDT selector is
 fortunately a rare case.

 I do mean intra-kernel.  And yes, this has nothing to do with espfix,
 but it would make write_msr_safe fail more quickly :)

 And, pray tell, how important is that?

Not very.

Page faults may be a different story for some workloads, particularly
if they are IO-heavy.  Returning to preempted kernel threads may also
matter.

For my particular workload, returns from rescheduling interrupts
delivered to idle cpus probably also matters, but the fact that those
interrupts are happening at all is a bug that tglx is working on.

--Andy

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread comex
On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin h...@linux.intel.com wrote:
 This is a prototype of espfix for the 64-bit kernel.  espfix is a
 workaround for the architectural definition of IRET, which fails to
 restore bits [31:16] of %esp when returning to a 16-bit stack
 segment.  We have a workaround for the 32-bit kernel, but that
 implementation doesn't work for 64 bits.

Hi,

A comment: The main purpose of espfix is to prevent attackers from
learning sensitive addresses, right?  But as far as I can tell, this
mini-stack becomes itself somewhat sensitive:

- The user can put arbitrary data in registers before returning to the
LDT in order to get it saved at a known address accessible from the
kernel.  With SMAP and KASLR this might otherwise be difficult.
- If the iret faults, kernel addresses will get stored there (and not
cleared).  If a vulnerability could return data from an arbitrary
specified address to the user, this would be harmful.

I guess with the current KASLR implementation you could get the same
effects via brute force anyway, by filling up and browsing memory,
respectively, but ideally there wouldn't be any virtual addresses
guaranteed not to fault.

- If a vulnerability allowed overwriting data at an arbitrary
specified address, the exception frame could get overwritten at
exactly the right moment between the copy and iret (or right after the
iret to mess up fixup_exception)?  You probably know better than I
whether or not caches prevent this from actually being possible.

Just raising the issue.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Andrew Lutomirski
On Wed, Apr 23, 2014 at 9:13 PM, comex com...@gmail.com wrote:
 On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin h...@linux.intel.com wrote:
 This is a prototype of espfix for the 64-bit kernel.  espfix is a
 workaround for the architectural definition of IRET, which fails to
 restore bits [31:16] of %esp when returning to a 16-bit stack
 segment.  We have a workaround for the 32-bit kernel, but that
 implementation doesn't work for 64 bits.

 Hi,

 A comment: The main purpose of espfix is to prevent attackers from
 learning sensitive addresses, right?  But as far as I can tell, this
 mini-stack becomes itself somewhat sensitive:

 - The user can put arbitrary data in registers before returning to the
 LDT in order to get it saved at a known address accessible from the
 kernel.  With SMAP and KASLR this might otherwise be difficult.

For one thing, this only matters on Haswell.  Otherwise the user can
put arbitrary data in userspace.

On Haswell, the HPET fixmap is currently a much simpler vector that
can do much the same thing, as long as you're willing to wait for the
HPET counter to contain some particular value.  I have patches that
will fix that as a side effect.

Would it pay to randomize the location of the espfix area?  Another
somewhat silly idea is to add some random offset to the CPU number mod
NR_CPUS so that at attacker won't know which ministack is which.

 - If the iret faults, kernel addresses will get stored there (and not
 cleared).  If a vulnerability could return data from an arbitrary
 specified address to the user, this would be harmful.

Can this be fixed by clearing the ministack in bad_iret?  There will
still be a window in which the kernel address is in there, but it'll
be short.


 I guess with the current KASLR implementation you could get the same
 effects via brute force anyway, by filling up and browsing memory,
 respectively, but ideally there wouldn't be any virtual addresses
 guaranteed not to fault.

 - If a vulnerability allowed overwriting data at an arbitrary
 specified address, the exception frame could get overwritten at
 exactly the right moment between the copy and iret (or right after the
 iret to mess up fixup_exception)?  You probably know better than I
 whether or not caches prevent this from actually being possible.

To attack this, you'd change the saved CS value.  I don't think caches
would make a difference.

This particular vector hurts: you can safely keep trying until it works.

This just gave me an evil idea: what if we make the whole espfix area
read-only?  This has some weird effects.  To switch to the espfix
stack, you have to write to an alias.  That's a little strange but
harmless and barely complicates the implementation.  If the iret
faults, though, I think the result will be a #DF.  This may actually
be a good thing: if the #DF handler detects that the cause was a bad
espfix iret, it could just return directly to bad_iret or send the
signal itself the same way that do_stack_segment does.  This could
even be written in C :)

Peter, is this idea completely nuts?  The only exceptions that can
happen there are NMI, MCE, #DB, #SS, and #GP.  The first four use IST,
so they won't double-fault.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin

On 04/22/2014 10:04 AM, Linus Torvalds wrote:


The segment table is shared for a process. So you can have one thread
doing a load_ldt() that invalidates a segment, while another thread is
busy taking a page fault. The segment was valid at page fault time and
is saved on the kernel stack, but by the time the page fault returns,
it is no longer valid and the iretq will fault.

Anyway, if done correctly, this whole espfix should be totally free
for normal processes, since it should only trigger if SS is a LDT
entry (bit #2 set in the segment descriptor). So the normal fast-path
should just have a simple test for that.

And if you have a SS that is a descriptor in the LDT, nobody cares
about performance any more.



I just realized that with the LDT being a process-level object (unlike 
the GDT), we need to remove the filtering on the espfix hack, both for 
32-bit and 64-bit kernels.  Otherwise there is a race condition between 
executing the LAR instruction in the filter and the IRET, which could 
allow the leak to become manifest.


The good part is that I think the espfix hack is harmless even with a 
32/64-bit stack segment, although it has a substantial performance penalty.


Does anyone have any idea if there is a real use case for non-16-bit LDT 
segments used as the stack segment?  Does Wine use anything like that?


Very old NPTL Linux binaries use LDT segments, but only for data segments.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Alexandre Julliard
H. Peter Anvin h...@zytor.com writes:

 Does anyone have any idea if there is a real use case for non-16-bit
 LDT segments used as the stack segment?  Does Wine use anything like
 that?

Wine uses them for DPMI support, though that would only get used when
vm86 mode is available.

-- 
Alexandre Julliard
julli...@winehq.org
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread One Thousand Gnomes
 Ideally the tests should be doable such that on a normal machine the
 tests can be overlapped with the other things we have to do on that
 path.  The exit branch will be strongly predicted in the negative
 direction, so it shouldn't be a significant problem.
 
 Again, this is not the case in the current prototype.

Or you make sure that you switch to those code paths only after software
has executed syscalls that make it possible it will use a 16bit ss. 

The other question I have is - is there any reason we can't fix up the
IRET to do a 32bit return into a vsyscall type userspace page which then
does a long jump or retf to the right place ?

Alan
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread Boris Ostrovsky

On 04/22/2014 09:42 PM, H. Peter Anvin wrote:

On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:

What's the to_dmesg thing for?


It's for debugging... the espfix page tables generate so many duplicate
entries that trying to output it via a seqfile runs out of memory.  I
suspect we need to do something like skip the espfix range or some other
hack.


It looks sane, although I haven't checked the detailed register manipulation.

Users of big systems may complain when every single CPU lines up for
that mutex.  Maybe no one cares.

Right now the whole smpboot sequence is fully serialized... that needs
to be fixed.

Konrad - I really could use some help figuring out what needs to be done
for this not to break Xen.


This does break Xen PV:

[3.683735] [ cut here ]
[3.683807] WARNING: CPU: 0 PID: 0 at arch/x86/xen/multicalls.c:129 
xen_mc_flush+0x1c8/0x1d0()

[3.683903] Modules linked in:
[3.684006] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.15.0-rc2 #2

[3.684176]  0009 81c01de0 816cfb15 

[3.684416]  81c01e18 81084abd  
0001
[3.684654]   88023da0b180 0010 
81c01e28

[3.684893] Call Trace:
[3.684962]  [816cfb15] dump_stack+0x45/0x56
[3.685032]  [81084abd] warn_slowpath_common+0x7d/0xa0
[3.685102]  [81084b9a] warn_slowpath_null+0x1a/0x20
[3.685171]  [810050a8] xen_mc_flush+0x1c8/0x1d0
[3.685240]  [81008155] xen_set_pgd+0x1f5/0x220
[3.685310]  [8101975a] init_espfix_this_cpu+0x36a/0x380
[3.685379]  [813cb559] ? acpi_tb_initialize_facs+0x31/0x33
[3.685450]  [81d27ec6] start_kernel+0x37f/0x411
[3.685517]  [81d27950] ? repair_env_string+0x5c/0x5c
[3.685586]  [81d27606] x86_64_start_reservations+0x2a/0x2c
[3.685654]  [81d2a6df] xen_start_kernel+0x594/0x5a0
[3.685728] ---[ end trace a2cf2d7b2ecab826 ]---

But then I think we may want to rearrange preempt_enable/disable in 
xen_set_pgd().


-boris
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-23 Thread H. Peter Anvin
On 04/23/2014 02:54 AM, One Thousand Gnomes wrote:
 Ideally the tests should be doable such that on a normal machine the
 tests can be overlapped with the other things we have to do on that
 path.  The exit branch will be strongly predicted in the negative
 direction, so it shouldn't be a significant problem.

 Again, this is not the case in the current prototype.
 
 Or you make sure that you switch to those code paths only after software
 has executed syscalls that make it possible it will use a 16bit ss. 
 

Which, again, would introduce a race, I believe, at least if we have an
LDT at all (and since we only enter these code paths for LDT descriptors
in the first place, it is equivalent to the current code minus the filters.)

 The other question I have is - is there any reason we can't fix up the
 IRET to do a 32bit return into a vsyscall type userspace page which then
 does a long jump or retf to the right place ?

I did a writeup on this a while ago.  It does have the problem that you
need additional memory in userspace, which is per-thread and in the
right region of userspace; this pretty much means you have to muck about
with the user space stack when user space is running in weird modes.
This gets complex very quickly and does have some footprint.
Furthermore, on some CPUs (not including any recent Intel CPUs) there is
still a way to leak bits [63:32].  I believe the in-kernel solution is
actually simpler.

-hpa


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 06:23 PM, Andrew Lutomirski wrote:
> 
> What's the to_dmesg thing for?
> 

It's for debugging... the espfix page tables generate so many duplicate
entries that trying to output it via a seqfile runs out of memory.  I
suspect we need to do something like skip the espfix range or some other
hack.

> It looks sane, although I haven't checked the detailed register manipulation.
> 
> Users of big systems may complain when every single CPU lines up for
> that mutex.  Maybe no one cares.

Right now the whole smpboot sequence is fully serialized... that needs
to be fixed.

Konrad - I really could use some help figuring out what needs to be done
for this not to break Xen.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 6:17 PM, H. Peter Anvin  wrote:
> Another spin of the prototype.  This one avoids the espfix for anything
> but #GP, and avoids save/restore/saving registers... one can wonder,
> though, how much that actually matters in practice.
>
> It still does redundant SWAPGS on the slow path.  I'm not sure I
> personally care enough to optimize that, as it means some fairly
> significant restructuring of some of the code paths.  Some of that
> restructuring might actually be beneficial, but still...
>

What's the to_dmesg thing for?

It looks sane, although I haven't checked the detailed register manipulation.

Users of big systems may complain when every single CPU lines up for
that mutex.  Maybe no one cares.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
Another spin of the prototype.  This one avoids the espfix for anything
but #GP, and avoids save/restore/saving registers... one can wonder,
though, how much that actually matters in practice.

It still does redundant SWAPGS on the slow path.  I'm not sure I
personally care enough to optimize that, as it means some fairly
significant restructuring of some of the code paths.  Some of that
restructuring might actually be beneficial, but still...

-hpa

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 9264f04a4c55..cea5b9b517f2 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -57,6 +57,8 @@ extern void x86_ce4100_early_setup(void);
 static inline void x86_ce4100_early_setup(void) { }
 #endif
 
+extern void init_espfix_this_cpu(void);
+
 #ifndef _SETUP
 
 /*
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index f4d96000d33a..1cc3789d99d9 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -29,6 +29,7 @@ obj-$(CONFIG_X86_64)  += sys_x86_64.o x8664_ksyms_64.o
 obj-y  += syscall_$(BITS).o vsyscall_gtod.o
 obj-$(CONFIG_X86_64)   += vsyscall_64.o
 obj-$(CONFIG_X86_64)   += vsyscall_emu_64.o
+obj-$(CONFIG_X86_64)   += espfix_64.o
 obj-$(CONFIG_SYSFS)+= ksysfs.o
 obj-y  += bootflag.o e820.o
 obj-y  += pci-dma.o quirks.o topology.o kdebugfs.o
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 1e96c3628bf2..7f71c97f59c0 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Avoid __ASSEMBLER__'ifying  just for this.  */
@@ -1040,8 +1041,16 @@ restore_args:
RESTORE_ARGS 1,8,1
 
 irq_return:
+   /*
+* Are we returning to the LDT?  Note: in 64-bit mode
+* SS:RSP on the exception stack is always valid.
+*/
+   testb $4,(SS-RIP)(%rsp)
+   jnz irq_return_ldt
+
+irq_return_iret:
INTERRUPT_RETURN
-   _ASM_EXTABLE(irq_return, bad_iret)
+   _ASM_EXTABLE(irq_return_iret, bad_iret)
 
 #ifdef CONFIG_PARAVIRT
 ENTRY(native_iret)
@@ -1049,6 +1058,34 @@ ENTRY(native_iret)
_ASM_EXTABLE(native_iret, bad_iret)
 #endif
 
+irq_return_ldt:
+   pushq_cfi %rcx
+   larl (CS-RIP+8)(%rsp), %ecx
+   jnz 1f  /* Invalid segment - will #GP at IRET time */
+   testl $0x0020, %ecx
+   jnz 1f  /* Returning to 64-bit mode */
+   larl (SS-RIP+8)(%rsp), %ecx
+   jnz 1f  /* Invalid segment - will #SS at IRET time */
+   testl $0x0040, %ecx
+   jnz 1f  /* Not a 16-bit stack segment */
+   pushq_cfi %rsi
+   pushq_cfi %rdi
+   SWAPGS
+   movq PER_CPU_VAR(espfix_stack),%rdi
+   movl (RSP-RIP+3*8)(%rsp),%esi
+   xorw %si,%si
+   orq %rsi,%rdi
+   movq %rsp,%rsi
+   movl $8,%ecx
+   rep;movsq
+   leaq -(8*8)(%rdi),%rsp
+   SWAPGS
+   popq_cfi %rdi
+   popq_cfi %rsi
+1:
+   popq_cfi %rcx
+   jmp irq_return_iret
+
.section .fixup,"ax"
 bad_iret:
/*
@@ -1058,6 +1095,7 @@ bad_iret:
 * So pretend we completed the iret and took the #GPF in user mode.
 *
 * We are now running with the kernel GS after exception recovery.
+* Exception entry will have removed us from the espfix stack.
 * But error_entry expects us to have user GS to match the user %cs,
 * so swap back.
 */
@@ -1278,6 +1316,62 @@ ENTRY(\sym)
 END(\sym)
 .endm
 
+/*
+ * Same as errorentry, except use for #GP in case we take the exception
+ * while on the espfix stack.  All other exceptions that are possible while
+ * on the espfix stack use IST, but that is not really practical for #GP
+ * for nesting reasons.
+ */
+.macro errorentry_espfix sym do_sym
+ENTRY(\sym)
+   XCPT_FRAME
+   ASM_CLAC
+   PARAVIRT_ADJUST_EXCEPTION_FRAME
+   /* Check if we are on the espfix stack */
+   pushq_cfi %rdi
+   pushq_cfi %rsi
+   movq %rsp,%rdi
+   sarq $PGDIR_SHIFT,%rdi
+   cmpl $-2,%edi   /* Are we on the espfix stack? */
+   CFI_REMEMBER_STATE
+   je 1f
+2:
+   subq $RSI-R15, %rsp
+   CFI_ADJUST_CFA_OFFSET RSI-R15
+   call error_entry_rdi_rsi_saved
+   DEFAULT_FRAME 0
+   movq %rsp,%rdi  /* pt_regs pointer */
+   movq ORIG_RAX(%rsp),%rsi/* get error code */
+   movq $-1,ORIG_RAX(%rsp) /* no syscall to restart */
+   call \do_sym
+   jmp error_exit  /* %ebx: no swapgs flag */
+1:
+   CFI_RESTORE_STATE
+   SWAPGS
+   movq PER_CPU_VAR(kernel_stack),%rdi
+   SWAPGS
+   /* Copy data from the espfix stack to the real stack */
+   movq %rsi,-64(%rdi) /* Saved value of %rsi already */
+   movq 8(%rsp),%rsi
+   movq %rsi,-56(%rdi)
+   movq 16(%rsp),%rsi
+   movq %rsi,-48(%rdi)
+   

Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 04:39 PM, Andi Kleen wrote:
>> That simply will not work if you can take a #GP due to the "safe" MSR
>> functions from NMI and #MC context, which would be my main concern.
> 
> At some point the IST entry functions subtracted 1K while the
> handler ran to handle simple nesting cases.
> 
> Not sure that code is still there.

Doesn't help if you take an NMI on the first instruction of the #GP handler.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andi Kleen
> That simply will not work if you can take a #GP due to the "safe" MSR
> functions from NMI and #MC context, which would be my main concern.

At some point the IST entry functions subtracted 1K while the
handler ran to handle simple nesting cases.

Not sure that code is still there.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Brian Gerst
On Tue, Apr 22, 2014 at 4:17 PM, H. Peter Anvin  wrote:
> On 04/22/2014 12:55 PM, Brian Gerst wrote:
>> On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin  wrote:
>>> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>
> That is the entry condition that we have to deal with.  The fact that
> the switch to the IST is unconditional is what makes ISTs hard to deal 
> with.

 Right, that is why you switch away from the IST as soon as possible,
 copying the data that is already pushed there to another stack so it
 won't be overwritten by a recursive fault.

>>>
>>> That simply will not work if you can take a #GP due to the "safe" MSR
>>> functions from NMI and #MC context, which would be my main concern.
>>
>> In that case (#2 above), you would switch to the previous %rsp (in the
>> NMI/MC stack), copy the exception frame from the IST, and continue
>> with the #GP handler.  That effectively is the same as it is today,
>> where no stack switch occurs on the #GP fault.
>>
>
> 1. You take #GP.  This causes an IST stack switch.
> 2. You immediately thereafter take an NMI.  This switches stacks again.
> 3. Now you take another #GP.  This causes another IST stack, and now you
> have clobbered your return information, and cannot resume.

You are right.  That will not work.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 12:55 PM, Brian Gerst wrote:
> On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin  wrote:
>> On 04/22/2014 11:17 AM, Brian Gerst wrote:

 That is the entry condition that we have to deal with.  The fact that
 the switch to the IST is unconditional is what makes ISTs hard to deal 
 with.
>>>
>>> Right, that is why you switch away from the IST as soon as possible,
>>> copying the data that is already pushed there to another stack so it
>>> won't be overwritten by a recursive fault.
>>>
>>
>> That simply will not work if you can take a #GP due to the "safe" MSR
>> functions from NMI and #MC context, which would be my main concern.
> 
> In that case (#2 above), you would switch to the previous %rsp (in the
> NMI/MC stack), copy the exception frame from the IST, and continue
> with the #GP handler.  That effectively is the same as it is today,
> where no stack switch occurs on the #GP fault.
> 

1. You take #GP.  This causes an IST stack switch.
2. You immediately thereafter take an NMI.  This switches stacks again.
3. Now you take another #GP.  This causes another IST stack, and now you
have clobbered your return information, and cannot resume.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Brian Gerst
On Tue, Apr 22, 2014 at 2:51 PM, H. Peter Anvin  wrote:
> On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>>
>>> That is the entry condition that we have to deal with.  The fact that
>>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
>>
>> Right, that is why you switch away from the IST as soon as possible,
>> copying the data that is already pushed there to another stack so it
>> won't be overwritten by a recursive fault.
>>
>
> That simply will not work if you can take a #GP due to the "safe" MSR
> functions from NMI and #MC context, which would be my main concern.

In that case (#2 above), you would switch to the previous %rsp (in the
NMI/MC stack), copy the exception frame from the IST, and continue
with the #GP handler.  That effectively is the same as it is today,
where no stack switch occurs on the #GP fault.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Borislav Petkov
On Tue, Apr 22, 2014 at 10:29:45AM -0700, Andrew Lutomirski wrote:
> Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
> 16-bit LDT entry.

Or something like that, yep.

> But I think it makes sense to nail down everything else first. I
> suspect that a single test-and-branch in the iret path will be lost in
> the noise from iret itself.

The cumulative effects of such additions here and there are nasty
though. If we can make the general path free relatively painlessly, we
should do it, IMO.

But yeah, later.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 11:17 AM, Brian Gerst wrote:
>>
>> That is the entry condition that we have to deal with.  The fact that
>> the switch to the IST is unconditional is what makes ISTs hard to deal with.
> 
> Right, that is why you switch away from the IST as soon as possible,
> copying the data that is already pushed there to another stack so it
> won't be overwritten by a recursive fault.
> 

That simply will not work if you can take a #GP due to the "safe" MSR
functions from NMI and #MC context, which would be my main concern.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Brian Gerst
On Tue, Apr 22, 2014 at 2:06 PM, H. Peter Anvin  wrote:
> On 04/22/2014 11:03 AM, Brian Gerst wrote:
>>
>> Maybe make the #GP handler check what the previous stack was at the start:
>> 1) If we came from userspace, switch to the top of the process stack.
>> 2) If the previous stack was not the espfix stack, switch back to that stack.
>> 3) Switch to the top of the process stack (espfix case)
>>
>> This leaves the IST available for any recursive faults.
>>
>
> Do you actually know what the IST is?  If so, you should realize the
> above is nonsense.
>
> The *hardware* switches stack on an exception; if the vector is set up
> as an IST, then we *always* switch to the IST stack, unconditionally.
> If the vector is not, then we switch to the process stack if we came
> from userspace.
>
> That is the entry condition that we have to deal with.  The fact that
> the switch to the IST is unconditional is what makes ISTs hard to deal with.

Right, that is why you switch away from the IST as soon as possible,
copying the data that is already pushed there to another stack so it
won't be overwritten by a recursive fault.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 11:03 AM, Brian Gerst wrote:
> 
> Maybe make the #GP handler check what the previous stack was at the start:
> 1) If we came from userspace, switch to the top of the process stack.
> 2) If the previous stack was not the espfix stack, switch back to that stack.
> 3) Switch to the top of the process stack (espfix case)
> 
> This leaves the IST available for any recursive faults.
> 

Do you actually know what the IST is?  If so, you should realize the
above is nonsense.

The *hardware* switches stack on an exception; if the vector is set up
as an IST, then we *always* switch to the IST stack, unconditionally.
If the vector is not, then we switch to the process stack if we came
from userspace.

That is the entry condition that we have to deal with.  The fact that
the switch to the IST is unconditional is what makes ISTs hard to deal with.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Brian Gerst
On Tue, Apr 22, 2014 at 1:46 PM, Andrew Lutomirski  wrote:
> On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin  wrote:
>> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski  
>>> wrote:

>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.

 How?  Doesn't something still need to check whether SS is funny before
 doing iret?
>>>
>>> Just test bit #2. Don't do anything else if it's clear, because you
>>> should be done. You don't need to do anything special if it's clear,
>>> because I don't *think* we have any 16-bit data segments in the GDT on
>>> x86-64.
>>>
>>
>> And we don't (neither do we on i386, and we depend on that invariance.)
>>
>> Hence:
>>
>>  irq_return:
>> +   /*
>> +* Are we returning to the LDT?  Note: in 64-bit mode
>> +* SS:RSP on the exception stack is always valid.
>> +*/
>> +   testb $4,(SS-RIP)(%rsp)
>> +   jnz irq_return_ldt
>> +
>> +irq_return_iret:
>> INTERRUPT_RETURN
>> -   _ASM_EXTABLE(irq_return, bad_iret)
>> +   _ASM_EXTABLE(irq_return_iret, bad_iret)
>>
>>
>> That is the whole impact of the IRET path.
>>
>> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
>> make sure there is absolutely no way we could end up nested) then the
>> rest of the fixup code can go away and we kill the common path
>> exception-handling overhead; the only new overhead is the IST
>> indirection for #GP, which isn't a performance-critical fault (good
>> thing, because untangling #GP faults is a major effort.)
>
> I'd be a bit nervous about read_msr_safe and friends.  Also, what
> happens if userspace triggers a #GP and the signal stack setup causes
> a page fault?
>
> --Andy

Maybe make the #GP handler check what the previous stack was at the start:
1) If we came from userspace, switch to the top of the process stack.
2) If the previous stack was not the espfix stack, switch back to that stack.
3) Switch to the top of the process stack (espfix case)

This leaves the IST available for any recursive faults.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 10:46 AM, Andrew Lutomirski wrote:
>>
>> That is the whole impact of the IRET path.
>>
>> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
>> make sure there is absolutely no way we could end up nested) then the
>> rest of the fixup code can go away and we kill the common path
>> exception-handling overhead; the only new overhead is the IST
>> indirection for #GP, which isn't a performance-critical fault (good
>> thing, because untangling #GP faults is a major effort.)
> 
> I'd be a bit nervous about read_msr_safe and friends.  Also, what
> happens if userspace triggers a #GP and the signal stack setup causes
> a page fault?
> 

Yes, #GPs inside the kernel could be a real problem.  MSRs generally
don't trigger #SS.  The second scenario shouldn't be a problem, the #PF
will be delivered on the currently active stack.

On the other hand, doing the espfix fixup only for #GP might be an
alternative, as long as we can convince ourselves that it really is the
only fault that could possibly be delivered on the espfix path.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:29 AM, H. Peter Anvin  wrote:
> On 04/22/2014 10:19 AM, Linus Torvalds wrote:
>> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski  wrote:
>>>

 Anyway, if done correctly, this whole espfix should be totally free
 for normal processes, since it should only trigger if SS is a LDT
 entry (bit #2 set in the segment descriptor). So the normal fast-path
 should just have a simple test for that.
>>>
>>> How?  Doesn't something still need to check whether SS is funny before
>>> doing iret?
>>
>> Just test bit #2. Don't do anything else if it's clear, because you
>> should be done. You don't need to do anything special if it's clear,
>> because I don't *think* we have any 16-bit data segments in the GDT on
>> x86-64.
>>
>
> And we don't (neither do we on i386, and we depend on that invariance.)
>
> Hence:
>
>  irq_return:
> +   /*
> +* Are we returning to the LDT?  Note: in 64-bit mode
> +* SS:RSP on the exception stack is always valid.
> +*/
> +   testb $4,(SS-RIP)(%rsp)
> +   jnz irq_return_ldt
> +
> +irq_return_iret:
> INTERRUPT_RETURN
> -   _ASM_EXTABLE(irq_return, bad_iret)
> +   _ASM_EXTABLE(irq_return_iret, bad_iret)
>
>
> That is the whole impact of the IRET path.
>
> If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
> make sure there is absolutely no way we could end up nested) then the
> rest of the fixup code can go away and we kill the common path
> exception-handling overhead; the only new overhead is the IST
> indirection for #GP, which isn't a performance-critical fault (good
> thing, because untangling #GP faults is a major effort.)

I'd be a bit nervous about read_msr_safe and friends.  Also, what
happens if userspace triggers a #GP and the signal stack setup causes
a page fault?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:26 AM, Borislav Petkov  wrote:
> On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
>> The fastpath interference is:
>>
>> 1. Testing for an LDT SS selector before IRET.  This is actually easier
>> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
>>
>> 2. Testing for an RSP inside the espfix region on exception entry, so we
>> can switch back the stack.  This has to be done very early in the
>> exception entry since the espfix stack is so small.  If NMI and #MC
>> didn't use IST it wouldn't work at all (but neither would SYSCALL).
>>
>> #2 currently saves/restores %rdi, which is also saved further down.
>> This is obviously wasteful.
>
> Btw, can we runtime-patch the fastpath interference chunk the moment we
> see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
> the hunk you've removed in there and enable the espfix checks there the
> moment we load a 16-bit segment.
>
> I know, I know, this is not so important right now but let me put it out
> there just the same.

Or we could add a TIF_NEEDS_ESPFIX that gets set once you have a
16-bit LDT entry.

But I think it makes sense to nail down everything else first.  I
suspect that a single test-and-branch in the iret path will be lost in
the noise from iret itself.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 10:19 AM, Linus Torvalds wrote:
> On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski  wrote:
>>
>>>
>>> Anyway, if done correctly, this whole espfix should be totally free
>>> for normal processes, since it should only trigger if SS is a LDT
>>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>>> should just have a simple test for that.
>>
>> How?  Doesn't something still need to check whether SS is funny before
>> doing iret?
> 
> Just test bit #2. Don't do anything else if it's clear, because you
> should be done. You don't need to do anything special if it's clear,
> because I don't *think* we have any 16-bit data segments in the GDT on
> x86-64.
> 

And we don't (neither do we on i386, and we depend on that invariance.)

Hence:

 irq_return:
+   /*
+* Are we returning to the LDT?  Note: in 64-bit mode
+* SS:RSP on the exception stack is always valid.
+*/
+   testb $4,(SS-RIP)(%rsp)
+   jnz irq_return_ldt
+
+irq_return_iret:
INTERRUPT_RETURN
-   _ASM_EXTABLE(irq_return, bad_iret)
+   _ASM_EXTABLE(irq_return_iret, bad_iret)


That is the whole impact of the IRET path.

If using IST for #GP won't cause trouble (ISTs don't nest, so we need to
make sure there is absolutely no way we could end up nested) then the
rest of the fixup code can go away and we kill the common path
exception-handling overhead; the only new overhead is the IST
indirection for #GP, which isn't a performance-critical fault (good
thing, because untangling #GP faults is a major effort.)

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Borislav Petkov
On Tue, Apr 22, 2014 at 10:11:27AM -0700, H. Peter Anvin wrote:
> The fastpath interference is:
> 
> 1. Testing for an LDT SS selector before IRET.  This is actually easier
> than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.
> 
> 2. Testing for an RSP inside the espfix region on exception entry, so we
> can switch back the stack.  This has to be done very early in the
> exception entry since the espfix stack is so small.  If NMI and #MC
> didn't use IST it wouldn't work at all (but neither would SYSCALL).
> 
> #2 currently saves/restores %rdi, which is also saved further down.
> This is obviously wasteful.

Btw, can we runtime-patch the fastpath interference chunk the moment we
see a 16-bit segment? I.e., connect it to the write_idt() path, i.e. in
the hunk you've removed in there and enable the espfix checks there the
moment we load a 16-bit segment.

I know, I know, this is not so important right now but let me put it out
there just the same.

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 10:20 AM, Andrew Lutomirski wrote:
> 
> It won't, given the above.  I misunderstood what you were checking.
> 
> It still seems to me that only #GP needs this special handling.  The
> IST entries should never run on the espfix stack, and #MC, #DB, #NM,
> and #SS (I missed that one earlier) use IST.
> 
> Would it ever make sense to make #GP use IST as well?  That might
> allow espfix_adjust_stack to be removed entirely.  I don't know how
> much other fiddling would be needed to make that work.
> 

Interesting thought.  It might even be able to share an IST entry with #SS.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:09 AM, H. Peter Anvin  wrote:
>
> As for Andy's questions:
>
>> What happens on the IST entries?  If I've read your patch right,
>> you're still switching back to the normal stack, which looks
>> questionable.
>
> No, in that case %rsp won't point into the espfix region, and the switch
> will be bypassed.  We will resume back into the espfix region on IRET,
> which is actually required e.g. if we take an NMI in the middle of the
> espfix setup.

Aha.  I misread that.  Would it be worth adding a comment along the lines of

/* Check whether we are running on the espfix stack.  This is
different from checking whether we faulted from the espfix stack,
since an ist exception will have switched us off of the espfix stack.
*/

>
>> Also, if you want to same some register abuse on each exception entry,
>> could you check the saved RIP instead of the current RSP?  I.e. use
>> the test instruction with offset(%rsp)?  Maybe there are multiple
>> possible values, though, and just testing some bits doesn't help.
>
> I don't see how that would work.

It won't, given the above.  I misunderstood what you were checking.

It still seems to me that only #GP needs this special handling.  The
IST entries should never run on the espfix stack, and #MC, #DB, #NM,
and #SS (I missed that one earlier) use IST.

Would it ever make sense to make #GP use IST as well?  That might
allow espfix_adjust_stack to be removed entirely.  I don't know how
much other fiddling would be needed to make that work.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Linus Torvalds
On Tue, Apr 22, 2014 at 10:11 AM, Andrew Lutomirski  wrote:
>
>>
>> Anyway, if done correctly, this whole espfix should be totally free
>> for normal processes, since it should only trigger if SS is a LDT
>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>> should just have a simple test for that.
>
> How?  Doesn't something still need to check whether SS is funny before
> doing iret?

Just test bit #2. Don't do anything else if it's clear, because you
should be done. You don't need to do anything special if it's clear,
because I don't *think* we have any 16-bit data segments in the GDT on
x86-64.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 10:04 AM, Linus Torvalds wrote:
> 
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.
> 
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.
> 
> And if you have a SS that is a descriptor in the LDT, nobody cares
> about performance any more.
> 

The fastpath interference is:

1. Testing for an LDT SS selector before IRET.  This is actually easier
than on 32 bits, because on 64 bits the SS:RSP on the stack is always valid.

2. Testing for an RSP inside the espfix region on exception entry, so we
can switch back the stack.  This has to be done very early in the
exception entry since the espfix stack is so small.  If NMI and #MC
didn't use IST it wouldn't work at all (but neither would SYSCALL).

#2 currently saves/restores %rdi, which is also saved further down.
This is obviously wasteful.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 10:11 AM, Andrew Lutomirski wrote:
>>
>> Anyway, if done correctly, this whole espfix should be totally free
>> for normal processes, since it should only trigger if SS is a LDT
>> entry (bit #2 set in the segment descriptor). So the normal fast-path
>> should just have a simple test for that.
> 
> How?  Doesn't something still need to check whether SS is funny before
> doing iret?
> 

Ideally the tests should be doable such that on a normal machine the
tests can be overlapped with the other things we have to do on that
path.  The exit branch will be strongly predicted in the negative
direction, so it shouldn't be a significant problem.

Again, this is not the case in the current prototype.

-hpa


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 10:04 AM, Linus Torvalds
 wrote:
> On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski  wrote:
>>
>> My point is that it may be safe to remove the special espfix fixup
>> from #PF, which is probably the most performance-critical piece here,
>> aside from iret itself.
>
> Actually, even that is unsafe.
>
> Why?
>
> The segment table is shared for a process. So you can have one thread
> doing a load_ldt() that invalidates a segment, while another thread is
> busy taking a page fault. The segment was valid at page fault time and
> is saved on the kernel stack, but by the time the page fault returns,
> it is no longer valid and the iretq will fault.

Let me try that again: I think it should be safe to remove the check
for "did we fault from the espfix stack" from the #PF entry.  You can
certainly have all kinds of weird things happen on return from #PF,
but the overhead that I'm talking about is a test on exception *entry*
to see whether the fault happened on the espfix stack so that we can
switch back to running on a real stack.

If the espfix code and the iret at the end can't cause #PF, then the
check in #PF entry can be removed, I think.

>
> Anyway, if done correctly, this whole espfix should be totally free
> for normal processes, since it should only trigger if SS is a LDT
> entry (bit #2 set in the segment descriptor). So the normal fast-path
> should just have a simple test for that.

How?  Doesn't something still need to check whether SS is funny before
doing iret?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
On 04/22/2014 10:00 AM, Andrew Lutomirski wrote:
>>
>> Yes, you can very much trigger GP deliberately.
>>
>> The way to do it is to just make an invalid segment descriptor on the
>> iret stack. Or make it a valid 16-bit one, but make it a code segment
>> for the stack pointer, or read-only, or whatever. All of which is
>> trivial to do with a sigretun system call. But you can do it other
>> ways too - enter with a SS that is valid, but do a load_ldt() system
>> call that makes it invalid, so that by the time you exit it is no
>> longer valid etc.
>>
>> There's a reason we mark that "iretq" as taking faults with that
>>
>> _ASM_EXTABLE(native_iret, bad_iret)
>>
>> and that "bad_iret" creates a GP fault.
>>
>> And that's a lot of kernel stack. The whole initial GP fault path,
>> which goes to the C code that finds the exception table etc. See
>> do_general_protection_fault() and fixup_exception().
> 
> My point is that it may be safe to remove the special espfix fixup
> from #PF, which is probably the most performance-critical piece here,
> aside from iret itself.
> 

It *might* even be plausible to do full manual sanitization, so that the
IRET cannot fault, but I have to admit to that being somewhat daunting,
especially given the thread/process distinction.  I wasn't actually sure
about the status of the LDT on the thread vs process scale (the GDT is
per-CPU, but has some entries that are context-switched per *thread*,
but I hadn't looked at the LDT recently.)

As for Andy's questions:

> What happens on the IST entries?  If I've read your patch right,
> you're still switching back to the normal stack, which looks
> questionable.

No, in that case %rsp won't point into the espfix region, and the switch
will be bypassed.  We will resume back into the espfix region on IRET,
which is actually required e.g. if we take an NMI in the middle of the
espfix setup.

> Also, if you want to same some register abuse on each exception entry,
> could you check the saved RIP instead of the current RSP?  I.e. use
> the test instruction with offset(%rsp)?  Maybe there are multiple
> possible values, though, and just testing some bits doesn't help.

I don't see how that would work.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Linus Torvalds
On Tue, Apr 22, 2014 at 10:00 AM, Andrew Lutomirski  wrote:
>
> My point is that it may be safe to remove the special espfix fixup
> from #PF, which is probably the most performance-critical piece here,
> aside from iret itself.

Actually, even that is unsafe.

Why?

The segment table is shared for a process. So you can have one thread
doing a load_ldt() that invalidates a segment, while another thread is
busy taking a page fault. The segment was valid at page fault time and
is saved on the kernel stack, but by the time the page fault returns,
it is no longer valid and the iretq will fault.

Anyway, if done correctly, this whole espfix should be totally free
for normal processes, since it should only trigger if SS is a LDT
entry (bit #2 set in the segment descriptor). So the normal fast-path
should just have a simple test for that.

And if you have a SS that is a descriptor in the LDT, nobody cares
about performance any more.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 9:43 AM, Linus Torvalds
 wrote:
> On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski  wrote:
>>
>> For the espfix_adjust_stack thing, when can it actually need to do
>> anything?  irqs should be off, I think, and MCE, NMI, and debug
>> exceptions use ist, so that leaves just #SS and #GP, I think.  How can
>> those actually occur?  Is there a way to trigger them deliberately
>> from userspace?  Why do you have three espfix_adjust_stack
>
> Yes, you can very much trigger GP deliberately.
>
> The way to do it is to just make an invalid segment descriptor on the
> iret stack. Or make it a valid 16-bit one, but make it a code segment
> for the stack pointer, or read-only, or whatever. All of which is
> trivial to do with a sigretun system call. But you can do it other
> ways too - enter with a SS that is valid, but do a load_ldt() system
> call that makes it invalid, so that by the time you exit it is no
> longer valid etc.
>
> There's a reason we mark that "iretq" as taking faults with that
>
> _ASM_EXTABLE(native_iret, bad_iret)
>
> and that "bad_iret" creates a GP fault.
>
> And that's a lot of kernel stack. The whole initial GP fault path,
> which goes to the C code that finds the exception table etc. See
> do_general_protection_fault() and fixup_exception().

My point is that it may be safe to remove the special espfix fixup
from #PF, which is probably the most performance-critical piece here,
aside from iret itself.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Linus Torvalds
On Tue, Apr 22, 2014 at 9:33 AM, Andrew Lutomirski  wrote:
>
> For the espfix_adjust_stack thing, when can it actually need to do
> anything?  irqs should be off, I think, and MCE, NMI, and debug
> exceptions use ist, so that leaves just #SS and #GP, I think.  How can
> those actually occur?  Is there a way to trigger them deliberately
> from userspace?  Why do you have three espfix_adjust_stack

Yes, you can very much trigger GP deliberately.

The way to do it is to just make an invalid segment descriptor on the
iret stack. Or make it a valid 16-bit one, but make it a code segment
for the stack pointer, or read-only, or whatever. All of which is
trivial to do with a sigretun system call. But you can do it other
ways too - enter with a SS that is valid, but do a load_ldt() system
call that makes it invalid, so that by the time you exit it is no
longer valid etc.

There's a reason we mark that "iretq" as taking faults with that

_ASM_EXTABLE(native_iret, bad_iret)

and that "bad_iret" creates a GP fault.

And that's a lot of kernel stack. The whole initial GP fault path,
which goes to the C code that finds the exception table etc. See
do_general_protection_fault() and fixup_exception().

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 9:10 AM, H. Peter Anvin  wrote:
> Honestly, guys... you're painting the bikeshed at the moment.
>
> Initialization is the easiest bit of all this code.  The tricky part is
> *the rest of the code*, i.e. the stuff in entry_64.S.

That's because the initialization code is much simpler, so it's easy
to pick on :)  Sorry.

For the espfix_adjust_stack thing, when can it actually need to do
anything?  irqs should be off, I think, and MCE, NMI, and debug
exceptions use ist, so that leaves just #SS and #GP, I think.  How can
those actually occur?  Is there a way to trigger them deliberately
from userspace?  Why do you have three espfix_adjust_stack

What happens on the IST entries?  If I've read your patch right,
you're still switching back to the normal stack, which looks
questionable.

Also, if you want to same some register abuse on each exception entry,
could you check the saved RIP instead of the current RSP?  I.e. use
the test instruction with offset(%rsp)?  Maybe there are multiple
possible values, though, and just testing some bits doesn't help.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread H. Peter Anvin
Honestly, guys... you're painting the bikeshed at the moment.

Initialization is the easiest bit of all this code.  The tricky part is
*the rest of the code*, i.e. the stuff in entry_64.S.

Also, the code is butt-ugly at the moment.  Aestetics have not been
dealt with.

-hpa

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Andrew Lutomirski
On Tue, Apr 22, 2014 at 7:46 AM, Borislav Petkov  wrote:
> On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
>> I wonder if it would be workable to use a bit in the espfix PGD to
>> denote that it has been initialized already... I hear, near NX there's
>> some room :-)
>
> Ok, I realized this won't work when I hit send... Oh well.
>
> Anyway, another dumb idea: have we considered making this lazy? I.e.,
> preallocate pages to fit the stack of NR_CPUS after smp init is done but
> not setup the percpu espfix stack. Only do that in espfix_fix_stack the
> first time we land there and haven't been setup yet on this cpu.
>
> This should cover the 1% out there who still use 16-bit segments and the
> rest simply doesn't use it and get to save themselves the PT-walk in
> start_secondary().
>
> Hmmm...

I'm going to try to do the math to see what's actually going on.

Each 4G slice contains 64kB of ministacks, which corresponds to 1024
ministacks.  Virtual addresses are divided up as:

12 bits (0..11): address within page.
9 bits (12..20): identifies the PTE within the level 1 directory
9 bits (21..29): identifies the level 1 directory (pmd?) within the
level 2 directory
9 bits (30..38): identifies the level 2 directory (pud) within the
level 3 directory

Critically, each 1024 CPUs can share the same level 1 directory --
there are just a bunch of copies of the same thing in there.
Similarly, they can share the same level 2 directory, and each slot in
that directory will point to the same level 1 directory.

For the level 3 directory, there is only one globally.  It needs 8
entries per 1024 CPUs.

I imagine there's a scalability problem here, too: it's okay if each
of a very large number of CPUs waits while shared structures are
allocated, but owners of big systems won't like it if they all
serialize on the way out.

So maybe it would make sense to refactor this into two separate
functions.  First, before we start the first non-boot CPU:

static pte_t *slice_pte_tables[NR_CPUS / 1024];
Allocate and initialize them all;

It might even make sense to do this at build time instead of run time.
 I can't imagine that parallelizing this would provide any benefit
unless it were done *very* carefully and there were hundreds of
thousands of CPUs.  At worst, we're wasting 4 bytes per CPU not
present.

Then, for the per-CPU part, have one init-once structure (please tell
me the kernel has one of these) per 64 possible CPUs.  Each CPU will
make sure that its group of 64 cpus is initialized, using the init
once mechanism, and then it will set its percpu variable accordingly.

There are only 64 CPUs per slice, so mutexes may no be so bad here.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Borislav Petkov
On Tue, Apr 22, 2014 at 01:23:12PM +0200, Borislav Petkov wrote:
> I wonder if it would be workable to use a bit in the espfix PGD to
> denote that it has been initialized already... I hear, near NX there's
> some room :-)

Ok, I realized this won't work when I hit send... Oh well.

Anyway, another dumb idea: have we considered making this lazy? I.e.,
preallocate pages to fit the stack of NR_CPUS after smp init is done but
not setup the percpu espfix stack. Only do that in espfix_fix_stack the
first time we land there and haven't been setup yet on this cpu.

This should cover the 1% out there who still use 16-bit segments and the
rest simply doesn't use it and get to save themselves the PT-walk in
start_secondary().

Hmmm...

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Borislav Petkov
Just nitpicks below:

On Mon, Apr 21, 2014 at 03:47:52PM -0700, H. Peter Anvin wrote:
> This is a prototype of espfix for the 64-bit kernel.  espfix is a
> workaround for the architectural definition of IRET, which fails to
> restore bits [31:16] of %esp when returning to a 16-bit stack
> segment.  We have a workaround for the 32-bit kernel, but that
> implementation doesn't work for 64 bits.
> 
> The 64-bit implementation works like this:
> 
> Set up a ministack for each CPU, which is then mapped 65536 times
> using the page tables.  This implementation uses the second-to-last
> PGD slot for this; with a 64-byte espfix stack this is sufficient for
> 2^18 CPUs (currently we support a max of 2^13 CPUs.)

I wish we'd put this description in the code instead of in a commit
message as those can get lost in git history over time.

> 64 bytes appear to be sufficient, because NMI and #MC cause a task
> switch.
> 
> THIS IS A PROTOTYPE AND IS NOT COMPLETE.  We need to make sure all
> code paths that can interrupt userspace execute this code.
> Fortunately we never need to use the espfix stack for nested faults,
> so one per CPU is guaranteed to be safe.
> 
> Furthermore, this code adds unnecessary instructions to the common
> path.  For example, on exception entry we push %rdi, pop %rdi, and
> then save away %rdi.  Ideally we should do this in such a way that we
> avoid unnecessary swapgs, especially on the IRET path (the exception
> path is going to be very rare, and so is less critical.)
> 
> Putting this version out there for people to look at/laugh at/play
> with.
> 
> Signed-off-by: H. Peter Anvin 
> Link: http://lkml.kernel.org/r/tip-kicdm89kzw9lldryb1br9...@git.kernel.org
> Cc: Linus Torvalds 
> Cc: Ingo Molnar 
> Cc: Alexander van Heukelum 
> Cc: Andy Lutomirski 
> Cc: Konrad Rzeszutek Wilk 
> Cc: Boris Ostrovsky 
> Cc: Borislav Petkov 
> Cc: Arjan van de Ven 
> Cc: Brian Gerst 
> Cc: Alexandre Julliard 
> Cc: Andi Kleen 
> Cc: Thomas Gleixner 

...

> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 1e96c3628bf2..7cc01770bf21 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -58,6 +58,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  
>  /* Avoid __ASSEMBLER__'ifying  just for this.  */
> @@ -1040,8 +1041,16 @@ restore_args:
>   RESTORE_ARGS 1,8,1
>  
>  irq_return:
> + /*
> +  * Are we returning to the LDT?  Note: in 64-bit mode
> +  * SS:RSP on the exception stack is always valid.
> +  */
> + testb $4,(SS-RIP)(%rsp)
> + jnz irq_return_ldt
> +
> +irq_return_iret:
>   INTERRUPT_RETURN
> - _ASM_EXTABLE(irq_return, bad_iret)
> + _ASM_EXTABLE(irq_return_iret, bad_iret)
>  
>  #ifdef CONFIG_PARAVIRT
>  ENTRY(native_iret)
> @@ -1049,6 +1058,34 @@ ENTRY(native_iret)
>   _ASM_EXTABLE(native_iret, bad_iret)
>  #endif
>  
> +irq_return_ldt:
> + pushq_cfi %rcx
> + larl (CS-RIP+8)(%rsp), %ecx
> + jnz 1f  /* Invalid segment - will #GP at IRET time */
> + testl $0x0020, %ecx
> + jnz 1f  /* Returning to 64-bit mode */
> + larl (SS-RIP+8)(%rsp), %ecx
> + jnz 1f  /* Invalid segment - will #SS at IRET time */

You mean " ... will #GP at IRET time"? But you're right, you're looking
at SS :-)

> + testl $0x0040, %ecx
> + jnz 1f  /* Not a 16-bit stack segment */
> + pushq_cfi %rsi
> + pushq_cfi %rdi
> + SWAPGS
> + movq PER_CPU_VAR(espfix_stack),%rdi
> + movl (RSP-RIP+3*8)(%rsp),%esi
> + xorw %si,%si
> + orq %rsi,%rdi
> + movq %rsp,%rsi
> + movl $8,%ecx
> + rep;movsq
> + leaq -(8*8)(%rdi),%rsp
> + SWAPGS
> + popq_cfi %rdi
> + popq_cfi %rsi
> +1:
> + popq_cfi %rcx
> + jmp irq_return_iret
> +
>   .section .fixup,"ax"
>  bad_iret:
>   /*

...

> diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
> index 85126ccbdf6b..dc2d8afcafe9 100644
> --- a/arch/x86/kernel/head64.c
> +++ b/arch/x86/kernel/head64.c
> @@ -32,6 +32,7 @@
>   * Manage page tables very early on.
>   */
>  extern pgd_t early_level4_pgt[PTRS_PER_PGD];
> +extern pud_t espfix_pud_page[PTRS_PER_PUD];

I guess you don't need the "extern" here.

>  extern pmd_t early_dynamic_pgts[EARLY_DYNAMIC_PAGE_TABLES][PTRS_PER_PMD];
>  static unsigned int __initdata next_early_pgt = 2;
>  pmdval_t early_pmd_flags = __PAGE_KERNEL_LARGE & ~(_PAGE_GLOBAL | _PAGE_NX);
> diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
> index af1d14a9ebda..ebc987398923 100644
> --- a/arch/x86/kernel/ldt.c
> +++ b/arch/x86/kernel/ldt.c
> @@ -229,17 +229,6 @@ static int write_ldt(void __user *ptr, unsigned long 
> bytecount, int oldmode)
>   }
>   }
>  
> - /*
> -  * On x86-64 we do not support 16-bit segments due to
> -  * IRET leaking the high bits of the kernel stack address.
> -  */
> -#ifdef CONFIG_X86_64
> - if (!ldt_info.seg_32bit) {
> - error = -EINVAL;
> 

Re: [PATCH] x86-64: espfix for 64-bit mode *PROTOTYPE*

2014-04-22 Thread Borislav Petkov
On Mon, Apr 21, 2014 at 06:53:36PM -0700, Andrew Lutomirski wrote:
> On Mon, Apr 21, 2014 at 6:47 PM, H. Peter Anvin  wrote:
> > Race condition (although with x86 being globally ordered, it probably can't 
> > actually happen.) The bitmask is probably the way to go.
> 
> Does the race matter?  In the worst case you take the lock
> unnecessarily.  But yes, the bitmask is easy.

I wonder if it would be workable to use a bit in the espfix PGD to
denote that it has been initialized already... I hear, near NX there's
some room :-)

-- 
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


  1   2   >