Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-29 Thread Andy Lutomirski
On Mar 28, 2015 6:17 AM, "Denys Vlasenko"  wrote:
>
> On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar  wrote:
> > * Denys Vlasenko  wrote:
> >> This is a C function. [...]
> >
> > Arguably that's a self-inflicted wound of uclibc: nothing keeps it
> > from taking advantage of the syscall ABI and avoiding the double
> > save/restores.
>
> It's not uclibc who calls write(), it's user program.
>
> IIRC uclibc can't make user program aware that write()
> is not clobbering registers.
>
> Even if it could do that via an __attribute__ somehow,
> it would be a violation of standards: write() is supposed to be
> an ordinary C function, users must be able to take its address
> and assign it to a pointer declared as
>
> ssize_t (*ptr)(int, const void *, size_t);
>
> Slapping __attribute__((different_abi))
> onto write() makes that impossible, the signature
> no longer matches.
>
> Let me go back from hypothetics to the actual situation.
> We can't do such a drastic ABI change now, it's too big.
>
> But is looks like we can relax ABI wrt saving flags,
> because it's broken for some time, and no one complains.
>
> If we say that arith flags and DF are not saved, it may
> mean that we don't need to kill ourselves with
> awkward and costly (popf is 20 cycles) EFLAGS
> manipulations.

Given that I just posted a patch that removes the popf by using
sysret, I'm unconvinced this is worthwhile.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-29 Thread Andy Lutomirski
On Mar 28, 2015 6:17 AM, Denys Vlasenko vda.li...@googlemail.com wrote:

 On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar mi...@kernel.org wrote:
  * Denys Vlasenko vda.li...@googlemail.com wrote:
  This is a C function. [...]
 
  Arguably that's a self-inflicted wound of uclibc: nothing keeps it
  from taking advantage of the syscall ABI and avoiding the double
  save/restores.

 It's not uclibc who calls write(), it's user program.

 IIRC uclibc can't make user program aware that write()
 is not clobbering registers.

 Even if it could do that via an __attribute__ somehow,
 it would be a violation of standards: write() is supposed to be
 an ordinary C function, users must be able to take its address
 and assign it to a pointer declared as

 ssize_t (*ptr)(int, const void *, size_t);

 Slapping __attribute__((different_abi))
 onto write() makes that impossible, the signature
 no longer matches.

 Let me go back from hypothetics to the actual situation.
 We can't do such a drastic ABI change now, it's too big.

 But is looks like we can relax ABI wrt saving flags,
 because it's broken for some time, and no one complains.

 If we say that arith flags and DF are not saved, it may
 mean that we don't need to kill ourselves with
 awkward and costly (popf is 20 cycles) EFLAGS
 manipulations.

Given that I just posted a patch that removes the popf by using
sysret, I'm unconvinced this is worthwhile.

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Denys Vlasenko
On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar  wrote:
> * Denys Vlasenko  wrote:
>> This is a C function. [...]
>
> Arguably that's a self-inflicted wound of uclibc: nothing keeps it
> from taking advantage of the syscall ABI and avoiding the double
> save/restores.

It's not uclibc who calls write(), it's user program.

IIRC uclibc can't make user program aware that write()
is not clobbering registers.

Even if it could do that via an __attribute__ somehow,
it would be a violation of standards: write() is supposed to be
an ordinary C function, users must be able to take its address
and assign it to a pointer declared as

ssize_t (*ptr)(int, const void *, size_t);

Slapping __attribute__((different_abi))
onto write() makes that impossible, the signature
no longer matches.

Let me go back from hypothetics to the actual situation.
We can't do such a drastic ABI change now, it's too big.

But is looks like we can relax ABI wrt saving flags,
because it's broken for some time, and no one complains.

If we say that arith flags and DF are not saved, it may
mean that we don't need to kill ourselves with
awkward and costly (popf is 20 cycles) EFLAGS
manipulations.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Denys Vlasenko
On Sat, Mar 28, 2015 at 1:39 AM, Linus Torvalds
 wrote:
> What part of "don't leak kernel data" did you have trouble understanding?
>
> IOW, this is a *security* issue. Stop arguing for crazy shit.

We can zero the registers instead of saving/restoring them.
push/pop pair takes 1-2 cycles at best, whereas
on four-issue CPUs "xor reg,reg" has throughput of four
instructions at a time, often even not requiring
an execution unit.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
>  wrote:
> > On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko  wrote:
> >>
> >> Apparently, users *don't* depend on arithmetic flags
> >> to survive over syscall. They also okay with DF flag
> >> being cleared.
> >
> > Generally, users probably dont' care about many registers at all being
> > saved, but it's worth noting that the reason system calls save/restore
> > even caller-saved registers is at least partly in order to avoid any
> > kernel information leaks.
> >
> > I don't believe that user mode will ever reasonably care about the
> > arithmetic flags being changed, but at the same time I also don't it
> > is something we should ever consider a "feature" we should try to take
> > advantage of. Generally we should try to not mess with the flag state,
> > and I'd *much* rather make the rule be that all the system call return
> > paths restore flags as much as possible.
> 
> "We don't clobber anything" ABI has its appeal.
> OTOH, fulfilling ABI's promises has cost which hast to be paid
> on every syscall, regardless whether userspace needed it or not.
> 
> Example. This is the uclibc implementation of write():
> 
> 004acfc4 <__libc_write>:
>   4acfc4:   53  push   %rbx
>   4acfc5:   48 63 ffmovslq %edi,%rdi
>   4acfc8:   b8 01 00 00 00  mov$0x1,%eax
>   4acfcd:   0f 05   syscall
>   4acfcf:   48 89 c3mov%rax,%rbx
>   4acfd2:   48 81 fb 00 f0 ff ffcmp$0xf000,%rbx
>   4acfd9:   76 0f   jbe4acfea <__libc_write+0x26>
>   4acfdb:   e8 64 15 00 00  callq  4ae544 <__GI___errno_location>
>   4acfe0:   89 da   mov%ebx,%edx
>   4acfe2:   f7 da   neg%edx
>   4acfe4:   89 10   mov%edx,(%rax)
>   4acfe6:   48 83 c8 ff or $0x,%rax
>   4acfea:   5b  pop%rbx
>   4acfeb:   c3  retq
> 
> This is a C function. [...]

Arguably that's a self-inflicted wound of uclibc: nothing keeps it 
from taking advantage of the syscall ABI and avoiding the double 
save/restores.

> [...] Therefore any its caller assumes that C-clobbered registers 
> can be, indeed, clobbered here, so if that caller uses any of them, 
> it saves/restores them.
> 
> All efforts by kernel code to save/restore C-clobbered registers, 
> eight of them, are in vain. It's just useless work. Userspace does 
> not benefit from that effort.

That's true only in this particular uclibc case, where user-space 
decided to not take advantage of the save/restore property of the 
kernel.

> If our syscall ABI would say that those regs are not preserved, we 
> could have a bit faster syscalls. Any userspace code which really 
> had to have those registers preserved across a particular syscall, 
> could push/pop them itself.

We'd at minimum have to zero out the registers to avoid the 
information leak and at that point it's in fact faster to just 
save/restore in the syscall and allow user-space to take advantage of 
that, if it wishes to.

We cannot do it the other way around.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Borislav Petkov
On Fri, Mar 27, 2015 at 01:09:34PM -0700, Linus Torvalds wrote:
> I think AMD documented that the sti "interrupt shadow" shadows even
> NMI.

Hmm, official docs says this:

"15.21.5 Interrupt Shadows

The x86 architecture defines the notion of an interrupt shadow—a
single-instruction window during which interrupts are not recognized.
For example, the instruction after an STI instruction that sets
EFLAGS.IF (from zero to one) does not recognize interrupts or certain
debug traps."

And I think with "interrupts" this means maskable interrupts, i.e. not
NMI.

STI description itself:

"Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby
allowing external interrupts received on the INTR input. Interrupts
received on the non-maskable interrupt (NMI) input are not affected by
this instruction."

but there might be some other documentation which I cannot find right
now.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst  wrote:
> >> <-- IRQ.  Boom
> >
> > The sti will delay interrupts for one instruction, and that should include 
> > NMIs.
> 
> Nope. Intel explicitly documents the NMI case only for mov->ss and popss.
> 
> > The Intel SDM states for STI:
> > "The IF flag and the STI and CLI instructions do not prohibit the
> > generation of exceptions and NMI interrupts. NMI
> > interrupts (and SMIs) may be blocked for one macroinstruction following an 
> > STI."
> 
> Note the *may*. For movss and popss the software developer guide 
> explicitly says that NMI's are also blocked.
> 
> For plain sti, it seems to be dependent on microarchitecture.

Well, how about 'STI+HLT' aka safe_halt()?

If an NMI is allowed after that STI then we might lose wakeups, or in 
extreme cases (with full-dynticks) might lock up for a long time until 
the next irq comes, even with runnable tasks around?

Arguably that's a race condition that is not very easy to notice on a 
typical system.

Random hypothesis: maybe Intel just messed up their STI shadow in a 
single (possibly ancient) microarchitecture in some rare situations 
and figured it could fix it cheaply via updating the documentation to 
match the breakage, not via actually fixing the CPU?

Might be useful if someone from Intel could chime in.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Olivier Galibert
  Hi,

Beware that could be opening the door to information leaks for a very
small gain (most syscalls are not getuid).

Best,

  OG.


On Sat, Mar 28, 2015 at 1:34 AM, Denys Vlasenko
 wrote:
> On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
>  wrote:
>> On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko  wrote:
>>>
>>> Apparently, users *don't* depend on arithmetic flags
>>> to survive over syscall. They also okay with DF flag
>>> being cleared.
>>
>> Generally, users probably dont' care about many registers at all being
>> saved, but it's worth noting that the reason system calls save/restore
>> even caller-saved registers is at least partly in order to avoid any
>> kernel information leaks.
>>
>> I don't believe that user mode will ever reasonably care about the
>> arithmetic flags being changed, but at the same time I also don't it
>> is something we should ever consider a "feature" we should try to take
>> advantage of. Generally we should try to not mess with the flag state,
>> and I'd *much* rather make the rule be that all the system call return
>> paths restore flags as much as possible.
>
> "We don't clobber anything" ABI has its appeal.
> OTOH, fulfilling ABI's promises has cost which hast to be paid
> on every syscall, regardless whether userspace needed it or not.
>
> Example. This is the uclibc implementation of write():
>
> 004acfc4 <__libc_write>:
>   4acfc4:   53  push   %rbx
>   4acfc5:   48 63 ffmovslq %edi,%rdi
>   4acfc8:   b8 01 00 00 00  mov$0x1,%eax
>   4acfcd:   0f 05   syscall
>   4acfcf:   48 89 c3mov%rax,%rbx
>   4acfd2:   48 81 fb 00 f0 ff ffcmp$0xf000,%rbx
>   4acfd9:   76 0f   jbe4acfea <__libc_write+0x26>
>   4acfdb:   e8 64 15 00 00  callq  4ae544 <__GI___errno_location>
>   4acfe0:   89 da   mov%ebx,%edx
>   4acfe2:   f7 da   neg%edx
>   4acfe4:   89 10   mov%edx,(%rax)
>   4acfe6:   48 83 c8 ff or $0x,%rax
>   4acfea:   5b  pop%rbx
>   4acfeb:   c3  retq
>
> This is a C function. Therefore any its caller assumes that C-clobbered
> registers can be, indeed, clobbered here, so if that caller uses any
> of them, it saves/restores them.
>
> All efforts by kernel code to save/restore C-clobbered registers,
> eight of them, are in vain. It's just useless work. Userspace
> does not benefit from that effort.
>
> If our syscall ABI would say that those regs are not preserved,
> we could have a bit faster syscalls. Any userspace code which
> really had to have those registers preserved across a particular
> syscall, could push/pop them itself.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Ingo Molnar

* Linus Torvalds  wrote:

> On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst  wrote:
> >> <-- IRQ.  Boom
> >
> > The sti will delay interrupts for one instruction, and that should include 
> > NMIs.
> 
> Nope. Intel explicitly documents the NMI case only for mov->ss and popss.

Interestingly, I still see a STI 'NMI shadow' even on Intel CPUs.

Try something like this as root on a system with Intel CPUs (running 
recent tools/perf), with high-freq NMI sampling:

   perf top -F 1

execute a tight syscall loop on all CPUs (getppid() loop for example), 
and you'll see something like this in the profile:

Samples: 1M of event 'cycles', Event count (approx.): 377899840545  

  
Overhead  Shared Object  Symbol 

 ◆
  27.67%  libc-2.19.so   [.] __GI___getppid 

 ▒
  21.34%  [kernel]   [k] system_call

 ▒
  17.42%  [kernel]   [k] system_call_after_swapgs   

 ▒
  12.00%  [kernel]   [k] pid_vnr

 ▒
   7.49%  [kernel]   [k] sys_getppid

 ▒
   5.49%  [kernel]   [k] sysret_check   

 ▒
   5.34%  loop-getppid   [.] main   

 ▒
   1.56%  [kernel]   [k] system_call_fastpath   

 ▒
   0.36%  loop-getppid   [.] getppid@plt

 ▒

Note the very high sample count (due to sampling at 10 KHz).

Now if you hit '' twice to annotate system_call_after_swapgs 
you should see something like this (the live kernel image disassembly, 
annotated):

  system_call_after_swapgs  /proc/kcore 


   │ 
   │ 
   │ 
   │  Disassembly of section load0:  
   │ 
   │  8178b3f3 :  
  9.72 │8178b3f3:   mov%rsp,%gs:0xb040
 44.24 │8178b3fc:   mov%gs:0xb888,%rsp
  0.02 │8178b405:   sti  
   │8178b406:   nopl   0x0(%rax) 
 16.04 │8178b40d:   sub$0x50,%rsp
   │8178b411:   mov%rdi,0x40(%rsp)   
  6.51 │8178b416:   mov%rsi,0x38(%rsp)
  5.81 │8178b41b:   mov%rdx,0x30(%rsp)
  2.22 │8178b420:   mov%rax,0x20(%rsp)
  2.16 │8178b425:   mov%r8,0x18(%rsp)
  0.93 │8178b42a:   mov%r9,0x10(%rsp)
  1.57 │8178b42f:   mov%r10,0x8(%rsp)
  3.70 │8178b434:   mov%r11,(%rsp)
  2.27 │8178b438:   mov%rax,0x48(%rsp)

Note how the 7-byte NOP after the STI did not get a single profiler 
hit.

This is with the default '-e cycles', not '-e cycles:pp', so what we 
see as profiler hits should be the raw NMI entry RIPs.

Arguably this could be just the decoder hiding the NOP efficiently, 
I'll try to run some more experiments ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Denys Vlasenko
On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar mi...@kernel.org wrote:
 * Denys Vlasenko vda.li...@googlemail.com wrote:
 This is a C function. [...]

 Arguably that's a self-inflicted wound of uclibc: nothing keeps it
 from taking advantage of the syscall ABI and avoiding the double
 save/restores.

It's not uclibc who calls write(), it's user program.

IIRC uclibc can't make user program aware that write()
is not clobbering registers.

Even if it could do that via an __attribute__ somehow,
it would be a violation of standards: write() is supposed to be
an ordinary C function, users must be able to take its address
and assign it to a pointer declared as

ssize_t (*ptr)(int, const void *, size_t);

Slapping __attribute__((different_abi))
onto write() makes that impossible, the signature
no longer matches.

Let me go back from hypothetics to the actual situation.
We can't do such a drastic ABI change now, it's too big.

But is looks like we can relax ABI wrt saving flags,
because it's broken for some time, and no one complains.

If we say that arith flags and DF are not saved, it may
mean that we don't need to kill ourselves with
awkward and costly (popf is 20 cycles) EFLAGS
manipulations.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Denys Vlasenko
On Sat, Mar 28, 2015 at 1:39 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 What part of don't leak kernel data did you have trouble understanding?

 IOW, this is a *security* issue. Stop arguing for crazy shit.

We can zero the registers instead of saving/restoring them.
push/pop pair takes 1-2 cycles at best, whereas
on four-issue CPUs xor reg,reg has throughput of four
instructions at a time, often even not requiring
an execution unit.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst brge...@gmail.com wrote:
  -- IRQ.  Boom
 
  The sti will delay interrupts for one instruction, and that should include 
  NMIs.
 
 Nope. Intel explicitly documents the NMI case only for mov-ss and popss.

Interestingly, I still see a STI 'NMI shadow' even on Intel CPUs.

Try something like this as root on a system with Intel CPUs (running 
recent tools/perf), with high-freq NMI sampling:

   perf top -F 1

execute a tight syscall loop on all CPUs (getppid() loop for example), 
and you'll see something like this in the profile:

Samples: 1M of event 'cycles', Event count (approx.): 377899840545  

  
Overhead  Shared Object  Symbol 

 ◆
  27.67%  libc-2.19.so   [.] __GI___getppid 

 ▒
  21.34%  [kernel]   [k] system_call

 ▒
  17.42%  [kernel]   [k] system_call_after_swapgs   

 ▒
  12.00%  [kernel]   [k] pid_vnr

 ▒
   7.49%  [kernel]   [k] sys_getppid

 ▒
   5.49%  [kernel]   [k] sysret_check   

 ▒
   5.34%  loop-getppid   [.] main   

 ▒
   1.56%  [kernel]   [k] system_call_fastpath   

 ▒
   0.36%  loop-getppid   [.] getppid@plt

 ▒

Note the very high sample count (due to sampling at 10 KHz).

Now if you hit 'Enter' twice to annotate system_call_after_swapgs 
you should see something like this (the live kernel image disassembly, 
annotated):

  system_call_after_swapgs  /proc/kcore 


   │ 
   │ 
   │ 
   │  Disassembly of section load0:  
   │ 
   │  8178b3f3 load0:  
  9.72 │8178b3f3:   mov%rsp,%gs:0xb040
 44.24 │8178b3fc:   mov%gs:0xb888,%rsp
  0.02 │8178b405:   sti  
   │8178b406:   nopl   0x0(%rax) 
 16.04 │8178b40d:   sub$0x50,%rsp
   │8178b411:   mov%rdi,0x40(%rsp)   
  6.51 │8178b416:   mov%rsi,0x38(%rsp)
  5.81 │8178b41b:   mov%rdx,0x30(%rsp)
  2.22 │8178b420:   mov%rax,0x20(%rsp)
  2.16 │8178b425:   mov%r8,0x18(%rsp)
  0.93 │8178b42a:   mov%r9,0x10(%rsp)
  1.57 │8178b42f:   mov%r10,0x8(%rsp)
  3.70 │8178b434:   mov%r11,(%rsp)
  2.27 │8178b438:   mov%rax,0x48(%rsp)

Note how the 7-byte NOP after the STI did not get a single profiler 
hit.

This is with the default '-e cycles', not '-e cycles:pp', so what we 
see as profiler hits should be the raw NMI entry RIPs.

Arguably this could be just the decoder hiding the NOP efficiently, 
I'll try to run some more experiments ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Olivier Galibert
  Hi,

Beware that could be opening the door to information leaks for a very
small gain (most syscalls are not getuid).

Best,

  OG.


On Sat, Mar 28, 2015 at 1:34 AM, Denys Vlasenko
vda.li...@googlemail.com wrote:
 On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote:

 Apparently, users *don't* depend on arithmetic flags
 to survive over syscall. They also okay with DF flag
 being cleared.

 Generally, users probably dont' care about many registers at all being
 saved, but it's worth noting that the reason system calls save/restore
 even caller-saved registers is at least partly in order to avoid any
 kernel information leaks.

 I don't believe that user mode will ever reasonably care about the
 arithmetic flags being changed, but at the same time I also don't it
 is something we should ever consider a feature we should try to take
 advantage of. Generally we should try to not mess with the flag state,
 and I'd *much* rather make the rule be that all the system call return
 paths restore flags as much as possible.

 We don't clobber anything ABI has its appeal.
 OTOH, fulfilling ABI's promises has cost which hast to be paid
 on every syscall, regardless whether userspace needed it or not.

 Example. This is the uclibc implementation of write():

 004acfc4 __libc_write:
   4acfc4:   53  push   %rbx
   4acfc5:   48 63 ffmovslq %edi,%rdi
   4acfc8:   b8 01 00 00 00  mov$0x1,%eax
   4acfcd:   0f 05   syscall
   4acfcf:   48 89 c3mov%rax,%rbx
   4acfd2:   48 81 fb 00 f0 ff ffcmp$0xf000,%rbx
   4acfd9:   76 0f   jbe4acfea __libc_write+0x26
   4acfdb:   e8 64 15 00 00  callq  4ae544 __GI___errno_location
   4acfe0:   89 da   mov%ebx,%edx
   4acfe2:   f7 da   neg%edx
   4acfe4:   89 10   mov%edx,(%rax)
   4acfe6:   48 83 c8 ff or $0x,%rax
   4acfea:   5b  pop%rbx
   4acfeb:   c3  retq

 This is a C function. Therefore any its caller assumes that C-clobbered
 registers can be, indeed, clobbered here, so if that caller uses any
 of them, it saves/restores them.

 All efforts by kernel code to save/restore C-clobbered registers,
 eight of them, are in vain. It's just useless work. Userspace
 does not benefit from that effort.

 If our syscall ABI would say that those regs are not preserved,
 we could have a bit faster syscalls. Any userspace code which
 really had to have those registers preserved across a particular
 syscall, could push/pop them itself.
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Borislav Petkov
On Fri, Mar 27, 2015 at 01:09:34PM -0700, Linus Torvalds wrote:
 I think AMD documented that the sti interrupt shadow shadows even
 NMI.

Hmm, official docs says this:

15.21.5 Interrupt Shadows

The x86 architecture defines the notion of an interrupt shadow—a
single-instruction window during which interrupts are not recognized.
For example, the instruction after an STI instruction that sets
EFLAGS.IF (from zero to one) does not recognize interrupts or certain
debug traps.

And I think with interrupts this means maskable interrupts, i.e. not
NMI.

STI description itself:

Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby
allowing external interrupts received on the INTR input. Interrupts
received on the non-maskable interrupt (NMI) input are not affected by
this instruction.

but there might be some other documentation which I cannot find right
now.

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Ingo Molnar

* Denys Vlasenko vda.li...@googlemail.com wrote:

 On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote:
 
  Apparently, users *don't* depend on arithmetic flags
  to survive over syscall. They also okay with DF flag
  being cleared.
 
  Generally, users probably dont' care about many registers at all being
  saved, but it's worth noting that the reason system calls save/restore
  even caller-saved registers is at least partly in order to avoid any
  kernel information leaks.
 
  I don't believe that user mode will ever reasonably care about the
  arithmetic flags being changed, but at the same time I also don't it
  is something we should ever consider a feature we should try to take
  advantage of. Generally we should try to not mess with the flag state,
  and I'd *much* rather make the rule be that all the system call return
  paths restore flags as much as possible.
 
 We don't clobber anything ABI has its appeal.
 OTOH, fulfilling ABI's promises has cost which hast to be paid
 on every syscall, regardless whether userspace needed it or not.
 
 Example. This is the uclibc implementation of write():
 
 004acfc4 __libc_write:
   4acfc4:   53  push   %rbx
   4acfc5:   48 63 ffmovslq %edi,%rdi
   4acfc8:   b8 01 00 00 00  mov$0x1,%eax
   4acfcd:   0f 05   syscall
   4acfcf:   48 89 c3mov%rax,%rbx
   4acfd2:   48 81 fb 00 f0 ff ffcmp$0xf000,%rbx
   4acfd9:   76 0f   jbe4acfea __libc_write+0x26
   4acfdb:   e8 64 15 00 00  callq  4ae544 __GI___errno_location
   4acfe0:   89 da   mov%ebx,%edx
   4acfe2:   f7 da   neg%edx
   4acfe4:   89 10   mov%edx,(%rax)
   4acfe6:   48 83 c8 ff or $0x,%rax
   4acfea:   5b  pop%rbx
   4acfeb:   c3  retq
 
 This is a C function. [...]

Arguably that's a self-inflicted wound of uclibc: nothing keeps it 
from taking advantage of the syscall ABI and avoiding the double 
save/restores.

 [...] Therefore any its caller assumes that C-clobbered registers 
 can be, indeed, clobbered here, so if that caller uses any of them, 
 it saves/restores them.
 
 All efforts by kernel code to save/restore C-clobbered registers, 
 eight of them, are in vain. It's just useless work. Userspace does 
 not benefit from that effort.

That's true only in this particular uclibc case, where user-space 
decided to not take advantage of the save/restore property of the 
kernel.

 If our syscall ABI would say that those regs are not preserved, we 
 could have a bit faster syscalls. Any userspace code which really 
 had to have those registers preserved across a particular syscall, 
 could push/pop them itself.

We'd at minimum have to zero out the registers to avoid the 
information leak and at that point it's in fact faster to just 
save/restore in the syscall and allow user-space to take advantage of 
that, if it wishes to.

We cannot do it the other way around.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-28 Thread Ingo Molnar

* Linus Torvalds torva...@linux-foundation.org wrote:

 On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst brge...@gmail.com wrote:
  -- IRQ.  Boom
 
  The sti will delay interrupts for one instruction, and that should include 
  NMIs.
 
 Nope. Intel explicitly documents the NMI case only for mov-ss and popss.
 
  The Intel SDM states for STI:
  The IF flag and the STI and CLI instructions do not prohibit the
  generation of exceptions and NMI interrupts. NMI
  interrupts (and SMIs) may be blocked for one macroinstruction following an 
  STI.
 
 Note the *may*. For movss and popss the software developer guide 
 explicitly says that NMI's are also blocked.
 
 For plain sti, it seems to be dependent on microarchitecture.

Well, how about 'STI+HLT' aka safe_halt()?

If an NMI is allowed after that STI then we might lose wakeups, or in 
extreme cases (with full-dynticks) might lock up for a long time until 
the next irq comes, even with runnable tasks around?

Arguably that's a race condition that is not very easy to notice on a 
typical system.

Random hypothesis: maybe Intel just messed up their STI shadow in a 
single (possibly ancient) microarchitecture in some rare situations 
and figured it could fix it cheaply via updating the documentation to 
match the breakage, not via actually fixing the CPU?

Might be useful if someone from Intel could chime in.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Denys Vlasenko
On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
 wrote:
> On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko  wrote:
>>
>> Apparently, users *don't* depend on arithmetic flags
>> to survive over syscall. They also okay with DF flag
>> being cleared.
>
> Generally, users probably dont' care about many registers at all being
> saved, but it's worth noting that the reason system calls save/restore
> even caller-saved registers is at least partly in order to avoid any
> kernel information leaks.
>
> I don't believe that user mode will ever reasonably care about the
> arithmetic flags being changed, but at the same time I also don't it
> is something we should ever consider a "feature" we should try to take
> advantage of. Generally we should try to not mess with the flag state,
> and I'd *much* rather make the rule be that all the system call return
> paths restore flags as much as possible.

"We don't clobber anything" ABI has its appeal.
OTOH, fulfilling ABI's promises has cost which hast to be paid
on every syscall, regardless whether userspace needed it or not.

Example. This is the uclibc implementation of write():

004acfc4 <__libc_write>:
  4acfc4:   53  push   %rbx
  4acfc5:   48 63 ffmovslq %edi,%rdi
  4acfc8:   b8 01 00 00 00  mov$0x1,%eax
  4acfcd:   0f 05   syscall
  4acfcf:   48 89 c3mov%rax,%rbx
  4acfd2:   48 81 fb 00 f0 ff ffcmp$0xf000,%rbx
  4acfd9:   76 0f   jbe4acfea <__libc_write+0x26>
  4acfdb:   e8 64 15 00 00  callq  4ae544 <__GI___errno_location>
  4acfe0:   89 da   mov%ebx,%edx
  4acfe2:   f7 da   neg%edx
  4acfe4:   89 10   mov%edx,(%rax)
  4acfe6:   48 83 c8 ff or $0x,%rax
  4acfea:   5b  pop%rbx
  4acfeb:   c3  retq

This is a C function. Therefore any its caller assumes that C-clobbered
registers can be, indeed, clobbered here, so if that caller uses any
of them, it saves/restores them.

All efforts by kernel code to save/restore C-clobbered registers,
eight of them, are in vain. It's just useless work. Userspace
does not benefit from that effort.

If our syscall ABI would say that those regs are not preserved,
we could have a bit faster syscalls. Any userspace code which
really had to have those registers preserved across a particular
syscall, could push/pop them itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Andy Lutomirski
On Fri, Mar 27, 2015 at 1:31 PM, Linus Torvalds
 wrote:
> On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski  wrote:
>>
>> Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
>> still be safe, and we have a real stack pointer before sysexit.
>
> Fair enough.  On 32-bit, the only worry is the race between "return to
> user space" and "something set a thread flag", resulting in delayed
> signals and/or higher scheduling latency etc. So on 32-bit, the bug is
> much less of an issue, I agree.

Right, except for one nasty case: KVM user return notifiers.  It's
possible we'd re-enter user mode with some MSRs set wrong.  Yuck.

--Andy

>
> So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more
> reasonable given the potential worry about sti+sysexit atomicity in
> the presense of nmi's.
>
>   Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst  wrote:
>> <-- IRQ.  Boom
>
> The sti will delay interrupts for one instruction, and that should include 
> NMIs.

Nope. Intel explicitly documents the NMI case only for mov->ss and popss.

> The Intel SDM states for STI:
> "The IF flag and the STI and CLI instructions do not prohibit the
> generation of exceptions and NMI interrupts. NMI
> interrupts (and SMIs) may be blocked for one macroinstruction following an 
> STI."

Note the *may*. For movss and popss the software developer guide
explicitly says that NMI's are also blocked.

For plain sti, it seems to be dependent on microarchitecture.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Brian Gerst
On Fri, Mar 27, 2015 at 2:37 PM, Andy Lutomirski  wrote:
> On Mar 27, 2015 7:26 AM, "Denys Vlasenko"  wrote:
>>
>> Hi,
>>
>> While running some tests I noticed that EFLAGS
>> is not saved across syscalls if I use 32-bit
>> userspace, use SYSENTER, and paravirt is active.
>>
>> Looking at the code, it's actually clear why that happens.
>>
>> /*
>>  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>>  * IF and VM in rflags are cleared (IOW: interrupts are off).
>>  * SYSENTER does not save anything on the stack,
>>  * and does not save old rip (!!!) and rflags.
>>  */
>> ENTRY(ia32_sysenter_target)
>> SWAPGS_UNSAFE_STACK  <
>> movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
>> ENABLE_INTERRUPTS(CLBR_NONE)
>>
>> movl%ebp, %ebp
>> movl%eax, %eax
>> movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
>>
>> /* Construct struct pt_regs on stack */
>> pushq_cfi   $__USER32_DS/* pt_regs->ss */
>> pushq_cfi   %rbp/* pt_regs->sp */
>> CFI_REL_OFFSET  rsp,0
>> pushfq_cfi  /* pt_regs->flags */
>>
>> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
>> will change EFLAGS, and it *can't* save/restore them -
>> there is no place to save it, since neither stack nor
>> PER_CPU() is usable at that point.
>>
>> Interestingly, *no one ever complained*!
>>
>> Apparently, users *don't* depend on arithmetic flags
>> to survive over syscall. They also okay with DF flag
>> being cleared.
>>
>> Let's go flag-by-flag.
>>
>> ID - probably no one depends on it
>> VIP,VIF,VM - v86 stuff, not supported in 64bit
>> AC - someone probably do use this
>> RF - should be cleared to 0
>> NT - iret via task gate, not supported in 64bit
>> IOPL - usually 00, sys_iopl() can change it
>> DF - according to C ABI, should be 0
>> IF - should be preserved (but almost always 1)
>> TF - should be preserved
>> arith flags - probably no one cares
>>
>> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
>> IF.
>>
>> AC and IOPL are preserved even with this paravirt quirk
>> because paravirt hooks do not mangle them.
>>
>> TF preservation and proper restoration is handled by
>> do_debug + syscall_trace_enter_phase2 + iret
>> combo.
>>
>> We unconditionally set IF. This is only a problem for applications
>> which use sys_iopl(3) and, disable IRQs in userspace and perform
>> syscalls. The set of such apps is probably empty.
>> (This "bug" exists even for non-paravirt case).
>>
>> So, formally, we have a bug: we do not preserve IF,
>> DF and arith flags.
>>
>> I'm proposing to use this opportunity to amend syscall ABI
>> to say that arith flags are not preserved across syscalls,
>> and DF can be cleared to 0 by syscalls (but can't be set to 1).
>> Evidently, it's broken for some time for some virtualized
>> setups and users are okay.
>
> I think I'd rather fix it.  I want to give x86_64 a sysenter stack
> like x86_32's, since AFAICT the only reason that #DF needs to use IST
> is because sysenter with TF set is the only way I can see that #DF
> could happen with an invalid stack.

What if RSP gets corrupted in the kernel?  That would cause a fault
that gets promoted to #DF, since the iret frame can't be pushed.  You
would at least get an oops out instead of a triple fault reset.

> Also, Houston, we have a bug, probably rootable, and probably damn
> near impossible to exploit without crashing your system:
>
> User does sysenter.  We end up in native_irq_enable_sysexit.  We do:
>
> swapgs
> sti
>
> <-- NMI here can happen on some (all?) cpus, returns successfully
> *with interrupts unmasked*
>
> <-- IRQ.  Boom

The sti will delay interrupts for one instruction, and that should include NMIs.

The Intel SDM states for STI:
"The IF flag and the STI and CLI instructions do not prohibit the
generation of exceptions and NMI interrupts. NMI
interrupts (and SMIs) may be blocked for one macroinstruction following an STI."

> My preferred fix would be to use sysretl instead of sysexit.  As far
> as I know, there are no 64-bit CPUs at all that don't support sysretl.

It would be nice to have one less return path.  I'm curious to know
why Intel doesn't support syscall from compatibility mode but does
support sysret to compatibility mode.

--
Brian Gerst
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski  wrote:
>
> Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
> still be safe, and we have a real stack pointer before sysexit.

Fair enough.  On 32-bit, the only worry is the race between "return to
user space" and "something set a thread flag", resulting in delayed
signals and/or higher scheduling latency etc. So on 32-bit, the bug is
much less of an issue, I agree.

So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more
reasonable given the potential worry about sti+sysexit atomicity in
the presense of nmi's.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Andy Lutomirski
On Fri, Mar 27, 2015 at 1:09 PM, Linus Torvalds
 wrote:
> On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski  wrote:
>>
>> User does sysenter.  We end up in native_irq_enable_sysexit.  We do:
>>
>> swapgs
>> sti
>>
>> <-- NMI here can happen on some (all?) cpus, returns successfully
>> *with interrupts unmasked*
>
> I think AMD documented that the sti "interrupt shadow" shadows even
> NMI. And for Intel, it definitely does not (but "mov ss" and "pop ss"
> masks even NMI for the next instruction - so the interrupt shadow is
> different for these cases).
>
> That said, it wasn't 100% clear that the "NMI return to immediate
> regular interrupt" can actually happen even on Intel. Iirc, there was
> some discussion about when the CPU actually tests the IRQ line after
> an 'iret'. It might end up testing the IF only after executing the
> instruction it returns to, so it's possible that the sequence
>
> .. interrupts disabled ..
> sti
> sysexit
>
> can not be interrupted by regular interrupts between the 'sti' and the
> 'sysexit' even if an NMI were to have happened between the two.
>
>> My preferred fix would be to use sysretl instead of sysexit.  As far
>> as I know, there are no 64-bit CPUs at all that don't support sysretl.
>
> That 'sti+sysexit" is used for the native 32-bit case too, not just
> the compat mode for x86-64. So I don't necessarily disagree with using
> sysretl, but..

Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
still be safe, and we have a real stack pointer before sysexit.

--Andy

>
>Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski  wrote:
>
> User does sysenter.  We end up in native_irq_enable_sysexit.  We do:
>
> swapgs
> sti
>
> <-- NMI here can happen on some (all?) cpus, returns successfully
> *with interrupts unmasked*

I think AMD documented that the sti "interrupt shadow" shadows even
NMI. And for Intel, it definitely does not (but "mov ss" and "pop ss"
masks even NMI for the next instruction - so the interrupt shadow is
different for these cases).

That said, it wasn't 100% clear that the "NMI return to immediate
regular interrupt" can actually happen even on Intel. Iirc, there was
some discussion about when the CPU actually tests the IRQ line after
an 'iret'. It might end up testing the IF only after executing the
instruction it returns to, so it's possible that the sequence

.. interrupts disabled ..
sti
sysexit

can not be interrupted by regular interrupts between the 'sti' and the
'sysexit' even if an NMI were to have happened between the two.

> My preferred fix would be to use sysretl instead of sysexit.  As far
> as I know, there are no 64-bit CPUs at all that don't support sysretl.

That 'sti+sysexit" is used for the native 32-bit case too, not just
the compat mode for x86-64. So I don't necessarily disagree with using
sysretl, but..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko  wrote:
>
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.

Generally, users probably dont' care about many registers at all being
saved, but it's worth noting that the reason system calls save/restore
even caller-saved registers is at least partly in order to avoid any
kernel information leaks.

I don't believe that user mode will ever reasonably care about the
arithmetic flags being changed, but at the same time I also don't it
is something we should ever consider a "feature" we should try to take
advantage of. Generally we should try to not mess with the flag state,
and I'd *much* rather make the rule be that all the system call return
paths restore flags as much as possible.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Andy Lutomirski
On Mar 27, 2015 7:26 AM, "Denys Vlasenko"  wrote:
>
> Hi,
>
> While running some tests I noticed that EFLAGS
> is not saved across syscalls if I use 32-bit
> userspace, use SYSENTER, and paravirt is active.
>
> Looking at the code, it's actually clear why that happens.
>
> /*
>  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>  * IF and VM in rflags are cleared (IOW: interrupts are off).
>  * SYSENTER does not save anything on the stack,
>  * and does not save old rip (!!!) and rflags.
>  */
> ENTRY(ia32_sysenter_target)
> SWAPGS_UNSAFE_STACK  <
> movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
> ENABLE_INTERRUPTS(CLBR_NONE)
>
> movl%ebp, %ebp
> movl%eax, %eax
> movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
>
> /* Construct struct pt_regs on stack */
> pushq_cfi   $__USER32_DS/* pt_regs->ss */
> pushq_cfi   %rbp/* pt_regs->sp */
> CFI_REL_OFFSET  rsp,0
> pushfq_cfi  /* pt_regs->flags */
>
> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
> will change EFLAGS, and it *can't* save/restore them -
> there is no place to save it, since neither stack nor
> PER_CPU() is usable at that point.
>
> Interestingly, *no one ever complained*!
>
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.
>
> Let's go flag-by-flag.
>
> ID - probably no one depends on it
> VIP,VIF,VM - v86 stuff, not supported in 64bit
> AC - someone probably do use this
> RF - should be cleared to 0
> NT - iret via task gate, not supported in 64bit
> IOPL - usually 00, sys_iopl() can change it
> DF - according to C ABI, should be 0
> IF - should be preserved (but almost always 1)
> TF - should be preserved
> arith flags - probably no one cares
>
> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
> IF.
>
> AC and IOPL are preserved even with this paravirt quirk
> because paravirt hooks do not mangle them.
>
> TF preservation and proper restoration is handled by
> do_debug + syscall_trace_enter_phase2 + iret
> combo.
>
> We unconditionally set IF. This is only a problem for applications
> which use sys_iopl(3) and, disable IRQs in userspace and perform
> syscalls. The set of such apps is probably empty.
> (This "bug" exists even for non-paravirt case).
>
> So, formally, we have a bug: we do not preserve IF,
> DF and arith flags.
>
> I'm proposing to use this opportunity to amend syscall ABI
> to say that arith flags are not preserved across syscalls,
> and DF can be cleared to 0 by syscalls (but can't be set to 1).
> Evidently, it's broken for some time for some virtualized
> setups and users are okay.

I think I'd rather fix it.  I want to give x86_64 a sysenter stack
like x86_32's, since AFAICT the only reason that #DF needs to use IST
is because sysenter with TF set is the only way I can see that #DF
could happen with an invalid stack.

Also, Houston, we have a bug, probably rootable, and probably damn
near impossible to exploit without crashing your system:

User does sysenter.  We end up in native_irq_enable_sysexit.  We do:

swapgs
sti

<-- NMI here can happen on some (all?) cpus, returns successfully
*with interrupts unmasked*

<-- IRQ.  Boom

My preferred fix would be to use sysretl instead of sysexit.  As far
as I know, there are no 64-bit CPUs at all that don't support sysretl.

--Andy

>
> I'm not sure what to do with the "bug" of forcing IF=1.
> Fix it? Or also declare that syscalls can set IF=1?
> Do you think this is a legitimate userspace code?
>
> sys_iopl(3);
> cli;
> syscall();
> /* expects irqs still disabled */
>
> --
> vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Borislav Petkov
+ Linus.

On Fri, Mar 27, 2015 at 03:25:47PM +0100, Denys Vlasenko wrote:
> Hi,
> 
> While running some tests I noticed that EFLAGS
> is not saved across syscalls if I use 32-bit
> userspace, use SYSENTER, and paravirt is active.
> 
> Looking at the code, it's actually clear why that happens.
> 
> /*
>  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
>  * IF and VM in rflags are cleared (IOW: interrupts are off).
>  * SYSENTER does not save anything on the stack,
>  * and does not save old rip (!!!) and rflags.
>  */
> ENTRY(ia32_sysenter_target)
> SWAPGS_UNSAFE_STACK  <
> movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
> ENABLE_INTERRUPTS(CLBR_NONE)
> 
> movl%ebp, %ebp
> movl%eax, %eax
> movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
> 
> /* Construct struct pt_regs on stack */
> pushq_cfi   $__USER32_DS/* pt_regs->ss */
> pushq_cfi   %rbp/* pt_regs->sp */
> CFI_REL_OFFSET  rsp,0
> pushfq_cfi  /* pt_regs->flags */
> 
> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
> will change EFLAGS, and it *can't* save/restore them -
> there is no place to save it, since neither stack nor
> PER_CPU() is usable at that point.
> 
> Interestingly, *no one ever complained*!
> 
> Apparently, users *don't* depend on arithmetic flags
> to survive over syscall. They also okay with DF flag
> being cleared.
> 
> Let's go flag-by-flag.
> 
> ID - probably no one depends on it

It is used as a toggle to detect CPUID support. Can a SYSENTER happen
while something toggles it? Probably...

> VIP,VIF,VM - v86 stuff, not supported in 64bit
> AC - someone probably do use this
> RF - should be cleared to 0
> NT - iret via task gate, not supported in 64bit
> IOPL - usually 00, sys_iopl() can change it
> DF - according to C ABI, should be 0
> IF - should be preserved (but almost always 1)
> TF - should be preserved
> arith flags - probably no one cares
> 
> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
> IF.
> 
> AC and IOPL are preserved even with this paravirt quirk
> because paravirt hooks do not mangle them.
> 
> TF preservation and proper restoration is handled by
>   do_debug + syscall_trace_enter_phase2 + iret
> combo.
> 
> We unconditionally set IF. This is only a problem for applications
> which use sys_iopl(3) and, disable IRQs in userspace and perform
> syscalls. The set of such apps is probably empty.
> (This "bug" exists even for non-paravirt case).
> 
> So, formally, we have a bug: we do not preserve IF,
> DF and arith flags.
> 
> I'm proposing to use this opportunity to amend syscall ABI
> to say that arith flags are not preserved across syscalls,
> and DF can be cleared to 0 by syscalls (but can't be set to 1).
> Evidently, it's broken for some time for some virtualized
> setups and users are okay.
> 
> I'm not sure what to do with the "bug" of forcing IF=1.
> Fix it? Or also declare that syscalls can set IF=1?
> Do you think this is a legitimate userspace code?
> 
>   sys_iopl(3);
>   cli;
>   syscall();
>   /* expects irqs still disabled */
> 
> -- 
> vda
> 

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Denys Vlasenko
Hi,

While running some tests I noticed that EFLAGS
is not saved across syscalls if I use 32-bit
userspace, use SYSENTER, and paravirt is active.

Looking at the code, it's actually clear why that happens.

/*
 * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
 * IF and VM in rflags are cleared (IOW: interrupts are off).
 * SYSENTER does not save anything on the stack,
 * and does not save old rip (!!!) and rflags.
 */
ENTRY(ia32_sysenter_target)
SWAPGS_UNSAFE_STACK  <
movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
ENABLE_INTERRUPTS(CLBR_NONE)

movl%ebp, %ebp
movl%eax, %eax
movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d

/* Construct struct pt_regs on stack */
pushq_cfi   $__USER32_DS/* pt_regs->ss */
pushq_cfi   %rbp/* pt_regs->sp */
CFI_REL_OFFSET  rsp,0
pushfq_cfi  /* pt_regs->flags */

The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
will change EFLAGS, and it *can't* save/restore them -
there is no place to save it, since neither stack nor
PER_CPU() is usable at that point.

Interestingly, *no one ever complained*!

Apparently, users *don't* depend on arithmetic flags
to survive over syscall. They also okay with DF flag
being cleared.

Let's go flag-by-flag.

ID - probably no one depends on it
VIP,VIF,VM - v86 stuff, not supported in 64bit
AC - someone probably do use this
RF - should be cleared to 0
NT - iret via task gate, not supported in 64bit
IOPL - usually 00, sys_iopl() can change it
DF - according to C ABI, should be 0
IF - should be preserved (but almost always 1)
TF - should be preserved
arith flags - probably no one cares

IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
IF.

AC and IOPL are preserved even with this paravirt quirk
because paravirt hooks do not mangle them.

TF preservation and proper restoration is handled by
do_debug + syscall_trace_enter_phase2 + iret
combo.

We unconditionally set IF. This is only a problem for applications
which use sys_iopl(3) and, disable IRQs in userspace and perform
syscalls. The set of such apps is probably empty.
(This "bug" exists even for non-paravirt case).

So, formally, we have a bug: we do not preserve IF,
DF and arith flags.

I'm proposing to use this opportunity to amend syscall ABI
to say that arith flags are not preserved across syscalls,
and DF can be cleared to 0 by syscalls (but can't be set to 1).
Evidently, it's broken for some time for some virtualized
setups and users are okay.

I'm not sure what to do with the "bug" of forcing IF=1.
Fix it? Or also declare that syscalls can set IF=1?
Do you think this is a legitimate userspace code?

sys_iopl(3);
cli;
syscall();
/* expects irqs still disabled */

-- 
vda
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Denys Vlasenko
Hi,

While running some tests I noticed that EFLAGS
is not saved across syscalls if I use 32-bit
userspace, use SYSENTER, and paravirt is active.

Looking at the code, it's actually clear why that happens.

/*
 * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
 * IF and VM in rflags are cleared (IOW: interrupts are off).
 * SYSENTER does not save anything on the stack,
 * and does not save old rip (!!!) and rflags.
 */
ENTRY(ia32_sysenter_target)
SWAPGS_UNSAFE_STACK  
movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
ENABLE_INTERRUPTS(CLBR_NONE)

movl%ebp, %ebp
movl%eax, %eax
movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d

/* Construct struct pt_regs on stack */
pushq_cfi   $__USER32_DS/* pt_regs-ss */
pushq_cfi   %rbp/* pt_regs-sp */
CFI_REL_OFFSET  rsp,0
pushfq_cfi  /* pt_regs-flags */

The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
will change EFLAGS, and it *can't* save/restore them -
there is no place to save it, since neither stack nor
PER_CPU() is usable at that point.

Interestingly, *no one ever complained*!

Apparently, users *don't* depend on arithmetic flags
to survive over syscall. They also okay with DF flag
being cleared.

Let's go flag-by-flag.

ID - probably no one depends on it
VIP,VIF,VM - v86 stuff, not supported in 64bit
AC - someone probably do use this
RF - should be cleared to 0
NT - iret via task gate, not supported in 64bit
IOPL - usually 00, sys_iopl() can change it
DF - according to C ABI, should be 0
IF - should be preserved (but almost always 1)
TF - should be preserved
arith flags - probably no one cares

IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
IF.

AC and IOPL are preserved even with this paravirt quirk
because paravirt hooks do not mangle them.

TF preservation and proper restoration is handled by
do_debug + syscall_trace_enter_phase2 + iret
combo.

We unconditionally set IF. This is only a problem for applications
which use sys_iopl(3) and, disable IRQs in userspace and perform
syscalls. The set of such apps is probably empty.
(This bug exists even for non-paravirt case).

So, formally, we have a bug: we do not preserve IF,
DF and arith flags.

I'm proposing to use this opportunity to amend syscall ABI
to say that arith flags are not preserved across syscalls,
and DF can be cleared to 0 by syscalls (but can't be set to 1).
Evidently, it's broken for some time for some virtualized
setups and users are okay.

I'm not sure what to do with the bug of forcing IF=1.
Fix it? Or also declare that syscalls can set IF=1?
Do you think this is a legitimate userspace code?

sys_iopl(3);
cli;
syscall();
/* expects irqs still disabled */

-- 
vda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Borislav Petkov
+ Linus.

On Fri, Mar 27, 2015 at 03:25:47PM +0100, Denys Vlasenko wrote:
 Hi,
 
 While running some tests I noticed that EFLAGS
 is not saved across syscalls if I use 32-bit
 userspace, use SYSENTER, and paravirt is active.
 
 Looking at the code, it's actually clear why that happens.
 
 /*
  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
  * IF and VM in rflags are cleared (IOW: interrupts are off).
  * SYSENTER does not save anything on the stack,
  * and does not save old rip (!!!) and rflags.
  */
 ENTRY(ia32_sysenter_target)
 SWAPGS_UNSAFE_STACK  
 movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
 ENABLE_INTERRUPTS(CLBR_NONE)
 
 movl%ebp, %ebp
 movl%eax, %eax
 movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d
 
 /* Construct struct pt_regs on stack */
 pushq_cfi   $__USER32_DS/* pt_regs-ss */
 pushq_cfi   %rbp/* pt_regs-sp */
 CFI_REL_OFFSET  rsp,0
 pushfq_cfi  /* pt_regs-flags */
 
 The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
 will change EFLAGS, and it *can't* save/restore them -
 there is no place to save it, since neither stack nor
 PER_CPU() is usable at that point.
 
 Interestingly, *no one ever complained*!
 
 Apparently, users *don't* depend on arithmetic flags
 to survive over syscall. They also okay with DF flag
 being cleared.
 
 Let's go flag-by-flag.
 
 ID - probably no one depends on it

It is used as a toggle to detect CPUID support. Can a SYSENTER happen
while something toggles it? Probably...

 VIP,VIF,VM - v86 stuff, not supported in 64bit
 AC - someone probably do use this
 RF - should be cleared to 0
 NT - iret via task gate, not supported in 64bit
 IOPL - usually 00, sys_iopl() can change it
 DF - according to C ABI, should be 0
 IF - should be preserved (but almost always 1)
 TF - should be preserved
 arith flags - probably no one cares
 
 IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
 IF.
 
 AC and IOPL are preserved even with this paravirt quirk
 because paravirt hooks do not mangle them.
 
 TF preservation and proper restoration is handled by
   do_debug + syscall_trace_enter_phase2 + iret
 combo.
 
 We unconditionally set IF. This is only a problem for applications
 which use sys_iopl(3) and, disable IRQs in userspace and perform
 syscalls. The set of such apps is probably empty.
 (This bug exists even for non-paravirt case).
 
 So, formally, we have a bug: we do not preserve IF,
 DF and arith flags.
 
 I'm proposing to use this opportunity to amend syscall ABI
 to say that arith flags are not preserved across syscalls,
 and DF can be cleared to 0 by syscalls (but can't be set to 1).
 Evidently, it's broken for some time for some virtualized
 setups and users are okay.
 
 I'm not sure what to do with the bug of forcing IF=1.
 Fix it? Or also declare that syscalls can set IF=1?
 Do you think this is a legitimate userspace code?
 
   sys_iopl(3);
   cli;
   syscall();
   /* expects irqs still disabled */
 
 -- 
 vda
 

-- 
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote:

 Apparently, users *don't* depend on arithmetic flags
 to survive over syscall. They also okay with DF flag
 being cleared.

Generally, users probably dont' care about many registers at all being
saved, but it's worth noting that the reason system calls save/restore
even caller-saved registers is at least partly in order to avoid any
kernel information leaks.

I don't believe that user mode will ever reasonably care about the
arithmetic flags being changed, but at the same time I also don't it
is something we should ever consider a feature we should try to take
advantage of. Generally we should try to not mess with the flag state,
and I'd *much* rather make the rule be that all the system call return
paths restore flags as much as possible.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski l...@amacapital.net wrote:

 User does sysenter.  We end up in native_irq_enable_sysexit.  We do:

 swapgs
 sti

 -- NMI here can happen on some (all?) cpus, returns successfully
 *with interrupts unmasked*

I think AMD documented that the sti interrupt shadow shadows even
NMI. And for Intel, it definitely does not (but mov ss and pop ss
masks even NMI for the next instruction - so the interrupt shadow is
different for these cases).

That said, it wasn't 100% clear that the NMI return to immediate
regular interrupt can actually happen even on Intel. Iirc, there was
some discussion about when the CPU actually tests the IRQ line after
an 'iret'. It might end up testing the IF only after executing the
instruction it returns to, so it's possible that the sequence

.. interrupts disabled ..
sti
sysexit

can not be interrupted by regular interrupts between the 'sti' and the
'sysexit' even if an NMI were to have happened between the two.

 My preferred fix would be to use sysretl instead of sysexit.  As far
 as I know, there are no 64-bit CPUs at all that don't support sysretl.

That 'sti+sysexit is used for the native 32-bit case too, not just
the compat mode for x86-64. So I don't necessarily disagree with using
sysretl, but..

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski l...@amacapital.net wrote:

 Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
 still be safe, and we have a real stack pointer before sysexit.

Fair enough.  On 32-bit, the only worry is the race between return to
user space and something set a thread flag, resulting in delayed
signals and/or higher scheduling latency etc. So on 32-bit, the bug is
much less of an issue, I agree.

So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more
reasonable given the potential worry about sti+sysexit atomicity in
the presense of nmi's.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Andy Lutomirski
On Fri, Mar 27, 2015 at 1:09 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski l...@amacapital.net wrote:

 User does sysenter.  We end up in native_irq_enable_sysexit.  We do:

 swapgs
 sti

 -- NMI here can happen on some (all?) cpus, returns successfully
 *with interrupts unmasked*

 I think AMD documented that the sti interrupt shadow shadows even
 NMI. And for Intel, it definitely does not (but mov ss and pop ss
 masks even NMI for the next instruction - so the interrupt shadow is
 different for these cases).

 That said, it wasn't 100% clear that the NMI return to immediate
 regular interrupt can actually happen even on Intel. Iirc, there was
 some discussion about when the CPU actually tests the IRQ line after
 an 'iret'. It might end up testing the IF only after executing the
 instruction it returns to, so it's possible that the sequence

 .. interrupts disabled ..
 sti
 sysexit

 can not be interrupted by regular interrupts between the 'sti' and the
 'sysexit' even if an NMI were to have happened between the two.

 My preferred fix would be to use sysretl instead of sysexit.  As far
 as I know, there are no 64-bit CPUs at all that don't support sysretl.

 That 'sti+sysexit is used for the native 32-bit case too, not just
 the compat mode for x86-64. So I don't necessarily disagree with using
 sysretl, but..

Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
still be safe, and we have a real stack pointer before sysexit.

--Andy


Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Andy Lutomirski
On Mar 27, 2015 7:26 AM, Denys Vlasenko dvlas...@redhat.com wrote:

 Hi,

 While running some tests I noticed that EFLAGS
 is not saved across syscalls if I use 32-bit
 userspace, use SYSENTER, and paravirt is active.

 Looking at the code, it's actually clear why that happens.

 /*
  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
  * IF and VM in rflags are cleared (IOW: interrupts are off).
  * SYSENTER does not save anything on the stack,
  * and does not save old rip (!!!) and rflags.
  */
 ENTRY(ia32_sysenter_target)
 SWAPGS_UNSAFE_STACK  
 movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
 ENABLE_INTERRUPTS(CLBR_NONE)

 movl%ebp, %ebp
 movl%eax, %eax
 movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d

 /* Construct struct pt_regs on stack */
 pushq_cfi   $__USER32_DS/* pt_regs-ss */
 pushq_cfi   %rbp/* pt_regs-sp */
 CFI_REL_OFFSET  rsp,0
 pushfq_cfi  /* pt_regs-flags */

 The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
 will change EFLAGS, and it *can't* save/restore them -
 there is no place to save it, since neither stack nor
 PER_CPU() is usable at that point.

 Interestingly, *no one ever complained*!

 Apparently, users *don't* depend on arithmetic flags
 to survive over syscall. They also okay with DF flag
 being cleared.

 Let's go flag-by-flag.

 ID - probably no one depends on it
 VIP,VIF,VM - v86 stuff, not supported in 64bit
 AC - someone probably do use this
 RF - should be cleared to 0
 NT - iret via task gate, not supported in 64bit
 IOPL - usually 00, sys_iopl() can change it
 DF - according to C ABI, should be 0
 IF - should be preserved (but almost always 1)
 TF - should be preserved
 arith flags - probably no one cares

 IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
 IF.

 AC and IOPL are preserved even with this paravirt quirk
 because paravirt hooks do not mangle them.

 TF preservation and proper restoration is handled by
 do_debug + syscall_trace_enter_phase2 + iret
 combo.

 We unconditionally set IF. This is only a problem for applications
 which use sys_iopl(3) and, disable IRQs in userspace and perform
 syscalls. The set of such apps is probably empty.
 (This bug exists even for non-paravirt case).

 So, formally, we have a bug: we do not preserve IF,
 DF and arith flags.

 I'm proposing to use this opportunity to amend syscall ABI
 to say that arith flags are not preserved across syscalls,
 and DF can be cleared to 0 by syscalls (but can't be set to 1).
 Evidently, it's broken for some time for some virtualized
 setups and users are okay.

I think I'd rather fix it.  I want to give x86_64 a sysenter stack
like x86_32's, since AFAICT the only reason that #DF needs to use IST
is because sysenter with TF set is the only way I can see that #DF
could happen with an invalid stack.

Also, Houston, we have a bug, probably rootable, and probably damn
near impossible to exploit without crashing your system:

User does sysenter.  We end up in native_irq_enable_sysexit.  We do:

swapgs
sti

-- NMI here can happen on some (all?) cpus, returns successfully
*with interrupts unmasked*

-- IRQ.  Boom

My preferred fix would be to use sysretl instead of sysexit.  As far
as I know, there are no 64-bit CPUs at all that don't support sysretl.

--Andy


 I'm not sure what to do with the bug of forcing IF=1.
 Fix it? Or also declare that syscalls can set IF=1?
 Do you think this is a legitimate userspace code?

 sys_iopl(3);
 cli;
 syscall();
 /* expects irqs still disabled */

 --
 vda
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Linus Torvalds
On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst brge...@gmail.com wrote:
 -- IRQ.  Boom

 The sti will delay interrupts for one instruction, and that should include 
 NMIs.

Nope. Intel explicitly documents the NMI case only for mov-ss and popss.

 The Intel SDM states for STI:
 The IF flag and the STI and CLI instructions do not prohibit the
 generation of exceptions and NMI interrupts. NMI
 interrupts (and SMIs) may be blocked for one macroinstruction following an 
 STI.

Note the *may*. For movss and popss the software developer guide
explicitly says that NMI's are also blocked.

For plain sti, it seems to be dependent on microarchitecture.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Andy Lutomirski
On Fri, Mar 27, 2015 at 1:31 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski l...@amacapital.net wrote:

 Does it matter on 32-bit kernels?  There's no swapgs, so IRQs should
 still be safe, and we have a real stack pointer before sysexit.

 Fair enough.  On 32-bit, the only worry is the race between return to
 user space and something set a thread flag, resulting in delayed
 signals and/or higher scheduling latency etc. So on 32-bit, the bug is
 much less of an issue, I agree.

Right, except for one nasty case: KVM user return notifiers.  It's
possible we'd re-enter user mode with some MSRs set wrong.  Yuck.

--Andy


 So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more
 reasonable given the potential worry about sti+sysexit atomicity in
 the presense of nmi's.

   Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Brian Gerst
On Fri, Mar 27, 2015 at 2:37 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Mar 27, 2015 7:26 AM, Denys Vlasenko dvlas...@redhat.com wrote:

 Hi,

 While running some tests I noticed that EFLAGS
 is not saved across syscalls if I use 32-bit
 userspace, use SYSENTER, and paravirt is active.

 Looking at the code, it's actually clear why that happens.

 /*
  * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs.
  * IF and VM in rflags are cleared (IOW: interrupts are off).
  * SYSENTER does not save anything on the stack,
  * and does not save old rip (!!!) and rflags.
  */
 ENTRY(ia32_sysenter_target)
 SWAPGS_UNSAFE_STACK  
 movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
 ENABLE_INTERRUPTS(CLBR_NONE)

 movl%ebp, %ebp
 movl%eax, %eax
 movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d

 /* Construct struct pt_regs on stack */
 pushq_cfi   $__USER32_DS/* pt_regs-ss */
 pushq_cfi   %rbp/* pt_regs-sp */
 CFI_REL_OFFSET  rsp,0
 pushfq_cfi  /* pt_regs-flags */

 The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks,
 will change EFLAGS, and it *can't* save/restore them -
 there is no place to save it, since neither stack nor
 PER_CPU() is usable at that point.

 Interestingly, *no one ever complained*!

 Apparently, users *don't* depend on arithmetic flags
 to survive over syscall. They also okay with DF flag
 being cleared.

 Let's go flag-by-flag.

 ID - probably no one depends on it
 VIP,VIF,VM - v86 stuff, not supported in 64bit
 AC - someone probably do use this
 RF - should be cleared to 0
 NT - iret via task gate, not supported in 64bit
 IOPL - usually 00, sys_iopl() can change it
 DF - according to C ABI, should be 0
 IF - should be preserved (but almost always 1)
 TF - should be preserved
 arith flags - probably no one cares

 IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_
 IF.

 AC and IOPL are preserved even with this paravirt quirk
 because paravirt hooks do not mangle them.

 TF preservation and proper restoration is handled by
 do_debug + syscall_trace_enter_phase2 + iret
 combo.

 We unconditionally set IF. This is only a problem for applications
 which use sys_iopl(3) and, disable IRQs in userspace and perform
 syscalls. The set of such apps is probably empty.
 (This bug exists even for non-paravirt case).

 So, formally, we have a bug: we do not preserve IF,
 DF and arith flags.

 I'm proposing to use this opportunity to amend syscall ABI
 to say that arith flags are not preserved across syscalls,
 and DF can be cleared to 0 by syscalls (but can't be set to 1).
 Evidently, it's broken for some time for some virtualized
 setups and users are okay.

 I think I'd rather fix it.  I want to give x86_64 a sysenter stack
 like x86_32's, since AFAICT the only reason that #DF needs to use IST
 is because sysenter with TF set is the only way I can see that #DF
 could happen with an invalid stack.

What if RSP gets corrupted in the kernel?  That would cause a fault
that gets promoted to #DF, since the iret frame can't be pushed.  You
would at least get an oops out instead of a triple fault reset.

 Also, Houston, we have a bug, probably rootable, and probably damn
 near impossible to exploit without crashing your system:

 User does sysenter.  We end up in native_irq_enable_sysexit.  We do:

 swapgs
 sti

 -- NMI here can happen on some (all?) cpus, returns successfully
 *with interrupts unmasked*

 -- IRQ.  Boom

The sti will delay interrupts for one instruction, and that should include NMIs.

The Intel SDM states for STI:
The IF flag and the STI and CLI instructions do not prohibit the
generation of exceptions and NMI interrupts. NMI
interrupts (and SMIs) may be blocked for one macroinstruction following an STI.

 My preferred fix would be to use sysretl instead of sysexit.  As far
 as I know, there are no 64-bit CPUs at all that don't support sysretl.

It would be nice to have one less return path.  I'm curious to know
why Intel doesn't support syscall from compatibility mode but does
support sysret to compatibility mode.

--
Brian Gerst
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ia32_sysenter_target does not preserve EFLAGS

2015-03-27 Thread Denys Vlasenko
On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote:

 Apparently, users *don't* depend on arithmetic flags
 to survive over syscall. They also okay with DF flag
 being cleared.

 Generally, users probably dont' care about many registers at all being
 saved, but it's worth noting that the reason system calls save/restore
 even caller-saved registers is at least partly in order to avoid any
 kernel information leaks.

 I don't believe that user mode will ever reasonably care about the
 arithmetic flags being changed, but at the same time I also don't it
 is something we should ever consider a feature we should try to take
 advantage of. Generally we should try to not mess with the flag state,
 and I'd *much* rather make the rule be that all the system call return
 paths restore flags as much as possible.

We don't clobber anything ABI has its appeal.
OTOH, fulfilling ABI's promises has cost which hast to be paid
on every syscall, regardless whether userspace needed it or not.

Example. This is the uclibc implementation of write():

004acfc4 __libc_write:
  4acfc4:   53  push   %rbx
  4acfc5:   48 63 ffmovslq %edi,%rdi
  4acfc8:   b8 01 00 00 00  mov$0x1,%eax
  4acfcd:   0f 05   syscall
  4acfcf:   48 89 c3mov%rax,%rbx
  4acfd2:   48 81 fb 00 f0 ff ffcmp$0xf000,%rbx
  4acfd9:   76 0f   jbe4acfea __libc_write+0x26
  4acfdb:   e8 64 15 00 00  callq  4ae544 __GI___errno_location
  4acfe0:   89 da   mov%ebx,%edx
  4acfe2:   f7 da   neg%edx
  4acfe4:   89 10   mov%edx,(%rax)
  4acfe6:   48 83 c8 ff or $0x,%rax
  4acfea:   5b  pop%rbx
  4acfeb:   c3  retq

This is a C function. Therefore any its caller assumes that C-clobbered
registers can be, indeed, clobbered here, so if that caller uses any
of them, it saves/restores them.

All efforts by kernel code to save/restore C-clobbered registers,
eight of them, are in vain. It's just useless work. Userspace
does not benefit from that effort.

If our syscall ABI would say that those regs are not preserved,
we could have a bit faster syscalls. Any userspace code which
really had to have those registers preserved across a particular
syscall, could push/pop them itself.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/