Re: ia32_sysenter_target does not preserve EFLAGS
On Mar 28, 2015 6:17 AM, "Denys Vlasenko" wrote: > > On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> This is a C function. [...] > > > > Arguably that's a self-inflicted wound of uclibc: nothing keeps it > > from taking advantage of the syscall ABI and avoiding the double > > save/restores. > > It's not uclibc who calls write(), it's user program. > > IIRC uclibc can't make user program aware that write() > is not clobbering registers. > > Even if it could do that via an __attribute__ somehow, > it would be a violation of standards: write() is supposed to be > an ordinary C function, users must be able to take its address > and assign it to a pointer declared as > > ssize_t (*ptr)(int, const void *, size_t); > > Slapping __attribute__((different_abi)) > onto write() makes that impossible, the signature > no longer matches. > > Let me go back from hypothetics to the actual situation. > We can't do such a drastic ABI change now, it's too big. > > But is looks like we can relax ABI wrt saving flags, > because it's broken for some time, and no one complains. > > If we say that arith flags and DF are not saved, it may > mean that we don't need to kill ourselves with > awkward and costly (popf is 20 cycles) EFLAGS > manipulations. Given that I just posted a patch that removes the popf by using sysret, I'm unconvinced this is worthwhile. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Mar 28, 2015 6:17 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar mi...@kernel.org wrote: * Denys Vlasenko vda.li...@googlemail.com wrote: This is a C function. [...] Arguably that's a self-inflicted wound of uclibc: nothing keeps it from taking advantage of the syscall ABI and avoiding the double save/restores. It's not uclibc who calls write(), it's user program. IIRC uclibc can't make user program aware that write() is not clobbering registers. Even if it could do that via an __attribute__ somehow, it would be a violation of standards: write() is supposed to be an ordinary C function, users must be able to take its address and assign it to a pointer declared as ssize_t (*ptr)(int, const void *, size_t); Slapping __attribute__((different_abi)) onto write() makes that impossible, the signature no longer matches. Let me go back from hypothetics to the actual situation. We can't do such a drastic ABI change now, it's too big. But is looks like we can relax ABI wrt saving flags, because it's broken for some time, and no one complains. If we say that arith flags and DF are not saved, it may mean that we don't need to kill ourselves with awkward and costly (popf is 20 cycles) EFLAGS manipulations. Given that I just posted a patch that removes the popf by using sysret, I'm unconvinced this is worthwhile. --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar wrote: > * Denys Vlasenko wrote: >> This is a C function. [...] > > Arguably that's a self-inflicted wound of uclibc: nothing keeps it > from taking advantage of the syscall ABI and avoiding the double > save/restores. It's not uclibc who calls write(), it's user program. IIRC uclibc can't make user program aware that write() is not clobbering registers. Even if it could do that via an __attribute__ somehow, it would be a violation of standards: write() is supposed to be an ordinary C function, users must be able to take its address and assign it to a pointer declared as ssize_t (*ptr)(int, const void *, size_t); Slapping __attribute__((different_abi)) onto write() makes that impossible, the signature no longer matches. Let me go back from hypothetics to the actual situation. We can't do such a drastic ABI change now, it's too big. But is looks like we can relax ABI wrt saving flags, because it's broken for some time, and no one complains. If we say that arith flags and DF are not saved, it may mean that we don't need to kill ourselves with awkward and costly (popf is 20 cycles) EFLAGS manipulations. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Sat, Mar 28, 2015 at 1:39 AM, Linus Torvalds wrote: > What part of "don't leak kernel data" did you have trouble understanding? > > IOW, this is a *security* issue. Stop arguing for crazy shit. We can zero the registers instead of saving/restoring them. push/pop pair takes 1-2 cycles at best, whereas on four-issue CPUs "xor reg,reg" has throughput of four instructions at a time, often even not requiring an execution unit. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
* Denys Vlasenko wrote: > On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds > wrote: > > On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko wrote: > >> > >> Apparently, users *don't* depend on arithmetic flags > >> to survive over syscall. They also okay with DF flag > >> being cleared. > > > > Generally, users probably dont' care about many registers at all being > > saved, but it's worth noting that the reason system calls save/restore > > even caller-saved registers is at least partly in order to avoid any > > kernel information leaks. > > > > I don't believe that user mode will ever reasonably care about the > > arithmetic flags being changed, but at the same time I also don't it > > is something we should ever consider a "feature" we should try to take > > advantage of. Generally we should try to not mess with the flag state, > > and I'd *much* rather make the rule be that all the system call return > > paths restore flags as much as possible. > > "We don't clobber anything" ABI has its appeal. > OTOH, fulfilling ABI's promises has cost which hast to be paid > on every syscall, regardless whether userspace needed it or not. > > Example. This is the uclibc implementation of write(): > > 004acfc4 <__libc_write>: > 4acfc4: 53 push %rbx > 4acfc5: 48 63 ffmovslq %edi,%rdi > 4acfc8: b8 01 00 00 00 mov$0x1,%eax > 4acfcd: 0f 05 syscall > 4acfcf: 48 89 c3mov%rax,%rbx > 4acfd2: 48 81 fb 00 f0 ff ffcmp$0xf000,%rbx > 4acfd9: 76 0f jbe4acfea <__libc_write+0x26> > 4acfdb: e8 64 15 00 00 callq 4ae544 <__GI___errno_location> > 4acfe0: 89 da mov%ebx,%edx > 4acfe2: f7 da neg%edx > 4acfe4: 89 10 mov%edx,(%rax) > 4acfe6: 48 83 c8 ff or $0x,%rax > 4acfea: 5b pop%rbx > 4acfeb: c3 retq > > This is a C function. [...] Arguably that's a self-inflicted wound of uclibc: nothing keeps it from taking advantage of the syscall ABI and avoiding the double save/restores. > [...] Therefore any its caller assumes that C-clobbered registers > can be, indeed, clobbered here, so if that caller uses any of them, > it saves/restores them. > > All efforts by kernel code to save/restore C-clobbered registers, > eight of them, are in vain. It's just useless work. Userspace does > not benefit from that effort. That's true only in this particular uclibc case, where user-space decided to not take advantage of the save/restore property of the kernel. > If our syscall ABI would say that those regs are not preserved, we > could have a bit faster syscalls. Any userspace code which really > had to have those registers preserved across a particular syscall, > could push/pop them itself. We'd at minimum have to zero out the registers to avoid the information leak and at that point it's in fact faster to just save/restore in the syscall and allow user-space to take advantage of that, if it wishes to. We cannot do it the other way around. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 01:09:34PM -0700, Linus Torvalds wrote: > I think AMD documented that the sti "interrupt shadow" shadows even > NMI. Hmm, official docs says this: "15.21.5 Interrupt Shadows The x86 architecture defines the notion of an interrupt shadow—a single-instruction window during which interrupts are not recognized. For example, the instruction after an STI instruction that sets EFLAGS.IF (from zero to one) does not recognize interrupts or certain debug traps." And I think with "interrupts" this means maskable interrupts, i.e. not NMI. STI description itself: "Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby allowing external interrupts received on the INTR input. Interrupts received on the non-maskable interrupt (NMI) input are not affected by this instruction." but there might be some other documentation which I cannot find right now. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
* Linus Torvalds wrote: > On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst wrote: > >> <-- IRQ. Boom > > > > The sti will delay interrupts for one instruction, and that should include > > NMIs. > > Nope. Intel explicitly documents the NMI case only for mov->ss and popss. > > > The Intel SDM states for STI: > > "The IF flag and the STI and CLI instructions do not prohibit the > > generation of exceptions and NMI interrupts. NMI > > interrupts (and SMIs) may be blocked for one macroinstruction following an > > STI." > > Note the *may*. For movss and popss the software developer guide > explicitly says that NMI's are also blocked. > > For plain sti, it seems to be dependent on microarchitecture. Well, how about 'STI+HLT' aka safe_halt()? If an NMI is allowed after that STI then we might lose wakeups, or in extreme cases (with full-dynticks) might lock up for a long time until the next irq comes, even with runnable tasks around? Arguably that's a race condition that is not very easy to notice on a typical system. Random hypothesis: maybe Intel just messed up their STI shadow in a single (possibly ancient) microarchitecture in some rare situations and figured it could fix it cheaply via updating the documentation to match the breakage, not via actually fixing the CPU? Might be useful if someone from Intel could chime in. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
Hi, Beware that could be opening the door to information leaks for a very small gain (most syscalls are not getuid). Best, OG. On Sat, Mar 28, 2015 at 1:34 AM, Denys Vlasenko wrote: > On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds > wrote: >> On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko wrote: >>> >>> Apparently, users *don't* depend on arithmetic flags >>> to survive over syscall. They also okay with DF flag >>> being cleared. >> >> Generally, users probably dont' care about many registers at all being >> saved, but it's worth noting that the reason system calls save/restore >> even caller-saved registers is at least partly in order to avoid any >> kernel information leaks. >> >> I don't believe that user mode will ever reasonably care about the >> arithmetic flags being changed, but at the same time I also don't it >> is something we should ever consider a "feature" we should try to take >> advantage of. Generally we should try to not mess with the flag state, >> and I'd *much* rather make the rule be that all the system call return >> paths restore flags as much as possible. > > "We don't clobber anything" ABI has its appeal. > OTOH, fulfilling ABI's promises has cost which hast to be paid > on every syscall, regardless whether userspace needed it or not. > > Example. This is the uclibc implementation of write(): > > 004acfc4 <__libc_write>: > 4acfc4: 53 push %rbx > 4acfc5: 48 63 ffmovslq %edi,%rdi > 4acfc8: b8 01 00 00 00 mov$0x1,%eax > 4acfcd: 0f 05 syscall > 4acfcf: 48 89 c3mov%rax,%rbx > 4acfd2: 48 81 fb 00 f0 ff ffcmp$0xf000,%rbx > 4acfd9: 76 0f jbe4acfea <__libc_write+0x26> > 4acfdb: e8 64 15 00 00 callq 4ae544 <__GI___errno_location> > 4acfe0: 89 da mov%ebx,%edx > 4acfe2: f7 da neg%edx > 4acfe4: 89 10 mov%edx,(%rax) > 4acfe6: 48 83 c8 ff or $0x,%rax > 4acfea: 5b pop%rbx > 4acfeb: c3 retq > > This is a C function. Therefore any its caller assumes that C-clobbered > registers can be, indeed, clobbered here, so if that caller uses any > of them, it saves/restores them. > > All efforts by kernel code to save/restore C-clobbered registers, > eight of them, are in vain. It's just useless work. Userspace > does not benefit from that effort. > > If our syscall ABI would say that those regs are not preserved, > we could have a bit faster syscalls. Any userspace code which > really had to have those registers preserved across a particular > syscall, could push/pop them itself. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
* Linus Torvalds wrote: > On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst wrote: > >> <-- IRQ. Boom > > > > The sti will delay interrupts for one instruction, and that should include > > NMIs. > > Nope. Intel explicitly documents the NMI case only for mov->ss and popss. Interestingly, I still see a STI 'NMI shadow' even on Intel CPUs. Try something like this as root on a system with Intel CPUs (running recent tools/perf), with high-freq NMI sampling: perf top -F 1 execute a tight syscall loop on all CPUs (getppid() loop for example), and you'll see something like this in the profile: Samples: 1M of event 'cycles', Event count (approx.): 377899840545 Overhead Shared Object Symbol ◆ 27.67% libc-2.19.so [.] __GI___getppid ▒ 21.34% [kernel] [k] system_call ▒ 17.42% [kernel] [k] system_call_after_swapgs ▒ 12.00% [kernel] [k] pid_vnr ▒ 7.49% [kernel] [k] sys_getppid ▒ 5.49% [kernel] [k] sysret_check ▒ 5.34% loop-getppid [.] main ▒ 1.56% [kernel] [k] system_call_fastpath ▒ 0.36% loop-getppid [.] getppid@plt ▒ Note the very high sample count (due to sampling at 10 KHz). Now if you hit '' twice to annotate system_call_after_swapgs you should see something like this (the live kernel image disassembly, annotated): system_call_after_swapgs /proc/kcore │ │ │ │ Disassembly of section load0: │ │ 8178b3f3 : 9.72 │8178b3f3: mov%rsp,%gs:0xb040 44.24 │8178b3fc: mov%gs:0xb888,%rsp 0.02 │8178b405: sti │8178b406: nopl 0x0(%rax) 16.04 │8178b40d: sub$0x50,%rsp │8178b411: mov%rdi,0x40(%rsp) 6.51 │8178b416: mov%rsi,0x38(%rsp) 5.81 │8178b41b: mov%rdx,0x30(%rsp) 2.22 │8178b420: mov%rax,0x20(%rsp) 2.16 │8178b425: mov%r8,0x18(%rsp) 0.93 │8178b42a: mov%r9,0x10(%rsp) 1.57 │8178b42f: mov%r10,0x8(%rsp) 3.70 │8178b434: mov%r11,(%rsp) 2.27 │8178b438: mov%rax,0x48(%rsp) Note how the 7-byte NOP after the STI did not get a single profiler hit. This is with the default '-e cycles', not '-e cycles:pp', so what we see as profiler hits should be the raw NMI entry RIPs. Arguably this could be just the decoder hiding the NOP efficiently, I'll try to run some more experiments ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Sat, Mar 28, 2015 at 10:46 AM, Ingo Molnar mi...@kernel.org wrote: * Denys Vlasenko vda.li...@googlemail.com wrote: This is a C function. [...] Arguably that's a self-inflicted wound of uclibc: nothing keeps it from taking advantage of the syscall ABI and avoiding the double save/restores. It's not uclibc who calls write(), it's user program. IIRC uclibc can't make user program aware that write() is not clobbering registers. Even if it could do that via an __attribute__ somehow, it would be a violation of standards: write() is supposed to be an ordinary C function, users must be able to take its address and assign it to a pointer declared as ssize_t (*ptr)(int, const void *, size_t); Slapping __attribute__((different_abi)) onto write() makes that impossible, the signature no longer matches. Let me go back from hypothetics to the actual situation. We can't do such a drastic ABI change now, it's too big. But is looks like we can relax ABI wrt saving flags, because it's broken for some time, and no one complains. If we say that arith flags and DF are not saved, it may mean that we don't need to kill ourselves with awkward and costly (popf is 20 cycles) EFLAGS manipulations. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Sat, Mar 28, 2015 at 1:39 AM, Linus Torvalds torva...@linux-foundation.org wrote: What part of don't leak kernel data did you have trouble understanding? IOW, this is a *security* issue. Stop arguing for crazy shit. We can zero the registers instead of saving/restoring them. push/pop pair takes 1-2 cycles at best, whereas on four-issue CPUs xor reg,reg has throughput of four instructions at a time, often even not requiring an execution unit. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
* Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst brge...@gmail.com wrote: -- IRQ. Boom The sti will delay interrupts for one instruction, and that should include NMIs. Nope. Intel explicitly documents the NMI case only for mov-ss and popss. Interestingly, I still see a STI 'NMI shadow' even on Intel CPUs. Try something like this as root on a system with Intel CPUs (running recent tools/perf), with high-freq NMI sampling: perf top -F 1 execute a tight syscall loop on all CPUs (getppid() loop for example), and you'll see something like this in the profile: Samples: 1M of event 'cycles', Event count (approx.): 377899840545 Overhead Shared Object Symbol ◆ 27.67% libc-2.19.so [.] __GI___getppid ▒ 21.34% [kernel] [k] system_call ▒ 17.42% [kernel] [k] system_call_after_swapgs ▒ 12.00% [kernel] [k] pid_vnr ▒ 7.49% [kernel] [k] sys_getppid ▒ 5.49% [kernel] [k] sysret_check ▒ 5.34% loop-getppid [.] main ▒ 1.56% [kernel] [k] system_call_fastpath ▒ 0.36% loop-getppid [.] getppid@plt ▒ Note the very high sample count (due to sampling at 10 KHz). Now if you hit 'Enter' twice to annotate system_call_after_swapgs you should see something like this (the live kernel image disassembly, annotated): system_call_after_swapgs /proc/kcore │ │ │ │ Disassembly of section load0: │ │ 8178b3f3 load0: 9.72 │8178b3f3: mov%rsp,%gs:0xb040 44.24 │8178b3fc: mov%gs:0xb888,%rsp 0.02 │8178b405: sti │8178b406: nopl 0x0(%rax) 16.04 │8178b40d: sub$0x50,%rsp │8178b411: mov%rdi,0x40(%rsp) 6.51 │8178b416: mov%rsi,0x38(%rsp) 5.81 │8178b41b: mov%rdx,0x30(%rsp) 2.22 │8178b420: mov%rax,0x20(%rsp) 2.16 │8178b425: mov%r8,0x18(%rsp) 0.93 │8178b42a: mov%r9,0x10(%rsp) 1.57 │8178b42f: mov%r10,0x8(%rsp) 3.70 │8178b434: mov%r11,(%rsp) 2.27 │8178b438: mov%rax,0x48(%rsp) Note how the 7-byte NOP after the STI did not get a single profiler hit. This is with the default '-e cycles', not '-e cycles:pp', so what we see as profiler hits should be the raw NMI entry RIPs. Arguably this could be just the decoder hiding the NOP efficiently, I'll try to run some more experiments ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
Hi, Beware that could be opening the door to information leaks for a very small gain (most syscalls are not getuid). Best, OG. On Sat, Mar 28, 2015 at 1:34 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote: Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Generally, users probably dont' care about many registers at all being saved, but it's worth noting that the reason system calls save/restore even caller-saved registers is at least partly in order to avoid any kernel information leaks. I don't believe that user mode will ever reasonably care about the arithmetic flags being changed, but at the same time I also don't it is something we should ever consider a feature we should try to take advantage of. Generally we should try to not mess with the flag state, and I'd *much* rather make the rule be that all the system call return paths restore flags as much as possible. We don't clobber anything ABI has its appeal. OTOH, fulfilling ABI's promises has cost which hast to be paid on every syscall, regardless whether userspace needed it or not. Example. This is the uclibc implementation of write(): 004acfc4 __libc_write: 4acfc4: 53 push %rbx 4acfc5: 48 63 ffmovslq %edi,%rdi 4acfc8: b8 01 00 00 00 mov$0x1,%eax 4acfcd: 0f 05 syscall 4acfcf: 48 89 c3mov%rax,%rbx 4acfd2: 48 81 fb 00 f0 ff ffcmp$0xf000,%rbx 4acfd9: 76 0f jbe4acfea __libc_write+0x26 4acfdb: e8 64 15 00 00 callq 4ae544 __GI___errno_location 4acfe0: 89 da mov%ebx,%edx 4acfe2: f7 da neg%edx 4acfe4: 89 10 mov%edx,(%rax) 4acfe6: 48 83 c8 ff or $0x,%rax 4acfea: 5b pop%rbx 4acfeb: c3 retq This is a C function. Therefore any its caller assumes that C-clobbered registers can be, indeed, clobbered here, so if that caller uses any of them, it saves/restores them. All efforts by kernel code to save/restore C-clobbered registers, eight of them, are in vain. It's just useless work. Userspace does not benefit from that effort. If our syscall ABI would say that those regs are not preserved, we could have a bit faster syscalls. Any userspace code which really had to have those registers preserved across a particular syscall, could push/pop them itself. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 01:09:34PM -0700, Linus Torvalds wrote: I think AMD documented that the sti interrupt shadow shadows even NMI. Hmm, official docs says this: 15.21.5 Interrupt Shadows The x86 architecture defines the notion of an interrupt shadow—a single-instruction window during which interrupts are not recognized. For example, the instruction after an STI instruction that sets EFLAGS.IF (from zero to one) does not recognize interrupts or certain debug traps. And I think with interrupts this means maskable interrupts, i.e. not NMI. STI description itself: Sets the interrupt flag (IF) in the rFLAGS register to 1, thereby allowing external interrupts received on the INTR input. Interrupts received on the non-maskable interrupt (NMI) input are not affected by this instruction. but there might be some other documentation which I cannot find right now. -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
* Denys Vlasenko vda.li...@googlemail.com wrote: On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote: Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Generally, users probably dont' care about many registers at all being saved, but it's worth noting that the reason system calls save/restore even caller-saved registers is at least partly in order to avoid any kernel information leaks. I don't believe that user mode will ever reasonably care about the arithmetic flags being changed, but at the same time I also don't it is something we should ever consider a feature we should try to take advantage of. Generally we should try to not mess with the flag state, and I'd *much* rather make the rule be that all the system call return paths restore flags as much as possible. We don't clobber anything ABI has its appeal. OTOH, fulfilling ABI's promises has cost which hast to be paid on every syscall, regardless whether userspace needed it or not. Example. This is the uclibc implementation of write(): 004acfc4 __libc_write: 4acfc4: 53 push %rbx 4acfc5: 48 63 ffmovslq %edi,%rdi 4acfc8: b8 01 00 00 00 mov$0x1,%eax 4acfcd: 0f 05 syscall 4acfcf: 48 89 c3mov%rax,%rbx 4acfd2: 48 81 fb 00 f0 ff ffcmp$0xf000,%rbx 4acfd9: 76 0f jbe4acfea __libc_write+0x26 4acfdb: e8 64 15 00 00 callq 4ae544 __GI___errno_location 4acfe0: 89 da mov%ebx,%edx 4acfe2: f7 da neg%edx 4acfe4: 89 10 mov%edx,(%rax) 4acfe6: 48 83 c8 ff or $0x,%rax 4acfea: 5b pop%rbx 4acfeb: c3 retq This is a C function. [...] Arguably that's a self-inflicted wound of uclibc: nothing keeps it from taking advantage of the syscall ABI and avoiding the double save/restores. [...] Therefore any its caller assumes that C-clobbered registers can be, indeed, clobbered here, so if that caller uses any of them, it saves/restores them. All efforts by kernel code to save/restore C-clobbered registers, eight of them, are in vain. It's just useless work. Userspace does not benefit from that effort. That's true only in this particular uclibc case, where user-space decided to not take advantage of the save/restore property of the kernel. If our syscall ABI would say that those regs are not preserved, we could have a bit faster syscalls. Any userspace code which really had to have those registers preserved across a particular syscall, could push/pop them itself. We'd at minimum have to zero out the registers to avoid the information leak and at that point it's in fact faster to just save/restore in the syscall and allow user-space to take advantage of that, if it wishes to. We cannot do it the other way around. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
* Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst brge...@gmail.com wrote: -- IRQ. Boom The sti will delay interrupts for one instruction, and that should include NMIs. Nope. Intel explicitly documents the NMI case only for mov-ss and popss. The Intel SDM states for STI: The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI interrupts (and SMIs) may be blocked for one macroinstruction following an STI. Note the *may*. For movss and popss the software developer guide explicitly says that NMI's are also blocked. For plain sti, it seems to be dependent on microarchitecture. Well, how about 'STI+HLT' aka safe_halt()? If an NMI is allowed after that STI then we might lose wakeups, or in extreme cases (with full-dynticks) might lock up for a long time until the next irq comes, even with runnable tasks around? Arguably that's a race condition that is not very easy to notice on a typical system. Random hypothesis: maybe Intel just messed up their STI shadow in a single (possibly ancient) microarchitecture in some rare situations and figured it could fix it cheaply via updating the documentation to match the breakage, not via actually fixing the CPU? Might be useful if someone from Intel could chime in. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds wrote: > On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko wrote: >> >> Apparently, users *don't* depend on arithmetic flags >> to survive over syscall. They also okay with DF flag >> being cleared. > > Generally, users probably dont' care about many registers at all being > saved, but it's worth noting that the reason system calls save/restore > even caller-saved registers is at least partly in order to avoid any > kernel information leaks. > > I don't believe that user mode will ever reasonably care about the > arithmetic flags being changed, but at the same time I also don't it > is something we should ever consider a "feature" we should try to take > advantage of. Generally we should try to not mess with the flag state, > and I'd *much* rather make the rule be that all the system call return > paths restore flags as much as possible. "We don't clobber anything" ABI has its appeal. OTOH, fulfilling ABI's promises has cost which hast to be paid on every syscall, regardless whether userspace needed it or not. Example. This is the uclibc implementation of write(): 004acfc4 <__libc_write>: 4acfc4: 53 push %rbx 4acfc5: 48 63 ffmovslq %edi,%rdi 4acfc8: b8 01 00 00 00 mov$0x1,%eax 4acfcd: 0f 05 syscall 4acfcf: 48 89 c3mov%rax,%rbx 4acfd2: 48 81 fb 00 f0 ff ffcmp$0xf000,%rbx 4acfd9: 76 0f jbe4acfea <__libc_write+0x26> 4acfdb: e8 64 15 00 00 callq 4ae544 <__GI___errno_location> 4acfe0: 89 da mov%ebx,%edx 4acfe2: f7 da neg%edx 4acfe4: 89 10 mov%edx,(%rax) 4acfe6: 48 83 c8 ff or $0x,%rax 4acfea: 5b pop%rbx 4acfeb: c3 retq This is a C function. Therefore any its caller assumes that C-clobbered registers can be, indeed, clobbered here, so if that caller uses any of them, it saves/restores them. All efforts by kernel code to save/restore C-clobbered registers, eight of them, are in vain. It's just useless work. Userspace does not benefit from that effort. If our syscall ABI would say that those regs are not preserved, we could have a bit faster syscalls. Any userspace code which really had to have those registers preserved across a particular syscall, could push/pop them itself. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:31 PM, Linus Torvalds wrote: > On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski wrote: >> >> Does it matter on 32-bit kernels? There's no swapgs, so IRQs should >> still be safe, and we have a real stack pointer before sysexit. > > Fair enough. On 32-bit, the only worry is the race between "return to > user space" and "something set a thread flag", resulting in delayed > signals and/or higher scheduling latency etc. So on 32-bit, the bug is > much less of an issue, I agree. Right, except for one nasty case: KVM user return notifiers. It's possible we'd re-enter user mode with some MSRs set wrong. Yuck. --Andy > > So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more > reasonable given the potential worry about sti+sysexit atomicity in > the presense of nmi's. > > Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst wrote: >> <-- IRQ. Boom > > The sti will delay interrupts for one instruction, and that should include > NMIs. Nope. Intel explicitly documents the NMI case only for mov->ss and popss. > The Intel SDM states for STI: > "The IF flag and the STI and CLI instructions do not prohibit the > generation of exceptions and NMI interrupts. NMI > interrupts (and SMIs) may be blocked for one macroinstruction following an > STI." Note the *may*. For movss and popss the software developer guide explicitly says that NMI's are also blocked. For plain sti, it seems to be dependent on microarchitecture. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 2:37 PM, Andy Lutomirski wrote: > On Mar 27, 2015 7:26 AM, "Denys Vlasenko" wrote: >> >> Hi, >> >> While running some tests I noticed that EFLAGS >> is not saved across syscalls if I use 32-bit >> userspace, use SYSENTER, and paravirt is active. >> >> Looking at the code, it's actually clear why that happens. >> >> /* >> * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. >> * IF and VM in rflags are cleared (IOW: interrupts are off). >> * SYSENTER does not save anything on the stack, >> * and does not save old rip (!!!) and rflags. >> */ >> ENTRY(ia32_sysenter_target) >> SWAPGS_UNSAFE_STACK < >> movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp >> ENABLE_INTERRUPTS(CLBR_NONE) >> >> movl%ebp, %ebp >> movl%eax, %eax >> movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d >> >> /* Construct struct pt_regs on stack */ >> pushq_cfi $__USER32_DS/* pt_regs->ss */ >> pushq_cfi %rbp/* pt_regs->sp */ >> CFI_REL_OFFSET rsp,0 >> pushfq_cfi /* pt_regs->flags */ >> >> The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, >> will change EFLAGS, and it *can't* save/restore them - >> there is no place to save it, since neither stack nor >> PER_CPU() is usable at that point. >> >> Interestingly, *no one ever complained*! >> >> Apparently, users *don't* depend on arithmetic flags >> to survive over syscall. They also okay with DF flag >> being cleared. >> >> Let's go flag-by-flag. >> >> ID - probably no one depends on it >> VIP,VIF,VM - v86 stuff, not supported in 64bit >> AC - someone probably do use this >> RF - should be cleared to 0 >> NT - iret via task gate, not supported in 64bit >> IOPL - usually 00, sys_iopl() can change it >> DF - according to C ABI, should be 0 >> IF - should be preserved (but almost always 1) >> TF - should be preserved >> arith flags - probably no one cares >> >> IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ >> IF. >> >> AC and IOPL are preserved even with this paravirt quirk >> because paravirt hooks do not mangle them. >> >> TF preservation and proper restoration is handled by >> do_debug + syscall_trace_enter_phase2 + iret >> combo. >> >> We unconditionally set IF. This is only a problem for applications >> which use sys_iopl(3) and, disable IRQs in userspace and perform >> syscalls. The set of such apps is probably empty. >> (This "bug" exists even for non-paravirt case). >> >> So, formally, we have a bug: we do not preserve IF, >> DF and arith flags. >> >> I'm proposing to use this opportunity to amend syscall ABI >> to say that arith flags are not preserved across syscalls, >> and DF can be cleared to 0 by syscalls (but can't be set to 1). >> Evidently, it's broken for some time for some virtualized >> setups and users are okay. > > I think I'd rather fix it. I want to give x86_64 a sysenter stack > like x86_32's, since AFAICT the only reason that #DF needs to use IST > is because sysenter with TF set is the only way I can see that #DF > could happen with an invalid stack. What if RSP gets corrupted in the kernel? That would cause a fault that gets promoted to #DF, since the iret frame can't be pushed. You would at least get an oops out instead of a triple fault reset. > Also, Houston, we have a bug, probably rootable, and probably damn > near impossible to exploit without crashing your system: > > User does sysenter. We end up in native_irq_enable_sysexit. We do: > > swapgs > sti > > <-- NMI here can happen on some (all?) cpus, returns successfully > *with interrupts unmasked* > > <-- IRQ. Boom The sti will delay interrupts for one instruction, and that should include NMIs. The Intel SDM states for STI: "The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI interrupts (and SMIs) may be blocked for one macroinstruction following an STI." > My preferred fix would be to use sysretl instead of sysexit. As far > as I know, there are no 64-bit CPUs at all that don't support sysretl. It would be nice to have one less return path. I'm curious to know why Intel doesn't support syscall from compatibility mode but does support sysret to compatibility mode. -- Brian Gerst -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski wrote: > > Does it matter on 32-bit kernels? There's no swapgs, so IRQs should > still be safe, and we have a real stack pointer before sysexit. Fair enough. On 32-bit, the only worry is the race between "return to user space" and "something set a thread flag", resulting in delayed signals and/or higher scheduling latency etc. So on 32-bit, the bug is much less of an issue, I agree. So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more reasonable given the potential worry about sti+sysexit atomicity in the presense of nmi's. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:09 PM, Linus Torvalds wrote: > On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski wrote: >> >> User does sysenter. We end up in native_irq_enable_sysexit. We do: >> >> swapgs >> sti >> >> <-- NMI here can happen on some (all?) cpus, returns successfully >> *with interrupts unmasked* > > I think AMD documented that the sti "interrupt shadow" shadows even > NMI. And for Intel, it definitely does not (but "mov ss" and "pop ss" > masks even NMI for the next instruction - so the interrupt shadow is > different for these cases). > > That said, it wasn't 100% clear that the "NMI return to immediate > regular interrupt" can actually happen even on Intel. Iirc, there was > some discussion about when the CPU actually tests the IRQ line after > an 'iret'. It might end up testing the IF only after executing the > instruction it returns to, so it's possible that the sequence > > .. interrupts disabled .. > sti > sysexit > > can not be interrupted by regular interrupts between the 'sti' and the > 'sysexit' even if an NMI were to have happened between the two. > >> My preferred fix would be to use sysretl instead of sysexit. As far >> as I know, there are no 64-bit CPUs at all that don't support sysretl. > > That 'sti+sysexit" is used for the native 32-bit case too, not just > the compat mode for x86-64. So I don't necessarily disagree with using > sysretl, but.. Does it matter on 32-bit kernels? There's no swapgs, so IRQs should still be safe, and we have a real stack pointer before sysexit. --Andy > >Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski wrote: > > User does sysenter. We end up in native_irq_enable_sysexit. We do: > > swapgs > sti > > <-- NMI here can happen on some (all?) cpus, returns successfully > *with interrupts unmasked* I think AMD documented that the sti "interrupt shadow" shadows even NMI. And for Intel, it definitely does not (but "mov ss" and "pop ss" masks even NMI for the next instruction - so the interrupt shadow is different for these cases). That said, it wasn't 100% clear that the "NMI return to immediate regular interrupt" can actually happen even on Intel. Iirc, there was some discussion about when the CPU actually tests the IRQ line after an 'iret'. It might end up testing the IF only after executing the instruction it returns to, so it's possible that the sequence .. interrupts disabled .. sti sysexit can not be interrupted by regular interrupts between the 'sti' and the 'sysexit' even if an NMI were to have happened between the two. > My preferred fix would be to use sysretl instead of sysexit. As far > as I know, there are no 64-bit CPUs at all that don't support sysretl. That 'sti+sysexit" is used for the native 32-bit case too, not just the compat mode for x86-64. So I don't necessarily disagree with using sysretl, but.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko wrote: > > Apparently, users *don't* depend on arithmetic flags > to survive over syscall. They also okay with DF flag > being cleared. Generally, users probably dont' care about many registers at all being saved, but it's worth noting that the reason system calls save/restore even caller-saved registers is at least partly in order to avoid any kernel information leaks. I don't believe that user mode will ever reasonably care about the arithmetic flags being changed, but at the same time I also don't it is something we should ever consider a "feature" we should try to take advantage of. Generally we should try to not mess with the flag state, and I'd *much* rather make the rule be that all the system call return paths restore flags as much as possible. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Mar 27, 2015 7:26 AM, "Denys Vlasenko" wrote: > > Hi, > > While running some tests I noticed that EFLAGS > is not saved across syscalls if I use 32-bit > userspace, use SYSENTER, and paravirt is active. > > Looking at the code, it's actually clear why that happens. > > /* > * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. > * IF and VM in rflags are cleared (IOW: interrupts are off). > * SYSENTER does not save anything on the stack, > * and does not save old rip (!!!) and rflags. > */ > ENTRY(ia32_sysenter_target) > SWAPGS_UNSAFE_STACK < > movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp > ENABLE_INTERRUPTS(CLBR_NONE) > > movl%ebp, %ebp > movl%eax, %eax > movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d > > /* Construct struct pt_regs on stack */ > pushq_cfi $__USER32_DS/* pt_regs->ss */ > pushq_cfi %rbp/* pt_regs->sp */ > CFI_REL_OFFSET rsp,0 > pushfq_cfi /* pt_regs->flags */ > > The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, > will change EFLAGS, and it *can't* save/restore them - > there is no place to save it, since neither stack nor > PER_CPU() is usable at that point. > > Interestingly, *no one ever complained*! > > Apparently, users *don't* depend on arithmetic flags > to survive over syscall. They also okay with DF flag > being cleared. > > Let's go flag-by-flag. > > ID - probably no one depends on it > VIP,VIF,VM - v86 stuff, not supported in 64bit > AC - someone probably do use this > RF - should be cleared to 0 > NT - iret via task gate, not supported in 64bit > IOPL - usually 00, sys_iopl() can change it > DF - according to C ABI, should be 0 > IF - should be preserved (but almost always 1) > TF - should be preserved > arith flags - probably no one cares > > IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ > IF. > > AC and IOPL are preserved even with this paravirt quirk > because paravirt hooks do not mangle them. > > TF preservation and proper restoration is handled by > do_debug + syscall_trace_enter_phase2 + iret > combo. > > We unconditionally set IF. This is only a problem for applications > which use sys_iopl(3) and, disable IRQs in userspace and perform > syscalls. The set of such apps is probably empty. > (This "bug" exists even for non-paravirt case). > > So, formally, we have a bug: we do not preserve IF, > DF and arith flags. > > I'm proposing to use this opportunity to amend syscall ABI > to say that arith flags are not preserved across syscalls, > and DF can be cleared to 0 by syscalls (but can't be set to 1). > Evidently, it's broken for some time for some virtualized > setups and users are okay. I think I'd rather fix it. I want to give x86_64 a sysenter stack like x86_32's, since AFAICT the only reason that #DF needs to use IST is because sysenter with TF set is the only way I can see that #DF could happen with an invalid stack. Also, Houston, we have a bug, probably rootable, and probably damn near impossible to exploit without crashing your system: User does sysenter. We end up in native_irq_enable_sysexit. We do: swapgs sti <-- NMI here can happen on some (all?) cpus, returns successfully *with interrupts unmasked* <-- IRQ. Boom My preferred fix would be to use sysretl instead of sysexit. As far as I know, there are no 64-bit CPUs at all that don't support sysretl. --Andy > > I'm not sure what to do with the "bug" of forcing IF=1. > Fix it? Or also declare that syscalls can set IF=1? > Do you think this is a legitimate userspace code? > > sys_iopl(3); > cli; > syscall(); > /* expects irqs still disabled */ > > -- > vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
+ Linus. On Fri, Mar 27, 2015 at 03:25:47PM +0100, Denys Vlasenko wrote: > Hi, > > While running some tests I noticed that EFLAGS > is not saved across syscalls if I use 32-bit > userspace, use SYSENTER, and paravirt is active. > > Looking at the code, it's actually clear why that happens. > > /* > * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. > * IF and VM in rflags are cleared (IOW: interrupts are off). > * SYSENTER does not save anything on the stack, > * and does not save old rip (!!!) and rflags. > */ > ENTRY(ia32_sysenter_target) > SWAPGS_UNSAFE_STACK < > movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp > ENABLE_INTERRUPTS(CLBR_NONE) > > movl%ebp, %ebp > movl%eax, %eax > movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d > > /* Construct struct pt_regs on stack */ > pushq_cfi $__USER32_DS/* pt_regs->ss */ > pushq_cfi %rbp/* pt_regs->sp */ > CFI_REL_OFFSET rsp,0 > pushfq_cfi /* pt_regs->flags */ > > The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, > will change EFLAGS, and it *can't* save/restore them - > there is no place to save it, since neither stack nor > PER_CPU() is usable at that point. > > Interestingly, *no one ever complained*! > > Apparently, users *don't* depend on arithmetic flags > to survive over syscall. They also okay with DF flag > being cleared. > > Let's go flag-by-flag. > > ID - probably no one depends on it It is used as a toggle to detect CPUID support. Can a SYSENTER happen while something toggles it? Probably... > VIP,VIF,VM - v86 stuff, not supported in 64bit > AC - someone probably do use this > RF - should be cleared to 0 > NT - iret via task gate, not supported in 64bit > IOPL - usually 00, sys_iopl() can change it > DF - according to C ABI, should be 0 > IF - should be preserved (but almost always 1) > TF - should be preserved > arith flags - probably no one cares > > IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ > IF. > > AC and IOPL are preserved even with this paravirt quirk > because paravirt hooks do not mangle them. > > TF preservation and proper restoration is handled by > do_debug + syscall_trace_enter_phase2 + iret > combo. > > We unconditionally set IF. This is only a problem for applications > which use sys_iopl(3) and, disable IRQs in userspace and perform > syscalls. The set of such apps is probably empty. > (This "bug" exists even for non-paravirt case). > > So, formally, we have a bug: we do not preserve IF, > DF and arith flags. > > I'm proposing to use this opportunity to amend syscall ABI > to say that arith flags are not preserved across syscalls, > and DF can be cleared to 0 by syscalls (but can't be set to 1). > Evidently, it's broken for some time for some virtualized > setups and users are okay. > > I'm not sure what to do with the "bug" of forcing IF=1. > Fix it? Or also declare that syscalls can set IF=1? > Do you think this is a legitimate userspace code? > > sys_iopl(3); > cli; > syscall(); > /* expects irqs still disabled */ > > -- > vda > -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ia32_sysenter_target does not preserve EFLAGS
Hi, While running some tests I noticed that EFLAGS is not saved across syscalls if I use 32-bit userspace, use SYSENTER, and paravirt is active. Looking at the code, it's actually clear why that happens. /* * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. * IF and VM in rflags are cleared (IOW: interrupts are off). * SYSENTER does not save anything on the stack, * and does not save old rip (!!!) and rflags. */ ENTRY(ia32_sysenter_target) SWAPGS_UNSAFE_STACK < movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp ENABLE_INTERRUPTS(CLBR_NONE) movl%ebp, %ebp movl%eax, %eax movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d /* Construct struct pt_regs on stack */ pushq_cfi $__USER32_DS/* pt_regs->ss */ pushq_cfi %rbp/* pt_regs->sp */ CFI_REL_OFFSET rsp,0 pushfq_cfi /* pt_regs->flags */ The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, will change EFLAGS, and it *can't* save/restore them - there is no place to save it, since neither stack nor PER_CPU() is usable at that point. Interestingly, *no one ever complained*! Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Let's go flag-by-flag. ID - probably no one depends on it VIP,VIF,VM - v86 stuff, not supported in 64bit AC - someone probably do use this RF - should be cleared to 0 NT - iret via task gate, not supported in 64bit IOPL - usually 00, sys_iopl() can change it DF - according to C ABI, should be 0 IF - should be preserved (but almost always 1) TF - should be preserved arith flags - probably no one cares IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ IF. AC and IOPL are preserved even with this paravirt quirk because paravirt hooks do not mangle them. TF preservation and proper restoration is handled by do_debug + syscall_trace_enter_phase2 + iret combo. We unconditionally set IF. This is only a problem for applications which use sys_iopl(3) and, disable IRQs in userspace and perform syscalls. The set of such apps is probably empty. (This "bug" exists even for non-paravirt case). So, formally, we have a bug: we do not preserve IF, DF and arith flags. I'm proposing to use this opportunity to amend syscall ABI to say that arith flags are not preserved across syscalls, and DF can be cleared to 0 by syscalls (but can't be set to 1). Evidently, it's broken for some time for some virtualized setups and users are okay. I'm not sure what to do with the "bug" of forcing IF=1. Fix it? Or also declare that syscalls can set IF=1? Do you think this is a legitimate userspace code? sys_iopl(3); cli; syscall(); /* expects irqs still disabled */ -- vda -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
ia32_sysenter_target does not preserve EFLAGS
Hi, While running some tests I noticed that EFLAGS is not saved across syscalls if I use 32-bit userspace, use SYSENTER, and paravirt is active. Looking at the code, it's actually clear why that happens. /* * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. * IF and VM in rflags are cleared (IOW: interrupts are off). * SYSENTER does not save anything on the stack, * and does not save old rip (!!!) and rflags. */ ENTRY(ia32_sysenter_target) SWAPGS_UNSAFE_STACK movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp ENABLE_INTERRUPTS(CLBR_NONE) movl%ebp, %ebp movl%eax, %eax movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d /* Construct struct pt_regs on stack */ pushq_cfi $__USER32_DS/* pt_regs-ss */ pushq_cfi %rbp/* pt_regs-sp */ CFI_REL_OFFSET rsp,0 pushfq_cfi /* pt_regs-flags */ The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, will change EFLAGS, and it *can't* save/restore them - there is no place to save it, since neither stack nor PER_CPU() is usable at that point. Interestingly, *no one ever complained*! Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Let's go flag-by-flag. ID - probably no one depends on it VIP,VIF,VM - v86 stuff, not supported in 64bit AC - someone probably do use this RF - should be cleared to 0 NT - iret via task gate, not supported in 64bit IOPL - usually 00, sys_iopl() can change it DF - according to C ABI, should be 0 IF - should be preserved (but almost always 1) TF - should be preserved arith flags - probably no one cares IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ IF. AC and IOPL are preserved even with this paravirt quirk because paravirt hooks do not mangle them. TF preservation and proper restoration is handled by do_debug + syscall_trace_enter_phase2 + iret combo. We unconditionally set IF. This is only a problem for applications which use sys_iopl(3) and, disable IRQs in userspace and perform syscalls. The set of such apps is probably empty. (This bug exists even for non-paravirt case). So, formally, we have a bug: we do not preserve IF, DF and arith flags. I'm proposing to use this opportunity to amend syscall ABI to say that arith flags are not preserved across syscalls, and DF can be cleared to 0 by syscalls (but can't be set to 1). Evidently, it's broken for some time for some virtualized setups and users are okay. I'm not sure what to do with the bug of forcing IF=1. Fix it? Or also declare that syscalls can set IF=1? Do you think this is a legitimate userspace code? sys_iopl(3); cli; syscall(); /* expects irqs still disabled */ -- vda -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
+ Linus. On Fri, Mar 27, 2015 at 03:25:47PM +0100, Denys Vlasenko wrote: Hi, While running some tests I noticed that EFLAGS is not saved across syscalls if I use 32-bit userspace, use SYSENTER, and paravirt is active. Looking at the code, it's actually clear why that happens. /* * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. * IF and VM in rflags are cleared (IOW: interrupts are off). * SYSENTER does not save anything on the stack, * and does not save old rip (!!!) and rflags. */ ENTRY(ia32_sysenter_target) SWAPGS_UNSAFE_STACK movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp ENABLE_INTERRUPTS(CLBR_NONE) movl%ebp, %ebp movl%eax, %eax movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d /* Construct struct pt_regs on stack */ pushq_cfi $__USER32_DS/* pt_regs-ss */ pushq_cfi %rbp/* pt_regs-sp */ CFI_REL_OFFSET rsp,0 pushfq_cfi /* pt_regs-flags */ The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, will change EFLAGS, and it *can't* save/restore them - there is no place to save it, since neither stack nor PER_CPU() is usable at that point. Interestingly, *no one ever complained*! Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Let's go flag-by-flag. ID - probably no one depends on it It is used as a toggle to detect CPUID support. Can a SYSENTER happen while something toggles it? Probably... VIP,VIF,VM - v86 stuff, not supported in 64bit AC - someone probably do use this RF - should be cleared to 0 NT - iret via task gate, not supported in 64bit IOPL - usually 00, sys_iopl() can change it DF - according to C ABI, should be 0 IF - should be preserved (but almost always 1) TF - should be preserved arith flags - probably no one cares IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ IF. AC and IOPL are preserved even with this paravirt quirk because paravirt hooks do not mangle them. TF preservation and proper restoration is handled by do_debug + syscall_trace_enter_phase2 + iret combo. We unconditionally set IF. This is only a problem for applications which use sys_iopl(3) and, disable IRQs in userspace and perform syscalls. The set of such apps is probably empty. (This bug exists even for non-paravirt case). So, formally, we have a bug: we do not preserve IF, DF and arith flags. I'm proposing to use this opportunity to amend syscall ABI to say that arith flags are not preserved across syscalls, and DF can be cleared to 0 by syscalls (but can't be set to 1). Evidently, it's broken for some time for some virtualized setups and users are okay. I'm not sure what to do with the bug of forcing IF=1. Fix it? Or also declare that syscalls can set IF=1? Do you think this is a legitimate userspace code? sys_iopl(3); cli; syscall(); /* expects irqs still disabled */ -- vda -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply. -- -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote: Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Generally, users probably dont' care about many registers at all being saved, but it's worth noting that the reason system calls save/restore even caller-saved registers is at least partly in order to avoid any kernel information leaks. I don't believe that user mode will ever reasonably care about the arithmetic flags being changed, but at the same time I also don't it is something we should ever consider a feature we should try to take advantage of. Generally we should try to not mess with the flag state, and I'd *much* rather make the rule be that all the system call return paths restore flags as much as possible. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski l...@amacapital.net wrote: User does sysenter. We end up in native_irq_enable_sysexit. We do: swapgs sti -- NMI here can happen on some (all?) cpus, returns successfully *with interrupts unmasked* I think AMD documented that the sti interrupt shadow shadows even NMI. And for Intel, it definitely does not (but mov ss and pop ss masks even NMI for the next instruction - so the interrupt shadow is different for these cases). That said, it wasn't 100% clear that the NMI return to immediate regular interrupt can actually happen even on Intel. Iirc, there was some discussion about when the CPU actually tests the IRQ line after an 'iret'. It might end up testing the IF only after executing the instruction it returns to, so it's possible that the sequence .. interrupts disabled .. sti sysexit can not be interrupted by regular interrupts between the 'sti' and the 'sysexit' even if an NMI were to have happened between the two. My preferred fix would be to use sysretl instead of sysexit. As far as I know, there are no 64-bit CPUs at all that don't support sysretl. That 'sti+sysexit is used for the native 32-bit case too, not just the compat mode for x86-64. So I don't necessarily disagree with using sysretl, but.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski l...@amacapital.net wrote: Does it matter on 32-bit kernels? There's no swapgs, so IRQs should still be safe, and we have a real stack pointer before sysexit. Fair enough. On 32-bit, the only worry is the race between return to user space and something set a thread flag, resulting in delayed signals and/or higher scheduling latency etc. So on 32-bit, the bug is much less of an issue, I agree. So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more reasonable given the potential worry about sti+sysexit atomicity in the presense of nmi's. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:09 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 11:37 AM, Andy Lutomirski l...@amacapital.net wrote: User does sysenter. We end up in native_irq_enable_sysexit. We do: swapgs sti -- NMI here can happen on some (all?) cpus, returns successfully *with interrupts unmasked* I think AMD documented that the sti interrupt shadow shadows even NMI. And for Intel, it definitely does not (but mov ss and pop ss masks even NMI for the next instruction - so the interrupt shadow is different for these cases). That said, it wasn't 100% clear that the NMI return to immediate regular interrupt can actually happen even on Intel. Iirc, there was some discussion about when the CPU actually tests the IRQ line after an 'iret'. It might end up testing the IF only after executing the instruction it returns to, so it's possible that the sequence .. interrupts disabled .. sti sysexit can not be interrupted by regular interrupts between the 'sti' and the 'sysexit' even if an NMI were to have happened between the two. My preferred fix would be to use sysretl instead of sysexit. As far as I know, there are no 64-bit CPUs at all that don't support sysretl. That 'sti+sysexit is used for the native 32-bit case too, not just the compat mode for x86-64. So I don't necessarily disagree with using sysretl, but.. Does it matter on 32-bit kernels? There's no swapgs, so IRQs should still be safe, and we have a real stack pointer before sysexit. --Andy Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Mar 27, 2015 7:26 AM, Denys Vlasenko dvlas...@redhat.com wrote: Hi, While running some tests I noticed that EFLAGS is not saved across syscalls if I use 32-bit userspace, use SYSENTER, and paravirt is active. Looking at the code, it's actually clear why that happens. /* * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. * IF and VM in rflags are cleared (IOW: interrupts are off). * SYSENTER does not save anything on the stack, * and does not save old rip (!!!) and rflags. */ ENTRY(ia32_sysenter_target) SWAPGS_UNSAFE_STACK movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp ENABLE_INTERRUPTS(CLBR_NONE) movl%ebp, %ebp movl%eax, %eax movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d /* Construct struct pt_regs on stack */ pushq_cfi $__USER32_DS/* pt_regs-ss */ pushq_cfi %rbp/* pt_regs-sp */ CFI_REL_OFFSET rsp,0 pushfq_cfi /* pt_regs-flags */ The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, will change EFLAGS, and it *can't* save/restore them - there is no place to save it, since neither stack nor PER_CPU() is usable at that point. Interestingly, *no one ever complained*! Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Let's go flag-by-flag. ID - probably no one depends on it VIP,VIF,VM - v86 stuff, not supported in 64bit AC - someone probably do use this RF - should be cleared to 0 NT - iret via task gate, not supported in 64bit IOPL - usually 00, sys_iopl() can change it DF - according to C ABI, should be 0 IF - should be preserved (but almost always 1) TF - should be preserved arith flags - probably no one cares IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ IF. AC and IOPL are preserved even with this paravirt quirk because paravirt hooks do not mangle them. TF preservation and proper restoration is handled by do_debug + syscall_trace_enter_phase2 + iret combo. We unconditionally set IF. This is only a problem for applications which use sys_iopl(3) and, disable IRQs in userspace and perform syscalls. The set of such apps is probably empty. (This bug exists even for non-paravirt case). So, formally, we have a bug: we do not preserve IF, DF and arith flags. I'm proposing to use this opportunity to amend syscall ABI to say that arith flags are not preserved across syscalls, and DF can be cleared to 0 by syscalls (but can't be set to 1). Evidently, it's broken for some time for some virtualized setups and users are okay. I think I'd rather fix it. I want to give x86_64 a sysenter stack like x86_32's, since AFAICT the only reason that #DF needs to use IST is because sysenter with TF set is the only way I can see that #DF could happen with an invalid stack. Also, Houston, we have a bug, probably rootable, and probably damn near impossible to exploit without crashing your system: User does sysenter. We end up in native_irq_enable_sysexit. We do: swapgs sti -- NMI here can happen on some (all?) cpus, returns successfully *with interrupts unmasked* -- IRQ. Boom My preferred fix would be to use sysretl instead of sysexit. As far as I know, there are no 64-bit CPUs at all that don't support sysretl. --Andy I'm not sure what to do with the bug of forcing IF=1. Fix it? Or also declare that syscalls can set IF=1? Do you think this is a legitimate userspace code? sys_iopl(3); cli; syscall(); /* expects irqs still disabled */ -- vda -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:53 PM, Brian Gerst brge...@gmail.com wrote: -- IRQ. Boom The sti will delay interrupts for one instruction, and that should include NMIs. Nope. Intel explicitly documents the NMI case only for mov-ss and popss. The Intel SDM states for STI: The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI interrupts (and SMIs) may be blocked for one macroinstruction following an STI. Note the *may*. For movss and popss the software developer guide explicitly says that NMI's are also blocked. For plain sti, it seems to be dependent on microarchitecture. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 1:31 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 1:16 PM, Andy Lutomirski l...@amacapital.net wrote: Does it matter on 32-bit kernels? There's no swapgs, so IRQs should still be safe, and we have a real stack pointer before sysexit. Fair enough. On 32-bit, the only worry is the race between return to user space and something set a thread flag, resulting in delayed signals and/or higher scheduling latency etc. So on 32-bit, the bug is much less of an issue, I agree. Right, except for one nasty case: KVM user return notifiers. It's possible we'd re-enter user mode with some MSRs set wrong. Yuck. --Andy So yeah, using sysretl instead of sti+sysexit on 64-bit sounds more reasonable given the potential worry about sti+sysexit atomicity in the presense of nmi's. Linus -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 2:37 PM, Andy Lutomirski l...@amacapital.net wrote: On Mar 27, 2015 7:26 AM, Denys Vlasenko dvlas...@redhat.com wrote: Hi, While running some tests I noticed that EFLAGS is not saved across syscalls if I use 32-bit userspace, use SYSENTER, and paravirt is active. Looking at the code, it's actually clear why that happens. /* * SYSENTER loads ss, rsp, cs, and rip from previously programmed MSRs. * IF and VM in rflags are cleared (IOW: interrupts are off). * SYSENTER does not save anything on the stack, * and does not save old rip (!!!) and rflags. */ ENTRY(ia32_sysenter_target) SWAPGS_UNSAFE_STACK movqPER_CPU_VAR(cpu_tss + TSS_sp0), %rsp ENABLE_INTERRUPTS(CLBR_NONE) movl%ebp, %ebp movl%eax, %eax movlASM_THREAD_INFO(TI_sysenter_return, %rsp, 0), %r10d /* Construct struct pt_regs on stack */ pushq_cfi $__USER32_DS/* pt_regs-ss */ pushq_cfi %rbp/* pt_regs-sp */ CFI_REL_OFFSET rsp,0 pushfq_cfi /* pt_regs-flags */ The SWAPGS_UNSAFE_STACK, it's it involves paravirt callbacks, will change EFLAGS, and it *can't* save/restore them - there is no place to save it, since neither stack nor PER_CPU() is usable at that point. Interestingly, *no one ever complained*! Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Let's go flag-by-flag. ID - probably no one depends on it VIP,VIF,VM - v86 stuff, not supported in 64bit AC - someone probably do use this RF - should be cleared to 0 NT - iret via task gate, not supported in 64bit IOPL - usually 00, sys_iopl() can change it DF - according to C ABI, should be 0 IF - should be preserved (but almost always 1) TF - should be preserved arith flags - probably no one cares IOW. Bits to be preseved are only AC, IOPL, TF, and _maybe_ IF. AC and IOPL are preserved even with this paravirt quirk because paravirt hooks do not mangle them. TF preservation and proper restoration is handled by do_debug + syscall_trace_enter_phase2 + iret combo. We unconditionally set IF. This is only a problem for applications which use sys_iopl(3) and, disable IRQs in userspace and perform syscalls. The set of such apps is probably empty. (This bug exists even for non-paravirt case). So, formally, we have a bug: we do not preserve IF, DF and arith flags. I'm proposing to use this opportunity to amend syscall ABI to say that arith flags are not preserved across syscalls, and DF can be cleared to 0 by syscalls (but can't be set to 1). Evidently, it's broken for some time for some virtualized setups and users are okay. I think I'd rather fix it. I want to give x86_64 a sysenter stack like x86_32's, since AFAICT the only reason that #DF needs to use IST is because sysenter with TF set is the only way I can see that #DF could happen with an invalid stack. What if RSP gets corrupted in the kernel? That would cause a fault that gets promoted to #DF, since the iret frame can't be pushed. You would at least get an oops out instead of a triple fault reset. Also, Houston, we have a bug, probably rootable, and probably damn near impossible to exploit without crashing your system: User does sysenter. We end up in native_irq_enable_sysexit. We do: swapgs sti -- NMI here can happen on some (all?) cpus, returns successfully *with interrupts unmasked* -- IRQ. Boom The sti will delay interrupts for one instruction, and that should include NMIs. The Intel SDM states for STI: The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and NMI interrupts. NMI interrupts (and SMIs) may be blocked for one macroinstruction following an STI. My preferred fix would be to use sysretl instead of sysexit. As far as I know, there are no 64-bit CPUs at all that don't support sysretl. It would be nice to have one less return path. I'm curious to know why Intel doesn't support syscall from compatibility mode but does support sysret to compatibility mode. -- Brian Gerst -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ia32_sysenter_target does not preserve EFLAGS
On Fri, Mar 27, 2015 at 9:00 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Mar 27, 2015 at 7:25 AM, Denys Vlasenko dvlas...@redhat.com wrote: Apparently, users *don't* depend on arithmetic flags to survive over syscall. They also okay with DF flag being cleared. Generally, users probably dont' care about many registers at all being saved, but it's worth noting that the reason system calls save/restore even caller-saved registers is at least partly in order to avoid any kernel information leaks. I don't believe that user mode will ever reasonably care about the arithmetic flags being changed, but at the same time I also don't it is something we should ever consider a feature we should try to take advantage of. Generally we should try to not mess with the flag state, and I'd *much* rather make the rule be that all the system call return paths restore flags as much as possible. We don't clobber anything ABI has its appeal. OTOH, fulfilling ABI's promises has cost which hast to be paid on every syscall, regardless whether userspace needed it or not. Example. This is the uclibc implementation of write(): 004acfc4 __libc_write: 4acfc4: 53 push %rbx 4acfc5: 48 63 ffmovslq %edi,%rdi 4acfc8: b8 01 00 00 00 mov$0x1,%eax 4acfcd: 0f 05 syscall 4acfcf: 48 89 c3mov%rax,%rbx 4acfd2: 48 81 fb 00 f0 ff ffcmp$0xf000,%rbx 4acfd9: 76 0f jbe4acfea __libc_write+0x26 4acfdb: e8 64 15 00 00 callq 4ae544 __GI___errno_location 4acfe0: 89 da mov%ebx,%edx 4acfe2: f7 da neg%edx 4acfe4: 89 10 mov%edx,(%rax) 4acfe6: 48 83 c8 ff or $0x,%rax 4acfea: 5b pop%rbx 4acfeb: c3 retq This is a C function. Therefore any its caller assumes that C-clobbered registers can be, indeed, clobbered here, so if that caller uses any of them, it saves/restores them. All efforts by kernel code to save/restore C-clobbered registers, eight of them, are in vain. It's just useless work. Userspace does not benefit from that effort. If our syscall ABI would say that those regs are not preserved, we could have a bit faster syscalls. Any userspace code which really had to have those registers preserved across a particular syscall, could push/pop them itself. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/