Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
* Denys Vlasenko wrote: > > - but making it slower without really good reasons isn't good. > > The thinking here is that cleaning up entry.S is a good reason. > > We won't do anything which would slow it down by, say, 5%, > but one cycle may be considered acceptable loss. Ok, so I've applied this particular cleanup, in the hope that we can recover those cycles on the now cleaner base. If that doesn't work out then we can still re-introduce this complication and see its maintainability in isolation, on a clean base. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
* Denys Vlasenko vda.li...@googlemail.com wrote: - but making it slower without really good reasons isn't good. The thinking here is that cleaning up entry.S is a good reason. We won't do anything which would slow it down by, say, 5%, but one cycle may be considered acceptable loss. Ok, so I've applied this particular cleanup, in the hope that we can recover those cycles on the now cleaner base. If that doesn't work out then we can still re-introduce this complication and see its maintainability in isolation, on a clean base. Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 3:02 PM, Andy Lutomirski wrote: > On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko > wrote: >> On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski wrote: >>> usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / >>> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward >>> killing that off completely. I've still never convinced myself that >>> there aren't ptrace-related info leaks in there. >>> >>> Denys, did you ever benchmark what happens if we use push instead of >>> mov? I bet that we get that cycle back and more, not to mention much >>> less icache usage. >> >> Yes, I did. >> Push conversion seems to perform the same as current, MOV-based code. >> >> The expected win there that we lose two huge 12-byte insns >> which store __USER_CS and __USER_DS in iret frame. >> >> MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: >> - needs REX prefix >> - no sing-extending imm8 form exists for it >> - ofs in our case can't fit into 8 bits >> - (%esp) requires SIB byte >> >> In my tests, each such instruction adds one cycle. >> >> Compare this to PUSH imm8, which is 2 bytes only. > > Does that mean that using push on top of this patch gets us our cycle back? Maybe. I can't be sure about it. In general I see a jitter of 1-2, sometimes 3 cycles even when I do changes which merely change code size (e.g. replacing equivalent insns). This may be caused by jump targets getting aligned differently wrt cacheline boundaries. If second/third/fourth insn after current one is not fetched because it did not fit into the cacheline, then some insn decoders don't get anything to chew on. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko wrote: > On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski wrote: >> usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / >> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward >> killing that off completely. I've still never convinced myself that >> there aren't ptrace-related info leaks in there. >> >> Denys, did you ever benchmark what happens if we use push instead of >> mov? I bet that we get that cycle back and more, not to mention much >> less icache usage. > > Yes, I did. > Push conversion seems to perform the same as current, MOV-based code. > > The expected win there that we lose two huge 12-byte insns > which store __USER_CS and __USER_DS in iret frame. > > MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: > - needs REX prefix > - no sing-extending imm8 form exists for it > - ofs in our case can't fit into 8 bits > - (%esp) requires SIB byte > > In my tests, each such instruction adds one cycle. > > Compare this to PUSH imm8, which is 2 bytes only. Does that mean that using push on top of this patch gets us our cycle back? --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski wrote: > usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / > RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward > killing that off completely. I've still never convinced myself that > there aren't ptrace-related info leaks in there. > > Denys, did you ever benchmark what happens if we use push instead of > mov? I bet that we get that cycle back and more, not to mention much > less icache usage. Yes, I did. Push conversion seems to perform the same as current, MOV-based code. The expected win there that we lose two huge 12-byte insns which store __USER_CS and __USER_DS in iret frame. MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: - needs REX prefix - no sing-extending imm8 form exists for it - ofs in our case can't fit into 8 bits - (%esp) requires SIB byte In my tests, each such instruction adds one cycle. Compare this to PUSH imm8, which is 2 bytes only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 2:21 PM, Ingo Molnar wrote: >> Since this patch does add two extra MOVs, >> I did benchmark these patches. They add exactly one cycle >> to system call code path on my Sandy Bridge CPU. > > Hm, but that's the wrong direction, we should try to make it faster, > and to clean it up As you know, goals of "faster" and "cleaner" are often mutually exclusive. C'est la vie :( entry.S seem to lean towards "faster" to the extent where it became a tangled maze of obscure optimizations. Such as the mysterious, and not at all obvious existence of 8 byte "safety padding" at the top of the 32-bit kernel stack. Before Andy stumbled into it, it was not at all obvious that it is even there. I had to write a test patch to verify it. There is a long-standing latent bug in the code where this padding is not accounted for. > - but making it slower without really good reasons isn't good. The thinking here is that cleaning up entry.S is a good reason. We won't do anything which would slow it down by, say, 5%, but one cycle may be considered acceptable loss. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
> * Denys Vlasenko wrote: > > > > So there are now +2 instructions (5 instead of 3) in the > > > system_call path, but there are -2 instructions in the SYSRETQ > > > path, > > > > Unfortunately, no. [...] > > So I assumed that it was an equivalent transformation, given that > none of the changelogs spelled out the increase in overhead ... Also, for future reference, you need to spell out performance costs of fast path patches very clearly - and that wasn't done here. That's a big no-no, please don't do it again ... Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> > So there are now +2 instructions (5 instead of 3) in the >> > system_call path, but there are -2 instructions in the SYSRETQ >> > path, >> >> Unfortunately, no. [...] > > So I assumed that it was an equivalent transformation, given that none > of the changelogs spelled out the increase in overhead ... > >> [...] There is only this change in SYSRETQ path, which simply >> changes where we get RSP from: >> >> @@ -293,7 +289,7 @@ ret_from_sys_call: >> CFI_REGISTERrip,rcx >> movqEFLAGS(%rsp),%r11 >> /*CFI_REGISTER rflags,r11*/ >> - movqPER_CPU_VAR(old_rsp), %rsp >> + movqRSP(%rsp),%rsp >> /* >>* 64bit SYSRET restores rip from rcx, >>* rflags from r11 (but RF and VM bits are forced to 0), >> >> Most likely, no change in execution speed here. >> At best, it is one cycle faster somewhere in address generation unit >> because for PER_CPU_VAR() address evaluation, GS base is nonzero. >> >> Since this patch does add two extra MOVs, >> I did benchmark these patches. They add exactly one cycle >> to system call code path on my Sandy Bridge CPU. > > Hm, but that's the wrong direction, we should try to make it faster, > and to clean it up - but making it slower without really good reasons > isn't good. > > Is 'usersp' really that much of a complication? usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward killing that off completely. I've still never convinced myself that there aren't ptrace-related info leaks in there. Denys, did you ever benchmark what happens if we use push instead of mov? I bet that we get that cycle back and more, not to mention much less icache usage. The reason I think that is that this patch changes us from writing nothing to the RIP slot to writing something there. If we push our stack frame instead of moving it into place, we must write to every slot, and writing the right value is probably no more expensive than writing, say, zero (the load / forwarding units are unlikely to be very busy at that point). So this change could actually be a step toward getting faster. --Andy > > Thanks, > > Ingo -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
* Andy Lutomirski wrote: > > Since this patch does add two extra MOVs, > > I did benchmark these patches. They add exactly one cycle > > to system call code path on my Sandy Bridge CPU. > > Personally, I'm willing to pay that cycle. It could be a bigger > savings on context switch, and the simplification it enables is > pretty good. But, but ... context switches are a relative slow path, compared to system calls. And I say this with the scheduler maintainer hat on as well. So this is not a good bargain IMHO, assuming it's not some _huge_ difference in maintainability - but having an extra percpu field isn't really much of a problem. I don't claim that we couldn't in some other situation decide that a certain type of speedup isn't worth it - but what's the big problem here? A bit of arithmetics shouldn't be a problem? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
* Denys Vlasenko wrote: > > So there are now +2 instructions (5 instead of 3) in the > > system_call path, but there are -2 instructions in the SYSRETQ > > path, > > Unfortunately, no. [...] So I assumed that it was an equivalent transformation, given that none of the changelogs spelled out the increase in overhead ... > [...] There is only this change in SYSRETQ path, which simply > changes where we get RSP from: > > @@ -293,7 +289,7 @@ ret_from_sys_call: > CFI_REGISTERrip,rcx > movqEFLAGS(%rsp),%r11 > /*CFI_REGISTER rflags,r11*/ > - movqPER_CPU_VAR(old_rsp), %rsp > + movqRSP(%rsp),%rsp > /* >* 64bit SYSRET restores rip from rcx, >* rflags from r11 (but RF and VM bits are forced to 0), > > Most likely, no change in execution speed here. > At best, it is one cycle faster somewhere in address generation unit > because for PER_CPU_VAR() address evaluation, GS base is nonzero. > > Since this patch does add two extra MOVs, > I did benchmark these patches. They add exactly one cycle > to system call code path on my Sandy Bridge CPU. Hm, but that's the wrong direction, we should try to make it faster, and to clean it up - but making it slower without really good reasons isn't good. Is 'usersp' really that much of a complication? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko wrote: > On 03/10/2015 01:51 PM, Ingo Molnar wrote: >> >> * Denys Vlasenko wrote: >> >>> PER_CPU(old_rsp) usage is simplified - now it is used only >>> as temp storage, and userspace stack pointer is immediately stored >>> in pt_regs->sp on syscall entry, instead of being used later, >>> on syscall exit. >>> >>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >>> uses pt_regs->sp now. >>> >>> FIXUP/RESTORE_TOP_OF_STACK are simplified. >> >> Just trying to judge the performance impact: >> >>> --- a/arch/x86/kernel/entry_64.S >>> +++ b/arch/x86/kernel/entry_64.S >>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >>> * manipulation. >>> */ >>> .macro FIXUP_TOP_OF_STACK tmp offset=0 >>> -movq PER_CPU_VAR(old_rsp),\tmp >>> -movq \tmp,RSP+\offset(%rsp) >>> movq $__USER_DS,SS+\offset(%rsp) >>> movq $__USER_CS,CS+\offset(%rsp) >>> movq RIP+\offset(%rsp),\tmp /* get rip */ >>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >>> .endm >>> >>> .macro RESTORE_TOP_OF_STACK tmp offset=0 >>> -movq RSP+\offset(%rsp),\tmp >>> -movq \tmp,PER_CPU_VAR(old_rsp) >>> +/* nothing to do */ >>> .endm >>> >>> /* >>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >>> */ >>> ENABLE_INTERRUPTS(CLBR_NONE) >>> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >>> +movq%rcx,RIP(%rsp) >>> +movqPER_CPU_VAR(old_rsp),%rcx >>> +movq%r11,EFLAGS(%rsp) >>> +movq%rcx,RSP(%rsp) >>> +movq_cfi rax,ORIG_RAX >>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >>> movq$-ENOSYS,RAX(%rsp) >>> -movq_cfi rax,ORIG_RAX >>> -movq%r11,EFLAGS(%rsp) >>> -movq%rcx,RIP(%rsp) >>> CFI_REL_OFFSET rip,RIP >>> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >>> jnz tracesys >> >> So there are now +2 instructions (5 instead of 3) in the system_call >> path, but there are -2 instructions in the SYSRETQ path, > > Unfortunately, no. There is only this change in SYSRETQ path, > which simply changes where we get RSP from: > > @@ -293,7 +289,7 @@ ret_from_sys_call: > CFI_REGISTERrip,rcx > movqEFLAGS(%rsp),%r11 > /*CFI_REGISTER rflags,r11*/ > - movqPER_CPU_VAR(old_rsp), %rsp > + movqRSP(%rsp),%rsp > /* > * 64bit SYSRET restores rip from rcx, > * rflags from r11 (but RF and VM bits are forced to 0), > > Most likely, no change in execution speed here. > At best, it is one cycle faster somewhere in address generation unit > because for PER_CPU_VAR() address evaluation, GS base is nonzero. > > > Since this patch does add two extra MOVs, > I did benchmark these patches. They add exactly one cycle > to system call code path on my Sandy Bridge CPU. > Personally, I'm willing to pay that cycle. It could be a bigger savings on context switch, and the simplification it enables is pretty good. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On 03/10/2015 01:51 PM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> PER_CPU(old_rsp) usage is simplified - now it is used only >> as temp storage, and userspace stack pointer is immediately stored >> in pt_regs->sp on syscall entry, instead of being used later, >> on syscall exit. >> >> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >> uses pt_regs->sp now. >> >> FIXUP/RESTORE_TOP_OF_STACK are simplified. > > Just trying to judge the performance impact: > >> --- a/arch/x86/kernel/entry_64.S >> +++ b/arch/x86/kernel/entry_64.S >> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >> * manipulation. >> */ >> .macro FIXUP_TOP_OF_STACK tmp offset=0 >> -movq PER_CPU_VAR(old_rsp),\tmp >> -movq \tmp,RSP+\offset(%rsp) >> movq $__USER_DS,SS+\offset(%rsp) >> movq $__USER_CS,CS+\offset(%rsp) >> movq RIP+\offset(%rsp),\tmp /* get rip */ >> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >> .endm >> >> .macro RESTORE_TOP_OF_STACK tmp offset=0 >> -movq RSP+\offset(%rsp),\tmp >> -movq \tmp,PER_CPU_VAR(old_rsp) >> +/* nothing to do */ >> .endm >> >> /* >> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >> */ >> ENABLE_INTERRUPTS(CLBR_NONE) >> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >> +movq%rcx,RIP(%rsp) >> +movqPER_CPU_VAR(old_rsp),%rcx >> +movq%r11,EFLAGS(%rsp) >> +movq%rcx,RSP(%rsp) >> +movq_cfi rax,ORIG_RAX >> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >> movq$-ENOSYS,RAX(%rsp) >> -movq_cfi rax,ORIG_RAX >> -movq%r11,EFLAGS(%rsp) >> -movq%rcx,RIP(%rsp) >> CFI_REL_OFFSET rip,RIP >> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >> jnz tracesys > > So there are now +2 instructions (5 instead of 3) in the system_call > path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. There is only this change in SYSRETQ path, which simply changes where we get RSP from: @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTERrip,rcx movqEFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movqPER_CPU_VAR(old_rsp), %rsp + movqRSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), Most likely, no change in execution speed here. At best, it is one cycle faster somewhere in address generation unit because for PER_CPU_VAR() address evaluation, GS base is nonzero. Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 5:51 AM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> PER_CPU(old_rsp) usage is simplified - now it is used only >> as temp storage, and userspace stack pointer is immediately stored >> in pt_regs->sp on syscall entry, instead of being used later, >> on syscall exit. >> >> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code >> uses pt_regs->sp now. >> >> FIXUP/RESTORE_TOP_OF_STACK are simplified. > > Just trying to judge the performance impact: > >> --- a/arch/x86/kernel/entry_64.S >> +++ b/arch/x86/kernel/entry_64.S >> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) >> * manipulation. >> */ >> .macro FIXUP_TOP_OF_STACK tmp offset=0 >> - movq PER_CPU_VAR(old_rsp),\tmp >> - movq \tmp,RSP+\offset(%rsp) >> movq $__USER_DS,SS+\offset(%rsp) >> movq $__USER_CS,CS+\offset(%rsp) >> movq RIP+\offset(%rsp),\tmp /* get rip */ >> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) >> .endm >> >> .macro RESTORE_TOP_OF_STACK tmp offset=0 >> - movq RSP+\offset(%rsp),\tmp >> - movq \tmp,PER_CPU_VAR(old_rsp) >> + /* nothing to do */ >> .endm >> >> /* >> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >>*/ >> ENABLE_INTERRUPTS(CLBR_NONE) >> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >> + movq%rcx,RIP(%rsp) >> + movqPER_CPU_VAR(old_rsp),%rcx >> + movq%r11,EFLAGS(%rsp) >> + movq%rcx,RSP(%rsp) >> + movq_cfi rax,ORIG_RAX >> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >> movq$-ENOSYS,RAX(%rsp) >> - movq_cfi rax,ORIG_RAX >> - movq%r11,EFLAGS(%rsp) >> - movq%rcx,RIP(%rsp) >> CFI_REL_OFFSET rip,RIP >> testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) >> jnz tracesys > > So there are now +2 instructions (5 instead of 3) in the system_call > path, but there are -2 instructions in the SYSRETQ path, so combined > it's a wash performance-wise, right? I think that the SYSRETQ path is the same number of instructions -- RESTORE_TOP_OF_STACK is only used in somewhat unusual circumstances. On SYSRETQ, I think we touch one fewer cache line, although that cache line will be in L1 unless the syscall pushed it out. The added instructions on entry are probably nearly free, though, as one of them is a load from a forwarded address and the other is a store to a hot cacheline. --And -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
* Denys Vlasenko wrote: > PER_CPU(old_rsp) usage is simplified - now it is used only > as temp storage, and userspace stack pointer is immediately stored > in pt_regs->sp on syscall entry, instead of being used later, > on syscall exit. > > Instead of PER_CPU(old_rsp) and task->thread.usersp, C code > uses pt_regs->sp now. > > FIXUP/RESTORE_TOP_OF_STACK are simplified. Just trying to judge the performance impact: > --- a/arch/x86/kernel/entry_64.S > +++ b/arch/x86/kernel/entry_64.S > @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) > * manipulation. > */ > .macro FIXUP_TOP_OF_STACK tmp offset=0 > - movq PER_CPU_VAR(old_rsp),\tmp > - movq \tmp,RSP+\offset(%rsp) > movq $__USER_DS,SS+\offset(%rsp) > movq $__USER_CS,CS+\offset(%rsp) > movq RIP+\offset(%rsp),\tmp /* get rip */ > @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) > .endm > > .macro RESTORE_TOP_OF_STACK tmp offset=0 > - movq RSP+\offset(%rsp),\tmp > - movq \tmp,PER_CPU_VAR(old_rsp) > + /* nothing to do */ > .endm > > /* > @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >*/ > ENABLE_INTERRUPTS(CLBR_NONE) > ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > + movq%rcx,RIP(%rsp) > + movqPER_CPU_VAR(old_rsp),%rcx > + movq%r11,EFLAGS(%rsp) > + movq%rcx,RSP(%rsp) > + movq_cfi rax,ORIG_RAX > SAVE_C_REGS_EXCEPT_RAX_RCX_R11 > movq$-ENOSYS,RAX(%rsp) > - movq_cfi rax,ORIG_RAX > - movq%r11,EFLAGS(%rsp) > - movq%rcx,RIP(%rsp) > CFI_REL_OFFSET rip,RIP > testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) > jnz tracesys So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, so combined it's a wash performance-wise, right? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 5:51 AM, Ingo Molnar mi...@kernel.org wrote: * Denys Vlasenko dvlas...@redhat.com wrote: PER_CPU(old_rsp) usage is simplified - now it is used only as temp storage, and userspace stack pointer is immediately stored in pt_regs-sp on syscall entry, instead of being used later, on syscall exit. Instead of PER_CPU(old_rsp) and task-thread.usersp, C code uses pt_regs-sp now. FIXUP/RESTORE_TOP_OF_STACK are simplified. Just trying to judge the performance impact: --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) * manipulation. */ .macro FIXUP_TOP_OF_STACK tmp offset=0 - movq PER_CPU_VAR(old_rsp),\tmp - movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 - movq RSP+\offset(%rsp),\tmp - movq \tmp,PER_CPU_VAR(old_rsp) + /* nothing to do */ .endm /* @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq%rcx,RIP(%rsp) + movqPER_CPU_VAR(old_rsp),%rcx + movq%r11,EFLAGS(%rsp) + movq%rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq%r11,EFLAGS(%rsp) - movq%rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) jnz tracesys So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, so combined it's a wash performance-wise, right? I think that the SYSRETQ path is the same number of instructions -- RESTORE_TOP_OF_STACK is only used in somewhat unusual circumstances. On SYSRETQ, I think we touch one fewer cache line, although that cache line will be in L1 unless the syscall pushed it out. The added instructions on entry are probably nearly free, though, as one of them is a load from a forwarded address and the other is a store to a hot cacheline. --And -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski l...@amacapital.net wrote: usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward killing that off completely. I've still never convinced myself that there aren't ptrace-related info leaks in there. Denys, did you ever benchmark what happens if we use push instead of mov? I bet that we get that cycle back and more, not to mention much less icache usage. Yes, I did. Push conversion seems to perform the same as current, MOV-based code. The expected win there that we lose two huge 12-byte insns which store __USER_CS and __USER_DS in iret frame. MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: - needs REX prefix - no sing-extending imm8 form exists for it - ofs in our case can't fit into 8 bits - (%esp) requires SIB byte In my tests, each such instruction adds one cycle. Compare this to PUSH imm8, which is 2 bytes only. Does that mean that using push on top of this patch gets us our cycle back? --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 2:21 PM, Ingo Molnar mi...@kernel.org wrote: Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. Hm, but that's the wrong direction, we should try to make it faster, and to clean it up As you know, goals of faster and cleaner are often mutually exclusive. C'est la vie :( entry.S seem to lean towards faster to the extent where it became a tangled maze of obscure optimizations. Such as the mysterious, and not at all obvious existence of 8 byte safety padding at the top of the 32-bit kernel stack. Before Andy stumbled into it, it was not at all obvious that it is even there. I had to write a test patch to verify it. There is a long-standing latent bug in the code where this padding is not accounted for. - but making it slower without really good reasons isn't good. The thinking here is that cleaning up entry.S is a good reason. We won't do anything which would slow it down by, say, 5%, but one cycle may be considered acceptable loss. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski l...@amacapital.net wrote: usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward killing that off completely. I've still never convinced myself that there aren't ptrace-related info leaks in there. Denys, did you ever benchmark what happens if we use push instead of mov? I bet that we get that cycle back and more, not to mention much less icache usage. Yes, I did. Push conversion seems to perform the same as current, MOV-based code. The expected win there that we lose two huge 12-byte insns which store __USER_CS and __USER_DS in iret frame. MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: - needs REX prefix - no sing-extending imm8 form exists for it - ofs in our case can't fit into 8 bits - (%esp) requires SIB byte In my tests, each such instruction adds one cycle. Compare this to PUSH imm8, which is 2 bytes only. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On 03/10/2015 01:51 PM, Ingo Molnar wrote: * Denys Vlasenko dvlas...@redhat.com wrote: PER_CPU(old_rsp) usage is simplified - now it is used only as temp storage, and userspace stack pointer is immediately stored in pt_regs-sp on syscall entry, instead of being used later, on syscall exit. Instead of PER_CPU(old_rsp) and task-thread.usersp, C code uses pt_regs-sp now. FIXUP/RESTORE_TOP_OF_STACK are simplified. Just trying to judge the performance impact: --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) * manipulation. */ .macro FIXUP_TOP_OF_STACK tmp offset=0 -movq PER_CPU_VAR(old_rsp),\tmp -movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 -movq RSP+\offset(%rsp),\tmp -movq \tmp,PER_CPU_VAR(old_rsp) +/* nothing to do */ .endm /* @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ +movq%rcx,RIP(%rsp) +movqPER_CPU_VAR(old_rsp),%rcx +movq%r11,EFLAGS(%rsp) +movq%rcx,RSP(%rsp) +movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) -movq_cfi rax,ORIG_RAX -movq%r11,EFLAGS(%rsp) -movq%rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) jnz tracesys So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. There is only this change in SYSRETQ path, which simply changes where we get RSP from: @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTERrip,rcx movqEFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movqPER_CPU_VAR(old_rsp), %rsp + movqRSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), Most likely, no change in execution speed here. At best, it is one cycle faster somewhere in address generation unit because for PER_CPU_VAR() address evaluation, GS base is nonzero. Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar mi...@kernel.org wrote: * Denys Vlasenko dvlas...@redhat.com wrote: So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. [...] So I assumed that it was an equivalent transformation, given that none of the changelogs spelled out the increase in overhead ... [...] There is only this change in SYSRETQ path, which simply changes where we get RSP from: @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTERrip,rcx movqEFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movqPER_CPU_VAR(old_rsp), %rsp + movqRSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), Most likely, no change in execution speed here. At best, it is one cycle faster somewhere in address generation unit because for PER_CPU_VAR() address evaluation, GS base is nonzero. Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. Hm, but that's the wrong direction, we should try to make it faster, and to clean it up - but making it slower without really good reasons isn't good. Is 'usersp' really that much of a complication? usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward killing that off completely. I've still never convinced myself that there aren't ptrace-related info leaks in there. Denys, did you ever benchmark what happens if we use push instead of mov? I bet that we get that cycle back and more, not to mention much less icache usage. The reason I think that is that this patch changes us from writing nothing to the RIP slot to writing something there. If we push our stack frame instead of moving it into place, we must write to every slot, and writing the right value is probably no more expensive than writing, say, zero (the load / forwarding units are unlikely to be very busy at that point). So this change could actually be a step toward getting faster. --Andy Thanks, Ingo -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko dvlas...@redhat.com wrote: On 03/10/2015 01:51 PM, Ingo Molnar wrote: * Denys Vlasenko dvlas...@redhat.com wrote: PER_CPU(old_rsp) usage is simplified - now it is used only as temp storage, and userspace stack pointer is immediately stored in pt_regs-sp on syscall entry, instead of being used later, on syscall exit. Instead of PER_CPU(old_rsp) and task-thread.usersp, C code uses pt_regs-sp now. FIXUP/RESTORE_TOP_OF_STACK are simplified. Just trying to judge the performance impact: --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) * manipulation. */ .macro FIXUP_TOP_OF_STACK tmp offset=0 -movq PER_CPU_VAR(old_rsp),\tmp -movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 -movq RSP+\offset(%rsp),\tmp -movq \tmp,PER_CPU_VAR(old_rsp) +/* nothing to do */ .endm /* @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ +movq%rcx,RIP(%rsp) +movqPER_CPU_VAR(old_rsp),%rcx +movq%r11,EFLAGS(%rsp) +movq%rcx,RSP(%rsp) +movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) -movq_cfi rax,ORIG_RAX -movq%r11,EFLAGS(%rsp) -movq%rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) jnz tracesys So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. There is only this change in SYSRETQ path, which simply changes where we get RSP from: @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTERrip,rcx movqEFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movqPER_CPU_VAR(old_rsp), %rsp + movqRSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), Most likely, no change in execution speed here. At best, it is one cycle faster somewhere in address generation unit because for PER_CPU_VAR() address evaluation, GS base is nonzero. Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. Personally, I'm willing to pay that cycle. It could be a bigger savings on context switch, and the simplification it enables is pretty good. --Andy -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
* Denys Vlasenko dvlas...@redhat.com wrote: So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. [...] So I assumed that it was an equivalent transformation, given that none of the changelogs spelled out the increase in overhead ... [...] There is only this change in SYSRETQ path, which simply changes where we get RSP from: @@ -293,7 +289,7 @@ ret_from_sys_call: CFI_REGISTERrip,rcx movqEFLAGS(%rsp),%r11 /*CFI_REGISTER rflags,r11*/ - movqPER_CPU_VAR(old_rsp), %rsp + movqRSP(%rsp),%rsp /* * 64bit SYSRET restores rip from rcx, * rflags from r11 (but RF and VM bits are forced to 0), Most likely, no change in execution speed here. At best, it is one cycle faster somewhere in address generation unit because for PER_CPU_VAR() address evaluation, GS base is nonzero. Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. Hm, but that's the wrong direction, we should try to make it faster, and to clean it up - but making it slower without really good reasons isn't good. Is 'usersp' really that much of a complication? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
* Denys Vlasenko dvlas...@redhat.com wrote: PER_CPU(old_rsp) usage is simplified - now it is used only as temp storage, and userspace stack pointer is immediately stored in pt_regs-sp on syscall entry, instead of being used later, on syscall exit. Instead of PER_CPU(old_rsp) and task-thread.usersp, C code uses pt_regs-sp now. FIXUP/RESTORE_TOP_OF_STACK are simplified. Just trying to judge the performance impact: --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64) * manipulation. */ .macro FIXUP_TOP_OF_STACK tmp offset=0 - movq PER_CPU_VAR(old_rsp),\tmp - movq \tmp,RSP+\offset(%rsp) movq $__USER_DS,SS+\offset(%rsp) movq $__USER_CS,CS+\offset(%rsp) movq RIP+\offset(%rsp),\tmp /* get rip */ @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64) .endm .macro RESTORE_TOP_OF_STACK tmp offset=0 - movq RSP+\offset(%rsp),\tmp - movq \tmp,PER_CPU_VAR(old_rsp) + /* nothing to do */ .endm /* @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq%rcx,RIP(%rsp) + movqPER_CPU_VAR(old_rsp),%rcx + movq%r11,EFLAGS(%rsp) + movq%rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq%r11,EFLAGS(%rsp) - movq%rcx,RIP(%rsp) CFI_REL_OFFSET rip,RIP testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP) jnz tracesys So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, so combined it's a wash performance-wise, right? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
* Andy Lutomirski l...@amacapital.net wrote: Since this patch does add two extra MOVs, I did benchmark these patches. They add exactly one cycle to system call code path on my Sandy Bridge CPU. Personally, I'm willing to pay that cycle. It could be a bigger savings on context switch, and the simplification it enables is pretty good. But, but ... context switches are a relative slow path, compared to system calls. And I say this with the scheduler maintainer hat on as well. So this is not a good bargain IMHO, assuming it's not some _huge_ difference in maintainability - but having an extra percpu field isn't really much of a problem. I don't claim that we couldn't in some other situation decide that a certain type of speedup isn't worth it - but what's the big problem here? A bit of arithmetics shouldn't be a problem? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
* Denys Vlasenko dvlas...@redhat.com wrote: So there are now +2 instructions (5 instead of 3) in the system_call path, but there are -2 instructions in the SYSRETQ path, Unfortunately, no. [...] So I assumed that it was an equivalent transformation, given that none of the changelogs spelled out the increase in overhead ... Also, for future reference, you need to spell out performance costs of fast path patches very clearly - and that wasn't done here. That's a big no-no, please don't do it again ... Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Tue, Mar 10, 2015 at 3:02 PM, Andy Lutomirski l...@amacapital.net wrote: On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko vda.li...@googlemail.com wrote: On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski l...@amacapital.net wrote: usersp is IMO tolerable. The nasty thing is the FIXUP_TOP_OF_STACK / RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward killing that off completely. I've still never convinced myself that there aren't ptrace-related info leaks in there. Denys, did you ever benchmark what happens if we use push instead of mov? I bet that we get that cycle back and more, not to mention much less icache usage. Yes, I did. Push conversion seems to perform the same as current, MOV-based code. The expected win there that we lose two huge 12-byte insns which store __USER_CS and __USER_DS in iret frame. MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86: - needs REX prefix - no sing-extending imm8 form exists for it - ofs in our case can't fit into 8 bits - (%esp) requires SIB byte In my tests, each such instruction adds one cycle. Compare this to PUSH imm8, which is 2 bytes only. Does that mean that using push on top of this patch gets us our cycle back? Maybe. I can't be sure about it. In general I see a jitter of 1-2, sometimes 3 cycles even when I do changes which merely change code size (e.g. replacing equivalent insns). This may be caused by jump targets getting aligned differently wrt cacheline boundaries. If second/third/fourth insn after current one is not fetched because it did not fit into the cacheline, then some insn decoders don't get anything to chew on. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Mon, Mar 9, 2015 at 1:32 PM, Denys Vlasenko wrote: > On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski wrote: >>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >>> */ >>> ENABLE_INTERRUPTS(CLBR_NONE) >>> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >>> + movq%rcx,RIP(%rsp) >>> + movqPER_CPU_VAR(old_rsp),%rcx >>> + movq%r11,EFLAGS(%rsp) >>> + movq%rcx,RSP(%rsp) >>> + movq_cfi rax,ORIG_RAX >>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >>> movq$-ENOSYS,RAX(%rsp) >>> - movq_cfi rax,ORIG_RAX >>> - movq%r11,EFLAGS(%rsp) >>> - movq%rcx,RIP(%rsp) >> >> Why the reordering? > > No strong reason. > > iret stack is "above" the rest of pt_regs. > > This does not matter now, but when/if we convert to PUSHes > for register saving, pushes which build iret frame > will have to be before "save C-clobbered registers" part, > exactly as in this patch. Fair enough. Acked-by: Andy Lutomirski -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski wrote: >> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) >> */ >> ENABLE_INTERRUPTS(CLBR_NONE) >> ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ >> + movq%rcx,RIP(%rsp) >> + movqPER_CPU_VAR(old_rsp),%rcx >> + movq%r11,EFLAGS(%rsp) >> + movq%rcx,RSP(%rsp) >> + movq_cfi rax,ORIG_RAX >> SAVE_C_REGS_EXCEPT_RAX_RCX_R11 >> movq$-ENOSYS,RAX(%rsp) >> - movq_cfi rax,ORIG_RAX >> - movq%r11,EFLAGS(%rsp) >> - movq%rcx,RIP(%rsp) > > Why the reordering? No strong reason. iret stack is "above" the rest of pt_regs. This does not matter now, but when/if we convert to PUSHes for register saving, pushes which build iret frame will have to be before "save C-clobbered registers" part, exactly as in this patch. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath
On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko wrote: > PER_CPU(old_rsp) usage is simplified - now it is used only > as temp storage, and userspace stack pointer is immediately stored > in pt_regs->sp on syscall entry, instead of being used later, > on syscall exit. > > Instead of PER_CPU(old_rsp) and task->thread.usersp, C code > uses pt_regs->sp now. > > FIXUP/RESTORE_TOP_OF_STACK are simplified. > > Signed-off-by: Denys Vlasenko > CC: Linus Torvalds > CC: Steven Rostedt > CC: Ingo Molnar > CC: Borislav Petkov > CC: "H. Peter Anvin" > CC: Andy Lutomirski > CC: Oleg Nesterov > CC: Frederic Weisbecker > CC: Alexei Starovoitov > CC: Will Drewry > CC: Kees Cook > CC: x...@kernel.org > CC: linux-kernel@vger.kernel.org Looks correct. > @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) > */ > ENABLE_INTERRUPTS(CLBR_NONE) > ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ > + movq%rcx,RIP(%rsp) > + movqPER_CPU_VAR(old_rsp),%rcx > + movq%r11,EFLAGS(%rsp) > + movq%rcx,RSP(%rsp) > + movq_cfi rax,ORIG_RAX > SAVE_C_REGS_EXCEPT_RAX_RCX_R11 > movq$-ENOSYS,RAX(%rsp) > - movq_cfi rax,ORIG_RAX > - movq%r11,EFLAGS(%rsp) > - movq%rcx,RIP(%rsp) Why the reordering? --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Mon, Mar 9, 2015 at 1:32 PM, Denys Vlasenko vda.li...@googlemail.com wrote: On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski l...@amacapital.net wrote: @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq%rcx,RIP(%rsp) + movqPER_CPU_VAR(old_rsp),%rcx + movq%r11,EFLAGS(%rsp) + movq%rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq%r11,EFLAGS(%rsp) - movq%rcx,RIP(%rsp) Why the reordering? No strong reason. iret stack is above the rest of pt_regs. This does not matter now, but when/if we convert to PUSHes for register saving, pushes which build iret frame will have to be before save C-clobbered registers part, exactly as in this patch. Fair enough. Acked-by: Andy Lutomirski l...@amacapital.net -- Andy Lutomirski AMA Capital Management, LLC -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko dvlas...@redhat.com wrote: PER_CPU(old_rsp) usage is simplified - now it is used only as temp storage, and userspace stack pointer is immediately stored in pt_regs-sp on syscall entry, instead of being used later, on syscall exit. Instead of PER_CPU(old_rsp) and task-thread.usersp, C code uses pt_regs-sp now. FIXUP/RESTORE_TOP_OF_STACK are simplified. Signed-off-by: Denys Vlasenko dvlas...@redhat.com CC: Linus Torvalds torva...@linux-foundation.org CC: Steven Rostedt rost...@goodmis.org CC: Ingo Molnar mi...@kernel.org CC: Borislav Petkov b...@alien8.de CC: H. Peter Anvin h...@zytor.com CC: Andy Lutomirski l...@amacapital.net CC: Oleg Nesterov o...@redhat.com CC: Frederic Weisbecker fweis...@gmail.com CC: Alexei Starovoitov a...@plumgrid.com CC: Will Drewry w...@chromium.org CC: Kees Cook keesc...@chromium.org CC: x...@kernel.org CC: linux-kernel@vger.kernel.org Looks correct. @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq%rcx,RIP(%rsp) + movqPER_CPU_VAR(old_rsp),%rcx + movq%r11,EFLAGS(%rsp) + movq%rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq%r11,EFLAGS(%rsp) - movq%rcx,RIP(%rsp) Why the reordering? --Andy -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath
On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski l...@amacapital.net wrote: @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs) */ ENABLE_INTERRUPTS(CLBR_NONE) ALLOC_PT_GPREGS_ON_STACK 8 /* +8: space for orig_ax */ + movq%rcx,RIP(%rsp) + movqPER_CPU_VAR(old_rsp),%rcx + movq%r11,EFLAGS(%rsp) + movq%rcx,RSP(%rsp) + movq_cfi rax,ORIG_RAX SAVE_C_REGS_EXCEPT_RAX_RCX_R11 movq$-ENOSYS,RAX(%rsp) - movq_cfi rax,ORIG_RAX - movq%r11,EFLAGS(%rsp) - movq%rcx,RIP(%rsp) Why the reordering? No strong reason. iret stack is above the rest of pt_regs. This does not matter now, but when/if we convert to PUSHes for register saving, pushes which build iret frame will have to be before save C-clobbered registers part, exactly as in this patch. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/