Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-16 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> > - but making it slower without really good reasons isn't good.
> 
> The thinking here is that cleaning up entry.S is a good reason.
> 
> We won't do anything which would slow it down by, say, 5%,
> but one cycle may be considered acceptable loss.

Ok, so I've applied this particular cleanup, in the hope that we can 
recover those cycles on the now cleaner base.

If that doesn't work out then we can still re-introduce this 
complication and see its maintainability in isolation, on a clean 
base.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-16 Thread Ingo Molnar

* Denys Vlasenko vda.li...@googlemail.com wrote:

  - but making it slower without really good reasons isn't good.
 
 The thinking here is that cleaning up entry.S is a good reason.
 
 We won't do anything which would slow it down by, say, 5%,
 but one cycle may be considered acceptable loss.

Ok, so I've applied this particular cleanup, in the hope that we can 
recover those cycles on the now cleaner base.

If that doesn't work out then we can still re-introduce this 
complication and see its maintainability in isolation, on a clean 
base.

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On Tue, Mar 10, 2015 at 3:02 PM, Andy Lutomirski  wrote:
> On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko
>  wrote:
>> On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski  wrote:
>>> usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
>>> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
>>> killing that off completely.  I've still never convinced myself that
>>> there aren't ptrace-related info leaks in there.
>>>
>>> Denys, did you ever benchmark what happens if we use push instead of
>>> mov?  I bet that we get that cycle back and more, not to mention much
>>> less icache usage.
>>
>> Yes, I did.
>> Push conversion seems to perform the same as current, MOV-based code.
>>
>> The expected win there that we lose two huge 12-byte insns
>> which store __USER_CS and __USER_DS in iret frame.
>>
>> MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
>> - needs REX prefix
>> - no sing-extending imm8 form exists for it
>> - ofs in our case can't fit into 8 bits
>> - (%esp) requires SIB byte
>>
>> In my tests, each such instruction adds one cycle.
>>
>> Compare this to PUSH imm8, which is 2 bytes only.
>
> Does that mean that using push on top of this patch gets us our cycle back?

Maybe. I can't be sure about it.

In general I see a jitter of 1-2, sometimes 3 cycles even when I do changes
which merely change code size (e.g. replacing equivalent insns).
This may be caused by jump targets getting aligned differently
wrt cacheline boundaries. If second/third/fourth insn after current one
is not fetched because it did not fit into the cacheline,
then some insn decoders don't get anything to chew on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko
 wrote:
> On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski  wrote:
>> usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
>> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
>> killing that off completely.  I've still never convinced myself that
>> there aren't ptrace-related info leaks in there.
>>
>> Denys, did you ever benchmark what happens if we use push instead of
>> mov?  I bet that we get that cycle back and more, not to mention much
>> less icache usage.
>
> Yes, I did.
> Push conversion seems to perform the same as current, MOV-based code.
>
> The expected win there that we lose two huge 12-byte insns
> which store __USER_CS and __USER_DS in iret frame.
>
> MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
> - needs REX prefix
> - no sing-extending imm8 form exists for it
> - ofs in our case can't fit into 8 bits
> - (%esp) requires SIB byte
>
> In my tests, each such instruction adds one cycle.
>
> Compare this to PUSH imm8, which is 2 bytes only.

Does that mean that using push on top of this patch gets us our cycle back?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski  wrote:
> usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
> RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
> killing that off completely.  I've still never convinced myself that
> there aren't ptrace-related info leaks in there.
>
> Denys, did you ever benchmark what happens if we use push instead of
> mov?  I bet that we get that cycle back and more, not to mention much
> less icache usage.

Yes, I did.
Push conversion seems to perform the same as current, MOV-based code.

The expected win there that we lose two huge 12-byte insns
which store __USER_CS and __USER_DS in iret frame.

MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
- needs REX prefix
- no sing-extending imm8 form exists for it
- ofs in our case can't fit into 8 bits
- (%esp) requires SIB byte

In my tests, each such instruction adds one cycle.

Compare this to PUSH imm8, which is 2 bytes only.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On Tue, Mar 10, 2015 at 2:21 PM, Ingo Molnar  wrote:
>> Since this patch does add two extra MOVs,
>> I did benchmark these patches. They add exactly one cycle
>> to system call code path on my Sandy Bridge CPU.
>
> Hm, but that's the wrong direction, we should try to make it faster,
> and to clean it up

As you know, goals of "faster" and "cleaner" are often mutually exclusive.
C'est la vie :(

entry.S seem to lean towards "faster" to the extent where it became
a tangled maze of obscure optimizations.

Such as the mysterious, and not at all obvious existence of 8 byte
"safety padding" at the top of the 32-bit kernel stack. Before Andy
stumbled into it, it was not at all obvious that it is even there.
I had to write a test patch to verify it.
There is a long-standing latent bug in the code where this padding
is not accounted for.

> - but making it slower without really good reasons isn't good.

The thinking here is that cleaning up entry.S is a good reason.

We won't do anything which would slow it down by, say, 5%,
but one cycle may be considered acceptable loss.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

> * Denys Vlasenko  wrote:
> 
> > > So there are now +2 instructions (5 instead of 3) in the 
> > > system_call path, but there are -2 instructions in the SYSRETQ 
> > > path,
> > 
> > Unfortunately, no. [...]
> 
> So I assumed that it was an equivalent transformation, given that 
> none of the changelogs spelled out the increase in overhead ...

Also, for future reference, you need to spell out performance costs of 
fast path patches very clearly - and that wasn't done here.

That's a big no-no, please don't do it again ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar  wrote:
>
> * Denys Vlasenko  wrote:
>
>> > So there are now +2 instructions (5 instead of 3) in the
>> > system_call path, but there are -2 instructions in the SYSRETQ
>> > path,
>>
>> Unfortunately, no. [...]
>
> So I assumed that it was an equivalent transformation, given that none
> of the changelogs spelled out the increase in overhead ...
>
>> [...] There is only this change in SYSRETQ path, which simply
>> changes where we get RSP from:
>>
>> @@ -293,7 +289,7 @@ ret_from_sys_call:
>>   CFI_REGISTERrip,rcx
>>   movqEFLAGS(%rsp),%r11
>>   /*CFI_REGISTER  rflags,r11*/
>> - movqPER_CPU_VAR(old_rsp), %rsp
>> + movqRSP(%rsp),%rsp
>>   /*
>>* 64bit SYSRET restores rip from rcx,
>>* rflags from r11 (but RF and VM bits are forced to 0),
>>
>> Most likely, no change in execution speed here.
>> At best, it is one cycle faster somewhere in address generation unit
>> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
>>
>> Since this patch does add two extra MOVs,
>> I did benchmark these patches. They add exactly one cycle
>> to system call code path on my Sandy Bridge CPU.
>
> Hm, but that's the wrong direction, we should try to make it faster,
> and to clean it up - but making it slower without really good reasons
> isn't good.
>
> Is 'usersp' really that much of a complication?

usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
killing that off completely.  I've still never convinced myself that
there aren't ptrace-related info leaks in there.

Denys, did you ever benchmark what happens if we use push instead of
mov?  I bet that we get that cycle back and more, not to mention much
less icache usage.

The reason I think that is that this patch changes us from writing
nothing to the RIP slot to writing something there.  If we push our
stack frame instead of moving it into place, we must write to every
slot, and writing the right value is probably no more expensive than
writing, say, zero (the load / forwarding units are unlikely to be
very busy at that point).  So this change could actually be a step
toward getting faster.

--Andy

>
> Thanks,
>
> Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

* Andy Lutomirski  wrote:

> > Since this patch does add two extra MOVs,
> > I did benchmark these patches. They add exactly one cycle
> > to system call code path on my Sandy Bridge CPU.
> 
> Personally, I'm willing to pay that cycle.  It could be a bigger 
> savings on context switch, and the simplification it enables is 
> pretty good.

But, but ... context switches are a relative slow path, compared to 
system calls. And I say this with the scheduler maintainer hat on as 
well.

So this is not a good bargain IMHO, assuming it's not some _huge_ 
difference in maintainability - but having an extra percpu field
isn't really much of a problem.

I don't claim that we couldn't in some other situation decide that a 
certain type of speedup isn't worth it - but what's the big problem 
here? A bit of arithmetics shouldn't be a problem?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> > So there are now +2 instructions (5 instead of 3) in the 
> > system_call path, but there are -2 instructions in the SYSRETQ 
> > path,
> 
> Unfortunately, no. [...]

So I assumed that it was an equivalent transformation, given that none 
of the changelogs spelled out the increase in overhead ...

> [...] There is only this change in SYSRETQ path, which simply 
> changes where we get RSP from:
> 
> @@ -293,7 +289,7 @@ ret_from_sys_call:
>   CFI_REGISTERrip,rcx
>   movqEFLAGS(%rsp),%r11
>   /*CFI_REGISTER  rflags,r11*/
> - movqPER_CPU_VAR(old_rsp), %rsp
> + movqRSP(%rsp),%rsp
>   /*
>* 64bit SYSRET restores rip from rcx,
>* rflags from r11 (but RF and VM bits are forced to 0),
> 
> Most likely, no change in execution speed here.
> At best, it is one cycle faster somewhere in address generation unit
> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
> 
> Since this patch does add two extra MOVs,
> I did benchmark these patches. They add exactly one cycle
> to system call code path on my Sandy Bridge CPU.

Hm, but that's the wrong direction, we should try to make it faster, 
and to clean it up - but making it slower without really good reasons 
isn't good.

Is 'usersp' really that much of a complication?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko  wrote:
> On 03/10/2015 01:51 PM, Ingo Molnar wrote:
>>
>> * Denys Vlasenko  wrote:
>>
>>> PER_CPU(old_rsp) usage is simplified - now it is used only
>>> as temp storage, and userspace stack pointer is immediately stored
>>> in pt_regs->sp on syscall entry, instead of being used later,
>>> on syscall exit.
>>>
>>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
>>> uses pt_regs->sp now.
>>>
>>> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>>
>> Just trying to judge the performance impact:
>>
>>> --- a/arch/x86/kernel/entry_64.S
>>> +++ b/arch/x86/kernel/entry_64.S
>>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>>>   * manipulation.
>>>   */
>>>  .macro FIXUP_TOP_OF_STACK tmp offset=0
>>> -movq PER_CPU_VAR(old_rsp),\tmp
>>> -movq \tmp,RSP+\offset(%rsp)
>>>  movq $__USER_DS,SS+\offset(%rsp)
>>>  movq $__USER_CS,CS+\offset(%rsp)
>>>  movq RIP+\offset(%rsp),\tmp  /* get rip */
>>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>>>  .endm
>>>
>>>  .macro RESTORE_TOP_OF_STACK tmp offset=0
>>> -movq RSP+\offset(%rsp),\tmp
>>> -movq \tmp,PER_CPU_VAR(old_rsp)
>>> +/* nothing to do */
>>>  .endm
>>>
>>>  /*
>>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>>   */
>>>  ENABLE_INTERRUPTS(CLBR_NONE)
>>>  ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
>>> +movq%rcx,RIP(%rsp)
>>> +movqPER_CPU_VAR(old_rsp),%rcx
>>> +movq%r11,EFLAGS(%rsp)
>>> +movq%rcx,RSP(%rsp)
>>> +movq_cfi rax,ORIG_RAX
>>>  SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>>  movq$-ENOSYS,RAX(%rsp)
>>> -movq_cfi rax,ORIG_RAX
>>> -movq%r11,EFLAGS(%rsp)
>>> -movq%rcx,RIP(%rsp)
>>>  CFI_REL_OFFSET rip,RIP
>>>  testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>>>  jnz tracesys
>>
>> So there are now +2 instructions (5 instead of 3) in the system_call
>> path, but there are -2 instructions in the SYSRETQ path,
>
> Unfortunately, no. There is only this change in SYSRETQ path,
> which simply changes where we get RSP from:
>
> @@ -293,7 +289,7 @@ ret_from_sys_call:
> CFI_REGISTERrip,rcx
> movqEFLAGS(%rsp),%r11
> /*CFI_REGISTER  rflags,r11*/
> -   movqPER_CPU_VAR(old_rsp), %rsp
> +   movqRSP(%rsp),%rsp
> /*
>  * 64bit SYSRET restores rip from rcx,
>  * rflags from r11 (but RF and VM bits are forced to 0),
>
> Most likely, no change in execution speed here.
> At best, it is one cycle faster somewhere in address generation unit
> because for PER_CPU_VAR() address evaluation, GS base is nonzero.
>
>
> Since this patch does add two extra MOVs,
> I did benchmark these patches. They add exactly one cycle
> to system call code path on my Sandy Bridge CPU.
>

Personally, I'm willing to pay that cycle.  It could be a bigger
savings on context switch, and the simplification it enables is pretty
good.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On 03/10/2015 01:51 PM, Ingo Molnar wrote:
> 
> * Denys Vlasenko  wrote:
> 
>> PER_CPU(old_rsp) usage is simplified - now it is used only
>> as temp storage, and userspace stack pointer is immediately stored
>> in pt_regs->sp on syscall entry, instead of being used later,
>> on syscall exit.
>>
>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
>> uses pt_regs->sp now.
>>
>> FIXUP/RESTORE_TOP_OF_STACK are simplified.
> 
> Just trying to judge the performance impact:
> 
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>>   * manipulation.
>>   */
>>  .macro FIXUP_TOP_OF_STACK tmp offset=0
>> -movq PER_CPU_VAR(old_rsp),\tmp
>> -movq \tmp,RSP+\offset(%rsp)
>>  movq $__USER_DS,SS+\offset(%rsp)
>>  movq $__USER_CS,CS+\offset(%rsp)
>>  movq RIP+\offset(%rsp),\tmp  /* get rip */
>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>>  .endm
>>  
>>  .macro RESTORE_TOP_OF_STACK tmp offset=0
>> -movq RSP+\offset(%rsp),\tmp
>> -movq \tmp,PER_CPU_VAR(old_rsp)
>> +/* nothing to do */
>>  .endm
>>  
>>  /*
>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>   */
>>  ENABLE_INTERRUPTS(CLBR_NONE)
>>  ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
>> +movq%rcx,RIP(%rsp)
>> +movqPER_CPU_VAR(old_rsp),%rcx
>> +movq%r11,EFLAGS(%rsp)
>> +movq%rcx,RSP(%rsp)
>> +movq_cfi rax,ORIG_RAX
>>  SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>  movq$-ENOSYS,RAX(%rsp)
>> -movq_cfi rax,ORIG_RAX
>> -movq%r11,EFLAGS(%rsp)
>> -movq%rcx,RIP(%rsp)
>>  CFI_REL_OFFSET rip,RIP
>>  testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>>  jnz tracesys
> 
> So there are now +2 instructions (5 instead of 3) in the system_call 
> path, but there are -2 instructions in the SYSRETQ path,

Unfortunately, no. There is only this change in SYSRETQ path,
which simply changes where we get RSP from:

@@ -293,7 +289,7 @@ ret_from_sys_call:
CFI_REGISTERrip,rcx
movqEFLAGS(%rsp),%r11
/*CFI_REGISTER  rflags,r11*/
-   movqPER_CPU_VAR(old_rsp), %rsp
+   movqRSP(%rsp),%rsp
/*
 * 64bit SYSRET restores rip from rcx,
 * rflags from r11 (but RF and VM bits are forced to 0),

Most likely, no change in execution speed here.
At best, it is one cycle faster somewhere in address generation unit
because for PER_CPU_VAR() address evaluation, GS base is nonzero.


Since this patch does add two extra MOVs,
I did benchmark these patches. They add exactly one cycle
to system call code path on my Sandy Bridge CPU.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 5:51 AM, Ingo Molnar  wrote:
>
> * Denys Vlasenko  wrote:
>
>> PER_CPU(old_rsp) usage is simplified - now it is used only
>> as temp storage, and userspace stack pointer is immediately stored
>> in pt_regs->sp on syscall entry, instead of being used later,
>> on syscall exit.
>>
>> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
>> uses pt_regs->sp now.
>>
>> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>
> Just trying to judge the performance impact:
>
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>>   * manipulation.
>>   */
>>   .macro FIXUP_TOP_OF_STACK tmp offset=0
>> - movq PER_CPU_VAR(old_rsp),\tmp
>> - movq \tmp,RSP+\offset(%rsp)
>>   movq $__USER_DS,SS+\offset(%rsp)
>>   movq $__USER_CS,CS+\offset(%rsp)
>>   movq RIP+\offset(%rsp),\tmp  /* get rip */
>> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>>   .endm
>>
>>   .macro RESTORE_TOP_OF_STACK tmp offset=0
>> - movq RSP+\offset(%rsp),\tmp
>> - movq \tmp,PER_CPU_VAR(old_rsp)
>> + /* nothing to do */
>>   .endm
>>
>>  /*
>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>*/
>>   ENABLE_INTERRUPTS(CLBR_NONE)
>>   ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
>> + movq%rcx,RIP(%rsp)
>> + movqPER_CPU_VAR(old_rsp),%rcx
>> + movq%r11,EFLAGS(%rsp)
>> + movq%rcx,RSP(%rsp)
>> + movq_cfi rax,ORIG_RAX
>>   SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>   movq$-ENOSYS,RAX(%rsp)
>> - movq_cfi rax,ORIG_RAX
>> - movq%r11,EFLAGS(%rsp)
>> - movq%rcx,RIP(%rsp)
>>   CFI_REL_OFFSET rip,RIP
>>   testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>>   jnz tracesys
>
> So there are now +2 instructions (5 instead of 3) in the system_call
> path, but there are -2 instructions in the SYSRETQ path, so combined
> it's a wash performance-wise, right?

I think that the SYSRETQ path is the same number of instructions --
RESTORE_TOP_OF_STACK is only used in somewhat unusual circumstances.

On SYSRETQ, I think we touch one fewer cache line, although that cache
line will be in L1 unless the syscall pushed it out.  The added
instructions on entry are probably nearly free, though, as one of them
is a load from a forwarded address and the other is a store to a hot
cacheline.

--And
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

* Denys Vlasenko  wrote:

> PER_CPU(old_rsp) usage is simplified - now it is used only
> as temp storage, and userspace stack pointer is immediately stored
> in pt_regs->sp on syscall entry, instead of being used later,
> on syscall exit.
> 
> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
> uses pt_regs->sp now.
> 
> FIXUP/RESTORE_TOP_OF_STACK are simplified.

Just trying to judge the performance impact:

> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
>   * manipulation.
>   */
>   .macro FIXUP_TOP_OF_STACK tmp offset=0
> - movq PER_CPU_VAR(old_rsp),\tmp
> - movq \tmp,RSP+\offset(%rsp)
>   movq $__USER_DS,SS+\offset(%rsp)
>   movq $__USER_CS,CS+\offset(%rsp)
>   movq RIP+\offset(%rsp),\tmp  /* get rip */
> @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
>   .endm
>  
>   .macro RESTORE_TOP_OF_STACK tmp offset=0
> - movq RSP+\offset(%rsp),\tmp
> - movq \tmp,PER_CPU_VAR(old_rsp)
> + /* nothing to do */
>   .endm
>  
>  /*
> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>*/
>   ENABLE_INTERRUPTS(CLBR_NONE)
>   ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
> + movq%rcx,RIP(%rsp)
> + movqPER_CPU_VAR(old_rsp),%rcx
> + movq%r11,EFLAGS(%rsp)
> + movq%rcx,RSP(%rsp)
> + movq_cfi rax,ORIG_RAX
>   SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>   movq$-ENOSYS,RAX(%rsp)
> - movq_cfi rax,ORIG_RAX
> - movq%r11,EFLAGS(%rsp)
> - movq%rcx,RIP(%rsp)
>   CFI_REL_OFFSET rip,RIP
>   testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
>   jnz tracesys

So there are now +2 instructions (5 instead of 3) in the system_call 
path, but there are -2 instructions in the SYSRETQ path, so combined 
it's a wash performance-wise, right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 5:51 AM, Ingo Molnar mi...@kernel.org wrote:

 * Denys Vlasenko dvlas...@redhat.com wrote:

 PER_CPU(old_rsp) usage is simplified - now it is used only
 as temp storage, and userspace stack pointer is immediately stored
 in pt_regs-sp on syscall entry, instead of being used later,
 on syscall exit.

 Instead of PER_CPU(old_rsp) and task-thread.usersp, C code
 uses pt_regs-sp now.

 FIXUP/RESTORE_TOP_OF_STACK are simplified.

 Just trying to judge the performance impact:

 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
   * manipulation.
   */
   .macro FIXUP_TOP_OF_STACK tmp offset=0
 - movq PER_CPU_VAR(old_rsp),\tmp
 - movq \tmp,RSP+\offset(%rsp)
   movq $__USER_DS,SS+\offset(%rsp)
   movq $__USER_CS,CS+\offset(%rsp)
   movq RIP+\offset(%rsp),\tmp  /* get rip */
 @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
   .endm

   .macro RESTORE_TOP_OF_STACK tmp offset=0
 - movq RSP+\offset(%rsp),\tmp
 - movq \tmp,PER_CPU_VAR(old_rsp)
 + /* nothing to do */
   .endm

  /*
 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
*/
   ENABLE_INTERRUPTS(CLBR_NONE)
   ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 + movq%rcx,RIP(%rsp)
 + movqPER_CPU_VAR(old_rsp),%rcx
 + movq%r11,EFLAGS(%rsp)
 + movq%rcx,RSP(%rsp)
 + movq_cfi rax,ORIG_RAX
   SAVE_C_REGS_EXCEPT_RAX_RCX_R11
   movq$-ENOSYS,RAX(%rsp)
 - movq_cfi rax,ORIG_RAX
 - movq%r11,EFLAGS(%rsp)
 - movq%rcx,RIP(%rsp)
   CFI_REL_OFFSET rip,RIP
   testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
   jnz tracesys

 So there are now +2 instructions (5 instead of 3) in the system_call
 path, but there are -2 instructions in the SYSRETQ path, so combined
 it's a wash performance-wise, right?

I think that the SYSRETQ path is the same number of instructions --
RESTORE_TOP_OF_STACK is only used in somewhat unusual circumstances.

On SYSRETQ, I think we touch one fewer cache line, although that cache
line will be in L1 unless the syscall pushed it out.  The added
instructions on entry are probably nearly free, though, as one of them
is a load from a forwarded address and the other is a store to a hot
cacheline.

--And
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko
vda.li...@googlemail.com wrote:
 On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski l...@amacapital.net wrote:
 usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
 RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
 killing that off completely.  I've still never convinced myself that
 there aren't ptrace-related info leaks in there.

 Denys, did you ever benchmark what happens if we use push instead of
 mov?  I bet that we get that cycle back and more, not to mention much
 less icache usage.

 Yes, I did.
 Push conversion seems to perform the same as current, MOV-based code.

 The expected win there that we lose two huge 12-byte insns
 which store __USER_CS and __USER_DS in iret frame.

 MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
 - needs REX prefix
 - no sing-extending imm8 form exists for it
 - ofs in our case can't fit into 8 bits
 - (%esp) requires SIB byte

 In my tests, each such instruction adds one cycle.

 Compare this to PUSH imm8, which is 2 bytes only.

Does that mean that using push on top of this patch gets us our cycle back?

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On Tue, Mar 10, 2015 at 2:21 PM, Ingo Molnar mi...@kernel.org wrote:
 Since this patch does add two extra MOVs,
 I did benchmark these patches. They add exactly one cycle
 to system call code path on my Sandy Bridge CPU.

 Hm, but that's the wrong direction, we should try to make it faster,
 and to clean it up

As you know, goals of faster and cleaner are often mutually exclusive.
C'est la vie :(

entry.S seem to lean towards faster to the extent where it became
a tangled maze of obscure optimizations.

Such as the mysterious, and not at all obvious existence of 8 byte
safety padding at the top of the 32-bit kernel stack. Before Andy
stumbled into it, it was not at all obvious that it is even there.
I had to write a test patch to verify it.
There is a long-standing latent bug in the code where this padding
is not accounted for.

 - but making it slower without really good reasons isn't good.

The thinking here is that cleaning up entry.S is a good reason.

We won't do anything which would slow it down by, say, 5%,
but one cycle may be considered acceptable loss.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski l...@amacapital.net wrote:
 usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
 RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
 killing that off completely.  I've still never convinced myself that
 there aren't ptrace-related info leaks in there.

 Denys, did you ever benchmark what happens if we use push instead of
 mov?  I bet that we get that cycle back and more, not to mention much
 less icache usage.

Yes, I did.
Push conversion seems to perform the same as current, MOV-based code.

The expected win there that we lose two huge 12-byte insns
which store __USER_CS and __USER_DS in iret frame.

MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
- needs REX prefix
- no sing-extending imm8 form exists for it
- ofs in our case can't fit into 8 bits
- (%esp) requires SIB byte

In my tests, each such instruction adds one cycle.

Compare this to PUSH imm8, which is 2 bytes only.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On 03/10/2015 01:51 PM, Ingo Molnar wrote:
 
 * Denys Vlasenko dvlas...@redhat.com wrote:
 
 PER_CPU(old_rsp) usage is simplified - now it is used only
 as temp storage, and userspace stack pointer is immediately stored
 in pt_regs-sp on syscall entry, instead of being used later,
 on syscall exit.

 Instead of PER_CPU(old_rsp) and task-thread.usersp, C code
 uses pt_regs-sp now.

 FIXUP/RESTORE_TOP_OF_STACK are simplified.
 
 Just trying to judge the performance impact:
 
 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
   * manipulation.
   */
  .macro FIXUP_TOP_OF_STACK tmp offset=0
 -movq PER_CPU_VAR(old_rsp),\tmp
 -movq \tmp,RSP+\offset(%rsp)
  movq $__USER_DS,SS+\offset(%rsp)
  movq $__USER_CS,CS+\offset(%rsp)
  movq RIP+\offset(%rsp),\tmp  /* get rip */
 @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
  .endm
  
  .macro RESTORE_TOP_OF_STACK tmp offset=0
 -movq RSP+\offset(%rsp),\tmp
 -movq \tmp,PER_CPU_VAR(old_rsp)
 +/* nothing to do */
  .endm
  
  /*
 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
   */
  ENABLE_INTERRUPTS(CLBR_NONE)
  ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 +movq%rcx,RIP(%rsp)
 +movqPER_CPU_VAR(old_rsp),%rcx
 +movq%r11,EFLAGS(%rsp)
 +movq%rcx,RSP(%rsp)
 +movq_cfi rax,ORIG_RAX
  SAVE_C_REGS_EXCEPT_RAX_RCX_R11
  movq$-ENOSYS,RAX(%rsp)
 -movq_cfi rax,ORIG_RAX
 -movq%r11,EFLAGS(%rsp)
 -movq%rcx,RIP(%rsp)
  CFI_REL_OFFSET rip,RIP
  testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
  jnz tracesys
 
 So there are now +2 instructions (5 instead of 3) in the system_call 
 path, but there are -2 instructions in the SYSRETQ path,

Unfortunately, no. There is only this change in SYSRETQ path,
which simply changes where we get RSP from:

@@ -293,7 +289,7 @@ ret_from_sys_call:
CFI_REGISTERrip,rcx
movqEFLAGS(%rsp),%r11
/*CFI_REGISTER  rflags,r11*/
-   movqPER_CPU_VAR(old_rsp), %rsp
+   movqRSP(%rsp),%rsp
/*
 * 64bit SYSRET restores rip from rcx,
 * rflags from r11 (but RF and VM bits are forced to 0),

Most likely, no change in execution speed here.
At best, it is one cycle faster somewhere in address generation unit
because for PER_CPU_VAR() address evaluation, GS base is nonzero.


Since this patch does add two extra MOVs,
I did benchmark these patches. They add exactly one cycle
to system call code path on my Sandy Bridge CPU.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 6:21 AM, Ingo Molnar mi...@kernel.org wrote:

 * Denys Vlasenko dvlas...@redhat.com wrote:

  So there are now +2 instructions (5 instead of 3) in the
  system_call path, but there are -2 instructions in the SYSRETQ
  path,

 Unfortunately, no. [...]

 So I assumed that it was an equivalent transformation, given that none
 of the changelogs spelled out the increase in overhead ...

 [...] There is only this change in SYSRETQ path, which simply
 changes where we get RSP from:

 @@ -293,7 +289,7 @@ ret_from_sys_call:
   CFI_REGISTERrip,rcx
   movqEFLAGS(%rsp),%r11
   /*CFI_REGISTER  rflags,r11*/
 - movqPER_CPU_VAR(old_rsp), %rsp
 + movqRSP(%rsp),%rsp
   /*
* 64bit SYSRET restores rip from rcx,
* rflags from r11 (but RF and VM bits are forced to 0),

 Most likely, no change in execution speed here.
 At best, it is one cycle faster somewhere in address generation unit
 because for PER_CPU_VAR() address evaluation, GS base is nonzero.

 Since this patch does add two extra MOVs,
 I did benchmark these patches. They add exactly one cycle
 to system call code path on my Sandy Bridge CPU.

 Hm, but that's the wrong direction, we should try to make it faster,
 and to clean it up - but making it slower without really good reasons
 isn't good.

 Is 'usersp' really that much of a complication?

usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
killing that off completely.  I've still never convinced myself that
there aren't ptrace-related info leaks in there.

Denys, did you ever benchmark what happens if we use push instead of
mov?  I bet that we get that cycle back and more, not to mention much
less icache usage.

The reason I think that is that this patch changes us from writing
nothing to the RIP slot to writing something there.  If we push our
stack frame instead of moving it into place, we must write to every
slot, and writing the right value is probably no more expensive than
writing, say, zero (the load / forwarding units are unlikely to be
very busy at that point).  So this change could actually be a step
toward getting faster.

--Andy


 Thanks,

 Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Andy Lutomirski
On Tue, Mar 10, 2015 at 6:18 AM, Denys Vlasenko dvlas...@redhat.com wrote:
 On 03/10/2015 01:51 PM, Ingo Molnar wrote:

 * Denys Vlasenko dvlas...@redhat.com wrote:

 PER_CPU(old_rsp) usage is simplified - now it is used only
 as temp storage, and userspace stack pointer is immediately stored
 in pt_regs-sp on syscall entry, instead of being used later,
 on syscall exit.

 Instead of PER_CPU(old_rsp) and task-thread.usersp, C code
 uses pt_regs-sp now.

 FIXUP/RESTORE_TOP_OF_STACK are simplified.

 Just trying to judge the performance impact:

 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
   * manipulation.
   */
  .macro FIXUP_TOP_OF_STACK tmp offset=0
 -movq PER_CPU_VAR(old_rsp),\tmp
 -movq \tmp,RSP+\offset(%rsp)
  movq $__USER_DS,SS+\offset(%rsp)
  movq $__USER_CS,CS+\offset(%rsp)
  movq RIP+\offset(%rsp),\tmp  /* get rip */
 @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
  .endm

  .macro RESTORE_TOP_OF_STACK tmp offset=0
 -movq RSP+\offset(%rsp),\tmp
 -movq \tmp,PER_CPU_VAR(old_rsp)
 +/* nothing to do */
  .endm

  /*
 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
   */
  ENABLE_INTERRUPTS(CLBR_NONE)
  ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 +movq%rcx,RIP(%rsp)
 +movqPER_CPU_VAR(old_rsp),%rcx
 +movq%r11,EFLAGS(%rsp)
 +movq%rcx,RSP(%rsp)
 +movq_cfi rax,ORIG_RAX
  SAVE_C_REGS_EXCEPT_RAX_RCX_R11
  movq$-ENOSYS,RAX(%rsp)
 -movq_cfi rax,ORIG_RAX
 -movq%r11,EFLAGS(%rsp)
 -movq%rcx,RIP(%rsp)
  CFI_REL_OFFSET rip,RIP
  testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
  jnz tracesys

 So there are now +2 instructions (5 instead of 3) in the system_call
 path, but there are -2 instructions in the SYSRETQ path,

 Unfortunately, no. There is only this change in SYSRETQ path,
 which simply changes where we get RSP from:

 @@ -293,7 +289,7 @@ ret_from_sys_call:
 CFI_REGISTERrip,rcx
 movqEFLAGS(%rsp),%r11
 /*CFI_REGISTER  rflags,r11*/
 -   movqPER_CPU_VAR(old_rsp), %rsp
 +   movqRSP(%rsp),%rsp
 /*
  * 64bit SYSRET restores rip from rcx,
  * rflags from r11 (but RF and VM bits are forced to 0),

 Most likely, no change in execution speed here.
 At best, it is one cycle faster somewhere in address generation unit
 because for PER_CPU_VAR() address evaluation, GS base is nonzero.


 Since this patch does add two extra MOVs,
 I did benchmark these patches. They add exactly one cycle
 to system call code path on my Sandy Bridge CPU.


Personally, I'm willing to pay that cycle.  It could be a bigger
savings on context switch, and the simplification it enables is pretty
good.

--Andy

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

* Denys Vlasenko dvlas...@redhat.com wrote:

  So there are now +2 instructions (5 instead of 3) in the 
  system_call path, but there are -2 instructions in the SYSRETQ 
  path,
 
 Unfortunately, no. [...]

So I assumed that it was an equivalent transformation, given that none 
of the changelogs spelled out the increase in overhead ...

 [...] There is only this change in SYSRETQ path, which simply 
 changes where we get RSP from:
 
 @@ -293,7 +289,7 @@ ret_from_sys_call:
   CFI_REGISTERrip,rcx
   movqEFLAGS(%rsp),%r11
   /*CFI_REGISTER  rflags,r11*/
 - movqPER_CPU_VAR(old_rsp), %rsp
 + movqRSP(%rsp),%rsp
   /*
* 64bit SYSRET restores rip from rcx,
* rflags from r11 (but RF and VM bits are forced to 0),
 
 Most likely, no change in execution speed here.
 At best, it is one cycle faster somewhere in address generation unit
 because for PER_CPU_VAR() address evaluation, GS base is nonzero.
 
 Since this patch does add two extra MOVs,
 I did benchmark these patches. They add exactly one cycle
 to system call code path on my Sandy Bridge CPU.

Hm, but that's the wrong direction, we should try to make it faster, 
and to clean it up - but making it slower without really good reasons 
isn't good.

Is 'usersp' really that much of a complication?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

* Denys Vlasenko dvlas...@redhat.com wrote:

 PER_CPU(old_rsp) usage is simplified - now it is used only
 as temp storage, and userspace stack pointer is immediately stored
 in pt_regs-sp on syscall entry, instead of being used later,
 on syscall exit.
 
 Instead of PER_CPU(old_rsp) and task-thread.usersp, C code
 uses pt_regs-sp now.
 
 FIXUP/RESTORE_TOP_OF_STACK are simplified.

Just trying to judge the performance impact:

 --- a/arch/x86/kernel/entry_64.S
 +++ b/arch/x86/kernel/entry_64.S
 @@ -128,8 +128,6 @@ ENDPROC(native_usergs_sysret64)
   * manipulation.
   */
   .macro FIXUP_TOP_OF_STACK tmp offset=0
 - movq PER_CPU_VAR(old_rsp),\tmp
 - movq \tmp,RSP+\offset(%rsp)
   movq $__USER_DS,SS+\offset(%rsp)
   movq $__USER_CS,CS+\offset(%rsp)
   movq RIP+\offset(%rsp),\tmp  /* get rip */
 @@ -139,8 +137,7 @@ ENDPROC(native_usergs_sysret64)
   .endm
  
   .macro RESTORE_TOP_OF_STACK tmp offset=0
 - movq RSP+\offset(%rsp),\tmp
 - movq \tmp,PER_CPU_VAR(old_rsp)
 + /* nothing to do */
   .endm
  
  /*
 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
*/
   ENABLE_INTERRUPTS(CLBR_NONE)
   ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 + movq%rcx,RIP(%rsp)
 + movqPER_CPU_VAR(old_rsp),%rcx
 + movq%r11,EFLAGS(%rsp)
 + movq%rcx,RSP(%rsp)
 + movq_cfi rax,ORIG_RAX
   SAVE_C_REGS_EXCEPT_RAX_RCX_R11
   movq$-ENOSYS,RAX(%rsp)
 - movq_cfi rax,ORIG_RAX
 - movq%r11,EFLAGS(%rsp)
 - movq%rcx,RIP(%rsp)
   CFI_REL_OFFSET rip,RIP
   testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags+THREAD_INFO(%rsp,RIP)
   jnz tracesys

So there are now +2 instructions (5 instead of 3) in the system_call 
path, but there are -2 instructions in the SYSRETQ path, so combined 
it's a wash performance-wise, right?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

* Andy Lutomirski l...@amacapital.net wrote:

  Since this patch does add two extra MOVs,
  I did benchmark these patches. They add exactly one cycle
  to system call code path on my Sandy Bridge CPU.
 
 Personally, I'm willing to pay that cycle.  It could be a bigger 
 savings on context switch, and the simplification it enables is 
 pretty good.

But, but ... context switches are a relative slow path, compared to 
system calls. And I say this with the scheduler maintainer hat on as 
well.

So this is not a good bargain IMHO, assuming it's not some _huge_ 
difference in maintainability - but having an extra percpu field
isn't really much of a problem.

I don't claim that we couldn't in some other situation decide that a 
certain type of speedup isn't worth it - but what's the big problem 
here? A bit of arithmetics shouldn't be a problem?

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Ingo Molnar

 * Denys Vlasenko dvlas...@redhat.com wrote:
 
   So there are now +2 instructions (5 instead of 3) in the 
   system_call path, but there are -2 instructions in the SYSRETQ 
   path,
  
  Unfortunately, no. [...]
 
 So I assumed that it was an equivalent transformation, given that 
 none of the changelogs spelled out the increase in overhead ...

Also, for future reference, you need to spell out performance costs of 
fast path patches very clearly - and that wasn't done here.

That's a big no-no, please don't do it again ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-10 Thread Denys Vlasenko
On Tue, Mar 10, 2015 at 3:02 PM, Andy Lutomirski l...@amacapital.net wrote:
 On Tue, Mar 10, 2015 at 7:00 AM, Denys Vlasenko
 vda.li...@googlemail.com wrote:
 On Tue, Mar 10, 2015 at 2:26 PM, Andy Lutomirski l...@amacapital.net wrote:
 usersp is IMO tolerable.  The nasty thing is the FIXUP_TOP_OF_STACK /
 RESTORE_TOP_OF_STACK garbage, and this patch is the main step toward
 killing that off completely.  I've still never convinced myself that
 there aren't ptrace-related info leaks in there.

 Denys, did you ever benchmark what happens if we use push instead of
 mov?  I bet that we get that cycle back and more, not to mention much
 less icache usage.

 Yes, I did.
 Push conversion seems to perform the same as current, MOV-based code.

 The expected win there that we lose two huge 12-byte insns
 which store __USER_CS and __USER_DS in iret frame.

 MOVQ imm,ofs(%rsp) has a very unfortunate encoding in x86:
 - needs REX prefix
 - no sing-extending imm8 form exists for it
 - ofs in our case can't fit into 8 bits
 - (%esp) requires SIB byte

 In my tests, each such instruction adds one cycle.

 Compare this to PUSH imm8, which is 2 bytes only.

 Does that mean that using push on top of this patch gets us our cycle back?

Maybe. I can't be sure about it.

In general I see a jitter of 1-2, sometimes 3 cycles even when I do changes
which merely change code size (e.g. replacing equivalent insns).
This may be caused by jump targets getting aligned differently
wrt cacheline boundaries. If second/third/fourth insn after current one
is not fetched because it did not fit into the cacheline,
then some insn decoders don't get anything to chew on.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-09 Thread Andy Lutomirski
On Mon, Mar 9, 2015 at 1:32 PM, Denys Vlasenko  wrote:
> On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski  wrote:
>>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>>  */
>>> ENABLE_INTERRUPTS(CLBR_NONE)
>>> ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
>>> +   movq%rcx,RIP(%rsp)
>>> +   movqPER_CPU_VAR(old_rsp),%rcx
>>> +   movq%r11,EFLAGS(%rsp)
>>> +   movq%rcx,RSP(%rsp)
>>> +   movq_cfi rax,ORIG_RAX
>>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>>> movq$-ENOSYS,RAX(%rsp)
>>> -   movq_cfi rax,ORIG_RAX
>>> -   movq%r11,EFLAGS(%rsp)
>>> -   movq%rcx,RIP(%rsp)
>>
>> Why the reordering?
>
> No strong reason.
>
> iret stack is "above" the rest of pt_regs.
>
> This does not matter now, but when/if we convert to PUSHes
> for register saving, pushes which build iret frame
> will have to be before "save C-clobbered registers" part,
> exactly as in this patch.

Fair enough.

Acked-by: Andy Lutomirski 

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-09 Thread Denys Vlasenko
On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski  wrote:
>> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>>  */
>> ENABLE_INTERRUPTS(CLBR_NONE)
>> ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
>> +   movq%rcx,RIP(%rsp)
>> +   movqPER_CPU_VAR(old_rsp),%rcx
>> +   movq%r11,EFLAGS(%rsp)
>> +   movq%rcx,RSP(%rsp)
>> +   movq_cfi rax,ORIG_RAX
>> SAVE_C_REGS_EXCEPT_RAX_RCX_R11
>> movq$-ENOSYS,RAX(%rsp)
>> -   movq_cfi rax,ORIG_RAX
>> -   movq%r11,EFLAGS(%rsp)
>> -   movq%rcx,RIP(%rsp)
>
> Why the reordering?

No strong reason.

iret stack is "above" the rest of pt_regs.

This does not matter now, but when/if we convert to PUSHes
for register saving, pushes which build iret frame
will have to be before "save C-clobbered registers" part,
exactly as in this patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs->sp on SYSCALL64 fastpath

2015-03-09 Thread Andy Lutomirski
On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko  wrote:
> PER_CPU(old_rsp) usage is simplified - now it is used only
> as temp storage, and userspace stack pointer is immediately stored
> in pt_regs->sp on syscall entry, instead of being used later,
> on syscall exit.
>
> Instead of PER_CPU(old_rsp) and task->thread.usersp, C code
> uses pt_regs->sp now.
>
> FIXUP/RESTORE_TOP_OF_STACK are simplified.
>
> Signed-off-by: Denys Vlasenko 
> CC: Linus Torvalds 
> CC: Steven Rostedt 
> CC: Ingo Molnar 
> CC: Borislav Petkov 
> CC: "H. Peter Anvin" 
> CC: Andy Lutomirski 
> CC: Oleg Nesterov 
> CC: Frederic Weisbecker 
> CC: Alexei Starovoitov 
> CC: Will Drewry 
> CC: Kees Cook 
> CC: x...@kernel.org
> CC: linux-kernel@vger.kernel.org

Looks correct.

> @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
>  */
> ENABLE_INTERRUPTS(CLBR_NONE)
> ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
> +   movq%rcx,RIP(%rsp)
> +   movqPER_CPU_VAR(old_rsp),%rcx
> +   movq%r11,EFLAGS(%rsp)
> +   movq%rcx,RSP(%rsp)
> +   movq_cfi rax,ORIG_RAX
> SAVE_C_REGS_EXCEPT_RAX_RCX_R11
> movq$-ENOSYS,RAX(%rsp)
> -   movq_cfi rax,ORIG_RAX
> -   movq%r11,EFLAGS(%rsp)
> -   movq%rcx,RIP(%rsp)

Why the reordering?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-09 Thread Andy Lutomirski
On Mon, Mar 9, 2015 at 1:32 PM, Denys Vlasenko vda.li...@googlemail.com wrote:
 On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski l...@amacapital.net wrote:
 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
  */
 ENABLE_INTERRUPTS(CLBR_NONE)
 ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 +   movq%rcx,RIP(%rsp)
 +   movqPER_CPU_VAR(old_rsp),%rcx
 +   movq%r11,EFLAGS(%rsp)
 +   movq%rcx,RSP(%rsp)
 +   movq_cfi rax,ORIG_RAX
 SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 movq$-ENOSYS,RAX(%rsp)
 -   movq_cfi rax,ORIG_RAX
 -   movq%r11,EFLAGS(%rsp)
 -   movq%rcx,RIP(%rsp)

 Why the reordering?

 No strong reason.

 iret stack is above the rest of pt_regs.

 This does not matter now, but when/if we convert to PUSHes
 for register saving, pushes which build iret frame
 will have to be before save C-clobbered registers part,
 exactly as in this patch.

Fair enough.

Acked-by: Andy Lutomirski l...@amacapital.net

-- 
Andy Lutomirski
AMA Capital Management, LLC
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-09 Thread Andy Lutomirski
On Mon, Mar 9, 2015 at 11:39 AM, Denys Vlasenko dvlas...@redhat.com wrote:
 PER_CPU(old_rsp) usage is simplified - now it is used only
 as temp storage, and userspace stack pointer is immediately stored
 in pt_regs-sp on syscall entry, instead of being used later,
 on syscall exit.

 Instead of PER_CPU(old_rsp) and task-thread.usersp, C code
 uses pt_regs-sp now.

 FIXUP/RESTORE_TOP_OF_STACK are simplified.

 Signed-off-by: Denys Vlasenko dvlas...@redhat.com
 CC: Linus Torvalds torva...@linux-foundation.org
 CC: Steven Rostedt rost...@goodmis.org
 CC: Ingo Molnar mi...@kernel.org
 CC: Borislav Petkov b...@alien8.de
 CC: H. Peter Anvin h...@zytor.com
 CC: Andy Lutomirski l...@amacapital.net
 CC: Oleg Nesterov o...@redhat.com
 CC: Frederic Weisbecker fweis...@gmail.com
 CC: Alexei Starovoitov a...@plumgrid.com
 CC: Will Drewry w...@chromium.org
 CC: Kees Cook keesc...@chromium.org
 CC: x...@kernel.org
 CC: linux-kernel@vger.kernel.org

Looks correct.

 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
  */
 ENABLE_INTERRUPTS(CLBR_NONE)
 ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 +   movq%rcx,RIP(%rsp)
 +   movqPER_CPU_VAR(old_rsp),%rcx
 +   movq%r11,EFLAGS(%rsp)
 +   movq%rcx,RSP(%rsp)
 +   movq_cfi rax,ORIG_RAX
 SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 movq$-ENOSYS,RAX(%rsp)
 -   movq_cfi rax,ORIG_RAX
 -   movq%r11,EFLAGS(%rsp)
 -   movq%rcx,RIP(%rsp)

Why the reordering?

--Andy
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 3/4] x86: save user rsp in pt_regs-sp on SYSCALL64 fastpath

2015-03-09 Thread Denys Vlasenko
On Mon, Mar 9, 2015 at 9:11 PM, Andy Lutomirski l...@amacapital.net wrote:
 @@ -253,11 +247,13 @@ GLOBAL(system_call_after_swapgs)
  */
 ENABLE_INTERRUPTS(CLBR_NONE)
 ALLOC_PT_GPREGS_ON_STACK 8  /* +8: space for orig_ax */
 +   movq%rcx,RIP(%rsp)
 +   movqPER_CPU_VAR(old_rsp),%rcx
 +   movq%r11,EFLAGS(%rsp)
 +   movq%rcx,RSP(%rsp)
 +   movq_cfi rax,ORIG_RAX
 SAVE_C_REGS_EXCEPT_RAX_RCX_R11
 movq$-ENOSYS,RAX(%rsp)
 -   movq_cfi rax,ORIG_RAX
 -   movq%r11,EFLAGS(%rsp)
 -   movq%rcx,RIP(%rsp)

 Why the reordering?

No strong reason.

iret stack is above the rest of pt_regs.

This does not matter now, but when/if we convert to PUSHes
for register saving, pushes which build iret frame
will have to be before save C-clobbered registers part,
exactly as in this patch.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/