Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-18 Thread Herbert Xu
On Fri, Sep 15, 2017 at 11:06:29PM +0200, Ingo Molnar wrote:
>
> Indeed, I suspect they should go through the crypto tree, these fixes are 
> independent, they don't depend on anything in the x86 tree.

Sure I can pick them up through cryptodev.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-18 Thread Herbert Xu
On Fri, Sep 15, 2017 at 11:06:29PM +0200, Ingo Molnar wrote:
>
> Indeed, I suspect they should go through the crypto tree, these fixes are 
> independent, they don't depend on anything in the x86 tree.

Sure I can pick them up through cryptodev.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-15 Thread Ingo Molnar

* Eric Biggers  wrote:

> On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers  wrote:
> > 
> > > Hi Josh,
> > > 
> > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > > > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > > > most of the performance lost by v1.
> > > > 
> > > > From: Josh Poimboeuf 
> > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > > > 
> > > > Using RBP as a temporary register breaks frame pointer convention and
> > > > breaks stack traces when unwinding from an interrupt in the crypto code.
> > > > 
> > > > Mix things up a little bit to get rid of the RBP usage, without
> > > > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > > > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > > > 
> > > > Also remove the unused y4 variable.
> > > > 
> > > 
> > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  
> > > They
> > > both pass the crypto self-tests, and there was no noticable performance
> > > difference compared to the unpatched versions.  Thanks!
> > 
> > Cool, thanks for review and the testing! Can we add your Tested-by + 
> > Acked-by tags 
> > to the patches?
> > 
> 
> Yes, that's fine for all the patches in the series.
> 
> Will these patches go in through the crypto tree or through the x86 tree?

Indeed, I suspect they should go through the crypto tree, these fixes are 
independent, they don't depend on anything in the x86 tree.

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-15 Thread Ingo Molnar

* Eric Biggers  wrote:

> On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers  wrote:
> > 
> > > Hi Josh,
> > > 
> > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > > > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > > > most of the performance lost by v1.
> > > > 
> > > > From: Josh Poimboeuf 
> > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > > > 
> > > > Using RBP as a temporary register breaks frame pointer convention and
> > > > breaks stack traces when unwinding from an interrupt in the crypto code.
> > > > 
> > > > Mix things up a little bit to get rid of the RBP usage, without
> > > > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > > > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > > > 
> > > > Also remove the unused y4 variable.
> > > > 
> > > 
> > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  
> > > They
> > > both pass the crypto self-tests, and there was no noticable performance
> > > difference compared to the unpatched versions.  Thanks!
> > 
> > Cool, thanks for review and the testing! Can we add your Tested-by + 
> > Acked-by tags 
> > to the patches?
> > 
> 
> Yes, that's fine for all the patches in the series.
> 
> Will these patches go in through the crypto tree or through the x86 tree?

Indeed, I suspect they should go through the crypto tree, these fixes are 
independent, they don't depend on anything in the x86 tree.

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-15 Thread Eric Biggers
On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
> > Hi Josh,
> > 
> > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > > most of the performance lost by v1.
> > > 
> > > From: Josh Poimboeuf 
> > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > > 
> > > Using RBP as a temporary register breaks frame pointer convention and
> > > breaks stack traces when unwinding from an interrupt in the crypto code.
> > > 
> > > Mix things up a little bit to get rid of the RBP usage, without
> > > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > > 
> > > Also remove the unused y4 variable.
> > > 
> > 
> > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  
> > They
> > both pass the crypto self-tests, and there was no noticable performance
> > difference compared to the unpatched versions.  Thanks!
> 
> Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by 
> tags 
> to the patches?
> 

Yes, that's fine for all the patches in the series.

Will these patches go in through the crypto tree or through the x86 tree?

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-15 Thread Eric Biggers
On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
> > Hi Josh,
> > 
> > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > > most of the performance lost by v1.
> > > 
> > > From: Josh Poimboeuf 
> > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > > 
> > > Using RBP as a temporary register breaks frame pointer convention and
> > > breaks stack traces when unwinding from an interrupt in the crypto code.
> > > 
> > > Mix things up a little bit to get rid of the RBP usage, without
> > > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > > 
> > > Also remove the unused y4 variable.
> > > 
> > 
> > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  
> > They
> > both pass the crypto self-tests, and there was no noticable performance
> > difference compared to the unpatched versions.  Thanks!
> 
> Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by 
> tags 
> to the patches?
> 

Yes, that's fine for all the patches in the series.

Will these patches go in through the crypto tree or through the x86 tree?

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> > So if this observation of mine is true we could go back to the old code for 
> > the 
> > hotpath, but use RDI for TBL and not reload it in the hotpath.
> 
> Thanks for the excellent breakdown.
> 
> When I looked at the patch again, I came to the same conclusion as your
> #4, which is that RDI isn't being used in the inner loops.  It *is* used
> in the outermost loop, however.
> 
> So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead
> of TBL:
> 
>   https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble

Indeed - I should have checked your v2 patch, but somehow missed it. Would have 
saved me some looking+typing ;-)

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> > So if this observation of mine is true we could go back to the old code for 
> > the 
> > hotpath, but use RDI for TBL and not reload it in the hotpath.
> 
> Thanks for the excellent breakdown.
> 
> When I looked at the patch again, I came to the same conclusion as your
> #4, which is that RDI isn't being used in the inner loops.  It *is* used
> in the outermost loop, however.
> 
> So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead
> of TBL:
> 
>   https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble

Indeed - I should have checked your v2 patch, but somehow missed it. Would have 
saved me some looking+typing ;-)

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Eric Biggers  wrote:

> Hi Josh,
> 
> On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > most of the performance lost by v1.
> > 
> > From: Josh Poimboeuf 
> > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > 
> > Using RBP as a temporary register breaks frame pointer convention and
> > breaks stack traces when unwinding from an interrupt in the crypto code.
> > 
> > Mix things up a little bit to get rid of the RBP usage, without
> > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > 
> > Also remove the unused y4 variable.
> > 
> 
> I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
> both pass the crypto self-tests, and there was no noticable performance
> difference compared to the unpatched versions.  Thanks!

Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by 
tags 
to the patches?

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Eric Biggers  wrote:

> Hi Josh,
> 
> On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> > And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> > most of the performance lost by v1.
> > 
> > From: Josh Poimboeuf 
> > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> > 
> > Using RBP as a temporary register breaks frame pointer convention and
> > breaks stack traces when unwinding from an interrupt in the crypto code.
> > 
> > Mix things up a little bit to get rid of the RBP usage, without
> > destroying performance.  Use RDI instead of RBP for the TBL pointer.
> > That will clobber CTX, so save CTX on the stack and use RDI as CTX
> > before it gets clobbered, and R12 as CTX after it gets clobbered.
> > 
> > Also remove the unused y4 variable.
> > 
> 
> I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
> both pass the crypto self-tests, and there was no noticable performance
> difference compared to the unpatched versions.  Thanks!

Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by 
tags 
to the patches?

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Eric Biggers
Hi Josh,

On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> most of the performance lost by v1.
> 
> From: Josh Poimboeuf 
> Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> 
> Using RBP as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> Mix things up a little bit to get rid of the RBP usage, without
> destroying performance.  Use RDI instead of RBP for the TBL pointer.
> That will clobber CTX, so save CTX on the stack and use RDI as CTX
> before it gets clobbered, and R12 as CTX after it gets clobbered.
> 
> Also remove the unused y4 variable.
> 

I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
both pass the crypto self-tests, and there was no noticable performance
difference compared to the unpatched versions.  Thanks!

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Eric Biggers
Hi Josh,

On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote:
> And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
> most of the performance lost by v1.
> 
> From: Josh Poimboeuf 
> Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> 
> Using RBP as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> Mix things up a little bit to get rid of the RBP usage, without
> destroying performance.  Use RDI instead of RBP for the TBL pointer.
> That will clobber CTX, so save CTX on the stack and use RDI as CTX
> before it gets clobbered, and R12 as CTX after it gets clobbered.
> 
> Also remove the unused y4 variable.
> 

I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake.  They
both pass the crypto self-tests, and there was no noticable performance
difference compared to the unpatched versions.  Thanks!

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Josh Poimboeuf
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote:
> 
> * Josh Poimboeuf  wrote:
> 
> > I'm still looking at the other one (sha512-avx2), but so far I haven't
> > found a way to speed it back up.
> 
> Here's a couple of very quick observations with possible optimizations:
> 
> AFAICS the main effect of the RBP fixes is the introduction of a memory load 
> into 
> the critical path, into the body unrolled loop:
> 
> +   mov frame_TBL(%rsp), TBL
> vpaddq  (TBL), Y_0, XFER
> vmovdqa XFER, frame_XFER(%rsp)
> FOUR_ROUNDS_AND_SCHED
> 
> Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to 
> be 
> reloaded from the stack.
> 
> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
> vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a 
> new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into 
> the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
> mov a, T1   # T1 = a# MAJB
> and c, T1   # T1 = a  # MAJB
> 
> add y0, y2  # y2 = S1 + CH  # --
> or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ
> 
> +   mov frame_TBL(%rsp), TBL
> 
> add y1, h   # h = k + w + h + S0# --
> 
> add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --
> 
> add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
> add y3, h   # h = t1 + S0 + MAJ # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.
> 
> 2)
> 
> If this does not get back performance, then maybe another reason is that it's 
> cache access latency limited, in which case a more involved optimization 
> would be 
> to look at the register patterns and usages:
> 
> 
>   first-use   last-useuse-length
>   a:  #10 #29 20
>   b:  #24 #24  1
>   c:  #14 #30 17
>   d:  #23 #34 12
>   e:  #11 #20 10
>   f:  #15 #15  1
>   g:  #18 #27 10
>   h:  #13 #36 24
> 
>   y0: #11 #31 21
>   y1: #12 #33 22
>   y2: #15 #35 21
>   y3: #10 #36 27
> 
>   T1: #16 #32 17
> 
> The 'first-use' colums shows the number of the instruction within the loop 
> body 
> that the register gets used - with '#1' denoting the first instruction ad #36 
> the 
> last instruction, the 'last-use' column is showing the last instruction, and 
> the 
> 'use-length' colum shows the 'window' in which a register is used.
> 
> What we want are the registers that are used the most tightly, i.e. these two:
> 
>   b:  #24 #24  1
>   f:  #15 #15  1
> 
> Of these two 'f' is the best one, because it has an earlier use and longer 
> cooldown.
> 
> If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration 
> very 
> early on:
> 
> mov f, y2   # y2 = f# CH
> +   mov frame_TBL(%rsp), TBL
> rorx$34, a, T1  # T1 = a >> 34  # S0B
> 
> And there will be 21 instructions that don't depend on TLB after this, plenty 
> of 
> time for the load to be generated and propagated.
> 
> NOTE: my pseudo-patch is naive, due to the complication caused by the 
> RotateState 
> macro name rotation. It's still fundamentally possible I believe, it's just 
> that 
> 'TBL' has to be rotated too, together with the other varibles.
> 
> 3)
> 
> If even this does not help, because the workload is ucode-cache limited, and 
> the 
> extra reloads pushed the critical path just beyond some cache limit, then 
> another 
> experiment to try would be to roll _back_ the loop some more: instead of 4x 
> FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.
> 
> The CPU should still be smart enough with 2x interleaving of the loop body, 
> and 
> the extra branches should be relatively small and we could get back some 
> performance.
> 
> In theory ...
> 
> 4)
> 

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Josh Poimboeuf
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote:
> 
> * Josh Poimboeuf  wrote:
> 
> > I'm still looking at the other one (sha512-avx2), but so far I haven't
> > found a way to speed it back up.
> 
> Here's a couple of very quick observations with possible optimizations:
> 
> AFAICS the main effect of the RBP fixes is the introduction of a memory load 
> into 
> the critical path, into the body unrolled loop:
> 
> +   mov frame_TBL(%rsp), TBL
> vpaddq  (TBL), Y_0, XFER
> vmovdqa XFER, frame_XFER(%rsp)
> FOUR_ROUNDS_AND_SCHED
> 
> Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to 
> be 
> reloaded from the stack.
> 
> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
> vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a 
> new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into 
> the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
> mov a, T1   # T1 = a# MAJB
> and c, T1   # T1 = a  # MAJB
> 
> add y0, y2  # y2 = S1 + CH  # --
> or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ
> 
> +   mov frame_TBL(%rsp), TBL
> 
> add y1, h   # h = k + w + h + S0# --
> 
> add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --
> 
> add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
> add y3, h   # h = t1 + S0 + MAJ # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.
> 
> 2)
> 
> If this does not get back performance, then maybe another reason is that it's 
> cache access latency limited, in which case a more involved optimization 
> would be 
> to look at the register patterns and usages:
> 
> 
>   first-use   last-useuse-length
>   a:  #10 #29 20
>   b:  #24 #24  1
>   c:  #14 #30 17
>   d:  #23 #34 12
>   e:  #11 #20 10
>   f:  #15 #15  1
>   g:  #18 #27 10
>   h:  #13 #36 24
> 
>   y0: #11 #31 21
>   y1: #12 #33 22
>   y2: #15 #35 21
>   y3: #10 #36 27
> 
>   T1: #16 #32 17
> 
> The 'first-use' colums shows the number of the instruction within the loop 
> body 
> that the register gets used - with '#1' denoting the first instruction ad #36 
> the 
> last instruction, the 'last-use' column is showing the last instruction, and 
> the 
> 'use-length' colum shows the 'window' in which a register is used.
> 
> What we want are the registers that are used the most tightly, i.e. these two:
> 
>   b:  #24 #24  1
>   f:  #15 #15  1
> 
> Of these two 'f' is the best one, because it has an earlier use and longer 
> cooldown.
> 
> If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration 
> very 
> early on:
> 
> mov f, y2   # y2 = f# CH
> +   mov frame_TBL(%rsp), TBL
> rorx$34, a, T1  # T1 = a >> 34  # S0B
> 
> And there will be 21 instructions that don't depend on TLB after this, plenty 
> of 
> time for the load to be generated and propagated.
> 
> NOTE: my pseudo-patch is naive, due to the complication caused by the 
> RotateState 
> macro name rotation. It's still fundamentally possible I believe, it's just 
> that 
> 'TBL' has to be rotated too, together with the other varibles.
> 
> 3)
> 
> If even this does not help, because the workload is ucode-cache limited, and 
> the 
> extra reloads pushed the critical path just beyond some cache limit, then 
> another 
> experiment to try would be to roll _back_ the loop some more: instead of 4x 
> FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.
> 
> The CPU should still be smart enough with 2x interleaving of the loop body, 
> and 
> the extra branches should be relatively small and we could get back some 
> performance.
> 
> In theory ...
> 
> 4)
> 
> If the workload is 

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
> vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a 
> new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into 
> the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
> mov a, T1   # T1 = a# MAJB
> and c, T1   # T1 = a  # MAJB
> 
> add y0, y2  # y2 = S1 + CH  # --
> or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ
> 
> +   mov frame_TBL(%rsp), TBL
> 
> add y1, h   # h = k + w + h + S0# --
> 
> add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --
> 
> add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
> add y3, h   # h = t1 + S0 + MAJ # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st 
iteration, via something like:

movq$4, frame_SRND(%rsp)

+   mov frame_TBL(%rsp), TBL

.align 16
loop1:
vpaddq  (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
> vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a 
> new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into 
> the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
> mov a, T1   # T1 = a# MAJB
> and c, T1   # T1 = a  # MAJB
> 
> add y0, y2  # y2 = S1 + CH  # --
> or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ
> 
> +   mov frame_TBL(%rsp), TBL
> 
> add y1, h   # h = k + w + h + S0# --
> 
> add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --
> 
> add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
> add y3, h   # h = t1 + S0 + MAJ # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.

Note that in this case 'TBL' would have to be initialized before the 1st 
iteration, via something like:

movq$4, frame_SRND(%rsp)

+   mov frame_TBL(%rsp), TBL

.align 16
loop1:
vpaddq  (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

Here's a couple of very quick observations with possible optimizations:

AFAICS the main effect of the RBP fixes is the introduction of a memory load 
into 
the critical path, into the body unrolled loop:

+   mov frame_TBL(%rsp), TBL
vpaddq  (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED

Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
reloaded from the stack.

1)

Note how R12 is used immediately, right in the next instruction:

vpaddq  (TBL), Y_0, XFER

I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
constraint and a few extra cycles per loop iteration if the workload is 
address-generator bandwidth limited on that.

A simple way to ease that constraint would be to move the 'TLB' load up into 
the 
loop, body, to the point where 'T1' is used for the last time - which is:


mov a, T1   # T1 = a# MAJB
and c, T1   # T1 = a  # MAJB

add y0, y2  # y2 = S1 + CH  # --
or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ

+   mov frame_TBL(%rsp), TBL

add y1, h   # h = k + w + h + S0# --

add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --

add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
add y3, h   # h = t1 + S0 + MAJ # --

Note how this moves up the 'TLB' reload by 4 instructions.

2)

If this does not get back performance, then maybe another reason is that it's 
cache access latency limited, in which case a more involved optimization would 
be 
to look at the register patterns and usages:


first-use   last-useuse-length
a:  #10 #29 20
b:  #24 #24  1
c:  #14 #30 17
d:  #23 #34 12
e:  #11 #20 10
f:  #15 #15  1
g:  #18 #27 10
h:  #13 #36 24

y0: #11 #31 21
y1: #12 #33 22
y2: #15 #35 21
y3: #10 #36 27

T1: #16 #32 17

The 'first-use' colums shows the number of the instruction within the loop body 
that the register gets used - with '#1' denoting the first instruction ad #36 
the 
last instruction, the 'last-use' column is showing the last instruction, and 
the 
'use-length' colum shows the 'window' in which a register is used.

What we want are the registers that are used the most tightly, i.e. these two:

b:  #24 #24  1
f:  #15 #15  1

Of these two 'f' is the best one, because it has an earlier use and longer 
cooldown.

If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
early on:

mov f, y2   # y2 = f# CH
+   mov frame_TBL(%rsp), TBL
rorx$34, a, T1  # T1 = a >> 34  # S0B

And there will be 21 instructions that don't depend on TLB after this, plenty 
of 
time for the load to be generated and propagated.

NOTE: my pseudo-patch is naive, due to the complication caused by the 
RotateState 
macro name rotation. It's still fundamentally possible I believe, it's just 
that 
'TBL' has to be rotated too, together with the other varibles.

3)

If even this does not help, because the workload is ucode-cache limited, and 
the 
extra reloads pushed the critical path just beyond some cache limit, then 
another 
experiment to try would be to roll _back_ the loop some more: instead of 4x 
FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.

The CPU should still be smart enough with 2x interleaving of the loop body, and 
the extra branches should be relatively small and we could get back some 
performance.

In theory ...

4)

If the workload is fundamentally cache-port bandwidth limited, then the extra 
loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
fix for that, but to find an unused register.

Here's the (initial, pre-rotation) integer register mappings:

a:  RAX
   

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Ingo Molnar

* Josh Poimboeuf  wrote:

> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

Here's a couple of very quick observations with possible optimizations:

AFAICS the main effect of the RBP fixes is the introduction of a memory load 
into 
the critical path, into the body unrolled loop:

+   mov frame_TBL(%rsp), TBL
vpaddq  (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED

Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be 
reloaded from the stack.

1)

Note how R12 is used immediately, right in the next instruction:

vpaddq  (TBL), Y_0, XFER

I.e. the RBP fixes lengthen the program order data dependencies - that's a new 
constraint and a few extra cycles per loop iteration if the workload is 
address-generator bandwidth limited on that.

A simple way to ease that constraint would be to move the 'TLB' load up into 
the 
loop, body, to the point where 'T1' is used for the last time - which is:


mov a, T1   # T1 = a# MAJB
and c, T1   # T1 = a  # MAJB

add y0, y2  # y2 = S1 + CH  # --
or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ

+   mov frame_TBL(%rsp), TBL

add y1, h   # h = k + w + h + S0# --

add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --

add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
add y3, h   # h = t1 + S0 + MAJ # --

Note how this moves up the 'TLB' reload by 4 instructions.

2)

If this does not get back performance, then maybe another reason is that it's 
cache access latency limited, in which case a more involved optimization would 
be 
to look at the register patterns and usages:


first-use   last-useuse-length
a:  #10 #29 20
b:  #24 #24  1
c:  #14 #30 17
d:  #23 #34 12
e:  #11 #20 10
f:  #15 #15  1
g:  #18 #27 10
h:  #13 #36 24

y0: #11 #31 21
y1: #12 #33 22
y2: #15 #35 21
y3: #10 #36 27

T1: #16 #32 17

The 'first-use' colums shows the number of the instruction within the loop body 
that the register gets used - with '#1' denoting the first instruction ad #36 
the 
last instruction, the 'last-use' column is showing the last instruction, and 
the 
'use-length' colum shows the 'window' in which a register is used.

What we want are the registers that are used the most tightly, i.e. these two:

b:  #24 #24  1
f:  #15 #15  1

Of these two 'f' is the best one, because it has an earlier use and longer 
cooldown.

If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very 
early on:

mov f, y2   # y2 = f# CH
+   mov frame_TBL(%rsp), TBL
rorx$34, a, T1  # T1 = a >> 34  # S0B

And there will be 21 instructions that don't depend on TLB after this, plenty 
of 
time for the load to be generated and propagated.

NOTE: my pseudo-patch is naive, due to the complication caused by the 
RotateState 
macro name rotation. It's still fundamentally possible I believe, it's just 
that 
'TBL' has to be rotated too, together with the other varibles.

3)

If even this does not help, because the workload is ucode-cache limited, and 
the 
extra reloads pushed the critical path just beyond some cache limit, then 
another 
experiment to try would be to roll _back_ the loop some more: instead of 4x 
FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2.

The CPU should still be smart enough with 2x interleaving of the loop body, and 
the extra branches should be relatively small and we could get back some 
performance.

In theory ...

4)

If the workload is fundamentally cache-port bandwidth limited, then the extra 
loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy 
fix for that, but to find an unused register.

Here's the (initial, pre-rotation) integer register mappings:

a:  RAX
b:  RBX

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-13 Thread Josh Poimboeuf
On Wed, Sep 13, 2017 at 04:24:28PM -0500, Josh Poimboeuf wrote:
> On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers  wrote:
> > > 
> > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Eric Biggers  wrote:
> > > > > 
> > > > > > Thanks for fixing these!  I don't have time to review these in 
> > > > > > detail, but I ran
> > > > > > the crypto self-tests on the affected algorithms, and they all 
> > > > > > pass.  I also
> > > > > > benchmarked them before and after; the only noticable performance 
> > > > > > difference was
> > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I 
> > > > > > don't suppose
> > > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > > 
> > > > > Which CPU model did you use for the test?
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > >   Ingo
> > > > 
> > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > > 
> > > Any chance to test this with the latest microarchitecture - any Skylake 
> > > derivative 
> > > Intel CPU you have access to?
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The 
> > results
> > were the following which seemed a bit worse than Haswell:
> > 
> > sha256-avx2 became 3.5% slower
> > sha512-avx2 became 7.5% slower
> > 
> > Note: it's tricky to benchmark this, especially with just a few percent
> > difference, so don't read too much into the exact numbers.
> 
> Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
> closes the performance gap?
> 
> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
most of the performance lost by v1.

From: Josh Poimboeuf 
Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Mix things up a little bit to get rid of the RBP usage, without
destroying performance.  Use RDI instead of RBP for the TBL pointer.
That will clobber CTX, so save CTX on the stack and use RDI as CTX
before it gets clobbered, and R12 as CTX after it gets clobbered.

Also remove the unused y4 variable.

Reported-by: Eric Biggers 
Reported-by: Peter Zijlstra 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/crypto/sha512-avx2-asm.S | 75 ---
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S 
b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..b16d56005162 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -69,8 +69,9 @@ XFER  = YTMP0
 
 BYTE_FLIP_MASK  = %ymm9
 
-# 1st arg
-CTX = %rdi
+# 1st arg is %rdi, which is saved to the stack and accessed later via %r12
+CTX1= %rdi
+CTX2= %r12
 # 2nd arg
 INP = %rsi
 # 3rd arg
@@ -81,7 +82,7 @@ d   = %r8
 e   = %rdx
 y3  = %rsi
 
-TBL   = %rbp
+TBL   = %rdi # clobbers CTX1
 
 a = %rax
 b = %rbx
@@ -91,26 +92,26 @@ g = %r10
 h = %r11
 old_h = %r11
 
-T1= %r12
+T1= %r12 # clobbers CTX2
 y0= %r13
 y1= %r14
 y2= %r15
 
-y4= %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
+CTX_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 6*8
+GPRSAVE_SIZE = 5*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
 frame_INP = frame_SRND + SRND_SIZE
 frame_INPEND = frame_INP + INP_SIZE
-frame_RSPSAVE = frame_INPEND + INPEND_SIZE
+frame_CTX = frame_INPEND + INPEND_SIZE
+frame_RSPSAVE = frame_CTX + CTX_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
 frame_size = frame_GPRSAVE + GPRSAVE_SIZE
 
@@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx)
mov %rax, frame_RSPSAVE(%rsp)
 
# Save GPRs
-   mov %rbp, frame_GPRSAVE(%rsp)
-   mov %rbx, 8*1+frame_GPRSAVE(%rsp)
-   mov %r12, 8*2+frame_GPRSAVE(%rsp)
-   mov %r13, 8*3+frame_GPRSAVE(%rsp)
-   mov %r14, 8*4+frame_GPRSAVE(%rsp)
-   mov %r15, 8*5+frame_GPRSAVE(%rsp)
+   mov %rbx, 8*0+frame_GPRSAVE(%rsp)
+   mov %r12, 8*1+frame_GPRSAVE(%rsp)
+   mov %r13, 8*2+frame_GPRSAVE(%rsp)
+   mov %r14, 8*3+frame_GPRSAVE(%rsp)
+   mov %r15, 8*4+frame_GPRSAVE(%rsp)
 
shl $7, NUM_BLKS# convert to bytes
jz  done_hash
@@ -589,14 +589,17 @@ ENTRY(sha512_transform_rorx)
mov NUM_BLKS, frame_INPEND(%rsp)
 
## load initial digest
-   mov 8*0(CTX),a
-   mov 

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-13 Thread Josh Poimboeuf
On Wed, Sep 13, 2017 at 04:24:28PM -0500, Josh Poimboeuf wrote:
> On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers  wrote:
> > > 
> > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Eric Biggers  wrote:
> > > > > 
> > > > > > Thanks for fixing these!  I don't have time to review these in 
> > > > > > detail, but I ran
> > > > > > the crypto self-tests on the affected algorithms, and they all 
> > > > > > pass.  I also
> > > > > > benchmarked them before and after; the only noticable performance 
> > > > > > difference was
> > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I 
> > > > > > don't suppose
> > > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > > 
> > > > > Which CPU model did you use for the test?
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > >   Ingo
> > > > 
> > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > > 
> > > Any chance to test this with the latest microarchitecture - any Skylake 
> > > derivative 
> > > Intel CPU you have access to?
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The 
> > results
> > were the following which seemed a bit worse than Haswell:
> > 
> > sha256-avx2 became 3.5% slower
> > sha512-avx2 became 7.5% slower
> > 
> > Note: it's tricky to benchmark this, especially with just a few percent
> > difference, so don't read too much into the exact numbers.
> 
> Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
> closes the performance gap?
> 
> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
most of the performance lost by v1.

From: Josh Poimboeuf 
Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Mix things up a little bit to get rid of the RBP usage, without
destroying performance.  Use RDI instead of RBP for the TBL pointer.
That will clobber CTX, so save CTX on the stack and use RDI as CTX
before it gets clobbered, and R12 as CTX after it gets clobbered.

Also remove the unused y4 variable.

Reported-by: Eric Biggers 
Reported-by: Peter Zijlstra 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/crypto/sha512-avx2-asm.S | 75 ---
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S 
b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..b16d56005162 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -69,8 +69,9 @@ XFER  = YTMP0
 
 BYTE_FLIP_MASK  = %ymm9
 
-# 1st arg
-CTX = %rdi
+# 1st arg is %rdi, which is saved to the stack and accessed later via %r12
+CTX1= %rdi
+CTX2= %r12
 # 2nd arg
 INP = %rsi
 # 3rd arg
@@ -81,7 +82,7 @@ d   = %r8
 e   = %rdx
 y3  = %rsi
 
-TBL   = %rbp
+TBL   = %rdi # clobbers CTX1
 
 a = %rax
 b = %rbx
@@ -91,26 +92,26 @@ g = %r10
 h = %r11
 old_h = %r11
 
-T1= %r12
+T1= %r12 # clobbers CTX2
 y0= %r13
 y1= %r14
 y2= %r15
 
-y4= %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
+CTX_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 6*8
+GPRSAVE_SIZE = 5*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
 frame_INP = frame_SRND + SRND_SIZE
 frame_INPEND = frame_INP + INP_SIZE
-frame_RSPSAVE = frame_INPEND + INPEND_SIZE
+frame_CTX = frame_INPEND + INPEND_SIZE
+frame_RSPSAVE = frame_CTX + CTX_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
 frame_size = frame_GPRSAVE + GPRSAVE_SIZE
 
@@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx)
mov %rax, frame_RSPSAVE(%rsp)
 
# Save GPRs
-   mov %rbp, frame_GPRSAVE(%rsp)
-   mov %rbx, 8*1+frame_GPRSAVE(%rsp)
-   mov %r12, 8*2+frame_GPRSAVE(%rsp)
-   mov %r13, 8*3+frame_GPRSAVE(%rsp)
-   mov %r14, 8*4+frame_GPRSAVE(%rsp)
-   mov %r15, 8*5+frame_GPRSAVE(%rsp)
+   mov %rbx, 8*0+frame_GPRSAVE(%rsp)
+   mov %r12, 8*1+frame_GPRSAVE(%rsp)
+   mov %r13, 8*2+frame_GPRSAVE(%rsp)
+   mov %r14, 8*3+frame_GPRSAVE(%rsp)
+   mov %r15, 8*4+frame_GPRSAVE(%rsp)
 
shl $7, NUM_BLKS# convert to bytes
jz  done_hash
@@ -589,14 +589,17 @@ ENTRY(sha512_transform_rorx)
mov NUM_BLKS, frame_INPEND(%rsp)
 
## load initial digest
-   mov 8*0(CTX),a
-   mov 8*1(CTX),b
-   mov 8*2(CTX),c
-   mov 8*3(CTX),d
-   mov 8*4(CTX),e
-   mov 8*5(CTX),f
-   

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-13 Thread Josh Poimboeuf
On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers  wrote:
> > 
> > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Eric Biggers  wrote:
> > > > 
> > > > > Thanks for fixing these!  I don't have time to review these in 
> > > > > detail, but I ran
> > > > > the crypto self-tests on the affected algorithms, and they all pass.  
> > > > > I also
> > > > > benchmarked them before and after; the only noticable performance 
> > > > > difference was
> > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I 
> > > > > don't suppose
> > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > 
> > > > Which CPU model did you use for the test?
> > > > 
> > > > Thanks,
> > > > 
> > > > Ingo
> > > 
> > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > 
> > Any chance to test this with the latest microarchitecture - any Skylake 
> > derivative 
> > Intel CPU you have access to?
> > 
> > Thanks,
> > 
> > Ingo
> 
> Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
> were the following which seemed a bit worse than Haswell:
> 
>   sha256-avx2 became 3.5% slower
>   sha512-avx2 became 7.5% slower
> 
> Note: it's tricky to benchmark this, especially with just a few percent
> difference, so don't read too much into the exact numbers.

Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
closes the performance gap?

I'm still looking at the other one (sha512-avx2), but so far I haven't
found a way to speed it back up.

From: Josh Poimboeuf 
Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

There's no need to use RBP as a temporary register for the TBL value,
because it always stores the same value: the address of the K256 table.
Instead just reference the address of K256 directly.

Reported-by: Eric Biggers 
Reported-by: Peter Zijlstra 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/crypto/sha256-avx2-asm.S | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S 
b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..1420db15dcdd 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -98,8 +98,6 @@ d = %r8d
 e   = %edx # clobbers NUM_BLKS
 y3 = %esi  # clobbers INP
 
-
-TBL= %rbp
 SRND   = CTX   # SRND is same register as CTX
 
 a = %eax
@@ -531,7 +529,6 @@ STACK_SIZE  = _RSP  + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
pushq   %rbx
-   pushq   %rbp
pushq   %r12
pushq   %r13
pushq   %r14
@@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx)
mov CTX, _CTX(%rsp)
 
 loop0:
-   lea K256(%rip), TBL
-
## Load first 16 dwords from two blocks
VMOVDQ  0*32(INP),XTMP0
VMOVDQ  1*32(INP),XTMP1
@@ -597,19 +592,19 @@ last_block_enter:
 
 .align 16
 loop1:
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
-   vpaddd  1*32(TBL, SRND), X0, XFER
+   vpaddd  K256+1*32(SRND), X0, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
-   vpaddd  2*32(TBL, SRND), X0, XFER
+   vpaddd  K256+2*32(SRND), X0, XFER
vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
-   vpaddd  3*32(TBL, SRND), X0, XFER
+   vpaddd  K256+3*32(SRND), X0, XFER
vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
 
@@ -619,10 +614,11 @@ loop1:
 
 loop2:
## Do last 16 rounds with no scheduling
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 0*32
-   vpaddd  1*32(TBL, SRND), X1, XFER
+
+   vpaddd  K256+1*32(SRND), X1, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 1*32
add $2*32, SRND
@@ -676,9 +672,6 @@ loop3:
ja  done_hash
 
 do_last_block:
-    do last block
-   lea K256(%rip), TBL
-
VMOVDQ  0*16(INP),XWORD0
VMOVDQ  1*16(INP),XWORD1
VMOVDQ  2*16(INP),XWORD2
@@ -718,7 +711,6 @@ done_hash:
popq%r14
popq%r13
popq%r12
-   popq%rbp
popq%rbx
ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5



Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-13 Thread Josh Poimboeuf
On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers  wrote:
> > 
> > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Eric Biggers  wrote:
> > > > 
> > > > > Thanks for fixing these!  I don't have time to review these in 
> > > > > detail, but I ran
> > > > > the crypto self-tests on the affected algorithms, and they all pass.  
> > > > > I also
> > > > > benchmarked them before and after; the only noticable performance 
> > > > > difference was
> > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I 
> > > > > don't suppose
> > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > 
> > > > Which CPU model did you use for the test?
> > > > 
> > > > Thanks,
> > > > 
> > > > Ingo
> > > 
> > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > 
> > Any chance to test this with the latest microarchitecture - any Skylake 
> > derivative 
> > Intel CPU you have access to?
> > 
> > Thanks,
> > 
> > Ingo
> 
> Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
> were the following which seemed a bit worse than Haswell:
> 
>   sha256-avx2 became 3.5% slower
>   sha512-avx2 became 7.5% slower
> 
> Note: it's tricky to benchmark this, especially with just a few percent
> difference, so don't read too much into the exact numbers.

Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
closes the performance gap?

I'm still looking at the other one (sha512-avx2), but so far I haven't
found a way to speed it back up.

From: Josh Poimboeuf 
Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

There's no need to use RBP as a temporary register for the TBL value,
because it always stores the same value: the address of the K256 table.
Instead just reference the address of K256 directly.

Reported-by: Eric Biggers 
Reported-by: Peter Zijlstra 
Signed-off-by: Josh Poimboeuf 
---
 arch/x86/crypto/sha256-avx2-asm.S | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S 
b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..1420db15dcdd 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -98,8 +98,6 @@ d = %r8d
 e   = %edx # clobbers NUM_BLKS
 y3 = %esi  # clobbers INP
 
-
-TBL= %rbp
 SRND   = CTX   # SRND is same register as CTX
 
 a = %eax
@@ -531,7 +529,6 @@ STACK_SIZE  = _RSP  + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
pushq   %rbx
-   pushq   %rbp
pushq   %r12
pushq   %r13
pushq   %r14
@@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx)
mov CTX, _CTX(%rsp)
 
 loop0:
-   lea K256(%rip), TBL
-
## Load first 16 dwords from two blocks
VMOVDQ  0*32(INP),XTMP0
VMOVDQ  1*32(INP),XTMP1
@@ -597,19 +592,19 @@ last_block_enter:
 
 .align 16
 loop1:
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
-   vpaddd  1*32(TBL, SRND), X0, XFER
+   vpaddd  K256+1*32(SRND), X0, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
-   vpaddd  2*32(TBL, SRND), X0, XFER
+   vpaddd  K256+2*32(SRND), X0, XFER
vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
-   vpaddd  3*32(TBL, SRND), X0, XFER
+   vpaddd  K256+3*32(SRND), X0, XFER
vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
 
@@ -619,10 +614,11 @@ loop1:
 
 loop2:
## Do last 16 rounds with no scheduling
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 0*32
-   vpaddd  1*32(TBL, SRND), X1, XFER
+
+   vpaddd  K256+1*32(SRND), X1, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 1*32
add $2*32, SRND
@@ -676,9 +672,6 @@ loop3:
ja  done_hash
 
 do_last_block:
-    do last block
-   lea K256(%rip), TBL
-
VMOVDQ  0*16(INP),XWORD0
VMOVDQ  1*16(INP),XWORD1
VMOVDQ  2*16(INP),XWORD2
@@ -718,7 +711,6 @@ done_hash:
popq%r14
popq%r13
popq%r12
-   popq%rbp
popq%rbx
ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5



Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-08 Thread Eric Biggers
On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
> > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers  wrote:
> > > 
> > > > Thanks for fixing these!  I don't have time to review these in detail, 
> > > > but I ran
> > > > the crypto self-tests on the affected algorithms, and they all pass.  I 
> > > > also
> > > > benchmarked them before and after; the only noticable performance 
> > > > difference was
> > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't 
> > > > suppose
> > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > 
> > > Which CPU model did you use for the test?
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> 
> Any chance to test this with the latest microarchitecture - any Skylake 
> derivative 
> Intel CPU you have access to?
> 
> Thanks,
> 
>   Ingo

Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
were the following which seemed a bit worse than Haswell:

sha256-avx2 became 3.5% slower
sha512-avx2 became 7.5% slower

Note: it's tricky to benchmark this, especially with just a few percent
difference, so don't read too much into the exact numbers.

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-08 Thread Eric Biggers
On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
> > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers  wrote:
> > > 
> > > > Thanks for fixing these!  I don't have time to review these in detail, 
> > > > but I ran
> > > > the crypto self-tests on the affected algorithms, and they all pass.  I 
> > > > also
> > > > benchmarked them before and after; the only noticable performance 
> > > > difference was
> > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't 
> > > > suppose
> > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > 
> > > Which CPU model did you use for the test?
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> 
> Any chance to test this with the latest microarchitecture - any Skylake 
> derivative 
> Intel CPU you have access to?
> 
> Thanks,
> 
>   Ingo

Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
were the following which seemed a bit worse than Haswell:

sha256-avx2 became 3.5% slower
sha512-avx2 became 7.5% slower

Note: it's tricky to benchmark this, especially with just a few percent
difference, so don't read too much into the exact numbers.

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-07 Thread Ingo Molnar

* Eric Biggers  wrote:

> On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers  wrote:
> > 
> > > Thanks for fixing these!  I don't have time to review these in detail, 
> > > but I ran
> > > the crypto self-tests on the affected algorithms, and they all pass.  I 
> > > also
> > > benchmarked them before and after; the only noticable performance 
> > > difference was
> > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't 
> > > suppose
> > > there is a way around that?  Otherwise it's probably not a big deal.
> > 
> > Which CPU model did you use for the test?
> > 
> > Thanks,
> > 
> > Ingo
> 
> This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".

Any chance to test this with the latest microarchitecture - any Skylake 
derivative 
Intel CPU you have access to?

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-07 Thread Ingo Molnar

* Eric Biggers  wrote:

> On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers  wrote:
> > 
> > > Thanks for fixing these!  I don't have time to review these in detail, 
> > > but I ran
> > > the crypto self-tests on the affected algorithms, and they all pass.  I 
> > > also
> > > benchmarked them before and after; the only noticable performance 
> > > difference was
> > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't 
> > > suppose
> > > there is a way around that?  Otherwise it's probably not a big deal.
> > 
> > Which CPU model did you use for the test?
> > 
> > Thanks,
> > 
> > Ingo
> 
> This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".

Any chance to test this with the latest microarchitecture - any Skylake 
derivative 
Intel CPU you have access to?

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-07 Thread Eric Biggers
On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
> > Thanks for fixing these!  I don't have time to review these in detail, but 
> > I ran
> > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > benchmarked them before and after; the only noticable performance 
> > difference was
> > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't 
> > suppose
> > there is a way around that?  Otherwise it's probably not a big deal.
> 
> Which CPU model did you use for the test?
> 
> Thanks,
> 
>   Ingo

This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-07 Thread Eric Biggers
On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
> > Thanks for fixing these!  I don't have time to review these in detail, but 
> > I ran
> > the crypto self-tests on the affected algorithms, and they all pass.  I also
> > benchmarked them before and after; the only noticable performance 
> > difference was
> > that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't 
> > suppose
> > there is a way around that?  Otherwise it's probably not a big deal.
> 
> Which CPU model did you use for the test?
> 
> Thanks,
> 
>   Ingo

This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-07 Thread Ingo Molnar

* Eric Biggers  wrote:

> Thanks for fixing these!  I don't have time to review these in detail, but I 
> ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference 
> was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Which CPU model did you use for the test?

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-07 Thread Ingo Molnar

* Eric Biggers  wrote:

> Thanks for fixing these!  I don't have time to review these in detail, but I 
> ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference 
> was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Which CPU model did you use for the test?

Thanks,

Ingo


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-06 Thread Josh Poimboeuf
On Fri, Sep 01, 2017 at 05:09:19PM -0700, Eric Biggers wrote:
> Hi Josh,
> 
> On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> > Many of the x86 crypto functions use RBP as a temporary register.  This
> > breaks frame pointer convention, and breaks stack traces when unwinding
> > from an interrupt in the crypto code.
> > 
> > Convert most* of them to leave RBP alone.
> > 
> > These pass the crypto boot tests for me.  Any further testing would be
> > appreciated!
> > 
> > [*] There are still a few crypto files left that need fixing, but the
> > fixes weren't trivial and nobody reported unwinder warnings about
> > them yet, so I'm skipping them for now.
> > 
> > Josh Poimboeuf (12):
> >   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
> >   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
> >   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx-asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
> >   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
> >   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> >   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> > 
> 
> Thanks for fixing these!  I don't have time to review these in detail, but I 
> ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference 
> was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Thanks for testing.  I might have a way to make sha256-avx2 faster, but
not sha512-avx2.  I'll let you know when I have a patch.

-- 
Josh


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-06 Thread Josh Poimboeuf
On Fri, Sep 01, 2017 at 05:09:19PM -0700, Eric Biggers wrote:
> Hi Josh,
> 
> On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> > Many of the x86 crypto functions use RBP as a temporary register.  This
> > breaks frame pointer convention, and breaks stack traces when unwinding
> > from an interrupt in the crypto code.
> > 
> > Convert most* of them to leave RBP alone.
> > 
> > These pass the crypto boot tests for me.  Any further testing would be
> > appreciated!
> > 
> > [*] There are still a few crypto files left that need fixing, but the
> > fixes weren't trivial and nobody reported unwinder warnings about
> > them yet, so I'm skipping them for now.
> > 
> > Josh Poimboeuf (12):
> >   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
> >   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
> >   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx-asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
> >   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
> >   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> >   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> > 
> 
> Thanks for fixing these!  I don't have time to review these in detail, but I 
> ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference 
> was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Thanks for testing.  I might have a way to make sha256-avx2 faster, but
not sha512-avx2.  I'll let you know when I have a patch.

-- 
Josh


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-01 Thread Eric Biggers
Hi Josh,

On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> Many of the x86 crypto functions use RBP as a temporary register.  This
> breaks frame pointer convention, and breaks stack traces when unwinding
> from an interrupt in the crypto code.
> 
> Convert most* of them to leave RBP alone.
> 
> These pass the crypto boot tests for me.  Any further testing would be
> appreciated!
> 
> [*] There are still a few crypto files left that need fixing, but the
> fixes weren't trivial and nobody reported unwinder warnings about
> them yet, so I'm skipping them for now.
> 
> Josh Poimboeuf (12):
>   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
>   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
>   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
>   x86/crypto: Fix RBP usage in sha256-avx-asm.S
>   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
>   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
>   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
>   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> 

Thanks for fixing these!  I don't have time to review these in detail, but I ran
the crypto self-tests on the affected algorithms, and they all pass.  I also
benchmarked them before and after; the only noticable performance difference was
that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
there is a way around that?  Otherwise it's probably not a big deal.

For reference, this was the list of affected algorithms I used:

shash: sha1-ssse3, sha1-avx, sha1-avx2, sha256-ssse3, sha256-avx, sha256-avx2,
sha512-ssse3, sha512-avx, sha512-avx2

cipher: blowfish-asm, camellia-asm, des3_ede-asm

skcipher: ecb-blowfish-asm, cbc-blowfish-asm, ctr-blowfish-asm, ecb-cast5-avx,
cbc-cast5-avx, ctr-cast5-avx, ecb-cast6-avx, cbc-cast6-avx, 
ctr-cast6-avx,
lrw-cast6-avx, xts-cast6-avx, ecb-camellia-asm, cbc-camellia-asm,
ctr-camellia-asm, lrw-camellia-asm, xts-camellia-asm, ecb-des3_ede-asm,
cbc-des3_ede-asm, ctr-des3_ede-asm, ecb-twofish-avx, cbc-twofish-avx,
ctr-twofish-avx, lrw-twofish-avx, xts-twofish-avx

Eric


Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-01 Thread Eric Biggers
Hi Josh,

On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> Many of the x86 crypto functions use RBP as a temporary register.  This
> breaks frame pointer convention, and breaks stack traces when unwinding
> from an interrupt in the crypto code.
> 
> Convert most* of them to leave RBP alone.
> 
> These pass the crypto boot tests for me.  Any further testing would be
> appreciated!
> 
> [*] There are still a few crypto files left that need fixing, but the
> fixes weren't trivial and nobody reported unwinder warnings about
> them yet, so I'm skipping them for now.
> 
> Josh Poimboeuf (12):
>   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
>   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
>   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
>   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
>   x86/crypto: Fix RBP usage in sha256-avx-asm.S
>   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
>   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
>   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
>   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> 

Thanks for fixing these!  I don't have time to review these in detail, but I ran
the crypto self-tests on the affected algorithms, and they all pass.  I also
benchmarked them before and after; the only noticable performance difference was
that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
there is a way around that?  Otherwise it's probably not a big deal.

For reference, this was the list of affected algorithms I used:

shash: sha1-ssse3, sha1-avx, sha1-avx2, sha256-ssse3, sha256-avx, sha256-avx2,
sha512-ssse3, sha512-avx, sha512-avx2

cipher: blowfish-asm, camellia-asm, des3_ede-asm

skcipher: ecb-blowfish-asm, cbc-blowfish-asm, ctr-blowfish-asm, ecb-cast5-avx,
cbc-cast5-avx, ctr-cast5-avx, ecb-cast6-avx, cbc-cast6-avx, 
ctr-cast6-avx,
lrw-cast6-avx, xts-cast6-avx, ecb-camellia-asm, cbc-camellia-asm,
ctr-camellia-asm, lrw-camellia-asm, xts-camellia-asm, ecb-des3_ede-asm,
cbc-des3_ede-asm, ctr-des3_ede-asm, ecb-twofish-avx, cbc-twofish-avx,
ctr-twofish-avx, lrw-twofish-avx, xts-twofish-avx

Eric


[PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-08-29 Thread Josh Poimboeuf
Many of the x86 crypto functions use RBP as a temporary register.  This
breaks frame pointer convention, and breaks stack traces when unwinding
from an interrupt in the crypto code.

Convert most* of them to leave RBP alone.

These pass the crypto boot tests for me.  Any further testing would be
appreciated!

[*] There are still a few crypto files left that need fixing, but the
fixes weren't trivial and nobody reported unwinder warnings about
them yet, so I'm skipping them for now.

Josh Poimboeuf (12):
  x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in des3_ede-asm_64.S
  x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
  x86/crypto: Fix RBP usage in sha256-avx-asm.S
  x86/crypto: Fix RBP usage in sha256-avx2-asm.S
  x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
  x86/crypto: Fix RBP usage in sha512-avx2-asm.S
  x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

 arch/x86/crypto/blowfish-x86_64-asm_64.S| 48 ++-
 arch/x86/crypto/camellia-x86_64-asm_64.S| 26 +++
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S   | 47 +--
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S   | 50 -
 arch/x86/crypto/des3_ede-asm_64.S   | 15 +
 arch/x86/crypto/sha1_avx2_x86_64_asm.S  |  4 +--
 arch/x86/crypto/sha1_ssse3_asm.S| 11 +++
 arch/x86/crypto/sha256-avx-asm.S| 15 -
 arch/x86/crypto/sha256-avx2-asm.S   | 16 -
 arch/x86/crypto/sha256-ssse3-asm.S  | 15 -
 arch/x86/crypto/sha512-avx2-asm.S   | 21 
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 +++
 12 files changed, 160 insertions(+), 120 deletions(-)

-- 
2.13.5



[PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-08-29 Thread Josh Poimboeuf
Many of the x86 crypto functions use RBP as a temporary register.  This
breaks frame pointer convention, and breaks stack traces when unwinding
from an interrupt in the crypto code.

Convert most* of them to leave RBP alone.

These pass the crypto boot tests for me.  Any further testing would be
appreciated!

[*] There are still a few crypto files left that need fixing, but the
fixes weren't trivial and nobody reported unwinder warnings about
them yet, so I'm skipping them for now.

Josh Poimboeuf (12):
  x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in des3_ede-asm_64.S
  x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
  x86/crypto: Fix RBP usage in sha256-avx-asm.S
  x86/crypto: Fix RBP usage in sha256-avx2-asm.S
  x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
  x86/crypto: Fix RBP usage in sha512-avx2-asm.S
  x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

 arch/x86/crypto/blowfish-x86_64-asm_64.S| 48 ++-
 arch/x86/crypto/camellia-x86_64-asm_64.S| 26 +++
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S   | 47 +--
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S   | 50 -
 arch/x86/crypto/des3_ede-asm_64.S   | 15 +
 arch/x86/crypto/sha1_avx2_x86_64_asm.S  |  4 +--
 arch/x86/crypto/sha1_ssse3_asm.S| 11 +++
 arch/x86/crypto/sha256-avx-asm.S| 15 -
 arch/x86/crypto/sha256-avx2-asm.S   | 16 -
 arch/x86/crypto/sha256-ssse3-asm.S  | 15 -
 arch/x86/crypto/sha512-avx2-asm.S   | 21 
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 +++
 12 files changed, 160 insertions(+), 120 deletions(-)

-- 
2.13.5