Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 15, 2017 at 11:06:29PM +0200, Ingo Molnar wrote: > > Indeed, I suspect they should go through the crypto tree, these fixes are > independent, they don't depend on anything in the x86 tree. Sure I can pick them up through cryptodev. Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 15, 2017 at 11:06:29PM +0200, Ingo Molnar wrote: > > Indeed, I suspect they should go through the crypto tree, these fixes are > independent, they don't depend on anything in the x86 tree. Sure I can pick them up through cryptodev. Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggerswrote: > On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote: > > > > * Eric Biggers wrote: > > > > > Hi Josh, > > > > > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > > > > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > > > > most of the performance lost by v1. > > > > > > > > From: Josh Poimboeuf > > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > > > > > > > Using RBP as a temporary register breaks frame pointer convention and > > > > breaks stack traces when unwinding from an interrupt in the crypto code. > > > > > > > > Mix things up a little bit to get rid of the RBP usage, without > > > > destroying performance. Use RDI instead of RBP for the TBL pointer. > > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX > > > > before it gets clobbered, and R12 as CTX after it gets clobbered. > > > > > > > > Also remove the unused y4 variable. > > > > > > > > > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. > > > They > > > both pass the crypto self-tests, and there was no noticable performance > > > difference compared to the unpatched versions. Thanks! > > > > Cool, thanks for review and the testing! Can we add your Tested-by + > > Acked-by tags > > to the patches? > > > > Yes, that's fine for all the patches in the series. > > Will these patches go in through the crypto tree or through the x86 tree? Indeed, I suspect they should go through the crypto tree, these fixes are independent, they don't depend on anything in the x86 tree. Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggers wrote: > On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote: > > > > * Eric Biggers wrote: > > > > > Hi Josh, > > > > > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > > > > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > > > > most of the performance lost by v1. > > > > > > > > From: Josh Poimboeuf > > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > > > > > > > Using RBP as a temporary register breaks frame pointer convention and > > > > breaks stack traces when unwinding from an interrupt in the crypto code. > > > > > > > > Mix things up a little bit to get rid of the RBP usage, without > > > > destroying performance. Use RDI instead of RBP for the TBL pointer. > > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX > > > > before it gets clobbered, and R12 as CTX after it gets clobbered. > > > > > > > > Also remove the unused y4 variable. > > > > > > > > > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. > > > They > > > both pass the crypto self-tests, and there was no noticable performance > > > difference compared to the unpatched versions. Thanks! > > > > Cool, thanks for review and the testing! Can we add your Tested-by + > > Acked-by tags > > to the patches? > > > > Yes, that's fine for all the patches in the series. > > Will these patches go in through the crypto tree or through the x86 tree? Indeed, I suspect they should go through the crypto tree, these fixes are independent, they don't depend on anything in the x86 tree. Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote: > > * Eric Biggerswrote: > > > Hi Josh, > > > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > > > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > > > most of the performance lost by v1. > > > > > > From: Josh Poimboeuf > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > > > > > Using RBP as a temporary register breaks frame pointer convention and > > > breaks stack traces when unwinding from an interrupt in the crypto code. > > > > > > Mix things up a little bit to get rid of the RBP usage, without > > > destroying performance. Use RDI instead of RBP for the TBL pointer. > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX > > > before it gets clobbered, and R12 as CTX after it gets clobbered. > > > > > > Also remove the unused y4 variable. > > > > > > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. > > They > > both pass the crypto self-tests, and there was no noticable performance > > difference compared to the unpatched versions. Thanks! > > Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by > tags > to the patches? > Yes, that's fine for all the patches in the series. Will these patches go in through the crypto tree or through the x86 tree? Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 15, 2017 at 07:34:31AM +0200, Ingo Molnar wrote: > > * Eric Biggers wrote: > > > Hi Josh, > > > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > > > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > > > most of the performance lost by v1. > > > > > > From: Josh Poimboeuf > > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > > > > > Using RBP as a temporary register breaks frame pointer convention and > > > breaks stack traces when unwinding from an interrupt in the crypto code. > > > > > > Mix things up a little bit to get rid of the RBP usage, without > > > destroying performance. Use RDI instead of RBP for the TBL pointer. > > > That will clobber CTX, so save CTX on the stack and use RDI as CTX > > > before it gets clobbered, and R12 as CTX after it gets clobbered. > > > > > > Also remove the unused y4 variable. > > > > > > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. > > They > > both pass the crypto self-tests, and there was no noticable performance > > difference compared to the unpatched versions. Thanks! > > Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by > tags > to the patches? > Yes, that's fine for all the patches in the series. Will these patches go in through the crypto tree or through the x86 tree? Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Josh Poimboeufwrote: > > So if this observation of mine is true we could go back to the old code for > > the > > hotpath, but use RDI for TBL and not reload it in the hotpath. > > Thanks for the excellent breakdown. > > When I looked at the patch again, I came to the same conclusion as your > #4, which is that RDI isn't being used in the inner loops. It *is* used > in the outermost loop, however. > > So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead > of TBL: > > https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble Indeed - I should have checked your v2 patch, but somehow missed it. Would have saved me some looking+typing ;-) Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Josh Poimboeuf wrote: > > So if this observation of mine is true we could go back to the old code for > > the > > hotpath, but use RDI for TBL and not reload it in the hotpath. > > Thanks for the excellent breakdown. > > When I looked at the patch again, I came to the same conclusion as your > #4, which is that RDI isn't being used in the inner loops. It *is* used > in the outermost loop, however. > > So v2 of my sha512-avx2-asm.S patch spilled CTX onto the stack, instead > of TBL: > > https://lkml.kernel.org/r/20170913223303.pskmy2v7nto6rvtg@treble Indeed - I should have checked your v2 patch, but somehow missed it. Would have saved me some looking+typing ;-) Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggerswrote: > Hi Josh, > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > > most of the performance lost by v1. > > > > From: Josh Poimboeuf > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > > > Using RBP as a temporary register breaks frame pointer convention and > > breaks stack traces when unwinding from an interrupt in the crypto code. > > > > Mix things up a little bit to get rid of the RBP usage, without > > destroying performance. Use RDI instead of RBP for the TBL pointer. > > That will clobber CTX, so save CTX on the stack and use RDI as CTX > > before it gets clobbered, and R12 as CTX after it gets clobbered. > > > > Also remove the unused y4 variable. > > > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. They > both pass the crypto self-tests, and there was no noticable performance > difference compared to the unpatched versions. Thanks! Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by tags to the patches? Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggers wrote: > Hi Josh, > > On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > > most of the performance lost by v1. > > > > From: Josh Poimboeuf > > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > > > Using RBP as a temporary register breaks frame pointer convention and > > breaks stack traces when unwinding from an interrupt in the crypto code. > > > > Mix things up a little bit to get rid of the RBP usage, without > > destroying performance. Use RDI instead of RBP for the TBL pointer. > > That will clobber CTX, so save CTX on the stack and use RDI as CTX > > before it gets clobbered, and R12 as CTX after it gets clobbered. > > > > Also remove the unused y4 variable. > > > > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. They > both pass the crypto self-tests, and there was no noticable performance > difference compared to the unpatched versions. Thanks! Cool, thanks for review and the testing! Can we add your Tested-by + Acked-by tags to the patches? Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Hi Josh, On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > most of the performance lost by v1. > > From: Josh Poimboeuf> Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > Using RBP as a temporary register breaks frame pointer convention and > breaks stack traces when unwinding from an interrupt in the crypto code. > > Mix things up a little bit to get rid of the RBP usage, without > destroying performance. Use RDI instead of RBP for the TBL pointer. > That will clobber CTX, so save CTX on the stack and use RDI as CTX > before it gets clobbered, and R12 as CTX after it gets clobbered. > > Also remove the unused y4 variable. > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. They both pass the crypto self-tests, and there was no noticable performance difference compared to the unpatched versions. Thanks! Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Hi Josh, On Wed, Sep 13, 2017 at 05:33:03PM -0500, Josh Poimboeuf wrote: > And here's v2 of the sha512-avx2 patch. It should hopefully gain back > most of the performance lost by v1. > > From: Josh Poimboeuf > Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > Using RBP as a temporary register breaks frame pointer convention and > breaks stack traces when unwinding from an interrupt in the crypto code. > > Mix things up a little bit to get rid of the RBP usage, without > destroying performance. Use RDI instead of RBP for the TBL pointer. > That will clobber CTX, so save CTX on the stack and use RDI as CTX > before it gets clobbered, and R12 as CTX after it gets clobbered. > > Also remove the unused y4 variable. > I tested the v2 patches for both sha256-avx2 and sha512-avx2 on Skylake. They both pass the crypto self-tests, and there was no noticable performance difference compared to the unpatched versions. Thanks! Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote: > > * Josh Poimboeufwrote: > > > I'm still looking at the other one (sha512-avx2), but so far I haven't > > found a way to speed it back up. > > Here's a couple of very quick observations with possible optimizations: > > AFAICS the main effect of the RBP fixes is the introduction of a memory load > into > the critical path, into the body unrolled loop: > > + mov frame_TBL(%rsp), TBL > vpaddq (TBL), Y_0, XFER > vmovdqa XFER, frame_XFER(%rsp) > FOUR_ROUNDS_AND_SCHED > > Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to > be > reloaded from the stack. > > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a > new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into > the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a# MAJB > and c, T1 # T1 = a # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c))|(a) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0# -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. > > 2) > > If this does not get back performance, then maybe another reason is that it's > cache access latency limited, in which case a more involved optimization > would be > to look at the register patterns and usages: > > > first-use last-useuse-length > a: #10 #29 20 > b: #24 #24 1 > c: #14 #30 17 > d: #23 #34 12 > e: #11 #20 10 > f: #15 #15 1 > g: #18 #27 10 > h: #13 #36 24 > > y0: #11 #31 21 > y1: #12 #33 22 > y2: #15 #35 21 > y3: #10 #36 27 > > T1: #16 #32 17 > > The 'first-use' colums shows the number of the instruction within the loop > body > that the register gets used - with '#1' denoting the first instruction ad #36 > the > last instruction, the 'last-use' column is showing the last instruction, and > the > 'use-length' colum shows the 'window' in which a register is used. > > What we want are the registers that are used the most tightly, i.e. these two: > > b: #24 #24 1 > f: #15 #15 1 > > Of these two 'f' is the best one, because it has an earlier use and longer > cooldown. > > If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration > very > early on: > > mov f, y2 # y2 = f# CH > + mov frame_TBL(%rsp), TBL > rorx$34, a, T1 # T1 = a >> 34 # S0B > > And there will be 21 instructions that don't depend on TLB after this, plenty > of > time for the load to be generated and propagated. > > NOTE: my pseudo-patch is naive, due to the complication caused by the > RotateState > macro name rotation. It's still fundamentally possible I believe, it's just > that > 'TBL' has to be rotated too, together with the other varibles. > > 3) > > If even this does not help, because the workload is ucode-cache limited, and > the > extra reloads pushed the critical path just beyond some cache limit, then > another > experiment to try would be to roll _back_ the loop some more: instead of 4x > FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2. > > The CPU should still be smart enough with 2x interleaving of the loop body, > and > the extra branches should be relatively small and we could get back some > performance. > > In theory ... > > 4) >
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote: > > * Josh Poimboeuf wrote: > > > I'm still looking at the other one (sha512-avx2), but so far I haven't > > found a way to speed it back up. > > Here's a couple of very quick observations with possible optimizations: > > AFAICS the main effect of the RBP fixes is the introduction of a memory load > into > the critical path, into the body unrolled loop: > > + mov frame_TBL(%rsp), TBL > vpaddq (TBL), Y_0, XFER > vmovdqa XFER, frame_XFER(%rsp) > FOUR_ROUNDS_AND_SCHED > > Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to > be > reloaded from the stack. > > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a > new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into > the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a# MAJB > and c, T1 # T1 = a # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c))|(a) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0# -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. > > 2) > > If this does not get back performance, then maybe another reason is that it's > cache access latency limited, in which case a more involved optimization > would be > to look at the register patterns and usages: > > > first-use last-useuse-length > a: #10 #29 20 > b: #24 #24 1 > c: #14 #30 17 > d: #23 #34 12 > e: #11 #20 10 > f: #15 #15 1 > g: #18 #27 10 > h: #13 #36 24 > > y0: #11 #31 21 > y1: #12 #33 22 > y2: #15 #35 21 > y3: #10 #36 27 > > T1: #16 #32 17 > > The 'first-use' colums shows the number of the instruction within the loop > body > that the register gets used - with '#1' denoting the first instruction ad #36 > the > last instruction, the 'last-use' column is showing the last instruction, and > the > 'use-length' colum shows the 'window' in which a register is used. > > What we want are the registers that are used the most tightly, i.e. these two: > > b: #24 #24 1 > f: #15 #15 1 > > Of these two 'f' is the best one, because it has an earlier use and longer > cooldown. > > If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration > very > early on: > > mov f, y2 # y2 = f# CH > + mov frame_TBL(%rsp), TBL > rorx$34, a, T1 # T1 = a >> 34 # S0B > > And there will be 21 instructions that don't depend on TLB after this, plenty > of > time for the load to be generated and propagated. > > NOTE: my pseudo-patch is naive, due to the complication caused by the > RotateState > macro name rotation. It's still fundamentally possible I believe, it's just > that > 'TBL' has to be rotated too, together with the other varibles. > > 3) > > If even this does not help, because the workload is ucode-cache limited, and > the > extra reloads pushed the critical path just beyond some cache limit, then > another > experiment to try would be to roll _back_ the loop some more: instead of 4x > FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2. > > The CPU should still be smart enough with 2x interleaving of the loop body, > and > the extra branches should be relatively small and we could get back some > performance. > > In theory ... > > 4) > > If the workload is
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Ingo Molnarwrote: > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a > new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into > the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a# MAJB > and c, T1 # T1 = a # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c))|(a) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0# -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. Note that in this case 'TBL' would have to be initialized before the 1st iteration, via something like: movq$4, frame_SRND(%rsp) + mov frame_TBL(%rsp), TBL .align 16 loop1: vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Ingo Molnar wrote: > 1) > > Note how R12 is used immediately, right in the next instruction: > > vpaddq (TBL), Y_0, XFER > > I.e. the RBP fixes lengthen the program order data dependencies - that's a > new > constraint and a few extra cycles per loop iteration if the workload is > address-generator bandwidth limited on that. > > A simple way to ease that constraint would be to move the 'TLB' load up into > the > loop, body, to the point where 'T1' is used for the last time - which is: > > > mov a, T1 # T1 = a# MAJB > and c, T1 # T1 = a # MAJB > > add y0, y2 # y2 = S1 + CH # -- > or T1, y3 # y3 = MAJ = (a|c))|(a) # MAJ > > + mov frame_TBL(%rsp), TBL > > add y1, h # h = k + w + h + S0# -- > > add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- > > add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- > add y3, h # h = t1 + S0 + MAJ # -- > > Note how this moves up the 'TLB' reload by 4 instructions. Note that in this case 'TBL' would have to be initialized before the 1st iteration, via something like: movq$4, frame_SRND(%rsp) + mov frame_TBL(%rsp), TBL .align 16 loop1: vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Josh Poimboeufwrote: > I'm still looking at the other one (sha512-avx2), but so far I haven't > found a way to speed it back up. Here's a couple of very quick observations with possible optimizations: AFAICS the main effect of the RBP fixes is the introduction of a memory load into the critical path, into the body unrolled loop: + mov frame_TBL(%rsp), TBL vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be reloaded from the stack. 1) Note how R12 is used immediately, right in the next instruction: vpaddq (TBL), Y_0, XFER I.e. the RBP fixes lengthen the program order data dependencies - that's a new constraint and a few extra cycles per loop iteration if the workload is address-generator bandwidth limited on that. A simple way to ease that constraint would be to move the 'TLB' load up into the loop, body, to the point where 'T1' is used for the last time - which is: mov a, T1 # T1 = a# MAJB and c, T1 # T1 = a # MAJB add y0, y2 # y2 = S1 + CH # -- or T1, y3 # y3 = MAJ = (a|c))|(a) # MAJ + mov frame_TBL(%rsp), TBL add y1, h # h = k + w + h + S0# -- add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- add y3, h # h = t1 + S0 + MAJ # -- Note how this moves up the 'TLB' reload by 4 instructions. 2) If this does not get back performance, then maybe another reason is that it's cache access latency limited, in which case a more involved optimization would be to look at the register patterns and usages: first-use last-useuse-length a: #10 #29 20 b: #24 #24 1 c: #14 #30 17 d: #23 #34 12 e: #11 #20 10 f: #15 #15 1 g: #18 #27 10 h: #13 #36 24 y0: #11 #31 21 y1: #12 #33 22 y2: #15 #35 21 y3: #10 #36 27 T1: #16 #32 17 The 'first-use' colums shows the number of the instruction within the loop body that the register gets used - with '#1' denoting the first instruction ad #36 the last instruction, the 'last-use' column is showing the last instruction, and the 'use-length' colum shows the 'window' in which a register is used. What we want are the registers that are used the most tightly, i.e. these two: b: #24 #24 1 f: #15 #15 1 Of these two 'f' is the best one, because it has an earlier use and longer cooldown. If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very early on: mov f, y2 # y2 = f# CH + mov frame_TBL(%rsp), TBL rorx$34, a, T1 # T1 = a >> 34 # S0B And there will be 21 instructions that don't depend on TLB after this, plenty of time for the load to be generated and propagated. NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState macro name rotation. It's still fundamentally possible I believe, it's just that 'TBL' has to be rotated too, together with the other varibles. 3) If even this does not help, because the workload is ucode-cache limited, and the extra reloads pushed the critical path just beyond some cache limit, then another experiment to try would be to roll _back_ the loop some more: instead of 4x FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2. The CPU should still be smart enough with 2x interleaving of the loop body, and the extra branches should be relatively small and we could get back some performance. In theory ... 4) If the workload is fundamentally cache-port bandwidth limited, then the extra loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy fix for that, but to find an unused register. Here's the (initial, pre-rotation) integer register mappings: a: RAX
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Josh Poimboeuf wrote: > I'm still looking at the other one (sha512-avx2), but so far I haven't > found a way to speed it back up. Here's a couple of very quick observations with possible optimizations: AFAICS the main effect of the RBP fixes is the introduction of a memory load into the critical path, into the body unrolled loop: + mov frame_TBL(%rsp), TBL vpaddq (TBL), Y_0, XFER vmovdqa XFER, frame_XFER(%rsp) FOUR_ROUNDS_AND_SCHED Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to be reloaded from the stack. 1) Note how R12 is used immediately, right in the next instruction: vpaddq (TBL), Y_0, XFER I.e. the RBP fixes lengthen the program order data dependencies - that's a new constraint and a few extra cycles per loop iteration if the workload is address-generator bandwidth limited on that. A simple way to ease that constraint would be to move the 'TLB' load up into the loop, body, to the point where 'T1' is used for the last time - which is: mov a, T1 # T1 = a# MAJB and c, T1 # T1 = a # MAJB add y0, y2 # y2 = S1 + CH # -- or T1, y3 # y3 = MAJ = (a|c))|(a) # MAJ + mov frame_TBL(%rsp), TBL add y1, h # h = k + w + h + S0# -- add y2, d # d = k + w + h + d + S1 + CH = d + t1 # -- add y2, h # h = k + w + h + S0 + S1 + CH = t1 + S0# -- add y3, h # h = t1 + S0 + MAJ # -- Note how this moves up the 'TLB' reload by 4 instructions. 2) If this does not get back performance, then maybe another reason is that it's cache access latency limited, in which case a more involved optimization would be to look at the register patterns and usages: first-use last-useuse-length a: #10 #29 20 b: #24 #24 1 c: #14 #30 17 d: #23 #34 12 e: #11 #20 10 f: #15 #15 1 g: #18 #27 10 h: #13 #36 24 y0: #11 #31 21 y1: #12 #33 22 y2: #15 #35 21 y3: #10 #36 27 T1: #16 #32 17 The 'first-use' colums shows the number of the instruction within the loop body that the register gets used - with '#1' denoting the first instruction ad #36 the last instruction, the 'last-use' column is showing the last instruction, and the 'use-length' colum shows the 'window' in which a register is used. What we want are the registers that are used the most tightly, i.e. these two: b: #24 #24 1 f: #15 #15 1 Of these two 'f' is the best one, because it has an earlier use and longer cooldown. If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration very early on: mov f, y2 # y2 = f# CH + mov frame_TBL(%rsp), TBL rorx$34, a, T1 # T1 = a >> 34 # S0B And there will be 21 instructions that don't depend on TLB after this, plenty of time for the load to be generated and propagated. NOTE: my pseudo-patch is naive, due to the complication caused by the RotateState macro name rotation. It's still fundamentally possible I believe, it's just that 'TBL' has to be rotated too, together with the other varibles. 3) If even this does not help, because the workload is ucode-cache limited, and the extra reloads pushed the critical path just beyond some cache limit, then another experiment to try would be to roll _back_ the loop some more: instead of 4x FOUR_ROUNDS_AND_SCHED unrolled loops, try just having 2. The CPU should still be smart enough with 2x interleaving of the loop body, and the extra branches should be relatively small and we could get back some performance. In theory ... 4) If the workload is fundamentally cache-port bandwidth limited, then the extra loads from memory to reload 'TLB' take away valuable bandwidth. There's no easy fix for that, but to find an unused register. Here's the (initial, pre-rotation) integer register mappings: a: RAX b: RBX
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Wed, Sep 13, 2017 at 04:24:28PM -0500, Josh Poimboeuf wrote: > On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote: > > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote: > > > > > > * Eric Biggerswrote: > > > > > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > > > > > > > * Eric Biggers wrote: > > > > > > > > > > > Thanks for fixing these! I don't have time to review these in > > > > > > detail, but I ran > > > > > > the crypto self-tests on the affected algorithms, and they all > > > > > > pass. I also > > > > > > benchmarked them before and after; the only noticable performance > > > > > > difference was > > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I > > > > > > don't suppose > > > > > > there is a way around that? Otherwise it's probably not a big deal. > > > > > > > > > > Which CPU model did you use for the test? > > > > > > > > > > Thanks, > > > > > > > > > > Ingo > > > > > > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". > > > > > > Any chance to test this with the latest microarchitecture - any Skylake > > > derivative > > > Intel CPU you have access to? > > > > > > Thanks, > > > > > > Ingo > > > > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz". The > > results > > were the following which seemed a bit worse than Haswell: > > > > sha256-avx2 became 3.5% slower > > sha512-avx2 became 7.5% slower > > > > Note: it's tricky to benchmark this, especially with just a few percent > > difference, so don't read too much into the exact numbers. > > Here's a v2 for the sha256-avx2 patch, would you mind seeing if this > closes the performance gap? > > I'm still looking at the other one (sha512-avx2), but so far I haven't > found a way to speed it back up. And here's v2 of the sha512-avx2 patch. It should hopefully gain back most of the performance lost by v1. From: Josh Poimboeuf Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S Using RBP as a temporary register breaks frame pointer convention and breaks stack traces when unwinding from an interrupt in the crypto code. Mix things up a little bit to get rid of the RBP usage, without destroying performance. Use RDI instead of RBP for the TBL pointer. That will clobber CTX, so save CTX on the stack and use RDI as CTX before it gets clobbered, and R12 as CTX after it gets clobbered. Also remove the unused y4 variable. Reported-by: Eric Biggers Reported-by: Peter Zijlstra Signed-off-by: Josh Poimboeuf --- arch/x86/crypto/sha512-avx2-asm.S | 75 --- 1 file changed, 39 insertions(+), 36 deletions(-) diff --git a/arch/x86/crypto/sha512-avx2-asm.S b/arch/x86/crypto/sha512-avx2-asm.S index 7f5f6c6ec72e..b16d56005162 100644 --- a/arch/x86/crypto/sha512-avx2-asm.S +++ b/arch/x86/crypto/sha512-avx2-asm.S @@ -69,8 +69,9 @@ XFER = YTMP0 BYTE_FLIP_MASK = %ymm9 -# 1st arg -CTX = %rdi +# 1st arg is %rdi, which is saved to the stack and accessed later via %r12 +CTX1= %rdi +CTX2= %r12 # 2nd arg INP = %rsi # 3rd arg @@ -81,7 +82,7 @@ d = %r8 e = %rdx y3 = %rsi -TBL = %rbp +TBL = %rdi # clobbers CTX1 a = %rax b = %rbx @@ -91,26 +92,26 @@ g = %r10 h = %r11 old_h = %r11 -T1= %r12 +T1= %r12 # clobbers CTX2 y0= %r13 y1= %r14 y2= %r15 -y4= %r12 - # Local variables (stack frame) XFER_SIZE = 4*8 SRND_SIZE = 1*8 INP_SIZE = 1*8 INPEND_SIZE = 1*8 +CTX_SIZE = 1*8 RSPSAVE_SIZE = 1*8 -GPRSAVE_SIZE = 6*8 +GPRSAVE_SIZE = 5*8 frame_XFER = 0 frame_SRND = frame_XFER + XFER_SIZE frame_INP = frame_SRND + SRND_SIZE frame_INPEND = frame_INP + INP_SIZE -frame_RSPSAVE = frame_INPEND + INPEND_SIZE +frame_CTX = frame_INPEND + INPEND_SIZE +frame_RSPSAVE = frame_CTX + CTX_SIZE frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE frame_size = frame_GPRSAVE + GPRSAVE_SIZE @@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx) mov %rax, frame_RSPSAVE(%rsp) # Save GPRs - mov %rbp, frame_GPRSAVE(%rsp) - mov %rbx, 8*1+frame_GPRSAVE(%rsp) - mov %r12, 8*2+frame_GPRSAVE(%rsp) - mov %r13, 8*3+frame_GPRSAVE(%rsp) - mov %r14, 8*4+frame_GPRSAVE(%rsp) - mov %r15, 8*5+frame_GPRSAVE(%rsp) + mov %rbx, 8*0+frame_GPRSAVE(%rsp) + mov %r12, 8*1+frame_GPRSAVE(%rsp) + mov %r13, 8*2+frame_GPRSAVE(%rsp) + mov %r14, 8*3+frame_GPRSAVE(%rsp) + mov %r15, 8*4+frame_GPRSAVE(%rsp) shl $7, NUM_BLKS# convert to bytes jz done_hash @@ -589,14 +589,17 @@ ENTRY(sha512_transform_rorx) mov NUM_BLKS, frame_INPEND(%rsp) ## load initial digest - mov 8*0(CTX),a - mov
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Wed, Sep 13, 2017 at 04:24:28PM -0500, Josh Poimboeuf wrote: > On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote: > > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote: > > > > > > * Eric Biggers wrote: > > > > > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > > > > > > > * Eric Biggers wrote: > > > > > > > > > > > Thanks for fixing these! I don't have time to review these in > > > > > > detail, but I ran > > > > > > the crypto self-tests on the affected algorithms, and they all > > > > > > pass. I also > > > > > > benchmarked them before and after; the only noticable performance > > > > > > difference was > > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I > > > > > > don't suppose > > > > > > there is a way around that? Otherwise it's probably not a big deal. > > > > > > > > > > Which CPU model did you use for the test? > > > > > > > > > > Thanks, > > > > > > > > > > Ingo > > > > > > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". > > > > > > Any chance to test this with the latest microarchitecture - any Skylake > > > derivative > > > Intel CPU you have access to? > > > > > > Thanks, > > > > > > Ingo > > > > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz". The > > results > > were the following which seemed a bit worse than Haswell: > > > > sha256-avx2 became 3.5% slower > > sha512-avx2 became 7.5% slower > > > > Note: it's tricky to benchmark this, especially with just a few percent > > difference, so don't read too much into the exact numbers. > > Here's a v2 for the sha256-avx2 patch, would you mind seeing if this > closes the performance gap? > > I'm still looking at the other one (sha512-avx2), but so far I haven't > found a way to speed it back up. And here's v2 of the sha512-avx2 patch. It should hopefully gain back most of the performance lost by v1. From: Josh Poimboeuf Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S Using RBP as a temporary register breaks frame pointer convention and breaks stack traces when unwinding from an interrupt in the crypto code. Mix things up a little bit to get rid of the RBP usage, without destroying performance. Use RDI instead of RBP for the TBL pointer. That will clobber CTX, so save CTX on the stack and use RDI as CTX before it gets clobbered, and R12 as CTX after it gets clobbered. Also remove the unused y4 variable. Reported-by: Eric Biggers Reported-by: Peter Zijlstra Signed-off-by: Josh Poimboeuf --- arch/x86/crypto/sha512-avx2-asm.S | 75 --- 1 file changed, 39 insertions(+), 36 deletions(-) diff --git a/arch/x86/crypto/sha512-avx2-asm.S b/arch/x86/crypto/sha512-avx2-asm.S index 7f5f6c6ec72e..b16d56005162 100644 --- a/arch/x86/crypto/sha512-avx2-asm.S +++ b/arch/x86/crypto/sha512-avx2-asm.S @@ -69,8 +69,9 @@ XFER = YTMP0 BYTE_FLIP_MASK = %ymm9 -# 1st arg -CTX = %rdi +# 1st arg is %rdi, which is saved to the stack and accessed later via %r12 +CTX1= %rdi +CTX2= %r12 # 2nd arg INP = %rsi # 3rd arg @@ -81,7 +82,7 @@ d = %r8 e = %rdx y3 = %rsi -TBL = %rbp +TBL = %rdi # clobbers CTX1 a = %rax b = %rbx @@ -91,26 +92,26 @@ g = %r10 h = %r11 old_h = %r11 -T1= %r12 +T1= %r12 # clobbers CTX2 y0= %r13 y1= %r14 y2= %r15 -y4= %r12 - # Local variables (stack frame) XFER_SIZE = 4*8 SRND_SIZE = 1*8 INP_SIZE = 1*8 INPEND_SIZE = 1*8 +CTX_SIZE = 1*8 RSPSAVE_SIZE = 1*8 -GPRSAVE_SIZE = 6*8 +GPRSAVE_SIZE = 5*8 frame_XFER = 0 frame_SRND = frame_XFER + XFER_SIZE frame_INP = frame_SRND + SRND_SIZE frame_INPEND = frame_INP + INP_SIZE -frame_RSPSAVE = frame_INPEND + INPEND_SIZE +frame_CTX = frame_INPEND + INPEND_SIZE +frame_RSPSAVE = frame_CTX + CTX_SIZE frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE frame_size = frame_GPRSAVE + GPRSAVE_SIZE @@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx) mov %rax, frame_RSPSAVE(%rsp) # Save GPRs - mov %rbp, frame_GPRSAVE(%rsp) - mov %rbx, 8*1+frame_GPRSAVE(%rsp) - mov %r12, 8*2+frame_GPRSAVE(%rsp) - mov %r13, 8*3+frame_GPRSAVE(%rsp) - mov %r14, 8*4+frame_GPRSAVE(%rsp) - mov %r15, 8*5+frame_GPRSAVE(%rsp) + mov %rbx, 8*0+frame_GPRSAVE(%rsp) + mov %r12, 8*1+frame_GPRSAVE(%rsp) + mov %r13, 8*2+frame_GPRSAVE(%rsp) + mov %r14, 8*3+frame_GPRSAVE(%rsp) + mov %r15, 8*4+frame_GPRSAVE(%rsp) shl $7, NUM_BLKS# convert to bytes jz done_hash @@ -589,14 +589,17 @@ ENTRY(sha512_transform_rorx) mov NUM_BLKS, frame_INPEND(%rsp) ## load initial digest - mov 8*0(CTX),a - mov 8*1(CTX),b - mov 8*2(CTX),c - mov 8*3(CTX),d - mov 8*4(CTX),e - mov 8*5(CTX),f -
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote: > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote: > > > > * Eric Biggerswrote: > > > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > > > > > * Eric Biggers wrote: > > > > > > > > > Thanks for fixing these! I don't have time to review these in > > > > > detail, but I ran > > > > > the crypto self-tests on the affected algorithms, and they all pass. > > > > > I also > > > > > benchmarked them before and after; the only noticable performance > > > > > difference was > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I > > > > > don't suppose > > > > > there is a way around that? Otherwise it's probably not a big deal. > > > > > > > > Which CPU model did you use for the test? > > > > > > > > Thanks, > > > > > > > > Ingo > > > > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". > > > > Any chance to test this with the latest microarchitecture - any Skylake > > derivative > > Intel CPU you have access to? > > > > Thanks, > > > > Ingo > > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz". The results > were the following which seemed a bit worse than Haswell: > > sha256-avx2 became 3.5% slower > sha512-avx2 became 7.5% slower > > Note: it's tricky to benchmark this, especially with just a few percent > difference, so don't read too much into the exact numbers. Here's a v2 for the sha256-avx2 patch, would you mind seeing if this closes the performance gap? I'm still looking at the other one (sha512-avx2), but so far I haven't found a way to speed it back up. From: Josh Poimboeuf Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S Using RBP as a temporary register breaks frame pointer convention and breaks stack traces when unwinding from an interrupt in the crypto code. There's no need to use RBP as a temporary register for the TBL value, because it always stores the same value: the address of the K256 table. Instead just reference the address of K256 directly. Reported-by: Eric Biggers Reported-by: Peter Zijlstra Signed-off-by: Josh Poimboeuf --- arch/x86/crypto/sha256-avx2-asm.S | 22 +++--- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S index 89c8f09787d2..1420db15dcdd 100644 --- a/arch/x86/crypto/sha256-avx2-asm.S +++ b/arch/x86/crypto/sha256-avx2-asm.S @@ -98,8 +98,6 @@ d = %r8d e = %edx # clobbers NUM_BLKS y3 = %esi # clobbers INP - -TBL= %rbp SRND = CTX # SRND is same register as CTX a = %eax @@ -531,7 +529,6 @@ STACK_SIZE = _RSP + _RSP_SIZE ENTRY(sha256_transform_rorx) .align 32 pushq %rbx - pushq %rbp pushq %r12 pushq %r13 pushq %r14 @@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx) mov CTX, _CTX(%rsp) loop0: - lea K256(%rip), TBL - ## Load first 16 dwords from two blocks VMOVDQ 0*32(INP),XTMP0 VMOVDQ 1*32(INP),XTMP1 @@ -597,19 +592,19 @@ last_block_enter: .align 16 loop1: - vpaddd 0*32(TBL, SRND), X0, XFER + vpaddd K256+0*32(SRND), X0, XFER vmovdqa XFER, 0*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 0*32 - vpaddd 1*32(TBL, SRND), X0, XFER + vpaddd K256+1*32(SRND), X0, XFER vmovdqa XFER, 1*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 1*32 - vpaddd 2*32(TBL, SRND), X0, XFER + vpaddd K256+2*32(SRND), X0, XFER vmovdqa XFER, 2*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 2*32 - vpaddd 3*32(TBL, SRND), X0, XFER + vpaddd K256+3*32(SRND), X0, XFER vmovdqa XFER, 3*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 3*32 @@ -619,10 +614,11 @@ loop1: loop2: ## Do last 16 rounds with no scheduling - vpaddd 0*32(TBL, SRND), X0, XFER + vpaddd K256+0*32(SRND), X0, XFER vmovdqa XFER, 0*32+_XFER(%rsp, SRND) DO_4ROUNDS _XFER + 0*32 - vpaddd 1*32(TBL, SRND), X1, XFER + + vpaddd K256+1*32(SRND), X1, XFER vmovdqa XFER, 1*32+_XFER(%rsp, SRND) DO_4ROUNDS _XFER + 1*32 add $2*32, SRND @@ -676,9 +672,6 @@ loop3: ja done_hash do_last_block: - do last block - lea K256(%rip), TBL - VMOVDQ 0*16(INP),XWORD0 VMOVDQ 1*16(INP),XWORD1 VMOVDQ 2*16(INP),XWORD2 @@ -718,7 +711,6 @@ done_hash: popq%r14 popq%r13 popq%r12 - popq%rbp popq%rbx ret ENDPROC(sha256_transform_rorx) -- 2.13.5
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote: > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote: > > > > * Eric Biggers wrote: > > > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > > > > > * Eric Biggers wrote: > > > > > > > > > Thanks for fixing these! I don't have time to review these in > > > > > detail, but I ran > > > > > the crypto self-tests on the affected algorithms, and they all pass. > > > > > I also > > > > > benchmarked them before and after; the only noticable performance > > > > > difference was > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I > > > > > don't suppose > > > > > there is a way around that? Otherwise it's probably not a big deal. > > > > > > > > Which CPU model did you use for the test? > > > > > > > > Thanks, > > > > > > > > Ingo > > > > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". > > > > Any chance to test this with the latest microarchitecture - any Skylake > > derivative > > Intel CPU you have access to? > > > > Thanks, > > > > Ingo > > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz". The results > were the following which seemed a bit worse than Haswell: > > sha256-avx2 became 3.5% slower > sha512-avx2 became 7.5% slower > > Note: it's tricky to benchmark this, especially with just a few percent > difference, so don't read too much into the exact numbers. Here's a v2 for the sha256-avx2 patch, would you mind seeing if this closes the performance gap? I'm still looking at the other one (sha512-avx2), but so far I haven't found a way to speed it back up. From: Josh Poimboeuf Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S Using RBP as a temporary register breaks frame pointer convention and breaks stack traces when unwinding from an interrupt in the crypto code. There's no need to use RBP as a temporary register for the TBL value, because it always stores the same value: the address of the K256 table. Instead just reference the address of K256 directly. Reported-by: Eric Biggers Reported-by: Peter Zijlstra Signed-off-by: Josh Poimboeuf --- arch/x86/crypto/sha256-avx2-asm.S | 22 +++--- 1 file changed, 7 insertions(+), 15 deletions(-) diff --git a/arch/x86/crypto/sha256-avx2-asm.S b/arch/x86/crypto/sha256-avx2-asm.S index 89c8f09787d2..1420db15dcdd 100644 --- a/arch/x86/crypto/sha256-avx2-asm.S +++ b/arch/x86/crypto/sha256-avx2-asm.S @@ -98,8 +98,6 @@ d = %r8d e = %edx # clobbers NUM_BLKS y3 = %esi # clobbers INP - -TBL= %rbp SRND = CTX # SRND is same register as CTX a = %eax @@ -531,7 +529,6 @@ STACK_SIZE = _RSP + _RSP_SIZE ENTRY(sha256_transform_rorx) .align 32 pushq %rbx - pushq %rbp pushq %r12 pushq %r13 pushq %r14 @@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx) mov CTX, _CTX(%rsp) loop0: - lea K256(%rip), TBL - ## Load first 16 dwords from two blocks VMOVDQ 0*32(INP),XTMP0 VMOVDQ 1*32(INP),XTMP1 @@ -597,19 +592,19 @@ last_block_enter: .align 16 loop1: - vpaddd 0*32(TBL, SRND), X0, XFER + vpaddd K256+0*32(SRND), X0, XFER vmovdqa XFER, 0*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 0*32 - vpaddd 1*32(TBL, SRND), X0, XFER + vpaddd K256+1*32(SRND), X0, XFER vmovdqa XFER, 1*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 1*32 - vpaddd 2*32(TBL, SRND), X0, XFER + vpaddd K256+2*32(SRND), X0, XFER vmovdqa XFER, 2*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 2*32 - vpaddd 3*32(TBL, SRND), X0, XFER + vpaddd K256+3*32(SRND), X0, XFER vmovdqa XFER, 3*32+_XFER(%rsp, SRND) FOUR_ROUNDS_AND_SCHED _XFER + 3*32 @@ -619,10 +614,11 @@ loop1: loop2: ## Do last 16 rounds with no scheduling - vpaddd 0*32(TBL, SRND), X0, XFER + vpaddd K256+0*32(SRND), X0, XFER vmovdqa XFER, 0*32+_XFER(%rsp, SRND) DO_4ROUNDS _XFER + 0*32 - vpaddd 1*32(TBL, SRND), X1, XFER + + vpaddd K256+1*32(SRND), X1, XFER vmovdqa XFER, 1*32+_XFER(%rsp, SRND) DO_4ROUNDS _XFER + 1*32 add $2*32, SRND @@ -676,9 +672,6 @@ loop3: ja done_hash do_last_block: - do last block - lea K256(%rip), TBL - VMOVDQ 0*16(INP),XWORD0 VMOVDQ 1*16(INP),XWORD1 VMOVDQ 2*16(INP),XWORD2 @@ -718,7 +711,6 @@ done_hash: popq%r14 popq%r13 popq%r12 - popq%rbp popq%rbx ret ENDPROC(sha256_transform_rorx) -- 2.13.5
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote: > > * Eric Biggerswrote: > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > > > * Eric Biggers wrote: > > > > > > > Thanks for fixing these! I don't have time to review these in detail, > > > > but I ran > > > > the crypto self-tests on the affected algorithms, and they all pass. I > > > > also > > > > benchmarked them before and after; the only noticable performance > > > > difference was > > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't > > > > suppose > > > > there is a way around that? Otherwise it's probably not a big deal. > > > > > > Which CPU model did you use for the test? > > > > > > Thanks, > > > > > > Ingo > > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". > > Any chance to test this with the latest microarchitecture - any Skylake > derivative > Intel CPU you have access to? > > Thanks, > > Ingo Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz". The results were the following which seemed a bit worse than Haswell: sha256-avx2 became 3.5% slower sha512-avx2 became 7.5% slower Note: it's tricky to benchmark this, especially with just a few percent difference, so don't read too much into the exact numbers. Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote: > > * Eric Biggers wrote: > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > > > * Eric Biggers wrote: > > > > > > > Thanks for fixing these! I don't have time to review these in detail, > > > > but I ran > > > > the crypto self-tests on the affected algorithms, and they all pass. I > > > > also > > > > benchmarked them before and after; the only noticable performance > > > > difference was > > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't > > > > suppose > > > > there is a way around that? Otherwise it's probably not a big deal. > > > > > > Which CPU model did you use for the test? > > > > > > Thanks, > > > > > > Ingo > > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". > > Any chance to test this with the latest microarchitecture - any Skylake > derivative > Intel CPU you have access to? > > Thanks, > > Ingo Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz". The results were the following which seemed a bit worse than Haswell: sha256-avx2 became 3.5% slower sha512-avx2 became 7.5% slower Note: it's tricky to benchmark this, especially with just a few percent difference, so don't read too much into the exact numbers. Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggerswrote: > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > * Eric Biggers wrote: > > > > > Thanks for fixing these! I don't have time to review these in detail, > > > but I ran > > > the crypto self-tests on the affected algorithms, and they all pass. I > > > also > > > benchmarked them before and after; the only noticable performance > > > difference was > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't > > > suppose > > > there is a way around that? Otherwise it's probably not a big deal. > > > > Which CPU model did you use for the test? > > > > Thanks, > > > > Ingo > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". Any chance to test this with the latest microarchitecture - any Skylake derivative Intel CPU you have access to? Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggers wrote: > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > > > * Eric Biggers wrote: > > > > > Thanks for fixing these! I don't have time to review these in detail, > > > but I ran > > > the crypto self-tests on the affected algorithms, and they all pass. I > > > also > > > benchmarked them before and after; the only noticable performance > > > difference was > > > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't > > > suppose > > > there is a way around that? Otherwise it's probably not a big deal. > > > > Which CPU model did you use for the test? > > > > Thanks, > > > > Ingo > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". Any chance to test this with the latest microarchitecture - any Skylake derivative Intel CPU you have access to? Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > * Eric Biggerswrote: > > > Thanks for fixing these! I don't have time to review these in detail, but > > I ran > > the crypto self-tests on the affected algorithms, and they all pass. I also > > benchmarked them before and after; the only noticable performance > > difference was > > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't > > suppose > > there is a way around that? Otherwise it's probably not a big deal. > > Which CPU model did you use for the test? > > Thanks, > > Ingo This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote: > > * Eric Biggers wrote: > > > Thanks for fixing these! I don't have time to review these in detail, but > > I ran > > the crypto self-tests on the affected algorithms, and they all pass. I also > > benchmarked them before and after; the only noticable performance > > difference was > > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't > > suppose > > there is a way around that? Otherwise it's probably not a big deal. > > Which CPU model did you use for the test? > > Thanks, > > Ingo This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz". Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggerswrote: > Thanks for fixing these! I don't have time to review these in detail, but I > ran > the crypto self-tests on the affected algorithms, and they all pass. I also > benchmarked them before and after; the only noticable performance difference > was > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't suppose > there is a way around that? Otherwise it's probably not a big deal. Which CPU model did you use for the test? Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
* Eric Biggers wrote: > Thanks for fixing these! I don't have time to review these in detail, but I > ran > the crypto self-tests on the affected algorithms, and they all pass. I also > benchmarked them before and after; the only noticable performance difference > was > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't suppose > there is a way around that? Otherwise it's probably not a big deal. Which CPU model did you use for the test? Thanks, Ingo
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 01, 2017 at 05:09:19PM -0700, Eric Biggers wrote: > Hi Josh, > > On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote: > > Many of the x86 crypto functions use RBP as a temporary register. This > > breaks frame pointer convention, and breaks stack traces when unwinding > > from an interrupt in the crypto code. > > > > Convert most* of them to leave RBP alone. > > > > These pass the crypto boot tests for me. Any further testing would be > > appreciated! > > > > [*] There are still a few crypto files left that need fixing, but the > > fixes weren't trivial and nobody reported unwinder warnings about > > them yet, so I'm skipping them for now. > > > > Josh Poimboeuf (12): > > x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in des3_ede-asm_64.S > > x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S > > x86/crypto: Fix RBP usage in sha1_ssse3_asm.S > > x86/crypto: Fix RBP usage in sha256-avx-asm.S > > x86/crypto: Fix RBP usage in sha256-avx2-asm.S > > x86/crypto: Fix RBP usage in sha256-ssse3-asm.S > > x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S > > > > Thanks for fixing these! I don't have time to review these in detail, but I > ran > the crypto self-tests on the affected algorithms, and they all pass. I also > benchmarked them before and after; the only noticable performance difference > was > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't suppose > there is a way around that? Otherwise it's probably not a big deal. Thanks for testing. I might have a way to make sha256-avx2 faster, but not sha512-avx2. I'll let you know when I have a patch. -- Josh
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
On Fri, Sep 01, 2017 at 05:09:19PM -0700, Eric Biggers wrote: > Hi Josh, > > On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote: > > Many of the x86 crypto functions use RBP as a temporary register. This > > breaks frame pointer convention, and breaks stack traces when unwinding > > from an interrupt in the crypto code. > > > > Convert most* of them to leave RBP alone. > > > > These pass the crypto boot tests for me. Any further testing would be > > appreciated! > > > > [*] There are still a few crypto files left that need fixing, but the > > fixes weren't trivial and nobody reported unwinder warnings about > > them yet, so I'm skipping them for now. > > > > Josh Poimboeuf (12): > > x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S > > x86/crypto: Fix RBP usage in des3_ede-asm_64.S > > x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S > > x86/crypto: Fix RBP usage in sha1_ssse3_asm.S > > x86/crypto: Fix RBP usage in sha256-avx-asm.S > > x86/crypto: Fix RBP usage in sha256-avx2-asm.S > > x86/crypto: Fix RBP usage in sha256-ssse3-asm.S > > x86/crypto: Fix RBP usage in sha512-avx2-asm.S > > x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S > > > > Thanks for fixing these! I don't have time to review these in detail, but I > ran > the crypto self-tests on the affected algorithms, and they all pass. I also > benchmarked them before and after; the only noticable performance difference > was > that sha256-avx2 and sha512-avx2 became a few percent slower. I don't suppose > there is a way around that? Otherwise it's probably not a big deal. Thanks for testing. I might have a way to make sha256-avx2 faster, but not sha512-avx2. I'll let you know when I have a patch. -- Josh
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Hi Josh, On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote: > Many of the x86 crypto functions use RBP as a temporary register. This > breaks frame pointer convention, and breaks stack traces when unwinding > from an interrupt in the crypto code. > > Convert most* of them to leave RBP alone. > > These pass the crypto boot tests for me. Any further testing would be > appreciated! > > [*] There are still a few crypto files left that need fixing, but the > fixes weren't trivial and nobody reported unwinder warnings about > them yet, so I'm skipping them for now. > > Josh Poimboeuf (12): > x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S > x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S > x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S > x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S > x86/crypto: Fix RBP usage in des3_ede-asm_64.S > x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S > x86/crypto: Fix RBP usage in sha1_ssse3_asm.S > x86/crypto: Fix RBP usage in sha256-avx-asm.S > x86/crypto: Fix RBP usage in sha256-avx2-asm.S > x86/crypto: Fix RBP usage in sha256-ssse3-asm.S > x86/crypto: Fix RBP usage in sha512-avx2-asm.S > x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S > Thanks for fixing these! I don't have time to review these in detail, but I ran the crypto self-tests on the affected algorithms, and they all pass. I also benchmarked them before and after; the only noticable performance difference was that sha256-avx2 and sha512-avx2 became a few percent slower. I don't suppose there is a way around that? Otherwise it's probably not a big deal. For reference, this was the list of affected algorithms I used: shash: sha1-ssse3, sha1-avx, sha1-avx2, sha256-ssse3, sha256-avx, sha256-avx2, sha512-ssse3, sha512-avx, sha512-avx2 cipher: blowfish-asm, camellia-asm, des3_ede-asm skcipher: ecb-blowfish-asm, cbc-blowfish-asm, ctr-blowfish-asm, ecb-cast5-avx, cbc-cast5-avx, ctr-cast5-avx, ecb-cast6-avx, cbc-cast6-avx, ctr-cast6-avx, lrw-cast6-avx, xts-cast6-avx, ecb-camellia-asm, cbc-camellia-asm, ctr-camellia-asm, lrw-camellia-asm, xts-camellia-asm, ecb-des3_ede-asm, cbc-des3_ede-asm, ctr-des3_ede-asm, ecb-twofish-avx, cbc-twofish-avx, ctr-twofish-avx, lrw-twofish-avx, xts-twofish-avx Eric
Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Hi Josh, On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote: > Many of the x86 crypto functions use RBP as a temporary register. This > breaks frame pointer convention, and breaks stack traces when unwinding > from an interrupt in the crypto code. > > Convert most* of them to leave RBP alone. > > These pass the crypto boot tests for me. Any further testing would be > appreciated! > > [*] There are still a few crypto files left that need fixing, but the > fixes weren't trivial and nobody reported unwinder warnings about > them yet, so I'm skipping them for now. > > Josh Poimboeuf (12): > x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S > x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S > x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S > x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S > x86/crypto: Fix RBP usage in des3_ede-asm_64.S > x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S > x86/crypto: Fix RBP usage in sha1_ssse3_asm.S > x86/crypto: Fix RBP usage in sha256-avx-asm.S > x86/crypto: Fix RBP usage in sha256-avx2-asm.S > x86/crypto: Fix RBP usage in sha256-ssse3-asm.S > x86/crypto: Fix RBP usage in sha512-avx2-asm.S > x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S > Thanks for fixing these! I don't have time to review these in detail, but I ran the crypto self-tests on the affected algorithms, and they all pass. I also benchmarked them before and after; the only noticable performance difference was that sha256-avx2 and sha512-avx2 became a few percent slower. I don't suppose there is a way around that? Otherwise it's probably not a big deal. For reference, this was the list of affected algorithms I used: shash: sha1-ssse3, sha1-avx, sha1-avx2, sha256-ssse3, sha256-avx, sha256-avx2, sha512-ssse3, sha512-avx, sha512-avx2 cipher: blowfish-asm, camellia-asm, des3_ede-asm skcipher: ecb-blowfish-asm, cbc-blowfish-asm, ctr-blowfish-asm, ecb-cast5-avx, cbc-cast5-avx, ctr-cast5-avx, ecb-cast6-avx, cbc-cast6-avx, ctr-cast6-avx, lrw-cast6-avx, xts-cast6-avx, ecb-camellia-asm, cbc-camellia-asm, ctr-camellia-asm, lrw-camellia-asm, xts-camellia-asm, ecb-des3_ede-asm, cbc-des3_ede-asm, ctr-des3_ede-asm, ecb-twofish-avx, cbc-twofish-avx, ctr-twofish-avx, lrw-twofish-avx, xts-twofish-avx Eric
[PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Many of the x86 crypto functions use RBP as a temporary register. This breaks frame pointer convention, and breaks stack traces when unwinding from an interrupt in the crypto code. Convert most* of them to leave RBP alone. These pass the crypto boot tests for me. Any further testing would be appreciated! [*] There are still a few crypto files left that need fixing, but the fixes weren't trivial and nobody reported unwinder warnings about them yet, so I'm skipping them for now. Josh Poimboeuf (12): x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S x86/crypto: Fix RBP usage in des3_ede-asm_64.S x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S x86/crypto: Fix RBP usage in sha1_ssse3_asm.S x86/crypto: Fix RBP usage in sha256-avx-asm.S x86/crypto: Fix RBP usage in sha256-avx2-asm.S x86/crypto: Fix RBP usage in sha256-ssse3-asm.S x86/crypto: Fix RBP usage in sha512-avx2-asm.S x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S arch/x86/crypto/blowfish-x86_64-asm_64.S| 48 ++- arch/x86/crypto/camellia-x86_64-asm_64.S| 26 +++ arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 47 +-- arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 50 - arch/x86/crypto/des3_ede-asm_64.S | 15 + arch/x86/crypto/sha1_avx2_x86_64_asm.S | 4 +-- arch/x86/crypto/sha1_ssse3_asm.S| 11 +++ arch/x86/crypto/sha256-avx-asm.S| 15 - arch/x86/crypto/sha256-avx2-asm.S | 16 - arch/x86/crypto/sha256-ssse3-asm.S | 15 - arch/x86/crypto/sha512-avx2-asm.S | 21 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 +++ 12 files changed, 160 insertions(+), 120 deletions(-) -- 2.13.5
[PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files
Many of the x86 crypto functions use RBP as a temporary register. This breaks frame pointer convention, and breaks stack traces when unwinding from an interrupt in the crypto code. Convert most* of them to leave RBP alone. These pass the crypto boot tests for me. Any further testing would be appreciated! [*] There are still a few crypto files left that need fixing, but the fixes weren't trivial and nobody reported unwinder warnings about them yet, so I'm skipping them for now. Josh Poimboeuf (12): x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S x86/crypto: Fix RBP usage in des3_ede-asm_64.S x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S x86/crypto: Fix RBP usage in sha1_ssse3_asm.S x86/crypto: Fix RBP usage in sha256-avx-asm.S x86/crypto: Fix RBP usage in sha256-avx2-asm.S x86/crypto: Fix RBP usage in sha256-ssse3-asm.S x86/crypto: Fix RBP usage in sha512-avx2-asm.S x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S arch/x86/crypto/blowfish-x86_64-asm_64.S| 48 ++- arch/x86/crypto/camellia-x86_64-asm_64.S| 26 +++ arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 47 +-- arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 50 - arch/x86/crypto/des3_ede-asm_64.S | 15 + arch/x86/crypto/sha1_avx2_x86_64_asm.S | 4 +-- arch/x86/crypto/sha1_ssse3_asm.S| 11 +++ arch/x86/crypto/sha256-avx-asm.S| 15 - arch/x86/crypto/sha256-avx2-asm.S | 16 - arch/x86/crypto/sha256-ssse3-asm.S | 15 - arch/x86/crypto/sha512-avx2-asm.S | 21 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 +++ 12 files changed, 160 insertions(+), 120 deletions(-) -- 2.13.5