Re: WARNING: kernel stack regs has bad 'bp' value (3)

2018-05-14 Thread Josh Poimboeuf
On Sat, May 12, 2018 at 12:11:17PM +0200, Ard Biesheuvel wrote:
> On 12 May 2018 at 11:50, Dmitry Vyukov  wrote:
> > On Sat, May 12, 2018 at 11:09 AM, Ard Biesheuvel
> >  wrote:
> >> (+ Arnd)
> >>
> >> On 12 May 2018 at 10:43, Dmitry Vyukov  wrote:
> >>> On Fri, Feb 2, 2018 at 11:18 PM, Eric Biggers  wrote:
>  On Fri, Feb 02, 2018 at 02:57:32PM +0100, Dmitry Vyukov wrote:
> > On Fri, Feb 2, 2018 at 2:48 PM, syzbot
> >  wrote:
> > > Hello,
> > >
> > > syzbot hit the following crash on upstream commit
> > > 7109a04eae81c41ed529da9f3c48c3655ccea741 (Thu Feb 1 17:37:30 2018 
> > > +)
> > > Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide
> > >
> > > So far this crash happened 4 times on net-next, upstream.
> > > C reproducer is attached.
> > > syzkaller reproducer is attached.
> > > Raw console output is attached.
> > > compiler: gcc (GCC) 7.1.1 20170620
> > > .config is attached.
> >
> >
> > From suspicious frames I see salsa20_asm_crypt there, so +crypto 
> > maintainers.
> >
> 
>  Looks like the x86 implementations of Salsa20 (both i586 and x86_64) 
>  need to be
>  updated to not use %ebp/%rbp.
> >>>
> >>> Ard,
> >>>
> >>> This was bisected as introduced by:
> >>>
> >>> commit 83dee2ce1ae791c3dc0c9d4d3a8d42cb109613f6
> >>> Author: Ard Biesheuvel 
> >>> Date:   Fri Jan 19 12:04:34 2018 +
> >>>
> >>> crypto: sha3-generic - rewrite KECCAK transform to help the
> >>> compiler optimize
> >>>
> >>> https://gist.githubusercontent.com/dvyukov/47f93f5a0679170dddf93bc019b42f6d/raw/65beac8ddd30003bbd4e9729236dc8572094abf7/gistfile1.txt
> >>
> >> Ouch.
> >>
> >> I'm not an expert in x86 assembly. Could someone please check the
> >> generated code to see what's going on? The C code changes are not that
> >> intricate, they basically unroll a loop, replacing accesses to
> >> 'array[indirect_index[i]]' with 'array[constant]'.
> >>
> >> As mentioned in the commit log, the speedup is more than significant
> >> for architectures with lots of GPRs so I'd prefer fixing the patch
> >> over reverting it (if there is anything wrong with the code in the
> >> first place)
> >
> > I suspect the problem is with __attribute__((__optimize__("O3"))). It
> > makes compiler use rbp register, which must not be used.
> 
> IIRC, the additional speedup from adding that was significant but not
> huge. Given that we don't use O3 anywhere else, I guess we should just
> remove it.
> 
> Could you please check whether that makes the issue go away?
> 
> If so,
> 
> Acked-by: Ard Biesheuvel 
> 
> for any patch that removes the O3 attribute override from keccakf()

The issue only affects CONFIG_FRAME_POINTER (which is no longer the
default on x86-64), so maybe -O3 could only be enabled for
CONFIG_FRAME_POINTER=n, in which case you'd still get the speedup with
the default ORC unwinder.

-- 
Josh


Re: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-19 Thread Josh Poimboeuf
On Mon, Dec 18, 2017 at 04:40:26PM -0800, Eric Biggers wrote:
> From: Eric Biggers <ebigg...@google.com>
> 
> Using %rbp as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> In twofish-3way, we can't simply replace %rbp with another register
> because there are none available.  Instead, we use the stack to hold the
> values that %rbp, %r11, and %r12 were holding previously.  Each of these
> values represents the half of the output from the previous Feistel round
> that is being passed on unchanged to the following round.  They are only
> used once per round, when they are exchanged with %rax, %rbx, and %rcx.
> 
> As a result, we free up 3 registers (one per block) and can reassign
> them so that %rbp is not used, and additionally %r14 and %r15 are not
> used so they do not need to be saved/restored.
> 
> There may be a small overhead caused by replacing 'xchg REG, REG' with
> the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
> round.  But, counterintuitively, when I tested "ctr-twofish-3way" on a
> Haswell processor, the new version was actually about 2% faster.
> (Perhaps 'xchg' is not as well optimized as plain moves.)
> 
> Reported-by: syzbot <syzkal...@googlegroups.com>
> Signed-off-by: Eric Biggers <ebigg...@google.com>

Thanks a lot for fixing this!

Reviewed-by: Josh Poimboeuf <jpoim...@redhat.com>

-- 
Josh


Re: [lkp-robot] [x86/kconfig] 81d3871900: BUG:unable_to_handle_kernel

2017-10-13 Thread Josh Poimboeuf
On Fri, Oct 13, 2017 at 04:56:43PM +0300, Andrey Ryabinin wrote:
> On 10/13/2017 07:45 AM, Josh Poimboeuf wrote:
> > On Thu, Oct 12, 2017 at 12:05:04PM -0500, Christopher Lameter wrote:
> >> On Wed, 11 Oct 2017, Josh Poimboeuf wrote:
> >>
> >>> I failed to add the slab maintainers to CC on the last attempt.  Trying
> >>> again.
> >>
> >>
> >> Hmmm... Yea. SLOB is rarely used and tested. Good illustration of a simple
> >> allocator and the K mechanism that was used in the early kernels.
> >>
> >>>> Adding the slub maintainers.  Is slob still supposed to work?
> >>
> >> Have not seen anyone using it in a decade or so.
> >>
> >> Does the same config with SLUB and slub_debug on the commandline run
> >> cleanly?
> >>
> >>>> I have no idea how that crypto panic could could be related to slob, but
> >>>> at least it goes away when I switch to slub.
> >>
> >> Can you run SLUB with full debug? specify slub_debug on the commandline or
> >> set CONFIG_SLUB_DEBUG_ON
> > 
> > Oddly enough, with CONFIG_SLUB+slub_debug, I get the same crypto panic I
> > got with CONFIG_SLOB.  The trapping instruction is:
> > 
> >   vmovdqa 0x140(%rdi),%xmm0
> 
> 
> It's unaligned access. Look at %rdi. vmovdqa requires 16-byte alignment.
> Apparently, something fed kmalloc()'ed data here. But kmalloc() guarantees 
> only sizeof(unsigned long)
> alignment. slub_debug changes slub's objects layout, so what happened to be 
> 16-bytes aligned
> without slub_debug, may become 8-byte aligned with slub_debg on.
> 
>
> > I'll try to bisect it tomorrow.  It at least goes back to v4.10.
> 
> Probably no point. I bet this bug always was here (since this code added).
> 
> This could be fixed by s/vmovdqa/vmovdqu change like bellow, but maybe the 
> right fix
> would be to align the data properly?
> 
> ---
>  arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S 
> b/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S
> index 8fe6338bcc84..7fd5d9b568c7 100644
> --- a/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S
> +++ b/arch/x86/crypto/sha256-mb/sha256_mb_mgr_flush_avx2.S
> @@ -155,8 +155,8 @@ LABEL skip_ %I
>  .endr
>  
>   # Find min length
> - vmovdqa _lens+0*16(state), %xmm0
> - vmovdqa _lens+1*16(state), %xmm1
> + vmovdqu _lens+0*16(state), %xmm0
> + vmovdqu _lens+1*16(state), %xmm1
>  
>   vpminud %xmm1, %xmm0, %xmm2 # xmm2 has {D,C,B,A}
>   vpalignr $8, %xmm2, %xmm3, %xmm3# xmm3 has {x,x,D,C}
> @@ -176,8 +176,8 @@ LABEL skip_ %I
>   vpsubd  %xmm2, %xmm0, %xmm0
>   vpsubd  %xmm2, %xmm1, %xmm1
>  
> - vmovdqa %xmm0, _lens+0*16(state)
> - vmovdqa %xmm1, _lens+1*16(state)
> + vmovdqu %xmm0, _lens+0*16(state)
> + vmovdqu %xmm1, _lens+1*16(state)
>  
>   # "state" and "args" are the same address, arg1
>   # len is arg2
> -- 
> 2.13.6

Makes sense.  I can confirm that the above patch fixes the panic.

-- 
Josh


[PATCH v2 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP.  R12 can't be used as the RT0 register because
of x86 instruction encoding limitations.  So use R12 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/blowfish-x86_64-asm_64.S | 48 +---
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/x86/crypto/blowfish-x86_64-asm_64.S 
b/arch/x86/crypto/blowfish-x86_64-asm_64.S
index 246c67006ed0..8c1fcb6bad21 100644
--- a/arch/x86/crypto/blowfish-x86_64-asm_64.S
+++ b/arch/x86/crypto/blowfish-x86_64-asm_64.S
@@ -33,7 +33,7 @@
 #define s3 ((16 + 2 + (3 * 256)) * 4)
 
 /* register macros */
-#define CTX %rdi
+#define CTX %r12
 #define RIO %rsi
 
 #define RX0 %rax
@@ -56,12 +56,12 @@
 #define RX2bh %ch
 #define RX3bh %dh
 
-#define RT0 %rbp
+#define RT0 %rdi
 #define RT1 %rsi
 #define RT2 %r8
 #define RT3 %r9
 
-#define RT0d %ebp
+#define RT0d %edi
 #define RT1d %esi
 #define RT2d %r8d
 #define RT3d %r9d
@@ -120,13 +120,14 @@
 
 ENTRY(__blowfish_enc_blk)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 *  %rcx: bool, if true: xor output
 */
-   movq %rbp, %r11;
+   movq %r12, %r11;
 
+   movq %rdi, CTX;
movq %rsi, %r10;
movq %rdx, RIO;
 
@@ -142,7 +143,7 @@ ENTRY(__blowfish_enc_blk)
round_enc(14);
add_roundkey_enc(16);
 
-   movq %r11, %rbp;
+   movq %r11, %r12;
 
movq %r10, RIO;
test %cl, %cl;
@@ -157,12 +158,13 @@ ENDPROC(__blowfish_enc_blk)
 
 ENTRY(blowfish_dec_blk)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
-   movq %rbp, %r11;
+   movq %r12, %r11;
 
+   movq %rdi, CTX;
movq %rsi, %r10;
movq %rdx, RIO;
 
@@ -181,7 +183,7 @@ ENTRY(blowfish_dec_blk)
movq %r10, RIO;
write_block();
 
-   movq %r11, %rbp;
+   movq %r11, %r12;
 
ret;
 ENDPROC(blowfish_dec_blk)
@@ -298,20 +300,21 @@ ENDPROC(blowfish_dec_blk)
 
 ENTRY(__blowfish_enc_blk_4way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 *  %rcx: bool, if true: xor output
 */
-   pushq %rbp;
+   pushq %r12;
pushq %rbx;
pushq %rcx;
 
-   preload_roundkey_enc(0);
-
+   movq %rdi, CTX
movq %rsi, %r11;
movq %rdx, RIO;
 
+   preload_roundkey_enc(0);
+
read_block4();
 
round_enc4(0);
@@ -324,39 +327,40 @@ ENTRY(__blowfish_enc_blk_4way)
round_enc4(14);
add_preloaded_roundkey4();
 
-   popq %rbp;
+   popq %r12;
movq %r11, RIO;
 
-   test %bpl, %bpl;
+   test %r12b, %r12b;
jnz .L__enc_xor4;
 
write_block4();
 
popq %rbx;
-   popq %rbp;
+   popq %r12;
ret;
 
 .L__enc_xor4:
xor_block4();
 
popq %rbx;
-   popq %rbp;
+   popq %r12;
ret;
 ENDPROC(__blowfish_enc_blk_4way)
 
 ENTRY(blowfish_dec_blk_4way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
-   pushq %rbp;
+   pushq %r12;
pushq %rbx;
-   preload_roundkey_dec(17);
 
-   movq %rsi, %r11;
+   movq %rdi, CTX;
+   movq %rsi, %r11
movq %rdx, RIO;
 
+   preload_roundkey_dec(17);
read_block4();
 
round_dec4(17);
@@ -373,7 +377,7 @@ ENTRY(blowfish_dec_blk_4way)
write_block4();
 
popq %rbx;
-   popq %rbp;
+   popq %r12;
 
ret;
 ENDPROC(blowfish_dec_blk_4way)
-- 
2.13.5



[PATCH v2 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R15 instead of RBP.  R15 can't be used as the RID1 register because
of x86 instruction encoding limitations.  So use R15 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 50 +--
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S 
b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
index 952d3156a933..7f30b6f0d72c 100644
--- a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
@@ -47,7 +47,7 @@
 /**
   8-way AVX cast6
  **/
-#define CTX %rdi
+#define CTX %r15
 
 #define RA1 %xmm0
 #define RB1 %xmm1
@@ -70,8 +70,8 @@
 
 #define RTMP %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %rdi
+#define RID1d %edi
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -264,15 +264,17 @@
 .align 8
 __cast6_enc_blk8:
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: blocks
 * output:
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: encrypted blocks
 */
 
-   pushq %rbp;
+   pushq %r15;
pushq %rbx;
 
+   movq %rdi, CTX;
+
vmovdqa .Lbswap_mask, RKM;
vmovd .Lfirst_mask, R1ST;
vmovd .L32_mask, R32;
@@ -297,7 +299,7 @@ __cast6_enc_blk8:
QBAR(11);
 
popq %rbx;
-   popq %rbp;
+   popq %r15;
 
vmovdqa .Lbswap_mask, RKM;
 
@@ -310,15 +312,17 @@ ENDPROC(__cast6_enc_blk8)
 .align 8
 __cast6_dec_blk8:
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: encrypted blocks
 * output:
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: decrypted blocks
 */
 
-   pushq %rbp;
+   pushq %r15;
pushq %rbx;
 
+   movq %rdi, CTX;
+
vmovdqa .Lbswap_mask, RKM;
vmovd .Lfirst_mask, R1ST;
vmovd .L32_mask, R32;
@@ -343,7 +347,7 @@ __cast6_dec_blk8:
QBAR(0);
 
popq %rbx;
-   popq %rbp;
+   popq %r15;
 
vmovdqa .Lbswap_mask, RKM;
outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
@@ -354,12 +358,14 @@ ENDPROC(__cast6_dec_blk8)
 
 ENTRY(cast6_ecb_enc_8way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
 
load_8way(%rdx, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
@@ -368,18 +374,21 @@ ENTRY(cast6_ecb_enc_8way)
 
store_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
FRAME_END
ret;
 ENDPROC(cast6_ecb_enc_8way)
 
 ENTRY(cast6_ecb_dec_8way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
 
load_8way(%rdx, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
@@ -388,20 +397,22 @@ ENTRY(cast6_ecb_dec_8way)
 
store_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
FRAME_END
ret;
 ENDPROC(cast6_ecb_dec_8way)
 
 ENTRY(cast6_cbc_dec_8way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
-
pushq %r12;
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
movq %rdx, %r12;
 
@@ -411,8 +422,8 @@ ENTRY(cast6_cbc_dec_8way)
 
store_cbc_8way(%r12, %r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
popq %r12;
-
FRAME_END
ret;
 ENDPROC(cast6_cbc_dec_8way)
@@ -425,9 +436,10 @@ ENTRY(cast6_ctr_8way)
 *  %rcx: iv (little endian, 128bit)
 */
FRAME_BEGIN
-
pushq %r12;
+   pushq %r15
 
+   movq %rdi, CTX;
movq %rsi, %r11;
movq %rdx, %r12;
 
@@ -438,8 +450,8 @@ ENTRY(cast6_ctr_8way)
 
store_ctr_8way(%r12, %r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
popq %r12;
-
FRAME_END
ret;
 ENDPROC(cast6_ctr_8way)
@@ -452,7 +464,9 @@ ENTRY(cast6_xts_enc_8way)
 

[PATCH v2 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP.  Both are callee-saved registers, so the
substitution is straightforward.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/camellia-x86_64-asm_64.S | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/crypto/camellia-x86_64-asm_64.S 
b/arch/x86/crypto/camellia-x86_64-asm_64.S
index 310319c601ed..95ba6956a7f6 100644
--- a/arch/x86/crypto/camellia-x86_64-asm_64.S
+++ b/arch/x86/crypto/camellia-x86_64-asm_64.S
@@ -75,17 +75,17 @@
 #define RCD1bh %dh
 
 #define RT0 %rsi
-#define RT1 %rbp
+#define RT1 %r12
 #define RT2 %r8
 
 #define RT0d %esi
-#define RT1d %ebp
+#define RT1d %r12d
 #define RT2d %r8d
 
 #define RT2bl %r8b
 
 #define RXOR %r9
-#define RRBP %r10
+#define RR12 %r10
 #define RDST %r11
 
 #define RXORd %r9d
@@ -197,7 +197,7 @@ ENTRY(__camellia_enc_blk)
 *  %rdx: src
 *  %rcx: bool xor
 */
-   movq %rbp, RRBP;
+   movq %r12, RR12;
 
movq %rcx, RXOR;
movq %rsi, RDST;
@@ -227,13 +227,13 @@ ENTRY(__camellia_enc_blk)
 
enc_outunpack(mov, RT1);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
ret;
 
 .L__enc_xor:
enc_outunpack(xor, RT1);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
ret;
 ENDPROC(__camellia_enc_blk)
 
@@ -248,7 +248,7 @@ ENTRY(camellia_dec_blk)
movl $24, RXORd;
cmovel RXORd, RT2d; /* max */
 
-   movq %rbp, RRBP;
+   movq %r12, RR12;
movq %rsi, RDST;
movq %rdx, RIO;
 
@@ -271,7 +271,7 @@ ENTRY(camellia_dec_blk)
 
dec_outunpack();
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
ret;
 ENDPROC(camellia_dec_blk)
 
@@ -433,7 +433,7 @@ ENTRY(__camellia_enc_blk_2way)
 */
pushq %rbx;
 
-   movq %rbp, RRBP;
+   movq %r12, RR12;
movq %rcx, RXOR;
movq %rsi, RDST;
movq %rdx, RIO;
@@ -461,14 +461,14 @@ ENTRY(__camellia_enc_blk_2way)
 
enc_outunpack2(mov, RT2);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
popq %rbx;
ret;
 
 .L__enc2_xor:
enc_outunpack2(xor, RT2);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
popq %rbx;
ret;
 ENDPROC(__camellia_enc_blk_2way)
@@ -485,7 +485,7 @@ ENTRY(camellia_dec_blk_2way)
cmovel RXORd, RT2d; /* max */
 
movq %rbx, RXOR;
-   movq %rbp, RRBP;
+   movq %r12, RR12;
movq %rsi, RDST;
movq %rdx, RIO;
 
@@ -508,7 +508,7 @@ ENTRY(camellia_dec_blk_2way)
 
dec_outunpack2();
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
movq RXOR, %rbx;
ret;
 ENDPROC(camellia_dec_blk_2way)
-- 
2.13.5



[PATCH v2 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use RSI instead of RBP for RT1.  Since RSI is also used as a the 'dst'
function argument, it needs to be saved on the stack until the argument
is needed.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/des3_ede-asm_64.S | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/des3_ede-asm_64.S 
b/arch/x86/crypto/des3_ede-asm_64.S
index f3e91647ca27..8e49ce117494 100644
--- a/arch/x86/crypto/des3_ede-asm_64.S
+++ b/arch/x86/crypto/des3_ede-asm_64.S
@@ -64,12 +64,12 @@
 #define RW2bh %ch
 
 #define RT0 %r15
-#define RT1 %rbp
+#define RT1 %rsi
 #define RT2 %r14
 #define RT3 %rdx
 
 #define RT0d %r15d
-#define RT1d %ebp
+#define RT1d %esi
 #define RT2d %r14d
 #define RT3d %edx
 
@@ -177,13 +177,14 @@ ENTRY(des3_ede_x86_64_crypt_blk)
 *  %rsi: dst
 *  %rdx: src
 */
-   pushq %rbp;
pushq %rbx;
pushq %r12;
pushq %r13;
pushq %r14;
pushq %r15;
 
+   pushq %rsi; /* dst */
+
read_block(%rdx, RL0, RR0);
initial_permutation(RL0, RR0);
 
@@ -241,6 +242,8 @@ ENTRY(des3_ede_x86_64_crypt_blk)
round1(32+15, RL0, RR0, dummy2);
 
final_permutation(RR0, RL0);
+
+   popq %rsi /* dst */
write_block(%rsi, RR0, RL0);
 
popq %r15;
@@ -248,7 +251,6 @@ ENTRY(des3_ede_x86_64_crypt_blk)
popq %r13;
popq %r12;
popq %rbx;
-   popq %rbp;
 
ret;
 ENDPROC(des3_ede_x86_64_crypt_blk)
@@ -432,13 +434,14 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 *  %rdx: src (3 blocks)
 */
 
-   pushq %rbp;
pushq %rbx;
pushq %r12;
pushq %r13;
pushq %r14;
pushq %r15;
 
+   pushq %rsi /* dst */
+
/* load input */
movl 0 * 4(%rdx), RL0d;
movl 1 * 4(%rdx), RR0d;
@@ -520,6 +523,7 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
bswapl RR2d;
bswapl RL2d;
 
+   popq %rsi /* dst */
movl RR0d, 0 * 4(%rsi);
movl RL0d, 1 * 4(%rsi);
movl RR1d, 2 * 4(%rsi);
@@ -532,7 +536,6 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
popq %r13;
popq %r12;
popq %rbx;
-   popq %rbp;
 
ret;
 ENDPROC(des3_ede_x86_64_crypt_blk_3way)
-- 
2.13.5



[PATCH v2 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

There's no need to use RBP as a temporary register for the TBL value,
because it always stores the same value: the address of the K256 table.
Instead just reference the address of K256 directly.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha256-avx2-asm.S | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S 
b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..1420db15dcdd 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -98,8 +98,6 @@ d = %r8d
 e   = %edx # clobbers NUM_BLKS
 y3 = %esi  # clobbers INP
 
-
-TBL= %rbp
 SRND   = CTX   # SRND is same register as CTX
 
 a = %eax
@@ -531,7 +529,6 @@ STACK_SIZE  = _RSP  + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
pushq   %rbx
-   pushq   %rbp
pushq   %r12
pushq   %r13
pushq   %r14
@@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx)
mov CTX, _CTX(%rsp)
 
 loop0:
-   lea K256(%rip), TBL
-
## Load first 16 dwords from two blocks
VMOVDQ  0*32(INP),XTMP0
VMOVDQ  1*32(INP),XTMP1
@@ -597,19 +592,19 @@ last_block_enter:
 
 .align 16
 loop1:
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
-   vpaddd  1*32(TBL, SRND), X0, XFER
+   vpaddd  K256+1*32(SRND), X0, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
-   vpaddd  2*32(TBL, SRND), X0, XFER
+   vpaddd  K256+2*32(SRND), X0, XFER
vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
-   vpaddd  3*32(TBL, SRND), X0, XFER
+   vpaddd  K256+3*32(SRND), X0, XFER
vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
 
@@ -619,10 +614,11 @@ loop1:
 
 loop2:
## Do last 16 rounds with no scheduling
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 0*32
-   vpaddd  1*32(TBL, SRND), X1, XFER
+
+   vpaddd  K256+1*32(SRND), X1, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 1*32
add $2*32, SRND
@@ -676,9 +672,6 @@ loop3:
ja  done_hash
 
 do_last_block:
-    do last block
-   lea K256(%rip), TBL
-
VMOVDQ  0*16(INP),XWORD0
VMOVDQ  1*16(INP),XWORD1
VMOVDQ  2*16(INP),XWORD2
@@ -718,7 +711,6 @@ done_hash:
popq%r14
popq%r13
popq%r12
-   popq%rbp
popq%rbx
ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5



[PATCH v2 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the REG_D register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha1_ssse3_asm.S | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S
index a4109506a5e8..6204bd53528c 100644
--- a/arch/x86/crypto/sha1_ssse3_asm.S
+++ b/arch/x86/crypto/sha1_ssse3_asm.S
@@ -37,7 +37,7 @@
 #define REG_A  %ecx
 #define REG_B  %esi
 #define REG_C  %edi
-#define REG_D  %ebp
+#define REG_D  %r12d
 #define REG_E  %edx
 
 #define REG_T1 %eax
@@ -74,10 +74,10 @@
ENTRY(\name)
 
push%rbx
-   push%rbp
push%r12
+   push%rbp
+   mov %rsp, %rbp
 
-   mov %rsp, %r12
sub $64, %rsp   # allocate workspace
and $~15, %rsp  # align stack
 
@@ -99,10 +99,9 @@
xor %rax, %rax
rep stosq
 
-   mov %r12, %rsp  # deallocate workspace
-
-   pop %r12
+   mov %rbp, %rsp  # deallocate workspace
pop %rbp
+   pop %r12
pop %rbx
ret
 
-- 
2.13.5



[PATCH v2 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the TBL register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha256-avx-asm.S | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx-asm.S b/arch/x86/crypto/sha256-avx-asm.S
index e0a1a5f2..001bbcf93c79 100644
--- a/arch/x86/crypto/sha256-avx-asm.S
+++ b/arch/x86/crypto/sha256-avx-asm.S
@@ -103,7 +103,7 @@ SRND = %rsi   # clobbers INP
 c = %ecx
 d = %r8d
 e = %edx
-TBL = %rbp
+TBL = %r12
 a = %eax
 b = %ebx
 
@@ -350,13 +350,13 @@ a = TMP_
 ENTRY(sha256_transform_avx)
 .align 32
pushq   %rbx
-   pushq   %rbp
+   pushq   %r12
pushq   %r13
pushq   %r14
pushq   %r15
-   pushq   %r12
+   pushq   %rbp
+   movq%rsp, %rbp
 
-   mov %rsp, %r12
subq$STACK_SIZE, %rsp   # allocate stack space
and $~15, %rsp  # align stack pointer
 
@@ -452,13 +452,12 @@ loop2:
 
 done_hash:
 
-   mov %r12, %rsp
-
-   popq%r12
+   mov %rbp, %rsp
+   popq%rbp
popq%r15
popq%r14
popq%r13
-   popq%rbp
+   popq%r12
popq%rbx
ret
 ENDPROC(sha256_transform_avx)
-- 
2.13.5



[PATCH v2 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R13 instead of RBP.  Both are callee-saved registers, so the
substitution is straightforward.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index b3f49d286348..73b471da3622 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -76,8 +76,8 @@
 #define RT %xmm14
 #define RR %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %r13
+#define RID1d %r13d
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -259,7 +259,7 @@ __twofish_enc_blk8:
 
vmovdqu w(CTX), RK1;
 
-   pushq %rbp;
+   pushq %r13;
pushq %rbx;
pushq %rcx;
 
@@ -282,7 +282,7 @@ __twofish_enc_blk8:
 
popq %rcx;
popq %rbx;
-   popq %rbp;
+   popq %r13;
 
outunpack_blocks(RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
outunpack_blocks(RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);
@@ -301,7 +301,7 @@ __twofish_dec_blk8:
 
vmovdqu (w+4*4)(CTX), RK1;
 
-   pushq %rbp;
+   pushq %r13;
pushq %rbx;
 
inpack_blocks(RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
@@ -322,7 +322,7 @@ __twofish_dec_blk8:
vmovdqu (w)(CTX), RK1;
 
popq %rbx;
-   popq %rbp;
+   popq %r13;
 
outunpack_blocks(RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
outunpack_blocks(RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);
-- 
2.13.5



[PATCH v2 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S

2017-09-18 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Mix things up a little bit to get rid of the RBP usage, without hurting
performance too much.  Use RDI instead of RBP for the TBL pointer.  That
will clobber CTX, so spill CTX onto the stack and use R12 to read it in
the outer loop.  R12 is used as a non-persistent temporary variable
elsewhere, so it's safe to use.

Also remove the unused y4 variable.

Reported-by: Eric Biggers <ebigge...@gmail.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Tested-by: Eric Biggers <ebigg...@google.com>
Acked-by: Eric Biggers <ebigg...@google.com>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha512-avx2-asm.S | 75 ---
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S 
b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..b16d56005162 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -69,8 +69,9 @@ XFER  = YTMP0
 
 BYTE_FLIP_MASK  = %ymm9
 
-# 1st arg
-CTX = %rdi
+# 1st arg is %rdi, which is saved to the stack and accessed later via %r12
+CTX1= %rdi
+CTX2= %r12
 # 2nd arg
 INP = %rsi
 # 3rd arg
@@ -81,7 +82,7 @@ d   = %r8
 e   = %rdx
 y3  = %rsi
 
-TBL   = %rbp
+TBL   = %rdi # clobbers CTX1
 
 a = %rax
 b = %rbx
@@ -91,26 +92,26 @@ g = %r10
 h = %r11
 old_h = %r11
 
-T1= %r12
+T1= %r12 # clobbers CTX2
 y0= %r13
 y1= %r14
 y2= %r15
 
-y4= %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
+CTX_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 6*8
+GPRSAVE_SIZE = 5*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
 frame_INP = frame_SRND + SRND_SIZE
 frame_INPEND = frame_INP + INP_SIZE
-frame_RSPSAVE = frame_INPEND + INPEND_SIZE
+frame_CTX = frame_INPEND + INPEND_SIZE
+frame_RSPSAVE = frame_CTX + CTX_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
 frame_size = frame_GPRSAVE + GPRSAVE_SIZE
 
@@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx)
mov %rax, frame_RSPSAVE(%rsp)
 
# Save GPRs
-   mov %rbp, frame_GPRSAVE(%rsp)
-   mov %rbx, 8*1+frame_GPRSAVE(%rsp)
-   mov %r12, 8*2+frame_GPRSAVE(%rsp)
-   mov %r13, 8*3+frame_GPRSAVE(%rsp)
-   mov %r14, 8*4+frame_GPRSAVE(%rsp)
-   mov %r15, 8*5+frame_GPRSAVE(%rsp)
+   mov %rbx, 8*0+frame_GPRSAVE(%rsp)
+   mov %r12, 8*1+frame_GPRSAVE(%rsp)
+   mov %r13, 8*2+frame_GPRSAVE(%rsp)
+   mov %r14, 8*3+frame_GPRSAVE(%rsp)
+   mov %r15, 8*4+frame_GPRSAVE(%rsp)
 
shl $7, NUM_BLKS# convert to bytes
jz  done_hash
@@ -589,14 +589,17 @@ ENTRY(sha512_transform_rorx)
mov NUM_BLKS, frame_INPEND(%rsp)
 
## load initial digest
-   mov 8*0(CTX),a
-   mov 8*1(CTX),b
-   mov 8*2(CTX),c
-   mov 8*3(CTX),d
-   mov 8*4(CTX),e
-   mov 8*5(CTX),f
-   mov 8*6(CTX),g
-   mov 8*7(CTX),h
+   mov 8*0(CTX1), a
+   mov 8*1(CTX1), b
+   mov 8*2(CTX1), c
+   mov 8*3(CTX1), d
+   mov 8*4(CTX1), e
+   mov 8*5(CTX1), f
+   mov 8*6(CTX1), g
+   mov 8*7(CTX1), h
+
+   # save %rdi (CTX) before it gets clobbered
+   mov %rdi, frame_CTX(%rsp)
 
vmovdqa PSHUFFLE_BYTE_FLIP_MASK(%rip), BYTE_FLIP_MASK
 
@@ -652,14 +655,15 @@ loop2:
subq$1, frame_SRND(%rsp)
jne loop2
 
-   addm8*0(CTX),a
-   addm8*1(CTX),b
-   addm8*2(CTX),c
-   addm8*3(CTX),d
-   addm8*4(CTX),e
-   addm8*5(CTX),f
-   addm8*6(CTX),g
-   addm8*7(CTX),h
+   mov frame_CTX(%rsp), CTX2
+   addm8*0(CTX2), a
+   addm8*1(CTX2), b
+   addm8*2(CTX2), c
+   addm8*3(CTX2), d
+   addm8*4(CTX2), e
+   addm8*5(CTX2), f
+   addm8*6(CTX2), g
+   addm8*7(CTX2), h
 
mov frame_INP(%rsp), INP
add $128, INP
@@ -669,12 +673,11 @@ loop2:
 done_hash:
 
 # Restore GPRs
-   mov frame_GPRSAVE(%rsp) ,%rbp
-   mov 8*1+frame_GPRSAVE(%rsp) ,%rbx
-   mov 8*2+frame_GPRSAVE(%rsp) ,%r12
-   mov 8*3+frame_GPRSAVE(%rsp) ,%r13
-   mov 8*4+frame_GPRSAVE(%rsp) ,%r14
-   mov 8*5+frame_GPRSAVE(%rsp) ,%r15
+   mov 8*0+frame_GPRSAVE(%rsp), %rbx
+   mov 8*1+frame_GPRSAVE(%rsp), %r12
+   mov 8*2+frame_GPRSAVE(%rsp), %r13
+   mov 8*3+frame_GPRSAVE(%rsp), %r14
+   mov 8*4+frame_GPRSAVE(%rsp), %r15
 
# Restore Stack Pointer
mov frame_RSPSAVE(%rsp), %rsp
-- 
2.13.5



[PATCH v2 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-18 Thread Josh Poimboeuf
v2:
- fix performance issues in sha256-avx2-asm.S and sha512-avx2-asm.S
  (Eric)

Many of the x86 crypto functions use RBP as a temporary register.  This
breaks frame pointer convention, and breaks stack traces when unwinding
from an interrupt in the crypto code.

Convert most* of them to leave RBP alone.

These pass the crypto boot tests for me.  Any further testing would be
appreciated!

[*] There are still a few crypto files left that need fixing, but the
fixes weren't trivial and nobody reported unwinder warnings about
them yet, so I'm skipping them for now.

Josh Poimboeuf (12):
  x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in des3_ede-asm_64.S
  x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
  x86/crypto: Fix RBP usage in sha256-avx-asm.S
  x86/crypto: Fix RBP usage in sha256-avx2-asm.S
  x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
  x86/crypto: Fix RBP usage in sha512-avx2-asm.S
  x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

 arch/x86/crypto/blowfish-x86_64-asm_64.S| 48 +-
 arch/x86/crypto/camellia-x86_64-asm_64.S| 26 +-
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S   | 47 +++---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S   | 50 +--
 arch/x86/crypto/des3_ede-asm_64.S   | 15 +++---
 arch/x86/crypto/sha1_avx2_x86_64_asm.S  |  4 +-
 arch/x86/crypto/sha1_ssse3_asm.S| 11 ++---
 arch/x86/crypto/sha256-avx-asm.S| 15 +++---
 arch/x86/crypto/sha256-avx2-asm.S   | 22 +++--
 arch/x86/crypto/sha256-ssse3-asm.S  | 15 +++---
 arch/x86/crypto/sha512-avx2-asm.S   | 75 +++--
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 ++---
 12 files changed, 184 insertions(+), 156 deletions(-)

-- 
2.13.5



Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-14 Thread Josh Poimboeuf
On Thu, Sep 14, 2017 at 11:16:12AM +0200, Ingo Molnar wrote:
> 
> * Josh Poimboeuf <jpoim...@redhat.com> wrote:
> 
> > I'm still looking at the other one (sha512-avx2), but so far I haven't
> > found a way to speed it back up.
> 
> Here's a couple of very quick observations with possible optimizations:
> 
> AFAICS the main effect of the RBP fixes is the introduction of a memory load 
> into 
> the critical path, into the body unrolled loop:
> 
> +   mov frame_TBL(%rsp), TBL
> vpaddq  (TBL), Y_0, XFER
> vmovdqa XFER, frame_XFER(%rsp)
> FOUR_ROUNDS_AND_SCHED
> 
> Both 'TLB' and 'T1' are mapped to R12, which is why TBL has to be spilled to 
> be 
> reloaded from the stack.
> 
> 1)
> 
> Note how R12 is used immediately, right in the next instruction:
> 
> vpaddq  (TBL), Y_0, XFER
> 
> I.e. the RBP fixes lengthen the program order data dependencies - that's a 
> new 
> constraint and a few extra cycles per loop iteration if the workload is 
> address-generator bandwidth limited on that.
> 
> A simple way to ease that constraint would be to move the 'TLB' load up into 
> the 
> loop, body, to the point where 'T1' is used for the last time - which is:
> 
> 
> mov a, T1   # T1 = a# MAJB
> and c, T1   # T1 = a  # MAJB
> 
> add y0, y2  # y2 = S1 + CH  # --
> or  T1, y3  # y3 = MAJ = (a|c))|(a) # MAJ
> 
> +   mov frame_TBL(%rsp), TBL
> 
> add y1, h   # h = k + w + h + S0# --
> 
> add y2, d   # d = k + w + h + d + S1 + CH = d + t1  # --
> 
> add y2, h   # h = k + w + h + S0 + S1 + CH = t1 + S0# --
> add y3, h   # h = t1 + S0 + MAJ # --
> 
> Note how this moves up the 'TLB' reload by 4 instructions.
> 
> 2)
> 
> If this does not get back performance, then maybe another reason is that it's 
> cache access latency limited, in which case a more involved optimization 
> would be 
> to look at the register patterns and usages:
> 
> 
>   first-use   last-useuse-length
>   a:  #10 #29 20
>   b:  #24 #24  1
>   c:  #14 #30 17
>   d:  #23 #34 12
>   e:  #11 #20 10
>   f:  #15 #15  1
>   g:  #18 #27 10
>   h:  #13 #36 24
> 
>   y0: #11 #31 21
>   y1: #12 #33 22
>   y2: #15 #35 21
>   y3: #10 #36 27
> 
>   T1: #16 #32 17
> 
> The 'first-use' colums shows the number of the instruction within the loop 
> body 
> that the register gets used - with '#1' denoting the first instruction ad #36 
> the 
> last instruction, the 'last-use' column is showing the last instruction, and 
> the 
> 'use-length' colum shows the 'window' in which a register is used.
> 
> What we want are the registers that are used the most tightly, i.e. these two:
> 
>   b:  #24 #24  1
>   f:  #15 #15  1
> 
> Of these two 'f' is the best one, because it has an earlier use and longer 
> cooldown.
> 
> If alias 'TBL' with 'f' then we could reload 'TLB' for the next iteration 
> very 
> early on:
> 
> mov f, y2   # y2 = f# CH
> +   mov frame_TBL(%rsp), TBL
> rorx$34, a, T1  # T1 = a >> 34  # S0B
> 
> And there will be 21 instructions that don't depend on TLB after this, plenty 
> of 
> time for the load to be generated and propagated.
> 
> NOTE: my pseudo-patch is naive, due to the complication caused by the 
> RotateState 
> macro name rotation. It's still fundamentally possible I believe, it's just 
> that 
> 'TBL' has to be rotated too, together with the other varibles.
> 
> 3)
> 
> If even this does not help, because the workload is ucode-cache limited, and 
> the 
> extra reloads pushed the critical path just beyond some c

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-13 Thread Josh Poimboeuf
On Wed, Sep 13, 2017 at 04:24:28PM -0500, Josh Poimboeuf wrote:
> On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> > On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Biggers <ebigge...@gmail.com> wrote:
> > > 
> > > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Eric Biggers <ebigge...@gmail.com> wrote:
> > > > > 
> > > > > > Thanks for fixing these!  I don't have time to review these in 
> > > > > > detail, but I ran
> > > > > > the crypto self-tests on the affected algorithms, and they all 
> > > > > > pass.  I also
> > > > > > benchmarked them before and after; the only noticable performance 
> > > > > > difference was
> > > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I 
> > > > > > don't suppose
> > > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > > 
> > > > > Which CPU model did you use for the test?
> > > > > 
> > > > > Thanks,
> > > > > 
> > > > >   Ingo
> > > > 
> > > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > > 
> > > Any chance to test this with the latest microarchitecture - any Skylake 
> > > derivative 
> > > Intel CPU you have access to?
> > > 
> > > Thanks,
> > > 
> > >   Ingo
> > 
> > Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The 
> > results
> > were the following which seemed a bit worse than Haswell:
> > 
> > sha256-avx2 became 3.5% slower
> > sha512-avx2 became 7.5% slower
> > 
> > Note: it's tricky to benchmark this, especially with just a few percent
> > difference, so don't read too much into the exact numbers.
> 
> Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
> closes the performance gap?
> 
> I'm still looking at the other one (sha512-avx2), but so far I haven't
> found a way to speed it back up.

And here's v2 of the sha512-avx2 patch.  It should hopefully gain back
most of the performance lost by v1.

From: Josh Poimboeuf <jpoim...@redhat.com>
Subject: [PATCH] x86/crypto: Fix RBP usage in sha512-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Mix things up a little bit to get rid of the RBP usage, without
destroying performance.  Use RDI instead of RBP for the TBL pointer.
That will clobber CTX, so save CTX on the stack and use RDI as CTX
before it gets clobbered, and R12 as CTX after it gets clobbered.

Also remove the unused y4 variable.

Reported-by: Eric Biggers <ebigge...@gmail.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha512-avx2-asm.S | 75 ---
 1 file changed, 39 insertions(+), 36 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S 
b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..b16d56005162 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -69,8 +69,9 @@ XFER  = YTMP0
 
 BYTE_FLIP_MASK  = %ymm9
 
-# 1st arg
-CTX = %rdi
+# 1st arg is %rdi, which is saved to the stack and accessed later via %r12
+CTX1= %rdi
+CTX2= %r12
 # 2nd arg
 INP = %rsi
 # 3rd arg
@@ -81,7 +82,7 @@ d   = %r8
 e   = %rdx
 y3  = %rsi
 
-TBL   = %rbp
+TBL   = %rdi # clobbers CTX1
 
 a = %rax
 b = %rbx
@@ -91,26 +92,26 @@ g = %r10
 h = %r11
 old_h = %r11
 
-T1= %r12
+T1= %r12 # clobbers CTX2
 y0= %r13
 y1= %r14
 y2= %r15
 
-y4= %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
+CTX_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
-GPRSAVE_SIZE = 6*8
+GPRSAVE_SIZE = 5*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
 frame_INP = frame_SRND + SRND_SIZE
 frame_INPEND = frame_INP + INP_SIZE
-frame_RSPSAVE = frame_INPEND + INPEND_SIZE
+frame_CTX = frame_INPEND + INPEND_SIZE
+frame_RSPSAVE = frame_CTX + CTX_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
 frame_size = frame_GPRSAVE + GPRSAVE_SIZE
 
@@ -576,12 +577,11 @@ ENTRY(sha512_transform_rorx)
mov %rax, frame_RSPSAVE(%rsp)
 
# Save GPRs
-   mov %rbp, frame_GPRSAVE(%rsp)
-   mov %rbx, 8*1+frame_GPRSAVE(%rsp)
-   mov %r12, 8*2+frame_GPRSAVE(%rsp)
-   mov %r13, 8*3+frame_GPRSAVE(%rsp)

Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-13 Thread Josh Poimboeuf
On Fri, Sep 08, 2017 at 10:57:05AM -0700, Eric Biggers wrote:
> On Thu, Sep 07, 2017 at 11:26:47PM +0200, Ingo Molnar wrote:
> > 
> > * Eric Biggers <ebigge...@gmail.com> wrote:
> > 
> > > On Thu, Sep 07, 2017 at 09:15:34AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Eric Biggers <ebigge...@gmail.com> wrote:
> > > > 
> > > > > Thanks for fixing these!  I don't have time to review these in 
> > > > > detail, but I ran
> > > > > the crypto self-tests on the affected algorithms, and they all pass.  
> > > > > I also
> > > > > benchmarked them before and after; the only noticable performance 
> > > > > difference was
> > > > > that sha256-avx2 and sha512-avx2 became a few percent slower.  I 
> > > > > don't suppose
> > > > > there is a way around that?  Otherwise it's probably not a big deal.
> > > > 
> > > > Which CPU model did you use for the test?
> > > > 
> > > > Thanks,
> > > > 
> > > > Ingo
> > > 
> > > This was on Haswell, "Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz".
> > 
> > Any chance to test this with the latest microarchitecture - any Skylake 
> > derivative 
> > Intel CPU you have access to?
> > 
> > Thanks,
> > 
> > Ingo
> 
> Tested with Skylake, "Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz".  The results
> were the following which seemed a bit worse than Haswell:
> 
>   sha256-avx2 became 3.5% slower
>   sha512-avx2 became 7.5% slower
> 
> Note: it's tricky to benchmark this, especially with just a few percent
> difference, so don't read too much into the exact numbers.

Here's a v2 for the sha256-avx2 patch, would you mind seeing if this
closes the performance gap?

I'm still looking at the other one (sha512-avx2), but so far I haven't
found a way to speed it back up.

From: Josh Poimboeuf <jpoim...@redhat.com>
Subject: [PATCH] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

There's no need to use RBP as a temporary register for the TBL value,
because it always stores the same value: the address of the K256 table.
Instead just reference the address of K256 directly.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha256-avx2-asm.S | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S 
b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..1420db15dcdd 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -98,8 +98,6 @@ d = %r8d
 e   = %edx # clobbers NUM_BLKS
 y3 = %esi  # clobbers INP
 
-
-TBL= %rbp
 SRND   = CTX   # SRND is same register as CTX
 
 a = %eax
@@ -531,7 +529,6 @@ STACK_SIZE  = _RSP  + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
pushq   %rbx
-   pushq   %rbp
pushq   %r12
pushq   %r13
pushq   %r14
@@ -568,8 +565,6 @@ ENTRY(sha256_transform_rorx)
mov CTX, _CTX(%rsp)
 
 loop0:
-   lea K256(%rip), TBL
-
## Load first 16 dwords from two blocks
VMOVDQ  0*32(INP),XTMP0
VMOVDQ  1*32(INP),XTMP1
@@ -597,19 +592,19 @@ last_block_enter:
 
 .align 16
 loop1:
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
-   vpaddd  1*32(TBL, SRND), X0, XFER
+   vpaddd  K256+1*32(SRND), X0, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
-   vpaddd  2*32(TBL, SRND), X0, XFER
+   vpaddd  K256+2*32(SRND), X0, XFER
vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
-   vpaddd  3*32(TBL, SRND), X0, XFER
+   vpaddd  K256+3*32(SRND), X0, XFER
vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
 
@@ -619,10 +614,11 @@ loop1:
 
 loop2:
## Do last 16 rounds with no scheduling
-   vpaddd  0*32(TBL, SRND), X0, XFER
+   vpaddd  K256+0*32(SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 0*32
-   vpaddd  1*32(TBL, SRND), X1, XFER
+
+   vpaddd  K256+1*32(SRND), X1, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 1*32
add $2*32, SRND
@@ -676,9 +672,6 @@ loop3:
ja  done_hash
 
 do_last_block:
-    do last block
-   lea K256(%rip), TBL
-
VMOVDQ  0*16(INP),XWORD0
VMOVDQ  1*16(INP),XWORD1
VMOVDQ  2*16(INP),XWORD2
@@ -718,7 +711,6 @@ done_hash:
popq%r14
popq%r13
popq%r12
-   popq%rbp
popq%rbx
ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5



Re: [PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-09-06 Thread Josh Poimboeuf
On Fri, Sep 01, 2017 at 05:09:19PM -0700, Eric Biggers wrote:
> Hi Josh,
> 
> On Tue, Aug 29, 2017 at 01:05:33PM -0500, Josh Poimboeuf wrote:
> > Many of the x86 crypto functions use RBP as a temporary register.  This
> > breaks frame pointer convention, and breaks stack traces when unwinding
> > from an interrupt in the crypto code.
> > 
> > Convert most* of them to leave RBP alone.
> > 
> > These pass the crypto boot tests for me.  Any further testing would be
> > appreciated!
> > 
> > [*] There are still a few crypto files left that need fixing, but the
> > fixes weren't trivial and nobody reported unwinder warnings about
> > them yet, so I'm skipping them for now.
> > 
> > Josh Poimboeuf (12):
> >   x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
> >   x86/crypto: Fix RBP usage in des3_ede-asm_64.S
> >   x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
> >   x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx-asm.S
> >   x86/crypto: Fix RBP usage in sha256-avx2-asm.S
> >   x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
> >   x86/crypto: Fix RBP usage in sha512-avx2-asm.S
> >   x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S
> > 
> 
> Thanks for fixing these!  I don't have time to review these in detail, but I 
> ran
> the crypto self-tests on the affected algorithms, and they all pass.  I also
> benchmarked them before and after; the only noticable performance difference 
> was
> that sha256-avx2 and sha512-avx2 became a few percent slower.  I don't suppose
> there is a way around that?  Otherwise it's probably not a big deal.

Thanks for testing.  I might have a way to make sha256-avx2 faster, but
not sha512-avx2.  I'll let you know when I have a patch.

-- 
Josh


[PATCH 01/12] x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP.  R12 can't be used as the RT0 register because
of x86 instruction encoding limitations.  So use R12 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/blowfish-x86_64-asm_64.S | 48 +---
 1 file changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/x86/crypto/blowfish-x86_64-asm_64.S 
b/arch/x86/crypto/blowfish-x86_64-asm_64.S
index 246c67006ed0..8c1fcb6bad21 100644
--- a/arch/x86/crypto/blowfish-x86_64-asm_64.S
+++ b/arch/x86/crypto/blowfish-x86_64-asm_64.S
@@ -33,7 +33,7 @@
 #define s3 ((16 + 2 + (3 * 256)) * 4)
 
 /* register macros */
-#define CTX %rdi
+#define CTX %r12
 #define RIO %rsi
 
 #define RX0 %rax
@@ -56,12 +56,12 @@
 #define RX2bh %ch
 #define RX3bh %dh
 
-#define RT0 %rbp
+#define RT0 %rdi
 #define RT1 %rsi
 #define RT2 %r8
 #define RT3 %r9
 
-#define RT0d %ebp
+#define RT0d %edi
 #define RT1d %esi
 #define RT2d %r8d
 #define RT3d %r9d
@@ -120,13 +120,14 @@
 
 ENTRY(__blowfish_enc_blk)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 *  %rcx: bool, if true: xor output
 */
-   movq %rbp, %r11;
+   movq %r12, %r11;
 
+   movq %rdi, CTX;
movq %rsi, %r10;
movq %rdx, RIO;
 
@@ -142,7 +143,7 @@ ENTRY(__blowfish_enc_blk)
round_enc(14);
add_roundkey_enc(16);
 
-   movq %r11, %rbp;
+   movq %r11, %r12;
 
movq %r10, RIO;
test %cl, %cl;
@@ -157,12 +158,13 @@ ENDPROC(__blowfish_enc_blk)
 
 ENTRY(blowfish_dec_blk)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
-   movq %rbp, %r11;
+   movq %r12, %r11;
 
+   movq %rdi, CTX;
movq %rsi, %r10;
movq %rdx, RIO;
 
@@ -181,7 +183,7 @@ ENTRY(blowfish_dec_blk)
movq %r10, RIO;
write_block();
 
-   movq %r11, %rbp;
+   movq %r11, %r12;
 
ret;
 ENDPROC(blowfish_dec_blk)
@@ -298,20 +300,21 @@ ENDPROC(blowfish_dec_blk)
 
 ENTRY(__blowfish_enc_blk_4way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 *  %rcx: bool, if true: xor output
 */
-   pushq %rbp;
+   pushq %r12;
pushq %rbx;
pushq %rcx;
 
-   preload_roundkey_enc(0);
-
+   movq %rdi, CTX
movq %rsi, %r11;
movq %rdx, RIO;
 
+   preload_roundkey_enc(0);
+
read_block4();
 
round_enc4(0);
@@ -324,39 +327,40 @@ ENTRY(__blowfish_enc_blk_4way)
round_enc4(14);
add_preloaded_roundkey4();
 
-   popq %rbp;
+   popq %r12;
movq %r11, RIO;
 
-   test %bpl, %bpl;
+   test %r12b, %r12b;
jnz .L__enc_xor4;
 
write_block4();
 
popq %rbx;
-   popq %rbp;
+   popq %r12;
ret;
 
 .L__enc_xor4:
xor_block4();
 
popq %rbx;
-   popq %rbp;
+   popq %r12;
ret;
 ENDPROC(__blowfish_enc_blk_4way)
 
 ENTRY(blowfish_dec_blk_4way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
-   pushq %rbp;
+   pushq %r12;
pushq %rbx;
-   preload_roundkey_dec(17);
 
-   movq %rsi, %r11;
+   movq %rdi, CTX;
+   movq %rsi, %r11
movq %rdx, RIO;
 
+   preload_roundkey_dec(17);
read_block4();
 
round_dec4(17);
@@ -373,7 +377,7 @@ ENTRY(blowfish_dec_blk_4way)
write_block4();
 
popq %rbx;
-   popq %rbp;
+   popq %r12;
 
ret;
 ENDPROC(blowfish_dec_blk_4way)
-- 
2.13.5



[PATCH 03/12] x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R15 instead of RBP.  R15 can't be used as the RID1 register because
of x86 instruction encoding limitations.  So use R15 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S | 47 ---
 1 file changed, 30 insertions(+), 17 deletions(-)

diff --git a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S 
b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
index b4a8806234ea..86107c961bb4 100644
--- a/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast5-avx-x86_64-asm_64.S
@@ -47,7 +47,7 @@
 /**
   16-way AVX cast5
  **/
-#define CTX %rdi
+#define CTX %r15
 
 #define RL1 %xmm0
 #define RR1 %xmm1
@@ -70,8 +70,8 @@
 
 #define RTMP %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %rdi
+#define RID1d %edi
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -226,7 +226,7 @@
 .align 16
 __cast5_enc_blk16:
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  RL1: blocks 1 and 2
 *  RR1: blocks 3 and 4
 *  RL2: blocks 5 and 6
@@ -246,9 +246,11 @@ __cast5_enc_blk16:
 *  RR4: encrypted blocks 15 and 16
 */
 
-   pushq %rbp;
+   pushq %r15;
pushq %rbx;
 
+   movq %rdi, CTX;
+
vmovdqa .Lbswap_mask, RKM;
vmovd .Lfirst_mask, R1ST;
vmovd .L32_mask, R32;
@@ -283,7 +285,7 @@ __cast5_enc_blk16:
 
 .L__skip_enc:
popq %rbx;
-   popq %rbp;
+   popq %r15;
 
vmovdqa .Lbswap_mask, RKM;
 
@@ -298,7 +300,7 @@ ENDPROC(__cast5_enc_blk16)
 .align 16
 __cast5_dec_blk16:
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  RL1: encrypted blocks 1 and 2
 *  RR1: encrypted blocks 3 and 4
 *  RL2: encrypted blocks 5 and 6
@@ -318,9 +320,11 @@ __cast5_dec_blk16:
 *  RR4: decrypted blocks 15 and 16
 */
 
-   pushq %rbp;
+   pushq %r15;
pushq %rbx;
 
+   movq %rdi, CTX;
+
vmovdqa .Lbswap_mask, RKM;
vmovd .Lfirst_mask, R1ST;
vmovd .L32_mask, R32;
@@ -356,7 +360,7 @@ __cast5_dec_blk16:
 
vmovdqa .Lbswap_mask, RKM;
popq %rbx;
-   popq %rbp;
+   popq %r15;
 
outunpack_blocks(RR1, RL1, RTMP, RX, RKM);
outunpack_blocks(RR2, RL2, RTMP, RX, RKM);
@@ -372,12 +376,14 @@ ENDPROC(__cast5_dec_blk16)
 
 ENTRY(cast5_ecb_enc_16way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
 
vmovdqu (0*4*4)(%rdx), RL1;
@@ -400,18 +406,22 @@ ENTRY(cast5_ecb_enc_16way)
vmovdqu RR4, (6*4*4)(%r11);
vmovdqu RL4, (7*4*4)(%r11);
 
+   popq %r15;
FRAME_END
ret;
 ENDPROC(cast5_ecb_enc_16way)
 
 ENTRY(cast5_ecb_dec_16way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
 
FRAME_BEGIN
+   pushq %r15;
+
+   movq %rdi, CTX;
movq %rsi, %r11;
 
vmovdqu (0*4*4)(%rdx), RL1;
@@ -434,20 +444,22 @@ ENTRY(cast5_ecb_dec_16way)
vmovdqu RR4, (6*4*4)(%r11);
vmovdqu RL4, (7*4*4)(%r11);
 
+   popq %r15;
FRAME_END
ret;
 ENDPROC(cast5_ecb_dec_16way)
 
 ENTRY(cast5_cbc_dec_16way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
-
pushq %r12;
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
movq %rdx, %r12;
 
@@ -483,23 +495,24 @@ ENTRY(cast5_cbc_dec_16way)
vmovdqu RR4, (6*16)(%r11);
vmovdqu RL4, (7*16)(%r11);
 
+   popq %r15;
popq %r12;
-
FRAME_END
ret;
 ENDPROC(cast5_cbc_dec_16way)
 
 ENTRY(cast5_ctr_16way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 *  %rcx: iv (big endian, 64bit)
 */
FRAME_BEGIN
-
pushq %r12;
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
movq %rdx, %r12;
 
@@ -558,8 +571,8 @@ ENTRY(cast5_ctr_16way)
vmovdqu RR4, (6*16)(%r11);
vmovdqu RL4, (7*16)(%r11);
 
+   popq %r15;
popq %r12;
-
FRAME_END
ret;
 ENDPR

[PATCH 02/12] x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP.  Both are callee-saved registers, so the
substitution is straightforward.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/camellia-x86_64-asm_64.S | 26 +-
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/x86/crypto/camellia-x86_64-asm_64.S 
b/arch/x86/crypto/camellia-x86_64-asm_64.S
index 310319c601ed..95ba6956a7f6 100644
--- a/arch/x86/crypto/camellia-x86_64-asm_64.S
+++ b/arch/x86/crypto/camellia-x86_64-asm_64.S
@@ -75,17 +75,17 @@
 #define RCD1bh %dh
 
 #define RT0 %rsi
-#define RT1 %rbp
+#define RT1 %r12
 #define RT2 %r8
 
 #define RT0d %esi
-#define RT1d %ebp
+#define RT1d %r12d
 #define RT2d %r8d
 
 #define RT2bl %r8b
 
 #define RXOR %r9
-#define RRBP %r10
+#define RR12 %r10
 #define RDST %r11
 
 #define RXORd %r9d
@@ -197,7 +197,7 @@ ENTRY(__camellia_enc_blk)
 *  %rdx: src
 *  %rcx: bool xor
 */
-   movq %rbp, RRBP;
+   movq %r12, RR12;
 
movq %rcx, RXOR;
movq %rsi, RDST;
@@ -227,13 +227,13 @@ ENTRY(__camellia_enc_blk)
 
enc_outunpack(mov, RT1);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
ret;
 
 .L__enc_xor:
enc_outunpack(xor, RT1);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
ret;
 ENDPROC(__camellia_enc_blk)
 
@@ -248,7 +248,7 @@ ENTRY(camellia_dec_blk)
movl $24, RXORd;
cmovel RXORd, RT2d; /* max */
 
-   movq %rbp, RRBP;
+   movq %r12, RR12;
movq %rsi, RDST;
movq %rdx, RIO;
 
@@ -271,7 +271,7 @@ ENTRY(camellia_dec_blk)
 
dec_outunpack();
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
ret;
 ENDPROC(camellia_dec_blk)
 
@@ -433,7 +433,7 @@ ENTRY(__camellia_enc_blk_2way)
 */
pushq %rbx;
 
-   movq %rbp, RRBP;
+   movq %r12, RR12;
movq %rcx, RXOR;
movq %rsi, RDST;
movq %rdx, RIO;
@@ -461,14 +461,14 @@ ENTRY(__camellia_enc_blk_2way)
 
enc_outunpack2(mov, RT2);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
popq %rbx;
ret;
 
 .L__enc2_xor:
enc_outunpack2(xor, RT2);
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
popq %rbx;
ret;
 ENDPROC(__camellia_enc_blk_2way)
@@ -485,7 +485,7 @@ ENTRY(camellia_dec_blk_2way)
cmovel RXORd, RT2d; /* max */
 
movq %rbx, RXOR;
-   movq %rbp, RRBP;
+   movq %r12, RR12;
movq %rsi, RDST;
movq %rdx, RIO;
 
@@ -508,7 +508,7 @@ ENTRY(camellia_dec_blk_2way)
 
dec_outunpack2();
 
-   movq RRBP, %rbp;
+   movq RR12, %r12;
movq RXOR, %rbx;
ret;
 ENDPROC(camellia_dec_blk_2way)
-- 
2.13.5



[PATCH 04/12] x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R15 instead of RBP.  R15 can't be used as the RID1 register because
of x86 instruction encoding limitations.  So use R15 for CTX and RDI for
CTX.  This means that CTX is no longer an implicit function argument.
Instead it needs to be explicitly copied from RDI.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S | 50 +--
 1 file changed, 34 insertions(+), 16 deletions(-)

diff --git a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S 
b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
index 952d3156a933..7f30b6f0d72c 100644
--- a/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/cast6-avx-x86_64-asm_64.S
@@ -47,7 +47,7 @@
 /**
   8-way AVX cast6
  **/
-#define CTX %rdi
+#define CTX %r15
 
 #define RA1 %xmm0
 #define RB1 %xmm1
@@ -70,8 +70,8 @@
 
 #define RTMP %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %rdi
+#define RID1d %edi
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -264,15 +264,17 @@
 .align 8
 __cast6_enc_blk8:
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: blocks
 * output:
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: encrypted blocks
 */
 
-   pushq %rbp;
+   pushq %r15;
pushq %rbx;
 
+   movq %rdi, CTX;
+
vmovdqa .Lbswap_mask, RKM;
vmovd .Lfirst_mask, R1ST;
vmovd .L32_mask, R32;
@@ -297,7 +299,7 @@ __cast6_enc_blk8:
QBAR(11);
 
popq %rbx;
-   popq %rbp;
+   popq %r15;
 
vmovdqa .Lbswap_mask, RKM;
 
@@ -310,15 +312,17 @@ ENDPROC(__cast6_enc_blk8)
 .align 8
 __cast6_dec_blk8:
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: encrypted blocks
 * output:
 *  RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2: decrypted blocks
 */
 
-   pushq %rbp;
+   pushq %r15;
pushq %rbx;
 
+   movq %rdi, CTX;
+
vmovdqa .Lbswap_mask, RKM;
vmovd .Lfirst_mask, R1ST;
vmovd .L32_mask, R32;
@@ -343,7 +347,7 @@ __cast6_dec_blk8:
QBAR(0);
 
popq %rbx;
-   popq %rbp;
+   popq %r15;
 
vmovdqa .Lbswap_mask, RKM;
outunpack_blocks(RA1, RB1, RC1, RD1, RTMP, RX, RKRF, RKM);
@@ -354,12 +358,14 @@ ENDPROC(__cast6_dec_blk8)
 
 ENTRY(cast6_ecb_enc_8way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
 
load_8way(%rdx, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
@@ -368,18 +374,21 @@ ENTRY(cast6_ecb_enc_8way)
 
store_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
FRAME_END
ret;
 ENDPROC(cast6_ecb_enc_8way)
 
 ENTRY(cast6_ecb_dec_8way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
 
load_8way(%rdx, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
@@ -388,20 +397,22 @@ ENTRY(cast6_ecb_dec_8way)
 
store_8way(%r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
FRAME_END
ret;
 ENDPROC(cast6_ecb_dec_8way)
 
 ENTRY(cast6_cbc_dec_8way)
/* input:
-*  %rdi: ctx, CTX
+*  %rdi: ctx
 *  %rsi: dst
 *  %rdx: src
 */
FRAME_BEGIN
-
pushq %r12;
+   pushq %r15;
 
+   movq %rdi, CTX;
movq %rsi, %r11;
movq %rdx, %r12;
 
@@ -411,8 +422,8 @@ ENTRY(cast6_cbc_dec_8way)
 
store_cbc_8way(%r12, %r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
popq %r12;
-
FRAME_END
ret;
 ENDPROC(cast6_cbc_dec_8way)
@@ -425,9 +436,10 @@ ENTRY(cast6_ctr_8way)
 *  %rcx: iv (little endian, 128bit)
 */
FRAME_BEGIN
-
pushq %r12;
+   pushq %r15
 
+   movq %rdi, CTX;
movq %rsi, %r11;
movq %rdx, %r12;
 
@@ -438,8 +450,8 @@ ENTRY(cast6_ctr_8way)
 
store_ctr_8way(%r12, %r11, RA1, RB1, RC1, RD1, RA2, RB2, RC2, RD2);
 
+   popq %r15;
popq %r12;
-
FRAME_END
ret;
 ENDPROC(cast6_ctr_8way)
@@ -452,7 +464,9 @@ ENTRY(cast6_xts_enc_8way)
 *  %rcx: iv (t ⊕ αⁿ ∈ GF(2¹²⁸))
 */
FRAME_BEGIN
+   pushq %r15;
 
+   

[PATCH 06/12] x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R11 instead of RBP.  Since R11 isn't a callee-saved register, it
doesn't need to be saved and restored on the stack.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha1_avx2_x86_64_asm.S | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/x86/crypto/sha1_avx2_x86_64_asm.S 
b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
index 1eab79c9ac48..9f712a7dfd79 100644
--- a/arch/x86/crypto/sha1_avx2_x86_64_asm.S
+++ b/arch/x86/crypto/sha1_avx2_x86_64_asm.S
@@ -89,7 +89,7 @@
 #defineREG_RE  %rdx
 #defineREG_RTA %r12
 #defineREG_RTB %rbx
-#defineREG_T1  %ebp
+#defineREG_T1  %r11d
 #definexmm_mov vmovups
 #defineavx2_zeroupper  vzeroupper
 #defineRND_F1  1
@@ -637,7 +637,6 @@ _loop3:
ENTRY(\name)
 
push%rbx
-   push%rbp
push%r12
push%r13
push%r14
@@ -673,7 +672,6 @@ _loop3:
pop %r14
pop %r13
pop %r12
-   pop %rbp
pop %rbx
 
ret
-- 
2.13.5



[PATCH 05/12] x86/crypto: Fix RBP usage in des3_ede-asm_64.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use RSI instead of RBP for RT1.  Since RSI is also used as a the 'dst'
function argument, it needs to be saved on the stack until the argument
is needed.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/des3_ede-asm_64.S | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/des3_ede-asm_64.S 
b/arch/x86/crypto/des3_ede-asm_64.S
index f3e91647ca27..8e49ce117494 100644
--- a/arch/x86/crypto/des3_ede-asm_64.S
+++ b/arch/x86/crypto/des3_ede-asm_64.S
@@ -64,12 +64,12 @@
 #define RW2bh %ch
 
 #define RT0 %r15
-#define RT1 %rbp
+#define RT1 %rsi
 #define RT2 %r14
 #define RT3 %rdx
 
 #define RT0d %r15d
-#define RT1d %ebp
+#define RT1d %esi
 #define RT2d %r14d
 #define RT3d %edx
 
@@ -177,13 +177,14 @@ ENTRY(des3_ede_x86_64_crypt_blk)
 *  %rsi: dst
 *  %rdx: src
 */
-   pushq %rbp;
pushq %rbx;
pushq %r12;
pushq %r13;
pushq %r14;
pushq %r15;
 
+   pushq %rsi; /* dst */
+
read_block(%rdx, RL0, RR0);
initial_permutation(RL0, RR0);
 
@@ -241,6 +242,8 @@ ENTRY(des3_ede_x86_64_crypt_blk)
round1(32+15, RL0, RR0, dummy2);
 
final_permutation(RR0, RL0);
+
+   popq %rsi /* dst */
write_block(%rsi, RR0, RL0);
 
popq %r15;
@@ -248,7 +251,6 @@ ENTRY(des3_ede_x86_64_crypt_blk)
popq %r13;
popq %r12;
popq %rbx;
-   popq %rbp;
 
ret;
 ENDPROC(des3_ede_x86_64_crypt_blk)
@@ -432,13 +434,14 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
 *  %rdx: src (3 blocks)
 */
 
-   pushq %rbp;
pushq %rbx;
pushq %r12;
pushq %r13;
pushq %r14;
pushq %r15;
 
+   pushq %rsi /* dst */
+
/* load input */
movl 0 * 4(%rdx), RL0d;
movl 1 * 4(%rdx), RR0d;
@@ -520,6 +523,7 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
bswapl RR2d;
bswapl RL2d;
 
+   popq %rsi /* dst */
movl RR0d, 0 * 4(%rsi);
movl RL0d, 1 * 4(%rsi);
movl RR1d, 2 * 4(%rsi);
@@ -532,7 +536,6 @@ ENTRY(des3_ede_x86_64_crypt_blk_3way)
popq %r13;
popq %r12;
popq %rbx;
-   popq %rbp;
 
ret;
 ENDPROC(des3_ede_x86_64_crypt_blk_3way)
-- 
2.13.5



[PATCH 09/12] x86/crypto: Fix RBP usage in sha256-avx2-asm.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP for the TBL register.  Since R12 is also used as
another temporary register (T1), it gets clobbered in each round of
computation.  So the table address needs to be freshly reloaded into R12
each time it's used.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha256-avx2-asm.S | 16 
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx2-asm.S 
b/arch/x86/crypto/sha256-avx2-asm.S
index 89c8f09787d2..cdd647231fa9 100644
--- a/arch/x86/crypto/sha256-avx2-asm.S
+++ b/arch/x86/crypto/sha256-avx2-asm.S
@@ -99,7 +99,7 @@ e   = %edx# clobbers NUM_BLKS
 y3 = %esi  # clobbers INP
 
 
-TBL= %rbp
+TBL= %r12  # clobbered by T1
 SRND   = CTX   # SRND is same register as CTX
 
 a = %eax
@@ -531,7 +531,6 @@ STACK_SIZE  = _RSP  + _RSP_SIZE
 ENTRY(sha256_transform_rorx)
 .align 32
pushq   %rbx
-   pushq   %rbp
pushq   %r12
pushq   %r13
pushq   %r14
@@ -568,8 +567,6 @@ ENTRY(sha256_transform_rorx)
mov CTX, _CTX(%rsp)
 
 loop0:
-   lea K256(%rip), TBL
-
## Load first 16 dwords from two blocks
VMOVDQ  0*32(INP),XTMP0
VMOVDQ  1*32(INP),XTMP1
@@ -597,18 +594,22 @@ last_block_enter:
 
 .align 16
 loop1:
+   lea K256(%rip), TBL
vpaddd  0*32(TBL, SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 0*32
 
+   lea K256(%rip), TBL
vpaddd  1*32(TBL, SRND), X0, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 1*32
 
+   lea K256(%rip), TBL
vpaddd  2*32(TBL, SRND), X0, XFER
vmovdqa XFER, 2*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 2*32
 
+   lea K256(%rip), TBL
vpaddd  3*32(TBL, SRND), X0, XFER
vmovdqa XFER, 3*32+_XFER(%rsp, SRND)
FOUR_ROUNDS_AND_SCHED   _XFER + 3*32
@@ -619,9 +620,12 @@ loop1:
 
 loop2:
## Do last 16 rounds with no scheduling
+   lea K256(%rip), TBL
vpaddd  0*32(TBL, SRND), X0, XFER
vmovdqa XFER, 0*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 0*32
+
+   lea K256(%rip), TBL
vpaddd  1*32(TBL, SRND), X1, XFER
vmovdqa XFER, 1*32+_XFER(%rsp, SRND)
DO_4ROUNDS  _XFER + 1*32
@@ -676,9 +680,6 @@ loop3:
ja  done_hash
 
 do_last_block:
-    do last block
-   lea K256(%rip), TBL
-
VMOVDQ  0*16(INP),XWORD0
VMOVDQ  1*16(INP),XWORD1
VMOVDQ  2*16(INP),XWORD2
@@ -718,7 +719,6 @@ done_hash:
popq%r14
popq%r13
popq%r12
-   popq%rbp
popq%rbx
ret
 ENDPROC(sha256_transform_rorx)
-- 
2.13.5



[PATCH 10/12] x86/crypto: Fix RBP usage in sha256-ssse3-asm.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the TBL register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha256-ssse3-asm.S | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-ssse3-asm.S 
b/arch/x86/crypto/sha256-ssse3-asm.S
index 39b83c93e7fd..c6c05ed2c16a 100644
--- a/arch/x86/crypto/sha256-ssse3-asm.S
+++ b/arch/x86/crypto/sha256-ssse3-asm.S
@@ -95,7 +95,7 @@ SRND = %rsi   # clobbers INP
 c = %ecx
 d = %r8d
 e = %edx
-TBL = %rbp
+TBL = %r12
 a = %eax
 b = %ebx
 
@@ -356,13 +356,13 @@ a = TMP_
 ENTRY(sha256_transform_ssse3)
 .align 32
pushq   %rbx
-   pushq   %rbp
+   pushq   %r12
pushq   %r13
pushq   %r14
pushq   %r15
-   pushq   %r12
+   pushq   %rbp
+   mov %rsp, %rbp
 
-   mov %rsp, %r12
subq$STACK_SIZE, %rsp
and $~15, %rsp
 
@@ -462,13 +462,12 @@ loop2:
 
 done_hash:
 
-   mov %r12, %rsp
-
-   popq%r12
+   mov %rbp, %rsp
+   popq%rbp
popq%r15
popq%r14
popq%r13
-   popq%rbp
+   popq%r12
popq%rbx
 
ret
-- 
2.13.5



[PATCH 07/12] x86/crypto: Fix RBP usage in sha1_ssse3_asm.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the REG_D register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha1_ssse3_asm.S | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/sha1_ssse3_asm.S b/arch/x86/crypto/sha1_ssse3_asm.S
index a4109506a5e8..6204bd53528c 100644
--- a/arch/x86/crypto/sha1_ssse3_asm.S
+++ b/arch/x86/crypto/sha1_ssse3_asm.S
@@ -37,7 +37,7 @@
 #define REG_A  %ecx
 #define REG_B  %esi
 #define REG_C  %edi
-#define REG_D  %ebp
+#define REG_D  %r12d
 #define REG_E  %edx
 
 #define REG_T1 %eax
@@ -74,10 +74,10 @@
ENTRY(\name)
 
push%rbx
-   push%rbp
push%r12
+   push%rbp
+   mov %rsp, %rbp
 
-   mov %rsp, %r12
sub $64, %rsp   # allocate workspace
and $~15, %rsp  # align stack
 
@@ -99,10 +99,9 @@
xor %rax, %rax
rep stosq
 
-   mov %r12, %rsp  # deallocate workspace
-
-   pop %r12
+   mov %rbp, %rsp  # deallocate workspace
pop %rbp
+   pop %r12
pop %rbx
ret
 
-- 
2.13.5



[PATCH 11/12] x86/crypto: Fix RBP usage in sha512-avx2-asm.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R12 instead of RBP for the TBL register.  Since R12 is also used as
another temporary register (T1), it gets clobbered in each round of
computation.  So the TBL value needs to be freshly reloaded into R12
each time it's used.  Since the value of TBL can change, store its
permanent value on the stack at the frame_TBL offset.

Also remove the unused y4 variable.

Reported-by: Eric Biggers <ebigge...@gmail.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha512-avx2-asm.S | 21 ++---
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/arch/x86/crypto/sha512-avx2-asm.S 
b/arch/x86/crypto/sha512-avx2-asm.S
index 7f5f6c6ec72e..37cfc2004abd 100644
--- a/arch/x86/crypto/sha512-avx2-asm.S
+++ b/arch/x86/crypto/sha512-avx2-asm.S
@@ -81,7 +81,7 @@ d   = %r8
 e   = %rdx
 y3  = %rsi
 
-TBL   = %rbp
+TBL   = %r12 # clobbered by T1
 
 a = %rax
 b = %rbx
@@ -96,11 +96,10 @@ y0= %r13
 y1= %r14
 y2= %r15
 
-y4= %r12
-
 # Local variables (stack frame)
 XFER_SIZE = 4*8
 SRND_SIZE = 1*8
+TBL_SIZE = 1*8
 INP_SIZE = 1*8
 INPEND_SIZE = 1*8
 RSPSAVE_SIZE = 1*8
@@ -108,7 +107,8 @@ GPRSAVE_SIZE = 6*8
 
 frame_XFER = 0
 frame_SRND = frame_XFER + XFER_SIZE
-frame_INP = frame_SRND + SRND_SIZE
+frame_TBL = frame_SRND + SRND_SIZE
+frame_INP = frame_TBL + TBL_SIZE
 frame_INPEND = frame_INP + INP_SIZE
 frame_RSPSAVE = frame_INPEND + INPEND_SIZE
 frame_GPRSAVE = frame_RSPSAVE + RSPSAVE_SIZE
@@ -601,7 +601,7 @@ ENTRY(sha512_transform_rorx)
vmovdqa PSHUFFLE_BYTE_FLIP_MASK(%rip), BYTE_FLIP_MASK
 
 loop0:
-   lea K512(%rip), TBL
+   movq$K512, frame_TBL(%rsp)
 
## byte swap first 16 dwords
COPY_YMM_AND_BSWAP  Y_0, (INP), BYTE_FLIP_MASK
@@ -616,39 +616,46 @@ loop0:
 
 .align 16
 loop1:
+   mov frame_TBL(%rsp), TBL
vpaddq  (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED
 
+   mov frame_TBL(%rsp), TBL
vpaddq  1*32(TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED
 
+   mov frame_TBL(%rsp), TBL
vpaddq  2*32(TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
FOUR_ROUNDS_AND_SCHED
 
+   mov frame_TBL(%rsp), TBL
vpaddq  3*32(TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
-   add $(4*32), TBL
FOUR_ROUNDS_AND_SCHED
 
+   addq$(4*32), frame_TBL(%rsp)
subq$1, frame_SRND(%rsp)
jne loop1
 
movq$2, frame_SRND(%rsp)
 loop2:
+   mov frame_TBL(%rsp), TBL
vpaddq  (TBL), Y_0, XFER
vmovdqa XFER, frame_XFER(%rsp)
DO_4ROUNDS
+
+   mov frame_TBL(%rsp), TBL
vpaddq  1*32(TBL), Y_1, XFER
vmovdqa XFER, frame_XFER(%rsp)
-   add $(2*32), TBL
DO_4ROUNDS
 
vmovdqa Y_2, Y_0
vmovdqa Y_3, Y_1
 
+   add $(2*32), frame_TBL(%rsp)
subq$1, frame_SRND(%rsp)
jne loop2
 
-- 
2.13.5



[PATCH 12/12] x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Use R13 instead of RBP.  Both are callee-saved registers, so the
substitution is straightforward.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S 
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
index b3f49d286348..73b471da3622 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -76,8 +76,8 @@
 #define RT %xmm14
 #define RR %xmm15
 
-#define RID1  %rbp
-#define RID1d %ebp
+#define RID1  %r13
+#define RID1d %r13d
 #define RID2  %rsi
 #define RID2d %esi
 
@@ -259,7 +259,7 @@ __twofish_enc_blk8:
 
vmovdqu w(CTX), RK1;
 
-   pushq %rbp;
+   pushq %r13;
pushq %rbx;
pushq %rcx;
 
@@ -282,7 +282,7 @@ __twofish_enc_blk8:
 
popq %rcx;
popq %rbx;
-   popq %rbp;
+   popq %r13;
 
outunpack_blocks(RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
outunpack_blocks(RC2, RD2, RA2, RB2, RK1, RX0, RY0, RK2);
@@ -301,7 +301,7 @@ __twofish_dec_blk8:
 
vmovdqu (w+4*4)(CTX), RK1;
 
-   pushq %rbp;
+   pushq %r13;
pushq %rbx;
 
inpack_blocks(RC1, RD1, RA1, RB1, RK1, RX0, RY0, RK2);
@@ -322,7 +322,7 @@ __twofish_dec_blk8:
vmovdqu (w)(CTX), RK1;
 
popq %rbx;
-   popq %rbp;
+   popq %r13;
 
outunpack_blocks(RA1, RB1, RC1, RD1, RK1, RX0, RY0, RK2);
outunpack_blocks(RA2, RB2, RC2, RD2, RK1, RX0, RY0, RK2);
-- 
2.13.5



[PATCH 08/12] x86/crypto: Fix RBP usage in sha256-avx-asm.S

2017-08-29 Thread Josh Poimboeuf
Using RBP as a temporary register breaks frame pointer convention and
breaks stack traces when unwinding from an interrupt in the crypto code.

Swap the usages of R12 and RBP.  Use R12 for the TBL register, and use
RBP to store the pre-aligned stack pointer.

Reported-by: Eric Biggers <ebigg...@google.com>
Reported-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Josh Poimboeuf <jpoim...@redhat.com>
---
 arch/x86/crypto/sha256-avx-asm.S | 15 +++
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/arch/x86/crypto/sha256-avx-asm.S b/arch/x86/crypto/sha256-avx-asm.S
index e0a1a5f2..001bbcf93c79 100644
--- a/arch/x86/crypto/sha256-avx-asm.S
+++ b/arch/x86/crypto/sha256-avx-asm.S
@@ -103,7 +103,7 @@ SRND = %rsi   # clobbers INP
 c = %ecx
 d = %r8d
 e = %edx
-TBL = %rbp
+TBL = %r12
 a = %eax
 b = %ebx
 
@@ -350,13 +350,13 @@ a = TMP_
 ENTRY(sha256_transform_avx)
 .align 32
pushq   %rbx
-   pushq   %rbp
+   pushq   %r12
pushq   %r13
pushq   %r14
pushq   %r15
-   pushq   %r12
+   pushq   %rbp
+   movq%rsp, %rbp
 
-   mov %rsp, %r12
subq$STACK_SIZE, %rsp   # allocate stack space
and $~15, %rsp  # align stack pointer
 
@@ -452,13 +452,12 @@ loop2:
 
 done_hash:
 
-   mov %r12, %rsp
-
-   popq%r12
+   mov %rbp, %rsp
+   popq%rbp
popq%r15
popq%r14
popq%r13
-   popq%rbp
+   popq%r12
popq%rbx
ret
 ENDPROC(sha256_transform_avx)
-- 
2.13.5



[PATCH 00/12] x86/crypto: Fix RBP usage in several crypto .S files

2017-08-29 Thread Josh Poimboeuf
Many of the x86 crypto functions use RBP as a temporary register.  This
breaks frame pointer convention, and breaks stack traces when unwinding
from an interrupt in the crypto code.

Convert most* of them to leave RBP alone.

These pass the crypto boot tests for me.  Any further testing would be
appreciated!

[*] There are still a few crypto files left that need fixing, but the
fixes weren't trivial and nobody reported unwinder warnings about
them yet, so I'm skipping them for now.

Josh Poimboeuf (12):
  x86/crypto: Fix RBP usage in blowfish-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in camellia-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast5-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in cast6-avx-x86_64-asm_64.S
  x86/crypto: Fix RBP usage in des3_ede-asm_64.S
  x86/crypto: Fix RBP usage in sha1_avx2_x86_64_asm.S
  x86/crypto: Fix RBP usage in sha1_ssse3_asm.S
  x86/crypto: Fix RBP usage in sha256-avx-asm.S
  x86/crypto: Fix RBP usage in sha256-avx2-asm.S
  x86/crypto: Fix RBP usage in sha256-ssse3-asm.S
  x86/crypto: Fix RBP usage in sha512-avx2-asm.S
  x86/crypto: Fix RBP usage in twofish-avx-x86_64-asm_64.S

 arch/x86/crypto/blowfish-x86_64-asm_64.S| 48 ++-
 arch/x86/crypto/camellia-x86_64-asm_64.S| 26 +++
 arch/x86/crypto/cast5-avx-x86_64-asm_64.S   | 47 +--
 arch/x86/crypto/cast6-avx-x86_64-asm_64.S   | 50 -
 arch/x86/crypto/des3_ede-asm_64.S   | 15 +
 arch/x86/crypto/sha1_avx2_x86_64_asm.S  |  4 +--
 arch/x86/crypto/sha1_ssse3_asm.S| 11 +++
 arch/x86/crypto/sha256-avx-asm.S| 15 -
 arch/x86/crypto/sha256-avx2-asm.S   | 16 -
 arch/x86/crypto/sha256-ssse3-asm.S  | 15 -
 arch/x86/crypto/sha512-avx2-asm.S   | 21 
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 12 +++
 12 files changed, 160 insertions(+), 120 deletions(-)

-- 
2.13.5



Re: [PATCH] crypto: x86/aes - Don't use %rbp as temporary register

2017-05-18 Thread Josh Poimboeuf
On Wed, May 17, 2017 at 03:21:41PM -0700, Eric Biggers wrote:
> On Wed, May 17, 2017 at 03:44:27PM -0500, Josh Poimboeuf wrote:
> > On Tue, May 16, 2017 at 09:03:08PM -0700, Eric Biggers wrote:
> > > From: Eric Biggers <ebigg...@google.com>
> > > 
> > > When using the "aes-asm" implementation of AES (*not* the AES-NI
> > > implementation) on an x86_64, v4.12-rc1 kernel with lockdep enabled, the
> > > following warning was reported, along with a long unwinder dump:
> > > 
> > >   WARNING: kernel stack regs at c9643558 in kworker/u4:2:155 has 
> > > bad 'bp' value 001c
> > > 
> > > The problem is that aes_enc_block() and aes_dec_block() use %rbp as a
> > > temporary register, which breaks stack traces if an interrupt occurs.
> > > 
> > > Fix this by replacing %rbp with %r9, which was being used to hold the
> > > saved value of %rbp.  This required rearranging the AES round macro
> > > slightly since %r9d cannot be used as the target of a move from %ah-%dh.
> > > 
> > > Performance is essentially unchanged --- actually about 0.2% faster than
> > > before.  Interestingly, I also measured aes-generic as being nearly 7%
> > > faster than aes-asm, so perhaps aes-asm has outlived its usefulness...
> > > 
> > > Signed-off-by: Eric Biggers <ebigg...@google.com>
> > 
> > Reviewed-by: Josh Poimboeuf <jpoim...@redhat.com>
> > 
> 
> Hmm, it looks like a number of other algorithms in arch/x86/crypto/ use %rbp 
> (or
> %ebp), e.g. blowfish, camellia, cast5, and aes-i586.  Presumably they have the
> same problem.  I'm a little confused: do these all need to be fixed, and
> when/why did this start being considered broken?

This warning was only recently added, with the goal of flushing out
these types of issues with hand-coded asm to make frame pointer based
stack traces more reliable.  I can take a look at fixing the rest of
them if you want.

-- 
Josh


Re: [PATCH] crypto: x86/aes - Don't use %rbp as temporary register

2017-05-17 Thread Josh Poimboeuf
On Tue, May 16, 2017 at 09:03:08PM -0700, Eric Biggers wrote:
> From: Eric Biggers <ebigg...@google.com>
> 
> When using the "aes-asm" implementation of AES (*not* the AES-NI
> implementation) on an x86_64, v4.12-rc1 kernel with lockdep enabled, the
> following warning was reported, along with a long unwinder dump:
> 
>   WARNING: kernel stack regs at c9643558 in kworker/u4:2:155 has 
> bad 'bp' value 001c
> 
> The problem is that aes_enc_block() and aes_dec_block() use %rbp as a
> temporary register, which breaks stack traces if an interrupt occurs.
> 
> Fix this by replacing %rbp with %r9, which was being used to hold the
> saved value of %rbp.  This required rearranging the AES round macro
> slightly since %r9d cannot be used as the target of a move from %ah-%dh.
> 
> Performance is essentially unchanged --- actually about 0.2% faster than
> before.  Interestingly, I also measured aes-generic as being nearly 7%
> faster than aes-asm, so perhaps aes-asm has outlived its usefulness...
> 
> Signed-off-by: Eric Biggers <ebigg...@google.com>

Reviewed-by: Josh Poimboeuf <jpoim...@redhat.com>

-- 
Josh


Re: [PATCH] x86/crypto: fix %progbits -> @progbits

2017-01-19 Thread Josh Poimboeuf
On Thu, Jan 19, 2017 at 10:28:05PM +0100, Denys Vlasenko wrote:
> %progbits form is used on ARM (where @ is a comment char).
> 
> x86 consistently uses @progbits everywhere else.
> 
> Signed-off-by: Denys Vlasenko <dvlas...@redhat.com>

Reviewed-by: Josh Poimboeuf <jpoim...@redhat.com>

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-13 Thread Josh Poimboeuf
On Fri, Jan 13, 2017 at 04:36:48PM +0800, Herbert Xu wrote:
> On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
> >
> > I think we have some inline functions that do asm volatile ("call
> > ..."), and I don't see any credible way of forcing alignment short of
> > generating an entirely new stack frame and aligning that.  Ick.  This
> 
> A straight asm call from C should always work because gcc keeps
> the stack aligned in the prologue.
> 
> The only problem with inline assembly is when you start pushing
> things onto the stack directly.

I tried another approach.  I rebuilt the kernel with
-mpreferred-stack-boundary=4 and used awk (poor man's objtool) to find
all leaf functions with misaligned stacks.

  objdump -d ~/k/vmlinux | awk '/>:/ { f=$2; call=0; push=0 } /fentry/ { next } 
/callq/ { call=1 } /push/ { push=!push } /sub.*8,%rsp/ { push=!push } /^$/ && 
call == 0 && push == 0 { print f }'

It found a lot of functions.  Here's one of them:

  814ab450 :
  814ab450: 55  push   %rbp
  814ab451: f7 d9   neg%ecx
  814ab453: 31 c0   xor%eax,%eax
  814ab455: 4c 63 c1movslq %ecx,%r8
  814ab458: 48 89 e5mov%rsp,%rbp
  814ab45b: 53  push   %rbx
  814ab45c: 4a 8d 1c c5 00 00 00lea0x0(,%r8,8),%rbx
  814ab463: 00 
  814ab464: eb 03   jmp814ab469 

  814ab466: 4c 63 c1movslq %ecx,%r8
  814ab469: 49 c1 e0 03 shl$0x3,%r8
  814ab46d: 45 31 c9xor%r9d,%r9d
  814ab470: 49 29 d8sub%rbx,%r8
  814ab473: 4a 03 04 02 add(%rdx,%r8,1),%rax
  814ab477: 41 0f 92 c1 setb   %r9b
  814ab47b: 4a 03 04 06 add(%rsi,%r8,1),%rax
  814ab47f: 41 0f 92 c2 setb   %r10b
  814ab483: 49 89 c3mov%rax,%r11
  814ab486: 83 c1 01add$0x1,%ecx
  814ab489: 45 0f b6 d2 movzbl %r10b,%r10d
  814ab48d: 4e 89 1c 07 mov%r11,(%rdi,%r8,1)
  814ab491: 4b 8d 04 0a lea(%r10,%r9,1),%rax
  814ab495: 75 cf   jne814ab466 

  814ab497: 5b  pop%rbx
  814ab498: 5d  pop%rbp
  814ab499: c3  retq   
  814ab49a: 66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)

That's a leaf function which, as far as I can tell, doesn't use any
inline asm, but its prologue produces a misaligned stack.

I added inline asm with a call instruction and no operands or clobbers,
and got the same result.

So Andy's theory seems to be correct.  As long as we allow calls from
inline asm, we can't rely on aligned stacks.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 07:23:18PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 12, 2017 at 7:11 PM, Josh Poimboeuf <jpoim...@redhat.com> wrote:
> > On Thu, Jan 12, 2017 at 05:46:55PM -0800, Andy Lutomirski wrote:
> >> On Thu, Jan 12, 2017 at 12:15 PM, Josh Poimboeuf <jpoim...@redhat.com> 
> >> wrote:
> >> > On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
> >> >> On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
> >> >> <torva...@linux-foundation.org> wrote:
> >> >> > On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf <jpoim...@redhat.com> 
> >> >> > wrote:
> >> >> >>
> >> >> >> Just to clarify, I think you're asking if, for versions of gcc which
> >> >> >> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
> >> >> >> functions to ensure their stacks are 16-byte aligned.
> >> >> >>
> >> >> >> It's certainly possible, but I don't see how that solves the problem.
> >> >> >> The stack will still be misaligned by entry code.  Or am I missing
> >> >> >> something?
> >> >> >
> >> >> > I think the argument is that we *could* try to align things, if we
> >> >> > just had some tool that actually then verified that we aren't missing
> >> >> > anything.
> >> >> >
> >> >> > I'm not entirely happy with checking the generated code, though,
> >> >> > because as Ingo says, you have a 50:50 chance of just getting it right
> >> >> > by mistake. So I'd much rather have some static tool that checks
> >> >> > things at a code level (ie coccinelle or sparse).
> >> >>
> >> >> What I meant was checking the entry code to see if it aligns stack
> >> >> frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
> >> >> alignment for real may actually be entirely a lost cause.  After all,
> >> >> I think we have some inline functions that do asm volatile ("call
> >> >> ..."), and I don't see any credible way of forcing alignment short of
> >> >> generating an entirely new stack frame and aligning that.
> >> >
> >> > Actually we already found all such cases and fixed them by forcing a new
> >> > stack frame, thanks to objtool.  For example, see 55a76b59b5fe.
> >>
> >> What I mean is: what guarantees that the stack is properly aligned for
> >> the subroutine call?  gcc promises to set up a stack frame, but does
> >> it promise that rsp will be properly aligned to call a C function?
> >
> > Yes, I did an experiment and you're right.  I had naively assumed that
> > all stack frames would be aligned.
> 
> Just to check: did you do your experiment with -mpreferred-stack-boundary=4?

Yes, but it's too late for me to be doing hard stuff and I think my
first experiment was bogus.  I didn't use all the other kernel-specific
gcc options.

I tried again with all the kernel gcc options, except with
-mpreferred-stack-boundary=4 instead of 3, and actually came up with the
opposite conclusion.

I used the following code:

void otherfunc(void);

static inline void bar(long *f)
{
asm volatile("call otherfunc" : : "m" (f) : );
}

void foo(void)
{
long buf[3] = {0, 0, 0};
bar(buf);
}

The stack frame was always 16-byte aligned regardless of whether the
buf array size was even or odd.

So my half-asleep brain is telling me that my original assumption was
right.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 05:46:55PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 12, 2017 at 12:15 PM, Josh Poimboeuf <jpoim...@redhat.com> wrote:
> > On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
> >> On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
> >> <torva...@linux-foundation.org> wrote:
> >> > On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf <jpoim...@redhat.com> 
> >> > wrote:
> >> >>
> >> >> Just to clarify, I think you're asking if, for versions of gcc which
> >> >> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
> >> >> functions to ensure their stacks are 16-byte aligned.
> >> >>
> >> >> It's certainly possible, but I don't see how that solves the problem.
> >> >> The stack will still be misaligned by entry code.  Or am I missing
> >> >> something?
> >> >
> >> > I think the argument is that we *could* try to align things, if we
> >> > just had some tool that actually then verified that we aren't missing
> >> > anything.
> >> >
> >> > I'm not entirely happy with checking the generated code, though,
> >> > because as Ingo says, you have a 50:50 chance of just getting it right
> >> > by mistake. So I'd much rather have some static tool that checks
> >> > things at a code level (ie coccinelle or sparse).
> >>
> >> What I meant was checking the entry code to see if it aligns stack
> >> frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
> >> alignment for real may actually be entirely a lost cause.  After all,
> >> I think we have some inline functions that do asm volatile ("call
> >> ..."), and I don't see any credible way of forcing alignment short of
> >> generating an entirely new stack frame and aligning that.
> >
> > Actually we already found all such cases and fixed them by forcing a new
> > stack frame, thanks to objtool.  For example, see 55a76b59b5fe.
> 
> What I mean is: what guarantees that the stack is properly aligned for
> the subroutine call?  gcc promises to set up a stack frame, but does
> it promise that rsp will be properly aligned to call a C function?

Yes, I did an experiment and you're right.  I had naively assumed that
all stack frames would be aligned.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 02:15:11PM -0600, Josh Poimboeuf wrote:
> On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
> > On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
> > <torva...@linux-foundation.org> wrote:
> > > On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf <jpoim...@redhat.com> 
> > > wrote:
> > >>
> > >> Just to clarify, I think you're asking if, for versions of gcc which
> > >> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
> > >> functions to ensure their stacks are 16-byte aligned.
> > >>
> > >> It's certainly possible, but I don't see how that solves the problem.
> > >> The stack will still be misaligned by entry code.  Or am I missing
> > >> something?
> > >
> > > I think the argument is that we *could* try to align things, if we
> > > just had some tool that actually then verified that we aren't missing
> > > anything.
> > >
> > > I'm not entirely happy with checking the generated code, though,
> > > because as Ingo says, you have a 50:50 chance of just getting it right
> > > by mistake. So I'd much rather have some static tool that checks
> > > things at a code level (ie coccinelle or sparse).
> > 
> > What I meant was checking the entry code to see if it aligns stack
> > frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
> > alignment for real may actually be entirely a lost cause.  After all,
> > I think we have some inline functions that do asm volatile ("call
> > ..."), and I don't see any credible way of forcing alignment short of
> > generating an entirely new stack frame and aligning that.
> 
> Actually we already found all such cases and fixed them by forcing a new
> stack frame, thanks to objtool.  For example, see 55a76b59b5fe.
> 
> > Ick.  This
> > whole situation stinks, and I wish that the gcc developers had been
> > less daft here in the first place or that we'd noticed and gotten it
> > fixed much longer ago.
> > 
> > Can we come up with a macro like STACK_ALIGN_16 that turns into
> > __aligned__(32) on bad gcc versions and combine that with your sparse
> > patch?

This could work.  Only concerns I'd have are:

- Are there (or will there be in the future) any asm functions which
  assume a 16-byte aligned stack?  (Seems unlikely.  Stack alignment is
  common in the crypto code but they do the alignment manually.)

- Who's going to run sparse all the time to catch unauthorized users of
  __aligned__(16)?

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 12:08:07PM -0800, Andy Lutomirski wrote:
> On Thu, Jan 12, 2017 at 11:51 AM, Linus Torvalds
> <torva...@linux-foundation.org> wrote:
> > On Thu, Jan 12, 2017 at 6:02 AM, Josh Poimboeuf <jpoim...@redhat.com> wrote:
> >>
> >> Just to clarify, I think you're asking if, for versions of gcc which
> >> don't support -mpreferred-stack-boundary=3, objtool can analyze all C
> >> functions to ensure their stacks are 16-byte aligned.
> >>
> >> It's certainly possible, but I don't see how that solves the problem.
> >> The stack will still be misaligned by entry code.  Or am I missing
> >> something?
> >
> > I think the argument is that we *could* try to align things, if we
> > just had some tool that actually then verified that we aren't missing
> > anything.
> >
> > I'm not entirely happy with checking the generated code, though,
> > because as Ingo says, you have a 50:50 chance of just getting it right
> > by mistake. So I'd much rather have some static tool that checks
> > things at a code level (ie coccinelle or sparse).
> 
> What I meant was checking the entry code to see if it aligns stack
> frames, and good luck getting sparse to do that.  Hmm, getting 16-byte
> alignment for real may actually be entirely a lost cause.  After all,
> I think we have some inline functions that do asm volatile ("call
> ..."), and I don't see any credible way of forcing alignment short of
> generating an entirely new stack frame and aligning that.

Actually we already found all such cases and fixed them by forcing a new
stack frame, thanks to objtool.  For example, see 55a76b59b5fe.

> Ick.  This
> whole situation stinks, and I wish that the gcc developers had been
> less daft here in the first place or that we'd noticed and gotten it
> fixed much longer ago.
> 
> Can we come up with a macro like STACK_ALIGN_16 that turns into
> __aligned__(32) on bad gcc versions and combine that with your sparse
> patch?

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 11:06:50PM +0800, Herbert Xu wrote:
> On Thu, Jan 12, 2017 at 09:03:18AM -0600, Josh Poimboeuf wrote:
> > 
> > For the entry code, could we just replace all calls with CALL_ALIGNED?
> > That might be less intrusive than trying to adjust all the pt_regs
> > accesses.
> > 
> > Then to ensure that nobody ever uses 'call' directly:
> > 
> >   '#define call please-use-CALL-ALIGNED-instead-of-call'
> > 
> > I think that would be possible if CALL_ALIGNED were a ".macro".
> 
> The trouble with that is that you have got things like TRACE_IRQS_OFF
> which are also used outside of the entry code.

Where?  As far as I can tell, TRACE_IRQS_OFF is used exclusively by entry
code.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 09:03:18AM -0600, Josh Poimboeuf wrote:
> On Wed, Jan 11, 2017 at 11:51:10PM -0800, Andy Lutomirski wrote:
> > On Wed, Jan 11, 2017 at 11:05 PM, Herbert Xu
> > <herb...@gondor.apana.org.au> wrote:
> > > On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> > >>
> > >> I'm pretty sure we have random asm code that may not maintain a
> > >> 16-byte stack alignment when it calls other code (including, in some
> > >> cases, calling C code).
> > >>
> > >> So I'm not at all convinced that this is a good idea. We shouldn't
> > >> expect 16-byte alignment to be something trustworthy.
> > >
> > > So what if we audited all the x86 assembly code to fix this? Would
> > > it then be acceptable to do a 16-byte aligned stack?
> > >
> > > On the face of it it doesn't seem to be a huge amount of code
> > > assuming they mostly live under arch/x86.
> > 
> > The problem is that we have nasties like TRACE_IRQS_OFF.  Performance
> > doesn't really matter for these macros, so we could probably rig up a
> > helper for forcibly align the stack there.  Maybe
> > FRAME_BEGIN_FORCE_ALIGN?  I also think I'd rather not to modify
> > pt_regs.  We should just fix the small number of code paths that
> > create a pt_regs and then call into C code to align the stack.
> > 
> > But if we can't do this with automatic verification, then I'm not sure
> > I want to do it at all.  The asm is already more precarious than I'd
> > like, and having a code path that is misaligned is asking for obscure
> > bugs down the road.
> 
> For the entry code, could we just replace all calls with CALL_ALIGNED?
> That might be less intrusive than trying to adjust all the pt_regs
> accesses.
> 
> Then to ensure that nobody ever uses 'call' directly:
> 
>   '#define call please-use-CALL-ALIGNED-instead-of-call'
> 
> I think that would be possible if CALL_ALIGNED were a ".macro".

To clarify, CALL_ALIGNED could be (completely untested):

.macro CALL_ALIGNED \func
push%rbp
movq%rsp, %rbp
and $0xfff0,%rsp
call\func
movq%rbp, %rsp
pop %rbp
.endm

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Wed, Jan 11, 2017 at 11:51:10PM -0800, Andy Lutomirski wrote:
> On Wed, Jan 11, 2017 at 11:05 PM, Herbert Xu
>  wrote:
> > On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> >>
> >> I'm pretty sure we have random asm code that may not maintain a
> >> 16-byte stack alignment when it calls other code (including, in some
> >> cases, calling C code).
> >>
> >> So I'm not at all convinced that this is a good idea. We shouldn't
> >> expect 16-byte alignment to be something trustworthy.
> >
> > So what if we audited all the x86 assembly code to fix this? Would
> > it then be acceptable to do a 16-byte aligned stack?
> >
> > On the face of it it doesn't seem to be a huge amount of code
> > assuming they mostly live under arch/x86.
> 
> The problem is that we have nasties like TRACE_IRQS_OFF.  Performance
> doesn't really matter for these macros, so we could probably rig up a
> helper for forcibly align the stack there.  Maybe
> FRAME_BEGIN_FORCE_ALIGN?  I also think I'd rather not to modify
> pt_regs.  We should just fix the small number of code paths that
> create a pt_regs and then call into C code to align the stack.
> 
> But if we can't do this with automatic verification, then I'm not sure
> I want to do it at all.  The asm is already more precarious than I'd
> like, and having a code path that is misaligned is asking for obscure
> bugs down the road.

For the entry code, could we just replace all calls with CALL_ALIGNED?
That might be less intrusive than trying to adjust all the pt_regs
accesses.

Then to ensure that nobody ever uses 'call' directly:

  '#define call please-use-CALL-ALIGNED-instead-of-call'

I think that would be possible if CALL_ALIGNED were a ".macro".

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Thu, Jan 12, 2017 at 08:46:01AM +0100, Ingo Molnar wrote:
> 
> * Herbert Xu  wrote:
> 
> > On Tue, Jan 10, 2017 at 09:05:28AM -0800, Linus Torvalds wrote:
> > >
> > > I'm pretty sure we have random asm code that may not maintain a
> > > 16-byte stack alignment when it calls other code (including, in some
> > > cases, calling C code).
> > > 
> > > So I'm not at all convinced that this is a good idea. We shouldn't
> > > expect 16-byte alignment to be something trustworthy.
> > 
> > So what if we audited all the x86 assembly code to fix this? Would
> > it then be acceptable to do a 16-byte aligned stack?
> 
> Audits for small but deadly details that isn't checked automatically by 
> tooling 
> would inevitably bitrot again - and in this particular case there's a 50% 
> chance 
> that a new, buggy change would test out to be 'fine' on a kernel developer's 
> own 
> box - and break on different configs, different hw or with unrelated (and 
> innocent) kernel changes, sometime later - spreading the pain unnecessarily.
> 
> So my feeling is that we really need improved tooling for this (and yes, the 
> GCC 
> toolchain should have handled this correctly).
> 
> But fortunately we have related tooling in the kernel: could objtool handle 
> this? 
> My secret hope was always that objtool would grow into a kind of life 
> insurance 
> against toolchain bogosities (which is a must for things like livepatching or 
> a 
> DWARF unwinder - but I digress).

Are we talking about entry code, or other asm code?  Because objtool
audits *most* asm code, but entry code is way too specialized for
objtool to understand.

(I do have a pending objtool rewrite which would make it very easy to
ensure 16-byte stack alignment.  But again, objtool can only understand
callable C or asm functions, not entry code.)

Another approach would be to solve this problem with unwinder warnings,
*if* there's enough test coverage.

I recently made some changes to try to standardize the "end" of the
stack, so that the stack pointer is always a certain value before
calling into C code.  I also added some warnings to the unwinder to
ensure that it always reaches that point on the stack.  So if the "end"
of the stack were adjusted by a word by adding padding to pt_regs, the
unwinder warnings could help preserve that.

We could take that a step further by adding an unwinder check to ensure
that *every* frame is 16-byte aligned if -mpreferred-stack-boundary=3
isn't used.

Yet another step would be to add a debug feature which does stack sanity
checking from a periodic NMI, to flush out these unwinder warnings.

(Though I've found that current 0-day and fuzzing efforts, combined with
lockdep and perf's frequent unwinder usage, are already doing a great
job at flushing out unwinder warnings.)

The only question is if there would be enough test coverage,
particularly with those versions of gcc which don't have
-mpreferred-stack-boundary=3.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86-64: Maintain 16-byte stack alignment

2017-01-12 Thread Josh Poimboeuf
On Wed, Jan 11, 2017 at 10:21:07PM -0800, Andy Lutomirski wrote:
> On Tue, Jan 10, 2017 at 10:01 PM, Andy Lutomirski  wrote:
> > On Tue, Jan 10, 2017 at 8:35 PM, Herbert Xu  
> > wrote:
> >> On Tue, Jan 10, 2017 at 08:17:17PM -0800, Linus Torvalds wrote:
> >>>
> >>> That said, I do think that the "don't assume stack alignment, do it by
> >>> hand" may be the safer thing. Because who knows what the random rules
> >>> will be on other architectures.
> >>
> >> Sure we can ban the use of attribute aligned on stacks.  But
> >> what about indirect uses through structures?  For example, if
> >> someone does
> >>
> >> struct foo {
> >> } __attribute__ ((__aligned__(16)));
> >>
> >> int bar(...)
> >> {
> >> struct foo f;
> >>
> >> return baz();
> >> }
> >>
> >> then baz will end up with an unaligned argument.  The worst part
> >> is that it is not at all obvious to the person writing the function
> >> bar.
> >
> > Linus, I'm starting to lean toward agreeing with Herbert here, except
> > that we should consider making it conditional on having a silly GCC
> > version.  After all, the silly GCC versions are wasting space and time
> > with alignment instructions no matter what we do, so this would just
> > mean tweaking the asm and adding some kind of check_stack_alignment()
> > helper to throw out a WARN_ONCE() if we miss one.  The problem with
> > making it conditional is that making pt_regs effectively live at a
> > variable offset from %rsp is just nasty.
> 
> So actually doing this is gross because we have calls from asm to C
> all over the place.  But... maybe we can automate all the testing.
> Josh, how hard would it be to teach objtool to (if requested by an
> option) check that stack frames with statically known size preserve
> 16-byte stack alignment?
> 
> I find it rather annoying that gcc before 4.8 malfunctions when it
> sees __aligned__(16) on x86_64 kernels.  Sigh.

Just to clarify, I think you're asking if, for versions of gcc which
don't support -mpreferred-stack-boundary=3, objtool can analyze all C
functions to ensure their stacks are 16-byte aligned.

It's certainly possible, but I don't see how that solves the problem.
The stack will still be misaligned by entry code.  Or am I missing
something?

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] crypto/sha1-mb: make sha1_x8_avx2() conform to C function ABI

2016-05-16 Thread Josh Poimboeuf
On Mon, May 16, 2016 at 02:39:06PM -0700, Megha Dey wrote:
> On Mon, 2016-05-16 at 15:16 -0500, Josh Poimboeuf wrote:
> > On Mon, May 16, 2016 at 11:31:12AM -0700, Megha Dey wrote:
> > > On Mon, 2016-05-16 at 09:44 -0500, Josh Poimboeuf wrote:
> > > > On Fri, May 13, 2016 at 10:32:26AM -0700, Megha Dey wrote:
> > > > > On Fri, 2016-05-13 at 07:51 +0200, Ingo Molnar wrote:
> > > > > > * Herbert Xu <herb...@gondor.apana.org.au> wrote:
> > > > > > 
> > > > > > > On Thu, May 12, 2016 at 04:31:06PM -0700, Megha Dey wrote:
> > > > > > > > Hi,
> > > > > > > >  
> > > > > > > > When booting latest kernel with the CONFIG_CRYPTO_SHA1_MB 
> > > > > > > > enabled, I
> > > > > > > > observe a panic.
> > > > > > > >  
> > > > > > > > After having a quick look, on reverting the following patches, 
> > > > > > > > I am able
> > > > > > > > to complete the booting process.
> > > > > > > > aec4d0e301f17bb143341c82cc44685b8af0b945
> > > > > > > > 8691ccd764f9ecc69a6812dfe76214c86ac9ba06
> > > > > > > > 68874ac3304ade7ed5ebb12af00d6b9bbbca0a16
> > > > > > > >  
> > > > > > > > Of the 3 patches, aec4d0e301f17bb143341c82cc44685b8af0b945 
> > > > > > > > seems wrong.
> > > > > > > > The r10 to r15 registers are used in sha1_x8_avx2.S, which is 
> > > > > > > > called
> > > > > > > > from sha1_mb_mgr_flush_avx2.S.
> > > > > > > >
> > > > > > > > I do not think the functionality of the SHA1-MB crypto 
> > > > > > > > algorithm has
> > > > > > > > been tested after applying these changes. (I am not sure if any 
> > > > > > > > of the
> > > > > > > > other crypto algorithms have been affected by these changes).
> > > > > > > 
> > > > > > > Josh, Ingo:
> > > > > > > 
> > > > > > > Any ideas on this? Should we revert?
> > > > > > 
> > > > > > Yeah, I think so - although another option would be to standardize 
> > > > > > sha1_x8_avx2() 
> > > > > > - the problem is that it is a function that clobbers registers 
> > > > > > without 
> > > > > > saving/restoring them, breaking the C function ABI. I realize it's 
> > > > > > written in 
> > > > > > assembly, but unless there are strong performance reasons to 
> > > > > > deviate from the 
> > > > > > regular calling convention it might make sense to fix that.
> > > > > > 
> > > > > > Do any warnings get generated after the revert, if you enable 
> > > > > > CONFIG_STACK_VALIDATION=y?
> > > > > 
> > > > > After the revert and enabling CONFIG_STACK_VALIDATION:
> > > > > arch/x86/crypto/sha1-mb/sha1_mb_mgr_flush_avx2.o: warning: objtool:
> > > > > sha1_mb_mgr_flush_avx2()+0x20d: call without frame pointer save/setup
> > > > > 
> > > > > arch/x86/crypto/sha1-mb/sha1_mb_mgr_submit_avx2.o: warning: objtool:
> > > > > sha1_mb_mgr_submit_avx2()+0x115: call without frame pointer save/setup
> > > > 
> > > > Megha,
> > > > 
> > > > Sorry for breaking it.  I completely missed the fact that the function
> > > > calls sha1_x8_avx2() which clobbers registers.
> > > > 
> > > > If the performance penalty isn't too bad, I'll submit a patch to
> > > > standardize sha1_x8_avx2() to follow the C ABI.
> > > > 
> > > > Do you have any tips for testing this code?  I've tried using the tcrypt
> > > > module, but no luck.
> > > > 
> > > Josh,
> > > Build the kernel with the following configs:
> > > CONFIG_CRYPTO_SHA1_MB=y
> > > CONFIG_CRYPTO_TEST=m
> > > CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=n
> > > There was a kernel panic while booting. 
> > > So if after applying your new patch, we are able to get complete the
> > > boot, then we are good.
> > > 
> > > Could you please send a copy of the patch, I could test it on my end
> > > too. 
> > 
> > Thanks.  I was able to run the tes

[PATCH] crypto/sha1-mb: make sha1_x8_avx2() conform to C function ABI

2016-05-16 Thread Josh Poimboeuf
On Mon, May 16, 2016 at 11:31:12AM -0700, Megha Dey wrote:
> On Mon, 2016-05-16 at 09:44 -0500, Josh Poimboeuf wrote:
> > On Fri, May 13, 2016 at 10:32:26AM -0700, Megha Dey wrote:
> > > On Fri, 2016-05-13 at 07:51 +0200, Ingo Molnar wrote:
> > > > * Herbert Xu <herb...@gondor.apana.org.au> wrote:
> > > > 
> > > > > On Thu, May 12, 2016 at 04:31:06PM -0700, Megha Dey wrote:
> > > > > > Hi,
> > > > > >  
> > > > > > When booting latest kernel with the CONFIG_CRYPTO_SHA1_MB enabled, I
> > > > > > observe a panic.
> > > > > >  
> > > > > > After having a quick look, on reverting the following patches, I am 
> > > > > > able
> > > > > > to complete the booting process.
> > > > > > aec4d0e301f17bb143341c82cc44685b8af0b945
> > > > > > 8691ccd764f9ecc69a6812dfe76214c86ac9ba06
> > > > > > 68874ac3304ade7ed5ebb12af00d6b9bbbca0a16
> > > > > >  
> > > > > > Of the 3 patches, aec4d0e301f17bb143341c82cc44685b8af0b945 seems 
> > > > > > wrong.
> > > > > > The r10 to r15 registers are used in sha1_x8_avx2.S, which is called
> > > > > > from sha1_mb_mgr_flush_avx2.S.
> > > > > >
> > > > > > I do not think the functionality of the SHA1-MB crypto algorithm has
> > > > > > been tested after applying these changes. (I am not sure if any of 
> > > > > > the
> > > > > > other crypto algorithms have been affected by these changes).
> > > > > 
> > > > > Josh, Ingo:
> > > > > 
> > > > > Any ideas on this? Should we revert?
> > > > 
> > > > Yeah, I think so - although another option would be to standardize 
> > > > sha1_x8_avx2() 
> > > > - the problem is that it is a function that clobbers registers without 
> > > > saving/restoring them, breaking the C function ABI. I realize it's 
> > > > written in 
> > > > assembly, but unless there are strong performance reasons to deviate 
> > > > from the 
> > > > regular calling convention it might make sense to fix that.
> > > > 
> > > > Do any warnings get generated after the revert, if you enable 
> > > > CONFIG_STACK_VALIDATION=y?
> > > 
> > > After the revert and enabling CONFIG_STACK_VALIDATION:
> > > arch/x86/crypto/sha1-mb/sha1_mb_mgr_flush_avx2.o: warning: objtool:
> > > sha1_mb_mgr_flush_avx2()+0x20d: call without frame pointer save/setup
> > > 
> > > arch/x86/crypto/sha1-mb/sha1_mb_mgr_submit_avx2.o: warning: objtool:
> > > sha1_mb_mgr_submit_avx2()+0x115: call without frame pointer save/setup
> > 
> > Megha,
> > 
> > Sorry for breaking it.  I completely missed the fact that the function
> > calls sha1_x8_avx2() which clobbers registers.
> > 
> > If the performance penalty isn't too bad, I'll submit a patch to
> > standardize sha1_x8_avx2() to follow the C ABI.
> > 
> > Do you have any tips for testing this code?  I've tried using the tcrypt
> > module, but no luck.
> > 
> Josh,
> Build the kernel with the following configs:
> CONFIG_CRYPTO_SHA1_MB=y
> CONFIG_CRYPTO_TEST=m
> CONFIG_CRYPTO_MANAGER_DISABLE_TESTS=n
> There was a kernel panic while booting. 
> So if after applying your new patch, we are able to get complete the
> boot, then we are good.
> 
> Could you please send a copy of the patch, I could test it on my end
> too. 

Thanks.  I was able to run the tests, though I didn't see a panic.  Can
you test with this patch?



From: Josh Poimboeuf <jpoim...@redhat.com>
Subject: [PATCH] crypto/sha1-mb: make sha1_x8_avx2() conform to C function ABI

Megha Day reported a kernel panic in crypto code.  The problem is that
sha1_x8_avx2() clobbers registers r12-r15 without saving and restoring
them.

Before commit aec4d0e301f1 ("x86/asm/crypto: Simplify stack usage in
sha-mb functions"), those registers were saved and restored by the
callers of the function.  I removed them with that commit because I
didn't realize sha1_x8_avx2() clobbered them.

Fix the potential undefined behavior associated with clobbering the
registers and make the behavior less surprising by changing the
registers to be callee saved/restored to conform with the C function
call ABI.

Also, rdx (aka RSP_SAVE) doesn't need to be saved: I verified that none
of the callers rely on it being saved, and it's not a callee-saved
register in the C ABI.

Fixes: aec4d0e301f1 ("x86/asm

Re: SHA1-MB algorithm broken on latest kernel

2016-05-16 Thread Josh Poimboeuf
On Fri, May 13, 2016 at 10:32:26AM -0700, Megha Dey wrote:
> On Fri, 2016-05-13 at 07:51 +0200, Ingo Molnar wrote:
> > * Herbert Xu  wrote:
> > 
> > > On Thu, May 12, 2016 at 04:31:06PM -0700, Megha Dey wrote:
> > > > Hi,
> > > >  
> > > > When booting latest kernel with the CONFIG_CRYPTO_SHA1_MB enabled, I
> > > > observe a panic.
> > > >  
> > > > After having a quick look, on reverting the following patches, I am able
> > > > to complete the booting process.
> > > > aec4d0e301f17bb143341c82cc44685b8af0b945
> > > > 8691ccd764f9ecc69a6812dfe76214c86ac9ba06
> > > > 68874ac3304ade7ed5ebb12af00d6b9bbbca0a16
> > > >  
> > > > Of the 3 patches, aec4d0e301f17bb143341c82cc44685b8af0b945 seems wrong.
> > > > The r10 to r15 registers are used in sha1_x8_avx2.S, which is called
> > > > from sha1_mb_mgr_flush_avx2.S.
> > > >
> > > > I do not think the functionality of the SHA1-MB crypto algorithm has
> > > > been tested after applying these changes. (I am not sure if any of the
> > > > other crypto algorithms have been affected by these changes).
> > > 
> > > Josh, Ingo:
> > > 
> > > Any ideas on this? Should we revert?
> > 
> > Yeah, I think so - although another option would be to standardize 
> > sha1_x8_avx2() 
> > - the problem is that it is a function that clobbers registers without 
> > saving/restoring them, breaking the C function ABI. I realize it's written 
> > in 
> > assembly, but unless there are strong performance reasons to deviate from 
> > the 
> > regular calling convention it might make sense to fix that.
> > 
> > Do any warnings get generated after the revert, if you enable 
> > CONFIG_STACK_VALIDATION=y?
> 
> After the revert and enabling CONFIG_STACK_VALIDATION:
> arch/x86/crypto/sha1-mb/sha1_mb_mgr_flush_avx2.o: warning: objtool:
> sha1_mb_mgr_flush_avx2()+0x20d: call without frame pointer save/setup
> 
> arch/x86/crypto/sha1-mb/sha1_mb_mgr_submit_avx2.o: warning: objtool:
> sha1_mb_mgr_submit_avx2()+0x115: call without frame pointer save/setup

Megha,

Sorry for breaking it.  I completely missed the fact that the function
calls sha1_x8_avx2() which clobbers registers.

If the performance penalty isn't too bad, I'll submit a patch to
standardize sha1_x8_avx2() to follow the C ABI.

Do you have any tips for testing this code?  I've tried using the tcrypt
module, but no luck.

-- 
Josh
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 04/10] x86/asm/crypto: Fix asmvalidate warnings for aesni-intel_asm.S

2015-06-10 Thread Josh Poimboeuf
Fix the following asmvalidate warnings:

   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_set_key(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_enc(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_dec(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_ecb_enc(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_ecb_dec(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_cbc_enc(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_cbc_dec(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_ctr_enc(): missing 
FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/aesni-intel_asm.o: aesni_xts_crypt8(): missing 
FP_SAVE/RESTORE macros

These are all non-leaf callable functions, so save/restore the frame
pointer with FP_SAVE/RESTORE.

Signed-off-by: Josh Poimboeuf jpoim...@redhat.com
Cc: linux-crypto@vger.kernel.org
Cc: Herbert Xu herb...@gondor.apana.org.au
Cc: David S. Miller da...@davemloft.net
---
 arch/x86/crypto/aesni-intel_asm.S | 19 +++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/crypto/aesni-intel_asm.S 
b/arch/x86/crypto/aesni-intel_asm.S
index 6bd2c6c..83465f9a 100644
--- a/arch/x86/crypto/aesni-intel_asm.S
+++ b/arch/x86/crypto/aesni-intel_asm.S
@@ -31,6 +31,7 @@
 
 #include linux/linkage.h
 #include asm/inst.h
+#include asm/func.h
 
 /*
  * The following macros are used to move an (un)aligned 16 byte value to/from
@@ -1800,6 +1801,7 @@ ENDPROC(_key_expansion_256b)
  *   unsigned int key_len)
  */
 ENTRY(aesni_set_key)
+   FP_SAVE
 #ifndef __x86_64__
pushl KEYP
movl 8(%esp), KEYP  # ctx
@@ -1905,6 +1907,7 @@ ENTRY(aesni_set_key)
 #ifndef __x86_64__
popl KEYP
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_set_key)
 
@@ -1912,6 +1915,7 @@ ENDPROC(aesni_set_key)
  * void aesni_enc(struct crypto_aes_ctx *ctx, u8 *dst, const u8 *src)
  */
 ENTRY(aesni_enc)
+   FP_SAVE
 #ifndef __x86_64__
pushl KEYP
pushl KLEN
@@ -1927,6 +1931,7 @@ ENTRY(aesni_enc)
popl KLEN
popl KEYP
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_enc)
 
@@ -2101,6 +2106,7 @@ ENDPROC(_aesni_enc4)
  * void aesni_dec (struct crypto_aes_ctx *ctx, u8 *dst, const u8 *src)
  */
 ENTRY(aesni_dec)
+   FP_SAVE
 #ifndef __x86_64__
pushl KEYP
pushl KLEN
@@ -2117,6 +2123,7 @@ ENTRY(aesni_dec)
popl KLEN
popl KEYP
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_dec)
 
@@ -2292,6 +2299,7 @@ ENDPROC(_aesni_dec4)
  *   size_t len)
  */
 ENTRY(aesni_ecb_enc)
+   FP_SAVE
 #ifndef __x86_64__
pushl LEN
pushl KEYP
@@ -2342,6 +2350,7 @@ ENTRY(aesni_ecb_enc)
popl KEYP
popl LEN
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_ecb_enc)
 
@@ -2350,6 +2359,7 @@ ENDPROC(aesni_ecb_enc)
  *   size_t len);
  */
 ENTRY(aesni_ecb_dec)
+   FP_SAVE
 #ifndef __x86_64__
pushl LEN
pushl KEYP
@@ -2401,6 +2411,7 @@ ENTRY(aesni_ecb_dec)
popl KEYP
popl LEN
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_ecb_dec)
 
@@ -2409,6 +2420,7 @@ ENDPROC(aesni_ecb_dec)
  *   size_t len, u8 *iv)
  */
 ENTRY(aesni_cbc_enc)
+   FP_SAVE
 #ifndef __x86_64__
pushl IVP
pushl LEN
@@ -2443,6 +2455,7 @@ ENTRY(aesni_cbc_enc)
popl LEN
popl IVP
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_cbc_enc)
 
@@ -2451,6 +2464,7 @@ ENDPROC(aesni_cbc_enc)
  *   size_t len, u8 *iv)
  */
 ENTRY(aesni_cbc_dec)
+   FP_SAVE
 #ifndef __x86_64__
pushl IVP
pushl LEN
@@ -2534,6 +2548,7 @@ ENTRY(aesni_cbc_dec)
popl LEN
popl IVP
 #endif
+   FP_RESTORE
ret
 ENDPROC(aesni_cbc_dec)
 
@@ -2598,6 +2613,7 @@ ENDPROC(_aesni_inc)
  *   size_t len, u8 *iv)
  */
 ENTRY(aesni_ctr_enc)
+   FP_SAVE
cmp $16, LEN
jb .Lctr_enc_just_ret
mov 480(KEYP), KLEN
@@ -2651,6 +2667,7 @@ ENTRY(aesni_ctr_enc)
 .Lctr_enc_ret:
movups IV, (IVP)
 .Lctr_enc_just_ret:
+   FP_RESTORE
ret
 ENDPROC(aesni_ctr_enc)
 
@@ -2677,6 +2694,7 @@ ENDPROC(aesni_ctr_enc)
  *  bool enc, u8 *iv)
  */
 ENTRY(aesni_xts_crypt8)
+   FP_SAVE
cmpb $0, %cl
movl $0, %ecx
movl $240, %r10d
@@ -2777,6 +2795,7 @@ ENTRY(aesni_xts_crypt8)
pxor INC, STATE4
movdqu STATE4, 0x70(OUTP)
 
+   FP_RESTORE
ret
 ENDPROC(aesni_xts_crypt8)
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 05/10] x86/asm/crypto: Fix asmvalidate warnings for ghash-clmulni-intel_asm.S

2015-06-10 Thread Josh Poimboeuf
Fix the following asmvalidate warnings:

   asmvalidate: arch/x86/crypto/ghash-clmulni-intel_asm.o: clmul_ghash_mul(): 
missing FP_SAVE/RESTORE macros
   asmvalidate: arch/x86/crypto/ghash-clmulni-intel_asm.o: 
clmul_ghash_update(): missing FP_SAVE/RESTORE macros

These are non-leaf callable functions, so save/restore the frame pointer
with FP_SAVE/RESTORE.

Signed-off-by: Josh Poimboeuf jpoim...@redhat.com
Cc: linux-crypto@vger.kernel.org
Cc: Herbert Xu herb...@gondor.apana.org.au
Cc: David S. Miller da...@davemloft.net
---
 arch/x86/crypto/ghash-clmulni-intel_asm.S | 5 +
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/crypto/ghash-clmulni-intel_asm.S 
b/arch/x86/crypto/ghash-clmulni-intel_asm.S
index 5d1e007..e5b76b1 100644
--- a/arch/x86/crypto/ghash-clmulni-intel_asm.S
+++ b/arch/x86/crypto/ghash-clmulni-intel_asm.S
@@ -18,6 +18,7 @@
 
 #include linux/linkage.h
 #include asm/inst.h
+#include asm/func.h
 
 .data
 
@@ -94,6 +95,7 @@ ENDPROC(__clmul_gf128mul_ble)
 
 /* void clmul_ghash_mul(char *dst, const u128 *shash) */
 ENTRY(clmul_ghash_mul)
+   FP_SAVE
movups (%rdi), DATA
movups (%rsi), SHASH
movaps .Lbswap_mask, BSWAP
@@ -101,6 +103,7 @@ ENTRY(clmul_ghash_mul)
call __clmul_gf128mul_ble
PSHUFB_XMM BSWAP DATA
movups DATA, (%rdi)
+   FP_RESTORE
ret
 ENDPROC(clmul_ghash_mul)
 
@@ -109,6 +112,7 @@ ENDPROC(clmul_ghash_mul)
  *const u128 *shash);
  */
 ENTRY(clmul_ghash_update)
+   FP_SAVE
cmp $16, %rdx
jb .Lupdate_just_ret# check length
movaps .Lbswap_mask, BSWAP
@@ -128,5 +132,6 @@ ENTRY(clmul_ghash_update)
PSHUFB_XMM BSWAP DATA
movups DATA, (%rdi)
 .Lupdate_just_ret:
+   FP_RESTORE
ret
 ENDPROC(clmul_ghash_update)
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html