Re: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-27 Thread Herbert Xu
On Mon, Dec 18, 2017 at 04:40:26PM -0800, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Using %rbp as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> In twofish-3way, we can't simply replace %rbp with another register
> because there are none available.  Instead, we use the stack to hold the
> values that %rbp, %r11, and %r12 were holding previously.  Each of these
> values represents the half of the output from the previous Feistel round
> that is being passed on unchanged to the following round.  They are only
> used once per round, when they are exchanged with %rax, %rbx, and %rcx.
> 
> As a result, we free up 3 registers (one per block) and can reassign
> them so that %rbp is not used, and additionally %r14 and %r15 are not
> used so they do not need to be saved/restored.
> 
> There may be a small overhead caused by replacing 'xchg REG, REG' with
> the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
> round.  But, counterintuitively, when I tested "ctr-twofish-3way" on a
> Haswell processor, the new version was actually about 2% faster.
> (Perhaps 'xchg' is not as well optimized as plain moves.)
> 
> Reported-by: syzbot 
> Signed-off-by: Eric Biggers 

Patch applied.  Thanks.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-19 Thread Josh Poimboeuf
On Mon, Dec 18, 2017 at 04:40:26PM -0800, Eric Biggers wrote:
> From: Eric Biggers 
> 
> Using %rbp as a temporary register breaks frame pointer convention and
> breaks stack traces when unwinding from an interrupt in the crypto code.
> 
> In twofish-3way, we can't simply replace %rbp with another register
> because there are none available.  Instead, we use the stack to hold the
> values that %rbp, %r11, and %r12 were holding previously.  Each of these
> values represents the half of the output from the previous Feistel round
> that is being passed on unchanged to the following round.  They are only
> used once per round, when they are exchanged with %rax, %rbx, and %rcx.
> 
> As a result, we free up 3 registers (one per block) and can reassign
> them so that %rbp is not used, and additionally %r14 and %r15 are not
> used so they do not need to be saved/restored.
> 
> There may be a small overhead caused by replacing 'xchg REG, REG' with
> the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
> round.  But, counterintuitively, when I tested "ctr-twofish-3way" on a
> Haswell processor, the new version was actually about 2% faster.
> (Perhaps 'xchg' is not as well optimized as plain moves.)
> 
> Reported-by: syzbot 
> Signed-off-by: Eric Biggers 

Thanks a lot for fixing this!

Reviewed-by: Josh Poimboeuf 

-- 
Josh


Re: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-19 Thread Ingo Molnar

* Ingo Molnar  wrote:

> 
> * Eric Biggers  wrote:
> 
> > There may be a small overhead caused by replacing 'xchg REG, REG' with
> > the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
> > round.  But, counterintuitively, when I tested "ctr-twofish-3way" on a
> > Haswell processor, the new version was actually about 2% faster.
> > (Perhaps 'xchg' is not as well optimized as plain moves.)
> 
> XCHG has implicit LOCK semantics on all x86 CPUs, so that's not a surprising 
> result I think.

Correction: I think XCHG only implies LOCK if there's a memory operand involved 
- 
register-register XCHG should not imply any barriers.

So the result is indeed unintuitive.

Thanks,

Ingo


RE: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-19 Thread David Laight
From: Juergen Gross
> Sent: 19 December 2017 08:05
..
> 
> Exchanging 2 registers can be done without memory access via:
> 
> xor reg1, reg2
> xor reg2, reg1
> xor reg1, reg2

That'll generate horrid data dependencies.
ISTR that there are some optimisations for the stack,
so even 'push reg1', 'mov reg2,reg1', 'pop reg2' might
be faster than the above.

David



Re: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-19 Thread Juergen Gross
On 19/12/17 08:54, Ingo Molnar wrote:
> 
> * Eric Biggers  wrote:
> 
>> There may be a small overhead caused by replacing 'xchg REG, REG' with
>> the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
>> round.  But, counterintuitively, when I tested "ctr-twofish-3way" on a
>> Haswell processor, the new version was actually about 2% faster.
>> (Perhaps 'xchg' is not as well optimized as plain moves.)
> 
> XCHG has implicit LOCK semantics on all x86 CPUs, so that's not a surprising 
> result I think.

Exchanging 2 registers can be done without memory access via:

xor reg1, reg2
xor reg2, reg1
xor reg1, reg2


Juergen


Re: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-18 Thread Ingo Molnar

* Eric Biggers  wrote:

> There may be a small overhead caused by replacing 'xchg REG, REG' with
> the needed sequence 'mov MEM, REG; mov REG, MEM; mov REG, REG' once per
> round.  But, counterintuitively, when I tested "ctr-twofish-3way" on a
> Haswell processor, the new version was actually about 2% faster.
> (Perhaps 'xchg' is not as well optimized as plain moves.)

XCHG has implicit LOCK semantics on all x86 CPUs, so that's not a surprising 
result I think.

Thanks,

Ingo