Re: [PATCH 4/4] ARM: add support for bit sliced AES using NEON instructions

2013-09-23 Thread Ard Biesheuvel
On 22 September 2013 13:12, Jussi Kivilinna jussi.kivili...@iki.fi wrote:

[...]

 Decryption can probably be made faster by implementing InvMixColumns slightly
 differently. Instead of implementing inverse MixColumns matrix directly, use
 preprocessing step, followed by MixColumns as described in section 4.1.3
 Decryption of The Design of Rijndael: AES - The Advanced Encryption 
 Standard
 (J. Daemen, V. Rijmen / 2002).

 In short, the MixColumns and InvMixColumns matrixes have following relation:
  | 0e 0b 0d 09 |   | 02 03 01 01 |   | 05 00 04 00 |
  | 09 0e 0b 0d | = | 01 02 03 01 | x | 00 05 00 04 |
  | 0d 09 0e 0b |   | 01 01 02 03 |   | 04 00 05 00 |
  | 0b 0d 09 0e |   | 03 01 01 02 |   | 00 04 00 05 |

 Bit-sliced implementation of the 05-00-04-00 matrix much shorter than 
 0e-0b-0d-09
 matrix, so even when combined with MixColumns total instruction count for
 InvMixColumns implemented this way should be nearly half of current.


That is a very useful tip, thank you. I will have a go at it and
follow up later.

Regards,
Ard.
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] ARM: add support for bit sliced AES using NEON instructions

2013-09-22 Thread Jussi Kivilinna
On 20.09.2013 21:46, Ard Biesheuvel wrote:
 This implementation of the AES algorithm gives around 45% speedup on 
 Cortex-A15
 for CTR mode and for XTS in encryption mode. Both CBC and XTS in decryption 
 mode
 are slightly faster (5 - 10% on Cortex-A15). [As CBC in encryption mode can 
 only
 be performed sequentially, there is no speedup in this case.]
 
 Unlike the core AES cipher (on which this module also depends), this algorithm
 uses bit slicing to process up to 8 blocks in parallel in constant time. This
 algorithm does not rely on any lookup tables so it is believed to be
 invulnerable to cache timing attacks.
 
 The core code has been adopted from the OpenSSL project (in collaboration
 with the original author, on cc). For ease of maintenance, this version is
 identical to the upstream OpenSSL code, i.e., all modifications that were
 required to make it suitable for inclusion into the kernel have already been
 merged upstream.
 
 Cc: Andy Polyakov ap...@openssl.org
 Signed-off-by: Ard Biesheuvel ard.biesheu...@linaro.org
 ---
[..snip..]
 + bcc .Ldec_done
 + @ multiplication by 0x0e

Decryption can probably be made faster by implementing InvMixColumns slightly
differently. Instead of implementing inverse MixColumns matrix directly, use
preprocessing step, followed by MixColumns as described in section 4.1.3
Decryption of The Design of Rijndael: AES - The Advanced Encryption Standard
(J. Daemen, V. Rijmen / 2002).

In short, the MixColumns and InvMixColumns matrixes have following relation:
 | 0e 0b 0d 09 |   | 02 03 01 01 |   | 05 00 04 00 |
 | 09 0e 0b 0d | = | 01 02 03 01 | x | 00 05 00 04 |
 | 0d 09 0e 0b |   | 01 01 02 03 |   | 04 00 05 00 |
 | 0b 0d 09 0e |   | 03 01 01 02 |   | 00 04 00 05 |

Bit-sliced implementation of the 05-00-04-00 matrix much shorter than 
0e-0b-0d-09
matrix, so even when combined with MixColumns total instruction count for
InvMixColumns implemented this way should be nearly half of current.

Check [1] for implementation of this on AVX instruction set.

-Jussi

[1] 
https://github.com/jkivilin/supercop-blockciphers/blob/beyond_master/crypto_stream/aes128ctr/avx/aes_asm_bitslice_avx.S#L234

--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html