Re: [PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization

2014-06-04 Thread chandramouli narayanan
On Wed, 2014-06-04 at 08:53 +0200, Mathias Krause wrote:
> On Tue, Jun 03, 2014 at 05:41:14PM -0700, chandramouli narayanan wrote:
> > This patch introduces "by8" AES CTR mode AVX optimization inspired by
> > Intel Optimized IPSEC Cryptograhpic library. For additional information,
> > please see:
> > http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972
> > 
> > The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and
> > aes_ctr_enc_256_avx_by8() are adapted from
> > Intel Optimized IPSEC Cryptographic library. When both AES and AVX features
> > are enabled in a platform, the glue code in AESNI module overrieds the
> > existing "by4" CTR mode en/decryption with the "by8"
> > AES CTR mode en/decryption.
> > 
> > On a Haswell desktop, with turbo disabled and all cpus running
> > at maximum frequency, the "by8" CTR mode optimization
> > shows better performance results across data & key sizes
> > as measured by tcrypt.
> > 
> > The average performance improvement of the "by8" version over the "by4"
> > version is as follows:
> > 
> > For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement.
> > For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement.
> > For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement.
> 
> Nice improvement :)
> 
> How does it perform on older processors that do have a penalty for
> unaligned loads (vmovdqu), e.g. SandyBridge? If those perform worse it
> might be wise to extend the CPU feature test in the glue code by a model
> test to enable the "by8" variant only for Haswell and newer processors
> that don't have such a penalty.

Good point. I will check it out and add the needed test to enable the
optimization on processors where it shines. 
> 
> > 
> > A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8"
> > optimization shows the following results:
> > 
> > tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop:
> > ---
> > 
> > testing speed of __ctr-aes-aesni encryption
> > test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes)
> > test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes)
> > test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes)
> > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 
> > bytes)
> > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 
> > bytes)
> > test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes)
> > test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes)
> > test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes)
> > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 
> > bytes)
> > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 
> > bytes)
> > test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes)
> > test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes)
> > test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 
> > bytes)
> > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 
> > bytes)
> > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 
> > bytes)
> > 
> > testing speed of __ctr-aes-aesni decryption
> > test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes)
> > test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes)
> > test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes)
> > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 
> > bytes)
> > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 
> > bytes)
> > test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes)
> > test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes)
> > test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes)
> > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 
> > bytes)
> > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 
> > bytes)
> > test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes)
> > test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes)
> > test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 
> > bytes)
> > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 
> > bytes)
> > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 
> > bytes)
> > 
> > tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop:
> > ---
> > 
> > testing speed of __ctr-aes-aesni encryption
> > test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (

Re: [PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization

2014-06-03 Thread Mathias Krause
On Tue, Jun 03, 2014 at 05:41:14PM -0700, chandramouli narayanan wrote:
> This patch introduces "by8" AES CTR mode AVX optimization inspired by
> Intel Optimized IPSEC Cryptograhpic library. For additional information,
> please see:
> http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972
> 
> The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and
> aes_ctr_enc_256_avx_by8() are adapted from
> Intel Optimized IPSEC Cryptographic library. When both AES and AVX features
> are enabled in a platform, the glue code in AESNI module overrieds the
> existing "by4" CTR mode en/decryption with the "by8"
> AES CTR mode en/decryption.
> 
> On a Haswell desktop, with turbo disabled and all cpus running
> at maximum frequency, the "by8" CTR mode optimization
> shows better performance results across data & key sizes
> as measured by tcrypt.
> 
> The average performance improvement of the "by8" version over the "by4"
> version is as follows:
> 
> For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement.
> For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement.
> For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement.

Nice improvement :)

How does it perform on older processors that do have a penalty for
unaligned loads (vmovdqu), e.g. SandyBridge? If those perform worse it
might be wise to extend the CPU feature test in the glue code by a model
test to enable the "by8" variant only for Haswell and newer processors
that don't have such a penalty.

> 
> A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8"
> optimization shows the following results:
> 
> tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop:
> ---
> 
> testing speed of __ctr-aes-aesni encryption
> test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes)
> test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes)
> test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes)
> test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 
> bytes)
> test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 
> bytes)
> test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes)
> test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes)
> test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes)
> test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 
> bytes)
> test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 
> bytes)
> test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes)
> test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes)
> test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
> test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 
> bytes)
> test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 
> bytes)
> 
> testing speed of __ctr-aes-aesni decryption
> test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes)
> test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes)
> test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes)
> test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 
> bytes)
> test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 
> bytes)
> test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes)
> test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes)
> test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes)
> test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 
> bytes)
> test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 
> bytes)
> test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes)
> test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes)
> test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes)
> test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 
> bytes)
> test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 
> bytes)
> 
> tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop:
> ---
> 
> testing speed of __ctr-aes-aesni encryption
> test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes)
> test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes)
> test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes)
> test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 
> bytes)
> test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 
> bytes)
> test 5 (192 bit key, 16 byt

[PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization

2014-06-03 Thread chandramouli narayanan
This patch introduces "by8" AES CTR mode AVX optimization inspired by
Intel Optimized IPSEC Cryptograhpic library. For additional information,
please see:
http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972

The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and
aes_ctr_enc_256_avx_by8() are adapted from
Intel Optimized IPSEC Cryptographic library. When both AES and AVX features
are enabled in a platform, the glue code in AESNI module overrieds the
existing "by4" CTR mode en/decryption with the "by8"
AES CTR mode en/decryption.

On a Haswell desktop, with turbo disabled and all cpus running
at maximum frequency, the "by8" CTR mode optimization
shows better performance results across data & key sizes
as measured by tcrypt.

The average performance improvement of the "by8" version over the "by4"
version is as follows:

For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement.
For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement.
For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement.

A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8"
optimization shows the following results:

tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop:
---

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 
bytes)

testing speed of __ctr-aes-aesni decryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes)
test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 bytes)
test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 
bytes)

tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop:
---

testing speed of __ctr-aes-aesni encryption
test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes)
test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes)
test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes)
test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 bytes)
test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 bytes)
test 5 (192 bit key, 16 byte blocks): 1 operation in 339 cycles (16 bytes)
test 6 (192 bit key, 64 byte blocks): 1 operation in 352 cycles (64 bytes)
test 7 (192 bit key, 256 byte blocks): 1 operation in 539 cycles (256 bytes)
test 8 (192 bit key, 1024 byte blocks): 1 operation in 1153 cycles (1024 bytes)
test 9 (192 bit key, 8192 byte blocks): 1 operation in 8458 cycles (8192 bytes)
test 10 (256 bit key, 16 byte blocks): 1 operation in 353 cycles (16 bytes)
test 11 (256 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes)
test 12 (256 bit key, 256 byte blocks): 1 operation in 512 cycles (256 bytes)
test 13