Re: [PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization
On Wed, 2014-06-04 at 08:53 +0200, Mathias Krause wrote: > On Tue, Jun 03, 2014 at 05:41:14PM -0700, chandramouli narayanan wrote: > > This patch introduces "by8" AES CTR mode AVX optimization inspired by > > Intel Optimized IPSEC Cryptograhpic library. For additional information, > > please see: > > http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972 > > > > The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and > > aes_ctr_enc_256_avx_by8() are adapted from > > Intel Optimized IPSEC Cryptographic library. When both AES and AVX features > > are enabled in a platform, the glue code in AESNI module overrieds the > > existing "by4" CTR mode en/decryption with the "by8" > > AES CTR mode en/decryption. > > > > On a Haswell desktop, with turbo disabled and all cpus running > > at maximum frequency, the "by8" CTR mode optimization > > shows better performance results across data & key sizes > > as measured by tcrypt. > > > > The average performance improvement of the "by8" version over the "by4" > > version is as follows: > > > > For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement. > > For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement. > > For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement. > > Nice improvement :) > > How does it perform on older processors that do have a penalty for > unaligned loads (vmovdqu), e.g. SandyBridge? If those perform worse it > might be wise to extend the CPU feature test in the glue code by a model > test to enable the "by8" variant only for Haswell and newer processors > that don't have such a penalty. Good point. I will check it out and add the needed test to enable the optimization on processors where it shines. > > > > > A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8" > > optimization shows the following results: > > > > tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop: > > --- > > > > testing speed of __ctr-aes-aesni encryption > > test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes) > > test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes) > > test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes) > > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 > > bytes) > > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 > > bytes) > > test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes) > > test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes) > > test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes) > > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 > > bytes) > > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 > > bytes) > > test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes) > > test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes) > > test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 > > bytes) > > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 > > bytes) > > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 > > bytes) > > > > testing speed of __ctr-aes-aesni decryption > > test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes) > > test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes) > > test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes) > > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 > > bytes) > > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 > > bytes) > > test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes) > > test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes) > > test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes) > > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 > > bytes) > > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 > > bytes) > > test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes) > > test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes) > > test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 > > bytes) > > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 > > bytes) > > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 > > bytes) > > > > tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop: > > --- > > > > testing speed of __ctr-aes-aesni encryption > > test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (
Re: [PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization
On Tue, Jun 03, 2014 at 05:41:14PM -0700, chandramouli narayanan wrote: > This patch introduces "by8" AES CTR mode AVX optimization inspired by > Intel Optimized IPSEC Cryptograhpic library. For additional information, > please see: > http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972 > > The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and > aes_ctr_enc_256_avx_by8() are adapted from > Intel Optimized IPSEC Cryptographic library. When both AES and AVX features > are enabled in a platform, the glue code in AESNI module overrieds the > existing "by4" CTR mode en/decryption with the "by8" > AES CTR mode en/decryption. > > On a Haswell desktop, with turbo disabled and all cpus running > at maximum frequency, the "by8" CTR mode optimization > shows better performance results across data & key sizes > as measured by tcrypt. > > The average performance improvement of the "by8" version over the "by4" > version is as follows: > > For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement. > For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement. > For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement. Nice improvement :) How does it perform on older processors that do have a penalty for unaligned loads (vmovdqu), e.g. SandyBridge? If those perform worse it might be wise to extend the CPU feature test in the glue code by a model test to enable the "by8" variant only for Haswell and newer processors that don't have such a penalty. > > A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8" > optimization shows the following results: > > tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop: > --- > > testing speed of __ctr-aes-aesni encryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 > bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 > bytes) > test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes) > test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes) > test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes) > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 > bytes) > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 > bytes) > test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes) > test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes) > test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes) > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 > bytes) > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 > bytes) > > testing speed of __ctr-aes-aesni decryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 > bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 > bytes) > test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes) > test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes) > test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes) > test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 > bytes) > test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 > bytes) > test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes) > test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes) > test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes) > test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 > bytes) > test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 > bytes) > > tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop: > --- > > testing speed of __ctr-aes-aesni encryption > test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes) > test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes) > test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes) > test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 > bytes) > test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 > bytes) > test 5 (192 bit key, 16 byt
[PATCH v2 1/1] crypto: AES CTR x86_64 "by8" AVX optimization
This patch introduces "by8" AES CTR mode AVX optimization inspired by Intel Optimized IPSEC Cryptograhpic library. For additional information, please see: http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972 The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and aes_ctr_enc_256_avx_by8() are adapted from Intel Optimized IPSEC Cryptographic library. When both AES and AVX features are enabled in a platform, the glue code in AESNI module overrieds the existing "by4" CTR mode en/decryption with the "by8" AES CTR mode en/decryption. On a Haswell desktop, with turbo disabled and all cpus running at maximum frequency, the "by8" CTR mode optimization shows better performance results across data & key sizes as measured by tcrypt. The average performance improvement of the "by8" version over the "by4" version is as follows: For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement. For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement. For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement. A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8" optimization shows the following results: tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop: --- testing speed of __ctr-aes-aesni encryption test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes) test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes) test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes) test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 bytes) test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 bytes) test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes) test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes) test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes) test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 bytes) test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 bytes) test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes) test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes) test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes) test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 bytes) test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 bytes) testing speed of __ctr-aes-aesni decryption test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes) test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes) test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes) test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 bytes) test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 bytes) test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes) test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes) test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes) test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 bytes) test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 bytes) test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes) test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes) test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes) test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 bytes) test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 bytes) tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop: --- testing speed of __ctr-aes-aesni encryption test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes) test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes) test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes) test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 bytes) test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 bytes) test 5 (192 bit key, 16 byte blocks): 1 operation in 339 cycles (16 bytes) test 6 (192 bit key, 64 byte blocks): 1 operation in 352 cycles (64 bytes) test 7 (192 bit key, 256 byte blocks): 1 operation in 539 cycles (256 bytes) test 8 (192 bit key, 1024 byte blocks): 1 operation in 1153 cycles (1024 bytes) test 9 (192 bit key, 8192 byte blocks): 1 operation in 8458 cycles (8192 bytes) test 10 (256 bit key, 16 byte blocks): 1 operation in 353 cycles (16 bytes) test 11 (256 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes) test 12 (256 bit key, 256 byte blocks): 1 operation in 512 cycles (256 bytes) test 13