Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote: About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU=CPU register moves. Here's new version of patch, with FPU=CPU moves from original implementation. (Note: also changes encryption function to inline all code in to main function, decryption still places common code to separate function to reduce object size. This is to measure the difference.) Yep, looks better than the previous run and also a bit better or on par with the initial run I did. The thing is, I'm not sure whether optimizing the thing for each uarch is a workable solution software-wise or maybe having a single version which performs sufficiently ok on all uarches is easier/better to maintain without causing code bloat. Hmmm... 4th: ran like 1st. [ 1014.074150] [ 1014.074150] testing speed of async ecb(twofish) encryption [ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055 operations in 1 seconds (77920880 bytes) [ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828 operations in 1 seconds (130804992 bytes) [ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400 operations in 1 seconds (155238400 bytes) [ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939 operations in 1 seconds (172993536 bytes) [ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777 operations in 1 seconds (178397184 bytes) [ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254 operations in 1 seconds (78116064 bytes) [ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230 operations in 1 seconds (130766720 bytes) [ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477 operations in 1 seconds (155514112 bytes) [ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743 operations in 1 seconds (172792832 bytes) [ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442 operations in 1 seconds (175652864 bytes) [ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863 operations in 1 seconds (78269808 bytes) [ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390 operations in 1 seconds (131160960 bytes) [ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847 operations in 1 seconds (155352832 bytes) [ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228 operations in 1 seconds (173289472 bytes) [ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773 operations in 1 seconds (178364416 bytes) [ 1029.184981] [ 1029.184981] testing speed of async ecb(twofish) decryption [ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065 operations in 1 seconds (78897040 bytes) [ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931 operations in 1 seconds (131643584 bytes) [ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409 operations in 1 seconds (150888704 bytes) [ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681 operations in 1 seconds (167609344 bytes) [ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062 operations in 1 seconds (172539904 bytes) [ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537 operations in 1 seconds (78904592 bytes) [ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989 operations in 1 seconds (131455296 bytes) [ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591 operations in 1 seconds (150935296 bytes) [ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565 operations in 1 seconds (167490560 bytes) [ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899 operations in 1 seconds (171204608 bytes) [ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343 operations in 1 seconds (78997488 bytes) [ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678 operations in 1 seconds (131243392 bytes) [ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869 operations in 1 seconds (150238464 bytes) [ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548 operations in 1 seconds (167473152 bytes) [ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053 operations in 1 seconds (172466176 bytes) [ 1044.283892] [ 1044.283892] testing speed of async cbc(twofish) encryption [ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240 operations in 1 seconds (82979840 bytes) [ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034 operations in 1 seconds (122946176 bytes) [ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787 operations in 1 seconds (138953472 bytes) [ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399 operations in 1 seconds (144792576 bytes) [ 1048.318312] test 4 (128 bit key, 8192 byte blocks): 17755 operations in 1 seconds (145448960 bytes) [ 1049.324829] test 5 (192 bit key, 16 byte blocks): 5196441 operations in 1 seconds (83143056 bytes) [ 1050.331485] test 6 (192 bit key, 64 byte blocks): 1921456 operations in 1 seconds (122973184 bytes) [ 1051.338157] test 7 (192 bit key, 256 byte blocks): 543581
Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation
Quoting Borislav Petkov b...@alien8.de: On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote: About ~5% slower, probably because I was tuning for sandy-bridge and introduced more FPU=CPU register moves. Here's new version of patch, with FPU=CPU moves from original implementation. (Note: also changes encryption function to inline all code in to main function, decryption still places common code to separate function to reduce object size. This is to measure the difference.) Yep, looks better than the previous run and also a bit better or on par with the initial run I did. Thanks again. Speed gained with patch is ~8%, and is able of getting twofish-avx pass twofish-3way. The thing is, I'm not sure whether optimizing the thing for each uarch is a workable solution software-wise or maybe having a single version which performs sufficiently ok on all uarches is easier/better to maintain without causing code bloat. Hmmm... Agreed, testing on multiple CPUs to get single well working version is what I have done in the past. But purchasing all the latest CPUs on the market isn't option for me, and for testing AVX I'm stuck with sandy-bridge :) -Jussi 4th: ran like 1st. [ 1014.074150] [ 1014.074150] testing speed of async ecb(twofish) encryption [ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055 operations in 1 seconds (77920880 bytes) [ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828 operations in 1 seconds (130804992 bytes) [ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400 operations in 1 seconds (155238400 bytes) [ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939 operations in 1 seconds (172993536 bytes) [ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777 operations in 1 seconds (178397184 bytes) [ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254 operations in 1 seconds (78116064 bytes) [ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230 operations in 1 seconds (130766720 bytes) [ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477 operations in 1 seconds (155514112 bytes) [ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743 operations in 1 seconds (172792832 bytes) [ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442 operations in 1 seconds (175652864 bytes) [ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863 operations in 1 seconds (78269808 bytes) [ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390 operations in 1 seconds (131160960 bytes) [ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847 operations in 1 seconds (155352832 bytes) [ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228 operations in 1 seconds (173289472 bytes) [ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773 operations in 1 seconds (178364416 bytes) [ 1029.184981] [ 1029.184981] testing speed of async ecb(twofish) decryption [ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065 operations in 1 seconds (78897040 bytes) [ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931 operations in 1 seconds (131643584 bytes) [ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409 operations in 1 seconds (150888704 bytes) [ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681 operations in 1 seconds (167609344 bytes) [ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062 operations in 1 seconds (172539904 bytes) [ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537 operations in 1 seconds (78904592 bytes) [ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989 operations in 1 seconds (131455296 bytes) [ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591 operations in 1 seconds (150935296 bytes) [ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565 operations in 1 seconds (167490560 bytes) [ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899 operations in 1 seconds (171204608 bytes) [ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343 operations in 1 seconds (78997488 bytes) [ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678 operations in 1 seconds (131243392 bytes) [ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869 operations in 1 seconds (150238464 bytes) [ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548 operations in 1 seconds (167473152 bytes) [ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053 operations in 1 seconds (172466176 bytes) [ 1044.283892] [ 1044.283892] testing speed of async cbc(twofish) encryption [ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240 operations in 1 seconds (82979840 bytes) [ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034 operations in 1 seconds (122946176 bytes) [ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787 operations in 1 seconds (138953472 bytes) [ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399
Re: [PATCH 1/3] crypto: twofish-avx - tune assembler code for ~10% more performance
Please, ignore this patchset as it causes performance regression on Bulldozer. I'll make new patchset with this issue fixed. -Jussi Quoting Jussi Kivilinna jussi.kivili...@mbnet.fi: Patch replaces 'movb' instructions with 'movzbl' to break false register dependencies and interleaves instructions better for out-of-order scheduling. Also move common round code to separate function to reduce object size. Tested on Core i5-2450M. Cc: Johannes Goetzfried johannes.goetzfr...@informatik.stud.uni-erlangen.de Signed-off-by: Jussi Kivilinna jussi.kivili...@mbnet.fi --- arch/x86/crypto/twofish-avx-x86_64-asm_64.S | 144 +-- 1 file changed, 92 insertions(+), 52 deletions(-) diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S index 35f4557..42b27b7 100644 --- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S +++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S @@ -47,15 +47,22 @@ #define RC2 %xmm6 #define RD2 %xmm7 -#define RX %xmm8 -#define RY %xmm9 +#define RX0 %xmm8 +#define RY0 %xmm9 -#define RK1 %xmm10 -#define RK2 %xmm11 +#define RX1 %xmm10 +#define RY1 %xmm11 + +#define RK1 %xmm12 +#define RK2 %xmm13 + +#define RT %xmm14 #define RID1 %rax +#define RID1d %eax #define RID1b %al #define RID2 %rbx +#define RID2d %ebx #define RID2b %bl #define RGI1 %rdx @@ -73,40 +80,45 @@ #define RGS3d %r10d -#define lookup_32bit(t0, t1, t2, t3, src, dst) \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ +#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + shrq $16, src; \ movlt0(CTX, RID1, 4), dst ## d; \ xorlt1(CTX, RID2, 4), dst ## d; \ - shrq $16, src; \ - movbsrc ## bl,RID1b; \ - movbsrc ## bh,RID2b; \ + movzbl src ## bl,RID1d; \ + movzbl src ## bh,RID2d; \ + interleave_op(il_reg); \ xorlt2(CTX, RID1, 4), dst ## d; \ xorlt3(CTX, RID2, 4), dst ## d; +#define dummy(d) /* do nothing */ + +#define shr_next(reg) \ + shrq $16, reg; + #define G(a, x, t0, t1, t2, t3) \ vmovq a,RGI1; \ - vpsrldq $8, a,x; \ - vmovq x,RGI2; \ + vpextrq $1, a,RGI2; \ \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \ - shrq $16, RGI1; \ - lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \ - shlq $32, RGS2; \ - orq RGS1, RGS2; \ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \ + vmovd RGS1d, x;\ + lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \ + vpinsrd $1, RGS2d, x, x; \ \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \ - shrq $16, RGI2; \ - lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \ - shlq $32, RGS3; \ - orq RGS1, RGS3; \ - \ - vmovq RGS2, x; \ - vpinsrq $1, RGS3, x, x; + lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \ + vpinsrd $2, RGS1d, x, x; \ + lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \ + vpinsrd $3, RGS3d, x, x; + +#define encround_g1g2(a, b, c, d, x, y) \ + G(a, x, s0, s1, s2, s3); \ + G(b, y, s1, s2, s3, s0); -#define encround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define encround_end(a, b, c, d, x, y) \ + vpslld $1, d, RT; \ + vpsrld $(32 - 1), d, d; \ + vpord, RT, d; \ vpaddd x, y, x; \ vpaddd y, x, y; \ vpaddd x, RK1, x; \ @@ -115,14 +127,16 @@ vpsrld $1, c, x; \ vpslld $(32 - 1), c, c; \ vporc, x, c; \ - vpslld $1, d, x; \ - vpsrld $(32 - 1), d, d; \ - vpord, x, d; \ vpxor d, y, d; -#define decround(a, b, c, d, x, y) \ - G(a, x, s0, s1, s2, s3); \ - G(b, y, s1, s2, s3, s0); \ +#define decround_g1g2(a, b, c, d, x, y) \ + G(a, x, s0, s1, s2, s3); \ + G(b, y, s1, s2, s3, s0); + +#define decround_end(a, b, c, d, x, y) \ + vpslld $1, c, RT;
Re: [PATCH 1/3] crypto: twofish-avx - tune assembler code for ~10% more performance
On Thu, Aug 16, 2012 at 05:30:49PM +0300, Jussi Kivilinna wrote: Please, ignore this patchset as it causes performance regression on Bulldozer. I'll make new patchset with this issue fixed. OK. -- Email: Herbert Xu herb...@gondor.apana.org.au Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: on stack dynamic allocations
On 08/16/2012 02:20 PM, Kasatkin, Dmitry wrote: Hello, Some places in the code uses variable-size allocation on stack.. For example from hmac_setkey(): struct { struct shash_desc shash; char ctx[crypto_shash_descsize(hash)]; } desc; sparse complains CHECK crypto/hmac.c crypto/hmac.c:57:47: error: bad constant expression I like it instead of kmalloc.. But what is position of kernel community about it? If you know that the range of crypto_shash_descsize(hash) is bounded, just use the upper bound. If the range of crypto_shash_descsize(hash) is unbounded, then the stack will overflow and ... BOOM! David Daney -- To unsubscribe from this list: send the line unsubscribe linux-crypto in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html