Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-16 Thread Borislav Petkov
On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote:
 About ~5% slower, probably because I was tuning for sandy-bridge and
 introduced more FPU=CPU register moves.

 Here's new version of patch, with FPU=CPU moves from original
 implementation.

 (Note: also changes encryption function to inline all code in to main
 function, decryption still places common code to separate function to
 reduce object size. This is to measure the difference.)

Yep, looks better than the previous run and also a bit better or on par
with the initial run I did.

The thing is, I'm not sure whether optimizing the thing for each uarch
is a workable solution software-wise or maybe having a single version
which performs sufficiently ok on all uarches is easier/better to
maintain without causing code bloat. Hmmm...

4th:

ran like 1st.

[ 1014.074150] 
[ 1014.074150] testing speed of async ecb(twofish) encryption
[ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055 operations in 1 
seconds (77920880 bytes)
[ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828 operations in 1 
seconds (130804992 bytes)
[ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400 operations in 1 
seconds (155238400 bytes)
[ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939 operations in 1 
seconds (172993536 bytes)
[ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777 operations in 1 
seconds (178397184 bytes)
[ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254 operations in 1 
seconds (78116064 bytes)
[ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230 operations in 1 
seconds (130766720 bytes)
[ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477 operations in 1 
seconds (155514112 bytes)
[ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743 operations in 1 
seconds (172792832 bytes)
[ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442 operations in 1 
seconds (175652864 bytes)
[ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863 operations in 1 
seconds (78269808 bytes)
[ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390 operations in 1 
seconds (131160960 bytes)
[ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847 operations in 1 
seconds (155352832 bytes)
[ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228 operations in 1 
seconds (173289472 bytes)
[ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773 operations in 1 
seconds (178364416 bytes)
[ 1029.184981] 
[ 1029.184981] testing speed of async ecb(twofish) decryption
[ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065 operations in 1 
seconds (78897040 bytes)
[ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931 operations in 1 
seconds (131643584 bytes)
[ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409 operations in 1 
seconds (150888704 bytes)
[ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681 operations in 1 
seconds (167609344 bytes)
[ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062 operations in 1 
seconds (172539904 bytes)
[ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537 operations in 1 
seconds (78904592 bytes)
[ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989 operations in 1 
seconds (131455296 bytes)
[ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591 operations in 1 
seconds (150935296 bytes)
[ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565 operations in 1 
seconds (167490560 bytes)
[ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899 operations in 1 
seconds (171204608 bytes)
[ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343 operations in 1 
seconds (78997488 bytes)
[ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678 operations in 1 
seconds (131243392 bytes)
[ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869 operations in 1 
seconds (150238464 bytes)
[ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548 operations in 1 
seconds (167473152 bytes)
[ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053 operations in 1 
seconds (172466176 bytes)
[ 1044.283892] 
[ 1044.283892] testing speed of async cbc(twofish) encryption
[ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240 operations in 1 
seconds (82979840 bytes)
[ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034 operations in 1 
seconds (122946176 bytes)
[ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787 operations in 1 
seconds (138953472 bytes)
[ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399 operations in 1 
seconds (144792576 bytes)
[ 1048.318312] test 4 (128 bit key, 8192 byte blocks): 17755 operations in 1 
seconds (145448960 bytes)
[ 1049.324829] test 5 (192 bit key, 16 byte blocks): 5196441 operations in 1 
seconds (83143056 bytes)
[ 1050.331485] test 6 (192 bit key, 64 byte blocks): 1921456 operations in 1 
seconds (122973184 bytes)
[ 1051.338157] test 7 (192 bit key, 256 byte blocks): 543581 

Re: [PATCH] crypto: twofish - add x86_64/avx assembler implementation

2012-08-16 Thread Jussi Kivilinna

Quoting Borislav Petkov b...@alien8.de:


On Wed, Aug 15, 2012 at 08:34:25PM +0300, Jussi Kivilinna wrote:

About ~5% slower, probably because I was tuning for sandy-bridge and
introduced more FPU=CPU register moves.

Here's new version of patch, with FPU=CPU moves from original
implementation.

(Note: also changes encryption function to inline all code in to main
function, decryption still places common code to separate function to
reduce object size. This is to measure the difference.)


Yep, looks better than the previous run and also a bit better or on par
with the initial run I did.


Thanks again. Speed gained with patch is ~8%, and is able of getting  
twofish-avx pass twofish-3way.




The thing is, I'm not sure whether optimizing the thing for each uarch
is a workable solution software-wise or maybe having a single version
which performs sufficiently ok on all uarches is easier/better to
maintain without causing code bloat. Hmmm...


Agreed, testing on multiple CPUs to get single well working version is  
what I have done in the past. But purchasing all the latest CPUs on  
the market isn't option for me, and for testing AVX I'm stuck with  
sandy-bridge :)


-Jussi


4th:

ran like 1st.

[ 1014.074150]
[ 1014.074150] testing speed of async ecb(twofish) encryption
[ 1014.083829] test 0 (128 bit key, 16 byte blocks): 4870055  
operations in 1 seconds (77920880 bytes)
[ 1015.092757] test 1 (128 bit key, 64 byte blocks): 2043828  
operations in 1 seconds (130804992 bytes)
[ 1016.099441] test 2 (128 bit key, 256 byte blocks): 606400  
operations in 1 seconds (155238400 bytes)
[ 1017.105939] test 3 (128 bit key, 1024 byte blocks): 168939  
operations in 1 seconds (172993536 bytes)
[ 1018.112517] test 4 (128 bit key, 8192 byte blocks): 21777  
operations in 1 seconds (178397184 bytes)
[ 1019.119035] test 5 (192 bit key, 16 byte blocks): 4882254  
operations in 1 seconds (78116064 bytes)
[ 1020.125716] test 6 (192 bit key, 64 byte blocks): 2043230  
operations in 1 seconds (130766720 bytes)
[ 1021.132391] test 7 (192 bit key, 256 byte blocks): 607477  
operations in 1 seconds (155514112 bytes)
[ 1022.138889] test 8 (192 bit key, 1024 byte blocks): 168743  
operations in 1 seconds (172792832 bytes)
[ 1023.145476] test 9 (192 bit key, 8192 byte blocks): 21442  
operations in 1 seconds (175652864 bytes)
[ 1024.152012] test 10 (256 bit key, 16 byte blocks): 4891863  
operations in 1 seconds (78269808 bytes)
[ 1025.158684] test 11 (256 bit key, 64 byte blocks): 2049390  
operations in 1 seconds (131160960 bytes)
[ 1026.165366] test 12 (256 bit key, 256 byte blocks): 606847  
operations in 1 seconds (155352832 bytes)
[ 1027.171841] test 13 (256 bit key, 1024 byte blocks): 169228  
operations in 1 seconds (173289472 bytes)
[ 1028.178436] test 14 (256 bit key, 8192 byte blocks): 21773  
operations in 1 seconds (178364416 bytes)

[ 1029.184981]
[ 1029.184981] testing speed of async ecb(twofish) decryption
[ 1029.194508] test 0 (128 bit key, 16 byte blocks): 4931065  
operations in 1 seconds (78897040 bytes)
[ 1030.199640] test 1 (128 bit key, 64 byte blocks): 2056931  
operations in 1 seconds (131643584 bytes)
[ 1031.206303] test 2 (128 bit key, 256 byte blocks): 589409  
operations in 1 seconds (150888704 bytes)
[ 1032.212832] test 3 (128 bit key, 1024 byte blocks): 163681  
operations in 1 seconds (167609344 bytes)
[ 1033.219443] test 4 (128 bit key, 8192 byte blocks): 21062  
operations in 1 seconds (172539904 bytes)
[ 1034.225979] test 5 (192 bit key, 16 byte blocks): 4931537  
operations in 1 seconds (78904592 bytes)
[ 1035.232608] test 6 (192 bit key, 64 byte blocks): 2053989  
operations in 1 seconds (131455296 bytes)
[ 1036.239289] test 7 (192 bit key, 256 byte blocks): 589591  
operations in 1 seconds (150935296 bytes)
[ 1037.241784] test 8 (192 bit key, 1024 byte blocks): 163565  
operations in 1 seconds (167490560 bytes)
[ 1038.244387] test 9 (192 bit key, 8192 byte blocks): 20899  
operations in 1 seconds (171204608 bytes)
[ 1039.250923] test 10 (256 bit key, 16 byte blocks): 4937343  
operations in 1 seconds (78997488 bytes)
[ 1040.257589] test 11 (256 bit key, 64 byte blocks): 2050678  
operations in 1 seconds (131243392 bytes)
[ 1041.264262] test 12 (256 bit key, 256 byte blocks): 586869  
operations in 1 seconds (150238464 bytes)
[ 1042.270753] test 13 (256 bit key, 1024 byte blocks): 163548  
operations in 1 seconds (167473152 bytes)
[ 1043.277365] test 14 (256 bit key, 8192 byte blocks): 21053  
operations in 1 seconds (172466176 bytes)

[ 1044.283892]
[ 1044.283892] testing speed of async cbc(twofish) encryption
[ 1044.293349] test 0 (128 bit key, 16 byte blocks): 5186240  
operations in 1 seconds (82979840 bytes)
[ 1045.298534] test 1 (128 bit key, 64 byte blocks): 1921034  
operations in 1 seconds (122946176 bytes)
[ 1046.305207] test 2 (128 bit key, 256 byte blocks): 542787  
operations in 1 seconds (138953472 bytes)
[ 1047.311699] test 3 (128 bit key, 1024 byte blocks): 141399  

Re: [PATCH 1/3] crypto: twofish-avx - tune assembler code for ~10% more performance

2012-08-16 Thread Jussi Kivilinna
Please, ignore this patchset as it causes performance regression on  
Bulldozer. I'll make new patchset with this issue fixed.


-Jussi

Quoting Jussi Kivilinna jussi.kivili...@mbnet.fi:


Patch replaces 'movb' instructions with 'movzbl' to break false register
dependencies and interleaves instructions better for out-of-order scheduling.

Also move common round code to separate function to reduce object size.

Tested on Core i5-2450M.

Cc: Johannes Goetzfried johannes.goetzfr...@informatik.stud.uni-erlangen.de
Signed-off-by: Jussi Kivilinna jussi.kivili...@mbnet.fi
---
 arch/x86/crypto/twofish-avx-x86_64-asm_64.S |  144  
+--

 1 file changed, 92 insertions(+), 52 deletions(-)

diff --git a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S  
b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S

index 35f4557..42b27b7 100644
--- a/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
+++ b/arch/x86/crypto/twofish-avx-x86_64-asm_64.S
@@ -47,15 +47,22 @@
 #define RC2 %xmm6
 #define RD2 %xmm7

-#define RX %xmm8
-#define RY %xmm9
+#define RX0 %xmm8
+#define RY0 %xmm9

-#define RK1 %xmm10
-#define RK2 %xmm11
+#define RX1 %xmm10
+#define RY1 %xmm11
+
+#define RK1 %xmm12
+#define RK2 %xmm13
+
+#define RT %xmm14

 #define RID1  %rax
+#define RID1d %eax
 #define RID1b %al
 #define RID2  %rbx
+#define RID2d %ebx
 #define RID2b %bl

 #define RGI1   %rdx
@@ -73,40 +80,45 @@
 #define RGS3d %r10d


-#define lookup_32bit(t0, t1, t2, t3, src, dst) \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+#define lookup_32bit(t0, t1, t2, t3, src, dst, interleave_op, il_reg) \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   shrq $16,   src; \
movlt0(CTX, RID1, 4), dst ## d;  \
xorlt1(CTX, RID2, 4), dst ## d;  \
-   shrq $16,   src; \
-   movbsrc ## bl,RID1b; \
-   movbsrc ## bh,RID2b; \
+   movzbl  src ## bl,RID1d; \
+   movzbl  src ## bh,RID2d; \
+   interleave_op(il_reg);   \
xorlt2(CTX, RID1, 4), dst ## d;  \
xorlt3(CTX, RID2, 4), dst ## d;

+#define dummy(d) /* do nothing */
+
+#define shr_next(reg) \
+   shrq $16,   reg;
+
 #define G(a, x, t0, t1, t2, t3) \
vmovq   a,RGI1;   \
-   vpsrldq $8, a,x;  \
-   vmovq   x,RGI2;   \
+   vpextrq $1, a,RGI2;   \
\
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1); \
-   shrq $16,   RGI1; \
-   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2); \
-   shlq $32,   RGS2; \
-   orq RGS1, RGS2;   \
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS1, shr_next, RGI1); \
+   vmovd   RGS1d, x;\
+   lookup_32bit(t0, t1, t2, t3, RGI1, RGS2, dummy, none); \
+   vpinsrd $1, RGS2d, x, x; \
\
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1); \
-   shrq $16,   RGI2; \
-   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3); \
-   shlq $32,   RGS3; \
-   orq RGS1, RGS3;   \
-   \
-   vmovq   RGS2, x;  \
-   vpinsrq $1, RGS3, x, x;
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS1, shr_next, RGI2); \
+   vpinsrd $2, RGS1d, x, x; \
+   lookup_32bit(t0, t1, t2, t3, RGI2, RGS3, dummy, none); \
+   vpinsrd $3, RGS3d, x, x;
+
+#define encround_g1g2(a, b, c, d, x, y) \
+   G(a, x, s0, s1, s2, s3); \
+   G(b, y, s1, s2, s3, s0);

-#define encround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define encround_end(a, b, c, d, x, y) \
+   vpslld $1,  d, RT; \
+   vpsrld $(32 - 1),   d, d;  \
+   vpord, RT,  d; \
vpaddd  x, y,   x; \
vpaddd  y, x,   y; \
vpaddd  x, RK1, x; \
@@ -115,14 +127,16 @@
vpsrld $1,  c, x;  \
vpslld $(32 - 1),   c, c;  \
vporc, x,   c; \
-   vpslld $1,  d, x;  \
-   vpsrld $(32 - 1),   d, d;  \
-   vpord, x,   d; \
vpxor   d, y,   d;

-#define decround(a, b, c, d, x, y) \
-   G(a, x, s0, s1, s2, s3);   \
-   G(b, y, s1, s2, s3, s0);   \
+#define decround_g1g2(a, b, c, d, x, y) \
+   G(a, x, s0, s1, s2, s3); \
+   G(b, y, s1, s2, s3, s0);
+
+#define decround_end(a, b, c, d, x, y) \
+   vpslld $1,  c, RT;

Re: [PATCH 1/3] crypto: twofish-avx - tune assembler code for ~10% more performance

2012-08-16 Thread Herbert Xu
On Thu, Aug 16, 2012 at 05:30:49PM +0300, Jussi Kivilinna wrote:
 Please, ignore this patchset as it causes performance regression on
 Bulldozer. I'll make new patchset with this issue fixed.

OK.
-- 
Email: Herbert Xu herb...@gondor.apana.org.au
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: on stack dynamic allocations

2012-08-16 Thread David Daney

On 08/16/2012 02:20 PM, Kasatkin, Dmitry wrote:

Hello,

Some places in the code uses variable-size allocation on stack..
For example from hmac_setkey():

struct {
struct shash_desc shash;
char ctx[crypto_shash_descsize(hash)];
} desc;


sparse complains

CHECK   crypto/hmac.c
crypto/hmac.c:57:47: error: bad constant expression

I like it instead of kmalloc..

But what is position of kernel community about it?


If you know that the range of crypto_shash_descsize(hash) is bounded, 
just use the upper bound.


If the range of crypto_shash_descsize(hash) is unbounded, then the stack 
will overflow and ... BOOM!


David Daney



--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html