Re: [PATCH] SSE2 inner loop for bn_mul_add_words
On Fri, 20 Jun 2003, Ben Laurie wrote: dean gaudet wrote: hi there, i tried sending this ages ago but i guess some spam filters probably lost it... i see i have to be subscribed to post stuff. Actually, I've been sitting on it waiting for some free time to take a look :-) cool :) sorry for the hack of a patch against a generated .s file... i didn't have time to clean it up. i've also got a SHA1 re-implementation which uses a mixture of SIMD and ALU operations to achieve greater throughput... but so far this code is exceptionally sensitive to the compiler/processor combo, i'm probably going to sit on it until it has wider applicability. -dean p.s. can someone point me at where the little/big endian conversion is done in openssl for SHA1? by my reading of FIPS 180-2 there has to be a conversion step when run on little endian boxes... so either i'm wrong or i missed it in my perusal of the openssl code. __ OpenSSL Project http://www.openssl.org Development Mailing List [EMAIL PROTECTED] Automated List Manager [EMAIL PROTECTED]
[PATCH] SSE2 inner loop for bn_mul_add_words
hi there, i tried sending this ages ago but i guess some spam filters probably lost it... i see i have to be subscribed to post stuff. on a p4 2.4GHz, here's the before: signverifysign/s verify/s rsa 1024 bits 0.0044s 0.0002s225.5 4139.4 and after: signverifysign/s verify/s rsa 1024 bits 0.0033s 0.0002s306.7 6264.2 see hacked patch below. -dean Date: Sun, 23 Mar 2003 22:08:25 -0800 (PST) From: dean gaudet [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: SSE2 inner loop for bn_mul_add_words for kicks i decided to see if it really was possible to get RSA speedups using the SSE2 PMULUDQ and PADDQ instructions ... and i'm seeing a 30%+ 1024-bit sign/s improvement on the p4 systems i've measured on. but i'm too lazy to try to understand the perl asm generation crud, and don't want to figure out how/if you folks would prefer to switch between routines optimized for specific platforms. so what i present here is a total hack patch against a .s file. do with as you please :) note that i use %mm registers rather than %xmm registers because this code is completely dominated by the carry-propagation, which is a series of 64-bit adds followed by shift-right 32 ... and if you attempt to do this with 128-bit registers you waste a lot of slots mucking about packing and shuffling and such. even still, this is SSE2-only code because the PMULUDQ and PADDQ instructions don't exist in MMX/SSE. (which means the only released processors it will run on are p4 and banias^Wpentium-m... it shows similar improvements on unreleased processors i can't talk about :) if you look closely i'm doing only 32-bit loads and stores ... the implicit zero-extension on the 32-bit load beats out any sort of creative attempt to do 64-bit loads and shuffle the halves around. it's unlikely that this technique can speed up the simple add/sub routines -- unless there are situations where multiple add/sub could be done in parallel... in the MMX hardware you can effectively parallelize non-dependent carry propagation -- something you can't do in the ALUs due to the conflict on EFLAGS.CF. this code probably still has slack which could be improved on... such as moving the emms somewhere much higher in the call stack... it's required before any fp code is run. and rearranging the loop so that it overlaps multiplication better with the carry chain propagation. -dean p.s. i'm not on the mailing list, so please CC me in any reply. --- openssl-0.9.7a/crypto/bn/asm/bn86-elf.s 2003-03-23 21:29:16.0 -0800 +++ openssl-0.9.7a/crypto/bn/asm/bn86-elf.s.dg2 2003-03-23 21:18:05.0 -0800 @@ -26,94 +26,76 @@ movl32(%esp), %ebp pushl %ecx jz .L000maw_finish -.L001maw_loop: - movl%ecx, (%esp) - movl(%ebx), %eax - mull%ebp - addl%esi, %eax - movl(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, (%edi) - movl%edx, %esi + movd %ebp,%mm0 + pxor %mm1,%mm1 /* mm1 = carry in */ - movl4(%ebx),%eax - mull%ebp - addl%esi, %eax - movl4(%edi),%esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 4(%edi) - movl%edx, %esi - - movl8(%ebx),%eax - mull%ebp - addl%esi, %eax - movl8(%edi),%esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 8(%edi) - movl%edx, %esi - - movl12(%ebx), %eax - mull%ebp - addl%esi, %eax - movl12(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 12(%edi) - movl%edx, %esi - - movl16(%ebx), %eax - mull%ebp - addl%esi, %eax - movl16(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 16(%edi) - movl%edx, %esi - - movl20(%ebx), %eax - mull%ebp - addl%esi, %eax - movl20(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 20(%edi) - movl%edx, %esi - - movl24(%ebx), %eax - mull%ebp - addl%esi, %eax - movl24(%edi), %esi
Re: [PATCH] SSE2 inner loop for bn_mul_add_words
dean gaudet wrote: hi there, i tried sending this ages ago but i guess some spam filters probably lost it... i see i have to be subscribed to post stuff. Actually, I've been sitting on it waiting for some free time to take a look :-) Cheers, Ben. -- http://www.apache-ssl.org/ben.html http://www.thebunker.net/ There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit. - Robert Woodruff __ OpenSSL Project http://www.openssl.org Development Mailing List [EMAIL PROTECTED] Automated List Manager [EMAIL PROTECTED]
SSE2 inner loop for bn_mul_add_words
for kicks i decided to see if it really was possible to get RSA speedups using the SSE2 PMULUDQ and PADDQ instructions ... and i'm seeing a 30%+ 1024-bit sign/s improvement on the p4 systems i've measured on. but i'm too lazy to try to understand the perl asm generation crud, and don't want to figure out how/if you folks would prefer to switch between routines optimized for specific platforms. so what i present here is a total hack patch against a .s file. do with as you please :) note that i use %mm registers rather than %xmm registers because this code is completely dominated by the carry-propagation, which is a series of 64-bit adds followed by shift-right 32 ... and if you attempt to do this with 128-bit registers you waste a lot of slots mucking about packing and shuffling and such. even still, this is SSE2-only code because the PMULUDQ and PADDQ instructions don't exist in MMX/SSE. (which means the only released processors it will run on are p4 and banias^Wpentium-m... it shows similar improvements on unreleased processors i can't talk about :) if you look closely i'm doing only 32-bit loads and stores ... the implicit zero-extension on the 32-bit load beats out any sort of creative attempt to do 64-bit loads and shuffle the halves around. it's unlikely that this technique can speed up the simple add/sub routines -- unless there are situations where multiple add/sub could be done in parallel... in the MMX hardware you can effectively parallelize non-dependent carry propagation -- something you can't do in the ALUs due to the conflict on EFLAGS.CF. this code probably still has slack which could be improved on... such as moving the emms somewhere much higher in the call stack... it's required before any fp code is run. and rearranging the loop so that it overlaps multiplication better with the carry chain propagation. -dean p.s. i'm not on the mailing list, so please CC me in any reply. --- openssl-0.9.7a/crypto/bn/asm/bn86-elf.s 2003-03-23 21:29:16.0 -0800 +++ openssl-0.9.7a/crypto/bn/asm/bn86-elf.s.dg2 2003-03-23 21:18:05.0 -0800 @@ -26,94 +26,76 @@ movl32(%esp), %ebp pushl %ecx jz .L000maw_finish -.L001maw_loop: - movl%ecx, (%esp) - movl(%ebx), %eax - mull%ebp - addl%esi, %eax - movl(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, (%edi) - movl%edx, %esi + movd %ebp,%mm0 + pxor %mm1,%mm1 /* mm1 = carry in */ - movl4(%ebx),%eax - mull%ebp - addl%esi, %eax - movl4(%edi),%esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 4(%edi) - movl%edx, %esi - - movl8(%ebx),%eax - mull%ebp - addl%esi, %eax - movl8(%edi),%esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 8(%edi) - movl%edx, %esi - - movl12(%ebx), %eax - mull%ebp - addl%esi, %eax - movl12(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 12(%edi) - movl%edx, %esi - - movl16(%ebx), %eax - mull%ebp - addl%esi, %eax - movl16(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 16(%edi) - movl%edx, %esi - - movl20(%ebx), %eax - mull%ebp - addl%esi, %eax - movl20(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 20(%edi) - movl%edx, %esi - - movl24(%ebx), %eax - mull%ebp - addl%esi, %eax - movl24(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 24(%edi) - movl%edx, %esi - - movl28(%ebx), %eax - mull%ebp - addl%esi, %eax - movl28(%edi), %esi - adcl$0, %edx - addl%esi, %eax - adcl$0, %edx - movl%eax, 28(%edi) - movl%edx, %esi +.L001maw_loop: + movd (%edi),%mm3/* mm3 = C[0]