Re: [PATCH] SSE2 inner loop for bn_mul_add_words

2003-06-23 Thread dean gaudet
On Fri, 20 Jun 2003, Ben Laurie wrote:

 dean gaudet wrote:

  hi there, i tried sending this ages ago but i guess some spam filters
  probably lost it... i see i have to be subscribed to post stuff.

 Actually, I've been sitting on it waiting for some free time to take a
 look :-)

cool :)  sorry for the hack of a patch against a generated .s file... i
didn't have time to clean it up.

i've also got a SHA1 re-implementation which uses a mixture of SIMD and
ALU operations to achieve greater throughput... but so far this code is
exceptionally sensitive to the compiler/processor combo, i'm probably
going to sit on it until it has wider applicability.

-dean

p.s. can someone point me at where the little/big endian conversion is
done in openssl for SHA1?  by my reading of FIPS 180-2 there has to be a
conversion step when run on little endian boxes... so either i'm wrong or
i missed it in my perusal of the openssl code.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   [EMAIL PROTECTED]
Automated List Manager   [EMAIL PROTECTED]


[PATCH] SSE2 inner loop for bn_mul_add_words

2003-06-20 Thread dean gaudet
hi there, i tried sending this ages ago but i guess some spam filters
probably lost it... i see i have to be subscribed to post stuff.

on a p4 2.4GHz, here's the before:

  signverifysign/s verify/s
rsa 1024 bits   0.0044s   0.0002s225.5   4139.4

and after:

  signverifysign/s verify/s
rsa 1024 bits   0.0033s   0.0002s306.7   6264.2

see hacked patch below.

-dean

Date: Sun, 23 Mar 2003 22:08:25 -0800 (PST)
From: dean gaudet [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Subject: SSE2 inner loop for bn_mul_add_words

for kicks i decided to see if it really was possible to get RSA speedups
using the SSE2 PMULUDQ and PADDQ instructions ... and i'm seeing a 30%+
1024-bit sign/s improvement on the p4 systems i've measured on.

but i'm too lazy to try to understand the perl asm generation crud, and
don't want to figure out how/if you folks would prefer to switch between
routines optimized for specific platforms.  so what i present here is a
total hack patch against a .s file.  do with as you please :)

note that i use %mm registers rather than %xmm registers because this
code is completely dominated by the carry-propagation, which is a series
of 64-bit adds followed by shift-right 32 ... and if you attempt to
do this with 128-bit registers you waste a lot of slots mucking about
packing and shuffling and such.

even still, this is SSE2-only code because the PMULUDQ and PADDQ
instructions don't exist in MMX/SSE.  (which means the only released
processors it will run on are p4 and banias^Wpentium-m... it shows
similar improvements on unreleased processors i can't talk about :)

if you look closely i'm doing only 32-bit loads and stores ... the
implicit zero-extension on the 32-bit load beats out any sort of creative
attempt to do 64-bit loads and shuffle the halves around.

it's unlikely that this technique can speed up the simple add/sub
routines -- unless there are situations where multiple add/sub could be
done in parallel... in the MMX hardware you can effectively parallelize
non-dependent carry propagation -- something you can't do in the ALUs
due to the conflict on EFLAGS.CF.

this code probably still has slack which could be improved on...  such as
moving the emms somewhere much higher in the call stack... it's required
before any fp code is run.  and rearranging the loop so that it overlaps
multiplication better with the carry chain propagation.

-dean

p.s. i'm not on the mailing list, so please CC me in any reply.

--- openssl-0.9.7a/crypto/bn/asm/bn86-elf.s 2003-03-23 21:29:16.0 -0800
+++ openssl-0.9.7a/crypto/bn/asm/bn86-elf.s.dg2 2003-03-23 21:18:05.0 -0800
@@ -26,94 +26,76 @@
movl32(%esp),   %ebp
pushl   %ecx
jz  .L000maw_finish
-.L001maw_loop:
-   movl%ecx,   (%esp)

-   movl(%ebx), %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl(%edi), %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   (%edi)
-   movl%edx,   %esi
+   movd %ebp,%mm0
+   pxor %mm1,%mm1  /* mm1 = carry in */

-   movl4(%ebx),%eax
-   mull%ebp
-   addl%esi,   %eax
-   movl4(%edi),%esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   4(%edi)
-   movl%edx,   %esi
-
-   movl8(%ebx),%eax
-   mull%ebp
-   addl%esi,   %eax
-   movl8(%edi),%esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   8(%edi)
-   movl%edx,   %esi
-
-   movl12(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl12(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   12(%edi)
-   movl%edx,   %esi
-
-   movl16(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl16(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   16(%edi)
-   movl%edx,   %esi
-
-   movl20(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl20(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   20(%edi)
-   movl%edx,   %esi
-
-   movl24(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl24(%edi),   %esi

Re: [PATCH] SSE2 inner loop for bn_mul_add_words

2003-06-20 Thread Ben Laurie
dean gaudet wrote:

 hi there, i tried sending this ages ago but i guess some spam filters
 probably lost it... i see i have to be subscribed to post stuff.

Actually, I've been sitting on it waiting for some free time to take a
look :-)

Cheers,

Ben.

-- 
http://www.apache-ssl.org/ben.html   http://www.thebunker.net/

There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit. - Robert Woodruff

__
OpenSSL Project http://www.openssl.org
Development Mailing List   [EMAIL PROTECTED]
Automated List Manager   [EMAIL PROTECTED]


SSE2 inner loop for bn_mul_add_words

2003-04-04 Thread dean gaudet
for kicks i decided to see if it really was possible to get RSA speedups
using the SSE2 PMULUDQ and PADDQ instructions ... and i'm seeing a 30%+
1024-bit sign/s improvement on the p4 systems i've measured on.

but i'm too lazy to try to understand the perl asm generation crud, and
don't want to figure out how/if you folks would prefer to switch between
routines optimized for specific platforms.  so what i present here is a
total hack patch against a .s file.  do with as you please :)

note that i use %mm registers rather than %xmm registers because this
code is completely dominated by the carry-propagation, which is a series
of 64-bit adds followed by shift-right 32 ... and if you attempt to
do this with 128-bit registers you waste a lot of slots mucking about
packing and shuffling and such.

even still, this is SSE2-only code because the PMULUDQ and PADDQ
instructions don't exist in MMX/SSE.  (which means the only released
processors it will run on are p4 and banias^Wpentium-m... it shows
similar improvements on unreleased processors i can't talk about :)

if you look closely i'm doing only 32-bit loads and stores ... the
implicit zero-extension on the 32-bit load beats out any sort of creative
attempt to do 64-bit loads and shuffle the halves around.

it's unlikely that this technique can speed up the simple add/sub
routines -- unless there are situations where multiple add/sub could be
done in parallel... in the MMX hardware you can effectively parallelize
non-dependent carry propagation -- something you can't do in the ALUs
due to the conflict on EFLAGS.CF.

this code probably still has slack which could be improved on...  such as
moving the emms somewhere much higher in the call stack... it's required
before any fp code is run.  and rearranging the loop so that it overlaps
multiplication better with the carry chain propagation.

-dean

p.s. i'm not on the mailing list, so please CC me in any reply.

--- openssl-0.9.7a/crypto/bn/asm/bn86-elf.s 2003-03-23 21:29:16.0 -0800
+++ openssl-0.9.7a/crypto/bn/asm/bn86-elf.s.dg2 2003-03-23 21:18:05.0 -0800
@@ -26,94 +26,76 @@
movl32(%esp),   %ebp
pushl   %ecx
jz  .L000maw_finish
-.L001maw_loop:
-   movl%ecx,   (%esp)

-   movl(%ebx), %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl(%edi), %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   (%edi)
-   movl%edx,   %esi
+   movd %ebp,%mm0
+   pxor %mm1,%mm1  /* mm1 = carry in */

-   movl4(%ebx),%eax
-   mull%ebp
-   addl%esi,   %eax
-   movl4(%edi),%esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   4(%edi)
-   movl%edx,   %esi
-
-   movl8(%ebx),%eax
-   mull%ebp
-   addl%esi,   %eax
-   movl8(%edi),%esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   8(%edi)
-   movl%edx,   %esi
-
-   movl12(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl12(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   12(%edi)
-   movl%edx,   %esi
-
-   movl16(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl16(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   16(%edi)
-   movl%edx,   %esi
-
-   movl20(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl20(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   20(%edi)
-   movl%edx,   %esi
-
-   movl24(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl24(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   24(%edi)
-   movl%edx,   %esi
-
-   movl28(%ebx),   %eax
-   mull%ebp
-   addl%esi,   %eax
-   movl28(%edi),   %esi
-   adcl$0, %edx
-   addl%esi,   %eax
-   adcl$0, %edx
-   movl%eax,   28(%edi)
-   movl%edx,   %esi
+.L001maw_loop:
+   movd (%edi),%mm3/* mm3 = C[0]