Re: Rijndael in Assembler for x86?

2001-10-14 Thread Sidney Markowitz

A little over a month ago Perry Metzger asked about free assembler
language implementations of Rijndael for x86. Helger Lipmaa, whose
commercial assembler language version seems to be the fastest,
mentioned Brian Gladman as having the best free C implementation.
Gladman's web page now says that he has a free assembler language
version. For comparison, Lipmaa says that his runs at 230 cycles per
block, Gladman's C version runs at 360 cycles per block, and Gladman's
assembler language version runs at 300 cycles/block. It is not yet a
complete implementation. Here's a quote from the website:

Here is a preliminary version of this code in assembler for the
Pentium family with MMX (Pentium II/III/IV). This only implements the
standard block size of 128 bits but is 15-20% faster than the C code.
It achieves a maximum speed with a fully primed processor cache of
about 300 cycles/block, which is around 50 Mbytes/second on a 1GHz
processor.  This version has not been extensively tested so please be
aware that there may be bugs in it. Note also that it uses the
Microsoft VC++ register saving conventions and may hence need changes
if used with other C/C++ compilers.

The URL is
http://fp.gladman.plus.com/cryptography_technology/rijndael/


 -- sidney





-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]



Re: Rijndael in Assembler for x86?

2001-09-25 Thread Vipul Ved Prakash

[Moderator's note: This is getting off topic --Perry]

On Tue, Sep 25, 2001 at 03:06:57PM +0100, Ben Laurie wrote:
 
 By far the best assember I ever saw from a C compiler came from
 Watcom's. I'm sure I remember hearing they open sourced it a while back.
 Or did I dream that?

No, you heard right, http://www.openwatcom.com

best,
vipul.

-- 

Vipul Ved Prakash |   I almost died, but I made it, so I'm not so
Software Design Artist|serious about formal-wear anymore.
http://vipul.net/ | -- Gene Boggs




-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]



Re: Rijndael in Assembler for x86?

2001-09-25 Thread Ben Laurie

Ian Goldberg wrote something above this:
 [Moderator's note: The best DES implementations for i386s in assembler
 are several times faster than the best in C. I'm not sure about AES
 but I'd prefer to try and see. Perhaps it's a feature of DES's odd bit
 manipulation patterns, perhaps not. I have yet to see GCC produce code
 for almost anything that was just as fast as hand tuned assembler,
 though. --Perry]

By far the best assember I ever saw from a C compiler came from
Watcom's. I'm sure I remember hearing they open sourced it a while back.
Or did I dream that?

Cheers,

Ben.

--
http://www.apache-ssl.org/ben.html

There is no limit to what a man can do or how far he can go if he
doesn't mind who gets the credit. - Robert Woodruff



-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]



Re: Rijndael in Assembler for x86?

2001-09-19 Thread Eric Young

[EMAIL PROTECTED] wrote:

  Perry E. Metzger [EMAIL PROTECTED] wrote:
  Because it is typically slower by many times than hand
  tuned assembler.

 On 14 Sep 2001, at 14:24, Ian Goldberg wrote:
  Are you sure?  For general code, that certainly hasn't been
  true in a long time; optimizing compilers nowadays can
  often do *better* then hand-coded assembler.

 So say compiler writers.

 I have not found this to be true.  Perhaps it is true of some
 compilers and some people's assembler, and some code.

I've done quite a bit of assembler for crypto
in the last few years and it very much depends
on the CPU/compiler (obviously).  The only platform that I have not
been able to beat the compiler by %10 or more for an algorithm is
PA-RISC.  Either my C code is good enough to give the compiler all
the help it needs or I need to revisit the architecture :-).
The biggest wins are for algorithms where the C compiler does not
give access to underlying primitives, ie 32*32-64, or where special tricks
can be used due to data relationships that the compiler cannot know about.

For x86, there are too few registers and lots of black magic going on.
At least for the pentium VTune would reveal everything.  For the Ppro/II/III,
depending on the compiler (gcc vs VisualC) the C code could sometimes get
within %30 of the ASM.
For the Pentium 4, ASM is good again.  It seems to be a very 'brittle' CPU.
Eg. for sha1 (The numbers are relative but may or may not have any relation
to something in the real world :-)
| P4 |Athalon   |P2 Celeron
| 1.7ghz | 1.4ghz   | 333mhz|
|lnx gcc |cygwin gcc|lnx gcc|
SHA1 586|  78.594| 135.937  | 26.038|
SHA1 686|  81.986| 141.996  | 32.481|
SHA1 786| 135.419| 137.804  | 29.106|
SHA1 fast   |  47.864|  83.846  | 20.828|
SHA1 small  |  54.322|  62.599  | 12.534|

Notice how the different assembler version are all around the same
speed for the P2 and Athalon. Even the ratio between the
C code versions is similar.  But now look at the P4.  Special
magic can be used to make things very fast, and the 'small' C code
version is faster than the loop unrolled version
(trace cache thrashing?)

For PA-RISC, I've done 1.1, 2.0 and 2.0W code and for some algorithms I cannot
beat the optimizer.  For others, specifically bignum stuff, 2 to 4 times faster.  In
this case
all multiples are done in the FP unit and data has to be swapped between CPU and FPU

via memory so there are lots of opportunities to use 64bit loads etc.
HP has good optimizers for a rather tricky architecture.

Sparc, I've only done digests, 30-40% speedup seem normal.  Simple architecture,
simple for the compiler to do a good job.

ARM, the compilers are good and the CPUs are simple, I consistently only get
20-40%.  The real win comes when trying to get fast with small code size.
It is possible to make things much faster for the same reduced footprint.  Xscale
could
be interesting since there are now inter instruction register dependencies.  I've
normally worked
on StrongARM where there are none.

Itanium.  Amazing speedups with assembler, but hard to write.  It is a vector
processor if there ever was one.  Anything that can do two 64*64-128
multiplies per cycle but with a 15 cycle latency, and no OOO  is going to be tricky.

For both ARM and PA-RISC I've been able to use instruction set
features to improve performance.  In theory the algorithms could be
coded in C, but it takes CPU architecture/compiler knowledge to write the code :-).

eric





-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]



Re: Rijndael in Assembler for x86?

2001-09-15 Thread jamesd

--
 Perry E. Metzger [EMAIL PROTECTED] wrote:
 Because it is typically slower by many times than hand
 tuned assembler.

On 14 Sep 2001, at 14:24, Ian Goldberg wrote:
 Are you sure?  For general code, that certainly hasn't been
 true in a long time; optimizing compilers nowadays can
 often do *better* then hand-coded assembler.

So say compiler writers.

I have not found this to be true.  Perhaps it is true of some
compilers and some people's assembler, and some code.

--digsig
 James A. Donald
 6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG
 R+xhXGtvscaNbOpfLSnwjeziDpDOv2XtF4/h1ST9
 4Haf1Gw4kSOsLRysU1Atpc78QFbNBjP0Dr0J4Ji3I




-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]



Re: Rijndael in Assembler for x86?

2001-09-15 Thread Helger Lipmaa

First, my question was caused since Perry(?) did not originally specify
*why* he needs an assembly code; and secondly, since the referred 186
assembly code might be slower than the best C codes for Pentium. On the
other hand, the best (commercial) assembly implementation of Rijndael for
P3 is 50% faster (~230 cycles per block versus ~360 cycles per
block) than Brian Gladman's (free) C implementation. Brian's
implementation seems to be almost optimal for a C-code. The reasons why
assembly code achieves such a speedup was somewhat explained in

* Kazumaro Aoki, Helger Lipmaa, Fast Implementations of AES Candidates,
  AES 3 conference, 2000.

Both this paper, and a compendium of AES implementations are available
from http://www.tcs.hut.fi/~helger/aes (if you have anything to add there,
feel free to email me!). I am *not* aware of any free Rijndael assembly
implementations that are faster than 300 cycles per block on P3. I know
that there exist some non-free (including mine) implementations that are
faster, though.

Helger

On 14 Sep 2001, Ian Goldberg wrote:

   Does anyone have an open source implementation of Rijndael in
   assembler for the Pentium?
  
  Why just not to use a C code?
 
 Because it is typically slower by many times than hand tuned assembler.
 
 Are you sure?  For general code, that certainly hasn't been true in a
 long time; optimizing compilers nowadays can often do *better* then
 hand-coded assembler.  However, for encryption code in particular,
 I can imagine the C primitives (which usually lack rotate, etc.
 instructions) may be suboptimal.
 
 That being said, back when I wrote the 40-bit RC5 breaker for the RSA
 challenge, I thought the same thing.  I figured I would first write a C
 version, and then tune the resulting assembler.  When I looked at what
 gcc had output, it had already done all the tricks I had in mind.
 
 I would severely doubt a slowdown of many times.  I'm more likely to
 believe a few percent, and would not be surprised if the compiler's
 optimizer is smarter than most people's.
 
- Ian
 
 [Moderator's note: The best DES implementations for i386s in assembler
 are several times faster than the best in C. I'm not sure about AES
 but I'd prefer to try and see. Perhaps it's a feature of DES's odd bit
 manipulation patterns, perhaps not. I have yet to see GCC produce code
 for almost anything that was just as fast as hand tuned assembler,
 though. --Perry]
 
 -
 The Cryptography Mailing List
 Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
 




-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]