Re: Rijndael in Assembler for x86?
A little over a month ago Perry Metzger asked about free assembler language implementations of Rijndael for x86. Helger Lipmaa, whose commercial assembler language version seems to be the fastest, mentioned Brian Gladman as having the best free C implementation. Gladman's web page now says that he has a free assembler language version. For comparison, Lipmaa says that his runs at 230 cycles per block, Gladman's C version runs at 360 cycles per block, and Gladman's assembler language version runs at 300 cycles/block. It is not yet a complete implementation. Here's a quote from the website: Here is a preliminary version of this code in assembler for the Pentium family with MMX (Pentium II/III/IV). This only implements the standard block size of 128 bits but is 15-20% faster than the C code. It achieves a maximum speed with a fully primed processor cache of about 300 cycles/block, which is around 50 Mbytes/second on a 1GHz processor. This version has not been extensively tested so please be aware that there may be bugs in it. Note also that it uses the Microsoft VC++ register saving conventions and may hence need changes if used with other C/C++ compilers. The URL is http://fp.gladman.plus.com/cryptography_technology/rijndael/ -- sidney - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Rijndael in Assembler for x86?
[Moderator's note: This is getting off topic --Perry] On Tue, Sep 25, 2001 at 03:06:57PM +0100, Ben Laurie wrote: By far the best assember I ever saw from a C compiler came from Watcom's. I'm sure I remember hearing they open sourced it a while back. Or did I dream that? No, you heard right, http://www.openwatcom.com best, vipul. -- Vipul Ved Prakash | I almost died, but I made it, so I'm not so Software Design Artist|serious about formal-wear anymore. http://vipul.net/ | -- Gene Boggs - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Rijndael in Assembler for x86?
Ian Goldberg wrote something above this: [Moderator's note: The best DES implementations for i386s in assembler are several times faster than the best in C. I'm not sure about AES but I'd prefer to try and see. Perhaps it's a feature of DES's odd bit manipulation patterns, perhaps not. I have yet to see GCC produce code for almost anything that was just as fast as hand tuned assembler, though. --Perry] By far the best assember I ever saw from a C compiler came from Watcom's. I'm sure I remember hearing they open sourced it a while back. Or did I dream that? Cheers, Ben. -- http://www.apache-ssl.org/ben.html There is no limit to what a man can do or how far he can go if he doesn't mind who gets the credit. - Robert Woodruff - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Rijndael in Assembler for x86?
[EMAIL PROTECTED] wrote: Perry E. Metzger [EMAIL PROTECTED] wrote: Because it is typically slower by many times than hand tuned assembler. On 14 Sep 2001, at 14:24, Ian Goldberg wrote: Are you sure? For general code, that certainly hasn't been true in a long time; optimizing compilers nowadays can often do *better* then hand-coded assembler. So say compiler writers. I have not found this to be true. Perhaps it is true of some compilers and some people's assembler, and some code. I've done quite a bit of assembler for crypto in the last few years and it very much depends on the CPU/compiler (obviously). The only platform that I have not been able to beat the compiler by %10 or more for an algorithm is PA-RISC. Either my C code is good enough to give the compiler all the help it needs or I need to revisit the architecture :-). The biggest wins are for algorithms where the C compiler does not give access to underlying primitives, ie 32*32-64, or where special tricks can be used due to data relationships that the compiler cannot know about. For x86, there are too few registers and lots of black magic going on. At least for the pentium VTune would reveal everything. For the Ppro/II/III, depending on the compiler (gcc vs VisualC) the C code could sometimes get within %30 of the ASM. For the Pentium 4, ASM is good again. It seems to be a very 'brittle' CPU. Eg. for sha1 (The numbers are relative but may or may not have any relation to something in the real world :-) | P4 |Athalon |P2 Celeron | 1.7ghz | 1.4ghz | 333mhz| |lnx gcc |cygwin gcc|lnx gcc| SHA1 586| 78.594| 135.937 | 26.038| SHA1 686| 81.986| 141.996 | 32.481| SHA1 786| 135.419| 137.804 | 29.106| SHA1 fast | 47.864| 83.846 | 20.828| SHA1 small | 54.322| 62.599 | 12.534| Notice how the different assembler version are all around the same speed for the P2 and Athalon. Even the ratio between the C code versions is similar. But now look at the P4. Special magic can be used to make things very fast, and the 'small' C code version is faster than the loop unrolled version (trace cache thrashing?) For PA-RISC, I've done 1.1, 2.0 and 2.0W code and for some algorithms I cannot beat the optimizer. For others, specifically bignum stuff, 2 to 4 times faster. In this case all multiples are done in the FP unit and data has to be swapped between CPU and FPU via memory so there are lots of opportunities to use 64bit loads etc. HP has good optimizers for a rather tricky architecture. Sparc, I've only done digests, 30-40% speedup seem normal. Simple architecture, simple for the compiler to do a good job. ARM, the compilers are good and the CPUs are simple, I consistently only get 20-40%. The real win comes when trying to get fast with small code size. It is possible to make things much faster for the same reduced footprint. Xscale could be interesting since there are now inter instruction register dependencies. I've normally worked on StrongARM where there are none. Itanium. Amazing speedups with assembler, but hard to write. It is a vector processor if there ever was one. Anything that can do two 64*64-128 multiplies per cycle but with a 15 cycle latency, and no OOO is going to be tricky. For both ARM and PA-RISC I've been able to use instruction set features to improve performance. In theory the algorithms could be coded in C, but it takes CPU architecture/compiler knowledge to write the code :-). eric - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Rijndael in Assembler for x86?
-- Perry E. Metzger [EMAIL PROTECTED] wrote: Because it is typically slower by many times than hand tuned assembler. On 14 Sep 2001, at 14:24, Ian Goldberg wrote: Are you sure? For general code, that certainly hasn't been true in a long time; optimizing compilers nowadays can often do *better* then hand-coded assembler. So say compiler writers. I have not found this to be true. Perhaps it is true of some compilers and some people's assembler, and some code. --digsig James A. Donald 6YeGpsZR+nOTh/cGwvITnSR3TdzclVpR0+pr3YYQdkG R+xhXGtvscaNbOpfLSnwjeziDpDOv2XtF4/h1ST9 4Haf1Gw4kSOsLRysU1Atpc78QFbNBjP0Dr0J4Ji3I - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]
Re: Rijndael in Assembler for x86?
First, my question was caused since Perry(?) did not originally specify *why* he needs an assembly code; and secondly, since the referred 186 assembly code might be slower than the best C codes for Pentium. On the other hand, the best (commercial) assembly implementation of Rijndael for P3 is 50% faster (~230 cycles per block versus ~360 cycles per block) than Brian Gladman's (free) C implementation. Brian's implementation seems to be almost optimal for a C-code. The reasons why assembly code achieves such a speedup was somewhat explained in * Kazumaro Aoki, Helger Lipmaa, Fast Implementations of AES Candidates, AES 3 conference, 2000. Both this paper, and a compendium of AES implementations are available from http://www.tcs.hut.fi/~helger/aes (if you have anything to add there, feel free to email me!). I am *not* aware of any free Rijndael assembly implementations that are faster than 300 cycles per block on P3. I know that there exist some non-free (including mine) implementations that are faster, though. Helger On 14 Sep 2001, Ian Goldberg wrote: Does anyone have an open source implementation of Rijndael in assembler for the Pentium? Why just not to use a C code? Because it is typically slower by many times than hand tuned assembler. Are you sure? For general code, that certainly hasn't been true in a long time; optimizing compilers nowadays can often do *better* then hand-coded assembler. However, for encryption code in particular, I can imagine the C primitives (which usually lack rotate, etc. instructions) may be suboptimal. That being said, back when I wrote the 40-bit RC5 breaker for the RSA challenge, I thought the same thing. I figured I would first write a C version, and then tune the resulting assembler. When I looked at what gcc had output, it had already done all the tricks I had in mind. I would severely doubt a slowdown of many times. I'm more likely to believe a few percent, and would not be surprised if the compiler's optimizer is smarter than most people's. - Ian [Moderator's note: The best DES implementations for i386s in assembler are several times faster than the best in C. I'm not sure about AES but I'd prefer to try and see. Perhaps it's a feature of DES's odd bit manipulation patterns, perhaps not. I have yet to see GCC produce code for almost anything that was just as fast as hand tuned assembler, though. --Perry] - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED] - The Cryptography Mailing List Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]