[PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Huang Ying
This patch adds support to Intel AES-NI instruction set for x86_64
platform.

Intel AES-NI is a new set of Single Instruction Multiple Data (SIMD)
instructions that are going to be introduced in the next generation of
Intel processor, as of 2009. These instructions enable fast and secure
data encryption and decryption, using the Advanced Encryption Standard
(AES), defined by FIPS Publication number 197.  The architecture
introduces six instructions that offer full hardware support for
AES. Four of them support high performance data encryption and
decryption, and the other two instructions support the AES key
expansion procedure.

The white paper can be downloaded from:

http://softwarecommunity.intel.com/isn/downloads/intelavx/AES-Instructions-Set_WP.pdf


- AES implementation based on AES-NI is put in crypto/aes/asm/aes-intel.S

- AES-NI operates on XMM registers, so the key structure need to be
  128-bit aligned. A pad field is added to AES_KEY and key structure
  is aligned to 128-bit boundary in entry of AES-NI implementation.

- In entry point of AES algorithm in crypto/aes/asm/aes-x86_64.pl,
  OPENSSL_ia32cap_P is checked, if corresponding bit (57) is set,
  branch into AES-NI based implementation.

- AES-NI based implementation can not benefit from a specialized
  AES_cbc_encrypt, so its general C implementation is used. To resolve
  the name conflict, original AES_cbc_encrypt is renamed to
  AES_cbc_encrypt_def and put in crypto/aes/aes_cbc_def.c.


Signed-off-by: Huang Ying [EMAIL PROTECTED]

---
 Configure|   20 +-
 crypto/aes/Makefile  |9 -
 crypto/aes/aes.h |5 
 crypto/aes/aes_cbc.c |   66 ---
 crypto/aes/aes_cbc_def.c |  130 ++
 crypto/aes/asm/aes-intel.S   |  374 +++
 crypto/aes/asm/aes-x86_64.pl |   20 ++
 7 files changed, 546 insertions(+), 78 deletions(-)

--- /dev/null
+++ b/crypto/aes/asm/aes-intel.S
@@ -0,0 +1,374 @@
+/*
+ * 
+ * Written by Huang Ying [EMAIL PROTECTED] for the OpenSSL
+ * project to add support for Intel new AES instructions. Rights for
+ * redistribution and usage in source and binary forms are granted
+ * according to the OpenSSL license.
+ * 
+ */
+
+.align 16
+key_expansion_128:
+   movaps %xmm1, %xmm4
+   psrldq $12, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm1, %xmm0
+
+   movaps %xmm0, (%rcx)
+   add $0x10, %rcx
+   ret
+
+.align 16
+key_expansion_192:
+   pshufd $0b01010101, %xmm1, %xmm1
+   movaps %xmm1, %xmm4
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm1, %xmm0
+
+   pshufd $0b, %xmm0, %xmm3
+   pxor %xmm2, %xmm3
+   palignr $12, %xmm0, %xmm3
+   pxor %xmm2, %xmm3
+
+   test %r9, %r9
+   not %r9
+   jnz 1f
+
+   movaps %xmm0, %xmm1
+   pslldq $8, %xmm2
+   palignr $8, %xmm2, %xmm1
+   movaps %xmm1, (%rcx)
+   add $0x10, %rcx
+   movaps %xmm3, %xmm2
+   palignr $8, %xmm0, %xmm3
+   movaps %xmm3, (%rcx)
+   add $0x10, %rcx
+   ret
+1:
+   movaps %xmm0, (%rcx)
+   add $0x10, %rcx
+   movaps %xmm3, %xmm2
+   ret
+
+.align 16
+key_expansion_256:
+   movaps %xmm1, %xmm4
+   psrldq $12, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm0, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm1, %xmm0
+
+   movaps %xmm0, (%rcx)
+   add $0x10, %rcx
+
+   test %r9, %r9
+   jnz 1f
+
+   # aeskeygenassist $0x1, %xmm0, %xmm1
+   .byte 0x66, 0x0f, 0x3a, 0xdf, 0xc8, 0x01
+
+   pshufd $0b10101010, %xmm1, %xmm1
+   movaps %xmm1, %xmm4
+   pxor %xmm2, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm2, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm2, %xmm1
+   palignr $12, %xmm4, %xmm1
+   pxor %xmm1, %xmm2
+
+   movaps %xmm2, (%rcx)
+   add $0x10, %rcx
+1:
+   ret
+
+.align 16
+.global intel_AES_set_encrypt_key
+intel_AES_set_encrypt_key:
+   test %rdi, %rdi
+   jz 3f
+   test %rdx, %rdx
+   jz 3f
+   add $0xf, %rdx  # make key struct 128-bit aligned
+   and $0xfff0, %rdx
+   movups (%rdi), %xmm0# user key (first 16 bytes)
+   movaps %xmm0, (%rdx)
+   lea 0x10(%rdx), %rcx# key addr
+   cmp $256, %esi
+   jnz 1f
+   mov $14, %esi
+   movl %esi, 240(%rdx)# 14 rounds for 256
+   movups 0x10(%rdi), %xmm2# 

Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Andy Polyakov
As for RFC part. NO! This is NOT the way to do it. For several reasons
(in ascending order of importance):

- OpenSSL assembler modules are maintained as dual-ABI, i.e. suitable
for both Unix and Win64;
- and $-16, %rdx is unacceptable in this context. The relevant
interface is exposed to end-user and we have to reserve for possibility
that key schedule is memcpy-ed to location with alternative alignment;
- zero-copy CBC routine gives a fair performance improvement even in
ordinary case, and driving ultra-fast block function from C would be
just wasteful. In other words AESENC/DEC would benefit more from
dedicated CBC routine (see even comment below);
- implementation should allow for pipelining;

As for the latter. I refer to possibility of scheduling of multiple
AESENC/DEC with same key schedule element and multiple data chunks. It's
possible in modes that allow for parallelization (e.g. ECB, CBC decrypt,
CTR), and as far as I understand it is even recommended. So we are kind
of obliged to reserve for this option.

The answer is engine. I mean this preferably should be implemented as
engine that will be able to take full advantage of architecture, not as
patch to general purpose block function.

 This patch adds support to Intel AES-NI instruction set for x86_64
 platform.
 
 Intel AES-NI is a new set of Single Instruction Multiple Data (SIMD)
 instructions that are going to be introduced in the next generation of
 Intel processor, as of 2009.

Hardware however is not expected before 2010, right? A.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Peter Waltenberg
If you want this in the mainstream code, you'll need to detect the
capability at runtime and use your alternate code paths only if the
hardware is present. It's not even to Intels advantage if OpenSSL crashes
and burns on older Intel CPU's and most bulk users of OpenSSL (OS vendors)
won't want to mess around installing different OpenSSL versions for
different hardware.

Autodetection is the best option if the detection overhead is reasonable -
take a look at crypto/x86_64cpuid.pl for how to do the detection logic
neatly.
There are advantages in this being present all the time/dynamically enabled
if it can be done, most users/OS vendors wouldn't bother to configure an
engine backend anyway.

I'll disagree with Andy on that aspect only. The engine modules aren't
particularly useful for this situation where the function is inherent in
some subset of CPU's, the engines will only get used by a few end users
that can be bothered to configure them. I doubt the OS vendors would bother
to enable an engine by default, testing of the possible configurations is
expensive and the costs of support calls if they mess up makes
autodetecting the engine to use a very unattractive proposition.
(i.e. You get scenarios like building an image on a system with the new
hardware then cloning it across large numbers of machines )

Peter




 
  From:   Andy Polyakov [EMAIL PROTECTED] 
  

 
  To: openssl-dev@openssl.org   
 

 
  Date:   10/12/2008 05:42  
 

 
  Subject:Re: [PATCH RFC] Add support to Intel AES-NI instruction set for 
x86_64 platform

 





As for RFC part. NO! This is NOT the way to do it. For several reasons
(in ascending order of importance):

- OpenSSL assembler modules are maintained as dual-ABI, i.e. suitable
for both Unix and Win64;
- and $-16, %rdx is unacceptable in this context. The relevant
interface is exposed to end-user and we have to reserve for possibility
that key schedule is memcpy-ed to location with alternative alignment;
- zero-copy CBC routine gives a fair performance improvement even in
ordinary case, and driving ultra-fast block function from C would be
just wasteful. In other words AESENC/DEC would benefit more from
dedicated CBC routine (see even comment below);
- implementation should allow for pipelining;

As for the latter. I refer to possibility of scheduling of multiple
AESENC/DEC with same key schedule element and multiple data chunks. It's
possible in modes that allow for parallelization (e.g. ECB, CBC decrypt,
CTR), and as far as I understand it is even recommended. So we are kind
of obliged to reserve for this option.

The answer is engine. I mean this preferably should be implemented as
engine that will be able to take full advantage of architecture, not as
patch to general purpose block function.

 This patch adds support to Intel AES-NI instruction set for x86_64
 platform.

 Intel AES-NI is a new set of Single Instruction Multiple Data (SIMD)
 instructions that are going to be introduced in the next generation of
 Intel processor, as of 2009.

Hardware however is not expected before 2010, right? A.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]



__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Andy Polyakov
 If you want this in the mainstream code, you'll need to detect the
 capability at runtime and use your alternate code paths only if the
 hardware is present.

He did. It wouldn't work on Win64, but on Unix detection would actually
work.

 There are advantages in this being present all the time/dynamically enabled
 if it can be done, most users/OS vendors wouldn't bother to configure an
 engine backend anyway.
 
 I'll disagree with Andy on that aspect only. The engine modules aren't
 particularly useful for this situation where the function is inherent in
 some subset of CPU's, the engines will only get used by a few end users
 that can be bothered to configure them.

As mentioned, in order to fully utilize the pipelined architecture one
would have to implement a number of mode-specific subroutines, most
notably Nx-interleaved and non-interleaved for short input and tail
processing, and wrap them in specific C logic. Putting this all this in
general purpose code serves no purpose. Of course one could argue that
improvement by patching single block function would be impressive
enough, ~4x(?), but why stop there if you can reach for ~20x? This is my
main argument for engine.

 I doubt the OS vendors would bother
 to enable an engine by default, testing of the possible configurations is
 expensive and the costs of support calls if they mess up makes
 autodetecting the engine to use a very unattractive proposition.

One can discuss loading selected engines by default, i.e. you'd have to
work to not load it:-) Then it wouldn't be any different, yet provide
proper isolation for specific pipeline-enabling logic, would it? Either
way, there were more points:-) A.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Huang Ying
On Wed, 2008-12-10 at 03:40 +0800, Andy Polyakov wrote:
 As for RFC part. NO! This is NOT the way to do it. For several reasons
 (in ascending order of importance):
 
 - OpenSSL assembler modules are maintained as dual-ABI, i.e. suitable
 for both Unix and Win64;

OK. I will follow the way like that in aes-x86_64.pl to deal with ABI
issue.

 - and $-16, %rdx is unacceptable in this context. The relevant
 interface is exposed to end-user and we have to reserve for possibility
 that key schedule is memcpy-ed to location with alternative alignment;

Does there any other mechanism to deal with alignment issue in OpenSSL?
Is it better to declare AES_KEY as follow:

struct aes_key_st {
unsigned int rd_key[4 *(AES_MAXNR + 1)];
int rounds;
} __attribute__ ((aligned (16)));

And how to deal with memory allocated with malloc()?

 - zero-copy CBC routine gives a fair performance improvement even in
 ordinary case, and driving ultra-fast block function from C would be
 just wasteful. In other words AESENC/DEC would benefit more from
 dedicated CBC routine (see even comment below);

I will do more investigation on that.

 - implementation should allow for pipelining;
 
 As for the latter. I refer to possibility of scheduling of multiple
 AESENC/DEC with same key schedule element and multiple data chunks. It's
 possible in modes that allow for parallelization (e.g. ECB, CBC decrypt,
 CTR), and as far as I understand it is even recommended. So we are kind
 of obliged to reserve for this option.
 
 The answer is engine. I mean this preferably should be implemented as
 engine that will be able to take full advantage of architecture, not as
 patch to general purpose block function.

But as Peter Waltenberg said, engine has its issue too. At least we
should have a branch based version (may be slower) to benefit most
users, until we can make engine version usable by most users.

  This patch adds support to Intel AES-NI instruction set for x86_64
  platform.
  
  Intel AES-NI is a new set of Single Instruction Multiple Data (SIMD)
  instructions that are going to be introduced in the next generation of
  Intel processor, as of 2009.
 
 Hardware however is not expected before 2010, right? A.

Maybe 2009 or 2010, I don't know that exactly too.

Best Regards,
Huang Ying



signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Huang Ying
On Wed, 2008-12-10 at 04:58 +0800, Peter Waltenberg wrote:
 If you want this in the mainstream code, you'll need to detect the
 capability at runtime and use your alternate code paths only if the
 hardware is present. It's not even to Intels advantage if OpenSSL crashes
 and burns on older Intel CPU's and most bulk users of OpenSSL (OS vendors)
 won't want to mess around installing different OpenSSL versions for
 different hardware.
 
 Autodetection is the best option if the detection overhead is reasonable -
 take a look at crypto/x86_64cpuid.pl for how to do the detection logic
 neatly.
 There are advantages in this being present all the time/dynamically enabled
 if it can be done, most users/OS vendors wouldn't bother to configure an
 engine backend anyway.

Auto-detection has been implemented in patch.

- In entry point of AES algorithm in crypto/aes/asm/aes-x86_64.pl,
  OPENSSL_ia32cap_P is checked, if corresponding bit (57) is set,
  branch into AES-NI based implementation.

Best Regards,
Huang Ying



signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Huang Ying
On Wed, 2008-12-10 at 05:47 +0800, Andy Polyakov wrote:
  I doubt the OS vendors would bother
  to enable an engine by default, testing of the possible configurations is
  expensive and the costs of support calls if they mess up makes
  autodetecting the engine to use a very unattractive proposition.
 
 One can discuss loading selected engines by default, i.e. you'd have to
 work to not load it:-) Then it wouldn't be any different, yet provide

I am new to OpenSSL. Can you tell me how to do that? how to use the
proper engine automatically?

Best Regards,
Huang Ying



signature.asc
Description: This is a digitally signed message part


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Andy Polyakov

- and $-16, %rdx is unacceptable in this context. The relevant
interface is exposed to end-user and we have to reserve for possibility
that key schedule is memcpy-ed to location with alternative alignment;


Does there any other mechanism to deal with alignment issue in OpenSSL?


The answer is engine.


Is it better to declare AES_KEY as follow:

struct aes_key_st {
unsigned int rd_key[4 *(AES_MAXNR + 1)];
int rounds;
} __attribute__ ((aligned (16)));


This is gcc-ism and we support other compilers, so no.


And how to deal with memory allocated with malloc()?


Implementation aiming to complement interface exposed by crypto/aes/asm 
should allow for non-16-byte-aligned key schedule. Period. One can use 
movups, or check alignment and choose between movups and movaps code 
paths, or copy key schedule to aligned location on stack.



- implementation should allow for pipelining;

As for the latter. I refer to possibility of scheduling of multiple
AESENC/DEC with same key schedule element and multiple data chunks. It's
possible in modes that allow for parallelization (e.g. ECB, CBC decrypt,
CTR), and as far as I understand it is even recommended. So we are kind
of obliged to reserve for this option.

The answer is engine. I mean this preferably should be implemented as
engine that will be able to take full advantage of architecture, not as
patch to general purpose block function.


But as Peter Waltenberg said, engine has its issue too.


Yes, and the relevant question is if it worth it.


At least we
should have a branch based version (may be slower) to benefit most
users, until we can make engine version usable by most users.


There is no hardware in sight, so until is not really an argument. One 
can reserve for branch version as back-up/exit plan, i.e. in case, 
but not until. A.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


Re: [PATCH RFC] Add support to Intel AES-NI instruction set for x86_64 platform

2008-12-09 Thread Andy Polyakov

I doubt the OS vendors would bother
to enable an engine by default, testing of the possible configurations is
expensive and the costs of support calls if they mess up makes
autodetecting the engine to use a very unattractive proposition.

One can discuss loading selected engines by default, i.e. you'd have to
work to not load it:-) Then it wouldn't be any different, yet provide


I am new to OpenSSL. Can you tell me how to do that? how to use the
proper engine automatically?


I said one can discuss it, there is no way currently, but as it's 
*soft*ware there is hardly limit for what one can do. A.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]