[openssl.org #2365] Limitations of ENGINE interface hamper performance on modern hardware

2011-12-04 Thread Andrey Kulikov via RT
 3) An accellerator device directly supports TLS/SSL record
 encryption/decryption and the handshake operation itself.

 We do many bus transactions to the accellerator (and
 possibly system calls into the OS kernel) where we
 could do one, doing every single basic cryptographic
 operation individually when we could actually amortize
 the cost over the entire record or handshake operation.

 This is the case for most modern accellerators used with
 general-purpose CPUs.


Application of such technique does not limited to hardware acselerator.
Yet another example of such devices is services, allowing to pass
the whole record plus encryption and MAC keys, and process it in
single call.
It is used when for some (security) reasons all
cryptography-manipulations performed in separate process/driver/VM,
and client operates only with handlers to keys.

I saw how it was implemented in extension to MS CryptoAPI.
Even without such extensions CryptEncrypt function is able to encrypt
and hash data at the same time.
Extension I'm taking about does add abbility to pass there pointer to
header, body and place where to put tail - i.e. MAC value.

Inabbility to process TLS record in single call results to necessity
to pass the same data over IPC twice.

Andrey.


__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


major ssl read/ write performance improvement - updated

2011-12-04 Thread Deng Michael
Hi,
 I have changed the mac code which gives substantial improvement for both read 
and write (not handshake)

 The saving is fairly major, on cpu with cryto acceleration, the change 
can more than double the overall ssl read /write speed for 1K record 
excluding OS IO time. this implies the change removed majority of the 
code overhead for read and write.

 The basic idea
is to remove all the EVP_MD_CTX duplications (which is very cpu 
intensive) during read and write. the original code involves numerous 
memory allocations and frees for each read or write all due to the ctx's
deep copy.

 the new way of keeping the ctx is to
make it do state checkpoint and restore instead of deep copy, after 
this change there is NO memory operation for read and write. The changes
are not too big also.

 One catch (should not 
really be a catch) is that at application level NO MORE than one thread 
can work on the SAME SSL/TLS connection for read or write (read or write
can be done at the same time). But I would think most apps would NEVER 
allow more than one thread to read or write on the same connection (I 
don't think it would work if you do that anyway, even without my 
change).

 the patch file I attached is based on 1.0.0e version.


Andrey found some problem in original version of the patch when PKEY_METHS 
engine is used. so this is an updated patch (complete, not incremental patch) 
to fix that.

This checkpoint/restore is  enabled if PKEY_METHS engine is used UNLESS the 
engine code implements the control interface to do the checkpointing/restore.

As pointed out by others, there can be other ways to achieve similar thing, the 
saving also depends your system's memory allocation routines. also part of the 
patch look a bit like hack


Thanks to Andrey!

Regards,
Michael


checkpoint.patch
Description: Binary data