Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-26 Thread Eric Young
Hovav Shacham wrote:
 On Aug 24, 2008, at 5:20 AM, Peter Gutmann wrote:

 Speaking of CPU-specific optimisations, I've seen a few algorithm
 proposals
 from the last few years that assume that an algorithm can be scaled
 linearly
 in the number of CPU cores, treating a multicore CPU as some kind of
 SIMD
 engine with all cores operating in lock-step, or at least engaging in
 some
 kind of rendezvous every couple of cycles (for example the
 recently-discussed
 MD6 uses a round of 16 steps, if I read the description correctly)

 My impressions from Ron's talk were different.  For multicore systems,
 the tree structure of the hash allows parallelism at a much higher
 granularity.  For hardware implementation, the feedback-register
 structure of the round function means that 16 steps can be computed in
 parallel.  I didn't get the sense that Ron intends for the second kind
 of parallelism to be used in software implementations.

 Hovav.

From the MD6 powerpoint, it does look good for parallelism.  When using
SSE5 (to get back on topic :-), you should be able to do 2 blocks in the
one instruction stream.  I can't remember enough of the other SSE
instructions to know if the relevant 64bit shifts are present before SSE5.

The only place where I've used multiple CPUs in crypto so far has been
in RSA's CRT, where, due to the magic of
OpenMP support, and a little bit of state splitting, I get the following
throughput numbers for dual core 2.5ghz, athlon64
doing 1024-2 RSA private key operations (number per second)

For normal single threaded, 4650 per cpu second and wall clock second.
OpenMP, 4330 per CPU second, 7360 wall clock second.

So in this case, the OpenMP overhead is about 8% CPU.  MD6 has smaller
chunks, and lots of them, so it will probably scale quite well.

OpenMP, it makes it very easy to put in parallelism.  In this CRT
implementation, it was a simple
#pragma omp parallel for
for (i=0; i2; i++)
 /* CRT code */
A few changes were made to make sure the structures were not shared, but
nothing that affects performance.
OpenMP is now in gcc 4.2 which is nice.

MD6, should be just as stupidly easy,

#pragma omp parallel for
for (block_num=0; block_num(data_len/512); block_num++) {
   MD6_block((ret_st[block_num]), input + block_num*512, block_num,
level, not_root, )
}

Repeat up the levels (depending of memory availability).

eric

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-24 Thread Eric Young
Paul Crowley wrote:
 http://www.ddj.com/hpc-high-performance-computing/201803067

 In the above Dr Dobb's article from a little over a year ago, AMD
 Senior Fellow Leendert vanDoorn states the Advanced Encryption
 Standard (AES) algorithm gets a factor of 5 performance improvement by
 using the new SSE5 extension.  However, glancing through the SSE5
 specification, I can't see at all how such a dramatic speedup might be
 achieved.  Does anyone know any more, or can anyone see more than I
 can in the spec?

 http://developer.amd.com/cpu/SSE5/Pages/default.aspx

I've only just seen this, but I've been playing with the VIA's AES and
looking at Intels AES instructions.

I believe the PPERM instruction will be rather important.  Combined with
the packed byte rotate and shift some rather
interesting SIMD byte fiddles should be possible.

From my initial look, it should be possible to implement AES without
tables, doing SIMD operations on all 16 bytes at once.
I've not looked at it enough yet, but currently I'm doing an AES round
in about 140 cycles a block (call it 13 per round plus overhead) on a
AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions.  I
don't believe they will be taking 30 instructions , so they probably
have 4-8 SSE instructions per round, it then comes down to how many SSE
execution units there are to execute in parallel.

As for VIA, on a 1ghz C7 part, cbc mode, 128bit key, for 16byte aligned,
I'm getting about 24 cycles per block, for unaligned, about 67 cycles. 
The chip does ECB mode at 12.6 cycles a block if aligned (2 at a time). 
It does not handle unaligned ECB, so with manual alignment, 75 cycles. 
Not bad for a single issue cpu considering the x86 instruction version
of AES I have
takes 1010 cycles per block.

For the intel AES instructions, from my readings, it will be able to do
a single AES (128bit) in a bit more that 60 cycles
(10 rounds, 6 cycle latency for the instructions).  The good part is
that they will pipeline.  So if you say do 6
AES ecb blocks at once, you can get a throughput of about 12 cycles a
block (intel's figures).  This is obviously of relevance for counter
mode, cbc decrypt and more recent standards like xts and gcm mode.

Part of the intel justification for the AES instruction seems to stop
cache timing attacks.  If the SSE5 instructions allow AES
to be done with SIMD instead of tables, they will achieve the same
affect, but without as much parallel upside.

It also looks like the  GF(2^8) maths will also benefit.


eric (who has only been able to play with via hardware :-(

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re:5x speedup for AES using SSE5?

2008-08-24 Thread Eric Young
Eric Young wrote:
 I've not looked at it enough yet, but currently I'm doing an AES round
 in about 140 cycles a block (call it 13 per round plus overhead) on a
 AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. 
Urk, correction, I forgot I've recently upgraded from a 2ghz machine to
2.5ghz.
So that should read about 182 cycles per block, and 18 cycles per round.
I though the number seems strange :-(.  I tent to always quote numbers
from a 2-3 second run encrypting a 4k buffer, not a machine cycle
counter over one or two blocks, so I leave myself open to this kind of
error :-(

Still, looking further at the various SSE5 instructions, I'm having
difficultly seeing how
to avoid instruction dependencies when using the SIMD instructions
(specifically using PPERM to implement the sbox).

eric

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: The perils of security tools

2008-05-24 Thread Eric Young

   #ifndef PURIFY
 MD_Update(m,buf,j); /* purify complains */
   #endif

   

I just re-checked, this code was from SSLeay, so it pre-dates OpenSSL
taking over from me
(about 10 years ago, after I was assimilated by RSA Security).

So in some ways I'm the one at fault for not being clear enough about
why 'purify complains' and why it was not relevant.
Purify also incorrectly companied about a construct used in the digest
gathering code which functioned correctly, but purify was
also correct (a byte in a read word was uninitialised, but it was later
overwritten by a shifted byte).

One of the more insidious things about Purify is that once its
complaints are investigated, and deemed irrelevant (but left in the
library),
anyone who subsequently runs purify on an application linking in the
library will get the same purify warning.
This leads to rather distressed application developers.  Especially if
their company has a policy of 'no purify warnings'.

One needs to really ship the 'warning ignore' file for purify (does
valgrind have one?).

I personally do wonder why, if the original author had purify related
comments, which means he was aware of the issues,
but had still left the code in place, the reviewer would not consider
that the code did some-thing important enough to
ignore purify's complaints.

eric

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: [cryptography] Re: Why the exponent 3 error happened:

2006-09-17 Thread Eric Young

James A. Donald wrote:

--
James A. Donald wrote:
 Code is going wrong because ASN.1 can contain
 complicated malicious information to cause code to go
 wrong.  If we do not have that information, or simply
 ignore it, no problem.

Ben Laurie wrote:
 This is incorrect. The simple form of the attack is
 exactly as described above - implementations ignore
 extraneous data after the hash. This extraneous data
 is _not_ part of the ASN.1 data.

But it is only extraneous because ASN.1 *says* it is
extraneous.

If you ignore the ASN.1 stuff, treat it as just
arbitrary padding, you will not get this problem.  You
will look at the rightmost part of the data, the low
order part of the data, for the hash, and lo, the hash
will be wrong!
This is a question I would not mind having answered; while the exponent 
3 attack works when there are low bits to 'modify', there has been talk 
of an attack where the ASN.1 is correctly right justified (hash is the 
least significant bytes), but incorrect ASN.1 encoding is used to add 
'arbitrary' bytes before the hash.  So in this case some of the most 
significant bytes are fixed, the least significant bytes are fixed, but 
some in the middle can be modified.  Does the exponent 3 attack work in 
this case?  My personal feel is that his would be much harder, but is 
such an attack infeasible?


This issue about ASN.1 parameters being an evil concept goes away if the 
attack can only work when the least significant bytes need to be modifiable.


eric

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: Exponent 3 damage spreads...

2006-09-12 Thread Eric Young

Jostein Tveit wrote:

Anyone got a test key with a real and a forged signature to test
other implementations than OpenSSL?
Well, since this in not really an issue about forging signatures, rather 
invalid verification,
I've appended 2 self-signed certs (resigned apps/server.pem), one with a 
valid signature,
and one with a signature block with an extra byte appended after the 
ASN.1 (but before signing).

For openssl 0.9.8a
[EMAIL PROTECTED] ~/workopenssl verify -CAfile cert-ok.pem cert-ok.pem
cert-ok.pem: OK
[EMAIL PROTECTED] ~/workopenssl verify -CAfile cert-bad.pem cert-bad.pem
cert-bad.pem: OK

For openssl 0.9.8c
[EMAIL PROTECTED] ~/workopenssl-0.9.8c/apps/openssl verify -CAfile cert-ok.pem 
cert-ok.pem

cert-ok.pem: OK
[EMAIL PROTECTED] ~/workopenssl-0.9.8c/apps/openssl verify -CAfile cert-bad.pem 
cert-bad.pem
cert-bad.pem: /C=AU/ST=Queensland/O=CryptSoft Pty Ltd/CN=Server test 
cert (512 bit)

error 7 at 0 depth lookup:certificate signature failure
28900:error:04077068:rsa routines:RSA_verify:bad signature:rsa_sign.c:192:
28900:error:0D0C5006:asn1 encoding routines:ASN1_item_verify:EVP 
lib:a_verify.c:168:


so this appears to trigger the relevant condition.
For my own recent pkcs#1 implementations, they do not ASN.1 decode the 
signature block, rather then generate a signature block and do a memcmp 
with the output from the RSA decrypt.  I did this since it is easy to 
generate small amounts of ASN.1 relative to parsing and checking all the 
boundary cases.  In this case this 'simpler' approach seems to have paid 
off :-).


eric

[EMAIL PROTECTED] ~/workcat cert-ok.pem
-BEGIN CERTIFICATE-
MIIBsDCCAVoCAQYwDQYJKoZIhvcNAQEFBQAwYzELMAkGA1UEBhMCQVUxEzARBgNV
BAgTClF1ZWVuc2xhbmQxGjAYBgNVBAoTEUNyeXB0U29mdCBQdHkgTHRkMSMwIQYD
VQQDExpTZXJ2ZXIgdGVzdCBjZXJ0ICg1MTIgYml0KTAeFw0wNjA5MTEyMzU5MDJa
Fw0wNjEwMTEyMzU5MDJaMGMxCzAJBgNVBAYTAkFVMRMwEQYDVQQIEwpRdWVlbnNs
YW5kMRowGAYDVQQKExFDcnlwdFNvZnQgUHR5IEx0ZDEjMCEGA1UEAxMaU2VydmVy
IHRlc3QgY2VydCAoNTEyIGJpdCkwXDANBgkqhkiG9w0BAQEFAANLADBIAkEAn7PD
hCeV/xIxUg8V70YRxK2A5jZbD92A12GN4PxyRQk0/lVmRUNMaJdq/qigpd9feP/u
12S4PwTLb/8q/v657QIDAQABMA0GCSqGSIb3DQEBBQUAA0EAc+fnj0rB2CYautG2
4itiMOU4SN6JFTFDCTU/Gb5aR/Fiu7HJkuE5yGEnTdnwcId/T9sTW251yzCc1e2z
rHX/kw==
-END CERTIFICATE-
[EMAIL PROTECTED] ~/workcat cert-bad.pem
-BEGIN CERTIFICATE-
MIIBsDCCAVoCAQYwDQYJKoZIhvcNAQEFBQAwYzELMAkGA1UEBhMCQVUxEzARBgNV
BAgTClF1ZWVuc2xhbmQxGjAYBgNVBAoTEUNyeXB0U29mdCBQdHkgTHRkMSMwIQYD
VQQDExpTZXJ2ZXIgdGVzdCBjZXJ0ICg1MTIgYml0KTAeFw0wNjA5MTEyMzU4NTVa
Fw0wNjEwMTEyMzU4NTVaMGMxCzAJBgNVBAYTAkFVMRMwEQYDVQQIEwpRdWVlbnNs
YW5kMRowGAYDVQQKExFDcnlwdFNvZnQgUHR5IEx0ZDEjMCEGA1UEAxMaU2VydmVy
IHRlc3QgY2VydCAoNTEyIGJpdCkwXDANBgkqhkiG9w0BAQEFAANLADBIAkEAn7PD
hCeV/xIxUg8V70YRxK2A5jZbD92A12GN4PxyRQk0/lVmRUNMaJdq/qigpd9feP/u
12S4PwTLb/8q/v657QIDAQABMA0GCSqGSIb3DQEBBQUAA0EAbynCRIlUQgaqyNgU
DF6P14yRKUtX8akOP2TwStaSiVf/akYqfLFm3UGka5XbPj4rifrZ0/sOoZEEBvHQ
e20sRA==
-END CERTIFICATE-


-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: AES Modes

2004-10-19 Thread Eric Young
Quoting Brian Gladman [EMAIL PROTECTED]:

 Ian Grigg wrote:
 
  Jack Lloyd also passed along lots of good comments I'd
  like to forward (having gained permission) FTR.  I've
  edited them for brevity and pertinence.
 
 [snip]
   I'm obviously being naive here ... I had thought that the combined 
  mode would
be faster, as it would run through the data once only, and that AES 
  seems to
clip along faster than SHA1.
  
  AFAIK all of the modes that use only one block cipher invocation per 
  block of
  input are patented. EAX+CCM both use two AES operations per block, and
  byte-for-byte SHA-1 is 2-5x faster than AES (at least in the 
  implementations
  I've seen/used/written), so using AES+HMAC is probably going to be 
  faster than
  AES/EAX or AES/CCM. The obvious exception being boxes with hardware AES 
  chips
  and slow CPUs (eg, an ARM7 with an AES coprocessor), where AES will of 
  course
  be much faster than SHA-1.
 
 Maybe my C implementation of SHA1 is hopeless but I get SHA1 on an x86 
 at about 17 cycles per byte (over 100,000 bytes) and AES in C at 21 
 cycles per byte.
 
 So I would put these two algorihms at about the same speed in C. In 
 consequence I rather suspect that the 'two encryptions per block' cost 
 might also apply to combined modes when AES is used with HMAC-SHA1.

Are you running on a P4?  ASM for sha1 on a P4 takes about 11.9 cycles 
per byte.  The P4 is a very touchy x86 implementation.
On most other architectures I nearly always see a bit less than 2 times faster
sha1 vs AES.  On AMD64, asm, I have
AES-cbc at 12.2 cycles per byte and sha1 at 6.8.  This is about
as good a CPU as it gets (IPC near 3 for both implementations).

eric


-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]