Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-26 Thread Eric Young
Hovav Shacham wrote:
 On Aug 24, 2008, at 5:20 AM, Peter Gutmann wrote:

 Speaking of CPU-specific optimisations, I've seen a few algorithm
 proposals
 from the last few years that assume that an algorithm can be scaled
 linearly
 in the number of CPU cores, treating a multicore CPU as some kind of
 SIMD
 engine with all cores operating in lock-step, or at least engaging in
 some
 kind of rendezvous every couple of cycles (for example the
 recently-discussed
 MD6 uses a round of 16 steps, if I read the description correctly)

 My impressions from Ron's talk were different.  For multicore systems,
 the tree structure of the hash allows parallelism at a much higher
 granularity.  For hardware implementation, the feedback-register
 structure of the round function means that 16 steps can be computed in
 parallel.  I didn't get the sense that Ron intends for the second kind
 of parallelism to be used in software implementations.

 Hovav.

From the MD6 powerpoint, it does look good for parallelism.  When using
SSE5 (to get back on topic :-), you should be able to do 2 blocks in the
one instruction stream.  I can't remember enough of the other SSE
instructions to know if the relevant 64bit shifts are present before SSE5.

The only place where I've used multiple CPUs in crypto so far has been
in RSA's CRT, where, due to the magic of
OpenMP support, and a little bit of state splitting, I get the following
throughput numbers for dual core 2.5ghz, athlon64
doing 1024-2 RSA private key operations (number per second)

For normal single threaded, 4650 per cpu second and wall clock second.
OpenMP, 4330 per CPU second, 7360 wall clock second.

So in this case, the OpenMP overhead is about 8% CPU.  MD6 has smaller
chunks, and lots of them, so it will probably scale quite well.

OpenMP, it makes it very easy to put in parallelism.  In this CRT
implementation, it was a simple
#pragma omp parallel for
for (i=0; i2; i++)
 /* CRT code */
A few changes were made to make sure the structures were not shared, but
nothing that affects performance.
OpenMP is now in gcc 4.2 which is nice.

MD6, should be just as stupidly easy,

#pragma omp parallel for
for (block_num=0; block_num(data_len/512); block_num++) {
   MD6_block((ret_st[block_num]), input + block_num*512, block_num,
level, not_root, )
}

Repeat up the levels (depending of memory availability).

eric

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: 5x speedup for AES using SSE5?

2008-08-26 Thread Ilya Levin
Brian Gladman wrote:
 But a fully byte oriented implementation runs at about 140 cycles/byte
 and here the S-Box substitution step is a significant bottleneck.
 ...
 It is also possible that the PPERM instruction could be used to speed up
 the Galois field calculations to produce the S-Box mathematically rather
 than by table lookup. I have tried this in the past but it has not
 proved competitive.  But PPERM looks interesting here as well.

This is where the following may be handy:
http://www.literatecode.com/2007/11/11/aes256/

It is a byte-oriented AES-256 implementation without S-box tables.
Although I doubt it can be speeded up that much.

Regards,
Ilya
-- 
http://www.literatecode.com

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: 5x speedup for AES using SSE5?

2008-08-25 Thread Brian Gladman
Eric Young wrote:
 Eric Young wrote:
 I've not looked at it enough yet, but currently I'm doing an AES round
 in about 140 cycles a block (call it 13 per round plus overhead) on a
 AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. 
 Urk, correction, I forgot I've recently upgraded from a 2ghz machine to
 2.5ghz.
 So that should read about 182 cycles per block, and 18 cycles per round.
 I though the number seems strange :-(.  I tent to always quote numbers
 from a 2-3 second run encrypting a 4k buffer, not a machine cycle
 counter over one or two blocks, so I leave myself open to this kind of
 error :-(

The best figure I obtain on an AMD64 system is 11 cycles/byte, which
matches your results (you had me worried for a while with 9 cycles/byte!)

To go 5 times faster than this would mean close to 2 cycles/byte, a
speed that I find hard to believe without hardware acceleration

But a fully byte oriented implementation runs at about 140 cycles/byte
and here the S-Box substitution step is a significant bottleneck.  I too
think the PPERM instruction could be used for this and it seems possible
that this would produce large savings.  So 30 cycles/byte might well be
achievable in this case.

I hence wonder whether this is the comparison that AMD are making.

It is also possible that the PPERM instruction could be used to speed up
the Galois field calculations to produce the S-Box mathematically rather
than by table lookup. I have tried this in the past but it has not
proved competitive.  But PPERM looks interesting here as well.

   Brian Gladman

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


multicore hash functions (was: 5x speedup for AES using SSE5?)

2008-08-25 Thread zooko

Hello Peter Gutmann.

I'm working on a contribution to the SHA-3 process, and I've been  
using exactly the sort of abstraction that you describe -- counting  
one computation of a hash compression function as a unit of work  
which could be computed concurrently by some sort of parallel computer.


I vaguely think that once I get this level of analysis done, I should  
add some terms to show how the velocity of data into the computer and  
from core to core is not infinite.


I certainly think that I should code up some actual implementations  
and benchmark them.  However, I don't have a machine available with  
lots of cores -- I'm considering requesting of Sun.com that they lend  
me a T2.  (Despite my earlier declaration to Sun that I had lost  
interest in their stupid architecture since they wouldn't release the  
source to the crypto module.)


Anyway, if you have a better way to think about parallelism of hash  
functions, I'm all ears.


Thanks,

Zooko
---
http://allmydata.org -- Tahoe, the Least-Authority Filesystem
http://allmydata.com -- back up all your files for $5/month

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-25 Thread Kevin Brock

Peter Gutmann wrote:

Is there some feature of multicore CPUs that I'm missing, or is it a case of
cryptographers abstracting a bit too much away?  And if it's the latter,
should someone tell them that multicore CPUs don't actually work that way?
  
I can't speak to the former issue, but I seem to remember that the 
numbers Rivest showed at the talk were actual performance measurements, 
not projections, and that it was basically linear.


Kevin

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-25 Thread Hovav Shacham

On Aug 24, 2008, at 5:20 AM, Peter Gutmann wrote:

Speaking of CPU-specific optimisations, I've seen a few algorithm  
proposals
from the last few years that assume that an algorithm can be scaled  
linearly
in the number of CPU cores, treating a multicore CPU as some kind  
of SIMD
engine with all cores operating in lock-step, or at least engaging  
in some
kind of rendezvous every couple of cycles (for example the recently- 
discussed

MD6 uses a round of 16 steps, if I read the description correctly)


My impressions from Ron's talk were different.  For multicore  
systems, the tree structure of the hash allows parallelism at a much  
higher granularity.  For hardware implementation, the feedback- 
register structure of the round function means that 16 steps can be  
computed in parallel.  I didn't get the sense that Ron intends for  
the second kind of parallelism to be used in software implementations.


Hovav.

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-24 Thread Eric Young
Paul Crowley wrote:
 http://www.ddj.com/hpc-high-performance-computing/201803067

 In the above Dr Dobb's article from a little over a year ago, AMD
 Senior Fellow Leendert vanDoorn states the Advanced Encryption
 Standard (AES) algorithm gets a factor of 5 performance improvement by
 using the new SSE5 extension.  However, glancing through the SSE5
 specification, I can't see at all how such a dramatic speedup might be
 achieved.  Does anyone know any more, or can anyone see more than I
 can in the spec?

 http://developer.amd.com/cpu/SSE5/Pages/default.aspx

I've only just seen this, but I've been playing with the VIA's AES and
looking at Intels AES instructions.

I believe the PPERM instruction will be rather important.  Combined with
the packed byte rotate and shift some rather
interesting SIMD byte fiddles should be possible.

From my initial look, it should be possible to implement AES without
tables, doing SIMD operations on all 16 bytes at once.
I've not looked at it enough yet, but currently I'm doing an AES round
in about 140 cycles a block (call it 13 per round plus overhead) on a
AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions.  I
don't believe they will be taking 30 instructions , so they probably
have 4-8 SSE instructions per round, it then comes down to how many SSE
execution units there are to execute in parallel.

As for VIA, on a 1ghz C7 part, cbc mode, 128bit key, for 16byte aligned,
I'm getting about 24 cycles per block, for unaligned, about 67 cycles. 
The chip does ECB mode at 12.6 cycles a block if aligned (2 at a time). 
It does not handle unaligned ECB, so with manual alignment, 75 cycles. 
Not bad for a single issue cpu considering the x86 instruction version
of AES I have
takes 1010 cycles per block.

For the intel AES instructions, from my readings, it will be able to do
a single AES (128bit) in a bit more that 60 cycles
(10 rounds, 6 cycle latency for the instructions).  The good part is
that they will pipeline.  So if you say do 6
AES ecb blocks at once, you can get a throughput of about 12 cycles a
block (intel's figures).  This is obviously of relevance for counter
mode, cbc decrypt and more recent standards like xts and gcm mode.

Part of the intel justification for the AES instruction seems to stop
cache timing attacks.  If the SSE5 instructions allow AES
to be done with SIMD instead of tables, they will achieve the same
affect, but without as much parallel upside.

It also looks like the  GF(2^8) maths will also benefit.


eric (who has only been able to play with via hardware :-(

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re: [cryptography] 5x speedup for AES using SSE5?

2008-08-24 Thread Peter Gutmann
Speaking of CPU-specific optimisations, I've seen a few algorithm proposals
from the last few years that assume that an algorithm can be scaled linearly
in the number of CPU cores, treating a multicore CPU as some kind of SIMD
engine with all cores operating in lock-step, or at least engaging in some
kind of rendezvous every couple of cycles (for example the recently-discussed
MD6 uses a round of 16 steps, if I read the description correctly) to exchange
data.  This abstraction seems to be particularly convenient when dealing with
things like hash trees.  However I'm not aware of any multicore CPU that
actually works this way, you'd need to have exclusive use of each core by one
thread and use incredibly expensive (compared to the other primitive CPU
operations used in hashing) barriers or something similar to ensure
synchronisation.

Is there some feature of multicore CPUs that I'm missing, or is it a case of
cryptographers abstracting a bit too much away?  And if it's the latter,
should someone tell them that multicore CPUs don't actually work that way?

Peter.

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


Re:5x speedup for AES using SSE5?

2008-08-24 Thread Eric Young
Eric Young wrote:
 I've not looked at it enough yet, but currently I'm doing an AES round
 in about 140 cycles a block (call it 13 per round plus overhead) on a
 AMD64, (220e6 bytes/sec on a 2ghz cpu) using normal instructions. 
Urk, correction, I forgot I've recently upgraded from a 2ghz machine to
2.5ghz.
So that should read about 182 cycles per block, and 18 cycles per round.
I though the number seems strange :-(.  I tent to always quote numbers
from a 2-3 second run encrypting a 4k buffer, not a machine cycle
counter over one or two blocks, so I leave myself open to this kind of
error :-(

Still, looking further at the various SSE5 instructions, I'm having
difficultly seeing how
to avoid instruction dependencies when using the SIMD instructions
(specifically using PPERM to implement the sbox).

eric

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]


5x speedup for AES using SSE5?

2008-08-23 Thread Paul Crowley

http://www.ddj.com/hpc-high-performance-computing/201803067

In the above Dr Dobb's article from a little over a year ago, AMD Senior 
Fellow Leendert vanDoorn states the Advanced Encryption Standard (AES) 
algorithm gets a factor of 5 performance improvement by using the new 
SSE5 extension.  However, glancing through the SSE5 specification, I 
can't see at all how such a dramatic speedup might be achieved.  Does 
anyone know any more, or can anyone see more than I can in the spec?


http://developer.amd.com/cpu/SSE5/Pages/default.aspx
--
  __
\/ o\ Paul Crowley
/\__/ www.ciphergoth.org

-
The Cryptography Mailing List
Unsubscribe by sending unsubscribe cryptography to [EMAIL PROTECTED]