Re: MONTMUL performance: t4 engine vs inlined t4

2013-06-30 Thread Andy Polyakov

Hi,


Thank you so much for looking into the issue with Ferenc!

I'll incorporate the change into Solaris to verify the 20-30% 
performance improvement.

The conservative approach sounds like the best approach at this point.

Once the performance improvement is verified, can you commit the change 
to 1.0.2?


The changes were committed to 1.0.2 same day. I've also addressed alloca 
problem.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-06-21 Thread Misaki.Miyashita

Hi Andy,

Thank you so much for looking into the issue with Ferenc!

I'll incorporate the change into Solaris to verify the 20-30% 
performance improvement.

The conservative approach sounds like the best approach at this point.

Once the performance improvement is verified, can you commit the change 
to 1.0.2?


Thanks again.

-- misaki

On 06/18/13 04:10, Andy Polyakov wrote:

Misaki,


The measurement I sent yesterday for OpenSSL (with inlined T4
instruction support) was not quite accurate.
Some of the T4 specific code you committed was not enabled when we
tested, and I realized that __sparc__ was not defined on our system.
Thus, I changed #if defined(__sparc__) to #if defined(__sparc).
Now, we are seeing better number with OpenSSL.

   signverifysign/s verify
rsa 1024 bits 0.000351s 0.24s   2852.9  42311.0
rsa 2048 bits 0.001258s 0.47s795.1  21128.6
rsa 4096 bits 0.006240s 0.000395s160.3   2533.3

Which is virtually identical to Linux results. So one mystery solved.
I'll commit the fix at some later point.


which is still slower than our t4 engine for 1k and 2k bit RSA sign:
   signverifysign/s verify/s
rsa 1024 bits 0.000237s 0.28s   4221.9  36119.8
rsa 2048 bits 0.000876s 0.75s   1141.7  13285.6
rsa 4096 bits 0.006341s 0.002139s157.7467.5

As mentioned the problem seems to be multi-layer and we are moving in
right direction.

http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=4ddacd9921f48013b5cd011e4b93b294c14db1c2
improves RSA sign performance by 20-30%:

rsa 1024 bits 0.000256s 0.16s   3904.4  61411.9
rsa 2048 bits 0.000946s 0.29s   1056.8  34292.7
rsa 4096 bits 0.005061s 0.000340s197.6   2940.5

This is still slower than your code, but conclusion we have to draw is
that it's intentional. In sense that the discrepancy is accounted to the
fact that OpenSSL implements counter-measures against cache-timing
attack, and takes rather conservative approach. It remains to be seen if
platform-specific and faster counter-measure will be suggested at later
point. Meanwhile, please double-check on Solaris.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-06-18 Thread Andy Polyakov
Misaki,

 The measurement I sent yesterday for OpenSSL (with inlined T4
 instruction support) was not quite accurate.
 Some of the T4 specific code you committed was not enabled when we
 tested, and I realized that __sparc__ was not defined on our system.
 Thus, I changed #if defined(__sparc__) to #if defined(__sparc).
 Now, we are seeing better number with OpenSSL.

   signverifysign/s verify/s
 rsa 1024 bits 0.000351s 0.24s   2852.9  42311.0
 rsa 2048 bits 0.001258s 0.47s795.1  21128.6
 rsa 4096 bits 0.006240s 0.000395s160.3   2533.3
 
 Which is virtually identical to Linux results. So one mystery solved.
 I'll commit the fix at some later point.
 
 which is still slower than our t4 engine for 1k and 2k bit RSA sign:
   signverifysign/s verify/s
 rsa 1024 bits 0.000237s 0.28s   4221.9  36119.8
 rsa 2048 bits 0.000876s 0.75s   1141.7  13285.6
 rsa 4096 bits 0.006341s 0.002139s157.7467.5
 
 As mentioned the problem seems to be multi-layer and we are moving in
 right direction.

http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=4ddacd9921f48013b5cd011e4b93b294c14db1c2
improves RSA sign performance by 20-30%:

rsa 1024 bits 0.000256s 0.16s   3904.4  61411.9
rsa 2048 bits 0.000946s 0.29s   1056.8  34292.7
rsa 4096 bits 0.005061s 0.000340s197.6   2940.5

This is still slower than your code, but conclusion we have to draw is
that it's intentional. In sense that the discrepancy is accounted to the
fact that OpenSSL implements counter-measures against cache-timing
attack, and takes rather conservative approach. It remains to be seen if
platform-specific and faster counter-measure will be suggested at later
point. Meanwhile, please double-check on Solaris.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-06-01 Thread Andy Polyakov

Another question is about suitability of floating-point fcmps and fmovd
instructions. These are used to pick a vector from powers table in
cache-timing neutral manner. I have to admit I haven't done due research
whether or not they are optimal choice in the context, and/or whether or
not we are better off using fand and for instructions for this purpose.
As instructions in question are floating-point they might be executed by
*shared* FPU and not by individual core [which might be disruptive for
pipeline?]...


fcmps is 11 cycle latency and executes in the external FPU.

Likewise for floating point conditional moves of floating point registers.

Floating point conditional moves of integer registers is the worst, it
is split into two micro-ops and it breaks the instruction decode
group.

Plain fmovd you should never use, it goes into the external FPU
because it effects the condition codes in the %fsr.  Use fsrc2 isntead
which has 1 cycle latency and executes in the front end of cpu.


I wonder about integer conditional move on integer condition. It should 
be noted that sheer latency is of lesser concern, as long as processor 
can efficiently handle several of them in pipeline. Condition Codes 
Register would remain constant throughout conditional move segment 
execution.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-06-01 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Sat, 01 Jun 2013 09:38:18 +0200

 I wonder about integer conditional move on integer condition. It
 should be noted that sheer latency is of lesser concern, as long as
 processor can efficiently handle several of them in
 pipeline. Condition Codes Register would remain constant throughout
 conditional move segment execution.

movcc is fine, has a 1 cycle latency, and pipelines quite well.

movr breaks the decode group, but still has a 1 cycle latency.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-06-01 Thread David Miller

I forgot to mention, the out of order execution unit renames
the condition codes just like any other register.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread Andy Polyakov
For public reference. In certain degree it's apparent from the context,
but the report is about RSA sign performance difference for OpenSSL
SPARC T4 Montgomery multiplication module and corresponding Solaris T4
module, with OpenSSL being significantly slower. The least one can say
[at this point] is that problem appears to be multi-layer, in sense
that there are different factors in play. First question in line is how
come same code performs that differently on Solaris and Linux. OpenSSL
on Linux delivers ~70% more RSA1024 signs than on Solaris (if we assume
that both systems operate at same frequency, which is supported by the
fact that verify results were virtually identical).

Misaki,

 I used 64-bit openssl binary to measure the performance.

With above in mind here is something to test. In
crypto/bn/asm/sparct4-mont.pl there is a register windows warm-up
sequence that is executed in 32-bit application context only
(benchmarking on Linux had shown that it's not necessary in 64-bit
application context). Could you test to engage it even in 64-bit
application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text
editor, locate warm it up comment and replace #ifndef __arch64__ in
preceding line with #if 1.

 Let me talk to our performance engineer to see if can collect some
 performance profile on sign operations.

One should probably note that openssl.org has quite low maximum e-mail
message size limit. In other words if message is big enough, it will
bounce. It naturally applies even to ap...@openssl.org, in case you
reckon that results are not of interest to general public [or choose not
to share them for other reason]. So that if it bounces from openssl.org,
drop me a note, and I'll provide alternative address for delivery.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread Andy Polyakov
 For public reference. In certain degree it's apparent from the context,
 but the report is about RSA sign performance difference for OpenSSL
 SPARC T4 Montgomery multiplication module and corresponding Solaris T4
 module, with OpenSSL being significantly slower. The least one can say
 [at this point] is that problem appears to be multi-layer, in sense
 that there are different factors in play. First question in line is how
 come same code performs that differently on Solaris and Linux. OpenSSL
 on Linux delivers ~70% more RSA1024 signs than on Solaris (if we assume
 that both systems operate at same frequency, which is supported by the
 fact that verify results were virtually identical).

Another question is about suitability of floating-point fcmps and fmovd
instructions. These are used to pick a vector from powers table in
cache-timing neutral manner. I have to admit I haven't done due research
whether or not they are optimal choice in the context, and/or whether or
not we are better off using fand and for instructions for this purpose.
As instructions in question are floating-point they might be executed by
*shared* FPU and not by individual core [which might be disruptive for
pipeline?]...
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread David Miller
From: Andy Polyakov ap...@openssl.org
Date: Fri, 31 May 2013 10:29:37 +0200

 Another question is about suitability of floating-point fcmps and fmovd
 instructions. These are used to pick a vector from powers table in
 cache-timing neutral manner. I have to admit I haven't done due research
 whether or not they are optimal choice in the context, and/or whether or
 not we are better off using fand and for instructions for this purpose.
 As instructions in question are floating-point they might be executed by
 *shared* FPU and not by individual core [which might be disruptive for
 pipeline?]...

fcmps is 11 cycle latency and executes in the external FPU.

Likewise for floating point conditional moves of floating point registers.

Floating point conditional moves of integer registers is the worst, it
is split into two micro-ops and it breaks the instruction decode
group.

Plain fmovd you should never use, it goes into the external FPU
because it effects the condition codes in the %fsr.  Use fsrc2 isntead
which has 1 cycle latency and executes in the front end of cpu.
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread Andy Polyakov
 Another question is about suitability of floating-point fcmps and fmovd
 instructions. These are used to pick a vector from powers table in
 cache-timing neutral manner. I have to admit I haven't done due research
 whether or not they are optimal choice in the context, and/or whether or
 not we are better off using fand and for instructions for this purpose.
 As instructions in question are floating-point they might be executed by
 *shared* FPU and not by individual core [which might be disruptive for
 pipeline?]...
 
 fcmps is 11 cycle latency and executes in the external FPU.
 
 Likewise for floating point conditional moves of floating point registers.
 
 Floating point conditional moves of integer registers is the worst, it
 is split into two micro-ops and it breaks the instruction decode
 group.
 
 Plain fmovd you should never use, it goes into the external FPU
 because it effects the condition codes in the %fsr.  Use fsrc2 isntead
 which has 1 cycle latency and executes in the front end of cpu.

Thanks! Even though the question was inadequately formulated (it was not
about just fmovd, but about *conditional* fmovd on floating-point
condition, sorry), I get the picture. The conclusion seems to be that we
should bet on logical operations, fand and for, which are 3 cycles and
[more importantly?] are handled by private core resources. Thanks again.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread Andy Polyakov
 ... here is something to test. In
 crypto/bn/asm/sparct4-mont.pl there is a register windows warm-up
 sequence that is executed in 32-bit application context only
 (benchmarking on Linux had shown that it's not necessary in 64-bit
 application context). Could you test to engage it even in 64-bit
 application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text
 editor, locate warm it up comment and replace #ifndef __arch64__ in
 preceding line with #if 1.

Forgot to mention there are *two* occurrences of warm it up sequence.
Both need to be engaged to test.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread Misaki.Miyashita

Hi Andy,

The measurement I sent yesterday for OpenSSL (with inlined T4 
instruction support) was not quite accurate.
Some of the T4 specific code you committed was not enabled when we 
tested, and I realized that__sparc__ was not defined on our system.

Thus, I changed #if defined(__sparc__) to #if defined(__sparc).
Now, we are seeing better number with OpenSSL.

  signverifysign/s verify/s
rsa 1024 bits 0.000351s 0.24s   2852.9  42311.0
rsa 2048 bits 0.001258s 0.47s795.1  21128.6
rsa 4096 bits 0.006240s 0.000395s160.3   2533.3

which is still slower than our t4 engine for 1k and 2k bit RSA sign:
  signverifysign/s verify/s
rsa 1024 bits 0.000237s 0.28s   4221.9  36119.8
rsa 2048 bits 0.000876s 0.75s   1141.7  13285.6
rsa 4096 bits 0.006341s 0.002139s157.7467.5


So,  I enabled warm-up as suggested by you, but the performance number 
still look the same.


Here is the new bn_mul_mont_t4_8():

bn_mul_mont_t4_8()
bn_mul_mont_t4_8:   8a 10 20 00  clr   %g5
bn_mul_mont_t4_8+0x4:   88 10 3f 80  mov   -0x80, %g4
bn_mul_mont_t4_8+0x8:   8b 29 70 20  sllx  %g5, 0x20, %g5
bn_mul_mont_t4_8+0xc:   9d e3 80 04  save  %sp, %g4, %sp
bn_mul_mont_t4_8+0x10:  9d e3 bf 80  save  %sp, -0x80, %sp
bn_mul_mont_t4_8+0x14:  9d e3 bf 80  save  %sp, -0x80, %sp
bn_mul_mont_t4_8+0x18:  9d e3 bf 80  save  %sp, -0x80, %sp
bn_mul_mont_t4_8+0x1c:  9d e3 bf 80  save  %sp, -0x80, %sp
bn_mul_mont_t4_8+0x20:  9d e3 bf 80  save  %sp, -0x80, %sp
bn_mul_mont_t4_8+0x24:  9d e3 bf 80  save  %sp, -0x80, %sp
bn_mul_mont_t4_8+0x28:  81 e8 00 00  restore
bn_mul_mont_t4_8+0x2c:  81 e8 00 00  restore
bn_mul_mont_t4_8+0x30:  81 e8 00 00  restore
bn_mul_mont_t4_8+0x34:  81 e8 00 00  restore
bn_mul_mont_t4_8+0x38:  81 e8 00 00  restore
bn_mul_mont_t4_8+0x3c:  81 e8 00 00  restore
bn_mul_mont_t4_8+0x40:  88 0b a0 01  and   %sp, 0x1, %g4
bn_mul_mont_t4_8+0x44:  bc 11 40 1e  or%g5, %fp, %fp
bn_mul_mont_t4_8+0x48:  8a 11 00 05  or%g4, %g5, %g5


I realized that, in sparct4-mont.pl, I see some 64-bit sparcv9 specific 
code, but my 64-bit library doesn't have those instructions.
It looks like __arch64__ branch was taken.  Did you expect the have the 
SOPARCV9_64BIT_STACK section to be compiled in?


.globl  bn_mul_mont_t4_$NUM
.align  32
bn_mul_mont_t4_$NUM:
#ifdef  __arch64__
mov 0,$sentinel
mov -128,%g4
#elif defined(SPARCV9_64BIT_STACK)
SPARC_LOAD_ADDRESS_LEAF(OPENSSL_sparcv9cap_P,%g1,%g5)
ld  [%g1+0],%g1 ! OPENSSL_sparcv9_P[0]
mov -2047,%g4
and %g1,SPARCV9_64BIT_STACK,%g1
movrz   %g1,0,%g4
mov -1,$sentinel
add %g4,-128,%g4
#else
mov -1,$sentinel
mov -128,%g4
#endif
sllx$sentinel,32,$sentinel
save%sp,%g4,%sp
#if 1
save%sp,-128,%sp! warm it up
save%sp,-128,%sp
-- snip---

Thank you,

-- misaki


I used 64-bit openssl binary to measure the performance.

With above in mind here is something to test. In
crypto/bn/asm/sparct4-mont.pl there is a register windows warm-up
sequence that is executed in 32-bit application context only
(benchmarking on Linux had shown that it's not necessary in 64-bit
application context). Could you test to engage it even in 64-bit
application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text
editor, locate warm it up comment and replace #ifndef __arch64__ in
preceding line with #if 1.







Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-31 Thread Andy Polyakov

Hi,

The measurement I sent yesterday for OpenSSL (with inlined T4 
instruction support) was not quite accurate.
Some of the T4 specific code you committed was not enabled when we 
tested, and I realized that __sparc__ was not defined on our system.

Thus, I changed #if defined(__sparc__) to #if defined(__sparc).
Now, we are seeing better number with OpenSSL.

  signverifysign/s verify/s
rsa 1024 bits 0.000351s 0.24s   2852.9  42311.0
rsa 2048 bits 0.001258s 0.47s795.1  21128.6
rsa 4096 bits 0.006240s 0.000395s160.3   2533.3


Which is virtually identical to Linux results. So one mystery solved. 
I'll commit the fix at some later point.



which is still slower than our t4 engine for 1k and 2k bit RSA sign:
  signverifysign/s verify/s
rsa 1024 bits 0.000237s 0.28s   4221.9  36119.8
rsa 2048 bits 0.000876s 0.75s   1141.7  13285.6
rsa 4096 bits 0.006341s 0.002139s157.7467.5


As mentioned the problem seems to be multi-layer and we are moving in 
right direction.


So,  I enabled warm-up as suggested by you, but the performance number 
still look the same.


Well, suggestion was of what-if character, product of slight 
desperation:-) But it appears to be unnecessary, so we leave it as it is.


I realized that, in sparct4-mont.pl, I see some 64-bit sparcv9 specific 
code, but my 64-bit library doesn't have those instructions.
It looks like __arch64__ branch was taken.  Did you expect the have the 
SOPARCV9_64BIT_STACK section to be compiled in?


No. SPARCV9_64BIT_STACK is Linux-specific thing. In the commentary 
section in crypto/bn/asm/sparct4-mont.pl you see paragraph that starts 
with 32-bit code is prone to performance degradation. This is what 
SPARCV9_64BIT_STACK is about.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   majord...@openssl.org


Re: MONTMUL performance: t4 engine vs inlined t4

2013-05-30 Thread Misaki.Miyashita

Hi Andy,

On 05/30/13 15:08, Ferenc Rakoczi wrote:

Hi, Andy,

Andy Polyakov wrote:


First of all, RSA512 is essentially irrelevant and no attempt was 
made to optimize it. So let's just disregard RSA512 results (I have 
even removed them from above quoted part). Secondly note that our RSA 
verify is faster. 
I never thought verify can be a bottleneck anywhere. So we always 
concentrated on sign.
Verify is dominated by single-op subroutine and we've got it more 
right. So we have only RSA sign to figure out. First thing one 
notices is difference between your and our results from 2.85GHz T4 
running Linux:


# rsa 1024 bits 0.000341s 0.21s   2931.5  46873.8
# rsa 2048 bits 0.001244s 0.44s803.9  22569.1
# rsa 4096 bits 0.006203s 0.000387s161.2   2586.3

Yes, it's not as fast as your engine (except for RSA4096), but 
difference for 1024- and 2048-bit results is significant to make how 
come question relevant. Is it 32- or 64-bit build you are referring 
to? If 32, can you collect results for 64-bit build? ./Configure 
solaris64-sparcv9-[g]cc. One should keep in mind that if 32-bit 
subroutine is hit by interrupt/exception it has to be restarted. 
Though it's longer keys that should be affected more... But please 
test. If 64-bit code delivers same performance as on Linux question 
would be why is Solaris 32-bit application hit by 
interrupts/exceptions more than Linux one.
Misaki run the tests now, but the default openssl on solaris is 
64-bit, so I think her results are 64-bit.
On the other hand, from my experience, on an empty system, interrupts 
causing recomputation in the

32-bit version are very rare.


As Ferenc noted, I used 64-bit openssl binary to measure the performance.

Let me talk to our performance engineer to see if can collect some 
performance profile on sign operations.


Thank you

-- misaki



As for RSA sign performance in general. OpenSSL doesn't actually use 
fastest possible algorithm for exponentiation, but rather more 
secure, more resistant to side-channel attacks (which should be taken 
very seriously on massive SMT platform such as T4). There also is 
possibility that your engine doesn't perform blinding. These are 
likely to be another bit of explanation to why it's slower.
I understand that but these don't contribute enough to the ~2x speed 
difference. The engine code only replaces the
modular exponentiation, so the blinding is decided by the openssl 
code. That is, either both runs are with blinding

or both are without it.


Then there also is risk that I was effectively blinded by the fact 
that I managed to significantly improve original result, and as 
result stopped looking for ways to improve even further. One thing 
that I could/should have wondered and wonder now, I'm using 
conditional fmovd instructions, but how fast are they in the context?


I asked Ferenc Rakoczi (Oracle's engineer who is most familiar with 
T4 instructions and crypto algorithms) to look at the code.  The 
response from Ferenc is attached below.  According to Ferenc, T4 
engine code gets rid of a lot of copying and probably that made the 
difference.


Yes, OpenSSL copies data, *but* it's copy-in and copy-out (with 
conversion in assembly) per exponentiation, and exponentiation is 
half number of bits Montgomery operations. I mean for 1024-bit key 
there is copy-in and copy-out per 512 montsqr/montmul instructions. I 
find it hard to believe it would be a problem.
I was referring to the copies from the registers to memory and back 
after each multiplication (a lot of which is
unnecessary because the instructions replace the multiplicand with the 
result, so repeated squarings don't need any setup
and for a multiplication step one only has to load the new multiplier 
before issuing the instruction).



===  Email from Ferenc Rakoczi ===

...

This code does not have the kind of precautions against timing
and cache based attacks as the openssl code - I think on the T4
the timing depends on so many factors that even if the attacker
runs on the same core they could not get accurate enough timings
for the attacks to succeed


I'd argue that it's easier on T4. SMT attack works by instrumenting 
memory timings. Victim thread accesses memory in very compact 
sequence and then goes on calculating without any references to 
memory. High ratio between calculation and memory reference phases 
works in attacker's favour. Yes, attacker thread would have to end-up 
on same core, but once it does it gets very good chance to deduce the 
access pattern.

It might work with 2 threads on a core, when you know when the other one
is doing the exponentiation. With 8 threads, doing all kind of things, 
I would
bet against it. But as I said, the straight line program operations 
can be modified

to use scattered data without much loss of performance.



- but for the extra paranoid those
defenses can be built in - one can change the algorithm slightly
so that there is always 5 squarings followed by a