Re: MONTMUL performance: t4 engine vs inlined t4
Hi, Thank you so much for looking into the issue with Ferenc! I'll incorporate the change into Solaris to verify the 20-30% performance improvement. The conservative approach sounds like the best approach at this point. Once the performance improvement is verified, can you commit the change to 1.0.2? The changes were committed to 1.0.2 same day. I've also addressed alloca problem. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
Hi Andy, Thank you so much for looking into the issue with Ferenc! I'll incorporate the change into Solaris to verify the 20-30% performance improvement. The conservative approach sounds like the best approach at this point. Once the performance improvement is verified, can you commit the change to 1.0.2? Thanks again. -- misaki On 06/18/13 04:10, Andy Polyakov wrote: Misaki, The measurement I sent yesterday for OpenSSL (with inlined T4 instruction support) was not quite accurate. Some of the T4 specific code you committed was not enabled when we tested, and I realized that __sparc__ was not defined on our system. Thus, I changed #if defined(__sparc__) to #if defined(__sparc). Now, we are seeing better number with OpenSSL. signverifysign/s verify rsa 1024 bits 0.000351s 0.24s 2852.9 42311.0 rsa 2048 bits 0.001258s 0.47s795.1 21128.6 rsa 4096 bits 0.006240s 0.000395s160.3 2533.3 Which is virtually identical to Linux results. So one mystery solved. I'll commit the fix at some later point. which is still slower than our t4 engine for 1k and 2k bit RSA sign: signverifysign/s verify/s rsa 1024 bits 0.000237s 0.28s 4221.9 36119.8 rsa 2048 bits 0.000876s 0.75s 1141.7 13285.6 rsa 4096 bits 0.006341s 0.002139s157.7467.5 As mentioned the problem seems to be multi-layer and we are moving in right direction. http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=4ddacd9921f48013b5cd011e4b93b294c14db1c2 improves RSA sign performance by 20-30%: rsa 1024 bits 0.000256s 0.16s 3904.4 61411.9 rsa 2048 bits 0.000946s 0.29s 1056.8 34292.7 rsa 4096 bits 0.005061s 0.000340s197.6 2940.5 This is still slower than your code, but conclusion we have to draw is that it's intentional. In sense that the discrepancy is accounted to the fact that OpenSSL implements counter-measures against cache-timing attack, and takes rather conservative approach. It remains to be seen if platform-specific and faster counter-measure will be suggested at later point. Meanwhile, please double-check on Solaris. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
Misaki, The measurement I sent yesterday for OpenSSL (with inlined T4 instruction support) was not quite accurate. Some of the T4 specific code you committed was not enabled when we tested, and I realized that __sparc__ was not defined on our system. Thus, I changed #if defined(__sparc__) to #if defined(__sparc). Now, we are seeing better number with OpenSSL. signverifysign/s verify/s rsa 1024 bits 0.000351s 0.24s 2852.9 42311.0 rsa 2048 bits 0.001258s 0.47s795.1 21128.6 rsa 4096 bits 0.006240s 0.000395s160.3 2533.3 Which is virtually identical to Linux results. So one mystery solved. I'll commit the fix at some later point. which is still slower than our t4 engine for 1k and 2k bit RSA sign: signverifysign/s verify/s rsa 1024 bits 0.000237s 0.28s 4221.9 36119.8 rsa 2048 bits 0.000876s 0.75s 1141.7 13285.6 rsa 4096 bits 0.006341s 0.002139s157.7467.5 As mentioned the problem seems to be multi-layer and we are moving in right direction. http://git.openssl.org/gitweb/?p=openssl.git;a=commitdiff;h=4ddacd9921f48013b5cd011e4b93b294c14db1c2 improves RSA sign performance by 20-30%: rsa 1024 bits 0.000256s 0.16s 3904.4 61411.9 rsa 2048 bits 0.000946s 0.29s 1056.8 34292.7 rsa 4096 bits 0.005061s 0.000340s197.6 2940.5 This is still slower than your code, but conclusion we have to draw is that it's intentional. In sense that the discrepancy is accounted to the fact that OpenSSL implements counter-measures against cache-timing attack, and takes rather conservative approach. It remains to be seen if platform-specific and faster counter-measure will be suggested at later point. Meanwhile, please double-check on Solaris. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
Another question is about suitability of floating-point fcmps and fmovd instructions. These are used to pick a vector from powers table in cache-timing neutral manner. I have to admit I haven't done due research whether or not they are optimal choice in the context, and/or whether or not we are better off using fand and for instructions for this purpose. As instructions in question are floating-point they might be executed by *shared* FPU and not by individual core [which might be disruptive for pipeline?]... fcmps is 11 cycle latency and executes in the external FPU. Likewise for floating point conditional moves of floating point registers. Floating point conditional moves of integer registers is the worst, it is split into two micro-ops and it breaks the instruction decode group. Plain fmovd you should never use, it goes into the external FPU because it effects the condition codes in the %fsr. Use fsrc2 isntead which has 1 cycle latency and executes in the front end of cpu. I wonder about integer conditional move on integer condition. It should be noted that sheer latency is of lesser concern, as long as processor can efficiently handle several of them in pipeline. Condition Codes Register would remain constant throughout conditional move segment execution. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
From: Andy Polyakov ap...@openssl.org Date: Sat, 01 Jun 2013 09:38:18 +0200 I wonder about integer conditional move on integer condition. It should be noted that sheer latency is of lesser concern, as long as processor can efficiently handle several of them in pipeline. Condition Codes Register would remain constant throughout conditional move segment execution. movcc is fine, has a 1 cycle latency, and pipelines quite well. movr breaks the decode group, but still has a 1 cycle latency. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
I forgot to mention, the out of order execution unit renames the condition codes just like any other register. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
For public reference. In certain degree it's apparent from the context, but the report is about RSA sign performance difference for OpenSSL SPARC T4 Montgomery multiplication module and corresponding Solaris T4 module, with OpenSSL being significantly slower. The least one can say [at this point] is that problem appears to be multi-layer, in sense that there are different factors in play. First question in line is how come same code performs that differently on Solaris and Linux. OpenSSL on Linux delivers ~70% more RSA1024 signs than on Solaris (if we assume that both systems operate at same frequency, which is supported by the fact that verify results were virtually identical). Misaki, I used 64-bit openssl binary to measure the performance. With above in mind here is something to test. In crypto/bn/asm/sparct4-mont.pl there is a register windows warm-up sequence that is executed in 32-bit application context only (benchmarking on Linux had shown that it's not necessary in 64-bit application context). Could you test to engage it even in 64-bit application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text editor, locate warm it up comment and replace #ifndef __arch64__ in preceding line with #if 1. Let me talk to our performance engineer to see if can collect some performance profile on sign operations. One should probably note that openssl.org has quite low maximum e-mail message size limit. In other words if message is big enough, it will bounce. It naturally applies even to ap...@openssl.org, in case you reckon that results are not of interest to general public [or choose not to share them for other reason]. So that if it bounces from openssl.org, drop me a note, and I'll provide alternative address for delivery. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
For public reference. In certain degree it's apparent from the context, but the report is about RSA sign performance difference for OpenSSL SPARC T4 Montgomery multiplication module and corresponding Solaris T4 module, with OpenSSL being significantly slower. The least one can say [at this point] is that problem appears to be multi-layer, in sense that there are different factors in play. First question in line is how come same code performs that differently on Solaris and Linux. OpenSSL on Linux delivers ~70% more RSA1024 signs than on Solaris (if we assume that both systems operate at same frequency, which is supported by the fact that verify results were virtually identical). Another question is about suitability of floating-point fcmps and fmovd instructions. These are used to pick a vector from powers table in cache-timing neutral manner. I have to admit I haven't done due research whether or not they are optimal choice in the context, and/or whether or not we are better off using fand and for instructions for this purpose. As instructions in question are floating-point they might be executed by *shared* FPU and not by individual core [which might be disruptive for pipeline?]... __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
From: Andy Polyakov ap...@openssl.org Date: Fri, 31 May 2013 10:29:37 +0200 Another question is about suitability of floating-point fcmps and fmovd instructions. These are used to pick a vector from powers table in cache-timing neutral manner. I have to admit I haven't done due research whether or not they are optimal choice in the context, and/or whether or not we are better off using fand and for instructions for this purpose. As instructions in question are floating-point they might be executed by *shared* FPU and not by individual core [which might be disruptive for pipeline?]... fcmps is 11 cycle latency and executes in the external FPU. Likewise for floating point conditional moves of floating point registers. Floating point conditional moves of integer registers is the worst, it is split into two micro-ops and it breaks the instruction decode group. Plain fmovd you should never use, it goes into the external FPU because it effects the condition codes in the %fsr. Use fsrc2 isntead which has 1 cycle latency and executes in the front end of cpu. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
Another question is about suitability of floating-point fcmps and fmovd instructions. These are used to pick a vector from powers table in cache-timing neutral manner. I have to admit I haven't done due research whether or not they are optimal choice in the context, and/or whether or not we are better off using fand and for instructions for this purpose. As instructions in question are floating-point they might be executed by *shared* FPU and not by individual core [which might be disruptive for pipeline?]... fcmps is 11 cycle latency and executes in the external FPU. Likewise for floating point conditional moves of floating point registers. Floating point conditional moves of integer registers is the worst, it is split into two micro-ops and it breaks the instruction decode group. Plain fmovd you should never use, it goes into the external FPU because it effects the condition codes in the %fsr. Use fsrc2 isntead which has 1 cycle latency and executes in the front end of cpu. Thanks! Even though the question was inadequately formulated (it was not about just fmovd, but about *conditional* fmovd on floating-point condition, sorry), I get the picture. The conclusion seems to be that we should bet on logical operations, fand and for, which are 3 cycles and [more importantly?] are handled by private core resources. Thanks again. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
... here is something to test. In crypto/bn/asm/sparct4-mont.pl there is a register windows warm-up sequence that is executed in 32-bit application context only (benchmarking on Linux had shown that it's not necessary in 64-bit application context). Could you test to engage it even in 64-bit application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text editor, locate warm it up comment and replace #ifndef __arch64__ in preceding line with #if 1. Forgot to mention there are *two* occurrences of warm it up sequence. Both need to be engaged to test. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
Hi Andy, The measurement I sent yesterday for OpenSSL (with inlined T4 instruction support) was not quite accurate. Some of the T4 specific code you committed was not enabled when we tested, and I realized that__sparc__ was not defined on our system. Thus, I changed #if defined(__sparc__) to #if defined(__sparc). Now, we are seeing better number with OpenSSL. signverifysign/s verify/s rsa 1024 bits 0.000351s 0.24s 2852.9 42311.0 rsa 2048 bits 0.001258s 0.47s795.1 21128.6 rsa 4096 bits 0.006240s 0.000395s160.3 2533.3 which is still slower than our t4 engine for 1k and 2k bit RSA sign: signverifysign/s verify/s rsa 1024 bits 0.000237s 0.28s 4221.9 36119.8 rsa 2048 bits 0.000876s 0.75s 1141.7 13285.6 rsa 4096 bits 0.006341s 0.002139s157.7467.5 So, I enabled warm-up as suggested by you, but the performance number still look the same. Here is the new bn_mul_mont_t4_8(): bn_mul_mont_t4_8() bn_mul_mont_t4_8: 8a 10 20 00 clr %g5 bn_mul_mont_t4_8+0x4: 88 10 3f 80 mov -0x80, %g4 bn_mul_mont_t4_8+0x8: 8b 29 70 20 sllx %g5, 0x20, %g5 bn_mul_mont_t4_8+0xc: 9d e3 80 04 save %sp, %g4, %sp bn_mul_mont_t4_8+0x10: 9d e3 bf 80 save %sp, -0x80, %sp bn_mul_mont_t4_8+0x14: 9d e3 bf 80 save %sp, -0x80, %sp bn_mul_mont_t4_8+0x18: 9d e3 bf 80 save %sp, -0x80, %sp bn_mul_mont_t4_8+0x1c: 9d e3 bf 80 save %sp, -0x80, %sp bn_mul_mont_t4_8+0x20: 9d e3 bf 80 save %sp, -0x80, %sp bn_mul_mont_t4_8+0x24: 9d e3 bf 80 save %sp, -0x80, %sp bn_mul_mont_t4_8+0x28: 81 e8 00 00 restore bn_mul_mont_t4_8+0x2c: 81 e8 00 00 restore bn_mul_mont_t4_8+0x30: 81 e8 00 00 restore bn_mul_mont_t4_8+0x34: 81 e8 00 00 restore bn_mul_mont_t4_8+0x38: 81 e8 00 00 restore bn_mul_mont_t4_8+0x3c: 81 e8 00 00 restore bn_mul_mont_t4_8+0x40: 88 0b a0 01 and %sp, 0x1, %g4 bn_mul_mont_t4_8+0x44: bc 11 40 1e or%g5, %fp, %fp bn_mul_mont_t4_8+0x48: 8a 11 00 05 or%g4, %g5, %g5 I realized that, in sparct4-mont.pl, I see some 64-bit sparcv9 specific code, but my 64-bit library doesn't have those instructions. It looks like __arch64__ branch was taken. Did you expect the have the SOPARCV9_64BIT_STACK section to be compiled in? .globl bn_mul_mont_t4_$NUM .align 32 bn_mul_mont_t4_$NUM: #ifdef __arch64__ mov 0,$sentinel mov -128,%g4 #elif defined(SPARCV9_64BIT_STACK) SPARC_LOAD_ADDRESS_LEAF(OPENSSL_sparcv9cap_P,%g1,%g5) ld [%g1+0],%g1 ! OPENSSL_sparcv9_P[0] mov -2047,%g4 and %g1,SPARCV9_64BIT_STACK,%g1 movrz %g1,0,%g4 mov -1,$sentinel add %g4,-128,%g4 #else mov -1,$sentinel mov -128,%g4 #endif sllx$sentinel,32,$sentinel save%sp,%g4,%sp #if 1 save%sp,-128,%sp! warm it up save%sp,-128,%sp -- snip--- Thank you, -- misaki I used 64-bit openssl binary to measure the performance. With above in mind here is something to test. In crypto/bn/asm/sparct4-mont.pl there is a register windows warm-up sequence that is executed in 32-bit application context only (benchmarking on Linux had shown that it's not necessary in 64-bit application context). Could you test to engage it even in 64-bit application context? I.e. open crypto/bn/asm/sparct4-mont.pl in text editor, locate warm it up comment and replace #ifndef __arch64__ in preceding line with #if 1.
Re: MONTMUL performance: t4 engine vs inlined t4
Hi, The measurement I sent yesterday for OpenSSL (with inlined T4 instruction support) was not quite accurate. Some of the T4 specific code you committed was not enabled when we tested, and I realized that __sparc__ was not defined on our system. Thus, I changed #if defined(__sparc__) to #if defined(__sparc). Now, we are seeing better number with OpenSSL. signverifysign/s verify/s rsa 1024 bits 0.000351s 0.24s 2852.9 42311.0 rsa 2048 bits 0.001258s 0.47s795.1 21128.6 rsa 4096 bits 0.006240s 0.000395s160.3 2533.3 Which is virtually identical to Linux results. So one mystery solved. I'll commit the fix at some later point. which is still slower than our t4 engine for 1k and 2k bit RSA sign: signverifysign/s verify/s rsa 1024 bits 0.000237s 0.28s 4221.9 36119.8 rsa 2048 bits 0.000876s 0.75s 1141.7 13285.6 rsa 4096 bits 0.006341s 0.002139s157.7467.5 As mentioned the problem seems to be multi-layer and we are moving in right direction. So, I enabled warm-up as suggested by you, but the performance number still look the same. Well, suggestion was of what-if character, product of slight desperation:-) But it appears to be unnecessary, so we leave it as it is. I realized that, in sparct4-mont.pl, I see some 64-bit sparcv9 specific code, but my 64-bit library doesn't have those instructions. It looks like __arch64__ branch was taken. Did you expect the have the SOPARCV9_64BIT_STACK section to be compiled in? No. SPARCV9_64BIT_STACK is Linux-specific thing. In the commentary section in crypto/bn/asm/sparct4-mont.pl you see paragraph that starts with 32-bit code is prone to performance degradation. This is what SPARCV9_64BIT_STACK is about. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: MONTMUL performance: t4 engine vs inlined t4
Hi Andy, On 05/30/13 15:08, Ferenc Rakoczi wrote: Hi, Andy, Andy Polyakov wrote: First of all, RSA512 is essentially irrelevant and no attempt was made to optimize it. So let's just disregard RSA512 results (I have even removed them from above quoted part). Secondly note that our RSA verify is faster. I never thought verify can be a bottleneck anywhere. So we always concentrated on sign. Verify is dominated by single-op subroutine and we've got it more right. So we have only RSA sign to figure out. First thing one notices is difference between your and our results from 2.85GHz T4 running Linux: # rsa 1024 bits 0.000341s 0.21s 2931.5 46873.8 # rsa 2048 bits 0.001244s 0.44s803.9 22569.1 # rsa 4096 bits 0.006203s 0.000387s161.2 2586.3 Yes, it's not as fast as your engine (except for RSA4096), but difference for 1024- and 2048-bit results is significant to make how come question relevant. Is it 32- or 64-bit build you are referring to? If 32, can you collect results for 64-bit build? ./Configure solaris64-sparcv9-[g]cc. One should keep in mind that if 32-bit subroutine is hit by interrupt/exception it has to be restarted. Though it's longer keys that should be affected more... But please test. If 64-bit code delivers same performance as on Linux question would be why is Solaris 32-bit application hit by interrupts/exceptions more than Linux one. Misaki run the tests now, but the default openssl on solaris is 64-bit, so I think her results are 64-bit. On the other hand, from my experience, on an empty system, interrupts causing recomputation in the 32-bit version are very rare. As Ferenc noted, I used 64-bit openssl binary to measure the performance. Let me talk to our performance engineer to see if can collect some performance profile on sign operations. Thank you -- misaki As for RSA sign performance in general. OpenSSL doesn't actually use fastest possible algorithm for exponentiation, but rather more secure, more resistant to side-channel attacks (which should be taken very seriously on massive SMT platform such as T4). There also is possibility that your engine doesn't perform blinding. These are likely to be another bit of explanation to why it's slower. I understand that but these don't contribute enough to the ~2x speed difference. The engine code only replaces the modular exponentiation, so the blinding is decided by the openssl code. That is, either both runs are with blinding or both are without it. Then there also is risk that I was effectively blinded by the fact that I managed to significantly improve original result, and as result stopped looking for ways to improve even further. One thing that I could/should have wondered and wonder now, I'm using conditional fmovd instructions, but how fast are they in the context? I asked Ferenc Rakoczi (Oracle's engineer who is most familiar with T4 instructions and crypto algorithms) to look at the code. The response from Ferenc is attached below. According to Ferenc, T4 engine code gets rid of a lot of copying and probably that made the difference. Yes, OpenSSL copies data, *but* it's copy-in and copy-out (with conversion in assembly) per exponentiation, and exponentiation is half number of bits Montgomery operations. I mean for 1024-bit key there is copy-in and copy-out per 512 montsqr/montmul instructions. I find it hard to believe it would be a problem. I was referring to the copies from the registers to memory and back after each multiplication (a lot of which is unnecessary because the instructions replace the multiplicand with the result, so repeated squarings don't need any setup and for a multiplication step one only has to load the new multiplier before issuing the instruction). === Email from Ferenc Rakoczi === ... This code does not have the kind of precautions against timing and cache based attacks as the openssl code - I think on the T4 the timing depends on so many factors that even if the attacker runs on the same core they could not get accurate enough timings for the attacks to succeed I'd argue that it's easier on T4. SMT attack works by instrumenting memory timings. Victim thread accesses memory in very compact sequence and then goes on calculating without any references to memory. High ratio between calculation and memory reference phases works in attacker's favour. Yes, attacker thread would have to end-up on same core, but once it does it gets very good chance to deduce the access pattern. It might work with 2 threads on a core, when you know when the other one is doing the exponentiation. With 8 threads, doing all kind of things, I would bet against it. But as I said, the straight line program operations can be modified to use scattered data without much loss of performance. - but for the extra paranoid those defenses can be built in - one can change the algorithm slightly so that there is always 5 squarings followed by a