[openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
For the record, I have reviewed Adam's versions of the code before these were posted here, and Adam has resolved the problems that I pointed out. As of the latest patch, I think the code is suitable for inclusion in OpenSSL. The final missing part is support that makes it easy to build with or without this NIST P-256 implementation, depending on whether the target platform supports it, similar to the enable-ec_nistp_64_gcc_128 config option for the 64-bit optimized implementations using type __uint128_t. (The current patch unconditionally links in the new files, but we may not even be able to process the new assembler files.) Also, it would be nice to have Shay review our changes to his contribution (the March 11 patch.bz2 as further changed by the April 10 patch) to make sure that while fixing the problems we found, we didn't do unwanted changes. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
RE: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Do you have any comment from Intel on the concerns regarding the scattering technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)? First, a comment: it is difficult to actually understand the precise claim by the authors, from these 6 slides. The code snippet does not include sufficient information to determine what the experiment really is. With this disclaimer, my understanding is this: the authors of the presentation claim that accessing different locations on a cache line shows different latencies – depending on the location (within the cache line). The graph shows a *significantly* larger latency for doubleword #0. As a result, they insinuate that the scattering-gathering method for accessing tables does not protect against side channel analysis. This, however, seems to me to be incorrect. Perhaps what the experiment shows is due to “cache line splits”, where the data that is being accessed does not reside completely in a single cahce line. For example, when aligned on 56 bytes, and reading 64-bit quadwords, the first one (say Q0) is in one cache line and the rest are in another. So, when flushing (only) the cache line that holds Q0, subsequent reading of Q0 would be “slow” but reading the rest would be fast. It is not possible to see how the array was aligned in the discussed presentation, because the code snippet does not include this part. Anyway, this phenomenon is not a vulnerability - it is only about a proper implementation of the gather/scatter method: the elements of the table simply need to be naturally aligned in a way that every single element belongs to the same cache line. The programmer needs make sure that the elements of the table are naturally aligned (e.g., aligned (64)). As long as there are no cache line splits, there should be no timing difference between accesses to a cache line. Shay -Original Message- From: Bodo Moeller via RT [mailto:r...@openssl.org] Sent: Friday, November 08, 2013 12:09 To: Gueron, Shay Cc: openssl-dev@openssl.org Subject: Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms Here is an updated version of the patch. Addressing a) pointer to the function (to select ADCX/ADOX) and b) multiple points addition There is (only) ~1% performance deterioration in due to the pointer being passed now, instead of (originally) being static. You can choose which style is preferable. Thanks! Alternatives would be (a) using a new lock for safe static initialization, or (b) more code duplication to avoid the need for an explicit pointer (there could be two separate implementations for the higher-level routines). However, given the 1% performance penalty, that's a minor issue at this point. Do you have any comment from Intel on the concerns regarding the scattering technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)? Bodo - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Do you have any comment from Intel on the concerns regarding the scattering technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)? As discussed off-list in this case the discrepancy is because so called memory disambiguation logic attempting to move loads ahead of stores, and failing when the least significant bits are same. Naturally load ought to be given opportunity to *try* to get ahead. I mean if there is enough work done between store and potentially conflicting load, then load won't try to get ahead of the store and variation . And indeed, if you add instructions to the test program the variation disappears. On Intel CPUs amount worth 5 cycles appears to be sufficient. This might be misinterpreted. It's not necessarily that 5 cycles is sufficient for load not to try to get ahead, but it's sufficient to amortize eventual variations. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Not sent to RT, to openssl-dev only. Alternatives would be (a) using a new lock for safe static initialization, or (b) more code duplication to avoid the need for an explicit pointer (there could be two separate implementations for the higher-level routines). However, given the 1% performance penalty, that's a minor issue at this point. While if (functiona==NULL || functionb==NULL) { asssign functiona, functionb } can be unsafe, I'd argue that if (functiona==NULL) { assign functiona } followed by if (functionb) { assign functionb } is. That's not thread-safe. There's no memory barrier so the writes can be reordered. The compiler itself could reorder those instructions. I want to remind that in this case assignments are constant in sense that all threads will assign same value. So that disjoint if's work out even if threads are reading/writing out of order. As for reorder by compiler. No, it won't do that. If it actually would have this option, then if (ptr) *ptr wouldn't be safe. If compiler [or hardware] does that kind of speculation it is also obliged to verify the assumption and act accordingly. Or do you mean that it could reorder it as if-if-assign-assign? This is not a problem. The only *formally* valid concerns are pointed by Bodo in another message, and they are that function[ab] are misaligned (and therefore access is not atomic), and that they are used for something else earlier. As for latter. If compiler was to use explicitly initialized shared variable for something else, it would have to do in a way invisible to program code. It is only possible after *full* inter-procedural analysis for given *binary*. But then analyzer would have to be aware of transition from single- to multi-threaded operation and stop using variable for something else before multi-threaded part [and even initialize it accordingly]. And as for former. In the context one should recognize that we are not talking about *universally* portable code. The code in question *is* platform-specific and we are free to take into consideration such implementation specifics as object alignment [in supported compilers]. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Here is an updated version of the patch. Addressing a) pointer to the function (to select ADCX/ADOX) and b) multiple points addition There is (only) ~1% performance deterioration in due to the pointer being passed now, instead of (originally) being static. You can choose which style is preferable. Thanks! Alternatives would be (a) using a new lock for safe static initialization, or (b) more code duplication to avoid the need for an explicit pointer (there could be two separate implementations for the higher-level routines). However, given the 1% performance penalty, that's a minor issue at this point. Do you have any comment from Intel on the concerns regarding the scattering technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)? Bodo __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
On Fri, Nov 8, 2013 at 4:08 AM, Bodo Moeller via RT r...@openssl.org wrote: Alternatives would be (a) using a new lock for safe static initialization, Maybe you could try my patches on my thread_safety branch of my github clone of OpenSSL? (https://github.com/nicowilliams/openssl) Nico -- __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Here is an updated version of the patch. Addressing a) pointer to the function (to select ADCX/ADOX) and b) multiple points addition There is (only) ~1% performance deterioration in due to the pointer being passed now, instead of (originally) being static. You can choose which style is preferable. Thanks! Alternatives would be (a) using a new lock for safe static initialization, or (b) more code duplication to avoid the need for an explicit pointer (there could be two separate implementations for the higher-level routines). However, given the 1% performance penalty, that's a minor issue at this point. While if (functiona==NULL || functionb==NULL) { asssign functiona, functionb } can be unsafe, I'd argue that if (functiona==NULL) { assign functiona } followed by if (functionb) { assign functionb } is. Do you have any comment from Intel on the concerns regarding the scattering technique (http://cryptojedi.org/peter/data/chesrump-20130822.pdf)? As discussed off-list in this case the discrepancy is because so called memory disambiguation logic attempting to move loads ahead of stores, and failing when the least significant bits are same. Naturally load ought to be given opportunity to *try* to get ahead. I mean if there is enough work done between store and potentially conflicting load, then load won't try to get ahead of the store and variation . And indeed, if you add instructions to the test program the variation disappears. On Intel CPUs amount worth 5 cycles appears to be sufficient. For record, update for x86_64-mont modules is being prepared... __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
On Fri, Nov 8, 2013 at 2:43 PM, Andy Polyakov via RT r...@openssl.org wrote: Alternatives would be (a) using a new lock for safe static initialization, or (b) more code duplication to avoid the need for an explicit pointer (there could be two separate implementations for the higher-level routines). However, given the 1% performance penalty, that's a minor issue at this point. While if (functiona==NULL || functionb==NULL) { asssign functiona, functionb } can be unsafe, I'd argue that if (functiona==NULL) { assign functiona } followed by if (functionb) { assign functionb } is. That's not thread-safe. There's no memory barrier so the writes can be reordered. The compiler itself could reorder those instructions. Nico -- __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
Re: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
While if (functiona==NULL || functionb==NULL) { asssign functiona, functionb } can be unsafe, I'd argue that if (functiona==NULL) { assign functiona } followed by if (functionb) { assign functionb } is. We're implicitly assuming here that (thanks to alignment, etc.) each pointer can be accessed atomically, which so far seems reasonable given the particular platform this code is for. However, the C11 memory model also allows the compiler to assume there's no write race, and it thus could, for example, use the same memory location to hold other temporary values, which could then be misinterpreted as the function pointer by concurrent threads. See http://static.usenix.org/event/hotpar11/tech/final_files/Boehm.pdf for ideas how this might break -- maybe not right now, but possibly with future compilers, possibly after this code has evolved a bit. (I'm not promising that it will actually break, but thread-safety analysis tools are likely to complain loudly. And at some point the code might actually fail spectacularly.) __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
RE: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Thanks you Bodo, for the comments. Here are some quick answers It seems that the BN_MONT_CTX-related code The optimization made for the computation of the modular inverse in the ECDSA sigh, is using const-time mod-exp. Indeed, this is independent of the rest of the patch, and it can be used independently (for other usages of the library). We included this addition in the patch for the particular usage in ECDSA. The paper: it will be posted soon. Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. This initialization is used for selecting a code path that would use ADCX/ADOX instructions when the processor supports them. The outcome depends only on the appropriate CPUID bits. Therefore, there is no “thread-safe” issue (because any thread would select the same path). Of course, feel free to use the patch code and modify this initialization to match OpenSSL conventions. Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points Indeed right. However, this patch is intended to optimize ECDSA sign/verify (and ECDH). This usage does not require adding more than a single point. If there are interesting cases - optimized multi-point addition can be added. Regards, Shay Gueron -Original Message- From: Bodo Moeller via RT [mailto:r...@openssl.org] Sent: Thursday, October 24, 2013 19:18 To: Gueron, Shay Cc: openssl-dev@openssl.org Subject: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms Thanks for the submission! It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for constant-time signing) is entirely independent of the remainder of the patch, and should be considered separately. Regarding your reference 'S.Gueron and V.Krasnov, Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes' for you NIST P-256 code, is that document available? (Web search only pointed me back to your patch.) I've noticed that for secret-independent constant-time memory access, your code relies on the scattering approach. However http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that apparently this doesn't actually work as intended. (Dan Bernstein's earlier references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming; http://cr.yp.to/mac/athlon.html.) Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. How about entirely avoiding this global state, and passing pointers down to the implementations? Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points (more precisely, more than one point other than the group generator), because you call ec_p256_windowed_mul multiple times separately and add the results. I'd suggest instead to implement this modeled on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right point multiplication. (This avoids the additional point-double operations from the separate point multiplication algorithm executions going through each additional scalar.) Your approach for precomputation also is different (using fewer point operations based on a larger precomputed table than the one we currently use in ec_GFp_nistp256_points_mul) -- that table size still seems appropriate, so keeping that probably makes sense. - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
[openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
This initialization is used for selecting a code path that would use ADCX/ADOX instructions when the processor supports them. The outcome depends only on the appropriate CPUID bits. Therefore, there is no “thread-safe” issue (because any thread would select the same path). I understand that that's the idea, and would have considered the code to be safe a while ago (and might have written the initialization just like that), but actually compiler transformations that are legal with the C memory model could break this. See http://static.usenix.org/event/hotpar11/tech/final_files/Boehm.pdf for inspiration. Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points Indeed right. However, this patch is intended to optimize ECDSA sign/verify (and ECDH). This usage does not require adding more than a single point. Sure, but there's no compelling reason to make the other (rarer) use cases slower. Also, adapting the addition/subtraction chain used in the existing crypto/ec/ecp_nistp256.c (modified Booth encoding instead of unsigned windows) could make the new implementation even faster. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
RE: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Thanks you Bodo, for the comments. Here are some quick answers It seems that the BN_MONT_CTX-related code The optimization made for the computation of the modular inverse in the ECDSA sigh, is using const-time mod-exp. Indeed, this is independent of the rest of the patch, and it can be used independently (for other usages of the library). We included this addition in the patch for the particular usage in ECDSA. The paper: it will be posted soon. Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. This initialization is used for selecting a code path that would use ADCX/ADOX instructions when the processor supports them. The outcome depends only on the appropriate CPUID bits. Therefore, there is no “thread-safe” issue (because any thread would select the same path). Of course, feel free to use the patch code and modify this initialization to match OpenSSL conventions. Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points Indeed right. However, this patch is intended to optimize ECDSA sign/verify (and ECDH). This usage does not require adding more than a single point. If there are interesting cases - optimized multi-point addition can be added. Regards, Shay Gueron -Original Message- From: Bodo Moeller via RT [mailto:r...@openssl.org] Sent: Thursday, October 24, 2013 19:18 To: Gueron, Shay Cc: openssl-dev@openssl.org Subject: [openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms Thanks for the submission! It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for constant-time signing) is entirely independent of the remainder of the patch, and should be considered separately. Regarding your reference 'S.Gueron and V.Krasnov, Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes' for you NIST P-256 code, is that document available? (Web search only pointed me back to your patch.) I've noticed that for secret-independent constant-time memory access, your code relies on the scattering approach. However http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that apparently this doesn't actually work as intended. (Dan Bernstein's earlier references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming; http://cr.yp.to/mac/athlon.html.) Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. How about entirely avoiding this global state, and passing pointers down to the implementations? Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points (more precisely, more than one point other than the group generator), because you call ec_p256_windowed_mul multiple times separately and add the results. I'd suggest instead to implement this modeled on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right point multiplication. (This avoids the additional point-double operations from the separate point multiplication algorithm executions going through each additional scalar.) Your approach for precomputation also is different (using fewer point operations based on a larger precomputed table than the one we currently use in ec_GFp_nistp256_points_mul) -- that table size still seems appropriate, so keeping that probably makes sense. - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies.
[openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Thanks for the submission! It seems that the BN_MONT_CTX-related code (used in crypto/ecdsa for constant-time signing) is entirely independent of the remainder of the patch, and should be considered separately. Regarding your reference 'S.Gueron and V.Krasnov, Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes' for you NIST P-256 code, is that document available? (Web search only pointed me back to your patch.) I've noticed that for secret-independent constant-time memory access, your code relies on the scattering approach. However http://cryptojedi.org/peter/data/chesrump-20130822.pdf points out that apparently this doesn't actually work as intended. (Dan Bernstein's earlier references: Sections 14, 15 in http://cr.yp.to/papers.html#cachetiming; http://cr.yp.to/mac/athlon.html.) Note that in your code, OPENSSL_ia32cap_P-dependent initialization of global variables is not done in a thread-safe way. How about entirely avoiding this global state, and passing pointers down to the implementations? Your ec_p256_points_mul implementation is much worse than necessary when then input comprises many points (more precisely, more than one point other than the group generator), because you call ec_p256_windowed_mul multiple times separately and add the results. I'd suggest instead to implement this modeled on ec_GFp_nistp256_points_mul instead to benefit from interleaved left-to-right point multiplication. (This avoids the additional point-double operations from the separate point multiplication algorithm executions going through each additional scalar.) Your approach for precomputation also is different (using fewer point operations based on a larger precomputed table than the one we currently use in ec_GFp_nistp256_points_mul) -- that table size still seems appropriate, so keeping that probably makes sense. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org
[openssl.org #3149] [patch] Fast and side channel protected implementation of the NIST P-256 Elliptic Curve, for x86-64 platforms
Hello all, This patch is a contribution to OpenSSL. It offers an efficient and constant-time implementation of EC multiplication for NIST P256 curve. It accelerated ECDSA (sign and verify) as well as ECDH (compute and generate key), for the P256 curve. The implementation is based on Montgomery multiplication, specifically optimized for the P-256 prime, using x86-64 assembly. It is intended to run on any x86-64 CPU (running Linux). We measured the performance benefit of this patch on Intel(R) CPUs Codename Haswell and Sandy Bridge, using 'openssl speed'. Here are the numbers. Single thread, single core: Sandy Bridge@3.4GHz: ECDSA sign: OpenSSL - 8,759 sign/sec OpenSSL with NISTP enabled - 8,822 sign/sec This patch - 17,111 sign/sec 1.95/1.94X speedup ECDSA verify: OpenSSL - 2,303 verify/sec OpenSSL with NISTP enabled - 3,831 verify/sec This patch - 6,193 verify/sec 2.69/1.62X speedup ECDH: OpenSSL - 2,795 op/sec OpenSSL with NISTP enabled - 5,189 op/sec This patch - 8,087 op/sec 2.89X/1.56X speedup Haswell@3.4GHz: ECDSA sign: OpenSSL - 10,212sign/sec OpenSSL with NISTP enabled - 10,499 sign/sec This patch - 22,747 sign/sec 2.22X/2.16X speedup ECDSA verify: OpenSSL - 2,647 verify/sec OpenSSL with NISTP enabled - 4,384 verify/sec This patch - 7,622 verify/sec 2.88X/1.74X speedup ECDH: OpenSSL - 3,193 op/sec OpenSSL with NISTP enabled - 5,932 op/sec This patch - 10,288 op/sec 3.22X/1.73X speedup Two threads, single core: Sandy Bridge@3.4GHz: ECDSA sign: OpenSSL - 10,076 sign/sec OpenSSL with NISTP enabled - 8,911 sign/sec This patch - 19,901 sign/sec 2/2.2X speedup ECDSA verify: OpenSSL - 2,480 verify/sec OpenSSL with NISTP enabled - 3,840 verify/sec This patch - 7,056 verify/sec 2.8/1.8X speedup ECDH: OpenSSL - 2,983 op/sec OpenSSL with NISTP enabled - 5,290 op/sec This patch - 9,657 op/sec 3.2X/1.8X speedup Haswell@3.4GHz: ECDSA sign: OpenSSL - 11,004sign/sec OpenSSL with NISTP enabled - 10,153 sign/sec This patch - 25,094 sign/sec 2.3X/2.5X speedup ECDSA verify: OpenSSL - 2,598 verify/sec OpenSSL with NISTP enabled - 3,947 verify/sec This patch - 7,705 verify/sec 3X/2X speedup ECDH: OpenSSL - 3,166 op/sec OpenSSL with NISTP enabled - 5,387 op/sec This patch - 10,438 op/sec 3.3X/1.9X speedup Reference: [1] S. Gueron, V. Krasnov: Fast Prime Field Elliptic Curve Cryptography with 256 Bit Primes Developers and authors: *** Shay Gueron (1, 2), and Vlad Krasnov (1) (1) Intel Corporation, Israel Development Center, Haifa, Israel (2) University of Haifa, Israel *** Copyright(c) 2013, Intel Corp. - Intel Israel (74) Limited This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager majord...@openssl.org