Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
Hi, r...@openssl.org via RT wrote: And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. But all x86_64 modules feature run-time switch, when processor capabilities are detected [with cpuid] and code that can't be executed on any particular processor won't execute. Or do you mean that fails to *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic? But I recall that there is cpuid... It fails to compile with -mmic: x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om' I see, thanks. In other words, as it turns out my suggestion about run-time switch does not apply in this case, because minimum of SSE2 is actually *assumed* for x86_64 platform. And this doesn't hold true for Knights Corner. But it does hold true for Knights Landing, doesn't it? I see no point in attempting to accommodate assembler support for Knights Corner (too rare processor) and would appreciate if you could confirm if following works with 1.0.2: ./Configure linux-x86_64-icc no-asm -mmic BTW, _lrotl fix is applied to 1.0.1, but not earlier versions, which are open for security fixes only. I can confirm that a clean build of openssl 1.0.2a using the above ./Configure line works for me. The resulting binary runs without issues. JJK ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
Hi, r...@openssl.org via RT wrote: And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. But all x86_64 modules feature run-time switch, when processor capabilities are detected [with cpuid] and code that can't be executed on any particular processor won't execute. Or do you mean that fails to *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic? But I recall that there is cpuid... It fails to compile with -mmic: x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om' I see, thanks. In other words, as it turns out my suggestion about run-time switch does not apply in this case, because minimum of SSE2 is actually *assumed* for x86_64 platform. And this doesn't hold true for Knights Corner. But it does hold true for Knights Landing, doesn't it? I see no point in attempting to accommodate assembler support for Knights Corner (too rare processor) and would appreciate if you could confirm if following works with 1.0.2: ./Configure linux-x86_64-icc no-asm -mmic BTW, _lrotl fix is applied to 1.0.1, but not earlier versions, which are open for security fixes only. I can confirm that a clean build of openssl 1.0.2a using the above ./Configure line works for me. The resulting binary runs without issues. JJK ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. But all x86_64 modules feature run-time switch, when processor capabilities are detected [with cpuid] and code that can't be executed on any particular processor won't execute. Or do you mean that fails to *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic? But I recall that there is cpuid... It fails to compile with -mmic: x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om' I see, thanks. In other words, as it turns out my suggestion about run-time switch does not apply in this case, because minimum of SSE2 is actually *assumed* for x86_64 platform. And this doesn't hold true for Knights Corner. But it does hold true for Knights Landing, doesn't it? I see no point in attempting to accommodate assembler support for Knights Corner (too rare processor) and would appreciate if you could confirm if following works with 1.0.2: ./Configure linux-x86_64-icc no-asm -mmic BTW, _lrotl fix is applied to 1.0.1, but not earlier versions, which are open for security fixes only. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
On May 26, 2015, at 4:57 PM, Andy Polyakov ap...@openssl.org wrote: And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. But all x86_64 modules feature run-time switch, when processor capabilities are detected [with cpuid] and code that can't be executed on any particular processor won't execute. Or do you mean that fails to *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic? But I recall that there is cpuid... It fails to compile with -mmic: x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om' I see, thanks. In other words, as it turns out my suggestion about run-time switch does not apply in this case, because minimum of SSE2 is actually *assumed* for x86_64 platform. And this doesn't hold true for Knights Corner. But it does hold true for Knights Landing, doesn't it? Yes, Knights Landing supposedly implements AVX512, which is backward compatible with older SIMD instructions. I see no point in attempting to accommodate assembler support for Knights Corner (too rare processor) and would appreciate if you could confirm if following works with 1.0.2: ./Configure linux-x86_64-icc no-asm -mmic Yes, it works. Solar, should I update JtR's READ-MIC to switch back to using OpenSSL? BTW, I'm not sure if switching between OpenSSL and LibreSSL would cause performance variation. Lei ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
Hi, Thanks for tips and pointers. As for getting off-topic, I'm the one to blame anyway. So I'm going to strip most of message and comment on points that still might be of public interest. (*) BTW, did you try existing [multi-block SHA]? No, totally missed it! Found it now, good work! $ find -name 'sha*-mb*' ./crypto/sha/asm/sha256-mb-x86_64.pl ./crypto/sha/asm/sha1-mb-x86_64.pl How is an application using OpenSSL supposed to access this functionality? Is there documentation? So far, I only found uses in OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols. Well, you have to admit that it's a bit too special to provide general-purpose interface to it. Which is why application-specific interface is provided instead, TLS-oriented one in e_aes_cbc_hmac_sha*.c. Mention of multi-block SHA was not really go ahead and use it kind, but rather is it interesting? with implied if it is interesting, then we can discuss how to interface your application to it. Note that it's even possible to take those modules out of OpenSSL context... You could want to add optional use of XOP there - rotates and vcmov. For SHA-1, F() is just one vcmov and H() is vcmov/andnot/xor (see sse-intrinsics.c above). For SHA-2, we use: #define Maj(x,y,z) vcmov(x, y, vxor(z, y)) #define Ch(x,y,z) vcmov(y, z, x) As for XOP. Motto is to provide near-optimal performance with minimum code. That means that if some processor-specific optimization provides just little improvement, then it's likely to be omitted. I don't recall attempting XOP specifically in multi-block SHA256, but it was attempted in SHA1 and it wasn't impressive. I even recall XOP-rotates delivering worse performance in some case. It likely was some instruction alignment issue (at least I ran into some anomaly with ChaCha code when merely flipping order of instruction input arguments affected performance). Another case of XOP omission is plain SHA256. Point there is that execution is dominated by scalar part and reducing number of vector instruction has no effect whatsoever. Anyway, XOP is considered, but so far was not found worthy. But it makes sense to double-check specifically multi-block SHA256... We're also experimenting with instruction interleaving. Sometimes, especially when running only 1 thread/core (such as on cheaper Intel CPUs without HT, or when there's no thread-level parallelism in the application - not our case, though), it's optimal to interleave several SIMD computations, for even wider virtual SIMD vectors than the CPU supports natively. e.g. for MD5 on AVX (64-bit builds only, since need 16 registers for interleaving), we currently interleave 3 of those (so 12 MD5's in parallel per thread). It's not uncommon that cryptographic algorithms have short dependency chains and consequently limited ILP, instruction-level parallelism. But then processors have limited resources too, and question is if those resources are sufficient to sustain the algorithmic IPL. Or rather vice versa, if processor has more resources than ILP, then resources will run underutilized. And naturally only then it makes sense to interleave instructions. Processor resources can be characterized by IPC, instructions per cycle, limit, and maximum possible improvement would be IPC/ILP. But one should remember that IPC is not just amount of execution ports, for example 4 on Haswell. Some instructions are port-specific and if algorithm uses such instructions a lot, you'll be limited by that port. Anyway, MD5 is known for its low IPL and it does make sense to interleave it (with itself or other algorithm). This doesn't apply to SHA. It has higher ILP and no contemporary processor has capacity to fully utilize this parallelism. Actually it's a bit worse in practice, because thing about multi-block is that it's limited by shifts, which are port-specific. This is why you observe virtually no difference among desktop/server processors. As for 4 Haswell ports. Of the 4 only 3 can execute vector instructions. So that absolutely best results can be achieved when you mix scalar integer-only and vector instructions, e.g. in addition to MD5 on AVX, mix in even scalar thread. Well, gain would have to be divided by ratio between how many blocks vector part processes vs. how many blocks scalar parts adds. So gain would be too little to care about. So it's more of a fun fact in the context. ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
Yes, I added a new target linux-mic into Configure, which is slightly modified from linux-generic64. From the original patch: (...) linux-generic64,gcc:-DTERMIO -O3 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR), +linux-mic,icc:-mmic -DTERMIO -O3 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR), (...) But what prevents you from 'env CC=icc ./Configure linux-generic64 -mmic'? Or same with linux-x86_64? Can you confirm if './Configure linux-x86_64-icc -mmic' works in 1.0.2? ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
On May 25, 2015, at 6:01 PM, Andy Polyakov ap...@openssl.org wrote: Yes, I added a new target linux-mic into Configure, which is slightly modified from linux-generic64. From the original patch: (...) linux-generic64,gcc:-DTERMIO -O3 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR), +linux-mic,icc:-mmic -DTERMIO -O3 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR), (...) But what prevents you from 'env CC=icc ./Configure linux-generic64 -mmic'? Or same with linux-x86_64? Can you confirm if './Configure linux-x86_64-icc -mmic' works in 1.0.2? 'CC=icc -mmic ./Configure shared linux-generic64' works in 1.0.0. It's better than modifying Configure. I just didn't think of it. But it doesn't work in 1.0.2, getting some link error: ../libcrypto.so: undefined reference to `rc4_md5_enc' And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. Lei ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
On May 26, 2015, at 12:01 AM, Andy Polyakov ap...@openssl.org wrote: Yes, I added a new target linux-mic into Configure, which is slightly modified from linux-generic64. From the original patch: (...) linux-generic64,gcc:-DTERMIO -O3 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR), +linux-mic,icc:-mmic -DTERMIO -O3 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT DES_UNROLL BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR), (...) But what prevents you from 'env CC=icc ./Configure linux-generic64 -mmic'? Or same with linux-x86_64? Can you confirm if './Configure linux-x86_64-icc -mmic' works in 1.0.2? 'CC=icc -mmic ./Configure shared linux-generic64' works in 1.0.0. It's better than modifying Configure. I just didn't think of it. But it doesn't work in 1.0.2, getting some link error: ../libcrypto.so: undefined reference to `rc4_md5_enc' Yes, similar issue was reported in another context and it will be resolved. Meanwhile could you pass explicit no-asm to confirm that it's in *general* viable option for you. And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. But all x86_64 modules feature run-time switch, when processor capabilities are detected [with cpuid] and code that can't be executed on any particular processor won't execute. Or do you mean that fails to *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic? But I recall that there is cpuid... It fails to compile with -mmic: x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om' (...) Here 'pxor' is a MMX instruction, but MIC doesn't support MMX. MIC has its own 512-bit SIMD instruction set, which is not backward-compatible like AVX512. Lei ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
Hi Andy, Thank you for your reply! I am CC'ing Lei on mine. On Wed, May 20, 2015 at 12:55:10PM +0200, Andy Polyakov via RT wrote: For reference. icc was not cared for for quite some time. Initially it was possible for me, by then university employee, to use it, but then they changes terms and it became impossible for me to maintain it. But I've just noticed they provide some starter version of something, I'll see... Yes, this might be usable for you: https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor Intel provides select Intel Software Development Products at no cost to qualified open source contributors who are working on open source projects compliant with the Open Source Initiative (OSI). But linux-x86_64-icc is not present in and was never supported in pre-1.0.2. Oh, I didn't realize that. Like I mentioned, we're actually building with icc for MIC. When we build with icc for x86_64 host, we typically simply link against the distro's gcc-built OpenSSL, so didn't run into this issue ourselves until we started building for MIC and thus had to make our own OpenSSL build with icc. (Indeed, I've been building OpenSSL from source on many other occasions, and as part of a distro too, but that's not with icc and unrelated to JtR project.) So you ought to provide custom line. This remark doesn't mean that fix can't be backported, but out of curiosity, what's your config line? Currently, Lei put this into JtR -jumbo README-MIC: Build LibreSSL (version 2.1.6): $ cd libressl-2.1.6 $ ./configure CC=icc -mmic --host=k1om-linux --prefix=$MIC $ make make install The previous instructions were: Build OpenSSL (version 1.0.0q): $ cd openssl-1.0.0q $ patch Configure $JOHN/src/unused/openssl.patch $ ./Configure linux-mic shared --prefix=$MIC $ make make install I'm not sure what was in $JOHN/src/unused/openssl.patch - I guess it had to add linux-mic support. Lei, please reply to all. Is assembly engaged? If so, how fast is it? Or is it so that you count on compiler to produce vector code that would process multiple inputs in parallel with SIMD? We're using OpenSSL (or LibreSSL) as an easy but slower option, replacing it with our own SIMD code right in JtR tree whenever we can and where this makes sense. So we're not trying to optimize OpenSSL's code. It remains scalar and unmodified, and our use of it is just to have things working where we do not have optimized code yet or where we prefer simpler rather than faster code (such as for some lightweight precomputation in some rare cases where this makes sense). This varies by crypto primitive, but overall we currently have SIMD intrinsics code for MMX, SSE2+/AVX, XOP, AVX2, MIC/AVX-512, and for bitslice DES also for AltiVec and NEON. One thing for which we still use OpenSSL's code in performance-critical manner is SSH key passphrase cracking (which involves RSA). There are probably many more examples like this, but this is a prominent one that comes to mind. There must be a lot of room for optimization here. As to compiler auto-vectorization - no, we are not relying on it. On related note. What's Xeon Phi in this context? I mean are we talking about Knights Corner Unfortunately, yes. BTW, you're welcome to play with it if you like: http://openwall.info/wiki/HPC/Village (that features own compatible-with-nothing SIMD instruction set) Yes, but at source code level many intrinsics match AVX-512. So we use it as a way to prepare for AVX-512. In many cases, it's just a recompile away. There are some notable exceptions to this, though - in fact, you happened to list some below. or Knights Landing (that features AVX512)? If latter, it might be interesting to extend multi-block SHA support(*), which should allow to achieve pretty cool results (with vector rotate and ternary logic instructions, not to mention 16 lanes:-). [As for interesting. It's possible but not really interesting in Knights Corner case, because effort is too specific, just a single obscure and hardly available CPU, while AVX512 is planned even for other processors so that code will be reusable.] This will take some #ifdef's to provide vector rotates as a macro when building for MIC and to use the ternary logic intrinsics only when building for true AVX-512 - nasty, but I think reasonable. For now, we're simply using the common subset between MIC and AVX-512: https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/src/pseudo_intrinsics.h https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/src/sse-intrinsics.c (*) BTW, did you try existing one? No, totally missed it! Found it now, good work! $ find -name 'sha*-mb*' ./crypto/sha/asm/sha256-mb-x86_64.pl ./crypto/sha/asm/sha1-mb-x86_64.pl How is an application using OpenSSL supposed to access this functionality? Is there documentation? So far, I only found uses in OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols. You could
Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()
Hi, For reference. icc was not cared for for quite some time. Initially it was possible for me, by then university employee, to use it, but then they changes terms and it became impossible for me to maintain it. But I've just noticed they provide some starter version of something, I'll see... Lei Zhang (re)discovered that OpenSSL 1.0.1* and below gets miscompiled, resulting in incorrect computation of at least SHA-1 hashes (and probably SHA-0, MD4, MD5) when it's compiled with icc for 64-bit Linux (x86_64 or mic), but not for Windows. The problem is already fixed in 1.0.2 and in LibreSSL. The problem is that OpenSSL uses the _lrotl() intrinsic to rotate 32-bit integers, whereas it is defined to operate on unsigned long, which obviously is 64-bit on many platforms. Lei's report: http://www.openwall.com/lists/john-dev/2015/03/26/1 A previous report (from 2011): https://software.intel.com/en-us/articles/openssl-generates-incorrect-shamd5-value-if-built-with-icc-compiler I suggest that this be fixed for all currently supported branches of OpenSSL. For now, Lei switched to using LibreSSL in our John the Ripper -jumbo builds for Xeon Phi, but we'd like to (re-)include instructions for building with OpenSSL as well. But linux-x86_64-icc is not present in and was never supported in pre-1.0.2. So you ought to provide custom line. This remark doesn't mean that fix can't be backported, but out of curiosity, what's your config line? Is assembly engaged? If so, how fast is it? Or is it so that you count on compiler to produce vector code that would process multiple inputs in parallel with SIMD? On related note. What's Xeon Phi in this context? I mean are we talking about Knights Corner (that features own compatible-with-nothing SIMD instruction set) or Knights Landing (that features AVX512)? If latter, it might be interesting to extend multi-block SHA support(*), which should allow to achieve pretty cool results (with vector rotate and ternary logic instructions, not to mention 16 lanes:-). [As for interesting. It's possible but not really interesting in Knights Corner case, because effort is too specific, just a single obscure and hardly available CPU, while AVX512 is planned even for other processors so that code will be reusable.] (*) BTW, did you try existing one? ___ openssl-dev mailing list To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev