Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-27 Thread Jan Just Keijser via RT
Hi,

r...@openssl.org via RT wrote:
 And linux-x86_64 won't work here, since it uses some instructions not 
 supported by MIC. 
 
 But all x86_64 modules feature run-time switch, when processor
 capabilities are detected [with cpuid] and code that can't be executed
 on any particular processor won't execute. Or do you mean that fails to
 *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic?
 But I recall that there is cpuid...
   
 It fails to compile with -mmic:
 x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om'
 

 I see, thanks. In other words, as it turns out my suggestion about
 run-time switch does not apply in this case, because minimum of SSE2 is
 actually *assumed* for x86_64 platform. And this doesn't hold true for
 Knights Corner. But it does hold true for Knights Landing, doesn't it? I
 see no point in attempting to accommodate assembler support for Knights
 Corner (too rare processor) and would appreciate if you could confirm if
 following works with 1.0.2:

 ./Configure linux-x86_64-icc no-asm -mmic

 BTW, _lrotl fix is applied to 1.0.1, but not earlier versions, which are
 open for security fixes only.

   
I can confirm that a clean build of openssl 1.0.2a using the above 
./Configure line works for me. The resulting binary runs without issues.

JJK


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-27 Thread Jan Just Keijser

Hi,

r...@openssl.org via RT wrote:
And linux-x86_64 won't work here, since it uses some instructions not supported by MIC. 


But all x86_64 modules feature run-time switch, when processor
capabilities are detected [with cpuid] and code that can't be executed
on any particular processor won't execute. Or do you mean that fails to
*compile* it with -mmic? Or do you mean that cpuid doesn't work on mic?
But I recall that there is cpuid...
  

It fails to compile with -mmic:
x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om'



I see, thanks. In other words, as it turns out my suggestion about
run-time switch does not apply in this case, because minimum of SSE2 is
actually *assumed* for x86_64 platform. And this doesn't hold true for
Knights Corner. But it does hold true for Knights Landing, doesn't it? I
see no point in attempting to accommodate assembler support for Knights
Corner (too rare processor) and would appreciate if you could confirm if
following works with 1.0.2:

./Configure linux-x86_64-icc no-asm -mmic

BTW, _lrotl fix is applied to 1.0.1, but not earlier versions, which are
open for security fixes only.

  
I can confirm that a clean build of openssl 1.0.2a using the above 
./Configure line works for me. The resulting binary runs without issues.


JJK
___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-26 Thread r...@openssl.org via RT
 And linux-x86_64 won't work here, since it uses some instructions not 
 supported by MIC. 

 But all x86_64 modules feature run-time switch, when processor
 capabilities are detected [with cpuid] and code that can't be executed
 on any particular processor won't execute. Or do you mean that fails to
 *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic?
 But I recall that there is cpuid...
 
 It fails to compile with -mmic:
 x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om'

I see, thanks. In other words, as it turns out my suggestion about
run-time switch does not apply in this case, because minimum of SSE2 is
actually *assumed* for x86_64 platform. And this doesn't hold true for
Knights Corner. But it does hold true for Knights Landing, doesn't it? I
see no point in attempting to accommodate assembler support for Knights
Corner (too rare processor) and would appreciate if you could confirm if
following works with 1.0.2:

./Configure linux-x86_64-icc no-asm -mmic

BTW, _lrotl fix is applied to 1.0.1, but not earlier versions, which are
open for security fixes only.


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-26 Thread Lei Zhang via RT

 On May 26, 2015, at 4:57 PM, Andy Polyakov ap...@openssl.org wrote:
 
 And linux-x86_64 won't work here, since it uses some instructions not 
 supported by MIC. 
 
 But all x86_64 modules feature run-time switch, when processor
 capabilities are detected [with cpuid] and code that can't be executed
 on any particular processor won't execute. Or do you mean that fails to
 *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic?
 But I recall that there is cpuid...
 
 It fails to compile with -mmic:
 x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om'
 
 I see, thanks. In other words, as it turns out my suggestion about
 run-time switch does not apply in this case, because minimum of SSE2 is
 actually *assumed* for x86_64 platform. And this doesn't hold true for
 Knights Corner. But it does hold true for Knights Landing, doesn't it?

Yes, Knights Landing supposedly implements AVX512, which is backward compatible 
with older SIMD instructions.

 I see no point in attempting to accommodate assembler support for Knights
 Corner (too rare processor) and would appreciate if you could confirm if
 following works with 1.0.2:
 
 ./Configure linux-x86_64-icc no-asm -mmic

Yes, it works. 

Solar, should I update JtR's READ-MIC to switch back to using OpenSSL? BTW, I'm 
not sure if switching between OpenSSL and LibreSSL would cause performance 
variation.


Lei


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-25 Thread Andy Polyakov via RT
Hi,

Thanks for tips and pointers. As for getting off-topic, I'm the one to
blame anyway. So I'm going to strip most of message and comment on
points that still might be of public interest.

 (*) BTW, did you try existing [multi-block SHA]?
 
 No, totally missed it!  Found it now, good work!
 
 $ find -name 'sha*-mb*'
 ./crypto/sha/asm/sha256-mb-x86_64.pl
 ./crypto/sha/asm/sha1-mb-x86_64.pl
 
 How is an application using OpenSSL supposed to access this
 functionality?  Is there documentation?  So far, I only found uses in
 OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols.

Well, you have to admit that it's a bit too special to provide
general-purpose interface to it. Which is why application-specific
interface is provided instead, TLS-oriented one in
e_aes_cbc_hmac_sha*.c. Mention of multi-block SHA was not really go
ahead and use it kind, but rather is it interesting? with implied if
it is interesting, then we can discuss how to interface your application
to it. Note that it's even possible to take those modules out of
OpenSSL context...

 You could want to add optional use of XOP there - rotates and vcmov.
 For SHA-1, F() is just one vcmov and H() is vcmov/andnot/xor (see
 sse-intrinsics.c above).  For SHA-2, we use:
 
 #define Maj(x,y,z) vcmov(x, y, vxor(z, y))
 #define Ch(x,y,z) vcmov(y, z, x)

As for XOP. Motto is to provide near-optimal performance with minimum
code. That means that if some processor-specific optimization provides
just little improvement, then it's likely to be omitted. I don't recall
attempting XOP specifically in multi-block SHA256, but it was attempted
in SHA1 and it wasn't impressive. I even recall XOP-rotates delivering
worse performance in some case. It likely was some instruction alignment
issue (at least I ran into some anomaly with ChaCha code when merely
flipping order of instruction input arguments affected performance).
Another case of XOP omission is plain SHA256. Point there is that
execution is dominated by scalar part and reducing number of
vector instruction has no effect whatsoever. Anyway, XOP is considered,
but so far was not found worthy. But it makes sense to double-check
specifically multi-block SHA256...

 We're also experimenting with instruction interleaving.  Sometimes,
 especially when running only 1 thread/core (such as on cheaper Intel
 CPUs without HT, or when there's no thread-level parallelism in the
 application - not our case, though), it's optimal to interleave several
 SIMD computations, for even wider virtual SIMD vectors than the CPU
 supports natively.  e.g. for MD5 on AVX (64-bit builds only, since need
 16 registers for interleaving), we currently interleave 3 of those (so
 12 MD5's in parallel per thread).

It's not uncommon that cryptographic algorithms have short dependency
chains and consequently limited ILP, instruction-level parallelism. But
then processors have limited resources too, and question is if those
resources are sufficient to sustain the algorithmic IPL. Or rather vice
versa, if processor has more resources than ILP, then resources will run
underutilized. And naturally only then it makes sense to interleave
instructions. Processor resources can be characterized by IPC,
instructions per cycle, limit, and maximum possible improvement would be
IPC/ILP. But one should remember that IPC is not just amount of
execution ports, for example 4 on Haswell. Some instructions are
port-specific and if algorithm uses such instructions a lot, you'll be
limited by that port. Anyway, MD5 is known for its low IPL and it does
make sense to interleave it (with itself or other algorithm). This
doesn't apply to SHA. It has higher ILP and no contemporary processor
has capacity to fully utilize this parallelism. Actually it's a bit
worse in practice, because thing about multi-block is that it's limited
by shifts, which are port-specific. This is why you observe virtually no
difference among desktop/server processors.

As for 4 Haswell ports. Of the 4 only 3 can execute vector instructions.
So that absolutely best results can be achieved when you mix scalar
integer-only and vector instructions, e.g. in addition to MD5 on AVX,
mix in even scalar thread. Well, gain would have to be divided by
ratio between how many blocks vector part processes vs. how many blocks
scalar parts adds. So gain would be too little to care about. So it's
more of a fun fact in the context.


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-25 Thread r...@openssl.org via RT
 Yes, I added a new target linux-mic into Configure, which is slightly 
 modified from linux-generic64.
 
 From the original patch:
 
 (...)
  linux-generic64,gcc:-DTERMIO -O3 
 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT 
 DES_UNROLL 
 BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR),
 +linux-mic,icc:-mmic -DTERMIO -O3 
 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT 
 DES_UNROLL 
 BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR),
 (...)

But what prevents you from 'env CC=icc ./Configure linux-generic64
-mmic'? Or same with linux-x86_64? Can you confirm if './Configure
linux-x86_64-icc -mmic' works in 1.0.2?


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-25 Thread Lei Zhang via RT

 On May 25, 2015, at 6:01 PM, Andy Polyakov ap...@openssl.org wrote:
 
 Yes, I added a new target linux-mic into Configure, which is slightly 
 modified from linux-generic64.
 
 From the original patch:
 
 (...)
 linux-generic64,gcc:-DTERMIO -O3 
 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT 
 DES_UNROLL 
 BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR),
 +linux-mic,icc:-mmic -DTERMIO -O3 
 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT 
 DES_UNROLL 
 BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR),
 (...)
 
 But what prevents you from 'env CC=icc ./Configure linux-generic64
 -mmic'? Or same with linux-x86_64? Can you confirm if './Configure
 linux-x86_64-icc -mmic' works in 1.0.2?

'CC=icc -mmic ./Configure shared linux-generic64' works in 1.0.0. It's better 
than modifying Configure. I just didn't think of it. 

But it doesn't work in 1.0.2, getting some link error:
../libcrypto.so: undefined reference to `rc4_md5_enc'

And linux-x86_64 won't work here, since it uses some instructions not supported 
by MIC. 


Lei

___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-25 Thread Lei Zhang via RT

 On May 26, 2015, at 12:01 AM, Andy Polyakov ap...@openssl.org wrote:
 
 Yes, I added a new target linux-mic into Configure, which is slightly 
 modified from linux-generic64.
 
 From the original patch:
 
 (...)
 linux-generic64,gcc:-DTERMIO -O3 
 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT 
 DES_UNROLL 
 BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR),
 +linux-mic,icc:-mmic -DTERMIO -O3 
 -Wall::-D_REENTRANT::-ldl:SIXTY_FOUR_BIT_LONG RC4_CHAR RC4_CHUNK DES_INT 
 DES_UNROLL 
 BF_PTR:${no_asm}:dlfcn:linux-shared:-fPIC::.so.\$(SHLIB_MAJOR).\$(SHLIB_MINOR),
 (...)
 
 But what prevents you from 'env CC=icc ./Configure linux-generic64
 -mmic'? Or same with linux-x86_64? Can you confirm if './Configure
 linux-x86_64-icc -mmic' works in 1.0.2?
 
 'CC=icc -mmic ./Configure shared linux-generic64' works in 1.0.0. It's 
 better than modifying Configure. I just didn't think of it. 
 
 But it doesn't work in 1.0.2, getting some link error:
 ../libcrypto.so: undefined reference to `rc4_md5_enc'
 
 Yes, similar issue was reported in another context and it will be
 resolved. Meanwhile could you pass explicit no-asm to confirm that it's
 in *general* viable option for you.
 
 And linux-x86_64 won't work here, since it uses some instructions not 
 supported by MIC. 
 
 But all x86_64 modules feature run-time switch, when processor
 capabilities are detected [with cpuid] and code that can't be executed
 on any particular processor won't execute. Or do you mean that fails to
 *compile* it with -mmic? Or do you mean that cpuid doesn't work on mic?
 But I recall that there is cpuid...

It fails to compile with -mmic:
x86_64cpuid.s:165: Error: `pxor' is not supported on `k1om'
(...)

Here 'pxor' is a MMX instruction, but MIC doesn't support MMX. MIC has its own 
512-bit SIMD instruction set, which is not backward-compatible like AVX512.


Lei


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev


Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-24 Thread Solar Designer via RT
Hi Andy,

Thank you for your reply!  I am CC'ing Lei on mine.

On Wed, May 20, 2015 at 12:55:10PM +0200, Andy Polyakov via RT wrote:
 For reference. icc was not cared for for quite some time. Initially it
 was possible for me, by then university employee, to use it, but then
 they changes terms and it became impossible for me to maintain it. But
 I've just noticed they provide some starter version of something, I'll
 see...

Yes, this might be usable for you:

https://software.intel.com/en-us/qualify-for-free-software/opensourcecontributor

Intel provides select Intel Software Development Products at no cost to
qualified open source contributors who are working on open source
projects compliant with the Open Source Initiative (OSI).

 But linux-x86_64-icc is not present in and was never supported in
 pre-1.0.2.

Oh, I didn't realize that.  Like I mentioned, we're actually building
with icc for MIC.  When we build with icc for x86_64 host, we typically
simply link against the distro's gcc-built OpenSSL, so didn't run into
this issue ourselves until we started building for MIC and thus had to
make our own OpenSSL build with icc.  (Indeed, I've been building
OpenSSL from source on many other occasions, and as part of a distro
too, but that's not with icc and unrelated to JtR project.)

 So you ought to provide custom line. This remark doesn't mean
 that fix can't be backported, but out of curiosity, what's your config
 line?

Currently, Lei put this into JtR -jumbo README-MIC:

Build LibreSSL (version 2.1.6):
$ cd libressl-2.1.6
$ ./configure CC=icc -mmic --host=k1om-linux --prefix=$MIC
$ make  make install

The previous instructions were:

Build OpenSSL (version 1.0.0q):
$ cd openssl-1.0.0q
$ patch Configure  $JOHN/src/unused/openssl.patch
$ ./Configure linux-mic shared --prefix=$MIC
$ make  make install

I'm not sure what was in $JOHN/src/unused/openssl.patch - I guess it had
to add linux-mic support.  Lei, please reply to all.

 Is assembly engaged? If so, how fast is it? Or is it so that you
 count on compiler to produce vector code that would process multiple
 inputs in parallel with SIMD?

We're using OpenSSL (or LibreSSL) as an easy but slower option,
replacing it with our own SIMD code right in JtR tree whenever we can
and where this makes sense.  So we're not trying to optimize OpenSSL's
code.  It remains scalar and unmodified, and our use of it is just to
have things working where we do not have optimized code yet or where we
prefer simpler rather than faster code (such as for some lightweight
precomputation in some rare cases where this makes sense).

This varies by crypto primitive, but overall we currently have SIMD
intrinsics code for MMX, SSE2+/AVX, XOP, AVX2, MIC/AVX-512, and for
bitslice DES also for AltiVec and NEON.

One thing for which we still use OpenSSL's code in performance-critical
manner is SSH key passphrase cracking (which involves RSA).  There are
probably many more examples like this, but this is a prominent one that
comes to mind.  There must be a lot of room for optimization here.

As to compiler auto-vectorization - no, we are not relying on it.

 On related note. What's Xeon Phi in this context? I mean are we talking
 about Knights Corner

Unfortunately, yes.  BTW, you're welcome to play with it if you like:

http://openwall.info/wiki/HPC/Village

 (that features own compatible-with-nothing SIMD instruction set)

Yes, but at source code level many intrinsics match AVX-512.  So we use
it as a way to prepare for AVX-512.  In many cases, it's just a
recompile away.  There are some notable exceptions to this, though - in
fact, you happened to list some below.

 or Knights Landing (that features AVX512)? If latter,
 it might be interesting to extend multi-block SHA support(*), which
 should allow to achieve pretty cool results (with vector rotate and
 ternary logic instructions, not to mention 16 lanes:-). [As for
 interesting. It's possible but not really interesting in Knights
 Corner case, because effort is too specific, just a single obscure and
 hardly available CPU, while AVX512 is planned even for other processors
 so that code will be reusable.]

This will take some #ifdef's to provide vector rotates as a macro when
building for MIC and to use the ternary logic intrinsics only when
building for true AVX-512 - nasty, but I think reasonable.  For now,
we're simply using the common subset between MIC and AVX-512:

https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/src/pseudo_intrinsics.h
https://github.com/magnumripper/JohnTheRipper/blob/bleeding-jumbo/src/sse-intrinsics.c

 (*) BTW, did you try existing one?

No, totally missed it!  Found it now, good work!

$ find -name 'sha*-mb*'
./crypto/sha/asm/sha256-mb-x86_64.pl
./crypto/sha/asm/sha1-mb-x86_64.pl

How is an application using OpenSSL supposed to access this
functionality?  Is there documentation?  So far, I only found uses in
OpenSSL's own e_aes_cbc_hmac_sha*.c and no export of these symbols.

You could 

Re: [openssl-dev] [openssl.org #3843] OpenSSL 1.0.1* and below: incorrect use of _lrotl()

2015-05-20 Thread Andy Polyakov via RT
Hi,

For reference. icc was not cared for for quite some time. Initially it
was possible for me, by then university employee, to use it, but then
they changes terms and it became impossible for me to maintain it. But
I've just noticed they provide some starter version of something, I'll
see...

 Lei Zhang (re)discovered that OpenSSL 1.0.1* and below gets miscompiled,
 resulting in incorrect computation of at least SHA-1 hashes (and probably
 SHA-0, MD4, MD5) when it's compiled with icc for 64-bit Linux (x86_64 or
 mic), but not for Windows. The problem is already fixed in 1.0.2 and in
 LibreSSL.
 
 The problem is that OpenSSL uses the _lrotl() intrinsic to rotate 32-bit
 integers, whereas it is defined to operate on unsigned long, which
 obviously is 64-bit on many platforms.
 
 Lei's report:
 
 http://www.openwall.com/lists/john-dev/2015/03/26/1
 
 A previous report (from 2011):
 
 https://software.intel.com/en-us/articles/openssl-generates-incorrect-shamd5-value-if-built-with-icc-compiler
 
 I suggest that this be fixed for all currently supported branches of
 OpenSSL.  For now, Lei switched to using LibreSSL in our John the Ripper
 -jumbo builds for Xeon Phi, but we'd like to (re-)include instructions
 for building with OpenSSL as well.

But linux-x86_64-icc is not present in and was never supported in
pre-1.0.2. So you ought to provide custom line. This remark doesn't mean
that fix can't be backported, but out of curiosity, what's your config
line? Is assembly engaged? If so, how fast is it? Or is it so that you
count on compiler to produce vector code that would process multiple
inputs in parallel with SIMD?

On related note. What's Xeon Phi in this context? I mean are we talking
about Knights Corner (that features own compatible-with-nothing SIMD
instruction set) or Knights Landing (that features AVX512)? If latter,
it might be interesting to extend multi-block SHA support(*), which
should allow to achieve pretty cool results (with vector rotate and
ternary logic instructions, not to mention 16 lanes:-). [As for
interesting. It's possible but not really interesting in Knights
Corner case, because effort is too specific, just a single obscure and
hardly available CPU, while AVX512 is planned even for other processors
so that code will be reusable.]

(*) BTW, did you try existing one?


___
openssl-dev mailing list
To unsubscribe: https://mta.openssl.org/mailman/listinfo/openssl-dev