Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2007-01-09 Thread Andy Polyakov

it seems somewhat fortunate that core2 CPUs track the p4 behaviour
w.r.t. these two rc4 implementations.  here are the core2 results with the
stock code / HT test:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192
bytes
rc4 166799.58k   180552.87k   182437.93k   183381.67k
183206.87k

for the record, core2 64-bit code seriously underperforming the 32-bit
code...  here's the 32-bit results (with cpuid test enabled):

type 16 bytes 64 bytes256 bytes   1024 bytes   8192
bytes
rc4 254164.64k   279901.10k   279364.38k   283617.62k
276690.26k

... The key feature in 32-bit code with cpuid test is that corresponding loop
is not unrolled. Can you test following in *64-bit* build on Core2 hardware.
Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154
unconditional, i.e. replace jz to jmp. make, benchmark and report back. A.


small improvement...

i think this hints that the problem with the unrolled code is the manual
load/store alias avoidance -- there's fancy new hardware in core2 for
dealing with this (obviously it's not fancy enough :)... and it seems
the 32-bit code pushes the alias problem onto the hardware.


But .Lcloop1 is folded and doesn't avoid aliasing.


oh and i tried using cmove with no luck either.

bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case 
and it's still not performing like it does in 32-bit...


Fresh optimization manual says that targeting 32 bits of a register and 
then using all 64 bits incurs extra μop (look for sign extension to 
full 64-bits). Could you try to remove occurrences of #d in movzb 
instructions in .Lcloop1 body? Naturally keeping unconditional jmp 
.Lcloop1 as suggested above. It's also possible to compress the loop 
body by moving variables to upper register half, ax-dx,si,di,bp to 
minimize usage of of rex prefix. It shouldn't make difference though, 
not in .Lcloop1, as it won't reduce amount of cache-lines used. A.


__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


RE: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2007-01-05 Thread David Schwartz

  So HT flag is no longer HyperThreading, but something else...
  Will look into
  it... There is another place HTT flag is checked and it's AES...

 yeah HT flag now basically means multi-threading or multi-core
 package... because when amd/intel went dual core they didn't want silly
 license managers to charge for every core.

I was under the impression that the HT flag meant that the CPU supported the
HT probe commands, which includes ways to determine if the CPU actually
supports hyper-threading, has multiple cores, and so on.

From the horse's mouth:

Note that support for Hyper-Threading technology on the processor does not
necessarily mean that the processor supports more than one logical
processor, that the BIOS has enabled the feature, or that the operating
system is utilizing the extra logical processors. Note that additional steps
are required to determine the number of logical processors supported by the
physical processor, as well as querying the operating system to determine
the logical-to-physical processor mapping.

Basically, the HT bit set in the CPUID means you can proceed to the next
step. If it's clear, then there is no HT or mult-core. (At least, not the
Intel variety.)

DS


__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2007-01-04 Thread Andy Polyakov
there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid 
bit to distinguish between two implementations of rc4... unfortunately 
this fails to properly distinguish the cpus.  all dual core cpus (intel or 
amd) report HT support even if they don't use symmetric-multithreading 
like some p4 do.


So HT flag is no longer HyperThreading, but something else... Will look 
into it... There is another place HTT flag is checked and it's AES...



it seems somewhat fortunate that core2 CPUs track the p4 behaviour
w.r.t. these two rc4 implementations.  here are the core2 results with the
stock code / HT test:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 166799.58k   180552.87k   182437.93k   183381.67k   183206.87k

and with cpuid test disabled:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 123361.30k   128102.17k   129876.57k   128787.22k   129419.95k

for the record, core2 64-bit code seriously underperforming the 32-bit
code...  here's the 32-bit results (with cpuid test enabled):

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 254164.64k   279901.10k   279364.38k   283617.62k   276690.26k


... The key feature in 32-bit code with cpuid test is that corresponding 
loop is not unrolled. Can you test following in *64-bit* build on Core2 
hardware. Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at 
line 154 unconditional, i.e. replace jz to jmp. make, benchmark and 
report back. A.

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2007-01-04 Thread dean gaudet
On Fri, 5 Jan 2007, Andy Polyakov wrote:

  there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit
  to distinguish between two implementations of rc4... unfortunately this
  fails to properly distinguish the cpus.  all dual core cpus (intel or amd)
  report HT support even if they don't use symmetric-multithreading like some
  p4 do.
 
 So HT flag is no longer HyperThreading, but something else... Will look into
 it... There is another place HTT flag is checked and it's AES...

yeah HT flag now basically means multi-threading or multi-core
package... because when amd/intel went dual core they didn't want silly
license managers to charge for every core.

hmm i don't see any OPENSSL_ia32cap_P test for AES in 0.9.8d ... maybe
i should be looking at the cvs?  i'm seeing 17.5 cycles per byte for
aes-128-cbc on core2, which is pretty good.


  it seems somewhat fortunate that core2 CPUs track the p4 behaviour
  w.r.t. these two rc4 implementations.  here are the core2 results with the
  stock code / HT test:
  
  type 16 bytes 64 bytes256 bytes   1024 bytes   8192
  bytes
  rc4 166799.58k   180552.87k   182437.93k   183381.67k
  183206.87k
  
  and with cpuid test disabled:
  
  type 16 bytes 64 bytes256 bytes   1024 bytes   8192
  bytes
  rc4 123361.30k   128102.17k   129876.57k   128787.22k
  129419.95k
  
  for the record, core2 64-bit code seriously underperforming the 32-bit
  code...  here's the 32-bit results (with cpuid test enabled):
  
  type 16 bytes 64 bytes256 bytes   1024 bytes   8192
  bytes
  rc4 254164.64k   279901.10k   279364.38k   283617.62k
  276690.26k
 
 ... The key feature in 32-bit code with cpuid test is that corresponding loop
 is not unrolled. Can you test following in *64-bit* build on Core2 hardware.
 Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154
 unconditional, i.e. replace jz to jmp. make, benchmark and report back. A.

small improvement...

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 174197.47k   182564.34k   184536.23k   185292.63k   186258.77k

i think this hints that the problem with the unrolled code is the manual
load/store alias avoidance -- there's fancy new hardware in core2 for
dealing with this (obviously it's not fancy enough :)... and it seems
the 32-bit code pushes the alias problem onto the hardware.

oh and i tried using cmove with no luck either.

bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case 
and it's still not performing like it does in 32-bit... maybe i screwed up 
though.

-dean
__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]


[openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus

2006-12-27 Thread dean gaudet via RT

there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid 
bit to distinguish between two implementations of rc4... unfortunately 
this fails to properly distinguish the cpus.  all dual core cpus (intel or 
amd) report HT support even if they don't use symmetric-multithreading 
like some p4 do.

on a dual-core k8 revF i see the following performance from a 0.9.8d build 
without any changes:

% ./openssl-0.9.8d speed rc4
Doing rc4 for 3s on 16 size blocks: 51091562 rc4's in 3.01s
Doing rc4 for 3s on 64 size blocks: 15937508 rc4's in 3.00s
Doing rc4 for 3s on 256 size blocks: 4190704 rc4's in 3.00s
Doing rc4 for 3s on 1024 size blocks: 1062795 rc4's in 3.00s
Doing rc4 for 3s on 8192 size blocks: 133319 rc4's in 3.01s
OpenSSL 0.9.8d 28 Sep 2006
built on: Tue Dec 26 17:40:14 PST 2006
options:bn(64,64) md2(int) rc4(ptr,int) des(idx,cisc,16,int) aes(partial) 
idea(int) blowfish(ptr2)
compiler: gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -static 
-m64 -DL_ENDIAN -DTERMIO -O3 -Wall -DMD32_REG_T=int -DMD5_ASM
available timing options: TIMES TIMEB HZ=100 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 271583.05k   34.17k   357606.74k   362767.36k   362840.28k

if i disable the cpuid test in rc4_skey.c i get these much improved
numbers:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 408832.88k   463675.26k   474736.30k   481802.21k   484870.83k

i see the same difference on dual-core k8 revE as well.


it seems somewhat fortunate that core2 CPUs track the p4 behaviour
w.r.t. these two rc4 implementations.  here are the core2 results with the
stock code / HT test:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 166799.58k   180552.87k   182437.93k   183381.67k   183206.87k

and with cpuid test disabled:

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 123361.30k   128102.17k   129876.57k   128787.22k   129419.95k


i understand from the comments in rc4_skey.c that you're attempting to
distinguish between {p3, k8} and {p4}... with this updated information
it seems you want to distinguish {p3, k8} and {p4, core2}.  to do this
i'd suggest decoding the cpuid vendor, family and model values... but
this becomes unmaintainable really quickly:

if (vendor == intel  (family == 15 || (family == 6  model = 15))) {
// intel p4 and core2 only (and likely follow-ons to core2)
// XXX: need to test if core (model 14) should be here
}
else {
// everyone else
}

it seems a more sustainable solution would be some sort of
/etc/openssl.conf and an openssl speed --generate-conf option used at
package install time to test several implementations.

for the record, core2 64-bit code seriously underperforming the 32-bit
code...  here's the 32-bit results (with cpuid test enabled):

type 16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
rc4 254164.64k   279901.10k   279364.38k   283617.62k   276690.26k

sorry, i haven't developed patches to fix this... i just wanted to record
these results somewhere for now... i'm not even sure which approach is
the best to fix this.

-dean

__
OpenSSL Project http://www.openssl.org
Development Mailing List   openssl-dev@openssl.org
Automated List Manager   [EMAIL PROTECTED]