Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus
it seems somewhat fortunate that core2 CPUs track the p4 behaviour w.r.t. these two rc4 implementations. here are the core2 results with the stock code / HT test: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 166799.58k 180552.87k 182437.93k 183381.67k 183206.87k for the record, core2 64-bit code seriously underperforming the 32-bit code... here's the 32-bit results (with cpuid test enabled): type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 254164.64k 279901.10k 279364.38k 283617.62k 276690.26k ... The key feature in 32-bit code with cpuid test is that corresponding loop is not unrolled. Can you test following in *64-bit* build on Core2 hardware. Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154 unconditional, i.e. replace jz to jmp. make, benchmark and report back. A. small improvement... i think this hints that the problem with the unrolled code is the manual load/store alias avoidance -- there's fancy new hardware in core2 for dealing with this (obviously it's not fancy enough :)... and it seems the 32-bit code pushes the alias problem onto the hardware. But .Lcloop1 is folded and doesn't avoid aliasing. oh and i tried using cmove with no luck either. bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case and it's still not performing like it does in 32-bit... Fresh optimization manual says that targeting 32 bits of a register and then using all 64 bits incurs extra μop (look for sign extension to full 64-bits). Could you try to remove occurrences of #d in movzb instructions in .Lcloop1 body? Naturally keeping unconditional jmp .Lcloop1 as suggested above. It's also possible to compress the loop body by moving variables to upper register half, ax-dx,si,di,bp to minimize usage of of rex prefix. It shouldn't make difference though, not in .Lcloop1, as it won't reduce amount of cache-lines used. A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
RE: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus
So HT flag is no longer HyperThreading, but something else... Will look into it... There is another place HTT flag is checked and it's AES... yeah HT flag now basically means multi-threading or multi-core package... because when amd/intel went dual core they didn't want silly license managers to charge for every core. I was under the impression that the HT flag meant that the CPU supported the HT probe commands, which includes ways to determine if the CPU actually supports hyper-threading, has multiple cores, and so on. From the horse's mouth: Note that support for Hyper-Threading technology on the processor does not necessarily mean that the processor supports more than one logical processor, that the BIOS has enabled the feature, or that the operating system is utilizing the extra logical processors. Note that additional steps are required to determine the number of logical processors supported by the physical processor, as well as querying the operating system to determine the logical-to-physical processor mapping. Basically, the HT bit set in the CPUID means you can proceed to the next step. If it's clear, then there is no HT or mult-core. (At least, not the Intel variety.) DS __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus
there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit to distinguish between two implementations of rc4... unfortunately this fails to properly distinguish the cpus. all dual core cpus (intel or amd) report HT support even if they don't use symmetric-multithreading like some p4 do. So HT flag is no longer HyperThreading, but something else... Will look into it... There is another place HTT flag is checked and it's AES... it seems somewhat fortunate that core2 CPUs track the p4 behaviour w.r.t. these two rc4 implementations. here are the core2 results with the stock code / HT test: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 166799.58k 180552.87k 182437.93k 183381.67k 183206.87k and with cpuid test disabled: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 123361.30k 128102.17k 129876.57k 128787.22k 129419.95k for the record, core2 64-bit code seriously underperforming the 32-bit code... here's the 32-bit results (with cpuid test enabled): type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 254164.64k 279901.10k 279364.38k 283617.62k 276690.26k ... The key feature in 32-bit code with cpuid test is that corresponding loop is not unrolled. Can you test following in *64-bit* build on Core2 hardware. Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154 unconditional, i.e. replace jz to jmp. make, benchmark and report back. A. __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]
Re: [openssl.org #1447] [bug] 0.9.8d: rc4 cpuid test broken on dual core cpus
On Fri, 5 Jan 2007, Andy Polyakov wrote: there is a cpuid test in rc4_skey.c which tests the hyperthreading cpuid bit to distinguish between two implementations of rc4... unfortunately this fails to properly distinguish the cpus. all dual core cpus (intel or amd) report HT support even if they don't use symmetric-multithreading like some p4 do. So HT flag is no longer HyperThreading, but something else... Will look into it... There is another place HTT flag is checked and it's AES... yeah HT flag now basically means multi-threading or multi-core package... because when amd/intel went dual core they didn't want silly license managers to charge for every core. hmm i don't see any OPENSSL_ia32cap_P test for AES in 0.9.8d ... maybe i should be looking at the cvs? i'm seeing 17.5 cycles per byte for aes-128-cbc on core2, which is pretty good. it seems somewhat fortunate that core2 CPUs track the p4 behaviour w.r.t. these two rc4 implementations. here are the core2 results with the stock code / HT test: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 166799.58k 180552.87k 182437.93k 183381.67k 183206.87k and with cpuid test disabled: type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 123361.30k 128102.17k 129876.57k 128787.22k 129419.95k for the record, core2 64-bit code seriously underperforming the 32-bit code... here's the 32-bit results (with cpuid test enabled): type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 254164.64k 279901.10k 279364.38k 283617.62k 276690.26k ... The key feature in 32-bit code with cpuid test is that corresponding loop is not unrolled. Can you test following in *64-bit* build on Core2 hardware. Open rc4-x86_64.pl in text editor and make jump to .Lcloop1 at line 154 unconditional, i.e. replace jz to jmp. make, benchmark and report back. A. small improvement... type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes rc4 174197.47k 182564.34k 184536.23k 185292.63k 186258.77k i think this hints that the problem with the unrolled code is the manual load/store alias avoidance -- there's fancy new hardware in core2 for dealing with this (obviously it's not fancy enough :)... and it seems the 32-bit code pushes the alias problem onto the hardware. oh and i tried using cmove with no luck either. bizarre... i think i copied the 32-bit code into the 64-bit Lcloop1 case and it's still not performing like it does in 32-bit... maybe i screwed up though. -dean __ OpenSSL Project http://www.openssl.org Development Mailing List openssl-dev@openssl.org Automated List Manager [EMAIL PROTECTED]