[osol-discuss] Re: LAMP for Solaris aka SAMP
On Wed, 11 Oct 2006, Chris Lemire wrote: Hey, I was reading that Php for Solaris is 64 bit? What about 32 bit? Why does it matter? IF you have a 64-bit x86 CPU, then you really should be using it with 64-bit software for maximum performance. I guess you forgot that 32-bit code is actually faster in most cases than 64-bit code? There are some very limited scenarios where 64-bit code is faster, and they usually involve manipulating very large quantities of data. Also, if you have an i86pc system with an older Pentium or Athlon CPU, the kernel boots in 32-bit mode, since the CPU isn't capable of executing 64-bit instructions. So if you have a 64-bit binary or a shared object library or a driver, you're stuck. Same for 64-bit platforms and 32-bit drivers - no go. This message posted from opensolaris.org ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
On 2006-10-19 09:21:46 +0200 UNIX admin [EMAIL PROTECTED] wrote: I guess you forgot that 32-bit code is actually faster in most cases than 64-bit code? There are some very limited scenarios where 64-bit code is faster, and they usually involve manipulating very large quantities of data. the biggest advantage of x86-64 over x86 is that it has a bunch registers more. so code tends to end up being more compact and faster in 64bit mode than in 32bit mode on x86. and yes, that's not the usual case (eg. on SPARC), but merely an effect of x86-64 to fix a couple of issues of x86. patrick mauritz ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
Patrick Mauritz schrieb: the biggest advantage of x86-64 over x86 is that it has a bunch registers more. so code tends to end up being more compact and faster in 64bit mode than in 32bit mode on x86. Faster: Yes, x86-64 is most of the time faster than ia32 code. Smaller: Usually no. Some examples: $ size /kernel/genunix /kernel/amd64/genunix /kernel/genunix: 1097477 + 40705 + 184544 = 1322726 /kernel/amd64/genunix: 1629491 + 53808 + 284888 = 1968187 $ size /usr/sfw/lib/libcrypto.so /usr/sfw/lib/amd64/libcrypto.so /usr/sfw/lib/libcrypto.so: 944748 + 84582 + 8310 = 1037640 /usr/sfw/lib/amd64/libcrypto.so: 1237360 + 123896 + 10540 = 1371796 So x86-64 code is usually ~30% larger than ia32 code. This could also have a negative effect on performance on some benchmarks (negative cache effects). I tried a simple benchmark (openssl speed) compiled with Sun Studio 11: 32 bit: compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -fast -O -Xa available timing options: TIMES TIMEB HZ=1000 [sysconf value] timing function used: times The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes md21355.59k 2862.13k 3971.95k 4391.52k 4536.13k mdc2 0.00 0.00 0.00 0.00 0.00 md4 14415.52k49963.59k 147409.25k 287259.35k 398234.25k md5 11809.87k39453.73k 109647.76k 197091.34k 257964.24k hmac(md5) 15609.16k48985.32k 127089.03k 210243.13k 260602.33k sha1 13064.55k42881.78k 109946.85k 180015.07k 221802.32k rmd16010302.32k29856.92k67258.51k97641.59k 112474.32k rc4 94902.10k 104165.04k 108189.22k 109169.61k 109425.24k des cbc 36492.43k38349.58k38847.14k38664.25k39016.97k des ede3 14029.63k14316.70k14404.98k14249.21k14405.30k idea cbc 39330.38k41825.71k42741.53k42785.97k42911.13k rc2 cbc 21974.04k22727.73k22935.10k22958.94k22988.77k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 66036.54k72483.74k74132.32k74722.80k74838.59k cast cbc 41895.25k44178.37k44798.45k45101.02k45219.29k aes-128 cbc 39793.82k43434.97k44639.73k44914.96k44925.55k aes-192 cbc 34567.00k37394.44k38244.70k38494.36k38628.45k aes-256 cbc 31409.13k32771.61k33463.78k33622.82k33709.01k sha25611090.84k27899.81k54231.09k71305.92k78475.95k sha512 3701.83k14766.45k24035.07k34668.49k39845.03k signverifysign/s verify/s rsa 512 bits 0.001101s 0.75s908.2 13369.4 rsa 1024 bits 0.004872s 0.000206s205.3 4853.2 rsa 2048 bits 0.026451s 0.000624s 37.8 1602.9 rsa 4096 bits 0.156266s 0.002043s 6.4489.5 signverifysign/s verify/s dsa 512 bits 0.000786s 0.000937s 1271.6 1067.4 dsa 1024 bits 0.002128s 0.002564s469.9390.1 dsa 2048 bits 0.006429s 0.007673s155.5130.3 64 bit: compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -fast -xarch=amd64 -xstrconst -Xa -DL_ENDIAN available timing options: TIMES TIMEB HZ=1000 [sysconf value] timing function used: times The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes md21659.88k 3489.66k 4827.13k 5362.44k 5509.88k mdc2 0.00 0.00 0.00 0.00 0.00 md4 18718.17k60346.24k 153484.15k 249352.42k 305693.55k md5 15092.03k46804.15k 113282.40k 175620.12k 209357.83k hmac(md5) 17113.37k51404.40k 120048.88k 179077.59k 210206.45k sha1 14774.09k42463.94k76415.66k 120985.27k 146305.33k rmd16011647.51k31071.76k61971.67k82660.30k91280.71k rc4 169194.41k 183235.03k 185723.58k 187306.07k 188022.50k des cbc 38621.38k39945.91k40263.18k40329.16k40383.68k des ede3 15407.90k15598.96k15654.06k15672.38k15675.24k idea cbc 43711.29k46672.59k47439.44k47668.87k47704.25k rc2 cbc 23168.34k23851.05k24037.38k24081.51k24055.92k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 65418.92k70249.71k71613.96k71999.04k72241.55k cast cbc 43763.67k45880.14k46417.32k46572.90k46605.58k aes-128 cbc 87854.71k96515.10k99639.62k 100672.92k 101320.07k aes-192 cbc 78534.91k85301.65k87745.73k88271.48k88410.94k aes-256 cbc 70460.15k76202.03k78129.60k78656.47k78688.77k sha256 9523.14k22442.72k40073.15k
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
You're comparing a bit apples and oranges there with the compiler options you used. For IA32 code, SSE/SSE2 extensions or conditional moves are disabled because not every x86 CPU has them - you need to instruct the compiler explicitly to create code for a CPU that allows these. In AMD64 mode on the other hand, everyone may safely assume there's SSE/SSE2, conditional moves and a tidbit of other things available by default. The difference between ia32 + extensions and amd64 would be smaller. As you found out on UltraSPARC, where 32/64bit differ only by register width. FrankH. On Thu, 19 Oct 2006, Daniel Rock wrote: Patrick Mauritz schrieb: the biggest advantage of x86-64 over x86 is that it has a bunch registers more. so code tends to end up being more compact and faster in 64bit mode than in 32bit mode on x86. Faster: Yes, x86-64 is most of the time faster than ia32 code. Smaller: Usually no. Some examples: $ size /kernel/genunix /kernel/amd64/genunix /kernel/genunix: 1097477 + 40705 + 184544 = 1322726 /kernel/amd64/genunix: 1629491 + 53808 + 284888 = 1968187 $ size /usr/sfw/lib/libcrypto.so /usr/sfw/lib/amd64/libcrypto.so /usr/sfw/lib/libcrypto.so: 944748 + 84582 + 8310 = 1037640 /usr/sfw/lib/amd64/libcrypto.so: 1237360 + 123896 + 10540 = 1371796 So x86-64 code is usually ~30% larger than ia32 code. This could also have a negative effect on performance on some benchmarks (negative cache effects). I tried a simple benchmark (openssl speed) compiled with Sun Studio 11: 32 bit: compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -fast -O -Xa available timing options: TIMES TIMEB HZ=1000 [sysconf value] timing function used: times The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes md21355.59k 2862.13k 3971.95k 4391.52k 4536.13k mdc2 0.00 0.00 0.00 0.00 0.00 md4 14415.52k49963.59k 147409.25k 287259.35k 398234.25k md5 11809.87k39453.73k 109647.76k 197091.34k 257964.24k hmac(md5) 15609.16k48985.32k 127089.03k 210243.13k 260602.33k sha1 13064.55k42881.78k 109946.85k 180015.07k 221802.32k rmd16010302.32k29856.92k67258.51k97641.59k 112474.32k rc4 94902.10k 104165.04k 108189.22k 109169.61k 109425.24k des cbc 36492.43k38349.58k38847.14k38664.25k39016.97k des ede3 14029.63k14316.70k14404.98k14249.21k14405.30k idea cbc 39330.38k41825.71k42741.53k42785.97k42911.13k rc2 cbc 21974.04k22727.73k22935.10k22958.94k22988.77k rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00 blowfish cbc 66036.54k72483.74k74132.32k74722.80k74838.59k cast cbc 41895.25k44178.37k44798.45k45101.02k45219.29k aes-128 cbc 39793.82k43434.97k44639.73k44914.96k44925.55k aes-192 cbc 34567.00k37394.44k38244.70k38494.36k38628.45k aes-256 cbc 31409.13k32771.61k33463.78k33622.82k33709.01k sha25611090.84k27899.81k54231.09k71305.92k78475.95k sha512 3701.83k14766.45k24035.07k34668.49k39845.03k signverifysign/s verify/s rsa 512 bits 0.001101s 0.75s908.2 13369.4 rsa 1024 bits 0.004872s 0.000206s205.3 4853.2 rsa 2048 bits 0.026451s 0.000624s 37.8 1602.9 rsa 4096 bits 0.156266s 0.002043s 6.4489.5 signverifysign/s verify/s dsa 512 bits 0.000786s 0.000937s 1271.6 1067.4 dsa 1024 bits 0.002128s 0.002564s469.9390.1 dsa 2048 bits 0.006429s 0.007673s155.5130.3 64 bit: compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -fast -xarch=amd64 -xstrconst -Xa -DL_ENDIAN available timing options: TIMES TIMEB HZ=1000 [sysconf value] timing function used: times The 'numbers' are in 1000s of bytes per second processed. type 16 bytes 64 bytes256 bytes 1024 bytes 8192 bytes md21659.88k 3489.66k 4827.13k 5362.44k 5509.88k mdc2 0.00 0.00 0.00 0.00 0.00 md4 18718.17k60346.24k 153484.15k 249352.42k 305693.55k md5 15092.03k46804.15k 113282.40k 175620.12k 209357.83k hmac(md5) 17113.37k51404.40k 120048.88k 179077.59k 210206.45k sha1 14774.09k42463.94k76415.66k 120985.27k 146305.33k rmd16011647.51k31071.76k61971.67k82660.30k91280.71k rc4 169194.41k 183235.03k 185723.58k 187306.07k 188022.50k des cbc 38621.38k39945.91k40263.18k40329.16k40383.68k des ede3 15407.90k15598.96k15654.06k15672.38k15675.24k idea cbc 43711.29k
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
Frank Hofmann schrieb: You're comparing a bit apples and oranges there with the compiler options you used. 32 bit: compiler: cc [...] -fast -O -fast implies -xtarget=native which implies -xchip=native the -O after -fast shouldn't negate the -xchip=XXX selection. Daniel ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
Daniel Rock [EMAIL PROTECTED] wrote: compiler: cc [...] -fast -O -fast implies -xtarget=native which implies -xchip=native the -O after -fast shouldn't negate the -xchip=XXX selection. -fast is a macro that gets expanded It will most likely overwrite everything to it's left. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
On Thu, 19 Oct 2006, UNIX admin wrote: I guess you forgot that 32-bit code is actually faster in most cases than 64-bit code? On SPARC, agreed, but on x86 it ain't necessarily so. 64-bit code on AMD processors has access to more registers, so it tends to be faster than 32-bit code. -- Rich Teer, SCNA, SCSA, OpenSolaris CAB member President, Rite Online Inc. Voice: +1 (250) 979-1638 URL: http://www.rite-group.com/rich ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
Rich Teer [EMAIL PROTECTED] wrote: On Thu, 19 Oct 2006, UNIX admin wrote: I guess you forgot that 32-bit code is actually faster in most cases than 64-bit code? On SPARC, agreed, but on x86 it ain't necessarily so. 64-bit code on AMD processors has access to more registers, so it tends to be faster than 32-bit code. 64 Bit code on Sparc is typically 5-10% slower, AMD64 code is typically 30% faster because there are twice as much registers. Jörg -- EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin [EMAIL PROTECTED](uni) [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
Joerg Schilling wrote: 64 Bit code on Sparc is typically 5-10% slower, AMD64 code is typically 30% faster because there are twice as much registers. More so, AMD64 supports various memory modes, so that it need not address a 64-bit address space always -- even when an application is compiled as 64-bit. http://developers.sun.com/sunstudio/articles/mmodel.html -- Derek E. Lewis [EMAIL PROTECTED] http://riemann.solnetworks.net/~dlewis ___ opensolaris-discuss mailing list opensolaris-discuss@opensolaris.org
Re: [osol-discuss] Re: LAMP for Solaris aka SAMP
Joerg Schilling schrieb: 64 Bit code on Sparc is typically 5-10% slower, AMD64 code is typically 30% faster because there are twice as much registers. 30% is very optimistic. My test results vary between 30% slower and 200% faster depending on the application and compiler. On average I'd say AMD64 code will be ~10% faster. My previous posted results with openssl speed are void. 32 bit code was compiled with -xO3 while the 64 bit code was compiled with -xO5. I reran the tests which on average still favour 64 bit code - but to a lesser extent. Test environment: cc: Sun C 5.8 Patch 121016-03 2006/06/07 ube: Sun Compiler Common 11 Patch 120759-08 2006/08/08 ../gcc-4.1.1/configure --with-system-zlib --with-gnu-as --with-as=/usr/sfw/bin/gas --without-included-gettext --without-libiconv-prefix --enable-languages=c,c++,ada,fortran,objc --with-x --enable-java-awt=xlib Thread-Modell: posix gcc-Version 4.1.1 AMD Athlon(tm) 64 X2 Dual Core Processor 4400+ ( == Opteron 175) 2x1 GB RAM Dual Channel DDR400 CL3 ECC Numbers below are relative performance AMD64 vs. IA32 (0 IA32 faster, 0% AMD64 faster) (1) OpenSSL 0.9.8d Studio 11 32 vs. 64 bits ./Configure no-asm solaris-x86-cc ./Configure no-asm solaris64-x86_64-cc cc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -fast -xstrconst [ -xarch=amd64 -Xa -DL_ENDIAN ] type16B 64B 256B1024B 8192B md2 -10.05% -11.15% -11.74% -12.12% -12.34% md4 8.86% 6.05% 0.52% -5.88% -10.08% md5 16.38% 10.94% 0.72% -8.88% -14.06% hmac(md5)-2.63% -5.14% -8.79% -12.78% -14.68% sha1 4.24% -11.21% -21.91% -26.25% -28.58% rmd160 -1.22% -10.99% -20.55% -26.60% -29.35% rc4 78.52% 82.69% 80.98% 81.75% 81.79% des cbc -8.77% -9.57% -9.63% -9.69% -9.63% idea cbc 6.43% 6.04% 6.02% 6.10% 5.85% rc2 cbc -0.68% -1.16% -1.18% -1.27% -1.46% blowfish cbc -7.59% -9.09% -9.35% -9.42% -9.98% cast cbc-23.04% -24.26% -24.59% -25.31% -24.85% aes-128 cbc 60.48% 61.71% 61.91% 62.32% 62.27% aes-192 cbc 64.41% 63.91% 64.31% 65.11% 65.13% aes-256 cbc 65.03% 66.60% 67.89% 67.40% 67.45% sha256 -16.11% -19.27% -23.42% -25.54% -26.56% sha512 82.83% 83.21% 112.24% 129.11% 137.42% signverify rsa 512 bits 40.73% 28.55% rsa 1024 bits28.89% 17.55% rsa 2048 bits15.93% 3.47% rsa 4096 bits 7.69% -3.87% dsa 512 bits 29.38% 30.25% dsa 1024 bits20.51% 21.10% dsa 2048 bits 7.06% 7.65% gcc 4.1.1 32 vs. 64 bits ./Configure no-asm solaris-x86-gcc ./Configure no-asm solaris64-x86_64-gcc gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -O3 -fomit-frame-pointer -DL_ENDIAN { -march=pentium -DOPENSSL_NO_INLINE_ASM | -m64 -DL_ENDIAN -DMD32_REG_T=int } type16B 64B 256B1024B 8192B md2 -7.40% -9.46% -10.21% -9.36% -8.95% md4 26.77% 24.25% 19.80% 14.47% 11.19% md5 18.20% 16.72% 11.50% 6.06% 2.69% hmac(md5)19.03% 16.02% 10.69% 5.95% 2.59% sha1 16.22% 13.13% 16.53% 20.12% 22.24% rmd160 24.41% 17.51% 12.67% 8.13% 6.07% rc4 22.65% 22.98% 23.10% 23.19% 23.17% des cbc 38.35% 37.66% 37.36% 37.29% 37.11% idea cbc 10.96% 6.71% 3.94% 3.69% 3.33% rc2 cbc 1.53% 0.27% -0.23% -0.22% -0.33% blowfish cbc 1.14% -1.38% -1.93% -2.16% -2.19% cast cbc 95.12% 97.09% 97.57% 97.94% 98.07% aes-128 cbc 76.22% 82.13% 83.89% 84.50% 84.79% aes-192 cbc 84.24% 86.69% 88.12% 88.91% 89.08% aes-256 cbc 83.59% 90.55% 91.96% 92.34% 92.52% sha256 -3.48% -2.75% -1.07% -0.29% 0.09% sha512 177.33% 177.60% 242.40% 279.34% 301.04% signverify rsa 512 bits 94.92% 109.87% rsa 1024 bits 124.20% 123.21% rsa 2048 bits 136.36% 130.01% rsa 4096 bits 142.86% 129.65% dsa 512 bits117.52% 114.45% dsa 1024 bits 137.08% 128.02% dsa 2048 bits 134.24% 130.59% (2) gzip/bzip2 I did also measure compression/decompression speed with gzip and bzip2 (test file: gcc-4.1.1.tar): Studio 11 32 vs. 64gcc 4.1.1 32 vs. 64 gzip -5.78 % 23.69 % gunzip2 2.46 % 2.26 % bzip2 3.47 %