[osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread UNIX admin
 On Wed, 11 Oct 2006, Chris Lemire wrote:
 
  Hey, I was reading that Php for Solaris is 64 bit?
 What about 32 bit?
 
 Why does it matter?  IF you have a 64-bit x86 CPU,
 then you really
 should be using it with 64-bit software for maximum
 performance.

I guess you forgot that 32-bit code is actually faster in most cases than 
64-bit code?
There are some very limited scenarios where 64-bit code is faster, and they 
usually involve manipulating very large quantities of data.

Also, if you have an i86pc system with an older Pentium or Athlon CPU, the 
kernel boots in 32-bit mode, since the CPU isn't capable of executing 64-bit 
instructions. So if you have a 64-bit binary or a shared object library or a 
driver, you're stuck.

Same for 64-bit platforms and 32-bit drivers - no go.
 
 
This message posted from opensolaris.org
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Patrick Mauritz

On 2006-10-19 09:21:46 +0200 UNIX admin [EMAIL PROTECTED] wrote:
I guess you forgot that 32-bit code is actually faster in most cases 
than 
64-bit code?
There are some very limited scenarios where 64-bit code is faster, 
and they 
usually involve manipulating very large quantities of data.
the biggest advantage of x86-64 over x86 is that it has a bunch 
registers more.
so code tends to end up being more compact and faster in 64bit mode 
than in 32bit mode on x86.


and yes, that's not the usual case (eg. on SPARC), but merely an 
effect of x86-64 to fix a couple of issues of x86.



patrick mauritz

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Daniel Rock

Patrick Mauritz schrieb:
the biggest advantage of x86-64 over x86 is that it has a bunch 
registers more.
so code tends to end up being more compact and faster in 64bit mode than 
in 32bit mode on x86.


Faster: Yes, x86-64 is most of the time faster than ia32 code.
Smaller: Usually no.

Some examples:

$ size /kernel/genunix /kernel/amd64/genunix
/kernel/genunix: 1097477 + 40705 + 184544 = 1322726
/kernel/amd64/genunix: 1629491 + 53808 + 284888 = 1968187

$ size /usr/sfw/lib/libcrypto.so /usr/sfw/lib/amd64/libcrypto.so
/usr/sfw/lib/libcrypto.so: 944748 + 84582 + 8310 = 1037640
/usr/sfw/lib/amd64/libcrypto.so: 1237360 + 123896 + 10540 = 1371796


So x86-64 code is usually ~30% larger than ia32 code. This could also have a 
negative effect on performance on some benchmarks (negative cache effects).


I tried a simple benchmark (openssl speed) compiled with Sun Studio 11:


32 bit:

compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN 
-DHAVE_DLFCN_H -fast -O -Xa

available timing options: TIMES TIMEB HZ=1000 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type  16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
md21355.59k 2862.13k 3971.95k 4391.52k 4536.13k
mdc2  0.00 0.00 0.00 0.00 0.00
md4   14415.52k49963.59k   147409.25k   287259.35k   398234.25k
md5   11809.87k39453.73k   109647.76k   197091.34k   257964.24k
hmac(md5) 15609.16k48985.32k   127089.03k   210243.13k   260602.33k
sha1  13064.55k42881.78k   109946.85k   180015.07k   221802.32k
rmd16010302.32k29856.92k67258.51k97641.59k   112474.32k
rc4   94902.10k   104165.04k   108189.22k   109169.61k   109425.24k
des cbc   36492.43k38349.58k38847.14k38664.25k39016.97k
des ede3  14029.63k14316.70k14404.98k14249.21k14405.30k
idea cbc  39330.38k41825.71k42741.53k42785.97k42911.13k
rc2 cbc   21974.04k22727.73k22935.10k22958.94k22988.77k
rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00
blowfish cbc  66036.54k72483.74k74132.32k74722.80k74838.59k
cast cbc  41895.25k44178.37k44798.45k45101.02k45219.29k
aes-128 cbc   39793.82k43434.97k44639.73k44914.96k44925.55k
aes-192 cbc   34567.00k37394.44k38244.70k38494.36k38628.45k
aes-256 cbc   31409.13k32771.61k33463.78k33622.82k33709.01k
sha25611090.84k27899.81k54231.09k71305.92k78475.95k
sha512 3701.83k14766.45k24035.07k34668.49k39845.03k
  signverifysign/s verify/s
rsa  512 bits 0.001101s 0.75s908.2  13369.4
rsa 1024 bits 0.004872s 0.000206s205.3   4853.2
rsa 2048 bits 0.026451s 0.000624s 37.8   1602.9
rsa 4096 bits 0.156266s 0.002043s  6.4489.5
  signverifysign/s verify/s
dsa  512 bits 0.000786s 0.000937s   1271.6   1067.4
dsa 1024 bits 0.002128s 0.002564s469.9390.1
dsa 2048 bits 0.006429s 0.007673s155.5130.3


64 bit:

compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN 
-DHAVE_DLFCN_H -fast -xarch=amd64 -xstrconst -Xa -DL_ENDIAN

available timing options: TIMES TIMEB HZ=1000 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type  16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
md21659.88k 3489.66k 4827.13k 5362.44k 5509.88k
mdc2  0.00 0.00 0.00 0.00 0.00
md4   18718.17k60346.24k   153484.15k   249352.42k   305693.55k
md5   15092.03k46804.15k   113282.40k   175620.12k   209357.83k
hmac(md5) 17113.37k51404.40k   120048.88k   179077.59k   210206.45k
sha1  14774.09k42463.94k76415.66k   120985.27k   146305.33k
rmd16011647.51k31071.76k61971.67k82660.30k91280.71k
rc4  169194.41k   183235.03k   185723.58k   187306.07k   188022.50k
des cbc   38621.38k39945.91k40263.18k40329.16k40383.68k
des ede3  15407.90k15598.96k15654.06k15672.38k15675.24k
idea cbc  43711.29k46672.59k47439.44k47668.87k47704.25k
rc2 cbc   23168.34k23851.05k24037.38k24081.51k24055.92k
rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00
blowfish cbc  65418.92k70249.71k71613.96k71999.04k72241.55k
cast cbc  43763.67k45880.14k46417.32k46572.90k46605.58k
aes-128 cbc   87854.71k96515.10k99639.62k   100672.92k   101320.07k
aes-192 cbc   78534.91k85301.65k87745.73k88271.48k88410.94k
aes-256 cbc   70460.15k76202.03k78129.60k78656.47k78688.77k
sha256 9523.14k22442.72k40073.15k  

Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Frank Hofmann


You're comparing a bit apples and oranges there with the compiler options 
you used. For IA32 code, SSE/SSE2 extensions or conditional moves are 
disabled because not every x86 CPU has them - you need to instruct the 
compiler explicitly to create code for a CPU that allows these. In AMD64 
mode on the other hand, everyone may safely assume there's SSE/SSE2, 
conditional moves and a tidbit of other things available by default.


The difference between ia32 + extensions and amd64 would be smaller. As 
you found out on UltraSPARC, where 32/64bit differ only by register width.


FrankH.

On Thu, 19 Oct 2006, Daniel Rock wrote:


Patrick Mauritz schrieb:
the biggest advantage of x86-64 over x86 is that it has a bunch registers 
more.
so code tends to end up being more compact and faster in 64bit mode than in 
32bit mode on x86.


Faster: Yes, x86-64 is most of the time faster than ia32 code.
Smaller: Usually no.

Some examples:

$ size /kernel/genunix /kernel/amd64/genunix
/kernel/genunix: 1097477 + 40705 + 184544 = 1322726
/kernel/amd64/genunix: 1629491 + 53808 + 284888 = 1968187

$ size /usr/sfw/lib/libcrypto.so /usr/sfw/lib/amd64/libcrypto.so
/usr/sfw/lib/libcrypto.so: 944748 + 84582 + 8310 = 1037640
/usr/sfw/lib/amd64/libcrypto.so: 1237360 + 123896 + 10540 = 1371796


So x86-64 code is usually ~30% larger than ia32 code. This could also have a 
negative effect on performance on some benchmarks (negative cache effects).


I tried a simple benchmark (openssl speed) compiled with Sun Studio 11:


32 bit:

compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN 
-DHAVE_DLFCN_H -fast -O -Xa

available timing options: TIMES TIMEB HZ=1000 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type  16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
md21355.59k 2862.13k 3971.95k 4391.52k 4536.13k
mdc2  0.00 0.00 0.00 0.00 0.00
md4   14415.52k49963.59k   147409.25k   287259.35k   398234.25k
md5   11809.87k39453.73k   109647.76k   197091.34k   257964.24k
hmac(md5) 15609.16k48985.32k   127089.03k   210243.13k   260602.33k
sha1  13064.55k42881.78k   109946.85k   180015.07k   221802.32k
rmd16010302.32k29856.92k67258.51k97641.59k   112474.32k
rc4   94902.10k   104165.04k   108189.22k   109169.61k   109425.24k
des cbc   36492.43k38349.58k38847.14k38664.25k39016.97k
des ede3  14029.63k14316.70k14404.98k14249.21k14405.30k
idea cbc  39330.38k41825.71k42741.53k42785.97k42911.13k
rc2 cbc   21974.04k22727.73k22935.10k22958.94k22988.77k
rc5-32/12 cbc 0.00 0.00 0.00 0.00 0.00
blowfish cbc  66036.54k72483.74k74132.32k74722.80k74838.59k
cast cbc  41895.25k44178.37k44798.45k45101.02k45219.29k
aes-128 cbc   39793.82k43434.97k44639.73k44914.96k44925.55k
aes-192 cbc   34567.00k37394.44k38244.70k38494.36k38628.45k
aes-256 cbc   31409.13k32771.61k33463.78k33622.82k33709.01k
sha25611090.84k27899.81k54231.09k71305.92k78475.95k
sha512 3701.83k14766.45k24035.07k34668.49k39845.03k
 signverifysign/s verify/s
rsa  512 bits 0.001101s 0.75s908.2  13369.4
rsa 1024 bits 0.004872s 0.000206s205.3   4853.2
rsa 2048 bits 0.026451s 0.000624s 37.8   1602.9
rsa 4096 bits 0.156266s 0.002043s  6.4489.5
 signverifysign/s verify/s
dsa  512 bits 0.000786s 0.000937s   1271.6   1067.4
dsa 1024 bits 0.002128s 0.002564s469.9390.1
dsa 2048 bits 0.006429s 0.007673s155.5130.3


64 bit:

compiler: cc -KPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN 
-DHAVE_DLFCN_H -fast -xarch=amd64 -xstrconst -Xa -DL_ENDIAN

available timing options: TIMES TIMEB HZ=1000 [sysconf value]
timing function used: times
The 'numbers' are in 1000s of bytes per second processed.
type  16 bytes 64 bytes256 bytes   1024 bytes   8192 bytes
md21659.88k 3489.66k 4827.13k 5362.44k 5509.88k
mdc2  0.00 0.00 0.00 0.00 0.00
md4   18718.17k60346.24k   153484.15k   249352.42k   305693.55k
md5   15092.03k46804.15k   113282.40k   175620.12k   209357.83k
hmac(md5) 17113.37k51404.40k   120048.88k   179077.59k   210206.45k
sha1  14774.09k42463.94k76415.66k   120985.27k   146305.33k
rmd16011647.51k31071.76k61971.67k82660.30k91280.71k
rc4  169194.41k   183235.03k   185723.58k   187306.07k   188022.50k
des cbc   38621.38k39945.91k40263.18k40329.16k40383.68k
des ede3  15407.90k15598.96k15654.06k15672.38k15675.24k
idea cbc  43711.29k  

Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Daniel Rock

Frank Hofmann schrieb:


You're comparing a bit apples and oranges there with the compiler 
options you used.


32 bit:

compiler: cc [...] -fast -O

-fast implies -xtarget=native which implies -xchip=native

the -O after -fast shouldn't negate the -xchip=XXX selection.


Daniel
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Joerg Schilling
Daniel Rock [EMAIL PROTECTED] wrote:

 compiler: cc [...] -fast -O

 -fast implies -xtarget=native which implies -xchip=native

 the -O after -fast shouldn't negate the -xchip=XXX selection.

-fast is a macro that gets expanded
It will most likely overwrite everything to it's left.

Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Rich Teer
On Thu, 19 Oct 2006, UNIX admin wrote:

 I guess you forgot that 32-bit code is actually faster in most cases than 
 64-bit code?

On SPARC, agreed, but on x86 it ain't necessarily so.  64-bit code
on AMD processors has access to more registers, so it tends to be
faster than 32-bit code.

-- 
Rich Teer, SCNA, SCSA, OpenSolaris CAB member

President,
Rite Online Inc.

Voice: +1 (250) 979-1638
URL: http://www.rite-group.com/rich
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Joerg Schilling
Rich Teer [EMAIL PROTECTED] wrote:

 On Thu, 19 Oct 2006, UNIX admin wrote:

  I guess you forgot that 32-bit code is actually faster in most cases than 
  64-bit code?

 On SPARC, agreed, but on x86 it ain't necessarily so.  64-bit code
 on AMD processors has access to more registers, so it tends to be
 faster than 32-bit code.

64 Bit code on Sparc is typically 5-10% slower, AMD64 code is typically 30%
faster because there are twice as much registers.


Jörg

-- 
 EMail:[EMAIL PROTECTED] (home) Jörg Schilling D-13353 Berlin
   [EMAIL PROTECTED](uni)  
   [EMAIL PROTECTED] (work) Blog: http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/old/private/ ftp://ftp.berlios.de/pub/schily
___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Derek E. Lewis

Joerg Schilling wrote:

64 Bit code on Sparc is typically 5-10% slower, AMD64 code is typically 30%
faster because there are twice as much registers.
  


More so, AMD64 supports various memory modes, so that it need not 
address a 64-bit address space always -- even when an application is 
compiled as 64-bit.


http://developers.sun.com/sunstudio/articles/mmodel.html

--
Derek E. Lewis
[EMAIL PROTECTED]
http://riemann.solnetworks.net/~dlewis

___
opensolaris-discuss mailing list
opensolaris-discuss@opensolaris.org


Re: [osol-discuss] Re: LAMP for Solaris aka SAMP

2006-10-19 Thread Daniel Rock

Joerg Schilling schrieb:

64 Bit code on Sparc is typically 5-10% slower, AMD64 code is typically 30%
faster because there are twice as much registers.


30% is very optimistic. My test results vary between 30% slower and 200% 
faster depending on the application and compiler. On average I'd say AMD64 
code will be ~10% faster.


My previous posted results with openssl speed are void. 32 bit code was 
compiled with -xO3 while the 64 bit code was compiled with -xO5. I reran the 
tests which on average still favour 64 bit code - but to a lesser extent.


Test environment:

cc: Sun C 5.8 Patch 121016-03 2006/06/07
ube: Sun Compiler Common 11 Patch 120759-08 2006/08/08
../gcc-4.1.1/configure --with-system-zlib --with-gnu-as 
--with-as=/usr/sfw/bin/gas --without-included-gettext 
--without-libiconv-prefix --enable-languages=c,c++,ada,fortran,objc --with-x 
--enable-java-awt=xlib

Thread-Modell: posix
gcc-Version 4.1.1
AMD Athlon(tm) 64 X2 Dual Core Processor 4400+  ( == Opteron 175)
2x1 GB RAM Dual Channel DDR400 CL3 ECC


Numbers below are relative performance AMD64 vs. IA32 (0 IA32 faster, 0% 
AMD64 faster)



(1) OpenSSL 0.9.8d

Studio 11   32 vs. 64 bits
./Configure no-asm solaris-x86-cc
./Configure no-asm solaris64-x86_64-cc
cc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -fast
-xstrconst [ -xarch=amd64 -Xa -DL_ENDIAN ]

type16B 64B 256B1024B   8192B
md2 -10.05% -11.15% -11.74% -12.12% -12.34%
md4   8.86%   6.05%   0.52%  -5.88% -10.08%
md5  16.38%  10.94%   0.72%  -8.88% -14.06%
hmac(md5)-2.63%  -5.14%  -8.79% -12.78% -14.68%
sha1  4.24% -11.21% -21.91% -26.25% -28.58%
rmd160   -1.22% -10.99% -20.55% -26.60% -29.35%
rc4  78.52%  82.69%  80.98%  81.75%  81.79%
des cbc  -8.77%  -9.57%  -9.63%  -9.69%  -9.63%
idea cbc  6.43%   6.04%   6.02%   6.10%   5.85%
rc2 cbc  -0.68%  -1.16%  -1.18%  -1.27%  -1.46%
blowfish cbc -7.59%  -9.09%  -9.35%  -9.42%  -9.98%
cast cbc-23.04% -24.26% -24.59% -25.31% -24.85%
aes-128 cbc  60.48%  61.71%  61.91%  62.32%  62.27%
aes-192 cbc  64.41%  63.91%  64.31%  65.11%  65.13%
aes-256 cbc  65.03%  66.60%  67.89%  67.40%  67.45%
sha256  -16.11% -19.27% -23.42% -25.54% -26.56%
sha512   82.83%  83.21% 112.24% 129.11% 137.42%

signverify
rsa 512 bits 40.73%  28.55%
rsa 1024 bits28.89%  17.55%
rsa 2048 bits15.93%   3.47%
rsa 4096 bits 7.69%  -3.87%
dsa 512 bits 29.38%  30.25%
dsa 1024 bits20.51%  21.10%
dsa 2048 bits 7.06%   7.65%


gcc 4.1.1   32 vs. 64 bits
./Configure no-asm solaris-x86-gcc
./Configure no-asm solaris64-x86_64-gcc
gcc -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -O3
-fomit-frame-pointer -DL_ENDIAN
{ -march=pentium -DOPENSSL_NO_INLINE_ASM |
  -m64 -DL_ENDIAN -DMD32_REG_T=int }

type16B 64B 256B1024B   8192B
md2  -7.40%  -9.46% -10.21%  -9.36%  -8.95%
md4  26.77%  24.25%  19.80%  14.47%  11.19%
md5  18.20%  16.72%  11.50%   6.06%   2.69%
hmac(md5)19.03%  16.02%  10.69%   5.95%   2.59%
sha1 16.22%  13.13%  16.53%  20.12%  22.24%
rmd160   24.41%  17.51%  12.67%   8.13%   6.07%
rc4  22.65%  22.98%  23.10%  23.19%  23.17%
des cbc  38.35%  37.66%  37.36%  37.29%  37.11%
idea cbc 10.96%   6.71%   3.94%   3.69%   3.33%
rc2 cbc   1.53%   0.27%  -0.23%  -0.22%  -0.33%
blowfish cbc  1.14%  -1.38%  -1.93%  -2.16%  -2.19%
cast cbc 95.12%  97.09%  97.57%  97.94%  98.07%
aes-128 cbc  76.22%  82.13%  83.89%  84.50%  84.79%
aes-192 cbc  84.24%  86.69%  88.12%  88.91%  89.08%
aes-256 cbc  83.59%  90.55%  91.96%  92.34%  92.52%
sha256   -3.48%  -2.75%  -1.07%  -0.29%   0.09%
sha512  177.33% 177.60% 242.40% 279.34% 301.04%

signverify
rsa 512 bits 94.92% 109.87%
rsa 1024 bits   124.20% 123.21%
rsa 2048 bits   136.36% 130.01%
rsa 4096 bits   142.86% 129.65%
dsa 512 bits117.52% 114.45%
dsa 1024 bits   137.08% 128.02%
dsa 2048 bits   134.24% 130.59%




(2) gzip/bzip2

I did also measure compression/decompression speed with gzip and bzip2 (test 
file: gcc-4.1.1.tar):

Studio 11  32 vs. 64gcc 4.1.1  32 vs. 64
gzip -5.78 % 23.69 %
gunzip2   2.46 %  2.26 %
bzip2 3.47 %