Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On Tue, Feb 11, 2003 at 09:24:21AM +0100, A. Bergman wrote: On måndag, feb 10, 2003, at 23:03 Europe/Stockholm, [EMAIL PROTECTED] wrote: I've never heard myself called that before :-) This is probably the right default for the general case, but it is counterproductive for benchmarking small code changes. So on gcc 2.95 I'm compiling with: -O -malign-loops=3 -malign-jumps=3 -malign-functions=3 -mpreferred-stack-boundary=3 -march=i686 (thats 2**3, ie 8) and on gcc 3.2 on a different machine: -O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 -mpreferred-stack-boundary=3 -march=i586 Does compiling with these settings make general perl faster? No. all together about 2% slower, at least according to perlbench. I'm not that surprised. I didn't experiment with trying each independently, as I was looking for deterministic benchmarks when code changed slightly, rather than a faster speed from aggressive compiler options. I expect that the loop alignment forcing options will only really help on the tight loops, such as the substring finding loop, or hashing loop, although the hashing loop is inlined everywhere it is needed, so that's a pain. If someone has the time they could experiment with which options might help, and whether pulling some code out into a separate file that benefits from (say) loop alignment helps measurably. Although I suspect that anyone with time to do this would be better spending it making better benchmarks. Nicholas Clark
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On måndag, feb 10, 2003, at 23:03 Europe/Stockholm, [EMAIL PROTECTED] wrote: 50% of the time your function/label/loop/jump is 16 byte aligned. 50% of the time your function/label/loop/jump is randomly aligned So, a slight code size change early on in a file can cause the remaining functions to ping either onto, or off alignment. Hence later loops in completely unrelated code can happen to become optimally aligned, and go faster. And similarly other loops which were optimally aligned will now go unaligned, and go more slowly. This is probably the right default for the general case, but it is counterproductive for benchmarking small code changes. So on gcc 2.95 I'm compiling with: -O -malign-loops=3 -malign-jumps=3 -malign-functions=3 -mpreferred-stack-boundary=3 -march=i686 (thats 2**3, ie 8) and on gcc 3.2 on a different machine: -O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 -mpreferred-stack-boundary=3 -march=i586 Does compiling with these settings make general perl faster? Arthur
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On Sun, Jan 12, 2003 at 10:24:23AM +0100, Leopold Toetsch wrote: In perl.perl6.internals, you wrote: --- Leopold Toetsch [EMAIL PROTECTED] wrote: * SLOW (same slow with register or odd aligned) * 0x818118a jit_func+194:sub0x8164cac,%ebx * 0x8181190 jit_func+200:jne0x818118a jit_func+194 The slow one has the loop crossing over a 16 byte boundary. Try moving it over a bit. Yep, actually it looks like a 8 byte boundary: Following program: And here is the output: 0 790.826400 M op/s 1 523.305494 M op/s 2 788.544190 M op/s 3 783.447189 M op/s 4 783.975462 M op/s 5 788.208178 M op/s 6 782.466484 M op/s 7 788.059343 M op/s 8 788.836349 M op/s 9 522.986581 M op/s 10 788.895326 M op/s 11 784.021624 M op/s 12 789.773978 M op/s 13 788.065635 M op/s 14 783.558056 M op/s 15 789.010709 M op/s 16 782.463565 M op/s 17 523.049517 M op/s 18 781.350657 M op/s etc This of course has the assumption, that the program did run at the same address, which is - from my experience with gdb - usually true. So moving the critical part of a program by just one byte can cause a huge slowdown. I don't think that I ever mailed what seemed to be the answer back to p5p or p6i. Thanks to Leo's suggestions I went hunting in the gcc man pages. 2.95 and 3.0 are quite informative. -falign-functions -falign-labels -falign-loops -falign-jumps all default to a machine dependent default. This default isn't documented explicitly, but I presume that on x86 it's the same as the x86 specific -m options of the same name (deprecated in gcc 3.0, removed along with their documentation by 3.2) *Their* alignment defaults are: `-malign-loops=NUM' Align loops to a 2 raised to a NUM byte boundary. If `-malign-loops' is not specified, the default is 2 unless gas 2.8 (or later) is being used in which case the default is to align the loop on a 16 byte boundary if it is less than 8 bytes away. so 50% of the time your function/label/loop/jump is 16 byte aligned. 50% of the time your function/label/loop/jump is randomly aligned So, a slight code size change early on in a file can cause the remaining functions to ping either onto, or off alignment. Hence later loops in completely unrelated code can happen to become optimally aligned, and go faster. And similarly other loops which were optimally aligned will now go unaligned, and go more slowly. This is probably the right default for the general case, but it is counterproductive for benchmarking small code changes. So on gcc 2.95 I'm compiling with: -O -malign-loops=3 -malign-jumps=3 -malign-functions=3 -mpreferred-stack-boundary=3 -march=i686 (thats 2**3, ie 8) and on gcc 3.2 on a different machine: -O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 -mpreferred-stack-boundary=3 -march=i586 This seems to smooth out the jumps. In the end copy on write regexps are on average 0% faster on the fast PIII machine with gcc 2.95, and about 2% faster on the slower Cyrix with gcc 3.2 Based on what perlbench thinks. Nicholas Clark
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
Dunno where this 'from' line came from, but it says here: [EMAIL PROTECTED] wrote: :On Sun, Jan 12, 2003 at 10:24:23AM +0100, Leopold Toetsch wrote: :all default to a machine dependent default. This default isn't documented :explicitly, but I presume that on x86 it's the same as the x86 specific -m :options of the same name (deprecated in gcc 3.0, removed along with their :documentation by 3.2) : :*Their* alignment defaults are: : :`-malign-loops=NUM' : Align loops to a 2 raised to a NUM byte boundary. If : `-malign-loops' is not specified, the default is 2 unless gas 2.8 : (or later) is being used in which case the default is to align the : loop on a 16 byte boundary if it is less than 8 bytes away. : :so : :50% of the time your function/label/loop/jump is 16 byte aligned. :50% of the time your function/label/loop/jump is randomly aligned I read this differently: 16n+7 should be aligned to 16n, because it is less than 8 bytes away; 16n+9 should be aligned 16n+16 similarly. Only 16n+8 would be unaligned, so that only in 1/16 random cases would it fail to be 16-byte aligned, and then it would still be 8-byte aligned. That doesn't necessarily invalidate any of the rest of what was said. Hugo
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On 01/12/2003 4:41 AM, Leopold Toetsch wrote: There might be additional problems with glibc, but the deviations in JIT code timings are only caused by moving the loop by on byte (crossing a 8 byte boundary). Do we have enough metadata at JIT-time to pad locations that get jmp'd to to an 8-byte boundry in memory? BTW, I legitimatly don't know. I have a sinking suspicition that the only way to know if somthing is a jump target is to scan through the entire bytecode and check if it gets used as one. (For that matter, you can jump to the value of an Ix reg, which makes even that infesable, no?) (BTW, I removed p5p from the CC list, since I don't think this makes sense for non-JIT targets... and since p5 doesn't JIT...) -=- James Mastros
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
At 7:43 AM +0100 1/14/03, Leopold Toetsch wrote: BTW, I legitimatly don't know. I have a sinking suspicition that the only way to know if somthing is a jump target is to scan through the entire bytecode and check if it gets used as one. I'm all for having an optional jump/branch target section in the bytecode that can be filled in, and have performance suffer if something decides to jump/branch to an un-noted location. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
In perl.perl6.internals, you wrote: --- Leopold Toetsch [EMAIL PROTECTED] wrote: * SLOW (same slow with register or odd aligned) * 0x818118a jit_func+194:sub0x8164cac,%ebx * 0x8181190 jit_func+200:jne0x818118a jit_func+194 The slow one has the loop crossing over a 16 byte boundary. Try moving it over a bit. Yep, actually it looks like a 8 byte boundary: Following program: #!/usr/bin/perl -w use strict; for (my $i = 0; $i 100; $i++) { printf %3d\t, $i; open(P, m.pasm); for (0..$i) { print(P 'ENOP'); noop ENOP } print(P 'EOF'); setI3, 1 setI4, 1 setI5, I4 time N1 REDO: subI4, I4, I3 if I4, REDO time N5 subN2, N5, N1 setN1, I5 mulN1, 2 divN1, N2 setN2, 100.0 divN1, N2 print N1 print M op/s\n end EOF close(P); system(perl assemble.pl m.pasm | parrot -j -); } And here is the output: 0 790.826400 M op/s 1 523.305494 M op/s 2 788.544190 M op/s 3 783.447189 M op/s 4 783.975462 M op/s 5 788.208178 M op/s 6 782.466484 M op/s 7 788.059343 M op/s 8 788.836349 M op/s 9 522.986581 M op/s 10 788.895326 M op/s 11 784.021624 M op/s 12 789.773978 M op/s 13 788.065635 M op/s 14 783.558056 M op/s 15 789.010709 M op/s 16 782.463565 M op/s 17 523.049517 M op/s 18 781.350657 M op/s 19 784.184698 M op/s 20 789.683646 M op/s 21 781.362666 M op/s 22 783.994146 M op/s 23 789.100887 M op/s 24 783.990848 M op/s 25 370.620840 M op/s 26 786.862561 M op/s 27 784.092342 M op/s 28 789.106826 M op/s 29 784.027852 M op/s 30 780.688935 M op/s 31 787.913154 M op/s 32 783.576354 M op/s 33 526.877272 M op/s 34 780.493905 M op/s 35 790.339116 M op/s 36 789.166586 M op/s 37 782.154592 M op/s 38 786.902789 M op/s 39 783.834446 M op/s 40 784.003305 M op/s 41 522.135984 M op/s 42 780.618829 M op/s 43 790.167145 M op/s 44 783.284786 M op/s 45 790.363689 M op/s 46 781.002931 M op/s 47 783.720572 M op/s 48 789.774350 M op/s 49 523.933363 M op/s 50 786.970706 M op/s 51 780.966576 M op/s 52 789.234894 M op/s 53 784.317040 M op/s 54 780.993842 M op/s 55 789.914164 M op/s 56 783.705196 M op/s 57 291.958023 M op/s 58 783.653215 M op/s 59 788.739927 M op/s 60 784.599837 M op/s 61 783.917218 M op/s 62 790.051795 M op/s 63 782.589121 M op/s 64 784.846120 M op/s 65 523.988181 M op/s 66 788.746231 M op/s 67 781.811980 M op/s 68 786.188159 M op/s 69 790.023521 M op/s 70 783.149502 M op/s 71 786.531300 M op/s 72 781.711076 M op/s 73 527.106372 M op/s 74 783.735948 M op/s 75 788.491194 M op/s 76 782.442035 M op/s 77 780.387170 M op/s 78 789.259770 M op/s 79 779.781801 M op/s 80 788.186701 M op/s 81 523.328673 M op/s 82 790.407627 M op/s 83 782.751235 M op/s 84 788.410417 M op/s 85 782.625627 M op/s 86 782.056516 M op/s 87 787.631292 M op/s 88 782.218409 M op/s 89 425.664145 M op/s 90 778.734333 M op/s 91 787.851363 M op/s 92 784.661485 M op/s 93 788.292247 M op/s 94 783.754621 M op/s 95 789.181805 M op/s 96 788.326694 M op/s 97 523.357568 M op/s 98 782.105369 M op/s 99 781.796679 M op/s This of course has the assumption, that the program did run at the same address, which is - from my experience with gdb - usually true. So moving the critical part of a program by just one byte can cause a huge slowdown. (This is an Athlon 800, i386/linux) leo
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
Andreas J. Koenig wrote: On Sat, 11 Jan 2003 22:26:39 +0100, Leopold Toetsch [EMAIL PROTECTED] said: Nicholas Clark wrote: So I'm confused. It looks like some bits of perl are incredibly sensitive to cache alignment, or something similar. This reminds me on my remarks on JITed mops.pasm which variied ~50% And it reminds me on my postings to p5p about glibc being very buggy up to 2.3 (posted during last October). I came to the conclusion that perl cannot be benchmarked at all with glibc before v2.3. There might be additional problems with glibc, but the deviations in JIT code timings are only caused by moving the loop by on byte (crossing a 8 byte boundary). leo
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
[EMAIL PROTECTED] wrote: Nicholas Clark [EMAIL PROTECTED] wrote: :So I'm confused. It looks like some bits of perl are incredibly sensitive to :cache alignment, or something similar. And as a consequence, perlbench is :reliably reporting wildly varying timings because of this, and because it :only tries a few, very specific things. Does this mean that it's still useful? I think I remember seeing a profiler that emulates the x86 instruction set, and so can give theoretically exact timings. Does this ring a bell for anyone? I don't know if the emulation extended to details such as RAM and cache sizes ... Do you mean Valgrind ? You can get it from http://developer.kde.org/~sewardj/ Peter Hugo
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On Sat, Jan 11, 2003 at 07:05:22PM +, Nicholas Clark wrote: I was getting about 5% speedups on penfold against vanilla development perl. Penfold is an x86 box (actually a Citrix chip, which may be important) running Debian unstable, with gcc 3.2.1 and 256M of RAM. I tried the same tests on mirth, a ppc box, again Debian unstable, gcc 3.2.1, but 128M of RAM. This time I saw 1% slowdowns. FWIW, in the past I've noticed that x86 and PPC do react differently to optimizations. I've had cases where things ran at the same speed on PPC yet showed large differences on x86. -- Michael G. Schwern [EMAIL PROTECTED]http://www.pobox.com/~schwern/ Perl Quality Assurance [EMAIL PROTECTED] Kwalitee Is Job One
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On Sat, Jan 11, 2003 at 11:17:57PM +0100, Andreas J. Koenig wrote: And it reminds me on my postings to p5p about glibc being very buggy up to 2.3 (posted during last October). I came to the conclusion that perl cannot be benchmarked at all with glibc before v2.3. I remember your posting, but not the details. Did it relate to glibc's malloc and how long it took to free things? If so, surely benchmarking using perl's malloc would work with earlier glibc's? Anyway, on the two Debian systems I tested: nick@penfold:~/5.8.0-i-g/t$ ls -l /lib/libc.so.6 lrwxrwxrwx1 root root 13 Jan 2 08:46 /lib/libc.so.6 - libc-2.3.1.so nick@mirth:~$ ls -l /lib/libc.so.6 lrwxrwxrwx1 root root 13 Jan 7 16:20 /lib/libc.so.6 - libc-2.3.1.so And (obviously) the FreeBSD has BSD's libc Thanks for the reminder. It's only good luck that I (well Richard) had 2.3.1 on them. Nicholas Clark
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
--- Leopold Toetsch [EMAIL PROTECTED] wrote: Nicholas Clark wrote: So I'm confused. It looks like some bits of perl are incredibly sensitive to cache alignment, or something similar. This reminds me on my remarks on JITed mops.pasm which variied ~50% (or more) depending on the position of the loop in memory. s. near the end of jit/i386/jit_emit.h. And no, I still don't know what's goin on. (The story for perl5-porters + my comment: the loop is just 1 subtraction and a conditional jump. Inserting nops before this loop has drastic imapt on performance. below is the gdb output of the loop) /* my i386/athlon has a drastic speed penalty for what? * not for unaligned odd jump targets * * But: * mops.pbc 790 = 300-530 if code gets just 4 bytes bigger * (loop is at 200 instead of 196 ???) * * FAST: * 0x818100a jit_func+194:sub%edi,%ebx * 0x818100c jit_func+196:jne0x818100a jit_func+194) * * Same fast speed w/o 2nd register * 0x8181102 jit_func+186:sub0x8164c2c,%ebx * 0x8181108 jit_func+192:jne0x8181102 jit_func+186 * * SLOW (same slow with register or odd aligned) * 0x818118a jit_func+194:sub0x8164cac,%ebx * 0x8181190 jit_func+200:jne0x818118a jit_func+194 * */ Nicholas Clark leo The slow one has the loop crossing over a 16 byte boundary. Try moving it over a bit. __ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
Nicholas Clark [EMAIL PROTECTED] wrote: :So I'm confused. It looks like some bits of perl are incredibly sensitive to :cache alignment, or something similar. And as a consequence, perlbench is :reliably reporting wildly varying timings because of this, and because it :only tries a few, very specific things. Does this mean that it's still useful? I think I remember seeing a profiler that emulates the x86 instruction set, and so can give theoretically exact timings. Does this ring a bell for anyone? I don't know if the emulation extended to details such as RAM and cache sizes ... Hugo
Re: benchmarking - it's now all(-1,0,1,5,6)% faster
On Sat, 11 Jan 2003 22:31:42 +, Nicholas Clark [EMAIL PROTECTED] said: On Sat, Jan 11, 2003 at 11:17:57PM +0100, Andreas J. Koenig wrote: And it reminds me on my postings to p5p about glibc being very buggy up to 2.3 (posted during last October). I came to the conclusion that perl cannot be benchmarked at all with glibc before v2.3. I remember your posting, but not the details. Did it relate to glibc's malloc and how long it took to free things? Yes. If so, surely benchmarking using perl's malloc would work with earlier glibc's? I saw the erratic speed behaviour with 2.2.3, 2.2.4, and 2.2.5 and didn't test earlier ones. glibc 2.3 had malloc rewritten from scratch and with my limited testing it seemed to have this problem fixed. Anyway, on the two Debian systems I tested: nick@penfold:~/5.8.0-i-g/t$ ls -l /lib/libc.so.6 lrwxrwxrwx1 root root 13 Jan 2 08:46 /lib/libc.so.6 - libc-2.3.1.so nick@mirth:~$ ls -l /lib/libc.so.6 lrwxrwxrwx1 root root 13 Jan 7 16:20 /lib/libc.so.6 - libc-2.3.1.so And (obviously) the FreeBSD has BSD's libc Thanks for the reminder. It's only good luck that I (well Richard) had 2.3.1 on them. Well, then my findings don't solve the puzzle. -- andreas