Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-02-16 Thread Nicholas Clark
On Tue, Feb 11, 2003 at 09:24:21AM +0100, A. Bergman wrote:
 
 On måndag, feb 10, 2003, at 23:03 Europe/Stockholm, 
 [EMAIL PROTECTED] wrote:

I've never heard myself called that before :-)

 This is probably the right default for the general case, but it is
 counterproductive for benchmarking small code changes. So on gcc 2.95 
 I'm
 compiling with:
 
 -O -malign-loops=3 -malign-jumps=3 -malign-functions=3 
 -mpreferred-stack-boundary=3 -march=i686
 
 (thats 2**3, ie 8)
 
 and on gcc 3.2 on a different machine:
 -O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 
 -mpreferred-stack-boundary=3 -march=i586
 
 
 Does compiling with these settings make general perl faster?

No. all together about 2% slower, at least according to perlbench.  I'm not
that surprised.  I didn't experiment with trying each independently, as I
was looking for deterministic benchmarks when code changed slightly, rather
than a faster speed from aggressive compiler options. I expect that the loop
alignment forcing options will only really help on the tight loops, such as
the substring finding loop, or hashing loop, although the hashing loop is
inlined everywhere it is needed, so that's a pain. If someone has the time
they could experiment with which options might help, and whether pulling some
code out into a separate file that benefits from (say) loop alignment helps
measurably.

Although I suspect that anyone with time to do this would be better spending
it making better benchmarks.

Nicholas Clark



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-02-11 Thread A . Bergman

On måndag, feb 10, 2003, at 23:03 Europe/Stockholm, 
[EMAIL PROTECTED] wrote:


50% of the time your function/label/loop/jump is 16 byte aligned.
50% of the time your function/label/loop/jump is randomly aligned

So, a slight code size change early on in a file can cause the 
remaining
functions to ping either onto, or off alignment. Hence later loops in
completely unrelated code can happen to become optimally aligned, and 
go
faster. And similarly other loops which were optimally aligned will now
go unaligned, and go more slowly.

This is probably the right default for the general case, but it is
counterproductive for benchmarking small code changes. So on gcc 2.95 
I'm
compiling with:

-O -malign-loops=3 -malign-jumps=3 -malign-functions=3 
-mpreferred-stack-boundary=3 -march=i686

(thats 2**3, ie 8)

and on gcc 3.2 on a different machine:
-O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 
-mpreferred-stack-boundary=3 -march=i586


Does compiling with these settings make general perl faster?

Arthur



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-02-10 Thread perl6-internals-return-14948-archive=jab . org
On Sun, Jan 12, 2003 at 10:24:23AM +0100, Leopold Toetsch wrote:
 In perl.perl6.internals, you wrote:
  --- Leopold Toetsch [EMAIL PROTECTED] wrote:
* SLOW (same slow with register or odd aligned)
* 0x818118a jit_func+194:sub0x8164cac,%ebx
* 0x8181190 jit_func+200:jne0x818118a jit_func+194
 
  The slow one has the loop crossing over a 16 byte boundary. Try moving it
  over a bit.
 
 Yep, actually it looks like a 8 byte boundary:
 Following program:

 And here is the output:
 
   0   790.826400 M op/s
   1   523.305494 M op/s
   2   788.544190 M op/s
   3   783.447189 M op/s
   4   783.975462 M op/s
   5   788.208178 M op/s
   6   782.466484 M op/s
   7   788.059343 M op/s
   8   788.836349 M op/s
   9   522.986581 M op/s
  10   788.895326 M op/s
  11   784.021624 M op/s
  12   789.773978 M op/s
  13   788.065635 M op/s
  14   783.558056 M op/s
  15   789.010709 M op/s
  16   782.463565 M op/s
  17   523.049517 M op/s
  18   781.350657 M op/s

etc

 This of course has the assumption, that the program did run at the
 same address, which is - from my experience with gdb - usually true.
 
 So moving the critical part of a program by just one byte can cause a
 huge slowdown.

I don't think that I ever mailed what seemed to be the answer back to p5p
or p6i. Thanks to Leo's suggestions I went hunting in the gcc man pages.
2.95 and 3.0 are quite informative.

-falign-functions
-falign-labels
-falign-loops
-falign-jumps

all default to a machine dependent default. This default isn't documented
explicitly, but I presume that on x86 it's the same as the x86 specific -m
options of the same name (deprecated in gcc 3.0, removed along with their
documentation by 3.2)

*Their* alignment defaults are:

`-malign-loops=NUM'
 Align loops to a 2 raised to a NUM byte boundary.  If
 `-malign-loops' is not specified, the default is 2 unless gas 2.8
 (or later) is being used in which case the default is to align the
 loop on a 16 byte boundary if it is less than 8 bytes away.

so

50% of the time your function/label/loop/jump is 16 byte aligned.
50% of the time your function/label/loop/jump is randomly aligned

So, a slight code size change early on in a file can cause the remaining
functions to ping either onto, or off alignment. Hence later loops in
completely unrelated code can happen to become optimally aligned, and go
faster. And similarly other loops which were optimally aligned will now
go unaligned, and go more slowly.

This is probably the right default for the general case, but it is
counterproductive for benchmarking small code changes. So on gcc 2.95 I'm
compiling with:

-O -malign-loops=3 -malign-jumps=3 -malign-functions=3 -mpreferred-stack-boundary=3 
-march=i686

(thats 2**3, ie 8)

and on gcc 3.2 on a different machine:
-O3 -falign-loops=16 -falign-jumps=16 -falign-functions=16 
-mpreferred-stack-boundary=3 -march=i586

This seems to smooth out the jumps.
In the end copy on write regexps are on average 0% faster on the fast PIII
machine with gcc 2.95, and about 2% faster on the slower Cyrix with gcc 3.2
Based on what perlbench thinks.

Nicholas Clark



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-02-10 Thread hv
Dunno where this 'from' line came from, but it says here:
[EMAIL PROTECTED] wrote:
:On Sun, Jan 12, 2003 at 10:24:23AM +0100, Leopold Toetsch wrote:
:all default to a machine dependent default. This default isn't documented
:explicitly, but I presume that on x86 it's the same as the x86 specific -m
:options of the same name (deprecated in gcc 3.0, removed along with their
:documentation by 3.2)
:
:*Their* alignment defaults are:
:
:`-malign-loops=NUM'
: Align loops to a 2 raised to a NUM byte boundary.  If
: `-malign-loops' is not specified, the default is 2 unless gas 2.8
: (or later) is being used in which case the default is to align the
: loop on a 16 byte boundary if it is less than 8 bytes away.
:
:so
:
:50% of the time your function/label/loop/jump is 16 byte aligned.
:50% of the time your function/label/loop/jump is randomly aligned

I read this differently: 16n+7 should be aligned to 16n, because it
is less than 8 bytes away; 16n+9 should be aligned 16n+16 similarly.
Only 16n+8 would be unaligned, so that only in 1/16 random cases
would it fail to be 16-byte aligned, and then it would still be 8-byte
aligned.

That doesn't necessarily invalidate any of the rest of what was said.

Hugo



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-14 Thread James Mastros
On 01/12/2003 4:41 AM, Leopold Toetsch wrote:

There might be additional problems with glibc, but the deviations in JIT 
code timings are only caused by moving the loop by on byte (crossing a 8 
byte boundary).
Do we have enough metadata at JIT-time to pad locations that get jmp'd 
to to an 8-byte boundry in memory?

BTW, I legitimatly don't know.  I have a sinking suspicition that the 
only way to know if somthing is a jump target is to scan through the 
entire bytecode and check if it gets used as one.  (For that matter, you 
can jump to the value of an Ix reg, which makes even that infesable, no?)

(BTW, I removed p5p from the CC list, since I don't think this makes 
sense for non-JIT targets... and since p5 doesn't JIT...)

	-=- James Mastros



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-14 Thread Dan Sugalski
At 7:43 AM +0100 1/14/03, Leopold Toetsch wrote:

BTW, I legitimatly don't know.  I have a sinking suspicition that 
the only way to know if somthing is a jump target is to scan through 
the entire bytecode and check if it gets used as one.

I'm all for having an optional jump/branch target section in the 
bytecode that can be filled in, and have performance suffer if 
something decides to jump/branch to an un-noted location.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-12 Thread Leopold Toetsch
In perl.perl6.internals, you wrote:
 --- Leopold Toetsch [EMAIL PROTECTED] wrote:
   * SLOW (same slow with register or odd aligned)
   * 0x818118a jit_func+194:sub0x8164cac,%ebx
   * 0x8181190 jit_func+200:jne0x818118a jit_func+194

 The slow one has the loop crossing over a 16 byte boundary. Try moving it
 over a bit.

Yep, actually it looks like a 8 byte boundary:
Following program:

#!/usr/bin/perl -w
use strict;

for (my $i = 0; $i  100; $i++) {
printf %3d\t, $i;
open(P, m.pasm);
for (0..$i) {
print(P 'ENOP');
noop
ENOP
}
print(P 'EOF');
setI3, 1
setI4, 1
setI5, I4
time   N1
REDO:   subI4, I4, I3
if I4, REDO
time   N5
subN2, N5, N1
setN1, I5
mulN1, 2
divN1, N2
setN2, 100.0
divN1, N2
print  N1
print   M op/s\n
end
EOF
close(P);
system(perl assemble.pl m.pasm | parrot -j -);
}

And here is the output:

  0 790.826400 M op/s
  1 523.305494 M op/s
  2 788.544190 M op/s
  3 783.447189 M op/s
  4 783.975462 M op/s
  5 788.208178 M op/s
  6 782.466484 M op/s
  7 788.059343 M op/s
  8 788.836349 M op/s
  9 522.986581 M op/s
 10 788.895326 M op/s
 11 784.021624 M op/s
 12 789.773978 M op/s
 13 788.065635 M op/s
 14 783.558056 M op/s
 15 789.010709 M op/s
 16 782.463565 M op/s
 17 523.049517 M op/s
 18 781.350657 M op/s
 19 784.184698 M op/s
 20 789.683646 M op/s
 21 781.362666 M op/s
 22 783.994146 M op/s
 23 789.100887 M op/s
 24 783.990848 M op/s
 25 370.620840 M op/s
 26 786.862561 M op/s
 27 784.092342 M op/s
 28 789.106826 M op/s
 29 784.027852 M op/s
 30 780.688935 M op/s
 31 787.913154 M op/s
 32 783.576354 M op/s
 33 526.877272 M op/s
 34 780.493905 M op/s
 35 790.339116 M op/s
 36 789.166586 M op/s
 37 782.154592 M op/s
 38 786.902789 M op/s
 39 783.834446 M op/s
 40 784.003305 M op/s
 41 522.135984 M op/s
 42 780.618829 M op/s
 43 790.167145 M op/s
 44 783.284786 M op/s
 45 790.363689 M op/s
 46 781.002931 M op/s
 47 783.720572 M op/s
 48 789.774350 M op/s
 49 523.933363 M op/s
 50 786.970706 M op/s
 51 780.966576 M op/s
 52 789.234894 M op/s
 53 784.317040 M op/s
 54 780.993842 M op/s
 55 789.914164 M op/s
 56 783.705196 M op/s
 57 291.958023 M op/s
 58 783.653215 M op/s
 59 788.739927 M op/s
 60 784.599837 M op/s
 61 783.917218 M op/s
 62 790.051795 M op/s
 63 782.589121 M op/s
 64 784.846120 M op/s
 65 523.988181 M op/s
 66 788.746231 M op/s
 67 781.811980 M op/s
 68 786.188159 M op/s
 69 790.023521 M op/s
 70 783.149502 M op/s
 71 786.531300 M op/s
 72 781.711076 M op/s
 73 527.106372 M op/s
 74 783.735948 M op/s
 75 788.491194 M op/s
 76 782.442035 M op/s
 77 780.387170 M op/s
 78 789.259770 M op/s
 79 779.781801 M op/s
 80 788.186701 M op/s
 81 523.328673 M op/s
 82 790.407627 M op/s
 83 782.751235 M op/s
 84 788.410417 M op/s
 85 782.625627 M op/s
 86 782.056516 M op/s
 87 787.631292 M op/s
 88 782.218409 M op/s
 89 425.664145 M op/s
 90 778.734333 M op/s
 91 787.851363 M op/s
 92 784.661485 M op/s
 93 788.292247 M op/s
 94 783.754621 M op/s
 95 789.181805 M op/s
 96 788.326694 M op/s
 97 523.357568 M op/s
 98 782.105369 M op/s
 99 781.796679 M op/s

This of course has the assumption, that the program did run at the
same address, which is - from my experience with gdb - usually true.

So moving the critical part of a program by just one byte can cause a
huge slowdown.

(This is an Athlon 800, i386/linux)

leo



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-12 Thread Leopold Toetsch
Andreas J. Koenig wrote:


On Sat, 11 Jan 2003 22:26:39 +0100, Leopold Toetsch [EMAIL PROTECTED] said:



   Nicholas Clark wrote:
  So I'm confused. It looks like some bits of perl are incredibly sensitive to
  cache alignment, or something similar.

   This reminds me on my remarks on JITed mops.pasm which variied ~50%

And it reminds me on my postings to p5p about glibc being very buggy
up to 2.3 (posted during last October). I came to the conclusion that
perl cannot be benchmarked at all with glibc before v2.3.


There might be additional problems with glibc, but the deviations in JIT 
code timings are only caused by moving the loop by on byte (crossing a 8 
byte boundary).

leo




Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-12 Thread Peter Nimmervoll
[EMAIL PROTECTED] wrote:


Nicholas Clark [EMAIL PROTECTED] wrote:
:So I'm confused. It looks like some bits of perl are incredibly sensitive to
:cache alignment, or something similar. And as a consequence, perlbench is
:reliably reporting wildly varying timings because of this, and because it
:only tries a few, very specific things. Does this mean that it's still useful?

I think I remember seeing a profiler that emulates the x86 instruction set,
and so can give theoretically exact timings. Does this ring a bell for
anyone? I don't know if the emulation extended to details such as RAM
and cache sizes ...
 

Do you mean Valgrind ?
You can get it from http://developer.kde.org/~sewardj/

Peter


Hugo


 







Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-12 Thread Michael G Schwern
On Sat, Jan 11, 2003 at 07:05:22PM +, Nicholas Clark wrote:
 I was getting about 5% speedups on penfold against vanilla development perl.
 Penfold is an x86 box (actually a Citrix chip, which may be important) running
 Debian unstable, with gcc 3.2.1 and 256M of RAM.
 
 I tried the same tests on mirth, a ppc box, again Debian unstable, gcc 3.2.1,
 but 128M of RAM. This time I saw 1% slowdowns.

FWIW, in the past I've noticed that x86 and PPC do react differently to
optimizations.  I've had cases where things ran at the same speed on PPC
yet showed large differences on x86.


-- 

Michael G. Schwern   [EMAIL PROTECTED]http://www.pobox.com/~schwern/
Perl Quality Assurance  [EMAIL PROTECTED] Kwalitee Is Job One



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-11 Thread Nicholas Clark
On Sat, Jan 11, 2003 at 11:17:57PM +0100, Andreas J. Koenig wrote:

 And it reminds me on my postings to p5p about glibc being very buggy
 up to 2.3 (posted during last October). I came to the conclusion that
 perl cannot be benchmarked at all with glibc before v2.3.

I remember your posting, but not the details. Did it relate to glibc's malloc
and how long it took to free things? If so, surely benchmarking using perl's
malloc would work with earlier glibc's?

Anyway, on the two Debian systems I tested:

nick@penfold:~/5.8.0-i-g/t$ ls -l /lib/libc.so.6
lrwxrwxrwx1 root root   13 Jan  2 08:46 /lib/libc.so.6 - libc-2.3.1.so
nick@mirth:~$ ls -l /lib/libc.so.6 
lrwxrwxrwx1 root root   13 Jan  7 16:20 /lib/libc.so.6 - libc-2.3.1.so

And (obviously) the FreeBSD has BSD's libc

Thanks for the reminder. It's only good luck that I (well Richard) had 2.3.1
on them.

Nicholas Clark



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-11 Thread Mr. Nobody
--- Leopold Toetsch [EMAIL PROTECTED] wrote:
 Nicholas Clark wrote:
 
 
  So I'm confused. It looks like some bits of perl are incredibly sensitive
 to
  cache alignment, or something similar.
 
 
 This reminds me on my remarks on JITed mops.pasm which variied ~50% (or 
 more) depending on the position of the loop in memory. s. near the end 
 of jit/i386/jit_emit.h.
 
 
 And no, I still don't know what's goin on.
 
 
 (The story for perl5-porters + my comment:
   the loop is just 1 subtraction and a conditional jump. Inserting nops 
 before this loop has drastic imapt on performance. below is the gdb 
 output of the loop)
 
 /* my i386/athlon has a drastic speed penalty for what?
   * not for unaligned odd jump targets
   *
   * But:
   * mops.pbc 790 = 300-530  if code gets just 4 bytes bigger
   * (loop is at 200 instead of 196 ???)
   *
   * FAST:
   * 0x818100a jit_func+194:sub%edi,%ebx
   * 0x818100c jit_func+196:jne0x818100a jit_func+194)
   *
   * Same fast speed w/o 2nd register
   * 0x8181102 jit_func+186:sub0x8164c2c,%ebx
   * 0x8181108 jit_func+192:jne0x8181102 jit_func+186
   *
   * SLOW (same slow with register or odd aligned)
   * 0x818118a jit_func+194:sub0x8164cac,%ebx
   * 0x8181190 jit_func+200:jne0x818118a jit_func+194
   *
   */

  Nicholas Clark

 leo

The slow one has the loop crossing over a 16 byte boundary. Try moving it
over a bit.

__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-11 Thread hv
Nicholas Clark [EMAIL PROTECTED] wrote:
:So I'm confused. It looks like some bits of perl are incredibly sensitive to
:cache alignment, or something similar. And as a consequence, perlbench is
:reliably reporting wildly varying timings because of this, and because it
:only tries a few, very specific things. Does this mean that it's still useful?

I think I remember seeing a profiler that emulates the x86 instruction set,
and so can give theoretically exact timings. Does this ring a bell for
anyone? I don't know if the emulation extended to details such as RAM
and cache sizes ...

Hugo



Re: benchmarking - it's now all(-1,0,1,5,6)% faster

2003-01-11 Thread Andreas J. Koenig
 On Sat, 11 Jan 2003 22:31:42 +, Nicholas Clark [EMAIL PROTECTED] said:

   On Sat, Jan 11, 2003 at 11:17:57PM +0100, Andreas J. Koenig wrote:
  And it reminds me on my postings to p5p about glibc being very buggy
  up to 2.3 (posted during last October). I came to the conclusion that
  perl cannot be benchmarked at all with glibc before v2.3.

   I remember your posting, but not the details. Did it relate to glibc's malloc
   and how long it took to free things?

Yes.

   If so, surely benchmarking using perl's malloc would work with
   earlier glibc's?

I saw the erratic speed behaviour with 2.2.3, 2.2.4, and 2.2.5 and
didn't test earlier ones. glibc 2.3 had malloc rewritten from scratch
and with my limited testing it seemed to have this problem fixed.

   Anyway, on the two Debian systems I tested:

   nick@penfold:~/5.8.0-i-g/t$ ls -l /lib/libc.so.6
   lrwxrwxrwx1 root root   13 Jan  2 08:46 /lib/libc.so.6 - 
libc-2.3.1.so
   nick@mirth:~$ ls -l /lib/libc.so.6 
   lrwxrwxrwx1 root root   13 Jan  7 16:20 /lib/libc.so.6 - 
libc-2.3.1.so

   And (obviously) the FreeBSD has BSD's libc

   Thanks for the reminder. It's only good luck that I (well Richard) had 2.3.1
   on them.

Well, then my findings don't solve the puzzle.

-- 
andreas