Re: LOCK overheads (was Re: objtrm problem probably found)

1999-07-19 Thread Julian Elischer

A bit late, but some more data points.


90MHz Pentium, FreeBSD 2.2.7
mode  0  60.80 ns/loop nproc=1 lcks=EMPTY
mode  1  91.13 ns/loop nproc=1 lcks=no
mode  2  91.11 ns/loop nproc=2 lcks=no
mode  3 242.59 ns/loop nproc=1 lcks=yes
mode  4 242.69 ns/loop nproc=2 lcks=yes
mode  5 586.27 ns/loop nproc=1 lcks=no
mode  6 586.91 ns/loop nproc=2 lcks=no
mode  7 749.28 ns/loop nproc=1 lcks=yes
mode  8 746.70 ns/loop nproc=2 lcks=yes
mode  9 181.96 ns/loop nproc=1 lcks=EMPTY
mode 10 242.56 ns/loop nproc=1 lcks=no
mode 11 242.69 ns/loop nproc=2 lcks=no
mode 12 343.80 ns/loop nproc=1 lcks=yes
mode 13 343.77 ns/loop nproc=2 lcks=yes
mode 14 727.79 ns/loop nproc=1 lcks=no
mode 15 729.95 ns/loop nproc=2 lcks=no
mode 16 850.10 ns/loop nproc=1 lcks=yes
mode 17 848.02 ns/loop nproc=2 lcks=yes


200MHz Pentium Pro, -current, same binary as above;
mode  0  42.76 ns/loop nproc=1 lcks=EMPTY
mode  1  32.01 ns/loop nproc=1 lcks=no
mode  2  33.30 ns/loop nproc=2 lcks=no
mode  3 191.30 ns/loop nproc=1 lcks=yes
mode  4 191.62 ns/loop nproc=2 lcks=yes
mode  5  93.12 ns/loop nproc=1 lcks=no
mode  6  94.54 ns/loop nproc=2 lcks=no
mode  7 195.16 ns/loop nproc=1 lcks=yes
mode  8 200.91 ns/loop nproc=2 lcks=yes
mode  9  65.83 ns/loop nproc=1 lcks=EMPTY
mode 10  90.32 ns/loop nproc=1 lcks=no
mode 11  90.33 ns/loop nproc=2 lcks=no
mode 12 236.61 ns/loop nproc=1 lcks=yes
mode 13 236.70 ns/loop nproc=2 lcks=yes
mode 14 120.83 ns/loop nproc=1 lcks=no
mode 15 122.12 ns/loop nproc=2 lcks=no
mode 16 276.92 ns/loop nproc=1 lcks=yes
mode 17 277.19 ns/loop nproc=2 lcks=yes

200MHz pentium Pro, -current, compiled with -current compiler
mode  0  35.30 ns/loop nproc=1 lcks=EMPTY
mode  1  22.13 ns/loop nproc=1 lcks=no
mode  2  22.31 ns/loop nproc=2 lcks=no
mode  3 186.26 ns/loop nproc=1 lcks=yes
mode  4 186.39 ns/loop nproc=2 lcks=yes
mode  5  75.61 ns/loop nproc=1 lcks=no
mode  6  78.52 ns/loop nproc=2 lcks=no
mode  7 191.46 ns/loop nproc=1 lcks=yes
mode  8 191.65 ns/loop nproc=2 lcks=yes
mode  9  69.34 ns/loop nproc=1 lcks=EMPTY
mode 10  86.68 ns/loop nproc=1 lcks=no
mode 11  86.49 ns/loop nproc=2 lcks=no
mode 12 237.49 ns/loop nproc=1 lcks=yes
mode 13 236.67 ns/loop nproc=2 lcks=yes
mode 14 134.96 ns/loop nproc=1 lcks=no
mode 15 134.99 ns/loop nproc=2 lcks=no
mode 16 276.90 ns/loop nproc=1 lcks=yes
mode 17 277.33 ns/loop nproc=2 lcks=yes



Not exactly sure what all this means but whatever mode1 17 is, it can
sure be expensive..

of course this is a UP machine...

julian


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: LOCK overheads (was Re: objtrm problem probably found)

1999-07-13 Thread Ollivier Robert

According to Matthew Dillon:
 Wow, now that *is* expensive!  The K6 must be implementing it in
 microcode for it to be that bad.

K6-200:

244 [21:57] roberto@keltia:src/C ./locktest  0
...
empty 26.84 ns/loop
1proc 22.62 ns/loop
2proc 22.64 ns/loop
empty w/locks 17.58 ns/loop
1proc w/locks 288.28 ns/loop
2proc w/locks 288.16 ns/loop

It hurts :(
-- 
Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- [EMAIL PROTECTED]
FreeBSD keltia.freenix.fr 4.0-CURRENT #72: Mon Jul 12 08:26:43 CEST 1999



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: LOCK overheads (was Re: objtrm problem probably found)

1999-07-13 Thread Peter Jeremy

Matthew Dillon [EMAIL PROTECTED] wrote:
:mode 1   17.99 ns/loop nproc=1 lcks=no
:mode 3  166.33 ns/loop nproc=1 lcks=yes
...
:This is a K6-2 350. Locks are pretty expensive on them.
Wow, now that *is* expensive!  The K6 must be implementing it in
microcode for it to be that bad.

I wouldn't be surprised if lock prefixes did result in microcode
execution.  As I stated yesterday, I don't believe locked instructions
are implemented frequently enough to warrant special handling, and are
therefore likely to be implemented in whichever way need the least
chip area.

Since you need to be able to track and mark the memory references
associated with the instruction, the cheapest implementation (in terms
of dedicated chip area) is likely to be something like: wait until all
currently executing instructions are complete, wait until all pending
memory writes are complete (at least to L1 cache), assert the lock pin
and execute RMW instuction without allowing any other instructions to
commence, deassert lock pin.  This is (of course) probably the worst
case as far as execution time as seen by that CPU - though it's not
far off optimal as seen by other CPUs.

(`Assert lock pin' should also be mapped into a `begin locked memory
reference' using whatever cache coherency protocol is being used).

I'm not saying that you can't implement a locked RMW sequence a lot
better, but until the chip architects decide that the performance is
an issue, they aren't likely to spend any silicon on it.  The big
IA-32 market is UP systems running games - and locked RMW instructions
don't affect this market.  Intel see the high-end of the market (where
SMP and hence locked RMW is more of an issue) moving to Merced.  This
suggests that it's unlikely that the IA-32 will ever acquire a decent
lock capability (though at least the PIII is no worse than the PII).

That said, the above timings make a lock prefix cost over 50 core
clocks (or 15 bus clocks) - even microcode couldn't be that bad.  My
other timings (core/bus cycles) were: 486DX2: 20/10, Pentium: 28/7,
P-II: 34/8.5, P-III 34/7.5.

I suspect that these timings are a combination of inefficient on-chip
implementation of the lock prefix (see above for my reasoning behind
this), together with poor off-chip handling of locked cycles.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message