Re: LOCK overheads (was Re: objtrm problem probably found)

1999-07-19 Thread Julian Elischer

A bit late, but some more data points.


90MHz Pentium, FreeBSD 2.2.7
mode  0  60.80 ns/loop nproc=1 lcks=EMPTY
mode  1  91.13 ns/loop nproc=1 lcks=no
mode  2  91.11 ns/loop nproc=2 lcks=no
mode  3 242.59 ns/loop nproc=1 lcks=yes
mode  4 242.69 ns/loop nproc=2 lcks=yes
mode  5 586.27 ns/loop nproc=1 lcks=no
mode  6 586.91 ns/loop nproc=2 lcks=no
mode  7 749.28 ns/loop nproc=1 lcks=yes
mode  8 746.70 ns/loop nproc=2 lcks=yes
mode  9 181.96 ns/loop nproc=1 lcks=EMPTY
mode 10 242.56 ns/loop nproc=1 lcks=no
mode 11 242.69 ns/loop nproc=2 lcks=no
mode 12 343.80 ns/loop nproc=1 lcks=yes
mode 13 343.77 ns/loop nproc=2 lcks=yes
mode 14 727.79 ns/loop nproc=1 lcks=no
mode 15 729.95 ns/loop nproc=2 lcks=no
mode 16 850.10 ns/loop nproc=1 lcks=yes
mode 17 848.02 ns/loop nproc=2 lcks=yes


200MHz Pentium Pro, -current, same binary as above;
mode  0  42.76 ns/loop nproc=1 lcks=EMPTY
mode  1  32.01 ns/loop nproc=1 lcks=no
mode  2  33.30 ns/loop nproc=2 lcks=no
mode  3 191.30 ns/loop nproc=1 lcks=yes
mode  4 191.62 ns/loop nproc=2 lcks=yes
mode  5  93.12 ns/loop nproc=1 lcks=no
mode  6  94.54 ns/loop nproc=2 lcks=no
mode  7 195.16 ns/loop nproc=1 lcks=yes
mode  8 200.91 ns/loop nproc=2 lcks=yes
mode  9  65.83 ns/loop nproc=1 lcks=EMPTY
mode 10  90.32 ns/loop nproc=1 lcks=no
mode 11  90.33 ns/loop nproc=2 lcks=no
mode 12 236.61 ns/loop nproc=1 lcks=yes
mode 13 236.70 ns/loop nproc=2 lcks=yes
mode 14 120.83 ns/loop nproc=1 lcks=no
mode 15 122.12 ns/loop nproc=2 lcks=no
mode 16 276.92 ns/loop nproc=1 lcks=yes
mode 17 277.19 ns/loop nproc=2 lcks=yes

200MHz pentium Pro, -current, compiled with -current compiler
mode  0  35.30 ns/loop nproc=1 lcks=EMPTY
mode  1  22.13 ns/loop nproc=1 lcks=no
mode  2  22.31 ns/loop nproc=2 lcks=no
mode  3 186.26 ns/loop nproc=1 lcks=yes
mode  4 186.39 ns/loop nproc=2 lcks=yes
mode  5  75.61 ns/loop nproc=1 lcks=no
mode  6  78.52 ns/loop nproc=2 lcks=no
mode  7 191.46 ns/loop nproc=1 lcks=yes
mode  8 191.65 ns/loop nproc=2 lcks=yes
mode  9  69.34 ns/loop nproc=1 lcks=EMPTY
mode 10  86.68 ns/loop nproc=1 lcks=no
mode 11  86.49 ns/loop nproc=2 lcks=no
mode 12 237.49 ns/loop nproc=1 lcks=yes
mode 13 236.67 ns/loop nproc=2 lcks=yes
mode 14 134.96 ns/loop nproc=1 lcks=no
mode 15 134.99 ns/loop nproc=2 lcks=no
mode 16 276.90 ns/loop nproc=1 lcks=yes
mode 17 277.33 ns/loop nproc=2 lcks=yes



Not exactly sure what all this means but whatever mode1 17 is, it can
sure be expensive..

of course this is a UP machine...

julian


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-13 Thread Alan Cox

Before this thread on "cache coherence" and "memory consistency" goes
any further, I'd like to suggest a time-out to read something like
http://www-ece.rice.edu/~sarita/Publications/models_tutorial.ps.
A lot of what I'm reading has a grain of truth but isn't quite
right.  This paper appeared as a tutorial in IEEE Computer a couple
years ago, and is fairly accessible to a general audience.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-13 Thread Mike Smith

 On Mon, Jul 12, 1999 at 10:38:03PM -0700, Mike Smith wrote:
  I said:
   than indirect function calls on some architectures: inline
   branched code.  So you still have a global variable selecting
   locked/non-locked, but it's a boolean, rather than a pointer.
   Your atomic macros are then { if (atomic_lock) asm("lock;foo");
   else asm ("foo"); }
  
  This requires you to have all the methods present at compile time, 
  which defeats the entire purpose of dynamic method loading.
 
 Pardon?  I didn't see a discussion of dynamic loading anywhere
 here.  We were referring to tiny inlined assembly language routines.
 The existing implementation is #defines in a C header file.

Uh, no, I was asking about the overhead involved in indirect function 
calls specifically because there are instances where fast indirect 
methods would allow us to greatly improve the generality of parts of 
the FreeBSD kernel and modules.  This includes code where we are 
currently using compile-time defines that require us to trade 
performance for generality.

-- 
\\  The mind's the standard   \\  Mike Smith
\\  of the man.   \\  [EMAIL PROTECTED]
\\-- Joseph Merrick   \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: LOCK overheads (was Re: objtrm problem probably found)

1999-07-13 Thread Ollivier Robert

According to Matthew Dillon:
 Wow, now that *is* expensive!  The K6 must be implementing it in
 microcode for it to be that bad.

K6-200:

244 [21:57] roberto@keltia:src/C ./locktest  0
...
empty 26.84 ns/loop
1proc 22.62 ns/loop
2proc 22.64 ns/loop
empty w/locks 17.58 ns/loop
1proc w/locks 288.28 ns/loop
2proc w/locks 288.16 ns/loop

It hurts :(
-- 
Ollivier ROBERT -=- FreeBSD: The Power to Serve! -=- [EMAIL PROTECTED]
FreeBSD keltia.freenix.fr 4.0-CURRENT #72: Mon Jul 12 08:26:43 CEST 1999



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: LOCK overheads (was Re: objtrm problem probably found)

1999-07-13 Thread Peter Jeremy

Matthew Dillon [EMAIL PROTECTED] wrote:
:mode 1   17.99 ns/loop nproc=1 lcks=no
:mode 3  166.33 ns/loop nproc=1 lcks=yes
...
:This is a K6-2 350. Locks are pretty expensive on them.
Wow, now that *is* expensive!  The K6 must be implementing it in
microcode for it to be that bad.

I wouldn't be surprised if lock prefixes did result in microcode
execution.  As I stated yesterday, I don't believe locked instructions
are implemented frequently enough to warrant special handling, and are
therefore likely to be implemented in whichever way need the least
chip area.

Since you need to be able to track and mark the memory references
associated with the instruction, the cheapest implementation (in terms
of dedicated chip area) is likely to be something like: wait until all
currently executing instructions are complete, wait until all pending
memory writes are complete (at least to L1 cache), assert the lock pin
and execute RMW instuction without allowing any other instructions to
commence, deassert lock pin.  This is (of course) probably the worst
case as far as execution time as seen by that CPU - though it's not
far off optimal as seen by other CPUs.

(`Assert lock pin' should also be mapped into a `begin locked memory
reference' using whatever cache coherency protocol is being used).

I'm not saying that you can't implement a locked RMW sequence a lot
better, but until the chip architects decide that the performance is
an issue, they aren't likely to spend any silicon on it.  The big
IA-32 market is UP systems running games - and locked RMW instructions
don't affect this market.  Intel see the high-end of the market (where
SMP and hence locked RMW is more of an issue) moving to Merced.  This
suggests that it's unlikely that the IA-32 will ever acquire a decent
lock capability (though at least the PIII is no worse than the PII).

That said, the above timings make a lock prefix cost over 50 core
clocks (or 15 bus clocks) - even microcode couldn't be that bad.  My
other timings (core/bus cycles) were: 486DX2: 20/10, Pentium: 28/7,
P-II: 34/8.5, P-III 34/7.5.

I suspect that these timings are a combination of inefficient on-chip
implementation of the lock prefix (see above for my reasoning behind
this), together with poor off-chip handling of locked cycles.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Peter Wemm

Doug Rabson wrote:
 On Mon, 12 Jul 1999, Peter Jeremy wrote:
 
  Mike Haertel [EMAIL PROTECTED] wrote:
  Um.  FYI on x86, even if the compiler generates the RMW
  form "addl $1, foo", it's not atomic.  If you want it to
  be atomic you have to precede the opcode with a LOCK
  prefix 0xF0.
  
  I'd noticed that point as well.  The top of sys/i386/include/atomic.h
  _does_ make is clear that they aren't SMP safe:
  
  /*
   * Various simple arithmetic on memory which is atomic in the presence
   * of interrupts.
   *
   * Note: these versions are not SMP safe.
   */
  
  That said, it should be fairly simple to change Matt's new in-line
  assembler versions to insert LOCK prefixes when building an SMP
  kernel.  (Although I don't know that this is necessary yet, given
  the `Big Giant Lock').
 
 We don't need the lock prefix for the current SMP implementation. A lock
 prefix would be needed in a multithreaded implementation but should not be
 added unless the kernel is an SMP kernel otherwise UP performance would
 suffer.

I was under the impression that a locked instruction was essentially free
at runtime, with the sole exception of being one byte larger.

If there is any chance of this stuff getting exported to modules via
inlines, it should support both as we don't need any more differences between
SMP and non-SMP modules than we've already got. :-(

Also, with the posted example, atomic_cmpex() is #ifndef I386_CPU - that
means it won't be available on any of the default kernels, either at
install time or for 99% of post-install systems built from a release.

Cheers,
-Peter
--
Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found

1999-07-12 Thread Peter Jeremy

Doug Rabson [EMAIL PROTECTED] wrote:
We don't need the lock prefix for the current SMP implementation. A lock
prefix would be needed in a multithreaded implementation but should not be
added unless the kernel is an SMP kernel otherwise UP performance would
suffer.

Modulo the issue of UP vs SMP modules, the code would seem to be simply:

#ifdef SMP
#define ATOMIC_ASM(type,op) \
__asm __volatile ("lock; " op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))
#else
#define ATOMIC_ASM(type,op) \
__asm __volatile (op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))
#endif

Or (maybe more clearly):

#ifdef SMP
#define SMP_LOCK"lock; "
#else
#define SMP_LOCK
#endif

#define ATOMIC_ASM(type,op) \
__asm __volatile (SMP_LOCK op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))


John-Mark Gurney [EMAIL PROTECTED] wrote:
actually, I'm not so sure, it guarantees that NO other bus operation
will succeed while this is happening... what happens if a pci bus
mastering card makes a modification to this value?

This is a valid point, but I don't think it's directly related to this
thread.  I think it's up the the various PCI device driver writers to
ensure that objects shared between a PCI device and driver are
correctly locked.  The mechanism to do this is likely to be device-
specific: Lock prefixes only protect objects no larger than 32 or 64
bits (depending on the processor), cards may require locked accesses
to much larger structures.

I believe the API to the PCI-locking routines should be distinct from
the API for SMP locks - even though the underlying implementation may
be common.


Oliver Fromme [EMAIL PROTECTED] wrote:
In my list of i386 clock cycles, the lock prefix is listed with
0 (zero) cycles.
My i486 book states 1 cycle, although that cycle can be overlapped with
several other combinations that add a cycle to the basic instruction
execution time.  I don't know about the Pentium and beyond timings.  In
any case, we have real-world timings, which are more useful.


Mike Haertel [EMAIL PROTECTED] wrote:
Although function calls are more expensive than inline code,
they aren't necessarily a lot more so, and function calls to
non-locked RMW operations are certainly much cheaper than
inline locked RMW operations.
Based on my timings below, this is correct, though counter-intuitive.
Given the substantial cost of indirect function calls, I don't
this this would be acceptable, though.  I think compiling modules
separately for UP/SMP is a better choice.


In Message-id: 19990 [EMAIL PROTECTED],
Matthew Dillon [EMAIL PROTECTED] provided some hard figures
for a dual PIII-450.  Expanding those figures for a range of machines
(all are UP except the PIII-450, which are Matt's SMP figures), and
adding the cost of using indirect function calls (patches to Matt's
code at the end):

i386SX-25   P-133   PII-266  PIII-450  nproc  locks
mode  0   1950.2339.6526.31 9.21 EMPTY
mode  1   3340.5971.7424.4516.48 1  no  tight
mode  2   3237.5771.1825.2723.65 2  no  tight
mode  3   3367.65   282.31   153.2993.02 1 yes  tight
mode  4   3263.64   285.58   152.94   160.82 2 yes  tight
mode  5   9439.15   446.1660.4037.64 1  no  spread
mode  6  10231.96   467.3960.1689.28 2  no  spread
mode  7  10660.05   725.80   153.1888.32 1 yes  spread
mode  8   9990.18   755.87   155.18   161.08 2 yes  spread

mode  9   5544.82   131.3149.96? EMPTY
mode 10   7234.97   174.2064.81? 1  no  tight
mode 11   7212.14   178.7264.87? 2  no  tight
mode 12   7355.46   304.74   182.75? 1 yes  tight
mode 13   6956.54   327.11   180.21? 2 yes  tight
mode 14  13603.72   582.02   100.10? 1  no  spread
mode 15  13443.54   543.97   101.13? 2  no  spread
mode 16  13731.94   717.31   207.12? 1 yes  spread
mode 17  13379.62   800.31   207.70? 2 yes  spread

Modes 9 through 17 are equivalent to modes 0-8 except that the
operation is performed via a call thru a pointer-to-function.
(Mode 9 is a pointer to a nop).

Apart from the noticable improvement in overall speed from left to
right, this shows that the lock prefix is _very_ expensive on
Pentium and above, even in a UP configuration.  It makes no difference
on a 386.  (I can produce the 486 figures tonight after I get home).
It also suggests that moving to an indirect function call (to allow
run-time UP/SMP selection) will be quite painful.

As you can see, the lock prefix creates a stall condition on the locked
memory, but does *NOT* stall other memory.
This is at least CPU dependent (and may also depend on the motherboard
chipset).  The i486 states that `all memory is locked'.

Therefore I believe the impact will be unnoticeable.  On a duel 
450MHz P-III we are talking 37 ns vs 88 ns - 

Re: objtrm problem probably found

1999-07-12 Thread Matthew Dillon

:Or (maybe more clearly):
:
:#ifdef SMP
:#defineSMP_LOCK"lock; "
:#else
:#defineSMP_LOCK
:#endif
:
:#define ATOMIC_ASM(type,op)\
:__asm __volatile (SMP_LOCK op : "=m" (*(type *)p) : "ir" (v), "0" (*(type *)p))

 Yes, precisely.

:I believe the API to the PCI-locking routines should be distinct from
:the API for SMP locks - even though the underlying implementation may
:be common.

This is more a driver issue then anything else.

:Based on the impact above, I believe the lock prefixes should not
:be inserted until they are necessary - even if it does mean we wind
:up with /modules and /modules.smp.  I don't believe that moving
:to indirect function pointers is a reasonable work-around.
:
:Peter

We can certainly remove it for the UP case, but it needs to stay in
for the SMP case because there is a 100% chance that these routines
are going to have to be SMP safe when the big giant lock hits up against
the VM system.  It will become even more important after interrupts are
moved into their own threads if we want to be able to run interrupts
in parallel with other supervisor code (e.g. network interrupt and network
stack ).

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Smith

 Although function calls are more expensive than inline code,
 they aren't necessarily a lot more so, and function calls to
 non-locked RMW operations are certainly much cheaper than
 inline locked RMW operations.

This is a fairly key statement in context, and an opinion here would 
count for a lot; are function calls likely to become more or less 
expensive in time?

-- 
\\  The mind's the standard   \\  Mike Smith
\\  of the man.   \\  [EMAIL PROTECTED]
\\-- Joseph Merrick   \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:
: Although function calls are more expensive than inline code,
: they aren't necessarily a lot more so, and function calls to
: non-locked RMW operations are certainly much cheaper than
: inline locked RMW operations.
:
:This is a fairly key statement in context, and an opinion here would 
:count for a lot; are function calls likely to become more or less 
:expensive in time?

In terms of cycles, either less or the same.  Certainly not more.
If you think about it, a function call is nothing more then a save, a jump,
and a retrieval and jump later on.  On intel the save is a push, on other
cpu's the save may be to a register (which is pushed later if necessary).

The change in code flow used to be the expensive piece, but not any
more.  You typically either see a branch prediction cache (Intel)
offering a best-case of 0-cycle latency, or a single-cycle latency 
that is slot-fillable (MIPS).

Since the jump portion of a subroutine call to a direct label is nothing
more then a deterministic branch, the branch prediction cache actually
operates in this case.  You do not quite get 0-cycle latency due to
the push/pop, and potential arguments, but it is very fast.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Smith

 
 :
 : Although function calls are more expensive than inline code,
 : they aren't necessarily a lot more so, and function calls to
 : non-locked RMW operations are certainly much cheaper than
 : inline locked RMW operations.
 :
 :This is a fairly key statement in context, and an opinion here would 
 :count for a lot; are function calls likely to become more or less 
 :expensive in time?
...
 The change in code flow used to be the expensive piece, but not any
 more.  You typically either see a branch prediction cache (Intel)
 offering a best-case of 0-cycle latency, or a single-cycle latency 
 that is slot-fillable (MIPS).

I assumed too much in asking the question; I was specifically 
interested in indirect function calls, since this has a direct impact 
on method-style implementations.

-- 
\\  The mind's the standard   \\  Mike Smith
\\  of the man.   \\  [EMAIL PROTECTED]
\\-- Joseph Merrick   \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:I assumed too much in asking the question; I was specifically 
:interested in indirect function calls, since this has a direct impact 
:on method-style implementations.

Branch prediction caches are typically PC-sensitive.  An indirect method
call will never be as fast as a direct call, but if the indirect address
is the same the branch prediction cache will work.

If the indirect address changes at the PC where the call is being made,
the branch cache may create a penalty.

Try this core in one of the cases to that test program, and add two nop
subroutines void nop1(void) { } and void nop2(void) { }.

Compile this code without any optimizations!  *no* optimizations or
the test will not demonstrate the problem :-)

In this case the branch prediction succeeds because the indirect 
address does not change at the PC where func() is called.  I get 34 ns
per loop.

{
void (*func)(void) = nop1;
for (i = 0; i  LOOPS; ++i) {
func();
if (i  1)
func = nop1;
else
func = nop1;
}
}

In this case the branch prediction fails because the indirect address
is different at the PC each time func() is called.  I get 61ns.

{
void (*func)(void) = nop1;
for (i = 0; i  LOOPS; ++i) {
func();
if (i  1)
func = nop1;
else
func = nop2;
}
}

In this case we simulate a mix.  (i  1) - (i  7).  I get 47 ns.

{
void (*func)(void) = nop1;
for (i = 0; i  LOOPS; ++i) {
func();
if (i  7)
func = nop1;
else
func = nop2;
}
}

Ok, so what does this mean for method calls?If the method call is
INLINED, then the branch prediction cache will tend to work because the
method call will tend to call the same address at any given PC.  If
the method call is doubly-indirect, where a routine is called which 
calculates the method address and then calls it, the branch prediction
cache will tend to fail because a different address will tend to be
called at the PC of the call.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Peter Jeremy

Mike Smith [EMAIL PROTECTED] wrote:
 Although function calls are more expensive than inline code,
 they aren't necessarily a lot more so, and function calls to
 non-locked RMW operations are certainly much cheaper than
 inline locked RMW operations.

This is a fairly key statement in context, and an opinion here would 
count for a lot; are function calls likely to become more or less 
expensive in time?

Based on general computer architecture principles, I'd say that a lock
prefix is likely to become more expensive[1], whilst a function call
will become cheaper[2] over time.

I'm not sure that this is an important issue here.  The sole advantage
of moving to indirect function calls would be that the same object
code could be used on both UP and SMP configurations, without
incurring the overhead of the lock prefix in the UP configuration.
(At the expense of an additional function call in all configurations).
We can't avoid the lock prefix overhead in the SMP case.

Based on the timings I did this morning, function calls are
(unacceptably, IMHO) expensive on all the CPU's I have to hand (i386,
Pentium and P-II) - the latter two presumably comprising the bulk of
current FreeBSD use.

Currently the UP/SMP decision is made at compile time (and has
significant and widespread impact) - therefore there seems little (if
any) benefit in using function calls within the main kernel.

I believe that Matt's patched i386/include/atomic.h, with the addition
of code to only include the lock prefix when SMP is defined, is
currently the optimal approach for the kernel - and I can't see any
way a future IA-32 implementation could change that.

The only benefit could be for kernel modules - a module could possibly
be compiled so the same LKM would run on either UP or SMP.  Note that
function calls for atomic operations may not be sufficient (by
themselves) to achieve this: One of the SMP gurus may be able to
confirm whether anything else prevents an SMP-compiled LKM running
with a UP kernel.

If the lock prefix overhead becomes an issue for LKMs, then we could
define a variant of i386/include/atomic.h (eg by using a #define which
is only true for compiling LKMs) which does use indirect function
calls (and add the appropriate initialisation code).  This is a
trivial exercise (which I'll demonstrate on request).

[1] A locked instruction implies a synchronous RMW cycle.  In order
to meet write-ordering guarantees (without which, a locked RMW
cycle would be useless as a semaphore primitive), it implies a
complete write serialization, and probably some level of
instruction serialisation.  Since write-back pipelines will get
longer and parallel execution units more numerous, the cost of
a serialisation operation will get relatively higher.  Also,
lock instructions are relatively infrequent, therefore there is
little incentive to expend valuable silicon on trying to make
them more efficient (at least as seen by the executing CPU).

[2] Function calls _are_ fairly common, therefore it probably is
worthwhile expending some effort in optimising them - and the
stack updates associated with a leaf subroutine are fairly
easy to totally hide in an on-chip write pipeline/cache.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:
:I'm not sure there's any reason why you shouldn't.  If you changed the
:semantics of a stack segment so that memory addresses below the stack
:pointer were irrelevant, you could implement a small, 0-cycle, on-chip
:stack (that overflowed into memory).  I don't know whether this
:semantic change would be allowable (and whether the associated silicon
:could be justified) for the IA-32.
:
:Peter

This would be relatively complex and also results in cache coherency
problems.  A solution already exists:  It's called branch-and-link,
but Intel cpu's do not use it because Intel cpu's do not have enough
registers (makes you just want to throw up -- all that MMX junk and they
couldn't add a branch and link register! ).  The key with branch-and-link
is that the lowest subroutine level does not have to save/restore the 
register, making entry and return two or three times faster then 
subroutine calls that make other subroutine calls.

The big problem with implementing complex caches is that it takes up
a serious amount of die space and power.  Most modern cpu's revolve 
almost entirely around their L1 cache and their register file.  The
remaining caches tend to be ad-hoc.  Intel's branch prediction cache
is like this.  

In order for a memory-prediction cache to be useful, it really needs
to be cache-coherent, which basically kills the idea of having a separate
little special case for the stack.  Only the L1 cache is coherent.  If
you wanted you could implement multiple L1 data caches on-chip - that 
might be of some benefit, but otherwise branch-and-link is the better
way to do it.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon

:
:Based on general computer architecture principles, I'd say that a lock
:prefix is likely to become more expensive[1], whilst a function call
:will become cheaper[2] over time.
:...
:
:[1] A locked instruction implies a synchronous RMW cycle.  In order
:to meet write-ordering guarantees (without which, a locked RMW
:cycle would be useless as a semaphore primitive), it implies a
:complete write serialization, and probably some level of
:instruction serialisation.  Since write-back pipelines will get

A locked instruction only implies cache coherency across the 
instruction.  It does not imply serialization.  Intel blows it
big time, but that's intel for you.

:longer and parallel execution units more numerous, the cost of
:a serialisation operation will get relatively higher.  Also,
:Peter

It is not a large number of execution units that implies a higher
cost of serialization but instead data interdependancies.  A 
locked instruction doe snot have to imply serialization.

Modern cache coherency protocols do not have a problem with 
a large number of caches in a parallel processing subsystem.

This is how a typical two-level cache coherency protocol works with an
L1 and L2 cache:

* The L2 cache is usually a superset of the L1 cache.  That is,
  all cache lines in the L1 cache also exist in the L2 cache.

* Both the L1 and L2 caches implement a shared and an exclusive
  bit, usually on a per-cache-line basis.

* When a processor, A, issues a memory op that is not in its cache,
  it can request the cache line from main memory either shared 
  (unlocked) or exclusive (locked).

* All other processors snoop the main memory access.

* If another processor's L2 cache contains the cache line being
  requested by processor A, it can provide the cache line to
  processor A and no main memory op actually occurs.

  In this case, the shared and exclusive bits in processor B's L1 
  and L2 caches are updated appropriately.

  - if A is requesting shared (unlocked) access, both A and B will
set the shared bit and B will clear the exclusive bit.  (B will
cause A to stall if it is in the middle of operating on the
locked cache line).

  - if A is requesting exclusive (locked) access, B will invalidate
its cache line and clear both the shared and exclusive bits,
and A will set its exclusive bit.  A now owns that cache line.

  - If no other processor owns the cache line, A obtains the data
from main memory and other processors have the option to snoop
the access in their L2 caches.

So, using the above rules as an example, a locked instruction can cost
as little as 0 extra cycles no matter how many cpu's you have running
in parallel.  There is no need to serialize or synchronize anything.

The worst case is actually not even as bad as a complete cache-miss
case.  The cost of snooping another cpu's L2 cache is much less then
the cost of going to main memory.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Andrew Reilly

On Mon, Jul 12, 1999 at 07:09:58PM -0700, Mike Smith wrote:
  Although function calls are more expensive than inline code,
  they aren't necessarily a lot more so, and function calls to
  non-locked RMW operations are certainly much cheaper than
  inline locked RMW operations.
 
 This is a fairly key statement in context, and an opinion here would 
 count for a lot; are function calls likely to become more or less 
 expensive in time?

Others have answered this question, but I thought I'd point out
that there is another alternative that is likely to be faster
than indirect function calls on some architectures: inline
branched code.  So you still have a global variable selecting
locked/non-locked, but it's a boolean, rather than a pointer.
Your atomic macros are then { if (atomic_lock) asm("lock;foo");
else asm ("foo"); }

You might be interested in the paper:

"Efficient Dynamic Dispatch without Virtual Function Tables. The
SmallEiffel Compiler." Olivier ZENDRA, Dominique COLNET, Suzanne
COLLIN. 12th Annual ACM SIGPLAN Conference on Object-Oriented
Programming, Systems, Languages, and Applications (OOPSLA'97),
Volume 32, Issue 10 - Atlanta, GA, USA, October 1997, pages 125-141.

http://SmallEiffel.loria.fr/papers/papers.html#OOPSLA97

as a justification for that approach.

-- 
Andrew


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:...

I would also like to add a few more notes in regards to write pipelines.

Write pipelines are not used any more, at least not long ones.  The
reason is simply the cache coherency issue again.  Until the data is
actually written into the L1 cache, it is acoherent.

Acoherency is no longer allowed :-).  That kind creates a problem for
the old write pipeline.

Thus, modern cpu's have begun removing their write pipelines in favor
of a synchronous write to the L1 cache.  This typically occurs in the 
"write-back" phase of the instruction pipeline.  It does not cause
a processor to stall.

There are several advantages to this:

* First, since the write is going directly into the L1 cache,
  it can partake of the cache coherency protocols.

* Second, since the L1 cache is large compared to the older
  write-pipeline schema, this effectively gives us a
  (for all intents and purposes other then a memory copy)
  infinitely-sized write pipeline.

* Third, we can throw away all the write pipeline address snooping 
  junk on the chip, making the chip smaller.

The cache coherency protocol may cause side effects when a memory
write is issued.  Typically the write is issued into the L1 cache
and requires what is known as a write-through operation into the L2
cache in order to guarentee cache coherency.  This is necessary because
it is the L2 cache which is performing the bulk of the bus protocol
to implement the cache coherency, and it needs to know when you've
modified something.

"write through" *used* to mean writing through to main memory, but with
the advent of modern cache coherency protocols, the L2 cache *IS* main
memory for all intents and purposes.  Dirty data contained in the L2
cache inherently invalidates the associated memory address in main memory
simply by being present in the L2 cache.  Another processor attempting
to access that location in main memory will instead obtain the data
directly from our processor's L2 cache.

One of the interesting benefits from this is that dirty data in the L2
cache can be flushed to main memory completely asynchronously without
screwing up cache coherency, serialization, or anything else.

Final note: memory mapped I/O spaces do not apply here.  These are usually
marked uncacheable and have nothing to do with the cache.  But DMA is
different.  DMA works just like any other memory access in that it 
can be made to run through the L2 cache coherency protocols.  So you 
can often safely initiate DMA without having to flush your caches.  This
is not true of all processors.  I'm not sure about the Pentium's.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Haertel

Second answer: in the real world, we're nearly always hitting the
cache on stack operations associated with calls and argument passing,
but not less often on operations in the procedure body.  So, in
 ^^^ typo

Urk.  I meant to say "less often", delete the "not".


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Haertel

This is a fairly key statement in context, and an opinion here would 
count for a lot; are function calls likely to become more or less 
expensive in time?

Ambiguous question.

First answer: Assume we're hitting the cache, taking no branch
mispredicts, and everything is generally going at "the speed of
light".  The cost of a call, relative to the cost of "basic
operations" like loads and adds, will stay about the same in Intel's
next-generation x86 processors as it is now.  We made some tradeoffs,
some things go a little faster and others a little slower, but in
the end the function call overhead averages out to about the same.

Second answer: in the real world, we're nearly always hitting the
cache on stack operations associated with calls and argument passing,
but not less often on operations in the procedure body.  So, in
the real world, as memory latencies get worse and worse relative
to processor frequency, the overhead for procedure calling represents
an increasingly smaller percentage of the total execution time.

Summary: So, correctly predicted calls are staying the same or
getting cheaper.  Certainly calls are not getting worse, unless
perhaps you have lots of mispredicted calls.  As pipelines get
deeper, mispredicts get worse.  But that's not a problem just for
calls, it's branches in general.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message