Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-13 Thread Alan Cox

Before this thread on "cache coherence" and "memory consistency" goes
any further, I'd like to suggest a time-out to read something like
http://www-ece.rice.edu/~sarita/Publications/models_tutorial.ps.
A lot of what I'm reading has a grain of truth but isn't quite
right.  This paper appeared as a tutorial in IEEE Computer a couple
years ago, and is fairly accessible to a general audience.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-13 Thread Mike Smith

 On Mon, Jul 12, 1999 at 10:38:03PM -0700, Mike Smith wrote:
  I said:
   than indirect function calls on some architectures: inline
   branched code.  So you still have a global variable selecting
   locked/non-locked, but it's a boolean, rather than a pointer.
   Your atomic macros are then { if (atomic_lock) asm("lock;foo");
   else asm ("foo"); }
  
  This requires you to have all the methods present at compile time, 
  which defeats the entire purpose of dynamic method loading.
 
 Pardon?  I didn't see a discussion of dynamic loading anywhere
 here.  We were referring to tiny inlined assembly language routines.
 The existing implementation is #defines in a C header file.

Uh, no, I was asking about the overhead involved in indirect function 
calls specifically because there are instances where fast indirect 
methods would allow us to greatly improve the generality of parts of 
the FreeBSD kernel and modules.  This includes code where we are 
currently using compile-time defines that require us to trade 
performance for generality.

-- 
\\  The mind's the standard   \\  Mike Smith
\\  of the man.   \\  [EMAIL PROTECTED]
\\-- Joseph Merrick   \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Peter Wemm

Doug Rabson wrote:
 On Mon, 12 Jul 1999, Peter Jeremy wrote:
 
  Mike Haertel [EMAIL PROTECTED] wrote:
  Um.  FYI on x86, even if the compiler generates the RMW
  form "addl $1, foo", it's not atomic.  If you want it to
  be atomic you have to precede the opcode with a LOCK
  prefix 0xF0.
  
  I'd noticed that point as well.  The top of sys/i386/include/atomic.h
  _does_ make is clear that they aren't SMP safe:
  
  /*
   * Various simple arithmetic on memory which is atomic in the presence
   * of interrupts.
   *
   * Note: these versions are not SMP safe.
   */
  
  That said, it should be fairly simple to change Matt's new in-line
  assembler versions to insert LOCK prefixes when building an SMP
  kernel.  (Although I don't know that this is necessary yet, given
  the `Big Giant Lock').
 
 We don't need the lock prefix for the current SMP implementation. A lock
 prefix would be needed in a multithreaded implementation but should not be
 added unless the kernel is an SMP kernel otherwise UP performance would
 suffer.

I was under the impression that a locked instruction was essentially free
at runtime, with the sole exception of being one byte larger.

If there is any chance of this stuff getting exported to modules via
inlines, it should support both as we don't need any more differences between
SMP and non-SMP modules than we've already got. :-(

Also, with the posted example, atomic_cmpex() is #ifndef I386_CPU - that
means it won't be available on any of the default kernels, either at
install time or for 99% of post-install systems built from a release.

Cheers,
-Peter
--
Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Smith

 Although function calls are more expensive than inline code,
 they aren't necessarily a lot more so, and function calls to
 non-locked RMW operations are certainly much cheaper than
 inline locked RMW operations.

This is a fairly key statement in context, and an opinion here would 
count for a lot; are function calls likely to become more or less 
expensive in time?

-- 
\\  The mind's the standard   \\  Mike Smith
\\  of the man.   \\  [EMAIL PROTECTED]
\\-- Joseph Merrick   \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:
: Although function calls are more expensive than inline code,
: they aren't necessarily a lot more so, and function calls to
: non-locked RMW operations are certainly much cheaper than
: inline locked RMW operations.
:
:This is a fairly key statement in context, and an opinion here would 
:count for a lot; are function calls likely to become more or less 
:expensive in time?

In terms of cycles, either less or the same.  Certainly not more.
If you think about it, a function call is nothing more then a save, a jump,
and a retrieval and jump later on.  On intel the save is a push, on other
cpu's the save may be to a register (which is pushed later if necessary).

The change in code flow used to be the expensive piece, but not any
more.  You typically either see a branch prediction cache (Intel)
offering a best-case of 0-cycle latency, or a single-cycle latency 
that is slot-fillable (MIPS).

Since the jump portion of a subroutine call to a direct label is nothing
more then a deterministic branch, the branch prediction cache actually
operates in this case.  You do not quite get 0-cycle latency due to
the push/pop, and potential arguments, but it is very fast.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Smith

 
 :
 : Although function calls are more expensive than inline code,
 : they aren't necessarily a lot more so, and function calls to
 : non-locked RMW operations are certainly much cheaper than
 : inline locked RMW operations.
 :
 :This is a fairly key statement in context, and an opinion here would 
 :count for a lot; are function calls likely to become more or less 
 :expensive in time?
...
 The change in code flow used to be the expensive piece, but not any
 more.  You typically either see a branch prediction cache (Intel)
 offering a best-case of 0-cycle latency, or a single-cycle latency 
 that is slot-fillable (MIPS).

I assumed too much in asking the question; I was specifically 
interested in indirect function calls, since this has a direct impact 
on method-style implementations.

-- 
\\  The mind's the standard   \\  Mike Smith
\\  of the man.   \\  [EMAIL PROTECTED]
\\-- Joseph Merrick   \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:I assumed too much in asking the question; I was specifically 
:interested in indirect function calls, since this has a direct impact 
:on method-style implementations.

Branch prediction caches are typically PC-sensitive.  An indirect method
call will never be as fast as a direct call, but if the indirect address
is the same the branch prediction cache will work.

If the indirect address changes at the PC where the call is being made,
the branch cache may create a penalty.

Try this core in one of the cases to that test program, and add two nop
subroutines void nop1(void) { } and void nop2(void) { }.

Compile this code without any optimizations!  *no* optimizations or
the test will not demonstrate the problem :-)

In this case the branch prediction succeeds because the indirect 
address does not change at the PC where func() is called.  I get 34 ns
per loop.

{
void (*func)(void) = nop1;
for (i = 0; i  LOOPS; ++i) {
func();
if (i  1)
func = nop1;
else
func = nop1;
}
}

In this case the branch prediction fails because the indirect address
is different at the PC each time func() is called.  I get 61ns.

{
void (*func)(void) = nop1;
for (i = 0; i  LOOPS; ++i) {
func();
if (i  1)
func = nop1;
else
func = nop2;
}
}

In this case we simulate a mix.  (i  1) - (i  7).  I get 47 ns.

{
void (*func)(void) = nop1;
for (i = 0; i  LOOPS; ++i) {
func();
if (i  7)
func = nop1;
else
func = nop2;
}
}

Ok, so what does this mean for method calls?If the method call is
INLINED, then the branch prediction cache will tend to work because the
method call will tend to call the same address at any given PC.  If
the method call is doubly-indirect, where a routine is called which 
calculates the method address and then calls it, the branch prediction
cache will tend to fail because a different address will tend to be
called at the PC of the call.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Peter Jeremy

Mike Smith [EMAIL PROTECTED] wrote:
 Although function calls are more expensive than inline code,
 they aren't necessarily a lot more so, and function calls to
 non-locked RMW operations are certainly much cheaper than
 inline locked RMW operations.

This is a fairly key statement in context, and an opinion here would 
count for a lot; are function calls likely to become more or less 
expensive in time?

Based on general computer architecture principles, I'd say that a lock
prefix is likely to become more expensive[1], whilst a function call
will become cheaper[2] over time.

I'm not sure that this is an important issue here.  The sole advantage
of moving to indirect function calls would be that the same object
code could be used on both UP and SMP configurations, without
incurring the overhead of the lock prefix in the UP configuration.
(At the expense of an additional function call in all configurations).
We can't avoid the lock prefix overhead in the SMP case.

Based on the timings I did this morning, function calls are
(unacceptably, IMHO) expensive on all the CPU's I have to hand (i386,
Pentium and P-II) - the latter two presumably comprising the bulk of
current FreeBSD use.

Currently the UP/SMP decision is made at compile time (and has
significant and widespread impact) - therefore there seems little (if
any) benefit in using function calls within the main kernel.

I believe that Matt's patched i386/include/atomic.h, with the addition
of code to only include the lock prefix when SMP is defined, is
currently the optimal approach for the kernel - and I can't see any
way a future IA-32 implementation could change that.

The only benefit could be for kernel modules - a module could possibly
be compiled so the same LKM would run on either UP or SMP.  Note that
function calls for atomic operations may not be sufficient (by
themselves) to achieve this: One of the SMP gurus may be able to
confirm whether anything else prevents an SMP-compiled LKM running
with a UP kernel.

If the lock prefix overhead becomes an issue for LKMs, then we could
define a variant of i386/include/atomic.h (eg by using a #define which
is only true for compiling LKMs) which does use indirect function
calls (and add the appropriate initialisation code).  This is a
trivial exercise (which I'll demonstrate on request).

[1] A locked instruction implies a synchronous RMW cycle.  In order
to meet write-ordering guarantees (without which, a locked RMW
cycle would be useless as a semaphore primitive), it implies a
complete write serialization, and probably some level of
instruction serialisation.  Since write-back pipelines will get
longer and parallel execution units more numerous, the cost of
a serialisation operation will get relatively higher.  Also,
lock instructions are relatively infrequent, therefore there is
little incentive to expend valuable silicon on trying to make
them more efficient (at least as seen by the executing CPU).

[2] Function calls _are_ fairly common, therefore it probably is
worthwhile expending some effort in optimising them - and the
stack updates associated with a leaf subroutine are fairly
easy to totally hide in an on-chip write pipeline/cache.

Peter


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:
:I'm not sure there's any reason why you shouldn't.  If you changed the
:semantics of a stack segment so that memory addresses below the stack
:pointer were irrelevant, you could implement a small, 0-cycle, on-chip
:stack (that overflowed into memory).  I don't know whether this
:semantic change would be allowable (and whether the associated silicon
:could be justified) for the IA-32.
:
:Peter

This would be relatively complex and also results in cache coherency
problems.  A solution already exists:  It's called branch-and-link,
but Intel cpu's do not use it because Intel cpu's do not have enough
registers (makes you just want to throw up -- all that MMX junk and they
couldn't add a branch and link register! ).  The key with branch-and-link
is that the lowest subroutine level does not have to save/restore the 
register, making entry and return two or three times faster then 
subroutine calls that make other subroutine calls.

The big problem with implementing complex caches is that it takes up
a serious amount of die space and power.  Most modern cpu's revolve 
almost entirely around their L1 cache and their register file.  The
remaining caches tend to be ad-hoc.  Intel's branch prediction cache
is like this.  

In order for a memory-prediction cache to be useful, it really needs
to be cache-coherent, which basically kills the idea of having a separate
little special case for the stack.  Only the L1 cache is coherent.  If
you wanted you could implement multiple L1 data caches on-chip - that 
might be of some benefit, but otherwise branch-and-link is the better
way to do it.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon

:
:Based on general computer architecture principles, I'd say that a lock
:prefix is likely to become more expensive[1], whilst a function call
:will become cheaper[2] over time.
:...
:
:[1] A locked instruction implies a synchronous RMW cycle.  In order
:to meet write-ordering guarantees (without which, a locked RMW
:cycle would be useless as a semaphore primitive), it implies a
:complete write serialization, and probably some level of
:instruction serialisation.  Since write-back pipelines will get

A locked instruction only implies cache coherency across the 
instruction.  It does not imply serialization.  Intel blows it
big time, but that's intel for you.

:longer and parallel execution units more numerous, the cost of
:a serialisation operation will get relatively higher.  Also,
:Peter

It is not a large number of execution units that implies a higher
cost of serialization but instead data interdependancies.  A 
locked instruction doe snot have to imply serialization.

Modern cache coherency protocols do not have a problem with 
a large number of caches in a parallel processing subsystem.

This is how a typical two-level cache coherency protocol works with an
L1 and L2 cache:

* The L2 cache is usually a superset of the L1 cache.  That is,
  all cache lines in the L1 cache also exist in the L2 cache.

* Both the L1 and L2 caches implement a shared and an exclusive
  bit, usually on a per-cache-line basis.

* When a processor, A, issues a memory op that is not in its cache,
  it can request the cache line from main memory either shared 
  (unlocked) or exclusive (locked).

* All other processors snoop the main memory access.

* If another processor's L2 cache contains the cache line being
  requested by processor A, it can provide the cache line to
  processor A and no main memory op actually occurs.

  In this case, the shared and exclusive bits in processor B's L1 
  and L2 caches are updated appropriately.

  - if A is requesting shared (unlocked) access, both A and B will
set the shared bit and B will clear the exclusive bit.  (B will
cause A to stall if it is in the middle of operating on the
locked cache line).

  - if A is requesting exclusive (locked) access, B will invalidate
its cache line and clear both the shared and exclusive bits,
and A will set its exclusive bit.  A now owns that cache line.

  - If no other processor owns the cache line, A obtains the data
from main memory and other processors have the option to snoop
the access in their L2 caches.

So, using the above rules as an example, a locked instruction can cost
as little as 0 extra cycles no matter how many cpu's you have running
in parallel.  There is no need to serialize or synchronize anything.

The worst case is actually not even as bad as a complete cache-miss
case.  The cost of snooping another cpu's L2 cache is much less then
the cost of going to main memory.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Andrew Reilly

On Mon, Jul 12, 1999 at 07:09:58PM -0700, Mike Smith wrote:
  Although function calls are more expensive than inline code,
  they aren't necessarily a lot more so, and function calls to
  non-locked RMW operations are certainly much cheaper than
  inline locked RMW operations.
 
 This is a fairly key statement in context, and an opinion here would 
 count for a lot; are function calls likely to become more or less 
 expensive in time?

Others have answered this question, but I thought I'd point out
that there is another alternative that is likely to be faster
than indirect function calls on some architectures: inline
branched code.  So you still have a global variable selecting
locked/non-locked, but it's a boolean, rather than a pointer.
Your atomic macros are then { if (atomic_lock) asm("lock;foo");
else asm ("foo"); }

You might be interested in the paper:

"Efficient Dynamic Dispatch without Virtual Function Tables. The
SmallEiffel Compiler." Olivier ZENDRA, Dominique COLNET, Suzanne
COLLIN. 12th Annual ACM SIGPLAN Conference on Object-Oriented
Programming, Systems, Languages, and Applications (OOPSLA'97),
Volume 32, Issue 10 - Atlanta, GA, USA, October 1997, pages 125-141.

http://SmallEiffel.loria.fr/papers/papers.html#OOPSLA97

as a justification for that approach.

-- 
Andrew


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Matthew Dillon


:...

I would also like to add a few more notes in regards to write pipelines.

Write pipelines are not used any more, at least not long ones.  The
reason is simply the cache coherency issue again.  Until the data is
actually written into the L1 cache, it is acoherent.

Acoherency is no longer allowed :-).  That kind creates a problem for
the old write pipeline.

Thus, modern cpu's have begun removing their write pipelines in favor
of a synchronous write to the L1 cache.  This typically occurs in the 
"write-back" phase of the instruction pipeline.  It does not cause
a processor to stall.

There are several advantages to this:

* First, since the write is going directly into the L1 cache,
  it can partake of the cache coherency protocols.

* Second, since the L1 cache is large compared to the older
  write-pipeline schema, this effectively gives us a
  (for all intents and purposes other then a memory copy)
  infinitely-sized write pipeline.

* Third, we can throw away all the write pipeline address snooping 
  junk on the chip, making the chip smaller.

The cache coherency protocol may cause side effects when a memory
write is issued.  Typically the write is issued into the L1 cache
and requires what is known as a write-through operation into the L2
cache in order to guarentee cache coherency.  This is necessary because
it is the L2 cache which is performing the bulk of the bus protocol
to implement the cache coherency, and it needs to know when you've
modified something.

"write through" *used* to mean writing through to main memory, but with
the advent of modern cache coherency protocols, the L2 cache *IS* main
memory for all intents and purposes.  Dirty data contained in the L2
cache inherently invalidates the associated memory address in main memory
simply by being present in the L2 cache.  Another processor attempting
to access that location in main memory will instead obtain the data
directly from our processor's L2 cache.

One of the interesting benefits from this is that dirty data in the L2
cache can be flushed to main memory completely asynchronously without
screwing up cache coherency, serialization, or anything else.

Final note: memory mapped I/O spaces do not apply here.  These are usually
marked uncacheable and have nothing to do with the cache.  But DMA is
different.  DMA works just like any other memory access in that it 
can be made to run through the L2 cache coherency protocols.  So you 
can often safely initiate DMA without having to flush your caches.  This
is not true of all processors.  I'm not sure about the Pentium's.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Haertel

Second answer: in the real world, we're nearly always hitting the
cache on stack operations associated with calls and argument passing,
but not less often on operations in the procedure body.  So, in
 ^^^ typo

Urk.  I meant to say "less often", delete the "not".


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: objtrm problem probably found (was Re: Stuck in objtrm)

1999-07-12 Thread Mike Haertel

This is a fairly key statement in context, and an opinion here would 
count for a lot; are function calls likely to become more or less 
expensive in time?

Ambiguous question.

First answer: Assume we're hitting the cache, taking no branch
mispredicts, and everything is generally going at "the speed of
light".  The cost of a call, relative to the cost of "basic
operations" like loads and adds, will stay about the same in Intel's
next-generation x86 processors as it is now.  We made some tradeoffs,
some things go a little faster and others a little slower, but in
the end the function call overhead averages out to about the same.

Second answer: in the real world, we're nearly always hitting the
cache on stack operations associated with calls and argument passing,
but not less often on operations in the procedure body.  So, in
the real world, as memory latencies get worse and worse relative
to processor frequency, the overhead for procedure calling represents
an increasingly smaller percentage of the total execution time.

Summary: So, correctly predicted calls are staying the same or
getting cheaper.  Certainly calls are not getting worse, unless
perhaps you have lots of mispredicted calls.  As pipelines get
deeper, mispredicts get worse.  But that's not a problem just for
calls, it's branches in general.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message