Re: objtrm problem probably found (was Re: Stuck in objtrm)
Before this thread on "cache coherence" and "memory consistency" goes any further, I'd like to suggest a time-out to read something like http://www-ece.rice.edu/~sarita/Publications/models_tutorial.ps. A lot of what I'm reading has a grain of truth but isn't quite right. This paper appeared as a tutorial in IEEE Computer a couple years ago, and is fairly accessible to a general audience. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
On Mon, Jul 12, 1999 at 10:38:03PM -0700, Mike Smith wrote: I said: than indirect function calls on some architectures: inline branched code. So you still have a global variable selecting locked/non-locked, but it's a boolean, rather than a pointer. Your atomic macros are then { if (atomic_lock) asm("lock;foo"); else asm ("foo"); } This requires you to have all the methods present at compile time, which defeats the entire purpose of dynamic method loading. Pardon? I didn't see a discussion of dynamic loading anywhere here. We were referring to tiny inlined assembly language routines. The existing implementation is #defines in a C header file. Uh, no, I was asking about the overhead involved in indirect function calls specifically because there are instances where fast indirect methods would allow us to greatly improve the generality of parts of the FreeBSD kernel and modules. This includes code where we are currently using compile-time defines that require us to trade performance for generality. -- \\ The mind's the standard \\ Mike Smith \\ of the man. \\ [EMAIL PROTECTED] \\-- Joseph Merrick \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Doug Rabson wrote: On Mon, 12 Jul 1999, Peter Jeremy wrote: Mike Haertel [EMAIL PROTECTED] wrote: Um. FYI on x86, even if the compiler generates the RMW form "addl $1, foo", it's not atomic. If you want it to be atomic you have to precede the opcode with a LOCK prefix 0xF0. I'd noticed that point as well. The top of sys/i386/include/atomic.h _does_ make is clear that they aren't SMP safe: /* * Various simple arithmetic on memory which is atomic in the presence * of interrupts. * * Note: these versions are not SMP safe. */ That said, it should be fairly simple to change Matt's new in-line assembler versions to insert LOCK prefixes when building an SMP kernel. (Although I don't know that this is necessary yet, given the `Big Giant Lock'). We don't need the lock prefix for the current SMP implementation. A lock prefix would be needed in a multithreaded implementation but should not be added unless the kernel is an SMP kernel otherwise UP performance would suffer. I was under the impression that a locked instruction was essentially free at runtime, with the sole exception of being one byte larger. If there is any chance of this stuff getting exported to modules via inlines, it should support both as we don't need any more differences between SMP and non-SMP modules than we've already got. :-( Also, with the posted example, atomic_cmpex() is #ifndef I386_CPU - that means it won't be available on any of the default kernels, either at install time or for 99% of post-install systems built from a release. Cheers, -Peter -- Peter Wemm - [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? -- \\ The mind's the standard \\ Mike Smith \\ of the man. \\ [EMAIL PROTECTED] \\-- Joseph Merrick \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: : Although function calls are more expensive than inline code, : they aren't necessarily a lot more so, and function calls to : non-locked RMW operations are certainly much cheaper than : inline locked RMW operations. : :This is a fairly key statement in context, and an opinion here would :count for a lot; are function calls likely to become more or less :expensive in time? In terms of cycles, either less or the same. Certainly not more. If you think about it, a function call is nothing more then a save, a jump, and a retrieval and jump later on. On intel the save is a push, on other cpu's the save may be to a register (which is pushed later if necessary). The change in code flow used to be the expensive piece, but not any more. You typically either see a branch prediction cache (Intel) offering a best-case of 0-cycle latency, or a single-cycle latency that is slot-fillable (MIPS). Since the jump portion of a subroutine call to a direct label is nothing more then a deterministic branch, the branch prediction cache actually operates in this case. You do not quite get 0-cycle latency due to the push/pop, and potential arguments, but it is very fast. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: : Although function calls are more expensive than inline code, : they aren't necessarily a lot more so, and function calls to : non-locked RMW operations are certainly much cheaper than : inline locked RMW operations. : :This is a fairly key statement in context, and an opinion here would :count for a lot; are function calls likely to become more or less :expensive in time? ... The change in code flow used to be the expensive piece, but not any more. You typically either see a branch prediction cache (Intel) offering a best-case of 0-cycle latency, or a single-cycle latency that is slot-fillable (MIPS). I assumed too much in asking the question; I was specifically interested in indirect function calls, since this has a direct impact on method-style implementations. -- \\ The mind's the standard \\ Mike Smith \\ of the man. \\ [EMAIL PROTECTED] \\-- Joseph Merrick \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
:I assumed too much in asking the question; I was specifically :interested in indirect function calls, since this has a direct impact :on method-style implementations. Branch prediction caches are typically PC-sensitive. An indirect method call will never be as fast as a direct call, but if the indirect address is the same the branch prediction cache will work. If the indirect address changes at the PC where the call is being made, the branch cache may create a penalty. Try this core in one of the cases to that test program, and add two nop subroutines void nop1(void) { } and void nop2(void) { }. Compile this code without any optimizations! *no* optimizations or the test will not demonstrate the problem :-) In this case the branch prediction succeeds because the indirect address does not change at the PC where func() is called. I get 34 ns per loop. { void (*func)(void) = nop1; for (i = 0; i LOOPS; ++i) { func(); if (i 1) func = nop1; else func = nop1; } } In this case the branch prediction fails because the indirect address is different at the PC each time func() is called. I get 61ns. { void (*func)(void) = nop1; for (i = 0; i LOOPS; ++i) { func(); if (i 1) func = nop1; else func = nop2; } } In this case we simulate a mix. (i 1) - (i 7). I get 47 ns. { void (*func)(void) = nop1; for (i = 0; i LOOPS; ++i) { func(); if (i 7) func = nop1; else func = nop2; } } Ok, so what does this mean for method calls?If the method call is INLINED, then the branch prediction cache will tend to work because the method call will tend to call the same address at any given PC. If the method call is doubly-indirect, where a routine is called which calculates the method address and then calls it, the branch prediction cache will tend to fail because a different address will tend to be called at the PC of the call. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Mike Smith [EMAIL PROTECTED] wrote: Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? Based on general computer architecture principles, I'd say that a lock prefix is likely to become more expensive[1], whilst a function call will become cheaper[2] over time. I'm not sure that this is an important issue here. The sole advantage of moving to indirect function calls would be that the same object code could be used on both UP and SMP configurations, without incurring the overhead of the lock prefix in the UP configuration. (At the expense of an additional function call in all configurations). We can't avoid the lock prefix overhead in the SMP case. Based on the timings I did this morning, function calls are (unacceptably, IMHO) expensive on all the CPU's I have to hand (i386, Pentium and P-II) - the latter two presumably comprising the bulk of current FreeBSD use. Currently the UP/SMP decision is made at compile time (and has significant and widespread impact) - therefore there seems little (if any) benefit in using function calls within the main kernel. I believe that Matt's patched i386/include/atomic.h, with the addition of code to only include the lock prefix when SMP is defined, is currently the optimal approach for the kernel - and I can't see any way a future IA-32 implementation could change that. The only benefit could be for kernel modules - a module could possibly be compiled so the same LKM would run on either UP or SMP. Note that function calls for atomic operations may not be sufficient (by themselves) to achieve this: One of the SMP gurus may be able to confirm whether anything else prevents an SMP-compiled LKM running with a UP kernel. If the lock prefix overhead becomes an issue for LKMs, then we could define a variant of i386/include/atomic.h (eg by using a #define which is only true for compiling LKMs) which does use indirect function calls (and add the appropriate initialisation code). This is a trivial exercise (which I'll demonstrate on request). [1] A locked instruction implies a synchronous RMW cycle. In order to meet write-ordering guarantees (without which, a locked RMW cycle would be useless as a semaphore primitive), it implies a complete write serialization, and probably some level of instruction serialisation. Since write-back pipelines will get longer and parallel execution units more numerous, the cost of a serialisation operation will get relatively higher. Also, lock instructions are relatively infrequent, therefore there is little incentive to expend valuable silicon on trying to make them more efficient (at least as seen by the executing CPU). [2] Function calls _are_ fairly common, therefore it probably is worthwhile expending some effort in optimising them - and the stack updates associated with a leaf subroutine are fairly easy to totally hide in an on-chip write pipeline/cache. Peter To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: :I'm not sure there's any reason why you shouldn't. If you changed the :semantics of a stack segment so that memory addresses below the stack :pointer were irrelevant, you could implement a small, 0-cycle, on-chip :stack (that overflowed into memory). I don't know whether this :semantic change would be allowable (and whether the associated silicon :could be justified) for the IA-32. : :Peter This would be relatively complex and also results in cache coherency problems. A solution already exists: It's called branch-and-link, but Intel cpu's do not use it because Intel cpu's do not have enough registers (makes you just want to throw up -- all that MMX junk and they couldn't add a branch and link register! ). The key with branch-and-link is that the lowest subroutine level does not have to save/restore the register, making entry and return two or three times faster then subroutine calls that make other subroutine calls. The big problem with implementing complex caches is that it takes up a serious amount of die space and power. Most modern cpu's revolve almost entirely around their L1 cache and their register file. The remaining caches tend to be ad-hoc. Intel's branch prediction cache is like this. In order for a memory-prediction cache to be useful, it really needs to be cache-coherent, which basically kills the idea of having a separate little special case for the stack. Only the L1 cache is coherent. If you wanted you could implement multiple L1 data caches on-chip - that might be of some benefit, but otherwise branch-and-link is the better way to do it. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
: :Based on general computer architecture principles, I'd say that a lock :prefix is likely to become more expensive[1], whilst a function call :will become cheaper[2] over time. :... : :[1] A locked instruction implies a synchronous RMW cycle. In order :to meet write-ordering guarantees (without which, a locked RMW :cycle would be useless as a semaphore primitive), it implies a :complete write serialization, and probably some level of :instruction serialisation. Since write-back pipelines will get A locked instruction only implies cache coherency across the instruction. It does not imply serialization. Intel blows it big time, but that's intel for you. :longer and parallel execution units more numerous, the cost of :a serialisation operation will get relatively higher. Also, :Peter It is not a large number of execution units that implies a higher cost of serialization but instead data interdependancies. A locked instruction doe snot have to imply serialization. Modern cache coherency protocols do not have a problem with a large number of caches in a parallel processing subsystem. This is how a typical two-level cache coherency protocol works with an L1 and L2 cache: * The L2 cache is usually a superset of the L1 cache. That is, all cache lines in the L1 cache also exist in the L2 cache. * Both the L1 and L2 caches implement a shared and an exclusive bit, usually on a per-cache-line basis. * When a processor, A, issues a memory op that is not in its cache, it can request the cache line from main memory either shared (unlocked) or exclusive (locked). * All other processors snoop the main memory access. * If another processor's L2 cache contains the cache line being requested by processor A, it can provide the cache line to processor A and no main memory op actually occurs. In this case, the shared and exclusive bits in processor B's L1 and L2 caches are updated appropriately. - if A is requesting shared (unlocked) access, both A and B will set the shared bit and B will clear the exclusive bit. (B will cause A to stall if it is in the middle of operating on the locked cache line). - if A is requesting exclusive (locked) access, B will invalidate its cache line and clear both the shared and exclusive bits, and A will set its exclusive bit. A now owns that cache line. - If no other processor owns the cache line, A obtains the data from main memory and other processors have the option to snoop the access in their L2 caches. So, using the above rules as an example, a locked instruction can cost as little as 0 extra cycles no matter how many cpu's you have running in parallel. There is no need to serialize or synchronize anything. The worst case is actually not even as bad as a complete cache-miss case. The cost of snooping another cpu's L2 cache is much less then the cost of going to main memory. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
On Mon, Jul 12, 1999 at 07:09:58PM -0700, Mike Smith wrote: Although function calls are more expensive than inline code, they aren't necessarily a lot more so, and function calls to non-locked RMW operations are certainly much cheaper than inline locked RMW operations. This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? Others have answered this question, but I thought I'd point out that there is another alternative that is likely to be faster than indirect function calls on some architectures: inline branched code. So you still have a global variable selecting locked/non-locked, but it's a boolean, rather than a pointer. Your atomic macros are then { if (atomic_lock) asm("lock;foo"); else asm ("foo"); } You might be interested in the paper: "Efficient Dynamic Dispatch without Virtual Function Tables. The SmallEiffel Compiler." Olivier ZENDRA, Dominique COLNET, Suzanne COLLIN. 12th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA'97), Volume 32, Issue 10 - Atlanta, GA, USA, October 1997, pages 125-141. http://SmallEiffel.loria.fr/papers/papers.html#OOPSLA97 as a justification for that approach. -- Andrew To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
:... I would also like to add a few more notes in regards to write pipelines. Write pipelines are not used any more, at least not long ones. The reason is simply the cache coherency issue again. Until the data is actually written into the L1 cache, it is acoherent. Acoherency is no longer allowed :-). That kind creates a problem for the old write pipeline. Thus, modern cpu's have begun removing their write pipelines in favor of a synchronous write to the L1 cache. This typically occurs in the "write-back" phase of the instruction pipeline. It does not cause a processor to stall. There are several advantages to this: * First, since the write is going directly into the L1 cache, it can partake of the cache coherency protocols. * Second, since the L1 cache is large compared to the older write-pipeline schema, this effectively gives us a (for all intents and purposes other then a memory copy) infinitely-sized write pipeline. * Third, we can throw away all the write pipeline address snooping junk on the chip, making the chip smaller. The cache coherency protocol may cause side effects when a memory write is issued. Typically the write is issued into the L1 cache and requires what is known as a write-through operation into the L2 cache in order to guarentee cache coherency. This is necessary because it is the L2 cache which is performing the bulk of the bus protocol to implement the cache coherency, and it needs to know when you've modified something. "write through" *used* to mean writing through to main memory, but with the advent of modern cache coherency protocols, the L2 cache *IS* main memory for all intents and purposes. Dirty data contained in the L2 cache inherently invalidates the associated memory address in main memory simply by being present in the L2 cache. Another processor attempting to access that location in main memory will instead obtain the data directly from our processor's L2 cache. One of the interesting benefits from this is that dirty data in the L2 cache can be flushed to main memory completely asynchronously without screwing up cache coherency, serialization, or anything else. Final note: memory mapped I/O spaces do not apply here. These are usually marked uncacheable and have nothing to do with the cache. But DMA is different. DMA works just like any other memory access in that it can be made to run through the L2 cache coherency protocols. So you can often safely initiate DMA without having to flush your caches. This is not true of all processors. I'm not sure about the Pentium's. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
Second answer: in the real world, we're nearly always hitting the cache on stack operations associated with calls and argument passing, but not less often on operations in the procedure body. So, in ^^^ typo Urk. I meant to say "less often", delete the "not". To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: objtrm problem probably found (was Re: Stuck in objtrm)
This is a fairly key statement in context, and an opinion here would count for a lot; are function calls likely to become more or less expensive in time? Ambiguous question. First answer: Assume we're hitting the cache, taking no branch mispredicts, and everything is generally going at "the speed of light". The cost of a call, relative to the cost of "basic operations" like loads and adds, will stay about the same in Intel's next-generation x86 processors as it is now. We made some tradeoffs, some things go a little faster and others a little slower, but in the end the function call overhead averages out to about the same. Second answer: in the real world, we're nearly always hitting the cache on stack operations associated with calls and argument passing, but not less often on operations in the procedure body. So, in the real world, as memory latencies get worse and worse relative to processor frequency, the overhead for procedure calling represents an increasingly smaller percentage of the total execution time. Summary: So, correctly predicted calls are staying the same or getting cheaper. Certainly calls are not getting worse, unless perhaps you have lots of mispredicted calls. As pipelines get deeper, mispredicts get worse. But that's not a problem just for calls, it's branches in general. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message