Re: [RFC] Fix full memory barrier on SPARC-V8
Linux doesn't ever run the cpu in the RMO memory model any more. All sparc64 chips run only in TSO now. All of the Niagara chips implement an even stricter than TSO memory model, and the membars we used to have all over the kernel to handle that properly were just wasted I-cache space. So I just moved unilaterally to TSO everywhere and killed off the membars necessitated by RMO. OK, thanks for the clarification. That's also fine from GCC's viewpoint. -- Eric Botcazou
Re: [RFC] Fix full memory barrier on SPARC-V8
Let's clarify something, did you run your testcase that triggered this bug on a v8 or a v9 machine? Sun UltraSPARC, so V9 of course. The point is that Solaris is TSO (TSO as defined for the V9 architecture, i.e. backward compatible with V8) so you have a V8-compatible TSO implementation, in particular not a Strong Consistency V8. It is perfectly valid to compile with -mcpu=v8 on Solaris and expect to get a working program. Now if you start to play seriously with __sync_synchronize, you conclude that it doesn't implement a full memory barrier with -mcpu=v8. The V8 architecture manual is quite clear about it: TSO allows stores to be reordered after subsequent loads (it's the only difference in TSO with Strong Consistency) so you need to do something to have a full memory barrier. As there is no specific instruction to that effect in V8, you need to do what is done for pre-SSE2 x86, i.e. use an atomic instruction. -- Eric Botcazou
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Eric Botcazou ebotca...@adacore.com Date: Tue, 28 Jun 2011 10:11:03 +0200 The V8 architecture manual is quite clear about it: TSO allows stores to be reordered after subsequent loads (it's the only difference in TSO with Strong Consistency) so you need to do something to have a full memory barrier. As there is no specific instruction to that effect in V8, you need to do what is done for pre-SSE2 x86, i.e. use an atomic instruction. Fair enough, you can add this code if you want.
Re: [RFC] Fix full memory barrier on SPARC-V8
Fair enough, you can add this code if you want. Thanks. Note that this is marginal for Solaris as GCC defaults to -mcpu=v9 on Solaris but, in all other cases, it defaults to -mcpu=v8. I can reproduce the problem on the SPARC/Linux machine 'grobluk' of the CompileFarm: cpu : TI UltraSparc II (BlackBird) fpu : UltraSparc II integrated FPU prom: OBP 3.2.30 2002/10/25 14:03 type: sun4u ncpus probed: 4 ncpus active: 4 Linux grobluk 2.6.26-2-sparc64-smp #1 SMP Thu Nov 5 03:34:29 UTC 2009 sparc64 GNU/Linux With the pristine compiler, the test passes with -mcpu=v9 but fails otherwise. It passes with the patched compiler. However, I suspect that we would still have problems with newer UltraSparc CPUs supporting full RMO, because the new insn membar_v8 is only half a memory barrier for V9. -- Eric Botcazou
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Eric Botcazou ebotca...@adacore.com Date: Tue, 28 Jun 2011 23:27:43 +0200 With the pristine compiler, the test passes with -mcpu=v9 but fails otherwise. It passes with the patched compiler. However, I suspect that we would still have problems with newer UltraSparc CPUs supporting full RMO, because the new insn membar_v8 is only half a memory barrier for V9. Linux doesn't ever run the cpu in the RMO memory model any more. All sparc64 chips run only in TSO now. All of the Niagara chips implement an even stricter than TSO memory model, and the membars we used to have all over the kernel to handle that properly were just wasted I-cache space. So I just moved unilaterally to TSO everywhere and killed off the membars necessitated by RMO.
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Eric Botcazou ebotca...@adacore.com Date: Mon, 27 Jun 2011 18:11:10 +0200 * config/sparc/sync.md (*stbar): Delete. (*membar_v8): New insn to implement UNSPEC_MEMBAR in SPARC-V8. Code which cares about memory ordering etc. really has to know the kind of cpu it is running on. This is why atomic and synchronization primitives are typically restricted to shared libraries and similar, where the dynamic linker can vet out what is the correct implementation on a given piece of hardware. V8 can only reorder stores, that's why it only has a 'stbar' instruction. I'm not so sure I agree with trying to paper over the fact that someone has compiled code for v8 that's going to run on a v9 cpu.
Re: [RFC] Fix full memory barrier on SPARC-V8
On Jun 27, 2011, at 19:00, David Miller da...@davemloft.net wrote: V8 can only reorder stores, that's why it only has a 'stbar' instruction. I'm not so sure I agree with trying to paper over the fact that someone has compiled code for v8 that's going to run on a v9 cpu. That's not the issue. While it is true that all stores will be submitted in order , this does not guarantee store-load consistency. In particular on a multiprocessor, each individual processor has it's own store buffers and cannot see what is in the other CPUs store buffet. In the end all stores will be committed to memory in a sequential order, but that is not sufficient. The use of a load-store instruction is needed to achieve a full barrier. The SPARC architecture manuals describe this in detail. -Geert
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Geert Bosch bo...@adacore.com Date: Mon, 27 Jun 2011 19:36:06 -0400 On Jun 27, 2011, at 19:00, David Miller da...@davemloft.net wrote: V8 can only reorder stores, that's why it only has a 'stbar' instruction. I'm not so sure I agree with trying to paper over the fact that someone has compiled code for v8 that's going to run on a v9 cpu. That's not the issue. While it is true that all stores will be submitted in order , this does not guarantee store-load consistency. In particular on a multiprocessor, each individual processor has it's own store buffers and cannot see what is in the other CPUs store buffet. In the end all stores will be committed to memory in a sequential order, but that is not sufficient. The use of a load-store instruction is needed to achieve a full barrier. The SPARC architecture manuals describe this in detail. I'm trying to find the part of the v8 manual that says there is a situation where we should use stbar and a ldstub to implement proper memory barriers. In particular I'm looking in Appendix J, Programming with the memory models. Where is the description? Adding a ldstub here is going to be really expensive, on UltraSparc that can be 36+ cycles even on a cache hit. Also, the more I think about it, the issue really is that one is trying to run v8 code on a v9 cpu. And this is because no v8 cpu ever implemented anything other than Strong Consistency, so on a v8 cpu you would never run into this problem. I really think the answer in this situation is compile code for the actual processor you're targetting, especially if you want features with processor specific behaviors, such as atomics and memory barriers, to work properly.
Re: [RFC] Fix full memory barrier on SPARC-V8
On Jun 27, 2011, at 19:53, David Miller wrote: I'm trying to find the part of the v8 manual that says there is a situation where we should use stbar and a ldstub to implement proper memory barriers. In particular I'm looking in Appendix J, Programming with the memory models. Where is the description? See J.7, and study why the store instructions are replaces by SWAP. Adding a ldstub here is going to be really expensive, on UltraSparc that can be 36+ cycles even on a cache hit. Yes, synchronization in multi-CPU systems is expensive. If it's really cheap, you're probably doing something wrong. Also, the more I think about it, the issue really is that one is trying to run v8 code on a v9 cpu. Double no: 1. No, my primary concern is about v8 code running on multiprocessor systems implementing the SPARC v8 architecture (LEON3 in particular) 2. No, a SPARCv8 compliant binary should run correctly on both SPARCv8 and SPARCv9. The entire raison-d'ĂȘtre for the SPARC architecture is so we can write code based on the architecture, and have it run correctly on all implementations. -Geert
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Geert Bosch bo...@adacore.com Date: Mon, 27 Jun 2011 22:21:47 -0400 On Jun 27, 2011, at 19:53, David Miller wrote: Adding a ldstub here is going to be really expensive, on UltraSparc that can be 36+ cycles even on a cache hit. Yes, synchronization in multi-CPU systems is expensive. If it's really cheap, you're probably doing something wrong. First, I fundamentally disagree with this assertion. The reason proper memory barriers exist is so that you don't need nonsense like these proposed atomics to get proper memory operation ordering. A proper membar on your v9 test system is orders of magnitude cheaper than this stbar+ldstub business. You then go on to speak about LEON, does LEON implement PSO?
Re: [RFC] Fix full memory barrier on SPARC-V8
On Jun 27, 2011, at 22:45, David Miller wrote: From: Geert Bosch bo...@adacore.com Date: Mon, 27 Jun 2011 22:21:47 -0400 On Jun 27, 2011, at 19:53, David Miller wrote: Adding a ldstub here is going to be really expensive, on UltraSparc that can be 36+ cycles even on a cache hit. Yes, synchronization in multi-CPU systems is expensive. If it's really cheap, you're probably doing something wrong. First, I fundamentally disagree with this assertion. The reason proper memory barriers exist is so that you don't need nonsense like these proposed atomics to get proper memory operation ordering. Sorry, I see now I phrased this poorly, no offense intended. We both agree that with TSO there is never a need for any STBAR instructions on SPARCv8. The point is that TSO is not sufficient for strong consistency. The reason for this is the existence of write buffers (see fig 6.1, or K-1 of the SPARC v8 architecture manual). In particular, note the CPU-local bypass from the store buffer. Two processors both storing a value X in location Y and then reading from Y might each see their own value. In the end, one will reach memory first and the stores will be ordered there. The load-store instructions are necessary to ensure the store will be seen by the memory system before subsequent loads can use them. The main issue is that SPARC's TSO does not guarantee Store-Load ordering. So, only by issuing a SWAP(A) or LDSTUB(A) instruction can total ordering of all loads and stores be guaranteed. A proper membar on your v9 test system is orders of magnitude cheaper than this stbar+ldstub business. That's true, but membar is a SPARC v9 instruction. The issue Eric and I are addressing is only about SPARCv8. \You then go on to speak about LEON, does LEON implement PSO? No, I'm not talking about PSO anywhere or SPARCv9 anywhere. Just plain old SPARCv8, using the TSO model. This requires a load-store instruction to guarantee a full memory barrier. I'm not making this up, that is why I refer to the examples in the SPARC v8 architecture manual that specifically state that SWAP instructions need to be used instead of store instructions to make Dekker's algorithm work. -Geert
Re: [RFC] Fix full memory barrier on SPARC-V8
From: Geert Bosch bo...@adacore.com Date: Mon, 27 Jun 2011 23:17:18 -0400 \You then go on to speak about LEON, does LEON implement PSO? No, I'm not talking about PSO anywhere or SPARCv9 anywhere. Just plain old SPARCv8, using the TSO model. This requires a load-store instruction to guarantee a full memory barrier. I'm not making this up, that is why I refer to the examples in the SPARC v8 architecture manual that specifically state that SWAP instructions need to be used instead of store instructions to make Dekker's algorithm work. All v8 processors that I am aware of implement strong consistency, and if so discussions about TSO are not relevant. Is LEON an exception? Let's clarify something, did you run your testcase that triggered this bug on a v8 or a v9 machine?