Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-29 Thread Eric Botcazou
 Linux doesn't ever run the cpu in the RMO memory model any more.  All
 sparc64 chips run only in TSO now.

 All of the Niagara chips implement an even stricter than TSO memory
 model, and the membars we used to have all over the kernel to handle
 that properly were just wasted I-cache space.  So I just moved
 unilaterally to TSO everywhere and killed off the membars necessitated
 by RMO.

OK, thanks for the clarification.  That's also fine from GCC's viewpoint.

-- 
Eric Botcazou


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-28 Thread Eric Botcazou
 Let's clarify something, did you run your testcase that triggered this
 bug on a v8 or a v9 machine?

Sun UltraSPARC, so V9 of course.  The point is that Solaris is TSO (TSO as 
defined for the V9 architecture, i.e. backward compatible with V8) so you have 
a V8-compatible TSO implementation, in particular not a Strong Consistency V8.

It is perfectly valid to compile with -mcpu=v8 on Solaris and expect to get a 
working program.  Now if you start to play seriously with __sync_synchronize, 
you conclude that it doesn't implement a full memory barrier with -mcpu=v8.

The V8 architecture manual is quite clear about it: TSO allows stores to be 
reordered after subsequent loads (it's the only difference in TSO with Strong 
Consistency) so you need to do something to have a full memory barrier.  As 
there is no specific instruction to that effect in V8, you need to do what is 
done for pre-SSE2 x86, i.e. use an atomic instruction.

-- 
Eric Botcazou


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-28 Thread David Miller
From: Eric Botcazou ebotca...@adacore.com
Date: Tue, 28 Jun 2011 10:11:03 +0200

 The V8 architecture manual is quite clear about it: TSO allows stores to be 
 reordered after subsequent loads (it's the only difference in TSO with Strong 
 Consistency) so you need to do something to have a full memory barrier.  As 
 there is no specific instruction to that effect in V8, you need to do what is 
 done for pre-SSE2 x86, i.e. use an atomic instruction.

Fair enough, you can add this code if you want.





Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-28 Thread Eric Botcazou
 Fair enough, you can add this code if you want.

Thanks.  Note that this is marginal for Solaris as GCC defaults to -mcpu=v9 on 
Solaris but, in all other cases, it defaults to -mcpu=v8.  I can reproduce the 
problem on the SPARC/Linux machine 'grobluk' of the CompileFarm:

cpu : TI UltraSparc II  (BlackBird)
fpu : UltraSparc II integrated FPU
prom: OBP 3.2.30 2002/10/25 14:03
type: sun4u
ncpus probed: 4
ncpus active: 4

Linux grobluk 2.6.26-2-sparc64-smp #1 SMP Thu Nov 5 03:34:29 UTC 2009 sparc64 
GNU/Linux

With the pristine compiler, the test passes with -mcpu=v9 but fails otherwise.
It passes with the patched compiler.  However, I suspect that we would still 
have problems with newer UltraSparc CPUs supporting full RMO, because the new 
insn membar_v8 is only half a memory barrier for V9.

-- 
Eric Botcazou


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-28 Thread David Miller
From: Eric Botcazou ebotca...@adacore.com
Date: Tue, 28 Jun 2011 23:27:43 +0200

 With the pristine compiler, the test passes with -mcpu=v9 but fails otherwise.
 It passes with the patched compiler.  However, I suspect that we would still 
 have problems with newer UltraSparc CPUs supporting full RMO, because the new 
 insn membar_v8 is only half a memory barrier for V9.

Linux doesn't ever run the cpu in the RMO memory model any more.  All
sparc64 chips run only in TSO now.

All of the Niagara chips implement an even stricter than TSO memory
model, and the membars we used to have all over the kernel to handle
that properly were just wasted I-cache space.  So I just moved
unilaterally to TSO everywhere and killed off the membars necessitated
by RMO.


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread David Miller
From: Eric Botcazou ebotca...@adacore.com
Date: Mon, 27 Jun 2011 18:11:10 +0200

   * config/sparc/sync.md (*stbar): Delete.
   (*membar_v8): New insn to implement UNSPEC_MEMBAR in SPARC-V8.

Code which cares about memory ordering etc. really has to know
the kind of cpu it is running on.

This is why atomic and synchronization primitives are typically
restricted to shared libraries and similar, where the dynamic
linker can vet out what is the correct implementation on a
given piece of hardware.

V8 can only reorder stores, that's why it only has a 'stbar'
instruction.  I'm not so sure I agree with trying to paper over the
fact that someone has compiled code for v8 that's going to run on a v9
cpu.


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread Geert Bosch

On Jun 27, 2011, at 19:00, David Miller da...@davemloft.net wrote:

 V8 can only reorder stores, that's why it only has a 'stbar'
 instruction.  I'm not so sure I agree with trying to paper over the
 fact that someone has compiled code for v8 that's going to run on a v9
 cpu.

That's not the issue. While it is true that all stores will be submitted in 
order , this does not guarantee store-load consistency. In particular on a 
multiprocessor, each individual processor has it's own store buffers and cannot 
see what is in the other CPUs store buffet. In the end all stores will be 
committed to memory in a sequential order, but that is not sufficient. The use 
of a load-store instruction is needed to achieve a full barrier.  The SPARC 
architecture manuals describe this in detail. 

  -Geert


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread David Miller
From: Geert Bosch bo...@adacore.com
Date: Mon, 27 Jun 2011 19:36:06 -0400

 On Jun 27, 2011, at 19:00, David Miller da...@davemloft.net wrote:
 
 V8 can only reorder stores, that's why it only has a 'stbar'
 instruction.  I'm not so sure I agree with trying to paper over the
 fact that someone has compiled code for v8 that's going to run on a v9
 cpu.

 That's not the issue. While it is true that all stores will be
 submitted in order , this does not guarantee store-load
 consistency. In particular on a multiprocessor, each individual
 processor has it's own store buffers and cannot see what is in the
 other CPUs store buffet. In the end all stores will be committed to
 memory in a sequential order, but that is not sufficient. The use of
 a load-store instruction is needed to achieve a full barrier.  The
 SPARC architecture manuals describe this in detail.

I'm trying to find the part of the v8 manual that says there is
a situation where we should use stbar and a ldstub to implement
proper memory barriers.  In particular I'm looking in Appendix J,
Programming with the memory models.  Where is the description?

Adding a ldstub here is going to be really expensive, on UltraSparc
that can be 36+ cycles even on a cache hit.

Also, the more I think about it, the issue really is that one is
trying to run v8 code on a v9 cpu.

And this is because no v8 cpu ever implemented anything other than
Strong Consistency, so on a v8 cpu you would never run into this
problem.

I really think the answer in this situation is compile code for the
actual processor you're targetting, especially if you want features
with processor specific behaviors, such as atomics and memory
barriers, to work properly.


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread Geert Bosch

On Jun 27, 2011, at 19:53, David Miller wrote:

 I'm trying to find the part of the v8 manual that says there is
 a situation where we should use stbar and a ldstub to implement
 proper memory barriers.  In particular I'm looking in Appendix J,
 Programming with the memory models.  Where is the description?
See J.7, and study why the store instructions are replaces by SWAP.
 
 Adding a ldstub here is going to be really expensive, on UltraSparc
 that can be 36+ cycles even on a cache hit.

Yes, synchronization in multi-CPU systems is expensive.
If it's really cheap, you're probably doing something wrong.

 Also, the more I think about it, the issue really is that one is
 trying to run v8 code on a v9 cpu.

Double no:
   1. No, my primary concern is about v8 code running on multiprocessor
  systems implementing the SPARC v8 architecture (LEON3 in particular)
   2. No, a SPARCv8 compliant binary should run correctly on both
  SPARCv8 and SPARCv9. The entire raison-d'ĂȘtre for the SPARC
  architecture is so we can write code based on the architecture,
  and have it run correctly on all implementations.

  -Geert


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread David Miller
From: Geert Bosch bo...@adacore.com
Date: Mon, 27 Jun 2011 22:21:47 -0400

 On Jun 27, 2011, at 19:53, David Miller wrote:
 
 Adding a ldstub here is going to be really expensive, on UltraSparc
 that can be 36+ cycles even on a cache hit.
 
 Yes, synchronization in multi-CPU systems is expensive.
 If it's really cheap, you're probably doing something wrong.

First, I fundamentally disagree with this assertion.  The reason
proper memory barriers exist is so that you don't need nonsense like
these proposed atomics to get proper memory operation ordering.

A proper membar on your v9 test system is orders of magnitude cheaper
than this stbar+ldstub business.

You then go on to speak about LEON, does LEON implement PSO?


Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread Geert Bosch

On Jun 27, 2011, at 22:45, David Miller wrote:

 From: Geert Bosch bo...@adacore.com
 Date: Mon, 27 Jun 2011 22:21:47 -0400
 
 On Jun 27, 2011, at 19:53, David Miller wrote:
 
 Adding a ldstub here is going to be really expensive, on UltraSparc
 that can be 36+ cycles even on a cache hit.
 
 Yes, synchronization in multi-CPU systems is expensive.
 If it's really cheap, you're probably doing something wrong.
 
 First, I fundamentally disagree with this assertion.  The reason
 proper memory barriers exist is so that you don't need nonsense like
 these proposed atomics to get proper memory operation ordering.
Sorry, I see now I phrased this poorly, no offense intended. We both 
agree that with TSO there is never a need for any STBAR instructions on 
SPARCv8. The point is that TSO is not sufficient for strong consistency.
The reason for this is the existence of write buffers (see fig 6.1, or K-1
of the SPARC v8 architecture manual). In particular, note the CPU-local
bypass from the store buffer. Two processors both storing a value X in 
location Y and then reading from Y might each see their own value. In
the end, one will reach memory first and the stores will be ordered
there. The load-store instructions are necessary to ensure the store
will be seen by the memory system before subsequent loads can use them.

The main issue is that SPARC's TSO does not guarantee Store-Load ordering.
So, only by issuing a SWAP(A) or LDSTUB(A) instruction can total ordering
of all loads and stores be guaranteed. 
 
 A proper membar on your v9 test system is orders of magnitude cheaper
 than this stbar+ldstub business.
That's true, but membar is a SPARC v9 instruction. The issue Eric and I
are addressing is only about SPARCv8. 

 \You then go on to speak about LEON, does LEON implement PSO?
No, I'm not talking about PSO anywhere or SPARCv9 anywhere. 
Just plain old SPARCv8, using the TSO model. This requires a
load-store instruction to guarantee a full memory barrier.

I'm not making this up, that is why I refer to the examples in
the SPARC v8 architecture manual that specifically state that
SWAP instructions need to be used instead of store instructions
to make Dekker's algorithm work.

  -Geert



Re: [RFC] Fix full memory barrier on SPARC-V8

2011-06-27 Thread David Miller
From: Geert Bosch bo...@adacore.com
Date: Mon, 27 Jun 2011 23:17:18 -0400

 \You then go on to speak about LEON, does LEON implement PSO?
 No, I'm not talking about PSO anywhere or SPARCv9 anywhere. 
 Just plain old SPARCv8, using the TSO model. This requires a
 load-store instruction to guarantee a full memory barrier.
 
 I'm not making this up, that is why I refer to the examples in
 the SPARC v8 architecture manual that specifically state that
 SWAP instructions need to be used instead of store instructions
 to make Dekker's algorithm work.

All v8 processors that I am aware of implement strong consistency, and
if so discussions about TSO are not relevant.  Is LEON an exception?

Let's clarify something, did you run your testcase that triggered this
bug on a v8 or a v9 machine?