Re: GCC libatomic ABI specification draft

2017-01-23 Thread Michael Matz
Hi,

On Fri, 20 Jan 2017, Richard Henderson wrote:

> > You can't have a 4-aligned type of size 3.  Sizes must be multiples of 
> > alignment (otherwise arrays don't work).  The type of a 3-sized field 
> > in a packed struct that syntactically might be a 4-aligned type (e.g. 
> > by using attributes on char-array types) is actually a different type 
> > having an alignment of 1.  It's easier to simply regard all types 
> > inside packed structs as 1-aligned (which is IMO what we try to do).
> > 
> > That is, the byte after a 4-aligned "3-sized" type is always padding.
> 
> [ I read Bin Fan's original email some months ago, but I don't have it handy
> now.  Take faulty memory with a grain of salt. ]
> 
> I thought this was about libatomic being presented with an unaligned 3-byte
> structure that happens to sit within an aligned 4-byte word, and choosing to
> atomically operate on the 4-byte word instead of taking a lock on the side.

Ah well, in that case I lost context as well ;)


Ciao,
Michael.


Re: GCC libatomic ABI specification draft

2017-01-20 Thread Richard Henderson

On 01/20/2017 05:41 AM, Michael Matz wrote:

Hi,

On Wed, 18 Jan 2017, Richard Henderson wrote:


Section 3 Rationale, alternative 1: I'm wondering if the example is
correct.  For a 4-byte-aligned type of size 3, the implementation
cannot simply use 4-byte hardware-backed atomics because this will
inevitably touch the 4th byte I think, and the implementation can't
know whether this is padding or not.  Or do we expect that things like
packed structs are disallowed?


If we atomically store an unchanged value into the 4th byte, can we
tell?


You can't have a 4-aligned type of size 3.  Sizes must be multiples of
alignment (otherwise arrays don't work).  The type of a 3-sized field in
a packed struct that syntactically might be a 4-aligned type (e.g. by
using attributes on char-array types) is actually a different type having
an alignment of 1.  It's easier to simply regard all types inside packed
structs as 1-aligned (which is IMO what we try to do).

That is, the byte after a 4-aligned "3-sized" type is always padding.


[ I read Bin Fan's original email some months ago, but I don't have it handy 
now.  Take faulty memory with a grain of salt. ]


I thought this was about libatomic being presented with an unaligned 3-byte 
structure that happens to sit within an aligned 4-byte word, and choosing to 
atomically operate on the 4-byte word instead of taking a lock on the side.



r~



Re: GCC libatomic ABI specification draft

2017-01-20 Thread Michael Matz
Hi,

On Wed, 18 Jan 2017, Richard Henderson wrote:

> > Section 3 Rationale, alternative 1: I'm wondering if the example is 
> > correct.  For a 4-byte-aligned type of size 3, the implementation 
> > cannot simply use 4-byte hardware-backed atomics because this will 
> > inevitably touch the 4th byte I think, and the implementation can't 
> > know whether this is padding or not.  Or do we expect that things like 
> > packed structs are disallowed?
> 
> If we atomically store an unchanged value into the 4th byte, can we 
> tell?

You can't have a 4-aligned type of size 3.  Sizes must be multiples of 
alignment (otherwise arrays don't work).  The type of a 3-sized field in 
a packed struct that syntactically might be a 4-aligned type (e.g. by 
using attributes on char-array types) is actually a different type having 
an alignment of 1.  It's easier to simply regard all types inside packed 
structs as 1-aligned (which is IMO what we try to do).

That is, the byte after a 4-aligned "3-sized" type is always padding.


Ciao,
Michael.


Re: GCC libatomic ABI specification draft

2017-01-19 Thread Torvald Riegel
On Thu, 2016-12-22 at 11:37 -0600, Segher Boessenkool wrote:
> On Thu, Dec 22, 2016 at 03:28:56PM +0100, Ulrich Weigand wrote:
> > However, there still seems to be a problem, but this time related to
> > alignment issues.  We do have the 16-byte atomic instructions, but they
> > only work on 16-byte aligned data.  This is a problem in particular
> > since the default alignment of 16-byte data types is still 8 bytes
> > on our platform (since the ABI only guarantees 8-byte stack alignment).
> > 
> > That's why the libatomic configure check thinks it cannot use the
> > atomic instructions when building on z, and generates code that uses
> > the separate lock.  However, *if* a particular object can be proven
> > by the compiler to be 16-byte aligned, it will emit the inline
> > atomic instruction.  This means there is indeed a bug if that same
> > object is also operated on via the library routine.
> > 
> > Andreas suggested that the best way to fix this would be to add a
> > runtime alignment check to the libatomic routines and also use the
> > atomic instructions in the library whenever the object actually
> > happens to be correctly aligned.  It seems that this should indeed
> > fix the problem (and also use the most efficient way in all cases).
> > 
> > 
> > Not sure about Power -- adding David and Segher on CC ...
> 
> We do not always have all atomic instructions.  Not all processors have
> all, and it depends on the compiler flags used which are used.  How would
> libatomic know what compiler flags are used to compile the program it is
> linked to?

I think the approach would be to require the user to always use a
suitably built libatomic that's at least as capable as the code that
will use it (e.g., see Richard Henderson's comments).  Thus, if the
program uses some flags to enable a certain set of HW instructions, the
program also should use a libatomic that is built with the same (or
stronger) flags.  That keeps old code working, and new code that uses
the HW instructions directly can interoperate with old code that still
calls libatomic.

If we find consensus to follow this approach, this requirement on
libatomic builds should be made explicit in the ABI spec.



Re: GCC libatomic ABI specification draft

2017-01-19 Thread Torvald Riegel
On Wed, 2017-01-18 at 14:23 -0800, Richard Henderson wrote:
> On 01/17/2017 09:00 AM, Torvald Riegel wrote:
> > I think the ABI should set a baseline for each architecture, and the
> > baseline decides whether something is inlinable or not.  Thus, the
> > x86_64 ABI would make __int128 operations not imlinable (because of the
> > issues with cmpxchg16b, see above).
> >
> > If users want to use capabilities beyond the baseline, they can choose
> > to use flags that alter/extend the ABI.  For example, if they use a flag
> > that explicitly enables the use of cmpxchg16b for atomics, they also
> > need to use a libatomic implementation built in the same way (if
> > possible).  This then creates a new ABI(-variant), basically.
> 
> Yes.  Other examples here are power7/power8 and armv6/armv7.
> 
> In both cases, the architecture added double-word load(-locked) and 
> store(-conditional) instructions.  In order for us to use these new 
> instructions inline, libatomic must be updated to use them as well.
> 
> The general principal, in my opinion, is that extensions to the ISA should 
> require that libatomic either be re-built, or perform runtime detection in 
> order to select the internal algorithm used.

That sounds okay for me.  I think we would have to make that clear in
the ABI specification though, because this also includes requirements
for the user of the ABI (eg, if you compile for power8, you need to use
a suitably built libatomic) and for distributions.

> In the case of arm, distributions normally either (1) build for a specific 
> cpu 
> revision, (2) build for old-arm + soft-fpu, (3) build for armv7 + hard-fpu.  
> So 
> most distributions would not actually require a runtime check for arm.
> 
> In the case of power, I assume it's possible to run ppc64 on power8, but 
> every 
> power8 system to which I have access has ppc64le deployed.  Certainly ppc64le 
> would not need a runtime check, but it would seem prudent for ppc64 to gain a 
> runtime check for the power8 insns.

OK.  I think it would be good if ARM/Power people could contribute to
the ABI specification and extend it to also cover ARM/Power.

> > I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
> > see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
> > see it (but I don't remember for sure).  Either way, if I should have
> > been wrong, and we are using cmpxchg16b for loads, this should be fixed.
> > Ideally, this should be fixed before the stage 3 deadline this Friday.
> > Such a fix might potentially break existing uses, but the earlier we fix
> > this, the better.
> 
> You needed to use -mcx16, or any other option (such as -march=host) that 
> implies that.  And, you will find that expand_atomic_load does have a 
> larger-than-word-size fallback path that does use 
> expand_atomic_compare_and_swap.
> 
> So, yes, there's something here that needs adjustment.

I'll send a separate email describing the options I see currently.

> > Section 3 Rationale, alternative 1: I'm wondering if the example is
> > correct.  For a 4-byte-aligned type of size 3, the implementation cannot
> > simply use 4-byte hardware-backed atomics because this will inevitably
> > touch the 4th byte I think, and the implementation can't know whether
> > this is padding or not.  Or do we expect that things like packed structs
> > are disallowed?
> 
> If we atomically store an unchanged value into the 4th byte, can we tell?

Probably not in terms of the value.  But race detectors, HW breakpoints
etc. could observe the store.  I'm not sure whether potentially having
to adapt these is justified by being able to optimize atomic access to
3-byte structs...

> > N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
> > Because cmpxchg8b is available for CPUs that are the lowest i?86 we
> > still intend to support?
> 
> For various definitions of "we", I suppose.  Red Hat certainly does not 
> support 
> anything lower than i686, which does have cmpxchg8b.
> 
> I suspect that the GNU project still supports i486.  I do know that glibc has 
> dropped support for i386.
> 
> I should note that supporting 64-bit atomics on i686 *is* possible, without 
> the 
> CAS problem that you describe for cmpxchg16b, because we *are* guaranteed 
> that 
> the FPU supports a 64-bit atomic load/store.  And we do already handle this; 
> see the atomic_loaddi_fpu and atomic_storedi_fpu patterns.
> 
> I'll also note that, as per above, this implies that if we build for i586-*, 
> libatomic should provide runtime paths that detect and use i686 insns, so 
> that 
> the library is compatible with what the compiler will generate inline given 
> appropriate command-line options.

OK.  So these rules should be added to the ABI spec too, I suppose.



Re: GCC libatomic ABI specification draft

2017-01-18 Thread Richard Henderson

On 01/17/2017 09:00 AM, Torvald Riegel wrote:

I think the ABI should set a baseline for each architecture, and the
baseline decides whether something is inlinable or not.  Thus, the
x86_64 ABI would make __int128 operations not imlinable (because of the
issues with cmpxchg16b, see above).

If users want to use capabilities beyond the baseline, they can choose
to use flags that alter/extend the ABI.  For example, if they use a flag
that explicitly enables the use of cmpxchg16b for atomics, they also
need to use a libatomic implementation built in the same way (if
possible).  This then creates a new ABI(-variant), basically.


Yes.  Other examples here are power7/power8 and armv6/armv7.

In both cases, the architecture added double-word load(-locked) and 
store(-conditional) instructions.  In order for us to use these new 
instructions inline, libatomic must be updated to use them as well.


The general principal, in my opinion, is that extensions to the ISA should 
require that libatomic either be re-built, or perform runtime detection in 
order to select the internal algorithm used.


In the case of arm, distributions normally either (1) build for a specific cpu 
revision, (2) build for old-arm + soft-fpu, (3) build for armv7 + hard-fpu.  So 
most distributions would not actually require a runtime check for arm.


In the case of power, I assume it's possible to run ppc64 on power8, but every 
power8 system to which I have access has ppc64le deployed.  Certainly ppc64le 
would not need a runtime check, but it would seem prudent for ppc64 to gain a 
runtime check for the power8 insns.



I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
see it (but I don't remember for sure).  Either way, if I should have
been wrong, and we are using cmpxchg16b for loads, this should be fixed.
Ideally, this should be fixed before the stage 3 deadline this Friday.
Such a fix might potentially break existing uses, but the earlier we fix
this, the better.


You needed to use -mcx16, or any other option (such as -march=host) that 
implies that.  And, you will find that expand_atomic_load does have a 
larger-than-word-size fallback path that does use expand_atomic_compare_and_swap.


So, yes, there's something here that needs adjustment.


Section 3 Rationale, alternative 1: I'm wondering if the example is
correct.  For a 4-byte-aligned type of size 3, the implementation cannot
simply use 4-byte hardware-backed atomics because this will inevitably
touch the 4th byte I think, and the implementation can't know whether
this is padding or not.  Or do we expect that things like packed structs
are disallowed?


If we atomically store an unchanged value into the 4th byte, can we tell?


N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
Because cmpxchg8b is available for CPUs that are the lowest i?86 we
still intend to support?


For various definitions of "we", I suppose.  Red Hat certainly does not support 
anything lower than i686, which does have cmpxchg8b.


I suspect that the GNU project still supports i486.  I do know that glibc has 
dropped support for i386.


I should note that supporting 64-bit atomics on i686 *is* possible, without the 
CAS problem that you describe for cmpxchg16b, because we *are* guaranteed that 
the FPU supports a 64-bit atomic load/store.  And we do already handle this; 
see the atomic_loaddi_fpu and atomic_storedi_fpu patterns.


I'll also note that, as per above, this implies that if we build for i586-*, 
libatomic should provide runtime paths that detect and use i686 insns, so that 
the library is compatible with what the compiler will generate inline given 
appropriate command-line options.



r~


Re: GCC libatomic ABI specification draft

2017-01-17 Thread Torvald Riegel
On Thu, 2016-11-17 at 12:12 -0800, Bin Fan wrote:
> On 11/14/2016 4:34 PM, Bin Fan wrote:
> > Hi All,
> >
> > I have an updated version of libatomic ABI specification draft. Please 
> > take a look to see if it matches GCC implementation. The purpose of 
> > this document is to establish an official GCC libatomic ABI, and allow 
> > compatible compiler and runtime implementations on the affected 
> > platforms.

Thanks for the update, and sorry for the late reply.  Comments below.

> > - Rewrite section 3 to replace "lock-free" operations with "hardware 
> > backed" instructions. The digest of this section is: 1) inlineable 
> > atomics must be implemented with the hardware backed atomic 
> > instructions. 2) for non-inlineable atomics, the compiler must 
> > generate a runtime call, and the runtime support function is free to 
> > use any implementation.

OK.

I still think that using hardware-backed instructions for a particular
type requires that there is a true atomic load instruction for that
type.  Emulating a load with an idempotent store (eg, cmpxchg16b) is not
useful, overall.

One could argue that an idempotent atomic HW store such as a cmpxchg16b
in a loop is indeed lock-free.  However, IMO the intention behind
"lock-free" atomics in C and C++ is to offer atomics that are both
lock-free *and* as fast as one would assume for a fully HW-backed
solution for atomic accesses.  This includes that loads must be cheaper
than stores, in particular under contention / concurrent accesses by
several threads.
I believe that "fast" is much more often part of the motivation for
using lock-free atomics than the actual "lock-free", so the
progress-guarantee aspect (which isn't even lock-free but
obstruction-free, see below).  If we do see a sufficiently strong need
for lock-free atomics, which should build something just for that (eg,
if removing the address-free requirement, we can support lock-free (in
the progress-guarantee sense) operations for a lot more types).

Also, while that previous issue is "just" a performance issue, the fact
that we could issue a store when calling to atomic_load() is a
correctness issue, I think.
One example are volatile atomic loads; while C/C++ don't really
constrain what a volatile load needs to be in the underlying
implementation, I think most users would assume that a load really means
a hardware load instruction of some sort, and nothing else.  cmpxchg16b
conflicts with such an assumption.
Another example is read-only mapped memory.

Bottom line: we shouldn't rely solely on cmpxchg16b and similar.
(Though this doesn't necessarily mean that there can't be compiler flags
that enable its use.)


I think the ABI should set a baseline for each architecture, and the
baseline decides whether something is inlinable or not.  Thus, the
x86_64 ABI would make __int128 operations not imlinable (because of the
issues with cmpxchg16b, see above).

If users want to use capabilities beyond the baseline, they can choose
to use flags that alter/extend the ABI.  For example, if they use a flag
that explicitly enables the use of cmpxchg16b for atomics, they also
need to use a libatomic implementation built in the same way (if
possible).  This then creates a new ABI(-variant), basically.


I've made a few tests on my x86_64 machine a few weeks ago, and I didn't
see cmpxchg16b being used.  IIRC, I also looked at libatomic and didn't
see it (but I don't remember for sure).  Either way, if I should have
been wrong, and we are using cmpxchg16b for loads, this should be fixed.
Ideally, this should be fixed before the stage 3 deadline this Friday.
Such a fix might potentially break existing uses, but the earlier we fix
this, the better.


Section 3 Rationale, alternative 1: I'm wondering if the example is
correct.  For a 4-byte-aligned type of size 3, the implementation cannot
simply use 4-byte hardware-backed atomics because this will inevitably
touch the 4th byte I think, and the implementation can't know whether
this is padding or not.  Or do we expect that things like packed structs
are disallowed?

N3.1:  Why do you assume that 8-byte HW atomics are available on i386?
Because cmpxchg8b is available for CPUs that are the lowest i?86 we
still intend to support?

I'd also use "hardware-backed" instead of "hardware backed".

> > - The Rationale section in section 3 is also revised to remove the 
> > mentioning of "lock-free", but there is not major change of concept.
> >
> > - Add note N3.1 to emphasize the assumption of general hardware 
> > supported atomic instruction
> >
> > - Add note N3.2 to discuss the issues of cmpxchg16b

See above.

> > - Add a paragraph in section 4.1 to specify memory_order_consume must 
> > be implemented through memory_order_acquire. Section 4.2 emphasizes it 
> > again.
> >
> > - The specification of each runtime functions mostly maps to the 
> > corresponding generic functions in the C11 standard. Two functions are 
> > worth noting:
> > 1) C11 atomic_compare_exchange 

Re: GCC libatomic ABI specification draft

2017-01-04 Thread Szabolcs Nagy
On 22/12/16 17:37, Segher Boessenkool wrote:
> We do not always have all atomic instructions.  Not all processors have
> all, and it depends on the compiler flags used which are used.  How would
> libatomic know what compiler flags are used to compile the program it is
> linked to?
> 
> Sounds like a job for multilibs?

x86_64 uses ifunc dispatch to always use atomic
instructions if available (which is bad because
ifunc is not supported on all platforms).

either such runtime feature detection and dispatch
is needed in libatomic or different abis have to
be supported (with the usual hassle).



Re: GCC libatomic ABI specification draft

2016-12-22 Thread Segher Boessenkool
On Thu, Dec 22, 2016 at 03:28:56PM +0100, Ulrich Weigand wrote:
> However, there still seems to be a problem, but this time related to
> alignment issues.  We do have the 16-byte atomic instructions, but they
> only work on 16-byte aligned data.  This is a problem in particular
> since the default alignment of 16-byte data types is still 8 bytes
> on our platform (since the ABI only guarantees 8-byte stack alignment).
> 
> That's why the libatomic configure check thinks it cannot use the
> atomic instructions when building on z, and generates code that uses
> the separate lock.  However, *if* a particular object can be proven
> by the compiler to be 16-byte aligned, it will emit the inline
> atomic instruction.  This means there is indeed a bug if that same
> object is also operated on via the library routine.
> 
> Andreas suggested that the best way to fix this would be to add a
> runtime alignment check to the libatomic routines and also use the
> atomic instructions in the library whenever the object actually
> happens to be correctly aligned.  It seems that this should indeed
> fix the problem (and also use the most efficient way in all cases).
> 
> 
> Not sure about Power -- adding David and Segher on CC ...

We do not always have all atomic instructions.  Not all processors have
all, and it depends on the compiler flags used which are used.  How would
libatomic know what compiler flags are used to compile the program it is
linked to?

Sounds like a job for multilibs?


Segher


Re: GCC libatomic ABI specification draft

2016-12-22 Thread Ulrich Weigand
Szabolcs Nagy wrote:
> On 20/12/16 13:26, Ulrich Weigand wrote:
> > I may have missed the context of the discussion, but just on the
> > specific ISA question here: both Power and z not only have the
> > 16-byte CAS (or load-and-reserve/store-conditional), but they also both
> > have specific 16-byte atomic load and store instructions (lpq/stpq
> > on z, lq/stq on Power).
> > 
> > Those are available on any system supporting z/Architecture (z900 and up),
> > and on any Power system supporting the V2.07 ISA (POWER8 and up).  GCC
> > does in fact use those instructions to implement atomic operations on
> > 16-byte data types on those machines.
> 
> that's a bug.
> 
> at least i don't see how gcc makes sure the libatomic
> calls can interoperate with inlined atomics.

Hmm, interesting.  On z, there is no issue with ISA levels, since *all*
64-bit platforms support the 16-byte atomics (and on non-64-bit platforms,
16-byte data types are not supported at all).

However, there still seems to be a problem, but this time related to
alignment issues.  We do have the 16-byte atomic instructions, but they
only work on 16-byte aligned data.  This is a problem in particular
since the default alignment of 16-byte data types is still 8 bytes
on our platform (since the ABI only guarantees 8-byte stack alignment).

That's why the libatomic configure check thinks it cannot use the
atomic instructions when building on z, and generates code that uses
the separate lock.  However, *if* a particular object can be proven
by the compiler to be 16-byte aligned, it will emit the inline
atomic instruction.  This means there is indeed a bug if that same
object is also operated on via the library routine.

Andreas suggested that the best way to fix this would be to add a
runtime alignment check to the libatomic routines and also use the
atomic instructions in the library whenever the object actually
happens to be correctly aligned.  It seems that this should indeed
fix the problem (and also use the most efficient way in all cases).


Not sure about Power -- adding David and Segher on CC ...


Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com



Re: GCC libatomic ABI specification draft

2016-12-20 Thread Szabolcs Nagy
On 20/12/16 13:26, Ulrich Weigand wrote:
> Torvald Riegel wrote:
>> On Fri, 2016-12-02 at 12:13 +0100, Gabriel Paubert wrote:
>>> On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
 Thanks for the comment. Yes, the ABI requires libatomic must query the 
 hardware. This is 
 necessary if we want the compiler to generate inlined code for 16-byte 
 atomics. Note that 
 this particular issue only affects x86. 
>>>
>>> Why? Power (at least recent ones) has 128 bit atomic instructions
>>> (lqarx/stqcx.) and Z has 128 bit compare and swap. 
>>
>> That's not the only factor affecting whether cmpxchg16b or such is used
>> for atomics.  If the HW just offers a wide CAS but no wide atomic load,
>> then even an atomic load is not truly just a load, which breaks (1)
>> atomic loads on read-only mapped memory and (2) volatile atomic loads
>> (unless we claim that an idempotent store is like a load, which is quite
>> a stretch for volatile I think).
> 
> I may have missed the context of the discussion, but just on the
> specific ISA question here: both Power and z not only have the
> 16-byte CAS (or load-and-reserve/store-conditional), but they also both
> have specific 16-byte atomic load and store instructions (lpq/stpq
> on z, lq/stq on Power).
> 
> Those are available on any system supporting z/Architecture (z900 and up),
> and on any Power system supporting the V2.07 ISA (POWER8 and up).  GCC
> does in fact use those instructions to implement atomic operations on
> 16-byte data types on those machines.

that's a bug.

at least i don't see how gcc makes sure the libatomic
calls can interoperate with inlined atomics.



Re: GCC libatomic ABI specification draft

2016-12-20 Thread Ulrich Weigand
Torvald Riegel wrote:
> On Fri, 2016-12-02 at 12:13 +0100, Gabriel Paubert wrote:
> > On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> > > Thanks for the comment. Yes, the ABI requires libatomic must query the 
> > > hardware. This is 
> > > necessary if we want the compiler to generate inlined code for 16-byte 
> > > atomics. Note that 
> > > this particular issue only affects x86. 
> > 
> > Why? Power (at least recent ones) has 128 bit atomic instructions
> > (lqarx/stqcx.) and Z has 128 bit compare and swap. 
> 
> That's not the only factor affecting whether cmpxchg16b or such is used
> for atomics.  If the HW just offers a wide CAS but no wide atomic load,
> then even an atomic load is not truly just a load, which breaks (1)
> atomic loads on read-only mapped memory and (2) volatile atomic loads
> (unless we claim that an idempotent store is like a load, which is quite
> a stretch for volatile I think).

I may have missed the context of the discussion, but just on the
specific ISA question here: both Power and z not only have the
16-byte CAS (or load-and-reserve/store-conditional), but they also both
have specific 16-byte atomic load and store instructions (lpq/stpq
on z, lq/stq on Power).

Those are available on any system supporting z/Architecture (z900 and up),
and on any Power system supporting the V2.07 ISA (POWER8 and up).  GCC
does in fact use those instructions to implement atomic operations on
16-byte data types on those machines.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  GNU/Linux compilers and toolchain
  ulrich.weig...@de.ibm.com



Re: GCC libatomic ABI specification draft

2016-12-19 Thread Torvald Riegel
On Fri, 2016-12-02 at 12:13 +0100, Gabriel Paubert wrote:
> On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> > Hi Szabolcs,
> > 
> > > On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy  wrote:
> > > 
> > > On 17/11/16 20:12, Bin Fan wrote:
> > >> 
> > >> Although this ABI specification specifies that 16-byte properly aligned 
> > >> atomics are inlineable on platforms
> > >> supporting cmpxchg16b, we document the caveats here for further 
> > >> discussion. If we decide to change the
> > >> inlineable attribute for those atomics, then this ABI, the compiler and 
> > >> the runtime implementation should be
> > >> updated together at the same time.
> > >> 
> > >> 
> > >> The compiler and runtime need to check the availability of cmpxchg16b to 
> > >> implement this ABI specification.
> > >> Here is how it would work: The compiler can get the information either 
> > >> from the compiler flags or by
> > >> inquiring the hardware capabilities. When the information is not 
> > >> available, the compiler should assume that
> > >> cmpxchg16b instruction is not supported. The runtime library 
> > >> implementation can also query the hardware
> > >> compatibility and choose the implementation at runtime. Assuming the 
> > >> user provides correct compiler options
> > > 
> > > with this abi the runtime implementation *must* query the hardware
> > > (because there might be inlined cmpxchg16b in use in another module
> > > on a hardware that supports it and the runtime must be able to sync
> > > with it).
> > 
> > Thanks for the comment. Yes, the ABI requires libatomic must query the 
> > hardware. This is 
> > necessary if we want the compiler to generate inlined code for 16-byte 
> > atomics. Note that 
> > this particular issue only affects x86. 
> 
> Why? Power (at least recent ones) has 128 bit atomic instructions
> (lqarx/stqcx.) and Z has 128 bit compare and swap. 

That's not the only factor affecting whether cmpxchg16b or such is used
for atomics.  If the HW just offers a wide CAS but no wide atomic load,
then even an atomic load is not truly just a load, which breaks (1)
atomic loads on read-only mapped memory and (2) volatile atomic loads
(unless we claim that an idempotent store is like a load, which is quite
a stretch for volatile I think).




Re: GCC libatomic ABI specification draft

2016-12-02 Thread Gabriel Paubert
On Thu, Dec 01, 2016 at 11:13:37AM -0800, Bin Fan at Work wrote:
> Hi Szabolcs,
> 
> > On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy  wrote:
> > 
> > On 17/11/16 20:12, Bin Fan wrote:
> >> 
> >> Although this ABI specification specifies that 16-byte properly aligned 
> >> atomics are inlineable on platforms
> >> supporting cmpxchg16b, we document the caveats here for further 
> >> discussion. If we decide to change the
> >> inlineable attribute for those atomics, then this ABI, the compiler and 
> >> the runtime implementation should be
> >> updated together at the same time.
> >> 
> >> 
> >> The compiler and runtime need to check the availability of cmpxchg16b to 
> >> implement this ABI specification.
> >> Here is how it would work: The compiler can get the information either 
> >> from the compiler flags or by
> >> inquiring the hardware capabilities. When the information is not 
> >> available, the compiler should assume that
> >> cmpxchg16b instruction is not supported. The runtime library 
> >> implementation can also query the hardware
> >> compatibility and choose the implementation at runtime. Assuming the user 
> >> provides correct compiler options
> > 
> > with this abi the runtime implementation *must* query the hardware
> > (because there might be inlined cmpxchg16b in use in another module
> > on a hardware that supports it and the runtime must be able to sync
> > with it).
> 
> Thanks for the comment. Yes, the ABI requires libatomic must query the 
> hardware. This is 
> necessary if we want the compiler to generate inlined code for 16-byte 
> atomics. Note that 
> this particular issue only affects x86. 

Why? Power (at least recent ones) has 128 bit atomic instructions
(lqarx/stqcx.) and Z has 128 bit compare and swap. 

Gabriel


Re: GCC libatomic ABI specification draft

2016-12-01 Thread Bin Fan at Work
Hi Szabolcs,

> On Nov 29, 2016, at 3:11 AM, Szabolcs Nagy  wrote:
> 
> On 17/11/16 20:12, Bin Fan wrote:
>> 
>> Although this ABI specification specifies that 16-byte properly aligned 
>> atomics are inlineable on platforms
>> supporting cmpxchg16b, we document the caveats here for further discussion. 
>> If we decide to change the
>> inlineable attribute for those atomics, then this ABI, the compiler and the 
>> runtime implementation should be
>> updated together at the same time.
>> 
>> 
>> The compiler and runtime need to check the availability of cmpxchg16b to 
>> implement this ABI specification.
>> Here is how it would work: The compiler can get the information either from 
>> the compiler flags or by
>> inquiring the hardware capabilities. When the information is not available, 
>> the compiler should assume that
>> cmpxchg16b instruction is not supported. The runtime library implementation 
>> can also query the hardware
>> compatibility and choose the implementation at runtime. Assuming the user 
>> provides correct compiler options
> 
> with this abi the runtime implementation *must* query the hardware
> (because there might be inlined cmpxchg16b in use in another module
> on a hardware that supports it and the runtime must be able to sync
> with it).

Thanks for the comment. Yes, the ABI requires libatomic must query the 
hardware. This is necessary if we want the compiler to generate inlined code 
for 16-byte atomics. Note that this particular issue only affects x86. I notice 
GCC already have a few builtins declared in cpuid.h. The functions are x86 
specific. So couldn’t the query be done by those functions?

> 
> currently gcc libatomic does not guarantee this which is dangerously
> broken: if gcc is configured with --disable-gnu-indirect-function
> (or on targets without ifunc support: solaris, bsd, android, musl,..)
> the compiler may inline cmpxchg16b in one translation unit but use
> incompatible runtime function in another.
> 
> there is PR 70191 but this issue has wider scope.

This issue was actually found by us while we are working on the ABI draft. So 
we filed the bug and we think it should be fixed.

Compiler inlining 16-byte atomics has other issues as noted in the ABI draft. 
So the alternative is stop inlining those atomics, but that would need a 
compiler fix.

Thanks,
- Bin

> 
>> and the inquiry returns the correct information, on a platform that supports 
>> cmpxchg16b, the code generated
>> by the compiler will both use cmpxchg16b; on a platform that does not 
>> support cmpxchg16b, the code generated
>> by the compiler, including the code generated for a generic platform, always 
>> call the support function, so
>> there is no compatibility problem.
> 



Re: GCC libatomic ABI specification draft

2016-11-29 Thread Szabolcs Nagy
On 17/11/16 20:12, Bin Fan wrote:
> 
> Although this ABI specification specifies that 16-byte properly aligned 
> atomics are inlineable on platforms
> supporting cmpxchg16b, we document the caveats here for further discussion. 
> If we decide to change the
> inlineable attribute for those atomics, then this ABI, the compiler and the 
> runtime implementation should be
> updated together at the same time.
> 
> 
> The compiler and runtime need to check the availability of cmpxchg16b to 
> implement this ABI specification.
> Here is how it would work: The compiler can get the information either from 
> the compiler flags or by
> inquiring the hardware capabilities. When the information is not available, 
> the compiler should assume that
> cmpxchg16b instruction is not supported. The runtime library implementation 
> can also query the hardware
> compatibility and choose the implementation at runtime. Assuming the user 
> provides correct compiler options

with this abi the runtime implementation *must* query the hardware
(because there might be inlined cmpxchg16b in use in another module
on a hardware that supports it and the runtime must be able to sync
with it).

currently gcc libatomic does not guarantee this which is dangerously
broken: if gcc is configured with --disable-gnu-indirect-function
(or on targets without ifunc support: solaris, bsd, android, musl,..)
the compiler may inline cmpxchg16b in one translation unit but use
incompatible runtime function in another.

there is PR 70191 but this issue has wider scope.

> and the inquiry returns the correct information, on a platform that supports 
> cmpxchg16b, the code generated
> by the compiler will both use cmpxchg16b; on a platform that does not support 
> cmpxchg16b, the code generated
> by the compiler, including the code generated for a generic platform, always 
> call the support function, so
> there is no compatibility problem.