[Bug target/114576] [14 regression] VEX-prefixed AES instruction without AVX enabled

2024-04-03 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114576

--- Comment #4 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #3)
> vaesenc etc. instructions can be used even if just -maes -mavx, not just
> -mvaes -mavx512vl.

Correct, that's just VEX-prefixed AESNI instructions.

VAES added the 256-bit and 512-bit versions of those instructions. The table at
felix's website is accurate: https://www.felixcloutier.com/x86/aesenc

This is actually similar to GFNI:
* GFNI: 128-bit only, non-VEX, non-EVEX
* GFNI+AVX: VEX allowed, 128- and 256-bit; no EVEX
* GFNI+AVX512F: 128- and 256-bit with VEX, 512-bit with EVEX
* GFNI+AVX512VL: 128- and 256-bit with VEX, all with EVEX
* GFNI+AVX10 without EVEX512: 128- and 256-bit with VEX and EVEX, no 512-bit

The F-no-VL case does not exist in practice.

> But, it is especially messy because -mvaes doesn't imply -maes, so IMHO if
> somebody e.g. asks for -mvaes -mavx512vl -mno-aes and the insns don't use
> any xmm16+ register, it would emit the insn using VEX encoding rather than
> EVEX, so I think we need to use {evex} prefixes.

Would it be simpler to just imply that VAES includes AESNI? There are no
processors that have VAES without AESNI and it doesn't make sense for there to
be one.

[Bug target/114576] New: [13 regression][config/i386] GCC 14/trunk emits VEX-prefixed AES instruction without AVX enabled

2024-04-03 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114576

Bug ID: 114576
   Summary: [13 regression][config/i386] GCC 14/trunk emits
VEX-prefixed AES instruction without AVX enabled
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Re: https://bugreports.qt.io/browse/QTBUG-123965
Re: https://bugzilla.redhat.com/show_bug.cgi?id=2262640,
https://bugzilla.redhat.com/show_bug.cgi?id=2272758
Godbolt link: https://gcc.godbolt.org/z/6P9fMvoxW

Found while compiling Qt 6.6 or 6.7 with GCC 14 (current trunk). This is a
regression from GCC 13.

This function from qhash.cpp
<https://github.com/qt/qtbase/blob/v6.7.0/src/corelib/tools/qhash.cpp#L581-L588>:

Q_ALWAYS_INLINE __m128i AESHashSeed::state1() const
{
{
// unlike the Go code, we don't have more per-process seed
__m128i state1 = _mm_aesenc_si128(state0, mseed2);
return state1;
}
}

Is apparently getting assembled to:
.L2:
leaq(%rdi,%rsi), %rdx
vaesenc %xmm1, %xmm0, %xmm1

Though there's no AVX enabled in this code (the original version in Qt has some
AVX/VAES and AVX512 code but the reduced example does not).

This function:
// hash twice 16 bytes, running 2 scramble rounds of AES on itself
static void QT_FUNCTION_TARGET(AES) QT_VECTORCALL
hash2x16bytes(__m128i , __m128i , const __m128i *src0, const
__m128i *src1)
{
__m128i data0 = _mm_loadu_si128(src0);
__m128i data1 = _mm_loadu_si128(src1);
state0 = _mm_xor_si128(data0, state0);
state1 = _mm_xor_si128(data1, state1);
state0 = _mm_aesenc_si128(state0, state0);
state1 = _mm_aesenc_si128(state1, state1);
state0 = _mm_aesenc_si128(state0, state0);
state1 = _mm_aesenc_si128(state1, state1);
}

Is even emitting:
.L20:
movdqu  (%rax), %xmm2
pxor%xmm0, %xmm2
movdqu  -16(%rdx), %xmm0
pxor%xmm0, %xmm1
vaesenc %xmm2, %xmm2, %xmm0
aesenc  %xmm1, %xmm1
aesenc  %xmm0, %xmm0
aesenc  %xmm1, %xmm1

and that makes no sense to use AVX for one of four instructions alone, called
from the same source function.

For reference, GCC 13 generates respectively:

.L2:
movdqa  %xmm0, %xmm1
leaq(%rdi,%rsi), %rdx
aesenc  %xmm2, %xmm1
and
.L20:
movdqu  (%rax), %xmm2
pxor%xmm0, %xmm2
movdqu  -16(%rdx), %xmm0
aesenc  %xmm2, %xmm2
pxor%xmm0, %xmm1
movdqa  %xmm2, %xmm0
aesenc  %xmm1, %xmm1
aesenc  %xmm2, %xmm0
aesenc  %xmm1, %xmm1

You can tell that they are the same source block because the labels are the
same.

Sources:

#include 
#ifdef _MSC_VER
#  define Q_ALWAYS_INLINE __forceinline
#  define QT_VECTORCALL __vectorcall
#  define QT_FUNCTION_TARGET(x)
#else
#  define Q_ALWAYS_INLINE inline __attribute__((always_inline))
#  define QT_VECTORCALL
#  define QT_FUNCTION_TARGET(x) __attribute__((target(QT_FUNCTION_TARGET_##x)))
#  define QT_FUNCTION_TARGET_AES"sse4.2,aes"
//#  define qCpuHasFeature(x) __builtin_cpu_supports(QT_FUNCTION_TARGET_ ## x)
#endif
#define QT_COMPILER_SUPPORTS_HERE(x)true
#define mm_set1_epz _mm_set1_epi64x
#define mm_cvtsz_si128  _mm_cvtsi64_si128
#define mm_cvtsi128_sz  _mm_cvtsi128_si64
#define mm256_set1_epz  _mm256_set1_epi64x
extern bool qCpuHasFeature(const char *) noexcept;
#define qCpuHasFeature(x) qCpuHasFeature(#x)

using uchar = unsigned char;
using quintptr = unsigned long long;
using qint8 = signed char;

// hash 16 bytes, running 3 scramble rounds of AES on itself (like label
"final1")
static void Q_ALWAYS_INLINE QT_FUNCTION_TARGET(AES) QT_VECTORCALL
hash16bytes(__m128i , __m128i data)
{
state0 = _mm_xor_si128(state0, data);
state0 = _mm_aesenc_si128(state0, state0);
state0 = _mm_aesenc_si128(state0, state0);
state0 = _mm_aesenc_si128(state0, state0);
}

// hash twice 16 bytes, running 2 scramble rounds of AES on itself
static void QT_FUNCTION_TARGET(AES) QT_VECTORCALL
hash2x16bytes(__m128i , __m128i , const __m128i *src0, const
__m128i *src1)
{
__m128i data0 = _mm_loadu_si128(src0);
__m128i data1 = _mm_loadu_si128(src1);
state0 = _mm_xor_si128(data0, state0);
state1 = _mm_xor_si128(data1, state1);
state0 = _mm_aesenc_si128(state0, state0);
state1 = _mm_aesenc_si128(state1, state1);
state0 = _mm_aesenc_si128(state0, state0);
state1 = _mm_aesenc_si128(state1, state1);
}

struct AESHashSeed
{
__m128i state0;
__m128i mseed2;
AESHashSeed(size_t seed, size_t seed2) QT_FUNCTION_TARGET(AES);
__m128i state1() co

[Bug c/114088] Please provide __builtin_c16slen and __builtin_c32slen to complement __builtin_wcslenw

2024-02-24 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114088

--- Comment #3 from Thiago Macieira  ---
> But __builtin_strlen *does* get optimized when the input is a string literal. 
>  Not sure about wcslen though.

It appears not to, in the test above. std::char_trait::length() calls
wcslen() whereas the char specialisation uses __builtin_strlen() explicitly.
But if the intrinsics are enabled, the two would be the same, wouldn't they?

Anyway, in the absence of a library function to call, inserting the loop is
fine; it's what is there already.

Though it would be nice to be able to provide such a function. I wrote it for
Qt (it's called qustrlen). I would try with __builtin_constant_p first to see
if the string is a literal.

[Bug c/114088] New: Please provide __builtin_c16slen and __builtin_c32slen to complement __builtin_wcslenw

2024-02-24 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114088

Bug ID: 114088
   Summary: Please provide __builtin_c16slen and __builtin_c32slen
to complement __builtin_wcslenw
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Actually, GCC doesn't have __builtin_wcslen, but Clang does. Providing these
extra two builtins would allow implementing __builtin_wcslen too. The names are
not part of the C standard, but follow the current naming construction rules
for it, similar to how "mbrtowc" and "wcslen" parallel.

My specific need is actually to implement char16_t string containers in C++.
I'm particularly interested in QString/QStringView, but this applies to
std::basic_string{_view} too.

For example:

std::string_view f1() { return "Hello"; }
std::wstring_view fw() { return L"Hello"; }
std::u16string_view f16() { return u"Hello"; }
std::u32string_view f32() { return U"Hello"; }

With GCC and libstdc++, the first function produces optimal code:
movl$5, %eax
leaq.LC0(%rip), %rdx
ret

For wchar_t case, GCC emits an out-of-line call to wcslen:
pushq   %rbx
leaq.LC2(%rip), %rbx
movq%rbx, %rdi
callwcslen@PLT
movq%rbx, %rdx
popq%rbx
ret

The next two, because of the absence of a C library function, emit a loop:
xorl%eax, %eax
leaq.LC1(%rip), %rcx
.L4:
incq%rax
cmpw$0, (%rcx,%rax,2)
jne .L4
movq%rcx, %rdx
ret

Clang, meanwhile, emits optimal code for all four and so did the pre-Clang
Intel compiler. See https://gcc.godbolt.org/z/qvj7qnYbz. MSVC emits optimal for
the char and wchar_t versions, but loops for the other two.

Clang gives up when the string gets longer, though. See
https://gcc.godbolt.org/z/54j3zr6e6. That indicates that it gave up on guessing
the loop run and would do better if the intrinsic were present.

[Bug target/113465] [mingw-w64] dllexported constexpr (inline) variables not automatically emitted

2024-02-03 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113465

--- Comment #6 from Thiago Macieira  ---
Mind if I ask you reconsider the decision for inline variables (which all
constexpr ones are)?

[Bug c++/54483] undefined reference to static constexpr in .so

2024-01-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54483

--- Comment #13 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #11)
> You still need:
> constexpr float A::val;

In C++11 mode, yes.

C++17 made all static constexpr data members implicitly inline, which change
the situation. Inline variables ought to be emitted on use and merged at
runtime.

This explanation does not change the resolution of this bug report. But if you
can update your code to use -std=c++17, gnu++17 or later, then the problem goes
away.

[Bug target/113465] [mingw-w64] dllexported constexpr (inline) variables not automatically emitted

2024-01-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113465

--- Comment #5 from Thiago Macieira  ---
> I don't think that's the same. That situation over there is C++11, where the
> constexpr variable is *not* static.

I meant not *inline*.

[Bug target/113465] [mingw-w64] dllexported constexpr (inline) variables not automatically emitted

2024-01-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113465

--- Comment #4 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #3)
> See PR 54483 .
> 
> *** This bug has been marked as a duplicate of bug 54483 ***

I don't think that's the same. That situation over there is C++11, where the
constexpr variable is *not* static.

I forgot to say that in my case it is inline because it's C++17.

On use, GCC emits a copy of the variable:
struct __declspec(dllimport) QLocale
{
static constexpr inline int FirstTwoDigitYear = 1900;
};

template void f(const T &);
void f() { f(QLocale::FirstTwoDigitYear); }

results in:

_Z1fv:
movq__imp__ZN7QLocale17FirstTwoDigitYearE(%rip), %rcx
jmp _Z1fIiEvRKT_
_ZN7QLocale17FirstTwoDigitYearE:
.long   1900

This copy is useless. The equivalent code for ELF and Mach-O ABIs is fine
because the relocation would find it, if the original variable doesn't exist in
the .so. But on Windows, that __imp_ prefix implies it's an import from another
DLL, which *must* have emitted its copy and exported it.

Clang and MSVC also do the import, but don't emit that copy. See
https://mingw.godbolt.org/z/aKsaYKThT.

[Bug target/113465] New: [mingw-w64] dllexported constexpr (inline) variables not automatically emitted

2024-01-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113465

Bug ID: 113465
   Summary: [mingw-w64] dllexported constexpr (inline) variables
not automatically emitted
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Related to explicit instantiation of templates bugs:
Bug 89088, Bug 109380
though I'd argue that since that has a special syntax, it's different.

Testcase:
struct __declspec(dllexport) QLocale
{
static constexpr int FirstTwoDigitYear = 1900;
};

With GCC, this produces:
.file   "example.cpp"
.text
.ident  "GCC: (MinGW-W64 x86_64-ucrt-mcf-seh, built by Brecht Sanders)
13.1.0"

That is, nothing.

With Clang, that emits (simplified):
.text
.section   
.rdata$_ZN7QLocale17FirstTwoDigitYearE,"dr",discard,_ZN7QLocale17FirstTwoDigitYearE
.globl  _ZN7QLocale17FirstTwoDigitYearE #
@_ZN7QLocale17FirstTwoDigitYearE
.p2align2, 0x0
_ZN7QLocale17FirstTwoDigitYearE:
.long   1900# 0x76c

.section.drectve,"yni"
.ascii  " -export:_ZN7QLocale17FirstTwoDigitYearE,data"
.addrsig

MSVC also emits the variable, though how it causes the export to happen isn't
clear.

https://mingw.godbolt.org/z/ErbfdPaf8

This can be worked around by explicitly declaring the variable as if it were
not inline (before C++17):

constexpr int QLocale::FirstTwoDigitYear;

However, since this isn't required in any other platform and other compilers on
Windows don't require it either, developers are going to forget it. And very
likely, this issue is going to show up only when users compile their code,
depending on varying levels of optimisation and inlining.

[Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows

2023-08-30 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #7 from Thiago Macieira  ---
(In reply to Costas Argyris from comment #6)
> At this point I just meant embedding it in your example a.out executable
> file, just to check if it will work correctly.

Ah, got it. But that is not the conditions of the issue at hand, so proving it
works doesn't help me in the conditions that do apply.

> But yes, assuming this even works, embedding the UTF-8 manifest is part of
> the build process of the application, so it would have to be accounted for
> in the Makefiles etc.

And I can't force my users to do that.

If libstdc++ wants to enforce that or require it for use of std::filesystem,
it's your choice.

[Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows

2023-08-30 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #5 from Thiago Macieira  ---
(In reply to Jonathan Wakely from comment #3)
> Somebody else will have to fix this, I've already wasted too much of my life
> making std:: filesystem (mostly) work on Windows.

Same here.

(In reply to Costas Argyris from comment #4)
> I'm wondering if it will work after embedding a UTF-8 manifest into your
> a.out executable, as described here:
> 
> https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-
> code-page

I can't embed a UTF-8 manifest in my DLL and much less in my .a. As a library
writer (I'm the QtCore maintainer), that's out of my hands - it is an
application decision.

If GCC+Binutils team wants to enforce that for the future, be my guest. I'd
support your decision; I think it's high time this happened. But I'm sure there
would be a lot of push-back from people who can't do that because their
existing Windows applications rely on the legacy encodings or those who deploy
to Windows versions that didn't have such support. I have a vague memory of
discussing this in the Qt development mailing list, but can't find it.

A softer approach is for std::filesystem to declare that it only supports
UTF-8-manifested applications (closing this bug as WONTFIX /
working-as-designed). I'd again support your decision and will simply pass the
requirement along to my users.

[Bug libstdc++/111244] std::filesystem::path encoding mismatches locale on Windows

2023-08-30 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

--- Comment #2 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #1)
> Except the code page could be tuned via a manifest file even.
> For an example GCC embeds a manifest into its own compiler to work around
> this issue and just use UTF8 always.
> 
> So ...

Indeed, but won't MultiByteToWideChar() adapt to that and correctly convert
from UTF-8 to UTF-16?

[Bug c++/111244] New: std::filesystem::path encoding mismatches locale on Windows

2023-08-30 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111244

Bug ID: 111244
   Summary: std::filesystem::path encoding mismatches locale on
Windows
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Test:
$ cat fstest.cpp 
#include 
#include 

int main(int argc, char **argv)
{
for (int i = 1; i < argc; ++i) {
std::filesystem::path p(argv[i]);
if (std::filesystem::exists(p)) {
printf("%s %llu\n", argv[1], (unsigned long
long)std::filesystem::file_size(p));
} else {
printf("%s does not exist\n", argv[1]);
}
}
}
$ touch filæ
$ g++ fstest.cpp
$ ./a.out fstest.cpp filæ

On Linux (and any other Unix):
fstest.cpp 377
fstest.cpp 0

On Windows with libc++ or MS STL:
fstest.cpp 377
fstest.cpp 0

On Windows with libstdc++:
fstest.cpp 377
terminate called after throwing an instance of
'std::filesystem::__cxx11::filesystem_error'
  what():  filesystem error: Cannot convert character sequence: Illegal byte
sequence

This is caused by std::filesystem::path interpreting the input as UTF-8. On
Windows, it's not; it must be decoded using the locale codec. 

Strictly speaking, the same should apply to the conversion to Unicode on Unix
systems too, but a) they're almost all UTF-8 these days, so the corner cases
may be ignored by a policy decision and b) the mismatch of input does not lead
to inability to refer to files by fs::path alone.

[Bug c++/111105] New: [12/13/14 regression] __attribute__((malloc)) can no longer name a C++ member function

2023-08-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=05

Bug ID: 05
   Summary: [12/13/14 regression] __attribute__((malloc)) can no
longer name a C++ member function
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

This compiles with GCC 11:

struct QArrayData
{
static void free(void *);
__attribute__((malloc(QArrayData::free))) static void *allocate();
};

But fails since 12.0.

error: 'malloc' attribute argument 1 does not name a function

See https://conformance.godbolt.org/z/n76jTsahT

[Bug target/110591] New: [i386] (Maybe) Missed optimisation: _cmpccxadd sets flags

2023-07-07 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110591

Bug ID: 110591
   Summary: [i386] (Maybe) Missed optimisation: _cmpccxadd sets
flags
   Product: gcc
   Version: 13.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

In:
#include 

bool increment_if(int *ptr, int v)
{
return _cmpccxadd_epi32(ptr, v, 1, _CMPCCX_Z) == v;
}

GCC generates (and current Clang does the same):

increment_if(int*, int):
movl$1, %edx
movl%esi, %eax
cmpzxadd%edx, %eax, (%rdi)
cmpl%eax, %esi
sete%al
ret

The CMPccXADD instructions set EFLAGS to the result of the comparison of their
memory operand to the middle one, which will get the current value of that
memory location whether the comparison succeeded or not. That means the CMP
instruction on the next line is superfluous, since it'll set the flags to
exactly what they are already set to. That means this particular example could
be written:

movl$1, %edx
cmpzxadd%edx, %esi, (%rdi)
sete%al
ret

Saving 2 retire slots and 1 uop. This can be done every time the result of the
intrinsic is compared to the same value that was passed as the intrinsic's
second parameter.

However, in a real workload, this function is likely to be inlined, where the
extra MOV may not be present at all and the CMP is likely to be followed by a
Jcc instead of a SETcc. For the latter case, the CMP+Jcc would be macro-fused,
so there would be no 1-uop gain. Moreover, this atomic operation is likely
going to be multiple cycles long and the conditional code after it probably
can't be speculated very well either.

I'll leave it up to you to decide whether it's worth pursuing this.

[Bug target/110184] New: [i386] Missed optimisation: atomic operations should use PF, ZF and SF

2023-06-08 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110184

Bug ID: 110184
   Summary: [i386] Missed optimisation: atomic operations should
use PF, ZF and SF
   Product: gcc
   Version: 13.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Follow up from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

The x86 locked ALU operations always set PF, ZF and SF, so the atomic builtins
could use those to emit more optimal code instead of a cmpxchg loop.

Given:
template  int atomic_rmw_op(std::atomic_int )
{
int old = Op(i);
if (old == 0)
return 1;
if (old < 0)
return 2;
return 0;
}

---
Starting with the non-standard __atomic_OP_fetch, the current code for 

inline int andn_fetch_1(std::atomic_int )
{
return __atomic_and_fetch((int *), ~1, 0);
}

is

L33:
movl%eax, %edx
andl$-2, %edx
lock cmpxchgl   %edx, (%rdi)
jne .L33
movl%edx, %eax
shrl$31, %eax
addl%eax, %eax  // eax = 2 if edx < 0
testl   %edx, %edx
movl$1, %edx
cmove   %edx, %eax

But it could be more optimally written as:

movl%ecx, 1
movl%edx, 2
xorl%eax, %eax
lock andl$-2, (%rdi)
cmove   %ecx, %eax
cmovs   %edx, %eax

The other __atomic_OP_fetch operations are very similar. I note that GCC
already realises that if you perform __atomic_and_fetch(ptr, 1), the result
can't have the sign bit set.

---
For the standard atomic_fetch_OP operations, there are a couple of caveats:

fetch_and: if the retrieved value is ANDed again with the same pattern; for
example:
int pattern = 0x8001;
return i.fetch_and(pattern, std::memory_order_relaxed) & pattern;
This appears to be partially implemented, depending on what the pattern is. For
example, it generates the optimal code for pattern = 3, 15, 0x7fff,
0x8000. It appears to be related to testing for either SF or ZF, but not
both.

fetch_or: always for SF, for the useful case when the pattern being ORed
doesn't already contain the sign bit. If it does (a "non-useful case"), then
the comparison is a constant, and likewise for ZF because it's never set if the
pattern isn't zero.

fetch_xor: always, because the original value is reconstructible. Avoid
generating unnecessary code in case the code already does the XOR itself, as
in:

return i.fetch_xor(1, std::memory_order_relaxed) ^ 1;


See https://gcc.godbolt.org/z/n9bMnaE4e for full results.

[Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new

2023-05-18 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #7 from Thiago Macieira  ---
(In reply to Jonathan Wakely from comment #6)
> With placement-new there's no allocation:
> https://gcc.godbolt.org/z/68e4PaeYz

Is the exception expected there, though?

[Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new

2023-05-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #5 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #4)
> If you are that picky for cycles, these cycles are not going to be a problem
> compared to the dynamic allocation that is just about to happen ..

Yeah, I realised that after I posted the reply. If the calculation is
successful, we're going to allocate memory and that's neither fast nor
determinstic. If it overflows, we're going to unwind the stack, which is even
worse. I had only looked at the multiplication and failed to consider what
comes after it.

So, yeah, do this if it's a low-hanging fruit.

[Bug target/109896] Missed optimisation: overflow detection in multiplication instructions for operator new

2023-05-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

--- Comment #3 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #2)
> (In reply to Andrew Pinski from comment #1)
> > I suspect the overflow code was added before __builtin_*_overflow were added
> > which is why the generated code is this way.
> 
> Should the C++ front-end use __builtin_mul_overflow?

That's what that code is doing, yes.

But mind you that not all examples are doing actual multiplications. That's why
I had the weird size of 47.

A size that is a power of 2 is just doing bit checks. For example, 16:
movq%rdi, %rax
shrq$59, %rax
jne .L2

Other sizes do the compare, but there's no multiplication involved. For 24:
movabsq $384307168202282325, %rax
cmpq%rdi, %rax
jb  .L2
leaq(%rdi,%rdi,2), %rdi
salq$3, %rdi
5 instructions, 4 cycles (not including front-end decode), so roughly the same
as the imulq example above (4 cycles), but with far more ports to dispatch to.

[Bug tree-optimization/106409] GCC with LTO: Warning: argument 1 value ‘18...615’ (SIZE_MAX) exceeds maximum object size with new

2023-05-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106409

--- Comment #8 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #7)
> See PR 58525 also which added that code path.

That explains why it won't call __cxa_throw_bad_array_new_length, but not why
it will call operator new[](-1). My suggestion is to keep
__cxa_throw_bad_array_new_length for the exceptions case and add a new function
for the non-exceptional case. This function could:
* call operator new[], which would probably cause the stack unwinder to
terminate the application
* call std::terminate() directly, possibly after printing something to stderr
* return null pointer
* something else (generate debug break, raise(SIGKILL), etc.)

[Bug tree-optimization/106409] GCC with LTO: Warning: argument 1 value ‘18...615’ (SIZE_MAX) exceeds maximum object size with new

2023-05-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106409

--- Comment #6 from Thiago Macieira  ---
Suggestion: add a function to libgcc to be called instead of
__cxa_throw_bad_array_new_length when exceptions are disabled. That function
can be a mere two instructions, but it provides two advantages:
* no need to stream something into LTO
* allows post-compilation tools to know what's happened (Valgrind, debuggers,
etc.)

I don't know if this is an acceptable solution, but I thought I'd make the
suggestion.

[Bug target/109896] New: Missed optimisation: overflow detection in multiplication instructions for operator new

2023-05-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109896

Bug ID: 109896
   Summary: Missed optimisation: overflow detection in
multiplication instructions for operator new
   Product: gcc
   Version: 13.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

In the following code:
struct S
{
char buf[47];   // weird size
};

void *f(unsigned long paramCount)
{
return new S[paramCount];
}

GCC generates (see https://gcc.godbolt.org/z/o5eocj5n9):
movabsq $196241958230952676, %rax
cmpq%rdi, %rax
jb  .L2
imulq   $47, %rdi, %rdi
jmp operator new[](unsigned long)
f(unsigned long) [clone .cold]:
.L2:
pushq   %rax
call__cxa_throw_bad_array_new_length

That's a slight pessimisation of the typical, non-exceptional case because of
the presence of the compare instructions. On modern x86, that's 3 retire slots
and 2 uops, in addition to the multiplication's 3 cycles (which may be
speculated and start early). But the presence of a 10-byte instruction and the
fact that the jump is further than 8-bit displacement range mean those three
instructions occupy 18 bytes, meaning the front-end is sub-utilised, requiring
2 cycles to decode the 5 instructions (pre-GLC [I think] CPUs decode 4
instructions in 16 bytes per cycle).

Instead, GCC should emit the multiplication and check if the overflow flag was
set. I believe the optimal code for GCC would be:

imulq   $47, %rdi, %rdi
jo  .L2
jmp operator new[](unsigned long)

That's 15 bytes, so 1 cycle for the decoder to decode all 3 instructions.
That's 3+1 cycles and 2 retire slots before the JMP.

In the Godbolt link above, Clang and MSVC emitted a CMOV:

mulq%rcx
movq$-1, %rdi
cmovnoq %rax, %rdi
jmp operator new[](unsigned long)@PLT

This is slightly worse (19 bytes, 4 instructions, though also 3+1 cycles). For
GCC's -fno-exceptions case, I recommend keeping the IMUL+JO case and only load
-1 in the .text.unlikely section. But see
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109895

[Bug c++/109895] New: -Walloc-size-larger-than complains about code it generated itself under -flto -fno-exceptions

2023-05-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109895

Bug ID: 109895
   Summary: -Walloc-size-larger-than complains about code it
generated itself under -flto -fno-exceptions
   Product: gcc
   Version: 13.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Reference: https://bugreports.qt.io/browse/QTBUG-113603

Code in question:

const auto paramCount = mysql_stmt_param_count(d->stmt);
if (paramCount > 0) // allocate memory for outvalues
d->outBinds = new MYSQL_BIND[paramCount]();

mysql_stmt_param_count returns unsigned long.

GCC 13.1 with -flto -fno-exceptions produced:

src/plugins/sqldrivers/mysql/qsql_mysql.cpp: In member function ‘prepare’:
src/plugins/sqldrivers/mysql/qsql_mysql.cpp:891:50: warning: argument 1 value
‘18446744073709551615’ exceeds maximum object size 9223372036854775807
[-Walloc-size-larger-than=]
  891 | d->outBinds = new MYSQL_BIND[paramCount]();
  |  ^
/usr/include/c++/13/new:128:26: note: in a call to allocation function
‘operator new []’ declared here
  128 | _GLIBCXX_NODISCARD void* operator new[](std::size_t) _GLIBCXX_THROW
(std::bad_alloc)
  |  ^

Disassembling the code shows it looks similar to this:
https://godbolt.org/z/9eKPxbEMY
movq$-1, %rdi
cmpq%rbx, %rax
jb  .L2
...
.L2:
calloperator new[](unsigned long)@PLT

So that 18446744073709551615 value is the -1 inserted by GCC itself to deal
with the multiplication overflow.

[Bug libstdc++/99277] C++2a synchronisation is inefficient in GCC 11

2023-05-08 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99277

--- Comment #21 from Thiago Macieira  ---
I understand that. I don't think it's a reason to repeat the policy, though.

Anyway, I don't have any new arguments than when we discussed this two years
ago, so I won't pursue this matter further.

[Bug libstdc++/99277] C++2a synchronisation is inefficient in GCC 11

2023-05-08 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99277

--- Comment #19 from Thiago Macieira  ---
(In reply to Jonathan Wakely from comment #18)
> We have not committed to a stable ABI for C++20 yet.

That was my argument when creating this bug report two years ago: if it's
available in the standard headers to be used without explicit, opt-in actions
by the users, it's committed. The change may not break the ABI inside of
libstdc++.so.6, but so long as it breaks ABI of some library, it's an ABI
break. I understand that's not what your current policy says, but it doesn't
change reality. So please don't do it.

[Bug libstdc++/99277] C++2a synchronisation is inefficient in GCC 11

2023-05-08 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99277

--- Comment #17 from Thiago Macieira  ---
(In reply to Thomas Rodgers from comment #16)
> The original implementation came from Olvier Giroux and is part of libc++.
> The libc++ implementation also does not use a type that futex or
> ulock_wait/wake (uint64_t) can handle. I have discussed this in the past
> with Olivier, the choice of char was deliberate on his part. The
> implementation has been tested on a number of platforms (including time on
> ORNL's Summit). 

I remember our discussion on this. But libc++ isn't trying to be optimal and it
never supports direct futex. The fact that they chose this path does not mean
libstdc++ must too.


> The following comment, preserved from libc++ should be
> considered carefully before any change here -
> 
> " 2. A great deal of attention has been paid to avoid cache line thrashing
> by flattening the tree structure into cache-line sized arrays, that
> are indexed in an efficient way."
> 
> It is my opinion that the bar for making a change here is high. I would need
> to see benchmark numbers that illustrate the performance differences under
> various contention scenarios vs impact on caches by being able to fit the
> entire tree in a single cache line using char, vs four or eight cache lines
> using the type favored by futex or ulock_wait/wake.

Indeed. My other $DAYJOB involves a lot of cacheline thrashing up to and
including current 480-core machines, so I appreciate the thought there.

In any case, we can't change the design even if we turn up new data showing
that there's benefit or a bottleneck somewhere.

[Bug libstdc++/99277] C++2a synchronisation is inefficient in GCC 11

2023-05-08 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99277

--- Comment #15 from Thiago Macieira  ---
> >  5) std::barrier implementation also uses a type that futex(2) can't handle

> barrier still uses a 1-byte enum for the atomic waits.

That can only now be fixed for libstdc++.so.7, then.

[Bug libstdc++/99277] C++2a synchronisation is inefficient in GCC 11

2023-05-08 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99277

Thiago Macieira  changed:

   What|Removed |Added

 Resolution|--- |FIXED
 Status|UNCONFIRMED |RESOLVED

--- Comment #13 from Thiago Macieira  ---
I believe this was fixed before release. Just the issue not closed.

[Bug tree-optimization/108980] [13 Regression] Warning text missing the warning itself (GCC 13)

2023-03-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108980

--- Comment #9 from Thiago Macieira  ---
Ah, got it. That also explains why I couldn't find anything wrong with my code,
and nothing I did that could likely be it made the warning go away.

Thanks for the quick turnaround.

[Bug tree-optimization/108980] [13 Regression] Warning text missing the warning itself (GCC 13)

2023-03-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108980

--- Comment #7 from Thiago Macieira  ---
The duplicate "note:" disappeared. But now there's no warning at all on the
same file, with the same options. Was that intended?

[Bug tree-optimization/108980] [13 Regression] Warning text missing the warning itself (GCC 13)

2023-03-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108980

--- Comment #6 from Thiago Macieira  ---
Testing.

[Bug c++/108980] Warning text missing the warning itself (GCC 13)

2023-02-28 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108980

--- Comment #1 from Thiago Macieira  ---
GCC 13 (trunk) built today.

[Bug c++/108980] New: Warning text missing the warning itself (GCC 13)

2023-02-28 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108980

Bug ID: 108980
   Summary: Warning text missing the warning itself (GCC 13)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Warning options:
-Wall -Wextra -fno-exceptions -mno-direct-extern-access -Werror -Wno-error=cpp
-Wno-error=deprecated-declarations -Wno-error=strict-overflow
-Wno-error=implicit-fallthrough -Wno-error=deprecated-copy
-Wno-error=redundant-move -Wno-error=init-list-lifetime
-Wno-error=format-overflow -Wno-error=stringop-overflow
-Wno-error=deprecated-enum-enum-conversion
-Wno-error=deprecated-enum-float-conversion -Wsuggest-override

Printed:

In file included from
/home/tjmaciei/src/qt/qt6-release/qtdeclarative/src/qmlmodels/qqmllistcompositor.cpp:4:
/home/tjmaciei/src/qt/qt6-release/qtdeclarative/src/qmlmodels/qqmllistcompositor_p.h:
In member function ‘void QQmlListCompositor::move(Group, int, Group, int, int,
Group, QVector*, QVector*)’:
/home/tjmaciei/src/qt/qt6-release/qtdeclarative/src/qmlmodels/qqmllistcompositor_p.h:115:13:
note: while referencing ‘QQmlListCompositor::iterator::index’
  115 | int index[MaximumGroupCount] = { 0 };
  | ^
/home/tjmaciei/src/qt/qt6-release/qtdeclarative/src/qmlmodels/qqmllistcompositor_p.h:115:13:
note: while referencing ‘QQmlListCompositor::iterator::index’

How can I help find out what the issue here is?

[Bug preprocessor/108372] New: [12 regression] -E -fdirectives-only crash

2023-01-11 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108372

Bug ID: 108372
   Summary: [12 regression] -E -fdirectives-only crash
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: preprocessor
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Probably similar to many other bugs related to -E -fdirectives-only. This
option is used by icecc <https://github.com/icecc/icecream>.

Test:
g++ -std=c++17 -include type_traits -E -xc++ /dev/null -o /dev/null
-fdirectives-only

Tested with:
gcc version 13.0.0 20230110 (experimental) (GCC)

Output:
In file included from :
/home/tjmaciei/dev/gcc/include/c++/13.0.0/type_traits:2948:25: error: missing
binary operator before token "("
 2948 |bool _Nothrow = noexcept(_S_conv<_Tp>(_S_get())),
  | ^
/home/tjmaciei/dev/gcc/include/c++/13.0.0/type_traits:3033:27: internal
compiler error: unspellable token PRAGMA_EOL
 3033 | ~__nonesuch() = delete;
  |   ^
0xc4c472 c_cpp_diagnostic(cpp_reader*, cpp_diagnostic_level,
cpp_warning_reason, rich_location*, char const*, __va_list_tag (*) [1])
/home/tjmaciei/src/gcc/gcc/c-family/c-common.cc:6694
0x229a914 cpp_diagnostic_at
/home/tjmaciei/src/gcc/libcpp/errors.cc:67
0x229a914 cpp_diagnostic
/home/tjmaciei/src/gcc/libcpp/errors.cc:82
0x229aa73 cpp_error(cpp_reader*, cpp_diagnostic_level, char const*, ...)
/home/tjmaciei/src/gcc/libcpp/errors.cc:96
0x22a58d3 cpp_spell_token(cpp_reader*, cpp_token const*, unsigned char*, bool)
/home/tjmaciei/src/gcc/libcpp/lex.cc:4426
0x22a663a cpp_token_as_text(cpp_reader*, cpp_token const*)
/home/tjmaciei/src/gcc/libcpp/lex.cc:4442
0x229e43c _cpp_parse_expr
/home/tjmaciei/src/gcc/libcpp/expr.cc:1389
0x2296981 do_if
/home/tjmaciei/src/gcc/libcpp/directives.cc:2076
0x2298b68 _cpp_handle_directive
/home/tjmaciei/src/gcc/libcpp/directives.cc:572
0x22a6e7d cpp_directive_only_process(cpp_reader*, void*, void (*)(cpp_reader*,
CPP_DO_task, void*, ...))
/home/tjmaciei/src/gcc/libcpp/lex.cc:5272
0xc76faf scan_translation_unit_directives_only
/home/tjmaciei/src/gcc/gcc/c-family/c-ppoutput.cc:431
0xc76faf preprocess_file(cpp_reader*)
/home/tjmaciei/src/gcc/gcc/c-family/c-ppoutput.cc:104
0xc750b8 c_common_init()
/home/tjmaciei/src/gcc/gcc/c-family/c-opts.cc:1227
0xa6d0ce cxx_init()
/home/tjmaciei/src/gcc/gcc/cp/lex.cc:338
0x95d1c1 lang_dependent_init
/home/tjmaciei/src/gcc/gcc/toplev.cc:1815
0x95d1c1 do_compile
/home/tjmaciei/src/gcc/gcc/toplev.cc:2110
Please submit a full bug report, with preprocessed source.
Please include the complete backtrace with any bug report.
See <https://gcc.gnu.org/bugs/> for instructions.

[Bug target/98112] Add -f[no-]direct-access-external-data & drop HAVE_LD_PIE_COPYRELOC

2023-01-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98112

--- Comment #9 from Thiago Macieira  ---
I can't be certain for other architectures' performance, but my feeling is that
indeed they would benefit from this. The option that was added as an -m should
be an -f (and match Clang's option).

However, maintainers of other architectures need to step up to help this.

Aside from that, yes, this task can be closed as it's implemented.

[Bug c++/108216] Wrong offset for (already-constructed) virtual base during construction of full object

2022-12-23 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108216

--- Comment #3 from Thiago Macieira  ---
In bug 70644, the pointer to Base was passed to Base's constructor, so the
conversion from the derived type to the virtual base Base happened clearly
before said base was constructed.

In this example here, the conversion happens inside C's constructor body, where
C's direct (but virtual) base A must be fully initialised, notwithstanding the
fact that it was initialised by D's in-charge constructor.

I'm not making a conclusion that this is or isn't UB. I'm saying that it can't
be UB for the explanation offered in that bug.

[Bug tree-optimization/104475] [12/13 Regression] Wstringop-overflow + atomics incorrect warning on dynamic object

2022-12-06 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104475

--- Comment #19 from Thiago Macieira  ---
(In reply to Richard Biener from comment #15)
> Thanks, it's still the same reason - we isolate a nullptr case and end up
> with
> 
> __atomic_or_fetch_4 (184B, 64, 0); [tail call]
> 
> The path we isolate is d->m_mutex == nullptr && !enable in
> 
> void QFutureInterfaceBase::setThrottled(bool enable)
> {
> QMutexLocker lock(>m_mutex);

Thank you for the analysis, Richard. But do note that it's >m_mutex, not
d->m_mutex that is passed to the locker. C++ says that if you do d-> then d !=
nullptr, so >m_mutex can't be nullptr either.

However, I guess GCC thinks it can be because the offset of m_mutex in QFIBP is
zero. pahole says:

public:
void QFutureInterfaceBasePrivate(class QFutureInterfaceBasePrivate *,
enum State);
void ~QFutureInterfaceBasePrivate(class QFutureInterfaceBasePrivate *,
int);

class QMutex  m_mutex;   /* 0 8 */
class QBasicMutex continuationMutex; /* 8 8 */

So there's a missed optimisation here. But it doesn't look like GCC is the only
one to miss it, see https://gcc.godbolt.org/z/WW5hbW6sW. Maybe it's an
intentional choice?

> we predict the path to be unlikely but the adjustment to the threader
> covered probably never executed paths (with probability zero).  The
> threading opportunity arises because the DTOR calls
> 
> inline void unlock() noexcept
> {   
> if (!isLocked)
> return;
> m->unlock();
> isLocked = false;
> }
> 
> and we know isLocked on the nullptr path.

We know it can't be true.

> I thought we could maybe enhance prediction to look for nullptr based
> accesses but at the time we estimate probabilities the QMutexLocker
> CTOR isn't yet inlined (the DTOR is partially inlined, exposing the
> isLocked check).
> 
> Note the "impossible" path is actually in the sources - so there might
> be a missing conditional somewhere.

I don't see it, but that's probably because I'm looking at it from the C++
side. If the mutex pointer that was passed is null, then isLocked is never set
to true. What you're saying is that the unlock() function above was inlined and
that GCC knew m to be nullptr, but didn't know isLocked's value... which makes
no sense to me. If the constructor wasn't inlined, it couldn't know the value
of m either. If the constructor was inlined, then it should know the value of
both.

Anyway, this discussion made me realise there's a series of changes to
QMutexLocker ending in "QMutexLocker: strenghten the locking operations"
(https://code.qt.io/cgit/qt/qtbase.git/commit/?id=1b1456975347b044c11169458b53c9f6083dbc59).
This probably did change how the optimiser works, explaining why the warnings
went away.

But it shouldn't have. We went from

inline ~QMutexLocker() {
unlock();
}
inline void unlock() noexcept
{
if (!isLocked)
return;
m->unlock();
isLocked = false;
}

to

inline ~QMutexLocker()
{
if (m_isLocked)
unlock();
}
inline void unlock() noexcept
{
Q_ASSERT(m_isLocked);
m_mutex->unlock();
m_isLocked = false;
}

with the Q_ASSERT expanding to nothing in release builds, it should be
effectively identical code.

[Bug tree-optimization/104475] [12/13 Regression] Wstringop-overflow + atomics incorrect warning on dynamic object

2022-12-05 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104475

--- Comment #14 from Thiago Macieira  ---
Created attachment 54015
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54015=edit
qfutureinterface.cpp preprocessed [gcc trunk-20221205]

(In reply to Richard Biener from comment #13)
> There's been some changes on trunk but the preprocessed source doesn't work
> there.

Uploaded the updated preprocessed source with current trunk, from roughly the
same Qt commit (I chose a date just before this bug report was opened).

I can still reproduce this issue with the minimal original sources and these
preprocessed, but somehow not with the actual qfutureinterface.cpp, either then
or now.

[Bug target/107456] std::atomic::fetch_xxx generate LOCK CMPXCHG instead of simpler LOCK instructions

2022-11-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107456

--- Comment #4 from Thiago Macieira  ---
(In reply to Thiago Macieira from comment #3)
> With the Remote Atomic Operations (RAO) of AAND, AOR and AXOR, we can do
> something.

Correcting myself: the RAO instructions don't give us the result back either.

[Bug target/107456] std::atomic::fetch_xxx generate LOCK CMPXCHG instead of simpler LOCK instructions

2022-10-31 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107456

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #3 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #1)
> I was going to say the exact same comment as on LLVM bug report:
> https://github.com/llvm/llvm-project/issues/58685#issuecomment-1295829030
> 
> There is no way atomically fetch and add without xadd.
> 
> There is no "x"and/"x"or  instruction on x86 (note the x here stands for
> exchange rather than exclusive as there is an xor but that is an "exclusive
> or").

With the Remote Atomic Operations (RAO) of AAND, AOR and AXOR, we can do
something.

[Bug c++/106395] New: [10/11 regression] [mingw] "redeclared without dllimport attribute: previous dllimport ignored" on C++ friend

2022-07-21 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106395

Bug ID: 106395
   Summary: [10/11 regression] [mingw] "redeclared without
dllimport attribute: previous dllimport ignored" on
C++ friend
   Product: gcc
   Version: 12.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

__attribute__((dllimport)) void f();
class S
{
private:
int i;
friend __attribute__((dllimport)) void f();
};

With GCC 10.3.0 (OpenSUSE Tumbleweed):
$ x86_64-w64-mingw32-gcc -c -Werror /tmp/test.cpp

With GCC 11.2.1 (Fedora 35 & 36):
$ x86_64-w64-mingw32-gcc -Wall -Wextra -Werror -c /tmp/test.cpp

With GCC 12.1.0 (Arch Linux, self built):
$ x86_64-w64-mingw32-g++ -Werror -c /tmp/test.cpp
/tmp/test.cpp:6:44: error: ‘void f()’ redeclared without dllimport attribute:
previous dllimport ignored [-Werror=attributes]
6 | friend __attribute__((dllimport)) void f();
  |^
cc1plus: all warnings being treated as errors

GCC 12 is obviously wrong because it is complaining that the attribute is
missing and then shows that the attribute is right there.

[Bug c++/77306] Unable to specify visibility for explicit template instantiations

2022-06-19 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77306

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #3 from Thiago Macieira  ---
*** Bug 106023 has been marked as a duplicate of this bug. ***

[Bug c++/106023] Would like to control the ELF visibility of template explicit instantiations

2022-06-19 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106023

Thiago Macieira  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |DUPLICATE

--- Comment #2 from Thiago Macieira  ---
Ah, so it is. Thanks for putting up with my being lazy.

*** This bug has been marked as a duplicate of bug 77306 ***

[Bug c++/106023] New: Would like to control the ELF visibility of template explicit instantiations

2022-06-18 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106023

Bug ID: 106023
   Summary: Would like to control the ELF visibility of template
explicit instantiations
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Given a template like:

template  struct __attribute__((visibility("hidden"))) S
{
static constexpr int n = 0;
};

I would like to mark an explicit instantiation (not a specialisation!) with a
different visibility:

template struct __attribute__((visibility("default"))) S;
template __attribute__((visibility("default"))) const int S::n;

Either solution would help me, but the first would be preferable. Both would be
best.

Clang does support both forms properly. Visual Studio supports the former only.
See https://gcc.godbolt.org/z/T7sdzTnbG and note how GCC adds ".hidden" to the
two "::n" symbols and Clang doesn't, while Clang properly refers to those
symbols using @GOTPCREL and GCC doesn't. The MVSC example uses dllimport only
to highlight the difference; it wouldn't be written like this in real code.

Please ensure "protected" visibility also works. See
https://gcc.godbolt.org/z/o3W4x5YsG

[Bug middle-end/105348] Overly aggressive -Warray-bounds after conditional

2022-05-31 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105348

--- Comment #4 from Thiago Macieira  ---
One more Qt workaround, for the record:
https://codereview.qt-project.org/c/qt/qtbase/+/413730

[Bug c++/105509] New: [compatibility] f16 suffix not supported in C++ mode - unable to find numeric literal operator ‘operator""f16’

2022-05-06 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105509

Bug ID: 105509
   Summary: [compatibility] f16 suffix not supported in C++ mode -
unable to find numeric literal operator
‘operator""f16’
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

The following code:

_Float16 f = 12.34f16;

compiles as C in GCC12, and as both C and C++ with Clang 14 and ICX 2022.

But with GCC 12 as C++, it generates:

error: unable to find numeric literal operator ‘operator""f16’

See also https://wg21.link/p1467, notably section 5.10 Literal suffixes

[Bug middle-end/105348] Overly aggressive -Warray-bounds after conditional

2022-04-25 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105348

--- Comment #3 from Thiago Macieira  ---
I understand. I'm just trying to avoid having to add code for a corner-case.
People don't usually parse empty buffers, so it's usually fine to allow it to
proceed and discover an EOF condition.

Anyway, worked around. Feel free to close if this is too hard to fix.

[Bug middle-end/105348] Overly aggressive -Warray-bounds after conditional

2022-04-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105348

--- Comment #1 from Thiago Macieira  ---
Qt workaround: https://codereview.qt-project.org/c/qt/qtbase/+/407217

[Bug middle-end/105348] New: Overly aggressive -Warray-bounds after conditional

2022-04-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105348

Bug ID: 105348
   Summary: Overly aggressive -Warray-bounds after conditional
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Testcase:

#include 
char empty;

void sink(int);
bool cond(size_t);
void f(const char *s, size_t l)
{
int n;
if (cond(l)) {
memcpy(, s, sizeof(n));
sink(n);
}
}

void g()
{
f(, 1);
}

#ifdef EXPAND
bool cond(size_t l)
{
return l >= sizeof(int);
}
#endif

$ gcc -DEXPAND -O3 -c -Wall -Wextra -Werror test.cpp && echo $?
0
$ gcc -O3 -c -Wall -Wextra -Werror test.cpp && echo $? 
In function ‘void f(const char*, size_t)’,
inlined from ‘void f(const char*, size_t)’ at test.cpp:6:6,
inlined from ‘void g()’ at test.cpp:17:6:
test.cpp:10:15: error: array subscript ‘unsigned int[0]’ is partly outside
array bounds of ‘char [1]’ [-Werror=array-bounds]
   10 | memcpy(, s, sizeof(n));
  | ~~^~
test.cpp: In function ‘void g()’:
test.cpp:2:6: note: object ‘empty’ of size 1
2 | char empty;
  |  ^
cc1plus: all warnings being treated as errors

I've noticed this even when the other function was present and available for
inlining. Unfortunately, for reasons outside of my direct control, GCC decided
not to inline that function, which meant it considers the condition bad.

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #20 from Thiago Macieira  ---
I think there will be cases where the relaxation makes sense and others where
it doesn't because the surrounding code already does it. So I'd like to control
per emission.

If I can't do it per code block, I suppose I could make a lambda block

  [&]() __attribute__((target("relax-cmpxchg-loop"))) { 
return __atomic_compare_exchange_weak();
  }();

Of course, it might be easier to simply relax it myself at that point.

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #18 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #17)
> _Pragma("GCC target \"relax-cmpxchg-loop\"")
> should do that (ditto target("relax-cmpxchg-loop") attribute).

The attribute is applied to a function. I'm hoping to do it for s block of
code:

 _Pragma("GCC push_options")
 _Pragma("GCC target \"relax-cmpxchg-loop\"")
 __atomic_compare_exchange_weak();
 _Pragma("GCC pop_options")

[Bug target/103069] cmpxchg isn't optimized

2022-02-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #16 from Thiago Macieira  ---
Can this option be enabled and disabled with a _Pragma?

[Bug target/103069] cmpxchg isn't optimized

2022-02-21 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #14 from Thiago Macieira  ---
I'd restrict relaxations to loops emitted by the compiler. All other atomic
operations shouldn't be modified at all, unless the user asks for it. That
includes non-looping atomic operations (like LOCK BTC, LOCK XADD) as well as a
pure LOCK CMPXCHG that came from a single __atomic_compare_exchange by the
user.

I'd welcome the ability to relax the latter, especially if with one codebase I
could be efficient in CAS architectures as well as LL/SC ones.

[Bug c++/104492] New: Bogus dangling pointer warning (dangling pointer to ‘candidates’ may be used [-Werror=dangling-pointer=])

2022-02-10 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104492

Bug ID: 104492
   Summary: Bogus dangling pointer warning (dangling pointer to
‘candidates’ may be used [-Werror=dangling-pointer=])
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Created attachment 52409
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52409=edit
Preprocessed sources for qlibrary.cpp

Workaround: https://codereview.qt-project.org/c/qt/qtbase/+/394894
Original sources: 
https://code.qt.io/cgit/qt/qtbase.git/tree/src/corelib/plugin/qlibrary.cpp?id=5e378aaff61c9708e0657f3ee29517c06cb075fa#n618
https://github.com/qt/qtbase/blob/5e378aaff61c9708e0657f3ee29517c06cb075fa/src/corelib/plugin/qlibrary.cpp#L618-L677

Code snippet:

auto isValidSuffix = [](QStringView s) {
const QLatin1String candidates[] = {
QLatin1String("so"),
};
return std::find(std::begin(candidates), std::end(candidates), s) !=
std::end(candidates);
};
auto suffixes = qTokenize(completeSuffix, u'.');
auto it = suffixes.begin();
const auto end = suffixes.end();
while (it != end) {
if (isValidSuffix(*it++))
  return true;
}
return false;

This code above produces the warning:
qlibrary.cpp:114:9: error: dangling pointer to ‘candidates’ may be used
[-Werror=dangling-pointer=]  
  114 | if (isValidSuffix(*it++))   
  | ^~  
qlibrary.cpp:76:29: note: ‘candidates’ declared here
   76 | const QLatin1String candidates[] = {
  | ^~  

The 'candidates' variable has indeed gone out of scope, but there's no use of
its pointer where the lambda is used. The lambda does a std::find, which
returns a pointer into the array, but that pointer is compared to std::end and
the lambda returns a boolean.

Compile the attached preprocessed sources with:
 g++ -march=x86-64-v3 -Wall -Wextra -O3 -o /dev/null -c qlibrary.cpp.ii

GCC commit 1ce5395977f37e8d0c03394f7b932a584ce85cc7 (master branch dated
2022-02-09).

[Bug c++/104475] New: Wstringop-overflow + atomics incorrect warning on dynamic object

2022-02-09 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104475

Bug ID: 104475
   Summary: Wstringop-overflow + atomics incorrect warning on
dynamic object
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Created attachment 52399
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52399=edit
qfutureinterface.cpp preprocessed

In:
static inline int switch_on(QAtomicInt , int which)
{
return a.fetchAndOrRelaxed(which) | which;
}

static inline int switch_off(QAtomicInt , int which)
{
return a.fetchAndAndRelaxed(~which) & ~which;
}

void QFutureInterfaceBase::setThrottled(bool enable)
{
QMutexLocker lock(>m_mutex);
if (enable) {
switch_on(d->state, Throttled);
} else {
switch_off(d->state, Throttled);
if (!(d->state.loadRelaxed() & suspendingOrSuspended))
d->pausedWaitCondition.wakeAll();
}
}

Compiling the attached preprocessed sources with:

g++ -Wall -Wextra -march=haswell -O2 -c -o /dev/null qfutureinterface.cpp.ii

Produces:

In member function ‘std::__atomic_base<_IntTp>::__int_type
std::__atomic_base<_IntTp>::fetch_or(__int_type, std::memory_order) [with _ITp
= int]’,
inlined from ‘static T QAtomicOps::fetchAndOrRelaxed(std::atomic&,
typename QAtomicAdditiveType::AdditiveT) [with T = int; X = int]’ at
/home/tjmaciei/obj/qt/qt6/qtbase/include/QtCore/../../../../../../src/qt/qt6/qtbase/src/corelib/thread/qatomic_cxx11.h:449:33,
inlined from ‘T QBasicAtomicInteger::fetchAndOrRelaxed(T) [with T =
int]’ at
/home/tjmaciei/obj/qt/qt6/qtbase/include/QtCore/../../../../../../src/qt/qt6/qtbase/src/corelib/thread/qbasicatomic.h:168:36,
inlined from ‘int switch_on(QAtomicInt&, int)’ at
/home/tjmaciei/src/qt/qt6/qtbase/src/corelib/thread/qfutureinterface.cpp:59:31,
inlined from ‘void QFutureInterfaceBase::setThrottled(bool)’ at
/home/tjmaciei/src/qt/qt6/qtbase/src/corelib/thread/qfutureinterface.cpp:71:18:
/home/tjmaciei/dev/gcc/include/c++/12.0.1/bits/atomic_base.h:648:33: warning:
‘unsigned int __atomic_or_fetch_4(volatile void*, unsigned int, int)’ writing 4
bytes into a region of size 0 overflows the destination [-Wstringop-overflow=]
  648 |   { return __atomic_fetch_or(&_M_i, __i, int(__m)); }
  |~^~
In member function ‘std::__atomic_base<_IntTp>::__int_type
std::__atomic_base<_IntTp>::fetch_and(__int_type, std::memory_order) [with _ITp
= int]’,
inlined from ‘static T QAtomicOps::fetchAndAndRelaxed(std::atomic&,
typename QAtomicAdditiveType::AdditiveT) [with T = int; X = int]’ at
/home/tjmaciei/obj/qt/qt6/qtbase/include/QtCore/../../../../../../src/qt/qt6/qtbase/src/corelib/thread/qatomic_cxx11.h:425:34,
inlined from ‘T QBasicAtomicInteger::fetchAndAndRelaxed(T) [with T =
int]’ at
/home/tjmaciei/obj/qt/qt6/qtbase/include/QtCore/../../../../../../src/qt/qt6/qtbase/src/corelib/thread/qbasicatomic.h:159:37,
inlined from ‘int switch_off(QAtomicInt&, int)’ at
/home/tjmaciei/src/qt/qt6/qtbase/src/corelib/thread/qfutureinterface.cpp:64:32,
inlined from ‘void QFutureInterfaceBase::setThrottled(bool)’ at
/home/tjmaciei/src/qt/qt6/qtbase/src/corelib/thread/qfutureinterface.cpp:73:19:
/home/tjmaciei/dev/gcc/include/c++/12.0.1/bits/atomic_base.h:638:34: warning:
‘unsigned int __atomic_fetch_and_4(volatile void*, unsigned int, int)’ writing
4 bytes into a region of size 0 overflows the destination
[-Wstringop-overflow=]
  638 |   { return __atomic_fetch_and(&_M_i, __i, int(__m)); }
  |~~^~

GCC Git commit 1ce5395977f37e8d0c03394f7b932a584ce85cc7, built today.

[Bug c++/104243] Optimization requires __sync_synchronize

2022-01-27 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104243

--- Comment #7 from Thiago Macieira  ---
(In reply to Martin Liška from comment #6)
> Anyway, upstream removed the pure attribute as we suggested:
> https://codereview.qt-project.org/c/qt/qtbase/+/392357

Can we be assured the pure attribute will work for complex return types?

https://gcc.godbolt.org/z/KE4s74od3
struct S
{
bool *ptr;
S();
S(const S &);
};

#ifdef __GNUC__
__attribute__((pure))
#endif
S f1();
bool f2()
{
return *f1().ptr;
}

[Bug target/104250] New: [i386] GCC may want to use 32-bit (I)DIV if it can for 64-bit operands

2022-01-26 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104250

Bug ID: 104250
   Summary: [i386] GCC may want to use 32-bit (I)DIV if it can for
64-bit operands
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

In
long long f1(long long n, long long d)
{
return n / d;
}

GCC generates:

movq%rdi, %rax
cqto
idivq   %rsi
ret

Which is fine, except that the 64-bit IDIV instruction is significantly slower
than the 32-bit (I)DIV. In recent CPUs (such as PMC, SNC, WLC, GLC), that's 18
vs 14 cycles, but it was much worse in older CPUs. There's still a significant
difference for Atom cores, such as used in Alder Lake-E.

Clang generates:
movq%rdi, %rax
movq%rdi, %rcx
orq %rsi, %rcx
shrq$32, %rcx
je  .LBB0_1
cqto
idivq   %rsi
retq
.LBB0_1:
xorl%edx, %edx
divl%esi
retq

That is, it ORs the two operands and checks if any bit in the upper half is
set. If so, it performs the 64-bit division; otherwise, it performs the 32-bit
one.

References:
https://gcc.godbolt.org/z/385a3da8q
https://uops.info/html-instr/IDIV_R32.html
https://uops.info/html-instr/IDIV_R64.html

[Bug target/103069] cmpxchg isn't optimized

2022-01-24 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #10 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #9)
> nptl/nptl_setxid.c in glibc has
> 
>   do  
> {   
>   flags = THREAD_GETMEM (self, cancelhandling);
>   newval = THREAD_ATOMIC_CMPXCHG_VAL (self, cancelhandling,
>   flags & ~SETXID_BITMASK, flags);
> }   
>   while (flags != newval);
> 
> GCC 12 generates:
> 
>899f0: 64 8b 14 25 08 03 00mov%fs:0x308,%edx
>899f8: 89 d6   mov%edx,%esi
>899fa: 89 d0   mov%edx,%eax
>899fc: 83 e6 bfand$0xffbf,%esi
>899ff: f0 0f b1 31 lock cmpxchg %esi,(%rcx)   
>89a03: 75 eb   jne899f0 <__GI___nptl_setxid_sighand
> ler+0x90>

This one is a single bit. This one should be replaced with a LOCK BTC and no
loop.

[Bug target/49001] GCC uses VMOVAPS/PD AVX instructions to access stack variables that are not 32-byte aligned

2021-12-21 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49001

--- Comment #7 from Thiago Macieira  ---
Hack to workaround:

asm(
".macro vmovapd args:vararg\n"
"vmovupd \\args\n"
".endm\n"
".macro vmovaps args:vararg\n"
"vmovups \\args\n"
".endm\n"
".macro vmovdqa args:vararg\n"
"vmovdqu \\args\n"
".endm\n"
".macro vmovdqa32 args:vararg\n"
"vmovdqu32 \\args\n"
".endm\n"
".macro vmovdqa64 args:vararg\n"
"vmovdqu64 \\args\n"
".endm\n"
);

See:
https://github.com/opendcdiag/opendcdiag/blob/main/framework/sysdeps/windows/win32_stdlib.h#L11-L34

[Bug target/103774] [i386] GCC should swap the arguments to certain functions to generate a single instruction

2021-12-20 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103774

Thiago Macieira  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #1 from Thiago Macieira  ---
This is a very minor thing because I expect that, at the uop level, the two
code sequences are identical. There are two more macro-instructions to retire
on the front-end, though.

You can lower the priority.

[Bug target/103774] New: [i386] GCC should swap the arguments to certain functions to generate a single instruction

2021-12-20 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103774

Bug ID: 103774
   Summary: [i386] GCC should swap the arguments to certain
functions to generate a single instruction
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

I don't know how widespread this is. Seen in the code generated at
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750.

This code:
__m256i data1 = _mm256_loadu_si256(reinterpret_cast(n));
__m256i data2 = _mm256_loadu_si256(reinterpret_cast(n)
+ 1);
__mmask16 mask1 = _mm256_cmpeq_epu16_mask(data1, mch256);
__mmask16 mask2 = _mm256_cmpeq_epu16_mask(data2, mch256);
Generates:
vmovdqu16   (%rdi), %ymm1
vmovdqu16   32(%rdi), %ymm2
vpcmpuw $0, %ymm0, %ymm1, %k0
vpcmpuw $0, %ymm0, %ymm2, %k1

While if you invert the two operands in the cmpeq intrinsics, as in:
__m256i data1 = _mm256_loadu_si256(reinterpret_cast(n));
__m256i data2 = _mm256_loadu_si256(reinterpret_cast(n)
+ 1);
__mmask16 mask1 = _mm256_cmpeq_epu16_mask(mch256, data1);
__mmask16 mask2 = _mm256_cmpeq_epu16_mask(mch256, data2);
You get:
vpcmpuw $0, (%rdi), %ymm0, %k0
vpcmpuw $0, 32(%rdi), %ymm0, %k1


Godbolt link with full copileable source code:
https://gcc.godbolt.org/z/rKo666MM7

Clang, ICC (Clang-based) do this. MSVC behaves like GCC.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #8 from Thiago Macieira  ---
Update again: looks like the issue was the next line I didn't paste, which was
performing _kortestz_mask32_u8 on an __mmask16. The type mismatch was causing
this problem.

If I Use the correct _kortestz_maskXX_u8, I'm getting:

vmovdqu8(%rsi), %ymm2
vmovdqu832(%rsi), %ymm3
vpcmpub $6, %ymm0, %ymm2, %k0
vpcmpub $6, %ymm0, %ymm3, %k1
kortestd%k1, %k0
je  .L794


vmovdqu16   (%rsi), %ymm2
vmovdqu16   32(%rsi), %ymm3
vpcmpuw $6, %ymm0, %ymm2, %k0
vpcmpuw $6, %ymm0, %ymm3, %k1
kortestw%k1, %k0
je  .L807

So it looks like GCC is not completely wrong, but it could be more lenient
(Clang is). You can lower the severity of this issue.

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #7 from Thiago Macieira  ---
I should add the same is not happening for Char == char, meaning the returned
type is an __mmask32 (unsigned)

vmovdqu8(%rsi), %ymm2
vmovdqu832(%rsi), %ymm3
vpcmpub $6, %ymm0, %ymm2, %k0
vpcmpub $6, %ymm0, %ymm3, %k1
kortestd%k1, %k0
je  .L792

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #6 from Thiago Macieira  ---
It got worse. Now I'm seeing:

.L807:
vmovdqu16   (%rsi), %ymm2
vmovdqu16   32(%rsi), %ymm3
vpcmpuw $6, %ymm0, %ymm2, %k2
vpcmpuw $6, %ymm0, %ymm3, %k3
kmovw   %k2, %eax
kmovw   %k3, %edx
kmovd   %eax, %k4
kmovd   %edx, %k5
kortestd%k5, %k4
je  .L814

Code snippet:

auto loadAndCompare = [maxval](const Char *ptr, unsigned mask = ~0U) {
if constexpr (sizeof(Char) == 1) {
__m256i mval = _mm256_set1_epi8(maxval);
__m256i data = _mm256_maskz_loadu_epi8(mask, ptr);
return _mm256_cmpgt_epu8_mask(data, mval);
} else if constexpr (sizeof(Char) == 2) {
__m256i mval = _mm256_set1_epi16(maxval);
__m256i data = _mm256_maskz_loadu_epi16(mask, ptr);
return _mm256_cmpgt_epu16_mask(data, mval);
} else if constexpr (sizeof(Char) == 4) {
__m256i mval = _mm256_set1_epi32(maxval);
__m256i data = _mm256_maskz_loadu_epi32(mask, ptr);
return _mm256_cmpgt_epu32_mask(data, mval);
}
};
/*...*/
auto mask1 = loadAndCompare(n1);
auto mask2 = loadAndCompare(n2);

I can make a compilable version if you need me to

[Bug target/103750] [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-17 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

--- Comment #5 from Thiago Macieira  ---
Maybe this is running afoul of GCC's thinking that a simple register-register
move is free? I've seen it save a constant in an opmask register, but kmov{d,q}
is not free like mov{l,q} is.

[Bug target/103750] New: [i386] GCC schedules KMOV instructions that destroys performance in loop

2021-12-16 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103750

Bug ID: 103750
   Summary: [i386] GCC schedules KMOV instructions that destroys
performance in loop
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Testcase:

const char16_t *qustrchr(char16_t *n, char16_t *e, char16_t c) noexcept
{
__m256i mch256 = _mm256_set1_epi16(c);
for ( ; n < e; n += 32) {
__m256i data1 = _mm256_loadu_si256(reinterpret_cast(n));
__m256i data2 = _mm256_loadu_si256(reinterpret_cast(n)
+ 1);
__mmask16 mask1 = _mm256_cmpeq_epu16_mask(data1, mch256);
__mmask16 mask2 = _mm256_cmpeq_epu16_mask(data2, mch256);
if (_kortestz_mask16_u8(mask1, mask2))
continue;

unsigned idx = _tzcnt_u32(mask1);
if (mask1 == 0) {
idx = __tzcnt_u16(mask2);
n += 16;
}
return n + idx;
}
return e;
}

The assembly for this produces:

vmovdqu16   (%rdi), %ymm1
vmovdqu16   32(%rdi), %ymm2
vpcmpuw $0, %ymm0, %ymm1, %k0
vpcmpuw $0, %ymm0, %ymm2, %k1
kmovw   %k0, %edx
kmovw   %k1, %eax
kortestw%k1, %k0
je  .L10

Those two KMOVW instructions aren't required for the check that follows.
They're also dispatched on port 0, same as the KORTESTW, meaning the KORTEST
can't be dispatched until those two have executed, thus introducing a 2-cycle
delay in this loop.

Clang generates:

.LBB0_2:# =>This Inner Loop Header: Depth=1
vpcmpeqw(%rdi), %ymm0, %k0
vpcmpeqw32(%rdi), %ymm0, %k1
kortestw%k0, %k1
jne .LBB0_3

ICC inserts one KMOVW, but not the other.

Godbolt build link: https://gcc.godbolt.org/z/cc3heo48M

LLVM-MCA analysis: https://analysis.godbolt.org/z/dGvY1Wj78
It shows the Clang loop runs on average 2.0 cycles per loop, whereas the GCC
code is 3 cycles/loop.

LLVM-MCA says the ICC loop with one of the two KMOV also runs at 2.0 cycles per
loop, because it can run in parallel with the second load, given that the loads
are ports 2 and 3.

[Bug target/103066] __sync_val_compare_and_swap/__sync_bool_compare_and_swap aren't optimized

2021-11-06 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103066

--- Comment #10 from Thiago Macieira  ---
You're right that emitting more penalises those who have done their job and
written proper code.

The problem we're seeing is that such code appears to be the minority. Or,
maybe put differently, the bad code is showing up a lot in our benchmarks,
especially on very big multi-core and multi-socket systems. "Fixing" the
compiler would make a broad update to the industry -- so long as the code is
recompiled with new compilers. Fixing the actual code would make it better even
if used with old ones.

Does anyone have a suggestion on how to get best "bang for buck"? (Biggest
benefit for smallest effort) This is a sincere question. I'm not trying to be
ironic or sarcastic. How can we help the most, the quickest, for the limited
amount of resources we can marshal?

Also, and I've been hitting this key for a few years, how can we do better at
teaching people how to use the tools at their disposal at the proper way? A
very good counter-example is C++11's std::atomic_flag: you MUST NEVER use it
(at least until C++20, where it got a test() member).

[Bug target/103090] [i386] GCC should use the SF and ZF flags in some atomic_fetch_op sequences

2021-11-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103090

--- Comment #1 from Thiago Macieira  ---
One more:

bool tsign3(std::atomic )
{
// any two or more bits, so long as the sign bit is one of them 
// (or the compiler doesn't know what's in the variable)
int bits = 1 | signbit; 
return i.fetch_and(bits, std::memory_order_relaxed) & signbit;
}

[Bug target/103069] cmpxchg isn't optimized

2021-11-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #2 from Thiago Macieira  ---
See also bug 103090 for a few more (restricted) possibilities to replace a
cmpxchg loop with a LOCK RMW operation.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-11-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #29 from Thiago Macieira  ---
New suggestion in bug 103090

[Bug middle-end/103090] New: [i386] GCC should use the SF and ZF flags in some atomic_fetch_op sequences

2021-11-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103090

Bug ID: 103090
   Summary: [i386] GCC should use the SF and ZF flags in some
atomic_fetch_op sequences
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Disclaimer: I don't know this code actually exists anywhere. But I've just come
up with it.

In Bug 102566, we optimised:

bool tbit(std::atomic )
{
return i.fetch_or(1, std::memory_order_relaxed) & 1;
}

To emit LOCK BTS. Similarly, fetch_xor got LOCK BTC and fetch_and got LOCK BTR.
These all work because CF is set by the bit-test-and-op instructions.

It occurs to me that LOCK AND, LOCK OR and LOCK XOR reliably set the SF, ZF,
and PF flags according to the result, so they may be used too. I can't think of
any time the PF flag would be useful and obviously ZF will not be set after a
LOCK OR (unless you OR'ed zero, but why would you do that?).

So possibilities are:

static constexpr int signbit = 0x8000;
bool tsign1(std::atomic )
{
int bit = 1; // any one or more bits, except for a constant sign bit
return i.fetch_or(bit, std::memory_order_relaxed) & signbit;
}
bool tsign2(std::atomic )
{
int bit = 1; // any one or more bits, except for a constant sign bit
return i.fetch_xor(bit, std::memory_order_relaxed) & signbit;
}
bool tzero1(std::atomic )
{
int bits = 1; // any one or more bits
return i.fetch_and(bit, std::memory_order_relaxed) == 0;
}
bool tzero2(std::atomic )
{
int bits = 1; // any one or more bits
return i.fetch_xor(bit, std::memory_order_relaxed) == 0;
}

all of the above can be negated too (op != 0 and (op & signbit) == 0).

[Bug target/103069] cmpxchg isn't optimized

2021-11-03 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103069

--- Comment #1 from Thiago Macieira  ---
(the assembly doesn't match the source code, but we got your point)

Another possible improvement for the __atomic_fetch_{and,nand,or} functions is
that it can check whether the fetched value is already correct and branch out.
In your example, the __atomic_fetch_or with 0x4000 can check if that bit is
already set and, if so, not execute the CMPXCHG at all.

This is a valid solution for x86 on memory orderings up to acq_rel. For other
architectures, they may still need barriers. For seq_cst, we either need a
barrier or we need to execute the CMPXCHG at least once. 

Therefore, the emitted code might want to optimistically execute the operation
once and, if it fails, enter the load loop. That's a slightly longer codegen.
Whether we want that under -Os or not, you'll have to be the judge.

Prior art: glibc/sysdeps/x86_64/nptl/pthread_spin_lock.S:
ENTRY(__pthread_spin_lock)
1:  LOCK
decl0(%rdi)
jne 2f
xor %eax, %eax
ret

.align  16
2:  rep
nop
cmpl$0, 0(%rdi)
jg  1b
jmp 2b
END(__pthread_spin_lock)

This does the atomic operation once, hoping it'll succeed. If it fails, it
enters the PAUSE+CMP+JG loop until the value is suitable.

[Bug libstdc++/101583] [12 Regression] error: use of deleted function when building gold

2021-10-14 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101583

Thiago Macieira  changed:

   What|Removed |Added

 CC||thiago at kde dot org

--- Comment #10 from Thiago Macieira  ---
(In reply to Jonathan Wakely from comment #9)
> Thanks for noticing I missed it.

Well, I had to clear C++'s good name when Arjan complained that he couldn't
understand what the compiler was telling him about  :)

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-07 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #26 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #25)
> Can you get some performance improvement data on real workloads?

Will ask.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-07 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #24 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #23)
> I renamed the commit title.  The new v3 is the v6 + fixes.

Got it. Still no issues.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-06 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #22 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #21)
> Created attachment 51559 [details]
> The new v3 patch
> 
> The new v3 patch to check invalid mask.

v3? We were already up to v6.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-06 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #20 from Thiago Macieira  ---
And:

$ cat /tmp/test.cpp 
#include 
bool tbit(std::atomic )
{
  return i.fetch_xor(CONSTANT, std::memory_order_relaxed) & (CONSTANT);
}
$ ~/dev/gcc/bin/gcc "-DCONSTANT=(1LL<<63)" -S -o - -O2 /tmp/test.cpp | sed
'1,/startproc/d;/endproc/,$d'
lock btcq   $63, (%rdi)
setc%al
ret

Nice!

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-05 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #19 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #17)
> Created attachment 51558 [details]
> The v6 patch
> 
> Please try this.

Confirmed for all inputs.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-05 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #15 from Thiago Macieira  ---
Works now for the failing case. Additionally:

bool tbit(std::atomic )
{
  return i.fetch_and(~CONSTANT, std::memory_order_relaxed) & (CONSTANT);
}

Will properly produce LOCK BTR (CONSTANT=2):

lock btrq   $1, (%rdi)
setc%al
ret

CONSTANT=(1L<<62):

lock btrq   $62, (%rdi)
setc%al
ret

But not for CONSTANT=1 or CONSTANT=(1L<<63):
movq(%rdi), %rax
.L2:
movq%rax, %rcx
movq%rax, %rdx
andq$-2, %rcx
lock cmpxchgq   %rcx, (%rdi)
jne .L2
movl%edx, %eax
andl$1, %eax
ret

Same applies to 1<<31 for atomic.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #12 from Thiago Macieira  ---
Commit 7e0c0500808d58bca5b8e23cbd474022c32234e4 + your patch.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #11 from Thiago Macieira  ---
$ for ((i=0;i<32;++i)); do ~/dev/gcc/bin/gcc "-DCONSTANT=(1<<$i)" -S -o - -O2
/tmp/test.cpp | grep bts; done 
lock btsl   $0, (%rdi)
lock btsl   $1, (%rdi)
lock btsl   $2, (%rdi)
lock btsl   $3, (%rdi)
lock btsl   $4, (%rdi)
lock btsl   $5, (%rdi)
lock btsl   $6, (%rdi)
lock btsl   $7, (%rdi)
lock btsl   $8, (%rdi)
lock btsl   $9, (%rdi)
lock btsl   $10, (%rdi)
lock btsl   $11, (%rdi)
lock btsl   $12, (%rdi)
lock btsl   $13, (%rdi)
lock btsl   $14, (%rdi)
lock btsl   $15, (%rdi)
lock btsl   $16, (%rdi)
lock btsl   $17, (%rdi)
lock btsl   $18, (%rdi)
lock btsl   $19, (%rdi)
lock btsl   $20, (%rdi)
lock btsl   $21, (%rdi)
lock btsl   $22, (%rdi)
lock btsl   $23, (%rdi)
lock btsl   $24, (%rdi)
lock btsl   $25, (%rdi)
lock btsl   $26, (%rdi)
lock btsl   $27, (%rdi)
lock btsl   $28, (%rdi)
lock btsl   $29, (%rdi)
lock btsl   $30, (%rdi)
lock btsl   $31, (%rdi)

And after changing to long:

$ for ((i=32;i<64;++i)); do ~/dev/gcc/bin/gcc "-DCONSTANT=(1L<<$i)" -S -o - -O2
/tmp/test.cpp | grep bts; done
lock btsq   $32, (%rdi)
lock btsq   $33, (%rdi)
lock btsq   $34, (%rdi)
lock btsq   $35, (%rdi)
lock btsq   $36, (%rdi)
lock btsq   $37, (%rdi)
lock btsq   $38, (%rdi)
lock btsq   $39, (%rdi)
lock btsq   $40, (%rdi)
lock btsq   $41, (%rdi)
lock btsq   $42, (%rdi)
lock btsq   $43, (%rdi)
lock btsq   $44, (%rdi)
lock btsq   $45, (%rdi)
lock btsq   $46, (%rdi)
lock btsq   $47, (%rdi)
lock btsq   $48, (%rdi)
lock btsq   $49, (%rdi)
lock btsq   $50, (%rdi)
lock btsq   $51, (%rdi)
lock btsq   $52, (%rdi)
lock btsq   $53, (%rdi)
lock btsq   $54, (%rdi)
lock btsq   $55, (%rdi)
lock btsq   $56, (%rdi)
lock btsq   $57, (%rdi)
lock btsq   $58, (%rdi)
lock btsq   $59, (%rdi)
lock btsq   $60, (%rdi)
lock btsq   $61, (%rdi)
lock btsq   $62, (%rdi)
lock btsq   $63, (%rdi)

But:

$ cat /tmp/test2.cpp 
#include 
bool tbit(std::atomic )
{
  return i.fetch_or(1, std::memory_order_relaxed) & (~1);
}
$ ~/dev/gcc/bin/gcc -S -o - -O2 /tmp/test2.cpp
.file   "test.cpp"
.text
/tmp/test.cpp: In function ‘bool tbit(std::atomic&)’:
/tmp/test.cpp:2:6: error: type mismatch in binary expression
2 | bool tbit(std::atomic )
  |  ^~~~
long int

long unsigned int

__int_type

_9 = _6 & -2;
during GIMPLE pass: fab
/tmp/test.cpp:2:6: internal compiler error: verify_gimple failed
0x119fbba verify_gimple_in_cfg(function*, bool)
/home/tjmaciei/src/gcc/gcc/tree-cfg.c:5576
0x106ced7 execute_function_todo
/home/tjmaciei/src/gcc/gcc/passes.c:2042
0x106d8fb execute_todo
/home/tjmaciei/src/gcc/gcc/passes.c:2096
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See  for instructions.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #9 from Thiago Macieira  ---
Looks like it doesn't work for the sign bit.

$ cat /tmp/test.cpp 
#include 
bool tbit(std::atomic )
{
return i.fetch_or(CONSTANT, std::memory_order_relaxed) & CONSTANT;
}
$ ~/dev/gcc/bin/gcc -DCONSTANT='(1<<30)' -S -o - -O2 /tmp/test.cpp | sed -n
'/startproc/,/endproc/p'
.cfi_startproc
lock btsl   $30, (%rdi)
setc%al
ret
.cfi_endproc
$ ~/dev/gcc/bin/gcc -DCONSTANT='(1<<31)' -S -o - -O2 /tmp/test.cpp | sed -n
'/startproc/,/endproc/p'
.cfi_startproc
movl(%rdi), %eax
.L2:
movl%eax, %ecx
movl%eax, %edx
orl $-2147483648, %ecx
lock cmpxchgl   %ecx, (%rdi)
jne .L2
shrl$31, %edx
movl%edx, %eax
ret
.cfi_endproc

Changing to std::atomic makes no difference.

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #8 from Thiago Macieira  ---
$ cat /tmp/test.cpp  
#include 
bool tbit(std::atomic )
{
   return i.fetch_or(1, std::memory_order_relaxed) & 1;
}
$ ~/dev/gcc/bin/gcc -S -o - -O2 /tmp/test.cpp  
   .file   "test.cpp"
   .text
   .p2align 4
   .globl  _Z4tbitRSt6atomicIiE
   .type   _Z4tbitRSt6atomicIiE, @function
_Z4tbitRSt6atomicIiE:
.LFB339:
   .cfi_startproc
   lock btsl   $0, (%rdi)
   setc%al
   ret
   .cfi_endproc
.LFE339:
   .size   _Z4tbitRSt6atomicIiE, .-_Z4tbitRSt6atomicIiE
   .ident  "GCC: (GNU) 12.0.0 20211004 (experimental)"
   .section.note.GNU-stack,"",@progbits

+1

[Bug middle-end/102566] [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-04 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

--- Comment #7 from Thiago Macieira  ---
(In reply to H.J. Lu from comment #5)
> Created attachment 51536 [details]
> A patch
> 
> Please try this.

Give me an hour (will try v2).

[Bug target/102566] New: [i386] GCC should emit LOCK BTS for simple bit-test-and-set operations with std::atomic

2021-10-02 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102566

Bug ID: 102566
   Summary: [i386] GCC should emit LOCK BTS for simple
bit-test-and-set operations with std::atomic
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

Simple test:

$ cat test.cpp
#include 
bool tbit(std::atomic )
{
return i.fetch_or(1, std::memory_order_relaxed) & 1;
}

The sequence x.fetch_or(singlebit_constant) & singlebit_constant can be
implemented by a LOCK BTS sequence. The above should emit:

lock bts $1, (%rdi)
setb %al
ret

But instead it emits a cmpxchg loop - see https://gcc.godbolt.org/z/99enKaffa.

This was found reviewing MariaDB lightweight-mutex code, which uses the sign
bit to indicate a contended mutex. See this commit[1] by one of their
maintainers for the removal of fetch_or because it emits an extra loop.

Bonus: LOCK BTR can be used in the sequence x.fetch_and(~single_bit_constant) &
single_bit_constant

[1]
https://github.com/dr-m/atomic_sync/commit/d5e22b2d42cdbac7a15d242bf1446377555c4041

[Bug target/102166] [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

--- Comment #9 from Thiago Macieira  ---
> clang defines them as intrinsic because they support AMX register allocation
> (a lot of effort), gcc does not support AMX register allocation for now, and
> defining them as intrinsic + builtin doesn't seem to do much good except
> provide some error messages.

If you can implement them as macros, I don't see why you need register
allocation in the first place. Just emit the same assembly that is being
emitted now by the inline assembly.

Anyway, I suggest at a minimum removing the #define check. There's little harm
in having no diagnostic on misuse: misuses are probably going to be seen when
testing. Until GCC is able to generate AMX code on its own, the missing
__attribute__ is superfluous anyway.

[Bug target/102166] [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

--- Comment #6 from Thiago Macieira  ---
> I suggest doing as Clang did and make it an intrinsic.

Or even a __builtin_ia32_markamxtile(); intrinsic, which produces the error if
misused and does add the necessary bits to the .note.gnu.property section

[Bug target/102166] [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

--- Comment #5 from Thiago Macieira  ---
(In reply to Hongtao.liu from comment #4)
> Because _tile_loadd is implemented as embedded assembly plus macros, if
> __AMX_TILE__ is removed, no error will be reported if the user does not use
> the -mamx option, So this macro is added here, but obviously this is not
> convenient for target_attribute. I think we'd better remove __AMX_TILE__,
> (not sure why c doesn't report the error).

I suggest doing as Clang did and make it an intrinsic.

[Bug target/102166] [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

--- Comment #3 from Thiago Macieira  ---
There appears to be some preprocessor magic behind the scenes because the
preprocessed output can't be compiled either:

$ gcc -no-integrated-cpp -Werror=implicit-function-declaration -c -xc test.cpp
test.cpp: In function ‘amx’:
test.cpp:10:5: error: implicit declaration of function ‘_tile_loadd’
[-Werror=implicit-function-declaration]
   10 | _tile_loadd(0, 0, 0);
  | ^~~
test.cpp:11:5: error: implicit declaration of function ‘_tile_release’
[-Werror=implicit-function-declaration]
   11 | _tile_release();
  | ^
cc1: some warnings being treated as errors

[Bug target/102166] [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

--- Comment #2 from Thiago Macieira  ---
FYI:

$ cat test.cpp
#include 

__attribute__((target("avx"))) void avx()
{
_mm256_zeroall();
}

#ifndef __INTEL_COMPILER
__attribute__((target("amx-tile")))
#endif
void amx()
{
_tile_loadd(0, 0, 0);
_tile_release();
}
$ icc -c test.cpp && echo success
success
$ icc -c -xc test.cpp && echo success
success
$ clang -c -xc test.cpp && echo success
success
$ clang -c test.cpp && echo success
success

$ clang --version
clang version 11.1.0
Target: x86_64-generic-linux
Thread model: posix
InstalledDir: /usr/bin
$ icc --version
icc (ICC) 19.1.3.304 20200925
Copyright (C) 1985-2020 Intel Corporation.  All rights reserved.

[Bug target/102166] [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

Thiago Macieira  changed:

   What|Removed |Added

 CC||hjl.tools at gmail dot com

--- Comment #1 from Thiago Macieira  ---
I don't understand how this compiles in C mode:

$ gcc -O2 -Werror=implicit-function-declaration -c -xc test.cpp
$ gcc -O2 -fno-asynchronous-unwind-tables  -S -o - -xc test.cpp
.file   "test.cpp"
.text
.p2align 4
.globl  avx
.type   avx, @function
avx:
vzeroall
ret
.size   avx, .-avx
.p2align 4
.globl  amx
.type   amx, @function
amx:
xorl%eax, %eax
#APP
# 10 "test.cpp" 1
tileloadd   (%rax,%rax,1), %tmm0
# 0 "" 2
# 56 "/usr/lib64/gcc/x86_64-generic-linux/11/include/amxtileintrin.h" 1
tilerelease
# 0 "" 2
#NO_APP
ret
.size   amx, .-amx

The comments in the assembly output indicate that _tile_loadd is a macro that
was expanded from test.cpp and that _tile_release is a function in
amxtileintrin.h.

But neither is defined.

$ gcc -E -xc test.cpp | grep -e _tile_loadd -e _tile_release
_tile_loadd(0, 0, 0);
_tile_release();

Whatever they are, they are not visible to the preprocessor.

[Bug target/102166] New: [i386] AMX intrinsics and macros not defined in C++

2021-09-01 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102166

Bug ID: 102166
   Summary: [i386] AMX intrinsics and macros not defined in C++
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

$ cat test.cpp
#include 

__attribute__((target("avx"))) void avx()
{
_mm256_zeroall();
}

__attribute__((target("amx-tile"))) void amx()
{
_tile_loadd(0, 0, 0);
_tile_release();
}
$ gcc -c test.cpp
test.cpp: In function ‘void amx()’:
test.cpp:10:5: error: ‘_tile_loadd’ was not declared in this scope
   10 | _tile_loadd(0, 0, 0);
  | ^~~
test.cpp:11:5: error: ‘_tile_release’ was not declared in this scope
   11 | _tile_release();
  | ^

That's because the macros and intrinsics in amxtileintrin.h are only defined
behind:

#if defined(__x86_64__) && defined(__AMX_TILE__)

The __AMX_TILE__ macro isn't defined and doesn't need to be. None of the other
itnrinsics require compiling with -m options. In fact, code shouldn't use -m
options for things that are detected at runtime, like AMX inevitably has to be.

[Bug libstdc++/99277] C++2a synchronisation is inefficient in GCC 11

2021-04-27 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99277

--- Comment #8 from Thiago Macieira  ---
This one is probably 12.0.

[Bug target/100005] undefined reference to `_rdrand64_step'

2021-04-12 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15

--- Comment #14 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #13)
> The same like in C.
> I.e.
> extern inline __attribute__((gnu_inline, always_inline, artificial)) int foo
> (int x) { return x; }
> // The above is typically from some header
> int foo (int x) { return x; }
> // The above is the out of line function definition

Thanks, Jakub. At first sight that's not valid C++, but then since it's an
extension it doesn't have to be. ICC even accepts the same syntax and generates
the same non-weak symbol.

Any way to do that without repeating the body, thus potentially causing an ODR
violation? I'm not likely to use this feature, but asking for a rainy day.

https://gcc.godbolt.org/z/96qW9ExcG

[Bug target/100005] undefined reference to `_rdrand64_step'

2021-04-12 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15

--- Comment #12 from Thiago Macieira  ---
(In reply to Richard Biener from comment #11)
> Invalid.  Note we can't really diagnose GNU extern inline address-taking
> since
> by definition that's allowed (just the definition needs to come from
> elsewhere).

Understood. Thanks for looking into the report.

Out of curiosity, how does one provide an extern inline's out-of-line copy in
C++?

[Bug c/100005] undefined reference to `_rdrand64_step'

2021-04-09 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15

--- Comment #6 from Thiago Macieira  ---
(In reply to Jakub Jelinek from comment #5)
> then one would get an out of line copy when taking their address, but it 
> would 
> duplicated in all the TUs that did this.

That's not a problem, since that's only for debug mode builds. In release
builds, they should get properly inlined.

> Anyway, your assumption that intrinsics can be used the way you expect them
> is just wrong.

If you say so, then please close as WONTFIX or NOTABUG. And indeed the ones
that are implemented as macros can't have their address taken anyway, since
macros don't have address.

I would suggest a better error message, though, if "just works" is not
possible.

[Bug c/100005] undefined reference to `_rdrand64_step'

2021-04-09 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15

--- Comment #4 from Thiago Macieira  ---
That's an artificial (pun intended) limitation.

In C++:

template 
int fill_array(Generator generator, unsigned long long *rand_array)

Also errors out with the same error, but works if you do:

fill_array([](auto x) { return _rdrand64_step(x); }, rand_array);

The extra indirection shouldn't be required.

PS: clang compiles the same code just fine.

[Bug c/100005] New: undefined reference to `_rdrand64_step'

2021-04-09 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15

Bug ID: 15
   Summary: undefined reference to `_rdrand64_step'
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: thiago at kde dot org
  Target Milestone: ---

$ cat rdrand.c
#include 

#define NUM_RANDOM_NUMBERS_TO_GENERATE  1024

typedef int (*Generator)(unsigned long long *);

int fill_array(Generator generator, unsigned long long *rand_array)
{
for (int i = 0; i < NUM_RANDOM_NUMBERS_TO_GENERATE; i++) {
// fast attempt once:
if (__builtin_expect(generator(_array[i]), 1))
continue;

// retry up to 16 times
int j;
for (j = 0; j < 16; ++j) {
if (generator(_array[i]))
break;
}
if (j == 16) {
// failed, the RNG is out of entropy
return -1;
}
}

return 0;
}

int main()
{
unsigned long long rand_array[NUM_RANDOM_NUMBERS_TO_GENERATE];
fill_array(_rdrand64_step, rand_array);
}

$ ~/dev/gcc/bin/gcc -march=haswell -O2 rdrand.c 
/usr/bin/ld: /tmp/ccTlQIsV.o: in function `main':
rdrand.c:(.text.startup+0x8): undefined reference to `_rdrand64_step'
collect2: error: ld returned 1 exit status

$ ~/dev/gcc/bin/gcc --version  
gcc (GCC) 11.0.1 20210325 (experimental)
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Happens in C++ too, including passing as a template parameter.

[Bug c++/69549] Named Address Spaces does not compile in C++

2021-03-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69549

--- Comment #7 from Thiago Macieira  ---
(In reply to Andrew Pinski from comment #6)
> The above is not the reason why namespaces are not handled in GCC's C++
> front-end.  The reason why they are not handled in C++ is because you need
> to handle them in overloads and templates correctly.  Does clang handle
> those correctly or does it ignore that issue?

It handles them:

$ clang -O2 -S -o - -include stdint.h -xc++ - <<<'template   void
f(T); void f() { auto tib = (void * __seg_fs*)(0); f(tib); }' | c++filt 
.text
.file   "-"
.globl  f()   # -- Begin function f()
.p2align4, 0x90
.type   f(),@function
f():  # @f()
.cfi_startproc
# %bb.0:
xorl%edi, %edi
jmp void f(void* AS257*) # TAILCALL

The mangled symbol was _Z1fIPU5AS257PvEvT_. That "AS257" is encoded as U5AS257,
which is an extended qualifier.

:
5.1.5.1 Qualified types


   ::=  

   ::= * 
   ::= U  [] # vendor extended
type qualifier
::= [r] [V] [K]# restrict (C99), volatile, const

::= R  # & ref-qualifier
::= O  # && ref-qualifier

[Bug c++/69549] Named Address Spaces does not compile in C++

2021-03-22 Thread thiago at kde dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69549

--- Comment #5 from Thiago Macieira  ---
BTW, Clang solved this by making __seg_fs, __seg_gs macros that resolve to
__attribute__:

$ clang -dM -E -xc /dev/null | grep __seg_.s
#define __seg_fs __attribute__((address_space(257)))
#define __seg_gs __attribute__((address_space(256)))

That way, they don't need to be deduced as qualifiers in C, like const,
volatile and _Atomic.

So this compiles with Clang in C++:

void *tid() { auto tib = (void * __seg_fs*)(0); return *tid; }

_Z3tibv:# @_Z3tibv
movq%fs:0, %rax
retq

  1   2   3   >