[Bug d/113125] New: [D] internal compiler error: in make_import, at d/imports.cc:48

2023-12-23 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113125

Bug ID: 113125
   Summary: [D] internal compiler error: in make_import, at
d/imports.cc:48
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Debian testing, amd64, gcc version 13.2.0 (Debian 13.2.0-7) 


meta.d:

```
module objc.meta;
struct A;
```


runtime.d:

```
module objc.runtime;
public import meta : A;
```


gdc -v -c -I. runtime.d

```
$ gdc -v -c -I. runtime.d 
Using built-in specs.
COLLECT_GCC=gdc
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 13.2.0-7'
--with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs
--enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-13
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/reproducible-path/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/reproducible-path/gcc-13-13.2.0/debian/tmp-gcn/usr
--enable-offload-defaulted --without-cuda-driver --enable-checking=release
--build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
--with-build-config=bootstrap-lto-lean --enable-link-serialization=3
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Debian 13.2.0-7) 
COLLECT_GCC_OPTIONS='-v' '-c' '-I' '.' '-o' 'runtime.o' '-shared-libgcc'
'-mtune=generic' '-march=x86-64'
 /usr/libexec/gcc/x86_64-linux-gnu/13/d21 runtime.d -quiet -dumpbase runtime.d
-dumpbase-ext .d -mtune=generic -march=x86-64 -version -imultiarch
x86_64-linux-gnu -I . -v -o /tmp/ccPyiN0m.s
GNU D (Debian 13.2.0-7) version 13.2.0 (x86_64-linux-gnu)
compiled by GNU C version 13.2.0, GMP version 6.3.0, MPFR version
4.2.1, MPC version 1.3.1, isl version isl-0.26-GMP

GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
binary/usr/libexec/gcc/x86_64-linux-gnu/13/d21
version   v2.103.1

predefs   GNU D_Version2 LittleEndian GNU_DWARF2_Exceptions GNU_StackGrowsDown
GNU_InlineAsm D_LP64 D_PIC D_PIE assert D_PreConditions D_PostConditions
D_Invariants D_ModuleInfo D_Exceptions D_TypeInfo all X86_64 D_HardFloat Posix
linux CRuntime_Glibc CppRuntime_Gcc
parse runtime
importall runtime
importmeta  (meta.d)
importobject(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/object.d)
importcore.attribute   
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/attribute.d)
importgcc.attributes   
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/gcc/attributes.d)
importcore.internal.hash   
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/hash.d)
importcore.internal.traits 
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/traits.d)
importcore.internal.entrypoint 
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/entrypoint.d)
importcore.internal.array.appending
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/appending.d)
importcore.internal.array.comparison   
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/comparison.d)
importcore.internal.array.equality 
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/equality.d)
importcore.internal.array.casting  
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/casting.d)
importcore.internal.array.concatenation
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/concatenation.d)
importcore.internal.array.construction 
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/construction.d)
importcore.internal.array.arrayassign  
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/arrayassign.d)
importcore.internal.array.capacity 
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/array/capacity.d)
importcore.internal.dassert
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/internal/dassert.d)
importcore.atomic  
(/usr/lib/gcc/x86_64-linux-gnu/13/include/d/core/atomic.d)
importcore.internal.attributes 
(/usr/lib/gcc/x86_64-linux

[Bug d/110516] core.volatile.volatileLoad discarded if result is unused

2023-07-01 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110516

--- Comment #9 from Witold Baryluk  ---
Thank you for a quick fix Iain!

[Bug d/110516] core.volatile.volatileLoad discarded if result is unused

2023-07-01 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110516

--- Comment #8 from Witold Baryluk  ---
I see.

Point 1 is definitively incorrect. I interpreted asembler wrong:

void example.actualRun(ubyte*):
pushrbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
nop
pop rbp
ret


The move there, is just some stack manipulation, it has nothing to do with
volatileLoad.



You are right about the side effect visibility and volatileStore.

Still, there should be a way to express real memory read, with result not
stored anywhere in program (just written to register, then discarded).

This has some (not very common) uses in memmory-mapped IO, i.e. in drivers for
devices where the read itself could indicate something (this of course usually
also require setting proper page table attributes to disable caching or other
optimizations, etc, not just volatile load in machine code). I do not have
specific examples at hand, but afaik I saw some examples in the past (mostly on
older architectures), as well some watchdog chips that reset timer on read.

Another use is for doing memory and cache read benchmarks and profiling. We
want to invoke read (to register) from some memory location, but we do not need
the value for anything else.

And more esoteric use might be memory probing. On some level systems, kernel or
bootloader, might not know the memory layout, and resort to just doing reads,
and relaying on CPU fault handlers to report invalid reads.

And some people might use load without destination, as a prefetch hint, or to
prefault some memory pages.

[Bug d/110516] New: core.volatile.volatileLoad is broken

2023-07-01 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110516

Bug ID: 110516
   Summary: core.volatile.volatileLoad is broken
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

gcc 12.2.0 (from Debian stable) and gcc trunk 14.0.0 (in godbolt) tested.

core.volatile.volatileLoad simply does not work.

1) It merges loads.
2) It removes unused loads at -O1 and higher.

Example:

void actualRun(ubyte* ptr1) {
  import core.volatile : volatileLoad;
  volatileLoad(ptr1);
  volatileLoad(ptr1);
  volatileLoad(ptr1);
  volatileLoad(ptr1);
}


Without optimisations:

void example.actualRun(ubyte*):
pushrbp
mov rbp, rsp
mov QWORD PTR [rbp-8], rdi
nop
pop rbp
ret


Incorrect.



With optimisations:

void example.actualRun(ubyte*):
ret

Incorrect.


Expected:

void example.actualRun(ubyte*):
movzx   eax, byte ptr [rdi]
movzx   eax, byte ptr [rdi]
movzx   eax, byte ptr [rdi]
movzx   eax, byte ptr [rdi]
ret



dmd and ldc behave properly.


It looks like it never worked properly.

Would be good to have a test case for this, so it does not become a regression
later.


I did not test volatileStore, but I would not be surprised it is also broken.

[Bug d/110113] gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)

2023-06-11 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113

--- Comment #10 from Witold Baryluk  ---
Thank you Iain. Amazing debugging skills.

BTW. `import std;` was because dustmite reduced original import to just that.
Original import was `import std.math.algebraic : sqrt;`

But you already figured this out without even using Phobos.

[Bug d/110113] gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)

2023-06-04 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113

--- Comment #2 from Witold Baryluk  ---
Also FYI, I was not able to trigger this on DMD64 D Compiler v2.104.0

[Bug d/110113] gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)

2023-06-04 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113

--- Comment #1 from Witold Baryluk  ---
BTW. Adding return statement in `raytrace`, does not change anything:

```
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
/usr/lib/gcc/x86_64-linux-gnu/13/include/d/std/math/algebraic.d:968:47:
internal compiler error: Segmentation fault
  968 | return cast(Unqual!T) (T(1) << bsr(val) + type);
  |   ^
0xd32f86 crash_signal
../../src/gcc/toplev.cc:314
0x7f7144273f8f ???
./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x17f7d10 _D3dmd4root3aav15dmd_aaGetRvalueFNaNbNiPSQBnQBmQBk2AAPvZQd
../../src/gcc/d/dmd/root/aav.d:127
0x1706b25 DsymbolTable::lookup(Identifier const*)
../../src/gcc/d/dmd/dsymbol.d:2408
0x1706b25 ScopeDsymbol::search(Loc const&, Identifier*, int)
../../src/gcc/d/dmd/dsymbol.d:1470
...
...
```

[Bug d/110113] New: gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127 dmd_aaGetRvalue from DsymbolTable::lookup(Identifier const*)

2023-06-04 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110113

Bug ID: 110113
   Summary: gdc -fpreview=dip1021 crash in d/dmd/root/aav.d:127
dmd_aaGetRvalue from DsymbolTable::lookup(Identifier
const*)
   Product: gcc
   Version: 13.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Created attachment 55254
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55254=edit
Minimized test case with dustmite

Debian Linux amd64, experimental gcc-13, gdc 13.1.0-3


This is not very deterministic. Run few times to trigger.

```
user@debian:~$ cat lup.d
class LUBench {
}
float lup(ulong , ulong , int , int = 1) {
double[] solution;
new LUBench;
return solution[0] ;
}
float lup_3200(ulong iters, ulong flops) {
return lup(iters, flops, 3200);
}
float raytrace() {
struct V {
float x, y, z;
auto normalize() {
}
import std;
auto cross() {
}
auto norm2() {
}
auto norm() {
}
auto opBinary(){
}
}
}
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is
expected to return a value of type ‘float’
   11 | float raytrace() {
  |   ^
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is
expected to return a value of type ‘float’
   11 | float raytrace() {
  |   ^
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is
expected to return a value of type ‘float’
   11 | float raytrace() {
  |   ^
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
lup.d:11:7: error: function ‘lup.raytrace’ has no ‘return’ statement, but is
expected to return a value of type ‘float’
   11 | float raytrace() {
  |   ^
user@debian:~$ gdc-13 -c -fpreview=dip1021 lup.d
/usr/lib/gcc/x86_64-linux-gnu/13/include/d/std/math/algebraic.d:968:47:
internal compiler error: Segmentation fault
  968 | return cast(Unqual!T) (T(1) << bsr(val) + type);
  |   ^
0xd32f86 crash_signal
../../src/gcc/toplev.cc:314
0x7f53b651cf8f ???
./signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
0x17f7d10 _D3dmd4root3aav15dmd_aaGetRvalueFNaNbNiPSQBnQBmQBk2AAPvZQd
../../src/gcc/d/dmd/root/aav.d:127
0x1706b25 DsymbolTable::lookup(Identifier const*)
../../src/gcc/d/dmd/dsymbol.d:2408
0x1706b25 ScopeDsymbol::search(Loc const&, Identifier*, int)
../../src/gcc/d/dmd/dsymbol.d:1470
0x17ef5b3
_D3dmd6opover15search_functionFCQBe7dsymbol12ScopeDsymbolCQCe10identifier10IdentifierZCQDhQCd7Dsymbol
../../src/gcc/d/dmd/opover.d:1435
0x1701fe0 search_toString(StructDeclaration*)
../../src/gcc/d/dmd/dstruct.d:51
0x180310a semanticTypeInfoMembers(StructDeclaration*)
../../src/gcc/d/dmd/semantic3.d:1650
0x1803394 Semantic3Visitor::visit(AggregateDeclaration*)
../../src/gcc/d/dmd/semantic3.d:1590
0x17fef19 semantic3(Dsymbol*, Scope*)
../../src/gcc/d/dmd/semantic3.d:83
0x175dc89 ExpressionSemanticVisitor::visit(DeclarationExp*)
../../src/gcc/d/dmd/expressionsem.d:5572
0x175dc89 ExpressionSemanticVisitor::visit(DeclarationExp*)
../../src/gcc/d/dmd/expressionsem.d:5407
0x175eb82 expressionSemantic(Expression*, Scope*)
../../src/gcc/d/dmd/expressionsem.d:12706
0x18096fa StatementSemanticVisitor::visit(ExpStatement*)
../../src/gcc/d/dmd/statementsem.d:207
0x18228c1 statementSemantic(Statement*, Scope*)
../../src/gcc/d/dmd/statementsem.d:149
0x18228c1 StatementSemanticVisitor::visit(CompoundStatement*)
../../src/gcc/d/dmd/statementsem.d:270
0x1809112 statementSemantic(Statement*, Scope*)
../../src/gcc/d/dmd/statementsem.d:149
0x18002a1 Semantic3Visitor::visit(FuncDeclaration*)
../../src/gcc/d/dmd/semantic3.d:598
0x17feae4 semantic3(Dsymbol*, Scope*)
../../src/gcc/d/dmd/semantic3.d:83
0x17feae4 Semantic3Visitor::visit(Module*)
../../src/gcc/d/dmd/semantic3.d:205
Please submit a full bug report, with preprocessed source (by using
-freport-bug).
Please include the complete backtrace with any bug report.
See  for instructions.
user@debian:~$ 
```


Could not reduce further, as it is sensitive to identifiers, and due to
non-deterministic nature testing requires many repetitions.

[Bug d/109221] std.math.floor, core.math.ldexp, std.math.poly poor inlining

2023-03-20 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109221

--- Comment #2 from Witold Baryluk  ---
Interesting enough, GDC 10.2 does inline `poly` instantiation with all the
constants.

[Bug d/109221] std.math.floor, core.math.ldexp, std.math.poly poor inlining

2023-03-20 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109221

--- Comment #1 from Witold Baryluk  ---
PS. LDC 1.23.0 - 1.32.0 produce optimal code. LDC 1.22.0 a bit worse (due to
use of x87 codegen), and 1.21 and older fail to inline `ldexp`, but still
inline `poly` and `floor` perfectly.

[Bug d/109221] New: std.math.floor, core.math.ldexp, std.math.poly poor inlining

2023-03-20 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109221

Bug ID: 109221
   Summary: std.math.floor, core.math.ldexp, std.math.poly poor
inlining
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Example:

static float sRGB_case4(float x) {
// import std.math : exp;
return 1.055f * expImpl(x) - 0.055f;  // expImpl not inlined by default
// (inlined when using pragma(inline, true), but that fails to inline in
DMD)
}


// pragma(inline, true)
// This is borrowed from phobos/exponential.d to help gcc inline it fully.
// Only T == float case is here (as some traits are private to phobos).
// Also isNaN and range checks are removed, as sRGB performs own checks.
static private T expImpl(T)(T x) @safe pure nothrow @nogc
{
//import std.math : floatTraits, RealFormat;
//import std.math.traits : isNaN;
//import std.math.rounding : floor;
//import std.math.algebraic : poly;
//import std.math.constants : LOG2E;
import std.math;
import core.math;

static immutable T[6] P = [
5.001201E-1,
1.665459E-1,
4.1665795894E-2,
8.3334519073E-3,
1.3981999507E-3,
1.9875691500E-4,
];

enum T C1 = 0.693359375;
enum T C2 = -2.12194440e-4;

// Overflow and Underflow limits.
enum T OF = 88.72283905206835;
enum T UF = -103.278929903431851103; // ln(2^-149)

// Special cases.
//if (isNaN(x))
//return x;
//if (x > OF)
//return real.infinity;
//if (x < UF)
//return 0.0;

// Express: e^^x = e^^g * 2^^n
//   = e^^g * e^^(n * LOG2E)
//   = e^^(g + n * LOG2E)
T xx = floor((cast(T) LOG2E) * x + cast(T) 0.5);   // NOT INLINED!
const int n = cast(int) xx;
x -= xx * C1;
x -= xx * C2;

xx = x * x;
x = poly(x, P) * xx + x + 1.0f; // poly is generated optimally, but
not inlined

// Scale by power of 2.
x = core.math.ldexp(x, n);// NOT INLINED

return x;
}


gdc gdc
(Compiler-Explorer-Build-gcc-454a4d5041f53cd1f7d902f6c0017b7ce95b36df-binutils-2.38)
13.0.1 20230318 (experimental)
gdc -O3 -march=znver2 -frelease -fbounds-check=off


pure nothrow @nogc @safe float std.math.algebraic.poly!(float, float,
6).poly(float, ref const(float[6])):
vmovss  xmm1, DWORD PTR [rdi+20]
vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+16]
vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+12]
vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+8]
vfmadd213ss xmm1, xmm0, DWORD PTR [rdi+4]
vfmadd213ss xmm0, xmm1, DWORD PTR [rdi]
ret
pure nothrow @nogc @safe float example.expImpl!(float).expImpl(float):
pushrbx
vmovaps xmm1, xmm0
sub rsp, 16
vmovss  xmm0, DWORD PTR .LC0[rip]
vfmadd213ss xmm0, xmm1, DWORD PTR .LC1[rip]
vmovss  DWORD PTR [rsp+8], xmm1
callpure nothrow @nogc @trusted float
std.math.rounding.floor(float)
vmovss  xmm1, DWORD PTR [rsp+8]
mov edi, OFFSET FLAT:immutable(float[6])
example.expImpl!(float).expImpl(float).P
vfnmadd231ssxmm1, xmm0, DWORD PTR .LC2[rip]
vmovss  DWORD PTR [rsp+12], xmm0
vfnmadd231ssxmm1, xmm0, DWORD PTR .LC3[rip]
vmulss  xmm3, xmm1, xmm1
vmovaps xmm0, xmm1
vmovss  DWORD PTR [rsp+8], xmm1
vmovd   ebx, xmm3
callpure nothrow @nogc @safe float std.math.algebraic.poly!(float,
float, 6).poly(float, ref const(float[6]))
vmovss  xmm1, DWORD PTR [rsp+8]
vmovd   xmm4, ebx
vmovss  xmm2, DWORD PTR [rsp+12]
vfmadd132ss xmm0, xmm1, xmm4
vaddss  xmm0, xmm0, DWORD PTR .LC4[rip]
add rsp, 16
pop rbx
vcvttss2si  edi, xmm2
jmp ldexpf
float example.sRGB_case4(float):
sub rsp, 8
callpure nothrow @nogc @safe float
example.expImpl!(float).expImpl(float)
vmovss  xmm1, DWORD PTR .LC6[rip]
vfmadd132ss xmm0, xmm1, DWORD PTR .LC5[rip]
add rsp, 8
ret


https://godbolt.org/z/YMoMPdjn5


Additionally

std.math.exp itself, is never inlined by gcc. This is important, as some early
checks (isNaN, OF, UF checks) in exp could be removed by proper inlining.

[Bug c/108255] New: Repeated address-of (lea) not optimized for size.

2022-12-30 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108255

Bug ID: 108255
   Summary: Repeated address-of (lea) not optimized for size.
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/q5sx9e49j


void f(int *);

int g(int of) {
int x = 13;
f();
f();
f();
f();
f();
f();
f();
f();
return 0;
}


Got:

g(int):
sub rsp, 24
lea rdi, [rsp+12]
mov DWORD PTR [rsp+12], 13
callf(int*)
lea rdi, [rsp+12] # compute, 5 bytes
callf(int*)
lea rdi, [rsp+12] # recompute, 5 bytes
callf(int*)
lea rdi, [rsp+12] # recompute, 5 bytes
callf(int*)
lea rdi, [rsp+12]
callf(int*)
lea rdi, [rsp+12]
callf(int*)
lea rdi, [rsp+12]
callf(int*)
lea rdi, [rsp+12]
callf(int*)
xor eax, eax
add rsp, 24
ret


But, note that lea is 5 bytes.

Expected (generated by clang 3.0 - 15.0):

g(int):  # @g(int)
pushrbx  # extra, but just 1 byte
sub rsp, 16
mov dword ptr [rsp + 12], 13 # CSE temp
lea rbx, [rsp + 12]
mov rdi, rbx # use
callf(int*)@PLT
mov rdi, rbx # reuse, 3 bytes
callf(int*)@PLT
mov rdi, rbx # reuse, 3 bytes
callf(int*)@PLT
mov rdi, rbx
callf(int*)@PLT
mov rdi, rbx
callf(int*)@PLT
mov rdi, rbx
callf(int*)@PLT
mov rdi, rbx
callf(int*)@PLT
mov rdi, rbx
callf(int*)@PLT
xor eax, eax
add rsp, 16
pop rbx  # extra, but just 1 byte
ret


Technically this is more instructions.

But

mov rdi, rbx is 3 bytes, which is shorter than 5 bytes of lea. This is at minor
expense of needing to save and restore rbx.

PS. Same happens when using temporary `int *const y = `

Also same when optimizing for size (`-Os`).

It looks like gcc 4.8.5 produced expected code, but gcc 4.9.0 does not.

It is possible that the code produced by gcc 4.9.0 is faster, but it is also
likely it contributes quite a bit to binary size.

clang uses CSE even if there are even just two uses of `` in the above
example. It is likely a bit higher threshold is (3 or 4) is actually optimal
(can be calculated knowing encoding sizes).


Weirdly tho, gcc -m32 does this:

g():
pushebp
mov ebp, esp
pushebx
lea ebx, [ebp-12]
sub esp, 32
mov DWORD PTR [ebp-12], 13
pushebx
callf(int*)
mov DWORD PTR [esp], ebx
callf(int*)
mov DWORD PTR [esp], ebx
callf(int*)
mov ebx, DWORD PTR [ebp-4]
xor eax, eax
leave
ret

Where, it does compute address and stores it in temporary. But does it on a
stack, instead in a register (my guess is there are no free register to store
it and it is spilled)., but in fact lea here would be likely faster (mov
DWORD PTR [esp], ebx, but requires memory/cache access, lea is 5 bytes, but
does not require memory access)

[Bug middle-end/35560] Missing CSE/PRE for memory operations involved in virtual call.

2022-12-30 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=35560

Witold Baryluk  changed:

   What|Removed |Added

 CC||witold.baryluk+gcc at gmail 
dot co
   ||m

--- Comment #15 from Witold Baryluk  ---
I know this is a pretty old bug, but I was exploring some assembly of gcc and
clang on godbolt, and also stumbled into same issue.

https://godbolt.org/z/qPzMhWse1

class A {
public:
virtual int f7(int x) const;
};

int g(const A * const a, int x) {
int r = 0;
for (int i = 0; i < 1; i++)
r += a->f7(x);
return r;
}

(same happens without loop, when just calling a->f7 multiple times)



g(A const*, int):
pushr13
mov r13d, esi
pushr12
xor r12d, r12d
pushrbp
mov rbp, rdi
pushrbx
mov ebx, 1
sub rsp, 8
.L2:
mov rax, QWORD PTR [rbp+0]   # a vtable deref
mov esi, r13d
mov rdi, rbp
call[QWORD PTR [rax]]# f7 indirect call
add r12d, eax
dec ebx
jne .L2

add rsp, 8
pop rbx
pop rbp
mov eax, r12d
pop r12
pop r13
ret


I was expecting  mov rax, QWORD PTR [rbp+0] and call[QWORD PTR [rax]],
to be hoisted out of the loop (call converted to lea, and call register).


A bit sad.

Is there some recent work done on this optimization?

Are there at least some cases where it is valid to do CSE, or change code so it
is moved out of the loop?

[Bug d/107241] New: std.bitmanip.bigEndianToNative et al not inlined

2022-10-12 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107241

Bug ID: 107241
   Summary: std.bitmanip.bigEndianToNative et al not inlined
   Product: gcc
   Version: 12.2.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

gdc fails to inline number of small functions that should fully inline and end
in single instruction.

on amd64 / x86, for example std.bitmanip.bigEndianToNative causes a chain of
calls / jumps, even with @attribute("flatten")



import std.bitmanip;
import gcc.attributes;

@attribute("flatten")
size_t f(char[] b) {
return std.bitmanip.bigEndianToNative!(size_t,
8)(cast(ubyte[8])(b[2..10]));
}




gcc -O3 -march=znver2 -frelease


pure nothrow @nogc @safe ulong
std.bitmanip.swapEndian!(ulong).swapEndian(const(ulong)):
mov rax, rdi
bswap   rax
ret
pure nothrow @nogc @safe ulong std.bitmanip.endianToNativeImpl!(true, ulong,
8uL).endianToNativeImpl(ubyte[8]):
jmp pure nothrow @nogc @safe ulong
std.bitmanip.swapEndian!(ulong).swapEndian(const(ulong))
pure nothrow @nogc @safe ulong std.bitmanip.bigEndianToNative!(ulong,
8uL).bigEndianToNative(ubyte[8]):
jmp pure nothrow @nogc @safe ulong
std.bitmanip.endianToNativeImpl!(true, ulong, 8uL).endianToNativeImpl(ubyte[8])
ulong example.f(char[]):
mov rdi, QWORD PTR [rsi+2]
jmp pure nothrow @nogc @safe ulong
std.bitmanip.bigEndianToNative!(ulong, 8uL).bigEndianToNative(ubyte[8])




No issues with LDC.

ulong example.f(char[]):
mov rax, qword ptr [rsi + 2]
bswap   rax
ret



godbolt: https://godbolt.org/z/Pj3f7oGso

[Bug d/105413] gdc extended assembler cannot constraints r8 - r15

2022-10-08 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105413

--- Comment #3 from Witold Baryluk  ---
It works. Thank you.

Any chance this will be in gcc 12.x? I work a lot on Debian Linux, and I doubt
I will have gcc trunk or gcc 13 available any time soon.


Also weirdly gcc does not inline this function, unless I add
@attribute("always_inline") on syscall, or @attribute("flatten") on
openatdummy.

[Bug d/105413] New: gdc extended assembler cannot constraints r8 - r15

2022-04-27 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105413

Bug ID: 105413
   Summary: gdc extended assembler cannot constraints r8 - r15
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

gcc in C does not support directly register constraints for x86_64 registers r8
- r15.

In C this can be done however using local register variables and asm
attributes.

https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html

There is no way to use this in GDC extended assembler.

version (linux) {
version (GNU) {

enum SYSCALL {
  OPENAT = 56,
}

@nogc:
nothrow:

size_t syscall(SYSCALL ident)(size_t arg1, size_t arg2, size_t arg3, size_t
arg4) {
version (X86_64) {
   asm @nogc nothrow {
 "syscall"
 // output:
 : "=a" (arg1)
 // inputs:
 : "a" (ident),  // rax - syscall number
   "D" (arg1),   // rdi - arg1
   "S" (arg2),   // rsi - arg2
   "d" (arg3),   // rdx - arg3
   "r10" (arg4),  // r10 - arg4
   "m"( *cast(ubyte*)arg1)   // "dummy" input instead of full memory
clobber
 // clobers
 : "c", "r11";  // Clobers rax, and rcx and r11.
   }
   return arg1;
   } else {
   static assert(false, "This platform/architecture is not supported when
using GDC compiler");
   } 
}

}

private int openatdummy() @nogc nothrow {
  return cast(int)syscall!(SYSCALL.OPENAT)(0, 0, 0, 0);
}

}



myio.d: In function ‘syscall’:
myio.d:232:10: error: matching constraint references invalid operand number
  232 |  ;



https://godbolt.org/z/xGzxa6orc

[Bug d/105360] Inlined lazy parameters / delegate literals, still emitted

2022-04-23 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105360

--- Comment #1 from Witold Baryluk  ---
https://godbolt.org/z/c8oT6E4cf

[Bug d/105360] New: Inlined lazy parameters / delegate literals, still emitted

2022-04-23 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105360

Bug ID: 105360
   Summary: Inlined lazy parameters / delegate literals, still
emitted
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

```
extern bool g();
extern void f(int n);

void log(lazy int num) {
if (g()) {
const n = num();
f(n);
}
}

void p(int n) {
log(n * 137);
}
```


This should emit the same (or close to the same) as code with no `lazy` (and
num reference changed accordingly) on `log` function. (Because compiler knows
that `num ` is called once, has no side effects, is moderately expensive, etc).

And the code for p is exactly the same - log and `n * 137` fully inlined.

However, the anonymous dgliteral code is still emitted, despite not being
referenced anywhere:

```
pure nothrow @nogc @safe int example.p(int).__dgliteral2():  #   < This should
not be in object file
imuleax, DWORD PTR [rdi], 137
ret
```


Rest of the object file is correct and optimal:

```
void example.log(lazy int):
pushrbp
pushrbx
mov rbp, rdi
mov rbx, rsi
sub rsp, 8
callbool example.g()
testal, al
je  .L3
mov rdi, rbp
callrbx
add rsp, 8
pop rbx
pop rbp
mov edi, eax
jmp void example.f(int)
.L3:
add rsp, 8
pop rbx
pop rbp
ret
void example.p(int):
pushrbx
mov ebx, edi
callbool example.g()
testal, al
je  .L6
imuledi, ebx, 137
pop rbx
jmp void example.f(int)
.L6:
pop rbx
ret
```


gdc
(Compiler-Explorer-Build-gcc-748d46cd049c89a799f99f14547267ebae915af6-binutils-2.36.1)
12.0.1 20220421 (experimental)  via godbolt.org


For a code passing reasonably big literals, this can lead to object file code
duplication.

ldc2 shows no such problem.

[Bug c++/103966] std::atomic relaxed load, inc, store sub-optimal codegen

2022-01-10 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103966

--- Comment #2 from Witold Baryluk  ---
Similarly, dec, add, sub, are affected, as well mul.

Example:

#include 
#include 

uint64_t x;
void add_a() {
x += 5;
}

std::atomic y;

void add_b_non_atomic() {
y.store(y.load(std::memory_order_relaxed) + 5, std::memory_order_relaxed);
}



Producing:

add_a():
add QWORD PTR x[rip], 5
ret
add_b_non_atomic():
mov rax, QWORD PTR y[rip]
add rax, 5
mov QWORD PTR y[rip], rax
ret
y:
.zero   8
x:
.zero   8

[Bug c++/103966] std::atomic relaxed load, inc, store sub-optimal codegen

2022-01-10 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103966

--- Comment #1 from Witold Baryluk  ---
Current codegen on gcc 12 on 64-bit x86:

inc_a():
inc QWORD PTR x[rip]
ret
inc_b_non_atomic():
mov rax, QWORD PTR y[rip]
inc rax
mov QWORD PTR y[rip], rax
ret
y:
.zero   8
x:
.zero   8

[Bug c++/103966] New: std::atomic relaxed load, inc, store sub-optimal codegen

2022-01-10 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103966

Bug ID: 103966
   Summary: std::atomic relaxed load, inc, store sub-optimal
codegen
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Both functions below, should compile to the same assembly on x86:

#include 
#include 

uint64_t x;
void inc_a() {
x++;
}

std::atomic y;

void inc_b_non_atomic() {
y.store(y.load(std::memory_order_relaxed) + 1, std::memory_order_relaxed);
}


and it does so in clang.

It does not in gcc 12 (and earlier).

https://godbolt.org/z/GcM67xz8T



This pattern is very popular in approximate statistical counters / metrics,
where the flow of information is unidirectional (i.e. from one thread that does
updates, to another thread that only reads the counters), and its performance
is critical in many codebases.

[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded

2021-05-26 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769

Witold Baryluk  changed:

   What|Removed |Added

 Resolution|FIXED   |INVALID

[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded

2021-05-26 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769

Witold Baryluk  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |FIXED

--- Comment #4 from Witold Baryluk  ---
Ok. That makes sense. Thanks.

[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded

2021-05-26 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769

--- Comment #2 from Witold Baryluk  ---
Hmm. It appears that using `import core.stdc.string : memcmp;` actually
resolves the problem. It looks like my manually declaration of memcmp for some
reason disabled optimisations for memcmp.

[Bug d/100769] [D] memcmp() == 0 for small constant strings not folded

2021-05-26 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769

--- Comment #1 from Witold Baryluk  ---
A typo in the example (godbolt is good), I forgot the `.ptr`:

extern(C) int memcmp(const void *s1, const void *s2, size_t n);

int recognize3(const char* s) {
return memcmp(s, "stract class".ptr, 12) == 0;
}

casting to ubyte*, or void*, doesn't change anything really.

options: -O3 -frelease -fno-semantic-interposition 

tested on amd64, Debian / Linux.

[Bug d/100769] New: [D] memcmp() == 0 for small constant strings not folded

2021-05-26 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100769

Bug ID: 100769
   Summary: [D] memcmp() == 0 for small constant strings not
folded
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

I expect this D code to be quite optimal, but it isn't.

```
extern(C) int memcmp(const void *s1, const void *s2, size_t n);

int recognize3(const char* s) {
return memcmp(s, "stract class", 12) == 0;
}
```

https://godbolt.org/z/vx17WK9rs


It produces a call to memcmp, instead of inlining and specializing the code for
this specific case.

int example.recognize3(const(char*)):
sub rsp, 8
mov edx, 12
mov esi, OFFSET FLAT:.LC0
callmemcmp
testeax, eax
seteal
add rsp, 8
movzx   eax, al
ret



ldc2 1.24.0 (for D) and clang 11.0.1-2 (for C and C++), and gcc 10.2.1 (for C
and C++) produce close to optimal codes. Similarly ldc2 1.26.0 (for D), and gcc
11.1 (for C and C++):

int example.recognize3(const(char*)):
movabs  rcx, 7142836979195081843
xor rcx, qword ptr [rdi]
mov edx, dword ptr [rdi + 8]
xor rdx, 1936941420
xor eax, eax
or  rdx, rcx
seteal
ret

and

recognize3:
movabs  rax, 7142836979195081843
cmp QWORD PTR [rdi], rax
je  .L6
.L2:
mov eax, 1
xor eax, 1
ret
.L6:
xor eax, eax
cmp DWORD PTR [rdi+8], 1936941420
jne .L2
xor eax, 1
ret


Notice, how both gcc, clang and ldc2, compare first 8 bytes of input, then 4
bytes of input. clang and ldc2 just xor/or the result, then return, with no
conditional jumps. gcc does a bit poorer, with more conditionals and more
jumps, but still pretty good and same idea.

gdc however, calls the generic memcmp, that does looping and does about 12
jumps and/or 13 exists.

[Bug c/100257] New: poor codegen with vcvtph2ps / stride of 6

2021-04-25 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100257

Bug ID: 100257
   Summary: poor codegen with vcvtph2ps / stride of 6
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

gcc (Compiler-Explorer-Build) 12.0.0 20210424 (experimental)


https://godbolt.org/z/n6ooMdnz8


This C code:

```
#include 
#include 
#include 

struct float3 {
float f1;
float f2;
float f3;
};

struct util_format_r16g16b16_float {
   uint16_t r;
   uint16_t g;
   uint16_t b;
};

static inline struct float3 _mesa_half3_to_float3(uint16_t val_0, uint16_t
val_1, uint16_t val_2) {
#if defined(__F16C__)
  //const __m128i in = {val_0, val_1, val_2};
  //__m128 out;
  //__asm volatile("vcvtph2ps %1, %0" : "=v"(out) : "v"(in));

  const __m128i in = _mm_setr_epi16(val_0, val_1, val_2, 0, 0, 0, 0, 0);
  const __m128 out = _mm_cvtph_ps(in);

  const struct float3 r = {out[0], out[1], out[2]};
  return r;
#endif
}


void
util_format_r16g16b16_float_unpack_rgba_float(void *restrict dst_row, const
uint8_t *restrict src, unsigned width)
{
   float *dst = dst_row;
   for (unsigned x = 0; x < width; x += 1) {
const struct util_format_r16g16b16_float pixel;
memcpy(, src, sizeof pixel);

struct float3 r = _mesa_half3_to_float3(pixel.r, pixel.g, pixel.b);
dst[0] = r.f1; /* r */
dst[1] = r.f2; /* g */
dst[2] = r.f3; /* b */
dst[3] = 1; /* a */

src += 6;
dst += 4;
   }
}

```

Is compiled "poorly" by gcc, even worse when compiled on i386 (with -mf16c
enabled) when using -FPIE.

Example:


gcc -O3 -m32 -march=znver2 -mfpmath=sse -fPIE

util_format_r16g16b16_float_unpack_rgba_float:
pushebp
pushedi
pushesi
pushebx
sub esp, 28
mov ecx, DWORD PTR 56[esp]
mov edx, DWORD PTR 48[esp]
call__x86.get_pc_thunk.ax
add eax, OFFSET FLAT:_GLOBAL_OFFSET_TABLE_
mov ebx, DWORD PTR 52[esp]
testecx, ecx
je  .L8
vmovss  xmm3, DWORD PTR .LC0@GOTOFF[eax]
xor esi, esi
xor ebp, ebp
vpxor   xmm2, xmm2, xmm2
.L3:
mov eax, DWORD PTR [ebx]
vmovss  DWORD PTR 12[edx], xmm3
add ebx, 6
add edx, 16
inc esi
mov ecx, eax
vmovd   xmm0, eax
shr ecx, 16
mov edi, ecx
movzx   ecx, WORD PTR -2[ebx]
vpinsrw xmm0, xmm0, edi, 1
vmovd   xmm1, ecx
vpinsrw xmm1, xmm1, ebp, 1
vpunpckldq  xmm0, xmm0, xmm1
vpunpcklqdq xmm0, xmm0, xmm2
vcvtph2ps   xmm0, xmm0
vmovss  DWORD PTR -16[edx], xmm0
vextractps  DWORD PTR -12[edx], xmm0, 1
vextractps  DWORD PTR -8[edx], xmm0, 2
cmp DWORD PTR 56[esp], esi
jne .L3
.L8:
add esp, 28
pop ebx
pop esi
pop edi
pop ebp
ret
.LC0:
.long   1065353216
__x86.get_pc_thunk.ax:
mov eax, DWORD PTR [esp]
ret



clang:

util_format_r16g16b16_float_unpack_rgba_float: #
@util_format_r16g16b16_float_unpack_rgba_float
mov eax, dword ptr [esp + 12]
testeax, eax
je  .LBB0_3
mov ecx, dword ptr [esp + 8]
mov edx, dword ptr [esp + 4]
.LBB0_2:# =>This Inner Loop Header: Depth=1
vmovd   xmm0, dword ptr [ecx]   # xmm0 = mem[0],zero,zero,zero
vpinsrw xmm0, xmm0, word ptr [ecx + 4], 2
add ecx, 6
vcvtph2ps   xmm0, xmm0
vmovss  dword ptr [edx], xmm0
vextractps  dword ptr [edx + 4], xmm0, 1
vextractps  dword ptr [edx + 8], xmm0, 2
mov dword ptr [edx + 12], 1065353216
add edx, 16
dec eax
jne .LBB0_2
.LBB0_3:
ret


clang code is essentially optimal.


The issue persist if I use `vcvtph2ps` directly via asm, or via intrinsics.

The issue might be the src stride, of 6, instead 8, that is confusing gcc.

Additionally, constant 1065353216  (which is weird, I would expect it to be 0),
is stored in data section, instead inline as immediate, this makes code
actually larger, and in PIE mode, requires extra pointer trickery, and on -m32,
even calling extra function.

Even without -fPIE the main loop has poor codegen even on x86-64 / amd64
compared to clang or what I would considered good code.

gcc -m64 -O3 -march=native

util_format_r16g16b16_float_unpack_rgba_float:
testedx, edx
je  .L8
mov edx, edx
sal rdx, 4
vmovss  xmm3, DWO

[Bug d/98494] New: libphobos: std.process Config.stderrPassThrough missing

2020-12-31 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98494

Bug ID: 98494
   Summary: libphobos: std.process Config.stderrPassThrough
missing
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

It appears that gdc version of libphobos is somehow lagging in some aspects
behind upstream.

One of the things I see missing, is `Config.stderrPassThrough` in std.process.
I see it was added upstream about 12 months ago:

enum Config {
...
/**
By default, the $(LREF execute) and $(LREF executeShell) functions
will capture child processes' both stdout and stderr. This can be
undesirable if the standard output is to be processed or otherwise
used by the invoking program, as `execute`'s result would then
contain a mix of output and warning/error messages.

Specify this flag when calling `execute` or `executeShell` to
cause invoked processes' stderr stream to be sent to $(REF stderr,
std,stdio), and only capture and return standard output.

This flag has no effect on $(LREF spawnProcess) or $(LREF spawnShell).
*/
stderrPassThrough = 128,
}

The implementation usage of this is relatively small and easy to backport:

in executeImpl:

-auto p = pipeFunc(commandLine, Redirect.stdout | Redirect.stderrToStdout,
-  env, config, workDir, extraArgs);
+auto redirect = (config & Config.stderrPassThrough)
+? Redirect.stdout
+: Redirect.stdout | Redirect.stderrToStdout;
+
+auto p = pipeFunc(commandLine, redirect,
+  env, config, workDir, extraArgs);



There are some other minor changes there, but nothing functionally significant.
Mostly unittests and minor signature changes (adding `scope` to many input
parameters).

Thank you.

[Bug d/98457] [d] writef!"%s" doesn't work with MonoTime / SysTick

2020-12-27 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98457

--- Comment #1 from Witold Baryluk  ---
Godbolt link: https://godbolt.org/z/q3bzhP

with gcc trunk 20201217 and a bit more diagnostic

/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/core/time.d:2405:16:
error: static variable _ticksPerSecond cannot be read at compile time
 2405 | return _ticksPerSecond[_clockIdx];
  |^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/core/time.d:2418:99:
note: called from here: ticksPerSecond()
 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ "
ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)";
  |
  ^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/core/time.d:2418:98:
note: called from here: signedToTempString(ticksPerSecond(), 10u)
 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ "
ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)";
  |
 ^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:3353:28:
note: called from here: val.toString()
 3353 | put(w, val.toString());
  |^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:3353:12:
note: called from here: put(w, val.toString())
 3353 | put(w, val.toString());
  |^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:3672:21:
note: called from here: formatObject(w, val, f)
 3672 | formatObject(w, val, f);
  | ^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:568:28:
note: called from here: formatValue(w, _param_2, spec)
  568 | formatValue(w, args[i], spec);
  |^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:5767:28:
note: called from here: formattedWrite(w, fmt, _param_1)
 5767 | auto n = formattedWrite(w, fmt, args);
  |^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:5729:16:
note: called from here: format("%s", MonoTimeImpl(0L))
 5729 | .format(fmt, Args.init);
  |^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/format.d:5733:2:
note: called from here: (*function () => null)()
 5733 | }();
  |  ^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/stdio.d:3754:15:
error: template instance std.format.checkFormatException!("�}�",
MonoTimeImpl!cast(ClockType)0) error instantiating
 3754 | alias e = checkFormatException!(fmt, A);
  |   ^
:4:14: note: instantiated from here: writef!("%s",
MonoTimeImpl!cast(ClockType)0)
4 |   writef!"%s"(MonoTime.currTime());
  |  ^
/opt/compiler-explorer/gcc-trunk-20201227/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/stdio.d:3755:5:
note: while evaluating: static assert(!e)
 3755 | static assert(!e, e.msg);
  | ^
Compiler returned: 1

[Bug d/98457] New: [d] writef!"%s" doesn't work with MonoTime / SysTick

2020-12-27 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98457

Bug ID: 98457
   Summary: [d] writef!"%s" doesn't work with MonoTime / SysTick
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

void main() {
  import std.stdio;
  import core.time : MonoTime;
  writef!"%s"(MonoTime.currTime());
}


Doesn't compile with gdc 10.2.1:

$ gdc test_monotime.d 
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/core/time.d:2405:16: error: static
variable _ticksPerSecond cannot be read at compile time
 2405 | return _ticksPerSecond[_clockIdx];
  |^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/core/time.d:2418:99: note: called
from here: ticksPerSecond()
 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ "
ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)";
  |
  ^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/core/time.d:2418:98: note: called
from here: signedToTempString(ticksPerSecond(), 10u)
 2418 | return "MonoTime(" ~ signedToTempString(_ticks, 10) ~ "
ticks, " ~ signedToTempString(ticksPerSecond, 10) ~ " ticks per second)";
  |
 ^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:3353:28: note: called
from here: val.toString()
 3353 | put(w, val.toString());
  |^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:3353:12: note: called
from here: put(w, val.toString())
 3353 | put(w, val.toString());
  |^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:3672:21: note: called
from here: formatObject(w, val, f)
 3672 | formatObject(w, val, f);
  | ^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:568:28: note: called
from here: formatValue(w, _param_2, spec)
  568 | formatValue(w, args[i], spec);
  |^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:5767:28: note: called
from here: formattedWrite(w, fmt, _param_1)
 5767 | auto n = formattedWrite(w, fmt, args);
  |^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:5729:16: note: called
from here: format("%s", MonoTimeImpl(0L))
 5729 | .format(fmt, Args.init);
  |^
/usr/lib/gcc/x86_64-linux-gnu/10/include/d/std/format.d:5733:2: note: called
from here: (*function () => null)()
 5733 | }();
  |  ^

(null):0: confused by earlier errors, bailing out



Adding manually .toString() makes it work (at the expense of possible extra
allocation).

No issues in ldc2 1.24.0 or dmd2 2.095.0-beta.1

It doesn't look like issue in phobos, but something deeper.

[Bug tree-optimization/96275] Vectorizer doesn't take into account bitmask condition from branch conditions.

2020-12-27 Thread witold.baryluk+gcc at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96275

--- Comment #3 from Witold Baryluk  ---
Thanks for looking into that. I just wanted to update that this still
suboptimal in current gcc trunk 20201226. While clang produces superior code.

[Bug c/96275] Vectorizer doesn't take into account bitmask condition from branch conditions.

2020-07-21 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96275

--- Comment #1 from Witold Baryluk  ---
FYI.

clang trunk 12 / 76a0c0ee6ffa9c38485776921948d8f930109674, doesn't do that
either:

fillArray:  # @fillArray
testdil, 31
jne .LBB0_8
testedi, edi
je  .LBB0_8
vmovss  xmm0, dword ptr [rdx]   # xmm0 = mem[0],zero,zero,zero
mov eax, edi
cmp edi, 32
jae .LBB0_4
xor edx, edx
jmp .LBB0_7
.LBB0_4:
vbroadcastssymm1, xmm0
mov edx, eax
xor edi, edi
and edx, -32
.LBB0_5:# =>This Inner Loop Header: Depth=1
vmulps  ymm2, ymm1, ymmword ptr [rcx + 4*rdi]
vmulps  ymm3, ymm1, ymmword ptr [rcx + 4*rdi + 32]
vmulps  ymm4, ymm1, ymmword ptr [rcx + 4*rdi + 64]
vmulps  ymm5, ymm1, ymmword ptr [rcx + 4*rdi + 96]
vmovups ymmword ptr [rsi + 4*rdi], ymm2
vmovups ymmword ptr [rsi + 4*rdi + 32], ymm3
vmovups ymmword ptr [rsi + 4*rdi + 64], ymm4
vmovups ymmword ptr [rsi + 4*rdi + 96], ymm5
add rdi, 32
cmp rdx, rdi
jne .LBB0_5
cmp rdx, rax
je  .LBB0_8
.LBB0_7:# =>This Inner Loop Header: Depth=1
vmulss  xmm1, xmm0, dword ptr [rcx + 4*rdx]
vmovss  dword ptr [rsi + 4*rdx], xmm1
inc rdx
cmp rax, rdx
jne .LBB0_7
.LBB0_8:
vzeroupper
ret


the main inner loop is unrolled / pipelined more aggressively, and the fallback
code is simpler (just handle scalars scalarly), which is unrelated. But the
fallback code is still there.



Changing to different variations of the condition, like `if ((N/32)*32 == N)
{`, `if ((N % 32) == 0) {`, `if ((N & ~31u) == N) {`, `if ((N >> 5) << 5 == N)
{`,  doesn't make any difference.

I tried with signed int, and unsigned int. Same effect.

Reassigning to N (after removing constness), i.e. `N = N & ~31u`, or `N = (N >>
5) << 5`, does appear to do something, but if it is inside the condition it is
already too late.

[Bug c/96275] New: Vectorizer doesn't take into account bitmask condition from branch conditions.

2020-07-21 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96275

Bug ID: 96275
   Summary: Vectorizer doesn't take into account bitmask condition
from branch conditions.
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/Gfebjd

With gcc trunk 20200720

If the loop to be vectorized is inside a if condition that check for loop
counter, or there is preceding assert / function return on such condition, the
gcc seems to forgot about it and not take into account in the optimizer /
vectorizer, and still emits the backup scalar code to take care of stragglers
despite it being a dead code.

#include "assert.h"

void fillArray(const unsigned int N, float * restrict a, const float* restrict
b, const float* restrict c) {
//assert(N >= 1024);
for (int i = 0; i < (N & ~31u); i++) {
a[i] = b[0] * c[i];
}
}


produces:

fillArray:
and edi, -32
je  .L8
shr edi, 3
vbroadcastssymm1, DWORD PTR [rdx]
xor eax, eax
mov edx, edi
sal rdx, 5
.L3:
vmulps  ymm0, ymm1, YMMWORD PTR [rcx+rax]
vmovups YMMWORD PTR [rsi+rax], ymm0
add rax, 32
cmp rax, rdx
jne .L3
vzeroupper
.L8:
ret




but:

#include "assert.h"

void fillArray(const unsigned int N, float * restrict a, const float* restrict
b, const float* restrict c) {
//assert(N >= 1024);
if ((N & 31u) == 0) {
for (int i = 0; i < N; i++) {
a[i] = b[0] * c[i];
}
}
}

produces this sub-optimal code:

fillArray:
mov eax, edi
and eax, 31
jne .L14
testedi, edi
je  .L14
lea r8d, [rdi-1]
vmovss  xmm1, DWORD PTR [rdx]
cmp r8d, 6
jbe .L8
mov edx, edi
vbroadcastssymm2, xmm1
xor eax, eax
shr edx, 3
sal rdx, 5
.L4:
vmulps  ymm0, ymm2, YMMWORD PTR [rcx+rax]
vmovups YMMWORD PTR [rsi+rax], ymm0
add rax, 32
cmp rdx, rax
jne .L4
mov eax, edi
and eax, -8
mov edx, eax
cmp edi, eax
je  .L16
vzeroupper
.L3:
mov r9d, edi
sub r8d, eax
sub r9d, eax
cmp r8d, 2
jbe .L6
mov eax, eax
vshufps xmm0, xmm1, xmm1, 0
vmulps  xmm0, xmm0, XMMWORD PTR [rcx+rax*4]
vmovups XMMWORD PTR [rsi+rax*4], xmm0
mov eax, r9d
and eax, -4
add edx, eax
cmp r9d, eax
je  .L14
.L6:
movsx   rax, edx
vmulss  xmm0, xmm1, DWORD PTR [rcx+rax*4]
vmovss  DWORD PTR [rsi+rax*4], xmm0
lea eax, [rdx+1]
cmp edi, eax
jbe .L14
cdqe
add edx, 2
vmulss  xmm0, xmm1, DWORD PTR [rcx+rax*4]
vmovss  DWORD PTR [rsi+rax*4], xmm0
cmp edi, edx
jbe .L14
movsx   rdx, edx
vmulss  xmm1, xmm1, DWORD PTR [rcx+rdx*4]
vmovss  DWORD PTR [rsi+rdx*4], xmm1
.L14:
ret
.L16:
vzeroupper
ret
.L8:
xor edx, edx
jmp .L3


Adding `assert(N == (N & ~31u));` doesn't help.

[Bug d/95250] New: [D] ICE instead of error when trying to use bad template type inside template

2020-05-20 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95250

Bug ID: 95250
   Summary: [D] ICE instead of error when trying to use bad
template type inside template
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/xWrXP5

Minimized version

```
module m;

import std.traits : Unsigned;

void* f(T)(T a, T b) {
alias UnsignedVoid = Unsigned!(T);
return cast(T)(cast(T)(cast(UnsignedVoid)(a-b) / 2));
}
//static assert(is(typeof(f(null, null)) == void*));   // ICE
static assert(is(typeof(f!(void*)(null, null)) == void*));  // ICE
```

The code is not correct, but on DMD v2.092.0 and LDC 1.20.1 (LLVM 9.0.1) it
does say static assert is false (which is also incorrect), and doesn't crash.

Instead it should say, something like this:

/usr/lib/gcc/x86_64-linux-gnu/11.0.0/include/d/std/traits.d:7163:13: error:
static assert  "Type void* does not have an Unsigned counterpart"

 7163 | static assert(false, "Type " ~ T.stringof ~

  | ^


Here is a local run, on Linux, amd64.

$ gdc gdc_ice.d
d21: internal compiler error: Segmentation fault
0xbd63ef crash_signal
../../src/gcc/toplev.c:328
0x7f31b746c7ff ???
./signal/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
0x71ed1e isAggregate(Type*)
../../src/gcc/d/dmd/opover.c:161
0x71ed1e visit
../../src/gcc/d/dmd/opover.c:586
0x71e935 op_overload(Expression*, Scope*)
../../src/gcc/d/dmd/opover.c:1385
0x6d0b88 Expression::op_overload(Scope*)
../../src/gcc/d/dmd/expression.h:213
0x6d0b88 ExpressionSemanticVisitor::visit(DivExp*)
../../src/gcc/d/dmd/expressionsem.c:6891
0x6cd7b4 semantic(Expression*, Scope*)
../../src/gcc/d/dmd/expressionsem.c:8214
0x6cd7b4 unaSemantic(UnaExp*, Scope*)
../../src/gcc/d/dmd/expressionsem.c:8164
0x6cd7b4 ExpressionSemanticVisitor::visit(CastExp*)
../../src/gcc/d/dmd/expressionsem.c:4203
0x6cd7b4 semantic(Expression*, Scope*)
../../src/gcc/d/dmd/expressionsem.c:8214
0x6cd7b4 unaSemantic(UnaExp*, Scope*)
../../src/gcc/d/dmd/expressionsem.c:8164
0x6cd7b4 ExpressionSemanticVisitor::visit(CastExp*)
../../src/gcc/d/dmd/expressionsem.c:4203
0x6c5a45 semantic(Expression*, Scope*)
../../src/gcc/d/dmd/expressionsem.c:8214
0x74795f StatementSemanticVisitor::visit(ReturnStatement*)
../../src/gcc/d/dmd/statementsem.c:2757
0x74a949 semantic(Statement*, Scope*)
../../src/gcc/d/dmd/statementsem.c:3782
0x74a949 StatementSemanticVisitor::visit(CompoundStatement*)
../../src/gcc/d/dmd/statementsem.c:142
0x743755 semantic(Statement*, Scope*)
../../src/gcc/d/dmd/statementsem.c:3782
0x6e8ba9 FuncDeclaration::semantic3(Scope*)
../../src/gcc/d/dmd/func.c:1711
0x6e8ba9 FuncDeclaration::semantic3(Scope*)
../../src/gcc/d/dmd/func.c:1354
$

$ gdc -v
Using built-in specs.
COLLECT_GCC=gdc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/10/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa:hsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 10.1.0-1'
--with-bugurl=file:///usr/share/doc/gcc-10/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-10
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --enable-libphobos-checking=release
--with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch
--disable-werror --with-arch-32=i686 --with-abi=m64
--with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none,amdgcn-amdhsa,hsa --without-cuda-driver
--enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu
--target=x86_64-linux-gnu
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.1.0 (Debian 10.1.0-1) 
$

[Bug d/95198] [D] extern(C) private final functions should use 'local' linker attribute

2020-05-20 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95198

--- Comment #3 from Witold Baryluk  ---
> The main example to demonstrate the current behaviour is correct would be the 
> following:

```
extern(C)
private final int f() {
  return 5;
}

auto pubf()() {
  return f();
}
```

I see, I guess you are right. I don't know how would one go to fix this to work
correctly with existing linkers and not break other code.

Thanks for clarifications.

[Bug d/95198] [D] extern(C) private final functions should use 'local' linker attribute

2020-05-18 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95198

--- Comment #1 from Witold Baryluk  ---
BTW.

Using:

```
extern(C) private final static int f() { ... }
```


doesn't change anything.

[Bug d/95198] New: [D] extern(C) private final functions should use 'local' linker attribute

2020-05-18 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95198

Bug ID: 95198
   Summary: [D] extern(C) private final functions should use
'local' linker attribute
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

```
module t1;

extern(C)
private final int f() {
  return 5;
}
pragma(msg, f.mangleof);

```

`gdc -c t1.d -o t1.o` results in object with this symbols:

 D _D2t111__moduleRefZ
 D _D2t112__ModuleInfoZ
 U _d_dso_registry
 T f
 W gdc.dso_ctor
 W gdc.dso_dtor
 u gdc.dso_initialized
 u gdc.dso_slot
0016 t _GLOBAL__D_2t1
000b t _GLOBAL__I_2t1
 U _GLOBAL_OFFSET_TABLE_
 U __start_minfo
 U __stop_minfo



Symbol, ' T f' should instead be ' t f'

Additional when using optimizations, I would expect the f to not be emitted at
all, but it is still there (unless compiler decides not to inline it or its
address is not taken and passed around), even with `gdc -O3`.

gcc for C does use LOCAL for static functions and variables in translation
unit. Similarly probably for C++ symbols in anonymous namespaces.



Example of linking issues:

t1.d:
```
module t1;

extern(C)
private final int f() {
  return 5;
}
```

t2.d:
```
module t2;

extern(C)
private final int f() {
  return 10;
}
```

tm.d:
```
module tm;

void main() {
}
```

$ gdc -O0 -c t1.d -o t1.o
$ gdc -O0 -c t2.d -o t2.o
$ gdc t1.o t2.o tm.d -o t12
/usr/bin/ld: t2.o: in function `f':
t2.d:(.text+0x0): multiple definition of `f'; t1.o:t1.d:(.text+0x0): first
defined here
collect2: error: ld returned 1 exit status
$


This code should link, similar to equivalent code in C. The use case is local
function that is passed in some other module function or method (or static
module constructor for example), to C libraries or other modules as a callback
or for variables a return value.

[Bug d/95174] [D] Incorrect compiled functions involving const fixed size arrays

2020-05-18 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95174

--- Comment #2 from Witold Baryluk  ---
Doh. Of course. My bad. Sorry.


static arrays are value type, dynamic arrays are reference type.

Changing signature to:

```
void f(immutable(float[64]) x, float[] o);
```


solves the problem.

[Bug d/95174] New: [D] Incorrect compiled functions involving const fixed size arrays

2020-05-17 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95174

Bug ID: 95174
   Summary: [D] Incorrect compiled functions involving const fixed
size arrays
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

https://explore.dgnu.org/z/LppySp


```
void f(immutable(float[64]) x, float[64] o) {
   o[] = x[] * 2.0f;
}
```

and

```
void f(immutable(float[64]) x, float[64] o) {
foreach (i; 0 .. 64) {
o[i] = x[i] * 2.0f;
}
}
```

and

```
void f(immutable(float[64]) x, float[64] o) {
o[1] = x[5] + x[7];
}

```

Is incorrectly compiled to 'nop; ret'


It appears DMD (v2.092) also essentially do the same, and do not perform any
computations in the function.

LDC2 (1.20.1, based on DMD v2.090.1) does generate correct code in some cases
(fully unrolled and fully vectorized in this specific case), but in some other
also do nothing and simply does 'ret' in the function.



As a bonus:

```
void f(immutable(float[4]) x, float[4] o) {
   o[2] = x[1] + x[3];
}

import std.stdio : writeln;

void main() {
  immutable(float[4]) k = [7.0f, 5.3f, 1.2f, 3.2f];
  float[4] o;
  f(k, o);
  writeln(o);
}

```

prints '[nan, nan, nan, nan]', but it should: '[nan, nan, 8.5, nan]'.


I got the same results using my local gdc version 10.1.0 on amd64.

[Bug d/95173] New: [D] ICE on some architecture targets when trying to use unknown attribute

2020-05-17 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95173

Bug ID: 95173
   Summary: [D] ICE on some architecture targets when trying to
use unknown attribute
   Product: gcc
   Version: 9.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

https://explore.dgnu.org/z/bseyKQ


```
import gcc.attribute;

@attribute("foo")
void f() {}
```

This code crashes the compiler when compiling for alpha-linux-gnu, sparc64-elf,
hppa-linux-gnu, mmix-knuth-mmixware, pdp11-aout, lm32-elf and possibly more
with variations. (But sparc64-sun-solaris2.11 for examples works fine).

The code compiles correctly for known attributes.

Other targets do compile without a crash, just with a warning about unknown
attribute. That is correct behaviour.

[Bug d/94496] [D] Use aggressive optimizations in release mode

2020-05-15 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94496

--- Comment #3 from Witold Baryluk  ---
Also about 'nothrow' and Errors.

I would really welcome a flag to compiler that simply terminates all threads
immidetly any Error is thrown at throw location. They aren't really
recoverable. The only option is either catch them really high in the stack or
terminate the program (in D this will I think unwind all the stack and destroy
scoped structs, also call full GC collection, optionally call all class
destrustors, and module destructors). But in many cases terminating the program
at the spot (_exit(2) or _Exit(2), from glibc (not kernel) to terminate all
threads via exit_group).

As of the 'nothrow' itself. I belive it doesn't mean there is 'no thrown
exceptions in the call tree'. I think it means there is no 'uncought exceptions
possibly throw by call to this function'.

```d
extern int g(int x);  // not nothrow

int f(int x) nothrow {
  try {
 return g(x);
 throw new MyException("ble");
  } catch (Exception e) {
return 1;
  }
  return 0;
}
```

https://gcc.godbolt.org/z/Y3vNQr


As of the asm pure, considering there is asm volatile, wouldn't it make sense
to not allow 'asm pure volatilve' in the first place in the source?

strict aliasing should be enabled for dynamic arrays, static arrays and normal
pointer to other types.   I.e.

```d
void f(int* x, float[] y);  // x, y, y.ptr should not alias.
```



```d
void f(int* x, int[] y);  // x, y and y.ptr can alias.
```

Also how about using `restrict` automatically for transitively const types?
I.e. 

```d
void f(const scope int[] a, int *b);  // can't alias. if b aliases a, then it
is UB.
```

[Bug d/95120] [D] Incorrectly allows fqdn access to imported symbols when doing selective imports.

2020-05-13 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95120

--- Comment #2 from Witold Baryluk  ---
Created attachment 48530
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48530=edit
Minimized example

[Bug d/95120] [D] Incorrectly allows fqdn access to imported symbols when doing selective imports.

2020-05-13 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95120

--- Comment #1 from Witold Baryluk  ---
Further minimized:

==
import std.stdio;
import std.algorithm.comparison : min;

int main() {
  return std.algorithm.comparison.min(3, 2);
}
==


Removing `import std.stdio;`, results in the same error messages in gdc-10, dmd
and ldc2.

$ gdc badimport.d
badimport.d:5:10: error: undefined identifier ‘std’
5 |   return std.algorithm.comparison.min(3, 2);
  |  ^
$
$ ldc2 badimport.d
badimport.d(5): Error: undefined identifier std
$ dmd badimport.d
badimport.d(5): Error: undefined identifier std
$


it complains about `unknown std`.

When I use `import std.stdio;` at the start, dmd and ldc complain about
`unknown algorithm in package std`.

Not sure if this is something in `std.stdio` package maybe.

[Bug d/95120] New: [D] Incorrectly allows fqdn access to imported symbols when doing selective imports.

2020-05-13 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95120

Bug ID: 95120
   Summary: [D] Incorrectly allows fqdn access to imported symbols
when doing selective imports.
   Product: gcc
   Version: 10.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: d
  Assignee: ibuclaw at gdcproject dot org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Created attachment 48529
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48529=edit
Example of incorrectly accepted d source by gdc-10

gdc does violatate D language spec:

https://dlang.org/spec/module.html#selective_imports


4.7 Selective Imports

Specific symbols can be exclusively imported from a module and bound into the
current namespace:

import std.stdio : writeln, foo = write;

void main()
{
std.stdio.writeln("hello!"); // error, std is undefined
writeln("hello!");   // ok, writeln bound into current namespace
write("world");  // error, write is undefined
foo("world");// ok, calls std.stdio.write()
fwritefln(stdout, "abc");// error, fwritefln undefined
}
=


I found that in some weird situations the gdc-10 does behave differently
than dmd and ldc2.

Here are the versions I used:

$ dmd --version
DMD64 D Compiler v2.092.0
Copyright (C) 1999-2020 by The D Language Foundation, All Rights Reserved
written by Walter Bright
$ ldc2 --version
LDC - the LLVM D compiler (1.20.1):
  based on DMD v2.090.1 and LLVM 9.0.1
  built with LDC - the LLVM D compiler (1.20.1)
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver1
$ gdc-10 --version
gdc-10 (Debian 10.1.0-1) 10.1.0
$


All on Debian testing/unstable, amd64.

 badimport.d =
void main() {
  import std.stdio;
  import std.algorithm.comparison : min;

  static struct S {
int min_;

// int min() { return min_; }

void opOpAssign(string op)(const S other) if (op == "+") {
  min_ = std.algorithm.comparison.min(min_, other.min_);
}
  }

  S x = {3};
  x += x;
}
=

(the intention was to use fqdn here, to not reference struct member function
min; using `.min(min_, other.min_)`, is another option, but it actually
shouldn't work either, due to other reasons).

Anyway:

$ gdc-10 badimport.d   # Compiles.
$


$ ldc2 badimport.d # Correct error.
badimport.d(11): Error: undefined identifier algorithm in package std, perhaps
add static import std.algorithm;
badimport.d(16): Error: template instance badimport.main.S.opOpAssign!"+" error
instantiating
$

$ dmd badimport.d  # Correct error.
badimport.d(11): Error: undefined identifier algorithm in package std, perhaps
add static import std.algorithm;
badimport.d(16): Error: template instance badimport.main.S.opOpAssign!"+" error
instantiating
$


Produced code by gdc-10 does work correctly. However, it shouldn't compile at
all.

>From what I can see, it is some kind of interaction with preceding imports,
that is the `import std.stdio;`. Removing `import std.stdio;` makes gdc-10
correctly report the error and stop compilation.

The test case can be further minimizes, and it attached to the bug.

Same behaviour:

==
import std.stdio;
import std.algorithm.comparison : min;

struct S {
  int min_;
  void add(const S other) {
min_ = std.algorithm.comparison.min(min_, other.min_);
  }
}

void main() {
  S x = {3};
  x.add(x);
}
==

[Bug d/94496] [D] Use aggressive optimizations in release mode

2020-05-13 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94496

Witold Baryluk  changed:

   What|Removed |Added

 CC||witold.baryluk+gcc at gmail 
dot co
   ||m

--- Comment #1 from Witold Baryluk  ---
We are close to making 'in' mean 'scope const', it is already available as a
preview in dmd 2.092: https://dlang.org/changelog/2.092.0.html#preview-in

[Bug c/83584] "ISO C forbids conversion of object pointer to function pointer type" -- no, not really

2019-10-20 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83584

--- Comment #20 from Witold Baryluk  ---
FYI. http://austingroupbugs.net/view.php?id=74#c205

says

Note that conversion from a void * pointer to a function pointer
as in:

fptr = (int (*)(int))dlsym(handle, "my_function");

is not defined by the ISO C Standard.  This standard requires
this conversion to work correctly on conforming implementations.



This is published now as IEEE Std 1003.1-2017, aka POSIX.1-2017:

https://pubs.opengroup.org/onlinepubs/9699919799/functions/dlsym.html

POSIX standard is free to do so.

[Bug c/83584] "ISO C forbids conversion of object pointer to function pointer type" -- no, not really

2019-10-20 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83584

Witold Baryluk  changed:

   What|Removed |Added

 CC||witold.baryluk+gcc at gmail 
dot co
   ||m

--- Comment #19 from Witold Baryluk  ---
This is still happening, even when using -std=c11 with gcc 9.2.1

C11 does state in annex J.5.7:

http://port70.net/~nsz/c/c11/n1570.html#J.5.7p1

"""
J.5.7 Function pointer casts

1 A pointer to an object or to void may be cast to a pointer to a function,
allowing data to be invoked as a function (6.5.4). 
"""


I am not sure how else I am supposed to use `dlsym(3)`. Maybe if there was a
version of dlsym that instead of returning (void*), would return (void(*)()) or
(void (*)(void)) it would help. Maybe it is a POSIX bug then?


This issue however is not a duplicate of bug 11234. And the error message is
incorrect anyway.

Sorry if this was mentioned before.

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-17 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #9 from Witold Baryluk  ---
Indeed, passing -fno-tree-pre in the first example does make it be vectorized.

In the mesh_simple.c this corresponds to ONTHEFLY_CONSTANTS being defined, but
USE_LOOP_CONSTANTS being not. The SIMPLIFIED can be defined or not, it
vectorizes now in both cases.

Targeting -march=knm.

This is with #define OCTAVES 12, a compile time constant, so compiler fully
unrolls the most inner loop.

Without -fno-tree-pre:

1230 :
1230:   41 57   push   %r15
1232:   62 a1 7d 40 ef c0   vpxord %zmm16,%zmm16,%zmm16
1238:   49 ba 53 ec 85 1a femovabs $0xc4ceb9fe1a85ec53,%r10
123f:   b9 ce c4 
1242:   41 56   push   %r14
1244:   c5 7a 10 0d f8 0d 00vmovss 0xdf8(%rip),%xmm9# 2044
<_IO_stdin_used+0x44>
124b:   00 
124c:   62 31 7c 48 28 d0   vmovaps %zmm16,%zmm10
1252:   41 55   push   %r13
1254:   c5 7a 10 3d ec 0d 00vmovss 0xdec(%rip),%xmm15# 2048
<_IO_stdin_used+0x48>
125b:   00 
125c:   62 a1 7c 48 28 d0   vmovaps %zmm16,%zmm18
1262:   41 54   push   %r12
1264:   c5 7a 10 35 e0 0d 00vmovss 0xde0(%rip),%xmm14# 204c
<_IO_stdin_used+0x4c>
126b:   00 
126c:   49 b9 cd 8c 55 ed d7movabs $0xff51afd7ed558ccd,%r9
1273:   af 51 ff 
1276:   55  push   %rbp
1277:   c5 7a 10 2d d1 0d 00vmovss 0xdd1(%rip),%xmm13# 2050
<_IO_stdin_used+0x50>
127e:   00 
127f:   49 be 68 66 ac 6a bfmovabs $0xfa8d7ebf6aac6668,%r14
1286:   7e 8d fa 
1289:   53  push   %rbx
128a:   c5 7a 10 25 c2 0d 00vmovss 0xdc2(%rip),%xmm12# 2054
<_IO_stdin_used+0x54>
1291:   00 
1292:   48 89 7c 24 f8  mov%rdi,-0x8(%rsp)
1297:   c7 44 24 f0 00 00 00movl   $0x0,-0x10(%rsp)
129e:   00 
129f:   c7 44 24 f4 00 00 00movl   $0x0,-0xc(%rsp)
12a6:   00 
12a7:   c5 7a 10 1d a9 0d 00vmovss 0xda9(%rip),%xmm11# 2058
<_IO_stdin_used+0x58>
12ae:   00 
12af:   62 e1 7e 08 10 0d a3vmovss 0xda3(%rip),%xmm17# 205c
<_IO_stdin_used+0x5c>
12b6:   0d 00 00 
12b9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
12c0:   48 8b 6c 24 f8  mov-0x8(%rsp),%rbp
12c5:   31 f6   xor%esi,%esi
12c7:   31 db   xor%ebx,%ebx
12c9:   62 31 7c 48 28 c2   vmovaps %zmm18,%zmm8
12cf:   90  nop
12d0:   8b 54 24 f0 mov-0x10(%rsp),%edx
12d4:   45 31 e4xor%r12d,%r12d
12d7:   62 b1 7c 48 28 f8   vmovaps %zmm16,%zmm7
12dd:   62 c1 7c 48 28 d9   vmovaps %zmm9,%zmm19
12e3:   c5 32 11 cc vmovss %xmm9,%xmm9,%xmm4
12e7:   eb 26   jmp130f

12e9:   0f 1f 80 00 00 00 00nopl   0x0(%rax)
12f0:   c5 ba 59 c4 vmulss %xmm4,%xmm8,%xmm0
12f4:   62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0
12fb:   c5 fa 2c f0 vcvttss2si %xmm0,%esi
12ff:   c4 c1 5a 59 c2  vmulss %xmm10,%xmm4,%xmm0
1304:   62 f3 7d 08 0a c0 09vrndscaless $0x9,%xmm0,%xmm0,%xmm0
130b:   c5 fa 2c d0 vcvttss2si %xmm0,%edx
130f:   4c 89 e1mov%r12,%rcx
1312:   62 c1 7c 48 28 e8   vmovaps %zmm8,%zmm21
1318:   48 c1 e9 21 shr$0x21,%rcx
131c:   62 e1 7c 48 28 e4   vmovaps %zmm4,%zmm20
1322:   c5 d2 2a ea vcvtsi2ss %edx,%xmm5,%xmm5
1326:   4c 31 e1xor%r12,%rcx
1329:   49 0f af ca imul   %r10,%rcx
132d:   48 63 d2movslq %edx,%rdx
1330:   c5 e2 2a de vcvtsi2ss %esi,%xmm3,%xmm3
1334:   4f 8d 24 0c lea(%r12,%r9,1),%r12
1338:   48 69 d2 53 42 41 4eimul   $0x4e414253,%rdx,%rdx
133f:   62 c2 55 08 9b e2   vfmsub132ss %xmm10,%xmm5,%xmm20
1345:   c4 c1 52 58 e9  vaddss %xmm9,%xmm5,%xmm5
134a:   48 8d 01lea(%rcx),%rax
134d:   48 c1 e8 21 shr$0x21,%rax
1351:   62 e2 65 08 9b ec   vfmsub132ss %xmm4,%xmm3,%xmm21
1357:   48 31 c1xor%rax,%rcx
135a:   4c 8d ba 53 42 41 4elea0x4e414253(%rdx),%r15
1361:   48 89 cfmov%rcx,%rdi
1364:   48 89 c8mov%rcx,%rax
1367:   48 81 f7 70 46 ab 58xor$0x58ab4670,%rdi
136e:   c4 c1 62 58 d9  vaddss %xmm9,%xmm3,%xmm3
1373:   48 c1 e8 21 shr

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #7 from Witold Baryluk  ---
Online examples: https://gcc.godbolt.org/z/Nyjty3

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #6 from Witold Baryluk  ---
I also tested clang with LLVM 10~svn374655 and it does vectorize the loop
properly, even when both frequency and amplitude variables are updated every
loop. 

It still doesn't inline calls to sinf, even if I set -fno-math-errno and other
things from -ffast-math. My random guess is that it is because there is no
hardware support for vectorized sinf, and there is no vectorized variant of
sinf software implementation either. If I provide my own version of sinf using
simple Taylor expansion, clang fully vectorized the code:



  401320:   62 e1 7d 58 fe 3d 56vpaddd 0xd56(%rip){1to16},%zmm0,%zmm23 
  # 402080 <_IO_stdin_used+0x80>
  401327:   0d 00 00 
  40132a:   62 61 7c 48 5b c0   vcvtdq2ps %zmm0,%zmm24
  401330:   62 a1 7c 48 5b ff   vcvtdq2ps %zmm23,%zmm23
  401336:   62 f1 7c 48 10 4c 24vmovups 0x140(%rsp),%zmm1
  40133d:   05 
  40133e:   62 61 3c 40 59 d1   vmulps %zmm1,%zmm24,%zmm26
  401344:   62 61 44 40 59 f9   vmulps %zmm1,%zmm23,%zmm31
  40134a:   62 f1 7c 48 10 4c 24vmovups 0x100(%rsp),%zmm1
  401351:   04 
  401352:   62 61 3c 40 59 d9   vmulps %zmm1,%zmm24,%zmm27
  401358:   62 f1 44 40 59 c9   vmulps %zmm1,%zmm23,%zmm1
  40135e:   62 01 2c 40 59 ca   vmulps %zmm26,%zmm26,%zmm25
  401364:   62 f1 7c 48 10 54 24vmovups 0x80(%rsp),%zmm2
  40136b:   02 
  40136c:   62 61 3c 40 59 e2   vmulps %zmm2,%zmm24,%zmm28
  401372:   62 f1 44 40 59 d2   vmulps %zmm2,%zmm23,%zmm2
  401378:   62 02 25 40 ac ca   vfnmadd213ps %zmm26,%zmm27,%zmm25
  40137e:   62 f1 7c 48 10 5c 24vmovups 0x40(%rsp),%zmm3
  401385:   01 
  401386:   62 61 3c 40 59 eb   vmulps %zmm3,%zmm24,%zmm29
  40138c:   62 f1 44 40 59 db   vmulps %zmm3,%zmm23,%zmm3
  401392:   62 01 1c 40 59 d4   vmulps %zmm28,%zmm28,%zmm26
  401398:   62 01 04 40 59 df   vmulps %zmm31,%zmm31,%zmm27
  40139e:   62 02 15 40 ac d4   vfnmadd213ps %zmm28,%zmm29,%zmm26
  4013a4:   62 f1 7c 48 10 6c 24vmovups -0x40(%rsp),%zmm5
  4013ab:   ff 
  4013ac:   62 f1 3c 40 59 e5   vmulps %zmm5,%zmm24,%zmm4
  4013b2:   62 f1 44 40 59 ed   vmulps %zmm5,%zmm23,%zmm5
  4013b8:   62 61 6c 48 59 e2   vmulps %zmm2,%zmm2,%zmm28
  4013be:   62 f1 7c 48 10 7c 24vmovups -0x80(%rsp),%zmm7
  4013c5:   fe 
  4013c6:   62 f1 3c 40 59 f7   vmulps %zmm7,%zmm24,%zmm6
  4013cc:   62 f1 44 40 59 ff   vmulps %zmm7,%zmm23,%zmm7
  4013d2:   62 61 5c 48 59 ec   vmulps %zmm4,%zmm4,%zmm29
  4013d8:   62 61 54 48 59 f5   vmulps %zmm5,%zmm5,%zmm30
  4013de:   62 62 4d 48 ac ec   vfnmadd213ps %zmm4,%zmm6,%zmm29
  4013e4:   62 d1 3c 40 59 e3   vmulps %zmm11,%zmm24,%zmm4
  4013ea:   62 d1 44 40 59 f3   vmulps %zmm11,%zmm23,%zmm6
  4013f0:   62 02 75 48 ac df   vfnmadd213ps %zmm31,%zmm1,%zmm27
  4013f6:   62 d1 3c 40 59 cc   vmulps %zmm12,%zmm24,%zmm1
  4013fc:   62 41 44 40 59 fc   vmulps %zmm12,%zmm23,%zmm31
  401402:   62 71 5c 48 59 c4   vmulps %zmm4,%zmm4,%zmm8
  401408:   62 62 65 48 ac e2   vfnmadd213ps %zmm2,%zmm3,%zmm28
  40140e:   62 72 75 48 ac c4   vfnmadd213ps %zmm4,%zmm1,%zmm8
  401414:   62 d1 3c 40 59 ce   vmulps %zmm14,%zmm24,%zmm1
  40141a:   62 d1 44 40 59 d6   vmulps %zmm14,%zmm23,%zmm2
  401420:   62 62 45 48 ac f5   vfnmadd213ps %zmm5,%zmm7,%zmm30
  401426:   62 d1 3c 40 59 df   vmulps %zmm15,%zmm24,%zmm3
  40142c:   62 d1 44 40 59 e7   vmulps %zmm15,%zmm23,%zmm4
  401432:   62 f1 74 48 59 e9   vmulps %zmm1,%zmm1,%zmm5
  401438:   62 f1 4c 48 59 fe   vmulps %zmm6,%zmm6,%zmm7
  40143e:   62 71 6c 48 59 ca   vmulps %zmm2,%zmm2,%zmm9
  401444:   62 f2 65 48 ac e9   vfnmadd213ps %zmm1,%zmm3,%zmm5
  40144a:   62 b1 3c 40 59 c9   vmulps %zmm17,%zmm24,%zmm1
  401450:   62 f2 05 40 ac fe   vfnmadd213ps %zmm6,%zmm31,%zmm7
  401456:   62 b1 44 40 59 d9   vmulps %zmm17,%zmm23,%zmm3
  40145c:   62 b1 3c 40 59 f2   vmulps %zmm18,%zmm24,%zmm6
  401462:   62 21 44 40 59 fa   vmulps %zmm18,%zmm23,%zmm31
  401468:   62 72 5d 48 ac ca   vfnmadd213ps %zmm2,%zmm4,%zmm9
  40146e:   62 f1 74 48 59 d1   vmulps %zmm1,%zmm1,%zmm2
  401474:   62 f1 64 48 59 e3   vmulps %zmm3,%zmm3,%zmm4
  40147a:   62 f2 4d 48 ac d1   vfnmadd213ps %zmm1,%zmm6,%zmm2
  401480:   62 f2 05 40 ac e3   vfnmadd213ps %zmm3,%zmm31,%zmm4
  401486:   62 b1 3c 40 59 cc   vmulps %zmm20,%zmm24,%zmm1
  40148c:   62 b1 3c 40 59 dd   vmulps %zmm21,%zmm24,%zmm3
  401492:   62 f1 74 48 59 f1   vmulps %zmm1,%zmm1,%zmm6
  401498:   62 21 44 40 59 fc   vmulps %zmm20,%zmm23,%zmm31
  40149e:   62 f2 65 48 ac f1   vfnmadd213ps 

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #5 from Witold Baryluk  ---
As a bonus:


static float perlin1d(float x) {
  float accum = 0.0f;
  for (int i = 0; i < 8; i++) {
accum += powf(0.781f, i) * sinf(x * powf(2.131f, i));
  }
  return accum;
}


claims to be vectorized, but really isn't, and has non inline or lowered calls
to sinf and expf_finite.

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #4 from Witold Baryluk  ---
If I reduce minimized test case even further:

only frequency update: VECTORIZED:

static float perlin1d(float x) {
  float accum = 0.0f;
  float amplitude = 1.0f;
  float frequency = 1.0f;
  for (int i = 0; i < 8; i++) {
accum += amplitude * sinf(x * frequency);
frequency *= 2.131f;
  }
  return accum;
}

__attribute__((noinline))
static void fill_data(int width, float * __restrict__ height_data, float scale)
{
  for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}


only amplitude update: VECTORIZED:

static float perlin1d(float x) {
  float accum = 0.0f;
  float amplitude = 1.0f;
  float frequency = 1.0f;
  for (int i = 0; i < 8; i++) {
accum += amplitude * sinf(x * frequency);
amplitude *= 0.781f;
  }
  return accum;
}

__attribute__((noinline))
static void fill_data(int width, float * __restrict__ height_data, float scale)
{
  for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}

both frequency and amplitude update: NOT VECTORIZED:

static float perlin1d(float x) {
  float accum = 0.0f;
  float amplitude = 1.0f;
  float frequency = 1.0f;
  for (int i = 0; i < 8; i++) {
accum += amplitude * sinf(x * frequency);
amplitude *= 0.781f;
frequency *= 2.131f;
  }
  return accum;
}

__attribute__((noinline))
static void fill_data(int width, float * __restrict__ height_data, float scale)
{
  for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #3 from Witold Baryluk  ---
If only the frequency is updated in the inner loop:

frequency *= 2.131f;

function fill_data is vectorized:

mesh_minimal.c:34:3: optimized: loop vectorized using 64 byte vectors
mesh_minimal.c:33:13: note: vectorized 1 loops in function.


However if amplitude is updated in the inner loop:

amplitude *= 0.781f;

function fill_data is NOT vectorized.

mesh_minimal.c:34:3: missed: couldn't vectorize loop
mesh_minimal.c:34:3: missed: not vectorized: latch block not empty.
mesh_minimal.c:33:13: note: vectorized 0 loops in function.


Here for reference:


/* line 20 */ static float perlin1d(float x) {
  float accum = 0.0;
  float frequency = 1.0;
  float amplitude = 1.0;
  for (int i = 0; i < 8; i++) {
accum += amplitude * (sinf(x * frequency + (float)i));
frequency *= 2.131f;
amplitude *= 0.781f;
  }
  return accum;
}

__attribute__((noinline))
/* line 33 */ static void fill_data(int width, float * __restrict__
height_data, float scale) {
  /* line 34 */ for (int i = 0; i < width; i++) {
height_data[i] = perlin1d(i);
  }
}

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #2 from Witold Baryluk  ---
Added a minimized test case that has only one outer loop, and f and h are
removed for simple inlined replacement.

Example diagnostic:

$ gcc -std=c17 -march=knm -O3 -ffast-math -fassociative-math
-ftree-vectorizer-verbose=2 -fopt-info-vec-all -ggdb -Wall mesh_minimal.c -o
mesh_minimal_knm -lm

mesh_minimal.c:34:3: missed: couldn't vectorize loop
mesh_minimal.c:34:3: missed: not vectorized: latch block not empty.
mesh_minimal.c:33:13: note: vectorized 0 loops in function.

[Bug tree-optimization/92130] Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

--- Comment #1 from Witold Baryluk  ---
Created attachment 47052
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47052=edit
Minimized test case

[Bug tree-optimization/92130] New: Missed vectorization for iteration dependent loads and simple multiplicative accumulators

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92130

Bug ID: 92130
   Summary: Missed vectorization for iteration dependent loads and
simple multiplicative accumulators
   Product: gcc
   Version: 9.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: witold.baryluk+gcc at gmail dot com
  Target Milestone: ---

Created attachment 47051
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47051=edit
Perlin2D noise mesh generation

So,

I do have pretty complex multi level loop spread across many functions, but it
can be all vectorized, but under certain scenarios gcc does not vectorize it
with gcc 9.2.1

I am attaching somehow simplified code with few defines inside to play with it.

The one exposed by default present the biggest challenge to gcc, despite me
able to vectorize it manually.

I tested this on SSE2, AVX2 (cascadelake and znver2), AVX512 (-march=knm and
-march=skylake-avx512) and ARM SVE, with all same effects. I am using
associative math and other flags mentioned in the sourcefile at the top.

The high level overview is like this:

input: A, F, W, maxO, sufficiently aligned d.

foreach y:
  foreach x:
float v = 0.0
float a = 1.0
float f = 1.0
foreach o in [0, maxO):
  v += a * g(f * x, f * y, o, h(o, p))
  a *= A
  f *= F
d[y*W + x] = v

where both g and h are pure functions (relatively complex tho) with no control
flow or data dependent flow.

In some situations if a and f are replaced by a precomputed table of
coefficient for every o, and then used as v += a[o] * g(f[o] * x, f[o] * y,
h(o, p)), it does vectorize, but not always. h(o, p) could also be precomputed,
but I didn't bother as it appears to not have any bad effect on vectorizer.

Vectorizater should vectorize along the 'foreach x', and compute multiple x-s
per-lane completely independently. It is true that when updating a and f, each
lane need to be duplicated, but that can be done by computing it scalarly, and
then broadcasting, or by repeating same constants updates in each lane.

[Bug tree-optimization/63945] Missing vectorization optimization

2019-10-16 Thread witold.baryluk+gcc at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63945

Witold Baryluk  changed:

   What|Removed |Added

 CC||witold.baryluk+gcc at gmail 
dot co
   ||m

--- Comment #1 from Witold Baryluk  ---
It does vectorize for me on gcc 9.2.1:

-march=skylake-avx512

aa.cpp:34:29: optimized: loop vectorized using 32 byte vectors
aa.cpp:25:27: optimized: loop vectorized using 32 byte vectors


  if (val<100.)
1279:   c5 fb 10 0b vmovsd (%rbx),%xmm1
127d:   c5 fb 10 05 8b 0d 00vmovsd 0xd8b(%rip),%xmm0# 2010
<_IO_stdin_used+0x10>
1284:   00 
1285:   c5 f9 2f c1 vcomisd %xmm1,%xmm0
1289:   76 2b   jbe12b6 <_ZN4TEST4testEv+0xc6>
128b:   c4 e2 7d 19 c9  vbroadcastsd %xmm1,%ymm1
1290:   31 c0   xor%eax,%eax
1292:   66 0f 1f 44 00 00   nopw   0x0(%rax,%rax,1)
  c[i] = val*a[i]+b[i];
1298:   c4 c1 7d 10 04 04   vmovupd (%r12,%rax,1),%ymm0
129e:   c4 c2 f5 a8 44 05 00vfmadd213pd
0x0(%r13,%rax,1),%ymm1,%ymm0
12a5:   c5 fd 11 04 07  vmovupd %ymm0,(%rdi,%rax,1)
for (unsigned int i=0; i
::operator delete(__p);
12b6:   c5 f8 77vzeroupper 


Similarly:

-march=knm

aa.cpp:34:29: optimized: loop vectorized using 64 byte vectors
aa.cpp:25:27: optimized: loop vectorized using 64 byte vectors

  if (val<100.)
15bc:   31 c0   xor%eax,%eax
15be:   66 90   xchg   %ax,%ax
  c[i] = val*a[i]+b[i];
15c0:   62 f1 fd 48 28 04 01vmovapd (%rcx,%rax,1),%zmm0
15c7:   62 f2 ed 48 a8 04 06vfmadd213pd (%rsi,%rax,1),%zmm2,%zmm0
15ce:   62 d1 fd 48 11 04 01vmovupd %zmm0,(%r9,%rax,1)
for (unsigned int i=0; i

(plus a lot of handling for unaligned stack).

-march=znver2

aa.cpp:34:29: optimized: loop vectorized using 32 byte vectors
aa.cpp:25:27: optimized: loop vectorized using 32 byte vectors

  if (val<100.)
1279:   c5 fb 10 0b vmovsd (%rbx),%xmm1
127d:   c5 fb 10 05 8b 0d 00vmovsd 0xd8b(%rip),%xmm0# 2010
<_IO_stdin_used+0x10>
1284:   00 
1285:   c5 f9 2f c1 vcomisd %xmm1,%xmm0
1289:   76 33   jbe12be <_ZN4TEST4testEv+0xce>
128b:   c4 e2 7d 19 c9  vbroadcastsd %xmm1,%ymm1
1290:   31 c0   xor%eax,%eax
1292:   66 66 2e 0f 1f 84 00data16 nopw %cs:0x0(%rax,%rax,1)
1299:   00 00 00 00 
129d:   0f 1f 00nopl   (%rax)
  c[i] = val*a[i]+b[i];
12a0:   c4 c1 7d 10 04 04   vmovupd (%r12,%rax,1),%ymm0
12a6:   c4 c2 f5 a8 44 05 00vfmadd213pd
0x0(%r13,%rax,1),%ymm1,%ymm0
12ad:   c5 fd 11 04 07  vmovupd %ymm0,(%rdi,%rax,1)
for (unsigned int i=0; i

-march=core2

aa.cpp:34:29: optimized: loop vectorized using 16 byte vectors
aa.cpp:25:27: optimized: loop vectorized using 16 byte vectors

  if (val<100.)
1276:   f2 0f 10 13 movsd  (%rbx),%xmm2
127a:   f2 0f 10 05 8e 0d 00movsd  0xd8e(%rip),%xmm0# 2010
<_IO_stdin_used+0x10>
1281:   00 
1282:   66 0f 2f c2 comisd %xmm2,%xmm0
1286:   76 40   jbe12c8 <_ZN4TEST4testEv+0xd8>
1288:   31 c0   xor%eax,%eax
128a:   66 0f 14 d2 unpcklpd %xmm2,%xmm2
128e:   66 90   xchg   %ax,%ax
  c[i] = val*a[i]+b[i];
1290:   f3 0f 7e 44 05 00   movq   0x0(%rbp,%rax,1),%xmm0
1296:   f3 41 0f 7e 0c 04   movq   (%r12,%rax,1),%xmm1
129c:   66 0f 16 44 05 08   movhpd 0x8(%rbp,%rax,1),%xmm0
12a2:   66 0f 59 c2 mulpd  %xmm2,%xmm0
12a6:   66 41 0f 16 4c 04 08movhpd 0x8(%r12,%rax,1),%xmm1
12ad:   66 0f 58 c1 addpd  %xmm1,%xmm0
12b1:   66 0f 13 04 07  movlpd %xmm0,(%rdi,%rax,1)
12b6:   66 0f 17 44 07 08   movhpd %xmm0,0x8(%rdi,%rax,1)
for (unsigned int i=0; i



Looks all pretty optimally vectorized to me.

The code can be made even better, if you ensure proper alignment of std::vector
arrrays, which they might not be at the moment.