[Bug target/97304] Boostrap failure on freebsd: ld: error: unable to find library -lc

2024-04-11 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97304

--- Comment #15 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #14)
> (In reply to Jonathan Wakely from comment #10)
> > If --with-as=/usr/local/bin/as --with-ld=/usr/local/bin/ld is required then
> > it needs to be documented at
> > https://gcc.gnu.org/install/specific.html#x-x-freebsd
> 
> So what I think is happening is the ld (LLVM's lld) does not include
> /usr/lib by default in the library search path and gcc's driver does not
> pass -L/usr/lib -L/lib on to ld because it assumes all ld normally search
> there by default (which most unix ld did before lld and mold come around).
>

[...]

> I am suspect we might be able to remove this and it will work but there
> needs to be a lot of testing on many different targets and such.

A configure test, maybe?

[Bug fortran/111938] Missing OpenACC/Fortran handling in 'gcc/fortran/frontend-passes.c'

2024-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111938

Thomas Koenig  changed:

   What|Removed |Added

   Keywords||missed-optimization
   Severity|normal  |enhancement
 Ever confirmed|0   |1
 CC||tkoenig at gcc dot gnu.org
   Last reconfirmed||2024-01-07
 Status|UNCONFIRMED |NEW

--- Comment #2 from Thomas Koenig  ---
I know next to nothing about OpenACC, so I cannot really do this
(but I know frontend-passes.cc).

Could you maybe provide a patch, or a list of what should go where?

Confirmed.

[Bug rtl-optimization/110390] ICE on valid code on x86_64-linux-gnu with sel-scheduling: in av_set_could_be_blocked_by_bookkeeping_p, at sel-sched.cc:3609 since r13-3596-ge7310e24b1c0ca

2023-11-13 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110390

Thomas Koenig  changed:

   What|Removed |Added

URL||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=110390#
Summary|ICE on valid code on|ICE on valid code on
   |x86_64-linux-gnu with   |x86_64-linux-gnu with
   |sel-scheduling: in  |sel-scheduling: in
   |av_set_could_be_blocked_by_ |av_set_could_be_blocked_by_
   |bookkeeping_p, at   |bookkeeping_p, at
   |sel-sched.cc:3609   |sel-sched.cc:3609 since
   ||r13-3596-ge7310e24b1c0ca
 CC||amacleod at redhat dot com
   Keywords|needs-bisection |

--- Comment #5 from Thomas Koenig  ---
Bisects to r13-3596-ge7310e24b1c0ca.

No idea if this just exposed a latent bug, or introduced it.

e7310e24b1c0ca67b1bb507c1330b2bf39e59e32 is the first bad commit
commit e7310e24b1c0ca67b1bb507c1330b2bf39e59e32
Author: Andrew MacLeod 
Date:   Tue Oct 25 16:42:41 2022 -0400

Make ranger vrp1 default.

Turn on ranger as the default vrp1 pass and adjust testcases.

gcc/
* params.opt (param_vrp1_mode): Make ranger default.

gcc/testsuite/
* gcc.dg/pr68217.c: Test [-INF, -INF][0, 0] instead of [-INF, 0].
* gcc.dg/tree-ssa/vrp-unreachable.c: New.  Test unreachable
removal.

 gcc/params.opt  |  2 +-
 gcc/testsuite/gcc.dg/pr68217.c  |  2 +-
 gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c | 42 +
 3 files changed, 44 insertions(+), 2 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/vrp-unreachable.c

[Bug fortran/106402] half preicision is not supported by gfortran(real*2).

2023-11-13 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106402

Thomas Koenig  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-13
 Ever confirmed|0   |1
 Status|UNCONFIRMED |NEW
 CC||tkoenig at gcc dot gnu.org
   Severity|normal  |enhancement

--- Comment #2 from Thomas Koenig  ---
It would make sense to have it, I guess.  If somebody has access
to the relevant hardware, it could also be tested :-)

[Bug libfortran/110966] should matmul_c8_avx512f be updated with matmul_c8_x86-64-v4.

2023-11-13 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110966

Thomas Koenig  changed:

   What|Removed |Added

 Ever confirmed|0   |1
 CC||tkoenig at gcc dot gnu.org
 Status|UNCONFIRMED |WAITING
   Last reconfirmed||2023-11-13

--- Comment #5 from Thomas Koenig  ---
(In reply to Hongtao.liu from comment #4)
> (In reply to anlauf from comment #3)
> > (In reply to Hongtao.liu from comment #2)
> > > (In reply to Richard Biener from comment #1)
> > > > I think matmul is fine with avx512f or avx, so requiring/using only the 
> > > > base
> > > > ISA level sounds fine to me.
> > > 
> > > Could be potential miss-optimization.
> > 
> > Do you mean a missed optimzation?
> > 
> > Or really wrong code?
> 
> a missed optimzation.

Are there benchmarks which show that the code would indeed run
faster?

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-11-13 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #15 from Thomas Koenig  ---
(In reply to CVS Commits from comment #14)

> Admittedly a single "mov" isn't much of a saving on modern architectures,
> but as demonstrated by the PR, people still track the number of them.

Thanks :-)

[Bug rtl-optimization/110390] ICE on valid code on x86_64-linux-gnu with sel-scheduling: in av_set_could_be_blocked_by_bookkeeping_p, at sel-sched.cc:3609

2023-11-12 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110390

--- Comment #3 from Thomas Koenig  ---
Fixed by r14-3414-g0cfc9c953d0221:

0cfc9c953d0221ec3971a25e6509ebe1041f142e is the first new commit
commit 0cfc9c953d0221ec3971a25e6509ebe1041f142e
Author: Andrew MacLeod 
Date:   Thu Aug 17 12:34:59 2023 -0400

Phi analyzer - Initialize with range instead of a tree.

Rangers PHI analyzer currently only allows a single initializer to a group.
This patch changes that to use an inialization range, which is
cumulative of all integer constants, plus a single symbolic value.
There is no other change to group functionality.

This patch also changes the way PHI groups are printed so they show up in
the
listing as they are encountered, rather than as a list at the end.  It
was more difficult to see what was going on previously.

[Bug rtl-optimization/110390] ICE on valid code on x86_64-linux-gnu with sel-scheduling: in av_set_could_be_blocked_by_bookkeeping_p, at sel-sched.cc:3609

2023-11-12 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110390

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
Seems to be fixed on current trunk as of r14-5226-g0b94e9cc060906.

[Bug modula2/111956] Many powerpc platforms do _not_ have support for IEEE754 long double

2023-11-09 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111956

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #11 from Thomas Koenig  ---
A remark - gfortran handles 128-bit reals on POWER as well, it might be a good
idea to look into libgfortran's configure scripts.

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-11-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #13 from Thomas Koenig  ---
(In reply to Patrick Palka from comment #3)
> Perhaps related to this PR: On x86_64, the following basic wrapper around
> int128 addition
> 
>   __uint128_t f(__uint128_t x, __uint128_t y) { return x + y; }
> 
> gets compiled (/w -O3, -O2 or -Os) to the seemingly suboptimal
> 
> movq%rdi, %r9
> movq%rdx, %rax
> movq%rsi, %r8
> movq%rcx, %rdx
> addq%r9, %rax
> adcq%r8, %rdx
> ret
> 
> Clang does:
> 
> movq%rdi, %rax
> addq%rdx, %rax
> adcq%rcx, %rsi
> movq%rsi, %rdx
> retq

With current trunk, this is now

movq%rdx, %rax
movq%rcx, %rdx
addq%rdi, %rax
adcq%rsi, %rdx
ret

so it looks OK.

The original test case regressed a bit, it is now 39 instructions.

[Bug tree-optimization/105558] simple 8-bit integer calculation fails with -O3 / march=core-avx2 on some gfortran 8/9/10 versions

2023-11-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105558

--- Comment #8 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #6)
> Would be interesting to find what patch broke this and what patch fixed the
> -mtune=generic case.

It is not easy bisecting with old compilers - compilation issues keep
coming up on more modern systems, and sometimes newer compilers do
not compile older compilers...

[Bug tree-optimization/105834] [13/14 Regression] Dead Code Elimination Regression at -O2 (trunk vs. 12.1.0)

2023-11-05 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105834

Thomas Koenig  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=109695
   Keywords|needs-bisection |

--- Comment #6 from Thomas Koenig  ---
On trunk, this was fixed by r14-1163-gd8b058d3ca4ebb, one of the
patchset which fixed PR 109695:

d8b058d3ca4ebbef5575105164417f125696f5ce is the first new commit
commit d8b058d3ca4ebbef5575105164417f125696f5ce
Author: Andrew MacLeod 
Date:   Tue May 23 15:11:44 2023 -0400

Choose better initial values for ranger.

Instead of defaulting to VARYING, fold the stmt using just global ranges.

PR tree-optimization/109695
* gimple-range-cache.cc (ranger_cache::get_global_range): Call
fold_range with global query to choose an initial value.

 gcc/gimple-range-cache.cc | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

Would this patch be something that could reasonably be backported?

[Bug tree-optimization/110903] [12/13 Regression] Dead Code Elimination Regression

2023-11-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110903

--- Comment #6 from Thomas Koenig  ---
The original regression was caused by r12-4526-gd8edfadfc7a979 .

d8edfadfc7a9795b65177a50ce44fd348858e844 is the first bad commit
commit d8edfadfc7a9795b65177a50ce44fd348858e844
Author: Aldy Hernandez 
Date:   Mon Oct 4 09:47:02 2021 +0200

Disallow loop rotation and loop header crossing in jump threaders.

There is a lot of fall-out from this patch, as there were many threading
tests that assumed the restrictions introduced by this patch were valid.
Some tests have merely shifted the threading to after loop
optimizations, but others ended up with no threading opportunities at
all.  Surprisingly some tests ended up with more total threads.  It was
a crapshoot all around.

On a postive note, there are 6 tests that no longer XFAIL, and one
guality test which now passes.

I felt a bit queasy about such a fundamental change wrt threading, so I
ran it through my callgrind test harness (.ii files from a bootstrap).
There was no change in overall compilation, DOM, or the VRP threaders.

However, there was a slight increase of 1.63% in the backward threader.
I'm pretty sure we could reduce this if we incorporated the restrictions
into their profitability code.  This way we could stop the search when
we ran into one of these restrictions.  Not sure it's worth it at this
point.

Tested on x86-64 Linux.

Co-authored-by: Richard Biener 

[Bug tree-optimization/110903] [12/13 Regression] Dead Code Elimination Regression

2023-11-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110903

Thomas Koenig  changed:

   What|Removed |Added

Summary|[12/13/14 Regression] Dead  |[12/13 Regression] Dead
   |Code Elimination Regression |Code Elimination Regression

--- Comment #5 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #4)
> (In reply to Thomas Koenig from comment #3)
> > The code from comment#2 and from comment#3 no longer calls foo
> > with current trunk, r14-5108-g751fc7bcdcdf25 .
> > 
> > Now, to see which commit fixed it...
> 
> My bet is on r14-4089-gd45ddc2c04e471 .

Weird thing is that I do see this on POWER (I often use gcc120 for
compiling because it is the beefiest machine I can lay my hands on).

However, this was actually fixed by r14-4141-gbf6b107e2a3423:

bf6b107e2a342319b3787ec960fc8014ef3aff91 is the first new commit
commit bf6b107e2a342319b3787ec960fc8014ef3aff91
Author: Andrew MacLeod 
Date:   Wed Sep 13 11:52:15 2023 -0400

New early __builtin_unreachable processing.

in VRP passes before __builtin_unreachable MUST be removed, only remove it
if all exports affected by the unreachable can have global values updated,
and
do not involve loads from memory.

PR tree-optimization/110080
PR tree-optimization/110249
gcc/
* tree-vrp.cc (remove_unreachable::final_p): New.
(remove_unreachable::maybe_register): Rename from
maybe_register_block and call early or final routine.
(fully_replaceable): New.
(remove_unreachable::handle_early): New.
(remove_unreachable::remove_and_update_globals): Remove
non-final processing.
(rvrp_folder::rvrp_folder): Add final flag to constructor.
(rvrp_folder::post_fold_bb): Remove unreachable registration.
(rvrp_folder::pre_fold_stmt): Move unreachable processing to here.
(execute_ranger_vrp): Adjust some call parameters.

gcc/testsuite/
* g++.dg/pr110249.C: New.
* gcc.dg/pr110080.c: New.
* gcc.dg/pr93917.c: Adjust.

[Bug tree-optimization/110903] [12/13/14 Regression] Dead Code Elimination Regression

2023-11-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110903

--- Comment #3 from Thomas Koenig  ---
The code from comment#2 and from comment#3 no longer calls foo
with current trunk, r14-5108-g751fc7bcdcdf25 .

Now, to see which commit fixed it...

[Bug tree-optimization/110116] [12/13 Regression] ICE on valid code at -O3 on x86_64-linux-gnu: verify_gimple failed

2023-11-02 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110116

Thomas Koenig  changed:

   What|Removed |Added

Summary|[12/13/14 Regression] ICE   |[12/13 Regression] ICE on
   |on valid code at -O3 on |valid code at -O3 on
   |x86_64-linux-gnu:   |x86_64-linux-gnu:
   |verify_gimple failed|verify_gimple failed
   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=111614
  Known to work||14.0

--- Comment #3 from Thomas Koenig  ---
r14-4303-g88d79b9b03eccf fixed it:

88d79b9b03eccf39921d13c2cbd1acc50aeda126 is the first fixed commit
commit 88d79b9b03eccf39921d13c2cbd1acc50aeda126
Author: Richard Biener 
Date:   Thu Sep 28 09:41:30 2023 +0200

tree-optimization/111614 - missing convert in
undistribute_bitref_for_vector

The following adjusts a flawed guard for converting the first vector
of the sum we create in undistribute_bitref_for_vector.

PR tree-optimization/111614
* tree-ssa-reassoc.cc (undistribute_bitref_for_vector): Properly
convert the first vector when required.

* gcc.dg/torture/pr111614.c: New testcase.

 gcc/testsuite/gcc.dg/torture/pr111614.c | 23 +++
 gcc/tree-ssa-reassoc.cc | 27 +++
 2 files changed, 38 insertions(+), 12 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/torture/pr111614.c

Maybe a candidate for backporting?  Unlike PR111614, this does not appear
to be latent.

[Bug tree-optimization/110116] [12/13/14 Regression] ICE on valid code at -O3 on x86_64-linux-gnu: verify_gimple failed

2023-11-01 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110116

--- Comment #2 from Thomas Koenig  ---
Looks like this has been fixed in the meantime:

tkoenig@gcc188:~> gcc -O3 small.c 
small.c: In function 'main':
small.c:6:21: warning: iteration 2147483646 invokes undefined behavior
[-Waggressive-loop-optimizations]
6 | for (b = 1; b; b++)
  |~^~
small.c:6:17: note: within this loop
6 | for (b = 1; b; b++)
  | ^
tkoenig@gcc188:~> gcc --version
gcc (GCC) 14.0.0 20231101 (experimental) [master r13-4915-g9b111debbfb]
Copyright (C) 2023 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

tkoenig@gcc188:~> cat small.c 
unsigned char a[5];
int b, d;
char c;
int main() {
  if (d) {
for (b = 1; b; b++)
  c &= d = 1;
for (; d < 5; d++)
  c &= a[d];
  }
  return 0;
}

Still interesting which revision fixed it.

[Bug middle-end/111921] [11/12/13/14 Regression] ICE with nested function after an error since r6-205-g5c4abbb8e80153

2023-11-01 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111921

Thomas Koenig  changed:

   What|Removed |Added

Summary|[11/12/13/14 Regression]|[11/12/13/14 Regression]
   |ICE with nested function|ICE with nested function
   |after an error  |after an error since
   ||r6-205-g5c4abbb8e80153
   Keywords|needs-bisection |
 CC||mpolacek at gcc dot gnu.org

--- Comment #4 from Thomas Koenig  ---
Bisection finally found the relevant patch: r6-205-g5c4abbb8e80153

5c4abbb8e80153999b0298e4b2fe81d512f133c8 is the first bad commit
commit 5c4abbb8e80153999b0298e4b2fe81d512f133c8
Author: Marek Polacek 
Date:   Thu Apr 23 14:35:12 2015 +

re PR c/65345 (ICE with _Generic selection on _Atomic int)

PR c/65345
* c-decl.c (set_labels_context_r): New function.
(store_parm_decls): Call it via walk_tree_without_duplicates.
* c-typeck.c (convert_lvalue_to_rvalue): Use create_tmp_var_raw
instead of create_tmp_var.  Build TARGET_EXPR instead of
COMPOUND_EXPR.
(build_atomic_assign): Use create_tmp_var_raw instead of
create_tmp_var.  Build TARGET_EXPRs instead of MODIFY_EXPR.

* gcc.dg/pr65345-1.c: New test.
* gcc.dg/pr65345-2.c: New test.

From-SVN: r222370

Bisection actually needed a patch for bootstrap to succeed:

diff --git a/gcc/cp/cfns.gperf b/gcc/cp/cfns.gperf
index 68acd3d..5ecf86a 100644
--- a/gcc/cp/cfns.gperf
+++ b/gcc/cp/cfns.gperf
@@ -23,7 +23,7 @@ static unsigned int hash (const char *, unsigned int);
 #ifdef __GNUC__
 __inline
 #endif
-const char * libc_name_p (const char *, unsigned int);
+# const char * libc_name_p (const char *, unsigned int);
 %}
 %%
 # The standard C library functions, for feeding to gperf; the result is used
diff --git a/gcc/cp/cfns.h b/gcc/cp/cfns.h
index 1c6665d..ee38f6a 100644
--- a/gcc/cp/cfns.h
+++ b/gcc/cp/cfns.h
@@ -51,9 +51,6 @@ along with GCC; see the file COPYING3.  If not see
 __inline
 #endif
 static unsigned int hash (const char *, unsigned int);
-#ifdef __GNUC__
-__inline
-#endif
 const char * libc_name_p (const char *, unsigned int);
 /* maximum key range = 391, duplicates = 0 */

[Bug target/112112] Improper Arithmetic Type Conversion in s390x-linux-gnu-gcc

2023-11-01 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112112

Thomas Koenig  changed:

   What|Removed |Added

   Last reconfirmed||2023-11-01
 Ever confirmed|0   |1
 Status|UNCONFIRMED |WAITING

--- Comment #7 from Thomas Koenig  ---
(In reply to 김대영 from comment #6)
> ```
> z3rodae0@z3rodae0:~$ ./sh.sh
> result for -O0 "signed" = 1
> result for -O1 "signed" = 1
> result for -O2 "signed" = 1
> result for -O3 "signed" = 1
> result for -O0 "unsigned" = 0
> result for -O1 "unsigned" = 0
> result for -O2 "unsigned" = 0
> result for -O3 "unsigned" = 0
> result for -O0 "" = 1
> result for -O1 "" = 1
> result for -O2 "" = 1
> result for -O3 "" = 1
> ```
> 
> That's correct. I ran your code and script in my environment, and it
> produced the same results

That is weird.

I don't see a meaningful difference between the version without signed or
unsigned and your program, and you get inconsistent results with your
original program and consistent results with the other one.

Or am I missing something?

[Bug middle-end/111921] [11/12/13/14 Regression] ICE with nested function after an error

2023-10-31 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111921

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
gcc 6 to 13 have "confused by earlier errors, bailing out".

The segfault starts occuring in gcc-14.

[Bug target/112276] [14 Regression] wrong code with -O2 -msse4.2 since r14-4964-g7eed861e8ca3f5

2023-10-29 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112276

Thomas Koenig  changed:

   What|Removed |Added

 CC||liuhongt at gcc dot gnu.org
Summary|[14 Regression] wrong code  |[14 Regression] wrong code
   |with -O2 -msse4.2   |with -O2 -msse4.2 since
   ||r14-4964-g7eed861e8ca3f5

--- Comment #2 from Thomas Koenig  ---
Bisected to r14-4964-g7eed861e8ca3f5 .

[Bug target/112112] Improper Arithmetic Type Conversion in s390x-linux-gnu-gcc

2023-10-29 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112112

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #5 from Thomas Koenig  ---
(In reply to 김대영 from comment #4)
> From your perspective, do you think this could be a compiler bug? When
> tested with various compiler options following the GCC bug reporting
> guidelines, the binary compiles without any warnings, yet exhibits these
> behaviors

It definitely sounds wrong, there should be consistent results.

Just to make the effect of the signs clear: Could you maybe run the
program

$ cat a.c
#include 
SIGN char v1 = -1;
short v2 = 1;
int main()
{   
printf("result for " OPT " \"" STR "\" = %d\n", v1 <= v2);
return 0;
}

with the shell script

$ cat do_all.sh 
for s in signed unsigned ""
do
for o in -O0 -O1 -O2 -O3
do
gcc $o -DOPT='"'$o'"' -DSTR='"'$s'"' -DSIGN=$s a.c && ./a.out
done
done

and post the results?  For reference, on x86_64 (which has signed
chars) this gets

result for -O0 "signed" = 1
result for -O1 "signed" = 1
result for -O2 "signed" = 1
result for -O3 "signed" = 1
result for -O0 "unsigned" = 0
result for -O1 "unsigned" = 0
result for -O2 "unsigned" = 0
result for -O3 "unsigned" = 0
result for -O0 "" = 1
result for -O1 "" = 1
result for -O2 "" = 1
result for -O3 "" = 1

[Bug tree-optimization/112113] [14 Regression] wrong code at -O3 on x86_64-linux-gnu

2023-10-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112113

--- Comment #3 from Thomas Koenig  ---
(In reply to Thomas Koenig from comment #2)
> According to bisection, f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75
> is the first bad commit.

Or, if anybody wants a link,
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75
.

[Bug tree-optimization/112113] [14 Regression] wrong code at -O3 on x86_64-linux-gnu

2023-10-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=112113

Thomas Koenig  changed:

   What|Removed |Added

 CC||hubicka at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
According to bisection, f5fb9ff2396fd41fdd2e6d35a412e936d2d42f75
is the first bad commit.

[Bug tree-optimization/111917] [11/12/13/14 Regression] ICE in as_a, at is-a.h:255 since GCC-7

2023-10-23 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111917

--- Comment #5 from Thomas Koenig  ---

> It does not ICE with aa90195, for which the original test case ICEs,
> so it is something else (although probably related).

.. or at least it does not ICE with checking disabled (to be exact).

[Bug tree-optimization/111917] [11/12/13/14 Regression] ICE in as_a, at is-a.h:255 since GCC-7

2023-10-23 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111917

--- Comment #4 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #3)
> If someone is worried about uninitialized variables or an executed infinite
> loop, this also ICEs at -O3:
> ```
> long t;
> long a() {
>   long b = t, c = t;
>   for (; b < 31; b++)
> c <<= 1;
>   return c;
> }
> long long t1;
> static 
> int d() {
>   if (!t1)
> return 0;
> e:
> f:
>   for (; a();)
> ;
>   goto f;
>   return 0;
> }
> int main() { d(); }
> ```

It does not ICE with aa90195, for which the original test case ICEs,
so it is something else (although probably related).

[Bug fortran/30409] [fortran] missed optimization with pure function arguments

2023-10-22 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=30409

Thomas Koenig  changed:

   What|Removed |Added

 Depends on||21046

--- Comment #9 from Thomas Koenig  ---
(In reply to anlauf from comment #8)

> I wonder how much (or little) really needs to be done here, or if the task
> can be split in a suitable way between FE and ME.
> 
> The tree-dump shows a __builtin_malloc/__builtin_free for the temporary
> *within* the i-loop.  Would it be possible to move this *management* just
> one loop level up, if the size of the temporary is known to be constant?
> (Which is the case here).  I mean attach it to the outer scope?
> Maybe the middle end then better "sees" what can reasonably be done?

A lot of it can probably be done in the middle end.

For memory allocation, this would be PR21046 (first variant), which
would be highly useful already.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21046
[Bug 21046] move memory allocation out of a loop

[Bug tree-optimization/111916] [14 Regression] wrong code at -O1 and above on x86_64-linux-gnu (the generated code hangs)

2023-10-22 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111916

Thomas Koenig  changed:

   What|Removed |Added

Summary|wrong code at -O1 and above |[14 Regression] wrong code
   |on x86_64-linux-gnu (the|at -O1 and above on
   |generated code hangs)   |x86_64-linux-gnu (the
   ||generated code hangs)
 CC||tkoenig at gcc dot gnu.org
   Keywords||wrong-code
   Target Milestone|--- |14.0

--- Comment #1 from Thomas Koenig  ---
Also occurs on POWER, so likely target-independent.  Does not happen
for 13.2.

[Bug tree-optimization/111917] [11/12/13/14 Regression] ICE in as_a, at is-a.h:255 since GCC-8

2023-10-22 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111917

Thomas Koenig  changed:

   What|Removed |Added

   Keywords||ice-on-valid-code
Summary|ICE in as_a, at is-a.h:255  |[11/12/13/14 Regression]
   |since GCC-8 |ICE in as_a, at is-a.h:255
   ||since GCC-8
   Target Milestone|--- |14.0
 CC||tkoenig at gcc dot gnu.org

--- Comment #1 from Thomas Koenig  ---
Works for 4.8.5, must be a not-so-recent regression.

Note that with gcc (GCC) 11.3.1 20221121 (Red Hat 11.3.1-4) on POWER,
the error is different:

x.c: In function ‘main’:
x.c:15:5: internal compiler error: in mark_stmt_if_obviously_necessary, at
tree-ssa-dce.c:295
   15 | int main() { d(); }
  | ^~~~
Please submit a full bug report,

[Bug tree-optimization/111652] [14 Regression] wrong code at -O3 on x86_64-linux-gnu

2023-10-02 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111652

Thomas Koenig  changed:

   What|Removed |Added

 CC||carll at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
I ran git bisect on POWER (gcc120) and strangely got this as the
first bad commit:

b51795c832cf6e724d61919eb18a383223b76694 is the first bad commit
commit b51795c832cf6e724d61919eb18a383223b76694
Author: Carl Love 
Date:   Wed Jul 26 11:31:53 2023 -0400

rs6000, fix vec_replace_unaligned built-in arguments

The first argument of the vec_replace_unaligned built-in should always be
of type vector unsigned char, as specified in gcc/doc/extend.texi.

This patch fixes the builtin definitions and updates the test cases to use
the correct arguments.  The original test file is renamed and a second test
file is added for a new test case.

This is weird because the problem also occurs on x86_64.

[Bug fortran/90608] Inline non-scalar minloc/maxloc calls

2023-09-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90608

Thomas Koenig  changed:

   What|Removed |Added

 CC||mikael at gcc dot gnu.org

--- Comment #7 from Thomas Koenig  ---
(In reply to Tamar Christina from comment #6)
> This is the ticket I meant toon.
> 
> Do you or Thomas have any ideas how we can inline this?

Two options, in principle.

One option is to extend use of the scalarizer for these cases. Mikael knows
this best, I am putting him in CC:

The other is to use front-end optimization and basically generate DO
loops, like we do for matmul.  A lot of the infrastructure is in place
in frontend-passes.cc, but there would be some restirctions because
it is not possible to put DO loops everywhere.

[Bug rtl-optimization/111373] New: Register moves right before stores and return

2023-09-11 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111373

Bug ID: 111373
   Summary: Register moves right before stores and return
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

The code

#define SWAP(i,j) do { \
  if (v[i] > v[j]) { \
tmp_v = v[i]; v[i] = v[j]; v[j] = tmp_v;\
tmp_p = a[i]; a[i] = a[j]; a[j] = tmp_p;\
}   \
  } while(0)

void s3 (long int *p[3])
{
  long int v[3];
  long int *a[3];
  long int tmp_v;
  long int *tmp_p;
  a[0] = p[0];
  v[0] = *p[0];
  a[1] = p[1];
  v[1] = *p[1];
  a[2] = p[2];
  v[2] = *p[2];
  SWAP (0,1);
  SWAP (0,2);
  SWAP (1,2);
  p[0] = a[0];
  p[1] = a[1];
  p[2] = a[2];
}

yields, with reasonably recent trunk with -O3, code where there are
register moves right before the results are stored, for example on x86_64:

s3:
.LFB0:
.cfi_startproc
movq(%rdi), %rax
movq8(%rdi), %rcx
movq16(%rdi), %rdx
movq(%rax), %r8
movq(%rcx), %rsi
movq(%rdx), %r9
cmpq%rsi, %r8
jg  .L2
cmpq%r9, %r8
jle .L3
movq%rax, %r9
movq%rdx, %rax
movq%r9, %rdx
movq%r8, %r9
.L3:
cmpq%rsi, %r9
jl  .L10
.L4:
movq%rax, (%rdi)
movq%rcx, 8(%rdi)
movq%rdx, 16(%rdi)
ret
.p2align 4,,10
.p2align 3
.L2:
cmpq%r9, %rsi
jle .L11
movq%rdx, %rsi
movq%rax, %rdx
movq%rcx, 8(%rdi)
movq%rsi, %rax
movq%rdx, 16(%rdi)
movq%rax, (%rdi)
ret
.p2align 4,,10
.p2align 3
.L11:
movq%r8, %rsi
movq%rax, %r8
movq%rcx, %rax
movq%r8, %rcx
cmpq%rsi, %r9
jge .L4
.L10:
movq%rcx, %rsi
movq%rdx, %rcx
movq%rax, (%rdi)
movq%rsi, %rdx
movq%rcx, 8(%rdi)
movq%rdx, 16(%rdi)
ret

This seems to be a general phenomenon, see https://godbolt.org/z/xW9x75qbf for
RISC-V (POWER is similar).

[Bug target/106271] Bootstrap on RISC-V on Ubuntu 22.04 LTS: bits/libc-header-start.h: No such file or directory

2023-08-30 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106271

--- Comment #7 from Thomas Koenig  ---
(In reply to Thomas Schwinge from comment #6)
> I noticed recent commit r14-3387-g47f95bc4be4eb14730ab3eaaaf8f6e71fda47690
> "RISC-V: Add multiarch support on riscv-linux-gnu" -- but can't tell
> off-hand whether that fixed all the issues here?

As soon as gcc92 is back up, we can test...

https://lists.tetaneutral.net/pipermail/cfarm-users/2023-August/000975.html

[Bug tree-optimization/111221] New: Floating point handling a*1.0 vs. a+0.0

2023-08-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111221

Bug ID: 111221
   Summary: Floating point handling a*1.0 vs. a+0.0
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

I just noticed that gcc will optimize away multiplying a floating
point number with 1.0, but will not do for an addition with 0.0.
Example, with -O3,

double add0 (double a)
{
  return a + 0.0;
}

double mul1 (double a)
{
  return a * 1.0;
}

yields

add0:
.LFB0:
.cfi_startproc
pxor%xmm1, %xmm1
addsd   %xmm1, %xmm0
ret

vs.

mul1:
.LFB1:
.cfi_startproc
ret

which seems inconsistent.  If this is the result of a deliberate design
decision, feel free to close as WONTFIX.

[Bug target/111096] Frame pointer is not used even when -fomit-frame-pointer is specified

2023-08-25 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096

--- Comment #9 from Thomas Koenig  ---
(In reply to Richard Earnshaw from comment #8)
> (In reply to Thomas Koenig from comment #7)
> > Would it make sense to document this somewhere?  Or did I just miss it? :-)
> 
> Possibly, but I've no idea where.  It's too target-specific to put under the
> generic documentation for -fomit-frame-pointer and I don't think there's a
> section in the manual that really documents the target-specific behaviours
> of generic options.

Hm, maybe a chapter "Architecture-specific implementation choices"
to document those cases where the ABI gives some leeway could be a
place to put it.  It could have sections on architecture.

[Bug target/111096] Frame pointer is not used even when -fomit-frame-pointer is specified

2023-08-23 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096

--- Comment #7 from Thomas Koenig  ---
(In reply to Richard Earnshaw from comment #5)
> This was a deliberate design choice.  Although the frame chain is not set up
> by code that omits the frame pointer, the chain of frames that are set up by
> other functions is still valid this way.  This ensures that any code that
> does try to walk the frame chain will not crash.  If we reused the frame
> pointer for other purposes, then any code trying to walk the frame chain (eg
> backtrace()) would encounter an invalid record and likely crash.


Would it make sense to document this somewhere?  Or did I just miss it? :-)

[Bug target/111096] Frame pointer is not used even when -fomit-frame-pointer is specified

2023-08-22 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096

--- Comment #3 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #2)
> See https://gcc.gnu.org/pipermail/gcc-patches/2016-September/456662.html
> 
> I think this is by design of the ABI ...

The workaround mentioned in the thread does not appear to work,
 -O3 -fomit-frame-pointer -fcall-used-x29 yields
cc1: error: cannot use 'x29' as a call-used register

[Bug rtl-optimization/111096] New: Frame pointer is not used even when -fomit-frame-pointer is specified

2023-08-21 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111096

Bug ID: 111096
   Summary: Frame pointer is not used even when
-fomit-frame-pointer is specified
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

The code, by Kent Dickey posted to comp.arch

typedef unsigned int u32;
typedef unsigned long long u64;

u64 do_op(u64 out0, u64 in0, u64 in1, u32 opcode, int size);

void
calc_loop(u64 *optr, u64 *iptr0, u64 *iptr1, u32 opcode, int size, int len)
{
u64 o0, i0, i1, val, result;
int num, shift, pos;
int i, j;

// size is 0,1,2,3 representing 8,16,32,64 bytes
num = 8 >> size;// 8,4,2,1
shift = 8 << size;  // 8,16,32,64
for(i = 0; i < len; i++) {
o0 = optr[i];
i0 = iptr0[i];
i1 = iptr1[i];
result = 0;
pos = 0;
for(j = 0; j < num; j++) {
val = do_op(o0, i0, i1, opcode, size);
result = result | (val << pos);
pos += shift;
if(shift < 64) {
o0 = o0 >> shift;
i0 = i0 >> shift;
i1 = i1 >> shift;
}
}
optr[i] = result;
}
}

compiled for aarch64 on godbolt with recent trunk and -O3 -fomit-frame-pointer
(see https://godbolt.org/z/5bKPeGWrK ) does not set up the frame pointer,
but it also does not use it for aoviding spill/restore pairs:

calc_loop:
stp x19, x20, [sp, -144]!
mov w6, 8
asr w19, w6, w4
stp x27, x28, [sp, 64]
lsl w27, w6, w4
str x30, [sp, 80]
stp x0, x1, [sp, 112]
str x2, [sp, 128]
cmp w5, 0
ble .L1
sbfiz   x0, x5, 3, 32
stp x21, x22, [sp, 16]
mov w20, w4
stp x23, x24, [sp, 32]
mov w21, w3
stp x25, x26, [sp, 48]
str x0, [sp, 136]
cmp w27, 63
ble .L3
mov x25, 0
.L6:
ldr x0, [sp, 112]
ldr x23, [x0, x25]
ldr x0, [sp, 120]
ldr x0, [x0, x25]
str x0, [sp, 104]
ldr x0, [sp, 128]
ldr x24, [x0, x25]
cbz w19, .L10
mov w22, 0
mov w28, 0
mov x26, 0
.L5:
ldr x1, [sp, 104]
mov w4, w20
mov w3, w21
mov x2, x24
mov x0, x23
add w22, w22, 1
bl  do_op
lsl x0, x0, x28
add w28, w28, w27
orr x26, x26, x0
cmp w19, w22
bne .L5
ldr x0, [sp, 112]
str x26, [x0, x25]
add x25, x25, 8
ldr x0, [sp, 136]
cmp x0, x25
bne .L6
.L17:
ldp x21, x22, [sp, 16]
ldp x23, x24, [sp, 32]
ldp x25, x26, [sp, 48]
.L1:
ldp x27, x28, [sp, 64]
ldr x30, [sp, 80]
ldp x19, x20, [sp], 144
ret
.L3:
str xzr, [sp, 104]
ldp x0, x1, [sp, 104]
ldr x24, [x1, x0]
ldr x1, [sp, 120]
ldr x25, [x1, x0]
ldr x1, [sp, 128]
ldr x22, [x1, x0]
cbz w19, .L11
.L20:
mov w26, 0
mov w28, 0
mov x23, 0
.L8:
mov x2, x22
mov x1, x25
mov x0, x24
mov w4, w20
mov w3, w21
add w26, w26, 1
bl  do_op
lsr x24, x24, x27
lsl x0, x0, x28
add w28, w28, w27
orr x23, x23, x0
lsr x25, x25, x27
lsr x22, x22, x27
cmp w19, w26
bne .L8
ldp x0, x1, [sp, 104]
str x23, [x1, x0]
add x0, x0, 8
ldr x1, [sp, 136]
str x0, [sp, 104]
cmp x1, x0
beq .L17
.L19:
ldp x0, x1, [sp, 104]
ldr x24, [x1, x0]
ldr x1, [sp, 120]
ldr x25, [x1, x0]
ldr x1, [sp, 128]
ldr x22, [x1, x0]
cbnzw19, .L20
.L11:
ldp x0, x1, [sp, 104]
mov x23, 0
str x23, [x1, x0]
add x0, x0, 8
ldr x1, [sp, 136]
str x0, [sp, 104]
cmp x1, x0
bne .L19
b   .L17
.L10:
ldr x0, [sp, 112]
mov x26, 0
str x26, [x0, x25]
add x25, x25

[Bug fortran/110888] Missing optimization for trivial MATMUL cases, requires -fno-signed-zeros

2023-08-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110888

Thomas Koenig  changed:

   What|Removed |Added

  Component|middle-end  |fortran

--- Comment #4 from Thomas Koenig  ---
Hm, on second thoughts, signed zeros are an issue, resetting to Fortran.

Generally, we are in an intrinsic, so we can do whatever we please
(we certainly do in the library case, and this is expected behavior).

Having -ffast-math applied locally to the BLOCK that the matmul
is executed in would be a possibility.

[Bug middle-end/110888] Missing optimization for trivial MATMUL cases, requires -fno-signed-zeros

2023-08-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110888

Thomas Koenig  changed:

   What|Removed |Added

  Component|fortran |middle-end

--- Comment #3 from Thomas Koenig  ---
Interesting problem.

For

  _19 = (*x_13(D))[0];
  _20 = (*y_14(D))[0];
  _21 = _19 * _20;
  _22 = _21 + 0.0;

the multiplication cannot produce a signalling NaN, so the addition
of zero should always be a no-op. For this, a simpler test case would
be

double add(double a, double b)
{
  return a*b + 0.0;
}

which gets me, on x86_64, 

mulsd   %xmm1, %xmm0
pxor%xmm1, %xmm1
addsd   %xmm1, %xmm0
re

According to godbolt, icc produces

add:
mulsd %xmm1, %xmm0  #3.12
ret   

which should be fine.

So, an issue for tree optimization?

[Bug libgomp/110842] [14 Regression] Openmp loops with KIND=16 DO loops

2023-07-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110842

--- Comment #3 from Thomas Koenig  ---
(In reply to Jakub Jelinek from comment #2)
> Why a regression?

It worked before (if only by accident), hence I put "Regression" there.

> libgomp has no support for loop iterators larger than 64-bit unsigned, and I
> believe in OpenMP it is implementation defined which iterator type is used.
> C/C++ OpenMP loops with __int128 or unsigned __int128 iterator will not work
> either (nor with _BitInt(575) or similar).

If it is illegal, then the best way to do this would probably be an error
message instead of silent wrong code.

[Bug libgomp/110842] [14 Regression] Openmp loops with KIND=16 DO loops

2023-07-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110842

Thomas Koenig  changed:

   What|Removed |Added

   Target Milestone|--- |14.0
   Keywords||needs-bisection, wrong-code

--- Comment #1 from Thomas Koenig  ---
I just tested trunk, this might have happened earlier.

[Bug libgomp/110842] New: [14 Regression] Openmp loops with KIND=16 DO loops

2023-07-28 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110842

Bug ID: 110842
   Summary: [14 Regression] Openmp loops with KIND=16 DO loops
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libgomp
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
CC: jakub at gcc dot gnu.org
  Target Milestone: ---

gfortran with a reasonably current trunk gives wrong
results for omp parallel:

$ cat dynamic.f90 
program main
  implicit none
  integer(kind=16) :: i, anfang, ende, delta
  anfang = 0
  ende = 2**10
  delta = 2**6
  !$omp parallel do default(private) schedule(dynamic,1)
  do i=anfang, ende, delta
 !$omp critical
 print *,i
 !$omp end critical
  end do
end program main
$ gfortran dynamic.f90 
$ ./a.out
0
   64
  128
  192
  256
  320
  384
  448
  512
  576
  640
  704
  768
  832
  896
  960
 1024

Without openmp, no problem.

With openmp, some values are garbage:

$ gfortran -fopenmp dynamic.f90 
$ ./a.out
   2584020860371700504877596135129104
  768
  192
  704
  960
  128
  896
   64
   2584020860371700504877596135130128
  384
  448
  576
  640
  320
  512
  256
  832
$ gfortran -v
Es werden eingebaute Spezifikationen verwendet.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=/home/ig25/libexec/gcc/x86_64-pc-linux-gnu/14.0.0/lto-wrapper
Ziel: x86_64-pc-linux-gnu
Konfiguriert mit: ../trunk/configure --prefix=/home/ig25
--enable-languages=c,c++,fortran --disable-multilib
Thread-Modell: posix
Unterstützte LTO-Kompressionsalgorithmen: zlib
gcc-Version 14.0.0 20230722 (experimental) [master r14-2725-g73cc6ce1294] (GCC) 

System compiler works fine:

$ /usr/bin/gfortran -fopenmp dynamic.f90 
$ ./a.out
0
   64
  832
  256
  320
  960
  576
  192
  896
  640
  128
  384
  704
  768
  512
  448
 1024
$ /usr/bin/gfortran -v
Using built-in specs.
COLLECT_GCC=/usr/bin/gfortran
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-linux-gnu/11/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
11.3.0-1ubuntu1~22.04.1' --with-bugurl=file:///usr/share/doc/gcc-11/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,m2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-11
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu
--enable-libstdcxx-debug --enable-libstdcxx-time=yes
--with-default-libstdcxx-abi=new --enable-gnu-unique-object
--disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib
--enable-libphobos-checking=release --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet
--with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32
--enable-multilib --with-tune=generic
--enable-offload-targets=nvptx-none=/build/gcc-11-aYxV0E/gcc-11-11.3.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-11-aYxV0E/gcc-11-11.3.0/debian/tmp

[Bug middle-end/68360] GCC bitfield processing code is very inefficient

2023-07-16 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68360

Thomas Koenig  changed:

   What|Removed |Added

   Last reconfirmed|2015-11-16 00:00:00 |2023-7-16
 CC||tkoenig at gcc dot gnu.org

--- Comment #7 from Thomas Koenig  ---
Just stumbled across this.

A maybe simpler testcase:

typedef struct
{
  unsigned long x: 42;
  unsigned b: 1;
  unsigned long y: 42;
} myfield;

typedef struct
{
   unsigned long x: 7;  
   unsigned b: 1;
   unsigned long y: 42;
} yourfield;

void foo(myfield *x)
{
  x->b = 1;
}

void bar (yourfield *x)
{
x->b = 1;
}

gets, on RISC-V,

foo:
ld  a5,0(a0)
li  a4,1
sllia4,a4,42
or  a5,a5,a4
sd  a5,0(a0)
ret
bar:
ld  a5,0(a0)
ori a5,a5,128
sd  a5,0(a0)
ret

Using an indexed load byte/store byte would be an advantage for foo, at least.

[Bug rtl-optimization/97756] [11/12/13/14 Regression] Inefficient handling of 128-bit arguments

2023-07-16 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97756

--- Comment #12 from Thomas Koenig  ---
(In reply to Andrew Pinski from comment #11)
> This seems to be improved on trunk ...

gcc is down to 37 instructions now for the original test case with -O3.
icc, which appears to be best, has 33, see https://godbolt.org/z/461jeozs9 .

[Bug rtl-optimization/110479] Unnecessary register move

2023-06-29 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110479

Thomas Koenig  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Thomas Koenig  ---
(In reply to Uroš Bizjak from comment #1)
> (In reply to Thomas Koenig from comment #0)
> 
> > movl%edi, %ecx
> 
> This one? It is needed because SAL wants its count argument in %cl and first
> argument is passed in %edi (mandated by x86_64 ABI).
> 
> With -mbmi2, one gets:
> 
> shrl$10, %edi
> movl$1, %eax
> andl$3, %edi
> addl$3, %edi
> shlx%edi, %eax, %eax
> ret

Hm, you're right.  The intricacies of x86...

Closing.

[Bug tree-optimization/110481] New: Possible improvements in dense switch statement returning values

2023-06-29 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110481

Bug ID: 110481
   Summary: Possible improvements in dense switch statement
returning values
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Putting this provisionally into tree-optimization, although there may
be other aspects.

Consider

unsigned int foo(unsigned int a)
{
  switch ((a >> 10) & 3)
{
case 0:
  return 8;
case 1:
  return 16;
case 2:
  return 32;
case 3:
  return 64;
}
}

unsigned int bar(unsigned int a)
{
  return 1u << (((a >> 10) & 3) + 3);
}

unsigned int baz (unsigned int a)
{
  switch (a & (3 << 10))
{
case 0:
  return 8;
case 1 << 10:
  return 16;
case 2 << 10:
  return 32;
case 3 << 10:
  return 64;
}
}

which all do the same thing.

The code for bar is

bar:
.LFB1:
.cfi_startproc
shrl$10, %edi
movl$1, %eax
movl%edi, %ecx
andl$3, %ecx
addl$3, %ecx
sall%cl, %eax
ret

which is optimum except for the register move (submitted as PR110479).

The compiler does not recognize that foo or baz are equivalent to bar,
but that may be too much of a special case to really consider. 

The code for foo is

foo:
.LFB0:
.cfi_startproc
shrl$10, %edi
movl$8, %eax
andl$3, %edi
decl%edi
cmpl$2, %edi
ja  .L1
movzbl  CSWTCH.1(%rdi), %eax
.L1:
ret
.cfi_endproc

[...]

CSWTCH.1:
.byte   16
.byte   32
.byte   64

where it seems strange that there is a comparison and conditional
jump around the load.  A look at *.optimized shows

 [local count: 1073741824]:
  _1 = a_4(D) >> 10;
  _2 = _1 & 3;
  _8 = _2 + 4294967295;
  if (_8 <= 2)
goto ; [50.00%]
  else
goto ; [50.00%]

   [local count: 536870913]:
  _6 = CSWTCH.1[_8];
  _5 = (unsigned int) _6;

   [local count: 1073741824]:
  # _3 = PHI <_5(3), 8(2)>
  return _3;

which assigns a probability of 50% to (a>>10)& 3 being zero.
Where this comes from is unclear to me.  A straightforward load
from a table which also includes the 8 seems more logical to me
(especially with -Os).

Finally, baz generates

baz:
.LFB2:
.cfi_startproc
andl$3072, %edi
movl$32, %eax
cmpl$2048, %edi
je  .L6
ja  .L8
movl$8, %eax
testl   %edi, %edi
je  .L6
movl$16, %eax
ret
.L8:
movl$64, %eax
.L6:
ret

when transforming into something equivalent to foo (or even bar)
would seem advantageous.

[Bug rtl-optimization/110479] New: Unnecessary register move

2023-06-29 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110479

Bug ID: 110479
   Summary: Unnecessary register move
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

May be related to / a dup of PR110240.

The function

unsigned int bar(unsigned int a)
{
  return 1u << (((a >> 10) & 3) + 3);
}

is compiled, with a relatively recent trunk and -O3, to

bar:
.LFB12:
.cfi_startproc
shrl$10, %edi
movl$1, %eax
movl%edi, %ecx
andl$3, %ecx
addl$3, %ecx
sall%cl, %eax
ret

where the register move seems unnecessary.

[Bug target/110240] New: Unnecessary register move in indexed swap routine

2023-06-13 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110240

Bug ID: 110240
   Summary: Unnecessary register move in indexed swap routine
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

void swap (unsigned int * restrict a, unsigned int * restrict b)
{
  if (a[b[0]] > a[b[1]])
{
  unsigned int tmp = b[0];
  b[0] = b[1];
  b[1] = tmp;
}
}
$ gcc -O3 -S swap.c

gets me

swap:
.LFB0:
.cfi_startproc
movl(%rsi), %ecx
movl4(%rsi), %r8d
movq%rcx, %rax
movl(%rdi,%rcx,4), %ecx
cmpl%ecx, (%rdi,%r8,4)
jnb .L1
movl%r8d, (%rsi)
movl%eax, 4(%rsi)
.L1:
ret
.cfi_endproc

where the

movq%rcx, %rax

is unneeded, because rcs is not overwritten.

(It is probably also a zero-latency operation due to register renaming,
but still).

[Bug fortran/98577] Wrong "count_rate" values with int32 and real32 if the "count" argument is int64.

2023-05-14 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98577

Thomas Koenig  changed:

   What|Removed |Added

 Resolution|WONTFIX |INVALID

[Bug fortran/109659] New: gcc_builtin module for gfortran

2023-04-27 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109659

Bug ID: 109659
   Summary: gcc_builtin module for gfortran
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

There are lots of useful builtin functions in gcc which Fortran currently
does not have access to.  Just think of checking for integer overflow,
which gcc offers as __builtin_add_overflow().

Extending the language with new intrinsics is probably not a good idea
because, if things like that are later taken up, they will be made
incompatible by the committee.

So, I propose to add an intrinsic gcc_builtin module, which could then
export Fortran versions of those builtin functions that we think are
useful.

Thought? Comments?

[Bug tree-optimization/109075] [13 Regression] rnflow hangs at -O3

2023-03-09 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

Thomas Koenig  changed:

   What|Removed |Added

  Known to work||12.2.0

--- Comment #7 from Thomas Koenig  ---
I've just checked 12.2.0, and the code does not hang there.

(In reply to Jakub Jelinek from comment #6)
> jmul is used just once, so I wonder if the easiest solution wouldn't be to
> make jmul
> PARAMETER kind=8.

We can change the benchmark source, but we cannot change the
existing code base using the same idiom out there :-|

> Anyway, does -fwrapv work around it too?

Yes, -frwapv works.

We could just include that in -std=legacy (it really is used for
legacy code) and mention it in the release notes, then.

How does that sound?

[Bug tree-optimization/109075] [13 Regression] rnflow hangs at -O3

2023-03-09 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

--- Comment #5 from Thomas Koenig  ---
Might be invalid code, see

https://gcc.gnu.org/pipermail/fortran/2023-March/059062.html

That appears to be a problem with widely used old-style linear congruential
random number generators, which expect overflow to just silently
truncate.

Looking at the test case with nagfor -C=all shows the problem:

  0: 0: 0.000 -> Read sequence
  0: 0: 0.150 -> extract extrema
  0: 0: 0.159 -> Generate raw transitions counts
  0: 0: 0.183 -> Compute Markov matrix
  0: 0: 0.184 -> Calculate theoretical rainflow
  0: 0:43.286 -> Simulate random markov sequences
Runtime Error: rnflow.f90, line 902: INTEGER(int32) overflow for 843314861 *
1993

The issue with the patch is that it is also illegal Fortran, because the
assignment outside the value range of a default integer is also illegal.
Again, nagfor catches this:

  0: 0: 0.000 -> Read sequence
  0: 0: 0.140 -> extract extrema
  0: 0: 0.150 -> Generate raw transitions counts
  0: 0: 0.175 -> Compute Markov matrix
  0: 0: 0.175 -> Calculate theoretical rainflow
  0: 0:44.032 -> Simulate random markov sequences
Runtime Error: rnflow.f90, line 905: Overflow converting 1681180334666 to
INTEGER(int32)

So, what to do?  I think we need to mention this in the release notes,
and also a workaround which gives the same result.

If there is a flag which suppresses whatever this does, we could also
set this with -std=legacy (and also mention this in the relase notes).

[Bug tree-optimization/109075] [13 Regression] rnflow hangs at -O3

2023-03-08 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

Thomas Koenig  changed:

   What|Removed |Added

   Keywords||needs-bisection
   Target Milestone|--- |13.0

--- Comment #4 from Thomas Koenig  ---
As also confirmed by Paul Thomas, the program works fine at -O2.

[Bug tree-optimization/109075] [13 Regression] rnflow hangs at -O3

2023-03-08 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

--- Comment #3 from Thomas Koenig  ---
Created attachment 54619
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54619=edit
Compressed input file

[Bug tree-optimization/109075] [13 Regression] rnflow hangs at -O3

2023-03-08 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

--- Comment #2 from Thomas Koenig  ---
Created attachment 54618
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54618=edit
Header file needed for compilation

[Bug tree-optimization/109075] [13 Regression] rnflow hangs at -O3

2023-03-08 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

--- Comment #1 from Thomas Koenig  ---
Created attachment 54617
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54617=edit
rnflow.f90

[Bug tree-optimization/109075] New: [13 Regression] rnflow hangs at -O3

2023-03-08 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109075

Bug ID: 109075
   Summary: [13 Regression] rnflow hangs at -O3
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

rnflow from the pb11 Polyhedron benchmark hangs at -O3 with recent trunk,

gcc-Version 13.0.1 20230308 (experimental) [master revision
e87559d202d:f4e6da6e8ac:36ec54aac7da134441c83248e14825381b8d6f17] (GCC)

Compiling with -O3 -g, running under gdb and interrupting the program 
sometime after

(gdb) r rnflow.in 
Starting program: /tmp/a.out rnflow.in
  0: 0: 0.000 -> Read sequence
  0: 0: 0.213 -> extract extrema
  0: 0: 0.215 -> Generate raw transitions counts
  0: 0: 0.221 -> Compute Markov matrix
  0: 0: 0.221 -> Calculate theoretical rainflow
  0: 0: 7.487 -> Simulate random markov sequences
^C
Program received signal SIGINT, Interrupt.
0x00402149 in minlst (ipos2=, ipos1=) at
rnflow.f90:3698
3698 if (xxtrt (ipos) < xxtrt (minlst)) then
(gdb) l
3693!
3694! .. dernier minimum de xxtrt entre ipos1 et ipos2
3695!
3696  minlst = ipos2
3697  do ipos = ipos2 - 1, ipos1, -1
3698 if (xxtrt (ipos) < xxtrt (minlst)) then
3699minlst = ipos
3700 endif
3701  enddo
3702  end function minlst

where it goes into an endless loop.  This happens both on x86_64 and
on POWER.

[Bug rtl-optimization/109019] Failure to optimize b + c -1

2023-03-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109019

Thomas Koenig  changed:

   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution|--- |INVALID

--- Comment #2 from Thomas Koenig  ---
Urgh,too late in the evening, I guess. Sorry for the noise.

[Bug rtl-optimization/109019] New: Failure to optimize b + c -1

2023-03-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109019

Bug ID: 109019
   Summary: Failure to optimize b + c -1
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Looks like a general RTL issue, I see this on POWER, RV64 and ARM64 (the
latter two on godbolt).


[tkoenig@gcc135 ~]$ cat c.c
long foo (long b, long c)
{
  return b + c - 1;
}
[tkoenig@gcc135 ~]$ gcc -O3 -S c.c
[tkoenig@gcc135 ~]$ cat c.s
.file   "c.c"
.machine power8
.abiversion 2
.section".text"
.align 2
.p2align 4,,15
.globl foo
.type   foo, @function
foo:
.LFB0:
.cfi_startproc
add 3,3,4
addi 3,3,-1
blr
.long 0
.byte 0,0,0,0,0,0,0,0
.cfi_endproc
.LFE0:
.size   foo,.-foo
.ident  "GCC: (GNU) 13.0.1 20230215 (experimental)"
.section.note.GNU-stack,"",@progbits

This should be

addi3,4,-1
ret

[Bug tree-optimization/108863] Unrolling could use range information

2023-02-20 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108863

Thomas Koenig  changed:

   What|Removed |Added

   Severity|normal  |enhancement
   Keywords||missed-optimization

[Bug tree-optimization/108863] New: Unrolling could use range information

2023-02-20 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108863

Bug ID: 108863
   Summary: Unrolling could use range information
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Created attachment 54497
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54497=edit
Assembly code generated by test case

Looking a bit more at the code generated for the test code of PR108839.

For the test
$ cat u2.c
void foo(double *const restrict dx, double *dy, double da, long int n)
{
  long int m = n % 4;
  for (unsigned long i = 0; i < m; i++ )
dy[i] = dy[i] + da * dx[i];
}

a recently-ish trunk gives, with

$ gcc -S -O3  -funroll-all-loops -fno-tree-vectorize u2.c

far too much unrolling for a loop which can only be executed, at
most, four times (see attachment).

The range information about m does not appear to be propagated to
the unroll passes.

[Bug tree-optimization/108844] New: sincos opportunity missed

2023-02-18 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108844

Bug ID: 108844
   Summary: sincos opportunity missed
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Two related test cases (which do the same, but are handled differently).

This is code for calculating a Jacobian, a frequent task in solving
non-linear systems of equations.  (I am using C instead of Fortran because
Fortran does not support fallthrough).

$ cat a.c
#include 

void f1 (double x, double y, double f[2], double fjac[2][2], int flag)
{
  switch (flag)
{
case 1:
  f[0] = x * y;
  f[1] = sin(x)*y*y;
  break;
case 2:
  fjac[0][0] = y;
  fjac[1][0] = cos(x)*y*y;
  fjac[0][1] = x;
  fjac[1][1] = 2*sin(x)*y;
  break;
case 3:
  f[0] = x * y;
  f[1] = sin(x)*y*y;
  fjac[0][0] = y;
  fjac[1][0] = cos(x)*y*y;
  fjac[0][1] = x;
  fjac[1][1] = 2*sin(x)*y;
  break;
 default:
  __builtin_unreachable();
}
}
$ cat b.c
#include 

void f1 (double x, double y, double f[2], double fjac[2][2], int flag)
{
  switch (flag)
{
case 1:
case 3:
  f[0] = x * y;
  f[1] = sin(x)*y*y;
  if (flag != 3)
break;
  /* Fallthrough */
case 2:
  fjac[0][0] = y;
  fjac[1][0] = cos(x)*y*y;
  fjac[0][1] = x;
  fjac[1][1] = 2*sin(x)*y;
  break;
default:
  __builtin_unreachable();
}
}
$ gcc -O3 -S a.c b.c

a.s looks good for flag=3:
leaq64(%rsp), %rsi
leaq72(%rsp), %rdi
movaps  %xmm3, 48(%rsp)
movsd   %xmm1, 32(%rsp)
movsd   %xmm0, 24(%rsp)
callsincos

but the code for flag=2 looks like

cmpl$2, %edx
je  .L2
[...]
.L2:
.cfi_restore_state
movaps  %xmm3, 32(%rsp)
movsd   %xmm1, 24(%rsp)
movsd   %xmm0, (%rsp)
callcos
movsd   (%rsp), %xmm2
movq%xmm0, %rbx
movapd  %xmm2, %xmm0
callsin

b.s generates no call to sincos:
$ egrep  '(sin|cos)' b.s
callsin
callsin
callcos

[Bug tree-optimization/108839] New: Option for rerolling loops

2023-02-17 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108839

Bug ID: 108839
   Summary: Option for rerolling loops
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Code sometimes contains manual unrolling.  For example, the BLAS
reference implementation, subroutine DSCAL, has

  IF (INCX.EQ.1) THEN
*
*code for increment equal to 1
*
*
*clean-up loop
*
 M = MOD(N,5)
 IF (M.NE.0) THEN
DO I = 1,M
   DX(I) = DA*DX(I)
END DO
IF (N.LT.5) RETURN
 END IF
 MP1 = M + 1
 DO I = MP1,N,5
DX(I) = DA*DX(I)
DX(I+1) = DA*DX(I+1)
DX(I+2) = DA*DX(I+2)
DX(I+3) = DA*DX(I+3)
DX(I+4) = DA*DX(I+4)
 END DO
  ELSE

While such code may have been beneficial on old architectures, by
now this disturbs the compiler's own unrolling and vectorization,
and it increases code size.

It could be beneficial to have a -freroll-loops option, which
undid the manual unrolling of the code above. This could be
stand-alone, or included in options such as -Os.

[Bug rtl-optimization/108826] New: Inefficient address generation on POWER and RISC-V

2023-02-16 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108826

Bug ID: 108826
   Summary: Inefficient address generation on POWER and RISC-V
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

For the code (reduced from embench)

struct {
  unsigned int table[4][100];
} * _nettle_aes_decrypt_T;
unsigned int _nettle_aes_decrypt_w1;
void _nettle_aes_decrypt() {
  _nettle_aes_decrypt_T->table[2][0] =
  _nettle_aes_decrypt_T->table[2][_nettle_aes_decrypt_w1 >> 6 & 5];
}

current trunk generates

0:  addis 2,12,.TOC.-.LCF0@ha
addi 2,2,.TOC.-.LCF0@l
.localentry _nettle_aes_decrypt,.-_nettle_aes_decrypt
addis 9,2,.LANCHOR0+8@toc@ha
lwz 9,.LANCHOR0+8@toc@l(9)
addis 10,2,.LANCHOR0@toc@ha
ld 10,.LANCHOR0@toc@l(10)
srwi 9,9,6
andi. 9,9,0x5
addi 9,9,200
sldi 9,9,2
lwzx 9,10,9
stw 9,800(10)
blr

After the TOC loading, this shifts the value once, does the and, adds 200
and then shifts back the value. These two shifts are not necessary.

A better alternative would be something like (please excuse any errors)

srwi 9,9,4
andi 9,9,20
add  9,9,2
lwz  9,800(9)
stw  9,800(9)

saving an instruction.

RISC-V does something similar.  According to godbolt:

lui a5,%hi(_nettle_aes_decrypt_w1)
lw  a5,%lo(_nettle_aes_decrypt_w1)(a5)
lui a4,%hi(_nettle_aes_decrypt_T)
ld  a4,%lo(_nettle_aes_decrypt_T)(a4)
srliw   a5,a5,6
andia5,a5,5
addia5,a5,200
sllia5,a5,2
add a5,a4,a5
lw  a5,0(a5)
sw  a5,800(a4)
ret


(which is why I think this is a general RTL optimization issue).
x86 is much better:

movl_nettle_aes_decrypt_w1(%rip), %eax
movq_nettle_aes_decrypt_T(%rip), %rdx
shrl$6, %eax
andl$5, %eax
movl800(%rdx,%rax,4), %eax
movl%eax, 800(%rdx)
ret

but it can use the complex addressing modes on x86.

[Bug tree-optimization/108710] Recognizing "rounding down to the nearest power of two"

2023-02-08 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108710

--- Comment #1 from Thomas Koenig  ---
Actually, register allocation is OK for an architecture with destructive shifts
only.

[Bug tree-optimization/108710] New: Recognizing "rounding down to the nearest power of two"

2023-02-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108710

Bug ID: 108710
   Summary: Recognizing "rounding down to the nearest power of
two"
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

In the code

#include 
#include 
#include 

uint64_t foo (uint64_t x)
{
  x = x | (x >> 1);
  x = x | (x >> 2);
  x = x | (x >> 4);
  x = x | (x >> 8);
  x = x | (x >> 16);
  x = x | (x >> 32);
  return x - (x >> 1);
}

uint64_t bar (uint64_t x)
{
  if (x == 0)
return 0;
  else
return 1ul << (63 - __builtin_clzl(x));
}

void tst (uint64_t a)
{
  uint64_t r_foo, r_bar;
  r_foo = foo(a);
  r_bar = bar(a);
  printf ("%20lu %20lu %20lu\n", a, r_foo, r_bar);
  if (r_foo != r_bar)
abort();
}

int main()
{
  tst(0ul);
  for (uint64_t i = 1; i<64; i++) {
for (uint64_t j = 0; j

[Bug fortran/108665] New: Depenency checking: Run-time loop reversal instead of creating a temporary

2023-02-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108665

Bug ID: 108665
   Summary: Depenency checking: Run-time loop reversal instead of
creating a temporary
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

In the Fortran front end, we could sometimes reverse loops at runtime if
dependency analysis shows that either a forward or a backward loop would have
no dependencies.

Example code:

module x
  implicit none
contains
  subroutine foo(a,i,j,n)
integer, intent(in) :: i, j, n
real, dimension(:), intent(inout) :: a
a(i:i+n-1) = a(j:j+n-1) + 10.
  end subroutine foo
  subroutine bar(a,i,j,n)
real, dimension(:), intent(inout) :: a
integer,intent(in) :: i, j, n
integer :: k
if (i <= j) then
   do k=0, n-1, 1
  a(i+k) = a(j+k) + 10.
   end do
else
   do k=n-1,0,-1
  a(i+k) = a(j+k) + 10.   
   end do
end if
  end subroutine bar
end module x

where we create a temporary in foo.

[Bug fortran/108592] In IF statements -Winteger-division is repeated 4 times

2023-01-30 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108592

--- Comment #2 from Thomas Koenig  ---
(In reply to anlauf from comment #1)

> @Thomas: do you remember the reason you chose the "_now" version?

I'm not sure any more.  It's been a few years :-)

[Bug fortran/103506] [10/11/12/13 Regression] ICE in gfc_free_namespace, at fortran/symbol.c:4039 since r10-2798-ge68a35ae4a65d2b3

2023-01-27 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103506

--- Comment #12 from Thomas Koenig  ---
(In reply to anlauf from comment #11)
> (In reply to Jerry DeLisle from comment #8)
> > Doing the search in bugzilla, 137 bugs are marked as ic-on-invalid-code.  I
> > suggest we make all of these P5 or Wont fix.
> 
> Please don't make them wont-fix.

I concur.

While it is annoying to have that many bugs, everybody has their
own priorities, and if somebody wants to spend their time fixing
them, there is no reason not to, and certainly no reason to remove
them from the search (which is what marking them as wont-fix would do).

> An ICE is always a bug, as we used to explain everybody who asked,
> and I still consider it that way.

Agreed.

> If you must, make them P5, but didn't we have P5 for enhancements,
> or (missing/wrong) diagnostics?

Not sure about that.  I hardly look at priorites, anyway, except for
the really high ones.

> Also, an ICE-on-invalid input that confuses the parser points at a
> different part of the compiler than an ICE that happens during resolution,
> simplification, frontend-optimization, translation.
> 
> In several cases, an ICE in one of the trans*.cc was caused by an issue
> much earlier in the process.
> 
> One problem is that there are lots of PRs that are - either seemingly or
> likely - related.  A good (better?) classification of bugs to find those
> that might be connected or near-duplicates would be helpful.
> 
> I once tried to edit the summary of some bugs that were e.g. coarray-related,
> or OOP; not sure if that was appreciated.

We have some meta-bugs for coarray-related stuff.  Coarry-related bugs
can be set as blocking Coarray (aka PR83700).  

> (We could more aggressively mark PRs as F2018 or F2023.)

There is the F2018 meta-bug, aka 85836, and I have just created F2023,
aka PR108577.  This is probably the best way to track these PRs - just
mark them as blocking the relevant standard PR.  People who are interested
in following those can just put themselves on the CC list.

> Also, there are several bugs pertaining only to CLASS.  Some of those
> would be addressed along with the fix for PR106856.  Tobias' patch plus
> some minor fixup to it seems to solve many of them.

I don't think we have a class meta-bug, but I'm not sure.

[Bug fortran/108577] New: [meta-bug] Fortran 2023 support

2023-01-27 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108577

Bug ID: 108577
   Summary: [meta-bug] Fortran 2023 support
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

A meta-bug to hang Fortran 2023 support PRs on.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-15 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #14 from Thomas Koenig  ---
Seems that libquadmath is not built on that particular Linux/CPU variant,
for whatever reason. At last I cannot find any '*quadmath* files
in the build directory.

/proc/cpuinfo tells me that

processor   : 0
BogoMIPS: 48.00
Features: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 asimddp sha512 asimdfhm dit
uscat ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 frint
CPU implementer : 0x61
CPU architecture: 8
CPU variant : 0x1
CPU part: 0x022
CPU revision: 1

[...]

and uname -a is

Linux gcc103.fsffrance.org 6.0.0-rc5-asahi-1-gc62bd3fe430f #1 SMP Sun Sep
18 18:07:57 CEST 2022 aarch64 GNU/Linux

So much for testing on Apple silicon.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-15 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #13 from Thomas Koenig  ---
I tried compiling your tests on Apple silicon using Asahi Linux, but
without success. A first step was rather easy; replacing __float128 by
_Float128 was required.  I then bootstrapped gcc on that machine and
added the (private) include path for , and am now hitting missing
__float128 in quadmath.h.  Not sure how to proceed from here.

The machine is gcc103.fsffrance.org, by the way, of the GCC compile farm.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #10 from Thomas Koenig  ---
What we would need for incorporation into gcc is to have several
functions, which would then called depending on which floating point
options are in force at the time of invocation.

So, let's go through the gcc options, to see what would fit where. Walking
down the options tree, depth first.

>From the gcc docs:

'-ffast-math'
 Sets the options '-fno-math-errno', '-funsafe-math-optimizations',
 '-ffinite-math-only', '-fno-rounding-math', '-fno-signaling-nans',
 '-fcx-limited-range' and '-fexcess-precision=fast'.

-fno-math-errno is irrelevant in this context, no need to look at that.

'-funsafe-math-optimizations'

 Allow optimizations for floating-point arithmetic that (a) assume
 that arguments and results are valid and (b) may violate IEEE or
 ANSI standards.  When used at link time, it may include libraries
 or startup files that change the default FPU control word or other
 similar optimizations.

 This option is not turned on by any '-O' option since it can result
 in incorrect output for programs that depend on an exact
 implementation of IEEE or ISO rules/specifications for math
 functions.  It may, however, yield faster code for programs that do
 not require the guarantees of these specifications.  Enables
 '-fno-signed-zeros', '-fno-trapping-math', '-fassociative-math' and
 '-freciprocal-math'.

'-fno-signed-zeros'
 Allow optimizations for floating-point arithmetic that ignore the
 signedness of zero.  IEEE arithmetic specifies the behavior of
 distinct +0.0 and -0.0 values, which then prohibits simplification
 of expressions such as x+0.0 or 0.0*x (even with
 '-ffinite-math-only').  This option implies that the sign of a zero
 result isn't significant.

 The default is '-fsigned-zeros'.

I don't think this options is relevant.

'-fno-trapping-math'
 Compile code assuming that floating-point operations cannot
 generate user-visible traps.  These traps include division by zero,
 overflow, underflow, inexact result and invalid operation.  This
 option requires that '-fno-signaling-nans' be in effect.  Setting
 this option may allow faster code if one relies on "non-stop" IEEE
 arithmetic, for example.

 This option should never be turned on by any '-O' option since it
 can result in incorrect output for programs that depend on an exact
 implementation of IEEE or ISO rules/specifications for math
 functions.

 The default is '-ftrapping-math'.

Relevant.

'-ffinite-math-only'
 Allow optimizations for floating-point arithmetic that assume that
 arguments and results are not NaNs or +-Infs.

 This option is not turned on by any '-O' option since it can result
 in incorrect output for programs that depend on an exact
 implementation of IEEE or ISO rules/specifications for math
 functions.  It may, however, yield faster code for programs that do
 not require the guarantees of these specifications.

This does not have further suboptions. Relevant.

'-fassociative-math'

 Allow re-association of operands in series of floating-point
 operations.  This violates the ISO C and C++ language standard by
 possibly changing computation result.  NOTE: re-ordering may change
 the sign of zero as well as ignore NaNs and inhibit or create
 underflow or overflow (and thus cannot be used on code that relies
 on rounding behavior like '(x + 2**52) - 2**52'.  May also reorder
 floating-point comparisons and thus may not be used when ordered
 comparisons are required.  This option requires that both
 '-fno-signed-zeros' and '-fno-trapping-math' be in effect.
 Moreover, it doesn't make much sense with '-frounding-math'.  For
 Fortran the option is automatically enabled when both
 '-fno-signed-zeros' and '-fno-trapping-math' are in effect.

 The default is '-fno-associative-math'.

Not relevant, I think - this influences compiler optimizations.

'-freciprocal-math'

 Allow the reciprocal of a value to be used instead of dividing by
 the value if this enables optimizations.  For example 'x / y' can
 be replaced with 'x * (1/y)', which is useful if '(1/y)' is subject
 to common subexpression elimination.  Note that this loses
 precision and increases the number of flops operating on the value.

 The default is '-fno-reciprocal-math'.

Again, not relevant.


'-frounding-math'
 Disable transformations and optimizations that assume default
 floating-point rounding behavior.  This is round-to-zero for all
 floating point to integer conversions, and round-to-nearest for all
 other arithmetic truncations.  This option should be specified for
 programs that change the FP rounding mode dynamically, or that may
 be executed with a non-default rounding mode.  This option disables
 

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-14 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #9 from Thomas Koenig  ---
Created attachment 54273
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54273=edit
matmul_r16.i

Here is matmul_r16.i from a relatively recent trunk.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-12 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #6 from Thomas Koenig  ---
(In reply to Michael_S from comment #5)
> Hi Thomas
> Are you in or out?

Depends a bit on what exactly you want to do, and if there is
a chance that what you want to do will be incorporated into gcc.

If you want to replace the soft-float routines, you will have to
replace them with the full functionality.

And there will have to be a decision about 32-bit targets.

> If you are still in, I can use your help on several issues.
> 
> 1. Torture. 
> See if Invalid Operand exception raised properly now. Also if there are
> still remaining problems with NaN.

I've putyour addition/subtraction routines in as a replacement
an am running a regression test.  We'll see when that finishes.

> 2. Run my correction tests on as many non-AMD64 targets as you can.
> Preferably, with 100,000,000 iterations, but on weaker HW 10,000,000 will do.

This will take some time.

> 3. Run my speed tests (tests/matmulq/mm_speed_ma) on more diverse set of
> AMD64 computers than I did.
> Of special interest are
> - AMD Zen3 on Linux running on bare metal
> - Intel Skylake, SkylakeX, Tiger/Rocket Lake and Alder Lake on Linux running
> on bare metal
> I realize that doing speed tests is not nearly as simple as correctness
> tests.
> We need non-busy (preferably almost idle) machines that have stable CPU
> clock rate. It's not easy to find machines like that nowadays. But, may be,
> you can find at least some from the list.

I currenty have no access to that sort of hardware (I'm just a volunteer,
and my home box is Zen-1).

> 4. Run my speed tests on as many non-obsolete ARM64 computers as you can
> find.
> Well, probably a wishful thinking on my part.
> 
> 
> Also off topic but of interest: postprocessed source of matmul_r16.c

Where should I send that to?

[Bug other/89204] -floop-interchange has no effect on Fortran code

2023-01-10 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89204

Thomas Koenig  changed:

   What|Removed |Added

 Resolution|INVALID |DUPLICATE

--- Comment #8 from Thomas Koenig  ---
Actually a duplicate.

*** This bug has been marked as a duplicate of bug 31756 ***

[Bug tree-optimization/31756] -floop-interchange is not working on some fortran loops

2023-01-10 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=31756

Thomas Koenig  changed:

   What|Removed |Added

 CC||mehdi.chinoune at hotmail dot 
com

--- Comment #7 from Thomas Koenig  ---
*** Bug 89204 has been marked as a duplicate of this bug. ***

[Bug fortran/108329] IEEE_SET_ROUNDING_MODE ineffective with common subexpression elimination

2023-01-09 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108329

Thomas Koenig  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW
   Assignee|tkoenig at gcc dot gnu.org |unassigned at gcc dot 
gnu.org

--- Comment #3 from Thomas Koenig  ---
Seems to be much more complicated than I thought, see the thrad starting at
https://gcc.gnu.org/pipermail/gcc-patches/2023-January/609532.html

[Bug fortran/108329] IEEE_SET_ROUNDING_MODE ineffective with common subexpression elimination

2023-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108329

Thomas Koenig  changed:

   What|Removed |Added

 Ever confirmed|0   |1
   Last reconfirmed||2023-01-07
   Assignee|unassigned at gcc dot gnu.org  |tkoenig at gcc dot 
gnu.org
 Status|UNCONFIRMED |ASSIGNED

[Bug fortran/108329] IEEE_SET_ROUNDING_MODE ineffective with common subexpression elimination

2023-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108329

--- Comment #2 from Thomas Koenig  ---
(In reply to Thomas Koenig from comment #1)
> As long as PR 36678

That should be PR 34678 .

[Bug fortran/108329] IEEE_SET_ROUNDING_MODE ineffective with common subexpression elimination

2023-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108329

Thomas Koenig  changed:

   What|Removed |Added

Version|unknown |13.0
 Depends on||34678
   Keywords||wrong-code
 Blocks||105105

--- Comment #1 from Thomas Koenig  ---
As long as PR 36678 is not fixed, I see one possible solution in
putting a memory barrier after ieee_set_rounding_mode.

This is a rather big hammer, but as long as the middle-end issue
is not fixed, I do not see an alternative.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678
[Bug 34678] Optimization generates incorrect code with -frounding-math option
(#pragma STDC FENV_ACCESS not implemented)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105105
[Bug 105105] [Meta] Fortran IEEE support

[Bug fortran/108329] New: IEEE_SET_ROUNDING_MODE ineffective with common subexpression elimination

2023-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108329

Bug ID: 108329
   Summary: IEEE_SET_ROUNDING_MODE ineffective with common
subexpression elimination
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: fortran
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Split from https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678#c47 .

The test case

$ cat y.f90
module y
  implicit none
  integer, parameter :: wp = selected_real_kind(15)
contains
  subroutine foo(a,b,c)
use ieee_arithmetic
real(kind=wp), dimension(4), intent(out) :: a
real(kind=wp), intent(in) :: b, c
type (ieee_round_type), dimension(4), parameter :: mode = &
 [ieee_nearest, ieee_to_zero, ieee_up, ieee_down]
call ieee_set_rounding_mode (mode(1))
a(1) = b + c
call ieee_set_rounding_mode (mode(2))
a(2) = b + c
call ieee_set_rounding_mode (mode(3))
a(3) = b + c
call ieee_set_rounding_mode (mode(4))
a(4) = b + c
  end subroutine foo
end module y

program main
  use y
  real(kind=wp), dimension(4) :: a
  call foo(a, 0.1_wp, 0.2_wp)
  print *,a
end program main
$ gfortran -O  y.f90 && ./a.out
  0.30004   0.30004   0.30004  
0.30004 
$ gfortran y.f90 && ./a.out
  0.30004   0.2   0.30004  
0.2

shows that common subexpression removal causes the addition to be performed
only once.

[Bug middle-end/34678] Optimization generates incorrect code with -frounding-math option (#pragma STDC FENV_ACCESS not implemented)

2023-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

--- Comment #49 from Thomas Koenig  ---
(In reply to Thomas Koenig from comment #48)
> Clang gets this right, even without the pragma;

The "even without the pragma" part is wrong.

[Bug middle-end/34678] Optimization generates incorrect code with -frounding-math option (#pragma STDC FENV_ACCESS not implemented)

2023-01-07 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

--- Comment #48 from Thomas Koenig  ---
Clang gets this right, even without the pragma; the original test case is
compiled to

pushq   %r14
pushq   %rbx
subq$24, %rsp
movq%rsi, %r14
movq%rdi, %rbx
movsd   %xmm1, 16(%rsp) # 8-byte Spill
movsd   %xmm0, 8(%rsp)  # 8-byte Spill
movl$1024, %edi # imm = 0x400
callq   fesetround@PLT
movsd   8(%rsp), %xmm0  # 8-byte Reload
divsd   16(%rsp), %xmm0 # 8-byte Folded Reload
movsd   %xmm0, (%rbx)
movl$2048, %edi # imm = 0x800
callq   fesetround@PLT
movsd   8(%rsp), %xmm0  # 8-byte Reload
divsd   16(%rsp), %xmm0 # 8-byte Folded Reload
movsd   %xmm0, (%r14)
addq$24, %rsp
popq%rbx
popq%r14
retq

[Bug middle-end/34678] Optimization generates incorrect code with -frounding-math option (#pragma STDC FENV_ACCESS not implemented)

2023-01-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

Thomas Koenig  changed:

   What|Removed |Added

 Blocks||105105

--- Comment #47 from Thomas Koenig  ---
(In reply to Thomas Koenig from comment #46)
> Fortran gets this right:

... but only by accident. This test case shows that it doesn't:

$ cat y.f90
module y
  implicit none
  integer, parameter :: wp = selected_real_kind(15)
contains
  subroutine foo(a,b,c)
use ieee_arithmetic
real(kind=wp), dimension(4), intent(out) :: a
real(kind=wp), intent(in) :: b, c
type (ieee_round_type), dimension(4), parameter :: mode = &
 [ieee_nearest, ieee_to_zero, ieee_up, ieee_down]
call ieee_set_rounding_mode (mode(1))
a(1) = b + c
call ieee_set_rounding_mode (mode(2))
a(2) = b + c
call ieee_set_rounding_mode (mode(3))
a(3) = b + c
call ieee_set_rounding_mode (mode(4))
a(4) = b + c
  end subroutine foo
end module y

program main
  use y
  real(kind=wp), dimension(4) :: a
  call foo(a, 0.1_wp, 0.2_wp)
  print *,a
end program main
$ gfortran -O  y.f90 && ./a.out
  0.30004   0.30004   0.30004  
0.30004 
$ gfortran y.f90 && ./a.out
  0.30004   0.2   0.30004  
0.2


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105105
[Bug 105105] [Meta] Fortran IEEE support

[Bug middle-end/34678] Optimization generates incorrect code with -frounding-math option (#pragma STDC FENV_ACCESS not implemented)

2023-01-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=34678

--- Comment #46 from Thomas Koenig  ---
Fortran gets this right:

$ cat set_rounding_mode.f90
module x
  implicit none
  integer, parameter :: wp = selected_real_kind(15)
contains
  subroutine foo(a,b,c)
use ieee_arithmetic
real(kind=wp), dimension(4), intent(out) :: a
real(kind=wp), intent(in) :: b, c
type (ieee_round_type), dimension(4), parameter :: mode = &
 [ieee_nearest, ieee_to_zero, ieee_up, ieee_down]
integer :: i
do i=1,4
   call ieee_set_rounding_mode (mode(i))
   a(i) = b + c
end do
  end subroutine foo
end module x

program main
  use x
  real(kind=wp), dimension(4) :: a
  call foo(a, 0.1_wp, 0.2_wp)
  print *,a
end program main
$ gfortran -O3 set_rounding_mode.f90
$ ./a.out
  0.30004   0.2   0.30004  
0.2

[Bug rtl-optimization/108318] New: Floating point calculation moved out of loop despite fesetround

2023-01-06 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108318

Bug ID: 108318
   Summary: Floating point calculation moved out of loop despite
fesetround
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

#include 
void
foo (double res[4], double a, double b)
{
  static const int rm[4]
  = { FE_DOWNWARD, FE_TONEAREST, FE_TOWARDZERO, FE_UPWARD };
  for (int i = 0; i < 4; ++i)
{
  fesetround (rm[i]);
  res[i] = a + b;
}
  fesetround (FE_TONEAREST); // restore default
}

when compiled with recent trunk and -O3, yields

addsd   %xmm1, %xmm0
pushq   %r14
.cfi_def_cfa_offset 16
.cfi_offset 14, -16
pushq   %rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
movq%rdi, %rbp
pushq   %rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
xorl%ebx, %ebx
movq%xmm0, %r14
.L2:
movlrm.0(,%rbx,4), %edi
callfesetround
movq%r14, 0(%rbp,%rbx,8)
addq$1, %rbx
cmpq$4, %rbx
jne .L2
popq%rbx
.cfi_def_cfa_offset 24
xorl%edi, %edi
popq%rbp
.cfi_def_cfa_offset 16
popq%r14
.cfi_def_cfa_offset 8
jmp fesetround
.cfi_endproc

Seems all right after tree optimization, the *.optimized dump looks OK:

  [local count: 858993457]:
  # ivtmp.5_16 = PHI 
  _1 = MEM[(int *) + ivtmp.5_16 * 4];
  fesetround (_1);
  _5 = a_12(D) + b_13(D);
  MEM[(double *)res_11(D) + ivtmp.5_16 * 8] = _5;
  ivtmp.5_7 = ivtmp.5_16 + 1;
  if (ivtmp.5_7 != 4)
goto ; [80.00%]
  else
goto ; [20.00%]

   [local count: 214748368]:
  fesetround (0); [tail call]
  return;


This does not seem to be a recent regression, this goes back to at
least gcc 4.1.2.

Noted by Michael S on comp.arch, on
https://groups.google.com/g/comp.arch/c/Izheu-k00Nw/m/oljg70SBBwAJ .

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #3 from Thomas Koenig  ---
(In reply to Jakub Jelinek from comment #2)
> From what I can see, they are certainly not portable.
> E.g. the relying on __int128 rules out various arches (basically all 32-bit
> arches,
> ia32, powerpc 32-bit among others).

For this kind of performance improvement on 64-bit systems, we could probably
introduce an appropriate #ifdef. Regarding x86 intrinsics, maybe they
can be replaced by gcc's vector extension.

> Not handling exceptions is a show
> stopper too.

Agreed, we should not be replacing the soft-fp that way.

> Guess better time investment would be to improve performance of the soft-fp
> versions.

I'm not sure, I think we could get an appreciable benefit if we
only invoke this kind of routine behind the appropriate sub-flags
of -ffast-math.

For a general-purpose code, I see at least no way around the bottleneck
of querying the processor status on each invocation, and that is a waste
if the program does not care.

[Bug libgcc/108279] Improved speed for float128 routines

2023-01-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

--- Comment #1 from Thomas Koenig  ---
Created attachment 54183
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=54183=edit
Example patch with Michael S's code just pasted over the libgcc implementation,
for a test

A benchmarks: Just pasting over the code from the github
repo yields an improvement of gfortran's matmul by almost a factor of two,
so significant speedups are possible:

module tick
interface
function rdtsc() bind(C,name="rdtsc")
use iso_c_binding
integer(kind=c_long) :: rdtsc
end function rdtsc
end interface
end module tick

program main
use tick
use iso_c_binding
implicit none
integer, parameter :: wp = selected_real_kind(30)
! integer, parameter :: n=5000, p=4000, m=3666
integer, parameter :: n = 1000, p = 1000, m = 1000
real (kind=wp) :: c(n,p), a(n,m), b(m, p)
character(len=80) :: line
integer(c_long) :: t1, t2, t3
real (kind=wp) :: fl = 2.d0*n*m*p
integer :: i,j

print *,wp

line = '10 10'
call random_number(a)
call random_number(b)
t1 = rdtsc()
t2 = rdtsc()
t3 = t2-t1
print *,t3
t1 = rdtsc()
c = matmul(a,b)
t2 = rdtsc()
print *,1/(fl/(t2-t1-t3)),"Cycles per operation"
read (unit=line,fmt=*) i,j
write (unit=line,fmt=*) c(i,j)
end program main

showed

tkoenig@gcc188:~> ./original
16
32
^C
tkoenig@gcc188:~> time ./original
16
32
90.5696151957 Cycles per operation

real 1m2,148s
user 1m2,123s
sys 0m0,008s
tkoenig@gcc188:~> time ./modified
16
32
52.81483917199957 Cycles per operation

real 0m36,296s
user 0m36,278s
sys 0m0,008s 

where "original" is the current libgcc soft-float implementation, and
"modified" is with the code from the repro.

It does not handle exceptions, so this causes a few regressions, but certainly
shows the potential

[Bug libgcc/108279] New: Improved speed for float128 routines

2023-01-03 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108279

Bug ID: 108279
   Summary: Improved speed for float128 routines
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: libgcc
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Our soft-float routines, which are used for the basic float128 arithmetic
(__addtf3, __subtf3, etc) are much slower than they need to be.

Michael S has some routines which are considerably faster, at
https://github.com/already5chosen/extfloat, which he would like to
contribute to gcc.  There is a rather lengthy thread in comp.arch
starting with https://groups.google.com/g/comp.arch/c/Izheu-k00Nw .

Current status of the discussion:

The routines currently do not support rounding modes, they support round to
nearest with tie even only. Adding such support would be feasible.

Handling the rounding mode it is currently done in libgcc, by
querying the hardware, leading to a high overhead for each
call. This would not be needed if -ffast-math (or a relevant
suboption) is specified.

It would also be suitable as is (with a different name) for Fortran
intrinsics such as matmul.

Fortran is a bit special because rounding modes are default on procedure
entry and are restored on procedure exit (which is why setting rounding
modes in a subroutine is a no-op). This would allow to keep a local
variable keeping track of the rounding mode.

The current idea would be something like this:

The current behavior of __addtf3 and friends could remain as is,
but its speed could be improved,. but it would still query the
hardware.

There can be two additional routines for each arithmetic operation. One
of them would implement the operation given a specified rounding mode
(to be called from Fortran when the correct IEEE module is in
use).

The other one would just implement round-to-nearest, for use from
Fortran intrinsics and from all other languages if the right flags
are given. It would be good to bolt this onto some flag which is
used for libgfortran, to make it accessible from C.

Probably gcc14 material.

[Bug tree-optimization/108227] Unnecessary division when looping over array with size of elements not a power of two

2022-12-26 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108227

Thomas Koenig  changed:

   What|Removed |Added

   Keywords||missed-optimization

--- Comment #1 from Thomas Koenig  ---
This could also impact Fortran's array descriptor reform if we ever switch to
the specified BIND(C) descriptors as our native format - there, we would have
generate loops just like that, preferably without division.

[Bug tree-optimization/108227] New: Unnecessary division when looping over array with size of elements not a power of two

2022-12-26 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108227

Bug ID: 108227
   Summary: Unnecessary division when looping over array with size
of elements not a power of two
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

Consider

typedef struct coord {
  double x, y, z;
} coord;

void foo(coord *from, coord *to)
{
  unsigned long int n = to - from;
  for (unsigned long int i=0; i < n; i++)
{
  from[i].x = from[i].x + 1.0;
}
}

void bar (coord *from, coord *to)
{
  char *c_from = (char *) from, *c_to = (char *) to;
  coord *p = from;
  long int c_n = c_to - c_from;
  for (long int i=0; i < c_n; i+= sizeof(coord))
{
  p->x = p->x + 1.0;
  p++;
}
}

The code is functionally equivalent, but the assembly somewhat different:

foo has

foo:
.LFB0:
.cfi_startproc
movabsq $-6148914691236517205, %rax
movq%rsi, %rdx
subq%rdi, %rdx
sarq$3, %rdx
imulq   %rax, %rdx
cmpq%rdi, %rsi
je  .L1
movsd   .LC0(%rip), %xmm1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L3:
movsd   (%rdi), %xmm0
addq$1, %rax
addq$24, %rdi
addsd   %xmm1, %xmm0
movsd   %xmm0, -24(%rdi)
cmpq%rdx, %rax
jb  .L3
.L1:
ret

so it first divides by 12 (efficiently) to determine n. There are 7
instructions in the loop itself.

bar has

bar:
.LFB1:
.cfi_startproc
subq%rdi, %rsi
testq   %rsi, %rsi
jle .L6
movsd   .LC0(%rip), %xmm1
xorl%eax, %eax
.p2align 4,,10
.p2align 3
.L8:
movsd   (%rdi,%rax), %xmm0
addsd   %xmm1, %xmm0
movsd   %xmm0, (%rdi,%rax)
addq$24, %rax
cmpq%rax, %rsi
jg  .L8
.L6:
ret

no need to divide, and one instruction less in the loop.

I would expect foo to match bar.

[Bug fortran/106576] Finalization of temporaries from functions not occuring

2022-12-04 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106576

--- Comment #6 from Thomas Koenig  ---

> I hope that you are well and that the lack of time is for a good cause?

Hi Paul,

yes, I'm well, and the lack of time is indeed for a good cause :-)

> I have just returned to my finalizer patch. With it applied, your testcase
> produces the same output as NAG.

That's great!

> I will attach the present version of the patch to this PR.

Is there a chance that we will see this patch in gcc13?  Even if it
does not fix every last bug in finalizers in gfortran, it would still
be a very large improvement compared to the current condition.

[Bug fortran/106576] Finalization of temporaries from functions not occuring

2022-11-12 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106576

Thomas Koenig  changed:

   What|Removed |Added

 Status|ASSIGNED|NEW

--- Comment #3 from Thomas Koenig  ---
No time to work on this at the moment.

[Bug fortran/107317] [10/11/12/13 Regression] ICE in emit_redzone_byte, at asan.cc:1508

2022-10-22 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107317

Thomas Koenig  changed:

   What|Removed |Added

   Priority|P2  |P3

[Bug fortran/107317] [10/11/12/13 Regression] ICE in emit_redzone_byte, at asan.cc:1508

2022-10-20 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107317

--- Comment #3 from Thomas Koenig  ---
As this is invalid code (and in Fortran), should this actually be P2?

[Bug fortran/41453] use INTENT(out) for optimization

2022-09-25 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=41453

--- Comment #16 from Thomas Koenig  ---
(In reply to Mikael Morin from comment #15)
> Status update:

A lot of progress :-)

> (In reply to Thomas Koenig from comment #5)
> > Still missing: To clobber
> > 
> > - variables passed by reference to the caller
> > - saved variables
> > - associated variables (there are passed as pointers to
> >   the associate blocsk)
> These have been done now.
> 
> Still missing: pointer or allocatable dummy.
> Seems doable, probably a low hanging fruit.

For an allocatable dummy, we have to deallocate on intent(out)
anyway, and we do this on the caller's side, so we should not clobber. 
For pointers, it could be an advantage.

> > - intent(out) variables on entry to the procedure.
> This remains to do.

Again, sounds doable


> Another case that could be handled is the case of arrays:
> when the full array is passed as argument, and it is contiguous, and maybe
> some other condition, we can clobber its decl.  The hard part is the "maybe
> some other condition".

Not sure what that other condition could be.  If we have a full array ref, as
per gfc_full_array_ref_p, and we pass this to an intent(out) argument,
then that should be enough.

> Not sure it's worth keeping this PR open.
> Surely the initial test works as expected, and has been working for a long
> time.

There are still a few open points in relation to this.  I would be in
favor of keeping this open (to not lose the discussion) until we have
them all fixed, or decide not to fix some or all of them.

[Bug tree-optimization/104265] Missed vectorization in 526.blender_r

2022-08-30 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104265

Thomas Koenig  changed:

   What|Removed |Added

 CC||tkoenig at gcc dot gnu.org

--- Comment #2 from Thomas Koenig  ---
Created attachment 53521
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53521=edit
Assembly code generated by aocc

FWITW, here is the assembler code as generated by aocc version 13.0.0 with -O3.

[Bug rtl-optimization/106678] New: Inefficiency in long integer multiplication

2022-08-18 Thread tkoenig at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106678

Bug ID: 106678
   Summary: Inefficiency in long integer multiplication
   Product: gcc
   Version: 13.0
Status: UNCONFIRMED
  Severity: enhancement
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoenig at gcc dot gnu.org
  Target Milestone: ---

The code from PR 103109

#include 

void Long_multiplication( uint64_t multiplicand[],
  uint64_t multiplier[],
  uint64_t sum[],
  uint64_t ilength, uint64_t jlength )
{
  uint64_t acarry, mcarry, product;

  for( uint64_t i = 0;
   i < (ilength + jlength);
   i++ )
sum[i] = 0;

  acarry = 0;
  for( uint64_t j = 0; j < jlength; j++ )
{
  mcarry = 0;
  for( uint64_t i = 0; i < ilength; i++ )
{
  __uint128_t mcarry_prod;
  __uint128_t acarry_sum;
  mcarry_prod = ((__uint128_t) multiplicand[i]) * ((__uint128_t)
multiplier[j])
+ (__uint128_t) mcarry;
  mcarry = mcarry_prod >> 64;
  product = mcarry_prod;
  acarry_sum = ((__uint128_t) sum[i+j]) + ((__uint128_t) acarry) +
product;
  sum[i+j] += acarry_sum;
  acarry = acarry_sum >> 64;
  //  {mcarry, product} = multiplicand[i]*multiplier[j]
  //+ mcarry;
  //  {acarry,sum[i+j]} = {sum[i+j]+acarry} + product;

}
}
}

still shows some inefficiency after r13-2107.

Compiling the function with gcc 13.0.0 20220818, with

$ gcc  -mcpu=power9 -O3 -c loop.c

and disassembling the output (for easier reading) gives (looking only
at the main part)

  7c:   00 00 80 38 li  r4,0
  80:   00 00 80 3b li  r28,0
  84:   00 00 60 38 li  r3,0
  88:   00 00 00 38 li  r0,0
  8c:   ff ff c0 38 li  r6,-1
  90:   00 00 e0 38 li  r7,0
  94:   20 00 c1 fa std r22,32(r1)
  98:   28 00 e1 fa std r23,40(r1)
  9c:   60 00 c1 fb std r30,96(r1)
  a0:   68 00 e1 fb std r31,104(r1)
  a4:   00 00 00 60 nop
  a8:   00 00 00 60 nop
  ac:   00 00 42 60 ori r2,r2,0
  b0:   a6 03 49 7f mtctr   r26
  b4:   78 c3 0c 7f mr  r12,r24
  b8:   14 22 b9 7c add r5,r25,r4
  bc:   00 00 00 39 li  r8,0
  c0:   09 00 6c e9 ldu r11,8(r12)
  c4:   2a 20 5d 7d ldx r10,r29,r4
  c8:   09 00 25 e9 ldu r9,8(r5)
  cc:   33 52 cb 13 maddld  r30,r11,r10,r8
  d0:   31 52 eb 13 maddhdu r31,r11,r10,r8
  d4:   38 30 d6 7f and r22,r30,r6
  d8:   38 38 f7 7f and r23,r31,r7
  dc:   78 fb e8 7f mr  r8,r31
  e0:   14 48 56 7d addcr10,r22,r9
  e4:   14 01 77 7d adder11,r23,r0
  e8:   14 18 4a 7d addcr10,r10,r3
  ec:   14 52 29 7d add r9,r9,r10
  f0:   94 01 6b 7c addze   r3,r11
  f4:   00 00 25 f9 std r9,0(r5)
  f8:   c8 ff 00 42 bdnzc0 
  fc:   01 00 9c 3b addir28,r28,1
 100:   08 00 84 38 addir4,r4,8
 104:   40 e0 3b 7c cmpld   r27,r28
 108:   a8 ff 82 40 bne b0 

In these two nested loops, r6 is not changed, so it is always -1.

  d4:   38 30 d6 7f and r22,r30,r6

just assigns r30 to r22, so r30 could have been used instead of
r22.

Similarly,

  d8:   38 38 f7 7f and r23,r31,r7

just sets r23 to zero because r7 is always zero.

  1   2   3   4   5   6   7   8   9   10   >