[wwwdocs] gcc-14/changes.html (AMD GCN): Mention gfx90c support

2024-04-26 Thread Andrew Stubbs
I will push this shortly. I think the gfx90c patch just made the cut for
the GCC-14 branch!

Andrew

---
 htdocs/gcc-14/changes.html | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index fce0fb44..47fef32d 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -726,10 +726,10 @@ a work-in-progress.
 AMD Radeon (GCN)
 
 
-  Initial support for the AMD Radeon gfx1030,
-gfx1036 (RDNA2), gfx1100 and
-gfx1103 (RDNA3) devices has been
-added. LLVM 15+ (assembler and linker) is Initial support for the AMD Radeon gfx90c (GCN5),
+gfx1030, gfx1036 (RDNA2), gfx1100
+and gfx1103 (RDNA3) devices has been added. LLVM 15+
+(assembler and linker) is https://gcc.gnu.org/install/specific.html#amdgcn-x-amdhsa;>required
 to support GFX11.
   Improved register usage and performance on CDNA Instinct MI100
-- 
2.41.0



[wwwdocs] gcc-14/changes.html (AMD GCN): Mention gfx90c support

2024-04-26 Thread Andrew Stubbs
I will push this shortly. I think the gfx90c patch just made the cut for
the GCC-14 branch!

Andrew

---
 htdocs/gcc-14/changes.html | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index fce0fb44..47fef32d 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -726,10 +726,10 @@ a work-in-progress.
 AMD Radeon (GCN)
 
 
-  Initial support for the AMD Radeon gfx1030,
-gfx1036 (RDNA2), gfx1100 and
-gfx1103 (RDNA3) devices has been
-added. LLVM 15+ (assembler and linker) is Initial support for the AMD Radeon gfx90c (GCN5),
+gfx1030, gfx1036 (RDNA2), gfx1100
+and gfx1103 (RDNA3) devices has been added. LLVM 15+
+(assembler and linker) is https://gcc.gnu.org/install/specific.html#amdgcn-x-amdhsa;>required
 to support GFX11.
   Improved register usage and performance on CDNA Instinct MI100
-- 
2.41.0



Re: [PATCH] amdgcn: Add gfx90c target

2024-04-26 Thread Andrew Stubbs

On 25/04/2024 19:37, Frederik Harwath wrote:

Hi Andrew,
this patch adds support for gfx90c GCN5 APU integrated graphics devices.
The LLVM AMDGPU documentation (https://llvm.org/docs/AMDGPUUsage.html)
lists those devices as unsupported by rocm-amdhsa.
As we have discussed elsewhere, I have tested the patch on an AMD Ryzen
5 5500U (also with different xnack settings) that I have and it passes
most libgomp offloading tests.
Although those APUs are very constrainted compared to dGPUs, I think
they might be interesting for learning, experimentation, and testing.


Can I commit the patch to the master branch?


OK, please go ahead.

Thanks for expanding our device support even further! :)

Andrew


Re: [patch] [gcn][nvptx] Add warning to mkoffload for 32bit host code

2024-04-25 Thread Andrew Stubbs

On 25/04/2024 11:51, Tobias Burnus wrote:

Motivated by a surprise of a colleague that with -m32,
no offload dumps were created; that's because mkoffload
does not process host binaries when the are 32bit (i.e. ilp32).

Internally, that done as follows: The host compiler passes to
'mkoffload' the used host ABI, i.e. -foffload-abi=ilp32 or -foffload-abi=lp64

That's done via TARGET_OFFLOAD_OPTIONS, which is supported by aarch64, i386, 
and rs6000.

While it is sensible (albeit not strictly required) that GCC requires that
the host and device side agree and that only 64bit is implemented for the
device side, it can be confusing that silently no offloading code is generated.


Hence, I propose to print a warning in that case - as implemented in the 
attached patch:

$ gcc -fopenmp -m32 test.c
nvptx mkoffload: warning: offload code generation skipped: offloading with 
32-bit host code is currently not supported
gcn mkoffload: warning: offload code generation skipped: offloading with 32-bit 
host code is currently not supported

* * *

This shouldn't have any effect on offload builds using -m64
and non-offload builds – while several testcases already have
issues with '-m32' when offloading is enabled or an offloading
device is available.

To make it not worse, this patch adds some pruning and for
a subset of the failing testcases, I added code to avoids FAILS.
There are some more fails, but those aren't new.

Comments, remarks, suggestions?
Is the mkoffload.cc part is okay?


The mkoffload part looks reasonable to me. I'm not sure if there are 
other ABIs we might want to warn about, but this is definitely an 
improvement.


Andrew


Re: GCN: Enable effective-target 'vect_long_long'

2024-04-17 Thread Andrew Stubbs

On 16/04/2024 20:01, Thomas Schwinge wrote:

Hi!

OK to push the attached "GCN: Enable effective-target 'vect_long_long'"?
(Or is that not what you'd expect to see for GCN?  I haven't checked the
actual back end code...)


I think if there are still missing int64 vector operations then they're 
exceptions, not the rule.


The patch looks good to me.

Andrew



Re: [wwwdocs] gcc-14/changes.html (AMD GCN): Mention gfx1036 support

2024-04-15 Thread Andrew Stubbs

On 15/04/2024 13:00, Richard Biener wrote:

On Mon, Apr 15, 2024 at 12:04 PM Tobias Burnus  wrote:


I experimented with some variants to make clearer that each of RDNA2 and
RNDA3 applies to two card types, but at the end I settled on the
fewest-word version.

Comments, remarks, suggestions? (To this change or in general?)

Current version: https://gcc.gnu.org/gcc-14/changes.html#amdgcn

Compiler flags, listing the the gfx* cards:
https://gcc.gnu.org/onlinedocs/gcc/AMD-GCN-Options.html

Tobias

PS: On the compiler side, I am looking forward to a .def file which
reduces the number of files to change when adding a new gfx* card, given
that we have doubled the number of entries. [Well, 1 missing but I know
of one WIP addition.]


I do wonder whether hot-patching the ELF header from the libgomp plugin
with the actual micro-subarch would be possible to make the driver happy.
We do query the device ISA when initializing the device so we should
be able to massage the ELF header of the object in GOMP_OFFLOAD_load_image
at least within some constraints (ideally we'd mark the ELF object as to
be matched with a device in some group).


This might work in some limited cases, especially if you limit the 
codegen to some subset of the ISA, but in general the metadata on the 
kernel entry-points is device-specific. For example, the gfx908 and 
gfx90a have different granularity on the VGPR count settings. It would 
probably be possible to generate some matching sets.


However, there's probably no need to do that ourselves because the LLVM 
tools now have new generic ELF flags "gfx9-generic", "gfx10-1-generic", 
"gfx10-3-generic", and "gfx11-generic" which supposedly do what you 
want.  I've not experimented with them. I don't know if libraries can 
have the generic variant and still link with the specific variant (the 
only libraries with kernel entry-points are the libgcc init_array and 
fini_array). If not it becomes yet another multilib.


I'm very sure there's no one binary that will run anywhere for real 
usecases.


Andrew


Re: [wwwdocs] gcc-14/changes.html (AMD GCN): Mention gfx1036 support

2024-04-15 Thread Andrew Stubbs

On 15/04/2024 11:03, Tobias Burnus wrote:
I experimented with some variants to make clearer that each of RDNA2 and 
RNDA3 applies to two card types, but at the end I settled on the 
fewest-word version.


Comments, remarks, suggestions? (To this change or in general?)

Current version: https://gcc.gnu.org/gcc-14/changes.html#amdgcn

Compiler flags, listing the the gfx* cards: 
https://gcc.gnu.org/onlinedocs/gcc/AMD-GCN-Options.html


Tobias

PS: On the compiler side, I am looking forward to a .def file which 
reduces the number of files to change when adding a new gfx* card, given 
that we have doubled the number of entries. [Well, 1 missing but I know 
of one WIP addition.]


LGTM

Andrew


Re: GCN: '--param=gcn-preferred-vector-lane-width=[default,32,64]'

2024-04-08 Thread Andrew Stubbs

On 08/04/2024 11:45, Thomas Schwinge wrote:

Hi!

On 2024-03-28T08:00:50+0100, I wrote:

On 2024-03-22T15:54:48+, Andrew Stubbs  wrote:

This patch alters the default (preferred) vector size to 32 on RDNA devices to
better match the actual hardware.  64-lane vectors will continue to be
used where they are hard-coded (such as function prologues).

We run these devices in wavefrontsize64 for compatibility, but they actually
only have 32-lane vectors, natively.  If the upper part of a V64 is masked
off (as it is in V32) then RDNA devices will skip execution of the upper part
for most operations, so this adjustment shouldn't leave too much performance on
the table.  One exception is memory instructions, so full wavefrontsize32
support would be better.

The advantage is that we avoid the missing V64 operations (such as permute and
vec_extract).

Committed to mainline.


In my GCN target '-march=gfx1100' testing, this commit
"amdgcn: Prefer V32 on RDNA devices" does resolve (or, make latent?) a
number of execution test FAILs (that is, regressions compared to earlier
'-march=gfx90a' etc. testing).

This commit also resolves (for my '-march=gfx1100' testing) one
pre-existing FAIL (that is, already seen in '-march=gfx90a' earlier
etc. testing):

 PASS: gcc.dg/tree-ssa/scev-14.c (test for excess errors)
 [-FAIL:-]{+PASS:+} gcc.dg/tree-ssa/scev-14.c scan-tree-dump ivopts 
"Overflowness wrto loop niter:\tNo-overflow"

That means, this test case specifically (or, just its 'scan-tree-dump'?)
needs to be adjusted for GCN V64 testing?

This commit, as you'd also mentioned elsewhere, however also causes a
number of regressions in 'gcc.target/gcn/gcn.exp', see list below.

Those can be "fixed" with 'dg-additional-options -march=gfx90a' (or
similar) in the affected test cases (let me know if you'd like me to
'git push' that), but I suppose something more elaborate may be in order?
(Conditionalize those on 'target { ! gcn_rdna }', and add respective
scanning for 'target gcn_rdna'?  I can help with effective-target
'gcn_rdna' (or similar), if you'd like me to.)

And/or, have a '-mpreferred-simd-mode=v64' (or similar) to be used for
such test cases, to override 'if (TARGET_RDNA2_PLUS)' etc. in
'gcn_vectorize_preferred_simd_mode'?


The latter I have quickly implemented, see attached
"GCN: '--param=gcn-preferred-vector-lane-width=[default,32,64]'".  OK to
push to trunk branch?

(This '--param' will also be useful for another bug/regression I'm about
to file.)


Best, probably, both these things, to properly test both V32 and V64?


That part remains to be done, but is best done by someone who actually
knowns "GCN" assembly/GCC back end -- that is, not me.


I'm not sure that this is *best* solution to the problem (in general, 
it's probably best to test the actual code that will be generated in 
practice), but I think this option will be useful for testing 
performance in each configuration and other correctness issues, and 
these tests are not testing that feature.


However, "vector lane width" sounds like it's configuring the number of 
bits in each lane. I think "vectorization factor" is unambigous.


OK to commit, with the name change.

Andrew




Grüße
  Thomas



 PASS: gcc.target/gcn/cond_fmaxnm_1.c (test for excess errors)
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-times 
smaxv64df3_exec 3
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_1.c scan-assembler-times 
smaxv64sf3_exec 3
 PASS: gcc.target/gcn/cond_fmaxnm_1_run.c (test for excess errors)
 PASS: gcc.target/gcn/cond_fmaxnm_1_run.c execution test

 PASS: gcc.target/gcn/cond_fmaxnm_2.c (test for excess errors)
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-times 
smaxv64df3_exec 3
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_2.c scan-assembler-times 
smaxv64sf3_exec 3
 PASS: gcc.target/gcn/cond_fmaxnm_2_run.c (test for excess errors)
 PASS: gcc.target/gcn/cond_fmaxnm_2_run.c execution test

 PASS: gcc.target/gcn/cond_fmaxnm_3.c (test for excess errors)
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-not 
\\tv_writelane_b32\\tv[0-9]+, vcc_..
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
movv64df_exec 3
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
movv64sf_exec 3
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
smaxv64sf3 3
 [-PASS:-]{+FAIL:+} gcc.target/gcn/cond_fmaxnm_3.c scan-assembler-times 
smaxv64sf3 3
 PASS: gcc.target/gcn/cond_fmaxnm_3_run.c (test for excess errors)
 PASS: gcc.target/gcn/cond_fmaxnm_3_run.c execution test

 PASS: gcc.ta

Re: [Patch] GCN: install.texi update for Newlib change and LLVM 18 release

2024-04-03 Thread Andrew Stubbs

On 03/04/2024 10:27, Jakub Jelinek wrote:

On Wed, Apr 03, 2024 at 11:09:19AM +0200, Tobias Burnus wrote:

@@ -3954,8 +3956,8 @@ on the GPU.
  To enable support for GCN3 Fiji devices (gfx803), GCC has to be configured 
with
  @option{--with-arch=@code{fiji}} or
  @option{--with-multilib-list=@code{fiji},...}.  Note that support for Fiji
-devices has been removed in ROCm 4.0 and support in LLVM is deprecated and will
-be removed in LLVM 18.
+devices has been removed in ROCm 4.0 and support in LLVM is deprecated and has
+been removed in LLVM 18.


Shouldn't we at configure time then detect the case where fiji can't be
supported and either error if it is included explicitly in multilib list, or
implicitly take it out from that list and arrange error to be emitted when
using -march=fiji/gfx803 ?
Sure, if one configures against LLVM 17 and then updates LLVM to 18, it will
still result in weird errors/LLVM ICEs, but at least in the common case when
one configures GCC 14 against LLVM 18 one won't suffer from those ICEs and
get clear diagnostics that fiji is sadly no longer supported.


One additional point: since our departure from Siemens, we no longer 
have access to any Fiji devices ourselves. I plan to rip that stuff out 
the first chance I get (not necessarily very soon).


In the meantime, Fiji is not included in the default configuration of 
GCC 14, so anyone enabling it is doing so explicitly and a) will have 
read the documentation, and b) would be surprised if Fiji were 
automatically excluded.


We could emit an error at configure time if an unsuitable LLVM is 
detected, but I don't think it's worth the effort for what is a niche 
product that requires drivers so old they were only supported on now-EOL 
OS versions.


I'm happy with Tobias's patch with s/LLVM is deprecated/LLVM was 
deprecated/. The Newlib versions are a bit awkward, but we can't 
recommend 4.5 until it exists.


Andrew


Re: [Patch] GCN: Fix --with-arch= handling in mkoffload [PR111966]

2024-04-03 Thread Andrew Stubbs

On 03/04/2024 10:05, Tobias Burnus wrote:

This patch handles --with-arch= in GCN's mkoffload.cc

While mkoffload mostly does not know this and passes it through to the 
GCN lto1 compiler,
it writes an .o file with debug information - and here the -march= in 
the ELF flags must
agree with the one in the other files. Hence, it uses now the 
--with-arch= config argument.


Doing so, there is now a diagnostic if the -march= or --with-arch= is 
unknown. While the
latter should be rejected at GCC compile time, the latter was not 
diagnosed in mkoffload

but only later in GCN's compiler.

But as there is now a fatal_error in mkoffload, which comes before the 
GCN-compiler call,
the 'note:' which devices are available were lost. This has been 
reinstated by using
the multilib settings. (That's not identical to the compiler supported 
flags the output

is reasonable, arguable better or worse than lto1.)

Advantage: The output is less cluttered than a later fail.

To make mkoffload errors - and especially this one - more useful, it now 
also initializes

the colorization / bold.

OK for mainline?


OK. Thanks for fixing this.

Andrew



* * *

Example error:

gcn mkoffload: error: unrecognized argument in option '-march=gfx'
gcn mkoffload: note: valid arguments to '-march=' are: gfx906, gfx908, 
gfx90a, gfx1030, gfx1036, gfx1100, gfx1103


where on my TERM=xterm-256color,  'gcn mkoffload:' and the quoted texts 
are in bold,

'error:' is red and 'note:' is cyan.

Compared to cc1, the 'note:' lacks 'fiji', the list is separated by ', '
instead of ' ', and cc1 has a "; did you mean 'gfx1100'?".
And the program name is 'gcn mkoffload' instead of 'cc1'.

Tobias

PS: The generated multilib list could be later changed to be based on 
the gcn-.def file;

or we just keep the multiconfig variant of this patch.


I think a .def file would be more future-proof if we ever have multilibs 
for options other than -march, but this works for now.


Andrew


Re: [PATCH] amdgcn: Add gfx1036 target

2024-03-25 Thread Andrew Stubbs

On 25/03/2024 11:27, Richard Biener wrote:

Add support for the gfx1036 RDNA2 APU integrated graphics devices.  The ROCm
documentation warns that these may not be supported, but it seems to work
at least partially.

x86 host bootstrap/regtest running, target-libgomp testing for the
offload produces results comparable to those of gfx1030.  The nice
thing is that gfx1036 is inside every Zen4 desktop CPU (Ryzen 7xxx)
and testing on that doesn't interfere with a separate GPU used for
your desktop (where I experienced crashes when using the GPU for both
offload and graphics).

I'll note that while gfx1030 works with llvm14 gfx1036 needs llvm15
as minimum version for the assembler.

OK for trunk?


OK.



I'll follow up with the libgomp testing test summary for archival
purposes.  I still see linker errors for testcases using -g
(the ld: ^[[0;31merror: ^[[0mincompatible mach:
/tmp/ccr0oDpD.mkoffload.dbg.o^M kind)


This is caused by the --with-arch=gfx1036 not being picked up by 
mkoffload. It works fine if you use the default configuration or specify 
the -march explicitly. Either way, the bug is not in your patch.


For now, please test like this:

   RUNTESTFLAGS=--target_board=unix/-foffload=-march=gfx1036

Andrew


Thanks,
Richard.

gcc/ChangeLog:

* config.gcc (amdgcn): Add gfx1036 entries.
* config/gcn/gcn-hsa.h (NO_XNACK): Likewise.
(gcn_local_sym_hash): Likewise.
* config/gcn/gcn-opts.h (enum processor_type): Likewise.
(TARGET_GFX1036): New macro.
* config/gcn/gcn.cc (gcn_option_override): Handle gfx1036.
(gcn_omp_device_kind_arch_isa): Likewise.
(output_file_start): Likewise.
* config/gcn/gcn.h (TARGET_CPU_CPP_BUILTINS): Add __gfx1036__.
(TARGET_CPU_CPP_BUILTINS): Rename __gfx1030 to __gfx1030__.
* config/gcn/gcn.opt: Add gfx1036.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1036): New.
(main): Handle gfx1036.
* config/gcn/t-omp-device: Add gfx1036 isa.
* doc/install.texi (amdgcn): Add gfx1036.
* doc/invoke.texi (-march): Likewise.

libgomp/ChangeLog:

* plugin/plugin-gcn.c (EF_AMDGPU_MACH): GFX1036.
(gcn_gfx1103_s): New.
(isa_hsa_name): Handle gfx1036.
(isa_code): Likewise.
(max_isa_vgprs): Likewise.
---
  gcc/config.gcc  |  4 ++--
  gcc/config/gcn/gcn-hsa.h|  6 +++---
  gcc/config/gcn/gcn-opts.h   |  2 ++
  gcc/config/gcn/gcn.cc   | 10 ++
  gcc/config/gcn/gcn.h|  4 +++-
  gcc/config/gcn/gcn.opt  |  3 +++
  gcc/config/gcn/mkoffload.cc |  5 +
  gcc/config/gcn/t-omp-device |  2 +-
  gcc/doc/install.texi|  3 ++-
  gcc/doc/invoke.texi |  3 +++
  libgomp/plugin/plugin-gcn.c |  8 
  11 files changed, 42 insertions(+), 8 deletions(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 87a5c92b6e3..17873ac2103 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4560,7 +4560,7 @@ case "${target}" in
for which in arch tune; do
eval "val=\$with_$which"
case ${val} in
-   "" | fiji | gfx900 | gfx906 | gfx908 | gfx90a | gfx1030 
| gfx1100 | gfx1103)
+   "" | fiji | gfx900 | gfx906 | gfx908 | gfx90a | gfx1030 
| gfx1036 | gfx1100 | gfx1103)
# OK
;;
*)
@@ -4576,7 +4576,7 @@ case "${target}" in
TM_MULTILIB_CONFIG=
;;
xdefault | xyes)
-   TM_MULTILIB_CONFIG=`echo 
"gfx900,gfx906,gfx908,gfx90a,gfx1030,gfx1100,gfx1103" | sed 
"s/${with_arch},\?//;s/,$//"`
+   TM_MULTILIB_CONFIG=`echo 
"gfx900,gfx906,gfx908,gfx90a,gfx1030,gfx1036,gfx1100,gfx1103" | sed 
"s/${with_arch},\?//;s/,$//"`
;;
*)
TM_MULTILIB_CONFIG="${with_multilib_list}"
diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index ac32b8a328f..7d6e3141cea 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -90,7 +90,7 @@ extern unsigned int gcn_local_sym_hash (const char *name);
 the ELF flags (e_flags) of that generated file must be identical to those
 generated by the compiler.  */
  
-#define NO_XNACK "march=fiji:;march=gfx1030:;march=gfx1100:;march=gfx1103:;" \

+#define NO_XNACK 
"march=fiji:;march=gfx1030:;march=gfx1036:;march=gfx1100:;march=gfx1103:;" \
  /* These match the defaults set in gcn.cc.  */ \
  
"!mxnack*|mxnack=default:%{march=gfx900|march=gfx906|march=gfx908:-mattr=-xnack};"
  #define NO_SRAM_ECC "!march=*:;march=fiji:;march=gfx900:;march=gfx906:;"
@@ -106,8 +106,8 @@ extern unsigned int gcn_local_sym_hash (const char *name);
  "%{" ABI_VERSION_SPEC "} " \
  "%{" NO_XNACK XNACKOPT "} " \
  "%{" NO_SRAM_ECC SRAMOPT "} " \
- 

Re: GCN: Enable effective-target 'vect_long_mult'

2024-03-25 Thread Andrew Stubbs

On 21/03/2024 10:41, Thomas Schwinge wrote:

Hi!

OK to push the attached "GCN: Enable effective-target 'vect_long_mult'"?
(Or is that not what you'd expect to see for GCN?  I haven't checked the
actual back end code...)


OK.

Andrew



Re: GCN: Enable effective-target 'vect_hw_misalign'

2024-03-25 Thread Andrew Stubbs

On 21/03/2024 10:41, Thomas Schwinge wrote:

Hi!

OK to push the attached
"GCN: Enable effective-target 'vect_hw_misalign'"?  (Or is that not what
you'd expect to see for GCN?  I haven't checked the actual back end
code...)


OK.

Andrew.


[wwwdocs, committed] gcc-14: amdgcn: Add gfx1103

2024-03-22 Thread Andrew Stubbs
I added a note about gfx1103 to the existing text for gfx1100.

Andrew

---
 htdocs/gcc-14/changes.html | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html
index d88fbc96..880b9195 100644
--- a/htdocs/gcc-14/changes.html
+++ b/htdocs/gcc-14/changes.html
@@ -343,11 +343,11 @@ a work-in-progress.
 AMD Radeon (GCN)
 
 
-  Initial support for the AMD Radeon gfx1030 (RDNA2) and
-gfx1100 (RDNA3) devices has been added. LLVM 15+ (assembler
-and linker) is Initial support for the AMD Radeon gfx1030 (RDNA2),
+gfx1100 and gfx1103 (RDNA3) devices has been
+added. LLVM 15+ (assembler and linker) is https://gcc.gnu.org/install/specific.html#amdgcn-x-amdhsa;>required
-to support gfx1100.
+to support GFX11.
   Improved register usage and performance on CDNA Instinct MI100
 and MI200 series devices.
   The default device architecture is now gfx900 (Vega).
-- 
2.41.0



[committed] amdgcn: Adjust GFX10/GFX11 cache coherency

2024-03-22 Thread Andrew Stubbs
The RDNA devices have different cache architectures to the CDNA devices, and
the differences go deeper than just the assembler mnemonics, so we
probably need to generate different code to maintain coherency across
the whole device.

I believe this patch is correct according to the documentation in the LLVM
AMDGPU user guide (the ISA manual is less instructive), but I hadn't observed
any real problems before (or after).

Committed to mainline.

Andrew

gcc/ChangeLog:

* config/gcn/gcn.md (*memory_barrier): Split into RDNA and !RDNA.
(atomic_load): Adjust RDNA cache settings.
(atomic_store): Likewise.
(atomic_exchange): Likewise.
---
 gcc/config/gcn/gcn.md | 86 +++
 1 file changed, 55 insertions(+), 31 deletions(-)

diff --git a/gcc/config/gcn/gcn.md b/gcc/config/gcn/gcn.md
index 3b51453aaca..574c2f87e8c 100644
--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -1960,11 +1960,19 @@
 (define_insn "*memory_barrier"
   [(set (match_operand:BLK 0)
(unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
-  ""
-  "{buffer_wbinvl1_vol|buffer_gl0_inv}"
+  "!TARGET_RDNA2_PLUS"
+  "buffer_wbinvl1_vol"
   [(set_attr "type" "mubuf")
(set_attr "length" "4")])
 
+(define_insn "*memory_barrier"
+  [(set (match_operand:BLK 0)
+   (unspec:BLK [(match_dup 0)] UNSPEC_MEMORY_BARRIER))]
+  "TARGET_RDNA2_PLUS"
+  "buffer_gl1_inv\;buffer_gl0_inv"
+  [(set_attr "type" "mult")
+   (set_attr "length" "8")])
+
 ; FIXME: These patterns have been disabled as they do not seem to work
 ; reliably - they can cause hangs or incorrect results.
 ; TODO: flush caches according to memory model
@@ -2094,9 +2102,13 @@
  case 0:
return "s_load%o0\t%0, %A1 glc\;s_waitcnt\tlgkmcnt(0)";
  case 1:
-   return "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0";
+   return (TARGET_RDNA2 /* Not GFX11.  */
+   ? "flat_load%o0\t%0, %A1%O1 glc dlc\;s_waitcnt\t0"
+   : "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0");
  case 2:
-   return "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)";
+   return (TARGET_RDNA2 /* Not GFX11.  */
+   ? "global_load%o0\t%0, %A1%O1 glc dlc\;s_waitcnt\tvmcnt(0)"
+   : "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)");
  }
break;
   case MEMMODEL_CONSUME:
@@ -2108,15 +2120,21 @@
return "s_load%o0\t%0, %A1 glc\;s_waitcnt\tlgkmcnt(0)\;"
   "s_dcache_wb_vol";
  case 1:
-   return (TARGET_RDNA2_PLUS
+   return (TARGET_RDNA2
+   ? "flat_load%o0\t%0, %A1%O1 glc dlc\;s_waitcnt\t0\;"
+ "buffer_gl1_inv\;buffer_gl0_inv"
+   : TARGET_RDNA3
? "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0\;"
- "buffer_gl0_inv"
+ "buffer_gl1_inv\;buffer_gl0_inv"
: "flat_load%o0\t%0, %A1%O1 glc\;s_waitcnt\t0\;"
  "buffer_wbinvl1_vol");
  case 2:
-   return (TARGET_RDNA2_PLUS
+   return (TARGET_RDNA2
+   ? "global_load%o0\t%0, %A1%O1 glc 
dlc\;s_waitcnt\tvmcnt(0)\;"
+ "buffer_gl1_inv\;buffer_gl0_inv"
+   : TARGET_RDNA3
? "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)\;"
- "buffer_gl0_inv"
+ "buffer_gl1_inv\;buffer_gl0_inv"
: "global_load%o0\t%0, %A1%O1 glc\;s_waitcnt\tvmcnt(0)\;"
  "buffer_wbinvl1_vol");
  }
@@ -2130,15 +2148,21 @@
return "s_dcache_wb_vol\;s_load%o0\t%0, %A1 glc\;"
   "s_waitcnt\tlgkmcnt(0)\;s_dcache_inv_vol";
  case 1:
-   return (TARGET_RDNA2_PLUS
-   ? "buffer_gl0_inv\;flat_load%o0\t%0, %A1%O1 glc\;"
- "s_waitcnt\t0\;buffer_gl0_inv"
+   return (TARGET_RDNA2
+   ? "buffer_gl1_inv\;buffer_gl0_inv\;flat_load%o0\t%0, %A1%O1 
glc dlc\;"
+ "s_waitcnt\t0\;buffer_gl1_inv\;buffer_gl0_inv"
+   : TARGET_RDNA3
+   ? "buffer_gl1_inv\;buffer_gl0_inv\;flat_load%o0\t%0, %A1%O1 
glc\;"
+ "s_waitcnt\t0\;buffer_gl1_inv\;buffer_gl0_inv"
: "buffer_wbinvl1_vol\;flat_load%o0\t%0, %A1%O1 glc\;"
  "s_waitcnt\t0\;buffer_wbinvl1_vol");
  case 2:
-   return (TARGET_RDNA2_PLUS
-   ? "buffer_gl0_inv\;global_load%o0\t%0, %A1%O1 glc\;"
- "s_waitcnt\tvmcnt(0)\;buffer_gl0_inv"
+   return (TARGET_RDNA2
+   ? "buffer_gl1_inv\;buffer_gl0_inv\;global_load%o0\t%0, 
%A1%O1 glc dlc\;"
+ "s_waitcnt\tvmcnt(0)\;buffer_gl1_inv\;buffer_gl0_inv"
+   : TARGET_RDNA3
+   ? "buffer_gl1_inv\;buffer_gl0_inv\;global_load%o0\t%0, 

[committed] amdgcn: Prefer V32 on RDNA devices

2024-03-22 Thread Andrew Stubbs
This patch alters the default (preferred) vector size to 32 on RDNA devices to
better match the actual hardware.  64-lane vectors will continue to be
used where they are hard-coded (such as function prologues).

We run these devices in wavefrontsize64 for compatibility, but they actually
only have 32-lane vectors, natively.  If the upper part of a V64 is masked
off (as it is in V32) then RDNA devices will skip execution of the upper part
for most operations, so this adjustment shouldn't leave too much performance on
the table.  One exception is memory instructions, so full wavefrontsize32
support would be better.

The advantage is that we avoid the missing V64 operations (such as permute and
vec_extract).

Committed to mainline.

Andrew

gcc/ChangeLog:

* config/gcn/gcn.cc (gcn_vectorize_preferred_simd_mode): Prefer V32 on
RDNA devices.
---
 gcc/config/gcn/gcn.cc | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 498146dcde9..efb73af50c4 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -5226,6 +5226,32 @@ gcn_vector_mode_supported_p (machine_mode mode)
 static machine_mode
 gcn_vectorize_preferred_simd_mode (scalar_mode mode)
 {
+  /* RDNA devices have 32-lane vectors with limited support for 64-bit vectors
+ (in particular, permute operations are only available for cases that don't
+ span the 32-lane boundary).
+
+ From the RDNA3 manual: "Hardware may choose to skip either half if the
+ EXEC mask for that half is all zeros...". This means that preferring
+ 32-lanes is a good stop-gap until we have proper wave32 support.  */
+  if (TARGET_RDNA2_PLUS)
+switch (mode)
+  {
+  case E_QImode:
+   return V32QImode;
+  case E_HImode:
+   return V32HImode;
+  case E_SImode:
+   return V32SImode;
+  case E_DImode:
+   return V32DImode;
+  case E_SFmode:
+   return V32SFmode;
+  case E_DFmode:
+   return V32DFmode;
+  default:
+   return word_mode;
+  }
+
   switch (mode)
 {
 case E_QImode:
-- 
2.41.0



[committed] amdgcn: Add gfx1103 target

2024-03-22 Thread Andrew Stubbs
This patch adds support for the gfx1103 RDNA3 APU integrated graphics
devices.  The ROCm documentation warns that these may not be supported,
but it seems to work at least partially.

This device should be considered "Experimental" at this point, although
so far it seems to be at least as functional as gfx1100.

Committed to mainline.

Andrew

gcc/ChangeLog:

* config.gcc (amdgcn): Add gfx1103 entries.
* config/gcn/gcn-hsa.h (NO_XNACK): Likewise.
(gcn_local_sym_hash): Likewise.
* config/gcn/gcn-opts.h (enum processor_type): Likewise.
(TARGET_GFX1103): New macro.
* config/gcn/gcn.cc (gcn_option_override): Handle gfx1103.
(gcn_omp_device_kind_arch_isa): Likewise.
(output_file_start): Likewise.
(gcn_hsa_declare_function_name): Use TARGET_RDNA3, not just gfx1100.
* config/gcn/gcn.h (TARGET_CPU_CPP_BUILTINS): Add __gfx1103__.
* config/gcn/gcn.opt: Add gfx1103.
* config/gcn/mkoffload.cc (EF_AMDGPU_MACH_AMDGCN_GFX1103): New.
(main): Handle gfx1103.
* config/gcn/t-omp-device: Add gfx1103 isa.
* doc/install.texi (amdgcn): Add gfx1103.
* doc/invoke.texi (-march): Likewise.

libgomp/ChangeLog:

* plugin/plugin-gcn.c (EF_AMDGPU_MACH): GFX1103.
(gcn_gfx1103_s): New.
(isa_hsa_name): Handle gfx1103.
(isa_code): Likewise.
(max_isa_vgprs): Likewise.
---
 gcc/config.gcc  |  4 ++--
 gcc/config/gcn/gcn-hsa.h|  6 +++---
 gcc/config/gcn/gcn-opts.h   |  4 +++-
 gcc/config/gcn/gcn.cc   | 14 --
 gcc/config/gcn/gcn.h|  2 ++
 gcc/config/gcn/gcn.opt  |  3 +++
 gcc/config/gcn/mkoffload.cc |  5 +
 gcc/config/gcn/t-omp-device |  2 +-
 gcc/doc/install.texi| 13 +++--
 gcc/doc/invoke.texi |  3 +++
 libgomp/plugin/plugin-gcn.c | 10 +-
 11 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/gcc/config.gcc b/gcc/config.gcc
index 040afabd9ec..87a5c92b6e3 100644
--- a/gcc/config.gcc
+++ b/gcc/config.gcc
@@ -4560,7 +4560,7 @@ case "${target}" in
for which in arch tune; do
eval "val=\$with_$which"
case ${val} in
-   "" | fiji | gfx900 | gfx906 | gfx908 | gfx90a | gfx1030 
| gfx1100)
+   "" | fiji | gfx900 | gfx906 | gfx908 | gfx90a | gfx1030 
| gfx1100 | gfx1103)
# OK
;;
*)
@@ -4576,7 +4576,7 @@ case "${target}" in
TM_MULTILIB_CONFIG=
;;
xdefault | xyes)
-   TM_MULTILIB_CONFIG=`echo 
"gfx900,gfx906,gfx908,gfx90a,gfx1030,gfx1100" | sed 
"s/${with_arch},\?//;s/,$//"`
+   TM_MULTILIB_CONFIG=`echo 
"gfx900,gfx906,gfx908,gfx90a,gfx1030,gfx1100,gfx1103" | sed 
"s/${with_arch},\?//;s/,$//"`
;;
*)
TM_MULTILIB_CONFIG="${with_multilib_list}"
diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index c75256dbac3..ac32b8a328f 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -90,7 +90,7 @@ extern unsigned int gcn_local_sym_hash (const char *name);
the ELF flags (e_flags) of that generated file must be identical to those
generated by the compiler.  */
 
-#define NO_XNACK "march=fiji:;march=gfx1030:;march=gfx1100:;" \
+#define NO_XNACK "march=fiji:;march=gfx1030:;march=gfx1100:;march=gfx1103:;" \
 /* These match the defaults set in gcn.cc.  */ \
 
"!mxnack*|mxnack=default:%{march=gfx900|march=gfx906|march=gfx908:-mattr=-xnack};"
 #define NO_SRAM_ECC "!march=*:;march=fiji:;march=gfx900:;march=gfx906:;"
@@ -106,8 +106,8 @@ extern unsigned int gcn_local_sym_hash (const char *name);
  "%{" ABI_VERSION_SPEC "} " \
  "%{" NO_XNACK XNACKOPT "} " \
  "%{" NO_SRAM_ECC SRAMOPT "} " \
- "%{march=gfx1030|march=gfx1100:-mattr=+wavefrontsize64} " \
- "%{march=gfx1030|march=gfx1100:-mattr=+cumode} " \
+ 
"%{march=gfx1030|march=gfx1100|march=gfx1103:-mattr=+wavefrontsize64} " \
+ "%{march=gfx1030|march=gfx1100|march=gfx1103:-mattr=+cumode} 
" \
  "-filetype=obj"
 #define LINK_SPEC "--pie --export-dynamic"
 #define LIB_SPEC  "-lc"
diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
index 6be2c9204fa..285746f7f4d 100644
--- a/gcc/config/gcn/gcn-opts.h
+++ b/gcc/config/gcn/gcn-opts.h
@@ -26,7 +26,8 @@ enum processor_type
   PROCESSOR_GFX908,
   PROCESSOR_GFX90a,
   PROCESSOR_GFX1030,
-  PROCESSOR_GFX1100
+  PROCESSOR_GFX1100,
+  PROCESSOR_GFX1103
 };
 
 #define TARGET_FIJI (gcn_arch == PROCESSOR_FIJI)
@@ -36,6 +37,7 @@ enum processor_type
 #define TARGET_GFX90a (gcn_arch == PROCESSOR_GFX90a)
 #define TARGET_GFX1030 (gcn_arch == PROCESSOR_GFX1030)
 #define TARGET_GFX1100 (gcn_arch 

Re: [PATCH] vect: more oversized bitmask fixups

2024-03-22 Thread Andrew Stubbs

On 22/03/2024 08:43, Richard Biener wrote:


  I'll note that we don't pass 'val' there and
'val' is unfortunately
not documented - what's it supposed to be?  I think I placed the original fix in
do_compare_and_jump because we have the full into available there.  So
what's the
do_compare_rtx_and_jump caller that needs fixing as well?  (IMHO keying on 'val'
looks fragile)


"val" is the tree expression from which the rtx op0 was expanded. It's
optional, but it's used in emit_cmp_and_jump_insns to determine whether
the target supports tbranch (according to a comment).

I think it would be safe to remove your code as that path does path
"treeop0" to "val".

WDYT?


Looks like a bit of a mess, but yes, I think that sounds good.


Thanks, here's what I pushed.

Andrew
vect: more oversized bitmask fixups

These patches fix up a failure in testcase vect/tsvc/vect-tsvc-s278.c when
configured to use V32 instead of V64 (I plan to do this for RDNA devices).

The problem was that a "not" operation on the mask inadvertently enabled
inactive lanes 31-63 and corrupted the output.  The fix is to adjust the mask
when calling internal functions (in this case COND_MINUS), when doing masked
loads and stores, and when doing conditional jumps (some cases were already
handled).

gcc/ChangeLog:

	* dojump.cc (do_compare_rtx_and_jump): Clear excess bits in vector
	bitmasks.
	(do_compare_and_jump): Remove now-redundant similar code.
	* internal-fn.cc (expand_fn_using_insn): Clear excess bits in vector
	bitmasks.
	(add_mask_and_len_args): Likewise.

diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index 88600cb42d3..5f74b696b41 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -1235,6 +1235,24 @@ do_compare_rtx_and_jump (rtx op0, rtx op1, enum rtx_code code, int unsignedp,
 	}
 	}
 
+  /* For boolean vectors with less than mode precision
+	 make sure to fill padding with consistent values.  */
+  if (val
+	  && VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (val))
+	  && SCALAR_INT_MODE_P (mode))
+	{
+	  auto nunits = TYPE_VECTOR_SUBPARTS (TREE_TYPE (val)).to_constant ();
+	  if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
+	{
+	  op0 = expand_binop (mode, and_optab, op0,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  NULL_RTX, true, OPTAB_WIDEN);
+	  op1 = expand_binop (mode, and_optab, op1,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  NULL_RTX, true, OPTAB_WIDEN);
+	}
+	}
+
   emit_cmp_and_jump_insns (op0, op1, code, size, mode, unsignedp, val,
 			   if_true_label, prob);
 }
@@ -1266,7 +1284,6 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum rtx_code signed_code,
   machine_mode mode;
   int unsignedp;
   enum rtx_code code;
-  unsigned HOST_WIDE_INT nunits;
 
   /* Don't crash if the comparison was erroneous.  */
   op0 = expand_normal (treeop0);
@@ -1309,21 +1326,6 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum rtx_code signed_code,
   emit_insn (targetm.gen_canonicalize_funcptr_for_compare (new_op1, op1));
   op1 = new_op1;
 }
-  /* For boolean vectors with less than mode precision
- make sure to fill padding with consistent values.  */
-  else if (VECTOR_BOOLEAN_TYPE_P (type)
-	   && SCALAR_INT_MODE_P (mode)
-	   && TYPE_VECTOR_SUBPARTS (type).is_constant ()
-	   && maybe_ne (GET_MODE_PRECISION (mode), nunits))
-{
-  gcc_assert (code == EQ || code == NE);
-  op0 = expand_binop (mode, and_optab, op0,
-			  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1), NULL_RTX,
-			  true, OPTAB_WIDEN);
-  op1 = expand_binop (mode, and_optab, op1,
-			  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1), NULL_RTX,
-			  true, OPTAB_WIDEN);
-}
 
   do_compare_rtx_and_jump (op0, op1, code, unsignedp, treeop0, mode,
 			   ((mode == BLKmode)
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index fcf47c7fa12..5269f0ac528 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -245,6 +245,18 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, unsigned int noutputs,
 	   && SSA_NAME_IS_DEFAULT_DEF (rhs)
 	   && VAR_P (SSA_NAME_VAR (rhs)))
 	create_undefined_input_operand ([opno], TYPE_MODE (rhs_type));
+  else if (VECTOR_BOOLEAN_TYPE_P (rhs_type)
+	   && SCALAR_INT_MODE_P (TYPE_MODE (rhs_type))
+	   && maybe_ne (GET_MODE_PRECISION (TYPE_MODE (rhs_type)),
+			TYPE_VECTOR_SUBPARTS (rhs_type).to_constant ()))
+	{
+	  /* Ensure that the vector bitmasks do not have excess bits.  */
+	  int nunits = TYPE_VECTOR_SUBPARTS (rhs_type).to_constant ();
+	  rtx tmp = expand_binop (TYPE_MODE (rhs_type), and_optab, rhs_rtx,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  NULL_RTX, true, OPTAB_WIDEN);
+	  create_input_operand ([opno], tmp, TYPE_MODE (rhs_type));
+	}
   else
 	create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
   opno += 1;
@@ -312,6 +324,20 @@ add_mask_and_len_args (expand_operand *ops, unsigned int opno, gcall *stmt)
 {
   tree mask = gimple_call_arg (stmt, mask_index);
   rtx 

Re: [committed] amdgcn: Ensure gfx11 is running in cumode

2024-03-22 Thread Andrew Stubbs

On 22/03/2024 11:56, Thomas Schwinge wrote:

Hi Andrew!

On 2024-03-21T13:39:53+, Andrew Stubbs  wrote:

CUmode "on" is the setting for compatibility with GCN and CDNA devices.



--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -107,6 +107,7 @@ extern unsigned int gcn_local_sym_hash (const char *name);
  "%{" NO_XNACK XNACKOPT "} " \
  "%{" NO_SRAM_ECC SRAMOPT "} " \
  "%{march=gfx1030|march=gfx1100:-mattr=+wavefrontsize64} " \
+ "%{march=gfx1030|march=gfx1100:-mattr=+cumode} " \
  "-filetype=obj"


Is this just general housekeeping, or should I be seeing any kind of
change in the GCN target '-march=gfx1100' test results?  (I'm not.)


I'm pretty sure cumode is the default, but defaults can change and now 
we're future-proof. The option doesn't change the ELF flags at all.


The opposite of cumode allows more than 16 wavefronts in a workgroup, 
but they can't physically share a single LDS memory so it would break 
OpenACC broadcasting and reductions, and OpenMP libgomp team metadata. 
Also "cgroup" low-latency memory allocation.


Andrew


Re: [PATCH] vect: more oversized bitmask fixups

2024-03-21 Thread Andrew Stubbs

On 21/03/2024 15:18, Richard Biener wrote:

On Thu, Mar 21, 2024 at 3:23 PM Andrew Stubbs  wrote:


My previous patch to fix this problem with xor was rejected because we
want to fix these issues only at the point of use.  That patch produced
slightly better code, in this example, but this works too

These patches fix up a failure in testcase vect/tsvc/vect-tsvc-s278.c when
configured to use V32 instead of V64 (I plan to do this for RDNA devices).

The problem was that a "not" operation on the mask inadvertently enabled
inactive lanes 31-63 and corrupted the output.  The fix is to adjust the mask
when calling internal functions (in this case COND_MINUS), when doing masked
loads and stores, and when doing conditional jumps.

OK for mainline?

Andrew

gcc/ChangeLog:

 * dojump.cc (do_compare_rtx_and_jump): Clear excess bits in vector
 bitmaps.
 * internal-fn.cc (expand_fn_using_insn): Likewise.
 (add_mask_and_len_args): Likewise.
---
  gcc/dojump.cc  | 16 
  gcc/internal-fn.cc | 26 ++
  2 files changed, 42 insertions(+)

diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index 88600cb42d3..8df86957e83 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -1235,6 +1235,22 @@ do_compare_rtx_and_jump (rtx op0, rtx op1, enum rtx_code 
code, int unsignedp,
 }
 }

+  if (val
+ && VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (val))
+ && SCALAR_INT_MODE_P (mode))
+   {
+ auto nunits = TYPE_VECTOR_SUBPARTS (TREE_TYPE (val)).to_constant ();
+ if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
+   {
+ op0 = expand_binop (mode, and_optab, op0,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
+ op1 = expand_binop (mode, and_optab, op1,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
+   }
+   }
+


Can we then remove the same code from do_compare_and_jump before the call to
do_compare_rtx_and_jump?


It's called from do_jump.


 I'll note that we don't pass 'val' there and
'val' is unfortunately
not documented - what's it supposed to be?  I think I placed the original fix in
do_compare_and_jump because we have the full into available there.  So
what's the
do_compare_rtx_and_jump caller that needs fixing as well?  (IMHO keying on 'val'
looks fragile)


"val" is the tree expression from which the rtx op0 was expanded. It's 
optional, but it's used in emit_cmp_and_jump_insns to determine whether 
the target supports tbranch (according to a comment).


I think it would be safe to remove your code as that path does path 
"treeop0" to "val".


WDYT?


The other hunks below are OK.


Thanks.

Andrew


Thanks,
Richard.


emit_cmp_and_jump_insns (op0, op1, code, size, mode, unsignedp, val,
if_true_label, prob);
  }
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index fcf47c7fa12..5269f0ac528 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -245,6 +245,18 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
unsigned int noutputs,
&& SSA_NAME_IS_DEFAULT_DEF (rhs)
&& VAR_P (SSA_NAME_VAR (rhs)))
 create_undefined_input_operand ([opno], TYPE_MODE (rhs_type));
+  else if (VECTOR_BOOLEAN_TYPE_P (rhs_type)
+  && SCALAR_INT_MODE_P (TYPE_MODE (rhs_type))
+  && maybe_ne (GET_MODE_PRECISION (TYPE_MODE (rhs_type)),
+   TYPE_VECTOR_SUBPARTS (rhs_type).to_constant ()))
+   {
+ /* Ensure that the vector bitmasks do not have excess bits.  */
+ int nunits = TYPE_VECTOR_SUBPARTS (rhs_type).to_constant ();
+ rtx tmp = expand_binop (TYPE_MODE (rhs_type), and_optab, rhs_rtx,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
+ create_input_operand ([opno], tmp, TYPE_MODE (rhs_type));
+   }
else
 create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
opno += 1;
@@ -312,6 +324,20 @@ add_mask_and_len_args (expand_operand *ops, unsigned int 
opno, gcall *stmt)
  {
tree mask = gimple_call_arg (stmt, mask_index);
rtx mask_rtx = expand_normal (mask);
+
+  tree mask_type = TREE_TYPE (mask);
+  if (VECTOR_BOOLEAN_TYPE_P (mask_type)
+ && SCALAR_INT_MODE_P (TYPE_MODE (mask_type))
+ && maybe_ne (GET_MODE_PRECISION (TYPE_MODE (mask_type)),
+  TYPE_VECTOR_SUBPARTS (mask_type).to_constant ()))
+   {
+ /* Ensure that the vector bitmasks do not have excess bits.  */
+ int nunits = TYPE

[PATCH] vect: more oversized bitmask fixups

2024-03-21 Thread Andrew Stubbs
My previous patch to fix this problem with xor was rejected because we
want to fix these issues only at the point of use.  That patch produced
slightly better code, in this example, but this works too

These patches fix up a failure in testcase vect/tsvc/vect-tsvc-s278.c when
configured to use V32 instead of V64 (I plan to do this for RDNA devices).

The problem was that a "not" operation on the mask inadvertently enabled
inactive lanes 31-63 and corrupted the output.  The fix is to adjust the mask
when calling internal functions (in this case COND_MINUS), when doing masked
loads and stores, and when doing conditional jumps.

OK for mainline?

Andrew

gcc/ChangeLog:

* dojump.cc (do_compare_rtx_and_jump): Clear excess bits in vector
bitmaps.
* internal-fn.cc (expand_fn_using_insn): Likewise.
(add_mask_and_len_args): Likewise.
---
 gcc/dojump.cc  | 16 
 gcc/internal-fn.cc | 26 ++
 2 files changed, 42 insertions(+)

diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index 88600cb42d3..8df86957e83 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -1235,6 +1235,22 @@ do_compare_rtx_and_jump (rtx op0, rtx op1, enum rtx_code 
code, int unsignedp,
}
}
 
+  if (val
+ && VECTOR_BOOLEAN_TYPE_P (TREE_TYPE (val))
+ && SCALAR_INT_MODE_P (mode))
+   {
+ auto nunits = TYPE_VECTOR_SUBPARTS (TREE_TYPE (val)).to_constant ();
+ if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
+   {
+ op0 = expand_binop (mode, and_optab, op0,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
+ op1 = expand_binop (mode, and_optab, op1,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
+   }
+   }
+
   emit_cmp_and_jump_insns (op0, op1, code, size, mode, unsignedp, val,
   if_true_label, prob);
 }
diff --git a/gcc/internal-fn.cc b/gcc/internal-fn.cc
index fcf47c7fa12..5269f0ac528 100644
--- a/gcc/internal-fn.cc
+++ b/gcc/internal-fn.cc
@@ -245,6 +245,18 @@ expand_fn_using_insn (gcall *stmt, insn_code icode, 
unsigned int noutputs,
   && SSA_NAME_IS_DEFAULT_DEF (rhs)
   && VAR_P (SSA_NAME_VAR (rhs)))
create_undefined_input_operand ([opno], TYPE_MODE (rhs_type));
+  else if (VECTOR_BOOLEAN_TYPE_P (rhs_type)
+  && SCALAR_INT_MODE_P (TYPE_MODE (rhs_type))
+  && maybe_ne (GET_MODE_PRECISION (TYPE_MODE (rhs_type)),
+   TYPE_VECTOR_SUBPARTS (rhs_type).to_constant ()))
+   {
+ /* Ensure that the vector bitmasks do not have excess bits.  */
+ int nunits = TYPE_VECTOR_SUBPARTS (rhs_type).to_constant ();
+ rtx tmp = expand_binop (TYPE_MODE (rhs_type), and_optab, rhs_rtx,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
+ create_input_operand ([opno], tmp, TYPE_MODE (rhs_type));
+   }
   else
create_input_operand ([opno], rhs_rtx, TYPE_MODE (rhs_type));
   opno += 1;
@@ -312,6 +324,20 @@ add_mask_and_len_args (expand_operand *ops, unsigned int 
opno, gcall *stmt)
 {
   tree mask = gimple_call_arg (stmt, mask_index);
   rtx mask_rtx = expand_normal (mask);
+
+  tree mask_type = TREE_TYPE (mask);
+  if (VECTOR_BOOLEAN_TYPE_P (mask_type)
+ && SCALAR_INT_MODE_P (TYPE_MODE (mask_type))
+ && maybe_ne (GET_MODE_PRECISION (TYPE_MODE (mask_type)),
+  TYPE_VECTOR_SUBPARTS (mask_type).to_constant ()))
+   {
+ /* Ensure that the vector bitmasks do not have excess bits.  */
+ int nunits = TYPE_VECTOR_SUBPARTS (mask_type).to_constant ();
+ mask_rtx = expand_binop (TYPE_MODE (mask_type), and_optab, mask_rtx,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  NULL_RTX, true, OPTAB_WIDEN);
+   }
+
   create_input_operand ([opno++], mask_rtx,
TYPE_MODE (TREE_TYPE (mask)));
 }
-- 
2.41.0



[committed] amdgcn: Ensure gfx11 is running in cumode

2024-03-21 Thread Andrew Stubbs
CUmode "on" is the setting for compatibility with GCN and CDNA devices.

Committed to mainline.

gcc/ChangeLog:

* config/gcn/gcn-hsa.h (ASM_SPEC): Pass -mattr=+cumode.
---
 gcc/config/gcn/gcn-hsa.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index 9cf181f52a4..c75256dbac3 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -107,6 +107,7 @@ extern unsigned int gcn_local_sym_hash (const char *name);
  "%{" NO_XNACK XNACKOPT "} " \
  "%{" NO_SRAM_ECC SRAMOPT "} " \
  "%{march=gfx1030|march=gfx1100:-mattr=+wavefrontsize64} " \
+ "%{march=gfx1030|march=gfx1100:-mattr=+cumode} " \
  "-filetype=obj"
 #define LINK_SPEC "--pie --export-dynamic"
 #define LIB_SPEC  "-lc"
-- 
2.41.0



[commmitted] amdgcn: Comment correction

2024-03-21 Thread Andrew Stubbs
The location of the marker was changed, but the comment wasn't updated.
Fixed now.

Committed to mainline

gcc/ChangeLog:

* config/gcn/gcn.cc (gcn_expand_builtin_1): Comment correction.
---
 gcc/config/gcn/gcn.cc | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index bc076d1120d..fca001811e5 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -4932,8 +4932,8 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx 
/*subtarget */ ,
   }
 case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
   {
-   /* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
-  whether it was the first call.  */
+   /* Stash a marker in the unused upper 16 bits of QUEUE_PTR_ARG to
+  indicate whether it was the first call.  */
rtx result = gen_reg_rtx (BImode);
emit_move_insn (result, const0_rtx);
if (cfun->machine->args.reg[QUEUE_PTR_ARG] >= 0)
-- 
2.41.0



[committed] amdgcn: Clean up device memory in gcn-run

2024-03-21 Thread Andrew Stubbs
There are some stability issues in the ROC runtime or drivers when we
run too many tests in quick succession.  I was hoping this patch might
fix it, but no; still good to fix the omissions though.

Committed to mainline.

gcc/ChangeLog:

* config/gcn/gcn-run.cc (main): Add an hsa_memory_free calls for each
device_malloc call.
---
 gcc/config/gcn/gcn-run.cc | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/gcc/config/gcn/gcn-run.cc b/gcc/config/gcn/gcn-run.cc
index d45ff3e6c2ba..2f3ed2d41d2f 100644
--- a/gcc/config/gcn/gcn-run.cc
+++ b/gcc/config/gcn/gcn-run.cc
@@ -755,7 +755,13 @@ main (int argc, char *argv[])
 
   /* Clean shut down.  */
   XHSA (hsa_fns.hsa_memory_free_fn (kernargs),
-   "Clean up device memory");
+   "Clean up device kernargs memory");
+  XHSA (hsa_fns.hsa_memory_free_fn (args),
+   "Clean up device args memory");
+  XHSA (hsa_fns.hsa_memory_free_fn (heap),
+   "Clean up device heap memory");
+  XHSA (hsa_fns.hsa_memory_free_fn (stack),
+   "Clean up device stack memory");
   XHSA (hsa_fns.hsa_executable_destroy_fn (executable),
"Clean up GCN executable");
   XHSA (hsa_fns.hsa_queue_destroy_fn (queue),
-- 
2.41.0



Re: GCN: Enable effective-target 'vect_early_break', 'vect_early_break_hw'

2024-03-21 Thread Andrew Stubbs

On 21/03/2024 10:41, Thomas Schwinge wrote:

Hi!

On 2024-01-12T15:02:35+0100, I wrote:

OK to push the attached
"GCN: Enable effective-target 'vect_early_break', 'vect_early_break_hw'"?


Ping.  (Or is that not what you'd expect to see for GCN?  I haven't
checked the actual back end code...)


Sorry, I missed this during the transition.

I think early break just means conditional/masked vectors, so it should 
work.


OK to commit.


("The relevant test cases are all-PASS with just [two] exceptions, to be
looked into individually, later on."  I'm not currently planning to look
into that.)


(One of those actually going to be fixed by a different patch to be
posted in a moment.)


Nice. :)

Andrew



Re: [Patch][RFC] GCN: Define ISA archs in gcn-devices.def and use it

2024-03-15 Thread Andrew Stubbs

On 15/03/2024 13:56, Tobias Burnus wrote:

Hi Andrew,

Andrew Stubbs wrote:
This is more-or-less what I was planning to do myself, but as I want 
to include all the other features that get parametrized in gcn.cc, 
gcn.h, gcn-hsa.h, gcn-opts.h, I hadn't got around to it yet. 
Unfortunately, I think the gcn.opt and config.gcc will always need 
manually updating, but if that's all it'll be an improvement.


Well, for .opt see how nvptx does it – it actually generates an .opt file.


I don't like the idea of including AMDGPU_ISA_UNSUPPORTED;


I concur – I was initially thinking of reporting the device name 
("Unsupported %s") but I then realized that the agent returns a string 
while only for GCC generated files (→ eflag) the hexcode is used. Thus, 
I ended up not using it.


Ultimately, I want to replace many of the conditionals like 
"TARGET_CDNA2_PLUS" from the code and replace them with feature flags 
derived from a def file, or at least a header file. We've acquired too 
many places where there are unsearchable conditionals that need 
finding and fixing every time a new device comes along.
I was thinking of having more flags, but those where the only ones 
required for the two files.
I had imagined that this .def file would exist in gcc/config/gcn, but 
you've placed it in libgomp maybe it makes sense to have multiple 
such files if they contain very different data, but I had imagined one 
file and I'm not sure that the compiler definitions live in libgomp.


There is already:

gcc/config/darwin-c.cc:#include "../../libcpp/internal.h"

gcc/config/gcn/gcn-run.cc:#include 
"../../../libgomp/config/gcn/libgomp-gcn.h"


gcc/fortran/cpp.cc:#include "../../libcpp/internal.h"

gcc/fortran/trigd_fe.inc:#include "../../libgfortran/intrinsics/trigd.inc"

But there is also the reverse:

libcpp/lex.cc:#include "../gcc/config/i386/cpuid.h"

libgfortran/libgfortran.h:#include "../gcc/fortran/libgfortran.h"

lto-plugin/lto-plugin.c:#include "../gcc/lto/common.h"

If you add more items, it is probably better to have it under 
gcc/config/gcn/ - and I really prefer a single file for all.


* * *

Talking about feature sets: This would be a bit like LLVM (see below) 
but I think they have a bit too much indirections. But I do concur that 
we need to consolidate the current support – and hopefully make it 
easier to keep adding more GPU support; we seem to have already covered 
a larger chunk :-)


I also did wonder whether we should support, e.g., running a gfx1100 
code (or a gfx11-generic one) on, e.g., a gfx1103 device. Alternatively, 
we could keep the current check which requires an exact match.


We didn't invent that restriction; the runtime won't let you do it. We 
only have the check because the HSA/ROCr error message is not very 
user-friendly.


BTW: I do note that looking at the feature sets in LLVM that all GFX110x 
GPUs seem to have common silicon bugs: FeatureMSAALoadDstSelBug and 
FeatureMADIntraFwdBug, while 1100 and 1102 additionally have the 
FeatureUserSGPRInit16Bug but 1101 and 1103 don't. — For some reasons, 
FeatureISAVersion11_Generic only consists of two of those bugs (it 
doesn't have FeatureMADIntraFwdBug), which doesn't seem to be that 
consistent. Maybe the workaround has issues elsewhere? If so, a generic 
-march=gfx11 might be not as useful as one might hope for.


* * *

If I look at LLVM's 
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AMDGPU.td ,


they first define several features – like 'FeatureUnalignedScratchAccess'.

Then they combine them like in:

def FeatureISAVersion11_Common ... [FeatureGFX11, ... 
FeatureAtomicFaddRtnInsts ...


And then they use those to map them to feature sets like:

def FeatureISAVersion11_0_Common ... 
listconcat(FeatureISAVersion11_Common.Features,

     [FeatureMSAALoadDstSelBug ...

And for gfx1103:

def FeatureISAVersion11_0_3 : FeatureSet<
   !listconcat(FeatureISAVersion11_0_Common.Features,
     [])>;

The mapping to gfx... names then happens in 
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/GCNProcessors.td such as:


def : ProcessorModel<"gfx1103", GFX11SpeedModel,
   FeatureISAVersion11_0_3.Features
 >;

Or for the generic one, i.e.:

// [gfx1100, gfx1101, gfx1102, gfx1103, gfx1150, gfx1151]
def : ProcessorModel<"gfx11-generic", GFX11SpeedModel,
   FeatureISAVersion11_Generic.Features

LLVM also has some generic flags like the following in 
https://github.com/llvm/llvm-project/blob/main/llvm/lib/TargetParser/TargetParser.cpp


     {{"gfx1013"},   {"gfx1013"}, GK_GFX1013, 
FEATURE_FAST_FMA_F32|FEATURE_FAST_DENORMAL_F32|FEATURE_WAVE32|FEATURE_XNACK|FEATURE_WGP},


I hope that this will give some inspiration – but I assume that at least 
the initial implementation will be much shorter.


Yeah, we can have one macro for each arch, or multiple macros for 

Re: [Patch][RFC] GCN: Define ISA archs in gcn-devices.def and use it

2024-03-15 Thread Andrew Stubbs

On 15/03/2024 12:21, Tobias Burnus wrote:
Given the large number of AMD GPU ISAs and the number of files which 
have to be adapted, I wonder whether it makes sense to consolidate this 
a bit, especially in the light that we may want to support more in the 
future.


Besides using some macros, I also improved the diagnostic if the object 
code couldn't be recognized (shouldn't happen) or if the GPU is 
unsupported (likely; it now prints the GPU string). I was initially 
thinking of resolving the arch encoded in the eflag to a string, but as 
this is about GCC-generated code, it seemed to be unlikely of much use. 
[It should that rare that we might also go back to the static string 
instead of outputting the hex value of the eflag.]


Note: I only modified mkoffload.cc and plugin-gcn.c, but with some 
tweaks it could also be used for other files in gcc/config/gcn/.


If you add a new ISA, you still need to update plugin-gcn.c's 
max_isa_vgprs and the xnack/sram-ecc handling in mkoffload.c's main, but 
that should be all for those two files.


Thoughts?


This is more-or-less what I was planning to do myself, but as I want to 
include all the other features that get parametrized in gcn.cc, gcn.h, 
gcn-hsa.h, gcn-opts.h, I hadn't got around to it yet.  Unfortunately, I 
think the gcn.opt and config.gcc will always need manually updating, but 
if that's all it'll be an improvement.


I don't like the idea of including AMDGPU_ISA_UNSUPPORTED; that list is 
going to be permanently out of date, and even if we maintain it 
fastidiously last year's release isn't going to have the updated list in 
the wild. I think it's not actually active in this patch in any case.


Instead of AMDGPU_ISA, I think "AMDGPU_ELF" makes more sense. The ISA is 
"CDNA2" or "RDNA3", etc., and the compiler needs to know about that.


Ultimately, I want to replace many of the conditionals like 
"TARGET_CDNA2_PLUS" from the code and replace them with feature flags 
derived from a def file, or at least a header file. We've acquired too 
many places where there are unsearchable conditionals that need finding 
and fixing every time a new device comes along.


I had imagined that this .def file would exist in gcc/config/gcn, but 
you've placed it in libgomp maybe it makes sense to have multiple 
such files if they contain very different data, but I had imagined one 
file and I'm not sure that the compiler definitions live in libgomp.



Tobias

PS: I think the patch is fine and builds, but I have not tested it on an 
AMD GPU machine, yet.


PPS: For using for other files, see also in config/nvptx which uses 
nvptx-sm.def to generate several files.




Andrew


Re: [PATCH] vect: Use xor to invert oversized vector masks

2024-03-15 Thread Andrew Stubbs

On 15/03/2024 07:35, Richard Biener wrote:

On Fri, Mar 15, 2024 at 4:35 AM Hongtao Liu  wrote:


On Thu, Mar 14, 2024 at 11:42 PM Andrew Stubbs  wrote:


Don't enable excess lanes when inverting vector bit-masks smaller than the
integer mode.  This is yet another case of wrong-code due to mishandling
of oversized bitmasks.

This issue shows up in vect/tsvc/vect-tsvc-s278.c and
vect/tsvc/vect-tsvc-s279.c if I set the preferred vector size to V32
(down from V64) on amdgcn.

OK for mainline?

Andrew

gcc/ChangeLog:

 * expr.cc (expand_expr_real_2): Use xor to invert vector masks.
---
  gcc/expr.cc | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 403eeaa108e4..3540327d879e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -10497,6 +10497,17 @@ expand_expr_real_2 (sepops ops, rtx target, 
machine_mode tmode,
immed_wide_int_const (mask, int_mode),
target, 1, OPTAB_LIB_WIDEN);
 }
+  /* If it's a vector mask don't enable excess bits.  */
+  else if (VECTOR_BOOLEAN_TYPE_P (type)
+  && SCALAR_INT_MODE_P (mode)
+  && maybe_ne (GET_MODE_PRECISION (mode),
+   TYPE_VECTOR_SUBPARTS (type).to_constant ()))
+   {
+ auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+ temp = expand_binop (mode, xor_optab, op0,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  target, true, OPTAB_WIDEN);
+   }

Not review, just curious, should the issue be fixed by the commit in PR113576.
Also wonder besides cbranch, excess land bits also matter?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576#c35


Yes, you patch BIT_NOT but we decided to patch final compares.  Is it that
we need to fixup every mask use in a .COND_* expansion as well?  If so
we should do it there.


I thought that the "not" to "xor" change was nice and there was already 
code there for fixing bitfields, but OK, I take your point.


The .COND_* statements are handled as internal function calls that are 
expanded directly via the optab with no special cases for different call 
types. This is because the "expand_cond_*_optab_fn" functions just map 
straight to "expand_direct_optab_fn" would that be the right place 
to insert a special case handler to insert "and" operations?


Andrew


Re: [PATCH] vect: Use xor to invert oversized vector masks

2024-03-15 Thread Andrew Stubbs

On 15/03/2024 03:45, Hongtao Liu wrote:

On Thu, Mar 14, 2024 at 11:42 PM Andrew Stubbs  wrote:


Don't enable excess lanes when inverting vector bit-masks smaller than the
integer mode.  This is yet another case of wrong-code due to mishandling
of oversized bitmasks.

This issue shows up in vect/tsvc/vect-tsvc-s278.c and
vect/tsvc/vect-tsvc-s279.c if I set the preferred vector size to V32
(down from V64) on amdgcn.

OK for mainline?

Andrew

gcc/ChangeLog:

 * expr.cc (expand_expr_real_2): Use xor to invert vector masks.
---
  gcc/expr.cc | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 403eeaa108e4..3540327d879e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -10497,6 +10497,17 @@ expand_expr_real_2 (sepops ops, rtx target, 
machine_mode tmode,
immed_wide_int_const (mask, int_mode),
target, 1, OPTAB_LIB_WIDEN);
 }
+  /* If it's a vector mask don't enable excess bits.  */
+  else if (VECTOR_BOOLEAN_TYPE_P (type)
+  && SCALAR_INT_MODE_P (mode)
+  && maybe_ne (GET_MODE_PRECISION (mode),
+   TYPE_VECTOR_SUBPARTS (type).to_constant ()))
+   {
+ auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+ temp = expand_binop (mode, xor_optab, op0,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  target, true, OPTAB_WIDEN);
+   }

Not review, just curious, should the issue be fixed by the commit in PR113576.
Also wonder besides cbranch, excess land bits also matter?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113576#c35


It does seem to be another case of the same problem, but those commits 
are long enough ago that I do have them, and still saw a problem.


Andrew



[PATCH] vect: Use xor to invert oversized vector masks

2024-03-14 Thread Andrew Stubbs
Don't enable excess lanes when inverting vector bit-masks smaller than the
integer mode.  This is yet another case of wrong-code due to mishandling
of oversized bitmasks.

This issue shows up in vect/tsvc/vect-tsvc-s278.c and
vect/tsvc/vect-tsvc-s279.c if I set the preferred vector size to V32
(down from V64) on amdgcn.

OK for mainline?

Andrew

gcc/ChangeLog:

* expr.cc (expand_expr_real_2): Use xor to invert vector masks.
---
 gcc/expr.cc | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index 403eeaa108e4..3540327d879e 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -10497,6 +10497,17 @@ expand_expr_real_2 (sepops ops, rtx target, 
machine_mode tmode,
   immed_wide_int_const (mask, int_mode),
   target, 1, OPTAB_LIB_WIDEN);
}
+  /* If it's a vector mask don't enable excess bits.  */
+  else if (VECTOR_BOOLEAN_TYPE_P (type)
+  && SCALAR_INT_MODE_P (mode)
+  && maybe_ne (GET_MODE_PRECISION (mode),
+   TYPE_VECTOR_SUBPARTS (type).to_constant ()))
+   {
+ auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
+ temp = expand_binop (mode, xor_optab, op0,
+  GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+  target, true, OPTAB_WIDEN);
+   }
   else
temp = expand_unop (mode, one_cmpl_optab, op0, target, 1);
   gcc_assert (temp);
-- 
2.41.0



Re: GCN: The original meaning of 'GCN_SUPPRESS_HOST_FALLBACK' isn't applicable (non-shared memory system)

2024-03-08 Thread Andrew Stubbs

On 08/03/2024 10:16, Thomas Schwinge wrote:

Hi!

So, attached here is now a different patch
"GCN: The original meaning of 'GCN_SUPPRESS_HOST_FALLBACK' isn't applicable 
(non-shared memory system)",
that takes a different approach re clarifying the two orthogonal aspects
that the 'GCN_SUPPRESS_HOST_FALLBACK' environment variable controls:
(a) the *original* meaning via 'HSA_SUPPRESS_HOST_FALLBACK', and
(b) the *additional*/*new* meaning to report as fatal certain errors
during device probing.

As you requested, (b) remains as it is (with just the diagnostic message
clarified).  Re (a):

On 2024-03-07T14:37:10+0100, I wrote:

On 2024-03-07T12:43:07+0100, Tobias Burnus  wrote:

Thomas Schwinge wrote:

[...] libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' [...]

[...] originates in the libgomp HSA plugin, where the idea was -- in my
understanding -- that you wouldn't have device code available for all
'fn_ptr's, and in that case transparently (shared-memory system!) do
host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
you'd get those diagnosed.

This has then been copied into the libgomp GCN plugin (see above).
However, is it really still applicable there; don't we assume that we're
generating device code for all relevant functions?



And, one step back: how is (the original meaning of)
'suppress_host_fallback = false' even supposed to work on non-shared
memory systems as currently implemented by the libgomp GCN plugin?



[...] this whole concept of dynamic plugin-level
host-fallback execution being in conflict with our current non-shared
memory system configurations?


I therefore suggest to get rid of (a).

OK to push?


I wasn't aware that things could be broken when fallback-suppression 
*wasn't* set. I agree that we don't need that "feature".


As far as I knew this feature was merely an older implementation of the 
now-standard OMP_TARGET_OFFLOAD=mandatory with the additional advantage 
that we could make it do whatever we want for our test and debug needs 
(i.e. no target independent "smarts").


This patch looks good, thanks.

Andrew


Re: GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal

2024-03-07 Thread Andrew Stubbs

On 07/03/2024 13:37, Thomas Schwinge wrote:

Hi Andrew!

On 2024-03-07T11:38:27+, Andrew Stubbs  wrote:

On 07/03/2024 11:29, Thomas Schwinge wrote:

On 2019-11-12T13:29:16+, Andrew Stubbs  wrote:

This patch contributes the GCN libgomp plugin, with the various
configure and make bits to go with it.


An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
different from the libgomp-level host-fallback execution):


--- /dev/null
+++ b/libgomp/plugin/plugin-gcn.c



+/* Flag to decide if the runtime should suppress a possible fallback to host
+   execution.  */
+
+static bool suppress_host_fallback;



+static void
+init_environment_variables (void)
+{
+  [...]
+  if (secure_getenv ("GCN_SUPPRESS_HOST_FALLBACK"))
+suppress_host_fallback = true;
+  else
+suppress_host_fallback = false;



+/* Return true if the HSA runtime can run function FN_PTR.  */
+
+bool
+GOMP_OFFLOAD_can_run (void *fn_ptr)
+{
+  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
+
+  init_kernel (kernel);
+  if (kernel->initialization_failed)
+goto failure;
+
+  return true;
+
+failure:
+  if (suppress_host_fallback)
+GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
+  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
+  return false;
+}


This originates in the libgomp HSA plugin, where the idea was -- in my
understanding -- that you wouldn't have device code available for all
'fn_ptr's, and in that case transparently (shared-memory system!) do
host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
you'd get those diagnosed.

This has then been copied into the libgomp GCN plugin (see above).
However, is it really still applicable there; don't we assume that we're
generating device code for all relevant functions?  (I suppose everyone
really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)  Should we thus
actually remove 'suppress_host_fallback' (that is, make it
always-'true'), including removal of the 'can_run' hook?  (I suppose that
even in a future shared-memory "GCN" configuration, we're not expecting
to use this again; expecting always-'true' for 'can_run'?)


Now my actual issue: the libgomp GCN plugin then invented an additional
use of 'GCN_SUPPRESS_HOST_FALLBACK':


+/* Initialize hsa_context if it has not already been done.
+   Return TRUE on success.  */
+
+static bool
+init_hsa_context (void)
+{
+  hsa_status_t status;
+  int agent_index = 0;
+
+  if (hsa_context.initialized)
+return true;
+  init_environment_variables ();
+  if (!init_hsa_runtime_functions ())
+{
+  GCN_WARNING ("Run-time could not be dynamically opened\n");
+  if (suppress_host_fallback)
+   GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
+  return false;
+}


That is, if 'GCN_SUPPRESS_HOST_FALLBACK' is (routinely) set (for its
original purpose), and you have the libgomp GCN plugin configured, but
don't have 'libhsa-runtime64.so.1' available, you run into a fatal error.

The libgomp nvptx plugin in such cases silently disables the
plugin/device (and thus lets libgomp proper do its thing), and I propose
we do the same here.  OK to push the attached
"GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 
'init_hsa_runtime_functions' is not fatal"?


If you try to run the offload testsuite on a device that is not properly
configured then we want FAIL


Exactly, and that's what I'm working towards.  (Currently we're not
implementing that properly.)

But why is 'GCN_SUPPRESS_HOST_FALLBACK' controlling
'init_hsa_runtime_functions' relevant for that?  As you know, that
function only deals with dynamically loading 'libhsa-runtime64.so.1', and
Failure to load that one (because it doesn't exist) should have the
agreed-upon behavior of *not* raising an error.  (Any other, later errors
should be fatal, I certainly agree.)


not pass-via-fallback. You're breaking that.


Sorry, I don't follow, please explain?


If the plugin load fails then libgomp will run in host-fallback. In that 
case, IIRC, this is the *only* opportunity we get to enforce 
GCN_SUPPRESS_HOST_FALLBACK. As far as I'm aware, that variable is 
internal, undocumented, meant for dev testing only. It says "I'm testing 
GCN features and if they're not working then I want to know about it."


Users should be using official OMP features.

Andrew


Re: GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 'init_hsa_runtime_functions' is not fatal

2024-03-07 Thread Andrew Stubbs

On 07/03/2024 11:29, Thomas Schwinge wrote:

Hi!

On 2019-11-12T13:29:16+, Andrew Stubbs  wrote:

This patch contributes the GCN libgomp plugin, with the various
configure and make bits to go with it.


An issue with libgomp GCN plugin 'GCN_SUPPRESS_HOST_FALLBACK' (which is
different from the libgomp-level host-fallback execution):


--- /dev/null
+++ b/libgomp/plugin/plugin-gcn.c



+/* Flag to decide if the runtime should suppress a possible fallback to host
+   execution.  */
+
+static bool suppress_host_fallback;



+static void
+init_environment_variables (void)
+{
+  [...]
+  if (secure_getenv ("GCN_SUPPRESS_HOST_FALLBACK"))
+suppress_host_fallback = true;
+  else
+suppress_host_fallback = false;



+/* Return true if the HSA runtime can run function FN_PTR.  */
+
+bool
+GOMP_OFFLOAD_can_run (void *fn_ptr)
+{
+  struct kernel_info *kernel = (struct kernel_info *) fn_ptr;
+
+  init_kernel (kernel);
+  if (kernel->initialization_failed)
+goto failure;
+
+  return true;
+
+failure:
+  if (suppress_host_fallback)
+GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
+  GCN_WARNING ("GCN target cannot be launched, doing a host fallback\n");
+  return false;
+}


This originates in the libgomp HSA plugin, where the idea was -- in my
understanding -- that you wouldn't have device code available for all
'fn_ptr's, and in that case transparently (shared-memory system!) do
host-fallback execution.  Or, with 'GCN_SUPPRESS_HOST_FALLBACK' set,
you'd get those diagnosed.

This has then been copied into the libgomp GCN plugin (see above).
However, is it really still applicable there; don't we assume that we're
generating device code for all relevant functions?  (I suppose everyone
really is testing with 'GCN_SUPPRESS_HOST_FALLBACK' set?)  Should we thus
actually remove 'suppress_host_fallback' (that is, make it
always-'true'), including removal of the 'can_run' hook?  (I suppose that
even in a future shared-memory "GCN" configuration, we're not expecting
to use this again; expecting always-'true' for 'can_run'?)


Now my actual issue: the libgomp GCN plugin then invented an additional
use of 'GCN_SUPPRESS_HOST_FALLBACK':


+/* Initialize hsa_context if it has not already been done.
+   Return TRUE on success.  */
+
+static bool
+init_hsa_context (void)
+{
+  hsa_status_t status;
+  int agent_index = 0;
+
+  if (hsa_context.initialized)
+return true;
+  init_environment_variables ();
+  if (!init_hsa_runtime_functions ())
+{
+  GCN_WARNING ("Run-time could not be dynamically opened\n");
+  if (suppress_host_fallback)
+   GOMP_PLUGIN_fatal ("GCN host fallback has been suppressed");
+  return false;
+}


That is, if 'GCN_SUPPRESS_HOST_FALLBACK' is (routinely) set (for its
original purpose), and you have the libgomp GCN plugin configured, but
don't have 'libhsa-runtime64.so.1' available, you run into a fatal error.

The libgomp nvptx plugin in such cases silently disables the
plugin/device (and thus lets libgomp proper do its thing), and I propose
we do the same here.  OK to push the attached
"GCN: Even with 'GCN_SUPPRESS_HOST_FALLBACK' set, failure to 
'init_hsa_runtime_functions' is not fatal"?


If you try to run the offload testsuite on a device that is not properly 
configured then we want FAIL, not pass-via-fallback. You're breaking that.


Andrew


Re: amdgcn: additional gfx1030/gfx1100 support: adjust test cases

2024-03-06 Thread Andrew Stubbs

On 06/03/2024 13:49, Thomas Schwinge wrote:

Hi!

On 2024-01-24T12:43:04+, Andrew Stubbs  wrote:

This [...]


... became commit 99890e15527f1f04caef95ecdd135c9f1a077f08
"amdgcn: additional gfx1030/gfx1100 support", and included the following:


--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -3555,30 +3555,63 @@
  ;; }}}
  ;; {{{ Int/int conversions
  
+(define_code_iterator all_convert [truncate zero_extend sign_extend])

  (define_code_iterator zero_convert [truncate zero_extend])
  (define_code_attr convop [
(sign_extend "extend")
(zero_extend "zero_extend")
(truncate "trunc")])
  
-(define_insn "2"

+(define_expand "2"
+  [(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
+(all_convert:V_INT_1REG
+ (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
+  "")
+
+(define_insn "*_sdwa"
[(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
  (zero_convert:V_INT_1REG
  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
-  ""
+  "!TARGET_RDNA3"
"v_mov_b32_sdwa\t%0, %1 dst_sel: dst_unused:UNUSED_PAD 
src0_sel:"
[(set_attr "type" "vop_sdwa")
 (set_attr "length" "8")])
  
-(define_insn "extend2"

+(define_insn "extend_sdwa"
[(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
  (sign_extend:V_INT_1REG
  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
-  ""
+  "!TARGET_RDNA3"
"v_mov_b32_sdwa\t%0, sext(%1) src0_sel:"
[(set_attr "type" "vop_sdwa")
 (set_attr "length" "8")])
  
+(define_insn "*_shift"

+  [(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
+(all_convert:V_INT_1REG
+ (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
+  "TARGET_RDNA3"
+  {
+enum {extend, zero_extend, trunc};
+rtx shiftwidth = (mode == QImode
+ || mode == QImode
+ ? GEN_INT (24)
+ : mode == HImode
+   || mode == HImode
+ ? GEN_INT (16)
+ : NULL);
+operands[2] = shiftwidth;
+
+if (!shiftwidth)
+  return "v_mov_b32 %0, %1";
+else if ( == extend ||  == trunc)
+  return "v_lshlrev_b32\t%0, %2, %1\;v_ashrrev_i32\t%0, %2, %0";
+else
+  return "v_lshlrev_b32\t%0, %2, %1\;v_lshrrev_b32\t%0, %2, %0";
+  }
+  [(set_attr "type" "mult")
+   (set_attr "length" "8")])


OK to push the attached
"amdgcn: additional gfx1030/gfx1100 support: adjust test cases"?
Tested 'gcn.exp' for all '-march'es.


LGTM.

Andrew



Re: Stabilize flaky GCN target/offloading testing

2024-03-06 Thread Andrew Stubbs

On 06/03/2024 12:09, Thomas Schwinge wrote:

Hi!

On 2024-02-21T17:32:13+0100, Richard Biener  wrote:

Am 21.02.2024 um 13:34 schrieb Thomas Schwinge :

[...] per my work on 
"libgomp make check time is excessive", all execution testing in libgomp
is serialized in 'libgomp/testsuite/lib/libgomp.exp:libgomp_load'.  [...]
(... with the caveat that execution tests for
effective-targets are *not* governed by that, as I've found yesterday.
I have a WIP hack for that, too.)



What disturbs the testing a lot is, that the GPU may get into a bad
state, upon which any use either fails with a
'HSA_STATUS_ERROR_OUT_OF_RESOURCES' error -- or by just hanging, deep in
'libhsa-runtime64.so.1'...

I've now tried to debug the latter case (hang).  When the GPU gets into
this bad state (whatever exactly that is),
'hsa_executable_load_code_object' still returns 'HSA_STATUS_SUCCESS', but
then GCN target execution ('gcn-run') hangs in 'hsa_executable_freeze'
vs. GCN offloading execution ('libgomp-plugin-gcn.so.1') hangs right
before 'hsa_executable_freeze', in the GCN heap setup 'hsa_memory_copy'.
There it hangs until killed (for example, until DejaGnu's timeout
mechanism kills the process -- just that the next GPU-using execution
test then runs into the same thing again...).

In this state (and also the 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' state),
we're able to recover via:

$ flock /tmp/gpu.lock sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover
0


At least most of the times.  I've found that -- sometimes... ;-( -- if
you run into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', then do
'amdgpu_gpu_recover', and then immediately re-execute, you'll again run
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.  That appears to be avoidable
by injecting some artificial "cool-down period"...  (The latter I've not
yet tested extensively.)


This is, obviously, a hack, probably needs a serial lock to not disturb
other things, has hard-coded 'dri/0', and as I said in

"GCN RDNA2+ vs. GCC SLP vectorizer":

| I've no idea what
| 'amdgpu_gpu_recover' would do if the GPU is also used for display.


It ends up terminating your X session…


Eh  ;'-|


(there’s some automatic driver recovery that’s also sometimes triggered which 
sounds like the same thing).



I need to try using the integrated graphics for X11 to see if that avoids the 
issue.


A few years ago, I tried that for a Nvidia GPU laptop, and -- if I now
remember correctly -- basically got it to work, via hand-editing
'/etc/X11/xorg.conf' and all that...  But: I couldn't get external HDMI
to work in that setup, and therefore reverted to "standard".


Guess AMD needs to improve the driver/runtime (or we - it’s open source at 
least up to the firmware).



However, it's very useful in my testing.  :-|

The questions is, how to detect the "hang" state without first running
into a timeout (and disambiguating such a timeout from a user code
timeout)?  Add a watchdog: call 'alarm([a few seconds])' before device
initialization, and before the actual GPU kernel launch cancel it with
'alarm(0)'?  (..., and add a handler for 'SIGALRM' to print a distinct
error message that we can then react on, like for
'HSA_STATUS_ERROR_OUT_OF_RESOURCES'.)  Probably 'alarm'/'SIGALRM' is a
no-go in libgomp -- instead, use a helper thread to similarly implement a
watchdog?  ('libgomp/plugin/plugin-gcn.c' already is using pthreads for
other purposes.)  Any other clever ideas?  What's a suitable value for
"a few seconds"?


I'm attaching my current "GCN: Watchdog for device image load", covering
both 'gcc/config/gcn/gcn-run.cc' and 'libgomp/plugin/plugin-gcn.c'.
(That's using 'timer_create' etc. instead of 'alarm'/'SIGALRM'. )

That, plus routing *all* potential GPU usage (in particular: including
execution tests for effective-targets, see above) through a serial lock
('flock', implemented in DejaGnu board file, outside of the the
"DejaGnu timeout domain", similar to
'libgomp/testsuite/lib/libgomp.exp:libgomp_load', see above), plus
catching 'HSA_STATUS_ERROR_OUT_OF_RESOURCES' (both the "real" ones and
the "fake" ones via "GCN: Watchdog for device image load") and in that
case 'amdgpu_gpu_recover' and re-execution of the respective executable,
does greatly stabilize flaky GCN target/offloading testing.

Do we have consensus to move forward with this approach, generally?


I've also observed a number of random hangs in host-side code outside 
our control, but after the kernel has exited. In general this watchdog 
approach might help with these. I do feel like it's "papering over the 
cracks", but if we can't fix it at the end of the day it's just a 
little extra code.


My only concern is that it might actually cause failures, perhaps on 
heavily loaded systems, or with network filesystems, or during debugging.


Andrew


Re: [PATCH] vect: Fix integer overflow calculating mask

2024-03-04 Thread Andrew Stubbs

On 23/02/2024 15:13, Richard Biener wrote:

On Fri, 23 Feb 2024, Jakub Jelinek wrote:


On Fri, Feb 23, 2024 at 02:22:19PM +, Andrew Stubbs wrote:

On 23/02/2024 13:02, Jakub Jelinek wrote:

On Fri, Feb 23, 2024 at 12:58:53PM +, Andrew Stubbs wrote:

This is a follow-up to the previous patch to ensure that integer vector
bit-masks do not have excess bits set. It fixes a bug, observed on
amdgcn, in which the mask could be incorrectly set to zero, resulting in
wrong-code.

The mask was broken when nunits==32. The patched version will probably
be broken for nunits==64, but I don't think any current targets have
masks with more than 64 bits.

OK for mainline?

Andrew

gcc/ChangeLog:

* expr.cc (store_constructor): Use 64-bit shifts.


No, this isn't 64-bit shift on all hosts.
Use HOST_WIDE_INT_1U instead.


OK, I did wonder if there was a proper way to do it. :)

How about this?


If you change the other two GEN_INT ((1 << nunits) - 1) occurrences in
expr.cc the same way, then LGTM.


There's also two in dojump.cc


This patch should fix all the cases, I think.

I have not observed any further test result changes.

OK?

Andrew
vect: Fix integer overflow calculating mask

The masks and bitvectors were broken when nunits==32 on hosts where int is
32-bit.

gcc/ChangeLog:

* dojump.cc (do_compare_and_jump): Use full-width integers for shifts.
* expr.cc (store_constructor): Likewise.
(do_store_flag): Likewise.

diff --git a/gcc/dojump.cc b/gcc/dojump.cc
index ac744e54cf8..88600cb42d3 100644
--- a/gcc/dojump.cc
+++ b/gcc/dojump.cc
@@ -1318,10 +1318,10 @@ do_compare_and_jump (tree treeop0, tree treeop1, enum 
rtx_code signed_code,
 {
   gcc_assert (code == EQ || code == NE);
   op0 = expand_binop (mode, and_optab, op0,
- GEN_INT ((1 << nunits) - 1), NULL_RTX,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1), NULL_RTX,
  true, OPTAB_WIDEN);
   op1 = expand_binop (mode, and_optab, op1,
- GEN_INT ((1 << nunits) - 1), NULL_RTX,
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1), NULL_RTX,
  true, OPTAB_WIDEN);
 }
 
diff --git a/gcc/expr.cc b/gcc/expr.cc
index 8d34d024c9c..f7d74525c15 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7879,8 +7879,8 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
  tmp = expand_binop (mode, and_optab, tmp,
- GEN_INT ((1 << nunits) - 1), target,
- true, OPTAB_WIDEN);
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ target, true, OPTAB_WIDEN);
if (tmp != target)
  emit_move_insn (target, tmp);
break;
@@ -13707,11 +13707,11 @@ do_store_flag (sepops ops, rtx target, machine_mode 
mode)
 {
   gcc_assert (code == EQ || code == NE);
   op0 = expand_binop (mode, and_optab, op0,
- GEN_INT ((1 << nunits) - 1), NULL_RTX,
- true, OPTAB_WIDEN);
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
   op1 = expand_binop (mode, and_optab, op1,
- GEN_INT ((1 << nunits) - 1), NULL_RTX,
- true, OPTAB_WIDEN);
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ NULL_RTX, true, OPTAB_WIDEN);
 }
 
   if (target == 0)


Re: [PATCH] vect: Fix integer overflow calculating mask

2024-02-23 Thread Andrew Stubbs

On 23/02/2024 13:02, Jakub Jelinek wrote:

On Fri, Feb 23, 2024 at 12:58:53PM +, Andrew Stubbs wrote:

This is a follow-up to the previous patch to ensure that integer vector
bit-masks do not have excess bits set. It fixes a bug, observed on
amdgcn, in which the mask could be incorrectly set to zero, resulting in
wrong-code.

The mask was broken when nunits==32. The patched version will probably
be broken for nunits==64, but I don't think any current targets have
masks with more than 64 bits.

OK for mainline?

Andrew

gcc/ChangeLog:

* expr.cc (store_constructor): Use 64-bit shifts.


No, this isn't 64-bit shift on all hosts.
Use HOST_WIDE_INT_1U instead.


OK, I did wonder if there was a proper way to do it. :)

How about this?

Andrew
vect: Fix integer overflow calculating mask

The mask was broken when nunits==32 on hosts where int is 32-bit.

gcc/ChangeLog:

* expr.cc (store_constructor): Use 64-bit shifts.

diff --git a/gcc/expr.cc b/gcc/expr.cc
index e23880e..6bd16ac7f49 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7879,8 +7879,8 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
  tmp = expand_binop (mode, and_optab, tmp,
- GEN_INT ((1 << nunits) - 1), target,
- true, OPTAB_WIDEN);
+ GEN_INT ((HOST_WIDE_INT_1U << nunits) - 1),
+ target, true, OPTAB_WIDEN);
if (tmp != target)
  emit_move_insn (target, tmp);
break;


[PATCH] vect: Fix integer overflow calculating mask

2024-02-23 Thread Andrew Stubbs
This is a follow-up to the previous patch to ensure that integer vector
bit-masks do not have excess bits set. It fixes a bug, observed on
amdgcn, in which the mask could be incorrectly set to zero, resulting in
wrong-code.

The mask was broken when nunits==32. The patched version will probably
be broken for nunits==64, but I don't think any current targets have
masks with more than 64 bits.

OK for mainline?

Andrew

gcc/ChangeLog:

* expr.cc (store_constructor): Use 64-bit shifts.
---
 gcc/expr.cc | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/gcc/expr.cc b/gcc/expr.cc
index e23880e..90de5decee3 100644
--- a/gcc/expr.cc
+++ b/gcc/expr.cc
@@ -7879,7 +7879,7 @@ store_constructor (tree exp, rtx target, int cleared, 
poly_int64 size,
auto nunits = TYPE_VECTOR_SUBPARTS (type).to_constant ();
if (maybe_ne (GET_MODE_PRECISION (mode), nunits))
  tmp = expand_binop (mode, and_optab, tmp,
- GEN_INT ((1 << nunits) - 1), target,
+ GEN_INT ((1UL << nunits) - 1), target,
  true, OPTAB_WIDEN);
if (tmp != target)
  emit_move_insn (target, tmp);
-- 
2.41.0



Re: GCN: Conditionalize 'define_expand "reduc__scal_"' on '!TARGET_RDNA2_PLUS' [PR113615]

2024-02-16 Thread Andrew Stubbs

On 16/02/2024 14:34, Thomas Schwinge wrote:

Hi!

On 2024-01-29T11:34:05+0100, Tobias Burnus  wrote:

Andrew wrote off list:
"Vector reductions don't work on RDNA, as is, but they're
 supposed to be disabled by the insn condition"

This patch disables "fold_left_plus_", which is about
vectorization and in the code path shown in the backtrace.
I can also confirm manually that it fixes the ICE I saw and
also the ICE for the testfile that Richard's PR shows at the
end of his backtrace.  (-O3 is needed to trigger the ICE.)


On top of that, OK to push the attached
"GCN: Conditionalize 'define_expand "reduc__scal_"' on 
'!TARGET_RDNA2_PLUS' [PR113615]"?

Which of the 'assert's are worth keeping?

Only tested 'vect.exp' for 'check-gcc-c' so far; full testing to run
later.

Please confirm I'm understanding this correctly:

Andrew's original commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL" did this:

  (define_expand "reduc__scal_"
[(set (match_operand: 0 "register_operand")
 (unspec:
   [(match_operand:V_ALL 1 "register_operand")]
   REDUC_UNSPEC))]
 -  ""
 +  "!TARGET_RDNA2" [later '!TARGET_RDNA2_PLUS']
{
  [...]

This conditional, however, does *not* govern any explicit
'gen_reduc_plus_scal_', and therefore Tobias in
commit 7cc2262ec9a410dc56d1c1c6b950c922e14f621d
"gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615]"
had to replicate the '!TARGET_RDNA2_PLUS' condition:


@@ -4274,7 +4274,8 @@ (define_expand "fold_left_plus_"
   [(match_operand: 0 "register_operand")
(match_operand: 1 "gcn_alu_operand")
(match_operand:V_FP 2 "gcn_alu_operand")]
-  "can_create_pseudo_p ()
+  "!TARGET_RDNA2_PLUS
+   && can_create_pseudo_p ()
 && (flag_openacc || flag_openmp
 || flag_associative_math)"
{

|  rtx dest = operands[0];
|  rtx scalar = operands[1];
|  rtx vector = operands[2];
|  rtx tmp = gen_reg_rtx (mode);
|
|  emit_insn (gen_reduc_plus_scal_ (tmp, vector));
|  [...]

..., and I thus now have to do similar for
'gen_reduc__scal_' use in here:

  (define_expand "reduc__scal_"
[(match_operand: 0 "register_operand")
 (fminmaxop:V_FP
   (match_operand:V_FP 1 "register_operand"))]
 -  ""
 +  "!TARGET_RDNA2_PLUS"
{
  /* fmin/fmax are identical to smin/smax.  */
  emit_insn (gen_reduc__scal_ (operands[0], 
operands[1]));
  [...]


OK. I don't mind the asserts. Hopefully they're redundant, but I suppose 
it's better than tracking down an unrecognised instruction in a later pass.


Andrew



Re: GCN RDNA2+ vs. GCC SLP vectorizer

2024-02-16 Thread Andrew Stubbs

On 16/02/2024 12:26, Richard Biener wrote:

On Fri, 16 Feb 2024, Andrew Stubbs wrote:


On 16/02/2024 10:17, Richard Biener wrote:

On Fri, 16 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
support builds on top of, and that's what I'm currently working on
getting proper GCC/GCN target (not offloading) results for.

Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
and hopefully representative for other SLP execution test FAILs
(regressions compared to my earlier non-gfx1100 testing).

  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/
  source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c
  --sysroot=install/amdgcn-amdhsa -ftree-vectorize
  -fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common
  -O2 -fdump-tree-slp-details -fdump-tree-vect-details -isystem
  build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem
  source-gcc/newlib/libc/include
  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper
  setarch,--addr-no-randomize -fdump-tree-all-all -fdump-ipa-all-all
  -fdump-rtl-all-all -save-temps -march=gfx1100

The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
suppose will also exhibit the same failure mode, once again?

Compared to '-march=gfx90a', the differences begin in
'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.

Changed like:

  @@ -38,10 +38,10 @@ int main ()
   #pragma GCC novector
 for (i = 1; i < N; i++)
   if (a[i] != i%4 + 1)
  -  abort ();
  +  __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
   
 if (a[0] != 5)

  -abort ();
  +__builtin_printf("%d %d != %d\n", 0, a[0], 5);

..., we see:

  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
  40 5 != 1
  41 6 != 2
  42 7 != 3
  43 8 != 4
  44 5 != 1
  45 6 != 2
  46 7 != 3
  47 8 != 4

'40..47' are the 'i = 10..11' in 'foo', and the expectation is
'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
scribbled zero values over these (vector lane masking issue, perhaps?),
or some other code generation issue?


So we're indeed BB vectorizing this to

_54 = MEM  [(int *)_14];
vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
7, 8 }, 115);
MEM  [(int *)_14] = vect_iftmp.12_56;

I don't understand the assembly very well but it might be that
the mask computation for the .VCOND scribbles the mask used
to constrain operation to 4 lanes?

.L3:
  s_mov_b64   exec, 15
  v_add_co_u32v4, s[22:23], s32, v3
  v_mov_b32   v5, s33
  v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
  flat_load_dword v7, v[4:5] offset:0
  s_waitcnt   0
  flat_load_dword v0, v[10:11] offset:0
  s_waitcnt   0
  flat_load_dword v6, v[8:9] offset:0
  s_waitcnt   0
  v_cmp_ne_u32s[18:19], v7, 0
  v_cndmask_b32   v0, v6, v0, s[18:19]
  flat_store_dwordv[4:5], v0 offset:0
  s_add_i32   s12, s12, 1
  s_add_u32   s32, s32, s28
  s_addc_u32  s33, s33, s29
  s_cmp_lg_u32s12, s13
  s_cbranch_scc1  .L3


This basic block has EXEC set to 15 (4 lanes) throughout. The mask for the
VCOND a.k.a. v_vndmask_b32 is in s[18:19]. Those things seem OK.

I see the testcase avoids vec_extract V64SI to V4SI for gfx1100, even though
it would be a no-op conversion, because the general case requires a permute
instruction and named pattern insns can't have non-constant conditions. Is
vec_extract allowed to FAIL? That might give a better result in this case.


I found that vec_extract is not allowed to FAIL. I guess the only way to 
allow the no-op conversions is to implement manual fall-back code-gen 
for the broken cases.




However, I must be doing something different because vect/bb-slp-cond-1.c
passes for me, on gfx1100.


I didn't try to run it - when doing make check-gcc fails to using
gcn-run for test invocation, what's the trick to make it do that?


There's a config file for nvptx here: 
https://github.com/SourceryTools/nvptx-tools/blob/master/nvptx-none-run.exp


You can probably make the obvious adjustments. I think Thomas has a GCN 
version with a few more features.


I usually use the CodeSourcery magic stack of scripts for testing 
installed toolchains on remote devices, so I'm not too familiar with 
using Dejagnu directly.


Andrew



Re: GCN RDNA2+ vs. GCC SLP vectorizer

2024-02-16 Thread Andrew Stubbs

On 16/02/2024 10:17, Richard Biener wrote:

On Fri, 16 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL", which the later RDNA3/gfx1100
support builds on top of, and that's what I'm currently working on
getting proper GCC/GCN target (not offloading) results for.

Now looking at 'gcc.dg/vect/bb-slp-cond-1.c', which is reasonably simple,
and hopefully representative for other SLP execution test FAILs
(regressions compared to my earlier non-gfx1100 testing).

 $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ 
source-gcc/gcc/testsuite/gcc.dg/vect/bb-slp-cond-1.c 
--sysroot=install/amdgcn-amdhsa -ftree-vectorize 
-fno-tree-loop-distribute-patterns -fno-vect-cost-model -fno-common -O2 
-fdump-tree-slp-details -fdump-tree-vect-details -isystem 
build-gcc/amdgcn-amdhsa/gfx1100/newlib/targ-include -isystem 
source-gcc/newlib/libc/include -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -wrapper setarch,--addr-no-randomize 
-fdump-tree-all-all -fdump-ipa-all-all -fdump-rtl-all-all -save-temps 
-march=gfx1100

The '-march=gfx1030' 'a-bb-slp-cond-1.s' is identical (apart from
'TARGET_PACKED_WORK_ITEMS' in 'gcn_target_asm_function_prologue'), so I
suppose will also exhibit the same failure mode, once again?

Compared to '-march=gfx90a', the differences begin in
'a-bb-slp-cond-1.c.266r.expand' (only!), down to 'a-bb-slp-cond-1.s'.

Changed like:

 @@ -38,10 +38,10 @@ int main ()
  #pragma GCC novector
for (i = 1; i < N; i++)
  if (a[i] != i%4 + 1)
 -  abort ();
 +  __builtin_printf("%d %d != %d\n", i, a[i], i%4 + 1);
  
if (a[0] != 5)

 -abort ();
 +__builtin_printf("%d %d != %d\n", 0, a[0], 5);

..., we see:

 $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
 40 5 != 1
 41 6 != 2
 42 7 != 3
 43 8 != 4
 44 5 != 1
 45 6 != 2
 46 7 != 3
 47 8 != 4

'40..47' are the 'i = 10..11' in 'foo', and the expectation is
'a[i * stride + 0..3] != 0'.  So, either some earlier iteration has
scribbled zero values over these (vector lane masking issue, perhaps?),
or some other code generation issue?


So we're indeed BB vectorizing this to

   _54 = MEM  [(int *)_14];
   vect_iftmp.12_56 = .VCOND (_54, { 0, 0, 0, 0 }, { 1, 2, 3, 4 }, { 5, 6,
7, 8 }, 115);
   MEM  [(int *)_14] = vect_iftmp.12_56;

I don't understand the assembly very well but it might be that
the mask computation for the .VCOND scribbles the mask used
to constrain operation to 4 lanes?

.L3:
 s_mov_b64   exec, 15
 v_add_co_u32v4, s[22:23], s32, v3
 v_mov_b32   v5, s33
 v_add_co_ci_u32 v5, s[22:23], 0, v5, s[22:23]
 flat_load_dword v7, v[4:5] offset:0
 s_waitcnt   0
 flat_load_dword v0, v[10:11] offset:0
 s_waitcnt   0
 flat_load_dword v6, v[8:9] offset:0
 s_waitcnt   0
 v_cmp_ne_u32s[18:19], v7, 0
 v_cndmask_b32   v0, v6, v0, s[18:19]
 flat_store_dwordv[4:5], v0 offset:0
 s_add_i32   s12, s12, 1
 s_add_u32   s32, s32, s28
 s_addc_u32  s33, s33, s29
 s_cmp_lg_u32s12, s13
 s_cbranch_scc1  .L3


This basic block has EXEC set to 15 (4 lanes) throughout. The mask for 
the VCOND a.k.a. v_vndmask_b32 is in s[18:19]. Those things seem OK.


I see the testcase avoids vec_extract V64SI to V4SI for gfx1100, even 
though it would be a no-op conversion, because the general case requires 
a permute instruction and named pattern insns can't have non-constant 
conditions. Is vec_extract allowed to FAIL? That might give a better 
result in this case.


However, I must be doing something different because 
vect/bb-slp-cond-1.c passes for me, on gfx1100.


Andrew


Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Andrew Stubbs

On 15/02/2024 10:23, Thomas Schwinge wrote:

Hi!

On 2024-02-15T08:49:17+0100, Richard Biener  wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:

On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:

On 2023-10-20T12:51:03+0100, Andrew Stubbs 
wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be
disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- [...]



With the following hack applied to 'gcc/tree-vect-loop.cc':

@@ -6687,8 +6687,9 @@ vect_create_epilog_for_reduction
(loop_vec_info
loop_vinfo,
   reduce_with_shift = have_whole_vector_shift (mode1);
   if (!VECTOR_MODE_P (mode1)
  || !directly_supported_p (code, vectype1))
reduce_with_shift = false;
+  reduce_with_shift = false;

..., I'm able to work around those regressions: by means of forcing
"Reduce using scalar code" instead of "Reduce using vector shifts".



The attached not-well-tested patch should allow only valid permutations.
Hopefully we go back to working code, but there'll be things that won't
vectorize. That said, the new "dump" output code has fewer and probably
cheaper instructions, so hmmm.


This fixes the reduced builtin-bitops-1.c on RDNA2.


I confirm that "amdgcn: Disallow unsupported permute on RDNA devices"
also obsoletes my 'reduce_with_shift = false;' hack -- and also cures a
good number of additional FAILs (regressions), where presumably we
permute via different code paths.  Thanks!

There also are a few regressions, but only minor:

 PASS: gcc.dg/vect/no-vfa-vect-depend-3.c (test for excess errors)
 PASS: gcc.dg/vect/no-vfa-vect-depend-3.c execution test
 PASS: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "vectorized 
1 loops" 4
 [-PASS:-]{+FAIL:+} gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect 
"dependence distance negative" 4

..., because:

 gcc.dg/vect/no-vfa-vect-depend-3.c: pattern found 6 times
 FAIL: gcc.dg/vect/no-vfa-vect-depend-3.c scan-tree-dump-times vect "dependence 
distance negative" 4

 PASS: gcc.dg/vect/vect-119.c (test for excess errors)
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected 
interleaving load of size 2" 1
 PASS: gcc.dg/vect/vect-119.c scan-tree-dump-not optimized "Invalid sum"

..., because:

 gcc.dg/vect/vect-119.c: pattern found 3 times
 FAIL: gcc.dg/vect/vect-119.c scan-tree-dump-times vect "Detected interleaving 
load of size 2" 1

 PASS: gcc.dg/vect/vect-reduc-mul_1.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-mul_1.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_1.c scan-tree-dump vect "Reduce 
using vector shifts"

 PASS: gcc.dg/vect/vect-reduc-mul_2.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-mul_2.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-mul_2.c scan-tree-dump vect "Reduce 
using vector shifts"

..., plus the following, in combination with the earlier changes
disabling patterns:

 PASS: gcc.dg/vect/vect-reduc-or_1.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-or_1.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/vect/vect-reduc-or_1.c scan-tree-dump vect "Reduce 
using direct vector reduction"

 PASS: gcc.dg/vect/vect-reduc-or_2.c (test for excess errors)
 PASS: gcc.dg/vect/vect-reduc-or_2.c execution test
 [-PASS:-]{+FAIL:+} gcc.dg/

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Andrew Stubbs

On 15/02/2024 10:21, Richard Biener wrote:
[snip]

I suppse if RDNA really only has 32 lane vectors (it sounds like it,
even if it can "simulate" 64 lane ones?) then it might make sense to
vectorize for 32 lanes?  That said, with variable-length it likely
doesn't matter but I'd not expose fixed-size modes with 64 lanes then?


For most operations, wavefrontsize=64 works just fine; the GPU runs each
instruction twice and presents a pair of hardware registers as a logical
64-lane register. This breaks down for permutations and reductions, and is
obviously inefficient when they vectors are not fully utilized, but is
otherwise compatible with the GCN/CDNA compiler.

I didn't want to invest all the effort it would take to support
wavefrontsize=32, which would be the natural mode for these devices; the
number of places that have "64" hard-coded is just too big. Not only that, but
the EXEC and VCC registers change from DImode to SImode and that's going to
break a lot of stuff. (And we have no paying customer for this.)

I'm open to patch submissions. :)


OK, I see ;)  As said for fully masked that's a good answer.  I'd
probably still not expose V64mode modes in the RTL expanders for the
vect_* patterns?  Or, what happens if you change
gcn_vectorize_preferred_simd_mode to return 32 lane modes for RDNA
and omit 64 lane modes from gcn_autovectorize_vector_modes for RDNA?


Changing the preferred mode probably would fix permute.


Does that possibly leave performance on the plate? (not sure if there's
any documents about choosing wavefrontsize=64 vs 32 with regard to
performance)

Note it would entirely forbit the vectorizer from using larger modes,
it just makes it prefer the smaller ones.  OTOH if you then run
wavefrontsize=64 ontop of it it's probably wasting the 2nd instruction
by always masking it?


Right, the GPU will continue to process the "top half" of the vector as 
an additional step, regardless whether you put anything useful there, or 
not.



So yeah.  Guess a s/64/wavefrontsize/ would be a first step towards
allowing 32 there ...


I think the DImode to SImode change is the most difficult fix. Unless 
you know of a cunning trick, that's going to mean a lot of changes to a 
lot of the machine description; substitutions, duplications, iterators, 
indirections, etc., etc., etc.


The "64" substitution would be tedious but less hairy. I did a lot of 
those when I created the fake vector sizes.



Anyway, the fix works, so that's the most important thing ;)


:)

Andrew


Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-15 Thread Andrew Stubbs

On 15/02/2024 07:49, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs 
wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be
disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I
presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared
to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as
in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the
offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

$ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
-fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
-O1
-ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

+builtin-bitops-1.c:7:17: missed:   reduc op not supported by
target.

..., and therefore:

-builtin-bitops-1.c:7:17: note:  Reduce using direct vector
reduction.
+builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
+builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build
a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

$ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
i=1, ints[i]=0x1 a=1, b=2
i=2, ints[i]=0x8000 a=1, b=2
i=3, ints[i]=0x2 a=1, b=2
i=4, ints[i]=0x4000 a=1, b=2
i=5, ints[i]=0x1 a=1, b=2
i=6, ints[i]=0x8000 a=1, b=2
i=7, ints[i]=0xa5a5a5a5 a=16, b=32
i=8, ints[i]=0x5a5a5a5a a=16, b=32
i=9, ints[i]=0xcafe a=11, b=22
i=10, ints[i]=0xcafe00 a=11, b=22
i=11, ints[i]=0xcafe a=11, b=22
i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer
code,
or rather in the GCN back end, or GCN back end parameterizing the
generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs

On 14/02/2024 13:43, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



   * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
   (addc3): Add RDNA2 syntax variant.
   (subc3): Likewise.
   (2_exec): Add RDNA2 alternatives.
   (vec_cmpdi): Likewise.
   (vec_cmpdi): Likewise.
   (vec_cmpdi_exec): Likewise.
   (vec_cmpdi_exec): Likewise.
   (vec_cmpdi_dup): Likewise.
   (vec_cmpdi_dup_exec): Likewise.
   (reduc__scal_): Disable for RDNA2.
   (*_dpp_shr_): Likewise.
   (*plus_carry_dpp_shr_): Likewise.
   (*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

   $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
   -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
   -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
   -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100
   -O1
   -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

   +builtin-bitops-1.c:7:17: missed:   reduc op not supported by
   target.

..., and therefore:

   -builtin-bitops-1.c:7:17: note:  Reduce using direct vector
   reduction.
   +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
   +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

   $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
   i=1, ints[i]=0x1 a=1, b=2
   i=2, ints[i]=0x8000 a=1, b=2
   i=3, ints[i]=0x2 a=1, b=2
   i=4, ints[i]=0x4000 a=1, b=2
   i=5, ints[i]=0x1 a=1, b=2
   i=6, ints[i]=0x8000 a=1, b=2
   i=7, ints[i]=0xa5a5a5a5 a=16, b=32
   i=8, ints[i]=0x5a5a5a5a a=16, b=32
   i=9, ints[i]=0xcafe a=11, b=22
   i=10, ints[i]=0xcafe00 a=11, b=22
   i=11, ints[i]=0xcafe a=11, b=22
   i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

   int my_popcount (unsigned int x)
   {
 int stmp__12.12;
 vector

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs

On 14/02/2024 13:27, Richard Biener wrote:

On Wed, 14 Feb 2024, Andrew Stubbs wrote:


On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



  * config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
  (addc3): Add RDNA2 syntax variant.
  (subc3): Likewise.
  (2_exec): Add RDNA2 alternatives.
  (vec_cmpdi): Likewise.
  (vec_cmpdi): Likewise.
  (vec_cmpdi_exec): Likewise.
  (vec_cmpdi_exec): Likewise.
  (vec_cmpdi_dup): Likewise.
  (vec_cmpdi_dup_exec): Likewise.
  (reduc__scal_): Disable for RDNA2.
  (*_dpp_shr_): Likewise.
  (*plus_carry_dpp_shr_): Likewise.
  (*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

  $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c
  -Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/
  -Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all
  -fdump-ipa-all-all -fdump-rtl-all-all -save-temps -march=gfx1100 -O1
  -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

  +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

  -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
  +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
  +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

  $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
  i=1, ints[i]=0x1 a=1, b=2
  i=2, ints[i]=0x8000 a=1, b=2
  i=3, ints[i]=0x2 a=1, b=2
  i=4, ints[i]=0x4000 a=1, b=2
  i=5, ints[i]=0x1 a=1, b=2
  i=6, ints[i]=0x8000 a=1, b=2
  i=7, ints[i]=0xa5a5a5a5 a=16, b=32
  i=8, ints[i]=0x5a5a5a5a a=16, b=32
  i=9, ints[i]=0xcafe a=11, b=22
  i=10, ints[i]=0xcafe00 a=11, b=22
  i=11, ints[i]=0xcafe a=11, b=22
  i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

  int my_popcount (unsigned int x)
  {
int stmp__12.12;
vector(64) int vect__12.11;
vector(64) unsigned int vect__1.8;
vector(64) unsigned int _13;
vector(64) unsigned int vect_cst

Re: GCN RDNA2+ vs. GCC vectorizer "Reduce using vector shifts"

2024-02-14 Thread Andrew Stubbs

On 13/02/2024 08:26, Richard Biener wrote:

On Mon, 12 Feb 2024, Thomas Schwinge wrote:


Hi!

On 2023-10-20T12:51:03+0100, Andrew Stubbs  wrote:

I've committed this patch


... as commit c7ec7bd1c6590cf4eed267feab490288e0b8d691
"amdgcn: add -march=gfx1030 EXPERIMENTAL".

The RDNA2 ISA variant doesn't support certain instructions previous
implemented in GCC/GCN, so a number of patterns etc. had to be disabled:


[...] Vector
reductions will need to be reworked for RDNA2.  [...]



* config/gcn/gcn-valu.md (@dpp_move): Disable for RDNA2.
(addc3): Add RDNA2 syntax variant.
(subc3): Likewise.
(2_exec): Add RDNA2 alternatives.
(vec_cmpdi): Likewise.
(vec_cmpdi): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_exec): Likewise.
(vec_cmpdi_dup): Likewise.
(vec_cmpdi_dup_exec): Likewise.
(reduc__scal_): Disable for RDNA2.
(*_dpp_shr_): Likewise.
(*plus_carry_dpp_shr_): Likewise.
(*plus_carry_in_dpp_shr_): Likewise.


Etc.  The expectation being that GCC middle end copes with this, and
synthesizes some less ideal yet still functional vector code, I presume.

The later RDNA3/gfx1100 support builds on top of this, and that's what
I'm currently working on getting proper GCC/GCN target (not offloading)
results for.

I'm seeing a good number of execution test FAILs (regressions compared to
my earlier non-gfx1100 testing), and I've now tracked down where one
large class of those comes into existance -- not yet how to resolve,
unfortunately.  But maybe, with you guys' combined vectorizer and back
end experience, the latter will be done quickly?

Richard, I don't know if you've ever run actual GCC/GCN target (not
offloading) testing; let me know if you have any questions about that.


I've only done offload testing - in the x86_64 build tree run
check-target-libgomp.  If you can tell me how to do GCN target testing
(maybe document it on the wiki even!) I can try do that as well.


Given that (at least largely?) the same patterns etc. are disabled as in
my gfx1100 configuration, I suppose your gfx1030 one would exhibit the
same issues.  You can build GCC/GCN target like you build the offloading
one, just remove '--enable-as-accelerator-for=[...]'.  Likely, you can
even use a offloading GCC/GCN build to reproduce the issue below.

One example is the attached 'builtin-bitops-1.c', reduced from
'gcc.c-torture/execute/builtin-bitops-1.c', where 'my_popcount' is
miscompiled as soon as '-ftree-vectorize' is effective:

 $ build-gcc/gcc/xgcc -Bbuild-gcc/gcc/ builtin-bitops-1.c 
-Bbuild-gcc/amdgcn-amdhsa/gfx1100/newlib/ 
-Lbuild-gcc/amdgcn-amdhsa/gfx1100/newlib -fdump-tree-all-all -fdump-ipa-all-all 
-fdump-rtl-all-all -save-temps -march=gfx1100 -O1 -ftree-vectorize

In the 'diff' of 'a-builtin-bitops-1.c.179t.vect', for example, for
'-march=gfx90a' vs. '-march=gfx1100', we see:

 +builtin-bitops-1.c:7:17: missed:   reduc op not supported by target.

..., and therefore:

 -builtin-bitops-1.c:7:17: note:  Reduce using direct vector reduction.
 +builtin-bitops-1.c:7:17: note:  Reduce using vector shifts
 +builtin-bitops-1.c:7:17: note:  extract scalar result

That is, instead of one '.REDUC_PLUS' for gfx90a, for gfx1100 we build a
chain of summation of 'VEC_PERM_EXPR's.  However, there's wrong code
generated:

 $ flock /tmp/gcn.lock build-gcc/gcc/gcn-run a.out
 i=1, ints[i]=0x1 a=1, b=2
 i=2, ints[i]=0x8000 a=1, b=2
 i=3, ints[i]=0x2 a=1, b=2
 i=4, ints[i]=0x4000 a=1, b=2
 i=5, ints[i]=0x1 a=1, b=2
 i=6, ints[i]=0x8000 a=1, b=2
 i=7, ints[i]=0xa5a5a5a5 a=16, b=32
 i=8, ints[i]=0x5a5a5a5a a=16, b=32
 i=9, ints[i]=0xcafe a=11, b=22
 i=10, ints[i]=0xcafe00 a=11, b=22
 i=11, ints[i]=0xcafe a=11, b=22
 i=12, ints[i]=0x a=32, b=64

(I can't tell if the 'b = 2 * a' pattern is purely coincidental?)

I don't speak enough "vectorization" to fully understand the generic
vectorized algorithm and its implementation.  It appears that the
"Reduce using vector shifts" code has been around for a very long time,
but also has gone through a number of changes.  I can't tell which GCC
targets/configurations it's actually used for (in the same way as for
GCN gfx1100), and thus whether there's an issue in that vectorizer code,
or rather in the GCN back end, or GCN back end parameterizing the generic
code?


The "shift" reduction is basically doing reduction by repeatedly
adding the upper to the lower half of the vector (each time halving
the vector size).


Manually working through the 'a-builtin-bitops-1.c.265t.optimized' code:

 int my_popcount (unsigned int x)
 {
   int stmp__12.12;
   vector(64) int vect__12.11;
   vector(64) unsigned int vect__1.8;
   vector(64) unsigned int _13;
   vector(64) unsigned int vect_cst__18;
   vector(64) int [all others];
 
[loca

Re: [PATCH] libgomp: testsuite: Don't XPASS libgomp.c/alloc-pinned-1.c etc. on non-Linux targets [PR113448]

2024-02-12 Thread Andrew Stubbs

On 05/02/2024 13:04, Rainer Orth wrote:

Two libgomp tests XPASS on Solaris (any non-Linux target actually) since
their introduction:

XPASS: libgomp.c/alloc-pinned-1.c execution test
XPASS: libgomp.c/alloc-pinned-2.c execution test

The problem is that the test just prints

OS unsupported

and exits successfully, while the test is XFAILed:

/* { dg-xfail-run-if "Pinning not implemented on this host" { ! *-*-linux-gnu } 
} */

Fixed by aborting immediately after the message above in the non-Linux
case.

Tested on i386-pc-solaris2.11 and i686-pc-linux-gnu.

Ok for trunk?


OK with me, FWIW.

Andrew



Re: GCN: Don't hard-code number of SGPR/VGPR/AVGPR registers

2024-02-01 Thread Andrew Stubbs

On 01/02/2024 13:49, Thomas Schwinge wrote:

Hi!

On 2018-12-12T11:52:52+, Andrew Stubbs  wrote:

This patch contains the major part of the GCN back-end.  [...]



--- /dev/null
+++ b/gcc/config/gcn/gcn.c



+void
+gcn_hsa_declare_function_name (FILE *file, const char *name, tree)
+{



+  /* Determine count of sgpr/vgpr registers by looking for last
+ one used.  */
+  for (sgpr = 101; sgpr >= 0; sgpr--)
+if (df_regs_ever_live_p (FIRST_SGPR_REG + sgpr))
+  break;
+  sgpr++;
+  for (vgpr = 255; vgpr >= 0; vgpr--)
+if (df_regs_ever_live_p (FIRST_VGPR_REG + vgpr))
+  break;
+  vgpr++;



--- /dev/null
+++ b/gcc/config/gcn/gcn.h



+#define FIRST_SGPR_REG 0
+#define SGPR_REGNO(N)  ((N)+FIRST_SGPR_REG)
+#define LAST_SGPR_REG  101



+#define FIRST_VGPR_REG 160
+#define VGPR_REGNO(N)  ((N)+FIRST_VGPR_REG)
+#define LAST_VGPR_REG  415


OK to push "GCN: Don't hard-code number of SGPR/VGPR/AVGPR registers",
see attached?


OK.

Andrew


Re: GCN, RDNA 3: Adjust 'sync_compare_and_swap_lds_insn'

2024-02-01 Thread Andrew Stubbs

On 01/02/2024 11:36, Thomas Schwinge wrote:

Hi!

On 2024-01-31T11:31:00+, Andrew Stubbs  wrote:

On 31/01/2024 10:36, Thomas Schwinge wrote:

OK to push "GCN, RDNA 3: Adjust 'sync_compare_and_swap_lds_insn'",
see attached?

In pre-RDNA 3 ISA manuals, there are notes for 'DS_CMPST_[...]', like:

  Caution, the order of src and cmp are the *opposite* of the 
BUFFER_ATOMIC_CMPSWAP opcode.

..., and conversely in the RDNA 3 ISA manual, for 'DS_CMPSTORE_[...]':

  In this architecture the order of src and cmp agree with the 
BUFFER_ATOMIC_CMPSWAP opcode.

Is my understanding correct, that this isn't something we have to worry
about at the GCC machine description level; that's resolved at the
assembler level?


Right, the IR uses GCC's operand order and has nothing to do with the
assembler syntax; the output template does the mapping.


--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -2095,7 +2095,12 @@
   (match_operand:SIDI 3 "register_operand" "  v")]
  UNSPECV_ATOMIC))]
""
-  "ds_cmpst_rtn_b %0, %1, %2, %3\;s_waitcnt\tlgkmcnt(0)"
+  {
+if (TARGET_RDNA3)
+  return "ds_cmpstore_rtn_b %0, %1, %2, 
%3\;s_waitcnt\tlgkmcnt(0)";
+else
+  return "ds_cmpst_rtn_b %0, %1, %2, %3\;s_waitcnt\tlgkmcnt(0)";
+  }
[(set_attr "type" "ds")
 (set_attr "length" "12")])


I think you need to swap %2 and %3 in the new format. ds_cmpst matches
GCC operand order, but ds_cmpstore has "cmp" and "src" reversed.


OK, thanks.  That was my actual question -- so, we do need to swap, and
indeed, most of the affected libgomp OpenACC test cases then PASS their
execution test.  With that changed, I've pushed to master branch
commit 6c2a40f4f4577f5d0f7bd1cfda48a5701b75744c
"GCN, RDNA 3: Adjust 'sync_compare_and_swap_lds_insn'", see
attached.


OK to commit.

Andrew


Re: GCN: Remove 'FIRST_{SGPR,VGPR,AVGPR}_REG', 'LAST_{SGPR,VGPR,AVGPR}_REG' from machine description

2024-01-31 Thread Andrew Stubbs

On 31/01/2024 17:21, Thomas Schwinge wrote:

Hi!

On 2018-12-12T11:52:23+, Andrew Stubbs  wrote:

This patch contains the machine description portion of the GCN back-end.  [...]



--- /dev/null
+++ b/gcc/config/gcn/gcn.md



+;; {{{ Constants and enums
+
+; Named registers
+(define_constants
+  [(FIRST_SGPR_REG  0)
+   (LAST_SGPR_REG   101)
+   (FLAT_SCRATCH_REG102)
+   (FLAT_SCRATCH_LO_REG 102)
+   (FLAT_SCRATCH_HI_REG 103)
+   (XNACK_MASK_REG  104)
+   (XNACK_MASK_LO_REG   104)
+   (XNACK_MASK_HI_REG   105)
+   (VCC_REG 106)
+   (VCC_LO_REG  106)
+   (VCC_HI_REG  107)
+   (VCCZ_REG108)
+   (TBA_REG 109)
+   (TBA_LO_REG  109)
+   (TBA_HI_REG  110)
+   (TMA_REG 111)
+   (TMA_LO_REG  111)
+   (TMA_HI_REG  112)
+   (TTMP0_REG   113)
+   (TTMP11_REG  124)
+   (M0_REG  125)
+   (EXEC_REG126)
+   (EXEC_LO_REG 126)
+   (EXEC_HI_REG 127)
+   (EXECZ_REG   128)
+   (SCC_REG 129)
+   (FIRST_VGPR_REG  160)
+   (LAST_VGPR_REG   415)])
+
+(define_constants
+  [(SP_REGNUM 16)
+   (LR_REGNUM 18)
+   (AP_REGNUM 416)
+   (FP_REGNUM 418)])


Oops, these last two are actually wrong, since AVGPRs were inserted!



Generally, shouldn't there be a better way, that avoids duplication and
instead shares such definitions between 'gcn.h' and 'gcn.md'?


I think this is stuff we originally inherited from the Honza's partial 
port and I just never questioned?


If the definitions are unused then it's fine to remove them (I imagine 
the TBA, TMA, and TTMP registers are also unused). Is there something 
about define_constants that is different to external macros? Does it 
affect the mddump? -fdump output? ICE messages? If there's no difference 
then I'm happy with just deleting the lot and use the gcn.h definitions 
exclusively.



Until that's settled, OK to push the attached
"GCN: Remove 'FIRST_{SGPR,VGPR,AVGPR}_REG', 'LAST_{SGPR,VGPR,AVGPR}_REG' from 
machine description"?
(I assume "still builds" is sufficient validation of this change.)


The patch is OK.

Andrew


Re: GCN: Remove 'SGPR_OR_VGPR_REGNO_P' definition

2024-01-31 Thread Andrew Stubbs

On 31/01/2024 17:12, Thomas Schwinge wrote:

Hi!

On 2018-12-12T11:52:52+, Andrew Stubbs  wrote:

This patch contains the major part of the GCN back-end.  [...]



--- /dev/null
+++ b/gcc/config/gcn/gcn.h



+#define FIRST_SGPR_REG 0
+#define SGPR_REGNO(N)  ((N)+FIRST_SGPR_REG)
+#define LAST_SGPR_REG  101



+#define FIRST_VGPR_REG 160
+#define VGPR_REGNO(N)  ((N)+FIRST_VGPR_REG)
+#define LAST_VGPR_REG  415



+#define SGPR_OR_VGPR_REGNO_P(N) ((N)>=FIRST_VGPR_REG && (N) <= LAST_SGPR_REG)


OK to push the attached "GCN: Remove 'SGPR_OR_VGPR_REGNO_P' definition"?


Seems like it qualifies as "obvious". :)

Andrew



Re: GCN, RDNA 3: Adjust 'sync_compare_and_swap_lds_insn'

2024-01-31 Thread Andrew Stubbs

On 31/01/2024 10:36, Thomas Schwinge wrote:

Hi!

OK to push "GCN, RDNA 3: Adjust 'sync_compare_and_swap_lds_insn'",
see attached?

In pre-RDNA 3 ISA manuals, there are notes for 'DS_CMPST_[...]', like:

 Caution, the order of src and cmp are the *opposite* of the 
BUFFER_ATOMIC_CMPSWAP opcode.

..., and conversely in the RDNA 3 ISA manual, for 'DS_CMPSTORE_[...]':

 In this architecture the order of src and cmp agree with the 
BUFFER_ATOMIC_CMPSWAP opcode.

Is my understanding correct, that this isn't something we have to worry
about at the GCC machine description level; that's resolved at the
assembler level?


Right, the IR uses GCC's operand order and has nothing to do with the 
assembler syntax; the output template does the mapping.



--- a/gcc/config/gcn/gcn.md
+++ b/gcc/config/gcn/gcn.md
@@ -2095,7 +2095,12 @@
   (match_operand:SIDI 3 "register_operand" "  v")]
  UNSPECV_ATOMIC))]
   ""
-  "ds_cmpst_rtn_b %0, %1, %2, %3\;s_waitcnt\tlgkmcnt(0)"
+  {
+if (TARGET_RDNA3)
+  return "ds_cmpstore_rtn_b %0, %1, %2, 
%3\;s_waitcnt\tlgkmcnt(0)";
+else
+  return "ds_cmpst_rtn_b %0, %1, %2, %3\;s_waitcnt\tlgkmcnt(0)";
+  }
   [(set_attr "type" "ds")
(set_attr "length" "12")])


I think you need to swap %2 and %3 in the new format. ds_cmpst matches 
GCC operand order, but ds_cmpstore has "cmp" and "src" reversed.


Andrew


Re: [patch] gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615]

2024-01-29 Thread Andrew Stubbs

On 29/01/2024 12:50, Tobias Burnus wrote:

Andrew Stubbs wrote:

/tmp/ccrsHfVQ.mkoffload.2.s:788736:27: error: value out of range
   .amdhsa_next_free_vgpr    516 
^~~ [Obviously, likewise 
forlibgomp.c++/..
Hmm, supposedly there are 768 registers allocated in groups of 12, on 
gfx1100 (8 on other devices), which number you have to double on 
wavefrontsize64 because that field actually counts the number of 
32-lane registers. The ISA can only actually reference 256 registers, 
so the limit here should be 512. (The remaining registers are intended 
for other wavefronts to use.)


But 256 is not divisible by 12, and it looks like we've rounded up. I 
guess we need to set the limit at 252 (504), for gfx1100.


BTW: The LLVM source code has,
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/Utils/AMDGPUBaseInfo.cpp#L1066

unsigned getTotalNumVGPRs(const MCSubtargetInfo *STI) {
   if (STI->getFeatureBits().test(FeatureGFX90AInsts))
     return 512;
   if (!isGFX10Plus(*STI))
     return 256;
   bool IsWave32 = STI->getFeatureBits().test(FeatureWavefrontSize32);
   if (STI->getFeatureBits().test(FeatureGFX11FullVGPRs))
     return IsWave32 ? 1536 : 768;
   return IsWave32 ? 1024 : 512;
}


That matches what we have in libgomp.

LLVM must have another configuration somewhere for how many registers it 
can actually use in code (the ISA can encode 256, but that doesn't mean 
it should always do so). This may be a moot point because allowing too 
many registers limits how many threads can run in parallel, so they may 
have chosen to impose an artificial limit at all times.


In GCC, non-kernel functions are limited to 24 registers (for maximum 
occupancy -- we could probably increase that 50% on "GFX11Full" 
devices), but the kernel entry point is permitted to go crazy.


Andrew


Re: [patch] gcn/gcn-valu.md: Disable fold_left_plus for TARGET_RDNA2_PLUS [PR113615]

2024-01-29 Thread Andrew Stubbs

On 29/01/2024 10:34, Tobias Burnus wrote:

Andrew wrote off list:
   "Vector reductions don't work on RDNA, as is, but they're
    supposed to be disabled by the insn condition"

This patch disables "fold_left_plus_", which is about
vectorization and in the code path shown in the backtrace.
I can also confirm manually that it fixes the ICE I saw and
also the ICE for the testfile that Richard's PR shows at the
end of his backtrace.  (-O3 is needed to trigger the ICE.)

OK for mainline?


OK.


Tobias

* * *

PS: We could add testcase(s) that is/are explicitly compiled with
gfx1100 and/or gfx1030 + '-O3' to ensure that this gets tested
with AMDGPU enabled, but I am not sure whether it is really worthwhile.


PPS: Running the testsuite, I see the following fails with
gfx1100 offloading:

FAIL: libgomp.c/../libgomp.c-c++-common/for-5.c (test for excess errors)
Excess errors:
/tmp/ccrsHfVQ.mkoffload.2.s:788736:27: error: value out of range
   .amdhsa_next_free_vgpr    516 
    ^~~ [Obviously, likewise 
forlibgomp.c++/../libgomp.c-c++-common/for-5.c]
FAIL:libgomp.c/pr104783-2.c execution test FAIL:libgomp.c/pr104783.c 
execution test (The .log unfortunately does not show more details) 
FAIL:libgomp.fortran/optional-map.f90   -O3 -fomit-frame-pointer 
-funroll-loops -fpeel-loops -ftracer -finline-functions  (test for 
excess errors) FAIL:libgomp.fortran/optional-map.f90   -O3 -g  (test for 
excess errors) FAIL: libgomp.fortran/target1.f90   -O3 
-fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer 
-finline-functions  (test for excess errors) FAIL: 
libgomp.fortran/target1.f90   -O3 -g  (test for excess errors)Same 'out 
of range' as above. * * * Manual testing shows for the two execution 
fails: Memory access fault by GPU node-1 (Agent handle: 0x8d1aa0) on 
address (nil). Reason: Page not present or supervisor privilege. 
Interestingly, it only fails with -O1 or higher, for -O0 it works. Tobias


Hmm, supposedly there are 768 registers allocated in groups of 12, on 
gfx1100 (8 on other devices), which number you have to double on 
wavefrontsize64 because that field actually counts the number of 32-lane 
registers. The ISA can only actually reference 256 registers, so the 
limit here should be 512. (The remaining registers are intended for 
other wavefronts to use.)


But 256 is not divisible by 12, and it looks like we've rounded up. I 
guess we need to set the limit at 252 (504), for gfx1100.


for-5.c is a register allocation nightmare!

Andrew


Re: [wwwdocs][patch] gcc-14/changes.html (amdgcn): Update for gfx1030/gfx1100

2024-01-29 Thread Andrew Stubbs

On 26/01/2024 17:06, Tobias Burnus wrote:

Mention that gfx1030/gfx1100 are now supported.

As noted in another thread, LLVM 15's assembler is now required, before 
LLVM 13.0.1 would do. (Alternatively, disabling gfx1100 support would 
do.) Hence, the added link to the install documentation.


Comments, suggestions?


I'm happy with the technical correctness of this, but I'm uncertain if 
"which required an update of the default build requirements" is the sort 
of wording we like in the changelog?


Perhaps like this?

  Initial support for the AMD Radeon gfx1030 (RDNA2) and
  gfx1100 (RDNA3) devices has been added.  LLVM 15+
  (assembler and linker) is required to support gfx1100.

Andrew


Re: [patch] install.texi: For gcn, recommend LLVM 15, unless gfx1100 is disabled

2024-01-29 Thread Andrew Stubbs

On 26/01/2024 16:45, Tobias Burnus wrote:

Hi,

Thomas Schwinge wrote:
amdgcn: config.gcc - enable gfx1030 and gfx1100 multilib; add them to 
the docs

...
Further down in that file, we state:
 @anchor{amdgcn-x-amdhsa}
 @heading amdgcn-*-amdhsa
 AMD GCN GPU target.
 
 Instead of GNU Binutils, you will need to install LLVM 13.0.1, or later, [...]


LLVM 13.0.1 may still be fine for gfx1030
('[...]/amdgcn-amdhsa/gfx1030/libgcc' does get built; I've not further
tested), but it's not sufficient for gfx1100 anymore:


Testing with the system compilers here, llvm-mc-14.0.6 also fails while 
llvm-mc-15.0.7 accepts it.



Which version of LLVM should we be recommending?


 >= LLVM 15, I think. How about the following wording? It still mentions 
LLVM 13.0.1 for those that really need it but with for the default 
setup, it requires 15+.


OK.

Andrew




Re: [patch][v2] gcn/mkoffload.cc: Fix SRAM_ECC and XNACK handling [PR111966]

2024-01-29 Thread Andrew Stubbs

On 25/01/2024 15:11, Tobias Burnus wrote:

Updated patch enclosed.

Tobias Burnus wrote:
I have now run the attached script and the result running yesterday's 
build with both my patch and your patch applied.


(And the now committed gcn-hsa.h patch)

Now the result with the testscript is:

* fiji, gfx1030, gfx1100 work, except for "error: '-mxnack=on' is 
incompatible with ..."
(and link errors for fiji as libgomp is not build, which makes the 
testing a tad less reliable but should be fine).


* (default)/gfx900/gfx906/gfx908: Works, except for -mxnack=on/any due 
to .target / -mattr= mismatch


* gfx90a: simply works

OK for mainline?

Tobias

PS: For the test script, see previous email in the thread; for the 
output of that script, see attachment.


PPS: I hope I got everything right.


OK.

Andrew


Re: [PATCH] Avoid registering unsupported OMP offload devices

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 14:21, Richard Biener wrote:

On Fri, 26 Jan 2024, Jakub Jelinek wrote:


On Fri, Jan 26, 2024 at 03:04:11PM +0100, Richard Biener wrote:

Otherwise it looks reasoanble to me, but let's see what Andrew thinks.


'n' before 'a', please. ;-)


?!


I've misspelled a word.


@@ -1443,6 +1445,16 @@ suitable_hsa_agent_p (hsa_agent_t agent)
switch (device_type)
  {
  case HSA_DEVICE_TYPE_GPU:
+  {
+   char name[64] = "nil";
+   if ((hsa_fns.hsa_agent_get_info_fn (agent, HSA_AGENT_INFO_NAME, name)
+!= HSA_STATUS_SUCCESS)
+   || isa_code (name) == EF_AMDGPU_MACH_UNSUPPORTED)
+ {
+   GCN_DEBUG ("Ignoring unsupported agent '%s'\n", name);
+   return false;
+ }


I must say I know nothing about HSA libraries, but generally if a function
that is supposed to fill some buffer fails the content of the buffer is
undefined/unpredictable.
So might be better not to not initialize name before calling the function
(unless it has to be initialized) and strcpy it to nil or something similar
if it fails.


Yeah, sorry.  Here's a proper engineered variant.  I don't expect that
function to ever fail of course.


I keep crossing emails with you today. :(

This version looks good to me. It has to work for all versions of ROCm 
from maybe 3.8 (the last version known to work with the Fiji devices) 
onwards to forever.


OK.

Andrew


 From 445891ba57e858d980441bd63249e3bc94632db3 Mon Sep 17 00:00:00 2001
From: Richard Biener 
Date: Fri, 26 Jan 2024 12:57:10 +0100
Subject: [PATCH] Avoid registering unsupported OMP offload devices
To: gcc-patches@gcc.gnu.org

The following avoids registering unsupported GCN offload devices
when iterating over available ones.  With a Zen4 desktop CPU
you will have an IGPU (unspported) which will otherwise be made
available.  This causes testcases like
libgomp.c-c++-common/non-rect-loop-1.c which iterate over all
decives to FAIL.

OK?

libgomp/
* plugin/plugin-gcn.c (suitable_hsa_agent_p): Filter out
agents with unsupported ISA.
---
  libgomp/plugin/plugin-gcn.c | 14 ++
  1 file changed, 14 insertions(+)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 588358bbbf9..2771123252a 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1427,6 +1427,8 @@ init_hsa_runtime_functions (void)
  #undef DLSYM_FN
  }
  
+static gcn_isa isa_code (const char *isa);

+
  /* Return true if the agent is a GPU and can accept of concurrent submissions
 from different threads.  */
  
@@ -1443,6 +1445,18 @@ suitable_hsa_agent_p (hsa_agent_t agent)

switch (device_type)
  {
  case HSA_DEVICE_TYPE_GPU:
+  {
+   char name[64];
+   hsa_status_t status
+ = hsa_fns.hsa_agent_get_info_fn (agent, HSA_AGENT_INFO_NAME, name);
+   if (status != HSA_STATUS_SUCCESS
+   || isa_code (name) == EF_AMDGPU_MACH_UNSUPPORTED)
+ {
+   GCN_DEBUG ("Ignoring unsupported agent '%s'\n",
+  status == HSA_STATUS_SUCCESS ? name : "invalid");
+   return false;
+ }
+  }
break;
  case HSA_DEVICE_TYPE_CPU:
if (!support_cpu_devices)




Re: [PATCH] Avoid registering unsupported OMP offload devices

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 14:04, Richard Biener wrote:

On Fri, 26 Jan 2024, Andrew Stubbs wrote:


On 26/01/2024 12:06, Jakub Jelinek wrote:

On Fri, Jan 26, 2024 at 01:00:28PM +0100, Richard Biener wrote:

The following avoids registering unsupported GCN offload devices
when iterating over available ones.  With a Zen4 desktop CPU
you will have an IGPU (unspported) which will otherwise be made
available.  This causes testcases like
libgomp.c-c++-common/non-rect-loop-1.c which iterate over all
decives to FAIL.

I'll run a bootstrap with both pending changes and will do
another round of full libgomp testing with this.

OK if that succeeds?

Thanks,
Richard.

libgomp/
  * plugin/plugin-gcn.c (suitable_hsa_agent_p): Filter out
  agents with unsupported ISA.
---
   libgomp/plugin/plugin-gcn.c | 9 +
   1 file changed, 9 insertions(+)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 588358bbbf9..88ed77ff049 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1427,6 +1427,8 @@ init_hsa_runtime_functions (void)
   #undef DLSYM_FN
   }
   
+static gcn_isa isa_code(const char *isa);


Space before ( please.


+
   /* Return true if the agent is a GPU and can accept of concurrent
   submissions
  from different threads.  */
   @@ -1443,6 +1445,13 @@ suitable_hsa_agent_p (hsa_agent_t agent)
 switch (device_type)
   {
   case HSA_DEVICE_TYPE_GPU:
+  {
+   char name[64];
+   if ((hsa_fns.hsa_agent_get_info_fn (agent, HSA_AGENT_INFO_NAME, name)
+!= HSA_STATUS_SUCCESS)
+   || isa_code (name) == EF_AMDGPU_MACH_UNSUPPORTED)
+ return false;
+  }
 break;
   case HSA_DEVICE_TYPE_CPU:
 if (!support_cpu_devices)


Otherwise it looks reasoanble to me, but let's see what Andrew thinks.


'n' before 'a', please. ;-)


?!


I think we need at least a GCN_DEBUG message when we ignore a GPU device.
Possibly gomp_debug also.


Like the following?  This will do

GCN debug: HSA run-time initialized for GCN
GCN debug: HSA_SYSTEM_INFO_ENDIANNESS: LITTLE
GCN debug: HSA_SYSTEM_INFO_EXTENSIONS: IMAGES
GCN debug: Ignoring unsupported agent 'gfx1036'
GCN debug: There are 1 GCN GPU devices.
GCN debug: Ignoring unsupported agent 'gfx1036'
GCN debug: HSA_AGENT_INFO_NAME: AMD Ryzen 9 7900X 12-Core Processor
...

for debug it's probably not too imporant to say this twice.

That said, no idea how to do gomp_debug.

OK?


I'm fairly comfortable with the repeat in debug output.

I mentioned gomp_debug because the target-independent GOMP_DEBUG=1 is a 
lot less noisy and actually documented where end-users might find it. 
From the plugin you would call GOMP_PLUGIN_debug (there are examples in 
plugin-nvptx.c). Probably the repeat is less welcome in that case 
though, so perhaps good for a follow-up.




Thanks,
Richard.


 From 7462a8ac70aeebc231c65456b9060d8cbf7d4c50 Mon Sep 17 00:00:00 2001
From: Richard Biener 
Date: Fri, 26 Jan 2024 12:57:10 +0100
Subject: [PATCH] Avoid registering unsupported OMP offload devices
To: gcc-patches@gcc.gnu.org

The following avoids registering unsupported GCN offload devices
when iterating over available ones.  With a Zen4 desktop CPU
you will have an IGPU (unspported) which will otherwise be made
available.  This causes testcases like
libgomp.c-c++-common/non-rect-loop-1.c which iterate over all
decives to FAIL.

libgomp/
* plugin/plugin-gcn.c (suitable_hsa_agent_p): Filter out
agents with unsupported ISA.
---
  libgomp/plugin/plugin-gcn.c | 12 
  1 file changed, 12 insertions(+)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 588358bbbf9..2a17dc8accc 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1427,6 +1427,8 @@ init_hsa_runtime_functions (void)
  #undef DLSYM_FN
  }
  
+static gcn_isa isa_code (const char *isa);

+
  /* Return true if the agent is a GPU and can accept of concurrent submissions
 from different threads.  */
  
@@ -1443,6 +1445,16 @@ suitable_hsa_agent_p (hsa_agent_t agent)

switch (device_type)
  {
  case HSA_DEVICE_TYPE_GPU:
+  {
+   char name[64] = "nil";
+   if ((hsa_fns.hsa_agent_get_info_fn (agent, HSA_AGENT_INFO_NAME, name)
+!= HSA_STATUS_SUCCESS)
+   || isa_code (name) == EF_AMDGPU_MACH_UNSUPPORTED)
+ {
+   GCN_DEBUG ("Ignoring unsupported agent '%s'\n", name);
+   return false;
+ }
+  }
break;
  case HSA_DEVICE_TYPE_CPU:
if (!support_cpu_devices)


Like Jakub says, I think it needs to be like this, to be safe:

  status = hsa_fns.hsa_agent_get_info_fn (...)
  if (status unsuccessful || name unsupported)
if (status successful) output debug
return false

Andrew


Re: [PATCH] Avoid registering unsupported OMP offload devices

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 12:06, Jakub Jelinek wrote:

On Fri, Jan 26, 2024 at 01:00:28PM +0100, Richard Biener wrote:

The following avoids registering unsupported GCN offload devices
when iterating over available ones.  With a Zen4 desktop CPU
you will have an IGPU (unspported) which will otherwise be made
available.  This causes testcases like
libgomp.c-c++-common/non-rect-loop-1.c which iterate over all
decives to FAIL.

I'll run a bootstrap with both pending changes and will do
another round of full libgomp testing with this.

OK if that succeeds?

Thanks,
Richard.

libgomp/
* plugin/plugin-gcn.c (suitable_hsa_agent_p): Filter out
agents with unsupported ISA.
---
  libgomp/plugin/plugin-gcn.c | 9 +
  1 file changed, 9 insertions(+)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index 588358bbbf9..88ed77ff049 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1427,6 +1427,8 @@ init_hsa_runtime_functions (void)
  #undef DLSYM_FN
  }
  
+static gcn_isa isa_code(const char *isa);


Space before ( please.


+
  /* Return true if the agent is a GPU and can accept of concurrent submissions
 from different threads.  */
  
@@ -1443,6 +1445,13 @@ suitable_hsa_agent_p (hsa_agent_t agent)

switch (device_type)
  {
  case HSA_DEVICE_TYPE_GPU:
+  {
+   char name[64];
+   if ((hsa_fns.hsa_agent_get_info_fn (agent, HSA_AGENT_INFO_NAME, name)
+!= HSA_STATUS_SUCCESS)
+   || isa_code (name) == EF_AMDGPU_MACH_UNSUPPORTED)
+ return false;
+  }
break;
  case HSA_DEVICE_TYPE_CPU:
if (!support_cpu_devices)


Otherwise it looks reasoanble to me, but let's see what Andrew thinks.


'n' before 'a', please. ;-)

I think we need at least a GCN_DEBUG message when we ignore a GPU 
device. Possibly gomp_debug also.


Andrew


Re: [PATCH] Fix architecture support in OMP_OFFLOAD_init_device for gcn

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 11:42, Richard Biener wrote:

The following makes the existing architecture support check work
instead of being optimized away (enum vs. -1).  This avoids
later asserts when we assume such devices are never actually
used.

Tested as previously, now the error is

libgomp: GCN fatal error: Unknown GCN agent architecture
Runtime message: HSA_STATUS_ERROR: A generic error has occurred.

now will figure why we try to initialize that device.

OK?


OK.



libgomp/
* plugin/plugin-gcn.c
(EF_AMDGPU_MACH::EF_AMDGPU_MACH_UNSUPPORTED): Add.
(isa_code): Return that instead of -1.
(GOMP_OFFLOAD_init_device): Adjust.
---
  libgomp/plugin/plugin-gcn.c | 5 +++--
  1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index db28781dedb..588358bbbf9 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -384,6 +384,7 @@ struct gcn_image_desc
 See https://llvm.org/docs/AMDGPUUsage.html#amdgpu-ef-amdgpu-mach-table */
  
  typedef enum {

+  EF_AMDGPU_MACH_UNSUPPORTED = -1,
EF_AMDGPU_MACH_AMDGCN_GFX803 = 0x02a,
EF_AMDGPU_MACH_AMDGCN_GFX900 = 0x02c,
EF_AMDGPU_MACH_AMDGCN_GFX906 = 0x02f,
@@ -1727,7 +1728,7 @@ isa_code(const char *isa) {
if (!strncmp (isa, gcn_gfx1100_s, gcn_isa_name_len))
  return EF_AMDGPU_MACH_AMDGCN_GFX1100;
  
-  return -1;

+  return EF_AMDGPU_MACH_UNSUPPORTED;
  }
  
  /* CDNA2 devices have twice as many VGPRs compared to older devices.  */

@@ -3374,7 +3375,7 @@ GOMP_OFFLOAD_init_device (int n)
  return hsa_error ("Error querying the name of the agent", status);
  
agent->device_isa = isa_code (agent->name);

-  if (agent->device_isa < 0)
+  if (agent->device_isa == EF_AMDGPU_MACH_UNSUPPORTED)
  return hsa_error ("Unknown GCN agent architecture", HSA_STATUS_ERROR);
  
status = hsa_fns.hsa_agent_get_info_fn (agent->id, HSA_AGENT_INFO_VENDOR_NAME,




Re: [patch] gcn/gcn-hsa.h: Always pass --amdhsa-code-object-version= in ASM_SPEC

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 10:39, Tobias Burnus wrote:

Hi all,

Andrew Stubbs wrote:

On 26/01/2024 07:29, Richard Biener wrote:
If you link against prebuilt objects with COV 5 it seems there's no 
way to

override the COV version GCC uses?  That is, do we want to add
a -mcode-object-version=... option to allow the user to override this
(and ABI_VERSION_SPEC honoring that, if specified and of course
mkoffload following suit)?


For completeness, I added such a feature, see attachment. (Actually, 
'=0' could be permitted for mkoffload without "-g" debugging enabled.)


However, the real problem is that one usually also has libraries build 
with the default such as libc, libm, libgomp, ... Thus, specifying 
anything else but GCC's default is likely to break.


Hence and also because of the following, I think it doesn't make sense 
to add:


We don't have a stable ABI, so trying to link against foreign binaries 
is already a problem. Most recently, the SIMD clone implementation 
required a change to the procedure calling ABI, the reverse-offload 
changes reimplemented the stack setup, and the low-latency memory 
patches changed the way we use local memories and needed more info 
passed into the device runtime. I expect more of this in future.



PS: The original patch has been committed as r14-8449-g4b5650acb31072.

Tobias


Agreed, there's no point in having a knob that only has one valid 
setting, especially when we want to be able to change that setting 
without breaking third-party scripts that choose to that knob "for 
completeness", or something.


The toolchain can have an opinion about which is the correct COV, and I 
think not relying on the LLVM default choice makes sense also. I think 
COV 5 is only a small update, but there's no reason to imagine that COV6 
will Just Work (COV2 to COV3, and COV3 to COV4 required real effort).


We can move on to COV5 for GCC 15, probably. I'm not aware of any great 
blocker, but it sets a minimum LLVM.


Andrew


Re: [PATCH] Avoid using an unsupported agent when offloading to GCN

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 10:40, Richard Biener wrote:

The following avoids selecting an unsupported agent early, avoiding
later asserts when we rely on it being supported.

tested on x86_64-unknown-linux-gnu -> amdhsa-gcn on gfx1060

that's the alternative to the other patch.  I do indeed seem to get
the other (unsupported) agent selected somehow after the other supported
agent finished a kernel run.  Not sure if it's the CPU or the IGPU though.

OK?  Which variant?


So, looking at it again, the original intent of the assert was to alert 
toolchain developers that they missed adding a new name when porting to 
a new device, but I concur that it's not ideal when the assert 
encounters an unknown device in the wild.


However, if we're trying to do something more useful than merely fixing 
an ugly error message, maybe we should look at removing unsupported 
devices in "suitable_hsa_agent_p" instead? Unsupported GPUs wouldn't be 
assigned a device number at all.


Probably devices that are GPUs but skipped because they are unsupported 
should be mentioned on GOMP_DEBUG (as well as GCN_DEBUG)?


The goal should be that folks with your twin-GPU setup shouldn't have to 
work around it, but I don't really want to remove the message for people 
who only have one device but don't realize it is unsupported.


On the other hand, if a user has two devices that *are* supported, but 
the second one is preferred, they'll have to set OMP_DEFAULT_DEVICE 
explicitly, and is this so different?


As a user, WDYT?

Andrew



libgomp/
* plugin/plugin-gcn.c (get_agent_info): When the agent isn't supported
return NULL.
---
  libgomp/plugin/plugin-gcn.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index d8c3907c108..f453f630e06 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -1036,6 +1036,8 @@ print_kernel_dispatch (struct kernel_dispatch *dispatch, 
unsigned indent)
  /* }}}  */
  /* {{{ Utility functions  */
  
+static const char* isa_hsa_name (int isa);

+
  /* Cast the thread local storage to gcn_thread.  */
  
  static inline struct gcn_thread *

@@ -1589,6 +1591,11 @@ get_agent_info (int n)
GOMP_PLUGIN_error ("Attempt to use an uninitialized GCN agent.");
return NULL;
  }
+  if (!isa_hsa_name (hsa_context.agents[n].device_isa))
+{
+  GOMP_PLUGIN_error ("Attempt to use an unsupported GCN agent.");
+  return NULL;
+}
return _context.agents[n];
  }
  




Re: [PATCH] Avoid assert for unknown device ISAs in GCN libgomp plugin

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 10:30, Richard Biener wrote:

When the agent reports a device ISA we don't support avoid hitting
an assert, instead report the raw integers as error.  I'm not sure
whether -1 is special as I didn't figure where that field is
initialized.  But I guess since agents are not rejected upfront
when registering them I might be able to force execution to an
unsupported one.

An alternative would maybe to change get_agent_info () to return NULL
for unsupported ISAs?

Tested on x86_64-unknown-linux-gnu -> amdgcn-hsa with gfx1060

OK?


OK, thanks.

Andrew



Thanks,
Richard.

libgomp/
* plugin/plugin-gcn.c (isa_matches_agent): Avoid asserting we
only get supported device ISAs.  Report raw numbers when not.
---
  libgomp/plugin/plugin-gcn.c | 16 +---
  1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/libgomp/plugin/plugin-gcn.c b/libgomp/plugin/plugin-gcn.c
index db28781dedb..d8c3907c108 100644
--- a/libgomp/plugin/plugin-gcn.c
+++ b/libgomp/plugin/plugin-gcn.c
@@ -2459,13 +2459,15 @@ isa_matches_agent (struct agent_info *agent, Elf64_Ehdr 
*image)
char msg[120];
const char *agent_isa_s = isa_hsa_name (agent->device_isa);
const char *agent_isa_gcc_s = isa_gcc_name (agent->device_isa);
-  assert (agent_isa_s);
-  assert (agent_isa_gcc_s);
-
-  snprintf (msg, sizeof msg,
-   "GCN code object ISA '%s' does not match GPU ISA '%s'.\n"
-   "Try to recompile with '-foffload-options=-march=%s'.\n",
-   isa_s, agent_isa_s, agent_isa_gcc_s);
+  if (agent_isa_s && agent_isa_gcc_s)
+   snprintf (msg, sizeof msg,
+ "GCN code object ISA '%s' does not match GPU ISA '%s'.\n"
+ "Try to recompile with '-foffload-options=-march=%s'.\n",
+ isa_s, agent_isa_s, agent_isa_gcc_s);
+  else
+   snprintf (msg, sizeof msg,
+ "GCN code object ISA '%s' (%d) does not match GPU ISA %d.\n",
+ isa_s, isa_field, agent->device_isa);
  
hsa_error (msg, HSA_STATUS_ERROR);

return false;




Re: [PATCH] amdgcn: additional gfx1100 support

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 10:22, Richard Biener wrote:

On Fri, 26 Jan 2024, Andrew Stubbs wrote:


On 26/01/2024 09:45, Richard Biener wrote:

On Fri, 26 Jan 2024, Richard Biener wrote:

  === libgomp Summary ===

# of expected passes29126
# of unexpected failures697
# of unexpected successes   1
# of expected failures  703
# of unresolved testcases   318
# of unsupported tests  766

full summary attached (compressed).  Even compressed libgomp.log is
too big to send.

Richard.


I think this is good enough to start with. PA reported clean results for
everything except gfx900 (looks like an unrelated issue).

I'll go ahead and commit the patch.

Hopefully Tobias's patch has already trimmed all the "-g" failures from that
list.


Should I open a bug for the ICE?  That's responsible for quite a number
of failures as well.


The broken vector reduction instruction? It's a known issue (RDNA 
doesn't support those instructions anymore, and somehow disabling the 
insn isn't enough to stop them being generated), but it doesn't have a 
tracking number, so why not?


Thanks

Andrew



Re: [PATCH] amdgcn: additional gfx1100 support

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 09:45, Richard Biener wrote:

On Fri, 26 Jan 2024, Richard Biener wrote:

 === libgomp Summary ===

# of expected passes29126
# of unexpected failures697
# of unexpected successes   1
# of expected failures  703
# of unresolved testcases   318
# of unsupported tests  766

full summary attached (compressed).  Even compressed libgomp.log is
too big to send.

Richard.


I think this is good enough to start with. PA reported clean results for 
everything except gfx900 (looks like an unrelated issue).


I'll go ahead and commit the patch.

Hopefully Tobias's patch has already trimmed all the "-g" failures from 
that list.


Andrew


Re: [patch] gcn/gcn-hsa.h: Always pass --amdhsa-code-object-version= in ASM_SPEC

2024-01-26 Thread Andrew Stubbs

On 26/01/2024 07:29, Richard Biener wrote:

On Fri, Jan 26, 2024 at 12:04 AM Tobias Burnus  wrote:


When targeting AMD GPUs, the LLVM assembler (and linker) are used.

Two days ago LLVM changed the default for the AMDHSA code object
version (COV) from 4 to 5.

In principle, we do not care which COV is used as long as it works;
unfortunately, "mkoffload.cc" also generates an object file directly,
bypassing the AMD GPU compiler as it copies debugging data to that
file. That object file must have the same COV version (ELF ABI version)
as compiler + llvm-mc assembler generated files.

In order to ensure those are the same, this patch forces the use of
COV 4 instead of using the default. Once GCC requires LLVM >= 14
instead of LLVM >= 13.0.1 we could change it. (Assuming that COV 5
is sufficiently stable in LLVM 14.) - But for now COV 4 will do.

If you wonder how this LLVM issue shows up, simply compile any OpenMP
or OpenACC program with AMD GPU offloading and enable debugging ("-g"),
e.g.
   gcc -fopenmp -g test.f90 -foffload=amdgcn-amdhsa 
-foffload-options=-march=gfx908

With LLVM main (to become LLVM 18), you will then get the error:

   ld: error: incompatible ABI version: /tmp/ccAKx5cz.mkoffload.dbg.o

OK for mainline?


If you link against prebuilt objects with COV 5 it seems there's no way to
override the COV version GCC uses?  That is, do we want to add
a -mcode-object-version=... option to allow the user to override this
(and ABI_VERSION_SPEC honoring that, if specified and of course
mkoffload following suit)?

Otherwise looks OK in the meantime.


We don't have a stable ABI, so trying to link against foreign binaries 
is already a problem. Most recently, the SIMD clone implementation 
required a change to the procedure calling ABI, the reverse-offload 
changes reimplemented the stack setup, and the low-latency memory 
patches changed the way we use local memories and needed more info 
passed into the device runtime. I expect more of this in future.


Compatibility across GCC versions doesn't really exist, and 
compatibility with LLVM-binaries is a non-starter.


Andrew


Re: [patch] gcn/gcn-hsa.h: Always pass --amdhsa-code-object-version= in ASM_SPEC

2024-01-26 Thread Andrew Stubbs

On 25/01/2024 23:03, Tobias Burnus wrote:

When targeting AMD GPUs, the LLVM assembler (and linker) are used.

Two days ago LLVM changed the default for theAMDHSA code object version (COV) from 4 to 5. In principle, we do not 
care which COV is used as long as it works; unfortunately, 
"mkoffload.cc" also generates an object file directly, bypassing the AMD 
GPU compiler as it copies debugging data to that file. That object file 
must have the same COV version (ELF ABI version) as compiler + llvm-mc 
assembler generated files. In order to ensure those are the same, this 
patch forces the use of COV 4 instead of using the default. Once GCC 
requires LLVM >= 14 instead of LLVM >= 13.0.1 we could change it. 
(Assuming that COV 5 is sufficiently stable in LLVM 14.) - But for now 
COV 4 will do.

If you wonder how this LLVM issue shows up, simply compile any OpenMP
or OpenACC program with AMD GPU offloading and enable debugging ("-g"),
e.g.
   gcc -fopenmp -g test.f90 -foffload=amdgcn-amdhsa 
-foffload-options=-march=gfx908

With LLVM main (to become LLVM 18), you will then get the error:

   ld: error: incompatible ABI version: /tmp/ccAKx5cz.mkoffload.dbg.o

OK for mainline?


Looks good to me.

The alternative would be to copy the elf flags from another object file; 
that probably has it's own pitfalls.


OK.

Andrew


Re: [patch] gcn: Add missing space to ASM_SPEC in gcn-hsa.h

2024-01-25 Thread Andrew Stubbs

On 25/01/2024 12:44, Tobias Burnus wrote:

This patch avoids assembler warnings for gfx908 and gfx90a such as
   '-xnack-mattr=-sramecc' is not a recognized feature for this target(ignoring 
feature)
as we pass   -mattr=-xnack-mattr=-sramecc  to the llvm-mc assembler.

Solution: Add a space before the second '-mattr='.

OK for mainline?


OK.

Andrew


Re: [patch] gcn/mkoffload.cc: Fix SRAM_ECC and XNACK handling [PR111966]

2024-01-25 Thread Andrew Stubbs

On 24/01/2024 22:12, Tobias Burnus wrote:

This patch fixes "-g" debug compilation for gfx1100 and gfx1030,
which fail to link when "-g" is specified. The reason is:

When using gfx1100 and compiling with '-g' I was running into an error
because the eflags used for the debugger file has additional eflags
(elf flags) set - contrary to the compiled files; mkoffload writes files
itself, hence, it also needs to get the elf flags right.

It turned out that the ASM_SPEC handling was insufficiently replicated
in mkoffload, leading to issues with gfx1100 and gfx1030. I think in
some corner case, gfx906 also behaved differently; for gfx900 and fiji,
the eflags were different before, but got reset inside
copy_early_debug_info such that those difference did not matter.

OK for mainline?


I've got so confused trying to figure out this stuff and how it works 
with different LLVM, different defaults, different devices.


I think this patch is fine, but we should wait until we can test it on 
all those devices.


Andrew


Tobias

PS: I tried hard to look at the ASM_SPEC and played with different
options, looking at what really got passed to the assembler, but I
might have missed something as the code is somewhat confusing. Naming
wise, there is both UNSUPPORTED and UNSET for the same thing; it should
be a tad more consistent (flag = UNSUPPORTED, SET/TEST functions: UNSET),
still, one could also argue that a single name would do.


Sometimes not passing the -mattr flag gives "any", and sometimes 
"unsupported", and sometimes leaves the flag unset. I think it's changed 
over time as well, but mkoffload has to match precisely or it won't link. :(



PPS: I think the PR is about other things in addition, but it also
kind of covers this "-g" issue and the one of previous commit. Even
if not directly addressing the issue, it is related and having the
commits listed there makes IMHO sense.


[PATCH] amdgcn: additional gfx1100 support

2024-01-24 Thread Andrew Stubbs
This is enough to get gfx1100 working for most purposes, on top of the
patch that Tobias committed a week or so ago; there are still some test
failures to investigate, and probably some tuning to do.

It might also get gfx1030 working too. @Richi, could you test it,
please?

I can't test the other multilibs right now. @PA, can you test it please?

I can self-approve the patch, but I'll hold off the commit until the
test results come back.

Andrew

gcc/ChangeLog:

* config/gcn/gcn-opts.h (TARGET_PACKED_WORK_ITEMS): Add TARGET_RDNA3.
* config/gcn/gcn-valu.md (all_convert): New iterator.
(2): New
define_expand, and rename the old one to ...
(*_sdwa): ... this.
(extend2): Likewise, to ...
(extend_sdwa): .. this.
(*_shift): New.
* config/gcn/gcn.cc (gcn_global_address_p): Use "offsetbits" correctly.
(gcn_hsa_declare_function_name): Update the vgpr counting for gfx1100.
* config/gcn/gcn.md (mulhisi3): Disable on RDNA3.
(mulqihi3_scalar): Likewise.

libgcc/ChangeLog:

* config/gcn/amdgcn_veclib.h (CDNA3_PLUS): Handle RDNA3.

libgomp/ChangeLog:

* config/gcn/time.c (RTC_TICKS): Configure RDNA3.
(omp_get_wtime): Add RDNA3-compatible variant.
* plugin/plugin-gcn.c (max_isa_vgprs): Tune for gfx1030 and gfx1100.

Signed-off-by:  Andrew Stubbs 
---
 gcc/config/gcn/gcn-opts.h |  2 +-
 gcc/config/gcn/gcn-valu.md| 41 ---
 gcc/config/gcn/gcn.cc | 31 ---
 gcc/config/gcn/gcn.md |  4 +--
 libgcc/config/gcn/amdgcn_veclib.h |  2 +-
 libgomp/config/gcn/time.c | 10 
 libgomp/plugin/plugin-gcn.c   |  6 +++--
 7 files changed, 77 insertions(+), 19 deletions(-)

diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
index 79fbda3ab25..6be2c9204fa 100644
--- a/gcc/config/gcn/gcn-opts.h
+++ b/gcc/config/gcn/gcn-opts.h
@@ -62,7 +62,7 @@ extern enum gcn_isa {
 
 
 #define TARGET_M0_LDS_LIMIT (TARGET_GCN3)
-#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS)
+#define TARGET_PACKED_WORK_ITEMS (TARGET_CDNA2_PLUS || TARGET_RDNA3)
 
 #define TARGET_XNACK (flag_xnack != HSACO_ATTR_OFF)
 
diff --git a/gcc/config/gcn/gcn-valu.md b/gcc/config/gcn/gcn-valu.md
index 3d5b6271ee6..cd027f8b369 100644
--- a/gcc/config/gcn/gcn-valu.md
+++ b/gcc/config/gcn/gcn-valu.md
@@ -3555,30 +3555,63 @@
 ;; }}}
 ;; {{{ Int/int conversions
 
+(define_code_iterator all_convert [truncate zero_extend sign_extend])
 (define_code_iterator zero_convert [truncate zero_extend])
 (define_code_attr convop [
(sign_extend "extend")
(zero_extend "zero_extend")
(truncate "trunc")])
 
-(define_insn "2"
+(define_expand "2"
+  [(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
+(all_convert:V_INT_1REG
+ (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
+  "")
+
+(define_insn "*_sdwa"
   [(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
 (zero_convert:V_INT_1REG
  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
-  ""
+  "!TARGET_RDNA3"
   "v_mov_b32_sdwa\t%0, %1 dst_sel: dst_unused:UNUSED_PAD 
src0_sel:"
   [(set_attr "type" "vop_sdwa")
(set_attr "length" "8")])
 
-(define_insn "extend2"
+(define_insn "extend_sdwa"
   [(set (match_operand:V_INT_1REG 0 "register_operand" "=v")
 (sign_extend:V_INT_1REG
  (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
-  ""
+  "!TARGET_RDNA3"
   "v_mov_b32_sdwa\t%0, sext(%1) src0_sel:"
   [(set_attr "type" "vop_sdwa")
(set_attr "length" "8")])
 
+(define_insn "*_shift"
+  [(set (match_operand:V_INT_1REG 0 "register_operand"  "=v")
+(all_convert:V_INT_1REG
+ (match_operand:V_INT_1REG_ALT 1 "gcn_alu_operand" " v")))]
+  "TARGET_RDNA3"
+  {
+enum {extend, zero_extend, trunc};
+rtx shiftwidth = (mode == QImode
+ || mode == QImode
+ ? GEN_INT (24)
+ : mode == HImode
+   || mode == HImode
+ ? GEN_INT (16)
+ : NULL);
+operands[2] = shiftwidth;
+
+if (!shiftwidth)
+  return "v_mov_b32 %0, %1";
+else if ( == extend ||  == trunc)
+  return "v_lshlrev_b32\t%0, %2, %1\;v_ashrrev_i32\t%0, %2, %0";
+else
+  return "v_lshlrev_b32\t%0, %2, %1\;v_lshrrev_b32\t%0, %2, %0";
+  }
+  [(set_attr "type" "mult")
+   (set_attr "length" "8&quo

[PATCH] Update my email in MAINTAINERS

2024-01-23 Thread Andrew Stubbs
I've moved to BayLibre and don't have access to my codesourcery.com
address, at least for a while.

ChangeLog:

* MAINTAINERS: Update

Signed-off-by:  Andrew Stubbs 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index cb5a42501dd..547237e0cf8 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -55,7 +55,7 @@ aarch64 port  Marcus Shawcroft

 aarch64 port   Kyrylo Tkachov  
 alpha port Richard Henderson   
 amdgcn portJulian Brown
-amdgcn portAndrew Stubbs   
+amdgcn portAndrew Stubbs   
 arc port   Joern Rennecke  
 arc port   Claudiu Zissulescu  
 arm port   Nick Clifton
-- 
2.41.0



Re: [PATCH] gcn: Fix a warning

2024-01-23 Thread Andrew Stubbs
On Tue, 23 Jan 2024 at 10:01, Jakub Jelinek  wrote:

> Hi!
>
> I see
> ../../gcc/config/gcn/gcn.cc: In function ‘void
> gcn_hsa_declare_function_name(FILE*, const char*, tree)’:
> ../../gcc/config/gcn/gcn.cc:6568:67: warning: unused parameter ‘decl’
> [-Wunused-parameter]
>  6568 | gcn_hsa_declare_function_name (FILE *file, const char *name, tree
> decl)
>   |
> ~^~~~
> warning presumably since r14-6945-gc659dd8bfb55e02a1b97407c1c28f7a0e8f7f09b
> Previously, the argument was anonymous, but now it is passed to a macro
> which ignores it, so I think we should go with ATTRIBUTE_UNUSED.
>
> Ok for trunk?
>

OK.

Andrew


Re: [Patch] xfail libgomp.c/declare-variant-4-{fiji,gfx803}.c

2024-01-22 Thread Andrew Stubbs
On Fri, 19 Jan 2024 at 18:27, Tobias Burnus  wrote:

> The problem is as described at
> https://gcc.gnu.org/install/specific.html#amdgcn-x-amdhsa
>
> "Note that support for Fiji devices has been removed in ROCm 4.0 and
> support in LLVM is deprecated and will be removed in LLVM 18."
>
> Therefore, GCC is no longer build with Fiji (gfx803) support by default
> – and the -march=fiji testcases now fails as the -lgomp multilib for
> Fiji is not available. (That is: It fails, unless Fiji support has been
> enabled manually.)
>
> Andrew mentioned that there is a PR about this, but I couldn't find it.
> If someone can, I am happy to add it to the changelog.
>
> OK for mainline?
>

OK. There's probably a bikeshed to paint here, but the tests are destined
to get deleted, so whatever.

Andrew


Re: [Patch] GCN: Add pre-initial support for gfx1100

2024-01-08 Thread Andrew Stubbs

On 07/01/2024 19:20, Tobias Burnus wrote:
ROCm meanwhile supports also some consumer cards; besides the semi-new 
gfx1030, support for gfx1100 was added more recently (in ROCm 5.7.1 for 
"Ubuntu 22.04 only" and without parenthesis since ROCm 6.0.0).


GCC has already very limited support for gfx1030 - whose multlib support 
is - on purpose - not yet enabled by default and is WIP.


The attached patch now adds gfx1100 on top of it, assuming that it 
mostly behaves the same as gfx1030. This is really WIP as there are 
known build (assembly) issues (see below) and not only "just" runtime 
issues.


gfx1100 differs at least in the following aspects from the previously 
supported cards:


* gfx1100 has an 'architected flat scratch' which is different from 
'absolute flat scratch' which all others (but fiji: 'offset flat 
scratch') have. Hence, '.amdhsa_reserve_flat_scratch 0'

has to be excluded to avoid assembly errors.

* gfx1100 also does not support 'v_mov_b32_sdwa', failing to assembly
   libc/argz/libc_a-argz_stringify.o with:
   "sdwa variant of this instruction is not supported"
→ This has not been address in the patch, hence, specifying gfx1100 in 
--with-multilib-list= will fail to build when an in-tree newlib is build.


* * *

The attached patch fixes in addition one issue in libgomp (string-length 
len constant is too short for gfx1030 (and gfx1100) = 7 characters) and 
it includes the fix that __gfx1030__ is not defined, which I have 
submitted separately (yesterday).


With the caveat that gfx1100 is even less usable than gfx1030 and it 
won't build newlib, is it nonetheless


   OK for mainline ?

(As gfx1100 is not enabled by default in multilib, a regular build will 
will not fail and I think the *.md issue can be addressed separately.)


This looks fine to me. I know there will be things that need fixing for 
both experimental architectures.


Andrew

P.S. Apologies, but I think my commits today conflict a little; you 
should be able to drop the hunks that patch deleted code.


[committed] amdgcn: Match new XNACK defaults in mkoffload

2024-01-08 Thread Andrew Stubbs
This patch fixes build failures with the offload toolchain since my 
recent XNACK patch. The problem was simply that mkoffload made 
out-of-date assumptions about the -mxnack defaults. This patch fixes the 
mismatch.


Committed to mainline.

Andrewamdgcn: Don't double-count AVGPRs

CDNA2 devices have VGPRs and AVGPRs combined into a single hardware register
file (they're seperate in CDNA1).  I originally thought they were counted
separately in the vgpr_count and agpr_count metadata fields, and therefore
mkoffload had to account for this when passing the values to libgomp.  However,
that wasn't the case, and this code should have been removed when I corrected
the calculations in gcn.cc.  Fixing the error now.

gcc/ChangeLog:

* config/gcn/mkoffload.cc (isa_has_combined_avgprs): Delete.
(process_asm): Don't count avgprs.

diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 3341c0d34eb..03cd040dbd2 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -471,26 +471,6 @@ copy_early_debug_info (const char *infile, const char 
*outfile)
   return true;
 }
 
-/* CDNA2 devices have twice as many VGPRs compared to older devices,
-   but the AVGPRS are allocated from the same pool.  */
-
-static int
-isa_has_combined_avgprs (int isa)
-{
-  switch (isa)
-{
-case EF_AMDGPU_MACH_AMDGCN_GFX803:
-case EF_AMDGPU_MACH_AMDGCN_GFX900:
-case EF_AMDGPU_MACH_AMDGCN_GFX906:
-case EF_AMDGPU_MACH_AMDGCN_GFX908:
-case EF_AMDGPU_MACH_AMDGCN_GFX1030:
-  return false;
-case EF_AMDGPU_MACH_AMDGCN_GFX90a:
-  return true;
-}
-  fatal_error (input_location, "unhandled ISA in isa_has_combined_avgprs");
-}
-
 /* Parse an input assembler file, extract the offload tables etc.,
and output (1) the assembler code, minus the tables (which can contain
problematic relocations), and (2) a C file with the offload tables
@@ -516,7 +496,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
   {
 int sgpr_count;
 int vgpr_count;
-int avgpr_count;
 char *kernel_name;
   } regcount = { -1, -1, NULL };
 
@@ -564,12 +543,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
gcc_assert (regcount.kernel_name);
break;
  }
-   else if (sscanf (buf, " .agpr_count: %d\n",
-_count) == 1)
- {
-   gcc_assert (regcount.kernel_name);
-   break;
- }
 
break;
  }
@@ -712,8 +685,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
  {
sgpr_count = regcounts[j].sgpr_count;
vgpr_count = regcounts[j].vgpr_count;
-   if (isa_has_combined_avgprs (elf_arch))
- vgpr_count += regcounts[j].avgpr_count;
break;
  }
 


[committed] amdgcn: Don't double-count AVGPRs

2024-01-08 Thread Andrew Stubbs
This patch fixes a runtime error with offload kernels that use a lot of 
registers, such as libgomp.fortran/target1.f90.


Committed to mainline.

Andrewamdgcn: Don't double-count AVGPRs

CDNA2 devices have VGPRs and AVGPRs combined into a single hardware register
file (they're seperate in CDNA1).  I originally thought they were counted
separately in the vgpr_count and agpr_count metadata fields, and therefore
mkoffload had to account for this when passing the values to libgomp.  However,
that wasn't the case, and this code should have been removed when I corrected
the calculations in gcn.cc.  Fixing the error now.

gcc/ChangeLog:

* config/gcn/mkoffload.cc (isa_has_combined_avgprs): Delete.
(process_asm): Don't count avgprs.

diff --git a/gcc/config/gcn/mkoffload.cc b/gcc/config/gcn/mkoffload.cc
index 3341c0d34eb..03cd040dbd2 100644
--- a/gcc/config/gcn/mkoffload.cc
+++ b/gcc/config/gcn/mkoffload.cc
@@ -471,26 +471,6 @@ copy_early_debug_info (const char *infile, const char 
*outfile)
   return true;
 }
 
-/* CDNA2 devices have twice as many VGPRs compared to older devices,
-   but the AVGPRS are allocated from the same pool.  */
-
-static int
-isa_has_combined_avgprs (int isa)
-{
-  switch (isa)
-{
-case EF_AMDGPU_MACH_AMDGCN_GFX803:
-case EF_AMDGPU_MACH_AMDGCN_GFX900:
-case EF_AMDGPU_MACH_AMDGCN_GFX906:
-case EF_AMDGPU_MACH_AMDGCN_GFX908:
-case EF_AMDGPU_MACH_AMDGCN_GFX1030:
-  return false;
-case EF_AMDGPU_MACH_AMDGCN_GFX90a:
-  return true;
-}
-  fatal_error (input_location, "unhandled ISA in isa_has_combined_avgprs");
-}
-
 /* Parse an input assembler file, extract the offload tables etc.,
and output (1) the assembler code, minus the tables (which can contain
problematic relocations), and (2) a C file with the offload tables
@@ -516,7 +496,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
   {
 int sgpr_count;
 int vgpr_count;
-int avgpr_count;
 char *kernel_name;
   } regcount = { -1, -1, NULL };
 
@@ -564,12 +543,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
gcc_assert (regcount.kernel_name);
break;
  }
-   else if (sscanf (buf, " .agpr_count: %d\n",
-_count) == 1)
- {
-   gcc_assert (regcount.kernel_name);
-   break;
- }
 
break;
  }
@@ -712,8 +685,6 @@ process_asm (FILE *in, FILE *out, FILE *cfile)
  {
sgpr_count = regcounts[j].sgpr_count;
vgpr_count = regcounts[j].vgpr_count;
-   if (isa_has_combined_avgprs (elf_arch))
- vgpr_count += regcounts[j].avgpr_count;
break;
  }
 


Re: [Patch] gcn.h: Add builtin_define ("__gfx1030")

2024-01-08 Thread Andrew Stubbs

On 06/01/2024 21:20, Tobias Burnus wrote:

Hi Andrew,

I just spotted that this define was missing.

OK for mainline?


OK.

Andrew


[committed] amdgcn: XNACK support

2023-12-13 Thread Andrew Stubbs
Some AMD GCN devices support an "XNACK" mode in which the device can 
handle page-misses (and maybe other traps in memory instructions), but 
it's not completely invisible to software.


We need this now to support OpenMP Unified Shared Memory (I plan to post 
updated patches for that in January), and in future it may enable 
support for APU devices (such as MI300).


The first patch ensures that load instructions are "restartable", 
meaning that the outputs do not overwrite the input registers (address 
and offsets). This maps pretty much exactly to the GCC "early-clobber" 
concept, so we just need to add additional alternatives and then not 
generate problem instructions explicitly.


The second patch is a workaround for the register allocation patch I 
asked about on gcc@ yesterday.  The early clobber increases register 
pressure which causes compile failure when LRA is unable to spill 
additional registers without needing yet more registers. This doesn't 
become a problem on gfx90a (MI200) so soon due to the additional AVGPR 
spill registers, and that's the only device that really supports USM, so 
far, so limiting XNACK to that device will work for now.


The -mxnack option was already added as a placeholder, so not much is 
needed there.


Committed to master. An older version of these patches is already 
committed to devel/omp/gcc-13 (OG13).


Andrewamdgcn: Work around XNACK register allocation problem

The extra register pressure is causing infinite loops in some cases, especially
at -O0.  I have not yet observed any issue on devices that have AVGPRs for
spilling, and XNACK is only really useful on those devices anyway, so change
the defaults.

gcc/ChangeLog:

* config/gcn/gcn-hsa.h (NO_XNACK): Change the defaults.
* config/gcn/gcn-opts.h (enum hsaco_attr_type): Add HSACO_ATTR_DEFAULT.
* config/gcn/gcn.cc (gcn_option_override): Set the default flag_xnack.
* config/gcn/gcn.opt: Add -mxnack=default.
* doc/invoke.texi: Document the -mxnack default.

diff --git a/gcc/config/gcn/gcn-hsa.h b/gcc/config/gcn/gcn-hsa.h
index bfb104526c5..b44d42b02d6 100644
--- a/gcc/config/gcn/gcn-hsa.h
+++ b/gcc/config/gcn/gcn-hsa.h
@@ -75,7 +75,9 @@ extern unsigned int gcn_local_sym_hash (const char *name);
supported for gcn.  */
 #define GOMP_SELF_SPECS ""
 
-#define NO_XNACK "march=fiji:;march=gfx1030:;"
+#define NO_XNACK "march=fiji:;march=gfx1030:;" \
+/* These match the defaults set in gcn.cc.  */ \
+
"!mxnack*|mxnack=default:%{march=gfx900|march=gfx906|march=gfx908:-mattr=-xnack};"
 #define NO_SRAM_ECC "!march=*:;march=fiji:;march=gfx900:;march=gfx906:;"
 
 /* In HSACOv4 no attribute setting means the binary supports "any" hardware
diff --git a/gcc/config/gcn/gcn-opts.h b/gcc/config/gcn/gcn-opts.h
index b4f494d868c..634cec6d832 100644
--- a/gcc/config/gcn/gcn-opts.h
+++ b/gcc/config/gcn/gcn-opts.h
@@ -65,7 +65,8 @@ enum hsaco_attr_type
 {
   HSACO_ATTR_OFF,
   HSACO_ATTR_ON,
-  HSACO_ATTR_ANY
+  HSACO_ATTR_ANY,
+  HSACO_ATTR_DEFAULT
 };
 
 #endif
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index d92cd01d03f..b67551a2e8e 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -172,6 +172,29 @@ gcn_option_override (void)
   /* Allow HSACO_ATTR_ANY silently because that's the default.  */
   flag_xnack = HSACO_ATTR_OFF;
 }
+
+  /* There's no need for XNACK on devices without USM, and there are register
+ allocation problems caused by the early-clobber when AVGPR spills are not
+ available.
+ FIXME: can the regalloc mean the default can be really "any"?  */
+  if (flag_xnack == HSACO_ATTR_DEFAULT)
+switch (gcn_arch)
+  {
+  case PROCESSOR_FIJI:
+  case PROCESSOR_VEGA10:
+  case PROCESSOR_VEGA20:
+  case PROCESSOR_GFX908:
+   flag_xnack = HSACO_ATTR_OFF;
+   break;
+  case PROCESSOR_GFX90a:
+   flag_xnack = HSACO_ATTR_ANY;
+   break;
+  default:
+   gcc_unreachable ();
+  }
+
+  if (flag_sram_ecc == HSACO_ATTR_DEFAULT)
+flag_sram_ecc = HSACO_ATTR_ANY;
 }
 
 /* }}}  */
diff --git a/gcc/config/gcn/gcn.opt b/gcc/config/gcn/gcn.opt
index c356a0cbb08..32486d9615f 100644
--- a/gcc/config/gcn/gcn.opt
+++ b/gcc/config/gcn/gcn.opt
@@ -97,9 +97,12 @@ Enum(hsaco_attr_type) String(on) Value(HSACO_ATTR_ON)
 EnumValue
 Enum(hsaco_attr_type) String(any) Value(HSACO_ATTR_ANY)
 
+EnumValue
+Enum(hsaco_attr_type) String(default) Value(HSACO_ATTR_DEFAULT)
+
 mxnack=
-Target RejectNegative Joined ToLower Enum(hsaco_attr_type) Var(flag_xnack) 
Init(HSACO_ATTR_ANY)
-Compile for devices requiring XNACK enabled. Default \"any\".
+Target RejectNegative Joined ToLower Enum(hsaco_attr_type) Var(flag_xnack) 
Init(HSACO_ATTR_DEFAULT)
+Compile for devices requiring XNACK enabled. Default \"any\" if USM is 
supported.
 
 msram-ecc=
 Target RejectNegative Joined ToLower Enum(hsaco_attr_type) Var(flag_sram_ecc) 
Init(HSACO_ATTR_ANY)
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 

Re: [PATCH v3 1/6] libgomp: basic pinned memory on Linux

2023-12-13 Thread Andrew Stubbs

On 12/12/2023 09:02, Tobias Burnus wrote:

On 11.12.23 18:04, Andrew Stubbs wrote:

Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall.  Pinned allocations are performed using mmap, not malloc, to 
ensure

that they can be unpinned safely when freed.

This implementation will work OK for page-scale allocations, and 
finer-grained

allocations will be implemented in a future patch.


LGTM.

Thanks,

Tobias


Thank you, this one is now pushed.

Andrew


Re: [PATCH v3 2/6] libgomp, openmp: Add ompx_pinned_mem_alloc

2023-12-12 Thread Andrew Stubbs

On 12/12/2023 10:05, Tobias Burnus wrote:

Hi Andrew,

On 11.12.23 18:04, Andrew Stubbs wrote:

This creates a new predefined allocator as a shortcut for using pinned
memory with OpenMP.  The name uses the OpenMP extension space and is
intended to be consistent with other OpenMP implementations currently in
development.


Discussed this with Jakub - and 9 does not permit to have a contiguous
range of numbers if OpenMP ever extends this,

Thus, maybe start the ompx_ with 100.


These numbers are not defined in any standard, are they? We can use 
whatever enumeration we choose.


I'm happy to change them, but the *_mem_alloc numbers are used as an 
index into a constant table to map them to the corresponding 
*_mem_space, so do we really want to make it a sparse table?



We were also pondering whether it should be ompx_gnu_pinned_mem_alloc or
ompx_pinned_mem_alloc.


It's a long time ago now, and I'm struggling to remember, but I think 
those names were agreed with some other parties (can't remember who 
though, and I may be thinking of the ompx_unified_shared_mem_alloc that 
is still to come).



The only other compiler supporting this flag seems to be IBM; their
compiler uses ompx_pinned_mem_alloc with the same meaning:
https://www.ibm.com/support/pages/system/files/inline-files/OMP5_User_Reference.pdf
(page 5)

As the obvious meaning is what both compilers have, I am fine without
the "gnu" infix, which Jakub accepted.


Good.



* * *

And you have not updated the compiler itself to support more this new
allocator. Cf.

https://github.com/gcc-mirror/gcc/blob/master/gcc/testsuite/c-c++-common/gomp/allocate-9.c#L23-L28

That file gives an overview what needs to be changed:

* The check functions mentioned there (seemingly for two ranges now)

* Update the OMP_ALLOCATOR env var parser in env.c

* That linked testcase (and possibly some some more) should be updated,
also to ensure that the new allocator is accepted + to check for new
unsupported values (99, 101 ?)

If we now leave gaps, the const_assert in libgomp/allocator.c probably
needs to be updated as well.

* * *

Glancing through the patches, for test cases, I think you should
'abort()' in CHECK_SIZE if it fails (rlimit issue or not supported
system). Or do you think that the results are still could make sense
when continuing and possibly failing later?


Those were not meant to be part of the test, really, but rather to 
demystify failures for future maintainers.




Tobias


Thanks for the review.

Andrew


[PATCH v3 6/6] libgomp: fine-grained pinned memory allocator

2023-12-11 Thread Andrew Stubbs

This patch introduces a new custom memory allocator for use with pinned
memory (in the case where the Cuda allocator isn't available).  In future,
this allocator will also be used for Unified Shared Memory.  Both memories
are incompatible with the system malloc because allocated memory cannot
share a page with memory allocated for other purposes.

This means that small allocations will no longer consume an entire page of
pinned memory.  Unfortunately, it also means that pinned memory pages will
never be unmapped (although they may be reused).

The implementation is not perfect; there are various corner cases (especially
related to extending onto new pages) where allocations and reallocations may
be sub-optimal, but it should still be a step forward in support for small
allocations.

I have considered using libmemkind's "fixed" memory but rejected it for three
reasons: 1) libmemkind may not always be present at runtime, 2) there's no
currently documented means to extend a "fixed" kind one page at a time
(although the code appears to have an undocumented function that may do the
job, and/or extending libmemkind to support the MAP_LOCKED mmap flag with its
regular kinds would be straight-forward), 3) USM benefits from having the
metadata located in different memory and using an external implementation makes
it hard to guarantee this.

libgomp/ChangeLog:

* Makefile.am (libgomp_la_SOURCES): Add usmpin-allocator.c.
* Makefile.in: Regenerate.
* config/linux/allocator.c: Include unistd.h.
(pin_ctx): New variable.
(ctxlock): New variable.
(linux_init_pin_ctx): New function.
(linux_memspace_alloc): Use usmpin-allocator for pinned memory.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise.
* libgomp.h (usmpin_init_context): New prototype.
(usmpin_register_memory): New prototype.
(usmpin_alloc): New prototype.
(usmpin_free): New prototype.
(usmpin_realloc): New prototype.
* testsuite/libgomp.c/alloc-pinned-1.c: Adjust for new behaviour.
* testsuite/libgomp.c/alloc-pinned-2.c: Likewise.
* testsuite/libgomp.c/alloc-pinned-5.c: Likewise.
* testsuite/libgomp.c/alloc-pinned-8.c: New test.
* usmpin-allocator.c: New file.
---
 libgomp/Makefile.am  |   2 +-
 libgomp/Makefile.in  |   7 +-
 libgomp/config/linux/allocator.c |  91 --
 libgomp/libgomp.h|  10 +
 libgomp/testsuite/libgomp.c/alloc-pinned-8.c | 127 
 libgomp/usmpin-allocator.c   | 319 +++
 6 files changed, 523 insertions(+), 33 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-8.c
 create mode 100644 libgomp/usmpin-allocator.c

diff --git a/libgomp/Makefile.am b/libgomp/Makefile.am
index 1871590596d..9d41ed886d1 100644
--- a/libgomp/Makefile.am
+++ b/libgomp/Makefile.am
@@ -72,7 +72,7 @@ libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c error.c \
 	target.c splay-tree.c libgomp-plugin.c oacc-parallel.c oacc-host.c \
 	oacc-init.c oacc-mem.c oacc-async.c oacc-plugin.c oacc-cuda.c \
 	priority_queue.c affinity-fmt.c teams.c allocator.c oacc-profiling.c \
-	oacc-target.c target-indirect.c
+	oacc-target.c target-indirect.c usmpin-allocator.c
 
 include $(top_srcdir)/plugin/Makefrag.am
 
diff --git a/libgomp/Makefile.in b/libgomp/Makefile.in
index 56a6beab867..96fa9faf6a4 100644
--- a/libgomp/Makefile.in
+++ b/libgomp/Makefile.in
@@ -219,7 +219,8 @@ am_libgomp_la_OBJECTS = alloc.lo atomic.lo barrier.lo critical.lo \
 	oacc-parallel.lo oacc-host.lo oacc-init.lo oacc-mem.lo \
 	oacc-async.lo oacc-plugin.lo oacc-cuda.lo priority_queue.lo \
 	affinity-fmt.lo teams.lo allocator.lo oacc-profiling.lo \
-	oacc-target.lo target-indirect.lo $(am__objects_1)
+	oacc-target.lo target-indirect.lo usmpin-allocator.lo \
+	$(am__objects_1)
 libgomp_la_OBJECTS = $(am_libgomp_la_OBJECTS)
 AM_V_P = $(am__v_P_@AM_V@)
 am__v_P_ = $(am__v_P_@AM_DEFAULT_V@)
@@ -552,7 +553,8 @@ libgomp_la_SOURCES = alloc.c atomic.c barrier.c critical.c env.c \
 	oacc-parallel.c oacc-host.c oacc-init.c oacc-mem.c \
 	oacc-async.c oacc-plugin.c oacc-cuda.c priority_queue.c \
 	affinity-fmt.c teams.c allocator.c oacc-profiling.c \
-	oacc-target.c target-indirect.c $(am__append_3)
+	oacc-target.c target-indirect.c usmpin-allocator.c \
+	$(am__append_3)
 
 # Nvidia PTX OpenACC plugin.
 @PLUGIN_NVPTX_TRUE@libgomp_plugin_nvptx_version_info = -version-info $(libtool_VERSION)
@@ -786,6 +788,7 @@ distclean-compile:
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/team.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/teams.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/time.Plo@am__quote@
+@AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/usmpin-allocator.Plo@am__quote@
 @AMDEP_TRUE@@am__include@ @am__quote@./$(DEPDIR)/work.Plo@am__quote@
 
 .c.o:
diff 

[PATCH v3 4/6] openmp: -foffload-memory=pinned

2023-12-11 Thread Andrew Stubbs

Implement the -foffload-memory=pinned option such that libgomp is
instructed to enable fully-pinned memory at start-up.  The option is
intended to provide a performance boost to certain offload programs without
modifying the code.

This feature only works on Linux, at present, and simply calls mlockall to
enable always-on memory pinning.  It requires that the ulimit feature is
set high enough to accommodate all the program's memory usage.

In this mode the ompx_pinned_memory_alloc feature is disabled as it is not
needed and may conflict.

gcc/ChangeLog:

* omp-builtins.def (BUILT_IN_GOMP_ENABLE_PINNED_MODE): New.
* omp-low.cc (omp_enable_pinned_mode): New function.
(execute_lower_omp): Call omp_enable_pinned_mode.

libgomp/ChangeLog:

* config/linux/allocator.c (always_pinned_mode): New variable.
(GOMP_enable_pinned_mode): New function.
(linux_memspace_alloc): Disable pinning when always_pinned_mode set.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise.
* libgomp.map: Add GOMP_enable_pinned_mode.
* testsuite/libgomp.c/alloc-pinned-7.c: New test.
* testsuite/libgomp.c-c++-common/alloc-pinned-1.c: New test.
---
 gcc/omp-builtins.def  |  3 +
 gcc/omp-low.cc| 66 +++
 libgomp/config/linux/allocator.c  | 26 
 libgomp/libgomp.map   |  1 +
 .../libgomp.c-c++-common/alloc-pinned-1.c | 28 
 libgomp/testsuite/libgomp.c/alloc-pinned-7.c  | 63 ++
 6 files changed, 187 insertions(+)
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c

diff --git a/gcc/omp-builtins.def b/gcc/omp-builtins.def
index ed78d49d205..54ea7380722 100644
--- a/gcc/omp-builtins.def
+++ b/gcc/omp-builtins.def
@@ -473,3 +473,6 @@ DEF_GOMP_BUILTIN (BUILT_IN_GOMP_WARNING, "GOMP_warning",
 		  BT_FN_VOID_CONST_PTR_SIZE, ATTR_NOTHROW_LEAF_LIST)
 DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ERROR, "GOMP_error",
 		  BT_FN_VOID_CONST_PTR_SIZE, ATTR_COLD_NORETURN_NOTHROW_LEAF_LIST)
+DEF_GOMP_BUILTIN (BUILT_IN_GOMP_ENABLE_PINNED_MODE,
+		  "GOMP_enable_pinned_mode",
+		  BT_FN_VOID, ATTR_NOTHROW_LIST)
diff --git a/gcc/omp-low.cc b/gcc/omp-low.cc
index dd802ca37a6..455c5897577 100644
--- a/gcc/omp-low.cc
+++ b/gcc/omp-low.cc
@@ -14592,6 +14592,68 @@ lower_omp (gimple_seq *body, omp_context *ctx)
   input_location = saved_location;
 }
 
+/* Emit a constructor function to enable -foffload-memory=pinned
+   at runtime.  Libgomp handles the OS mode setting, but we need to trigger
+   it by calling GOMP_enable_pinned mode before the program proper runs.  */
+
+static void
+omp_enable_pinned_mode ()
+{
+  static bool visited = false;
+  if (visited)
+return;
+  visited = true;
+
+  /* Create a new function like this:
+ 
+   static void __attribute__((constructor))
+   __set_pinned_mode ()
+   {
+ GOMP_enable_pinned_mode ();
+   }
+  */
+
+  tree name = get_identifier ("__set_pinned_mode");
+  tree voidfntype = build_function_type_list (void_type_node, NULL_TREE);
+  tree decl = build_decl (UNKNOWN_LOCATION, FUNCTION_DECL, name, voidfntype);
+
+  TREE_STATIC (decl) = 1;
+  TREE_USED (decl) = 1;
+  DECL_ARTIFICIAL (decl) = 1;
+  DECL_IGNORED_P (decl) = 0;
+  TREE_PUBLIC (decl) = 0;
+  DECL_UNINLINABLE (decl) = 1;
+  DECL_EXTERNAL (decl) = 0;
+  DECL_CONTEXT (decl) = NULL_TREE;
+  DECL_INITIAL (decl) = make_node (BLOCK);
+  BLOCK_SUPERCONTEXT (DECL_INITIAL (decl)) = decl;
+  DECL_STATIC_CONSTRUCTOR (decl) = 1;
+  DECL_ATTRIBUTES (decl) = tree_cons (get_identifier ("constructor"),
+  NULL_TREE, NULL_TREE);
+
+  tree t = build_decl (UNKNOWN_LOCATION, RESULT_DECL, NULL_TREE,
+		   void_type_node);
+  DECL_ARTIFICIAL (t) = 1;
+  DECL_IGNORED_P (t) = 1;
+  DECL_CONTEXT (t) = decl;
+  DECL_RESULT (decl) = t;
+
+  push_struct_function (decl);
+  init_tree_ssa (cfun);
+
+  tree calldecl = builtin_decl_explicit (BUILT_IN_GOMP_ENABLE_PINNED_MODE);
+  gcall *call = gimple_build_call (calldecl, 0);
+
+  gimple_seq seq = NULL;
+  gimple_seq_add_stmt (, call);
+  gimple_set_body (decl, gimple_build_bind (NULL_TREE, seq, NULL));
+
+  cfun->function_end_locus = UNKNOWN_LOCATION;
+  cfun->curr_properties |= PROP_gimple_any;
+  pop_cfun ();
+  cgraph_node::add_new_function (decl, true);
+}
+
 /* Main entry point.  */
 
 static unsigned int
@@ -14648,6 +14710,10 @@ execute_lower_omp (void)
   for (auto task_stmt : task_cpyfns)
 finalize_task_copyfn (task_stmt);
   task_cpyfns.release ();
+
+  if (flag_offload_memory == OFFLOAD_MEMORY_PINNED)
+omp_enable_pinned_mode ();
+
   return 0;
 }
 
diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index 269d0d607d8..57278b1af91 100644
--- a/libgomp/config/linux/allocator.c
+++ 

[PATCH v3 5/6] libgomp, nvptx: Cuda pinned memory

2023-12-11 Thread Andrew Stubbs

Use Cuda to pin memory, instead of Linux mlock, when available.

There are two advantages: firstly, this gives a significant speed boost for
NVPTX offloading, and secondly, it side-steps the usual OS ulimit/rlimit
setting.

The design adds a device independent plugin API for allocating pinned memory,
and then implements it for NVPTX.  At present, the other supported devices do
not have equivalent capabilities (or requirements).

libgomp/ChangeLog:

* config/linux/allocator.c: Include assert.h.
(using_device_for_page_locked): New variable.
(linux_memspace_alloc): Add init0 parameter. Support device pinning.
(linux_memspace_calloc): Set init0 to true.
(linux_memspace_free): Support device pinning.
(linux_memspace_realloc): Support device pinning.
(MEMSPACE_ALLOC): Set init0 to false.
* libgomp-plugin.h
(GOMP_OFFLOAD_page_locked_host_alloc): New prototype.
(GOMP_OFFLOAD_page_locked_host_free): Likewise.
* libgomp.h (gomp_page_locked_host_alloc): Likewise.
(gomp_page_locked_host_free): Likewise.
(struct gomp_device_descr): Add page_locked_host_alloc_func and
page_locked_host_free_func.
* libgomp.texi: Adjust the docs for the pinned trait.
* libgomp_g.h (GOMP_enable_pinned_mode): New prototype.
* plugin/plugin-nvptx.c
(GOMP_OFFLOAD_page_locked_host_alloc): New function.
(GOMP_OFFLOAD_page_locked_host_free): Likewise.
* target.c (device_for_page_locked): New variable.
(get_device_for_page_locked): New function.
(gomp_page_locked_host_alloc): Likewise.
(gomp_page_locked_host_free): Likewise.
(gomp_load_plugin_for_device): Add page_locked_host_alloc and
page_locked_host_free.
* testsuite/libgomp.c/alloc-pinned-1.c: Change expectations for NVPTX
devices.
* testsuite/libgomp.c/alloc-pinned-2.c: Likewise.
* testsuite/libgomp.c/alloc-pinned-3.c: Likewise.
* testsuite/libgomp.c/alloc-pinned-4.c: Likewise.
* testsuite/libgomp.c/alloc-pinned-5.c: Likewise.
* testsuite/libgomp.c/alloc-pinned-6.c: Likewise.

Co-Authored-By: Thomas Schwinge 
---
 libgomp/config/linux/allocator.c | 137 ++-
 libgomp/libgomp-plugin.h |   2 +
 libgomp/libgomp.h|   4 +
 libgomp/libgomp.texi |  11 +-
 libgomp/libgomp_g.h  |   1 +
 libgomp/plugin/plugin-nvptx.c|  42 ++
 libgomp/target.c | 136 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-1.c |  26 
 libgomp/testsuite/libgomp.c/alloc-pinned-2.c |  26 
 libgomp/testsuite/libgomp.c/alloc-pinned-3.c |  45 +-
 libgomp/testsuite/libgomp.c/alloc-pinned-4.c |  44 +-
 libgomp/testsuite/libgomp.c/alloc-pinned-5.c |  26 
 libgomp/testsuite/libgomp.c/alloc-pinned-6.c |  35 -
 13 files changed, 487 insertions(+), 48 deletions(-)

diff --git a/libgomp/config/linux/allocator.c b/libgomp/config/linux/allocator.c
index 57278b1af91..8d681b5ec50 100644
--- a/libgomp/config/linux/allocator.c
+++ b/libgomp/config/linux/allocator.c
@@ -36,6 +36,11 @@
 
 /* Implement malloc routines that can handle pinned memory on Linux.

+   Given that pinned memory is typically used to help host <-> device memory
+   transfers, we attempt to allocate such memory using a device (really:
+   libgomp plugin), but fall back to mmap plus mlock if no suitable device is
+   available.
+
It's possible to use mlock on any heap memory, but using munlock is
problematic if there are multiple pinned allocations on the same page.
Tracking all that manually would be possible, but adds overhead. This may
@@ -49,6 +54,7 @@
 #define _GNU_SOURCE
 #include 
 #include 
+#include 
 #include "libgomp.h"
 
 static bool always_pinned_mode = false;
@@ -65,45 +71,87 @@ GOMP_enable_pinned_mode ()
 always_pinned_mode = true;
 }
 
+static int using_device_for_page_locked
+  = /* uninitialized */ -1;
+
 static void *
-linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin)
+linux_memspace_alloc (omp_memspace_handle_t memspace, size_t size, int pin,
+		  bool init0)
 {
-  (void)memspace;
+  gomp_debug (0, "%s: memspace=%llu, size=%llu, pin=%d, init0=%d\n",
+	  __FUNCTION__, (unsigned long long) memspace,
+	  (unsigned long long) size, pin, init0);
+
+  void *addr;
 
   /* Explicit pinning may not be required.  */
   pin = pin && !always_pinned_mode;
 
   if (pin)
 {
-  /* Note that mmap always returns zeroed memory and is therefore also a
-	 suitable implementation of calloc.  */
-  void *addr = mmap (NULL, size, PROT_READ | PROT_WRITE,
-			 MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
-  if (addr == MAP_FAILED)
-	return NULL;
-
-  if (mlock (addr, size))
+  int using_device
+	= __atomic_load_n (_device_for_page_locked,
+			   

[PATCH v3 2/6] libgomp, openmp: Add ompx_pinned_mem_alloc

2023-12-11 Thread Andrew Stubbs

This creates a new predefined allocator as a shortcut for using pinned
memory with OpenMP.  The name uses the OpenMP extension space and is
intended to be consistent with other OpenMP implementations currently in
development.

The allocator is equivalent to using a custom allocator with the pinned
trait and the null fallback trait.

libgomp/ChangeLog:

* allocator.c (omp_max_predefined_alloc): Update.
(predefined_alloc_mapping): Add ompx_pinned_mem_alloc entry.
(omp_aligned_alloc): Support ompx_pinned_mem_alloc.
(omp_free): Likewise.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* libgomp.texi: Document ompx_pinned_mem_alloc.
* omp.h.in (omp_allocator_handle_t): Add ompx_pinned_mem_alloc.
* omp_lib.f90.in: Add ompx_pinned_mem_alloc.
* testsuite/libgomp.c/alloc-pinned-5.c: New test.
* testsuite/libgomp.c/alloc-pinned-6.c: New test.
* testsuite/libgomp.fortran/alloc-pinned-1.f90: New test.

Co-Authored-By: Thomas Schwinge 
---
 libgomp/allocator.c   |  58 ++
 libgomp/libgomp.texi  |   7 +-
 libgomp/omp.h.in  |   1 +
 libgomp/omp_lib.f90.in|   2 +
 libgomp/testsuite/libgomp.c/alloc-pinned-5.c  | 103 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-6.c  | 101 +
 .../libgomp.fortran/alloc-pinned-1.f90|  16 +++
 7 files changed, 268 insertions(+), 20 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-6.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/alloc-pinned-1.f90

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index 666adf9a3a9..6c69c4f008f 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -35,7 +35,7 @@
 #include 
 #endif
 
-#define omp_max_predefined_alloc omp_thread_mem_alloc
+#define omp_max_predefined_alloc ompx_pinned_mem_alloc
 
 /* These macros may be overridden in config//allocator.c.
The defaults (no override) are to return NULL for pinned memory requests
@@ -78,6 +78,7 @@ static const omp_memspace_handle_t predefined_alloc_mapping[] = {
   omp_low_lat_mem_space,   /* omp_cgroup_mem_alloc (implementation defined). */
   omp_low_lat_mem_space,   /* omp_pteam_mem_alloc (implementation defined). */
   omp_low_lat_mem_space,   /* omp_thread_mem_alloc (implementation defined). */
+  omp_default_mem_space,   /* ompx_pinned_mem_alloc. */
 };
 
 #define ARRAY_SIZE(A) (sizeof (A) / sizeof ((A)[0]))
@@ -623,8 +624,10 @@ retry:
 	  memspace = (allocator_data
 		  ? allocator_data->memspace
 		  : predefined_alloc_mapping[allocator]);
-	  ptr = MEMSPACE_ALLOC (memspace, new_size,
-allocator_data && allocator_data->pinned);
+	  int pinned = (allocator_data
+			? allocator_data->pinned
+			: allocator == ompx_pinned_mem_alloc);
+	  ptr = MEMSPACE_ALLOC (memspace, new_size, pinned);
 	}
   if (ptr == NULL)
 	goto fail;
@@ -645,7 +648,8 @@ retry:
 fail:;
   int fallback = (allocator_data
 		  ? allocator_data->fallback
-		  : allocator == omp_default_mem_alloc
+		  : (allocator == omp_default_mem_alloc
+		 || allocator == ompx_pinned_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
@@ -760,6 +764,7 @@ omp_free (void *ptr, omp_allocator_handle_t allocator)
 #endif
 
   memspace = predefined_alloc_mapping[data->allocator];
+  pinned = (data->allocator == ompx_pinned_mem_alloc);
 }
 
   MEMSPACE_FREE (memspace, data->ptr, data->size, pinned);
@@ -933,8 +938,10 @@ retry:
 	  memspace = (allocator_data
 		  ? allocator_data->memspace
 		  : predefined_alloc_mapping[allocator]);
-	  ptr = MEMSPACE_CALLOC (memspace, new_size,
- allocator_data && allocator_data->pinned);
+	  int pinned = (allocator_data
+			? allocator_data->pinned
+			: allocator == ompx_pinned_mem_alloc);
+	  ptr = MEMSPACE_CALLOC (memspace, new_size, pinned);
 	}
   if (ptr == NULL)
 	goto fail;
@@ -955,7 +962,8 @@ retry:
 fail:;
   int fallback = (allocator_data
 		  ? allocator_data->fallback
-		  : allocator == omp_default_mem_alloc
+		  : (allocator == omp_default_mem_alloc
+		 || allocator == ompx_pinned_mem_alloc)
 		  ? omp_atv_null_fb
 		  : omp_atv_default_mem_fb);
   switch (fallback)
@@ -1165,11 +1173,14 @@ retry:
   else
 #endif
   if (prev_size)
-	new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
-data->size, new_size,
-(free_allocator_data
- && free_allocator_data->pinned),
-allocator_data->pinned);
+	{
+	  int was_pinned = (free_allocator_data
+			? free_allocator_data->pinned
+			: free_allocator == ompx_pinned_mem_alloc);
+	  new_ptr = MEMSPACE_REALLOC (allocator_data->memspace, data->ptr,
+  data->size, new_size, was_pinned,
+  allocator_data->pinned);
+	}
   else
 	new_ptr = MEMSPACE_ALLOC 

[PATCH v3 3/6] openmp: Add -foffload-memory

2023-12-11 Thread Andrew Stubbs

Add a new option.  It's inactive until I add some follow-up patches.

gcc/ChangeLog:

* common.opt: Add -foffload-memory and its enum values.
* coretypes.h (enum offload_memory): New.
* doc/invoke.texi: Document -foffload-memory.
---
 gcc/common.opt  | 16 
 gcc/coretypes.h |  7 +++
 gcc/doc/invoke.texi | 16 +++-
 3 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/gcc/common.opt b/gcc/common.opt
index 5eb5ecff04b..a008827cfa2 100644
--- a/gcc/common.opt
+++ b/gcc/common.opt
@@ -2332,6 +2332,22 @@ Enum(offload_abi) String(ilp32) Value(OFFLOAD_ABI_ILP32)
 EnumValue
 Enum(offload_abi) String(lp64) Value(OFFLOAD_ABI_LP64)
 
+foffload-memory=
+Common Joined RejectNegative Enum(offload_memory) Var(flag_offload_memory) Init(OFFLOAD_MEMORY_NONE)
+-foffload-memory=[none|unified|pinned]	Use an offload memory optimization.
+
+Enum
+Name(offload_memory) Type(enum offload_memory) UnknownError(Unknown offload memory option %qs)
+
+EnumValue
+Enum(offload_memory) String(none) Value(OFFLOAD_MEMORY_NONE)
+
+EnumValue
+Enum(offload_memory) String(unified) Value(OFFLOAD_MEMORY_UNIFIED)
+
+EnumValue
+Enum(offload_memory) String(pinned) Value(OFFLOAD_MEMORY_PINNED)
+
 fomit-frame-pointer
 Common Var(flag_omit_frame_pointer) Optimization
 When possible do not generate stack frames.
diff --git a/gcc/coretypes.h b/gcc/coretypes.h
index fe5b868fb4f..fb4bf37ba24 100644
--- a/gcc/coretypes.h
+++ b/gcc/coretypes.h
@@ -218,6 +218,13 @@ enum offload_abi {
   OFFLOAD_ABI_ILP32
 };
 
+/* Types of memory optimization for an offload device.  */
+enum offload_memory {
+  OFFLOAD_MEMORY_NONE,
+  OFFLOAD_MEMORY_UNIFIED,
+  OFFLOAD_MEMORY_PINNED
+};
+
 /* Types of profile update methods.  */
 enum profile_update {
   PROFILE_UPDATE_SINGLE,
diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi
index 43341fe6e5e..f6a7459bda7 100644
--- a/gcc/doc/invoke.texi
+++ b/gcc/doc/invoke.texi
@@ -202,7 +202,7 @@ in the following sections.
 -fno-builtin  -fno-builtin-@var{function}  -fcond-mismatch
 -ffreestanding  -fgimple  -fgnu-tm  -fgnu89-inline  -fhosted
 -flax-vector-conversions  -fms-extensions
--foffload=@var{arg}  -foffload-options=@var{arg}
+-foffload=@var{arg}  -foffload-options=@var{arg} -foffload-memory=@var{arg} 
 -fopenacc  -fopenacc-dim=@var{geom}
 -fopenmp  -fopenmp-simd  -fopenmp-target-simd-clone@r{[}=@var{device-type}@r{]}
 -fpermitted-flt-eval-methods=@var{standard}
@@ -2766,6 +2766,20 @@ Typical command lines are
 -foffload-options=amdgcn-amdhsa=-march=gfx906
 @end smallexample
 
+@opindex foffload-memory
+@cindex OpenMP offloading memory modes
+@item -foffload-memory=none
+@itemx -foffload-memory=unified
+@itemx -foffload-memory=pinned
+Enable a memory optimization mode to use with OpenMP.  The default behavior,
+@option{-foffload-memory=none}, is to do nothing special (unless enabled via
+a requires directive in the code).  @option{-foffload-memory=unified} is
+equivalent to @code{#pragma omp requires unified_shared_memory}.
+@option{-foffload-memory=pinned} forces all host memory to be pinned (this
+mode may require the user to increase the ulimit setting for locked memory).
+All translation units must select the same setting to avoid undefined
+behavior.
+
 @opindex fopenacc
 @cindex OpenACC accelerator programming
 @item -fopenacc


[PATCH v3 0/6] libgomp: OpenMP pinned memory omp_alloc

2023-12-11 Thread Andrew Stubbs
This patch series is a rework of the v2 series I posted in August:

https://patchwork.sourceware.org/project/gcc/list/?series=23763=%2A=both

This version addresses most of the review comments from Tobias, but
after discussion with Tobias and Thomas we've decided to skip the
nice-to-have proposed initialization improvement in the interest of
getting the job done, for now.

Otherwise, some bugs have been fixed and few other clean-ups have been
made, but the series retains the same purpose and structure.

This series no longer has any out-of-tree dependencies, now that the
low-latency allocator patch have been committed.

An older, less compact, version of these patches is already applied to
the devel/omp/gcc-13 (OG13) branch.

OK for mainline?

Andrew

Andrew Stubbs (5):
  libgomp: basic pinned memory on Linux
  libgomp, openmp: Add ompx_pinned_mem_alloc
  openmp: Add -foffload-memory
  openmp: -foffload-memory=pinned
  libgomp: fine-grained pinned memory allocator

Thomas Schwinge (1):
  libgomp, nvptx: Cuda pinned memory

 gcc/common.opt|  16 +
 gcc/coretypes.h   |   7 +
 gcc/doc/invoke.texi   |  16 +-
 gcc/omp-builtins.def  |   3 +
 gcc/omp-low.cc|  66 
 libgomp/Makefile.am   |   2 +-
 libgomp/Makefile.in   |   7 +-
 libgomp/allocator.c   |  95 --
 libgomp/config/gcn/allocator.c|  21 +-
 libgomp/config/linux/allocator.c  | 243 +
 libgomp/config/nvptx/allocator.c  |  21 +-
 libgomp/libgomp-plugin.h  |   2 +
 libgomp/libgomp.h |  14 +
 libgomp/libgomp.map   |   1 +
 libgomp/libgomp.texi  |  17 +-
 libgomp/libgomp_g.h   |   1 +
 libgomp/omp.h.in  |   1 +
 libgomp/omp_lib.f90.in|   2 +
 libgomp/plugin/plugin-nvptx.c |  42 +++
 libgomp/target.c  | 136 
 .../libgomp.c-c++-common/alloc-pinned-1.c |  28 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-1.c  | 141 
 libgomp/testsuite/libgomp.c/alloc-pinned-2.c  | 146 
 libgomp/testsuite/libgomp.c/alloc-pinned-3.c  | 189 +++
 libgomp/testsuite/libgomp.c/alloc-pinned-4.c  | 184 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-5.c  | 129 +++
 libgomp/testsuite/libgomp.c/alloc-pinned-6.c  | 128 +++
 libgomp/testsuite/libgomp.c/alloc-pinned-7.c  |  63 
 libgomp/testsuite/libgomp.c/alloc-pinned-8.c  | 127 +++
 .../libgomp.fortran/alloc-pinned-1.f90|  16 +
 libgomp/usmpin-allocator.c| 319 ++
 31 files changed, 2127 insertions(+), 56 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c-c++-common/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-6.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-7.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-8.c
 create mode 100644 libgomp/testsuite/libgomp.fortran/alloc-pinned-1.f90
 create mode 100644 libgomp/usmpin-allocator.c

-- 
2.41.0



[PATCH v3 1/6] libgomp: basic pinned memory on Linux

2023-12-11 Thread Andrew Stubbs

Implement the OpenMP pinned memory trait on Linux hosts using the mlock
syscall.  Pinned allocations are performed using mmap, not malloc, to ensure
that they can be unpinned safely when freed.

This implementation will work OK for page-scale allocations, and finer-grained
allocations will be implemented in a future patch.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
(MEMSPACE_VALIDATE): Add PIN.
(omp_init_allocator): Use MEMSPACE_VALIDATE to check pinning.
(omp_aligned_alloc): Add pinning to all MEMSPACE_* calls.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
(omp_free): Likewise.
* config/linux/allocator.c: New file.
* config/nvptx/allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
(MEMSPACE_VALIDATE): Add PIN.
* config/gcn/allocator.c (MEMSPACE_ALLOC): Add PIN.
(MEMSPACE_CALLOC): Add PIN.
(MEMSPACE_REALLOC): Add PIN.
(MEMSPACE_FREE): Add PIN.
* libgomp.texi: Switch pinned trait to supported.
(MEMSPACE_VALIDATE): Add PIN.
* testsuite/libgomp.c/alloc-pinned-1.c: New test.
* testsuite/libgomp.c/alloc-pinned-2.c: New test.
* testsuite/libgomp.c/alloc-pinned-3.c: New test.
* testsuite/libgomp.c/alloc-pinned-4.c: New test.

Co-Authored-By: Thomas Schwinge 
---
 libgomp/allocator.c  |  65 +---
 libgomp/config/gcn/allocator.c   |  21 +--
 libgomp/config/linux/allocator.c | 111 +
 libgomp/config/nvptx/allocator.c |  21 +--
 libgomp/libgomp.texi |   3 +-
 libgomp/testsuite/libgomp.c/alloc-pinned-1.c | 115 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-2.c | 120 ++
 libgomp/testsuite/libgomp.c/alloc-pinned-3.c | 156 +++
 libgomp/testsuite/libgomp.c/alloc-pinned-4.c | 150 ++
 9 files changed, 716 insertions(+), 46 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/alloc-pinned-4.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index a8a80f8028d..666adf9a3a9 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -38,27 +38,30 @@
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
 /* These macros may be overridden in config//allocator.c.
+   The defaults (no override) are to return NULL for pinned memory requests
+   and pass through to the regular OS calls otherwise.
The following definitions (ab)use comma operators to avoid unused
variable errors.  */
 #ifndef MEMSPACE_ALLOC
-#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
-  malloc (((void)(MEMSPACE), (SIZE)))
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE, PIN) \
+  (PIN ? NULL : malloc (((void)(MEMSPACE), (SIZE
 #endif
 #ifndef MEMSPACE_CALLOC
-#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
-  calloc (1, (((void)(MEMSPACE), (SIZE
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE, PIN) \
+  (PIN ? NULL : calloc (1, (((void)(MEMSPACE), (SIZE)
 #endif
 #ifndef MEMSPACE_REALLOC
-#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
-  realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE, OLDPIN, PIN) \
+   ((PIN) || (OLDPIN) ? NULL \
+   : realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE)
 #endif
 #ifndef MEMSPACE_FREE
-#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
-  free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE, PIN) \
+  if (PIN) free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
 #endif
 #ifndef MEMSPACE_VALIDATE
-#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
-  (((void)(MEMSPACE), (void)(ACCESS), 1))
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS, PIN) \
+  (PIN ? 0 : ((void)(MEMSPACE), (void)(ACCESS), 1))
 #endif
 
 /* Map the predefined allocators to the correct memory space.
@@ -439,12 +442,8 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
 }
 #endif
 
-  /* No support for this so far.  */
-  if (data.pinned)
-return omp_null_allocator;
-
   /* Reject unsupported memory spaces.  */
-  if (!MEMSPACE_VALIDATE (data.memspace, data.access))
+  if (!MEMSPACE_VALIDATE (data.memspace, data.access, data.pinned))
 return omp_null_allocator;
 
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
@@ -586,7 +585,8 @@ retry:
 	}
   else
 #endif
-	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size);
+	ptr = MEMSPACE_ALLOC (allocator_data->memspace, new_size,
+			  allocator_data->pinned);
   if (ptr == NULL)
 	{
 #ifdef HAVE_SYNC_BUILTINS
@@ -623,7 

Re: [PATCH v2 5/6] libgomp, nvptx: Cuda pinned memory

2023-12-07 Thread Andrew Stubbs

@Thomas, there are questions for you below

On 22/11/2023 17:07, Tobias Burnus wrote:

Note before: Starting with TR11 alias OpenMP 6.0, OpenMP supports handling
multiple devices for allocation. It seems as if after using:

   my_memspace = omp_get_device_and_host_memspace( 5 , 
omp_default_mem_space)

   my_alloc = omp_init_allocator (my_memspace, my_traits_with_pinning);

The pinning should be done via device '5' is possible.


This patch is intended to get us to 5.0. These are future features for 
future patches.



However, I believe that it shouldn't really matter for now, given that CUDA
has no special handling of NUMA hierarchy on the host nor for specific 
devices

and GCN has none.

It only becomes interesting if mmap/mlock memory is (measurably) faster 
than

CUDA allocated memory when accessed from the host or, for USM, from GCN.


I don't believe there's any issue here (yet).


Let's start with the patch itself:

--- a/libgomp/target.c
+++ b/libgomp/target.c
...
+static struct gomp_device_descr *
+get_device_for_page_locked (void)
+{
+ gomp_debug (0, "%s\n",
+ __FUNCTION__);
+
+ struct gomp_device_descr *device;
+#ifdef HAVE_SYNC_BUILTINS
+ device
+   = __atomic_load_n (_for_page_locked, MEMMODEL_RELAXED);
+ if (device == (void *) -1)
+   {
+ gomp_debug (0, " init\n");
+
+ gomp_init_targets_once ();
+
+ device = NULL;
+ for (int i = 0; i < num_devices; ++i)


Given that this function just sets a single variable based on whether the
page_locked_host_alloc_func function pointer exists, wouldn't it be much
simpler to just do all this handling in   gomp_target_init  ?


@Thomas, care to comment on this?


+ for (int i = 0; i < num_devices; ++i)
...
+/* We consider only the first device of potentially several of the
+   same type as this functionality is not specific to an individual
+   offloading device, but instead relates to the host-side
+   implementation of the respective offloading implementation. */
+if (devices[i].target_id != 0)
+  continue;
+
+if (!devices[i].page_locked_host_alloc_func)
+  continue;
...
+if (device)
+  gomp_fatal ("Unclear how %s and %s libgomp plugins may"
+  " simultaneously provide functionality"
+  " for page-locked memory",
+  device->name, devices[i].name);
+else
+  device = [i];


I find this a bit inconsistent: If - let's say - GCN does not not 
provide its
own pinning, the code assumes that CUDA pinning is just fine.  However, 
if both

support it, CUDA pinning suddenly is not fine for GCN.


I think it means that we need to revisit this code if that situation 
ever occurs. Again, @Thomas?


Additionally, all wording suggests that it does not matter for CUDA for 
which
device access we want to optimize the pinning. But the code above also 
fails if
I have a system with two Nvidia cards.  From the wording, it sounds as 
if just

checking whether the  device->type  is different would do.


But all in all, I wonder whether it wouldn't be much simpler to state 
something

like the following (where applicable):

If first device that provided pinning support is used; the assumption is 
that

all other devices and the host can access this memory without measurable
performance penalty compared to a normal page lock and that having multiple
device types or host/device NUMA aware pinning support in the plugin is not
available.
NOTE: For OpenMP 6.0's OMP_AVAILABLE_DEVICES environment variable, 
device-set

memory spaces this might need to be revisited.


This seems reasonable to me, until the user can specify.

(I'm going to go look at the other review points now)

Andrew


[committed v4 3/3] amdgcn, libgomp: low-latency allocator

2023-12-06 Thread Andrew Stubbs

This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).

Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible.  This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories).  A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.

gcc/ChangeLog:

* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
addressing.
(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.

libgomp/ChangeLog:

* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
(GCN_LOWLAT_HEAP): New.
* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
(__gcn_lowlat_init): New prototype.
(gomp_gcn_enter_kernel): Initialize the low-latency heap.
* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
* plugin/plugin-gcn.c (lowlat_size): New variable.
(print_kernel_dispatch): Label the group_segment_size purpose.
(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
(create_kernel_dispatch): Pass low-latency head allocation to kernel.
(run_kernel): Use shadow; don't assume values.
* testsuite/libgomp.c/omp_alloc-traits.c: Enable for amdgcn.
* config/gcn/allocator.c: New file.
* libgomp.texi: Document low-latency implementation details.
---
 gcc/config/gcn/gcn-builtins.def   |   2 +
 gcc/config/gcn/gcn.cc |  16 ++-
 libgomp/config/gcn/allocator.c| 127 ++
 libgomp/config/gcn/libgomp-gcn.h  |   6 +
 libgomp/config/gcn/team.c |  12 ++
 libgomp/libgomp.h |   3 -
 libgomp/libgomp.texi  |  13 ++
 libgomp/plugin/plugin-gcn.c   |  35 -
 .../testsuite/libgomp.c/omp_alloc-traits.c|   2 +-
 9 files changed, 205 insertions(+), 11 deletions(-)
 create mode 100644 libgomp/config/gcn/allocator.c

diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
index 636a8e7a1a9..471457d7c23 100644
--- a/gcc/config/gcn/gcn-builtins.def
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN,
 	 _A1 (GCN_BTI_BOOL), gcn_expand_builtin_1)
 DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
 	 gcn_expand_builtin_1)
+DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
+	 gcn_expand_builtin_1)
 DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN,
 	 _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1)
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 0781c2a47c2..031b405e810 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -110,7 +110,8 @@ gcn_init_machine_status (void)
 
   f = ggc_cleared_alloc ();
 
-  if (TARGET_GCN3)
+  // FIXME: re-enable global addressing with safety for LDS-flat addresses
+  //if (TARGET_GCN3)
 f->use_flat_addressing = true;
 
   return f;
@@ -4879,6 +4880,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
 	  }
 	return ptr;
   }
+case GCN_BUILTIN_DISPATCH_PTR:
+  {
+	rtx ptr;
+	if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+	   ptr = gen_rtx_REG (DImode,
+			  cfun->machine->args.reg[DISPATCH_PTR_ARG]);
+	else
+	  {
+	ptr = gen_reg_rtx (DImode);
+	emit_move_insn (ptr, const0_rtx);
+	  }
+	return ptr;
+  }
 case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
   {
 	/* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c
new file mode 100644
index 000..e9a95d683f9
--- /dev/null
+++ b/libgomp/config/gcn/allocator.c
@@ -0,0 +1,127 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should 

[committed v4 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-06 Thread Andrew Stubbs

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_ALLOC): New macro.
(MEMSPACE_CALLOC): New macro.
(MEMSPACE_REALLOC): New macro.
(MEMSPACE_FREE): New macro.
(predefined_alloc_mapping): New array.  Add _Static_assert to match.
(ARRAY_SIZE): New macro.
(omp_aligned_alloc): Use MEMSPACE_ALLOC.
Implement fall-backs for predefined allocators.  Simplify existing
fall-backs.
(omp_free): Use MEMSPACE_FREE.
(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
predefined allocators.  Simplify existing fall-backs.
(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
Implement fall-backs for predefined allocators.  Simplify existing
fall-backs.
* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
(__nvptx_lowlat_init): New prototype.
(gomp_nvptx_main): Call __nvptx_lowlat_init.
* libgomp.texi: Update memory space table.
* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
* basic-allocator.c: New file.
* config/nvptx/allocator.c: New file.
* testsuite/libgomp.c/omp_alloc-1.c: New test.
* testsuite/libgomp.c/omp_alloc-2.c: New test.
* testsuite/libgomp.c/omp_alloc-3.c: New test.
* testsuite/libgomp.c/omp_alloc-4.c: New test.
* testsuite/libgomp.c/omp_alloc-5.c: New test.
* testsuite/libgomp.c/omp_alloc-6.c: New test.

Co-authored-by: Kwok Cheung Yeung  
Co-Authored-By: Thomas Schwinge 
---
 libgomp/allocator.c   | 246 --
 libgomp/basic-allocator.c | 382 ++
 libgomp/config/nvptx/allocator.c  | 120 +++
 libgomp/config/nvptx/team.c   |  18 +
 libgomp/libgomp.texi  |  11 +-
 libgomp/plugin/plugin-nvptx.c |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  56 
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  64 
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  42 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 199 +++
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  63 
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 120 +++
 12 files changed, 1239 insertions(+), 105 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b4e50e2ad72..fa398128368 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -37,6 +37,47 @@
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+/* These macros may be overridden in config//allocator.c.
+   The following definitions (ab)use comma operators to avoid unused
+   variable errors.  */
+#ifndef MEMSPACE_ALLOC
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  malloc (((void)(MEMSPACE), (SIZE)))
+#endif
+#ifndef MEMSPACE_CALLOC
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  calloc (1, (((void)(MEMSPACE), (SIZE
+#endif
+#ifndef MEMSPACE_REALLOC
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE
+#endif
+#ifndef MEMSPACE_FREE
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
+#endif
+
+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.
+   When the user calls omp_alloc with a predefined allocator this
+   table determines what memory they get.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator doesn't actually use this. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */
+  omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */
+  omp_const_mem_space, /* omp_const_mem_alloc. */
+  omp_high_bw_mem_space,   /* omp_high_bw_mem_alloc. */

[committed v4 2/3] openmp, nvptx: low-lat memory access traits

2023-12-06 Thread Andrew Stubbs

The NVPTX low latency memory is not accessible outside the team that allocates
it, and therefore should be unavailable for allocators with the access trait
"all".  This change means that the omp_low_lat_mem_alloc predefined
allocator no longer works (but omp_cgroup_mem_alloc still does).

libgomp/ChangeLog:

* allocator.c (MEMSPACE_VALIDATE): New macro.
(omp_init_allocator): Use MEMSPACE_VALIDATE.
(omp_aligned_alloc): Use OMP_LOW_LAT_MEM_ALLOC_INVALID.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* config/nvptx/allocator.c (nvptx_memspace_validate): New function.
(MEMSPACE_VALIDATE): New macro.
(OMP_LOW_LAT_MEM_ALLOC_INVALID): New define.
* libgomp.texi: Document low-latency implementation details.
* testsuite/libgomp.c/omp_alloc-1.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-2.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-3.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-4.c (main): Add access trait.
* testsuite/libgomp.c/omp_alloc-5.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-6.c (main): Add access trait.
* testsuite/libgomp.c/omp_alloc-traits.c: New test.
---
 libgomp/allocator.c   | 20 ++
 libgomp/config/nvptx/allocator.c  | 21 ++
 libgomp/libgomp.texi  | 18 +
 libgomp/testsuite/libgomp.c/omp_alloc-1.c | 10 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  7 ++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c |  7 +-
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c |  7 +-
 .../testsuite/libgomp.c/omp_alloc-traits.c| 66 +++
 10 files changed, 166 insertions(+), 6 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index fa398128368..a8a80f8028d 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -56,6 +56,10 @@
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
 #endif
+#ifndef MEMSPACE_VALIDATE
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  (((void)(MEMSPACE), (void)(ACCESS), 1))
+#endif
 
 /* Map the predefined allocators to the correct memory space.
The index to this table is the omp_allocator_handle_t enum value.
@@ -439,6 +443,10 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
   if (data.pinned)
 return omp_null_allocator;
 
+  /* Reject unsupported memory spaces.  */
+  if (!MEMSPACE_VALIDATE (data.memspace, data.access))
+return omp_null_allocator;
+
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
   *ret = data;
 #ifndef HAVE_SYNC_BUILTINS
@@ -522,6 +530,10 @@ retry:
 new_size += new_alignment - sizeof (void *);
   if (__builtin_add_overflow (size, new_size, _size))
 goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -820,6 +832,10 @@ retry:
 goto fail;
   if (__builtin_add_overflow (size_temp, new_size, _size))
 goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -1054,6 +1070,10 @@ retry:
   if (__builtin_add_overflow (size, new_size, _size))
 goto fail;
   old_size = data->size;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 6014fba177f..a3302411bcb 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -108,6 +108,21 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 return realloc (addr, size);
 }
 
+static inline int
+nvptx_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+|| (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MINOR >= 1)
+  /* Disallow use of low-latency memory when it must be accessible by
+ all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+#else
+  /* Low-latency memory is not available before PTX 4.1.  */
+  return (memspace != omp_low_lat_mem_space);
+#endif
+}
+
 #define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
   nvptx_memspace_alloc (MEMSPACE, SIZE)
 #define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
@@ -116,5 +131,11 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
   nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
 #define 

[committed v4 0/3] libgomp: OpenMP low-latency omp_alloc

2023-12-06 Thread Andrew Stubbs
Thank you, Tobias, for approving the v3 patch series with minor changes.

https://patchwork.sourceware.org/project/gcc/list/?series=27815=%2A=both

These patches are what I've actually committed.  Besides the requested
changes there were one or two bug fixes and minor tweaks, but otherwise
the patches are the same.

The series implements device-specific allocators and adds a low-latency
allocator for both GPUs architectures.

Andrew Stubbs (3):
  libgomp, nvptx: low-latency memory allocator
  openmp, nvptx: low-lat memory access traits
  amdgcn, libgomp: low-latency allocator

 gcc/config/gcn/gcn-builtins.def   |   2 +
 gcc/config/gcn/gcn.cc |  16 +-
 libgomp/allocator.c   | 266 +++-
 libgomp/basic-allocator.c | 382 ++
 libgomp/config/gcn/allocator.c| 127 ++
 libgomp/config/gcn/libgomp-gcn.h  |   6 +
 libgomp/config/gcn/team.c |  12 +
 libgomp/config/nvptx/allocator.c  | 141 +++
 libgomp/config/nvptx/team.c   |  18 +
 libgomp/libgomp.h |   3 -
 libgomp/libgomp.texi  |  42 +-
 libgomp/plugin/plugin-gcn.c   |  35 +-
 libgomp/plugin/plugin-nvptx.c |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  66 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  72 
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  49 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 200 +
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  71 
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 121 ++
 .../testsuite/libgomp.c/omp_alloc-traits.c|  66 +++
 20 files changed, 1603 insertions(+), 115 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/gcn/allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c

-- 
2.41.0



Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-05 Thread Andrew Stubbs

On 04/12/2023 16:04, Tobias Burnus wrote:

On 03.12.23 01:32, Andrew Stubbs wrote:

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The 
memory
can be allocated, reallocated, and freed using a basic but fast 
algorithm,
is thread safe and the size of the low-latency heap can be configured 
using

the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency 
allocator

will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will 
change

when I implement the access traits.


...

LGTM, however, I about the following:


diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index e5fe7af76af..39d0749e7b3 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,11 +3012,14 @@ value.
  @item omp_const_mem_alloc   @tab omp_const_mem_space
  @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space
  @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc  @tab --
-@item omp_pteam_mem_alloc   @tab --
-@item omp_thread_mem_alloc  @tab --
+@item omp_cgroup_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_pteam_mem_alloc   @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_thread_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)

  @end multitable

+The @code{omp_low_lat_mem_space} is only available on supported devices.
+See @ref{Offload-Target Specifics}.
+


Whether it would be clearer to have this wording not here for the 
OMP_ALLOCATOR env, i.e.

https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html
but just a simple crossref like:

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3061,5 +3061,5 @@ 
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest

  @item @emph{See also}:
  @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics}

  @item @emph{Reference}:


And add your wording to:
   https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html

As this sections mentions that "omp_low_lat_mem_space maps to 
omp_default_mem_space" in general.
Hence, mentioning in this section in addition that  
omp_low_lat_mem_space  is honored on devices

seems to be the better location.


How about this?

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,9 +3012,9 @@ value.
 @item omp_const_mem_alloc   @tab omp_const_mem_space
 @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space
 @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc  @tab --
-@item omp_pteam_mem_alloc   @tab --
-@item omp_thread_mem_alloc  @tab --
+@item omp_cgroup_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_pteam_mem_alloc   @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_thread_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)

 @end multitable

 The predefined allocators use the default values for the traits,
@@ -3060,7 +3060,7 @@ 
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest


 @item @emph{See also}:
 @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specific}

 @item @emph{Reference}:
 @uref{https://www.openmp.org, OpenMP specification v5.0}, Section 6.21
@@ -5710,7 +5710,8 @@ For the memory spaces, the following applies:
 @itemize
 @item @code{omp_default_mem_space} is supported
 @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space}
-@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space}
+@item @code{omp_low_lat_mem_space} is only available on supported devices,
+  and maps to @code{omp_default_mem_space} otherwise.
 @item @code{omp_large_cap_mem_space} maps to @code{omp_default_mem_space},
   unless the memkind library is available
 @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
@@ -5766,6 +5767,9 @@ Additional notes regarding the traits:
 @item The @code{sync_hint} trait has no effect.
 @end itemize

+See also:
+@ref{Offload-Target Specifics}
+
 @c -
 @c Offload-Target Specifics
 @c -



Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 
80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: 
Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; 
Registergericht München, HRB 106955






[PATCH v3 3/3] amdgcn, libgomp: low-latency allocator

2023-12-02 Thread Andrew Stubbs

This implements the OpenMP low-latency memory allocator for AMD GCN using the
small per-team LDS memory (Local Data Store).

Since addresses can now refer to LDS space, the "Global" address space is
no-longer compatible.  This patch therefore switches the backend to use
entirely "Flat" addressing (which supports both memories).  A future patch
will re-enable "global" instructions for cases where it is known to be safe
to do so.

gcc/ChangeLog:

* config/gcn/gcn-builtins.def (DISPATCH_PTR): New built-in.
* config/gcn/gcn.cc (gcn_init_machine_status): Disable global
addressing.
(gcn_expand_builtin_1): Implement GCN_BUILTIN_DISPATCH_PTR.

libgomp/ChangeLog:

* config/gcn/libgomp-gcn.h (TEAM_ARENA_START): Move to here.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
(GCN_LOWLAT_HEAP): New.
* config/gcn/team.c (LITTLEENDIAN_CPU): New, and import hsa.h.
(__gcn_lowlat_init): New prototype.
(gomp_gcn_enter_kernel): Initialize the low-latency heap.
* libgomp.h (TEAM_ARENA_START): Move to libgomp.h.
(TEAM_ARENA_FREE): Likewise.
(TEAM_ARENA_END): Likewise.
* plugin/plugin-gcn.c (lowlat_size): New variable.
(print_kernel_dispatch): Label the group_segment_size purpose.
(init_environment_variables): Read GOMP_GCN_LOWLAT_POOL.
(create_kernel_dispatch): Pass low-latency head allocation to kernel.
(run_kernel): Use shadow; don't assume values.
* testsuite/libgomp.c/omp_alloc-traits.c: Enable for amdgcn.
* config/gcn/allocator.c: New file.
* libgomp.texi: Document low-latency implementation details.
---
 gcc/config/gcn/gcn-builtins.def   |   2 +
 gcc/config/gcn/gcn.cc |  16 ++-
 libgomp/config/gcn/allocator.c| 127 ++
 libgomp/config/gcn/libgomp-gcn.h  |   6 +
 libgomp/config/gcn/team.c |  12 ++
 libgomp/libgomp.h |   3 -
 libgomp/libgomp.texi  |  13 ++
 libgomp/plugin/plugin-gcn.c   |  35 -
 .../testsuite/libgomp.c/omp_alloc-traits.c|   2 +-
 9 files changed, 205 insertions(+), 11 deletions(-)
 create mode 100644 libgomp/config/gcn/allocator.c

diff --git a/gcc/config/gcn/gcn-builtins.def b/gcc/config/gcn/gcn-builtins.def
index 636a8e7a1a9..471457d7c23 100644
--- a/gcc/config/gcn/gcn-builtins.def
+++ b/gcc/config/gcn/gcn-builtins.def
@@ -164,6 +164,8 @@ DEF_BUILTIN (FIRST_CALL_THIS_THREAD_P, -1, "first_call_this_thread_p", B_INSN,
 	 _A1 (GCN_BTI_BOOL), gcn_expand_builtin_1)
 DEF_BUILTIN (KERNARG_PTR, -1, "kernarg_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
 	 gcn_expand_builtin_1)
+DEF_BUILTIN (DISPATCH_PTR, -1, "dispatch_ptr", B_INSN, _A1 (GCN_BTI_VOIDPTR),
+	 gcn_expand_builtin_1)
 DEF_BUILTIN (GET_STACK_LIMIT, -1, "get_stack_limit", B_INSN,
 	 _A1 (GCN_BTI_VOIDPTR), gcn_expand_builtin_1)
 
diff --git a/gcc/config/gcn/gcn.cc b/gcc/config/gcn/gcn.cc
index 22d2b6ebf6d..d70238820dd 100644
--- a/gcc/config/gcn/gcn.cc
+++ b/gcc/config/gcn/gcn.cc
@@ -110,7 +110,8 @@ gcn_init_machine_status (void)
 
   f = ggc_cleared_alloc ();
 
-  if (TARGET_GCN3)
+  // FIXME: re-enable global addressing with safety for LDS-flat addresses
+  //if (TARGET_GCN3)
 f->use_flat_addressing = true;
 
   return f;
@@ -4881,6 +4882,19 @@ gcn_expand_builtin_1 (tree exp, rtx target, rtx /*subtarget */ ,
 	  }
 	return ptr;
   }
+case GCN_BUILTIN_DISPATCH_PTR:
+  {
+	rtx ptr;
+	if (cfun->machine->args.reg[DISPATCH_PTR_ARG] >= 0)
+	   ptr = gen_rtx_REG (DImode,
+			  cfun->machine->args.reg[DISPATCH_PTR_ARG]);
+	else
+	  {
+	ptr = gen_reg_rtx (DImode);
+	emit_move_insn (ptr, const0_rtx);
+	  }
+	return ptr;
+  }
 case GCN_BUILTIN_FIRST_CALL_THIS_THREAD_P:
   {
 	/* Stash a marker in the unused upper 16 bits of s[0:1] to indicate
diff --git a/libgomp/config/gcn/allocator.c b/libgomp/config/gcn/allocator.c
new file mode 100644
index 000..e9a95d683f9
--- /dev/null
+++ b/libgomp/config/gcn/allocator.c
@@ -0,0 +1,127 @@
+/* Copyright (C) 2023 Free Software Foundation, Inc.
+
+   This file is part of the GNU Offloading and Multi Processing Library
+   (libgomp).
+
+   Libgomp is free software; you can redistribute it and/or modify it
+   under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3, or (at your option)
+   any later version.
+
+   Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY
+   WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
+   FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+   more details.
+
+   Under Section 7 of GPL version 3, you are granted additional
+   permissions described in the GCC Runtime Library Exception, version
+   3.1, as published by the Free Software Foundation.
+
+   You should 

[PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-02 Thread Andrew Stubbs

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_ALLOC): New macro.
(MEMSPACE_CALLOC): New macro.
(MEMSPACE_REALLOC): New macro.
(MEMSPACE_FREE): New macro.
(predefined_alloc_mapping): New array.  Add _Static_assert to match.
(ARRAY_SIZE): New macro.
(omp_aligned_alloc): Use MEMSPACE_ALLOC.
Implement fall-backs for predefined allocators.  Simplify existing
fall-backs.
(omp_free): Use MEMSPACE_FREE.
(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
predefined allocators.  Simplify existing fall-backs.
(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
Implement fall-backs for predefined allocators.  Simplify existing
fall-backs.
* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
(__nvptx_lowlat_init): New prototype.
(gomp_nvptx_main): Call __nvptx_lowlat_init.
* libgomp.texi: Update memory space table.
* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
* basic-allocator.c: New file.
* config/nvptx/allocator.c: New file.
* testsuite/libgomp.c/omp_alloc-1.c: New test.
* testsuite/libgomp.c/omp_alloc-2.c: New test.
* testsuite/libgomp.c/omp_alloc-3.c: New test.
* testsuite/libgomp.c/omp_alloc-4.c: New test.
* testsuite/libgomp.c/omp_alloc-5.c: New test.
* testsuite/libgomp.c/omp_alloc-6.c: New test.

Co-authored-by: Kwok Cheung Yeung  
Co-Authored-By: Thomas Schwinge 
---
 libgomp/allocator.c   | 246 --
 libgomp/basic-allocator.c | 380 ++
 libgomp/config/nvptx/allocator.c  | 120 +++
 libgomp/config/nvptx/team.c   |  18 +
 libgomp/libgomp.texi  |   9 +-
 libgomp/plugin/plugin-nvptx.c |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  56 
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  64 
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  42 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 196 +++
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  63 
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 117 +++
 12 files changed, 1231 insertions(+), 103 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b4e50e2ad72..fa398128368 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -37,6 +37,47 @@
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+/* These macros may be overridden in config//allocator.c.
+   The following definitions (ab)use comma operators to avoid unused
+   variable errors.  */
+#ifndef MEMSPACE_ALLOC
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  malloc (((void)(MEMSPACE), (SIZE)))
+#endif
+#ifndef MEMSPACE_CALLOC
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  calloc (1, (((void)(MEMSPACE), (SIZE
+#endif
+#ifndef MEMSPACE_REALLOC
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE
+#endif
+#ifndef MEMSPACE_FREE
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
+#endif
+
+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.
+   When the user calls omp_alloc with a predefined allocator this
+   table determines what memory they get.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator doesn't actually use this. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */
+  omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */
+  omp_const_mem_space, /* omp_const_mem_alloc. */
+  omp_high_bw_mem_space,   /* omp_high_bw_mem_alloc. */

[PATCH v3 0/3] libgomp: OpenMP low-latency omp_alloc

2023-12-02 Thread Andrew Stubbs
This patch series is a rework of the patch series posted in August.

https://patchwork.sourceware.org/project/gcc/list/?series=23045=%2A=both

The series implements device-specific allocators and adds a low-latency
allocator for both GPUs architectures.

This time the omp_low_lat_mem_alloc does not work because the default
traits are incompatible (GPU low-latency memory is not accessible to
other teams).  I've also included documentation and addressed the
comments from Tobias's review.

Andrew

Andrew Stubbs (3):
  libgomp, nvptx: low-latency memory allocator
  openmp, nvptx: low-lat memory access traits
  amdgcn, libgomp: low-latency allocator

 gcc/config/gcn/gcn-builtins.def   |   2 +
 gcc/config/gcn/gcn.cc |  16 +-
 libgomp/allocator.c   | 266 +++-
 libgomp/basic-allocator.c | 380 ++
 libgomp/config/gcn/allocator.c| 127 ++
 libgomp/config/gcn/libgomp-gcn.h  |   6 +
 libgomp/config/gcn/team.c |  12 +
 libgomp/config/nvptx/allocator.c  | 141 +++
 libgomp/config/nvptx/team.c   |  18 +
 libgomp/libgomp.h |   3 -
 libgomp/libgomp.texi  |  40 +-
 libgomp/plugin/plugin-gcn.c   |  35 +-
 libgomp/plugin/plugin-nvptx.c |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  66 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  72 
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  49 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 197 +
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  71 
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 118 ++
 .../testsuite/libgomp.c/omp_alloc-traits.c|  66 +++
 20 files changed, 1595 insertions(+), 113 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/gcn/allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c

-- 
2.41.0



[PATCH v3 2/3] openmp, nvptx: low-lat memory access traits

2023-12-02 Thread Andrew Stubbs

The NVPTX low latency memory is not accessible outside the team that allocates
it, and therefore should be unavailable for allocators with the access trait
"all".  This change means that the omp_low_lat_mem_alloc predefined
allocator no longer works (but omp_cgroup_mem_alloc still does).

libgomp/ChangeLog:

* allocator.c (MEMSPACE_VALIDATE): New macro.
(omp_init_allocator): Use MEMSPACE_VALIDATE.
(omp_aligned_alloc): Use OMP_LOW_LAT_MEM_ALLOC_INVALID.
(omp_aligned_calloc): Likewise.
(omp_realloc): Likewise.
* config/nvptx/allocator.c (nvptx_memspace_validate): New function.
(MEMSPACE_VALIDATE): New macro.
(OMP_LOW_LAT_MEM_ALLOC_INVALID): New define.
* libgomp.texi: Document low-latency implementation details.
* testsuite/libgomp.c/omp_alloc-1.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-2.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-3.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-4.c (main): Add access trait.
* testsuite/libgomp.c/omp_alloc-5.c (main): Add gnu_lowlat.
* testsuite/libgomp.c/omp_alloc-6.c (main): Add access trait.
* testsuite/libgomp.c/omp_alloc-traits.c: New test.
---
 libgomp/allocator.c   | 20 ++
 libgomp/config/nvptx/allocator.c  | 21 ++
 libgomp/libgomp.texi  | 18 +
 libgomp/testsuite/libgomp.c/omp_alloc-1.c | 10 +++
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  7 ++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c |  7 +-
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  8 +++
 libgomp/testsuite/libgomp.c/omp_alloc-6.c |  7 +-
 .../testsuite/libgomp.c/omp_alloc-traits.c| 66 +++
 10 files changed, 166 insertions(+), 6 deletions(-)
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-traits.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index fa398128368..a8a80f8028d 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -56,6 +56,10 @@
 #define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
   free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
 #endif
+#ifndef MEMSPACE_VALIDATE
+#define MEMSPACE_VALIDATE(MEMSPACE, ACCESS) \
+  (((void)(MEMSPACE), (void)(ACCESS), 1))
+#endif
 
 /* Map the predefined allocators to the correct memory space.
The index to this table is the omp_allocator_handle_t enum value.
@@ -439,6 +443,10 @@ omp_init_allocator (omp_memspace_handle_t memspace, int ntraits,
   if (data.pinned)
 return omp_null_allocator;
 
+  /* Reject unsupported memory spaces.  */
+  if (!MEMSPACE_VALIDATE (data.memspace, data.access))
+return omp_null_allocator;
+
   ret = gomp_malloc (sizeof (struct omp_allocator_data));
   *ret = data;
 #ifndef HAVE_SYNC_BUILTINS
@@ -522,6 +530,10 @@ retry:
 new_size += new_alignment - sizeof (void *);
   if (__builtin_add_overflow (size, new_size, _size))
 goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -820,6 +832,10 @@ retry:
 goto fail;
   if (__builtin_add_overflow (size_temp, new_size, _size))
 goto fail;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
@@ -1054,6 +1070,10 @@ retry:
   if (__builtin_add_overflow (size, new_size, _size))
 goto fail;
   old_size = data->size;
+#ifdef OMP_LOW_LAT_MEM_ALLOC_INVALID
+  if (allocator == omp_low_lat_mem_alloc)
+goto fail;
+#endif
 
   if (__builtin_expect (allocator_data
 			&& allocator_data->pool_size < ~(uintptr_t) 0, 0))
diff --git a/libgomp/config/nvptx/allocator.c b/libgomp/config/nvptx/allocator.c
index 6014fba177f..a3302411bcb 100644
--- a/libgomp/config/nvptx/allocator.c
+++ b/libgomp/config/nvptx/allocator.c
@@ -108,6 +108,21 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
 return realloc (addr, size);
 }
 
+static inline int
+nvptx_memspace_validate (omp_memspace_handle_t memspace, unsigned access)
+{
+#if __PTX_ISA_VERSION_MAJOR__ > 4 \
+|| (__PTX_ISA_VERSION_MAJOR__ == 4 && __PTX_ISA_VERSION_MINOR >= 1)
+  /* Disallow use of low-latency memory when it must be accessible by
+ all threads.  */
+  return (memspace != omp_low_lat_mem_space
+	  || access != omp_atv_all);
+#else
+  /* Low-latency memory is not available before PTX 4.1.  */
+  return (memspace != omp_low_lat_mem_space);
+#endif
+}
+
 #define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
   nvptx_memspace_alloc (MEMSPACE, SIZE)
 #define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
@@ -116,5 +131,11 @@ nvptx_memspace_realloc (omp_memspace_handle_t memspace, void *addr,
   nvptx_memspace_realloc (MEMSPACE, ADDR, OLDSIZE, SIZE)
 #define 

Re: [PATCH v2 1/6] libgomp: basic pinned memory on Linux

2023-11-29 Thread Andrew Stubbs

On 22/11/2023 14:26, Tobias Burnus wrote:

Hi Andrew,

Side remark:


-#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \ - calloc (1,
(((void)(MEMSPACE), (SIZE


This fits a bit more to previous patch, but I wonder whether that should
use (MEMSPACE, NMEMB, SIZE) instead - to fit to the actual calloc 
arguments.


I think the main/only difference between SIZE and NMEMB and SIZE is that
"If the multiplication of nmemb and size would result in integer overflow,
then calloc() returns an error." (Linux manpage)

However, while this wording seems to be neither in POSIX nor in the OpenMP
spec. There was some alignment discussion at https://gcc.gnu.org/PR112364
regarding whether C (since C23) has a different alignment for
calloc(1, n) vs. calloc(n,1) but Joseph believes it doen't.

Thus, this is more bikesheding than making a real difference.


[Addressing this point separately to the others]

The size has already been calculated, aligned, and padded, before we get 
to calling MEMSPACE_CALLOC. I don't think we can revert to "nmemb, size" 
without breaking that.


Andrew


  1   2   3   4   5   6   7   8   9   10   >