Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-05 Thread Tobias Burnus

On 05.12.23 16:39, Andrew Stubbs wrote:

Hence, mentioning in this section in addition that
omp_low_lat_mem_space  is honored on devices
seems to be the better location.


How about this?


LGTM – Thanks!

Tobias


--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,9 +3012,9 @@ value.
 @item omp_const_mem_alloc   @tab omp_const_mem_space
 @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space
 @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc  @tab --
-@item omp_pteam_mem_alloc   @tab --
-@item omp_thread_mem_alloc  @tab --
+@item omp_cgroup_mem_alloc  @tab omp_low_lat_mem_space
(implementation defined)
+@item omp_pteam_mem_alloc   @tab omp_low_lat_mem_space
(implementation defined)
+@item omp_thread_mem_alloc  @tab omp_low_lat_mem_space
(implementation defined)
 @end multitable

 The predefined allocators use the default values for the traits,
@@ -3060,7 +3060,7 @@
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest

 @item @emph{See also}:
 @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specific}

 @item @emph{Reference}:
 @uref{https://www.openmp.org/, OpenMP specification v5.0}, Section 6.21
@@ -5710,7 +5710,8 @@ For the memory spaces, the following applies:
 @itemize
 @item @code{omp_default_mem_space} is supported
 @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space}
-@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space}
+@item @code{omp_low_lat_mem_space} is only available on supported
devices,
+  and maps to @code{omp_default_mem_space} otherwise.
 @item @code{omp_large_cap_mem_space} maps to
@code{omp_default_mem_space},
   unless the memkind library is available
 @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
@@ -5766,6 +5767,9 @@ Additional notes regarding the traits:
 @item The @code{sync_hint} trait has no effect.
 @end itemize

+See also:
+@ref{Offload-Target Specifics}
+

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-05 Thread Andrew Stubbs

On 04/12/2023 16:04, Tobias Burnus wrote:

On 03.12.23 01:32, Andrew Stubbs wrote:

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The 
memory
can be allocated, reallocated, and freed using a basic but fast 
algorithm,
is thread safe and the size of the low-latency heap can be configured 
using

the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency 
allocator

will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will 
change

when I implement the access traits.


...

LGTM, however, I about the following:


diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index e5fe7af76af..39d0749e7b3 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,11 +3012,14 @@ value.
  @item omp_const_mem_alloc   @tab omp_const_mem_space
  @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space
  @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc  @tab --
-@item omp_pteam_mem_alloc   @tab --
-@item omp_thread_mem_alloc  @tab --
+@item omp_cgroup_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_pteam_mem_alloc   @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_thread_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)

  @end multitable

+The @code{omp_low_lat_mem_space} is only available on supported devices.
+See @ref{Offload-Target Specifics}.
+


Whether it would be clearer to have this wording not here for the 
OMP_ALLOCATOR env, i.e.

https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html
but just a simple crossref like:

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3061,5 +3061,5 @@ 
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest

  @item @emph{See also}:
  @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics}

  @item @emph{Reference}:


And add your wording to:
   https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html

As this sections mentions that "omp_low_lat_mem_space maps to 
omp_default_mem_space" in general.
Hence, mentioning in this section in addition that  
omp_low_lat_mem_space  is honored on devices

seems to be the better location.


How about this?

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,9 +3012,9 @@ value.
 @item omp_const_mem_alloc   @tab omp_const_mem_space
 @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space
 @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc  @tab --
-@item omp_pteam_mem_alloc   @tab --
-@item omp_thread_mem_alloc  @tab --
+@item omp_cgroup_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_pteam_mem_alloc   @tab omp_low_lat_mem_space 
(implementation defined)
+@item omp_thread_mem_alloc  @tab omp_low_lat_mem_space 
(implementation defined)

 @end multitable

 The predefined allocators use the default values for the traits,
@@ -3060,7 +3060,7 @@ 
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest


 @item @emph{See also}:
 @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specific}

 @item @emph{Reference}:
 @uref{https://www.openmp.org, OpenMP specification v5.0}, Section 6.21
@@ -5710,7 +5710,8 @@ For the memory spaces, the following applies:
 @itemize
 @item @code{omp_default_mem_space} is supported
 @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space}
-@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space}
+@item @code{omp_low_lat_mem_space} is only available on supported devices,
+  and maps to @code{omp_default_mem_space} otherwise.
 @item @code{omp_large_cap_mem_space} maps to @code{omp_default_mem_space},
   unless the memkind library is available
 @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space},
@@ -5766,6 +5767,9 @@ Additional notes regarding the traits:
 @item The @code{sync_hint} trait has no effect.
 @end itemize

+See also:
+@ref{Offload-Target Specifics}
+
 @c -
 @c Offload-Target Specifics
 @c -



Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 
80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: 
Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; 
Registergericht München, HRB 106955






Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-05 Thread Tobias Burnus

Hi Andrew,

I now looked at the whole series - and the series LGTM,
except for some testcase issues, as outlined below.


First, I notice that there is no call to:
  omp_destroy_allocator (gpu_lowlat);

While it might make sense to leave some of the testcases
without that call for testing purpose, I think at least one
it not all (but one?) should call it for completeness/testing
purpose.

However, the real issue is:

On 03.12.23 01:32, Andrew Stubbs wrote:

  * testsuite/libgomp.c/omp_alloc-4.c: New test.
  * testsuite/libgomp.c/omp_alloc-6.c: New test.


In particular: If you run this without offloading
(not configured, -foffload=disable, or no hardware available)


Result: This will fail all over the place.
I am not sure whether some tests should remain with hostfallback.

If yes, I think you need a check whether
  omp_get_initial_device() == omp_get_default_device()
or something similar.

If not, you should expect it to fail unless
   { target offload_device }


In any case, I get with host fallback the following for the -4.c test:

62: allocate did not coalesce first two chunks
66: allocate did not split first chunk (1)
68: allocate did not split first chunk (2)
73: allocate did not coalesce middle two chunks
77: allocate did not split second chunk (1)
79: allocate did not split second chunk (2)
84: allocate did not coalesce first two chunks, reverse free
95: allocate did not coalesce second two chunks, reverse free
107: allocate did not coalesce first three chunks
111: allocate did not split first chunk (1)
115: allocate did not split first chunk (3)
121: allocate did not coalesce last three chunks
125: allocate did not split second chunk (1)
129: allocate did not split second chunk (3)
135: allocate did not coalesce first three chunks, reverse free
149: allocate did not coalesce second three chunks, reverse free
163: allocate did not coalesce first three chunks, mixed free
167: allocate did not split first chunk (1), mixed free
169: allocate did not split first chunk (2), mixed free
177: allocate did not coalesce second three chunks, mixed free
181: allocate did not split second chunk (1), mixed free
183: allocate did not split second chunk (2), mixed free
192: allocate did not coalesce all memory

And with ASAN already for:
49: allocate did not reuse first chunk
53: allocate did not reuse second chunk
57: allocate did not reuse third chunk
AddressSanitizer:DEADLYSIGNAL
=
==1395708==ERROR: AddressSanitizer: SEGV on unknown address 0xfffb 
(pc 0x7ffa2f649025 bp 0xfffb sp 0x7ffe22a90910 T0)
...
#4 0x401748 in main._omp_fn.0 libgomp/testsuite/libgomp.c/omp_alloc-4.c:59



And the -6 tests fails with:

72: realloc did not extend into whole next chunk
free(): invalid pointer
Segmentation fault (core dumped)

where the free() crash is at:
74  p = omp_realloc (b, size3, lowlat, lowlat);

And with ASAN it fails already with:

48: realloc did not reuse same size chunk, no space after
=
==1396453==ERROR: AddressSanitizer: heap-use-after-free on address 
0x504000b0 at pc 0x7f83c3ef9406 bp 0x7fffc48d7e10 sp 0x7fffc48d75d0
...
READ of size 8 at 0x504000b0 thread T0
...
#5 0x401358 in main libgomp/testsuite/libgomp.c/omp_alloc-6.c:23
freed by thread T0 here:
#1 0x7f83c459af6a in omp_realloc 
../../../repos/gcc/libgomp/config/linux/../../allocator.c:1219
previously allocated by thread T0 here:
#1 0x7f83c459a13b in omp_aligned_alloc 
../../../repos/gcc/libgomp/config/linux/../../allocator.c:626


Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-04 Thread Tobias Burnus

On 03.12.23 01:32, Andrew Stubbs wrote:

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.


...

LGTM, however, I about the following:


diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi
index e5fe7af76af..39d0749e7b3 100644
--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3012,11 +3012,14 @@ value.
  @item omp_const_mem_alloc   @tab omp_const_mem_space
  @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space
  @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space
-@item omp_cgroup_mem_alloc  @tab --
-@item omp_pteam_mem_alloc   @tab --
-@item omp_thread_mem_alloc  @tab --
+@item omp_cgroup_mem_alloc  @tab omp_low_lat_mem_space (implementation 
defined)
+@item omp_pteam_mem_alloc   @tab omp_low_lat_mem_space (implementation 
defined)
+@item omp_thread_mem_alloc  @tab omp_low_lat_mem_space (implementation 
defined)
  @end multitable

+The @code{omp_low_lat_mem_space} is only available on supported devices.
+See @ref{Offload-Target Specifics}.
+


Whether it would be clearer to have this wording not here for the OMP_ALLOCATOR 
env, i.e.
https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html
but just a simple crossref like:

--- a/libgomp/libgomp.texi
+++ b/libgomp/libgomp.texi
@@ -3061,5 +3061,5 @@ 
OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest
 @item @emph{See also}:
 @ref{Memory allocation}, @ref{omp_get_default_allocator},
-@ref{omp_set_default_allocator}
+@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics}

 @item @emph{Reference}:


And add your wording to:
  https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html

As this sections mentions that "omp_low_lat_mem_space maps to 
omp_default_mem_space" in general.
Hence, mentioning in this section in addition that  omp_low_lat_mem_space  is 
honored on devices
seems to be the better location.

Tobias

-
Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 
München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas 
Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht 
München, HRB 106955


[PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator

2023-12-02 Thread Andrew Stubbs

This patch adds support for allocating low-latency ".shared" memory on
NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc.  The memory
can be allocated, reallocated, and freed using a basic but fast algorithm,
is thread safe and the size of the low-latency heap can be configured using
the GOMP_NVPTX_LOWLAT_POOL environment variable.

The use of the PTX dynamic_smem_size feature means that low-latency allocator
will not work with the PTX 3.1 multilib.

For now, the omp_low_lat_mem_alloc allocator also works, but that will change
when I implement the access traits.

libgomp/ChangeLog:

* allocator.c (MEMSPACE_ALLOC): New macro.
(MEMSPACE_CALLOC): New macro.
(MEMSPACE_REALLOC): New macro.
(MEMSPACE_FREE): New macro.
(predefined_alloc_mapping): New array.  Add _Static_assert to match.
(ARRAY_SIZE): New macro.
(omp_aligned_alloc): Use MEMSPACE_ALLOC.
Implement fall-backs for predefined allocators.  Simplify existing
fall-backs.
(omp_free): Use MEMSPACE_FREE.
(omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for
predefined allocators.  Simplify existing fall-backs.
(omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE.
Implement fall-backs for predefined allocators.  Simplify existing
fall-backs.
* config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable.
(__nvptx_lowlat_init): New prototype.
(gomp_nvptx_main): Call __nvptx_lowlat_init.
* libgomp.texi: Update memory space table.
* plugin/plugin-nvptx.c (lowlat_pool_size): New variable.
(GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar.
(GOMP_OFFLOAD_run): Apply lowlat_pool_size.
* basic-allocator.c: New file.
* config/nvptx/allocator.c: New file.
* testsuite/libgomp.c/omp_alloc-1.c: New test.
* testsuite/libgomp.c/omp_alloc-2.c: New test.
* testsuite/libgomp.c/omp_alloc-3.c: New test.
* testsuite/libgomp.c/omp_alloc-4.c: New test.
* testsuite/libgomp.c/omp_alloc-5.c: New test.
* testsuite/libgomp.c/omp_alloc-6.c: New test.

Co-authored-by: Kwok Cheung Yeung  
Co-Authored-By: Thomas Schwinge 
---
 libgomp/allocator.c   | 246 --
 libgomp/basic-allocator.c | 380 ++
 libgomp/config/nvptx/allocator.c  | 120 +++
 libgomp/config/nvptx/team.c   |  18 +
 libgomp/libgomp.texi  |   9 +-
 libgomp/plugin/plugin-nvptx.c |  23 +-
 libgomp/testsuite/libgomp.c/omp_alloc-1.c |  56 
 libgomp/testsuite/libgomp.c/omp_alloc-2.c |  64 
 libgomp/testsuite/libgomp.c/omp_alloc-3.c |  42 +++
 libgomp/testsuite/libgomp.c/omp_alloc-4.c | 196 +++
 libgomp/testsuite/libgomp.c/omp_alloc-5.c |  63 
 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 117 +++
 12 files changed, 1231 insertions(+), 103 deletions(-)
 create mode 100644 libgomp/basic-allocator.c
 create mode 100644 libgomp/config/nvptx/allocator.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c
 create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c

diff --git a/libgomp/allocator.c b/libgomp/allocator.c
index b4e50e2ad72..fa398128368 100644
--- a/libgomp/allocator.c
+++ b/libgomp/allocator.c
@@ -37,6 +37,47 @@
 
 #define omp_max_predefined_alloc omp_thread_mem_alloc
 
+/* These macros may be overridden in config//allocator.c.
+   The following definitions (ab)use comma operators to avoid unused
+   variable errors.  */
+#ifndef MEMSPACE_ALLOC
+#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \
+  malloc (((void)(MEMSPACE), (SIZE)))
+#endif
+#ifndef MEMSPACE_CALLOC
+#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \
+  calloc (1, (((void)(MEMSPACE), (SIZE
+#endif
+#ifndef MEMSPACE_REALLOC
+#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \
+  realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE
+#endif
+#ifndef MEMSPACE_FREE
+#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \
+  free (((void)(MEMSPACE), (void)(SIZE), (ADDR)))
+#endif
+
+/* Map the predefined allocators to the correct memory space.
+   The index to this table is the omp_allocator_handle_t enum value.
+   When the user calls omp_alloc with a predefined allocator this
+   table determines what memory they get.  */
+static const omp_memspace_handle_t predefined_alloc_mapping[] = {
+  omp_default_mem_space,   /* omp_null_allocator doesn't actually use this. */
+  omp_default_mem_space,   /* omp_default_mem_alloc. */
+  omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */
+  omp_const_mem_space, /* omp_const_mem_alloc. */
+  omp_high_bw_mem_space,   /* omp_high_bw_mem_alloc. */