Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
On 05.12.23 16:39, Andrew Stubbs wrote: Hence, mentioning in this section in addition that omp_low_lat_mem_space is honored on devices seems to be the better location. How about this? LGTM – Thanks! Tobias --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3012,9 +3012,9 @@ value. @item omp_const_mem_alloc @tab omp_const_mem_space @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space -@item omp_cgroup_mem_alloc @tab -- -@item omp_pteam_mem_alloc @tab -- -@item omp_thread_mem_alloc @tab -- +@item omp_cgroup_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_pteam_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_thread_mem_alloc @tab omp_low_lat_mem_space (implementation defined) @end multitable The predefined allocators use the default values for the traits, @@ -3060,7 +3060,7 @@ OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest @item @emph{See also}: @ref{Memory allocation}, @ref{omp_get_default_allocator}, -@ref{omp_set_default_allocator} +@ref{omp_set_default_allocator}, @ref{Offload-Target Specific} @item @emph{Reference}: @uref{https://www.openmp.org/, OpenMP specification v5.0}, Section 6.21 @@ -5710,7 +5710,8 @@ For the memory spaces, the following applies: @itemize @item @code{omp_default_mem_space} is supported @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space} -@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space} +@item @code{omp_low_lat_mem_space} is only available on supported devices, + and maps to @code{omp_default_mem_space} otherwise. @item @code{omp_large_cap_mem_space} maps to @code{omp_default_mem_space}, unless the memkind library is available @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space}, @@ -5766,6 +5767,9 @@ Additional notes regarding the traits: @item The @code{sync_hint} trait has no effect. @end itemize +See also: +@ref{Offload-Target Specifics} + - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
On 04/12/2023 16:04, Tobias Burnus wrote: On 03.12.23 01:32, Andrew Stubbs wrote: This patch adds support for allocating low-latency ".shared" memory on NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory can be allocated, reallocated, and freed using a basic but fast algorithm, is thread safe and the size of the low-latency heap can be configured using the GOMP_NVPTX_LOWLAT_POOL environment variable. The use of the PTX dynamic_smem_size feature means that low-latency allocator will not work with the PTX 3.1 multilib. For now, the omp_low_lat_mem_alloc allocator also works, but that will change when I implement the access traits. ... LGTM, however, I about the following: diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index e5fe7af76af..39d0749e7b3 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3012,11 +3012,14 @@ value. @item omp_const_mem_alloc @tab omp_const_mem_space @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space -@item omp_cgroup_mem_alloc @tab -- -@item omp_pteam_mem_alloc @tab -- -@item omp_thread_mem_alloc @tab -- +@item omp_cgroup_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_pteam_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_thread_mem_alloc @tab omp_low_lat_mem_space (implementation defined) @end multitable +The @code{omp_low_lat_mem_space} is only available on supported devices. +See @ref{Offload-Target Specifics}. + Whether it would be clearer to have this wording not here for the OMP_ALLOCATOR env, i.e. https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html but just a simple crossref like: --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3061,5 +3061,5 @@ OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest @item @emph{See also}: @ref{Memory allocation}, @ref{omp_get_default_allocator}, -@ref{omp_set_default_allocator} +@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics} @item @emph{Reference}: And add your wording to: https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html As this sections mentions that "omp_low_lat_mem_space maps to omp_default_mem_space" in general. Hence, mentioning in this section in addition that omp_low_lat_mem_space is honored on devices seems to be the better location. How about this? --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3012,9 +3012,9 @@ value. @item omp_const_mem_alloc @tab omp_const_mem_space @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space -@item omp_cgroup_mem_alloc @tab -- -@item omp_pteam_mem_alloc @tab -- -@item omp_thread_mem_alloc @tab -- +@item omp_cgroup_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_pteam_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_thread_mem_alloc @tab omp_low_lat_mem_space (implementation defined) @end multitable The predefined allocators use the default values for the traits, @@ -3060,7 +3060,7 @@ OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest @item @emph{See also}: @ref{Memory allocation}, @ref{omp_get_default_allocator}, -@ref{omp_set_default_allocator} +@ref{omp_set_default_allocator}, @ref{Offload-Target Specific} @item @emph{Reference}: @uref{https://www.openmp.org, OpenMP specification v5.0}, Section 6.21 @@ -5710,7 +5710,8 @@ For the memory spaces, the following applies: @itemize @item @code{omp_default_mem_space} is supported @item @code{omp_const_mem_space} maps to @code{omp_default_mem_space} -@item @code{omp_low_lat_mem_space} maps to @code{omp_default_mem_space} +@item @code{omp_low_lat_mem_space} is only available on supported devices, + and maps to @code{omp_default_mem_space} otherwise. @item @code{omp_large_cap_mem_space} maps to @code{omp_default_mem_space}, unless the memkind library is available @item @code{omp_high_bw_mem_space} maps to @code{omp_default_mem_space}, @@ -5766,6 +5767,9 @@ Additional notes regarding the traits: @item The @code{sync_hint} trait has no effect. @end itemize +See also: +@ref{Offload-Target Specifics} + @c - @c Offload-Target Specifics @c - Tobias - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
Hi Andrew, I now looked at the whole series - and the series LGTM, except for some testcase issues, as outlined below. First, I notice that there is no call to: omp_destroy_allocator (gpu_lowlat); While it might make sense to leave some of the testcases without that call for testing purpose, I think at least one it not all (but one?) should call it for completeness/testing purpose. However, the real issue is: On 03.12.23 01:32, Andrew Stubbs wrote: * testsuite/libgomp.c/omp_alloc-4.c: New test. * testsuite/libgomp.c/omp_alloc-6.c: New test. In particular: If you run this without offloading (not configured, -foffload=disable, or no hardware available) Result: This will fail all over the place. I am not sure whether some tests should remain with hostfallback. If yes, I think you need a check whether omp_get_initial_device() == omp_get_default_device() or something similar. If not, you should expect it to fail unless { target offload_device } In any case, I get with host fallback the following for the -4.c test: 62: allocate did not coalesce first two chunks 66: allocate did not split first chunk (1) 68: allocate did not split first chunk (2) 73: allocate did not coalesce middle two chunks 77: allocate did not split second chunk (1) 79: allocate did not split second chunk (2) 84: allocate did not coalesce first two chunks, reverse free 95: allocate did not coalesce second two chunks, reverse free 107: allocate did not coalesce first three chunks 111: allocate did not split first chunk (1) 115: allocate did not split first chunk (3) 121: allocate did not coalesce last three chunks 125: allocate did not split second chunk (1) 129: allocate did not split second chunk (3) 135: allocate did not coalesce first three chunks, reverse free 149: allocate did not coalesce second three chunks, reverse free 163: allocate did not coalesce first three chunks, mixed free 167: allocate did not split first chunk (1), mixed free 169: allocate did not split first chunk (2), mixed free 177: allocate did not coalesce second three chunks, mixed free 181: allocate did not split second chunk (1), mixed free 183: allocate did not split second chunk (2), mixed free 192: allocate did not coalesce all memory And with ASAN already for: 49: allocate did not reuse first chunk 53: allocate did not reuse second chunk 57: allocate did not reuse third chunk AddressSanitizer:DEADLYSIGNAL = ==1395708==ERROR: AddressSanitizer: SEGV on unknown address 0xfffb (pc 0x7ffa2f649025 bp 0xfffb sp 0x7ffe22a90910 T0) ... #4 0x401748 in main._omp_fn.0 libgomp/testsuite/libgomp.c/omp_alloc-4.c:59 And the -6 tests fails with: 72: realloc did not extend into whole next chunk free(): invalid pointer Segmentation fault (core dumped) where the free() crash is at: 74 p = omp_realloc (b, size3, lowlat, lowlat); And with ASAN it fails already with: 48: realloc did not reuse same size chunk, no space after = ==1396453==ERROR: AddressSanitizer: heap-use-after-free on address 0x504000b0 at pc 0x7f83c3ef9406 bp 0x7fffc48d7e10 sp 0x7fffc48d75d0 ... READ of size 8 at 0x504000b0 thread T0 ... #5 0x401358 in main libgomp/testsuite/libgomp.c/omp_alloc-6.c:23 freed by thread T0 here: #1 0x7f83c459af6a in omp_realloc ../../../repos/gcc/libgomp/config/linux/../../allocator.c:1219 previously allocated by thread T0 here: #1 0x7f83c459a13b in omp_aligned_alloc ../../../repos/gcc/libgomp/config/linux/../../allocator.c:626 Tobias - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
Re: [PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
On 03.12.23 01:32, Andrew Stubbs wrote: This patch adds support for allocating low-latency ".shared" memory on NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory can be allocated, reallocated, and freed using a basic but fast algorithm, is thread safe and the size of the low-latency heap can be configured using the GOMP_NVPTX_LOWLAT_POOL environment variable. The use of the PTX dynamic_smem_size feature means that low-latency allocator will not work with the PTX 3.1 multilib. For now, the omp_low_lat_mem_alloc allocator also works, but that will change when I implement the access traits. ... LGTM, however, I about the following: diff --git a/libgomp/libgomp.texi b/libgomp/libgomp.texi index e5fe7af76af..39d0749e7b3 100644 --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3012,11 +3012,14 @@ value. @item omp_const_mem_alloc @tab omp_const_mem_space @item omp_high_bw_mem_alloc @tab omp_high_bw_mem_space @item omp_low_lat_mem_alloc @tab omp_low_lat_mem_space -@item omp_cgroup_mem_alloc @tab -- -@item omp_pteam_mem_alloc @tab -- -@item omp_thread_mem_alloc @tab -- +@item omp_cgroup_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_pteam_mem_alloc @tab omp_low_lat_mem_space (implementation defined) +@item omp_thread_mem_alloc @tab omp_low_lat_mem_space (implementation defined) @end multitable +The @code{omp_low_lat_mem_space} is only available on supported devices. +See @ref{Offload-Target Specifics}. + Whether it would be clearer to have this wording not here for the OMP_ALLOCATOR env, i.e. https://gcc.gnu.org/onlinedocs/libgomp/OMP_005fALLOCATOR.html but just a simple crossref like: --- a/libgomp/libgomp.texi +++ b/libgomp/libgomp.texi @@ -3061,5 +3061,5 @@ OMP_ALLOCATOR=omp_low_lat_mem_space:pinned=true,partition=nearest @item @emph{See also}: @ref{Memory allocation}, @ref{omp_get_default_allocator}, -@ref{omp_set_default_allocator} +@ref{omp_set_default_allocator}, @ref{Offload-Target Specifics} @item @emph{Reference}: And add your wording to: https://gcc.gnu.org/onlinedocs/libgomp/Memory-allocation.html As this sections mentions that "omp_low_lat_mem_space maps to omp_default_mem_space" in general. Hence, mentioning in this section in addition that omp_low_lat_mem_space is honored on devices seems to be the better location. Tobias - Siemens Electronic Design Automation GmbH; Anschrift: Arnulfstraße 201, 80634 München; Gesellschaft mit beschränkter Haftung; Geschäftsführer: Thomas Heurung, Frank Thürauf; Sitz der Gesellschaft: München; Registergericht München, HRB 106955
[PATCH v3 1/3] libgomp, nvptx: low-latency memory allocator
This patch adds support for allocating low-latency ".shared" memory on NVPTX GPU device, via the omp_low_lat_mem_space and omp_alloc. The memory can be allocated, reallocated, and freed using a basic but fast algorithm, is thread safe and the size of the low-latency heap can be configured using the GOMP_NVPTX_LOWLAT_POOL environment variable. The use of the PTX dynamic_smem_size feature means that low-latency allocator will not work with the PTX 3.1 multilib. For now, the omp_low_lat_mem_alloc allocator also works, but that will change when I implement the access traits. libgomp/ChangeLog: * allocator.c (MEMSPACE_ALLOC): New macro. (MEMSPACE_CALLOC): New macro. (MEMSPACE_REALLOC): New macro. (MEMSPACE_FREE): New macro. (predefined_alloc_mapping): New array. Add _Static_assert to match. (ARRAY_SIZE): New macro. (omp_aligned_alloc): Use MEMSPACE_ALLOC. Implement fall-backs for predefined allocators. Simplify existing fall-backs. (omp_free): Use MEMSPACE_FREE. (omp_calloc): Use MEMSPACE_CALLOC. Implement fall-backs for predefined allocators. Simplify existing fall-backs. (omp_realloc): Use MEMSPACE_REALLOC, MEMSPACE_ALLOC, and MEMSPACE_FREE. Implement fall-backs for predefined allocators. Simplify existing fall-backs. * config/nvptx/team.c (__nvptx_lowlat_pool): New asm variable. (__nvptx_lowlat_init): New prototype. (gomp_nvptx_main): Call __nvptx_lowlat_init. * libgomp.texi: Update memory space table. * plugin/plugin-nvptx.c (lowlat_pool_size): New variable. (GOMP_OFFLOAD_init_device): Read the GOMP_NVPTX_LOWLAT_POOL envvar. (GOMP_OFFLOAD_run): Apply lowlat_pool_size. * basic-allocator.c: New file. * config/nvptx/allocator.c: New file. * testsuite/libgomp.c/omp_alloc-1.c: New test. * testsuite/libgomp.c/omp_alloc-2.c: New test. * testsuite/libgomp.c/omp_alloc-3.c: New test. * testsuite/libgomp.c/omp_alloc-4.c: New test. * testsuite/libgomp.c/omp_alloc-5.c: New test. * testsuite/libgomp.c/omp_alloc-6.c: New test. Co-authored-by: Kwok Cheung Yeung Co-Authored-By: Thomas Schwinge --- libgomp/allocator.c | 246 -- libgomp/basic-allocator.c | 380 ++ libgomp/config/nvptx/allocator.c | 120 +++ libgomp/config/nvptx/team.c | 18 + libgomp/libgomp.texi | 9 +- libgomp/plugin/plugin-nvptx.c | 23 +- libgomp/testsuite/libgomp.c/omp_alloc-1.c | 56 libgomp/testsuite/libgomp.c/omp_alloc-2.c | 64 libgomp/testsuite/libgomp.c/omp_alloc-3.c | 42 +++ libgomp/testsuite/libgomp.c/omp_alloc-4.c | 196 +++ libgomp/testsuite/libgomp.c/omp_alloc-5.c | 63 libgomp/testsuite/libgomp.c/omp_alloc-6.c | 117 +++ 12 files changed, 1231 insertions(+), 103 deletions(-) create mode 100644 libgomp/basic-allocator.c create mode 100644 libgomp/config/nvptx/allocator.c create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-1.c create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-2.c create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-3.c create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-4.c create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-5.c create mode 100644 libgomp/testsuite/libgomp.c/omp_alloc-6.c diff --git a/libgomp/allocator.c b/libgomp/allocator.c index b4e50e2ad72..fa398128368 100644 --- a/libgomp/allocator.c +++ b/libgomp/allocator.c @@ -37,6 +37,47 @@ #define omp_max_predefined_alloc omp_thread_mem_alloc +/* These macros may be overridden in config//allocator.c. + The following definitions (ab)use comma operators to avoid unused + variable errors. */ +#ifndef MEMSPACE_ALLOC +#define MEMSPACE_ALLOC(MEMSPACE, SIZE) \ + malloc (((void)(MEMSPACE), (SIZE))) +#endif +#ifndef MEMSPACE_CALLOC +#define MEMSPACE_CALLOC(MEMSPACE, SIZE) \ + calloc (1, (((void)(MEMSPACE), (SIZE +#endif +#ifndef MEMSPACE_REALLOC +#define MEMSPACE_REALLOC(MEMSPACE, ADDR, OLDSIZE, SIZE) \ + realloc (ADDR, (((void)(MEMSPACE), (void)(OLDSIZE), (SIZE +#endif +#ifndef MEMSPACE_FREE +#define MEMSPACE_FREE(MEMSPACE, ADDR, SIZE) \ + free (((void)(MEMSPACE), (void)(SIZE), (ADDR))) +#endif + +/* Map the predefined allocators to the correct memory space. + The index to this table is the omp_allocator_handle_t enum value. + When the user calls omp_alloc with a predefined allocator this + table determines what memory they get. */ +static const omp_memspace_handle_t predefined_alloc_mapping[] = { + omp_default_mem_space, /* omp_null_allocator doesn't actually use this. */ + omp_default_mem_space, /* omp_default_mem_alloc. */ + omp_large_cap_mem_space, /* omp_large_cap_mem_alloc. */ + omp_const_mem_space, /* omp_const_mem_alloc. */ + omp_high_bw_mem_space, /* omp_high_bw_mem_alloc. */