Re: [PATCH 2/2] libgomp/gcn: cache kernel argument allocations
Hi Arsen, On 10/03/2026 10:20, Arsen Arsenović wrote: I'm happy with this patch for mainline, as long as it's properly tested with kernels that have a lot of mappings. I've ran it through OpenMP VV, and omptests, neither of which regressed. I'm not aware of any especially good real-world testcases that I could use, so I'm open to suggestions on that front. FWIW, SPEC HPC's clvleaf has a lot of mappings. See for instance hydro.F90. Hope this helps! -- PA
Re: [PATCH 2/2] libgomp/gcn: cache kernel argument allocations
Andrew Stubbs writes: > I can't see any great problems with this. Obviously it would be nice > if the "ephemeral memories" and kernargs caches shared the same > implementation, instead of being almost-but-not-quite the same, but > that's more work. Yeah, I initially tried to fit the kernargs into the ephemeral memory code, but it'd have been a bigger change because the ephemeral memory allocation code depends on the kernel arguments already having been partially populated, or something of such nature (it's been a while since I looked at it so I've forgotten the details - sorry). The data structure here is probably the same as the ephemeral memories one, except with a different size-checking mechanism, but that can probably be remedied. > Again, it's not specifically this patch, but did you consider finding > a way to avoid the "shadow" allocation? That's been there since we > first copied the code from some HSA example somewhere, but I don't > know why we need it. We don't, really. It's very superfluous, I could allot some time to removing it and cleaning that up, but it is a larger change than this patch would've needed. It could at least be turned into an automatic storage variable, really, since its lifetime is entirely constrained to run_kernel. Maybe GCC already does that transform? ;) > I'm happy with this patch for mainline, as long as it's properly > tested with kernels that have a lot of mappings. I've ran it through OpenMP VV, and omptests, neither of which regressed. I'm not aware of any especially good real-world testcases that I could use, so I'm open to suggestions on that front. I could also just synthesize some trivial test with, like, 80 mappings to make sure that works. That should hit the concerning test case and could fit in the testsuite, but really it depends on what size of stack people run their OpenMP/ACC programs with and how deep their call stacks are, so I'm not sure it'd be very representative. -- Arsen Arsenović signature.asc Description: PGP signature
Re: [PATCH 2/2] libgomp/gcn: cache kernel argument allocations
On 06/03/2026 14:59, Arsen Arsenović wrote: On AMD GCN, for each kernel that we execute on the GPUs, the vast majority of the time preparing the kernel for execution is spent in memory allocation and deallocation for the kernel arguments. Out of the total execution time of run_kernel, which is the GCN plugin function that actually performs launching a kernel, ~83.5% of execution time is spent in these (de)allocation routines. Obviously, then, these calls should be elliminated. However, it is not possible to avoid needing to allocate kernel arguments. To this end, this patch implements a cache of kernel argument allocations. We expect this cache to be of size T where T is the maximum number of kernels being launched in parallel. This should be a fairly small number, as there isn't much benefit to (or, to my awareness, real world code that) executing very many kernels in parallel. As the kernel argument allocations are of differing sizes, there is benefit in not making too many disparate classes of sizes. To prevent this happening, the plugin rounds up the size of the varying part of the kernel argument allocation (the target variable table) to a multiple of sixty-four pointers, and ensures that this size is never zero. This should result in the vast majority of allocations being able to reuse the same cache nodes. In my experiments (with BabelStream, though this should by no means be improvements specific to it as run_kernel is used for all kernels and branches very little), this was able to cut the non-kernel-wait runtime of run_kernel by a factor of 5.5x. Cumulatively, this patch and the previous improved BabelStream results in the default configuration by 47% in the Copy kernel, and 35.5% on average across all kernels. libgomp/ChangeLog: * plugin/plugin-gcn.c (struct kernel_dispatch): Add a field to hold a pointer to the allocation cache node this dispatch is holding for kernel arguments, replacing kernarg_address. (print_kernel_dispatch): Print the allocation pointer from that node as kernargs address. (struct agent_info): Add in an allocation cache field. (alloc_kernargs_on_agent): New function. Pulls kernel arguments from the cache, or, if no appropriate node is found, allocates new ones. (create_kernel_dispatch): Round HOST_VARS_SIZE to a multiple of 64 pointers, and ensure it is nonzero. Use alloc_kernargs_on_agent to allocate kernargs. (release_kernel_dispatch): Use release_alloc_cache_node to release kernargs. (run_kernel): Update usages of kernarg_address to use the kernel arguments cache node. (GOMP_OFFLOAD_fini_device): Clean up kernargs cache. (GOMP_OFFLOAD_init_device): Initialize kernargs cache. * alloc_cache.h: New file. * testsuite/libgomp.c/alloc_cache-1.c: New test. I can't see any great problems with this. Obviously it would be nice if the "ephemeral memories" and kernargs caches shared the same implementation, instead of being almost-but-not-quite the same, but that's more work. Again, it's not specifically this patch, but did you consider finding a way to avoid the "shadow" allocation? That's been there since we first copied the code from some HSA example somewhere, but I don't know why we need it. I'm happy with this patch for mainline, as long as it's properly tested with kernels that have a lot of mappings. Andrew --- libgomp/alloc_cache.h | 144 libgomp/plugin/plugin-gcn.c | 87 ++-- libgomp/testsuite/libgomp.c/alloc_cache-1.c | 62 + 3 files changed, 279 insertions(+), 14 deletions(-) create mode 100644 libgomp/alloc_cache.h create mode 100644 libgomp/testsuite/libgomp.c/alloc_cache-1.c diff --git a/libgomp/alloc_cache.h b/libgomp/alloc_cache.h new file mode 100644 index ..782569c1faec --- /dev/null +++ b/libgomp/alloc_cache.h @@ -0,0 +1,144 @@ +/* A simple allocation cache. + Copyright (C) 2026 Free Software Foundation, Inc. + + This file is part of the GNU Offloading and Multi Processing Library + (libgomp). + + Libgomp is free software; you can redistribute it and/or modify it + under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3, or (at your option) + any later version. + + Libgomp is distributed in the hope that it will be useful, but WITHOUT ANY + WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS + FOR A PARTICULAR PURPOSE. See the GNU General Public License for + more details. + + Under Section 7 of GPL version 3, you are granted additional + permissions described in the GCC Runtime Library Exception, version + 3.1, as published by the Free Software Foundation. + + You should have received a copy of the GNU General Public License and + a copy of the GCC Runtime Librar
Re: [PATCH 2/2] libgomp/gcn: cache kernel argument allocations
Arsen Arsenović wrote:
Given that the kernel argument allocation cache is a change isolate to
GCN, I think it might be safe to merge even now (despite being late),
but I reckon it's probably way too far along in this stage to merge the
target variable table allocation change.
I think it would have helped, if you had reversed the order - as
only applying 2/2 will fail as 'create_kernel_dispatch' has been
modified by 1/2 – and
'sizeof (struct kernargs) + host_vars_size'
explicitly requires the host_vars_size part.
I concur that the 2/2 patch is of low risk and has definitely
benefits. For 1/2, it seems as if both waiting for Stage 1 and
possibly some additional modifications would be useful
(as you alluded in the description; I have not read that patch).
* * *
Arsen Arsenović wrote:
On AMD GCN, for each kernel that we execute on the GPUs, the vast
majority of the time preparing the kernel for execution is spent in
memory allocation and deallocation for the kernel arguments.
...
To this end, this patch implements a cache of kernel argument
allocations.
We expect this cache to be of size T where T is the maximum number of
kernels being launched in parallel. This should be a fairly small
number, as there isn't much benefit to (or, to my awareness, real world
code that) executing very many kernels in parallel.
It seems as if a dozen to a hundred seems to be a sensible upper bound,
even if more are in theory possibly (and I bet some odd program uses
a thousand).
Only looking at 2/2, i.e. without looking at host_vars_size, the
kernel args is sizeof(struct kernargs), i.e.:
struct kernargs {
struct kernargs_abi abi;
struct output output_data;
};
The first one contains 6× int64_t + 2× int = 52 bytes.
The second one contains:
- 2× int
- 1024× struct printf_data
{ int, char[128], int, {char[128]/uint64_t[16]}
= 4 + 128 + 4 + 128 bytes = 264 bytes
→ 316 bytes, assuming 32bit alignment / no padding.
This seems to get allocated in non-pageable host memory and the system
should have quite a lot of memory.
Without the need to handle host_vars_size, the number of caches should
be exactly the same as the number of concurrent asynchronous kernels,
this number seems to be fine - not that I have checked how the
non-pagable flag will be handled (i.e. in how far data can be
coallacent on a single page vs. using multiple pages).
I have not looked at the host_vars_size code, i.e. how that handles
different sizes. (I have the feeling that especially with iterators,
this can also grow quite a bit - on the other hand, the larger the
kernels / more complex data mapping, the less likely there are many
large kernels. - Still several small ones in one code part and fewer
large ones elsewhere might exist.)
On top comes the data structure of the caching code, but that's
unsurprisingly not that much:
sizeof(pthread_mutex_t) + 2*sizeof(void*) + sizeof (size_t).
The current chaching is a single-linked list, with new elements
added to the front. Thus, it grows with O(#cached-kern_args);
As new elements get added to the front: If a new kernel needs to
be created because the host_vars_size is too large, the large
vars ones will be created at the start of the list, reducing the
need of walking the whole list by favoring (unused) large-size
nodes.
* * *
The caching code in alloc_cache.h uses pthread_mutex_t – but as
it is only included in plugin/plugin-gcn.c, which already uses
pthreads, that's fine.
* * *
As the kernel argument allocations are of differing sizes, there is
benefit in not making too many disparate classes of sizes. To prevent
this happening, the plugin rounds up the size of the varying part of the
kernel argument allocation (the target variable table) to a multiple of
sixty-four pointers, and ensures that this size is never zero. This
should result in the vast majority of allocations being able to reuse
the same cache nodes.
This only becomes active together with patch 1/2. I think when only
applying 2/2, there is nothing to worry about – albeit the review of
this code should be done with the 1/2 patch in mind and we may need
to revisit this when eventually adding a patch like 1/2, assuming that
2/2 is indeed committed first.
* * *
I think except that it needs to be rediffed to be applicable without
patch 1/2, the patch is okay.
I still want to re-read the patch once (possibly the rediffed one but
the change is obvious) - and I would like to wait a few days for
comments by others like Andrew and Jakub, in particular.
libgomp/ChangeLog:
* plugin/plugin-gcn.c (struct kernel_dispatch): Add a field to
hold a pointer to the allocation cache node this dispatch is
holding for kernel arguments, replacing kernarg_address.
(print_kernel_dispatch): Print the allocation pointer from that
node as kernargs address.
(struct agent_info): Add in an allocation cache field.
(alloc_kernargs_on_agent): New function. Pulls kernel arguments
