[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-28 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added a comment.

In D153883#4456342 , @tianshilei1992 
wrote:

> I think it's better to just limit it to AMDGPU for now.

I rather doubt this is a good decision. Better to support for all targets. 
NVPTX supports(ed) (IIRC) static allocation and internal management for the 
shared memory (not sure it is true for the new library). If no, then we need at 
least to diagnose that this feature is not supported.

> BTW, it might be worth to check if heap-to-stack will push it back to stack.




Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-28 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added a comment.

In D153883#4456342 , @tianshilei1992 
wrote:

> I think it's better to just limit it to AMDGPU for now.
> BTW, it might be worth to check if heap-to-stack will push it back to stack.

If you're really going to go for backend workarounds, it should be special 
casing the known broken with a fixme for why, not a positive check for where 
it's enabled


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-28 Thread Shilei Tian via Phabricator via cfe-commits
tianshilei1992 added a comment.

I think it's better to just limit it to AMDGPU for now.
BTW, it might be worth to check if heap-to-stack will push it back to stack.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-28 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added inline comments.



Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

doru1004 wrote:
> arsenm wrote:
> > ABataev wrote:
> > > doru1004 wrote:
> > > > jhuber6 wrote:
> > > > > ABataev wrote:
> > > > > > OpenMPIsDevice?
> > > > > Does NVPTX handle this already? If not, is there a compelling reason 
> > > > > to exclude NVPTX? Otherwise we should check if we are the OpenMP 
> > > > > device.
> > > > Does NVPTX support dynamic allocas?
> > > It does not matter here, it depends on the runtime library 
> > > implementations. The compiler just shall provide proper runtime calls 
> > > emission, everything else is part of the runtime support.
> > I think I heard recent ptx introdced new instructions for it. amdgpu 
> > codegen just happens to be broken because we don't properly restore the 
> > stack afterwards. When I added the support we had no way of testing (and 
> > still don't really, __builtin_alloca doesn't handle non-0 stack address 
> > space correctly)
> If NVPTX supports that then there is no reason to have NVPTX avoid emitting 
> allocas (i.e. the condition stays as it is right now) but I am willing to 
> reach a consensus so please let me know what you would all prefer.
frontends seem to have a tradition of working around missing features in 
codegen, I think you should just pass through the correct IR and leave the 
backend bugs for the backends


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-28 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 added inline comments.



Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

arsenm wrote:
> ABataev wrote:
> > doru1004 wrote:
> > > jhuber6 wrote:
> > > > ABataev wrote:
> > > > > OpenMPIsDevice?
> > > > Does NVPTX handle this already? If not, is there a compelling reason to 
> > > > exclude NVPTX? Otherwise we should check if we are the OpenMP device.
> > > Does NVPTX support dynamic allocas?
> > It does not matter here, it depends on the runtime library implementations. 
> > The compiler just shall provide proper runtime calls emission, everything 
> > else is part of the runtime support.
> I think I heard recent ptx introdced new instructions for it. amdgpu codegen 
> just happens to be broken because we don't properly restore the stack 
> afterwards. When I added the support we had no way of testing (and still 
> don't really, __builtin_alloca doesn't handle non-0 stack address space 
> correctly)
If NVPTX supports that then there is no reason to have NVPTX avoid emitting 
allocas (i.e. the condition stays as it is right now) but I am willing to reach 
a consensus so please let me know what you would all prefer.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-28 Thread Matt Arsenault via Phabricator via cfe-commits
arsenm added inline comments.



Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

ABataev wrote:
> doru1004 wrote:
> > jhuber6 wrote:
> > > ABataev wrote:
> > > > OpenMPIsDevice?
> > > Does NVPTX handle this already? If not, is there a compelling reason to 
> > > exclude NVPTX? Otherwise we should check if we are the OpenMP device.
> > Does NVPTX support dynamic allocas?
> It does not matter here, it depends on the runtime library implementations. 
> The compiler just shall provide proper runtime calls emission, everything 
> else is part of the runtime support.
I think I heard recent ptx introdced new instructions for it. amdgpu codegen 
just happens to be broken because we don't properly restore the stack 
afterwards. When I added the support we had no way of testing (and still don't 
really, __builtin_alloca doesn't handle non-0 stack address space correctly)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

doru1004 wrote:
> jhuber6 wrote:
> > ABataev wrote:
> > > OpenMPIsDevice?
> > Does NVPTX handle this already? If not, is there a compelling reason to 
> > exclude NVPTX? Otherwise we should check if we are the OpenMP device.
> Does NVPTX support dynamic allocas?
It does not matter here, it depends on the runtime library implementations. The 
compiler just shall provide proper runtime calls emission, everything else is 
part of the runtime support.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 added inline comments.



Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

jhuber6 wrote:
> ABataev wrote:
> > OpenMPIsDevice?
> Does NVPTX handle this already? If not, is there a compelling reason to 
> exclude NVPTX? Otherwise we should check if we are the OpenMP device.
Does NVPTX support dynamic allocas?


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 updated this revision to Diff 535186.
doru1004 marked 3 inline comments as done.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

Files:
  clang/lib/CodeGen/CGDecl.cpp
  clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
  clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
  clang/lib/CodeGen/CodeGenFunction.cpp
  clang/lib/CodeGen/CodeGenFunction.h
  clang/test/OpenMP/amdgcn_target_device_vla.cpp

Index: clang/test/OpenMP/amdgcn_target_device_vla.cpp
===
--- /dev/null
+++ clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -0,0 +1,1258 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --function-signature --include-generated-funcs --replace-value-regex "__omp_offloading_[0-9a-z]+_[0-9a-z]+" "reduction_size[.].+[.]" "pl_cond[.].+[.|,]" --prefix-filecheck-ir-name _
+// REQUIRES: amdgpu-registered-target
+
+// RUN: %clang_cc1 -fopenmp -x c++ -std=c++11 -triple x86_64-unknown-unknown -fopenmp-targets=amdgcn-amd-amdhsa -emit-llvm-bc %s -o %t-ppc-host.bc
+// RUN: %clang_cc1 -fopenmp -x c++ -std=c++11 -triple amdgcn-amd-amdhsa -fopenmp-targets=amdgcn-amd-amdhsa -emit-llvm %s -fopenmp-is-device -fopenmp-host-ir-file-path %t-ppc-host.bc -o - | FileCheck %s
+// expected-no-diagnostics
+#ifndef HEADER
+#define HEADER
+
+int foo1() {
+  int sum = 0.0;
+  #pragma omp target map(tofrom: sum)
+  {
+int N = 10;
+int A[N];
+
+for (int i = 0; i < N; i++)
+  A[i] = i;
+
+for (int i = 0; i < N; i++)
+  sum += A[i];
+  }
+  return sum;
+}
+
+int foo2() {
+  int sum = 0.0;
+  int M = 12;
+  int result[M];
+  #pragma omp target teams distribute parallel for map(from: result[:M])
+  for (int i = 0; i < M; i++) {
+int N = 10;
+int A[N];
+result[i] = i;
+
+for (int j = 0; j < N; j++)
+  A[j] = j;
+
+for (int j = 0; j < N; j++)
+  result[i] += A[j];
+  }
+
+  for (int i = 0; i < M; i++)
+sum += result[i];
+  return sum;
+}
+
+int foo3() {
+  int sum = 0.0;
+  int M = 12;
+  int result[M];
+  #pragma omp target teams distribute map(from: result[:M])
+  for (int i = 0; i < M; i++) {
+int N = 10;
+int A[N];
+result[i] = i;
+
+#pragma omp parallel for
+for (int j = 0; j < N; j++)
+  A[j] = j;
+
+for (int j = 0; j < N; j++)
+  result[i] += A[j];
+  }
+
+  for (int i = 0; i < M; i++)
+sum += result[i];
+  return sum;
+}
+
+int foo4() {
+  int sum = 0.0;
+  int M = 12;
+  int result[M];
+  int N = 10;
+  #pragma omp target teams distribute map(from: result[:M])
+  for (int i = 0; i < M; i++) {
+int A[N];
+result[i] = i;
+
+#pragma omp parallel for
+for (int j = 0; j < N; j++)
+  A[j] = j;
+
+for (int j = 0; j < N; j++)
+  result[i] += A[j];
+  }
+
+  for (int i = 0; i < M; i++)
+sum += result[i];
+  return sum;
+}
+
+int main() {
+  return foo1() + foo2() + foo3() + foo4();
+}
+
+#endif
+// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12
+// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:[[SUM_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// CHECK-NEXT:[[N:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:[[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
+// CHECK-NEXT:[[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:[[I1:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:[[SUM_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SUM_ADDR]] to ptr
+// CHECK-NEXT:[[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
+// CHECK-NEXT:[[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
+// CHECK-NEXT:[[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:[[I1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I1]] to ptr
+// CHECK-NEXT:store ptr [[SUM]], ptr [[SUM_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP0:%.*]] = load ptr, ptr [[SUM_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @[[GLOB1:[0-9]+]] to ptr), i8 1, i1 true)
+// CHECK-NEXT:[[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
+// CHECK-NEXT:br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
+// CHECK:   user_code.entry:
+// CHECK-NEXT:store i32 10, ptr [[N_ASCAST]], align 4
+// CHECK-NEXT:[[TMP2:%.*]] = load i32, ptr [[N_ASCAST]], align 4
+// CHECK-NEXT:[[TMP3:%.*]] = zext i32 [[TMP2]] to i64
+// CHECK-NEXT:[[TMP4:%.*]] = mul nuw i64 [[TMP3]], 4
+// CHECK-NEXT:[[TMP5:%.*]] = add nuw i64 [[TMP4]], 3
+// CHECK-NEXT:[[TMP6:%.*]] = udiv i64 [[TMP5]], 4
+// CHECK-NEXT:[[TMP7:%.*]] = mul nuw i64 [[TMP6]], 4
+// CHECK-NEXT:[[A:%.*]] = call align 4 ptr @__kmpc_alloc_shared(i64 

[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 added inline comments.



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1085
   }
-  for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
-// Use actual memory size of the VLA object including the padding

doru1004 wrote:
> ABataev wrote:
> > jhuber6 wrote:
> > > doru1004 wrote:
> > > > ABataev wrote:
> > > > > Why this code is removed?
> > > > I could not understand why this code is here in the first place since 
> > > > it doesn't seem that it could ever work correctly (and it doesn't seem 
> > > > to be covered by any existing tests). Maybe I'm wrong but that was my 
> > > > understanding of it. So what seems to happen is that this code attempts 
> > > > to emit a kmpc_alloc_shared before the actual size calculation is 
> > > > emitted. So if the VLA size is something that the user defines such as 
> > > > `int N = 10;` then that code will not have been emitted at this point. 
> > > > When the expression computing the size of the VLA uses `N`, the code 
> > > > that is deleted here would just fail to find the VLA size in the 
> > > > attempt to emit the kmpc_alloc_shared. The emission of the VLA as 
> > > > kmpc_alloc_shared needs to happen after the expression of the size is 
> > > > emitted.
> > > I'm pretty sure I was the one that wrote this code, and at the time I 
> > > don't recall it really working. I remember there was something else that 
> > > expected this to be here, but for what utility I do not recall. VLAs were 
> > > never tested or used.
> > They are tested, check 
> > test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for 
> > example, where it captures VLA implicitly. I assume this should not be 
> > AMDGCN specific.
> Oh I see so this code path would cover the case when the VLA is defined 
> outside the target region? I'm surprised I haven't seen any lit test fails 
> for AMD GPUs, maybe this kind of test only exists for NVPTX. I'll add a test 
> for AMD GPUs in that case.
Edit: the VLA is defined outside the target region => the VLA //size// is 
defined outside the target region


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 added inline comments.



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1085
   }
-  for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
-// Use actual memory size of the VLA object including the padding

ABataev wrote:
> jhuber6 wrote:
> > doru1004 wrote:
> > > ABataev wrote:
> > > > Why this code is removed?
> > > I could not understand why this code is here in the first place since it 
> > > doesn't seem that it could ever work correctly (and it doesn't seem to be 
> > > covered by any existing tests). Maybe I'm wrong but that was my 
> > > understanding of it. So what seems to happen is that this code attempts 
> > > to emit a kmpc_alloc_shared before the actual size calculation is 
> > > emitted. So if the VLA size is something that the user defines such as 
> > > `int N = 10;` then that code will not have been emitted at this point. 
> > > When the expression computing the size of the VLA uses `N`, the code that 
> > > is deleted here would just fail to find the VLA size in the attempt to 
> > > emit the kmpc_alloc_shared. The emission of the VLA as kmpc_alloc_shared 
> > > needs to happen after the expression of the size is emitted.
> > I'm pretty sure I was the one that wrote this code, and at the time I don't 
> > recall it really working. I remember there was something else that expected 
> > this to be here, but for what utility I do not recall. VLAs were never 
> > tested or used.
> They are tested, check 
> test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for 
> example, where it captures VLA implicitly. I assume this should not be AMDGCN 
> specific.
Oh I see so this code path would cover the case when the VLA is defined outside 
the target region? I'm surprised I haven't seen any lit test fails for AMD 
GPUs, maybe this kind of test only exists for NVPTX. I'll add a test for AMD 
GPUs in that case.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added inline comments.



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1085
   }
-  for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
-// Use actual memory size of the VLA object including the padding

jhuber6 wrote:
> doru1004 wrote:
> > ABataev wrote:
> > > Why this code is removed?
> > I could not understand why this code is here in the first place since it 
> > doesn't seem that it could ever work correctly (and it doesn't seem to be 
> > covered by any existing tests). Maybe I'm wrong but that was my 
> > understanding of it. So what seems to happen is that this code attempts to 
> > emit a kmpc_alloc_shared before the actual size calculation is emitted. So 
> > if the VLA size is something that the user defines such as `int N = 10;` 
> > then that code will not have been emitted at this point. When the 
> > expression computing the size of the VLA uses `N`, the code that is deleted 
> > here would just fail to find the VLA size in the attempt to emit the 
> > kmpc_alloc_shared. The emission of the VLA as kmpc_alloc_shared needs to 
> > happen after the expression of the size is emitted.
> I'm pretty sure I was the one that wrote this code, and at the time I don't 
> recall it really working. I remember there was something else that expected 
> this to be here, but for what utility I do not recall. VLAs were never tested 
> or used.
They are tested, check 
test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp for example, 
where it captures VLA implicitly. I assume this should not be AMDGCN specific.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added inline comments.



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1085
   }
-  for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
-// Use actual memory size of the VLA object including the padding

doru1004 wrote:
> ABataev wrote:
> > Why this code is removed?
> I could not understand why this code is here in the first place since it 
> doesn't seem that it could ever work correctly (and it doesn't seem to be 
> covered by any existing tests). Maybe I'm wrong but that was my understanding 
> of it. So what seems to happen is that this code attempts to emit a 
> kmpc_alloc_shared before the actual size calculation is emitted. So if the 
> VLA size is something that the user defines such as `int N = 10;` then that 
> code will not have been emitted at this point. When the expression computing 
> the size of the VLA uses `N`, the code that is deleted here would just fail 
> to find the VLA size in the attempt to emit the kmpc_alloc_shared. The 
> emission of the VLA as kmpc_alloc_shared needs to happen after the expression 
> of the size is emitted.
I'm pretty sure I was the one that wrote this code, and at the time I don't 
recall it really working. I remember there was something else that expected 
this to be here, but for what utility I do not recall. VLAs were never tested 
or used.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 added inline comments.



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1085
   }
-  for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
-// Use actual memory size of the VLA object including the padding

ABataev wrote:
> Why this code is removed?
I could not understand why this code is here in the first place since it 
doesn't seem that it could ever work correctly (and it doesn't seem to be 
covered by any existing tests). Maybe I'm wrong but that was my understanding 
of it. So what seems to happen is that this code attempts to emit a 
kmpc_alloc_shared before the actual size calculation is emitted. So if the VLA 
size is something that the user defines such as `int N = 10;` then that code 
will not have been emitted at this point. When the expression computing the 
size of the VLA uses `N`, the code that is deleted here would just fail to find 
the VLA size in the attempt to emit the kmpc_alloc_shared. The emission of the 
VLA as kmpc_alloc_shared needs to happen after the expression of the size is 
emitted.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Alexey Bataev via Phabricator via cfe-commits
ABataev added a comment.

Add the runtime test?




Comment at: clang/lib/CodeGen/CGDecl.cpp:587
+std::pair AddrSizePair;
+KmpcAllocFree(std::pair AddrSizePair)
+: AddrSizePair(AddrSizePair) {}

Better to pass it as const reference



Comment at: clang/lib/CodeGen/CGDecl.cpp:589
+: AddrSizePair(AddrSizePair) {}
+void Emit(CodeGenFunction , Flags flags) override {
+  CGOpenMPRuntimeGPU  =

Wrong param name, use Camel



Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

OpenMPIsDevice?



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1085
   }
-  for (const auto *VD : I->getSecond().EscapedVariableLengthDecls) {
-// Use actual memory size of the VLA object including the padding

Why this code is removed?



Comment at: clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp:1112
+
+  return std::pair({VoidPtr, Size});
+}

Use `std::make_pair(VoidPtr, Size)`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Joseph Huber via Phabricator via cfe-commits
jhuber6 added a comment.

So this is implementing the `stacksave` using `__kmpc_alloc_shared` instead? It 
makes sense since the OpenMP standard expects sharing for the stack. I wonder 
how this interfaces with `-fopenmp-cuda-mode`.




Comment at: clang/lib/CodeGen/CGDecl.cpp:1603
+// deallocation call of __kmpc_free_shared() is emitted later.
+if (getLangOpts().OpenMP && getTarget().getTriple().isAMDGCN()) {
+  // Emit call to __kmpc_alloc_shared() instead of the alloca.

Does NVPTX handle this already? If not, is there a compelling reason to exclude 
NVPTX? Otherwise we should check if we are the OpenMP device.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D153883/new/

https://reviews.llvm.org/D153883

___
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D153883: [Clang][OpenMP] Enable use of __kmpc_alloc_shared for VLAs defined in AMD GPU offloaded regions

2023-06-27 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
doru1004 created this revision.
doru1004 added reviewers: ronlieb, gregrodgers, carlo.bertolli, arsenm, 
jdoerfert, dhruvachak, ABataev.
doru1004 added a project: OpenMP.
Herald added subscribers: sunshaoce, guansong, yaxunl, jvesely.
Herald added a project: All.
doru1004 requested review of this revision.
Herald added subscribers: cfe-commits, jplehr, sstefan1, wdng.
Herald added a project: clang.

This patch enables the use of `___kmpc_alloc_shared` to allocate dynamically 
sized allocation on AMD GPUs. For example:

  #pragma omp target
  {
int N = 10;
double A[N];
...
  }

This will generate a pair of `__kmpc_alloc_shared / __kmpc_free_shared` to 
handle the allocation and deallocation of `A` inside the target region.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D153883

Files:
  clang/lib/CodeGen/CGDecl.cpp
  clang/lib/CodeGen/CGOpenMPRuntimeGPU.cpp
  clang/lib/CodeGen/CGOpenMPRuntimeGPU.h
  clang/lib/CodeGen/CodeGenFunction.h
  clang/test/OpenMP/amdgcn_target_device_vla.cpp

Index: clang/test/OpenMP/amdgcn_target_device_vla.cpp
===
--- /dev/null
+++ clang/test/OpenMP/amdgcn_target_device_vla.cpp
@@ -0,0 +1,869 @@
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py UTC_ARGS: --function-signature --include-generated-funcs --replace-value-regex "__omp_offloading_[0-9a-z]+_[0-9a-z]+" "reduction_size[.].+[.]" "pl_cond[.].+[.|,]" --prefix-filecheck-ir-name _
+// REQUIRES: amdgpu-registered-target
+
+// RUN: %clang_cc1 -fopenmp -x c++ -std=c++11 -triple x86_64-unknown-unknown -fopenmp-targets=amdgcn-amd-amdhsa -emit-llvm-bc %s -o %t-ppc-host.bc
+// RUN: %clang_cc1 -fopenmp -x c++ -std=c++11 -triple amdgcn-amd-amdhsa -fopenmp-targets=amdgcn-amd-amdhsa -emit-llvm %s -fopenmp-is-device -fopenmp-host-ir-file-path %t-ppc-host.bc -o - | FileCheck %s
+// expected-no-diagnostics
+#ifndef HEADER
+#define HEADER
+
+int foo1() {
+  int sum = 0.0;
+  #pragma omp target map(tofrom: sum)
+  {
+int N = 10;
+int A[N];
+
+for (int i = 0; i < N; i++)
+  A[i] = i;
+
+for (int i = 0; i < N; i++)
+  sum += A[i];
+  }
+  return sum;
+}
+
+int foo2() {
+  int sum = 0.0;
+  int M = 12;
+  int result[M];
+  #pragma omp target teams distribute parallel for map(from: result[:M])
+  for (int i = 0; i < M; i++) {
+int N = 10;
+int A[N];
+result[i] = i;
+
+for (int j = 0; j < N; j++)
+  A[j] = j;
+
+for (int j = 0; j < N; j++)
+  result[i] += A[j];
+  }
+
+  for (int i = 0; i < M; i++)
+sum += result[i];
+  return sum;
+}
+
+int foo3() {
+  int sum = 0.0;
+  int M = 12;
+  int result[M];
+  #pragma omp target teams distribute map(from: result[:M])
+  for (int i = 0; i < M; i++) {
+int N = 10;
+int A[N];
+result[i] = i;
+
+#pragma omp parallel for
+for (int j = 0; j < N; j++)
+  A[j] = j;
+
+for (int j = 0; j < N; j++)
+  result[i] += A[j];
+  }
+
+  for (int i = 0; i < M; i++)
+sum += result[i];
+  return sum;
+}
+
+int main() {
+  return foo1() + foo2() + foo3();
+}
+
+#endif
+// CHECK-LABEL: define {{[^@]+}}@{{__omp_offloading_[0-9a-z]+_[0-9a-z]+}}__Z4foo1v_l12
+// CHECK-SAME: (ptr noundef nonnull align 4 dereferenceable(4) [[SUM:%.*]]) #[[ATTR0:[0-9]+]] {
+// CHECK-NEXT:  entry:
+// CHECK-NEXT:[[SUM_ADDR:%.*]] = alloca ptr, align 8, addrspace(5)
+// CHECK-NEXT:[[N:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:[[__VLA_EXPR0:%.*]] = alloca i64, align 8, addrspace(5)
+// CHECK-NEXT:[[I:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:[[I1:%.*]] = alloca i32, align 4, addrspace(5)
+// CHECK-NEXT:[[SUM_ADDR_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[SUM_ADDR]] to ptr
+// CHECK-NEXT:[[N_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[N]] to ptr
+// CHECK-NEXT:[[__VLA_EXPR0_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[__VLA_EXPR0]] to ptr
+// CHECK-NEXT:[[I_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I]] to ptr
+// CHECK-NEXT:[[I1_ASCAST:%.*]] = addrspacecast ptr addrspace(5) [[I1]] to ptr
+// CHECK-NEXT:store ptr [[SUM]], ptr [[SUM_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP0:%.*]] = load ptr, ptr [[SUM_ADDR_ASCAST]], align 8
+// CHECK-NEXT:[[TMP1:%.*]] = call i32 @__kmpc_target_init(ptr addrspacecast (ptr addrspace(1) @[[GLOB1:[0-9]+]] to ptr), i8 1, i1 true)
+// CHECK-NEXT:[[EXEC_USER_CODE:%.*]] = icmp eq i32 [[TMP1]], -1
+// CHECK-NEXT:br i1 [[EXEC_USER_CODE]], label [[USER_CODE_ENTRY:%.*]], label [[WORKER_EXIT:%.*]]
+// CHECK:   user_code.entry:
+// CHECK-NEXT:store i32 10, ptr [[N_ASCAST]], align 4
+// CHECK-NEXT:[[TMP2:%.*]] = load i32, ptr [[N_ASCAST]], align 4
+// CHECK-NEXT:[[TMP3:%.*]] = zext i32 [[TMP2]] to i64
+// CHECK-NEXT:[[TMP4:%.*]] = mul nuw i64 [[TMP3]], 4
+// CHECK-NEXT:[[TMP5:%.*]] = add nuw i64 [[TMP4]], 3
+// CHECK-NEXT:[[TMP6:%.*]] = udiv i64 [[TMP5]], 4
+// CHECK-NEXT:[[TMP7:%.*]]