[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-29 Thread Jonas Hahnfeld via Phabricator via cfe-commits
Hahnfeld added a comment.

In https://reviews.llvm.org/D52434#1249399, @gtbercea wrote:

> In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote:
>
> > In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote:
> >
> > > You report a slow down which I am not able to reproduce actually. Do you 
> > > use any additional clauses not present in your previous post?
> >
> >
> > No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` 
> > with today's trunk version:
> >
> > | `#pragma omp target teams distribute parallel for` (new defaults) 
> > | 190 - 250 GB/s |
> > | adding clauses for old defaults: `schedule(static) dist_schedule(static)` 
> > | 30 - 50 GB/s   |
> > | same directive with only `dist_schedule(static)` added (fewer registers)  
> > | 320 - 400 GB/s |
> > |
>
>
> Which loop size you're using ? What runtime does nvprof report for these 
> kernels?


Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 
`double` elements in each vector.

Output from `nvprof`:

 Type  Time(%)  Time Calls   Avg   Min   Max  
Name
  GPU activities:   70.05%  676.71ms 9  75.191ms  1.3760us  248.09ms  
[CUDA memcpy DtoH]
 7.67%  74.102ms10  7.4102ms  7.3948ms  7.4220ms  
__omp_offloading_34_b871a7d5_main_l307
 7.63%  73.679ms10  7.3679ms  7.3457ms  7.3811ms  
__omp_offloading_34_b871a7d5_main_l301
 6.78%  65.516ms10  6.5516ms  6.5382ms  6.5763ms  
__omp_offloading_34_b871a7d5_main_l295
 6.77%  65.399ms10  6.5399ms  6.5319ms  6.5495ms  
__omp_offloading_34_b871a7d5_main_l289
 0.68%  6.6106ms 1  6.6106ms  6.6106ms  6.6106ms  
__omp_offloading_34_b871a7d5_main_l264
 0.41%  3.9659ms 1  3.9659ms  3.9659ms  3.9659ms  
__omp_offloading_34_b871a7d5_main_l245
 0.00%  1.1200us 1  1.1200us  1.1200us  1.1200us  
[CUDA memcpy HtoD]
   API calls:   51.12%  678.90ms 9  75.434ms  24.859us  248.70ms  
cuMemcpyDtoH
22.40%  297.51ms42  7.0835ms  4.0042ms  7.6802ms  
cuCtxSynchronize
20.31%  269.72ms 1  269.72ms  269.72ms  269.72ms  
cuCtxCreate
 5.32%  70.631ms 1  70.631ms  70.631ms  70.631ms  
cuCtxDestroy
 0.46%  6.1607ms 1  6.1607ms  6.1607ms  6.1607ms  
cuModuleLoadDataEx
 0.28%  3.7628ms 1  3.7628ms  3.7628ms  3.7628ms  
cuModuleUnload
 0.10%  1.2977ms42  30.898us  13.930us  60.092us  
cuLaunchKernel
 0.00%  56.142us42  1.3360us 677ns  2.0930us  
cuFuncGetAttribute
 0.00%  43.957us46 955ns 454ns  1.7670us  
cuCtxSetCurrent
 0.00%  15.179us 1  15.179us  15.179us  15.179us  
cuMemcpyHtoD
 0.00%  7.2780us10 727ns 358ns  1.4760us  
cuModuleGetGlobal
 0.00%  6.9910us 2  3.4950us  2.2660us  4.7250us  
cuDeviceGetPCIBusId
 0.00%  5.7500us 6 958ns 333ns  3.5270us  
cuModuleGetFunction
 0.00%  3.7530us 9 417ns 184ns  1.0850us  
cuDeviceGetAttribute
 0.00%  2.6790us 3 893ns 370ns  1.9300us  
cuDeviceGetCount
 0.00%  2.0090us 3 669ns 484ns 767ns  
cuDeviceGet

The memcpy comes from a `target update` to verify the results on the host. It's 
not included in the measurement itself, so STREAM only evaluates the kernel 
execution time:

  FunctionBest Rate MB/s  Avg time Min time Max time
  Copy:  190819.6 0.006781 0.006708 0.006841
  Scale: 189065.7 0.006800 0.006770 0.006831
  Add:   253831.7 0.007616 0.007564 0.007646
  Triad: 253432.3 0.007668 0.007576 0.007737


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
gtbercea added a comment.

In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote:

> In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote:
>
> > You report a slow down which I am not able to reproduce actually. Do you 
> > use any additional clauses not present in your previous post?
>
>
> No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` 
> with today's trunk version:
>
> | `#pragma omp target teams distribute parallel for` (new defaults) | 
> 190 - 250 GB/s |
> | adding clauses for old defaults: `schedule(static) dist_schedule(static)` | 
> 30 - 50 GB/s   |
> | same directive with only `dist_schedule(static)` added (fewer registers)  | 
> 320 - 400 GB/s |
> |


Which loop size you're using ? What runtime does nvprof report for these 
kernels?


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Jonas Hahnfeld via Phabricator via cfe-commits
Hahnfeld added a comment.

In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote:

> You report a slow down which I am not able to reproduce actually. Do you use 
> any additional clauses not present in your previous post?


No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` with 
today's trunk version:

| `#pragma omp target teams distribute parallel for` (new defaults) | 
190 - 250 GB/s |
| adding clauses for old defaults: `schedule(static) dist_schedule(static)` | 
30 - 50 GB/s   |
| same directive with only `dist_schedule(static)` added (fewer registers)  | 
320 - 400 GB/s |


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
gtbercea added a comment.

In https://reviews.llvm.org/D52434#1248975, @Hahnfeld wrote:

> In https://reviews.llvm.org/D52434#1248974, @gtbercea wrote:
>
> > One big problem your code has is that the trip count is incredibly small, 
> > especially for STREAM and especially on GPUs. You need a much larger loop 
> > size otherwise the timings will be dominated by OpenMP setups costs.
>
>
> Sure, I'm not that dump. The real code has larger loops, this was just for 
> demonstration purposes. I don't expect the register count to change based on 
> loop size - is that too optimistic?


I checked the different combinations of schedules and the current default is 
the fastest compared to previous defaults. The old defaults are about 10x 
slower than the current set of defaults (dist_schedule(static, ) 
and schedule(static, 1)). The register allocation looks strange but it's just a 
consequence of using different schedules.

You report a slow down which I am not able to reproduce actually. Do you use 
any additional clauses not present in your previous post?


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
gtbercea added a comment.

In https://reviews.llvm.org/D52434#1248975, @Hahnfeld wrote:

> In https://reviews.llvm.org/D52434#1248974, @gtbercea wrote:
>
> > One big problem your code has is that the trip count is incredibly small, 
> > especially for STREAM and especially on GPUs. You need a much larger loop 
> > size otherwise the timings will be dominated by OpenMP setups costs.
>
>
> Sure, I'm not that dump. The real code has larger loops, this was just for 
> demonstration purposes. I don't expect the register count to change based on 
> loop size - is that too optimistic?


The register count will of course not change with loop size.


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Jonas Hahnfeld via Phabricator via cfe-commits
Hahnfeld added a comment.

In https://reviews.llvm.org/D52434#1248974, @gtbercea wrote:

> One big problem your code has is that the trip count is incredibly small, 
> especially for STREAM and especially on GPUs. You need a much larger loop 
> size otherwise the timings will be dominated by OpenMP setups costs.


Sure, I'm not that dump. The real code has larger loops, this was just for 
demonstration purposes. I don't expect the register count to change based on 
loop size - is that too optimistic?


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Gheorghe-Teodor Bercea via Phabricator via cfe-commits
gtbercea added a comment.

In https://reviews.llvm.org/D52434#1248844, @Hahnfeld wrote:

> Just tested this and got very weird results for register usage:
>
>   void func(double *a) {
> #pragma omp target teams distribute parallel for map(a[0:100]) // 
> dist_schedule(static)
> for (int i = 0; i < 100; i++) {
>   a[i]++;
> }
>   }
>
>
> Compiling with current trunk for `sm_60` (Pascal): 29 registers
>  Adding `dist_schedule(static)` (the previous default): 19 registers
>  For reference: `dist_schedule(static, 128)` also uses 29 registers
>
> Any ideas? This significantly slows down STREAM...


Jonas, without an explicit dist_schedule clause the program will run with 
schedule(static, ). It looks like that happens fine 
since you get the same register count in the explicit static chunk variant as 
in the default case.

The difference you see in register count is (I suspect) driven by the runtime 
code (less registers for non-chunked than for chunked). I am currently 
investigating this and trying to find ways to reduce this number.

One big problem your code has is that the trip count is incredibly small, 
especially for STREAM and especially on GPUs. You need a much larger loop size 
otherwise the timings will be dominated by OpenMP setups costs.


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-28 Thread Jonas Hahnfeld via Phabricator via cfe-commits
Hahnfeld added a comment.

Just tested this and got very weird results for register usage:

  void func(double *a) {
#pragma omp target teams distribute parallel for map(a[0:100]) // 
dist_schedule(static)
for (int i = 0; i < 100; i++) {
  a[i]++;
}
  }

Compiling with current trunk for `sm_60` (Pascal): 29 registers
Adding `dist_schedule(static)` (the previous default): 19 registers
For reference: `dist_schedule(static, 128)` also uses 29 registers

Any ideas? This significantly slows down STREAM...


Repository:
  rC Clang

https://reviews.llvm.org/D52434



___
cfe-commits mailing list
cfe-commits@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits


[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing

2018-09-27 Thread Phabricator via Phabricator via cfe-commits
This revision was automatically updated to reflect the committed changes.
Closed by commit rC343253: [OpenMP] Make default distribute schedule for NVPTX 
target regions in SPMD modeā€¦ (authored by gbercea, committed by ).

Changed prior to commit:
  https://reviews.llvm.org/D52434?vs=167326=167373#toc

Repository:
  rC Clang

https://reviews.llvm.org/D52434

Files:
  lib/CodeGen/CGOpenMPRuntime.h
  lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp
  lib/CodeGen/CGOpenMPRuntimeNVPTX.h
  lib/CodeGen/CGStmtOpenMP.cpp
  test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp
  test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp

Index: test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp
===
--- test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp
+++ test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp
@@ -33,7 +33,7 @@
 l = i;
   }
 
-  #pragma omp target teams distribute parallel for simd map(tofrom: aa) num_teams(M) thread_limit(64)
+ #pragma omp target teams distribute parallel for simd map(tofrom: aa) num_teams(M) thread_limit(64)
   for(int i = 0; i < n; i++) {
 aa[i] += 1;
   }
@@ -82,7 +82,7 @@
 // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}(
 // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0)
-// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92,
+// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91,
 // CHECK: {{call|invoke}} void [[OUTL2:@.+]](
 // CHECK: call void @__kmpc_for_static_fini(
 // CHECK: call void @__kmpc_spmd_kernel_deinit()
@@ -96,7 +96,7 @@
 // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}(
 // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0)
-// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92,
+// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91,
 // CHECK: {{call|invoke}} void [[OUTL3:@.+]](
 // CHECK: call void @__kmpc_for_static_fini(
 // CHECK: call void @__kmpc_spmd_kernel_deinit()
@@ -112,7 +112,7 @@
 // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0)
 // CHECK: store {{.+}} 99, {{.+}}* [[COMB_UB:%.+]], align
-// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, {{.+}}, {{.+}}, {{.+}}* [[COMB_UB]],
+// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, {{.+}}, {{.+}}, {{.+}}* [[COMB_UB]],
 // CHECK: {{call|invoke}} void [[OUTL4:@.+]](
 // CHECK: call void @__kmpc_for_static_fini(
 // CHECK: call void @__kmpc_spmd_kernel_deinit()
Index: test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp
===
--- test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp
+++ test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp
@@ -35,7 +35,7 @@
 l = i;
   }
 
-  #pragma omp target teams distribute parallel for map(tofrom: aa) num_teams(M) thread_limit(64)
+#pragma omp target teams distribute parallel for map(tofrom: aa) num_teams(M) thread_limit(64)
   for(int i = 0; i < n; i++) {
 aa[i] += 1;
   }
@@ -87,7 +87,7 @@
 // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}(
 // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0)
-// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92,
+// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91,
 // CHECK: {{call|invoke}} void [[OUTL2:@.+]](
 // CHECK: call void @__kmpc_for_static_fini(
 // CHECK: call void @__kmpc_spmd_kernel_deinit()
@@ -101,7 +101,7 @@
 // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}(
 // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0)
-// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92,
+// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91,
 // CHECK: {{call|invoke}} void [[OUTL3:@.+]](
 // CHECK: call void @__kmpc_for_static_fini(
 // CHECK: call void @__kmpc_spmd_kernel_deinit()
@@ -117,7 +117,7 @@
 // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
 // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0)
 // CHECK: store {{.+}} 99, {{.+}}* [[COMB_UB:%.+]], align
-// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, {{.+}}, {{.+}}, {{.+}}* [[COMB_UB]],
+// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, {{.+}}, {{.+}},