[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
Hahnfeld added a comment. In https://reviews.llvm.org/D52434#1249399, @gtbercea wrote: > In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote: > > > In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote: > > > > > You report a slow down which I am not able to reproduce actually. Do you > > > use any additional clauses not present in your previous post? > > > > > > No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` > > with today's trunk version: > > > > | `#pragma omp target teams distribute parallel for` (new defaults) > > | 190 - 250 GB/s | > > | adding clauses for old defaults: `schedule(static) dist_schedule(static)` > > | 30 - 50 GB/s | > > | same directive with only `dist_schedule(static)` added (fewer registers) > > | 320 - 400 GB/s | > > | > > > Which loop size you're using ? What runtime does nvprof report for these > kernels? Sorry, forgot to mention: I'm using the original STREAM code with 80,000,000 `double` elements in each vector. Output from `nvprof`: Type Time(%) Time Calls Avg Min Max Name GPU activities: 70.05% 676.71ms 9 75.191ms 1.3760us 248.09ms [CUDA memcpy DtoH] 7.67% 74.102ms10 7.4102ms 7.3948ms 7.4220ms __omp_offloading_34_b871a7d5_main_l307 7.63% 73.679ms10 7.3679ms 7.3457ms 7.3811ms __omp_offloading_34_b871a7d5_main_l301 6.78% 65.516ms10 6.5516ms 6.5382ms 6.5763ms __omp_offloading_34_b871a7d5_main_l295 6.77% 65.399ms10 6.5399ms 6.5319ms 6.5495ms __omp_offloading_34_b871a7d5_main_l289 0.68% 6.6106ms 1 6.6106ms 6.6106ms 6.6106ms __omp_offloading_34_b871a7d5_main_l264 0.41% 3.9659ms 1 3.9659ms 3.9659ms 3.9659ms __omp_offloading_34_b871a7d5_main_l245 0.00% 1.1200us 1 1.1200us 1.1200us 1.1200us [CUDA memcpy HtoD] API calls: 51.12% 678.90ms 9 75.434ms 24.859us 248.70ms cuMemcpyDtoH 22.40% 297.51ms42 7.0835ms 4.0042ms 7.6802ms cuCtxSynchronize 20.31% 269.72ms 1 269.72ms 269.72ms 269.72ms cuCtxCreate 5.32% 70.631ms 1 70.631ms 70.631ms 70.631ms cuCtxDestroy 0.46% 6.1607ms 1 6.1607ms 6.1607ms 6.1607ms cuModuleLoadDataEx 0.28% 3.7628ms 1 3.7628ms 3.7628ms 3.7628ms cuModuleUnload 0.10% 1.2977ms42 30.898us 13.930us 60.092us cuLaunchKernel 0.00% 56.142us42 1.3360us 677ns 2.0930us cuFuncGetAttribute 0.00% 43.957us46 955ns 454ns 1.7670us cuCtxSetCurrent 0.00% 15.179us 1 15.179us 15.179us 15.179us cuMemcpyHtoD 0.00% 7.2780us10 727ns 358ns 1.4760us cuModuleGetGlobal 0.00% 6.9910us 2 3.4950us 2.2660us 4.7250us cuDeviceGetPCIBusId 0.00% 5.7500us 6 958ns 333ns 3.5270us cuModuleGetFunction 0.00% 3.7530us 9 417ns 184ns 1.0850us cuDeviceGetAttribute 0.00% 2.6790us 3 893ns 370ns 1.9300us cuDeviceGetCount 0.00% 2.0090us 3 669ns 484ns 767ns cuDeviceGet The memcpy comes from a `target update` to verify the results on the host. It's not included in the measurement itself, so STREAM only evaluates the kernel execution time: FunctionBest Rate MB/s Avg time Min time Max time Copy: 190819.6 0.006781 0.006708 0.006841 Scale: 189065.7 0.006800 0.006770 0.006831 Add: 253831.7 0.007616 0.007564 0.007646 Triad: 253432.3 0.007668 0.007576 0.007737 Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
gtbercea added a comment. In https://reviews.llvm.org/D52434#1249186, @Hahnfeld wrote: > In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote: > > > You report a slow down which I am not able to reproduce actually. Do you > > use any additional clauses not present in your previous post? > > > No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` > with today's trunk version: > > | `#pragma omp target teams distribute parallel for` (new defaults) | > 190 - 250 GB/s | > | adding clauses for old defaults: `schedule(static) dist_schedule(static)` | > 30 - 50 GB/s | > | same directive with only `dist_schedule(static)` added (fewer registers) | > 320 - 400 GB/s | > | Which loop size you're using ? What runtime does nvprof report for these kernels? Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
Hahnfeld added a comment. In https://reviews.llvm.org/D52434#1249102, @gtbercea wrote: > You report a slow down which I am not able to reproduce actually. Do you use > any additional clauses not present in your previous post? No, only `dist_schedule(static)` which is faster. Tested on a `Tesla P100` with today's trunk version: | `#pragma omp target teams distribute parallel for` (new defaults) | 190 - 250 GB/s | | adding clauses for old defaults: `schedule(static) dist_schedule(static)` | 30 - 50 GB/s | | same directive with only `dist_schedule(static)` added (fewer registers) | 320 - 400 GB/s | Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
gtbercea added a comment. In https://reviews.llvm.org/D52434#1248975, @Hahnfeld wrote: > In https://reviews.llvm.org/D52434#1248974, @gtbercea wrote: > > > One big problem your code has is that the trip count is incredibly small, > > especially for STREAM and especially on GPUs. You need a much larger loop > > size otherwise the timings will be dominated by OpenMP setups costs. > > > Sure, I'm not that dump. The real code has larger loops, this was just for > demonstration purposes. I don't expect the register count to change based on > loop size - is that too optimistic? I checked the different combinations of schedules and the current default is the fastest compared to previous defaults. The old defaults are about 10x slower than the current set of defaults (dist_schedule(static, ) and schedule(static, 1)). The register allocation looks strange but it's just a consequence of using different schedules. You report a slow down which I am not able to reproduce actually. Do you use any additional clauses not present in your previous post? Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
gtbercea added a comment. In https://reviews.llvm.org/D52434#1248975, @Hahnfeld wrote: > In https://reviews.llvm.org/D52434#1248974, @gtbercea wrote: > > > One big problem your code has is that the trip count is incredibly small, > > especially for STREAM and especially on GPUs. You need a much larger loop > > size otherwise the timings will be dominated by OpenMP setups costs. > > > Sure, I'm not that dump. The real code has larger loops, this was just for > demonstration purposes. I don't expect the register count to change based on > loop size - is that too optimistic? The register count will of course not change with loop size. Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
Hahnfeld added a comment. In https://reviews.llvm.org/D52434#1248974, @gtbercea wrote: > One big problem your code has is that the trip count is incredibly small, > especially for STREAM and especially on GPUs. You need a much larger loop > size otherwise the timings will be dominated by OpenMP setups costs. Sure, I'm not that dump. The real code has larger loops, this was just for demonstration purposes. I don't expect the register count to change based on loop size - is that too optimistic? Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
gtbercea added a comment. In https://reviews.llvm.org/D52434#1248844, @Hahnfeld wrote: > Just tested this and got very weird results for register usage: > > void func(double *a) { > #pragma omp target teams distribute parallel for map(a[0:100]) // > dist_schedule(static) > for (int i = 0; i < 100; i++) { > a[i]++; > } > } > > > Compiling with current trunk for `sm_60` (Pascal): 29 registers > Adding `dist_schedule(static)` (the previous default): 19 registers > For reference: `dist_schedule(static, 128)` also uses 29 registers > > Any ideas? This significantly slows down STREAM... Jonas, without an explicit dist_schedule clause the program will run with schedule(static, ). It looks like that happens fine since you get the same register count in the explicit static chunk variant as in the default case. The difference you see in register count is (I suspect) driven by the runtime code (less registers for non-chunked than for chunked). I am currently investigating this and trying to find ways to reduce this number. One big problem your code has is that the trip count is incredibly small, especially for STREAM and especially on GPUs. You need a much larger loop size otherwise the timings will be dominated by OpenMP setups costs. Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
Hahnfeld added a comment. Just tested this and got very weird results for register usage: void func(double *a) { #pragma omp target teams distribute parallel for map(a[0:100]) // dist_schedule(static) for (int i = 0; i < 100; i++) { a[i]++; } } Compiling with current trunk for `sm_60` (Pascal): 29 registers Adding `dist_schedule(static)` (the previous default): 19 registers For reference: `dist_schedule(static, 128)` also uses 29 registers Any ideas? This significantly slows down STREAM... Repository: rC Clang https://reviews.llvm.org/D52434 ___ cfe-commits mailing list cfe-commits@lists.llvm.org http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits
[PATCH] D52434: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD mode achieve coalescing
This revision was automatically updated to reflect the committed changes. Closed by commit rC343253: [OpenMP] Make default distribute schedule for NVPTX target regions in SPMD modeā¦ (authored by gbercea, committed by ). Changed prior to commit: https://reviews.llvm.org/D52434?vs=167326=167373#toc Repository: rC Clang https://reviews.llvm.org/D52434 Files: lib/CodeGen/CGOpenMPRuntime.h lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp lib/CodeGen/CGOpenMPRuntimeNVPTX.h lib/CodeGen/CGStmtOpenMP.cpp test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp Index: test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp === --- test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp +++ test/OpenMP/nvptx_target_teams_distribute_parallel_for_simd_codegen.cpp @@ -33,7 +33,7 @@ l = i; } - #pragma omp target teams distribute parallel for simd map(tofrom: aa) num_teams(M) thread_limit(64) + #pragma omp target teams distribute parallel for simd map(tofrom: aa) num_teams(M) thread_limit(64) for(int i = 0; i < n; i++) { aa[i] += 1; } @@ -82,7 +82,7 @@ // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}( // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0) -// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, +// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, // CHECK: {{call|invoke}} void [[OUTL2:@.+]]( // CHECK: call void @__kmpc_for_static_fini( // CHECK: call void @__kmpc_spmd_kernel_deinit() @@ -96,7 +96,7 @@ // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}( // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0) -// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, +// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, // CHECK: {{call|invoke}} void [[OUTL3:@.+]]( // CHECK: call void @__kmpc_for_static_fini( // CHECK: call void @__kmpc_spmd_kernel_deinit() @@ -112,7 +112,7 @@ // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0) // CHECK: store {{.+}} 99, {{.+}}* [[COMB_UB:%.+]], align -// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, {{.+}}, {{.+}}, {{.+}}* [[COMB_UB]], +// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, {{.+}}, {{.+}}, {{.+}}* [[COMB_UB]], // CHECK: {{call|invoke}} void [[OUTL4:@.+]]( // CHECK: call void @__kmpc_for_static_fini( // CHECK: call void @__kmpc_spmd_kernel_deinit() Index: test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp === --- test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp +++ test/OpenMP/nvptx_target_teams_distribute_parallel_for_codegen.cpp @@ -35,7 +35,7 @@ l = i; } - #pragma omp target teams distribute parallel for map(tofrom: aa) num_teams(M) thread_limit(64) +#pragma omp target teams distribute parallel for map(tofrom: aa) num_teams(M) thread_limit(64) for(int i = 0; i < n; i++) { aa[i] += 1; } @@ -87,7 +87,7 @@ // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}( // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0) -// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, +// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, // CHECK: {{call|invoke}} void [[OUTL2:@.+]]( // CHECK: call void @__kmpc_for_static_fini( // CHECK: call void @__kmpc_spmd_kernel_deinit() @@ -101,7 +101,7 @@ // CHECK-LABEL: define {{.*}}void {{@__omp_offloading_.+}}( // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0) -// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, +// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, // CHECK: {{call|invoke}} void [[OUTL3:@.+]]( // CHECK: call void @__kmpc_for_static_fini( // CHECK: call void @__kmpc_spmd_kernel_deinit() @@ -117,7 +117,7 @@ // CHECK-DAG: [[THREAD_LIMIT:%.+]] = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x() // CHECK: call void @__kmpc_spmd_kernel_init(i32 [[THREAD_LIMIT]], i16 0, i16 0) // CHECK: store {{.+}} 99, {{.+}}* [[COMB_UB:%.+]], align -// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 92, {{.+}}, {{.+}}, {{.+}}* [[COMB_UB]], +// CHECK: call void @__kmpc_for_static_init_4({{.+}}, {{.+}}, {{.+}} 91, {{.+}}, {{.+}},