Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On 11/11/2016 04:35 PM, Alexander Monakov wrote: For the avoidance of doubt, is this a statement of intent, or an actual approval for the patchset? After these backend modifications and the rest of libgomp/middle-end changes are applied, trunk will need the following flip-the-switch patch to allow OpenMP offloading for NVPTX. OK? Ok for everything. Bernd
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Fri, 11 Nov 2016, Bernd Schmidt wrote: > On 10/19/2016 12:39 PM, Bernd Schmidt wrote: > > I'll refrain from any further comments on the topic. The ptx patches > > don't look unreasonable iff someone else decides that this version of > > OpenMP support should be merged and I'll look into them in more detail > > if that happens. Patch 2/8 is ok now. > > Sounds like Jakub has made that decision. So I'll get out of the way and just > approve all these. For the avoidance of doubt, is this a statement of intent, or an actual approval for the patchset? After these backend modifications and the rest of libgomp/middle-end changes are applied, trunk will need the following flip-the-switch patch to allow OpenMP offloading for NVPTX. OK? Thanks. Alexander PR target/67822 * config/nvptx/mkoffload.c (main): Allow -fopenmp. diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c index c8eed45..e99ef37 100644 --- a/gcc/config/nvptx/mkoffload.c +++ b/gcc/config/nvptx/mkoffload.c @@ -517,8 +524,8 @@ main (int argc, char **argv) fatal_error (input_location, "cannot open '%s'", ptx_cfile_name); /* PR libgomp/65099: Currently, we only support offloading in 64-bit - configurations. PR target/67822: OpenMP offloading to nvptx fails. */ - if (offload_abi == OFFLOAD_ABI_LP64 && !fopenmp) + configurations. */ + if (offload_abi == OFFLOAD_ABI_LP64) { ptx_name = make_temp_file (".mkoffload"); obstack_ptr_grow (_obstack, "-o");
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On 10/19/2016 12:39 PM, Bernd Schmidt wrote: I'll refrain from any further comments on the topic. The ptx patches don't look unreasonable iff someone else decides that this version of OpenMP support should be merged and I'll look into them in more detail if that happens. Patch 2/8 is ok now. Sounds like Jakub has made that decision. So I'll get out of the way and just approve all these. Bernd
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote: > On Tue, 18 Oct 2016, Bernd Schmidt wrote: > > The performance I saw was lower by a factor of 80 or so compared to their > > CUDA > > version, and even lower than OpenMP on the host. > > The currently published OpenMP version of LULESH simply doesn't use > openmp-simd > anywhere. This should make it obvious that it won't be anywhere near any > reasonable CUDA implementation, and also bound to be below host performance. > Besides, it's common for such benchmark suites to have very different levels > of > hand tuning for the native-CUDA implementation vs OpenMP implementation, > sometimes to the point of significant algorithmic differences. So you're > making an invalid comparison here. This is related to the independent clause/construct (or whatever other names) discussions, the problem with LULESH's #pragma distribute parallel for rather than #pragma distribute parallel for simd is that usually it calls (inline) functions, and distribute parallel for, even with the implementation defined default for schedule() clause, isn't just let the implementation choose distribution between teams/threads/simd it likes; for loops which don't call any functions we can scan the loop body and figure out if it could e.g. through various omp_* calls observe anything that could reveal how it is distributed among teams/threads/simd, but for loops that can call other functions that is hard to do, especially as early as during omp lowering/expansion. OpenMP 5.0 is likely going to have some clause or whatever that will just say the loop iterations are completely independent, but until then the programmer uses more prescriptive pragmas and needs to be careful what exactly they want. But, certainly we should collect some OpenMP/OpenACC offloading benchmarks or write our own and use that to compare GCC with other compilers. Jakub
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On 10/18/2016 06:58 PM, Alexander Monakov wrote: The currently published OpenMP version of LULESH simply doesn't use openmp-simd anywhere. This should make it obvious that it won't be anywhere near any reasonable CUDA implementation, and also bound to be below host performance. Besides, it's common for such benchmark suites to have very different levels of hand tuning for the native-CUDA implementation vs OpenMP implementation, sometimes to the point of significant algorithmic differences. So you're making an invalid comparison here. The information I have is that the LULESH code is representative of how at least some groups on the HPC side expect to write OpenMP code. It's the biggest real-world piece of code that I'm aware of that's available for testing, so it seemed like a good thing to try. If you have other real-world tests available, please let us know. If you can demonstrate good performance by modifying LULESH sources, that would also be a good step, although maybe not the ideal case. But I think it's not unreasonable to look for a demonstration that reasonable performance is achievable on something that isn't just a microbenchmark. I'll refrain from any further comments on the topic. The ptx patches don't look unreasonable iff someone else decides that this version of OpenMP support should be merged and I'll look into them in more detail if that happens. Patch 2/8 is ok now. Bernd
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Tue, 18 Oct 2016, Bernd Schmidt wrote: > [...] but then I think we shouldn't repeat the mistakes we made with OpenACC I think it would be good if you'd mention for posterity what, specifically, the mistakes were, in particular if you want those not to be repeated in the context of OpenMP offloading. Alexander
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote: > On Tue, 18 Oct 2016, Bernd Schmidt wrote: > > The performance I saw was lower by a factor of 80 or so compared to their > > CUDA > > version, and even lower than OpenMP on the host. > > The currently published OpenMP version of LULESH simply doesn't use > openmp-simd > anywhere. This should make it obvious that it won't be anywhere near any > reasonable CUDA implementation, and also bound to be below host performance. Yeah, perhaps just changing some or all #pragma omp distribute parallel for into #pragma omp distribute parallel for simd could do something (of course, one should actually analyze what it does, but if it is valid for distribute without dist_schedule clause, then the loops ought to be without forward or backward lexical dependencies (teams can't really synchronize, though they can use some atomics). That said, the OpenMP port of LULESH doesn't seem to be done very carefully, e.g. in CalcHourglassControlForElems I see: /* Do a check for negative volumes */ if ( v[i] <= Real_t(0.0) ) { vol_error = i; } There is not any kind of explicit mapping of vol_error nor reduction of it, so while in OpenMP 4.0 it would be just a possible data race (the var would be map(tofrom: vol_error) by default and shared between teams/threads, so if more than one thread decides to write it, it is a data race, in OpenMP 4.5 it is implicitly firstprivate(vol_error) and thus the changes to the var (still racy) would just never be propagated back to the caller. For the missing simd regions, it might be helpful if we were able to "autovectorize" into the SIMT, but I guess that might be quite a lot of work. Jakub
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Tue, 18 Oct 2016, Bernd Schmidt wrote: > The performance I saw was lower by a factor of 80 or so compared to their CUDA > version, and even lower than OpenMP on the host. The currently published OpenMP version of LULESH simply doesn't use openmp-simd anywhere. This should make it obvious that it won't be anywhere near any reasonable CUDA implementation, and also bound to be below host performance. Besides, it's common for such benchmark suites to have very different levels of hand tuning for the native-CUDA implementation vs OpenMP implementation, sometimes to the point of significant algorithmic differences. So you're making an invalid comparison here. Internally at ISP RAS we used a small set of microbenchmarks implemented in CUDA/OpenACC/OpenMP specifically for the purpose of evaluating the exact same computations implemented in terms of different APIs. We got close performance in all three. The biggest issue is visible on short-running OpenMP target regions: the startup cost (going through libgomp) is non-trivial. That can be improved with further changes in libgomp port, notably avoiding malloc, shaving off more code, perhaps inlining more code (e.g. via LTO eventually). There's also avoidable cuMemAlloc/cuMemFree on the libgomp plugin side. For example, there's this patch on the branch: libgomp: avoid malloc calls in gomp_nvptx_main Avoid calling malloc where it's easy to use stack storage instead: device malloc is very slow in CUDA. This cuts about 60-80 microseconds from target region entry/exit time, slimming down empty target regions from ~95 to ~17 microseconds (as measured on a GTX Titan). (empty CUDA kernel is ~5 microseconds; all figures are taken via nvprof) > To me this kind of performance doesn't look like something that will be fixed > by fine-tuning; it leaves me undecided whether the chosen approach (what you > call the fundamentals) is viable at all. If you try to draw conclusions just from comparing the performance you got on LULESH, without looking at benchmark's source (otherwise you should have acknowledged the lack of openmp-simd and significant source-level differences between CUDA and OpenMP implementations, like the use of __shared__ in CUDA algorithms), I am sorry to say, but that is just ridiculous. The implementation on the branch is far from ideal, but your method of evaluation is nonsensical. > Performance is still better than the OpenACC version of the benchmark, but > then I think we shouldn't repeat the mistakes we made with OpenACC and avoid > merging something until we're sure it's ready and of benefit to users. Would you kindly try and keep your commentary constructive. It's frustrating to me to have to tolerate hostilities like an ad hominem attack, ignored nvptx-backend-related questions, etc. How can the work get ready if all you do is passively push back? Please trust me, I have experience with GPUs and GCC. There should be a process for getting this gradually reviewed, with fundamental design decisions acked and patches reviewed before all tweaks and optimizations are in place. If you suggest that the work needs to proceed on the branch without any kind of interim review, and then reviewed in one go after satisfying some unspecified criteria of being "ready and of benefit", that doesn't sound right to me. Alexander
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On 10/17/2016 07:06 PM, Alexander Monakov wrote: I've just pushed two commits to the branch to fix this issue. Before those, the last commit left the branch in a state where an incremental build seemed ok (because libgcc/libgomp weren't rebuilt with the new cc1), but a from-scratch build was broken like you've shown. LULESH is known to work. I also intend to perform a trunk merge soon. Ok that did work, however... I think before merging this work we'll need to have some idea of how well it works on real-world code. This patchset and the branch lay the foundation, there's more work to be done, in particular on the performance improvements side. There should be an agreement on these fundamental bits first, before moving on to fine-tuning. The performance I saw was lower by a factor of 80 or so compared to their CUDA version, and even lower than OpenMP on the host. Does this match what you are seeing? Do you have a clear plan how this can be improved? To me this kind of performance doesn't look like something that will be fixed by fine-tuning; it leaves me undecided whether the chosen approach (what you call the fundamentals) is viable at all. Performance is still better than the OpenACC version of the benchmark, but then I think we shouldn't repeat the mistakes we made with OpenACC and avoid merging something until we're sure it's ready and of benefit to users. Bernd
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Mon, 17 Oct 2016, Bernd Schmidt wrote: > On 10/14/2016 06:39 PM, Alexander Monakov wrote: > > I'm resending the patch series with backend prerequisites for OpenMP > > offloading to the NVIDIA PTX ISA. The patches are rebased on trunk. > > What's the status of the branch? Is it expected to work? I'm trying to compile > the OpenMP version of these benchmarks: > https://codesign.llnl.gov/lulesh.php > > and the resulting binary fails as follows: > > libgomp: Link error log error : Size doesn't match for '__nvptx_stacks' in > 'Input 8', first specified in 'Input 8' > error : Multiple definition of '__nvptx_stacks' in 'Input 8', first defined > in 'Input 8' I've just pushed two commits to the branch to fix this issue. Before those, the last commit left the branch in a state where an incremental build seemed ok (because libgcc/libgomp weren't rebuilt with the new cc1), but a from-scratch build was broken like you've shown. LULESH is known to work. I also intend to perform a trunk merge soon. > I think before merging this work we'll need to have some idea of how well it > works on real-world code. This patchset and the branch lay the foundation, there's more work to be done, in particular on the performance improvements side. There should be an agreement on these fundamental bits first, before moving on to fine-tuning. Alexander
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On 10/14/2016 06:39 PM, Alexander Monakov wrote: I'm resending the patch series with backend prerequisites for OpenMP offloading to the NVIDIA PTX ISA. The patches are rebased on trunk. What's the status of the branch? Is it expected to work? I'm trying to compile the OpenMP version of these benchmarks: https://codesign.llnl.gov/lulesh.php and the resulting binary fails as follows: libgomp: Link error log error : Size doesn't match for '__nvptx_stacks' in 'Input 8', first specified in 'Input 8' error : Multiple definition of '__nvptx_stacks' in 'Input 8', first defined in 'Input 8' I think before merging this work we'll need to have some idea of how well it works on real-world code. Bernd
[PATCH 0/8] NVPTX offloading to NVPTX: backend patches
Hi, I'm resending the patch series with backend prerequisites for OpenMP offloading to the NVIDIA PTX ISA. The patches are rebased on trunk. Could a global reviewer have a look at patch 6 (new TARGET_SIMT_VF hook) please? Documentation changes in doc/invoke.texi have already been reviewed by Sandra Loosemore (thank you!). Alexander
Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches
On Thu, Jun 09, 2016 at 07:53:52PM +0300, Alexander Monakov wrote: > I'm sending updated patch series with backend prerequisites for OpenMP > offloading to the NVIDIA PTX ISA. The first patch has already received some > comments and this version reflects review feedback. The other patches have > been adjusted for clarity and re-cut in a more rigorous manner. All patches > are > rebased onto current trunk. > > Jakub, can you offer wishes/recommendations for sending the rest of > (middle-end and libgomp) patches? As you know there's a branch with Once all the prerequisites are in (I assume the patches depend on the NVPTX backend patches you've just posted), then I'd prefer if you rebase the rest to current trunk and post in reasonably reviewable chunks (that can be all of middle-end changes in one patch, all of libgomp plugin changes, all of other libgomp changes, or if some of those would be too large, split that a little bit). Jakub
[PATCH 0/8] NVPTX offloading to NVPTX: backend patches
Hi, I'm sending updated patch series with backend prerequisites for OpenMP offloading to the NVIDIA PTX ISA. The first patch has already received some comments and this version reflects review feedback. The other patches have been adjusted for clarity and re-cut in a more rigorous manner. All patches are rebased onto current trunk. Jakub, can you offer wishes/recommendations for sending the rest of (middle-end and libgomp) patches? As you know there's a branch with development history; is that of interest, or would it be easier if I rebased all stuff anew on current trunk? Thanks. Alexander