Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-11-11 Thread Bernd Schmidt

On 11/11/2016 04:35 PM, Alexander Monakov wrote:


For the avoidance of doubt, is this a statement of intent, or an actual approval
for the patchset?

After these backend modifications and the rest of libgomp/middle-end changes are
applied, trunk will need the following flip-the-switch patch to allow OpenMP
offloading for NVPTX.  OK?


Ok for everything.


Bernd



Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-11-11 Thread Alexander Monakov
On Fri, 11 Nov 2016, Bernd Schmidt wrote:
> On 10/19/2016 12:39 PM, Bernd Schmidt wrote:
> > I'll refrain from any further comments on the topic. The ptx patches
> > don't look unreasonable iff someone else decides that this version of
> > OpenMP support should be merged and I'll look into them in more detail
> > if that happens. Patch 2/8 is ok now.
> 
> Sounds like Jakub has made that decision. So I'll get out of the way and just
> approve all these.

For the avoidance of doubt, is this a statement of intent, or an actual approval
for the patchset?

After these backend modifications and the rest of libgomp/middle-end changes are
applied, trunk will need the following flip-the-switch patch to allow OpenMP
offloading for NVPTX.  OK?

Thanks.
Alexander

PR target/67822
* config/nvptx/mkoffload.c (main): Allow -fopenmp.

diff --git a/gcc/config/nvptx/mkoffload.c b/gcc/config/nvptx/mkoffload.c
index c8eed45..e99ef37 100644
--- a/gcc/config/nvptx/mkoffload.c
+++ b/gcc/config/nvptx/mkoffload.c
@@ -517,8 +524,8 @@ main (int argc, char **argv)
 fatal_error (input_location, "cannot open '%s'", ptx_cfile_name);

   /* PR libgomp/65099: Currently, we only support offloading in 64-bit
- configurations.  PR target/67822: OpenMP offloading to nvptx fails.  */
-  if (offload_abi == OFFLOAD_ABI_LP64 && !fopenmp)
+ configurations.  */
+  if (offload_abi == OFFLOAD_ABI_LP64)
 {
   ptx_name = make_temp_file (".mkoffload");
   obstack_ptr_grow (_obstack, "-o");



Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-11-11 Thread Bernd Schmidt

On 10/19/2016 12:39 PM, Bernd Schmidt wrote:

I'll refrain from any further comments on the topic. The ptx patches
don't look unreasonable iff someone else decides that this version of
OpenMP support should be merged and I'll look into them in more detail
if that happens. Patch 2/8 is ok now.


Sounds like Jakub has made that decision. So I'll get out of the way and 
just approve all these.



Bernd



Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-19 Thread Jakub Jelinek
On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote:
> On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> > The performance I saw was lower by a factor of 80 or so compared to their 
> > CUDA
> > version, and even lower than OpenMP on the host.
> 
> The currently published OpenMP version of LULESH simply doesn't use 
> openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.
> Besides, it's common for such benchmark suites to have very different levels 
> of
> hand tuning for the native-CUDA implementation vs OpenMP implementation,
> sometimes to the point of significant algorithmic differences. So you're
> making an invalid comparison here.

This is related to the independent clause/construct (or whatever other
names) discussions, the problem with LULESH's
#pragma distribute parallel for
rather than
#pragma distribute parallel for simd
is that usually it calls (inline) functions, and distribute parallel for,
even with the implementation defined default for schedule() clause, isn't
just let the implementation choose distribution between teams/threads/simd
it likes; for loops which don't call any functions we can scan the loop body
and figure out if it could e.g. through various omp_* calls observe anything
that could reveal how it is distributed among teams/threads/simd, but for
loops that can call other functions that is hard to do, especially as early
as during omp lowering/expansion.
OpenMP 5.0 is likely going to have some clause or whatever that will just
say the loop iterations are completely independent, but until then the
programmer uses more prescriptive pragmas and needs to be careful what
exactly they want.

But, certainly we should collect some OpenMP/OpenACC offloading benchmarks
or write our own and use that to compare GCC with other compilers.

Jakub


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-19 Thread Bernd Schmidt

On 10/18/2016 06:58 PM, Alexander Monakov wrote:


The currently published OpenMP version of LULESH simply doesn't use openmp-simd
anywhere. This should make it obvious that it won't be anywhere near any
reasonable CUDA implementation, and also bound to be below host performance.
Besides, it's common for such benchmark suites to have very different levels of
hand tuning for the native-CUDA implementation vs OpenMP implementation,
sometimes to the point of significant algorithmic differences. So you're
making an invalid comparison here.


The information I have is that the LULESH code is representative of how 
at least some groups on the HPC side expect to write OpenMP code. It's 
the biggest real-world piece of code that I'm aware of that's available 
for testing, so it seemed like a good thing to try. If you have other 
real-world tests available, please let us know. If you can demonstrate 
good performance by modifying LULESH sources, that would also be a good 
step, although maybe not the ideal case. But I think it's not 
unreasonable to look for a demonstration that reasonable performance is 
achievable on something that isn't just a microbenchmark.


I'll refrain from any further comments on the topic. The ptx patches 
don't look unreasonable iff someone else decides that this version of 
OpenMP support should be merged and I'll look into them in more detail 
if that happens. Patch 2/8 is ok now.



Bernd


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-19 Thread Alexander Monakov
On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> [...] but then I think we shouldn't repeat the mistakes we made with OpenACC

I think it would be good if you'd mention for posterity what, specifically,
the mistakes were, in particular if you want those not to be repeated in the
context of OpenMP offloading.

Alexander


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-18 Thread Jakub Jelinek
On Tue, Oct 18, 2016 at 07:58:49PM +0300, Alexander Monakov wrote:
> On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> > The performance I saw was lower by a factor of 80 or so compared to their 
> > CUDA
> > version, and even lower than OpenMP on the host.
> 
> The currently published OpenMP version of LULESH simply doesn't use 
> openmp-simd
> anywhere. This should make it obvious that it won't be anywhere near any
> reasonable CUDA implementation, and also bound to be below host performance.

Yeah, perhaps just changing some or all #pragma omp distribute parallel for
into #pragma omp distribute parallel for simd could do something (of course,
one should actually analyze what it does, but if it is valid for distribute
without dist_schedule clause, then the loops ought to be without forward or
backward lexical dependencies (teams can't really synchronize, though they
can use some atomics).
That said, the OpenMP port of LULESH doesn't seem to be done very carefully,
e.g. in CalcHourglassControlForElems I see:
  /* Do a check for negative volumes */
  if ( v[i] <= Real_t(0.0) ) {
vol_error = i;
  }
There is not any kind of explicit mapping of vol_error nor reduction of it,
so while in OpenMP 4.0 it would be just a possible data race (the var would
be map(tofrom: vol_error) by default and shared between teams/threads, so if
more than one thread decides to write it, it is a data race, in OpenMP 4.5
it is implicitly firstprivate(vol_error) and thus the changes to the var
(still racy) would just never be propagated back to the caller.

For the missing simd regions, it might be helpful if we were able to
"autovectorize" into the SIMT, but I guess that might be quite a lot of
work.

Jakub


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-18 Thread Alexander Monakov
On Tue, 18 Oct 2016, Bernd Schmidt wrote:
> The performance I saw was lower by a factor of 80 or so compared to their CUDA
> version, and even lower than OpenMP on the host.

The currently published OpenMP version of LULESH simply doesn't use openmp-simd
anywhere. This should make it obvious that it won't be anywhere near any
reasonable CUDA implementation, and also bound to be below host performance.
Besides, it's common for such benchmark suites to have very different levels of
hand tuning for the native-CUDA implementation vs OpenMP implementation,
sometimes to the point of significant algorithmic differences. So you're
making an invalid comparison here.

Internally at ISP RAS we used a small set of microbenchmarks implemented in
CUDA/OpenACC/OpenMP specifically for the purpose of evaluating the exact same
computations implemented in terms of different APIs. We got close performance in
all three. The biggest issue is visible on short-running OpenMP target regions:
the startup cost (going through libgomp) is non-trivial. That can be improved
with further changes in libgomp port, notably avoiding malloc, shaving off more
code, perhaps inlining more code (e.g. via LTO eventually). There's also
avoidable cuMemAlloc/cuMemFree on the libgomp plugin side.

For example, there's this patch on the branch:

libgomp: avoid malloc calls in gomp_nvptx_main

Avoid calling malloc where it's easy to use stack storage instead: device
malloc is very slow in CUDA.  This cuts about 60-80 microseconds from target
region entry/exit time, slimming down empty target regions from ~95 to ~17
microseconds (as measured on a GTX Titan).

(empty CUDA kernel is ~5 microseconds; all figures are taken via nvprof)

> To me this kind of performance doesn't look like something that will be fixed
> by fine-tuning; it leaves me undecided whether the chosen approach (what you
> call the fundamentals) is viable at all.

If you try to draw conclusions just from comparing the performance you got on
LULESH, without looking at benchmark's source (otherwise you should have
acknowledged the lack of openmp-simd and significant source-level differences
between CUDA and OpenMP implementations, like the use of __shared__ in CUDA
algorithms), I am sorry to say, but that is just ridiculous. The implementation
on the branch is far from ideal, but your method of evaluation is nonsensical.

> Performance is still better than the OpenACC version of the benchmark, but
> then I think we shouldn't repeat the mistakes we made with OpenACC and avoid
> merging something until we're sure it's ready and of benefit to users.

Would you kindly try and keep your commentary constructive. It's frustrating to
me to have to tolerate hostilities like an ad hominem attack, ignored
nvptx-backend-related questions, etc. How can the work get ready if all you do
is passively push back?  Please trust me, I have experience with GPUs and GCC.
There should be a process for getting this gradually reviewed, with fundamental
design decisions acked and patches reviewed before all tweaks and optimizations
are in place. If you suggest that the work needs to proceed on the branch
without any kind of interim review, and then reviewed in one go after satisfying
some unspecified criteria of being "ready and of benefit", that doesn't sound
right to me.

Alexander


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-18 Thread Bernd Schmidt

On 10/17/2016 07:06 PM, Alexander Monakov wrote:


I've just pushed two commits to the branch to fix this issue.  Before those, the
last commit left the branch in a state where an incremental build seemed ok
(because libgcc/libgomp weren't rebuilt with the new cc1), but a from-scratch
build was broken like you've shown.  LULESH is known to work.  I also intend to
perform a trunk merge soon.


Ok that did work, however...


I think before merging this work we'll need to have some idea of how well it
works on real-world code.


This patchset and the branch lay the foundation, there's more work to be
done, in particular on the performance improvements side. There should be
an agreement on these fundamental bits first, before moving on to fine-tuning.


The performance I saw was lower by a factor of 80 or so compared to 
their CUDA version, and even lower than OpenMP on the host. Does this 
match what you are seeing? Do you have a clear plan how this can be 
improved?


To me this kind of performance doesn't look like something that will be 
fixed by fine-tuning; it leaves me undecided whether the chosen approach 
(what you call the fundamentals) is viable at all. Performance is still 
better than the OpenACC version of the benchmark, but then I think we 
shouldn't repeat the mistakes we made with OpenACC and avoid merging 
something until we're sure it's ready and of benefit to users.



Bernd


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-17 Thread Alexander Monakov
On Mon, 17 Oct 2016, Bernd Schmidt wrote:
> On 10/14/2016 06:39 PM, Alexander Monakov wrote:
> > I'm resending the patch series with backend prerequisites for OpenMP
> > offloading to the NVIDIA PTX ISA.  The patches are rebased on trunk.
> 
> What's the status of the branch? Is it expected to work? I'm trying to compile
> the OpenMP version of these benchmarks:
>   https://codesign.llnl.gov/lulesh.php
> 
> and the resulting binary fails as follows:
> 
> libgomp: Link error log error   : Size doesn't match for '__nvptx_stacks' in
> 'Input 8', first specified in 'Input 8'
> error   : Multiple definition of '__nvptx_stacks' in 'Input 8', first defined
> in 'Input 8'

I've just pushed two commits to the branch to fix this issue.  Before those, the
last commit left the branch in a state where an incremental build seemed ok
(because libgcc/libgomp weren't rebuilt with the new cc1), but a from-scratch
build was broken like you've shown.  LULESH is known to work.  I also intend to
perform a trunk merge soon.

> I think before merging this work we'll need to have some idea of how well it
> works on real-world code.

This patchset and the branch lay the foundation, there's more work to be
done, in particular on the performance improvements side. There should be
an agreement on these fundamental bits first, before moving on to fine-tuning.

Alexander


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-17 Thread Bernd Schmidt

On 10/14/2016 06:39 PM, Alexander Monakov wrote:

I'm resending the patch series with backend prerequisites for OpenMP
offloading to the NVIDIA PTX ISA.  The patches are rebased on trunk.


What's the status of the branch? Is it expected to work? I'm trying to 
compile the OpenMP version of these benchmarks:

  https://codesign.llnl.gov/lulesh.php

and the resulting binary fails as follows:

libgomp: Link error log error   : Size doesn't match for 
'__nvptx_stacks' in 'Input 8', first specified in 'Input 8'
error   : Multiple definition of '__nvptx_stacks' in 'Input 8', first 
defined in 'Input 8'


I think before merging this work we'll need to have some idea of how 
well it works on real-world code.



Bernd


[PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-10-14 Thread Alexander Monakov
Hi,

I'm resending the patch series with backend prerequisites for OpenMP
offloading to the NVIDIA PTX ISA.  The patches are rebased on trunk.

Could a global reviewer have a look at patch 6 (new TARGET_SIMT_VF hook) please?

Documentation changes in doc/invoke.texi have already been reviewed
by Sandra Loosemore (thank you!).

Alexander


Re: [PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-06-09 Thread Jakub Jelinek
On Thu, Jun 09, 2016 at 07:53:52PM +0300, Alexander Monakov wrote:
> I'm sending updated patch series with backend prerequisites for OpenMP
> offloading to the NVIDIA PTX ISA.  The first patch has already received some
> comments and this version reflects review feedback.  The other patches have
> been adjusted for clarity and re-cut in a more rigorous manner.  All patches 
> are
> rebased onto current trunk.
> 
> Jakub, can you offer wishes/recommendations for sending the rest of
> (middle-end and libgomp) patches?  As you know there's a branch with

Once all the prerequisites are in (I assume the patches depend on the NVPTX
backend patches you've just posted), then I'd prefer if you rebase the rest
to current trunk and post  in reasonably reviewable chunks (that can be all
of middle-end changes in one patch, all of libgomp plugin changes, all of
other libgomp changes, or if some of those would be too large, split that a
little bit).

Jakub


[PATCH 0/8] NVPTX offloading to NVPTX: backend patches

2016-06-09 Thread Alexander Monakov
Hi,

I'm sending updated patch series with backend prerequisites for OpenMP
offloading to the NVIDIA PTX ISA.  The first patch has already received some
comments and this version reflects review feedback.  The other patches have
been adjusted for clarity and re-cut in a more rigorous manner.  All patches are
rebased onto current trunk.

Jakub, can you offer wishes/recommendations for sending the rest of
(middle-end and libgomp) patches?  As you know there's a branch with
development history; is that of interest, or would it be easier if I rebased
all stuff anew on current trunk?

Thanks.
Alexander