Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-07 Thread Nathan Sidwell
On 12/01/15 11:01, Bernd Schmidt wrote: On 12/01/2015 04:28 PM, Alexander Monakov wrote: I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction.

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-03 Thread Alexander Monakov
On Wed, 2 Dec 2015, Nathan Sidwell wrote: > On 12/02/15 12:09, Alexander Monakov wrote: > > > I meant the PTX linked (post PTX-JIT link) image, so regardless of support, > > it's not an issue. E.g. check early in gomp_nvptx_main if .weak > > __nvptx_has_simd != 0. It would only break if there

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 09:22, Jakub Jelinek wrote: I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore you refer to your own .local space rather than the other thread's. Before or after applying cvta? nathan

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: > On 12/02/15 09:22, Jakub Jelinek wrote: > > >I believe Alex' testing revealed that if you take address of the same .local > >objects in several threads, the addresses are the same, and therefore you > >refer to your own .local

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 10:12, Jakub Jelinek wrote: If we have a reasonable IPA pass to discover which addressable variables can be shared by multiple threads and which can't, then we could use soft-stack for those that can be shared by multiple PTX threads (different warps, or same warp, different threads

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote: > On 12/02/15 08:46, Jakub Jelinek wrote: > > >Or does the OpenACC execution model not allow anything like that, i.e. > >have some function with an automatic variable pass the address of that > >variable to some other function and

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote: > On Wed, 2 Dec 2015, Jakub Jelinek wrote: > > > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > > On 12/02/15 05:40, Jakub Jelinek wrote: > > > > Don't know the HW good enough, is there any power consumption,

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > On 12/02/15 05:40, Jakub Jelinek wrote: > > > Don't know the HW good enough, is there any power consumption, heat etc. > > >difference between the two approaches? I mean does the HW

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 09:24, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: On 12/02/15 09:22, Jakub Jelinek wrote: I believe Alex' testing revealed that if you take address of the same .local objects in several threads, the addresses are the same, and therefore

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote: > > But you never know if people actually use #pragma omp simd regions or not, > > sometimes they will, sometimes they won't, and if the uniform SIMT > increases > > power consumption, it might not be desirable. > > It's easy to

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote: > > On 12/02/15 09:22, Jakub Jelinek wrote: > > > > >I believe Alex' testing revealed that if you take address of the same > > >.local > > >objects in several threads, the addresses are

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 09:41, Alexander Monakov wrote: On Wed, 2 Dec 2015, Nathan Sidwell wrote: On 12/02/15 05:40, Jakub Jelinek wrote: Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Nathan Sidwell wrote: > On 12/02/15 05:40, Jakub Jelinek wrote: > > Don't know the HW good enough, is there any power consumption, heat etc. > > difference between the two approaches? I mean does the HW consume different > > amount of power if only one thread in a warp

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 11:35, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote: But you never know if people actually use #pragma omp simd regions or not, sometimes they will, sometimes they won't, and if the uniform SIMT increases power consumption, it might not be

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Alexander Monakov
On Wed, 2 Dec 2015, Jakub Jelinek wrote: > > It's easy to address: just terminate threads 1-31 if the linked image has > > no SIMD regions, like my pre-simd libgomp was doing. > > Well, can't say the linked image in one shared library call a function > in another linked image in another shared

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 12:09, Alexander Monakov wrote: I meant the PTX linked (post PTX-JIT link) image, so regardless of support, it's not an issue. E.g. check early in gomp_nvptx_main if .weak __nvptx_has_simd != 0. It would only break if there was dlopen on PTX. Note I found a bug in .weak support.

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote: > The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31 > "slaves" which just follow branches without any computation -- that requires > extra jumps and broadcasting branch predicates, -- and 2) broadcast

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote: > On 12/02/15 08:10, Jakub Jelinek wrote: > >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > > >Always the whole stack, from the current stack pointer up to top of the > >stack, so sometimes a few bytes, sometimes a

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 05:40, Jakub Jelinek wrote: Don't know the HW good enough, is there any power consumption, heat etc. difference between the two approaches? I mean does the HW consume different amount of power if only one thread in a warp executes code and the other threads in the same warp just

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Jakub Jelinek
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: > On 12/02/15 05:40, Jakub Jelinek wrote: > > Don't know the HW good enough, is there any power consumption, heat etc. > >difference between the two approaches? I mean does the HW consume different > >amount of power if only one

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 08:10, Jakub Jelinek wrote: On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote: Always the whole stack, from the current stack pointer up to top of the stack, so sometimes a few bytes, sometimes a few kilobytes or more each time? The frame of the current function.

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Bernd Schmidt
On 12/02/2015 02:46 PM, Jakub Jelinek wrote: Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-02 Thread Nathan Sidwell
On 12/02/15 08:46, Jakub Jelinek wrote: Or does the OpenACC execution model not allow anything like that, i.e. have some function with an automatic variable pass the address of that variable to some other function and that other function use #acc loop kind that expects the caller to be at the

[gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-01 Thread Alexander Monakov
This patch introduces a code generation variant for NVPTX that I'm using for SIMD work in OpenMP offloading. Let me try to explain the idea behind it... In place of SIMD vectorization, NVPTX is using SIMT (single instruction/multiple threads) execution: groups of 32 threads execute the same

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-01 Thread Bernd Schmidt
On 12/01/2015 04:28 PM, Alexander Monakov wrote: I'm taking a different approach. I want to execute all insns in all warp members, while ensuring that effect (on global and local state) is that same as if any single thread was executing that instruction. Most instructions automatically satisfy

Re: [gomp-nvptx 2/9] nvptx backend: new "uniform SIMT" codegen variant

2015-12-01 Thread Alexander Monakov
On Tue, 1 Dec 2015, Bernd Schmidt wrote: > > Didn't we also conclude that address-taking (let's say for stack addresses) is > also an operation that does not result in the same state? This is intended to be used with soft-stacks in OpenMP offloading, and soft-stacks are per-warp outside of SIMD