On 12/01/15 11:01, Bernd Schmidt wrote:
On 12/01/2015 04:28 PM, Alexander Monakov wrote:
I'm taking a different approach. I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction.
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
> On 12/02/15 12:09, Alexander Monakov wrote:
>
> > I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
> > it's not an issue. E.g. check early in gomp_nvptx_main if .weak
> > __nvptx_has_simd != 0. It would only break if there
On 12/02/15 09:22, Jakub Jelinek wrote:
I believe Alex' testing revealed that if you take address of the same .local
objects in several threads, the addresses are the same, and therefore you
refer to your own .local space rather than the other thread's.
Before or after applying cvta?
nathan
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> On 12/02/15 09:22, Jakub Jelinek wrote:
>
> >I believe Alex' testing revealed that if you take address of the same .local
> >objects in several threads, the addresses are the same, and therefore you
> >refer to your own .local
On 12/02/15 10:12, Jakub Jelinek wrote:
If we have a reasonable IPA pass to discover which addressable variables can
be shared by multiple threads and which can't, then we could use soft-stack
for those that can be shared by multiple PTX threads (different warps, or
same warp, different threads
On Wed, Dec 02, 2015 at 09:14:03AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:46, Jakub Jelinek wrote:
>
> >Or does the OpenACC execution model not allow anything like that, i.e.
> >have some function with an automatic variable pass the address of that
> >variable to some other function and
On Wed, Dec 02, 2015 at 05:54:51PM +0300, Alexander Monakov wrote:
> On Wed, 2 Dec 2015, Jakub Jelinek wrote:
>
> > On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > > Don't know the HW good enough, is there any power consumption,
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 05:40, Jakub Jelinek wrote:
> > > Don't know the HW good enough, is there any power consumption, heat etc.
> > >difference between the two approaches? I mean does the HW
On 12/02/15 09:24, Jakub Jelinek wrote:
On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
On 12/02/15 09:22, Jakub Jelinek wrote:
I believe Alex' testing revealed that if you take address of the same .local
objects in several threads, the addresses are the same, and therefore
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
> > But you never know if people actually use #pragma omp simd regions or not,
> > sometimes they will, sometimes they won't, and if the uniform SIMT
> increases
> > power consumption, it might not be desirable.
>
> It's easy to
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> On Wed, Dec 02, 2015 at 09:23:11AM -0500, Nathan Sidwell wrote:
> > On 12/02/15 09:22, Jakub Jelinek wrote:
> >
> > >I believe Alex' testing revealed that if you take address of the same
> > >.local
> > >objects in several threads, the addresses are
On 12/02/15 09:41, Alexander Monakov wrote:
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
On 12/02/15 05:40, Jakub Jelinek wrote:
Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches? I mean does the HW consume different
amount of power
On Wed, 2 Dec 2015, Nathan Sidwell wrote:
> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> > difference between the two approaches? I mean does the HW consume different
> > amount of power if only one thread in a warp
On 12/02/15 11:35, Jakub Jelinek wrote:
On Wed, Dec 02, 2015 at 06:44:11PM +0300, Alexander Monakov wrote:
But you never know if people actually use #pragma omp simd regions or not,
sometimes they will, sometimes they won't, and if the uniform SIMT
increases
power consumption, it might not be
On Wed, 2 Dec 2015, Jakub Jelinek wrote:
> > It's easy to address: just terminate threads 1-31 if the linked image has
> > no SIMD regions, like my pre-simd libgomp was doing.
>
> Well, can't say the linked image in one shared library call a function
> in another linked image in another shared
On 12/02/15 12:09, Alexander Monakov wrote:
I meant the PTX linked (post PTX-JIT link) image, so regardless of support,
it's not an issue. E.g. check early in gomp_nvptx_main if .weak
__nvptx_has_simd != 0. It would only break if there was dlopen on PTX.
Note I found a bug in .weak support.
On Tue, Dec 01, 2015 at 06:28:20PM +0300, Alexander Monakov wrote:
> The approach in OpenACC is to, outside of "vector" loops, 1) make threads 1-31
> "slaves" which just follow branches without any computation -- that requires
> extra jumps and broadcasting branch predicates, -- and 2) broadcast
On Wed, Dec 02, 2015 at 08:38:56AM -0500, Nathan Sidwell wrote:
> On 12/02/15 08:10, Jakub Jelinek wrote:
> >On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
>
> >Always the whole stack, from the current stack pointer up to top of the
> >stack, so sometimes a few bytes, sometimes a
On 12/02/15 05:40, Jakub Jelinek wrote:
Don't know the HW good enough, is there any power consumption, heat etc.
difference between the two approaches? I mean does the HW consume different
amount of power if only one thread in a warp executes code and the other
threads in the same warp just
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
> On 12/02/15 05:40, Jakub Jelinek wrote:
> > Don't know the HW good enough, is there any power consumption, heat etc.
> >difference between the two approaches? I mean does the HW consume different
> >amount of power if only one
On 12/02/15 08:10, Jakub Jelinek wrote:
On Wed, Dec 02, 2015 at 08:02:47AM -0500, Nathan Sidwell wrote:
Always the whole stack, from the current stack pointer up to top of the
stack, so sometimes a few bytes, sometimes a few kilobytes or more each time?
The frame of the current function.
On 12/02/2015 02:46 PM, Jakub Jelinek wrote:
Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at
On 12/02/15 08:46, Jakub Jelinek wrote:
Or does the OpenACC execution model not allow anything like that, i.e.
have some function with an automatic variable pass the address of that
variable to some other function and that other function use #acc loop kind
that expects the caller to be at the
This patch introduces a code generation variant for NVPTX that I'm using for
SIMD work in OpenMP offloading. Let me try to explain the idea behind it...
In place of SIMD vectorization, NVPTX is using SIMT (single
instruction/multiple threads) execution: groups of 32 threads execute the same
On 12/01/2015 04:28 PM, Alexander Monakov wrote:
I'm taking a different approach. I want to execute all insns in all warp
members, while ensuring that effect (on global and local state) is that same
as if any single thread was executing that instruction. Most instructions
automatically satisfy
On Tue, 1 Dec 2015, Bernd Schmidt wrote:
>
> Didn't we also conclude that address-taking (let's say for stack addresses) is
> also an operation that does not result in the same state?
This is intended to be used with soft-stacks in OpenMP offloading, and
soft-stacks are per-warp outside of SIMD
26 matches
Mail list logo