Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-04-05 Thread Tom de Vries
On 03/30/2018 05:14 PM, Tom de Vries wrote: On 03/30/2018 05:00 PM, Cesar Philippidis wrote: I should have checked that patch with the vector length fallback disabled. Right. The patch series introduces a lot of code that is not exercised. I've added an -mlong-vector-in-workers option in my

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-04-05 Thread Tom de Vries
On 04/03/2018 05:00 PM, Tom de Vries wrote: + unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align); + unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE + ? nvptx_mach_max_workers () + 1 + : 1); This claims too

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-04-05 Thread Tom de Vries
On 04/03/2018 05:00 PM, Tom de Vries wrote: On 03/02/2018 05:55 PM, Cesar Philippidis wrote: * config/nvptx/nvptx.c (oacc_bcast_partition): Declare. One last thing: this variable needs to be reset to zero for every function. Without this reset, we can generated different code for a

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-04-03 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: * config/nvptx/nvptx.c (oacc_bcast_partition): Declare. One last thing: this variable needs to be reset to zero for every function. Without this reset, we can generated different code for a function depending on whether there's another

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-30 Thread Tom de Vries
On 03/30/2018 05:00 PM, Cesar Philippidis wrote: I should have checked that patch with the vector length fallback disabled. Right. The patch series introduces a lot of code that is not exercised. I've added an -mlong-vector-in-workers option in my local branch and added 3 test-cases to

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-30 Thread Cesar Philippidis
On 03/30/2018 07:45 AM, Tom de Vries wrote: > On 03/30/2018 03:07 AM, Tom de Vries wrote: >> On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >>> As a follow up patch will show, the nvptx BE falls back to using >>> vector_length = 32 when a vector loop is nested inside a worker loop. >> >> I

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-30 Thread Tom de Vries
On 03/30/2018 03:07 AM, Tom de Vries wrote: On 03/02/2018 05:55 PM, Cesar Philippidis wrote: As a follow up patch will show, the nvptx BE falls back to using vector_length = 32 when a vector loop is nested inside a worker loop. I disabled the fallback, and analyzed the vred2d-128.c illegal

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-29 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: As a follow up patch will show, the nvptx BE falls back to using vector_length = 32 when a vector loop is nested inside a worker loop. I disabled the fallback, and analyzed the vred2d-128.c illegal memory access execution failure. I minimized

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-23 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: + if (cfun->machine->sync_bar) +fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; " +"// vector synchronization barrier\n", +REGNO (cfun->machine->sync_bar)); I realize that atm we don't support large vector length

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-23 Thread Tom de Vries
On 03/22/2018 06:24 PM, Cesar Philippidis wrote: On 03/22/2018 09:18 AM, Tom de Vries wrote: That's obviously not good enough. When I compile this test-case: ... int main (void) {   int a[10]; #pragma acc parallel num_workers (16) #pragma acc loop worker   for (int i = 0; i < 10; i++)    

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-23 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md index 28ae263c867..ac2731233dd 100644 --- a/gcc/config/nvptx/nvptx.md +++ b/gcc/config/nvptx/nvptx.md @@ -1418,10 +1418,16 @@ [(set_attr "atomic" "true")]) (define_insn

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-23 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: +/* Loop structure of the function. The entire function is described as + a NULL loop. */ + struct parallel { /* Parent parallel. */ You dropped this comment in "vector_length extension part 1: generalize function and variable

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-23 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: @@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from, basic_block to) pred = gen_reg_rtx (BImode); cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred; } - + It's fine to clean

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-23 Thread Tom de Vries
On 03/22/2018 08:04 PM, Cesar Philippidis wrote: I'm going to retest the variable vector length changes without it and see if it's still necessary. On one hand, maxntid should be fairly innocuous, but I don't like how it can mask other PTX JIT bugs. At this point, I'm leaning towards dropping it

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Cesar Philippidis
On 03/22/2018 10:51 AM, Tom de Vries wrote: > On 03/22/2018 06:24 PM, Cesar Philippidis wrote: >> On 03/22/2018 09:18 AM, Tom de Vries wrote: >> >>> That's obviously not good enough. >>> >>> When I compile this test-case: >>> ... >>> int >>> main (void) >>> { >>>    int a[10]; >>> #pragma acc

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/22/2018 06:47 PM, Cesar Philippidis wrote: On 03/22/2018 10:39 AM, Tom de Vries wrote: On 03/02/2018 05:55 PM, Cesar Philippidis wrote: +  rtx red_partition; /* Similar to bcast_partition, except for vector +    reductions.  */ Shouldn't this be in "[og7] vector_length

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/22/2018 06:24 PM, Cesar Philippidis wrote: On 03/22/2018 09:18 AM, Tom de Vries wrote: That's obviously not good enough. When I compile this test-case: ... int main (void) {   int a[10]; #pragma acc parallel num_workers (16) #pragma acc loop worker   for (int i = 0; i < 10; i++)    

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Cesar Philippidis
On 03/22/2018 10:39 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> +  rtx red_partition; /* Similar to bcast_partition, except for vector >> +    reductions.  */ > > Shouldn't this be in "[og7] vector_length extension part 3: reductions"? Maybe. But keep in

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: + rtx red_partition; /* Similar to bcast_partition, except for vector + reductions. */ Shouldn't this be in "[og7] vector_length extension part 3: reductions"? Thanks, - Tom

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Cesar Philippidis
On 03/22/2018 09:18 AM, Tom de Vries wrote: > That's obviously not good enough. > > When I compile this test-case: > ... > int > main (void) > { >   int a[10]; > #pragma acc parallel num_workers (16) > #pragma acc loop worker >   for (int i = 0; i < 10; i++) >     a[i] = i; > >   return 0; > }

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Cesar Philippidis
On 03/22/2018 07:44 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> The attached patch generalizes the worker state propagation and >> synchronization code to handle large vectors. When the vector_length is >> larger than a CUDA warp, the nvptx BE will now use

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/22/2018 04:11 PM, Cesar Philippidis wrote: On 03/22/2018 07:23 AM, Tom de Vries wrote: On 03/02/2018 05:55 PM, Cesar Philippidis wrote: (nvptx_declare_function_name): Emit a .maxntid directive hint and call nvptx_init_oacc_workers. + +  /* Emit a .maxntid hint to help the PTX

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Cesar Philippidis
On 03/22/2018 07:23 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: > >> (nvptx_declare_function_name): Emit a .maxntid directive hint and >> call nvptx_init_oacc_workers. > >> + >> +  /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches.  */ >> +  if

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: The attached patch generalizes the worker state propagation and synchronization code to handle large vectors. When the vector_length is larger than a CUDA warp, the nvptx BE will now use shared-memory to spill-and-fill vector state when

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Cesar Philippidis
On 03/22/2018 06:43 AM, Tom de Vries wrote: > On 03/22/2018 04:59 AM, Cesar Philippidis wrote: >> On 03/21/2018 10:10 AM, Tom de Vries wrote: >>> Changing the code generation scheme for workers is fine, but obviously >>> that should be a minimal, separate patch that we can bisect back to. >> >>

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: (nvptx_declare_function_name): Emit a .maxntid directive hint and call nvptx_init_oacc_workers. + + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */ + if (lookup_attribute ("omp target entrypoint",

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-22 Thread Tom de Vries
On 03/22/2018 04:59 AM, Cesar Philippidis wrote: On 03/21/2018 10:10 AM, Tom de Vries wrote: On 03/02/2018 05:55 PM, Cesar Philippidis wrote: In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn, have been extended to take a barrier ID and a thread count. The idea here is to

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-21 Thread Cesar Philippidis
On 03/21/2018 10:10 AM, Tom de Vries wrote: > On 03/02/2018 05:55 PM, Cesar Philippidis wrote: >> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn, >> have been extended to take a barrier ID and a thread count. The idea >> here is to assign one barrier for each logical vector.

Re: [og7] vector_length extension part 2: Generalize state propagation and synchronization

2018-03-21 Thread Tom de Vries
On 03/02/2018 05:55 PM, Cesar Philippidis wrote: In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn, have been extended to take a barrier ID and a thread count. The idea here is to assign one barrier for each logical vector. Worker-single synchronization is controlled by