On 03/30/2018 05:14 PM, Tom de Vries wrote:
On 03/30/2018 05:00 PM, Cesar Philippidis wrote:
I should
have checked that patch with the vector length fallback disabled.
Right. The patch series introduces a lot of code that is not exercised.
I've added an -mlong-vector-in-workers option in my
On 04/03/2018 05:00 PM, Tom de Vries wrote:
+ unsigned int psize = ROUND_UP (data.offset, oacc_bcast_align);
+ unsigned int pnum = (nvptx_mach_vector_length () > PTX_WARP_SIZE
+ ? nvptx_mach_max_workers () + 1
+ : 1);
This claims too
On 04/03/2018 05:00 PM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
One last thing: this variable needs to be reset to zero for every function.
Without this reset, we can generated different code for a
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
* config/nvptx/nvptx.c (oacc_bcast_partition): Declare.
One last thing: this variable needs to be reset to zero for every function.
Without this reset, we can generated different code for a function
depending on whether there's another
On 03/30/2018 05:00 PM, Cesar Philippidis wrote:
I should
have checked that patch with the vector length fallback disabled.
Right. The patch series introduces a lot of code that is not exercised.
I've added an -mlong-vector-in-workers option in my local branch and
added 3 test-cases to
On 03/30/2018 07:45 AM, Tom de Vries wrote:
> On 03/30/2018 03:07 AM, Tom de Vries wrote:
>> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>>> As a follow up patch will show, the nvptx BE falls back to using
>>> vector_length = 32 when a vector loop is nested inside a worker loop.
>>
>> I
On 03/30/2018 03:07 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.
I disabled the fallback, and analyzed the vred2d-128.c illegal
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
As a follow up patch will show, the nvptx BE falls back to using
vector_length = 32 when a vector loop is nested inside a worker loop.
I disabled the fallback, and analyzed the vred2d-128.c illegal memory
access execution failure.
I minimized
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
+ if (cfun->machine->sync_bar)
+fprintf (file, "\t\tadd.u32\t\t%%r%d, %%tidy, 1; "
+"// vector synchronization barrier\n",
+REGNO (cfun->machine->sync_bar));
I realize that atm we don't support large vector length
On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
On 03/22/2018 09:18 AM, Tom de Vries wrote:
That's obviously not good enough.
When I compile this test-case:
...
int
main (void)
{
int a[10];
#pragma acc parallel num_workers (16)
#pragma acc loop worker
for (int i = 0; i < 10; i++)
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
diff --git a/gcc/config/nvptx/nvptx.md b/gcc/config/nvptx/nvptx.md
index 28ae263c867..ac2731233dd 100644
--- a/gcc/config/nvptx/nvptx.md
+++ b/gcc/config/nvptx/nvptx.md
@@ -1418,10 +1418,16 @@
[(set_attr "atomic" "true")])
(define_insn
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
+/* Loop structure of the function. The entire function is described as
+ a NULL loop. */
+
struct parallel
{
/* Parent parallel. */
You dropped this comment in "vector_length extension part 1: generalize
function and variable
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
@@ -4115,13 +4225,23 @@ nvptx_single (unsigned mask, basic_block from,
basic_block to)
pred = gen_reg_rtx (BImode);
cfun->machine->axis_predicate[mode - GOMP_DIM_WORKER] = pred;
}
-
+
It's fine to clean
On 03/22/2018 08:04 PM, Cesar Philippidis wrote:
I'm going to retest the variable vector length changes without it and
see if it's still necessary. On one hand, maxntid should be fairly
innocuous, but I don't like how it can mask other PTX JIT bugs. At this
point, I'm leaning towards dropping it
On 03/22/2018 10:51 AM, Tom de Vries wrote:
> On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
>> On 03/22/2018 09:18 AM, Tom de Vries wrote:
>>
>>> That's obviously not good enough.
>>>
>>> When I compile this test-case:
>>> ...
>>> int
>>> main (void)
>>> {
>>> int a[10];
>>> #pragma acc
On 03/22/2018 06:47 PM, Cesar Philippidis wrote:
On 03/22/2018 10:39 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
+ rtx red_partition; /* Similar to bcast_partition, except for vector
+ reductions. */
Shouldn't this be in "[og7] vector_length
On 03/22/2018 06:24 PM, Cesar Philippidis wrote:
On 03/22/2018 09:18 AM, Tom de Vries wrote:
That's obviously not good enough.
When I compile this test-case:
...
int
main (void)
{
int a[10];
#pragma acc parallel num_workers (16)
#pragma acc loop worker
for (int i = 0; i < 10; i++)
On 03/22/2018 10:39 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> + rtx red_partition; /* Similar to bcast_partition, except for vector
>> + reductions. */
>
> Shouldn't this be in "[og7] vector_length extension part 3: reductions"?
Maybe. But keep in
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
+ rtx red_partition; /* Similar to bcast_partition, except for vector
+ reductions. */
Shouldn't this be in "[og7] vector_length extension part 3: reductions"?
Thanks,
- Tom
On 03/22/2018 09:18 AM, Tom de Vries wrote:
> That's obviously not good enough.
>
> When I compile this test-case:
> ...
> int
> main (void)
> {
> int a[10];
> #pragma acc parallel num_workers (16)
> #pragma acc loop worker
> for (int i = 0; i < 10; i++)
> a[i] = i;
>
> return 0;
> }
On 03/22/2018 07:44 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> The attached patch generalizes the worker state propagation and
>> synchronization code to handle large vectors. When the vector_length is
>> larger than a CUDA warp, the nvptx BE will now use
On 03/22/2018 04:11 PM, Cesar Philippidis wrote:
On 03/22/2018 07:23 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
(nvptx_declare_function_name): Emit a .maxntid directive hint and
call nvptx_init_oacc_workers.
+
+ /* Emit a .maxntid hint to help the PTX
On 03/22/2018 07:23 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>
>> (nvptx_declare_function_name): Emit a .maxntid directive hint and
>> call nvptx_init_oacc_workers.
>
>> +
>> + /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */
>> + if
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
The attached patch generalizes the worker state propagation and
synchronization code to handle large vectors. When the vector_length is
larger than a CUDA warp, the nvptx BE will now use shared-memory to
spill-and-fill vector state when
On 03/22/2018 06:43 AM, Tom de Vries wrote:
> On 03/22/2018 04:59 AM, Cesar Philippidis wrote:
>> On 03/21/2018 10:10 AM, Tom de Vries wrote:
>>> Changing the code generation scheme for workers is fine, but obviously
>>> that should be a minimal, separate patch that we can bisect back to.
>>
>>
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
(nvptx_declare_function_name): Emit a .maxntid directive hint and
call nvptx_init_oacc_workers.
+
+ /* Emit a .maxntid hint to help the PTX JIT emit SYNC branches. */
+ if (lookup_attribute ("omp target entrypoint",
On 03/22/2018 04:59 AM, Cesar Philippidis wrote:
On 03/21/2018 10:10 AM, Tom de Vries wrote:
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
have been extended to take a barrier ID and a thread count. The idea
here is to
On 03/21/2018 10:10 AM, Tom de Vries wrote:
> On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
>> In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
>> have been extended to take a barrier ID and a thread count. The idea
>> here is to assign one barrier for each logical vector.
On 03/02/2018 05:55 PM, Cesar Philippidis wrote:
In addition, nvptx_cta_sync and the corresponding nvptx_barsync insn,
have been extended to take a barrier ID and a thread count. The idea
here is to assign one barrier for each logical vector. Worker-single
synchronization is controlled by
29 matches
Mail list logo