https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84871

            Bug ID: 84871
           Summary: libgomp examples-4/declare_target-[12].f90 fail with
                    nvptx Titan V offloading
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: minor
          Priority: P3
         Component: libgomp
          Assignee: unassigned at gcc dot gnu.org
          Reporter: cesar at gcc dot gnu.org
                CC: jakub at gcc dot gnu.org
  Target Milestone: ---

Both libgomp.fortran/examples-4/declare_target-1.f90 and
libgomp.fortran/examples-4/declare_target-2.f90 fail when offloaded on Nvidia
Titan V (or Volta family) GPUs running Nvida driver 390.25. The failure appears
to be the result of a limited per-CUDA thread stack size of 1024b as collected
by cuCtxGetLimit (..., CU_LIMIT_STACK_SIZE).

Those tests only fail at -O1, -O2 and -Os. Furthermore, all of the tests pass
on older Nvidia GPUs, including Kepler (K80s) and Pascal (GeForce 1080).

One thing I noticed was that ptxas reports that it is spilling more registers
to the stack for the Volta GPUs than it is for Pascal GPUs. Here's the relevant
statistics for Pascal:

ptxas info    : Function properties for __e_53_1_mod_MOD_fib
    24 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads

Here are the corresponding statistics for Volta:

ptxas info    : Function properties for __e_53_1_mod_MOD_fib
    40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads

Given that we can't control the PTX driver JIT, maybe we should either reduce
the recursion depth in declare_target-[12].f90 to 20 (actually fib (22) works,
but I don't a newer driver to break it again), or or just xfail those tests for
nvptx targets. 

The CUDA driver API provides cuCtxSetLimit function to adjust the stack limit,
but apparently, that only adjusts the upper bound limit.

Reply via email to