https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84871
Bug ID: 84871 Summary: libgomp examples-4/declare_target-[12].f90 fail with nvptx Titan V offloading Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: minor Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: cesar at gcc dot gnu.org CC: jakub at gcc dot gnu.org Target Milestone: --- Both libgomp.fortran/examples-4/declare_target-1.f90 and libgomp.fortran/examples-4/declare_target-2.f90 fail when offloaded on Nvidia Titan V (or Volta family) GPUs running Nvida driver 390.25. The failure appears to be the result of a limited per-CUDA thread stack size of 1024b as collected by cuCtxGetLimit (..., CU_LIMIT_STACK_SIZE). Those tests only fail at -O1, -O2 and -Os. Furthermore, all of the tests pass on older Nvidia GPUs, including Kepler (K80s) and Pascal (GeForce 1080). One thing I noticed was that ptxas reports that it is spilling more registers to the stack for the Volta GPUs than it is for Pascal GPUs. Here's the relevant statistics for Pascal: ptxas info : Function properties for __e_53_1_mod_MOD_fib 24 bytes stack frame, 24 bytes spill stores, 24 bytes spill loads Here are the corresponding statistics for Volta: ptxas info : Function properties for __e_53_1_mod_MOD_fib 40 bytes stack frame, 40 bytes spill stores, 40 bytes spill loads Given that we can't control the PTX driver JIT, maybe we should either reduce the recursion depth in declare_target-[12].f90 to 20 (actually fib (22) works, but I don't a newer driver to break it again), or or just xfail those tests for nvptx targets. The CUDA driver API provides cuCtxSetLimit function to adjust the stack limit, but apparently, that only adjusts the upper bound limit.