This will be fixed in version 14.11.5. It effects the case when an exec fails and the srun is being run under a debugger. The commit with the fix is here:
https://github.com/SchedMD/slurm/commit/49770e20b6c18e4aedd3fe2567505bbcc8247451


Quoting Dirk Schubert <dschub...@allinea.com>:

Hi Slurm developers!

When running srun under GDB to use MPIR Process Acquisition Interface the following strange behaviour occurs when the executable does not exist:

*srun without debugger*

$ srun -n 2 ./does-not-exist # exits "immediately"
slurmstepd: execve(): XYZ/does-not-exist: No such file or directory
slurmstepd: execve(): XYZ/does-not-exist: No such file or directory
srun: error: jelly: tasks 0-1: Exited with exit code 2

*srun running under GDB*

_Expected_: srun exits "immediately" or hits MPIR_Breakpoint soon
_Actual_: srun hits MPIR_Breakpoint after around 3 minutes.

$ gdb srun
(gdb) break main
(gdb) run -n 2 ./does-not-exist

# This should will run to: Breakpoint 1, 0x0000000000423ba0 in main ()

(gdb) set MPIR_being_debugged=1
(gdb) break MPIR_Breakpoint
(gdb) cont
Continuing.
[New Thread 0x7ffff7fd6700 (LWP 5605)]
[New Thread 0x7ffff6190700 (LWP 5610)]
[New Thread 0x7ffff608f700 (LWP 5611)]
[New Thread 0x7ffff5f8e700 (LWP 5612)]
[New Thread 0x7ffff5e8d700 (LWP 5613)]
[Thread 0x7ffff5e8d700 (LWP 5613) exited]
slurmstepd: execve(): XYZ/does-not-exist: No such file or directory
slurmstepd: execve(): XYZ/does-not-exist: No such file or directory
slurmstepd: pdebug_trace_process WIFSTOPPED false for pid 5623
slurmstepd: Process 5623 exited "normally" with return code 2
slurmstepd: pdebug_trace_process WIFSTOPPED false for pid 5624
slurmstepd: Process 5624 exited "normally" with return code 2

# after around 3 minutes

srun: error: task 0 launch failed: Slurmd could not execve job
srun: error: task 1 launch failed: Slurmd could not execve job

Breakpoint 2, MPIR_Breakpoint (job=0x76c590) at debugger.c:81
81    debugger.c: No such file or directory.
(gdb) cont

srun: error: _server_read: fd 19 got error or unexpected eof reading header
srun: error: step_launch_notify_io_failure: aborting, io error with slurmstepd on node 0

Regards,
Dirk

--
Dirk Schubert - Lead Software Developer || Allinea Software


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to