Hi all,
( slurm 2.2.7, large cluster ).
need a little help.
First, anyone have experience with that type of message ?
step_launch_notify_io_failure() seems to possibly from
many io types of operations. So it's difficult to narrow down a little
what's going on.
A clue is _step_missing_handler() being called and srun_step_missing()
is called
to tell srun via _srun_agent_launch().
Hopefully, if I get it right, it seems it could be related to
DEFAULT_BATCH_START_TIMEOUT = 10
default timeout or maybe DEFAULT_MSG_TIMEOUT 10. Now, I am lost :)
Am I on the right track or could better trouble-shooting patch be
suggested ?
Thanks in advance.
--
-----------------------------------------------------------
Michel Bourget - SGI - Linux Software Engineering
"Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------