Hi all,

( slurm 2.2.7, large cluster ).

need a little help.

First, anyone have experience with that type of message ?

step_launch_notify_io_failure() seems to possibly from
many io types of operations. So it's difficult to narrow down a little 
what's going on.

A clue is _step_missing_handler() being called and srun_step_missing() 
is called
to tell srun via _srun_agent_launch().

Hopefully, if I get it right, it seems it could be related to 
DEFAULT_BATCH_START_TIMEOUT = 10
default timeout or maybe DEFAULT_MSG_TIMEOUT 10. Now, I am lost :)

Am I on the right track or could better trouble-shooting patch be 
suggested ?

Thanks in advance.

-- 

-----------------------------------------------------------
      Michel Bourget - SGI - Linux Software Engineering
     "Past BIOS POST, everything else is extra" (travis)
-----------------------------------------------------------

Reply via email to