[slurm-dev] Large job socket timed out errors.

Timothy Brown Mon, 21 Sep 2015 14:17:11 -0700

Hi,

We're trying to run some large (>500 node, 6000 core) jobs. However we are 
getting pretty high failure rates (I'd say 7 out of 8 jobs fail). I've tried 
twiddling a lot of knobs with no success.



The errors occur during the process start-up. If we are able to get past that 
step the programs run fine.

If using Intel MPI, I see the following error:

[proxy:0:166@node0924] got pmi command (from 7): barrier_in
[proxy:0:166@node0924] forwarding command (cmd=barrier_in) upstream
[mpiexec@node0224] control_cb (../../pm/pmiserv/pmiserv_cb.c:773): connection 
to proxy 26 at host node0250 failed
[mpiexec@node0224] HYDT_dmxu_poll_wait_for_event 
(../../tools/demux/demux_poll.c:76): callback returned error status
[mpiexec@node0224] HYD_pmci_wait_for_completion 
(../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event
[mpiexec@node0224] main (../../ui/mpich/mpiexec.c:1059): process manager error 
waiting for completion

This is using mpiexec.hydra with slurm as the bootstrap. 

If I switch to OpenMPI the error is:

[node0204:24847] [[43101,0],0]->[[43101,0],128] mca_oob_tcp_msg_send_bytes: 
write failed: Broken pipe (32) [sd = 382]

If I try a PMI hello world job, I get:

srun: error: Task launch for 975055.0 failed on node node0336: Socket timed out 
on send/recv operation
srun: error: Application launch failed: Socket timed out on send/recv operation
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

In looking at the slurmd logs for this job we have:

Sep 21 12:50:06 node0270 slurmd[2650]: error: _step_connect: connect() failed 
dir /tmp/node0270 node node0270 job 975055 step 0 No such file or directory
Sep 21 12:50:06 node0270 slurmd[2650]: error: stepd_connect to 975055.0 failed: 
No such file or directory
[SNIP LAST 2 LINES REPEATED]
Sep 21 12:50:33 node0270 slurmd[2650]: error: slurm_receive_msgs: Socket timed 
out on send/recv operation
Sep 21 12:50:43 node0336 slurmd[2649]: error: _rpc_launch_tasks: unable to send 
return code to address:port=10.134.2.70:39613 msg_type=6001: Transport endpoint 
is not connected


Everything works perfectly fine for jobs < 300 nodes.

In googling these errors, there's a lot of talk about increasing the 
MessageTimeout, we've already got ours set to 60.

Does anybody have any thoughts?
Thanks!

Timothy=

[slurm-dev] Large job socket timed out errors.

Reply via email to