Hi, We're trying to run some large (>500 node, 6000 core) jobs. However we are getting pretty high failure rates (I'd say 7 out of 8 jobs fail). I've tried twiddling a lot of knobs with no success.
The errors occur during the process start-up. If we are able to get past that step the programs run fine. If using Intel MPI, I see the following error: [proxy:0:166@node0924] got pmi command (from 7): barrier_in [proxy:0:166@node0924] forwarding command (cmd=barrier_in) upstream [mpiexec@node0224] control_cb (../../pm/pmiserv/pmiserv_cb.c:773): connection to proxy 26 at host node0250 failed [mpiexec@node0224] HYDT_dmxu_poll_wait_for_event (../../tools/demux/demux_poll.c:76): callback returned error status [mpiexec@node0224] HYD_pmci_wait_for_completion (../../pm/pmiserv/pmiserv_pmci.c:501): error waiting for event [mpiexec@node0224] main (../../ui/mpich/mpiexec.c:1059): process manager error waiting for completion This is using mpiexec.hydra with slurm as the bootstrap. If I switch to OpenMPI the error is: [node0204:24847] [[43101,0],0]->[[43101,0],128] mca_oob_tcp_msg_send_bytes: write failed: Broken pipe (32) [sd = 382] If I try a PMI hello world job, I get: srun: error: Task launch for 975055.0 failed on node node0336: Socket timed out on send/recv operation srun: error: Application launch failed: Socket timed out on send/recv operation srun: Job step aborted: Waiting up to 32 seconds for job step to finish. In looking at the slurmd logs for this job we have: Sep 21 12:50:06 node0270 slurmd[2650]: error: _step_connect: connect() failed dir /tmp/node0270 node node0270 job 975055 step 0 No such file or directory Sep 21 12:50:06 node0270 slurmd[2650]: error: stepd_connect to 975055.0 failed: No such file or directory [SNIP LAST 2 LINES REPEATED] Sep 21 12:50:33 node0270 slurmd[2650]: error: slurm_receive_msgs: Socket timed out on send/recv operation Sep 21 12:50:43 node0336 slurmd[2649]: error: _rpc_launch_tasks: unable to send return code to address:port=10.134.2.70:39613 msg_type=6001: Transport endpoint is not connected Everything works perfectly fine for jobs < 300 nodes. In googling these errors, there's a lot of talk about increasing the MessageTimeout, we've already got ours set to 60. Does anybody have any thoughts? Thanks! Timothy=
