Hi Paul, Paul Edmon <[email protected]> writes:
> For a relevant error: > > Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request from > [email protected] (port 704) > Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE plugin > loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE plugin > loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX plugin > loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded > Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded: > checkpoint/none > Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2 > Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started > 2014-04-20T17:45:19 > Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge > (http://code.google.com/p/munge/) loaded > Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with exit > code > 0. > Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: Socket > timed out on send/recv operation > Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed sending step > completion message directly to slurmctld (0.0.0.0:0), retrying > Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 secs > after > job shutdown initiated > Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion > message > directly to slurmctld (0.0.0.0:0) > Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job > > Is there anyway to prevent this? When this fails it creates a Zombie task > that > holds the job still open. I think part of the reason why is that the user is > looping over mpirun's like this: > > do i=1,1000 > mpirun -np 64 ./executable > enddo > > Each run lasts about 5 minutes. If one of the mpirun's fails to launch the > entire thing hangs. It would be better if srun kept trying instead of just > failing. > > -Paul Edmon- > > On 4/16/2014 11:16 PM, Paul Edmon wrote: >> Occassionally when we reset the master some of our nodes go into an unknown >> state or take a bit to get back in contact with the master. If srun is >> being >> launched on the nodes at that time it tends to make it hang which causes the >> mpirun dependent on the srun being launched to fail. Even stranger the >> sbatch >> that originally launched the srun keeps running and not failing out right. >> >> Is there a way to prevent srun from failing but rather just have it wait >> until >> the master comes back? Or is the timeout the only way to set this? Or if >> this isn't possible can we have the parent sbatch die with an error rather >> than have srun just hang up? >> >> Thanks for any insight. >> >> -Paul Edmon- > Did you ever get to the bottom of this? We are seeing something similar with Slurm 2.4.5 and a user running a script which generates batch scripts and submits them within a loop. Cheers, Loris -- This signature is currently under construction.
