[slurm-dev] Re: srun and node unknown state

Paul Edmon Mon, 21 Apr 2014 13:16:27 -0700


For a relevant error:

Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 requestfrom [email protected] (port 704)

Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE pluginloaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONEplugin loadedApr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONEplugin loadedApr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUXplugin loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded:checkpoint/none

Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2

Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started2014-04-20T17:45:19Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge(http://code.google.com/p/munge/) loadedApr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited withexit code 0.Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg:Socket timed out on send/recv operationApr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failedsending step completion message directly to slurmctld (0.0.0.0:0), retryingApr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60secs after job shutdown initiatedApr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completionmessage directly to slurmctld (0.0.0.0:0)

Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job

Is there anyway to prevent this? When this fails it creates a Zombietask that holds the job still open. I think part of the reason why isthat the user is looping over mpirun's like this:


do i=1,1000
    mpirun -np 64 ./executable
enddo

Each run lasts about 5 minutes. If one of the mpirun's fails to launchthe entire thing hangs. It would be better if srun kept trying insteadof just failing.


-Paul Edmon-

On 4/16/2014 11:16 PM, Paul Edmon wrote:

Occassionally when we reset the master some of our nodes go into anunknown state or take a bit to get back in contact with the master.If srun is being launched on the nodes at that time it tends to makeit hang which causes the mpirun dependent on the srun being launchedto fail. Even stranger the sbatch that originally launched the srunkeeps running and not failing out right.
Is there a way to prevent srun from failing but rather just have itwait until the master comes back? Or is the timeout the only way toset this? Or if this isn't possible can we have the parent sbatch diewith an error rather than have srun just hang up?
Thanks for any insight.

-Paul Edmon-

[slurm-dev] Re: srun and node unknown state

Reply via email to