Hi Paul,

Paul Edmon <[email protected]> writes:

> For a relevant error:
>
> Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request from
> [email protected] (port 704)
> Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE plugin 
> loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE plugin 
> loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX plugin
> loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded
> Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded:
> checkpoint/none
> Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2
> Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started
> 2014-04-20T17:45:19
> Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge
> (http://code.google.com/p/munge/) loaded
> Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with exit 
> code
> 0.
> Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: Socket
> timed out on send/recv operation
> Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed sending step
> completion message directly to slurmctld (0.0.0.0:0), retrying
> Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 secs 
> after
> job shutdown initiated
> Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion 
> message
> directly to slurmctld (0.0.0.0:0)
> Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job
>
> Is there anyway to prevent this?  When this fails it creates a Zombie task 
> that
> holds the job still open.  I think part of the reason why is that the user is
> looping over mpirun's like this:
>
> do i=1,1000
>     mpirun -np 64 ./executable
> enddo
>
> Each run lasts about 5 minutes.  If one of the mpirun's fails to launch the
> entire thing hangs.  It would be better if srun kept trying instead of just
> failing.
>
> -Paul Edmon-
>
> On 4/16/2014 11:16 PM, Paul Edmon wrote:
>> Occassionally when we reset the master some of our nodes go into an unknown
>> state or take a bit to get back in contact with the master.   If srun is 
>> being
>> launched on the nodes at that time it tends to make it hang which causes the
>> mpirun dependent on the srun being launched to fail.  Even stranger the 
>> sbatch
>> that originally launched the srun keeps running and not failing out right.
>>
>> Is there a way to prevent srun from failing but rather just have it wait 
>> until
>> the master comes back?  Or is the timeout the only way to set this?  Or if
>> this isn't possible can we have the parent sbatch die with an error rather
>> than have srun just hang up?
>>
>> Thanks for any insight.
>>
>> -Paul Edmon-
>

Did you ever get to the bottom of this?  We are seeing something similar
with Slurm 2.4.5 and a user running a script which generates batch
scripts and submits them within a loop.

Cheers,

Loris

-- 
This signature is currently under construction.

Reply via email to