[slurm-dev] Re: srun and node unknown state

2014-06-26 Thread Loris Bennett

Hi Paul,

Paul Edmon ped...@cfa.harvard.edu writes:

 For a relevant error:

 Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request from
 56441.33234@10.242.58.34 (port 704)
 Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE plugin 
 loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE plugin 
 loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX plugin
 loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded
 Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded:
 checkpoint/none
 Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2
 Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started
 2014-04-20T17:45:19
 Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge
 (http://code.google.com/p/munge/) loaded
 Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with exit 
 code
 0.
 Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: Socket
 timed out on send/recv operation
 Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed sending step
 completion message directly to slurmctld (0.0.0.0:0), retrying
 Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 secs 
 after
 job shutdown initiated
 Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion 
 message
 directly to slurmctld (0.0.0.0:0)
 Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job

 Is there anyway to prevent this?  When this fails it creates a Zombie task 
 that
 holds the job still open.  I think part of the reason why is that the user is
 looping over mpirun's like this:

 do i=1,1000
 mpirun -np 64 ./executable
 enddo

 Each run lasts about 5 minutes.  If one of the mpirun's fails to launch the
 entire thing hangs.  It would be better if srun kept trying instead of just
 failing.

 -Paul Edmon-

 On 4/16/2014 11:16 PM, Paul Edmon wrote:
 Occassionally when we reset the master some of our nodes go into an unknown
 state or take a bit to get back in contact with the master.   If srun is 
 being
 launched on the nodes at that time it tends to make it hang which causes the
 mpirun dependent on the srun being launched to fail.  Even stranger the 
 sbatch
 that originally launched the srun keeps running and not failing out right.

 Is there a way to prevent srun from failing but rather just have it wait 
 until
 the master comes back?  Or is the timeout the only way to set this?  Or if
 this isn't possible can we have the parent sbatch die with an error rather
 than have srun just hang up?

 Thanks for any insight.

 -Paul Edmon-


Did you ever get to the bottom of this?  We are seeing something similar
with Slurm 2.4.5 and a user running a script which generates batch
scripts and submits them within a loop.

Cheers,

Loris

-- 
This signature is currently under construction.


[slurm-dev] Re: srun and node unknown state

2014-04-21 Thread Paul Edmon


For a relevant error:

Apr 20 17:45:19 itc041 slurmd[50099]: launch task 9285091.51 request 
from 56441.33234@10.242.58.34 (port 704)

Apr 20 17:45:19 itc041 slurmstepd[9850]: switch NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherProfile NONE plugin 
loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherEnergy NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherInfiniband NONE 
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: AcctGatherFilesystem NONE 
plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Job accounting gather LINUX 
plugin loaded

Apr 20 17:45:19 itc041 slurmstepd[9850]: task NONE plugin loaded
Apr 20 17:45:19 itc041 slurmstepd[9850]: Checkpoint plugin loaded: 
checkpoint/none

Apr 20 17:45:19 itc041 slurmd[itc041][9850]: debug level = 2
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: task 0 (9855) started 
2014-04-20T17:45:19
Apr 20 17:45:19 itc041 slurmd[itc041][9850]: auth plugin for Munge 
(http://code.google.com/p/munge/) loaded
Apr 20 17:51:54 itc041 slurmd[itc041][9850]: task 0 (9855) exited with 
exit code 0.
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: slurm_receive_msg: 
Socket timed out on send/recv operation
Apr 20 17:53:35 itc041 slurmd[itc041][9850]: error: Rank 0 failed 
sending step completion message directly to slurmctld (0.0.0.0:0), retrying
Apr 20 17:53:59 itc041 slurmd[itc041][9850]: error: Abandoning IO 60 
secs after job shutdown initiated
Apr 20 17:54:35 itc041 slurmd[itc041][9850]: Rank 0 sent step completion 
message directly to slurmctld (0.0.0.0:0)

Apr 20 17:54:35 itc041 slurmd[itc041][9850]: done with job

Is there anyway to prevent this?  When this fails it creates a Zombie 
task that holds the job still open.  I think part of the reason why is 
that the user is looping over mpirun's like this:


do i=1,1000
mpirun -np 64 ./executable
enddo

Each run lasts about 5 minutes.  If one of the mpirun's fails to launch 
the entire thing hangs.  It would be better if srun kept trying instead 
of just failing.


-Paul Edmon-

On 4/16/2014 11:16 PM, Paul Edmon wrote:
Occassionally when we reset the master some of our nodes go into an 
unknown state or take a bit to get back in contact with the master.   
If srun is being launched on the nodes at that time it tends to make 
it hang which causes the mpirun dependent on the srun being launched 
to fail.  Even stranger the sbatch that originally launched the srun 
keeps running and not failing out right.


Is there a way to prevent srun from failing but rather just have it 
wait until the master comes back?  Or is the timeout the only way to 
set this?  Or if this isn't possible can we have the parent sbatch die 
with an error rather than have srun just hang up?


Thanks for any insight.

-Paul Edmon-