Amit,

This sounds a fair amount like something I reported. I
believe that the problem is described at this link:

  
https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe

Bob

On Fri, 11 Sep 2015, Kumar, Amit wrote:

Dear All,

Noticing a bit strange behavior. We have some jobs that within a run launches 
multiple parallel jobs after making sure all dependencies are met.

In short

#!/bin/bash
#SBATCH ...
...
srun namd2 xyz
checks to make sure all went well ..if true continue else fail
srun namd2 abc
checks to make sure all went well ..if true  continue else fail
....continue this for 5 different configs....
//end
Alternatively we could do this by adding dependencies but the volume of jobs is 
deterring and cannot manually check if dependencies are satisfied.

My issue here is we are randomly seeing the launching of tasks by srun 
fail/killed in one of the intermediate steps above. Since we are running the 
tasks on the same set of nodes I wonder why would they fail for the next 
launch. I have confirmed it is not application related. I am repeatedly using 
an already run example and we see this behavior. Could I be running into a 
timeout in-between next launch??

Any thoughts will be greatly appreciated.
Regards,
Amit



--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227

Reply via email to