Hi Bob, Interesting!! Although I fall short to understand this. Just so I understand: Github points to "gres tracking for multiple steps" and I am not scheduling any gpu's or special resources. And I understood that GRES was designed to handle those special kinds of resources or probably I get it wrong?
I will patch gres.c as pointed out in github and see if the solves my problems ... Thank you, Amit ________________________________________ From: Bob Moench [[email protected]] Sent: Friday, September 11, 2015 10:28 AM To: slurm-dev Subject: [slurm-dev] Re: Multiple srun commands within a job script Amit, This sounds a fair amount like something I reported. I believe that the problem is described at this link: https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe Bob On Fri, 11 Sep 2015, Kumar, Amit wrote: > Dear All, > > Noticing a bit strange behavior. We have some jobs that within a run launches > multiple parallel jobs after making sure all dependencies are met. > > In short > > #!/bin/bash > #SBATCH ... > ... > srun namd2 xyz > checks to make sure all went well ..if true continue else fail > srun namd2 abc > checks to make sure all went well ..if true continue else fail > ....continue this for 5 different configs.... > //end > Alternatively we could do this by adding dependencies but the volume of jobs > is deterring and cannot manually check if dependencies are satisfied. > > My issue here is we are randomly seeing the launching of tasks by srun > fail/killed in one of the intermediate steps above. Since we are running the > tasks on the same set of nodes I wonder why would they fail for the next > launch. I have confirmed it is not application related. I am repeatedly using > an already run example and we see this behavior. Could I be running into a > timeout in-between next launch?? > > Any thoughts will be greatly appreciated. > Regards, > Amit > > -- Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227
