Hi Bob,

Interesting!! Although I fall short to understand this. Just so I understand: 
Github points to "gres tracking for multiple steps" and I am not scheduling any 
gpu's or special resources. And I understood that GRES was designed to handle 
those special kinds of resources or probably I get it wrong?

I will patch gres.c as pointed out in github and see if the solves my problems 
... 

Thank you,
Amit
  
________________________________________
From: Bob Moench [[email protected]]
Sent: Friday, September 11, 2015 10:28 AM
To: slurm-dev
Subject: [slurm-dev] Re: Multiple srun commands within a job script

Amit,

This sounds a fair amount like something I reported. I
believe that the problem is described at this link:

   
https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe

Bob

On Fri, 11 Sep 2015, Kumar, Amit wrote:

> Dear All,
>
> Noticing a bit strange behavior. We have some jobs that within a run launches 
> multiple parallel jobs after making sure all dependencies are met.
>
> In short
>
> #!/bin/bash
> #SBATCH ...
> ...
> srun namd2 xyz
> checks to make sure all went well ..if true continue else fail
> srun namd2 abc
> checks to make sure all went well ..if true  continue else fail
> ....continue this for 5 different configs....
> //end
> Alternatively we could do this by adding dependencies but the volume of jobs 
> is deterring and cannot manually check if dependencies are satisfied.
>
> My issue here is we are randomly seeing the launching of tasks by srun 
> fail/killed in one of the intermediate steps above. Since we are running the 
> tasks on the same set of nodes I wonder why would they fail for the next 
> launch. I have confirmed it is not application related. I am repeatedly using 
> an already run example and we see this behavior. Could I be running into a 
> timeout in-between next launch??
>
> Any thoughts will be greatly appreciated.
> Regards,
> Amit
>
>

--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227

Reply via email to