That patch only relates to running multiple simultaneous steps which use GRES.
Quoting "Kumar, Amit" <[email protected]>:
Hi Bob,
Interesting!! Although I fall short to understand this. Just so I
understand: Github points to "gres tracking for multiple steps" and
I am not scheduling any gpu's or special resources. And I understood
that GRES was designed to handle those special kinds of resources or
probably I get it wrong?
I will patch gres.c as pointed out in github and see if the solves
my problems ...
Thank you,
Amit
________________________________________
From: Bob Moench [[email protected]]
Sent: Friday, September 11, 2015 10:28 AM
To: slurm-dev
Subject: [slurm-dev] Re: Multiple srun commands within a job script
Amit,
This sounds a fair amount like something I reported. I
believe that the problem is described at this link:
https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe
Bob
On Fri, 11 Sep 2015, Kumar, Amit wrote:
Dear All,
Noticing a bit strange behavior. We have some jobs that within a
run launches multiple parallel jobs after making sure all
dependencies are met.
In short
#!/bin/bash
#SBATCH ...
...
srun namd2 xyz
checks to make sure all went well ..if true continue else fail
srun namd2 abc
checks to make sure all went well ..if true continue else fail
....continue this for 5 different configs....
//end
Alternatively we could do this by adding dependencies but the
volume of jobs is deterring and cannot manually check if
dependencies are satisfied.
My issue here is we are randomly seeing the launching of tasks by
srun fail/killed in one of the intermediate steps above. Since we
are running the tasks on the same set of nodes I wonder why would
they fail for the next launch. I have confirmed it is not
application related. I am repeatedly using an already run example
and we see this behavior. Could I be running into a timeout
in-between next launch??
Any thoughts will be greatly appreciated.
Regards,
Amit
--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support
===============================================================
Slurm User Group Meeting, 15-16 September 2015, Washington D.C.
http://slurm.schedmd.com/slurm_ug_agenda.html