That patch only relates to running multiple simultaneous steps which use GRES.

Quoting "Kumar, Amit" <[email protected]>:
Hi Bob,

Interesting!! Although I fall short to understand this. Just so I understand: Github points to "gres tracking for multiple steps" and I am not scheduling any gpu's or special resources. And I understood that GRES was designed to handle those special kinds of resources or probably I get it wrong?

I will patch gres.c as pointed out in github and see if the solves my problems ...

Thank you,
Amit

________________________________________
From: Bob Moench [[email protected]]
Sent: Friday, September 11, 2015 10:28 AM
To: slurm-dev
Subject: [slurm-dev] Re: Multiple srun commands within a job script

Amit,

This sounds a fair amount like something I reported. I
believe that the problem is described at this link:

https://github.com/SchedMD/slurm/commit/af1163a20e1f82db6e177b13584de398c48fa9fe

Bob

On Fri, 11 Sep 2015, Kumar, Amit wrote:

Dear All,

Noticing a bit strange behavior. We have some jobs that within a run launches multiple parallel jobs after making sure all dependencies are met.

In short

#!/bin/bash
#SBATCH ...
...
srun namd2 xyz
checks to make sure all went well ..if true continue else fail
srun namd2 abc
checks to make sure all went well ..if true  continue else fail
....continue this for 5 different configs....
//end
Alternatively we could do this by adding dependencies but the volume of jobs is deterring and cannot manually check if dependencies are satisfied.

My issue here is we are randomly seeing the launching of tasks by srun fail/killed in one of the intermediate steps above. Since we are running the tasks on the same set of nodes I wonder why would they fail for the next launch. I have confirmed it is not application related. I am repeatedly using an already run example and we see this behavior. Could I be running into a timeout in-between next launch??

Any thoughts will be greatly appreciated.
Regards,
Amit



--
Bob Moench (rwm); PE Debugger Development; 605-9034; 354-7895; SP 24227


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support
===============================================================
Slurm User Group Meeting, 15-16 September 2015, Washington D.C.
http://slurm.schedmd.com/slurm_ug_agenda.html

Reply via email to