To be fixed in SLURM v2.3.6. See attached patch or
https://github.com/SchedMD/slurm/commit/0caecbc53dd476cca866b8a48421162b5a25aa2c
Establishes configuration parameter with maximum number of jobs to check for circular dependencies. Default value 10 jobs.

Quoting Matthieu Hautreux <matthieu.hautr...@gmail.com>:


Alejandro, I am okay with what you said but the main problem is that
Paul has provided what we could call a 0-day for doing a DOS on SLURM
controller in about a few seconds.... At least a mechanism should be
added to protect the controller from entering an exponential loop and
being unresponsive for a long long time because of circular
dependencies check.

My 2 cents
Matthieu

2012/3/26 Alejandro Lucero Palau <alejandro.luc...@bsc.es>:

As I said before, we have similar problems with a machine which is not
used as "standard" HPC software expects. This is more about job
throughput , aka HTC, than job performance and a software like Slurm was
not designed for this use, although it can do a really good job under
some limits.

I wonder if we could talk about this issue next Slurm Users Meeting.
Slurm is doing well with current big clusters and some work is being
done in scalability but I guess HTC needs another sort of algorithms.
Being Slurm so dynamically configurable thanks to plugins, one first
thing coming out of my head is to implement main scheduler as a
configurable option as well. I think this is a better idea than adding
corner cases for HTC purposes.


On 03/25/2012 09:47 PM, Paul Mezzanini wrote:
Matthieu,

I know I lack the programming skill to implement the patch myself.  I could probably hack it together but it would be far from pretty.  I'm also hesitant to patch the controller for one user.  For the time being, I think I will have my user modify her submit scripts to only have one or two generations "on deck".  She will be defending soon so rewriting her workflow to utilize steps just isn't worth it.

This issue should be addressed in future revisions.  It came as quite a surprise when a user was able to bring the controller to its knees.  I can also see it cropping up every year or so.

Kids are getting restless, I better get off the laptop.

Thanks for the programming offer, I may still take you up on it.

Paul

-------------------------------
From: Matthieu Hautreux [mailto:matthieu.hautr...@gmail.com]
Sent: Saturday, March 24, 2012 7:40 PM
To: slurm-dev
Subject: [slurm-dev] Re: Complex job dependency issues

2012/3/23 Paul Mezzanini <pfm...@rit.edu>
To add insult to injury, this logic works perfect in SGE.

Hum, that is surely not a fair comment :)

Your reproducer works as expected and I see the same exact behavior at generation 7.

Alejandro is correct, and doing profiling on the slurmctld while submitting the 7th generation, we can see that the _scan_dependency logic is taking really too much time. This is because of the fact that slurm is walking through all the dependencies recursively and at generation 7 for your workload it corresponds to 16^6=16777216 items... That starts to be a little bit unmanageable. I let you imagine the result at generation 128... I am happy that SGE is able to manage that, it certainly does not check for circular dependencies the way SLURM does it. Maybe it does not do it at all or only for direct ancestors.

As the problem comes from the fact that SLURM is trying to detect a problem that could become too hard to solve, maybe it should stop before killing itself because of the complexity of the thing and consider that when too much layers of dependencies are used, it has to consider that the user is knowing what he does. Maybe a counter should be added in the logic and the recursion stopped when a max value is reached.

In your workload, every job reuse the dependencies of the other jobs of the same generation. The dependencies check could be reduced from 16^6 to 16*6 by doing a merge of all the dependencies at each level before doing the recursion.

To sum up, IMHO, SLURM should  (1) use a counter to stop trying to solve the _real-time_ unsolvable and (2) do a merge of the dependencies at each dependency level before entering a recursion (to eliminate redundancy).

In the mean time, if you are confident that you will not have circular dependencies, you (or I, if you want !) can do a quick patch to remove the check using an env var and see if there is an other problem a few steps after.

Regards,
Matthieu

Sent from my Samsung smartphone on AT&T


-------- Original message --------
Subject: [slurm-dev] Re: Complex job dependency issues
From: Moe Jette <je...@schedmd.com>
To: slurm-dev <slurm-dev@schedmd.com>
CC:




I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
on-line that may be helpful:
http://www.schedmd.com/slurmdocs/high_throughput.html

Quoting Paul Mezzanini <pfm...@rit.edu>:


Are there known issues with large dependency lists?  We have a user who is
doing a fairly large number of generational jobs.  The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.

Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly.  Slurmctld's cpu usage goes to 100% and I begin to get
warning messages about processing time in the slurmctld logs (slurmctld:
Warning: Note very large processing time from _slurm_rpc_submit_batch_job:
usec=2735283).  Turning the verbosity up yielded no obvious issues.

Eventually sbatch fails with timeouts and that kills the rest of the
submits.

As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld.  The same slowdown occurred at
generation 7.

I have created a very simplified version of her submit scripts for
testing.  It shows the same issues.

Important info:
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.

My submit scripts for testing:

####BEGIN CONSOLE DUMP####

[pfmeec@tropos submitotron []]# cat submit-many-jobs.sh
#!/bin/bash

# Just a constant variable used throughout the script to name our jobs
#   in a meaningful way.
BASEJOBNAME="dep"

# Another constant variable used to name the slurm submission file that
#   this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"

#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16


#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.

for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
        if [ ${GENERATION} -eq 1 ]  ; then
                for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
                        echo GENERATION/WORKER: ${GENERATION}/${WORKER}
                        WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
                done
        else
                for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
                        echo GENERATION/WORKER: ${GENERATION}/${WORKER}
                        WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
                done
        fi
done
[pfmeec@tropos submitotron []]# cat slurm-payload.sh
#!/bin/bash -l
# NOTE the -l flag!
#

# Where to send mail...
#SBATCH --mail-user pfm...@rit.edu

# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL

# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30

#vaild partions are "work" and "debug"
#SBATCH -p work -n 1

# Job memory requirements in MB
#SBATCH --mem=30

#Just a quick sleep.
sleep 60

[pfmeec@tropos submitotron []]#



####END CONSOLE DUMP####

Wow, that totally killed my indentation.  Github version:
https://github.com/paulmezz/SlurmThings



I know there are ways I could clean up the loops but for this test I just
don't care :)

Any ideas?  (and thanks!)
-paul

--
paul.mezzan...@rit.edu
Sr Systems Administrator/Engineer
Research Computing at RIT
585.475.3245







WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm

diff --git a/NEWS b/NEWS
index 198f6df..8c8f90c 100644
--- a/NEWS
+++ b/NEWS
@@ -19,6 +19,10 @@ documents those changes that are of interest to users and admins.
  -- Fix bug in select/cons_res plugin when used with topology/tree and a node
     range count in job allocation request.
  -- Fixed moab_2_slurmdb.pl script to correctly work for end records.
+ -- Add support for new SchedulerParameters of max_depend_depth defining the
+    maximum number of jobs to test for circular dependencies (i.e. job A waits
+    for job B to start and job B waits for job A to start). Default value is
+    10 jobs.
 
 * Changes in SLURM 2.3.4
 ========================
diff --git a/doc/man/man5/slurm.conf.5 b/doc/man/man5/slurm.conf.5
index 5461cb8..d895fb6 100644
--- a/doc/man/man5/slurm.conf.5
+++ b/doc/man/man5/slurm.conf.5
@@ -1394,6 +1394,11 @@ In the case of large clusters (more than 1000 nodes) configured with
 desirable.
 This option applies only to \fBSchedulerType=sched/backfill\fR.
 .TP
+\fBmax_depend_depth=#\fR
+Maximum number of jobs to test for a circular job dependency. Stop testing
+after this number of job dependencies have been tested. The default value is
+10 jobs.
+.TP
 \fBmax_switch_wait=#\fR
 Maximum number of seconds that a job can delay execution waiting for the
 specified desired switch count. The default value is 60 seconds.
diff --git a/src/slurmctld/job_scheduler.c b/src/slurmctld/job_scheduler.c
index ac0b3bb..481c451 100644
--- a/src/slurmctld/job_scheduler.c
+++ b/src/slurmctld/job_scheduler.c
@@ -1158,6 +1158,7 @@ extern int update_job_dependency(struct job_record *job_ptr, char *new_depend)
 
 	if (rc == SLURM_SUCCESS) {
 		/* test for circular dependencies (e.g. A -> B -> A) */
+		(void) _scan_depend(NULL, job_ptr->job_id);
 		if (_scan_depend(new_depend_list, job_ptr->job_id))
 			rc = ESLURM_CIRCULAR_DEPENDENCY;
 	}
@@ -1178,16 +1179,44 @@ extern int update_job_dependency(struct job_record *job_ptr, char *new_depend)
 }
 
 /* Return TRUE if job_id is found in dependency_list.
+ * Pass NULL dependency list to clear the counter.
  * Execute recursively for each dependent job */
 static bool _scan_depend(List dependency_list, uint32_t job_id)
 {
+	static time_t sched_update = 0;
+	static int max_depend_depth = 10;
+	static int job_counter = 0;
 	bool rc = false;
 	ListIterator iter;
 	struct depend_spec *dep_ptr;
 
-	xassert(job_id);
-	xassert(dependency_list);
+	if (sched_update != slurmctld_conf.last_update) {
+		char *sched_params, *tmp_ptr;
+
+		sched_params = slurm_get_sched_params();
+		if (sched_params &&
+		    (tmp_ptr = strstr(sched_params, "max_depend_depth="))) {
+		/*                                   01234567890123456 */
+			int i = atoi(tmp_ptr + 17);
+			if (i < 0) {
+				error("ignoring SchedulerParameters: "
+				      "max_depend_depth value of %d", i);
+			} else {
+				      max_depend_depth = i;
+			}
+		}
+		xfree(sched_params);
+		sched_update = slurmctld_conf.last_update;
+	}
+
+	if (dependency_list == NULL) {
+		job_counter = 0;
+		return FALSE;
+	} else if (job_counter++ >= max_depend_depth) {
+		return FALSE;
+	}
 
+	xassert(job_id);
 	iter = list_iterator_create(dependency_list);
 	if (iter == NULL)
 		fatal("list_iterator_create malloc failure");

Reply via email to