Lindsay, here is a patch that fixes this. It is already in 2.2.4. Sorry for
the issues...
Danny
Index: src/plugins/select/bluegene/plugin/bg_job_place.c
===================================================================
--- src/plugins/select/bluegene/plugin/bg_job_place.c (revision 22654)
+++ src/plugins/select/bluegene/plugin/bg_job_place.c (revision 22655)
@@ -604,7 +604,10 @@
}
}
- if (!SELECT_IS_CHECK_FULL_SET(query_mode)
+ if (((bg_conf->layout_mode == LAYOUT_DYNAMIC)
+ || ((!SELECT_IS_CHECK_FULL_SET(query_mode)
+ || SELECT_IS_MODE_RUN_NOW(query_mode))
+ && (bg_conf->layout_mode != LAYOUT_DYNAMIC)))
&& ((found_record->job_running != NO_JOB_RUNNING)
|| (found_record->state
== RM_PARTITION_ERROR))) {
> Folks: A few days ago we moved our large Blue Gene/L from slurm 2.2.1
> (which worked well) to slurm 2.2.3. Since that time, we have had
> problems with jobs failing very quickly.
>
> From the user perspective, jobs are queued. Then instead of going to
> the R state, they go to the CG state, linger a few seconds, and are
> gone. After a few times, the user sees: "sbatch: error: Batch job
> submission failed: Job violates accounting policy (job submit limit,
> user's size and/or time limits)".
>
> In the slurmctrld.log I see messages like this:
> [2011-03-10T13:24:06] Queue start of job 82001 in BG block RMP08Mr143502133
> [2011-03-10T13:24:06] error: Trying to start job 82001 on block
> RMP08Mr143502133, but there is a job 81997 running on an overlapping
> block RMP08Mr143503361 it will not end until 1299821676. This should
> never happen.
> [2011-03-10T13:24:06] sched: Allocate JobId=82001 BPList=bp[110x111]
> [2011-03-10T13:24:06] Queue termination of job 82001 in BG block
> RMP08Mr143502133
> [2011-03-10T13:24:09] error: slurmd error 4008 running JobId=82001 on
> node(s)=bp[110x111]: Job credential revoked
> [2011-03-10T13:24:09] completing job 82001
> [2011-03-10T13:24:09] job_signal of requeuing job 82001 successful
> [2011-03-10T13:24:09] sched: Cancel of JobId=82001 by UID=0, usec=66
>
> In our slurmd.log I see messages like these:
> [2011-03-10T13:24:06] debug: [job 82001] attempting to run prolog
> [/bgl/local/slurm/etc/prolog.sh]
> [2011-03-10T13:24:06] debug2: got this type of message 6011
> [2011-03-10T13:24:06] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> [2011-03-10T13:24:06] debug: _rpc_terminate_job, uid = 188
> [2011-03-10T13:24:06] debug: task_slurmd_release_resources: 82001
> [2011-03-10T13:24:06] debug: credential for job 82001 revoked
> [2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 18
> [2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 15
> [2011-03-10T13:24:06] debug2: set revoke expiration for jobid 82001 to
> 110310134406
> [2011-03-10T13:24:06] debug: Waiting for job 82001's prolog to complete
> [2011-03-10T13:24:09] debug: Finished wait for job 82001's prolog to complete
> [2011-03-10T13:24:09] debug: [job 82001] attempting to run epilog
> [/bgl/local/slurm/etc/epilog.sh]
> [2011-03-10T13:24:09] Job 82001 already killed, do not launch batch job
> [2011-03-10T13:24:09] debug: completed epilog for jobid 82001
> [2011-03-10T13:24:09] debug: Job 82001: sent epilog complete msg: rc = 0
>
> I've attached our bluegene.conf file -- you'll see it specifies
> OVERLAP LayoutMode, which according to the release notes received some
> attention in this update. I suspect something isn't quite right here.
> In any event, after reverting to 2.2.1, we seem to have happy users
> again...
>
> /Lindsay
> --
> R. Lindsay Todd, PhD email: [email protected]
> Senior Systems Programmer phone: 518-276-2605
> Rensselaer Polytechnic Institute fax: 518-276-2809
> Troy, NY 12180-3590 WWW: http://www.rpi.edu/~toddr
>
> The views, opinions, and judgments expressed in this message are
> solely those of the author. The message contents have not been
> reviewed or approved by Rensselaer.
>