Lindsay, here is a patch that fixes this.  It is already in 2.2.4.  Sorry for 
the issues...

Danny

Index: src/plugins/select/bluegene/plugin/bg_job_place.c
===================================================================
--- src/plugins/select/bluegene/plugin/bg_job_place.c   (revision 22654)
+++ src/plugins/select/bluegene/plugin/bg_job_place.c   (revision 22655)
@@ -604,7 +604,10 @@
                                }
                        }
 
-                       if (!SELECT_IS_CHECK_FULL_SET(query_mode)
+                       if (((bg_conf->layout_mode == LAYOUT_DYNAMIC)
+                            || ((!SELECT_IS_CHECK_FULL_SET(query_mode)
+                                 || SELECT_IS_MODE_RUN_NOW(query_mode))
+                                && (bg_conf->layout_mode != LAYOUT_DYNAMIC)))
                            && ((found_record->job_running != NO_JOB_RUNNING)
                                || (found_record->state
                                    == RM_PARTITION_ERROR))) {


> Folks: A few days ago we moved our large Blue Gene/L from slurm 2.2.1
> (which worked well) to slurm 2.2.3.  Since that time, we have had
> problems with jobs failing very quickly.
> 
> From the user perspective, jobs are queued.  Then instead of going to
> the R state, they go to the CG state, linger a few seconds, and are
> gone.  After a few times, the user sees:  "sbatch: error: Batch job
> submission failed: Job violates accounting policy (job submit limit,
> user's size and/or time limits)".
> 
> In the slurmctrld.log I see messages like this:
> [2011-03-10T13:24:06] Queue start of job 82001 in BG block RMP08Mr143502133
> [2011-03-10T13:24:06] error: Trying to start job 82001 on block
> RMP08Mr143502133, but there is a job 81997 running on an overlapping
> block RMP08Mr143503361 it will not end until 1299821676.  This should
> never happen.
> [2011-03-10T13:24:06] sched: Allocate JobId=82001 BPList=bp[110x111]
> [2011-03-10T13:24:06] Queue termination of job 82001 in BG block
> RMP08Mr143502133
> [2011-03-10T13:24:09] error: slurmd error 4008 running JobId=82001 on
> node(s)=bp[110x111]: Job credential revoked
> [2011-03-10T13:24:09] completing job 82001
> [2011-03-10T13:24:09] job_signal of requeuing job 82001 successful
> [2011-03-10T13:24:09] sched: Cancel of JobId=82001 by UID=0, usec=66
> 
> In our slurmd.log I see messages like these:
> [2011-03-10T13:24:06] debug:  [job 82001] attempting to run prolog
> [/bgl/local/slurm/etc/prolog.sh]
> [2011-03-10T13:24:06] debug2: got this type of message 6011
> [2011-03-10T13:24:06] debug2: Processing RPC: REQUEST_TERMINATE_JOB
> [2011-03-10T13:24:06] debug:  _rpc_terminate_job, uid = 188
> [2011-03-10T13:24:06] debug:  task_slurmd_release_resources: 82001
> [2011-03-10T13:24:06] debug:  credential for job 82001 revoked
> [2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 18
> [2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 15
> [2011-03-10T13:24:06] debug2: set revoke expiration for jobid 82001 to
> 110310134406
> [2011-03-10T13:24:06] debug:  Waiting for job 82001's prolog to complete
> [2011-03-10T13:24:09] debug:  Finished wait for job 82001's prolog to complete
> [2011-03-10T13:24:09] debug:  [job 82001] attempting to run epilog
> [/bgl/local/slurm/etc/epilog.sh]
> [2011-03-10T13:24:09] Job 82001 already killed, do not launch batch job
> [2011-03-10T13:24:09] debug:  completed epilog for jobid 82001
> [2011-03-10T13:24:09] debug:  Job 82001: sent epilog complete msg: rc = 0
> 
> I've attached our bluegene.conf file -- you'll see it specifies
> OVERLAP LayoutMode, which according to the release notes received some
> attention in this update.  I suspect something isn't quite right here.
>  In any event, after reverting to 2.2.1, we seem to have happy users
> again...
> 
> /Lindsay
> --
> R. Lindsay Todd, PhD                 email: [email protected]
> Senior Systems Programmer            phone: 518-276-2605
> Rensselaer Polytechnic Institute     fax:   518-276-2809
> Troy, NY 12180-3590                  WWW:   http://www.rpi.edu/~toddr
> 
> The views, opinions, and judgments expressed in this message are
> solely those of the author. The message contents have not been
> reviewed or approved by Rensselaer.
> 

Reply via email to