Folks: A few days ago we moved our large Blue Gene/L from slurm 2.2.1
(which worked well) to slurm 2.2.3.  Since that time, we have had
problems with jobs failing very quickly.

>From the user perspective, jobs are queued.  Then instead of going to
the R state, they go to the CG state, linger a few seconds, and are
gone.  After a few times, the user sees:  "sbatch: error: Batch job
submission failed: Job violates accounting policy (job submit limit,
user's size and/or time limits)".

In the slurmctrld.log I see messages like this:
[2011-03-10T13:24:06] Queue start of job 82001 in BG block RMP08Mr143502133
[2011-03-10T13:24:06] error: Trying to start job 82001 on block
RMP08Mr143502133, but there is a job 81997 running on an overlapping
block RMP08Mr143503361 it will not end until 1299821676.  This should
never happen.
[2011-03-10T13:24:06] sched: Allocate JobId=82001 BPList=bp[110x111]
[2011-03-10T13:24:06] Queue termination of job 82001 in BG block
RMP08Mr143502133
[2011-03-10T13:24:09] error: slurmd error 4008 running JobId=82001 on
node(s)=bp[110x111]: Job credential revoked
[2011-03-10T13:24:09] completing job 82001
[2011-03-10T13:24:09] job_signal of requeuing job 82001 successful
[2011-03-10T13:24:09] sched: Cancel of JobId=82001 by UID=0, usec=66

In our slurmd.log I see messages like these:
[2011-03-10T13:24:06] debug:  [job 82001] attempting to run prolog
[/bgl/local/slurm/etc/prolog.sh]
[2011-03-10T13:24:06] debug2: got this type of message 6011
[2011-03-10T13:24:06] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2011-03-10T13:24:06] debug:  _rpc_terminate_job, uid = 188
[2011-03-10T13:24:06] debug:  task_slurmd_release_resources: 82001
[2011-03-10T13:24:06] debug:  credential for job 82001 revoked
[2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 18
[2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 15
[2011-03-10T13:24:06] debug2: set revoke expiration for jobid 82001 to
110310134406
[2011-03-10T13:24:06] debug:  Waiting for job 82001's prolog to complete
[2011-03-10T13:24:09] debug:  Finished wait for job 82001's prolog to complete
[2011-03-10T13:24:09] debug:  [job 82001] attempting to run epilog
[/bgl/local/slurm/etc/epilog.sh]
[2011-03-10T13:24:09] Job 82001 already killed, do not launch batch job
[2011-03-10T13:24:09] debug:  completed epilog for jobid 82001
[2011-03-10T13:24:09] debug:  Job 82001: sent epilog complete msg: rc = 0

I've attached our bluegene.conf file -- you'll see it specifies
OVERLAP LayoutMode, which according to the release notes received some
attention in this update.  I suspect something isn't quite right here.
 In any event, after reverting to 2.2.1, we seem to have happy users
again...

/Lindsay
--
R. Lindsay Todd, PhD                 email: [email protected]
Senior Systems Programmer            phone: 518-276-2605
Rensselaer Polytechnic Institute     fax:   518-276-2809
Troy, NY 12180-3590                  WWW:   http://www.rpi.edu/~toddr

The views, opinions, and judgments expressed in this message are
solely those of the author. The message contents have not been
reviewed or approved by Rensselaer.

Attachment: bluegene.conf
Description: Binary data

Reply via email to