Folks: A few days ago we moved our large Blue Gene/L from slurm 2.2.1 (which worked well) to slurm 2.2.3. Since that time, we have had problems with jobs failing very quickly.
>From the user perspective, jobs are queued. Then instead of going to the R state, they go to the CG state, linger a few seconds, and are gone. After a few times, the user sees: "sbatch: error: Batch job submission failed: Job violates accounting policy (job submit limit, user's size and/or time limits)". In the slurmctrld.log I see messages like this: [2011-03-10T13:24:06] Queue start of job 82001 in BG block RMP08Mr143502133 [2011-03-10T13:24:06] error: Trying to start job 82001 on block RMP08Mr143502133, but there is a job 81997 running on an overlapping block RMP08Mr143503361 it will not end until 1299821676. This should never happen. [2011-03-10T13:24:06] sched: Allocate JobId=82001 BPList=bp[110x111] [2011-03-10T13:24:06] Queue termination of job 82001 in BG block RMP08Mr143502133 [2011-03-10T13:24:09] error: slurmd error 4008 running JobId=82001 on node(s)=bp[110x111]: Job credential revoked [2011-03-10T13:24:09] completing job 82001 [2011-03-10T13:24:09] job_signal of requeuing job 82001 successful [2011-03-10T13:24:09] sched: Cancel of JobId=82001 by UID=0, usec=66 In our slurmd.log I see messages like these: [2011-03-10T13:24:06] debug: [job 82001] attempting to run prolog [/bgl/local/slurm/etc/prolog.sh] [2011-03-10T13:24:06] debug2: got this type of message 6011 [2011-03-10T13:24:06] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2011-03-10T13:24:06] debug: _rpc_terminate_job, uid = 188 [2011-03-10T13:24:06] debug: task_slurmd_release_resources: 82001 [2011-03-10T13:24:06] debug: credential for job 82001 revoked [2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 18 [2011-03-10T13:24:06] debug2: No steps in jobid 82001 to send signal 15 [2011-03-10T13:24:06] debug2: set revoke expiration for jobid 82001 to 110310134406 [2011-03-10T13:24:06] debug: Waiting for job 82001's prolog to complete [2011-03-10T13:24:09] debug: Finished wait for job 82001's prolog to complete [2011-03-10T13:24:09] debug: [job 82001] attempting to run epilog [/bgl/local/slurm/etc/epilog.sh] [2011-03-10T13:24:09] Job 82001 already killed, do not launch batch job [2011-03-10T13:24:09] debug: completed epilog for jobid 82001 [2011-03-10T13:24:09] debug: Job 82001: sent epilog complete msg: rc = 0 I've attached our bluegene.conf file -- you'll see it specifies OVERLAP LayoutMode, which according to the release notes received some attention in this update. I suspect something isn't quite right here. In any event, after reverting to 2.2.1, we seem to have happy users again... /Lindsay -- R. Lindsay Todd, PhD email: [email protected] Senior Systems Programmer phone: 518-276-2605 Rensselaer Polytechnic Institute fax: 518-276-2809 Troy, NY 12180-3590 WWW: http://www.rpi.edu/~toddr The views, opinions, and judgments expressed in this message are solely those of the author. The message contents have not been reviewed or approved by Rensselaer.
bluegene.conf
Description: Binary data
