For debugging purpose I wanted to add this issue has now expanded to be occurring on production jobs in our "grid" and "background" partitions. We had a backfill issue and when jobs began getting backfilled 8-CPU nodes were getting 8 CPUs allocated from background and from grid at the same time (both preemptable partitions with same priority).
- Trey ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected] On Thu, Nov 13, 2014 at 3:51 PM, Trey Dockendorf <[email protected]> wrote: > I have noticed in production that our "long" partition has been sharing > nodes with jobs from our "background" partition and the NumCPUs of each job > sum to be greater than the CPU count of the node they get scheduled onto. > This issue has been seen in production on 14.03.10 and also in our test > environment also on 14.03.10. The jobs in production MAY have been > scheduled before we updated to 14.03.10 which was done earlier this week. > Previously we were on 14.03.6. > > Node: > > NodeName=c0218 NodeAddr=192.168.200.68 CPUs=8 Sockets=2 CoresPerSocket=4 > ThreadsPerCore=1 RealMemory=32200 TmpDisk=16000 > Feature=core8,mem32gb,gig,harpertown State=UNKNOWN > > Partitions: > > PartitionName=DEFAULT Nodes=c0218,c0[931-932]n[1-2] DefMemPerCPU=1900 > MaxMemPerCPU=2000 > PartitionName=serial-long Nodes=c0218 Priority=10 PreemptMode=OFF > MaxNodes=1 MaxTime=720:00:00 DefMemPerCPU=3900 MaxMemPerCPU=4000 State=UP > PartitionName=background Priority=10 MaxNodes=1 MaxTime=96:00:00 State=UP > > slurm.conf items: > > PreemptMode = GANG,SUSPEND > PreemptType = preempt/partition_prio > SelectType = select/cons_res > SelectTypeParameters = CR_CPU,CR_CORE_DEFAULT_DIST_BLOCK > > Using a simple batch script that just calls the "stress" program to > generate load I scheduled two jobs. SLURM correctly adjusted each job's > NumCPUs based on the MaxMemPerCPU values > > sbatch --mem=14400 -p background batches/stress.slrm > # jobID 11323 > # NumCPUs=8 > > sbatch --mem=15360 -p serial-long batches/stress.slrm > # jobID 11324 > # NumCPUs=4 > > Below is the debug output from slurmctld.log regarding scheduling of the > second job. I'm wondering if this is a bug or a configuration issue. It > seems like a bug because more CPUs were scheduled than available, and all > partitions have Shared=NO by default. > > [2014-11-13T15:38:25.278] debug: Setting job's pn_min_cpus to 4 due to > memory limit > [2014-11-13T15:38:25.278] debug3: acct_policy_validate: MPN: job_memory > set to 15360 > [2014-11-13T15:38:25.278] debug3: before alteration asking for nodes > 1-4294967294 cpus 1-4294967294 > [2014-11-13T15:38:25.278] debug3: after alteration asking for nodes > 1-4294967294 cpus 1-4294967294 > [2014-11-13T15:38:25.279] debug2: initial priority for job 11324 is > 30668109 > [2014-11-13T15:38:25.279] debug2: found 1 usable nodes from config > containing c0218 > [2014-11-13T15:38:25.279] debug3: _pick_best_nodes: job 11324 idle_nodes 4 > share_nodes 5 > [2014-11-13T15:38:25.279] debug2: select_p_job_test for job 11324 > [2014-11-13T15:38:25.279] debug3: acct_policy_job_runnable_post_select: > job 11324: MPN: job_memory set to 15360 > [2014-11-13T15:38:25.279] debug2: _adjust_limit_usage: job 11324: MPN: > job_memory set to 0 > [2014-11-13T15:38:25.279] debug2: sched: JobId=11324 allocated resources: > NodeList=(null) > [2014-11-13T15:38:25.279] _slurm_rpc_submit_batch_job JobId=11324 usec=1582 > [2014-11-13T15:38:25.281] debug3: Writing job id 11324 to header record of > job_state file > [2014-11-13T15:38:26.187] debug: sched: Running job scheduler > [2014-11-13T15:38:26.187] debug2: found 1 usable nodes from config > containing c0218 > [2014-11-13T15:38:26.187] debug3: _pick_best_nodes: job 11324 idle_nodes 4 > share_nodes 5 > [2014-11-13T15:38:26.187] debug2: select_p_job_test for job 11324 > [2014-11-13T15:38:26.188] debug3: cons_res: best_fit: node[0]: required > cpus: 4, min req boards: 1, > [2014-11-13T15:38:26.188] debug3: cons_res: best_fit: node[0]: min req > sockets: 1, min avail cores: 8 > [2014-11-13T15:38:26.188] debug3: cons_res: best_fit: using node[0]: > board[0]: socket[1]: 4 cores available > [2014-11-13T15:38:26.188] debug3: acct_policy_job_runnable_post_select: > job 11324: MPN: job_memory set to 15360 > [2014-11-13T15:38:26.188] debug3: cons_res: _add_job_to_res: job 11324 act > 0 > [2014-11-13T15:38:26.188] debug3: cons_res: adding job 11324 to part > serial-long row 0 > [2014-11-13T15:38:26.188] debug2: _adjust_limit_usage: job 11324: MPN: > job_memory set to 15360 > [2014-11-13T15:38:26.188] debug3: sched: JobId=11324 initiated > [2014-11-13T15:38:26.188] sched: Allocate JobId=11324 NodeList=c0218 > #CPUs=4 > > Thanks, > - Trey > > ============================= > > Trey Dockendorf > Systems Analyst I > Texas A&M University > Academy for Advanced Telecommunications and Learning Technologies > Phone: (979)458-2396 > Email: [email protected] > Jabber: [email protected] >
