[slurm-dev] Re: select/cons_res scheduling jobs that exceed nodes CPU count

Trey Dockendorf Fri, 21 Nov 2014 15:24:35 -0800

For debugging purpose I wanted to add this issue has now expanded to be
occurring on production jobs in our "grid" and "background" partitions.  We
had a backfill issue and when jobs began getting backfilled 8-CPU nodes
were getting 8 CPUs allocated from background and from grid at the same
time (both preemptable partitions with same priority).


- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

On Thu, Nov 13, 2014 at 3:51 PM, Trey Dockendorf <[email protected]> wrote:

> I have noticed in production that our "long" partition has been sharing
> nodes with jobs from our "background" partition and the NumCPUs of each job
> sum to be greater than the CPU count of the node they get scheduled onto.
> This issue has been seen in production on 14.03.10 and also in our test
> environment also on 14.03.10.  The jobs in production MAY have been
> scheduled before we updated to 14.03.10 which was done earlier this week.
> Previously we were on 14.03.6.
>
> Node:
>
> NodeName=c0218 NodeAddr=192.168.200.68 CPUs=8 Sockets=2 CoresPerSocket=4
> ThreadsPerCore=1 RealMemory=32200 TmpDisk=16000
> Feature=core8,mem32gb,gig,harpertown State=UNKNOWN
>
> Partitions:
>
> PartitionName=DEFAULT Nodes=c0218,c0[931-932]n[1-2] DefMemPerCPU=1900
> MaxMemPerCPU=2000
> PartitionName=serial-long Nodes=c0218 Priority=10 PreemptMode=OFF
> MaxNodes=1 MaxTime=720:00:00 DefMemPerCPU=3900 MaxMemPerCPU=4000 State=UP
> PartitionName=background Priority=10 MaxNodes=1 MaxTime=96:00:00 State=UP
>
> slurm.conf items:
>
> PreemptMode             = GANG,SUSPEND
> PreemptType             = preempt/partition_prio
> SelectType              = select/cons_res
> SelectTypeParameters    = CR_CPU,CR_CORE_DEFAULT_DIST_BLOCK
>
> Using a simple batch script that just calls the "stress" program to
> generate load I scheduled two jobs.  SLURM correctly adjusted each job's
> NumCPUs based on the MaxMemPerCPU values
>
> sbatch --mem=14400 -p background batches/stress.slrm
> # jobID 11323
> # NumCPUs=8
>
> sbatch --mem=15360 -p serial-long batches/stress.slrm
> # jobID 11324
> # NumCPUs=4
>
> Below is the debug output from slurmctld.log regarding scheduling of the
> second job.  I'm wondering if this is a bug or a configuration issue.  It
> seems like a bug because more CPUs were scheduled than available, and all
> partitions have Shared=NO by default.
>
> [2014-11-13T15:38:25.278] debug:  Setting job's pn_min_cpus to 4 due to
> memory limit
> [2014-11-13T15:38:25.278] debug3: acct_policy_validate: MPN: job_memory
> set to 15360
> [2014-11-13T15:38:25.278] debug3: before alteration asking for nodes
> 1-4294967294 cpus 1-4294967294
> [2014-11-13T15:38:25.278] debug3: after alteration asking for nodes
> 1-4294967294 cpus 1-4294967294
> [2014-11-13T15:38:25.279] debug2: initial priority for job 11324 is
> 30668109
> [2014-11-13T15:38:25.279] debug2: found 1 usable nodes from config
> containing c0218
> [2014-11-13T15:38:25.279] debug3: _pick_best_nodes: job 11324 idle_nodes 4
> share_nodes 5
> [2014-11-13T15:38:25.279] debug2: select_p_job_test for job 11324
> [2014-11-13T15:38:25.279] debug3: acct_policy_job_runnable_post_select:
> job 11324: MPN: job_memory set to 15360
> [2014-11-13T15:38:25.279] debug2: _adjust_limit_usage: job 11324: MPN:
> job_memory set to 0
> [2014-11-13T15:38:25.279] debug2: sched: JobId=11324 allocated resources:
> NodeList=(null)
> [2014-11-13T15:38:25.279] _slurm_rpc_submit_batch_job JobId=11324 usec=1582
> [2014-11-13T15:38:25.281] debug3: Writing job id 11324 to header record of
> job_state file
> [2014-11-13T15:38:26.187] debug:  sched: Running job scheduler
> [2014-11-13T15:38:26.187] debug2: found 1 usable nodes from config
> containing c0218
> [2014-11-13T15:38:26.187] debug3: _pick_best_nodes: job 11324 idle_nodes 4
> share_nodes 5
> [2014-11-13T15:38:26.187] debug2: select_p_job_test for job 11324
> [2014-11-13T15:38:26.188] debug3: cons_res: best_fit: node[0]: required
> cpus: 4, min req boards: 1,
> [2014-11-13T15:38:26.188] debug3: cons_res: best_fit: node[0]: min req
> sockets: 1, min avail cores: 8
> [2014-11-13T15:38:26.188] debug3: cons_res: best_fit: using node[0]:
> board[0]: socket[1]: 4 cores available
> [2014-11-13T15:38:26.188] debug3: acct_policy_job_runnable_post_select:
> job 11324: MPN: job_memory set to 15360
> [2014-11-13T15:38:26.188] debug3: cons_res: _add_job_to_res: job 11324 act
> 0
> [2014-11-13T15:38:26.188] debug3: cons_res: adding job 11324 to part
> serial-long row 0
> [2014-11-13T15:38:26.188] debug2: _adjust_limit_usage: job 11324: MPN:
> job_memory set to 15360
> [2014-11-13T15:38:26.188] debug3: sched: JobId=11324 initiated
> [2014-11-13T15:38:26.188] sched: Allocate JobId=11324 NodeList=c0218
> #CPUs=4
>
> Thanks,
> - Trey
>
> =============================
>
> Trey Dockendorf
> Systems Analyst I
> Texas A&M University
> Academy for Advanced Telecommunications and Learning Technologies
> Phone: (979)458-2396
> Email: [email protected]
> Jabber: [email protected]
>

[slurm-dev] Re: select/cons_res scheduling jobs that exceed nodes CPU count

Reply via email to