We ran into this swap issue when using SelectTypeParameters=CR_CPU_Memory. We are still on 14.03.10 and have a very ugly hack that adds a SchedulerParameter of "assume_swap" which basically forces SLURM to ignore memory allocations of swapped jobs. The patch was very rushed so likely we ended up just making SLURM behave like it's configured with CR_CPU instead of CR_CPU_Memory. When we upgrade to 15.08.x we will be using CR_CPU without our patch since we define MaxMemoryPerCPU on all partitions. So far in testing, CR_CPU and MaxMemoryPerCPU results in behavior where a 64GB node can have 64GB worth of suspended jobs and still run 64GB worth of active jobs. If a user requests 1 CPU and 64GB with MaxMemoryPerCPU=2000, they end up with 32 CPUs which we use for QOS resource limits and accounting.
Attached are the patches. They likely only work on 14.03.x releases. I wouldn't recommend using the patches, but they may give an idea of how to implement a proper solution that is worthy of being submitted for inclusion in SLURM. - Trey ============================= Trey Dockendorf Systems Analyst I Texas A&M University Academy for Advanced Telecommunications and Learning Technologies Phone: (979)458-2396 Email: [email protected] Jabber: [email protected] On Mon, Oct 26, 2015 at 9:06 AM, Daniel Letai <[email protected]> wrote: > > It would be easy if there was a way to force TRES > allocation/reconfiguration, e.g. > Add the swap as GRES/swap, and on suspend transfer the allocation from > TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start > the new job which requires available mem. > > Will it be possible to add such a mechanism to scontrol update ? > > On 10/23/2015 03:14 AM, Bill Broadley wrote: > >> I've been using the example documented at: >> http://slurm.schedmd.com/preempt.html >> >> Specifically Excerpt from slurm.conf >> PartitionName=low Nodes=linux Default=YES Shared=NO Priority=10 >> PreemptMode=requeue >> PartitionName=med Nodes=linux Default=NO Shared=FORCE:1 Priority=20 >> PreemptMode=suspend >> PartitionName=high Nodes=linux Default=NO Shared=FORCE:1 Priority=30 >> PreemptMode=off >> >> All my compute nodes have at least as much swap as ram. This works quite >> well, >> so high priority jobs can suspend medium priority jobs and if there's >> memory >> pressure on the node suspended jobs can pushed to swap. I enforce the >> memory >> limits so jobs using more ram than they ask for get killed. With the >> slurm >> 2.6.5 to 14.11 upgrade slurm added the ability so manage memory limits as >> well >> as CPU. >> >> So I started adding GrpMemory to users so if they purchase 4 nodes they >> can >> allocate a total of 4 nodes of CPUs or 4 nodes of memory in the high >> priority >> queue. So I have entries like: >> >> User-'test':Partition='high':DefaultAccount='testgrp':GrpCPUs=128:GrpMemory=256000 >> >> So I set DefMemPerCPU=2000, so that users who do not ask for a specific >> memory >> allocation they get 2GB per CPU. My nodes have 64GB ram and 32 CPUs. >> This >> works quite well, but it broke preemption. >> >> So now if I'm running 32 2GB jobs in the medium queue, no high priority >> jobs can >> run because all ram is allocated. That seems quite weird to me, if a job >> is >> SIGSTOP'd to suspend any memory pressure should force suspended memory >> pages >> into swap. Given that the suspended job isn't running that shouldn't >> cause too >> much I/O since each page is written just once, no churning. >> >> Is there any way to get slurm to not count suspended jobs memory >> allocation >> towards the node's memory used total? >> >> Any suggestions on how to get the old behavior back where high priority >> jobs can >> be suspended? >> >
0001-select_cons_res.c.patch
Description: Binary data
0002-add-assume_swap-config-option.patch
Description: Binary data
