We ran into this swap issue when using SelectTypeParameters=CR_CPU_Memory.
We are still on 14.03.10 and have a very ugly hack that adds a
SchedulerParameter of "assume_swap" which basically forces SLURM to ignore
memory allocations of swapped jobs.  The patch was very rushed so likely we
ended up just making SLURM behave like it's configured with CR_CPU instead
of CR_CPU_Memory.  When we upgrade to 15.08.x we will be using CR_CPU
without our patch since we define MaxMemoryPerCPU on all partitions.  So
far in testing, CR_CPU and MaxMemoryPerCPU results in behavior where a 64GB
node can have 64GB worth of suspended jobs and still run 64GB worth of
active jobs.  If a user requests 1 CPU and 64GB with MaxMemoryPerCPU=2000,
they end up with 32 CPUs which we use for QOS resource limits and
accounting.

Attached are the patches.  They likely only work on 14.03.x releases.  I
wouldn't recommend using the patches, but they may give an idea of how to
implement a proper solution that is worthy of being submitted for inclusion
in SLURM.

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected]
Jabber: [email protected]

On Mon, Oct 26, 2015 at 9:06 AM, Daniel Letai <[email protected]> wrote:

>
> It would be easy if there was a way to force TRES
> allocation/reconfiguration, e.g.
> Add the swap as GRES/swap, and on suspend transfer the allocation from
> TRES=mem=64,GRES/swap=0 to TRES=mem=0,GRES/swap=64. Then you could start
> the new job which requires available mem.
>
> Will it be possible to add such a mechanism to scontrol update ?
>
> On 10/23/2015 03:14 AM, Bill Broadley wrote:
>
>> I've been using the example documented at:
>>    http://slurm.schedmd.com/preempt.html
>>
>> Specifically  Excerpt from slurm.conf
>> PartitionName=low Nodes=linux Default=YES Shared=NO      Priority=10
>> PreemptMode=requeue
>> PartitionName=med Nodes=linux Default=NO  Shared=FORCE:1 Priority=20
>> PreemptMode=suspend
>> PartitionName=high  Nodes=linux Default=NO  Shared=FORCE:1 Priority=30
>> PreemptMode=off
>>
>> All my compute nodes have at least as much swap as ram.  This works quite
>> well,
>> so high priority jobs can suspend medium priority jobs and if there's
>> memory
>> pressure on the node suspended jobs can pushed to swap.  I enforce the
>> memory
>> limits so jobs using more ram than they ask for get killed.  With the
>> slurm
>> 2.6.5 to 14.11 upgrade slurm added the ability so manage memory limits as
>> well
>> as CPU.
>>
>> So I started adding GrpMemory to users so if they purchase 4 nodes they
>> can
>> allocate a total of 4 nodes of CPUs or 4 nodes of memory in the high
>> priority
>> queue.  So I have entries like:
>>
>> User-'test':Partition='high':DefaultAccount='testgrp':GrpCPUs=128:GrpMemory=256000
>>
>> So I set DefMemPerCPU=2000, so that users who do not ask for a specific
>> memory
>> allocation they get 2GB per CPU.  My nodes have 64GB ram and 32 CPUs.
>> This
>> works quite well, but it broke preemption.
>>
>> So now if I'm running 32 2GB jobs in the medium queue, no high priority
>> jobs can
>> run because all ram is allocated.  That seems quite weird to me, if a job
>> is
>> SIGSTOP'd to suspend any memory pressure should force suspended memory
>> pages
>> into swap.  Given that the suspended job isn't running that shouldn't
>> cause too
>> much I/O since each page is written just once, no churning.
>>
>> Is there any way to get slurm to not count suspended jobs memory
>> allocation
>> towards the node's memory used total?
>>
>> Any suggestions on how to get the old behavior back where high priority
>> jobs can
>> be suspended?
>>
>

Attachment: 0001-select_cons_res.c.patch
Description: Binary data

Attachment: 0002-add-assume_swap-config-option.patch
Description: Binary data

Reply via email to