After considering this solution further, it's not ideal - only one of
the partitions will be used, but many jobs require the entire cluster to
run.
I think I liked the old behavior (automatic memory over-subscription)
better.
For now, I'll probably have to define a low common denominator
(suggested 50 sounds about right) and force users to define --mem
On 06/17/2015 08:38 PM, Trey Dockendorf wrote:
Re: [slurm-dev] setting DefMemPerCPU in a heterogeneous cluster
We also have a heterogeneous environment, with basically two classes
of nodes in terms of the memory/CPU. We have 2GB/CPU and 4GB/CPU. We
use "background" partitions which access the entire cluster and allow
for opportunistic utilization of otherwise idle CPUs. We found we had
to create one of these partitions for each class of memory/CPU.
PartitionName=DEFAULT Nodes=<long nodelist> DefMemPerCPU=1900
MaxMemPerCPU=2000
PartitionName=background Nodes=<long nodelist> Priority=10
AllowQOS=background MaxNodes=1 MaxTime=96:00:00 State=UP
PartitionName=background-4g Nodes=<long nodelist> Priority=10
AllowQOS=background MaxNodes=1 DefMemPerCPU=3900 MaxMemPerCPU=4000
MaxTime=96:00:00 State=UP
The background partition contains all 2GB/CPU nodes and background-4g
contains all 4GB/CPU. A user can submit to either by doing something
like "sbatch --partition=background,background-4g --qos=background".
There may be a better and/or more clever way of handling such
partitions in a heterogenous environment, but the above method has
served us well.
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected] <mailto:[email protected]>
Jabber: [email protected] <mailto:[email protected]>
On Wed, Jun 17, 2015 at 9:22 AM, Daniel Letai <[email protected]
<mailto:[email protected]>> wrote:
Currently I have 2 types of nodes:
old = 2 sockets, 4 cores per socket, 64GB mem
new = 2 sockets, 6 cores per socket, 128GB mem
Since I'm using select/cr_cons and using CR_CPU_Memory, I thought
I'd assign as default the relative amount of memory per core,
old - DefMemPerCPU = 8000
new - DefMemPerCPU = 20000
However, those values are part of the partition, not node, definition.
How can I assign those defaults to the cluster, yet define a
single global partition to allow jobs to utilize the entire cluster?
Assume tux[001-100]=old, tux[101-200]=new
I assume something like
PartitionName=Default Nodes=tux[001-100] DefMemPerCPU=8000
PartitionName=Default Nodes=tux[101-200] DefMemPerCPU=20000
PartitionName=compute Nodes=tux[101-200] Default=yes State=up
will not work.
What is the correct way to represent/use this cluster?
The other option I could think of was set DefMemPerCPU=1 to the
entire cluster, and force users to always use --mem, but I'm
hoping to avoid this kind of solution.