After considering this solution further, it's not ideal - only one of the partitions will be used, but many jobs require the entire cluster to run.

I think I liked the old behavior (automatic memory over-subscription) better. For now, I'll probably have to define a low common denominator (suggested 50 sounds about right) and force users to define --mem

On 06/17/2015 08:38 PM, Trey Dockendorf wrote:
Re: [slurm-dev] setting DefMemPerCPU in a heterogeneous cluster
We also have a heterogeneous environment, with basically two classes of nodes in terms of the memory/CPU. We have 2GB/CPU and 4GB/CPU. We use "background" partitions which access the entire cluster and allow for opportunistic utilization of otherwise idle CPUs. We found we had to create one of these partitions for each class of memory/CPU.

PartitionName=DEFAULT Nodes=<long nodelist> DefMemPerCPU=1900 MaxMemPerCPU=2000

PartitionName=background Nodes=<long nodelist> Priority=10 AllowQOS=background MaxNodes=1 MaxTime=96:00:00 State=UP PartitionName=background-4g Nodes=<long nodelist> Priority=10 AllowQOS=background MaxNodes=1 DefMemPerCPU=3900 MaxMemPerCPU=4000 MaxTime=96:00:00 State=UP

The background partition contains all 2GB/CPU nodes and background-4g contains all 4GB/CPU. A user can submit to either by doing something like "sbatch --partition=background,background-4g --qos=background".

There may be a better and/or more clever way of handling such partitions in a heterogenous environment, but the above method has served us well.

- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: [email protected] <mailto:[email protected]>
Jabber: [email protected] <mailto:[email protected]>

On Wed, Jun 17, 2015 at 9:22 AM, Daniel Letai <[email protected] <mailto:[email protected]>> wrote:


    Currently I have 2 types of nodes:
    old = 2 sockets, 4 cores per socket, 64GB mem
    new = 2 sockets, 6 cores per socket, 128GB mem

    Since I'm using select/cr_cons and using CR_CPU_Memory, I thought
    I'd assign as default the relative amount of memory per core,
    old - DefMemPerCPU = 8000
    new - DefMemPerCPU = 20000

    However, those values are part of the partition, not node, definition.

    How can I assign those defaults to the cluster, yet define a
    single global partition to allow jobs to utilize the entire cluster?
    Assume tux[001-100]=old, tux[101-200]=new

    I assume something like
    PartitionName=Default Nodes=tux[001-100] DefMemPerCPU=8000
    PartitionName=Default Nodes=tux[101-200] DefMemPerCPU=20000
    PartitionName=compute Nodes=tux[101-200] Default=yes State=up

    will not work.

    What is the correct way to represent/use this cluster?
    The other option I could think of was set DefMemPerCPU=1 to the
    entire cluster, and force users to always use --mem, but I'm
    hoping to avoid this kind of solution.



Reply via email to