[slurm-users] Re: Avoiding fragmentation

Paul Edmon via slurm-users Tue, 09 Apr 2024 06:43:17 -0700

I wrote a little blog post on this topic a few years back:https://www.rc.fas.harvard.edu/blog/cluster-fragmentation/

It's a vexing problem, but as noted by the other responders it issomething that depends on your cluster policy and job performance needs.Well written MPI code should be able to scale well even when givennon-optimal topologies.

You might also look at Node Weights(https://slurm.schedmd.com/slurm.conf.html#OPT_Weight). We use them onmosaic partitions so that the latest hardware is left available forlarger jobs needing more performance. You can also use it to force jobsto one side of the partition, though generally the scheduler does thisautomatically.



-Paul Edmon-


On 4/9/24 6:45 AM, Cutts, Tim via slurm-users wrote:

Agree with that. Plus, of course, even if the jobs run a bit slowerby not having all the cores on a single node, they will be scheduledsooner, so the overall turnaround time for the user will be better,and ultimately that's what they care about. I've always been of theview, for any scheduler, that the less you try to constrain it thebetter. It really depends on what you're trying to optimise for, butgenerally speaking I try to optimise for maximum utilisation andthroughput, unless I have a specific business case that needs toprioritise particular workloads, and then I'll compromise onthroughput to get the urgent workload through sooner.
Tun
------------------------------------------------------------------------
*From:* Loris Bennett via slurm-users <slurm-users@lists.schedmd.com>
*Sent:* 09 April 2024 06:51
*To:* slurm-users@lists.schedmd.com <slurm-users@lists.schedmd.com>
*Cc:* Gerhard Strangar <g...@arcor.de>
*Subject:* [slurm-users] Re: Avoiding fragmentation
Hi Gerhard,

Gerhard Strangar via slurm-users <slurm-users@lists.schedmd.com> writes:

> Hi,
>
> I'm trying to figure out how to deal with a mix of few- and many-cpu
> jobs. By that I mean most jobs use 128 cpus, but sometimes there are
> jobs with only 16. As soon as that job with only 16 is running, the
> scheduler splits the next 128 cpu jobs into 96+16 each, instead of
> assigning a full 128 cpu node to them. Is there a way for the
> administrator to achieve preferring full nodes?
> The existence of pack_serial_at_end makes me believe there is not,
> because that basically is what I needed, apart from my serial jobs using
> 16 cpus instead of 1.
>
> Gerhard

This may well not be relevant for your case, but we actively discourage
the use of full nodes for the following reasons:

  - When the cluster is full, which is most of the time, MPI jobs in
    general will start much faster if they don't specify the number of
    nodes and certainly don't request full nodes.  The overhead due to
    the jobs being scattered across nodes is often much lower than the
    additional waiting time incurred by requesting whole nodes.

  - When all the cores of a node are requested, all the memory of the
    node becomes unavailable to other jobs, regardless of how much
    memory is requested or indeed how much is actually used.  This holds
    up jobs with low CPU but high memory requirements and thus reduces
    the total throughput of the system.

These factors are important for us because we have a large number of
single core jobs and almost all the users, whether doing MPI or not,
significantly overestimate the memory requirements of their jobs.

Cheers,

Loris

--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin

--
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com
------------------------------------------------------------------------
AstraZeneca UK Limited is a company incorporated in England and Waleswith registered number:03674842 and its registered office at 1 FrancisCrick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA.
This e-mail and its attachments are intended for the above namedrecipient only and may contain confidential and privilegedinformation. If they have come to you in error, you must not copy orshow them to anyone; instead, please reply to this e-mail,highlighting the error to the sender and then immediately delete themessage. For information about how AstraZeneca UK Limited and itsaffiliates may process information, personal data and monitorcommunications, please see our privacy notice at www.astrazeneca.com<https://www.astrazeneca.com>

-- 
slurm-users mailing list -- slurm-users@lists.schedmd.com
To unsubscribe send an email to slurm-users-le...@lists.schedmd.com

[slurm-users] Re: Avoiding fragmentation

Reply via email to