Hello,

I'm looking for information on Slurm architectural constraints as we are
considering a switch to Slurm.

We are currently running a heavily modified version of Torque with a custom
scheduler.

Our system (~13k CPU cores), is heavily heterogenous (~40 clusters) with
complex operational constraints. The system generally handles 10k-50k
enqued jobs.

We are currently scheduling CPU cores, Memory, GPU cards, Scratch space
(local, SSD and NFS with different machines having access to different
combination of these) and software licenses.

Machines are described by a set of physical and software properties (can be
requested by users) and their speed (users can request ranges of machine
performance).

Jobs carry complex requests. Each job can request sets of machines, where
each set caries a different specification. Each set is described by the
amount of resources requested, machine properties (negative specification
is supported for specifying nodes that do not have specific property) and
the number of nodes with such specification. Nodes can be allocated
exclusively (in which case, the specification describes the minimum amount)
or shared, with each resource still being allocated exclusively (jobs
cannot overlap in cores, memory, gpu cards....).

We are relying on Kerberos which is used for both identification and
authentication. Each running task inside of a job has a nanny process that
periodically refreshes the kerberos ticket for that particular process.

For scalability reasons, our scheduler relies on the server to keep an
up-to-date and full state of the system. The server therefore counts the
current resource allocation state for each of the resources on each of the
nodes.

Please let me know if this sounds like something Slurm could handle, or if
there are any limitations in Slurm that would make this impossible to
support.

Sincerely,
Simon Toth

Reply via email to