Hello, I'm looking for information on Slurm architectural constraints as we are considering a switch to Slurm.
We are currently running a heavily modified version of Torque with a custom scheduler. Our system (~13k CPU cores), is heavily heterogenous (~40 clusters) with complex operational constraints. The system generally handles 10k-50k enqued jobs. We are currently scheduling CPU cores, Memory, GPU cards, Scratch space (local, SSD and NFS with different machines having access to different combination of these) and software licenses. Machines are described by a set of physical and software properties (can be requested by users) and their speed (users can request ranges of machine performance). Jobs carry complex requests. Each job can request sets of machines, where each set caries a different specification. Each set is described by the amount of resources requested, machine properties (negative specification is supported for specifying nodes that do not have specific property) and the number of nodes with such specification. Nodes can be allocated exclusively (in which case, the specification describes the minimum amount) or shared, with each resource still being allocated exclusively (jobs cannot overlap in cores, memory, gpu cards....). We are relying on Kerberos which is used for both identification and authentication. Each running task inside of a job has a nanny process that periodically refreshes the kerberos ticket for that particular process. For scalability reasons, our scheduler relies on the server to keep an up-to-date and full state of the system. The server therefore counts the current resource allocation state for each of the resources on each of the nodes. Please let me know if this sounds like something Slurm could handle, or if there are any limitations in Slurm that would make this impossible to support. Sincerely, Simon Toth
