Hi, We're setting up a new cluster in our faculty and we want to use a SLURM + BLCR combination. We couldn't figure out how BLCR checkpointing would be best used for improved fair-share policy.
This is a general scenario that's relevant to many academic research institutes: each research group should get a certain percentage of the nodes; when some groups don't run then their resources should be distributed among the others; when they suddenly do start to run jobs they should be given their share back. The problem with classical scheduling policies without checkpointing is that very long jobs run by one group can monopolize the share of other groups for a very long time (days). We cannot split long jobs into several shorter jobs. This is how I would imagine the ideal solution: all resources are distributed between the groups running jobs at any given moment; when another group submits jobs then its share is immediately freed by automatically checkpointing and evicting the excess jobs of the running groups. When nodes become available again the checkpointed jobs are automatically resumed. I know some places implement a policy where each group owns a high-priority queue for its share of nodes and a public low-priority queue allows anyone to run on the unused nodes. This public queue is checkpointed to allow jobs to be evicted by the node owner and later resumed. However, this solution requires each user to manually split their jobs between the private and public queues, to monitor their progress, and to redistribute jobs between the queues if one queue is slower than expected. It's desirable that SLURM would automatically manage this without the complication of having two queues. I.e. everybody submits to one queue and the smart scheduler manages everything so that every group gets its fair share at any moment, instantaneously thanks to checkpointing. Is such a solution possible? Many thanks, Eyal
