I'm trying to understand how QOS limits work and what order limits are applied and checked and how limits may be overridden. For example, if we have:
partition QOS: max_jobs_pu = 50 max_jobs_pa = 75 user foo, account bar association limit: max_jobs = 10 Which limits apply? If the user submits to the partition with the above QOS, can the user have up to 50 jobs running as long as only 10 of them are in account bar? If they're the only one submitting to account bar in this QOS, does it override the max_jobs_pu to allow them 75 jobs? Since the association limit is more limiting, do they only get 10 jobs? The documentation wasn't clear to me on this and in the code max_jobs_pu/max_jobs_pa doesn't exist in an association and max_jobs from the association doesn't seem to exist in a QOS, so it wasn't clear how one would override the other. Any help to clarify this? Also, the documentation suggests that the partition QOS takes priority over the job QOS and that if no limits are set in from one, they could be set at the next level. How does SLURM distinguish between an unset parameter an a parameter which is set to unlimited? For example if I leave partition QOS MaxJobs unset, so it could be set by the job QOS, how is that different than having a set MaxJobs QOS for the job and using a partition MaxJobs QOS of unlimited to override that at the partition level? Also, if I have a MaxJobs on a partition of 4 and a MaxJobs for another QOS of 3, if the user gets 4 jobs running under the partition QOS, will they not be allowed to run other jobs in another partition using the QOS that allows 3 jobs? ----- Gary Skouson
