We are facing more or less the same problem. We have historically defined a Gres "localtmp" with the number of GB initially available on local disk, and then jobs ask for --gres=localtmp:50 or similar.
That prevents slurm from allocating jobs on the cluster if they ask for more disk than is currently "free" -- in the sense of "not handed out to a job". But it doesn't prevent jobs from using more than they have asked for, so the disk might have less (real) free space than slurm thinks. As far as I can see, cgroups does not support limiting used disk space, only amount of IO/s and similar. We are currently considering using file system quotas for enforcing this. Our localtmp disk is a separate xfs partition, and the idea is to make the prolog set up a "project" disk quota for the job on the localtmp file system, and the epilog to remove it again. I'm not 100% sure we will make it work, but I'm hopeful. Fingers crossed! :) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo
signature.asc
Description: PGP signature