We are facing more or less the same problem.  We have historically
defined a Gres "localtmp" with the number of GB initially available
on local disk, and then jobs ask for --gres=localtmp:50 or similar.

That prevents slurm from allocating jobs on the cluster if they ask for
more disk than is currently "free" -- in the sense of "not handed out to
a job".  But it doesn't prevent jobs from using more than they have
asked for, so the disk might have less (real) free space than slurm
thinks.

As far as I can see, cgroups does not support limiting used disk space,
only amount of IO/s and similar.

We are currently considering using file system quotas for enforcing
this.  Our localtmp disk is a separate xfs partition, and the idea is to
make the prolog set up a "project" disk quota for the job on the
localtmp file system, and the epilog to remove it again.

I'm not 100% sure we will make it work, but I'm hopeful.  Fingers
crossed! :)

-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Attachment: signature.asc
Description: PGP signature

Reply via email to