Hello all,
I want to control the usage of the local disk of our execution nodes. As
far as I have found, the only related option offered by SGE is the
h_fsize limit. But that will not work because it just limits the maximum
file size of any created file in any filesystem, being it the local disk
or the NFS shared volume.
What I came around is:
1- Create a load sensor for the usage percentage of the local disk of
each host.
2- Add that sensor to the Suspend Threshold of all queues.
3- Create a consumable attribute "local_disk", with default value = 0KB
(most jobs won't make any use of it)
4- Set the value of "local_disk" in each host
That way, whenever a job is sent, if it requests no disk space, nothing
happens. If the job explicitly requests disk space, the job will be
scheduled to a host with enough free space. If that job exceeds the
requested disk space, "usually" nothing will happen. But if the job
exceeds its disk space in a node with several other jobs using that
disk, instead of filling the disk and crash the jobs due to lack of
space, all jobs will be suspended until the problem is manually fixed.
I understand that this is not a true resource limit as with h_vmem, and
it requires human conflict solving.
Does anyone have a better idea?
Thanks in advance,
Txema
PS: Another possible option i thought about would be a prolog script
(and the epilog cleanup equivalent) that, before the job starts:
1- Creates a group for the jobid, and assigns the group to the user.
2- Creates a group quota for the local disk with the requested
local_disk value
But that would be much more complicated and could add some unwanted
complexity to the whole system.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users