Hello all,

I want to control the usage of the local disk of our execution nodes. As far as I have found, the only related option offered by SGE is the h_fsize limit. But that will not work because it just limits the maximum file size of any created file in any filesystem, being it the local disk or the NFS shared volume.

What I came around is:
1- Create a load sensor for the usage percentage of the local disk of each host.
2- Add that sensor to the Suspend Threshold of all queues.
3- Create a consumable attribute "local_disk", with default value = 0KB (most jobs won't make any use of it)
4- Set the value of "local_disk" in each host

That way, whenever a job is sent, if it requests no disk space, nothing happens. If the job explicitly requests disk space, the job will be scheduled to a host with enough free space. If that job exceeds the requested disk space, "usually" nothing will happen. But if the job exceeds its disk space in a node with several other jobs using that disk, instead of filling the disk and crash the jobs due to lack of space, all jobs will be suspended until the problem is manually fixed. I understand that this is not a true resource limit as with h_vmem, and it requires human conflict solving.

Does anyone have a better idea?

Thanks in advance,

Txema

PS: Another possible option i thought about would be a prolog script (and the epilog cleanup equivalent) that, before the job starts:
1- Creates a group for the jobid, and assigns the group to the user.
2- Creates a group quota for the local disk with the requested local_disk value But that would be much more complicated and could add some unwanted complexity to the whole system.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to