Hi,

Am 29.02.2012 um 18:07 schrieb Txema Heredia Genestar:

> I want to control the usage of the local disk of our execution nodes. As far 
> as I have found, the only related option offered by SGE is the h_fsize limit. 
> But that will not work because it just limits the maximum file size of any 
> created file in any filesystem, being it the local disk or the NFS shared 
> volume.
> 
> What I came around is:
> 1- Create a load sensor for the usage percentage of the local disk of each 
> host.
> 2- Add that sensor to the Suspend Threshold of all queues.
> 3- Create a consumable attribute "local_disk", with default value = 0KB (most 
> jobs won't make any use of it)
> 4- Set the value of "local_disk" in each host
> 
> That way, whenever a job is sent, if it requests no disk space, nothing 
> happens. If the job explicitly requests disk space, the job will be scheduled 
> to a host with enough free space. If that job exceeds the requested disk 
> space, "usually" nothing will happen. But if the job exceeds its disk space 
> in a node with several other jobs using that disk, instead of filling the 
> disk and crash the jobs due to lack of space, all jobs will be suspended 
> until the problem is manually fixed.
> I understand that this is not a true resource limit as with h_vmem, and it 
> requires human conflict solving.
> 
> Does anyone have a better idea?

A load sensor is covered in:

http://arc.liv.ac.uk/SGE/howto/loadsensor.html

I use it for a load_threshold if tmpfree falls below 1 GB left in /tmp.

In addition you can make tmpfree consumable and attach an initial value to each 
exechost which can be requested. 


> Thanks in advance,
> 
> Txema
> 
> PS: Another possible option i thought about would be a prolog script (and the 
> epilog cleanup equivalent) that, before the job starts:
> 1- Creates a group for the jobid, and assigns the group to the user.
> 2- Creates a group quota for the local disk with the requested local_disk 
> value

And then terminates the job?


> But that would be much more complicated and could add some unwanted 
> complexity to the whole system.

Do you users stay in $TMPDIR? Then it would be easier I think to have a `du -s 
*.all.q` and check whether any is above the request.

NB: There is a suspend_threshold for queues, but unfortunately not for each 
individual job on its own.

===

Another approach, if the jobs stay in one node:

- in the job prolog create a file with the requested space
- format and mount it on $TMPDIR as loop device
- in the epilog it can be removed again

Well, creating and formatting will take some time, but they can never pass the 
requested space and it's guaranteed to be available.

-- Reuti

> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to