Re: [gridengine users] h_vmem + CUDA

Reuti Tue, 22 Apr 2014 04:42:08 -0700

Hi,

Am 22.04.2014 um 01:17 schrieb Ilya M:


>>> I have been using h_vmem as a consumable resource to limit the amount of 
>>> memory users can request and to make sure jobs don't use more than they 
>>> requested. It all has been working fine until we added nodes with GPU 
>>> modules.
>>> 
>>> The memory model in CUDA applications is such, that the address space for 
>>> virtual memory is expanded to have a single address space for regular 
>>> memory and GPU memory (and something else, because I saw reported memory 
>>> usage 2-fold of the virtual memory).
>>> 
>>> This results in jobs getting killed because of exceeding h_vmem, whereas is 
>>> fact the actual memory usage was really low. So this effectively renders 
>>> h_vmem useless, because it either kills jobs that should not be killed or 
>>> needs to be set very high (2-3 time the size of virtual memory) for the 
>>> host and jobs to allow jobs to run.
>>> 
>>> I was wondering if there were good solutions or practices on controlling 
>>> memory usage on GPU-equipped nodes. I am using SGE 6.2u5.
>> Do you need to control the jobs behavior by using h_vmem? An alternative 
>> could be to define virtual_free as consumable and with a times 2 or 3 value 
>> per exechost, so that it helps SGE to schedule jobs to the best suited node. 
>> Jobs won't get killed when they pass the requested resource though.
>> 
> I had created a custom consumable to track available/used memory on a node 
> without the possibility to kill a job. This should help SGE correctly 
> dispatch jobs. I think this is the same thing you're suggesting. I was hoping 
> there was something else I overlooked.

Using a custom consumable is almost the same. The small difference is, that for 
virtual_free the measured load value is still taken into account, and the 
tighter constrain will be used. In `qhost -F` you can notice this by the prefix 
for the value being either hl: or hc: depending the acutal values.

-- Reuti


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] h_vmem + CUDA

Reply via email to