Re: [gridengine users] Reconfiguring my SGE cluster.

Reuti Thu, 29 Aug 2013 08:16:29 -0700

Am 29.08.2013 um 13:02 schrieb Guillermo Marco Puche:

> On 08/29/2013 12:40 PM, Reuti wrote:
>> Am 29.08.2013 um 12:18 schrieb Guillermo Marco Puche:
>> 
>> 
>>> Hello,
>>> 
>>> I'm having a lot of troubles trying to guess the ideal SGE queue 
>>> configuration for my cluster.
>>> 
>>> Cluster schema is the following one:
>>> 
>>> submit_host:
>>>     • frontend
>>> execution hosts:
>>>     • compute-0-0 (8 cpus 120GB+ RAM)
>>>     • compute-0-1 (8 cpus - 32GB RAM)
>>>     • compute-0-2 (8 cpus - 32GB RAM)
>>>     • compute-0-3 (8 cpus - 32GB RAM)
>>>     • compute-0-4 (8 cpus - 32GB RAM)
>>>     • compute-0-5 (8 cpus - 32GB RAM)
>>> 
>>> My idea to configure it was:
>>> 
>>> medium_priority.q:
>>>  - All pipeline jobs will run by default on this queue.
>>>  - All hosts are available for this queue.
>>> 
>>> high_priorty.q:
>>>  - All hosts are available for this queue.
>>>  - It has the authority to suspend jobs in medium_priority.q
>>> 
>>> non_suspendable.q:
>>>  - This queue is employed for specific jobs which have trouble if they're 
>>> suspended.
>>>  - Problem if lot of non_suspendable jobs run in same queue and exceed 
>>> memory thresholds.
>>> 
>> Why are so many jobs running at the same time when they exceed the available 
>> memory - are they requesting the estimated amount of memory?
> Ok the problem is that depending on input data load_memory of process varies. 
> I really cannot predict the amount of memory the process is gonna use. Those 
> processes start to "eat" memory slowly. So they start with very low memory 
> usage and as they run they take more and more memory.
> 
> I think the idea solution will be then to set the max memory a job can use 
> without being suspended.


You can limit h_vmem (either per job or on a queue level), but once it's passed 
the job will be killed. I think it's not possible to set up: suspend only jobs 
consuming more than 8 GB or alike. And it wouldn't help: the memory would be 
already gone.


>>>  - If they're non suspendable I guess both high_priority.q and 
>>> medium_priority.q must be subordinates to this queue.
>>> 
>>> memory.q:
>>>  - Specific queue with just compute-0-0 slots to submit jobs to memory node.
>>> 
>>> The problem is that I've to set-up this schema while enabling threshold 
>>> memory uses so compute nodes don't crash by excessive memory load.
>>> 
>> If they really use 12 GB of swap they are already oversubscribing the memory 
>> by far. Nowadays having a swap of 2 GB is common. How large is your swap 
>> partition?
>> 
>> Jobs, which are in the system, will still consume the resources they already 
>> allocated. Nothing is freed.
>> 
> 16 GB of Swap per node.

I think it's too much for systems from today. Years ago the rule of thumb was 
even double the size of the real memory.

Are these serial or parallel jobs? Maybe it would help to allow only 4 (serial) 
ones a 8 core machine, as the memory is the limitation.

-- Reuti 


>> 
>> - Putting a queue in alarm state by a load_threshold will avoid that new 
>> jobs are going to this node. All running ones will continue to run as usual.
>> 
>> - Setting a suspend_threshold will only avoid that the suspended job 
>> consumes more.
>> 
>> -- Reuti
>> 
>> 
>> 
>>> I'm also confused about over subscribing and thresholds. How to mix them 
>>> together?
>>> 
>>> I've read basic Oracle SGE manual but I still feel weak. Is there any 
>>> example configurations to test? Do you think that configuration is viable? 
>>> Any suggestions?
>>> 
>>> Thank you very much.
>>> 
>>> Best regards,
>>> Guillermo.
>>> _______________________________________________
>>> users mailing list
>>> 
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
> 
> 
> -- 
> Guillermo Marco Puche
> 
> Bioinformatician, Computer Science Engineer.
> Sistemas Genómicos S.L.
> Phone: +34 902 364 669
> Fax: +34 902 364 670
> www.sistemasgenomicos.com
> 
>  <bioinfo.png> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Reconfiguring my SGE cluster.

Reply via email to