Am 29.08.2013 um 13:43 schrieb Guillermo Marco Puche: > Hello Reuti, > > I just thought maybe the solution for memory usage, since I cannot predict > the job memory consumption would be setting policies in "qconf -srqs". > How can I set srqs policies to set a rule "a job cannot subscribe to a node > if at least not 2GB virtual memory free", I guess it's virtual memory the > correct way to do this.
This would go to load_threshold settings in the queue. Something like: load_thresholds virtual_free=2GB "alarm state" of a queue instance just means, that it became disabled as a load_threshold was bypassed. Maybe "disable_threshold" would be clearer. (The system load I would even leave out. IMO this is useful for large SMP machine where you allow e.g. 72 slots on a 64 core system - as not all parallel jobs are scaling well, it might be possible to start more processes than cores are installed without any penalty.) -- Reuti > Regards, > > Guillermo. > > > On 08/29/2013 01:02 PM, Guillermo Marco Puche wrote: >> On 08/29/2013 12:40 PM, Reuti wrote: >>> Am 29.08.2013 um 12:18 schrieb Guillermo Marco Puche: >>> >>> >>>> Hello, >>>> >>>> I'm having a lot of troubles trying to guess the ideal SGE queue >>>> configuration for my cluster. >>>> >>>> Cluster schema is the following one: >>>> >>>> submit_host: >>>> • frontend >>>> execution hosts: >>>> • compute-0-0 (8 cpus 120GB+ RAM) >>>> • compute-0-1 (8 cpus - 32GB RAM) >>>> • compute-0-2 (8 cpus - 32GB RAM) >>>> • compute-0-3 (8 cpus - 32GB RAM) >>>> • compute-0-4 (8 cpus - 32GB RAM) >>>> • compute-0-5 (8 cpus - 32GB RAM) >>>> >>>> My idea to configure it was: >>>> >>>> medium_priority.q: >>>> - All pipeline jobs will run by default on this queue. >>>> - All hosts are available for this queue. >>>> >>>> high_priorty.q: >>>> - All hosts are available for this queue. >>>> - It has the authority to suspend jobs in medium_priority.q >>>> >>>> non_suspendable.q: >>>> - This queue is employed for specific jobs which have trouble if they're >>>> suspended. >>>> - Problem if lot of non_suspendable jobs run in same queue and exceed >>>> memory thresholds. >>>> >>> Why are so many jobs running at the same time when they exceed the >>> available memory - are they requesting the estimated amount of memory? >> Ok the problem is that depending on input data load_memory of process >> varies. I really cannot predict the amount of memory the process is gonna >> use. Those processes start to "eat" memory slowly. So they start with very >> low memory usage and as they run they take more and more memory. >> >> I think the idea solution will be then to set the max memory a job can use >> without being suspended. >>> >>>> - If they're non suspendable I guess both high_priority.q and >>>> medium_priority.q must be subordinates to this queue. >>>> >>>> memory.q: >>>> - Specific queue with just compute-0-0 slots to submit jobs to memory >>>> node. >>>> >>>> The problem is that I've to set-up this schema while enabling threshold >>>> memory uses so compute nodes don't crash by excessive memory load. >>>> >>> If they really use 12 GB of swap they are already oversubscribing the >>> memory by far. Nowadays having a swap of 2 GB is common. How large is your >>> swap partition? >>> >>> Jobs, which are in the system, will still consume the resources they >>> already allocated. Nothing is freed. >>> >> 16 GB of Swap per node. >>> - Putting a queue in alarm state by a load_threshold will avoid that new >>> jobs are going to this node. All running ones will continue to run as usual. >>> >>> - Setting a suspend_threshold will only avoid that the suspended job >>> consumes more. >>> >>> -- Reuti >>> >>> >>> >>>> I'm also confused about over subscribing and thresholds. How to mix them >>>> together? >>>> >>>> I've read basic Oracle SGE manual but I still feel weak. Is there any >>>> example configurations to test? Do you think that configuration is viable? >>>> Any suggestions? >>>> >>>> Thank you very much. >>>> >>>> Best regards, >>>> Guillermo. >>>> _______________________________________________ >>>> users mailing list >>>> >>>> [email protected] >>>> https://gridengine.org/mailman/listinfo/users > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
