Re: [gridengine users] Fwd: Re: Reconfiguring my SGE cluster.

Guillermo Marco Puche Fri, 30 Aug 2013 03:55:13 -0700

Hello,

It seems that virtual_free is working very well for queues !

Thank you very much for support. I love this mailing list, I always endup solving all my problems.


Best regards,
Guillermo.

On 08/30/2013 09:17 AM, Guillermo Marco Puche wrote:


On 08/29/2013 05:52 PM, Reuti wrote:

Am 29.08.2013 um 13:43 schrieb Guillermo Marco Puche:

Hello Reuti,

I just thought maybe the solution for memory usage, since I cannot predict the job memory 
consumption would be setting policies in "qconf -srqs".
How can I set srqs policies to set a rule "a job cannot subscribe to a node if at 
least not 2GB virtual memory free", I guess it's virtual memory the correct way to 
do this.

This would go to load_threshold settings in the queue. Something like:

load_thresholds       virtual_free=2GB

"alarm state" of a queue instance just means, that it became disabled as a load_threshold 
was bypassed. Maybe "disable_threshold" would be clearer.

(The system load I would even leave out. IMO this is useful for large SMP 
machine where you allow e.g. 72 slots on a 64 core system - as not all parallel 
jobs are scaling well, it might be possible to start more processes than cores 
are installed without any penalty.)

-- Reuti

Thank you for the clarification Reuti. I'm gonna try out withvirtual_free threshold and removing load threshold as you told me inthe other e-mail.


Guillermo.

Regards,

Guillermo.


On 08/29/2013 01:02 PM, Guillermo Marco Puche wrote:

On 08/29/2013 12:40 PM, Reuti wrote:

Am 29.08.2013 um 12:18 schrieb Guillermo Marco Puche:

Hello,

I'm having a lot of troubles trying to guess the ideal SGE queue configuration 
for my cluster.

Cluster schema is the following one:

submit_host:
        . frontend
execution hosts:
        . compute-0-0 (8 cpus 120GB+ RAM)
        . compute-0-1 (8 cpus - 32GB RAM)
        . compute-0-2 (8 cpus - 32GB RAM)
        . compute-0-3 (8 cpus - 32GB RAM)
        . compute-0-4 (8 cpus - 32GB RAM)
        . compute-0-5 (8 cpus - 32GB RAM)

My idea to configure it was:

medium_priority.q:
  - All pipeline jobs will run by default on this queue.
  - All hosts are available for this queue.

high_priorty.q:
  - All hosts are available for this queue.
  - It has the authority to suspend jobs in medium_priority.q

non_suspendable.q:
  - This queue is employed for specific jobs which have trouble if they're 
suspended.
  - Problem if lot of non_suspendable jobs run in same queue and exceed memory 
thresholds.

Why are so many jobs running at the same time when they exceed the available 
memory - are they requesting the estimated amount of memory?

Ok the problem is that depending on input data load_memory of process varies. I really 
cannot predict the amount of memory the process is gonna use. Those processes start to 
"eat" memory slowly. So they start with very low memory usage and as they run 
they take more and more memory.

I think the idea solution will be then to set the max memory a job can use 
without being suspended.

  - If they're non suspendable I guess both high_priority.q and 
medium_priority.q must be subordinates to this queue.

memory.q:
  - Specific queue with just compute-0-0 slots to submit jobs to memory node.

The problem is that I've to set-up this schema while enabling threshold memory 
uses so compute nodes don't crash by excessive memory load.

If they really use 12 GB of swap they are already oversubscribing the memory by 
far. Nowadays having a swap of 2 GB is common. How large is your swap partition?

Jobs, which are in the system, will still consume the resources they already 
allocated. Nothing is freed.

16 GB of Swap per node.

- Putting a queue in alarm state by a load_threshold will avoid that new jobs 
are going to this node. All running ones will continue to run as usual.

- Setting a suspend_threshold will only avoid that the suspended job consumes 
more.

-- Reuti

I'm also confused about over subscribing and thresholds. How to mix them 
together?

I've read basic Oracle SGE manual but I still feel weak. Is there any example 
configurations to test? Do you think that configuration is viable? Any 
suggestions?

Thank you very much.

Best regards,
Guillermo.
_______________________________________________
users mailing list

[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Fwd: Re: Reconfiguring my SGE cluster.

Reply via email to