[gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread Tina Friedrich
Hello, I'm sure this has been asked time and time before, only I can't find it (search foo failling, somehow). What's the best way to run an array job so that each task ends up on a different node (but they run concurrently)? I don't mind other jobs running on the nodes at the time, but

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread Skylar Thompson
An exclusive host consumable is the right way to approach the problem. If the task elements might be part of a parallel environment, then you'll want to set the scaling to JOB as well. On Wed, Apr 02, 2014 at 03:39:03PM +0100, Tina Friedrich wrote: Hello, I'm sure this has been asked time and

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread MacMullan, Hugh
As an alternative, you could create a simple queue (onejobpernode) with 'slots 1'. -Hugh -Original Message- From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of Skylar Thompson Sent: Wednesday, April 02, 2014 11:04 AM To: Tina Friedrich Cc:

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread Tina Friedrich
If I call them with '-l exclusive' they will be the only task on that node, though. I don't want that - I simply want only that for my job array, no more than one slot on a node should be used. (No need to block the rest of the node for other jobs). But I'm thinking some sort of consumable

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread Skylar Thompson
You can still get the consumable to work, if you don't have it set by default. Only people who need to run this type of job would request it. When it's requested, the host's consumable would decrement, which would make it ineligible to run another instance of the job until the current job is done.

Re: [gridengine users] array job / node allocation / 'spread' question

2014-04-02 Thread Tina Friedrich
...yes, I could. I'm trying to achieve this without messing with the cluster config too much. Still, is an option. Now I've got three and no idea which one would be best :) Tina On 02/04/14 16:35, MacMullan, Hugh wrote: As an alternative, you could create a simple queue (onejobpernode)

[gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Peskin, Eric
All, We are running SGE 2011.11 on a CentOS 6.3 cluster. Twice now we've had the following experience: -- Any new jobs submitted sit in the qw state, even though there are plenty of nodes available that could satisfy the requirements of the jobs. -- top reveals that sge_qmaster is eating way too

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Skylar Thompson
We encountered a similar problem with GE 6.2u5, and it turned out to be a bug in schedd_job_info in sched_conf. Disabling it made our problems go away. We don't depend heavily on schedd_job_info; most of the time using -w v with qalter or qsub is sufficient. On Wed, Apr 02, 2014 at 06:27:07PM

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Chris Dagdigian
Same symptoms seen at one of my clients just yesterday. Programmatic scripts that send a small number of jobs into qsub that all use a threaded PE or similar. Our cluster routinely runs much larger workloads all the time. Our sge_qmaster ran the master node out of memory and was killed hard by

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Joshua Baker-LePain
On Wed, 2 Apr 2014 at 11:27am, Peskin, Eric wrote The generated scripts themselves look like reasonable job scripts. The only twist is using our threaded parallel environment and asking for a range of slots. An example job is: I hit the same thing -- see

Re: [gridengine users] sge_qmaster uses too much memory and becomes unresponsive

2014-04-02 Thread Skylar Thompson
Yep, we use slot ranges for both PE jobs and array jobs. The more complex the job request, the more likely it was that sge_qmaster would explode. On Wed, Apr 02, 2014 at 11:54:44AM -0700, Joshua Baker-LePain wrote: On Wed, 2 Apr 2014 at 11:27am, Peskin, Eric wrote The generated scripts

[gridengine users] SGE starter_method breaks OpenMPI

2014-04-02 Thread bergman
We're bringing up SoGE 8.1.6 and I've run into a problem with the use of a 'starter_method' that's affecting OpenMPI jobs. Following previous discussions on the list[1], we're using the 'environment modules' package, and using a starter_method to initialize the user's environment as if it was a

Re: [gridengine users] SGE starter_method breaks OpenMPI

2014-04-02 Thread Adam Tygart
We use a starter method here, and it looks something like this: function execute_normal_job() { # Start the job normally with proper login shell handling if [ $SGE_STARTER_USE_LOGIN_SHELL == true ] then exec -l $SGE_STARTER_SHELL_PATH -c ${@} else exec $SGE_STARTER_SHELL_PATH -c

Re: [gridengine users] SGE starter_method breaks OpenMPI

2014-04-02 Thread Reuti
Hi, Am 02.04.2014 um 22:19 schrieb berg...@merctech.com: We're bringing up SoGE 8.1.6 and I've run into a problem with the use of a 'starter_method' that's affecting OpenMPI jobs. Following previous discussions on the list[1], we're using the 'environment modules' package, and using a