[gridengine users] Grid Engine or Grid Scheduler?

2012-04-24 Thread Joseph Farran
Hi Grid folks. As per my earlier post, I am a newbie to OGE, but not to schedulers in general. To keep things clear, what is the official name for this software? It is Open Grid Engine (OGE), or Open Grid Scheduler

[gridengine users] Grid Engine or Grid Scheduler?

2012-04-24 Thread Joseph Farran
Resending in text format: Hi Grid folks. As per my earlier post, I am a newbie to OGE, but not to schedulers in general. To keep things clear, what is the official name for this software?It is Open Grid Engine (OGE), or Open Grid Scheduler (OGS)? I downloaded the tar ball executable and

[gridengine users] Installing OGE on Rocks Login Node

2012-05-09 Thread Joseph Farran
Hello. I have a cluster running Rocks 5.4.3 that I originally setup with Torque/Maui. I am testing Open Grid Scheduler using the ge2011.11.tar distribution. I setup OGE on the master head node and was able to also setup 6 compute nodes using start_gui_installer on the head node.All 6

Re: [gridengine users] Installing OGE on Rocks Login Node

2012-05-10 Thread Joseph Farran
. -Original Message- From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On Behalf Of Joseph Farran Sent: Wednesday, May 09, 2012 4:10 PM To: users@gridengine.org Users Subject: [gridengine users] Installing OGE on Rocks Login Node Hello. I have a cluster running Rocks 5.4.3

Re: [gridengine users] Default Submit Directory Shell

2012-05-31 Thread Joseph Farran
Adding these lines at the end of oge-dir/default/common/sge_request -cwd -S /bin/bash Works and does what I was looking for. Nice! Thanks. On 05/31/2012 05:40 PM, Joseph Farran wrote: Cool! A lot easier than I thought. So a default shell can also be specified so that the batch script

[gridengine users] OGE Spooling Directory

2012-06-04 Thread Joseph Farran
Hi All. When installing OGE with respect to the Spooling Configuration, one can select: Qmaster spool directory Global execd spool directory I installed OGE from the head node on a shared NFS directory ( /data/oge ) and like to make the spooling to be on the head node /var file system

Re: [gridengine users] OGE Spooling Directory

2012-06-04 Thread Joseph Farran
Thanks Reuti. On 06/04/2012 02:51 PM, Reuti wrote: Hi, Am 04.06.2012 um 22:59 schrieb Joseph Farran: When installing OGE with respect to the Spooling Configuration, one can select: Qmaster spool directory Global execd spool directory I installed OGE from the head node on a shared

[gridengine users] Parallel Environment

2012-06-04 Thread Joseph Farran
Hi All. I am trying to understand OGE parallel environment. I am coming from Torque/PBS where one simply asks for the number of nodes and cores (ppn) when running with a parallel program like an MPI job. With OGE, it appears that a parallel environment must first be setup for each parallel

Re: [gridengine users] OGE Spooling Directory

2012-06-05 Thread Joseph Farran
/default/spool Spooling: classic Using the NFS share directory for Global execd, then everything works just fine - compute nodes are setup correctly. What am I doing wrong? Joseph On 06/04/2012 02:51 PM, Reuti wrote: Hi, Am 04.06.2012 um 22:59 schrieb Joseph Farran: When installing OGE

Re: [gridengine users] OGE Spooling Directory

2012-06-05 Thread Joseph Farran
OGE is owned by ogeadmin, so /var/spool/oge on the compute node needs to exist *and* owned by ogeadmin. Joseph On 06/05/2012 08:53 AM, Reuti wrote: Am 05.06.2012 um 17:47 schrieb Joseph Farran: My OGE software resides on a shared NFS directory /data/hpc/oge. When I run

Re: [gridengine users] Parallel Environment

2012-06-05 Thread Joseph Farran
On 06/04/2012 07:29 PM, Rayson Ho wrote: On Mon, Jun 4, 2012 at 9:08 PM, Joseph Farranjfar...@uci.edu wrote: You can do something similar by defining a generic PE and you can then use a generic name. If I create a generic PE, say parallel, can the parallel name be the default if no PE name

[gridengine users] Linux Groups

2012-06-08 Thread Joseph Farran
Greetings. How does one set access to an OGE queue to have access from more than one Linux group? So if I have Linux groups staff, bio, and chem, how do I make my test queue only accessible by these 3 groups? What kind of Q type do I setup? ___

Re: [gridengine users] Linux Groups

2012-06-08 Thread Joseph Farran
On 06/08/2012 10:25 AM, Reuti wrote: You can make one ACL containing these three Unix groups: $ qconf -au @staff foobar $ qconf -au @bio foobar $ qconf -au @chem foobar $ qconf -mattr queue user_lists foobar test -- Reuti Perfect! Exactly what I was looking for. Will test it soon.

Re: [gridengine users] Linux Groups

2012-06-08 Thread Joseph Farran
On 06/08/2012 11:19 AM, Rayson Ho wrote: but if Joseph is OK with using a cron job to sync. membership then I can leave it aside for now - I will need to work on a few more urgent things but will have more time later this month. Rayson Hi Rayson. If you are asking of the compute nodes

[gridengine users] Collecting of scheduler job information is turned off

2012-06-08 Thread Joseph Farran
Me again :-) The Queue access list by Linux groups ( /etc/group ) is working perfectly! I submitted a test job to the bio queue from an account that has bio group ownership and the job runs.When I submitt a test job to the bio queue from an account that does *not* belong to the bio linux

[gridengine users] Understanding Parallel Enviroment ( whole nodes )

2012-06-08 Thread Joseph Farran
Greetings. I am try to setup my MPI Parallel Environment so that whole nodes are used before going to the next node when looking for cores. Our nodes have 64 cores. What I like is that if I ask for 128 cores (slots), one compute node is selected with 64 cores, and then the next one with 64

[gridengine users] PE Job Suspend / Resume

2012-06-11 Thread Joseph Farran
Hi. With the help of this group, I've been able to make good progress on setting up OGE 2011.11 with our cluster. I am testing the Suspend Resume features and it works great for serial jobs but not able to get Parallel jobs suspended. I created a simple Parallel Environment (PE) called mpi

Re: [gridengine users] PE Job Suspend / Resume

2012-06-11 Thread Joseph Farran
Thanks for the clarification. This is NAMD run, so I am launching it via charmrun and not mpirun. If the OGE code suspend via rank 0, I would think that charmrun and/or any other parallel job would suspend as well, no? I will try an mpirun job next to see if it behaves differently and

Re: [gridengine users] PE Job Suspend / Resume

2012-06-12 Thread Joseph Farran
Well, for our needs, we *REALLY* need Parallel Job suspension.It's not even a choice for us. If Torque/Maui can do it, I am sure OGE can do it without issues. Can someone please tell me what patch I need to install to un-break / turn-on Parallel job suspension? If you guys are that

[gridengine users] Suspending Job Arrays with qmon

2012-06-13 Thread Joseph Farran
Hi. I am able to successfully suspend a job array with: qmod -sj job-id But how does one do this using qmon - the graphical tool? How does one select the entire job array and not just single job array entries that show up in qmon? Joseph

[gridengine users] ge2011.11

2012-06-14 Thread Joseph Farran
Greetings. At http://gridscheduler.sourceforge.net under the download Grid Engine, there is a: Grid Engine 2011.11 binary for x64 ( now with the GUI installer ) Is this the latest available release as of date? Also, how easy or hard is it to upgrade OGE once it is installed and in

[gridengine users] Subordinate Queues

2012-06-15 Thread Joseph Farran
Greetings. I am playing with OGE subordinate Queues and I can't seem to get it right. All my nodes are 64 cores and I set all my nodes to node pack jobs with: qconf -rattr exechost complex_values slots=64 node1 ( repeat for all other nodes ) The scheduler is then set with Load Formula

Re: [gridengine users] Subordinate Queues

2012-06-15 Thread Joseph Farran
On 06/15/2012 09:48 AM, Rayson Ho wrote: On Fri, Jun 15, 2012 at 12:29 PM, Reutire...@staff.uni-marburg.de wrote: And just want to add that if no new jobs are sent to the super-ordinate queue, then the sub-ordination process would never kick in. Which is why Reuti mentioned the queue vs. host

[gridengine users] $SGE_ROOT/$SGE_CELL/common/sge_qstat

2012-06-18 Thread Joseph Farran
Hello. I like to make qstat do qstat -u * as the default to see all user jobs. I added: -u * to our /oge-path/default/common/sge_qstat and it does not seem to work.The qstat does not show all user's jobs, but it does if I say qstat -u * Also, adding the above command to

Re: [gridengine users] $SGE_ROOT/$SGE_CELL/common/sge_qstat

2012-06-18 Thread Joseph Farran
Thanks, that did the trick! After making this the system wide default, how can an individual users change it back to just show their jobs? In one account, I created: $ cat ~/.sge_qstat -u $USER Trying to switch it back to only listing this one user qstat jobs, but I still get the system

Re: [gridengine users] $SGE_ROOT/$SGE_CELL/common/sge_qstat

2012-06-18 Thread Joseph Farran
Yes, that makes sense. I wanted to have a global default but then let savy users *undo* the global setup. So what I am going to do is create a local ~/.sge_qstat with -u * and let the users who want to change the default to simply remove this file, or change it to their likings. Best, Joseph

[gridengine users] Possible OGE Bug with Subordinate Field when using OGE GUI

2012-06-21 Thread Joseph Farran
I may have found a bug OGE ( 2011.11 ) ? I have a queue called owner that I modified using: qconf -mq owner To set the subordinate_list field with slots=8(free:4:sr). I confirm the new change with: # qconf -sq owner | grep subordinate subordinate_list slots=8(free:4:sr)

Re: [gridengine users] Possible OGE Bug with Subordinate Field when using OGE GUI

2012-06-21 Thread Joseph Farran
On 06/21/2012 08:19 AM, Reuti wrote: Am 21.06.2012 um 16:25 schrieb Joseph Farran: If this is normal, what happen to the entry slots=8(free:4:sr)?Or is this a bug? I would say so, it's already in 6.2u5. To make sure we are on the same page: When you say I would say so, are you saying

Re: [gridengine users] Possible OGE Bug with Subordinate Field when using OGE GUI

2012-06-21 Thread Joseph Farran
Well, this is a horrible bug specially for anyone starting out with OGE. Imagine if this was a software program that every time a user ran emacs (editor) on the code *without* making any changes, the editor would automatically make changes to the code. If I had any control over this software

[gridengine users] 8 slot PE Job only suspends 1 slot and not 8 slots

2012-06-21 Thread Joseph Farran
Hi. I am playing with subordinate queues. I have defined owner queue and free queue. The owner queue has: # qconf -sq owner | grep subordinate subordinate_list slots=8(free:0:sr) If I submit a 1-core job to owner queue, OGE suspends 1 core (slot) job from free queue. If I submit

Re: [gridengine users] 8 slot PE Job only suspends 1 slot and not 8 slots

2012-06-25 Thread Joseph Farran
Thanks Dave. The list helps to know what bugs I am dealing with with my particular OGE setup. With respect to the following bugs: - 6953013 slotwise preemption does not take amount of slots of pe tasks into account when unsuspend/suspend - 6932534 slotwise suspend on subordinate with parallel

[gridengine users] Consumable for Packing Jobs on Nodes

2012-06-25 Thread Joseph Farran
Howdy. Our environment uses a mixture of Parallel jobs, job arrays serial jobs. Parallel jobs being the biggest followed by job arrays. For parallel jobs, one of the worse things a scheduler can do is to spread 1-core jobs evenly across all nodes based on node load because you get severe

Re: [gridengine users] Packing jobs on nodes v2

2012-06-28 Thread Joseph Farran
Hi. This approach for load_formula seems ideal for our needs but I cannot get it to work with ge2011.11 Has anyone tried this with ge2011.11 and does it work? Joseph On 05/14/2012 11:49 AM, Stuart Barkley wrote: On our systems we use: % qconf -ssconf ... queue_sort_method

[gridengine users] New version of GE2011 ?

2012-07-02 Thread Joseph Farran
Hello. I remember reading that an updated version of GE2011.11 was going to be released by the end of June. When I go to http://gridscheduler.sourceforge.net and click on the download link, I don't see any new version. The only version I see is GE2011.11 dated 2011-11-04. Did I read wrong

Re: [gridengine users] New version of GE2011 ?

2012-07-02 Thread Joseph Farran
When is GE 2011.11 update 1 with cgroups planned on being released? On 07/02/2012 01:56 PM, Rayson Ho wrote: GE 2011.11 patch 1 was released a while ago (back in April or May I believe): http://dl.dropbox.com/u/47200624/GE2011.11p1/GE2011.11p1.tar.gz But we did not release certified binaries

Re: [gridengine users] Default Shell bash not always found

2012-07-10 Thread Joseph Farran
On 07/10/2012 11:38 AM, Rayson Ho wrote: On Tue, Jul 10, 2012 at 1:48 PM, Joseph Farranjfar...@uci.edu wrote: I was using the same identical script, so it's still a mystery why the script ran on some nodes while it failed on others, but now with this change it works on all nodes and that is

Re: [gridengine users] Default Shell bash not always found

2012-07-11 Thread Joseph Farran
On 07/10/2012 01:24 PM, Rayson Ho wrote: On Tue, Jul 10, 2012 at 3:51 PM, Reutire...@staff.uni-marburg.de wrote: Joseph, To debug the difference in behavior: 1) make sure that you can always reproduce the job failure. 2) then submit jobs to a node that fails the job to a node that does

Re: [gridengine users] Saving OGE Configuration

2012-07-16 Thread Joseph Farran
On 07/16/2012 01:51 PM, Reuti wrote: There is a script in $SGE_ROOT/util/upgrade_modules/save_sge_config.sh -- Reuti Running the command where the sge_qmaster is running (the admin host), it says: # mkdir /root/oge-backup-test # ./util/upgrade_modules/save_sge_config.sh

Re: [gridengine users] .sge_request

2012-07-17 Thread Joseph Farran
On 07/17/2012 03:22 PM, Joseph Farran wrote: $ od -c OUT.o5223.1 000 T e s t \n 033 [ H 033 [ J 013 Do you know where this is coming from? False alarm - it was coming form a call to another script from the job. ___ users

[gridengine users] Reservation default for Array Jobs

2012-07-25 Thread Joseph Farran
Hello. I like to set the default reservation option for array jobs to be No: #-R n Is this possible via the sge_request global file and if so, what is the syntax to do this (only for job arrays)? Or is this something I need to do in the prolog.sh startup part and if so, what is the

Re: [gridengine users] Reservation default for Array Jobs

2012-07-25 Thread Joseph Farran
Ok, I ~think~ this is the default for all jobs with OGE ( No reservation ). So let me turn this around. How can I set Job reservation for Parallel jobs to be On by default? #-R y Only for PE jobs? On 07/25/2012 11:57 AM, Joseph Farran wrote: Hello. I like to set the default

Re: [gridengine users] $SGE_STDOUT_PATH $SGE_STDERR_PATH but for PE Environments?

2012-08-01 Thread Joseph Farran
Hi Rayson / Reuti. I have an epilog setup to clear out empty .o and .e files, so I was using the GE environments to check for said file and act accordingly. Since they are not defined in GE, I am manually checking for and removing them with: $JOB_NAME.pe$JOB_ID $JOB_NAME.po$JOB_ID

[gridengine users] start_gui_installer ( re-adding nodes )

2012-08-01 Thread Joseph Farran
Hi. I originally ran start_gui_installer which is a great and easy gui toll to add compute nodes. What is the proper way to re-add nodes but from the command line? I am running Rocks 5.4.3 and when a node is re-imaged all is gone, so is there an easy way to re-add the node via command

Re: [gridengine users] start_gui_installer ( re-adding nodes )

2012-08-01 Thread Joseph Farran
Thanks Simon and Rayson. That was pretty much what I was doing. On my new freshly installed node, I placed a copy of my sgeexecd.HPC in /etc/init.d, chkconfig to make sure it starts up on next boot, created /var/spool/oge and starting oge would then create the compute-x-x directory and

[gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
Hi. I pack jobs unto nodes using the following GE setup: # qconf -ssconf | egrep queue|load queue_sort_method seqno job_load_adjustments NONE load_adjustment_decay_time0 load_formula slots I also set my nodes with

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
On 08/03/2012 09:18 AM, Reuti wrote: Am 03.08.2012 um 18:04 schrieb Joseph Farran: I pack jobs unto nodes using the following GE setup: # qconf -ssconf | egrep queue|load queue_sort_method seqno job_load_adjustments NONE load_adjustment_decay_time

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
On 08/03/2012 09:57 AM, Reuti wrote: Am 03.08.2012 um 18:50 schrieb Joseph Farran: On 08/03/2012 09:18 AM, Reuti wrote: Am 03.08.2012 um 18:04 schrieb Joseph Farran: I pack jobs unto nodes using the following GE setup: # qconf -ssconf | egrep queue|load queue_sort_method

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
ONE core jobs suspend 16 single core jobs.Nasty and wasteful! On 08/03/2012 10:10 AM, Joseph Farran wrote: On 08/03/2012 09:57 AM, Reuti wrote: Am 03.08.2012 um 18:50 schrieb Joseph Farran: On 08/03/2012 09:18 AM, Reuti wrote: Am 03.08.2012 um 18:04 schrieb Joseph Farran: I pack jobs

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
suspending another 7 cores. If job-packing with subordinate queues were available, job #8585 would have started compute-3-2 since it has cores available. Two single ONE core jobs suspend 16 single core jobs.Nasty and wasteful! On 08/03/2012 10:10 AM, Joseph Farran wrote: On 08/03/2012 09

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
Ok, it's not that difficult to setup a load sensor in GE and I ~think~ I figured out how to tell the cores in use by a node. Best, Joseph On 08/03/2012 01:03 PM, Joseph Farran wrote: Great!Will it work for both parallel and single core jobs? If yes, is there such a load sensor available

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
. If job-packing with subordinate queues were available, job #8585 would have started compute-3-2 since it has cores available. Two single ONE core jobs suspend 16 single core jobs.Nasty and wasteful! On 08/03/2012 10:10 AM, Joseph Farran wrote: On 08/03/2012 09:57 AM, Reuti wrote: Am

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
with subordinate queues were available, job #8585 would have started compute-3-2 since it has cores available. Two single ONE core jobs suspend 16 single core jobs.Nasty and wasteful! On 08/03/2012 10:10 AM, Joseph Farran wrote: On 08/03/2012 09:57 AM, Reuti wrote: Am 03.08.2012 um 18:50 schrieb Joseph

Re: [gridengine users] Subordinate Queue Job Packing

2012-08-03 Thread Joseph Farran
Found the issue. If I start with the count being the number of cores counting down, then it works. On 8/3/2012 4:29 PM, Joseph Farran wrote: I create a load sensor and it is reporting accordingly. Not sure if I got the sensor options correct? # qconf -sc| egrep cores_in_use cores_in_use

[gridengine users] qrsh modules environment

2012-08-04 Thread Joseph Farran
Howdy. In reading about GE qrsh, it looks like qrsh set's a minimal path: $ qrsh echo '$PATH' /scratch/1125.1.user:/usr/local/bin:/bin:/usr/bin To add additional paths, one can do: qrsh -v PATH=/extra:/usr/local/bin:/bin:/usr/bin:/new-path echo '$PATH'

[gridengine users] load_formula and PE jobs

2012-08-09 Thread Joseph Farran
Howdy. I am using GE2011.11. I am successfully using GE load_formula to load jobs by core count using my own load_sensor script. All works as expected with single core jobs, however, for PE jobs, it seems as if GE does not abide by the load_formula. Does the scheduler use a different load

Re: [gridengine users] load_formula and PE jobs

2012-08-11 Thread Joseph Farran
|pe_list qname bio pe_list make mpi openmp slots 64 Thanks for taking a look at this! On 8/11/2012 4:32 AM, Reuti wrote: Am 11.08.2012 um 02:57 schrieb Joseph Farran jfar...@uci.edu: Reuti, Are you sure this works in GE2011.11? I have defined my own

Re: [gridengine users] load_formula and PE jobs

2012-08-11 Thread Joseph Farran
Clarification: In the example I just posted, I updated my scheduler queue_sort_method from seq_no to load to make sure the scheduler sort method was not using the queue sequence number. On 8/11/2012 11:30 AM, Joseph Farran wrote: Yes, all my queues have the same 0 for seq_no

Re: [gridengine users] load_formula and PE jobs

2012-08-12 Thread Joseph Farran
On 8/11/2012 1:51 PM, Reuti wrote: Am 11.08.2012 um 20:30 schrieb Joseph Farran: Yes, all my queues have the same 0 for seq_no. Here is my scheduler load formula: qconf -ssconf algorithm default schedule_interval 0:0:15 maxujobs

Re: [gridengine users] load_formula and PE jobs

2012-08-12 Thread Joseph Farran
a look at this! On 8/11/2012 4:32 AM, Reuti wrote: Am 11.08.2012 um 02:57 schrieb Joseph Farran jfar...@uci.edu: Reuti, Are you sure this works in GE2011.11? I have defined my own complex called cores_in_use which counts both single cores and PE cores correctly. It works great for single core

Re: [gridengine users] load_formula and PE jobs

2012-08-13 Thread Joseph Farran
the load_formula for PE jobs $pe_slots? 2) Will this be brought back to GE on a future release? Joseph On 08/13/2012 08:22 AM, Reuti wrote: Am 12.08.2012 um 19:55 schrieb Joseph Farran: Hi Rayson. Here is one particular entry: http://gridengine.org/pipermail/users/2012-May/003495.html I

[gridengine users] PE Job Starvation and Job Reservation

2012-08-13 Thread Joseph Farran
Hi. We are having the classical job starvation with PE jobs. I followed the instructions listed at http://www.gridengine.info/2006/05/31/resource-reservation-prevents-parallel-job-starvation # qconf -ssconf | egrep reservation max_reservation 64 # qconf -sconf | grep

Re: [gridengine users] PE Job Starvation and Job Reservation

2012-08-13 Thread Joseph Farran
I checked the PE-job and the job-arrays and they had none. I originally setup all my queues with a large h_rt value thinking that jobs would inherit that: # qconf -sq bio | fgrep h_rt h_rt :00:00 But I think I read wrong how that worked. What is the proper / recommend

Re: [gridengine users] qrsh modules environment

2012-08-13 Thread Joseph Farran
The issue is with qrsh. With regular qsub I am able to load modules just fine. With qsub, it finds the module environment correctly. With qrsh, it's a different story: $ qrsh -V error: Unknown option -V On 08/13/2012 03:36 PM, Dave Love wrote: Joseph Farranjfar...@uci.edu writes: For

Re: [gridengine users] load_formula and PE jobs

2012-08-14 Thread Joseph Farran
On 08/14/2012 02:31 AM, Reuti wrote: Am 14.08.2012 um 00:27 schrieb Joseph Farran: Hi Alex. Thanks for the info, but the issue is more complex. The issue is that slots cannot be used with Subordinate queues. Why not? Reason is here: http://gridengine.org/pipermail/users/2012-August

[gridengine users] USE_QSUB_GID Son Of Grid Engine sge-8.1.1

2012-08-14 Thread Joseph Farran
Hi. Does Unix Groups work under Son Of Grid Engine 8.1.1 ? I have mine set to: # qconf -sconf| grep execd_params execd_params USE_QSUB_GID=TRUE And I have by queues user_lists set with the Linux group, but qsub wont let me submit the job.

Re: [gridengine users] load_formula and PE jobs

2012-08-15 Thread Joseph Farran
Dave, Thank you for pointing me to Son Of Grid Engine sge-8.1.1. SoGE solved the BUG that GE2011.11 had with respect to load_formula not counting $pe_slots correctly and SoGE also solved a few others bugs I was fighting with in GE2011.11. Best, Joseph On 08/15/2012 06:09 AM, Dave Love

Re: [gridengine users] qrsh modules environment

2012-08-15 Thread Joseph Farran
This was another issued that Son of Grid Engine sge_8.1.1 solved that GE2011.11 had issues with. Qrsh works seamlessly with modules in sge_8.1.1. Best, Joseph On 08/15/2012 06:11 AM, Dave Love wrote: Joseph Farranjfar...@uci.edu writes: The issue is with qrsh. With regular qsub I am able

Re: [gridengine users] USE_QSUB_GID Son Of Grid Engine sge-8.1.1

2012-08-15 Thread Joseph Farran
On 08/15/2012 02:17 PM, Reuti wrote: Am 15.08.2012 um 21:46 schrieb Joseph Farran: This was working under GE2011.11 which I cannot get it to work with sge-8.1.1 or maybe I have it set wrong? Just for curiosity: did you define a user_lists anywhere else by accident? Global or exechost level

Re: [gridengine users] USE_QSUB_GID Son Of Grid Engine sge-8.1.1

2012-08-15 Thread Joseph Farran
Aha! Thanks Reuti. I think that's probably it - I had some left over stuff from a previous GE installation. I will correct and test it later and will post my findings. Best, Joseph On 08/15/2012 02:51 PM, Reuti wrote: Am 15.08.2012 um 23:48 schrieb Joseph Farran: On 08/15/2012 02:17 PM

Re: [gridengine users] USE_QSUB_GID Son Of Grid Engine sge-8.1.1

2012-08-16 Thread Joseph Farran
Thanks Dave. This is helpful as I was not sure of the step sequence. On 08/16/2012 10:21 AM, Dave Love wrote: In case it's not clear, the upgrade procedure should be: stop the execds; stop the qmaster; install the new binaries; restart the master; restart the execds.

[gridengine users] job_load_adjustments load_adjustment_decay_time

2012-08-19 Thread Joseph Farran
Howdy. I am using my own load formula cores_in_use with the following scheduler settings: # qconf -ssconf algorithm default schedule_interval 0:0:15 maxujobs 0 queue_sort_method seqno job_load_adjustments

Re: [gridengine users] SGE 6.2u5 - submitting to whole nodes

2012-08-22 Thread Joseph Farran
The linky Rayson shows below works great. I am converting from Torque/Maui to Grid Engine and the one thing that I missed the most in Torque was the ability to easily request whole nodes. I should add that the link below works *BUT* you will not able to use it with Subordinate queues. What

Re: [gridengine users] SGE 6.2u5 - submitting to whole nodes

2012-08-22 Thread Joseph Farran
On 08/22/2012 01:19 PM, Reuti wrote: Correct. There is also something like the allocation_rule 48 to get multiple of 48 slots. -- Reuti So if one has a mixture of nodes that have say 16, 48 and 64 cores, then we need to create an PE for each? So for mpi, something like: mpi16

Re: [gridengine users] GPU node with pe and complex

2012-08-23 Thread Joseph Farran
Thanks William. Setting the consumable to JOB did the trick! Best, Joseph On 08/23/2012 12:32 AM, William Hay wrote: On 22 August 2012 23:53, Joseph Farranjfar...@uci.edu wrote: You have consumable set to YES which means the request is multiplied by the number of slots you request 64 so you

[gridengine users] sge 8.1.1 and sge_shepherd running at 100%

2012-08-23 Thread Joseph Farran
Hi Dave. Any updates when the bug that causes sge_shepherd to run at 100% when one uses qrsh is going to be fixed for sge 8.1.1? I just tested it using qrsh and the bug is there. Joseph ___ users mailing list users@gridengine.org

[gridengine users] sge 8.1.1 and sge_shepherd running at 100%

2012-08-23 Thread Joseph Farran
Hi Dave. Any updates when the bug that causes sge_shepherd to run at 100% when one uses qrsh is going to be fixed for sge 8.1.1? I just tested it using qrsh and the bug is there. Joseph ___ users mailing list users@gridengine.org

[gridengine users] Do not suspend job, kill instead

2012-08-23 Thread Joseph Farran
Howdy. Is there a flag one can set on a job so that it will be killed instead of being suspended for subordinate queue? So if a job is running on a subordinate queue and the scheduler suspends it, to have the job be killed instead? Joseph ___

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
Thanks Dave. We just discovered that we cannot request nodes with -l mem_free=xxx. We are on 8.1.1. Does this new release fix this? Joseph On 08/28/2012 09:57 AM, Dave Love wrote: SGE 8.1.2 is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/. It is a large superset of the

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
I don't use it, but one of our users has used it before successfully before we moved to GE 8.1.1. # qstat -q bio -F mem_free|fgrep mem hl:mem_free=498.198G hl:mem_free=498.528G hl:mem_free=499.143G hl:mem_free=498.959G hl:mem_free=499.198G $ qrsh -q bio

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
Hi Reuti. Here it is with the additional info: $ qrsh -w v -q bio -l mem_free=190G Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue bio@compute-2-7.local because job requests unknown resource (mem_free) Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-30 Thread Joseph Farran
On 08/28/2012 07:37 PM, Joseph Farran wrote: Hi Reuti. Here it is with the additional info: $ qrsh -w v -q bio -l mem_free=190G Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue bio@compute-2-7.local because job requests unknown resource (mem_free) Job 1637 (-l h_rt=604800,mem_free=190G

[gridengine users] Requesting mem_free

2012-08-30 Thread Joseph Farran
Hi. I am trying to request nodes with a certain mem_free value and I am not sure what is missing in my configuration that this does not work. My test nodes in my space1 queue has: $ qstat -F -q space1 | grep mem_free hl:mem_free=6.447G hl:mem_free=7.237G

Re: [gridengine users] Requesting mem_free

2012-08-30 Thread Joseph Farran
Hi Mazouzi. I still get the same issue.With no mem_free request, all works ok: $ qrsh -q space1 -l mem_free=1G error: no suitable queues $ qrsh -q space1 Last login: Wed Aug 29 14:31:07 2012 from login-1-1.local Rocks Compute Node Rocks 5.4.3 (Viper) Profile built 14:11 07-May-2012

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-30 Thread Joseph Farran
On 08/30/2012 02:22 PM, Dave Love wrote: That doesn't actually demonstrate that it's on the relevant nodes (e.g. qconf -se), though I'll believe it is. The -w v messages suggest that there's no load report from those nodes. What OS is this, and what load values are actually reported by one of

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-31 Thread Joseph Farran
On 8/31/2012 6:58 AM, Dave Love wrote: In the absence of any knowledge about that cluster, that doesn't confirm that it's reported for the specific hosts that the scheduler complained about, just that it's reported for some. Look explicitly at the load parameters from one of the hosts in

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-11 Thread Joseph Farran
Mark, Thanks!I just upgraded to 8.1.2. Will these patches work with 8.1.2 or were they intended only for 8.1.1? Joseph On 09/10/2012 07:45 AM, Mark Dixon wrote: Hi, Way back in May I promised this list a simple integration of gridengine with the cgroup functionality found in

[gridengine users] s_rt / h_rt Limits with Informative Messages?

2012-09-11 Thread Joseph Farran
Hi All. Is there a way ( hopefully easy way ) to have Grid Engine to give an informative message when a job has gone past a limit and killed, like when a job goes over the wall time limit. When I get an email from Grid Engine where a job has gone past it's wall time limit, it is not very

Re: [gridengine users] s_rt / h_rt Limits with Informative Messages?

2012-09-11 Thread Joseph Farran
Thanks Reuti. I think this sends an additional email, correct?Any easy way to append or check for -m bea in case users does not want the email? Joseph On 09/11/2012 11:21 AM, Reuti wrote: Hi, Am 11.09.2012 um 19:10 schrieb Joseph Farran: Is there a way ( hopefully easy way ) to have

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-09-13 Thread Joseph Farran
Hi Brian. Cool and thank you for pointing this out and the fix.  Being so new go GE and after 20+ posts on this issue, I thought it was something wrong in my GE configuration!    Glad to hear is was not me :-) Best, Joseph

Re: [gridengine users] failing mem_free request

2012-09-17 Thread Joseph Farran
Dave, I am having the same/similar issues as Brian's but with 8.1.2.But for me, it's even worse. There are only two resources I can request which are mem_total and swap_total. All others fail. $ qrsh -l mem_total=1M Last login: Mon Sep 10 22:02:39 2012 from login-1-1.local

Re: [gridengine users] Cleaning up Run-away jobs on nodes

2012-09-20 Thread Joseph Farran
Thanks William, Reuti and Dave. I will try the pointers made here. Joseph On 09/20/2012 02:13 AM, Reuti wrote: Am 20.09.2012 um 02:08 schrieb Joseph Farran: What is the recommended way and/or do scripts exists for cleaning up once a job completes/dies/crashes on a node? I would prefer

Re: [gridengine users] Functional Fair Share on a Queue?

2012-10-15 Thread Joseph Farran
this, I think: http://moo.nac.uci.edu/~hjm/BDUC_Pay_For_Priority.html If it is inaccurate, please let me know and I'll correct it. hjm On Sunday, October 14, 2012 01:42:38 AM Joseph Farran wrote: Hi All. I have a queue on our cluster with 1,000 cores that all users can use. I like to keep

Re: [gridengine users] Functional Fair Share on a Queue?

2012-10-15 Thread Joseph Farran
Syntax question on the limit. In order to place a limit of say 333 cores per user on queue free, is the syntax: limitusers * queues free to slots=333 Correct? On 10/15/2012 01:32 PM, Joseph Farran wrote: Hi Harry. Thanks. I understand the general fair share methods available

[gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-26 Thread Joseph Farran
Howdy. One of my queues has a wall time hard limit of 4 days ( 96 hours ): # qconf -sq queue | grep h_rt h_rt 96:00:00 There is a job which has been running much longer than 4 days and I am not sure how to get the hours the job has been

Re: [gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-29 Thread Joseph Farran
08:55 schrieb Daniel Gruber: Am 26.10.2012 um 07:58 schrieb Joseph Farran: Howdy. One of my queues has a wall time hard limit of 4 days ( 96 hours ): # qconf -sq queue | grep h_rt h_rt 96:00:00 There is a job which has been running much longer than 4 days and I am not sure

Re: [gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-29 Thread Joseph Farran
Ah I missed that. Yes we have awk version 3.1.5 and the readme says 3.1.6 or higher. We will be upgrading OS from SL 5.7 to 6.3 soon so that should fix this. Thanks, Joseph On 10/29/2012 11:11 AM, Reuti wrote: Am 29.10.2012 um 19:08 schrieb Joseph Farran: Thanks Reuti, but it does not work

[gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Hi all. I google this issue but did not see much help on the subject. I have several queues with hard wall clock limits like this one: # qconf -sq queue | grep h_rt h_rt 96:00:00 I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall clock limit and

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
killed when they go past their wall time clock. How can I investigate this further? On 10/30/2012 11:44 AM, Reuti wrote: Hi, Am 30.10.2012 um 19:31 schrieb Joseph Farran: I google this issue but did not see much help on the subject. I have several queues with hard wall clock limits like

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
On 10/30/2012 12:07 PM, Reuti wrote: Am 30.10.2012 um 20:02 schrieb Joseph Farran: Hi Reuti. Yes, I had that already set: qconf -sconf|fgrep execd_params execd_params ENABLE_ADDGRP_KILL=TRUE What is strange is that 1 out of 10 jobs or so do get killed just fine when they go

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
for the h_rt and nothing either. On 10/30/2012 01:49 PM, Reuti wrote: Am 30.10.2012 um 20:18 schrieb Joseph Farran: Here is one case: qstat| egrep 12959|12960 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 free2@compute-12-22.local 1 12960 0.50500 dna.pmf_17 amentes

  1   2   >