Hi Grid
folks.
As per my earlier post, I am a newbie to OGE, but not to
schedulers in general.
To keep things clear, what is the official name for this
software? It is Open Grid Engine (OGE), or Open Grid
Scheduler
Resending in text format:
Hi Grid folks.
As per my earlier post, I am a newbie to OGE, but not to schedulers in general.
To keep things clear, what is the official name for this software?It is
Open Grid Engine (OGE), or
Open Grid Scheduler (OGS)?
I downloaded the tar ball executable and
Hello.
I have a cluster running Rocks 5.4.3 that I originally setup with Torque/Maui.
I am testing Open Grid Scheduler using the ge2011.11.tar distribution.
I setup OGE on the master head node and was able to also setup 6 compute nodes using
start_gui_installer on the head node.All 6
.
-Original Message-
From: users-boun...@gridengine.org [mailto:users-boun...@gridengine.org] On
Behalf Of Joseph Farran
Sent: Wednesday, May 09, 2012 4:10 PM
To: users@gridengine.org Users
Subject: [gridengine users] Installing OGE on Rocks Login Node
Hello.
I have a cluster running Rocks 5.4.3
Adding these lines at the end of oge-dir/default/common/sge_request
-cwd
-S /bin/bash
Works and does what I was looking for. Nice!
Thanks.
On 05/31/2012 05:40 PM, Joseph Farran wrote:
Cool! A lot easier than I thought.
So a default shell can also be specified so that the batch script
Hi All.
When installing OGE with respect to the Spooling Configuration, one can select:
Qmaster spool directory
Global execd spool directory
I installed OGE from the head node on a shared NFS directory ( /data/oge ) and
like to make the spooling to be on the head node /var file system
Thanks Reuti.
On 06/04/2012 02:51 PM, Reuti wrote:
Hi,
Am 04.06.2012 um 22:59 schrieb Joseph Farran:
When installing OGE with respect to the Spooling Configuration, one can select:
Qmaster spool directory
Global execd spool directory
I installed OGE from the head node on a shared
Hi All.
I am trying to understand OGE parallel environment.
I am coming from Torque/PBS where one simply asks for the number of nodes and
cores (ppn) when running with a parallel program like an MPI job.
With OGE, it appears that a parallel environment must first be setup for each
parallel
/default/spool
Spooling: classic
Using the NFS share directory for Global execd, then everything works just
fine - compute nodes are setup correctly.
What am I doing wrong?
Joseph
On 06/04/2012 02:51 PM, Reuti wrote:
Hi,
Am 04.06.2012 um 22:59 schrieb Joseph Farran:
When installing OGE
OGE is owned by ogeadmin, so
/var/spool/oge on the compute node needs to exist *and* owned by ogeadmin.
Joseph
On 06/05/2012 08:53 AM, Reuti wrote:
Am 05.06.2012 um 17:47 schrieb Joseph Farran:
My OGE software resides on a shared NFS directory /data/hpc/oge.
When I run
On 06/04/2012 07:29 PM, Rayson Ho wrote:
On Mon, Jun 4, 2012 at 9:08 PM, Joseph Farranjfar...@uci.edu wrote:
You can do something similar by defining a generic PE and you can then
use a generic name.
If I create a generic PE, say parallel, can the parallel name be the
default if no PE name
Greetings.
How does one set access to an OGE queue to have access from more than one Linux
group?
So if I have Linux groups staff, bio, and chem, how do I make my test queue
only accessible by these 3 groups?
What kind of Q type do I setup?
___
On 06/08/2012 10:25 AM, Reuti wrote:
You can make one ACL containing these three Unix groups:
$ qconf -au @staff foobar
$ qconf -au @bio foobar
$ qconf -au @chem foobar
$ qconf -mattr queue user_lists foobar test
-- Reuti
Perfect! Exactly what I was looking for. Will test it soon.
On 06/08/2012 11:19 AM, Rayson Ho wrote:
but if Joseph is OK with using a cron
job to sync. membership then I can leave it aside for now - I will
need to work on a few more urgent things but will have more time later
this month.
Rayson
Hi Rayson.
If you are asking of the compute nodes
Me again :-)
The Queue access list by Linux groups ( /etc/group ) is working perfectly!
I submitted a test job to the bio queue from an account that has bio group
ownership and the job runs.When I submitt a test job to the bio queue from
an account that does *not* belong to the bio linux
Greetings.
I am try to setup my MPI Parallel Environment so that whole nodes are used
before going to the next node when looking for cores.
Our nodes have 64 cores. What I like is that if I ask for 128 cores (slots),
one compute node is selected with 64 cores, and then the next one with 64
Hi.
With the help of this group, I've been able to make good progress on setting up
OGE 2011.11 with our cluster.
I am testing the Suspend Resume features and it works great for serial jobs
but not able to get Parallel jobs suspended.
I created a simple Parallel Environment (PE) called mpi
Thanks for the clarification.
This is NAMD run, so I am launching it via charmrun and not mpirun.
If the OGE code suspend via rank 0, I would think that charmrun and/or any
other parallel job would suspend as well, no?
I will try an mpirun job next to see if it behaves differently and
Well, for our needs, we *REALLY* need Parallel Job suspension.It's not even
a choice for us.
If Torque/Maui can do it, I am sure OGE can do it without issues.
Can someone please tell me what patch I need to install to un-break / turn-on
Parallel job suspension?
If you guys are that
Hi.
I am able to successfully suspend a job array with:
qmod -sj job-id
But how does one do this using qmon - the graphical tool? How does one
select the entire job array and not just single job array entries that show up
in qmon?
Joseph
Greetings.
At http://gridscheduler.sourceforge.net under the download Grid Engine, there
is a:
Grid Engine 2011.11 binary for x64 ( now with the GUI installer )
Is this the latest available release as of date?
Also, how easy or hard is it to upgrade OGE once it is installed and in
Greetings.
I am playing with OGE subordinate Queues and I can't seem to get it right.
All my nodes are 64 cores and I set all my nodes to node pack jobs with:
qconf -rattr exechost complex_values slots=64 node1 ( repeat for all
other nodes )
The scheduler is then set with Load Formula
On 06/15/2012 09:48 AM, Rayson Ho wrote:
On Fri, Jun 15, 2012 at 12:29 PM, Reutire...@staff.uni-marburg.de wrote:
And just want to add that if no new jobs are sent to the super-ordinate queue, then the sub-ordination process would never kick in. Which is why Reuti mentioned the queue vs. host
Hello.
I like to make qstat do qstat -u * as the default to see all user jobs. I
added:
-u *
to our /oge-path/default/common/sge_qstat
and it does not seem to work.The qstat does not show all user's jobs, but it does if
I say qstat -u *
Also, adding the above command to
Thanks, that did the trick!
After making this the system wide default, how can an individual users change
it back to just show their jobs?
In one account, I created:
$ cat ~/.sge_qstat
-u $USER
Trying to switch it back to only listing this one user qstat jobs, but I still
get the system
Yes, that makes sense.
I wanted to have a global default but then let savy users *undo* the global
setup.
So what I am going to do is create a local ~/.sge_qstat with -u * and let
the users who want to change the default to simply remove this file, or change it to their likings.
Best,
Joseph
I may have found a bug OGE ( 2011.11 ) ?
I have a queue called owner that I modified using:
qconf -mq owner
To set the subordinate_list field with slots=8(free:4:sr). I confirm the
new change with:
# qconf -sq owner | grep subordinate
subordinate_list slots=8(free:4:sr)
On 06/21/2012 08:19 AM, Reuti wrote:
Am 21.06.2012 um 16:25 schrieb Joseph Farran:
If this is normal, what happen to the entry slots=8(free:4:sr)?Or is this
a bug?
I would say so, it's already in 6.2u5.
To make sure we are on the same page:
When you say I would say so, are you saying
Well, this is a horrible bug specially for anyone starting out with OGE.
Imagine if this was a software program that every time a user ran emacs
(editor) on the code *without* making any changes, the editor would
automatically make changes to the code.
If I had any control over this software
Hi.
I am playing with subordinate queues. I have defined owner queue and free
queue.
The owner queue has:
# qconf -sq owner | grep subordinate
subordinate_list slots=8(free:0:sr)
If I submit a 1-core job to owner queue, OGE suspends 1 core (slot) job from
free queue. If I submit
Thanks Dave.
The list helps to know what bugs I am dealing with with my particular OGE setup.
With respect to the following bugs:
- 6953013 slotwise preemption does not take amount of slots of pe tasks into
account when unsuspend/suspend
- 6932534 slotwise suspend on subordinate with parallel
Howdy.
Our environment uses a mixture of Parallel jobs, job arrays serial jobs.
Parallel jobs being the biggest followed by job arrays.
For parallel jobs, one of the worse things a scheduler can do is to spread
1-core jobs evenly across all nodes based on node load because you get severe
Hi.
This approach for load_formula seems ideal for our needs but I cannot get it
to work with ge2011.11
Has anyone tried this with ge2011.11 and does it work?
Joseph
On 05/14/2012 11:49 AM, Stuart Barkley wrote:
On our systems we use:
% qconf -ssconf
...
queue_sort_method
Hello.
I remember reading that an updated version of GE2011.11 was going to be
released by the end of June.
When I go to http://gridscheduler.sourceforge.net and click on the download
link, I don't see any new version.
The only version I see is GE2011.11 dated 2011-11-04.
Did I read wrong
When is GE 2011.11 update 1 with cgroups planned on being released?
On 07/02/2012 01:56 PM, Rayson Ho wrote:
GE 2011.11 patch 1 was released a while ago (back in April or May I believe):
http://dl.dropbox.com/u/47200624/GE2011.11p1/GE2011.11p1.tar.gz
But we did not release certified binaries
On 07/10/2012 11:38 AM, Rayson Ho wrote:
On Tue, Jul 10, 2012 at 1:48 PM, Joseph Farranjfar...@uci.edu wrote:
I was using the same identical script, so it's still a mystery why the
script ran on some nodes while it failed on others, but now with this change
it works on all nodes and that is
On 07/10/2012 01:24 PM, Rayson Ho wrote:
On Tue, Jul 10, 2012 at 3:51 PM, Reutire...@staff.uni-marburg.de wrote:
Joseph,
To debug the difference in behavior:
1) make sure that you can always reproduce the job failure.
2) then submit jobs to a node that fails the job to a node that does
On 07/16/2012 01:51 PM, Reuti wrote:
There is a script in $SGE_ROOT/util/upgrade_modules/save_sge_config.sh
-- Reuti
Running the command where the sge_qmaster is running (the admin host), it says:
# mkdir /root/oge-backup-test
# ./util/upgrade_modules/save_sge_config.sh
On 07/17/2012 03:22 PM, Joseph Farran wrote:
$ od -c OUT.o5223.1
000 T e s t \n 033 [ H 033 [ J
013
Do you know where this is coming from?
False alarm - it was coming form a call to another script from the job.
___
users
Hello.
I like to set the default reservation option for array jobs to be No:
#-R n
Is this possible via the sge_request global file and if so, what is the
syntax to do this (only for job arrays)?
Or is this something I need to do in the prolog.sh startup part and if so, what
is the
Ok, I ~think~ this is the default for all jobs with OGE ( No reservation ).
So let me turn this around. How can I set Job reservation for Parallel jobs to be
On by default?
#-R y
Only for PE jobs?
On 07/25/2012 11:57 AM, Joseph Farran wrote:
Hello.
I like to set the default
Hi Rayson / Reuti.
I have an epilog setup to clear out empty .o and .e files, so I was using the
GE environments to check for said file and act accordingly.
Since they are not defined in GE, I am manually checking for and removing them
with:
$JOB_NAME.pe$JOB_ID
$JOB_NAME.po$JOB_ID
Hi.
I originally ran start_gui_installer which is a great and easy gui toll to
add compute nodes.
What is the proper way to re-add nodes but from the command line? I am
running Rocks 5.4.3 and when a node is re-imaged all is gone, so is there an
easy way to re-add the node via command
Thanks Simon and Rayson.
That was pretty much what I was doing. On my new freshly installed node, I placed a
copy of my sgeexecd.HPC in /etc/init.d, chkconfig to make sure it starts up on next boot,
created /var/spool/oge and starting oge would then create the compute-x-x
directory and
Hi.
I pack jobs unto nodes using the following GE setup:
# qconf -ssconf | egrep queue|load
queue_sort_method seqno
job_load_adjustments NONE
load_adjustment_decay_time0
load_formula slots
I also set my nodes with
On 08/03/2012 09:18 AM, Reuti wrote:
Am 03.08.2012 um 18:04 schrieb Joseph Farran:
I pack jobs unto nodes using the following GE setup:
# qconf -ssconf | egrep queue|load
queue_sort_method seqno
job_load_adjustments NONE
load_adjustment_decay_time
On 08/03/2012 09:57 AM, Reuti wrote:
Am 03.08.2012 um 18:50 schrieb Joseph Farran:
On 08/03/2012 09:18 AM, Reuti wrote:
Am 03.08.2012 um 18:04 schrieb Joseph Farran:
I pack jobs unto nodes using the following GE setup:
# qconf -ssconf | egrep queue|load
queue_sort_method
ONE core jobs suspend 16 single core jobs.Nasty and wasteful!
On 08/03/2012 10:10 AM, Joseph Farran wrote:
On 08/03/2012 09:57 AM, Reuti wrote:
Am 03.08.2012 um 18:50 schrieb Joseph Farran:
On 08/03/2012 09:18 AM, Reuti wrote:
Am 03.08.2012 um 18:04 schrieb Joseph Farran:
I pack jobs
suspending another 7 cores.
If job-packing with subordinate queues were available, job #8585 would have
started compute-3-2 since it has cores available.
Two single ONE core jobs suspend 16 single core jobs.Nasty and wasteful!
On 08/03/2012 10:10 AM, Joseph Farran wrote:
On 08/03/2012 09
Ok, it's not that difficult to setup a load sensor in GE and I ~think~ I
figured out how to tell the cores in use by a node.
Best,
Joseph
On 08/03/2012 01:03 PM, Joseph Farran wrote:
Great!Will it work for both parallel and single core jobs?
If yes, is there such a load sensor available
.
If job-packing with subordinate queues were available, job #8585 would have
started compute-3-2 since it has cores available.
Two single ONE core jobs suspend 16 single core jobs.Nasty and wasteful!
On 08/03/2012 10:10 AM, Joseph Farran wrote:
On 08/03/2012 09:57 AM, Reuti wrote:
Am
with subordinate queues were available, job #8585 would have
started compute-3-2 since it has cores available.
Two single ONE core jobs suspend 16 single core jobs.Nasty and wasteful!
On 08/03/2012 10:10 AM, Joseph Farran wrote:
On 08/03/2012 09:57 AM, Reuti wrote:
Am 03.08.2012 um 18:50 schrieb Joseph
Found the issue. If I start with the count being the number of cores counting
down, then it works.
On 8/3/2012 4:29 PM, Joseph Farran wrote:
I create a load sensor and it is reporting accordingly. Not sure if I got the
sensor options correct?
# qconf -sc| egrep cores_in_use
cores_in_use
Howdy.
In reading about GE qrsh, it looks like qrsh set's a minimal path:
$ qrsh echo '$PATH'
/scratch/1125.1.user:/usr/local/bin:/bin:/usr/bin
To add additional paths, one can do:
qrsh -v PATH=/extra:/usr/local/bin:/bin:/usr/bin:/new-path echo '$PATH'
Howdy.
I am using GE2011.11.
I am successfully using GE load_formula to load jobs by core count using my own
load_sensor script.
All works as expected with single core jobs, however, for PE jobs, it seems as if GE does
not abide by the load_formula.
Does the scheduler use a different load
|pe_list
qname bio
pe_list make mpi openmp
slots 64
Thanks for taking a look at this!
On 8/11/2012 4:32 AM, Reuti wrote:
Am 11.08.2012 um 02:57 schrieb Joseph Farran jfar...@uci.edu:
Reuti,
Are you sure this works in GE2011.11?
I have defined my own
Clarification: In the example I just posted, I updated my scheduler queue_sort_method from
seq_no to load to make sure the scheduler sort method was not using the queue
sequence number.
On 8/11/2012 11:30 AM, Joseph Farran wrote:
Yes, all my queues have the same 0 for seq_no
On 8/11/2012 1:51 PM, Reuti wrote:
Am 11.08.2012 um 20:30 schrieb Joseph Farran:
Yes, all my queues have the same 0 for seq_no.
Here is my scheduler load formula:
qconf -ssconf
algorithm default
schedule_interval 0:0:15
maxujobs
a look at this!
On 8/11/2012 4:32 AM, Reuti wrote:
Am 11.08.2012 um 02:57 schrieb Joseph Farran jfar...@uci.edu:
Reuti,
Are you sure this works in GE2011.11?
I have defined my own complex called cores_in_use which counts both
single cores and PE cores correctly.
It works great for single core
the load_formula for PE jobs
$pe_slots?
2) Will this be brought back to GE on a future release?
Joseph
On 08/13/2012 08:22 AM, Reuti wrote:
Am 12.08.2012 um 19:55 schrieb Joseph Farran:
Hi Rayson.
Here is one particular entry:
http://gridengine.org/pipermail/users/2012-May/003495.html
I
Hi.
We are having the classical job starvation with PE jobs.
I followed the instructions listed at
http://www.gridengine.info/2006/05/31/resource-reservation-prevents-parallel-job-starvation
# qconf -ssconf | egrep reservation
max_reservation 64
# qconf -sconf | grep
I checked the PE-job and the job-arrays and they had none.
I originally setup all my queues with a large h_rt value thinking that jobs
would inherit that:
# qconf -sq bio | fgrep h_rt
h_rt :00:00
But I think I read wrong how that worked.
What is the proper / recommend
The issue is with qrsh.
With regular qsub I am able to load modules just fine. With qsub, it finds
the module environment correctly.
With qrsh, it's a different story:
$ qrsh -V
error: Unknown option -V
On 08/13/2012 03:36 PM, Dave Love wrote:
Joseph Farranjfar...@uci.edu writes:
For
On 08/14/2012 02:31 AM, Reuti wrote:
Am 14.08.2012 um 00:27 schrieb Joseph Farran:
Hi Alex.
Thanks for the info, but the issue is more complex.
The issue is that slots cannot be used with Subordinate queues.
Why not? Reason is here:
http://gridengine.org/pipermail/users/2012-August
Hi.
Does Unix Groups work under Son Of Grid Engine 8.1.1 ?
I have mine set to:
# qconf -sconf| grep execd_params
execd_params USE_QSUB_GID=TRUE
And I have by queues user_lists set with the Linux group, but qsub wont let
me submit the job.
Dave,
Thank you for pointing me to Son Of Grid Engine sge-8.1.1.
SoGE solved the BUG that GE2011.11 had with respect to load_formula not
counting $pe_slots correctly and SoGE also solved a few others bugs I was
fighting with in GE2011.11.
Best,
Joseph
On 08/15/2012 06:09 AM, Dave Love
This was another issued that Son of Grid Engine sge_8.1.1 solved that GE2011.11
had issues with.
Qrsh works seamlessly with modules in sge_8.1.1.
Best,
Joseph
On 08/15/2012 06:11 AM, Dave Love wrote:
Joseph Farranjfar...@uci.edu writes:
The issue is with qrsh.
With regular qsub I am able
On 08/15/2012 02:17 PM, Reuti wrote:
Am 15.08.2012 um 21:46 schrieb Joseph Farran:
This was working under GE2011.11 which I cannot get it to work with sge-8.1.1
or maybe I have it set wrong?
Just for curiosity: did you define a user_lists anywhere else by accident?
Global or exechost level
Aha!
Thanks Reuti. I think that's probably it - I had some left over stuff from a
previous GE installation.
I will correct and test it later and will post my findings.
Best,
Joseph
On 08/15/2012 02:51 PM, Reuti wrote:
Am 15.08.2012 um 23:48 schrieb Joseph Farran:
On 08/15/2012 02:17 PM
Thanks Dave.
This is helpful as I was not sure of the step sequence.
On 08/16/2012 10:21 AM, Dave Love wrote:
In case it's not clear, the upgrade procedure should be: stop the execds; stop the qmaster; install the new binaries; restart the master; restart the execds.
Howdy.
I am using my own load formula cores_in_use with the following scheduler
settings:
# qconf -ssconf
algorithm default
schedule_interval 0:0:15
maxujobs 0
queue_sort_method seqno
job_load_adjustments
The linky Rayson shows below works great.
I am converting from Torque/Maui to Grid Engine and the one thing that I missed the most
in Torque was the ability to easily request whole nodes.
I should add that the link below works *BUT* you will not able to use it with
Subordinate queues.
What
On 08/22/2012 01:19 PM, Reuti wrote:
Correct.
There is also something like the allocation_rule 48 to get multiple of 48
slots.
-- Reuti
So if one has a mixture of nodes that have say 16, 48 and 64 cores, then we
need to create an PE for each?
So for mpi, something like:
mpi16
Thanks William.
Setting the consumable to JOB did the trick!
Best,
Joseph
On 08/23/2012 12:32 AM, William Hay wrote:
On 22 August 2012 23:53, Joseph Farranjfar...@uci.edu wrote:
You have consumable set to YES which means the request is multiplied
by the number of slots you request 64 so you
Hi Dave.
Any updates when the bug that causes sge_shepherd to run at 100% when one uses
qrsh is going to be fixed for sge 8.1.1?
I just tested it using qrsh and the bug is there.
Joseph
___
users mailing list
users@gridengine.org
Hi Dave.
Any updates when the bug that causes sge_shepherd to run at 100% when one uses
qrsh is going to be fixed for sge 8.1.1?
I just tested it using qrsh and the bug is there.
Joseph
___
users mailing list
users@gridengine.org
Howdy.
Is there a flag one can set on a job so that it will be killed instead of being
suspended for subordinate queue?
So if a job is running on a subordinate queue and the scheduler suspends it, to
have the job be killed instead?
Joseph
___
Thanks Dave.
We just discovered that we cannot request nodes with -l mem_free=xxx.
We are on 8.1.1. Does this new release fix this?
Joseph
On 08/28/2012 09:57 AM, Dave Love wrote:
SGE 8.1.2 is available from
http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/. It is a large
superset of the
I don't use it, but one of our users has used it before successfully before we
moved to GE 8.1.1.
# qstat -q bio -F mem_free|fgrep mem
hl:mem_free=498.198G
hl:mem_free=498.528G
hl:mem_free=499.143G
hl:mem_free=498.959G
hl:mem_free=499.198G
$ qrsh -q bio
Hi Reuti.
Here it is with the additional info:
$ qrsh -w v -q bio -l mem_free=190G
Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue
bio@compute-2-7.local because job requests unknown resource (mem_free)
Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue
On 08/28/2012 07:37 PM, Joseph Farran wrote:
Hi Reuti.
Here it is with the additional info:
$ qrsh -w v -q bio -l mem_free=190G
Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue
bio@compute-2-7.local because job requests unknown resource (mem_free)
Job 1637 (-l h_rt=604800,mem_free=190G
Hi.
I am trying to request nodes with a certain mem_free value and I am not sure
what is missing in my configuration that this does not work.
My test nodes in my space1 queue has:
$ qstat -F -q space1 | grep mem_free
hl:mem_free=6.447G
hl:mem_free=7.237G
Hi Mazouzi.
I still get the same issue.With no mem_free request, all works ok:
$ qrsh -q space1 -l mem_free=1G
error: no suitable queues
$ qrsh -q space1
Last login: Wed Aug 29 14:31:07 2012 from login-1-1.local
Rocks Compute Node
Rocks 5.4.3 (Viper)
Profile built 14:11 07-May-2012
On 08/30/2012 02:22 PM, Dave Love wrote:
That doesn't actually demonstrate that it's on the relevant nodes (e.g. qconf
-se), though I'll believe it is. The -w v messages suggest that there's no load
report from those nodes. What OS is this, and what load values are actually
reported by one of
On 8/31/2012 6:58 AM, Dave Love wrote:
In the absence of any knowledge about that cluster, that doesn't confirm that it's reported for the specific hosts that the scheduler complained about, just that it's reported for some. Look
explicitly at the load parameters from one of the hosts in
Mark,
Thanks!I just upgraded to 8.1.2. Will these patches work with 8.1.2 or
were they intended only for 8.1.1?
Joseph
On 09/10/2012 07:45 AM, Mark Dixon wrote:
Hi,
Way back in May I promised this list a simple integration of gridengine with
the cgroup functionality found in
Hi All.
Is there a way ( hopefully easy way ) to have Grid Engine to give an
informative message when a job has gone past a limit and killed, like when a
job goes over the wall time limit.
When I get an email from Grid Engine where a job has gone past it's wall time
limit, it is not very
Thanks Reuti.
I think this sends an additional email, correct?Any easy way to append or check for
-m bea in case users does not want the email?
Joseph
On 09/11/2012 11:21 AM, Reuti wrote:
Hi,
Am 11.09.2012 um 19:10 schrieb Joseph Farran:
Is there a way ( hopefully easy way ) to have
Hi Brian.
Cool and thank you for pointing this out and the fix.
Being so new go GE and after 20+ posts on this issue, I thought it
was something wrong in my GE configuration! Glad to hear is was
not me :-)
Best,
Joseph
Dave,
I am having the same/similar issues as Brian's but with 8.1.2.But for me,
it's even worse.
There are only two resources I can request which are mem_total and
swap_total. All others fail.
$ qrsh -l mem_total=1M
Last login: Mon Sep 10 22:02:39 2012 from login-1-1.local
Thanks William, Reuti and Dave.
I will try the pointers made here.
Joseph
On 09/20/2012 02:13 AM, Reuti wrote:
Am 20.09.2012 um 02:08 schrieb Joseph Farran:
What is the recommended way and/or do scripts exists for cleaning up once a job
completes/dies/crashes on a node?
I would prefer
this, I think:
http://moo.nac.uci.edu/~hjm/BDUC_Pay_For_Priority.html
If it is inaccurate, please let me know and I'll correct it.
hjm
On Sunday, October 14, 2012 01:42:38 AM Joseph Farran wrote:
Hi All.
I have a queue on our cluster with 1,000 cores that all users can use.
I like to keep
Syntax question on the limit.
In order to place a limit of say 333 cores per user on queue free, is the
syntax:
limitusers * queues free to slots=333
Correct?
On 10/15/2012 01:32 PM, Joseph Farran wrote:
Hi Harry.
Thanks. I understand the general fair share methods available
Howdy.
One of my queues has a wall time hard limit of 4 days ( 96 hours
):
# qconf -sq queue | grep h_rt
h_rt 96:00:00
There is a job which has been running much longer than 4 days and
I am not sure how to get the hours the job has been
08:55 schrieb Daniel Gruber:
Am 26.10.2012 um 07:58 schrieb Joseph Farran:
Howdy.
One of my queues has a wall time hard limit of 4 days ( 96 hours ):
# qconf -sq queue | grep h_rt
h_rt 96:00:00
There is a job which has been running much longer than 4 days and I am not sure
Ah I missed that.
Yes we have awk version 3.1.5 and the readme says 3.1.6 or higher.
We will be upgrading OS from SL 5.7 to 6.3 soon so that should fix this.
Thanks,
Joseph
On 10/29/2012 11:11 AM, Reuti wrote:
Am 29.10.2012 um 19:08 schrieb Joseph Farran:
Thanks Reuti, but it does not work
Hi all.
I google this issue but did not see much help on the subject.
I have several queues with hard wall clock limits like this one:
# qconf -sq queue | grep h_rt
h_rt 96:00:00
I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall
clock limit and
killed when they go past their
wall time clock.
How can I investigate this further?
On 10/30/2012 11:44 AM, Reuti wrote:
Hi,
Am 30.10.2012 um 19:31 schrieb Joseph Farran:
I google this issue but did not see much help on the subject.
I have several queues with hard wall clock limits like
On 10/30/2012 12:07 PM, Reuti wrote:
Am 30.10.2012 um 20:02 schrieb Joseph Farran:
Hi Reuti.
Yes, I had that already set:
qconf -sconf|fgrep execd_params
execd_params ENABLE_ADDGRP_KILL=TRUE
What is strange is that 1 out of 10 jobs or so do get killed just fine when
they go
for the h_rt and nothing either.
On 10/30/2012 01:49 PM, Reuti wrote:
Am 30.10.2012 um 20:18 schrieb Joseph Farran:
Here is one case:
qstat| egrep 12959|12960
12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12
free2@compute-12-22.local 1
12960 0.50500 dna.pmf_17 amentes
1 - 100 of 174 matches
Mail list logo