Thanks Joshua!
That did the 1-Trillion
dollar trick!
Best,
Joseph
On 8/7/2019 10:50 PM, Joshua
Baker-LePain wrote:
On
Wed, 7 Aug 2019 at 4:40pm, Joseph Farran wrote
A user accidentally submitted a 1.4
Correction. 1
TRILLION :-)
On 8/7/2019 4:40 PM, Joseph Farran
wrote:
Howdy.
A user accidentally submitted a 1.4 BILLION job array on our HPC
cluster. How can I remove it?
I cannot qdel the job nor can I
Howdy.
A user accidentally submitted a 1.4 BILLION job array on our HPC
cluster. How can I remove it?
I cannot qdel the job nor can I qhold the job because it crashes SGE.
I can restart SGE just fine but the job remains.
I removed the SGE job script itself from
11:26 AM, Daniel Povey wrote:
It may depend on specific features of those large job arrays. You could try
deleting them and see if the problem disappears.
On Sat, Jan 26, 2019 at 2:23 PM Joseph Farran mailto:jfar...@uci.edu>> wrote:
Hi Daniel.
Yes I do have large job-arrays aro
Joseph Farran mailto:jfar...@uci.edu> wrote:
Hi Daniel.
Yes I do have large job-arrays around 7k tasks BUT I have had larger job
arrays of 500k without seeing this kind of slowdown.
Joseph
On 1/26/2019 10:16 AM, Daniel Povey wrote:
> Check if there are any hug
of jobs, can make it slow.
On Sat, Jan 26, 2019 at 7:05 AM Reuti mailto:re...@staff.uni-marburg.de>> wrote:
Hi,
> Am 26.01.2019 um 10:20 schrieb Joseph Farran mailto:jfar...@uci.edu>>:
>
> Hi.
> Our Grid Engine is running very sluggish all of a sudden. S
Hi Reuti.
Yes - several times
with no success.
Joseph
On 1/26/2019 4:03 AM, Reuti wrote:
Hi,
Am 26.01.2019 um 10:20 schrieb Joseph Farran :
Hi.
Our Grid Engine is running very sluggish all of a sudden. Sqe_qmaster stays at 100
Hi.
Our Grid Engine is running very sluggish all of a sudden. Sqe_qmaster stays at
100% all the time where is used to be 100% for a few seconds every 30 seconds
or so.
I ran the qping command but not sure how to read it. Any helpful insight much
appreciated
qping -i 5 -info hpc-s 6444
Glad you were able to fix it Dan.
I looked at Univa Grid Engine a while ago and it was super
expensive.
I was able to ask lots of question to a potential candidate
for a position we had who was using Univa GE. His sentiments
were that it was
, Nov 9, 2018 at 12:12 AM Joseph Farran
<jfar...@uci.edu>
wrote:
Hi Dan.
Thank you for the suggestion. Here is what I have:
# qconf -sconf | grep gid_range
gid_range 200-
the range of possible userids.
On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran
<jfar...@uci.edu>
wrote:
Hi Dan.
Thank you for the suggestion. Here is what
:33 PM Joseph Farran
<jfar...@uci.edu>
wrote:
Greetings.
I am running SGE 8.1.9 on a cluster with
some 10k cores, CentOS 6.9.
I am seeing job failures on
nodes where the
Greetings.
I am running SGE 8.1.9 on a cluster with some 10k
cores, CentOS 6.9.
I am seeing job failures on
nodes where the node's sge_execd
unexpectedly dies.
I ran strace on the nodes sge_execd and it's not of much help.
It always
Cool!
Thanks you Gabriel!
Best,
Joseph
On 02/16/2016 01:39 AM, RDlab wrote:
Hello,
S-GAE is a free GNU web application designed to display accounting information
generated by the Grid Engine family. This data is stored in a database in order
to display eye-candy charts grouped by user,
: for -q free64,bio, what GE does to choose an available
queue for a job? Will it sort and do alphabetical order?
On Fri, May 29, 2015 at 8:12 AM, William Hay w@ucl.ac.uk wrote:
On Thu, 28 May 2015 19:27:07 +
Joseph Farran jfar...@uci.edu wrote:
Hi all.
I am not sure if this is a bug
.
Not sure if this answers your question?
Joseph
On 05/29/2015 05:12 AM, William Hay wrote:
On Thu, 28 May 2015 19:27:07 +
Joseph Farran jfar...@uci.edu wrote:
Hi all.
I am not sure if this is a bug or the way Grid Engine works.
We have several queues our users submit jobs to.One
Ok, Reuti wrote one which is available at:
* /opt/gridengine/bin/qstatus
Joseph
On 05/27/2015 02:27 PM, Joseph Farran wrote:
Hi All.
Running SGE 8.1.8 on CentOS 6.6.
Does someone have a qstat type of script that will show how much time
is left on a running job that was submitted
Hi all.
I am not sure if this is a bug or the way Grid Engine works.
We have several queues our users submit jobs to.One of the queues
free64 has a 3-day wall-clock limit:
$ qconf -sq free64 | grep _rt
s_rt 72:00:00
h_rt 72:05:00
While other queue bio
Hi All.
Is there a way using qconf and/or qhost to tell if a queue or
queue-instance is a suspend-able queue?
I've been checking the manual pages and cannot find how.
Joseph
___
users mailing list
users@gridengine.org
A little late but I am running 8.1.7 and suspend worked part-time.
I had to write my own suspend script to make it work, specially with
MATLAB jobs which try to trap signals.
Joseph
On 12/19/2014 04:54 AM, berg...@merctech.com wrote:
On December 19, 2014 6:19:58 AM EST, Reuti
Hi All.
We are using Son of GE 8.1.7 with checkpoint BLCR. All works great.
Even though everything works just fine, SGE log message shows the following
when a job is migrated:
2/05/2014 22:18:08|worker|hpc-s|W|job 3029146.1 failed on host compute-7-5.local
migrating because: unknown
Hi All.
I am using Son of Grid Engine 8.1.6.
We have an issue that occurs once in a while in which Grid Engine will
suspend a job ( subordinate queue ) and while Grid Engine thinks the job
is suspended ( qstat shows S for job state ), the process on the node
keeps running and not really
Thanks Reuti.
I'll give that a try. Do I need to setup an un-suspend method / script as
well?
Joseph
On 8/7/2014 2:33 PM, Reuti wrote:
Hi,
Am 07.08.2014 um 21:14 schrieb Joseph Farran:
I am using Son of Grid Engine 8.1.6.
We have an issue that occurs once in a while in which Grid
Howdy.
We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on
for our 64-core nodes.
$ qconf -sconf | grep BINDING
ENABLE_BINDING=TRUE
When I submit an OpenMP job with:
#!/bin/bash
#$ -N TESTING
#$ -q q64
#$ -pe openmp 16
#$ -binding linear:
Thank you all for the helpful suggestions.
Mark, your scripts are exactly what I was looking! Thanks.
Joseph
___
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users
Howdy.
I am able to disabled enable a queue @ a compute node with:
$ qmod -d bio@compute-1-1
me@sys changed state of bio@compute-1-1.local (disabled)
$ qmod -e bio@compute-1-1
me@sys changed state of bio@compute-1-1.local (enabled)
But how can I query the state of a queue @ a node? In
Allison,
I love Grid Engine but this is the one feature I truly miss from Torque:
-l nodes=x:ppn=[count]
Reuti,
We have a complex setup trying to accomplish this same thing and it kind of
works but we have an issue with job not starting when jobs are running on a
subordinate queue.
First,
Cheers Dave!
On 11/4/2013 3:44 PM, Dave Love wrote:
SGE 8.1.6 is available from
http://arc.liv.ac.uk/downloads/SGE/releases/8.1.6/, fixing various bugs.
Please report bugs, patches and suggestions for enhancement
https://arc.liv.ac.uk/trac/SGE#mail.
Release notes:
* Bug fixes
* Man and
Hi Reuti.
Yes, after going through the logs, the subsequent restarts are messed up.
I've played with it more and there is easy no way to do this inside the job
submission script, so I will have to resort ( as you indicated ) to using outside
script to run periodically and do a qsub -sj job /
Greetings.
We have a queue defined with a soft hard wall-clock limit of:
qconf -sq free64 | egrep _rt|notify
notify00:05:00
s_rt 48:00:00
h_rt 48:05:00
And jobs get killed correctly after 2 days of wall-clock run time. We now have
Grid
Engine
is reached and the job receives SIGUSR1 signal, it suspends
the job via qmod.
Joseph
On 10/31/2013 11:48 AM, Joseph Farran wrote:
Greetings.
We have a queue defined with a soft hard wall-clock limit of:
qconf -sq free64 | egrep _rt|notify
notify00:05:00
s_rt 48:00
Thank you Reuti.
On 10/29/2013 11:47 AM, Reuti wrote:
I came up with this:
#!/bin/sh
case $SGE_STARTER_SHELL_START_MODE in
unix_behavior)
exec $@ ;;
#
# Although posix_compliant and script_from_stdin are the same, the behavior is
different:
# posix_compliant = $1 is the
Thanks Reuti as always.
If you have a *default* starter_method script please post it as it will help
many since it's
tricky to get everything right for those of us who don't know GE inside-out.
Best,
Joseph
On 10/28/2013 12:12 AM, Reuti wrote:
Hi,
Am 28.10.2013 um 01:21 schrieb Joseph
Greetings.
We have setup BLCR ( Berkeley Lab Checkpoint/Restart ) on our cluster with
Grid Engine ckpt scripts to process the checkpoints and restart methods.
In an effort to make things as easy as possible for our user base, I am using
Grid Engine starter_method to run our blcr_submit script
On 10/25/2013 03:43 AM, Fritz Ferstl wrote:
Here is an account of the history of the technology
http://blogs.gridengine.com/content/history-sun-grid-engine and my team's 20+
years of involvement.
Very impressive Fritz. I have been using GE for only a year or so and prior
to that
Yes and I did not mean to skip and forget all of the other folks who contributed
to what we know today as Grid Engine.
If you dig far back enough and before it was CODINE, I am sure it started
with someone writing some home grown code.
The main point remains however. Adaptive computing is an
Are you kidding me? NO?
Have you seen what Adaptive Computing did with Moab?They took Maui
added/improved it
and are now charging a fortune for Moab.
If a company wants to start from scratch with a product fine, but to take a
product contributed
by the community for free and then
Greetings.
Reading the man page for checkpoint, it sounds like ckpt_command
migr_command can
be called multiple times and Grid Engine will not wait for the previous call to
end before calling it again?
So if GE calls ckpt_command and the previous ckpt_command has not yet exited,
will GE
Howdy.
We have several users using large job-arrays using 1-core per job array element.
We have a shared queue pointing to several 64-core nodes.
With the above setup, each of our 64-core nodes ends up with 64 individual jobs
from
various users. This is normal and expected behavior.
Is
Howdy.
I am setting up GE 8.1.4 with blcr using the GE scripts from
BLCR-GridEngine-Integration-master.zip
One question which I don't see an answer to, is how does one setup an X-minutes
checkpoint interval with GE?
So how can I tell Grid Engine to do a checkpoint say every 30 minutes
:47 PM, Orion Poplawski wrote:
On 10/2/2013 9:54 PM, Joseph Farran wrote:
Thanks Dave and yes, I accidentally sent it non-ascii - I hate it when
that happens.
I want to tackle single jobs first so I'll try DMTCP.
What SGE scripts do you recommend?I found this but not sure if there
are better
/dmtcp_starter
Joseph
On 10/2/2013 4:02 PM, Dave Love wrote:
Joseph Farran jfar...@uci.edu writes:
[Please don't post content-type: text/html.]
Hi all.
We have Grid Engine 8.1.4 running on a cluster with CentOS 6.4, using kernel
2.6.32-358.18.1.We are just getting started on setting up job
Hi all.
We have Grid Engine 8.1.4 running on a cluster with CentOS 6.4,
using kernel 2.6.32-358.18.1. We are just getting started on
setting up job checkpoint.
We got BLCR compiled and are currently testing it. Before we
go much further and
Thanks Dave and yes, there was something wrong.
The new version now works correctly.
Best,
Joseph
On 09/19/2013 09:50 AM, Dave Love wrote:
I wrote:
I can't reproduce that, at least with the version I have installed.
I should have waited until I could test the distributed version. There
Howdy.
We are running Son of Grid Engine 8.1.3.
I compiled 8.1.4 and downloaded and un-tar the gui_installer-8.1.4.tar into the
compiled directory.
When I run ./start_gui_installer all is well and 8.1.4 GUI starts up just fine,
but 3 screens later, it bombs with:
Exception in thread
Howdy.
Using GE 8.1.2.I have two jobs which suspended correctly via Grid Engine
subordinate queue.
I am however trying to force the scheduler to resume ( un-suspend ) the
suspended jobs with no success:
$ qstat | grep compute-14-18
288279 0.5 MakeSummar juser S 04/02/2013
On 3/17/2013 1:42 PM, Reuti wrote:
Am 17.03.2013 um 19:15 schrieb Joseph Farran:
On 3/17/2013 2:14 AM, Reuti wrote:
Am 17.03.2013 um 07:22 schrieb Joseph Farran:
On 1/4/2013 10:37 AM, Reuti wrote:
Am 02.01.2013 um 05:08 schrieb Joseph Farran:
Hello Reuti.
Yes, the job(s
On 1/4/2013 10:37 AM, Reuti wrote:
Am 02.01.2013 um 05:08 schrieb Joseph Farran:
Hello Reuti.
Yes, the job(s) are not suspending (S) as they normally do. So it's not the
queue, but the jobs.
But is the queue in suspended state (qstat -f)?
Sorry Reuti, missed your question.
Yes
On 2/19/2013 3:22 PM, Reuti wrote:
Did you change this value in the past and it could have been copied with a
different value to the user entry?
I made so many changes I forget, but the more I understand how it works, yes I
think that's what happen.
To answer one of my questions, here is
Joseph Farran jfar...@uci.edu:
Hi.
I searched and did not find a way, so I am checking here.
Is there a way to reset ( zero out ) the Grid Engine usage accounting data (
qacct ) for one user only?
Joseph
___
users mailing list
users@gridengine.org
Using ssh -vvv when the node refuses a connection from the user gives the clue of it
being no-more-sessi...@openssh.com
debug1: Requesting no-more-sessi...@openssh.com
debug1: Entering interactive session.
debug3: Wrote 192 bytes for a total of 2581
debug1: channel 0: free:
Hi Reuti.
Ah, I thought it was a binary file, it's text based.
Thanks,
Joseph
On 2/14/2013 5:27 PM, Reuti wrote:
Removing the relevant lines from the accounting file should do it (for 'qacct').
-- Reuti
Am 15.02.2013 um 01:54 schrieb Joseph Farran jfar...@uci.edu:
Hi.
I searched and did
Hi All.
To expand a bit on what is going on. We are using Grid Engine 8.1.2 using
Rocks 6.1 for the clustering software.
We have a program that is not behaving nicely with the amount of cores being
requested, so the node easily goes over-loaded.
To keep the node load from going through the
and functional policies.
And thus it's good practice to *not* use projects for anything else than these
policies or else you might be in trouble if turning on those policies and
requiring projects for them.
Cheers,
Fritz
Am 07.02.2013 um 07:39 schrieb Joseph Farran:
Hi.
I am using Grid Engine
Hi.
I am using Grid Engine 8.1.2 setup with some 20 queues.
Most queues point to a set of private nodes. A few queues point to a pool of
shared nodes.
All queues are FIFO order. I like to convert a couple of the share queues to
use FairShare instead of FIFO order.
I asked this question
Hi Reuti.
Yes, I am creating a script to be ran by cron that will re-adjust the number of
slots allowed per user based on the wait.
In the process of creating the script, I thought of checking first to see this
already existed with Dynamic quotas to not re-invent the wheel.
Thanks,
Joseph
Hi All.
I am using Grid Engine 8.1.2. I am reading up on dynamic resource quotas.
One example I see to allow 5 slots per CPU on all linux hosts is:
limit hosts {@linux_hosts} to slots=$num_proc*5
I like to setup the following dynamic resource quota but not sure if it can be
done?
Hi Reuti.
Here are my limits for a node and for Grid Engine:
cat /etc/security/limits.conf
* soft memlock unlimited
* hard memlock unlimited
* soft nofile 4096
* hard nofile 10240
# qconf -sconf
execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=4096, \
Howdy.
We have a cluster running Rocks 6.1 with Grid Engine 8.1.2.
Every once in a while, we get jobs that fail not being able to set the user id
( setuid fails ).
The nodes have the correct /etc/passwd entry as many jobs from the same user work while a few fail every once in a while.The
Hello Reuti.
Yes, the job(s) are not suspending (S) as they normally do. So it's not the
queue, but the jobs.
Normally as soon as 1 or more core jobs enters the node through the queue, the subordinate jobs suspend immediately.Once is a while, the jobs that go in through the subordinate
Hi All.
I am running GE 8.1.2 and I have a situation where once in a while ( 2x a week
), Grid Engine forgets about one of the Subordinate queues.
Everything works as expected where my subordinate queue goes to S suspend-mode when a job enters the queue it is subordinate to.However once in
it so that these types of jobs
can never be queued?Some other kind of verification process?
On 12/24/2012 8:02 AM, Reuti wrote:
Hi,
Am 24.12.2012 um 09:08 schrieb Joseph Farran:
maybe it's by design. From `man qsub` for the -w option: It should also be noted that load values are not taken
On 12/16/2012 10:15 AM, Dave Love wrote:
I think the answer is not to do that. Why restart it?
Since restarting GE server is not harmful and because Murphy always shows up on
a Friday night on the eve of a long 3 day weekend, sometimes restarting
services (which are safe to restart) is a
Howdy.
This is minor issue but one I like to see if there is a fix for.
I re-start Grid Engine 8.1.2 every day via a cron job.
I noticed that the qstat listing changes the display order when GE
is restarted.
Before the restart,
Hi Dave.
That's exactly what I am looking for.
Would you be willing to share your script and/or method for populating the
fields? I am assuming this is automated via a script?
Joseph
On 12/13/2012 9:28 AM, Dave Love wrote:
I have a complex cputype with (6!) values like interlagos and
Greetings.
How do I request the CPU type in qrsh / qsub with SGE 8.1.2?
Googling this question shows some answers of the type qrsh -l arch=xxx.
However, all my nodes in my qhost shows the same type of arch:
# qhost -F | grep arch
hl:arch=lx-amd64
hl:arch=lx-amd64
hl:arch=lx-amd64
Hi All.
I increased our ulimits on our compute nodes and I can request the new limits
if I ssh to the compute nodes:
[root@compute-2-3 security]# tail -5 /etc/security/limits.conf
# End of file
* hard nofile 10240
* soft nofile 4096
* hard nofile 10240
* soft nofile 4096
[user@compute-2-3
Thanks Rayson!
That did the trick.
Best,
Joseph
On 11/14/2012 10:55 AM, Rayson Ho wrote:
Joseph,
You need to set S_DESCRIPTORS, H_DESCRIPTORS with the execd_params
option in sge_conf:
http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html
Example: H_DESCRIPTORS=1
Rayson
Thanks Reuti.
That was the mystery why it looked like some queues were using less cores.
Best,
Joseph
On 11/07/2012 11:38 PM, Reuti wrote:
This shows the location of the master queue for this job only, not its
allocation inside the cluster, which depends on the defined allocation_rule in
Hi.
I am using SGE 8.1.2 with several queues and recently, several of
my 64-slots queues are not scheduling the full 64-cores.
So if I submit 64 1-core jobs, only 57 or so are schedule per node
instead of 64. If I submit 4 16-core pe jobs, only 3 of
this?
On 11/7/2012 9:25 PM, Joseph Farran wrote:
Hi.
I am using SGE 8.1.2 with several queues and recently, several of my 64-slots
queues are not scheduling the full 64-cores.
So if I submit 64 1-core jobs, only 57 or so are schedule per node instead of
64. If I submit 4 16-core pe jobs
Hi all.
I google this issue but did not see much help on the subject.
I have several queues with hard wall clock limits like this one:
# qconf -sq queue | grep h_rt
h_rt 96:00:00
I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall
clock limit and
killed when they go past their
wall time clock.
How can I investigate this further?
On 10/30/2012 11:44 AM, Reuti wrote:
Hi,
Am 30.10.2012 um 19:31 schrieb Joseph Farran:
I google this issue but did not see much help on the subject.
I have several queues with hard wall clock limits like
On 10/30/2012 12:07 PM, Reuti wrote:
Am 30.10.2012 um 20:02 schrieb Joseph Farran:
Hi Reuti.
Yes, I had that already set:
qconf -sconf|fgrep execd_params
execd_params ENABLE_ADDGRP_KILL=TRUE
What is strange is that 1 out of 10 jobs or so do get killed just fine when
they go
for the h_rt and nothing either.
On 10/30/2012 01:49 PM, Reuti wrote:
Am 30.10.2012 um 20:18 schrieb Joseph Farran:
Here is one case:
qstat| egrep 12959|12960
12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12
free2@compute-12-22.local 1
12960 0.50500 dna.pmf_17 amentes
Joseph Farran:
Did not have loglevel set to log_info, so I updated it, restarted GE on the
master and softstop and start on the compute node.
I got a lot more log information now, but still no cigar:
# cat /var/spool/ge/compute-12-22/messages | fgrep h_rt
#
Checked a few other compute nodes
No:
# qconf -sq free2 | fgrep terminate
terminate_method NONE
On 10/30/2012 03:07 PM, Reuti wrote:
Mmh, was the terminate method redefined in the queue configuration of the queue
in question?
Am 30.10.2012 um 23:04 schrieb Joseph Farran:
No, still no cigar.
# cat /var/spool/ge
correctly.
Oh well, thanks Reuti. I will keep playing with this...
On 10/30/2012 03:53 PM, Reuti wrote:
Am 30.10.2012 um 23:45 schrieb Joseph Farran:
No:
# qconf -sq free2 | fgrep terminate
terminate_method NONE
Is the process still doing something serious or hanging somewhere in a loop
08:55 schrieb Daniel Gruber:
Am 26.10.2012 um 07:58 schrieb Joseph Farran:
Howdy.
One of my queues has a wall time hard limit of 4 days ( 96 hours ):
# qconf -sq queue | grep h_rt
h_rt 96:00:00
There is a job which has been running much longer than 4 days and I am not sure
Ah I missed that.
Yes we have awk version 3.1.5 and the readme says 3.1.6 or higher.
We will be upgrading OS from SL 5.7 to 6.3 soon so that should fix this.
Thanks,
Joseph
On 10/29/2012 11:11 AM, Reuti wrote:
Am 29.10.2012 um 19:08 schrieb Joseph Farran:
Thanks Reuti, but it does not work
Howdy.
One of my queues has a wall time hard limit of 4 days ( 96 hours
):
# qconf -sq queue | grep h_rt
h_rt 96:00:00
There is a job which has been running much longer than 4 days and
I am not sure how to get the hours the job has been
this, I think:
http://moo.nac.uci.edu/~hjm/BDUC_Pay_For_Priority.html
If it is inaccurate, please let me know and I'll correct it.
hjm
On Sunday, October 14, 2012 01:42:38 AM Joseph Farran wrote:
Hi All.
I have a queue on our cluster with 1,000 cores that all users can use.
I like to keep
Syntax question on the limit.
In order to place a limit of say 333 cores per user on queue free, is the
syntax:
limitusers * queues free to slots=333
Correct?
On 10/15/2012 01:32 PM, Joseph Farran wrote:
Hi Harry.
Thanks. I understand the general fair share methods available
Thanks William, Reuti and Dave.
I will try the pointers made here.
Joseph
On 09/20/2012 02:13 AM, Reuti wrote:
Am 20.09.2012 um 02:08 schrieb Joseph Farran:
What is the recommended way and/or do scripts exists for cleaning up once a job
completes/dies/crashes on a node?
I would prefer
Dave,
I am having the same/similar issues as Brian's but with 8.1.2.But for me,
it's even worse.
There are only two resources I can request which are mem_total and
swap_total. All others fail.
$ qrsh -l mem_total=1M
Last login: Mon Sep 10 22:02:39 2012 from login-1-1.local
Hi Brian.
Cool and thank you for pointing this out and the fix.
Being so new go GE and after 20+ posts on this issue, I thought it
was something wrong in my GE configuration! Glad to hear is was
not me :-)
Best,
Joseph
Mark,
Thanks!I just upgraded to 8.1.2. Will these patches work with 8.1.2 or
were they intended only for 8.1.1?
Joseph
On 09/10/2012 07:45 AM, Mark Dixon wrote:
Hi,
Way back in May I promised this list a simple integration of gridengine with
the cgroup functionality found in
Hi All.
Is there a way ( hopefully easy way ) to have Grid Engine to give an
informative message when a job has gone past a limit and killed, like when a
job goes over the wall time limit.
When I get an email from Grid Engine where a job has gone past it's wall time
limit, it is not very
Thanks Reuti.
I think this sends an additional email, correct?Any easy way to append or check for
-m bea in case users does not want the email?
Joseph
On 09/11/2012 11:21 AM, Reuti wrote:
Hi,
Am 11.09.2012 um 19:10 schrieb Joseph Farran:
Is there a way ( hopefully easy way ) to have
On 8/31/2012 6:58 AM, Dave Love wrote:
In the absence of any knowledge about that cluster, that doesn't confirm that it's reported for the specific hosts that the scheduler complained about, just that it's reported for some. Look
explicitly at the load parameters from one of the hosts in
On 08/28/2012 07:37 PM, Joseph Farran wrote:
Hi Reuti.
Here it is with the additional info:
$ qrsh -w v -q bio -l mem_free=190G
Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue
bio@compute-2-7.local because job requests unknown resource (mem_free)
Job 1637 (-l h_rt=604800,mem_free=190G
Hi.
I am trying to request nodes with a certain mem_free value and I am not sure
what is missing in my configuration that this does not work.
My test nodes in my space1 queue has:
$ qstat -F -q space1 | grep mem_free
hl:mem_free=6.447G
hl:mem_free=7.237G
Hi Mazouzi.
I still get the same issue.With no mem_free request, all works ok:
$ qrsh -q space1 -l mem_free=1G
error: no suitable queues
$ qrsh -q space1
Last login: Wed Aug 29 14:31:07 2012 from login-1-1.local
Rocks Compute Node
Rocks 5.4.3 (Viper)
Profile built 14:11 07-May-2012
On 08/30/2012 02:22 PM, Dave Love wrote:
That doesn't actually demonstrate that it's on the relevant nodes (e.g. qconf
-se), though I'll believe it is. The -w v messages suggest that there's no load
report from those nodes. What OS is this, and what load values are actually
reported by one of
Thanks Dave.
We just discovered that we cannot request nodes with -l mem_free=xxx.
We are on 8.1.1. Does this new release fix this?
Joseph
On 08/28/2012 09:57 AM, Dave Love wrote:
SGE 8.1.2 is available from
http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/. It is a large
superset of the
I don't use it, but one of our users has used it before successfully before we
moved to GE 8.1.1.
# qstat -q bio -F mem_free|fgrep mem
hl:mem_free=498.198G
hl:mem_free=498.528G
hl:mem_free=499.143G
hl:mem_free=498.959G
hl:mem_free=499.198G
$ qrsh -q bio
Hi Reuti.
Here it is with the additional info:
$ qrsh -w v -q bio -l mem_free=190G
Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue
bio@compute-2-7.local because job requests unknown resource (mem_free)
Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue
Thanks William.
Setting the consumable to JOB did the trick!
Best,
Joseph
On 08/23/2012 12:32 AM, William Hay wrote:
On 22 August 2012 23:53, Joseph Farranjfar...@uci.edu wrote:
You have consumable set to YES which means the request is multiplied
by the number of slots you request 64 so you
Hi Dave.
Any updates when the bug that causes sge_shepherd to run at 100% when one uses
qrsh is going to be fixed for sge 8.1.1?
I just tested it using qrsh and the bug is there.
Joseph
___
users mailing list
users@gridengine.org
Hi Dave.
Any updates when the bug that causes sge_shepherd to run at 100% when one uses
qrsh is going to be fixed for sge 8.1.1?
I just tested it using qrsh and the bug is there.
Joseph
___
users mailing list
users@gridengine.org
1 - 100 of 174 matches
Mail list logo