Re: [gridengine users] Removing 1.4 BILLION tasks job array

2019-08-08 Thread Joseph Farran
Thanks Joshua! That did the 1-Trillion dollar trick! Best, Joseph On 8/7/2019 10:50 PM, Joshua Baker-LePain wrote: On Wed, 7 Aug 2019 at 4:40pm, Joseph Farran wrote A user accidentally submitted a 1.4

Re: [gridengine users] Removing 1.4 BILLION tasks job array

2019-08-07 Thread Joseph Farran
Correction.   1 TRILLION   :-) On 8/7/2019 4:40 PM, Joseph Farran wrote: Howdy. A user accidentally submitted a 1.4 BILLION job array on our HPC cluster.    How can I remove it? I cannot qdel the job nor can I

[gridengine users] Removing 1.4 BILLION tasks job array

2019-08-07 Thread Joseph Farran
Howdy. A user accidentally submitted a 1.4 BILLION job array on our HPC cluster.    How can I remove it? I cannot qdel the job nor can I qhold the job because it crashes SGE.   I can restart SGE just fine but the job remains. I removed the SGE job script itself from

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Joseph Farran
11:26 AM, Daniel Povey wrote: It may depend on specific features of those large job arrays.  You could try deleting them and see if the problem disappears. On Sat, Jan 26, 2019 at 2:23 PM Joseph Farran mailto:jfar...@uci.edu>> wrote: Hi Daniel. Yes I do have large job-arrays aro

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Joseph Farran
Joseph Farran mailto:jfar...@uci.edu> wrote: Hi Daniel. Yes I do have large job-arrays around 7k tasks BUT I have had larger job arrays of 500k without seeing this kind of slowdown. Joseph On 1/26/2019 10:16 AM, Daniel Povey wrote: > Check if there are any hug

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Joseph Farran
of jobs, can make it slow. On Sat, Jan 26, 2019 at 7:05 AM Reuti mailto:re...@staff.uni-marburg.de>> wrote: Hi, > Am 26.01.2019 um 10:20 schrieb Joseph Farran mailto:jfar...@uci.edu>>: > > Hi. > Our Grid Engine is running very sluggish all of a sudden. S

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Joseph Farran
Hi Reuti. Yes - several times with no success. Joseph On 1/26/2019 4:03 AM, Reuti wrote: Hi, Am 26.01.2019 um 10:20 schrieb Joseph Farran : Hi. Our Grid Engine is running very sluggish all of a sudden. Sqe_qmaster stays at 100

[gridengine users] Grid Engine Sluggish

2019-01-26 Thread Joseph Farran
Hi. Our Grid Engine is running very sluggish all of a sudden. Sqe_qmaster stays at 100% all the time where is used to be 100% for a few seconds every 30 seconds or so. I ran the qping command but not sure how to read it.   Any helpful insight much appreciated qping -i 5 -info hpc-s 6444

Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Joseph Farran
Glad you were able to fix it Dan. I looked at Univa Grid Engine a while ago and it was super expensive.    I was able to ask lots of question to a potential candidate for a position we had who was using Univa GE.   His sentiments were that it was

Re: [gridengine users] sge_execd dies

2018-11-09 Thread Joseph Farran
, Nov 9, 2018 at 12:12 AM Joseph Farran <jfar...@uci.edu> wrote: Hi Dan. Thank you for the suggestion.   Here is what I have: # qconf -sconf | grep gid_range gid_range    200-

Re: [gridengine users] sge_execd dies

2018-11-08 Thread Joseph Farran
the range of possible userids. On Fri, Nov 9, 2018 at 12:12 AM Joseph Farran <jfar...@uci.edu> wrote: Hi Dan. Thank you for the suggestion.   Here is what

Re: [gridengine users] sge_execd dies

2018-11-08 Thread Joseph Farran
:33 PM Joseph Farran <jfar...@uci.edu> wrote: Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on nodes where the

[gridengine users] sge_execd dies

2018-11-08 Thread Joseph Farran
Greetings. I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. I am seeing job failures on nodes where the node's sge_execd unexpectedly dies. I ran strace on the nodes sge_execd and it's not of much help.   It always

Re: [gridengine users] S-GAE v2.0 is available!

2016-02-16 Thread Joseph Farran
Cool! Thanks you Gabriel! Best, Joseph On 02/16/2016 01:39 AM, RDlab wrote: Hello, S-GAE is a free GNU web application designed to display accounting information generated by the Grid Engine family. This data is stored in a database in order to display eye-candy charts grouped by user,

Re: [gridengine users] Using multiple queues inherits s_rt h_rt

2015-06-02 Thread Joseph Farran
: for -q free64,bio, what GE does to choose an available queue for a job? Will it sort and do alphabetical order? On Fri, May 29, 2015 at 8:12 AM, William Hay w@ucl.ac.uk wrote: On Thu, 28 May 2015 19:27:07 + Joseph Farran jfar...@uci.edu wrote: Hi all. I am not sure if this is a bug

Re: [gridengine users] Using multiple queues inherits s_rt h_rt

2015-06-02 Thread Joseph Farran
. Not sure if this answers your question? Joseph On 05/29/2015 05:12 AM, William Hay wrote: On Thu, 28 May 2015 19:27:07 + Joseph Farran jfar...@uci.edu wrote: Hi all. I am not sure if this is a bug or the way Grid Engine works. We have several queues our users submit jobs to.One

Re: [gridengine users] qstat to show wall-clock time left for a job?

2015-05-28 Thread Joseph Farran
Ok, Reuti wrote one which is available at: * /opt/gridengine/bin/qstatus Joseph On 05/27/2015 02:27 PM, Joseph Farran wrote: Hi All. Running SGE 8.1.8 on CentOS 6.6. Does someone have a qstat type of script that will show how much time is left on a running job that was submitted

[gridengine users] Using multiple queues inherits s_rt h_rt

2015-05-28 Thread Joseph Farran
Hi all. I am not sure if this is a bug or the way Grid Engine works. We have several queues our users submit jobs to.One of the queues free64 has a 3-day wall-clock limit: $ qconf -sq free64 | grep _rt s_rt 72:00:00 h_rt 72:05:00 While other queue bio

[gridengine users] How to tell if queue is suspend-able ?

2015-01-22 Thread Joseph Farran
Hi All. Is there a way using qconf and/or qhost to tell if a queue or queue-instance is a suspend-able queue? I've been checking the manual pages and cannot find how. Joseph ___ users mailing list users@gridengine.org

Re: [gridengine users] suspended jobs continue to run

2015-01-22 Thread Joseph Farran
A little late but I am running 8.1.7 and suspend worked part-time. I had to write my own suspend script to make it work, specially with MATLAB jobs which try to trap signals. Joseph On 12/19/2014 04:54 AM, berg...@merctech.com wrote: On December 19, 2014 6:19:58 AM EST, Reuti

[gridengine users] Checkpoint CKPT failed migrating because: unknown reason

2014-12-06 Thread Joseph Farran
Hi All. We are using Son of GE 8.1.7 with checkpoint BLCR. All works great. Even though everything works just fine, SGE log message shows the following when a job is migrated: 2/05/2014 22:18:08|worker|hpc-s|W|job 3029146.1 failed on host compute-7-5.local migrating because: unknown

[gridengine users] How to tell GE to RE-Suspend a job?

2014-08-07 Thread Joseph Farran
Hi All. I am using Son of Grid Engine 8.1.6. We have an issue that occurs once in a while in which Grid Engine will suspend a job ( subordinate queue ) and while Grid Engine thinks the job is suspended ( qstat shows S for job state ), the process on the node keeps running and not really

Re: [gridengine users] How to tell GE to RE-Suspend a job?

2014-08-07 Thread Joseph Farran
Thanks Reuti. I'll give that a try. Do I need to setup an un-suspend method / script as well? Joseph On 8/7/2014 2:33 PM, Reuti wrote: Hi, Am 07.08.2014 um 21:14 schrieb Joseph Farran: I am using Son of Grid Engine 8.1.6. We have an issue that occurs once in a while in which Grid

[gridengine users] Core Binding and Node Load

2014-04-29 Thread Joseph Farran
Howdy. We are running Son of GE 8.1.6 on CentOS 6.5 with core binding turned on for our 64-core nodes. $ qconf -sconf | grep BINDING ENABLE_BINDING=TRUE When I submit an OpenMP job with: #!/bin/bash #$ -N TESTING #$ -q q64 #$ -pe openmp 16 #$ -binding linear:

Re: [gridengine users] The state of a queue

2014-04-22 Thread Joseph Farran
Thank you all for the helpful suggestions. Mark, your scripts are exactly what I was looking! Thanks. Joseph ___ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users

[gridengine users] The state of a queue

2014-04-17 Thread Joseph Farran
Howdy. I am able to disabled enable a queue @ a compute node with: $ qmod -d bio@compute-1-1 me@sys changed state of bio@compute-1-1.local (disabled) $ qmod -e bio@compute-1-1 me@sys changed state of bio@compute-1-1.local (enabled) But how can I query the state of a queue @ a node? In

Re: [gridengine users] Requesting multiple cores on one node

2014-01-15 Thread Joseph Farran
Allison, I love Grid Engine but this is the one feature I truly miss from Torque: -l nodes=x:ppn=[count] Reuti, We have a complex setup trying to accomplish this same thing and it kind of works but we have an issue with job not starting when jobs are running on a subordinate queue. First,

Re: [gridengine users] Son of Grid Engine 8.1.6 available

2013-11-04 Thread Joseph Farran
Cheers Dave! On 11/4/2013 3:44 PM, Dave Love wrote: SGE 8.1.6 is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.6/, fixing various bugs. Please report bugs, patches and suggestions for enhancement https://arc.liv.ac.uk/trac/SGE#mail. Release notes: * Bug fixes * Man and

Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

2013-11-01 Thread Joseph Farran
Hi Reuti. Yes, after going through the logs, the subsequent restarts are messed up. I've played with it more and there is easy no way to do this inside the job submission script, so I will have to resort ( as you indicated ) to using outside script to run periodically and do a qsub -sj job /

[gridengine users] Queue limit s_rt / h_rt and CheckPoint

2013-10-31 Thread Joseph Farran
Greetings. We have a queue defined with a soft hard wall-clock limit of: qconf -sq free64 | egrep _rt|notify notify00:05:00 s_rt 48:00:00 h_rt 48:05:00 And jobs get killed correctly after 2 days of wall-clock run time. We now have Grid Engine

Re: [gridengine users] Queue limit s_rt / h_rt and CheckPoint

2013-10-31 Thread Joseph Farran
is reached and the job receives SIGUSR1 signal, it suspends the job via qmod. Joseph On 10/31/2013 11:48 AM, Joseph Farran wrote: Greetings. We have a queue defined with a soft hard wall-clock limit of: qconf -sq free64 | egrep _rt|notify notify00:05:00 s_rt 48:00

Re: [gridengine users] BLCR starter_method

2013-10-29 Thread Joseph Farran
Thank you Reuti. On 10/29/2013 11:47 AM, Reuti wrote: I came up with this: #!/bin/sh case $SGE_STARTER_SHELL_START_MODE in unix_behavior) exec $@ ;; # # Although posix_compliant and script_from_stdin are the same, the behavior is different: # posix_compliant = $1 is the

Re: [gridengine users] BLCR starter_method

2013-10-28 Thread Joseph Farran
Thanks Reuti as always. If you have a *default* starter_method script please post it as it will help many since it's tricky to get everything right for those of us who don't know GE inside-out. Best, Joseph On 10/28/2013 12:12 AM, Reuti wrote: Hi, Am 28.10.2013 um 01:21 schrieb Joseph

[gridengine users] BLCR starter_method

2013-10-27 Thread Joseph Farran
Greetings. We have setup BLCR ( Berkeley Lab Checkpoint/Restart ) on our cluster with Grid Engine ckpt scripts to process the checkpoints and restart methods. In an effort to make things as easy as possible for our user base, I am using Grid Engine starter_method to run our blcr_submit script

Re: [gridengine users] Welcome Home Grid Engine!

2013-10-25 Thread Joseph Farran
On 10/25/2013 03:43 AM, Fritz Ferstl wrote: Here is an account of the history of the technology http://blogs.gridengine.com/content/history-sun-grid-engine and my team's 20+ years of involvement. Very impressive Fritz. I have been using GE for only a year or so and prior to that

Re: [gridengine users] Welcome Home Grid Engine!

2013-10-24 Thread Joseph Farran
Yes and I did not mean to skip and forget all of the other folks who contributed to what we know today as Grid Engine. If you dig far back enough and before it was CODINE, I am sure it started with someone writing some home grown code. The main point remains however. Adaptive computing is an

Re: [gridengine users] Welcome Home Grid Engine!

2013-10-23 Thread Joseph Farran
Are you kidding me? NO? Have you seen what Adaptive Computing did with Moab?They took Maui added/improved it and are now charging a fortune for Moab. If a company wants to start from scratch with a product fine, but to take a product contributed by the community for free and then

[gridengine users] Grid Engine ckpt_command running multiple times?

2013-10-15 Thread Joseph Farran
Greetings. Reading the man page for checkpoint, it sounds like ckpt_command migr_command can be called multiple times and Grid Engine will not wait for the previous call to end before calling it again? So if GE calls ckpt_command and the previous ckpt_command has not yet exited, will GE

[gridengine users] Requesting Exclusive node

2013-10-14 Thread Joseph Farran
Howdy. We have several users using large job-arrays using 1-core per job array element. We have a shared queue pointing to several 64-core nodes. With the above setup, each of our 64-core nodes ends up with 64 individual jobs from various users. This is normal and expected behavior. Is

[gridengine users] Grid Engine BLCR checkpoint interval ( x-minutes )?

2013-10-06 Thread Joseph Farran
Howdy. I am setting up GE 8.1.4 with blcr using the GE scripts from BLCR-GridEngine-Integration-master.zip One question which I don't see an answer to, is how does one setup an X-minutes checkpoint interval with GE? So how can I tell Grid Engine to do a checkpoint say every 30 minutes

Re: [gridengine users] Job Checkpoint: BLCR or DMTCP ?

2013-10-03 Thread Joseph Farran
:47 PM, Orion Poplawski wrote: On 10/2/2013 9:54 PM, Joseph Farran wrote: Thanks Dave and yes, I accidentally sent it non-ascii - I hate it when that happens. I want to tackle single jobs first so I'll try DMTCP. What SGE scripts do you recommend?I found this but not sure if there are better

Re: [gridengine users] Job Checkpoint: BLCR or DMTCP ?

2013-10-02 Thread Joseph Farran
/dmtcp_starter Joseph On 10/2/2013 4:02 PM, Dave Love wrote: Joseph Farran jfar...@uci.edu writes: [Please don't post content-type: text/html.] Hi all. We have Grid Engine 8.1.4 running on a cluster with CentOS 6.4, using kernel 2.6.32-358.18.1.We are just getting started on setting up job

[gridengine users] Job Checkpoint: BLCR or DMTCP ?

2013-09-29 Thread Joseph Farran
Hi all. We have Grid Engine 8.1.4 running on a cluster with CentOS 6.4, using kernel 2.6.32-358.18.1. We are just getting started on setting up job checkpoint. We got BLCR compiled and are currently testing it. Before we go much further and

Re: [gridengine users] start_gui_installer SGE 8.1.4

2013-09-24 Thread Joseph Farran
Thanks Dave and yes, there was something wrong. The new version now works correctly. Best, Joseph On 09/19/2013 09:50 AM, Dave Love wrote: I wrote: I can't reproduce that, at least with the version I have installed. I should have waited until I could test the distributed version. There

[gridengine users] start_gui_installer SGE 8.1.4

2013-09-17 Thread Joseph Farran
Howdy. We are running Son of Grid Engine 8.1.3. I compiled 8.1.4 and downloaded and un-tar the gui_installer-8.1.4.tar into the compiled directory. When I run ./start_gui_installer all is well and 8.1.4 GUI starts up just fine, but 3 screens later, it bombs with: Exception in thread

[gridengine users] Cannot UN-suspend suspended job

2013-04-03 Thread Joseph Farran
Howdy. Using GE 8.1.2.I have two jobs which suspended correctly via Grid Engine subordinate queue. I am however trying to force the scheduler to resume ( un-suspend ) the suspended jobs with no success: $ qstat | grep compute-14-18 288279 0.5 MakeSummar juser S 04/02/2013

Re: [gridengine users] Forgetting the Subordinate Queue

2013-03-18 Thread Joseph Farran
On 3/17/2013 1:42 PM, Reuti wrote: Am 17.03.2013 um 19:15 schrieb Joseph Farran: On 3/17/2013 2:14 AM, Reuti wrote: Am 17.03.2013 um 07:22 schrieb Joseph Farran: On 1/4/2013 10:37 AM, Reuti wrote: Am 02.01.2013 um 05:08 schrieb Joseph Farran: Hello Reuti. Yes, the job(s

Re: [gridengine users] Forgetting the Subordinate Queue

2013-03-17 Thread Joseph Farran
On 1/4/2013 10:37 AM, Reuti wrote: Am 02.01.2013 um 05:08 schrieb Joseph Farran: Hello Reuti. Yes, the job(s) are not suspending (S) as they normally do. So it's not the queue, but the jobs. But is the queue in suspended state (qstat -f)? Sorry Reuti, missed your question. Yes

Re: [gridengine users] Functional Shares and how to see distribution?

2013-02-19 Thread Joseph Farran
On 2/19/2013 3:22 PM, Reuti wrote: Did you change this value in the past and it could have been copied with a different value to the user entry? I made so many changes I forget, but the more I understand how it works, yes I think that's what happen. To answer one of my questions, here is

Re: [gridengine users] Reset user accounting data?

2013-02-15 Thread Joseph Farran
Joseph Farran jfar...@uci.edu: Hi. I searched and did not find a way, so I am checking here. Is there a way to reset ( zero out ) the Grid Engine usage accounting data ( qacct ) for one user only? Joseph ___ users mailing list users@gridengine.org

Re: [gridengine users] Fwd: Unable to ssh into node

2013-02-14 Thread Joseph Farran
Using ssh -vvv when the node refuses a connection from the user gives the clue of it being no-more-sessi...@openssh.com debug1: Requesting no-more-sessi...@openssh.com debug1: Entering interactive session. debug3: Wrote 192 bytes for a total of 2581 debug1: channel 0: free:

Re: [gridengine users] Reset user accounting data?

2013-02-14 Thread Joseph Farran
Hi Reuti. Ah, I thought it was a binary file, it's text based. Thanks, Joseph On 2/14/2013 5:27 PM, Reuti wrote: Removing the relevant lines from the accounting file should do it (for 'qacct'). -- Reuti Am 15.02.2013 um 01:54 schrieb Joseph Farran jfar...@uci.edu: Hi. I searched and did

Re: [gridengine users] Fwd: Unable to ssh into node

2013-02-13 Thread Joseph Farran
Hi All. To expand a bit on what is going on. We are using Grid Engine 8.1.2 using Rocks 6.1 for the clustering software. We have a program that is not behaving nicely with the amount of cores being requested, so the node easily goes over-loaded. To keep the node load from going through the

Re: [gridengine users] FairShare on a group of users?

2013-02-07 Thread Joseph Farran
and functional policies. And thus it's good practice to *not* use projects for anything else than these policies or else you might be in trouble if turning on those policies and requiring projects for them. Cheers, Fritz Am 07.02.2013 um 07:39 schrieb Joseph Farran: Hi. I am using Grid Engine

[gridengine users] FairShare on a group of users?

2013-02-06 Thread Joseph Farran
Hi. I am using Grid Engine 8.1.2 setup with some 20 queues. Most queues point to a set of private nodes. A few queues point to a pool of shared nodes. All queues are FIFO order. I like to convert a couple of the share queues to use FairShare instead of FIFO order. I asked this question

Re: [gridengine users] Dynamic Resource Quotas

2013-02-05 Thread Joseph Farran
Hi Reuti. Yes, I am creating a script to be ran by cron that will re-adjust the number of slots allowed per user based on the wait. In the process of creating the script, I thought of checking first to see this already existed with Dynamic quotas to not re-invent the wheel. Thanks, Joseph

[gridengine users] Dynamic Resource Quotas

2013-02-04 Thread Joseph Farran
Hi All. I am using Grid Engine 8.1.2. I am reading up on dynamic resource quotas. One example I see to allow 5 slots per CPU on all linux hosts is: limit hosts {@linux_hosts} to slots=$num_proc*5 I like to setup the following dynamic resource quota but not sure if it can be done?

Re: [gridengine users] SETUID Failed

2013-01-14 Thread Joseph Farran
Hi Reuti. Here are my limits for a node and for Grid Engine: cat /etc/security/limits.conf * soft memlock unlimited * hard memlock unlimited * soft nofile 4096 * hard nofile 10240 # qconf -sconf execd_params ENABLE_ADDGRP_KILL=TRUE,S_DESCRIPTORS=4096, \

[gridengine users] SETUID Failed

2013-01-13 Thread Joseph Farran
Howdy. We have a cluster running Rocks 6.1 with Grid Engine 8.1.2. Every once in a while, we get jobs that fail not being able to set the user id ( setuid fails ). The nodes have the correct /etc/passwd entry as many jobs from the same user work while a few fail every once in a while.The

Re: [gridengine users] Forgetting the Subordinate Queue

2013-01-01 Thread Joseph Farran
Hello Reuti. Yes, the job(s) are not suspending (S) as they normally do. So it's not the queue, but the jobs. Normally as soon as 1 or more core jobs enters the node through the queue, the subordinate jobs suspend immediately.Once is a while, the jobs that go in through the subordinate

[gridengine users] Forgetting the Subordinate Queue

2012-12-31 Thread Joseph Farran
Hi All. I am running GE 8.1.2 and I have a situation where once in a while ( 2x a week ), Grid Engine forgets about one of the Subordinate queues. Everything works as expected where my subordinate queue goes to S suspend-mode when a job enters the queue it is subordinate to.However once in

Re: [gridengine users] Broken -w e in .sge_request ?

2012-12-25 Thread Joseph Farran
it so that these types of jobs can never be queued?Some other kind of verification process? On 12/24/2012 8:02 AM, Reuti wrote: Hi, Am 24.12.2012 um 09:08 schrieb Joseph Farran: maybe it's by design. From `man qsub` for the -w option: It should also be noted that load values are not taken

Re: [gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-17 Thread Joseph Farran
On 12/16/2012 10:15 AM, Dave Love wrote: I think the answer is not to do that. Why restart it? Since restarting GE server is not harmful and because Murphy always shows up on a Friday night on the eve of a long 3 day weekend, sometimes restarting services (which are safe to restart) is a

[gridengine users] Restarting Grid Engine makes qstat forget display order

2012-12-16 Thread Joseph Farran
Howdy. This is minor issue but one I like to see if there is a fix for. I re-start Grid Engine 8.1.2 every day via a cron job. I noticed that the qstat listing changes the display order when GE is restarted. Before the restart,

Re: [gridengine users] Requesting CPU Type with qsub / qrsh ?

2012-12-14 Thread Joseph Farran
Hi Dave. That's exactly what I am looking for. Would you be willing to share your script and/or method for populating the fields? I am assuming this is automated via a script? Joseph On 12/13/2012 9:28 AM, Dave Love wrote: I have a complex cputype with (6!) values like interlagos and

[gridengine users] Requesting CPU Type with qsub / qrsh ?

2012-12-11 Thread Joseph Farran
Greetings. How do I request the CPU type in qrsh / qsub with SGE 8.1.2? Googling this question shows some answers of the type qrsh -l arch=xxx. However, all my nodes in my qhost shows the same type of arch: # qhost -F | grep arch hl:arch=lx-amd64 hl:arch=lx-amd64 hl:arch=lx-amd64

[gridengine users] ulimits not taking in GE

2012-11-14 Thread Joseph Farran
Hi All. I increased our ulimits on our compute nodes and I can request the new limits if I ssh to the compute nodes: [root@compute-2-3 security]# tail -5 /etc/security/limits.conf # End of file * hard nofile 10240 * soft nofile 4096 * hard nofile 10240 * soft nofile 4096 [user@compute-2-3

Re: [gridengine users] ulimits not taking in GE

2012-11-14 Thread Joseph Farran
Thanks Rayson! That did the trick. Best, Joseph On 11/14/2012 10:55 AM, Rayson Ho wrote: Joseph, You need to set S_DESCRIPTORS, H_DESCRIPTORS with the execd_params option in sge_conf: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_conf.html Example: H_DESCRIPTORS=1 Rayson

Re: [gridengine users] Queues using less than slots=

2012-11-08 Thread Joseph Farran
Thanks Reuti. That was the mystery why it looked like some queues were using less cores. Best, Joseph On 11/07/2012 11:38 PM, Reuti wrote: This shows the location of the master queue for this job only, not its allocation inside the cluster, which depends on the defined allocation_rule in

[gridengine users] Queues using less than slots=

2012-11-07 Thread Joseph Farran
Hi. I am using SGE 8.1.2 with several queues and recently, several of my 64-slots queues are not scheduling the full 64-cores. So if I submit 64 1-core jobs, only 57 or so are schedule per node instead of 64. If I submit 4 16-core pe jobs, only 3 of

Re: [gridengine users] Queues using less than slots=

2012-11-07 Thread Joseph Farran
this? On 11/7/2012 9:25 PM, Joseph Farran wrote: Hi. I am using SGE 8.1.2 with several queues and recently, several of my 64-slots queues are not scheduling the full 64-cores. So if I submit 64 1-core jobs, only 57 or so are schedule per node instead of 64. If I submit 4 16-core pe jobs

[gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Hi all. I google this issue but did not see much help on the subject. I have several queues with hard wall clock limits like this one: # qconf -sq queue | grep h_rt h_rt 96:00:00 I am running Son of Grid engine 8.1.2 and many jobs run past the hard wall clock limit and

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
killed when they go past their wall time clock. How can I investigate this further? On 10/30/2012 11:44 AM, Reuti wrote: Hi, Am 30.10.2012 um 19:31 schrieb Joseph Farran: I google this issue but did not see much help on the subject. I have several queues with hard wall clock limits like

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
On 10/30/2012 12:07 PM, Reuti wrote: Am 30.10.2012 um 20:02 schrieb Joseph Farran: Hi Reuti. Yes, I had that already set: qconf -sconf|fgrep execd_params execd_params ENABLE_ADDGRP_KILL=TRUE What is strange is that 1 out of 10 jobs or so do get killed just fine when they go

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
for the h_rt and nothing either. On 10/30/2012 01:49 PM, Reuti wrote: Am 30.10.2012 um 20:18 schrieb Joseph Farran: Here is one case: qstat| egrep 12959|12960 12959 0.50500 dna.pmf_17 amentes r 10/24/2012 18:59:12 free2@compute-12-22.local 1 12960 0.50500 dna.pmf_17 amentes

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
Joseph Farran: Did not have loglevel set to log_info, so I updated it, restarted GE on the master and softstop and start on the compute node. I got a lot more log information now, but still no cigar: # cat /var/spool/ge/compute-12-22/messages | fgrep h_rt # Checked a few other compute nodes

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
No: # qconf -sq free2 | fgrep terminate terminate_method NONE On 10/30/2012 03:07 PM, Reuti wrote: Mmh, was the terminate method redefined in the queue configuration of the queue in question? Am 30.10.2012 um 23:04 schrieb Joseph Farran: No, still no cigar. # cat /var/spool/ge

Re: [gridengine users] Jobs are not being Terminated ( Job should have finished since )

2012-10-30 Thread Joseph Farran
correctly. Oh well, thanks Reuti. I will keep playing with this... On 10/30/2012 03:53 PM, Reuti wrote: Am 30.10.2012 um 23:45 schrieb Joseph Farran: No: # qconf -sq free2 | fgrep terminate terminate_method NONE Is the process still doing something serious or hanging somewhere in a loop

Re: [gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-29 Thread Joseph Farran
08:55 schrieb Daniel Gruber: Am 26.10.2012 um 07:58 schrieb Joseph Farran: Howdy. One of my queues has a wall time hard limit of 4 days ( 96 hours ): # qconf -sq queue | grep h_rt h_rt 96:00:00 There is a job which has been running much longer than 4 days and I am not sure

Re: [gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-29 Thread Joseph Farran
Ah I missed that. Yes we have awk version 3.1.5 and the readme says 3.1.6 or higher. We will be upgrading OS from SL 5.7 to 6.3 soon so that should fix this. Thanks, Joseph On 10/29/2012 11:11 AM, Reuti wrote: Am 29.10.2012 um 19:08 schrieb Joseph Farran: Thanks Reuti, but it does not work

[gridengine users] How to Tell the running Wall Clock of a Job?

2012-10-26 Thread Joseph Farran
Howdy. One of my queues has a wall time hard limit of 4 days ( 96 hours ): # qconf -sq queue | grep h_rt h_rt 96:00:00 There is a job which has been running much longer than 4 days and I am not sure how to get the hours the job has been

Re: [gridengine users] Functional Fair Share on a Queue?

2012-10-15 Thread Joseph Farran
this, I think: http://moo.nac.uci.edu/~hjm/BDUC_Pay_For_Priority.html If it is inaccurate, please let me know and I'll correct it. hjm On Sunday, October 14, 2012 01:42:38 AM Joseph Farran wrote: Hi All. I have a queue on our cluster with 1,000 cores that all users can use. I like to keep

Re: [gridengine users] Functional Fair Share on a Queue?

2012-10-15 Thread Joseph Farran
Syntax question on the limit. In order to place a limit of say 333 cores per user on queue free, is the syntax: limitusers * queues free to slots=333 Correct? On 10/15/2012 01:32 PM, Joseph Farran wrote: Hi Harry. Thanks. I understand the general fair share methods available

Re: [gridengine users] Cleaning up Run-away jobs on nodes

2012-09-20 Thread Joseph Farran
Thanks William, Reuti and Dave. I will try the pointers made here. Joseph On 09/20/2012 02:13 AM, Reuti wrote: Am 20.09.2012 um 02:08 schrieb Joseph Farran: What is the recommended way and/or do scripts exists for cleaning up once a job completes/dies/crashes on a node? I would prefer

Re: [gridengine users] failing mem_free request

2012-09-17 Thread Joseph Farran
Dave, I am having the same/similar issues as Brian's but with 8.1.2.But for me, it's even worse. There are only two resources I can request which are mem_total and swap_total. All others fail. $ qrsh -l mem_total=1M Last login: Mon Sep 10 22:02:39 2012 from login-1-1.local

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-09-13 Thread Joseph Farran
Hi Brian. Cool and thank you for pointing this out and the fix.  Being so new go GE and after 20+ posts on this issue, I thought it was something wrong in my GE configuration!    Glad to hear is was not me :-) Best, Joseph

Re: [gridengine users] [PATCH] Simple memory cgroup functionality

2012-09-11 Thread Joseph Farran
Mark, Thanks!I just upgraded to 8.1.2. Will these patches work with 8.1.2 or were they intended only for 8.1.1? Joseph On 09/10/2012 07:45 AM, Mark Dixon wrote: Hi, Way back in May I promised this list a simple integration of gridengine with the cgroup functionality found in

[gridengine users] s_rt / h_rt Limits with Informative Messages?

2012-09-11 Thread Joseph Farran
Hi All. Is there a way ( hopefully easy way ) to have Grid Engine to give an informative message when a job has gone past a limit and killed, like when a job goes over the wall time limit. When I get an email from Grid Engine where a job has gone past it's wall time limit, it is not very

Re: [gridengine users] s_rt / h_rt Limits with Informative Messages?

2012-09-11 Thread Joseph Farran
Thanks Reuti. I think this sends an additional email, correct?Any easy way to append or check for -m bea in case users does not want the email? Joseph On 09/11/2012 11:21 AM, Reuti wrote: Hi, Am 11.09.2012 um 19:10 schrieb Joseph Farran: Is there a way ( hopefully easy way ) to have

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-31 Thread Joseph Farran
On 8/31/2012 6:58 AM, Dave Love wrote: In the absence of any knowledge about that cluster, that doesn't confirm that it's reported for the specific hosts that the scheduler complained about, just that it's reported for some. Look explicitly at the load parameters from one of the hosts in

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-30 Thread Joseph Farran
On 08/28/2012 07:37 PM, Joseph Farran wrote: Hi Reuti. Here it is with the additional info: $ qrsh -w v -q bio -l mem_free=190G Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue bio@compute-2-7.local because job requests unknown resource (mem_free) Job 1637 (-l h_rt=604800,mem_free=190G

[gridengine users] Requesting mem_free

2012-08-30 Thread Joseph Farran
Hi. I am trying to request nodes with a certain mem_free value and I am not sure what is missing in my configuration that this does not work. My test nodes in my space1 queue has: $ qstat -F -q space1 | grep mem_free hl:mem_free=6.447G hl:mem_free=7.237G

Re: [gridengine users] Requesting mem_free

2012-08-30 Thread Joseph Farran
Hi Mazouzi. I still get the same issue.With no mem_free request, all works ok: $ qrsh -q space1 -l mem_free=1G error: no suitable queues $ qrsh -q space1 Last login: Wed Aug 29 14:31:07 2012 from login-1-1.local Rocks Compute Node Rocks 5.4.3 (Viper) Profile built 14:11 07-May-2012

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-30 Thread Joseph Farran
On 08/30/2012 02:22 PM, Dave Love wrote: That doesn't actually demonstrate that it's on the relevant nodes (e.g. qconf -se), though I'll believe it is. The -w v messages suggest that there's no load report from those nodes. What OS is this, and what load values are actually reported by one of

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
Thanks Dave. We just discovered that we cannot request nodes with -l mem_free=xxx. We are on 8.1.1. Does this new release fix this? Joseph On 08/28/2012 09:57 AM, Dave Love wrote: SGE 8.1.2 is available from http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/. It is a large superset of the

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
I don't use it, but one of our users has used it before successfully before we moved to GE 8.1.1. # qstat -q bio -F mem_free|fgrep mem hl:mem_free=498.198G hl:mem_free=498.528G hl:mem_free=499.143G hl:mem_free=498.959G hl:mem_free=499.198G $ qrsh -q bio

Re: [gridengine users] Son of Grid Engine 8.1.2 available

2012-08-28 Thread Joseph Farran
Hi Reuti. Here it is with the additional info: $ qrsh -w v -q bio -l mem_free=190G Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue bio@compute-2-7.local because job requests unknown resource (mem_free) Job 1637 (-l h_rt=604800,mem_free=190G) cannot run in queue

Re: [gridengine users] GPU node with pe and complex

2012-08-23 Thread Joseph Farran
Thanks William. Setting the consumable to JOB did the trick! Best, Joseph On 08/23/2012 12:32 AM, William Hay wrote: On 22 August 2012 23:53, Joseph Farranjfar...@uci.edu wrote: You have consumable set to YES which means the request is multiplied by the number of slots you request 64 so you

[gridengine users] sge 8.1.1 and sge_shepherd running at 100%

2012-08-23 Thread Joseph Farran
Hi Dave. Any updates when the bug that causes sge_shepherd to run at 100% when one uses qrsh is going to be fixed for sge 8.1.1? I just tested it using qrsh and the bug is there. Joseph ___ users mailing list users@gridengine.org

[gridengine users] sge 8.1.1 and sge_shepherd running at 100%

2012-08-23 Thread Joseph Farran
Hi Dave. Any updates when the bug that causes sge_shepherd to run at 100% when one uses qrsh is going to be fixed for sge 8.1.1? I just tested it using qrsh and the bug is there. Joseph ___ users mailing list users@gridengine.org

  1   2   >