Re: [gridengine users] SoGE file descriptor limit and MAX_DYN_EC

2020-02-20 Thread Daniel Povey
That's a GridEngine bug whereby the event client ids or whatever they are called don't properly get cleaned up. The workaround is to restart the qmaster when that happens. Be careful, sometimes restarting the service doesn't work and you may need to kill the process. At the cluster I used to

Re: [gridengine users] qsh not working

2019-11-20 Thread Daniel Povey
The accidental cc! On Wed, Nov 20, 2019 at 9:22 PM Friedrich Ferstl wrote: > This person was at the Smithsonian until very recently and shows up in > our support records but he is now CFA/Harvard, apparently, using open > source. > > > Am 19.11.2019 um 23:41 schrieb Korzennik, Sylvain < >

Re: [gridengine users] issue compiling SoGE on Debian 10.1

2019-10-30 Thread Daniel Povey
That looks like it's by design... he was signing the builds using his secret key. You'd have to figure out where he configured that and either insert your own details or turn off signing (if that's allowed). On Wed, Oct 30, 2019 at 10:35 AM Jerome wrote: > Dear all > > I've trying to compile

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-28 Thread Daniel Povey
ober 25, 2019 5:42 PM > *To:* dpo...@gmail.com > *Cc:* Skylar Thompson ; users@gridengine.org > *Subject:* RE: [gridengine users] What is the easiest/best way to update > our servers' domain name? > > > > Hi Daniel, > > > > Thank you for your reply. > > >

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-25 Thread Daniel Povey
that is an adequate solution > or one that will cause problems for me. I'm also not sure if that is the > best approach to take for this task. > > Thanks, > > -- > Mun > > > > > > On Fri, Oct 25, 2019 at 04:12:11PM -0700, Daniel Povey wrote: > > >

Re: [gridengine users] What is the easiest/best way to update our servers' domain name?

2019-10-25 Thread Daniel Povey
IIRC, GridEngine is very picky about machines having a consistent hostname, e.g. that what hostname they think they have matches with how they were addressed. I think this is because of SunRPC. I think it may be hard to do what you want without an interruption of some kind. But I may be wrong.

Re: [gridengine users] SGE rolling over MAX_SEQNUM, peculiar things happened

2019-10-18 Thread Daniel Povey
Normally restarting the qmaster (e.g. systemctl restart gridengine-qmaster) should be a very routine and harmless operation that should be invisible to users except for a temporary inaccessibility of `qstat`. On Fri, Oct 18, 2019 at 8:35 AM WALLIS Michael wrote: > Hi folks, > > Our instance of

Re: [gridengine users] limit CPU/slot resource to the number of reserved slots

2019-08-26 Thread Daniel Povey
I don't think it's supported in Son of GridEngine. Ondrej Valousek (cc'd) described in the first thread here http://arc.liv.ac.uk/pipermail/sge-discuss/2019-August/thread.html how he was able to implement it, but it required code changes, i.e. you would need to figure out how to build and install

Re: [gridengine users] Different ulimit settings given by different compute nodes with the exactly same /etc/security/limits.conf

2019-07-02 Thread Daniel Povey
Could it relate to when the daemons were started on those nodes? I'm not sure exactly at what point those limits are applied, and how they are inherited by child processes. If you changed those files recently it might not have taken effect. On Tue, Jul 2, 2019 at 10:36 PM Derrick Lin wrote: >

Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Daniel Povey
Sorry, but I think this conversation shouldn't continue. This list is for system administrators, not for users with basic questions about bash. People will unsubscribe if it goes on much longer. On Thu, Jun 13, 2019 at 2:49 PM VG wrote: > Hi Feng, > I did something like this > > for i in * >

Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Daniel Povey
Daniel Povey wrote: > > for i in *tar.gz; > do > while true; do > if [ $(qstat -u $USER | wc -l) -lt 900 ]; then break; fi; > sleep 60; >done >qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf > $i" > done > > On

Re: [gridengine users] scripting help with per user job submit restrictions

2019-06-13 Thread Daniel Povey
for i in *tar.gz; do while true; do if [ $(qstat -u $USER | wc -l) -lt 900 ]; then break; fi; sleep 60; done qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf $i" done On Thu, Jun 13, 2019 at 12:39 PM Skylar Thompson wrote: > We've used resource quota

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Daniel Povey
I have observed apparently random failures when users had gid's in the range `gid_range` (see below; gid_range should be out of the range where users have gid's). But usually this kind of thing would be due to OOM. qconf -sconf | grep gid_range gid_range5-51000 On Tue,

Re: [gridengine users] Limiting each user's slots across all nodes

2019-03-12 Thread Daniel Povey
When I see weird things like this (and it happens), my reaction is usually, "It's probably a bug somewhere deep in the code. Just change something about your setup to make it go away". In future I hope to switch to slurm. It doesn't have great architecture but I think it's better maintained, and

Re: [gridengine users] Different GDI version between client and qmaster

2019-02-26 Thread Daniel Povey
presumably a combination of newer and older packages. On Tue, Feb 26, 2019 at 10:02 PM Radhouane Aniba wrote: > Hi everyone > > I am trying to run a python code on SGE but I am running through this issue > > *drmaa.errors.DrmCommunicationException: code 2: denied: client (xxx) uses > old GDI

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Daniel Povey
500k without seeing this kind of slowdown. > > Joseph > > > On 1/26/2019 10:16 AM, Daniel Povey wrote: > > Check if there are any huge jobs in the queue. Sometimes very large task > ranges, or large numbers of jobs, can make it slow. > > > > On Sat, Jan 26

Re: [gridengine users] Grid Engine Sluggish

2019-01-26 Thread Daniel Povey
Check if there are any huge jobs in the queue. Sometimes very large task ranges, or large numbers of jobs, can make it slow. On Sat, Jan 26, 2019 at 7:05 AM Reuti wrote: > Hi, > > > Am 26.01.2019 um 10:20 schrieb Joseph Farran : > > > > Hi. > > Our Grid Engine is running very sluggish all of a

Re: [gridengine users] Dilemma with exec node reponsiveness degrading

2019-01-17 Thread Daniel Povey
It seems to me that this likely isn't closely related to GridEngine itself, but more about something going on on that node. You'd have to debug by looking at 'top' output, system logs, iostat, ifstat, to see if it's about heavy usage by some existing job, or some kind of kernel hang. But it's

Re: [gridengine users] Processes not exiting

2018-11-15 Thread Daniel Povey
Make sure the gid_range is set to a range in which none of your system's users have group-ids. Otherwise it will kill the wrong things. On Thu, Nov 15, 2018 at 6:10 PM wrote: > Hay, William wrote on 11/14/18 04:21: > > Do you have ENABLE_ADDGRP_KILL set? Can be helpful in killing processes >

[gridengine users] Alternatives to Son of GridEngine

2018-11-12 Thread Daniel Povey
Everyone, I'm trying to understand the landscape of alternatives to Son of GridEngine, since the maintenance situation isn't great right now and I'm not sure that it has a long term future. If you guys were to switch to something in the same universe of products, what would it be to? Univa

Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
Sorry, what I wrote was confusing due to an errant paste. Edited below. On Sat, Nov 10, 2018 at 5:03 PM Daniel Povey wrote: > I was able to fix it, although I suspect that my fix may have been > disruptive to the jobs. > > Firstly, I believe the problem was that gridengine doe

Re: [gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
ing restart to try to see at which point it is dying > (e.g., > loading a config file) > 2) look for any reference to the name of the host you deleted in the spool > area and do some cleanup > 3) clean out the jobs spool area > > HTH, > John > > On Sat, 2018-11-10 at 16:23

[gridengine users] C|!!!!!!!!!! got NULL element for EH_name !!!!!!!!!!

2018-11-10 Thread Daniel Povey
Has anyone found this error, and managed to fix it? I am in a very difficult situation. I deleted a host (qconf -de hostname) thinking that the machine no longer existed, but it did exist, and there was a job in 'dr' state there. After I attempted to force-delete that job (qdel -f job-id), the

Re: [gridengine users] sge_execd dies

2018-11-08 Thread Daniel Povey
Joseph Farran wrote: > Hi Dan. > > Thank you for the suggestion. Here is what I have: > > # qconf -sconf | grep gid_range > gid_range200-70 > > The highest gid is 3135. > Best, > Joseph > > On 11/8/2018 8:58 PM, Daniel Povey wrote: &g

Re: [gridengine users] sge_execd dies

2018-11-08 Thread Daniel Povey
Do qconf -sconf | grep gid_range and check whether any of your users have group id's in that range. That can lead to things being killed. Dan On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran wrote: > Greetings. > > I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9. > > I am

Re: [gridengine users] Dave Love repository issue

2018-10-12 Thread Daniel Povey
There is an issue tracker here https://arc.liv.ac.uk/trac but it's not clear whether Dave Love still has access to it (he moved to Manchester and for a while at least he did not have access; and he doesn't seem to have been working on GridEngine lately anyway). Also I couldn't figure out where in

Re: [gridengine users] Execution node no host information

2018-09-17 Thread Daniel Povey
Setting the hostname may depend slightly on the linux flavor... I'd start with editing /etc/hostname and /etc/hosts, and then if systemctl is installed, use `hostnamectl set-hostname your_hostname`. On Mon, Sep 17, 2018 at 9:24 AM linux khbuex wrote: > Also > on client, gethostname returns

Re: [gridengine users] cpu usage calculation

2018-08-31 Thread Daniel Povey
This gets back to the issue of who is going to maintain GridEngine. Dave Love briefly resurfaced (enough to dissuade me from forming a group to maintain it, we were going to make this its home https://github.com/son-of-gridengine/sge) but seems to have gone under again. And actually I'm not sure

Re: [gridengine users] different results on terminal and on submission via qsub

2018-07-09 Thread Daniel Povey
he other double quotes > are the submission quotes to the qsub command. > > Regards > Varun > > > On Mon, Jul 9, 2018 at 1:30 PM, Daniel Povey wrote: > >> You should put that in a bash script and invoke it by name. >> >> It's really a miracle that it gave you

Re: [gridengine users] different results on terminal and on submission via qsub

2018-07-09 Thread Daniel Povey
You should put that in a bash script and invoke it by name. It's really a miracle that it gave you anything even remotely close to what you intended when run like that, because of how bash interprets double-quotes and variables. Inside double quotes it will expand variables with their values at

Re: [gridengine users] Possible opportunity for development work

2018-05-14 Thread Daniel Povey
t;j...@salilab.org> wrote: > On Sun, 13 May 2018 at 8:49pm, Daniel Povey wrote > >> Can you show the full output from when you do `qstat -j ` for >> the job that's pending? > > > Unfortunately I had to change our setup so that GPU jobs would actually flow > through the queues -

Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
oops my bad, looks like it means 'job granted'. sorry for the spam. On Mon, May 14, 2018 at 12:17 AM, Daniel Povey <dpo...@gmail.com> wrote: > And an interesting tidbit from the house of horrors that is SGE code: > > A bunch of variable names in sge_follow.c, have JG in them, e.

Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
get it past me. But I guess we don't get to choose when it's legacy code. Dan On Mon, May 14, 2018 at 12:04 AM, Daniel Povey <dpo...@gmail.com> wrote: > Can you please create an issue for this at the new github location? > https://github.com/son-of-gridengine/sge/issues >

Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
time understanding the code. It seems to have been written before the days when documentation was expected in projects like this. On Sun, May 13, 2018 at 11:49 PM, Daniel Povey <dpo...@gmail.com> wrote: > Can you show the full output from when you do `qstat -j ` for > the job th

Re: [gridengine users] Possible opportunity for development work

2018-05-13 Thread Daniel Povey
Can you show the full output from when you do `qstat -j ` for the job that's pending? On Sun, May 13, 2018 at 11:39 PM, Joshua Baker-LePain wrote: > As I've mentioned on this list a few times, we are running SoGE 8.1.9 on a > small (but growing) cluster here. With the

Re: [gridengine users] Son of GridEngine succession?

2018-05-13 Thread Daniel Povey
, please let me know your github userid, and please 'watch' the project. My own experience of GridEngine is mainly as a user, not a maintainer, so we need qualified people! Dan On Sat, May 12, 2018 at 6:28 PM, Daniel Povey <dpo...@gmail.com> wrote: > Thanks for your responses to the

Re: [gridengine users] Son of GridEngine succession?

2018-05-12 Thread Daniel Povey
is anyone in that old team that would be upset if the name is re-used. Dan On Sat, May 12, 2018 at 5:01 PM, Daniel Povey <dpo...@gmail.com> wrote: > I've created a Doodle poll with the options named so far. > > https://doodle.com/poll/vcr2nasruzkg4r4g > > This is just s

Re: [gridengine users] Son of GridEngine succession?

2018-05-12 Thread Daniel Povey
, Daniel Povey <dpo...@gmail.com> wrote: > Thanks for the naming ideas! > > Since many of these ideas are family-related I had a look for synonyms > for 'offspring' that begin with s (so that we have no difficulty > explaining why the binaries still have 'sge' in the name). >

Re: [gridengine users] Son of GridEngine succession?

2018-05-11 Thread Daniel Povey
him. However, if most others want to to keep the same name, I'd be OK with it. Dan On Fri, May 11, 2018 at 7:08 PM, Jerome <jer...@ibt.unam.mx> wrote: > Le 11/05/2018 à 18:02, Christopher Heiny a écrit : >> On Fri, 2018-05-11 at 18:49 -0400, Daniel Povey wrote: >>

[gridengine users] Son of GridEngine succession?

2018-05-11 Thread Daniel Povey
Everyone, I want to start a discussion about how to replace Son of GridEngine. As far as I can tell, Dave Love has had no online activity for a year, is not responding to emails, and my attempts to contact him indirectly via his workplace have come to nothing. Even if he is still alive, I think

[gridengine users] Futex leap-second bug for GridEngine?

2012-07-13 Thread Daniel Povey
Has anyone noticed their sge_execd proceses suddenly taking up a lot of CPU, possibly since around July 2nd this year? I think it might be to do with the Linux leap second bug, which affects processes that use futexes. It doesn't happen to all nodes on a queue, just some. The only way I know to

[gridengine users] GridEngine and soft user limits not being respected.

2012-06-23 Thread Daniel Povey
We have a problem in our queue that GridEngine is not respecting user limits specified in /etc/security/limits.conf We have in that file * softas 2000 and nothing else, and when we log into a machine (Debian Linux) using ssh or qrsh the limit will appear, so if

[gridengine users] error with qsub -sync y

2011-10-08 Thread Daniel Povey
I have been getting occasional errors when using qsub -sync y. It prints out the error message: Unable to initialize environment because of error: range_list containes no elements Exiting. This is not reproducible, but seems to occur in batches. This is with GE 6.2R5. Looking for this online