That's a GridEngine bug whereby the event client ids or whatever they are
called don't properly get cleaned up.
The workaround is to restart the qmaster when that happens.
Be careful, sometimes restarting the service doesn't work and you may need
to kill the process.
At the cluster I used to
The accidental cc!
On Wed, Nov 20, 2019 at 9:22 PM Friedrich Ferstl wrote:
> This person was at the Smithsonian until very recently and shows up in
> our support records but he is now CFA/Harvard, apparently, using open
> source.
>
>
> Am 19.11.2019 um 23:41 schrieb Korzennik, Sylvain <
>
That looks like it's by design... he was signing the builds using his
secret key. You'd have to figure out where he configured that and either
insert your own details or turn off signing (if that's allowed).
On Wed, Oct 30, 2019 at 10:35 AM Jerome wrote:
> Dear all
>
> I've trying to compile
ober 25, 2019 5:42 PM
> *To:* dpo...@gmail.com
> *Cc:* Skylar Thompson ; users@gridengine.org
> *Subject:* RE: [gridengine users] What is the easiest/best way to update
> our servers' domain name?
>
>
>
> Hi Daniel,
>
>
>
> Thank you for your reply.
>
>
>
that is an adequate solution
> or one that will cause problems for me. I'm also not sure if that is the
> best approach to take for this task.
>
> Thanks,
>
> --
> Mun
>
>
> >
> > On Fri, Oct 25, 2019 at 04:12:11PM -0700, Daniel Povey wrote:
> > >
IIRC, GridEngine is very picky about machines having a consistent hostname,
e.g. that what hostname they think they have matches with how they were
addressed. I think this is because of SunRPC. I think it may be hard to
do what you want without an interruption of some kind. But I may be wrong.
Normally restarting the qmaster (e.g. systemctl restart gridengine-qmaster)
should be a very routine and harmless operation that should be invisible to
users except for a temporary inaccessibility of `qstat`.
On Fri, Oct 18, 2019 at 8:35 AM WALLIS Michael wrote:
> Hi folks,
>
> Our instance of
I don't think it's supported in Son of GridEngine. Ondrej Valousek (cc'd)
described in the first thread here
http://arc.liv.ac.uk/pipermail/sge-discuss/2019-August/thread.html
how he was able to implement it, but it required code changes, i.e. you
would need to figure out how to build and install
Could it relate to when the daemons were started on those nodes? I'm not
sure exactly at what point those limits are applied, and how they are
inherited by child processes. If you changed those files recently it might
not have taken effect.
On Tue, Jul 2, 2019 at 10:36 PM Derrick Lin wrote:
>
Sorry, but I think this conversation shouldn't continue.
This list is for system administrators, not for users with basic questions
about bash. People will unsubscribe if it goes on much longer.
On Thu, Jun 13, 2019 at 2:49 PM VG wrote:
> Hi Feng,
> I did something like this
>
> for i in *
>
Daniel Povey wrote:
>
> for i in *tar.gz;
> do
> while true; do
> if [ $(qstat -u $USER | wc -l) -lt 900 ]; then break; fi;
> sleep 60;
>done
>qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf
> $i"
> done
>
> On
for i in *tar.gz;
do
while true; do
if [ $(qstat -u $USER | wc -l) -lt 900 ]; then break; fi;
sleep 60;
done
qsub -l h_vmem=4G -cwd -j y -b y -N tar -R y -q all.q,gpu.q "tar -xzf $i"
done
On Thu, Jun 13, 2019 at 12:39 PM Skylar Thompson wrote:
> We've used resource quota
I have observed apparently random failures when users had gid's in the
range `gid_range` (see below; gid_range should be
out of the range where users have gid's).
But usually this kind of thing would be due to OOM.
qconf -sconf | grep gid_range
gid_range5-51000
On Tue,
When I see weird things like this (and it happens), my reaction is usually,
"It's probably a bug somewhere deep in the code. Just change something
about your setup to make it go away".
In future I hope to switch to slurm. It doesn't have great architecture
but I think it's better maintained, and
presumably a combination of newer and older packages.
On Tue, Feb 26, 2019 at 10:02 PM Radhouane Aniba wrote:
> Hi everyone
>
> I am trying to run a python code on SGE but I am running through this issue
>
> *drmaa.errors.DrmCommunicationException: code 2: denied: client (xxx) uses
> old GDI
500k without seeing this kind of slowdown.
>
> Joseph
>
>
> On 1/26/2019 10:16 AM, Daniel Povey wrote:
> > Check if there are any huge jobs in the queue. Sometimes very large task
> ranges, or large numbers of jobs, can make it slow.
> >
> > On Sat, Jan 26
Check if there are any huge jobs in the queue. Sometimes very large task
ranges, or large numbers of jobs, can make it slow.
On Sat, Jan 26, 2019 at 7:05 AM Reuti wrote:
> Hi,
>
> > Am 26.01.2019 um 10:20 schrieb Joseph Farran :
> >
> > Hi.
> > Our Grid Engine is running very sluggish all of a
It seems to me that this likely isn't closely related to GridEngine itself,
but more about something going on on that node. You'd have to debug by
looking at 'top' output, system logs, iostat, ifstat, to see if it's about
heavy usage by some existing job, or some kind of kernel hang. But it's
Make sure the gid_range is set to a range in which none of your system's
users have group-ids. Otherwise it will kill the wrong things.
On Thu, Nov 15, 2018 at 6:10 PM wrote:
> Hay, William wrote on 11/14/18 04:21:
> > Do you have ENABLE_ADDGRP_KILL set? Can be helpful in killing processes
>
Everyone,
I'm trying to understand the landscape of alternatives to Son of
GridEngine, since the maintenance situation isn't great right now and I'm
not sure that it has a long term future.
If you guys were to switch to something in the same universe of products,
what would it be to? Univa
Sorry, what I wrote was confusing due to an errant paste. Edited below.
On Sat, Nov 10, 2018 at 5:03 PM Daniel Povey wrote:
> I was able to fix it, although I suspect that my fix may have been
> disruptive to the jobs.
>
> Firstly, I believe the problem was that gridengine doe
ing restart to try to see at which point it is dying
> (e.g.,
> loading a config file)
> 2) look for any reference to the name of the host you deleted in the spool
> area and do some cleanup
> 3) clean out the jobs spool area
>
> HTH,
> John
>
> On Sat, 2018-11-10 at 16:23
Has anyone found this error, and managed to fix it?
I am in a very difficult situation.
I deleted a host (qconf -de hostname) thinking that the machine no longer
existed, but it did exist, and there was a job in 'dr' state there.
After I attempted to force-delete that job (qdel -f job-id), the
Joseph Farran wrote:
> Hi Dan.
>
> Thank you for the suggestion. Here is what I have:
>
> # qconf -sconf | grep gid_range
> gid_range200-70
>
> The highest gid is 3135.
> Best,
> Joseph
>
> On 11/8/2018 8:58 PM, Daniel Povey wrote:
&g
Do
qconf -sconf | grep gid_range
and check whether any of your users have group id's in that range. That
can lead to things being killed.
Dan
On Thu, Nov 8, 2018 at 10:33 PM Joseph Farran wrote:
> Greetings.
>
> I am running SGE 8.1.9 on a cluster with some 10k cores, CentOS 6.9.
>
> I am
There is an issue tracker here
https://arc.liv.ac.uk/trac
but it's not clear whether Dave Love still has access to it (he moved to
Manchester and for a while at least he did not have access; and he doesn't
seem to have been working on GridEngine lately anyway). Also I couldn't
figure out where in
Setting the hostname may depend slightly on the linux flavor... I'd start
with editing /etc/hostname and /etc/hosts, and then if systemctl is
installed, use `hostnamectl set-hostname your_hostname`.
On Mon, Sep 17, 2018 at 9:24 AM linux khbuex wrote:
> Also
> on client, gethostname returns
This gets back to the issue of who is going to maintain GridEngine.
Dave Love briefly resurfaced (enough to dissuade me from forming a
group to maintain it, we were going to make this its home
https://github.com/son-of-gridengine/sge) but seems to have gone under
again. And actually I'm not sure
he other double quotes
> are the submission quotes to the qsub command.
>
> Regards
> Varun
>
>
> On Mon, Jul 9, 2018 at 1:30 PM, Daniel Povey wrote:
>
>> You should put that in a bash script and invoke it by name.
>>
>> It's really a miracle that it gave you
You should put that in a bash script and invoke it by name.
It's really a miracle that it gave you anything even remotely close to what
you intended when run like that, because of how bash interprets
double-quotes and variables. Inside double quotes it will expand variables
with their values at
t;j...@salilab.org> wrote:
> On Sun, 13 May 2018 at 8:49pm, Daniel Povey wrote
>
>> Can you show the full output from when you do `qstat -j ` for
>> the job that's pending?
>
>
> Unfortunately I had to change our setup so that GPU jobs would actually flow
> through the queues -
oops my bad, looks like it means 'job granted'. sorry for the spam.
On Mon, May 14, 2018 at 12:17 AM, Daniel Povey <dpo...@gmail.com> wrote:
> And an interesting tidbit from the house of horrors that is SGE code:
>
> A bunch of variable names in sge_follow.c, have JG in them, e.
get it past me. But I guess we don't get to choose when
it's legacy code.
Dan
On Mon, May 14, 2018 at 12:04 AM, Daniel Povey <dpo...@gmail.com> wrote:
> Can you please create an issue for this at the new github location?
> https://github.com/son-of-gridengine/sge/issues
>
time understanding the code. It seems to have been
written before the days when documentation was expected in projects
like this.
On Sun, May 13, 2018 at 11:49 PM, Daniel Povey <dpo...@gmail.com> wrote:
> Can you show the full output from when you do `qstat -j ` for
> the job th
Can you show the full output from when you do `qstat -j ` for
the job that's pending?
On Sun, May 13, 2018 at 11:39 PM, Joshua Baker-LePain wrote:
> As I've mentioned on this list a few times, we are running SoGE 8.1.9 on a
> small (but growing) cluster here. With the
, please let me know your github userid, and
please 'watch' the project. My own experience of GridEngine is mainly
as a user, not a maintainer, so we need qualified people!
Dan
On Sat, May 12, 2018 at 6:28 PM, Daniel Povey <dpo...@gmail.com> wrote:
> Thanks for your responses to the
is anyone in that old team
that would be upset if the name is re-used.
Dan
On Sat, May 12, 2018 at 5:01 PM, Daniel Povey <dpo...@gmail.com> wrote:
> I've created a Doodle poll with the options named so far.
>
> https://doodle.com/poll/vcr2nasruzkg4r4g
>
> This is just s
, Daniel Povey <dpo...@gmail.com> wrote:
> Thanks for the naming ideas!
>
> Since many of these ideas are family-related I had a look for synonyms
> for 'offspring' that begin with s (so that we have no difficulty
> explaining why the binaries still have 'sge' in the name).
>
him. However,
if most others want to to keep the same name, I'd be OK with it.
Dan
On Fri, May 11, 2018 at 7:08 PM, Jerome <jer...@ibt.unam.mx> wrote:
> Le 11/05/2018 à 18:02, Christopher Heiny a écrit :
>> On Fri, 2018-05-11 at 18:49 -0400, Daniel Povey wrote:
>>
Everyone,
I want to start a discussion about how to replace Son of GridEngine.
As far as I can tell, Dave Love has had no online activity for a year,
is not responding to emails, and my attempts to contact him indirectly
via his workplace have come to nothing. Even if he is still alive, I
think
Has anyone noticed their sge_execd proceses suddenly taking up a lot of
CPU, possibly since around July 2nd this year?
I think it might be to do with the Linux leap second bug, which affects
processes that use futexes. It doesn't happen to all nodes on a queue,
just some.
The only way I know to
We have a problem in our queue that GridEngine is not respecting user
limits specified in /etc/security/limits.conf
We have in that file
* softas 2000
and nothing else,
and when we log into a machine (Debian Linux) using ssh or qrsh the limit
will appear, so if
I have been getting occasional errors when using qsub -sync y. It prints
out the error message:
Unable to initialize environment because of error: range_list containes no
elements
Exiting.
This is not reproducible, but seems to occur in batches. This is with GE
6.2R5.
Looking for this online
43 matches
Mail list logo