e sort. Do you have a RQS of
> > > any kind (qconf -srqs)? We see this for job-requested, or system
> > > set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on
> > > compute nodes often useful), as well as time limits reached. What
> > > is the whole output fr
I have observed apparently random failures when users had gid's in the
range `gid_range` (see below; gid_range should be
out of the range where users have gid's).
But usually this kind of thing would be due to OOM.
qconf -sconf | grep gid_range
gid_range5-51000
On Tue,
AFAICS the sent kill by SGE happens after a task returned already with an
error. SGE would in this case use the kill signal to be sure to kill all child
processes. Hence the question would be: what was the initial command in the
job script, and what output/error did it generate?
-- Reuti
>
Message-
From: users-boun...@gridengine.org On Behalf Of
hiller
Sent: Tuesday, May 14, 2019 9:52 AM
To: users@gridengine.org
Subject: Re: [gridengine users] jobs randomly die
~> qconf -srqs
No resource quota set found
'dmesg -T' does not give an oom or other weird messages.
'free -h' lo
is the whole output from 'qacct -j JOBID'?
> >
> > Cheers,
> > -Hugh
> >
> > -Original Message-
> > From: users-boun...@gridengine.org On Behalf
> > Of hiller
> > Sent: Tuesday, May 14, 2019 9:02 AM
> > To: users@gridengine.org
> > S
'?
Cheers,
-Hugh
-Original Message-
From: users-boun...@gridengine.org On Behalf Of
hiller
Sent: Tuesday, May 14, 2019 9:02 AM
To: users@gridengine.org
Subject: Re: [gridengine users] jobs randomly die
Hi,
nope, there are no oom messages in the journal.
Regards, ulrich
On 5/14/19 12:49
Hi,
nope, there are no oom messages in the journal.
Regards, ulrich
On 5/14/19 12:49 PM, Arnau wrote:
> Hi,
>
> _maybe_ the OOM killer killed the job ? a look to messages will give you an
> answer (I've seen this in my cluster).
>
> HTH,
> Arnau
>
> El mar., 14 may. 2019 a las 12:37, hiller