I have observed apparently random failures when users had gid's in the
range `gid_range` (see below; gid_range should be
out of the range where users have gid's).
But usually this kind of thing would be due to OOM.
qconf -sconf | grep gid_range
gid_range5-51000
On Tue,
Dear all,
i have a problem that jobs sent to gridengine randomly die.
The gridengine version is 8.1.9
The OS is opensuse 15.0
The gridengine messages file says:
05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing
job
05/13/2019 18:31:46|worker|karun|W|job 635659.1
Hi,
nope, there are no oom messages in the journal.
Regards, ulrich
On 5/14/19 12:49 PM, Arnau wrote:
> Hi,
>
> _maybe_ the OOM killer killed the job ? a look to messages will give you an
> answer (I've seen this in my cluster).
>
> HTH,
> Arnau
>
> El mar., 14 may. 2019 a las 12:37, hiller
It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf
-srqs)? We see this for job-requested, or system set RAM exhaustion (OOM
killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time
limits reached. What is the whole output from 'qacct -j JOBID'?
looks like your job used a lot of ram:
mem 7.463TBs
io 70.435GB
iow 0.000s
maxvmem 532.004MB
Do you have CGROUP to limit resource of jobs?
Best,
Feng
On Tue, May 14, 2019 at 9:53 AM hiller wrote:
>
> ~> qconf -srqs
> No resource quota set found
>
> 'dmesg -T'
Huh. Yeah, nothing particularly useful there (was hoping for the submit_cmd ...
but maybe that's just UGE?). What's in the job script (options), and how
exactly was it submitted (command)? And do you have any default limits in
$SGE_ROOT/$SGE_CELL/common/sge_request file?
-Hugh
-Original
AFAICS the sent kill by SGE happens after a task returned already with an
error. SGE would in this case use the kill signal to be sure to kill all child
processes. Hence the question would be: what was the initial command in the
job script, and what output/error did it generate?
-- Reuti
>