Re: [gridengine users] jobs randomly die

2019-05-14 Thread Daniel Povey
I have observed apparently random failures when users had gid's in the range `gid_range` (see below; gid_range should be out of the range where users have gid's). But usually this kind of thing would be due to OOM. qconf -sconf | grep gid_range gid_range5-51000 On Tue,

[gridengine users] jobs randomly die

2019-05-14 Thread hiller
Dear all, i have a problem that jobs sent to gridengine randomly die. The gridengine version is 8.1.9 The OS is opensuse 15.0 The gridengine messages file says: 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - killing job 05/13/2019 18:31:46|worker|karun|W|job 635659.1

Re: [gridengine users] jobs randomly die

2019-05-14 Thread hiller
Hi, nope, there are no oom messages in the journal. Regards, ulrich On 5/14/19 12:49 PM, Arnau wrote: > Hi, > > _maybe_ the OOM killer killed the job ? a look to messages will give you an > answer (I've seen this in my cluster). > > HTH, > Arnau > > El mar., 14 may. 2019 a las 12:37, hiller

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
It's a limit being reached, of some sort. Do you have a RQS of any kind (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well as time limits reached. What is the whole output from 'qacct -j JOBID'?

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Feng Zhang
looks like your job used a lot of ram: mem 7.463TBs io 70.435GB iow 0.000s maxvmem 532.004MB Do you have CGROUP to limit resource of jobs? Best, Feng On Tue, May 14, 2019 at 9:53 AM hiller wrote: > > ~> qconf -srqs > No resource quota set found > > 'dmesg -T'

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
Huh. Yeah, nothing particularly useful there (was hoping for the submit_cmd ... but maybe that's just UGE?). What's in the job script (options), and how exactly was it submitted (command)? And do you have any default limits in $SGE_ROOT/$SGE_CELL/common/sge_request file? -Hugh -Original

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Reuti
AFAICS the sent kill by SGE happens after a task returned already with an error. SGE would in this case use the kill signal to be sure to kill all child processes. Hence the question would be: what was the initial command in the job script, and what output/error did it generate? -- Reuti >