Re: [gridengine users] jobs randomly die

2019-05-17 Thread Hay, William
e sort. Do you have a RQS of > > > any kind (qconf -srqs)? We see this for job-requested, or system > > > set RAM exhaustion (OOM killer, as mentioned 'dmesg -T' on > > > compute nodes often useful), as well as time limits reached. What > > > is the whole output fr

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Daniel Povey
I have observed apparently random failures when users had gid's in the range `gid_range` (see below; gid_range should be out of the range where users have gid's). But usually this kind of thing would be due to OOM. qconf -sconf | grep gid_range gid_range5-51000 On Tue,

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Reuti
AFAICS the sent kill by SGE happens after a task returned already with an error. SGE would in this case use the kill signal to be sure to kill all child processes. Hence the question would be: what was the initial command in the job script, and what output/error did it generate? -- Reuti >

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
Message- From: users-boun...@gridengine.org On Behalf Of hiller Sent: Tuesday, May 14, 2019 9:52 AM To: users@gridengine.org Subject: Re: [gridengine users] jobs randomly die ~> qconf -srqs No resource quota set found 'dmesg -T' does not give an oom or other weird messages. 'free -h' lo

Re: [gridengine users] jobs randomly die

2019-05-14 Thread Feng Zhang
is the whole output from 'qacct -j JOBID'? > > > > Cheers, > > -Hugh > > > > -Original Message- > > From: users-boun...@gridengine.org On Behalf > > Of hiller > > Sent: Tuesday, May 14, 2019 9:02 AM > > To: users@gridengine.org > > S

Re: [gridengine users] jobs randomly die

2019-05-14 Thread MacMullan IV, Hugh
'? Cheers, -Hugh -Original Message- From: users-boun...@gridengine.org On Behalf Of hiller Sent: Tuesday, May 14, 2019 9:02 AM To: users@gridengine.org Subject: Re: [gridengine users] jobs randomly die Hi, nope, there are no oom messages in the journal. Regards, ulrich On 5/14/19 12:49

Re: [gridengine users] jobs randomly die

2019-05-14 Thread hiller
Hi, nope, there are no oom messages in the journal. Regards, ulrich On 5/14/19 12:49 PM, Arnau wrote: > Hi, > > _maybe_ the OOM killer killed the job ? a look to messages will give you an > answer (I've seen this in my cluster). > > HTH, > Arnau > > El mar., 14 may. 2019 a las 12:37, hiller