looks like your job used a lot of ram: mem 7.463TBs io 70.435GB iow 0.000s maxvmem 532.004MB
Do you have CGROUP to limit resource of jobs? Best, Feng On Tue, May 14, 2019 at 9:53 AM hiller <hil...@mpia-hd.mpg.de> wrote: > > ~> qconf -srqs > No resource quota set found > > 'dmesg -T' does not give an oom or other weird messages. > > 'free -h' looks good and also looked good at 'kill time': > > ~> free -h > total used free shared buff/cache > available > Mem: 188G 1.0G 185G 2.6M 2.0G > 186G > Swap: 49G 0B 49G > > Full output of qacct: > ~> qacct -j 635659 > ============================================================== > qname all.q > hostname karun10 > group users > owner calj > project NONE > department defaultdepartment > jobname dsc_gdr2 > jobnumber 635659 > taskid undefined > account sge > priority 0 > qsub_time Mon May 13 13:06:58 2019 > start_time Mon May 13 13:06:56 2019 > end_time Mon May 13 18:31:42 2019 > granted_pe make > slots 1 > failed 100 : assumedly after job > exit_status 137 (Killed) > ru_wallclock 19486s > ru_utime 0.048s > ru_stime 0.006s > ru_maxrss 11.566KB > ru_ixrss 0.000B > ru_ismrss 0.000B > ru_idrss 0.000B > ru_isrss 0.000B > ru_minflt 7885 > ru_majflt 0 > ru_nswap 0 > ru_inblock 0 > ru_oublock 8 > ru_msgsnd 0 > ru_msgrcv 0 > ru_nsignals 0 > ru_nvcsw 142 > ru_nivcsw 3 > cpu 19305.760s > mem 7.463TBs > io 70.435GB > iow 0.000s > maxvmem 532.004MB > arid undefined > ar_sub_time undefined > category -l hostname=karun10 -pe make 1 > > > Thanks, ulrich > > > On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote: > > It's a limit being reached, of some sort. Do you have a RQS of any kind > > (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion > > (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as > > well as time limits reached. What is the whole output from 'qacct -j JOBID'? > > > > Cheers, > > -Hugh > > > > -----Original Message----- > > From: users-boun...@gridengine.org <users-boun...@gridengine.org> On Behalf > > Of hiller > > Sent: Tuesday, May 14, 2019 9:02 AM > > To: users@gridengine.org > > Subject: Re: [gridengine users] jobs randomly die > > > > Hi, > > nope, there are no oom messages in the journal. > > Regards, ulrich > > > > > > On 5/14/19 12:49 PM, Arnau wrote: > >> Hi, > >> > >> _maybe_ the OOM killer killed the job ? a look to messages will give you > >> an answer (I've seen this in my cluster). > >> > >> HTH, > >> Arnau > >> > >> El mar., 14 may. 2019 a las 12:37, hiller (<hil...@mpia-hd.mpg.de > >> <mailto:hil...@mpia-hd.mpg.de>>) escribió: > >> > >> Dear all, > >> i have a problem that jobs sent to gridengine randomly die. > >> The gridengine version is 8.1.9 > >> The OS is opensuse 15.0 > >> The gridengine messages file says: > >> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed > >> - killing job > >> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 > >> assumedly after job because: job 635659.1 died through signal KILL (9) > >> > >> qacct -j 635659 says: > >> failed 100 : assumedly after job > >> exit_status 137 (Killed) > >> > >> > >> The was no kill triggered by the user. Also there are no other > >> limitations, neither ulimit nor in the gridengine queue > >> The 'qconf -sq all.q' command gives: > >> s_rt INFINITY > >> h_rt INFINITY > >> s_cpu INFINITY > >> h_cpu INFINITY > >> s_fsize INFINITY > >> h_fsize INFINITY > >> s_data INFINITY > >> h_data INFINITY > >> s_stack INFINITY > >> h_stack INFINITY > >> s_core INFINITY > >> h_core INFINITY > >> s_rss INFINITY > >> h_rss INFINITY > >> s_vmem INFINITY > >> h_vmem INFINITY > >> > >> Years ago there were some threads about the same issue, but i did not > >> find a solution. > >> > >> Does somebody have a hint what i can do or check/debug? > >> > >> With kind regards and many thanks for any help, ulrich > >> _______________________________________________ > >> users mailing list > >> users@gridengine.org <mailto:users@gridengine.org> > >> https://gridengine.org/mailman/listinfo/users > >> > > _______________________________________________ > > users mailing list > > users@gridengine.org > > https://gridengine.org/mailman/listinfo/users > > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users