Huh. Yeah, nothing particularly useful there (was hoping for the submit_cmd ... but maybe that's just UGE?). What's in the job script (options), and how exactly was it submitted (command)? And do you have any default limits in $SGE_ROOT/$SGE_CELL/common/sge_request file?
-Hugh -----Original Message----- From: users-boun...@gridengine.org <users-boun...@gridengine.org> On Behalf Of hiller Sent: Tuesday, May 14, 2019 9:52 AM To: users@gridengine.org Subject: Re: [gridengine users] jobs randomly die ~> qconf -srqs No resource quota set found 'dmesg -T' does not give an oom or other weird messages. 'free -h' looks good and also looked good at 'kill time': ~> free -h total used free shared buff/cache available Mem: 188G 1.0G 185G 2.6M 2.0G 186G Swap: 49G 0B 49G Full output of qacct: ~> qacct -j 635659 ============================================================== qname all.q hostname karun10 group users owner calj project NONE department defaultdepartment jobname dsc_gdr2 jobnumber 635659 taskid undefined account sge priority 0 qsub_time Mon May 13 13:06:58 2019 start_time Mon May 13 13:06:56 2019 end_time Mon May 13 18:31:42 2019 granted_pe make slots 1 failed 100 : assumedly after job exit_status 137 (Killed) ru_wallclock 19486s ru_utime 0.048s ru_stime 0.006s ru_maxrss 11.566KB ru_ixrss 0.000B ru_ismrss 0.000B ru_idrss 0.000B ru_isrss 0.000B ru_minflt 7885 ru_majflt 0 ru_nswap 0 ru_inblock 0 ru_oublock 8 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 142 ru_nivcsw 3 cpu 19305.760s mem 7.463TBs io 70.435GB iow 0.000s maxvmem 532.004MB arid undefined ar_sub_time undefined category -l hostname=karun10 -pe make 1 Thanks, ulrich On 5/14/19 3:28 PM, MacMullan IV, Hugh wrote: > It's a limit being reached, of some sort. Do you have a RQS of any kind > (qconf -srqs)? We see this for job-requested, or system set RAM exhaustion > (OOM killer, as mentioned 'dmesg -T' on compute nodes often useful), as well > as time limits reached. What is the whole output from 'qacct -j JOBID'? > > Cheers, > -Hugh > > -----Original Message----- > From: users-boun...@gridengine.org <users-boun...@gridengine.org> On Behalf > Of hiller > Sent: Tuesday, May 14, 2019 9:02 AM > To: users@gridengine.org > Subject: Re: [gridengine users] jobs randomly die > > Hi, > nope, there are no oom messages in the journal. > Regards, ulrich > > > On 5/14/19 12:49 PM, Arnau wrote: >> Hi, >> >> _maybe_ the OOM killer killed the job ? a look to messages will give you an >> answer (I've seen this in my cluster). >> >> HTH, >> Arnau >> >> El mar., 14 may. 2019 a las 12:37, hiller (<hil...@mpia-hd.mpg.de >> <mailto:hil...@mpia-hd.mpg.de>>) escribió: >> >> Dear all, >> i have a problem that jobs sent to gridengine randomly die. >> The gridengine version is 8.1.9 >> The OS is opensuse 15.0 >> The gridengine messages file says: >> 05/13/2019 18:31:45|worker|karun|E|master task of job 635659.1 failed - >> killing job >> 05/13/2019 18:31:46|worker|karun|W|job 635659.1 failed on host karun10 >> assumedly after job because: job 635659.1 died through signal KILL (9) >> >> qacct -j 635659 says: >> failed 100 : assumedly after job >> exit_status 137 (Killed) >> >> >> The was no kill triggered by the user. Also there are no other >> limitations, neither ulimit nor in the gridengine queue >> The 'qconf -sq all.q' command gives: >> s_rt INFINITY >> h_rt INFINITY >> s_cpu INFINITY >> h_cpu INFINITY >> s_fsize INFINITY >> h_fsize INFINITY >> s_data INFINITY >> h_data INFINITY >> s_stack INFINITY >> h_stack INFINITY >> s_core INFINITY >> h_core INFINITY >> s_rss INFINITY >> h_rss INFINITY >> s_vmem INFINITY >> h_vmem INFINITY >> >> Years ago there were some threads about the same issue, but i did not >> find a solution. >> >> Does somebody have a hint what i can do or check/debug? >> >> With kind regards and many thanks for any help, ulrich >> _______________________________________________ >> users mailing list >> users@gridengine.org <mailto:users@gridengine.org> >> https://gridengine.org/mailman/listinfo/users >> > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users