We are using ge 6.2u5 with CentOS 6.4. I have jobs that are randomly being killed. Here is the log entry. The jobs that are getting killed are getting an exit status of 127 or 137. I did check /var/log/messages on the nodes and didn't see anything out of the ordinary.
03/31/2014 09:55:30|worker|kepler|W|job 33393.1 failed on host research029.cm.cluster assumedly after job because: job 33393.1 died through signal KILL (9) 03/31/2014 09:55:34|worker|kepler|W|job 33394.1 failed on host research026.cm.cluster assumedly after job because: job 33394.1 died through signal KILL (9) qacct -j 33394 qname std hostname research026.cm.cluster group justinchem owner justinchem project NONE department defaultdepartment jobname runCHO-C6H5-Cs_opt.24081 jobnumber 33394 taskid undefined account sge priority 0 qsub_time Mon Mar 31 09:54:53 2014 start_time Mon Mar 31 09:55:10 2014 end_time Mon Mar 31 09:55:33 2014 granted_pe gauss slots 4 failed 100 : assumedly after job exit_status 137 ru_wallclock 23 ru_utime 0.003 ru_stime 0.008 ru_maxrss 1380 ru_ixrss 0 ru_ismrss 0 ru_idrss 0 ru_isrss 0 ru_minflt 1957 ru_majflt 5 ru_nswap 0 ru_inblock 584 ru_oublock 40 ru_msgsnd 0 ru_msgrcv 0 ru_nsignals 0 ru_nvcsw 58 ru_nivcsw 6 cpu 82.570 mem 452.669 io 0.084 iow 0.000 maxvmem 5.710G arid undefined Thanks, Eric -- Eric Kaufmann | Application Support Analyst - Advanced Technology Group | Saint Louis University | 314-977-2257 | [email protected]
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
