Am 17.10.2011 um 19:26 schrieb Peskin, Eric: > No, there is nothing in /var/log/messages about the oom-killer (or anything > getting killed). There is stuff about DHCPDISCOVER not finding leases. That > is strange, because we are using static IP addresses, so I am not sure why > anything is looking for DHCP. There are also complaints that the various > compute nodes cannot find a suitable server for ntpd. Both of these are > chronic messages that we always have. They do not seem to be limited to the > times when jobs fail. There are messages about directories being nfs mounted > and unmounted. Finally, the mail log shows the message about the job failure > being sent to the user. > > As for the messages file on the exec host. For the job mentioned in my > original email, no I do not see anything. We do have another similar failure > (more recent), where I do see some messages at the time the job failed. > However, they refer to different job numbers. In this, more recent case, the > message sent to the end user was: > ============================================== > Job 332115 (qmake) Aborted > Exit Status = 137 > Signal = KILL > User = tangz01 > Queue = [email protected] > Host = compute-2-13.local > Start Time = 10/13/2011 10:02:04 > End Time = 10/13/2011 10:09:34 > CPU = 00:00:09 > Max vmem = 1.322G > failed assumedly after job because: > job 332115.1 died through signal KILL (9) > ============================================== > > At that time on compute-2-13 itself, the file > $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following: > > 10/13/2011 10:09:37| main|compute-2-13|W|reaping job "332836" ptf complains: > Job does not exist > 10/13/2011 10:09:37| main|compute-2-13|E|can't open file > active_jobs/332836.1/error: No such file or directory > 10/13/2011 10:09:37| main|compute-2-13|W|reaping job "332842" ptf complains: > Job does not exist > 10/13/2011 10:09:37| main|compute-2-13|E|can't open file > active_jobs/332842.1/error: No such file or directory
Somehow I remember this issue on the list. But IIRC we never found a solution but the problem vanished at one point again. They were killed randomly without any reason. I can't find the thread right now though. -- Reuti > These messages are actually three seconds after the failure, and they refer > to different job numbers. But I list them because they are so close in time. > > Thanks, > Eric > > > On Oct 7, 2011, at 10:40 AM, Reuti wrote: > >> Hi, >> >> Am 04.10.2011 um 00:47 schrieb Peskin, Eric: >> >>>> Is the hard run time limit (h_rt) getting reached some times but >>>> not others? >>> >>> No, we do not have any limits set: >>> >>> [root@fen1 ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u >>> h_core INFINITY >>> h_cpu INFINITY >>> h_data INFINITY >>> h_fsize INFINITY >>> h_rss INFINITY >>> h_rt INFINITY >>> h_stack INFINITY >>> h_vmem INFINITY >>> s_core INFINITY >>> s_cpu INFINITY >>> s_data INFINITY >>> s_fsize INFINITY >>> s_rss INFINITY >>> s_rt INFINITY >>> s_stack INFINITY >>> s_vmem INFINITY >>> [root@fen1 ~]# >> >> is there anything in /var/log/messages about the oom-killer? Or the >> SGE messages files on the exechost's spool directory? >> >> -- Reuti >> >> >>> >>> On Oct 3, 2011, at 1:41 PM, Mike Hanby wrote: >>> >>>> Is the hard run time limit (h_rt) getting reached some times but >>>> not others? >>>> >>>>> -----Original Message----- >>>>> From: [email protected] [mailto:users- >>>>> [email protected]] On Behalf Of Peskin, Eric >>>>> Sent: Monday, October 03, 2011 11:14 AM >>>>> To: [email protected] >>>>> Subject: [gridengine users] jobs getting killed (failed assumedly >>>>> after >>>>> job because: job 311263.1 died through signal KILL (9)) >>>>> >>>>> All, >>>>> >>>>> I have a user running qmake jobs. Intermittently, the job fails and >>>>> SGE says it was killed with signal 9. The user did not kill it. We >>>>> (the sysadmins) did not kill it. How can I figure out what is going >>>>> on? The worst part is that this problem is intermittent. Exactly >>>>> the >>>>> same command works sometimes but fails sometimes. I have appended >>>>> the >>>>> message from SGE below. Any suggestions would be greatly >>>>> appreciated. >>>>> >>>>> Thanks, >>>>> Eric Peskin >>>>> >>>>> From: root [root@local] >>>>> Sent: Saturday, September 24, 2011 9:04 PM >>>>> To: Tang, Zuojian >>>>> Subject: Job 311263 (qmake) Aborted >>>>> >>>>> Job 311263 (qmake) Aborted >>>>> Exit Status = 137 >>>>> Signal = KILL >>>>> User = tangz01 >>>>> Queue = [email protected] >>>>> Host = compute-0-13.local >>>>> Start Time = 09/24/2011 19:03:31 >>>>> End Time = 09/24/2011 21:04:10 >>>>> CPU = 00:00:29 >>>>> Max vmem = 2.579G >>>>> failed assumedly after job because: >>>>> job 311263.1 died through signal KILL (9) >>>>> >>>>> >>>>> ------------------------------------------------------------ >>>>> This email message, including any attachments, is for the sole use >>>>> of >>>>> the intended recipient(s) and may contain information that is >>>>> proprietary, confidential, and exempt from disclosure under >>>>> applicable >>>>> law. Any unauthorized review, use, disclosure, or distribution is >>>>> prohibited. If you have received this email in error please notify >>>>> the >>>>> sender by return email and delete the original message. Please note, >>>>> the recipient should check this email and any attachments for the >>>>> presence of viruses. The organization accepts no liability for any >>>>> damage caused by any virus transmitted by this email. >>>>> ================================= >>>>> >>>>> >>>>> _______________________________________________ >>>>> users mailing list >>>>> [email protected] >>>>> https://gridengine.org/mailman/listinfo/users >>> >>> >>> ------------------------------------------------------------ >>> This email message, including any attachments, is for the sole use >>> of the intended recipient(s) and may contain information that is >>> proprietary, confidential, and exempt from disclosure under >>> applicable law. Any unauthorized review, use, disclosure, or >>> distribution is prohibited. If you have received this email in error >>> please notify the sender by return email and delete the original >>> message. Please note, the recipient should check this email and any >>> attachments for the presence of viruses. The organization accepts no >>> liability for any damage caused by any virus transmitted by this >>> email. >>> ================================= >>> >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >> > > > ------------------------------------------------------------ > This email message, including any attachments, is for the sole use of the > intended recipient(s) and may contain information that is proprietary, > confidential, and exempt from disclosure under applicable law. Any > unauthorized review, use, disclosure, or distribution is prohibited. If you > have received this email in error please notify the sender by return email > and delete the original message. Please note, the recipient should check this > email and any attachments for the presence of viruses. The organization > accepts no liability for any damage caused by any virus transmitted by this > email. > ================================= > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
