Hi Eric, I am wondering how your exechost spool directory is configured. Is your exechost spool directory on a local disk or on a nfs mounted filesystem?
If it's on a nfs filesystem, how many exechosts are in your cluster? I am wondering, if this is the case, there may be an issue with nfs server, which is not able to serve many clients. - Chansup On Wed, Oct 19, 2011 at 11:36 AM, Reuti <[email protected]> wrote: > Am 17.10.2011 um 19:26 schrieb Peskin, Eric: > > > No, there is nothing in /var/log/messages about the oom-killer (or > anything getting killed). There is stuff about DHCPDISCOVER not finding > leases. That is strange, because we are using static IP addresses, so I am > not sure why anything is looking for DHCP. There are also complaints that > the various compute nodes cannot find a suitable server for ntpd. Both of > these are chronic messages that we always have. They do not seem to be > limited to the times when jobs fail. There are messages about directories > being nfs mounted and unmounted. Finally, the mail log shows the message > about the job failure being sent to the user. > > > > As for the messages file on the exec host. For the job mentioned in my > original email, no I do not see anything. We do have another similar > failure (more recent), where I do see some messages at the time the job > failed. However, they refer to different job numbers. In this, more recent > case, the message sent to the end user was: > > ============================================== > > Job 332115 (qmake) Aborted > > Exit Status = 137 > > Signal = KILL > > User = tangz01 > > Queue = [email protected] > > Host = compute-2-13.local > > Start Time = 10/13/2011 10:02:04 > > End Time = 10/13/2011 10:09:34 > > CPU = 00:00:09 > > Max vmem = 1.322G > > failed assumedly after job because: > > job 332115.1 died through signal KILL (9) > > ============================================== > > > > At that time on compute-2-13 itself, the file > $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following: > > > > 10/13/2011 10:09:37| main|compute-2-13|W|reaping job "332836" ptf > complains: Job does not exist > > 10/13/2011 10:09:37| main|compute-2-13|E|can't open file > active_jobs/332836.1/error: No such file or directory > > 10/13/2011 10:09:37| main|compute-2-13|W|reaping job "332842" ptf > complains: Job does not exist > > 10/13/2011 10:09:37| main|compute-2-13|E|can't open file > active_jobs/332842.1/error: No such file or directory > > Somehow I remember this issue on the list. But IIRC we never found a > solution but the problem vanished at one point again. > > They were killed randomly without any reason. I can't find the thread right > now though. > > -- Reuti > > > > These messages are actually three seconds after the failure, and they > refer to different job numbers. But I list them because they are so close > in time. > > > > Thanks, > > Eric > > > > > > On Oct 7, 2011, at 10:40 AM, Reuti wrote: > > > >> Hi, > >> > >> Am 04.10.2011 um 00:47 schrieb Peskin, Eric: > >> > >>>> Is the hard run time limit (h_rt) getting reached some times but > >>>> not others? > >>> > >>> No, we do not have any limits set: > >>> > >>> [root@fen1 ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u > >>> h_core INFINITY > >>> h_cpu INFINITY > >>> h_data INFINITY > >>> h_fsize INFINITY > >>> h_rss INFINITY > >>> h_rt INFINITY > >>> h_stack INFINITY > >>> h_vmem INFINITY > >>> s_core INFINITY > >>> s_cpu INFINITY > >>> s_data INFINITY > >>> s_fsize INFINITY > >>> s_rss INFINITY > >>> s_rt INFINITY > >>> s_stack INFINITY > >>> s_vmem INFINITY > >>> [root@fen1 ~]# > >> > >> is there anything in /var/log/messages about the oom-killer? Or the > >> SGE messages files on the exechost's spool directory? > >> > >> -- Reuti > >> > >> > >>> > >>> On Oct 3, 2011, at 1:41 PM, Mike Hanby wrote: > >>> > >>>> Is the hard run time limit (h_rt) getting reached some times but > >>>> not others? > >>>> > >>>>> -----Original Message----- > >>>>> From: [email protected] [mailto:users- > >>>>> [email protected]] On Behalf Of Peskin, Eric > >>>>> Sent: Monday, October 03, 2011 11:14 AM > >>>>> To: [email protected] > >>>>> Subject: [gridengine users] jobs getting killed (failed assumedly > >>>>> after > >>>>> job because: job 311263.1 died through signal KILL (9)) > >>>>> > >>>>> All, > >>>>> > >>>>> I have a user running qmake jobs. Intermittently, the job fails and > >>>>> SGE says it was killed with signal 9. The user did not kill it. We > >>>>> (the sysadmins) did not kill it. How can I figure out what is going > >>>>> on? The worst part is that this problem is intermittent. Exactly > >>>>> the > >>>>> same command works sometimes but fails sometimes. I have appended > >>>>> the > >>>>> message from SGE below. Any suggestions would be greatly > >>>>> appreciated. > >>>>> > >>>>> Thanks, > >>>>> Eric Peskin > >>>>> > >>>>> From: root [root@local] > >>>>> Sent: Saturday, September 24, 2011 9:04 PM > >>>>> To: Tang, Zuojian > >>>>> Subject: Job 311263 (qmake) Aborted > >>>>> > >>>>> Job 311263 (qmake) Aborted > >>>>> Exit Status = 137 > >>>>> Signal = KILL > >>>>> User = tangz01 > >>>>> Queue = [email protected] > >>>>> Host = compute-0-13.local > >>>>> Start Time = 09/24/2011 19:03:31 > >>>>> End Time = 09/24/2011 21:04:10 > >>>>> CPU = 00:00:29 > >>>>> Max vmem = 2.579G > >>>>> failed assumedly after job because: > >>>>> job 311263.1 died through signal KILL (9) > >>>>> > >>>>> > >>>>> ------------------------------------------------------------ > >>>>> This email message, including any attachments, is for the sole use > >>>>> of > >>>>> the intended recipient(s) and may contain information that is > >>>>> proprietary, confidential, and exempt from disclosure under > >>>>> applicable > >>>>> law. Any unauthorized review, use, disclosure, or distribution is > >>>>> prohibited. If you have received this email in error please notify > >>>>> the > >>>>> sender by return email and delete the original message. Please note, > >>>>> the recipient should check this email and any attachments for the > >>>>> presence of viruses. The organization accepts no liability for any > >>>>> damage caused by any virus transmitted by this email. > >>>>> ================================= > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> users mailing list > >>>>> [email protected] > >>>>> https://gridengine.org/mailman/listinfo/users > >>> > >>> > >>> ------------------------------------------------------------ > >>> This email message, including any attachments, is for the sole use > >>> of the intended recipient(s) and may contain information that is > >>> proprietary, confidential, and exempt from disclosure under > >>> applicable law. Any unauthorized review, use, disclosure, or > >>> distribution is prohibited. If you have received this email in error > >>> please notify the sender by return email and delete the original > >>> message. Please note, the recipient should check this email and any > >>> attachments for the presence of viruses. The organization accepts no > >>> liability for any damage caused by any virus transmitted by this > >>> email. > >>> ================================= > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> [email protected] > >>> https://gridengine.org/mailman/listinfo/users > >> > > > > > > ------------------------------------------------------------ > > This email message, including any attachments, is for the sole use of the > intended recipient(s) and may contain information that is proprietary, > confidential, and exempt from disclosure under applicable law. Any > unauthorized review, use, disclosure, or distribution is prohibited. If you > have received this email in error please notify the sender by return email > and delete the original message. Please note, the recipient should check > this email and any attachments for the presence of viruses. The organization > accepts no liability for any damage caused by any virus transmitted by this > email. > > ================================= > > > > > > > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
