Hi Eric,

I am wondering how your exechost spool directory is configured.
Is your exechost spool directory on a local disk or on a nfs mounted
filesystem?

If it's on a nfs filesystem, how many exechosts are in your cluster?
I am wondering, if this is the case, there may be an issue with nfs server,
which is not able to serve many clients.

- Chansup


On Wed, Oct 19, 2011 at 11:36 AM, Reuti <[email protected]> wrote:

> Am 17.10.2011 um 19:26 schrieb Peskin, Eric:
>
> > No, there is nothing in /var/log/messages about the oom-killer (or
> anything getting killed).  There is stuff about DHCPDISCOVER not finding
> leases.  That is strange, because we are using static IP addresses, so I am
> not sure why anything is looking for DHCP.  There are also complaints that
> the various compute nodes cannot find a suitable server for ntpd.  Both of
> these are chronic messages that we always have.  They do not seem to be
> limited to the times when jobs fail.  There are messages about directories
> being nfs mounted and unmounted.  Finally, the mail log shows the message
> about the job failure being sent to the user.
> >
> > As for the messages file on the exec host.  For the job mentioned in my
> original email, no I do not see anything.  We do have another similar
> failure (more recent), where I do see some messages at the time the job
> failed.  However, they refer to different job numbers.  In this, more recent
> case, the message sent to the end user was:
> > ==============================================
> > Job 332115 (qmake) Aborted
> > Exit Status      = 137
> > Signal           = KILL
> > User             = tangz01
> > Queue            = [email protected]
> > Host             = compute-2-13.local
> > Start Time       = 10/13/2011 10:02:04
> > End Time         = 10/13/2011 10:09:34
> > CPU              = 00:00:09
> > Max vmem         = 1.322G
> > failed assumedly after job because:
> > job 332115.1 died through signal KILL (9)
> > ==============================================
> >
> > At that time on compute-2-13 itself, the file
> $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following:
> >
> > 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332836" ptf
> complains: Job does not exist
> > 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file
> active_jobs/332836.1/error: No such file or directory
> > 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332842" ptf
> complains: Job does not exist
> > 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file
> active_jobs/332842.1/error: No such file or directory
>
> Somehow I remember this issue on the list. But IIRC we never found a
> solution but the problem vanished at one point again.
>
> They were killed randomly without any reason. I can't find the thread right
> now though.
>
> -- Reuti
>
>
> > These messages are actually three seconds after the failure, and they
> refer to different job numbers.  But I list them because they are so close
> in time.
> >
> > Thanks,
> >       Eric
> >
> >
> > On Oct 7, 2011, at 10:40 AM, Reuti wrote:
> >
> >> Hi,
> >>
> >> Am 04.10.2011 um 00:47 schrieb Peskin, Eric:
> >>
> >>>> Is the hard run time limit (h_rt) getting reached some times but
> >>>> not others?
> >>>
> >>> No, we do not have any limits set:
> >>>
> >>> [root@fen1 ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u
> >>> h_core                INFINITY
> >>> h_cpu                 INFINITY
> >>> h_data                INFINITY
> >>> h_fsize               INFINITY
> >>> h_rss                 INFINITY
> >>> h_rt                  INFINITY
> >>> h_stack               INFINITY
> >>> h_vmem                INFINITY
> >>> s_core                INFINITY
> >>> s_cpu                 INFINITY
> >>> s_data                INFINITY
> >>> s_fsize               INFINITY
> >>> s_rss                 INFINITY
> >>> s_rt                  INFINITY
> >>> s_stack               INFINITY
> >>> s_vmem                INFINITY
> >>> [root@fen1 ~]#
> >>
> >> is there anything in /var/log/messages about the oom-killer? Or the
> >> SGE messages files on the exechost's spool directory?
> >>
> >> -- Reuti
> >>
> >>
> >>>
> >>> On Oct 3, 2011, at 1:41 PM, Mike Hanby wrote:
> >>>
> >>>> Is the hard run time limit (h_rt) getting reached some times but
> >>>> not others?
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: [email protected] [mailto:users-
> >>>>> [email protected]] On Behalf Of Peskin, Eric
> >>>>> Sent: Monday, October 03, 2011 11:14 AM
> >>>>> To: [email protected]
> >>>>> Subject: [gridengine users] jobs getting killed (failed assumedly
> >>>>> after
> >>>>> job because: job 311263.1 died through signal KILL (9))
> >>>>>
> >>>>> All,
> >>>>>
> >>>>> I have a user running qmake jobs.  Intermittently, the job fails and
> >>>>> SGE says it was killed with signal 9.  The user did not kill it.  We
> >>>>> (the sysadmins) did not kill it.  How can I figure out what is going
> >>>>> on?  The worst part is that this problem is intermittent.  Exactly
> >>>>> the
> >>>>> same command works sometimes but fails sometimes.  I have appended
> >>>>> the
> >>>>> message from SGE below.  Any suggestions would be greatly
> >>>>> appreciated.
> >>>>>
> >>>>> Thanks,
> >>>>>   Eric Peskin
> >>>>>
> >>>>> From: root [root@local]
> >>>>> Sent: Saturday, September 24, 2011 9:04 PM
> >>>>> To: Tang, Zuojian
> >>>>> Subject: Job 311263 (qmake) Aborted
> >>>>>
> >>>>> Job 311263 (qmake) Aborted
> >>>>> Exit Status      = 137
> >>>>> Signal           = KILL
> >>>>> User             = tangz01
> >>>>> Queue            = [email protected]
> >>>>> Host             = compute-0-13.local
> >>>>> Start Time       = 09/24/2011 19:03:31
> >>>>> End Time         = 09/24/2011 21:04:10
> >>>>> CPU              = 00:00:29
> >>>>> Max vmem         = 2.579G
> >>>>> failed assumedly after job because:
> >>>>> job 311263.1 died through signal KILL (9)
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>> This email message, including any attachments, is for the sole use
> >>>>> of
> >>>>> the intended recipient(s) and may contain information that is
> >>>>> proprietary, confidential, and exempt from disclosure under
> >>>>> applicable
> >>>>> law. Any unauthorized review, use, disclosure, or distribution is
> >>>>> prohibited. If you have received this email in error please notify
> >>>>> the
> >>>>> sender by return email and delete the original message. Please note,
> >>>>> the recipient should check this email and any attachments for the
> >>>>> presence of viruses. The organization accepts no liability for any
> >>>>> damage caused by any virus transmitted by this email.
> >>>>> =================================
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> [email protected]
> >>>>> https://gridengine.org/mailman/listinfo/users
> >>>
> >>>
> >>> ------------------------------------------------------------
> >>> This email message, including any attachments, is for the sole use
> >>> of the intended recipient(s) and may contain information that is
> >>> proprietary, confidential, and exempt from disclosure under
> >>> applicable law. Any unauthorized review, use, disclosure, or
> >>> distribution is prohibited. If you have received this email in error
> >>> please notify the sender by return email and delete the original
> >>> message. Please note, the recipient should check this email and any
> >>> attachments for the presence of viruses. The organization accepts no
> >>> liability for any damage caused by any virus transmitted by this
> >>> email.
> >>> =================================
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> [email protected]
> >>> https://gridengine.org/mailman/listinfo/users
> >>
> >
> >
> > ------------------------------------------------------------
> > This email message, including any attachments, is for the sole use of the
> intended recipient(s) and may contain information that is proprietary,
> confidential, and exempt from disclosure under applicable law. Any
> unauthorized review, use, disclosure, or distribution is prohibited. If you
> have received this email in error please notify the sender by return email
> and delete the original message. Please note, the recipient should check
> this email and any attachments for the presence of viruses. The organization
> accepts no liability for any damage caused by any virus transmitted by this
> email.
> > =================================
> >
> >
>
>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to