Chansup,

No, the spool is on a local disk:

[ep599@compute-2-13 ~]$ cd $SGE_ROOT/$SGE_CELL/spool/
[ep599@compute-2-13 spool]$ pwd
/opt/gridengine/default/spool
[ep599@compute-2-13 spool]$ df -h .
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              16G   11G  3.6G  76% /
[ep599@compute-2-13 spool]$ 

However, home directories and some of the applications are stored on NFS 
mounted disks.  The home directories are on an Isilon storage system.  Some of 
the applications are on a NAS system.  The NAS system is integrated into the 
cluster and is on the cluster's private network.  However, in both cases those 
mounts are over NFS over 1 Gb/s Ethernet.  We have been seeing other 
intermittent performance issues that we suspect are either related to the 
network or the file systems.

There are 56 exechosts in our cluster.

Thanks,
        Eric

On Oct 19, 2011, at 2:42 PM, CB wrote:

> Hi Eric,
> 
> I am wondering how your exechost spool directory is configured.
> Is your exechost spool directory on a local disk or on a nfs mounted 
> filesystem?
> 
> If it's on a nfs filesystem, how many exechosts are in your cluster?  
> I am wondering, if this is the case, there may be an issue with nfs server, 
> which is not able to serve many clients.
> 
> - Chansup
> 
> 
> On Wed, Oct 19, 2011 at 11:36 AM, Reuti <[email protected]> wrote:
> Am 17.10.2011 um 19:26 schrieb Peskin, Eric:
> 
> > No, there is nothing in /var/log/messages about the oom-killer (or anything 
> > getting killed).  There is stuff about DHCPDISCOVER not finding leases.  
> > That is strange, because we are using static IP addresses, so I am not sure 
> > why anything is looking for DHCP.  There are also complaints that the 
> > various compute nodes cannot find a suitable server for ntpd.  Both of 
> > these are chronic messages that we always have.  They do not seem to be 
> > limited to the times when jobs fail.  There are messages about directories 
> > being nfs mounted and unmounted.  Finally, the mail log shows the message 
> > about the job failure being sent to the user.
> >
> > As for the messages file on the exec host.  For the job mentioned in my 
> > original email, no I do not see anything.  We do have another similar 
> > failure (more recent), where I do see some messages at the time the job 
> > failed.  However, they refer to different job numbers.  In this, more 
> > recent case, the message sent to the end user was:
> > ==============================================
> > Job 332115 (qmake) Aborted
> > Exit Status      = 137
> > Signal           = KILL
> > User             = tangz01
> > Queue            = [email protected]
> > Host             = compute-2-13.local
> > Start Time       = 10/13/2011 10:02:04
> > End Time         = 10/13/2011 10:09:34
> > CPU              = 00:00:09
> > Max vmem         = 1.322G
> > failed assumedly after job because:
> > job 332115.1 died through signal KILL (9)
> > ==============================================
> >
> > At that time on compute-2-13 itself, the file 
> > $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following:
> >
> > 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332836" ptf 
> > complains: Job does not exist
> > 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file 
> > active_jobs/332836.1/error: No such file or directory
> > 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332842" ptf 
> > complains: Job does not exist
> > 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file 
> > active_jobs/332842.1/error: No such file or directory
> 
> Somehow I remember this issue on the list. But IIRC we never found a solution 
> but the problem vanished at one point again.
> 
> They were killed randomly without any reason. I can't find the thread right 
> now though.
> 
> -- Reuti
> 
> 
> > These messages are actually three seconds after the failure, and they refer 
> > to different job numbers.  But I list them because they are so close in 
> > time.
> >
> > Thanks,
> >       Eric
> >
> >
> > On Oct 7, 2011, at 10:40 AM, Reuti wrote:
> >
> >> Hi,
> >>
> >> Am 04.10.2011 um 00:47 schrieb Peskin, Eric:
> >>
> >>>> Is the hard run time limit (h_rt) getting reached some times but
> >>>> not others?
> >>>
> >>> No, we do not have any limits set:
> >>>
> >>> [root@fen1 ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u
> >>> h_core                INFINITY
> >>> h_cpu                 INFINITY
> >>> h_data                INFINITY
> >>> h_fsize               INFINITY
> >>> h_rss                 INFINITY
> >>> h_rt                  INFINITY
> >>> h_stack               INFINITY
> >>> h_vmem                INFINITY
> >>> s_core                INFINITY
> >>> s_cpu                 INFINITY
> >>> s_data                INFINITY
> >>> s_fsize               INFINITY
> >>> s_rss                 INFINITY
> >>> s_rt                  INFINITY
> >>> s_stack               INFINITY
> >>> s_vmem                INFINITY
> >>> [root@fen1 ~]#
> >>
> >> is there anything in /var/log/messages about the oom-killer? Or the
> >> SGE messages files on the exechost's spool directory?
> >>
> >> -- Reuti
> >>
> >>
> >>>
> >>> On Oct 3, 2011, at 1:41 PM, Mike Hanby wrote:
> >>>
> >>>> Is the hard run time limit (h_rt) getting reached some times but
> >>>> not others?
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: [email protected] [mailto:users-
> >>>>> [email protected]] On Behalf Of Peskin, Eric
> >>>>> Sent: Monday, October 03, 2011 11:14 AM
> >>>>> To: [email protected]
> >>>>> Subject: [gridengine users] jobs getting killed (failed assumedly
> >>>>> after
> >>>>> job because: job 311263.1 died through signal KILL (9))
> >>>>>
> >>>>> All,
> >>>>>
> >>>>> I have a user running qmake jobs.  Intermittently, the job fails and
> >>>>> SGE says it was killed with signal 9.  The user did not kill it.  We
> >>>>> (the sysadmins) did not kill it.  How can I figure out what is going
> >>>>> on?  The worst part is that this problem is intermittent.  Exactly
> >>>>> the
> >>>>> same command works sometimes but fails sometimes.  I have appended
> >>>>> the
> >>>>> message from SGE below.  Any suggestions would be greatly
> >>>>> appreciated.
> >>>>>
> >>>>> Thanks,
> >>>>>   Eric Peskin
> >>>>>
> >>>>> From: root [root@local]
> >>>>> Sent: Saturday, September 24, 2011 9:04 PM
> >>>>> To: Tang, Zuojian
> >>>>> Subject: Job 311263 (qmake) Aborted
> >>>>>
> >>>>> Job 311263 (qmake) Aborted
> >>>>> Exit Status      = 137
> >>>>> Signal           = KILL
> >>>>> User             = tangz01
> >>>>> Queue            = [email protected]
> >>>>> Host             = compute-0-13.local
> >>>>> Start Time       = 09/24/2011 19:03:31
> >>>>> End Time         = 09/24/2011 21:04:10
> >>>>> CPU              = 00:00:29
> >>>>> Max vmem         = 2.579G
> >>>>> failed assumedly after job because:
> >>>>> job 311263.1 died through signal KILL (9)
> >>>>>
> >>>>>
> >>>>> ------------------------------------------------------------
> >>>>> This email message, including any attachments, is for the sole use
> >>>>> of
> >>>>> the intended recipient(s) and may contain information that is
> >>>>> proprietary, confidential, and exempt from disclosure under
> >>>>> applicable
> >>>>> law. Any unauthorized review, use, disclosure, or distribution is
> >>>>> prohibited. If you have received this email in error please notify
> >>>>> the
> >>>>> sender by return email and delete the original message. Please note,
> >>>>> the recipient should check this email and any attachments for the
> >>>>> presence of viruses. The organization accepts no liability for any
> >>>>> damage caused by any virus transmitted by this email.
> >>>>> =================================
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> users mailing list
> >>>>> [email protected]
> >>>>> https://gridengine.org/mailman/listinfo/users
> >>>
> >>>
> >>> ------------------------------------------------------------
> >>> This email message, including any attachments, is for the sole use
> >>> of the intended recipient(s) and may contain information that is
> >>> proprietary, confidential, and exempt from disclosure under
> >>> applicable law. Any unauthorized review, use, disclosure, or
> >>> distribution is prohibited. If you have received this email in error
> >>> please notify the sender by return email and delete the original
> >>> message. Please note, the recipient should check this email and any
> >>> attachments for the presence of viruses. The organization accepts no
> >>> liability for any damage caused by any virus transmitted by this
> >>> email.
> >>> =================================
> >>>
> >>>
> >>> _______________________________________________
> >>> users mailing list
> >>> [email protected]
> >>> https://gridengine.org/mailman/listinfo/users
> >>
> >
> >
> > ------------------------------------------------------------
> > This email message, including any attachments, is for the sole use of the 
> > intended recipient(s) and may contain information that is proprietary, 
> > confidential, and exempt from disclosure under applicable law. Any 
> > unauthorized review, use, disclosure, or distribution is prohibited. If you 
> > have received this email in error please notify the sender by return email 
> > and delete the original message. Please note, the recipient should check 
> > this email and any attachments for the presence of viruses. The 
> > organization accepts no liability for any damage caused by any virus 
> > transmitted by this email.
> > =================================
> >
> >
> 
> 
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
> 


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain information that is proprietary, 
confidential, and exempt from disclosure under applicable law. Any unauthorized 
review, use, disclosure, or distribution is prohibited. If you have received 
this email in error please notify the sender by return email and delete the 
original message. Please note, the recipient should check this email and any 
attachments for the presence of viruses. The organization accepts no liability 
for any damage caused by any virus transmitted by this email.
=================================


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to