Am 17.10.2011 um 19:26 schrieb Peskin, Eric:

> No, there is nothing in /var/log/messages about the oom-killer (or anything 
> getting killed).  There is stuff about DHCPDISCOVER not finding leases.  That 
> is strange, because we are using static IP addresses, so I am not sure why 
> anything is looking for DHCP.  There are also complaints that the various 
> compute nodes cannot find a suitable server for ntpd.  Both of these are 
> chronic messages that we always have.  They do not seem to be limited to the 
> times when jobs fail.  There are messages about directories being nfs mounted 
> and unmounted.  Finally, the mail log shows the message about the job failure 
> being sent to the user.
> 
> As for the messages file on the exec host.  For the job mentioned in my 
> original email, no I do not see anything.  We do have another similar failure 
> (more recent), where I do see some messages at the time the job failed.  
> However, they refer to different job numbers.  In this, more recent case, the 
> message sent to the end user was:
> ==============================================
> Job 332115 (qmake) Aborted
> Exit Status      = 137
> Signal           = KILL
> User             = tangz01
> Queue            = [email protected]
> Host             = compute-2-13.local
> Start Time       = 10/13/2011 10:02:04
> End Time         = 10/13/2011 10:09:34
> CPU              = 00:00:09
> Max vmem         = 1.322G
> failed assumedly after job because:
> job 332115.1 died through signal KILL (9)
> ==============================================
> 
> At that time on compute-2-13 itself, the file 
> $SGE_ROOT/$SGE_CELL/spool/compute-2-13/messages has the following:
> 
> 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332836" ptf complains: 
> Job does not exist
> 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file 
> active_jobs/332836.1/error: No such file or directory
> 10/13/2011 10:09:37|  main|compute-2-13|W|reaping job "332842" ptf complains: 
> Job does not exist
> 10/13/2011 10:09:37|  main|compute-2-13|E|can't open file 
> active_jobs/332842.1/error: No such file or directory

Somehow I remember this issue on the list. But IIRC we never found a solution 
but the problem vanished at one point again.

They were killed randomly without any reason. I can't find the thread right now 
though.

-- Reuti


> These messages are actually three seconds after the failure, and they refer 
> to different job numbers.  But I list them because they are so close in time.
> 
> Thanks,
>       Eric
> 
> 
> On Oct 7, 2011, at 10:40 AM, Reuti wrote:
> 
>> Hi,
>> 
>> Am 04.10.2011 um 00:47 schrieb Peskin, Eric:
>> 
>>>> Is the hard run time limit (h_rt) getting reached some times but  
>>>> not others?
>>> 
>>> No, we do not have any limits set:
>>> 
>>> [root@fen1 ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u
>>> h_core                INFINITY
>>> h_cpu                 INFINITY
>>> h_data                INFINITY
>>> h_fsize               INFINITY
>>> h_rss                 INFINITY
>>> h_rt                  INFINITY
>>> h_stack               INFINITY
>>> h_vmem                INFINITY
>>> s_core                INFINITY
>>> s_cpu                 INFINITY
>>> s_data                INFINITY
>>> s_fsize               INFINITY
>>> s_rss                 INFINITY
>>> s_rt                  INFINITY
>>> s_stack               INFINITY
>>> s_vmem                INFINITY
>>> [root@fen1 ~]#
>> 
>> is there anything in /var/log/messages about the oom-killer? Or the  
>> SGE messages files on the exechost's spool directory?
>> 
>> -- Reuti
>> 
>> 
>>> 
>>> On Oct 3, 2011, at 1:41 PM, Mike Hanby wrote:
>>> 
>>>> Is the hard run time limit (h_rt) getting reached some times but  
>>>> not others?
>>>> 
>>>>> -----Original Message-----
>>>>> From: [email protected] [mailto:users-
>>>>> [email protected]] On Behalf Of Peskin, Eric
>>>>> Sent: Monday, October 03, 2011 11:14 AM
>>>>> To: [email protected]
>>>>> Subject: [gridengine users] jobs getting killed (failed assumedly  
>>>>> after
>>>>> job because: job 311263.1 died through signal KILL (9))
>>>>> 
>>>>> All,
>>>>> 
>>>>> I have a user running qmake jobs.  Intermittently, the job fails and
>>>>> SGE says it was killed with signal 9.  The user did not kill it.  We
>>>>> (the sysadmins) did not kill it.  How can I figure out what is going
>>>>> on?  The worst part is that this problem is intermittent.  Exactly  
>>>>> the
>>>>> same command works sometimes but fails sometimes.  I have appended  
>>>>> the
>>>>> message from SGE below.  Any suggestions would be greatly  
>>>>> appreciated.
>>>>> 
>>>>> Thanks,
>>>>>   Eric Peskin
>>>>> 
>>>>> From: root [root@local]
>>>>> Sent: Saturday, September 24, 2011 9:04 PM
>>>>> To: Tang, Zuojian
>>>>> Subject: Job 311263 (qmake) Aborted
>>>>> 
>>>>> Job 311263 (qmake) Aborted
>>>>> Exit Status      = 137
>>>>> Signal           = KILL
>>>>> User             = tangz01
>>>>> Queue            = [email protected]
>>>>> Host             = compute-0-13.local
>>>>> Start Time       = 09/24/2011 19:03:31
>>>>> End Time         = 09/24/2011 21:04:10
>>>>> CPU              = 00:00:29
>>>>> Max vmem         = 2.579G
>>>>> failed assumedly after job because:
>>>>> job 311263.1 died through signal KILL (9)
>>>>> 
>>>>> 
>>>>> ------------------------------------------------------------
>>>>> This email message, including any attachments, is for the sole use  
>>>>> of
>>>>> the intended recipient(s) and may contain information that is
>>>>> proprietary, confidential, and exempt from disclosure under  
>>>>> applicable
>>>>> law. Any unauthorized review, use, disclosure, or distribution is
>>>>> prohibited. If you have received this email in error please notify  
>>>>> the
>>>>> sender by return email and delete the original message. Please note,
>>>>> the recipient should check this email and any attachments for the
>>>>> presence of viruses. The organization accepts no liability for any
>>>>> damage caused by any virus transmitted by this email.
>>>>> =================================
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> users mailing list
>>>>> [email protected]
>>>>> https://gridengine.org/mailman/listinfo/users
>>> 
>>> 
>>> ------------------------------------------------------------
>>> This email message, including any attachments, is for the sole use  
>>> of the intended recipient(s) and may contain information that is  
>>> proprietary, confidential, and exempt from disclosure under  
>>> applicable law. Any unauthorized review, use, disclosure, or  
>>> distribution is prohibited. If you have received this email in error  
>>> please notify the sender by return email and delete the original  
>>> message. Please note, the recipient should check this email and any  
>>> attachments for the presence of viruses. The organization accepts no  
>>> liability for any damage caused by any virus transmitted by this  
>>> email.
>>> =================================
>>> 
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 
> ------------------------------------------------------------
> This email message, including any attachments, is for the sole use of the 
> intended recipient(s) and may contain information that is proprietary, 
> confidential, and exempt from disclosure under applicable law. Any 
> unauthorized review, use, disclosure, or distribution is prohibited. If you 
> have received this email in error please notify the sender by return email 
> and delete the original message. Please note, the recipient should check this 
> email and any attachments for the presence of viruses. The organization 
> accepts no liability for any damage caused by any virus transmitted by this 
> email.
> =================================
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to