Ian,

Sorry for the delayed reply. I have appended the SGE logs below.  The very 
first line is actually a different user's job that happens to have the same 
number pop up.  I am not sure what the best way to grep for these is:

[root@fen1 ~]# grep --recursive :311263: $SGE_ROOT 
/opt/gridengine/default/common/accounting:regular.q:compute-0-5.local:liz06:liz06:oh15_4.U2_AUC.krr_poly_optimize_c:117352:sge:0:1309754468:1309936571:1309936730:0:0:159:101.494570:145.296911:0.000000:0:0:0:0:311263:0:0:0.000000:0:0:0:0:12983:215341259:NONE:defaultdepartment:NONE:1:10:246.791481:183.908423:0.015248:NONE:0.000000:NONE:914309120.000000:0:0
/opt/gridengine/default/common/accounting:regular.q:compute-0-13.local:tangz01:tangz01:qmake:311263:sge:0:1316905413:1316905411:1316912650:100:137:7239:17.696309:11.818203:0.000000:0:0:0:0:238144:23:0:0.000000:0:0:0:0:1268313:1568:NONE:defaultdepartment:NONE:1:0:29.514512:1.094530:0.359842:-U
 casavausers -q regular.q -l arch=lx26-amd64 -I 
y:0.000000:NONE:2768855040.000000:0:0
/opt/gridengine/default/common/reporting:1309936731:acct:regular.q:compute-0-5.local:liz06:liz06:oh15_4.U2_AUC.krr_poly_optimize_c:117352:sge:0:1309754468:1309936571:1309936730:0:0:159:101.494570:145.296911:0.000000:0:0:0:0:311263:0:0:0.000000:0:0:0:0:12983:215341259:NONE:defaultdepartment:NONE:1:10:246.791481:183.908423:0.015248:NONE:0.000000:NONE:914309120.000000:0:0
/opt/gridengine/default/common/reporting:1316905413:new_job:1316905413:311263:-1:NONE:qmake:tangz01:tangz01::defaultdepartment:sge:1024
/opt/gridengine/default/common/reporting:1316905413:job_log:1316905413:pending:311263:-1:NONE::tangz01:fen1.local:0:1024:1316905413:qmake:tangz01:tangz01::defaultdepartment:sge:new
 job
/opt/gridengine/default/common/reporting:1316905413:job_log:1316905413:sent:311263:0:NONE:t:master:fen1.local:0:1024:1316905413:qmake:tangz01:tangz01::defaultdepartment:sge:sent
 to execd
/opt/gridengine/default/common/reporting:1316905413:job_log:1316905413:delivered:311263:0:NONE:r:master:fen1.local:0:1024:1316905413:qmake:tangz01:tangz01::defaultdepartment:sge:job
 received by execd
/opt/gridengine/default/common/reporting:1316912652:acct:regular.q:compute-0-13.local:tangz01:tangz01:qmake:311263:sge:0:1316905413:1316905411:1316912650:100:137:7239:17.696309:11.818203:0.000000:0:0:0:0:238144:23:0:0.000000:0:0:0:0:1268313:1568:NONE:defaultdepartment:NONE:1:0:29.514512:1.094530:0.359842:-U
 casavausers -q regular.q -l arch=lx26-amd64 -I 
y:0.000000:NONE:2768855040.000000:0:0
/opt/gridengine/default/common/reporting:1316912652:job_log:1316912652:finished:311263:0:NONE:r:execution
 
daemon:compute-0-13.local:0:1024:1316905413:qmake:tangz01:tangz01::defaultdepartment:sge:job
 exited
/opt/gridengine/default/common/reporting:1316912652:job_log:1316912652:finished:311263:0:NONE:r:master:fen1.local:0:1024:1316905413:qmake:tangz01:tangz01::defaultdepartment:sge:job
 waits for schedds deletion
/opt/gridengine/default/common/reporting:1316912658:job_log:1316912658:deleted:311263:0:NONE:T:scheduler:fen1.local:0:1024:1316905413:qmake:tangz01:tangz01::defaultdepartment:sge:job
 deleted by schedd
[root@fen1 ~]# 



On Oct 5, 2011, at 12:24 PM, Ian Kaufman wrote:

> Is there anything in the SGE logs? If it is a limit issue of some
> sort, the spool logs should indicate what limit was reached.
> 
> Ian
> 
> On Mon, Oct 3, 2011 at 3:47 PM, Peskin, Eric <[email protected]> wrote:
>>> Is the hard run time limit (h_rt) getting reached some times but not others?
>> 
>> No, we do not have any limits set:
>> 
>> [root@fen1 ~]# qconf -sq `qconf -sql` | grep [hs]_|sort -u
>> h_core                INFINITY
>> h_cpu                 INFINITY
>> h_data                INFINITY
>> h_fsize               INFINITY
>> h_rss                 INFINITY
>> h_rt                  INFINITY
>> h_stack               INFINITY
>> h_vmem                INFINITY
>> s_core                INFINITY
>> s_cpu                 INFINITY
>> s_data                INFINITY
>> s_fsize               INFINITY
>> s_rss                 INFINITY
>> s_rt                  INFINITY
>> s_stack               INFINITY
>> s_vmem                INFINITY
>> [root@fen1 ~]#
>> 
>> 
>> On Oct 3, 2011, at 1:41 PM, Mike Hanby wrote:
>> 
>>> Is the hard run time limit (h_rt) getting reached some times but not others?
>>> 
>>>> -----Original Message-----
>>>> From: [email protected] [mailto:users-
>>>> [email protected]] On Behalf Of Peskin, Eric
>>>> Sent: Monday, October 03, 2011 11:14 AM
>>>> To: [email protected]
>>>> Subject: [gridengine users] jobs getting killed (failed assumedly after
>>>> job because: job 311263.1 died through signal KILL (9))
>>>> 
>>>> All,
>>>> 
>>>> I have a user running qmake jobs.  Intermittently, the job fails and
>>>> SGE says it was killed with signal 9.  The user did not kill it.  We
>>>> (the sysadmins) did not kill it.  How can I figure out what is going
>>>> on?  The worst part is that this problem is intermittent.  Exactly the
>>>> same command works sometimes but fails sometimes.  I have appended the
>>>> message from SGE below.  Any suggestions would be greatly appreciated.
>>>> 
>>>> Thanks,
>>>>      Eric Peskin
>>>> 
>>>> From: root [root@local]
>>>> Sent: Saturday, September 24, 2011 9:04 PM
>>>> To: Tang, Zuojian
>>>> Subject: Job 311263 (qmake) Aborted
>>>> 
>>>> Job 311263 (qmake) Aborted
>>>> Exit Status      = 137
>>>> Signal           = KILL
>>>> User             = tangz01
>>>> Queue            = [email protected]
>>>> Host             = compute-0-13.local
>>>> Start Time       = 09/24/2011 19:03:31
>>>> End Time         = 09/24/2011 21:04:10
>>>> CPU              = 00:00:29
>>>> Max vmem         = 2.579G
>>>> failed assumedly after job because:
>>>> job 311263.1 died through signal KILL (9)
>>>> 
>>>> 
>>>> ------------------------------------------------------------
>>>> This email message, including any attachments, is for the sole use of
>>>> the intended recipient(s) and may contain information that is
>>>> proprietary, confidential, and exempt from disclosure under applicable
>>>> law. Any unauthorized review, use, disclosure, or distribution is
>>>> prohibited. If you have received this email in error please notify the
>>>> sender by return email and delete the original message. Please note,
>>>> the recipient should check this email and any attachments for the
>>>> presence of viruses. The organization accepts no liability for any
>>>> damage caused by any virus transmitted by this email.
>>>> =================================
>>>> 
>>>> 
>>>> _______________________________________________
>>>> users mailing list
>>>> [email protected]
>>>> https://gridengine.org/mailman/listinfo/users
>> 
>> 
>> ------------------------------------------------------------
>> This email message, including any attachments, is for the sole use of the 
>> intended recipient(s) and may contain information that is proprietary, 
>> confidential, and exempt from disclosure under applicable law. Any 
>> unauthorized review, use, disclosure, or distribution is prohibited. If you 
>> have received this email in error please notify the sender by return email 
>> and delete the original message. Please note, the recipient should check 
>> this email and any attachments for the presence of viruses. The organization 
>> accepts no liability for any damage caused by any virus transmitted by this 
>> email.
>> =================================
>> 
>> 
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>> 
> 
> 
> 
> -- 
> Ian Kaufman
> Research Systems Administrator
> UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu


------------------------------------------------------------
This email message, including any attachments, is for the sole use of the 
intended recipient(s) and may contain information that is proprietary, 
confidential, and exempt from disclosure under applicable law. Any unauthorized 
review, use, disclosure, or distribution is prohibited. If you have received 
this email in error please notify the sender by return email and delete the 
original message. Please note, the recipient should check this email and any 
attachments for the presence of viruses. The organization accepts no liability 
for any damage caused by any virus transmitted by this email.
=================================


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to