Hi,

Am 13.10.2011 um 18:33 schrieb Laurent Duchesne:

> On Thu, Oct 13, 2011 at 12:26 PM, Reuti <[email protected]> wrote:
>> Am 13.10.2011 um 18:10 schrieb Laurent Duchesne:
>> 
>>> I'd like to have your input on a problem we are facing right now:
>>> 
>>> We have a small script which parses the SGE (6.2u5) accounting file
>>> and writes information in a SQL database. We just found out about what
>>> seems to be a problem in the accounting file. From man 5 accounting:
>>> 
>>> ru_wallclock
>>>        Difference between end_time and start_time (see above).
>>> 
>>> We use that particular field to gather statistics for our users. What
>>> we found out was that when the "failed" field is 37, the ru_wallclock
>>> field is always 0, even if the job did run. We don't know exactly
>>> under which circumstances this happens yet.
>>> 
>>> Here's one such entry from the accounting file:
>>> 
>>> med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.000000:0:0:0:0:134261699:23127:0:0.000000:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.000000:0.000000:0.000000:-l
>>> h_rt=86400 -pe default 512:0.000000:NONE:0.000000:0:0
>>> 
>>> And it's qacct output:
>>> 
>>> ==============================================================
>>> qname        med
>>> hostname     r104-n7
>>> group        nne-790-01
>>> owner        sboisver12
>>> project      nne-790-ab
>>> department   defaultdepartment
>>> jobname      SRA024407-Ray-1.4.0-k31-group1
>>> jobnumber    2903640
>>> taskid       undefined
>>> account      sge
>>> priority     0
>>> qsub_time    Mon May 30 14:49:45 2011
>>> start_time   Sat Jun  4 09:45:50 2011
>>> end_time     Tue Jun  7 14:19:15 2011
>>> granted_pe   default
>>> slots        512
>> 
>> What is your definition of the PE? Normally you have one entry per `qrsh` 
>> call, or are all 512 slots allocated on one and the same machine, unless you 
>> specify in the PE to sum it up.
>> 
>> -- Reuti
>> 
> 
> Here's our pe definition:
> 
> pe_name            default
> slots              9999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    /bin/true
> stop_proc_args     /bin/true
> allocation_rule    8
> control_slaves     TRUE
> job_is_first_task  FALSE
> urgency_slots      min
> accounting_summary TRUE
> 
> We have only 1 entry per job/task because of the accounting_summary setting.

ok.


>>> 
>>> failed       37  : qmaster enforced h_rt limit

In case of a 37 I understand it in this way, that the exechost was in unheard 
condition. Is this true? The accounting record might then not be written be the 
report of the exechost, but from the last stage the qmaster heard about.

Nevertheless it should be correct, but maybe it's just in these cases that it's 
wrong as the computation inside the qmaster is wrong.

(man sge_conf / section "qmaster_params ENABLE_ENFORCE_MASTER_LIMIT")

-- Reuti


>>> exit_status  0
>>> ru_wallclock 0
>>> ru_utime     1023454.939
>>> ru_stime     617405.204
>>> ru_maxrss    0
>>> ru_ixrss     0
>>> ru_ismrss    0
>>> ru_idrss     0
>>> ru_isrss     0
>>> ru_minflt    134261699
>>> ru_majflt    23127
>>> ru_nswap     0
>>> ru_inblock   0
>>> ru_oublock   0
>>> ru_msgsnd    0
>>> ru_msgrcv    0
>>> ru_nsignals  0
>>> ru_nvcsw     23568146
>>> ru_nivcsw    18934035
>>> cpu          0.000
>>> mem          0.000
>>> io           0.000
>>> iow          0.000
>>> maxvmem      0.000
>>> arid         undefined
>>> 
>>> Has anyone experienced this before? Is this a known "bug/feature"?
>>> 
>>> Thanks,
>>> 
>>> --
>>> Laurent Duchesne
>>> CLUMEQ, Université Laval
>>> 
>>> _______________________________________________
>>> users mailing list
>>> [email protected]
>>> https://gridengine.org/mailman/listinfo/users
>>> 
>> 
>> 
> 
> Thanks,
> 
> -- 
> Laurent Duchesne
> CLUMEQ, Université Laval
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to