Hi, Am 13.10.2011 um 18:33 schrieb Laurent Duchesne:
> On Thu, Oct 13, 2011 at 12:26 PM, Reuti <[email protected]> wrote: >> Am 13.10.2011 um 18:10 schrieb Laurent Duchesne: >> >>> I'd like to have your input on a problem we are facing right now: >>> >>> We have a small script which parses the SGE (6.2u5) accounting file >>> and writes information in a SQL database. We just found out about what >>> seems to be a problem in the accounting file. From man 5 accounting: >>> >>> ru_wallclock >>> Difference between end_time and start_time (see above). >>> >>> We use that particular field to gather statistics for our users. What >>> we found out was that when the "failed" field is 37, the ru_wallclock >>> field is always 0, even if the job did run. We don't know exactly >>> under which circumstances this happens yet. >>> >>> Here's one such entry from the accounting file: >>> >>> med:r104-n7:nne-790-01:sboisver12:SRA024407-Ray-1.4.0-k31-group1:2903640:sge:0:1306781385:1307195150:1307470755:37:0:0:1023454.939168:617405.204111:0.000000:0:0:0:0:134261699:23127:0:0.000000:0:0:0:0:23568146:18934035:nne-790-ab:defaultdepartment:default:512:0:0.000000:0.000000:0.000000:-l >>> h_rt=86400 -pe default 512:0.000000:NONE:0.000000:0:0 >>> >>> And it's qacct output: >>> >>> ============================================================== >>> qname med >>> hostname r104-n7 >>> group nne-790-01 >>> owner sboisver12 >>> project nne-790-ab >>> department defaultdepartment >>> jobname SRA024407-Ray-1.4.0-k31-group1 >>> jobnumber 2903640 >>> taskid undefined >>> account sge >>> priority 0 >>> qsub_time Mon May 30 14:49:45 2011 >>> start_time Sat Jun 4 09:45:50 2011 >>> end_time Tue Jun 7 14:19:15 2011 >>> granted_pe default >>> slots 512 >> >> What is your definition of the PE? Normally you have one entry per `qrsh` >> call, or are all 512 slots allocated on one and the same machine, unless you >> specify in the PE to sum it up. >> >> -- Reuti >> > > Here's our pe definition: > > pe_name default > slots 9999 > user_lists NONE > xuser_lists NONE > start_proc_args /bin/true > stop_proc_args /bin/true > allocation_rule 8 > control_slaves TRUE > job_is_first_task FALSE > urgency_slots min > accounting_summary TRUE > > We have only 1 entry per job/task because of the accounting_summary setting. ok. >>> >>> failed 37 : qmaster enforced h_rt limit In case of a 37 I understand it in this way, that the exechost was in unheard condition. Is this true? The accounting record might then not be written be the report of the exechost, but from the last stage the qmaster heard about. Nevertheless it should be correct, but maybe it's just in these cases that it's wrong as the computation inside the qmaster is wrong. (man sge_conf / section "qmaster_params ENABLE_ENFORCE_MASTER_LIMIT") -- Reuti >>> exit_status 0 >>> ru_wallclock 0 >>> ru_utime 1023454.939 >>> ru_stime 617405.204 >>> ru_maxrss 0 >>> ru_ixrss 0 >>> ru_ismrss 0 >>> ru_idrss 0 >>> ru_isrss 0 >>> ru_minflt 134261699 >>> ru_majflt 23127 >>> ru_nswap 0 >>> ru_inblock 0 >>> ru_oublock 0 >>> ru_msgsnd 0 >>> ru_msgrcv 0 >>> ru_nsignals 0 >>> ru_nvcsw 23568146 >>> ru_nivcsw 18934035 >>> cpu 0.000 >>> mem 0.000 >>> io 0.000 >>> iow 0.000 >>> maxvmem 0.000 >>> arid undefined >>> >>> Has anyone experienced this before? Is this a known "bug/feature"? >>> >>> Thanks, >>> >>> -- >>> Laurent Duchesne >>> CLUMEQ, Université Laval >>> >>> _______________________________________________ >>> users mailing list >>> [email protected] >>> https://gridengine.org/mailman/listinfo/users >>> >> >> > > Thanks, > > -- > Laurent Duchesne > CLUMEQ, Université Laval > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
