Am 23.01.2012 um 16:30 schrieb Gerard Henry:
> On 01/20/12 07:55 PM, Reuti wrote:
>
>> Well, for the CPU time spend right now on a process it's not reported per
>> process, at least I've never heard of it. The kernel schedules the processes
>> ouside of SGE and the nice values are only relative. So it could only report
>> in short time intervals like: in the last 2 seconds I spent 50% on this
>> process and 10% for each of the following processes. But a second later it
>> might change already as one is waiting for I/O. I would expect that `top` is
>> doing it this way, as the values are more constant in case you specify a
>> longer refresh interval.
>>
>> As in the outline for the memory above, it should be possible to compute the
>> difference of used up CPU time in the last 1/5/15 minutes for each process
>> and compute the amount of CPU spend for the jobs in a particular queue this
>> way.
>>
>> ==
>>
>> Another idea: build a load sensor (i.e. one complex needs to be defined per
>> queue in question and attached to each exechost) and feed:
>>
>> top -b -n 2 -d 120 -p 1600,1234 | tail -n 3
>>
>> to it where 1600 and 1234 are the PIDs of the jobs whose additonal group was
>> found before $SGE_SPOOL_DIR/active_jobs. I suggest -n 2, as the first output
>> doesn't show anything meaningful regarding CPU time because the interval is
>> too short. This means to read one cycle ahead to avoid a delay in replying
>> to SGE. The values can then be sorted by queue and assigned to appropriate
>> complexes for each queue on each host.
>
>
> thanks to your reply. I'm not sure to understand, certainly because my
> question was so obscure.
> Here is a solution i tried. Considering that all jobs are ended, and i want
> to give values about cpu and memory consumed by jobs during last year, i dig
> into /local/export/sge/default/common/accounting and extract with the
> attached python script the folowing results:
> user iusti (51) cpu: 634.871534559 days mem: 0.805913076979 Go
> user irphe (135) cpu: 414.912775315 days mem: 1.4525422628 Go
> user l3m (252) cpu: 567.227461918 days mem: 1.70951371079 Go
> user lma (45) cpu: 1139.86254098 days mem: 1.71787710938 Go
> user latp (106) cpu: 127.5595829 days mem: 1.41270344795 Go
>
> i'm just summing each cpu and mem values for a user on a queue defined on one
> host.
> The only problem is that the mem record is false due to the "4go" bug in SGE
> 6.2u5
Yes, after the job you get the integral value over a certain time frame. But
it's harder to make a statement of the actual cpu consumption for each task
right now while it's running. Imagine you have only a single core CPU - at a
tiny timeframe each process gets 100% and is the only one running - but this is
not the output you are looking for. So it's necessary to define a timeframe:
the shared CPU in the last 5 minutes or so for each queue. Therefore I used the
-d 120 for two minutes in the above `top` command (it seems like the very first
output is always after 3 seconds and can't be avoided [unless you rewrite
`top`]).
> another point obscure for me, is, " can i detect if a job is OpenMP or serial
> job by looking in accounting file" ?
An OpenMP job should have a higher cpu time than wallclock time.
> for instance, i can write the following comand:
> cat /local/export/sge/default/common/accounting | awk -F':' 'BEGIN
> {printf("owner group job_number granted_pe cpu mem category\n")} /<hostname>/
> && /<queuename>/ {printf("%s %s %s %s %s %s %s\n", $4, $3, $6, $34, $37, $38,
> $40)}' | less
> owner group job_number granted_pe cpu mem category
> user1 iusti 13496 impi 2572922.350000 5628845.960567 -q dev -pe impi 10
> user1 iusti 13925 impi 15400293.770000 15728826.955261 -q dev -pe impi 25
> user2 iusti 13926 NONE 602366.910000 768713.172452 -q dev
> user2 iusti 14088 NONE 606897.320000 779858.199665 -q dev
> user1 iusti 14169 NONE 0.382940 0.003174 -U arusers -ar 2002
>
> does "NONE" in granted_pe show an OpenMP job ?
Not per se, in fact it would be an abuse.
In my opinion also an OpenMP job is a parallel job and should request a PE
(maybe named "smp" with "allocation_rule $pe_slots"). Otherwise a node might
get oversubscribed.
But you can compare the wallclock time and CPU time consumed for the job.
-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users