Scott Yockel <syoc...@g.harvard.edu> writes:

> Slurm-Dev,
>
> Is there anything in the works to add the capacity of TotalCPU to also
> track the child process user and system time?  I see that currently
> TotalCPU is defined: "provides a measure of the task’s parent process
> and does not include CPU time of child processes.”

In my experience, that description might not be accurate.  It seems also
child processes are included, as long as the job doesn't time out.  Here
is an email I wrote about it last year:


From: Bjørn-Helge Mevik <b.h.me...@usit.uio.no>
Subject: [slurm-dev] UserCPU etc. for subprocesses not registered when a job 
times out.
To: slurm-dev <slurm-dev@schedmd.com>
Date: Fri, 12 Sep 2014 06:25:16 -0700
Reply-To: slurm-dev <slurm-dev@schedmd.com>


We would like to use the UserCPU, SystemCPU and TotalCPU values from
sacct to assess the efficiency of jobs.  When a job exits normally,
these values are reported for the batch script step and includes the
time spent by sub-processes.

However, if the job times out, these values only includ the CPU time
spent by the batch script process itself, not its sub-processes.  Se
below for an illustration.

Is this intended behaviour?  If so, is there any other way to gather CPU
times from jobs, even when they don't exit normally?


Illustration:

407 (1) $ cat shell-loop.sh
#!/bin/bash

echo Starting loop

## Loop that only uses shell builtins:
while true; do echo -n ;  done

echo This is the end...

408 (1) $ cat timeout-in-subprocess.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-subprocess-%j.out

## Execute shell-loop.sh in subprocess:
./shell-loop.sh

409 (1) $ cat timeout-in-shell.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-shell-%j.out

## Run shell-loop.sh in this shell:
source shell-loop.sh

410 (1) $ sbatch timeout-in-subprocess.sm
Submitted batch job 40
411 (1) $ sbatch timeout-in-shell.sm
Submitted batch job 41

[... after a couple of minutes ...]

412 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 40,41
       JobID      State    Elapsed    UserCPU  SystemCPU   TotalCPU 
------------ ---------- ---------- ---------- ---------- ---------- 
40              TIMEOUT   00:02:27  00:00.001  00:00.001  00:00.003 
40.batch      CANCELLED   00:02:27  00:00.001  00:00.001  00:00.003 
41              TIMEOUT   00:02:24  02:23.415   00:00:00  02:23.416 
41.batch      CANCELLED   00:02:24  02:23.415   00:00:00  02:23.416 

i.e., time spent in subprocesses is not reported.


Doing the same thing, but now with loops that terminate so the jobs
don't time out, we get:

416 (1) $ sbatch work-in-subprocess.sm
Submitted batch job 42
417 (1) $ sbatch work-in-shell.sm
Submitted batch job 43

[... after a couple of minutes ...]

418 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 42,43
       JobID      State    Elapsed    UserCPU  SystemCPU   TotalCPU 
------------ ---------- ---------- ---------- ---------- ---------- 
42            COMPLETED   00:01:07  01:03.980  00:02.207  01:06.187 
42.batch      COMPLETED   00:01:07  01:03.980  00:02.207  01:06.187 
43            COMPLETED   00:01:08  01:05.230  00:02.173  01:07.403 
43.batch      COMPLETED   00:01:08  01:05.230  00:02.173  01:07.403 

i.e., time spent in subprocesses is reported.


-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo


-- 
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Reply via email to