We would like to use the UserCPU, SystemCPU and TotalCPU values from
sacct to assess the efficiency of jobs.  When a job exits normally,
these values are reported for the batch script step and includes the
time spent by sub-processes.

However, if the job times out, these values only includ the CPU time
spent by the batch script process itself, not its sub-processes.  Se
below for an illustration.

Is this intended behaviour?  If so, is there any other way to gather CPU
times from jobs, even when they don't exit normally?


Illustration:

407 (1) $ cat shell-loop.sh
#!/bin/bash

echo Starting loop

## Loop that only uses shell builtins:
while true; do echo -n ;  done

echo This is the end...

408 (1) $ cat timeout-in-subprocess.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-subprocess-%j.out

## Execute shell-loop.sh in subprocess:
./shell-loop.sh

409 (1) $ cat timeout-in-shell.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-shell-%j.out

## Run shell-loop.sh in this shell:
source shell-loop.sh

410 (1) $ sbatch timeout-in-subprocess.sm
Submitted batch job 40
411 (1) $ sbatch timeout-in-shell.sm
Submitted batch job 41

[... after a couple of minutes ...]

412 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 40,41
       JobID      State    Elapsed    UserCPU  SystemCPU   TotalCPU 
------------ ---------- ---------- ---------- ---------- ---------- 
40              TIMEOUT   00:02:27  00:00.001  00:00.001  00:00.003 
40.batch      CANCELLED   00:02:27  00:00.001  00:00.001  00:00.003 
41              TIMEOUT   00:02:24  02:23.415   00:00:00  02:23.416 
41.batch      CANCELLED   00:02:24  02:23.415   00:00:00  02:23.416 

i.e., time spent in subprocesses is not reported.


Doing the same thing, but now with loops that terminate so the jobs
don't time out, we get:

416 (1) $ sbatch work-in-subprocess.sm
Submitted batch job 42
417 (1) $ sbatch work-in-shell.sm
Submitted batch job 43

[... after a couple of minutes ...]

418 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 42,43
       JobID      State    Elapsed    UserCPU  SystemCPU   TotalCPU 
------------ ---------- ---------- ---------- ---------- ---------- 
42            COMPLETED   00:01:07  01:03.980  00:02.207  01:06.187 
42.batch      COMPLETED   00:01:07  01:03.980  00:02.207  01:06.187 
43            COMPLETED   00:01:08  01:05.230  00:02.173  01:07.403 
43.batch      COMPLETED   00:01:08  01:05.230  00:02.173  01:07.403 

i.e., time spent in subprocesses is reported.


-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

Reply via email to