We would like to use the UserCPU, SystemCPU and TotalCPU values from
sacct to assess the efficiency of jobs. When a job exits normally,
these values are reported for the batch script step and includes the
time spent by sub-processes.
However, if the job times out, these values only includ the CPU time
spent by the batch script process itself, not its sub-processes. Se
below for an illustration.
Is this intended behaviour? If so, is there any other way to gather CPU
times from jobs, even when they don't exit normally?
Illustration:
407 (1) $ cat shell-loop.sh
#!/bin/bash
echo Starting loop
## Loop that only uses shell builtins:
while true; do echo -n ; done
echo This is the end...
408 (1) $ cat timeout-in-subprocess.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-subprocess-%j.out
## Execute shell-loop.sh in subprocess:
./shell-loop.sh
409 (1) $ cat timeout-in-shell.sm
#!/bin/bash
#SBATCH --account=staff
#SBATCH --time=0:2:0 --mem-per-cpu=500
#SBATCH --output=out/timeout-in-shell-%j.out
## Run shell-loop.sh in this shell:
source shell-loop.sh
410 (1) $ sbatch timeout-in-subprocess.sm
Submitted batch job 40
411 (1) $ sbatch timeout-in-shell.sm
Submitted batch job 41
[... after a couple of minutes ...]
412 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 40,41
JobID State Elapsed UserCPU SystemCPU TotalCPU
------------ ---------- ---------- ---------- ---------- ----------
40 TIMEOUT 00:02:27 00:00.001 00:00.001 00:00.003
40.batch CANCELLED 00:02:27 00:00.001 00:00.001 00:00.003
41 TIMEOUT 00:02:24 02:23.415 00:00:00 02:23.416
41.batch CANCELLED 00:02:24 02:23.415 00:00:00 02:23.416
i.e., time spent in subprocesses is not reported.
Doing the same thing, but now with loops that terminate so the jobs
don't time out, we get:
416 (1) $ sbatch work-in-subprocess.sm
Submitted batch job 42
417 (1) $ sbatch work-in-shell.sm
Submitted batch job 43
[... after a couple of minutes ...]
418 (1) $ sacct -o jobid,state,elapsed,usercpu,systemcpu,totalcpu -j 42,43
JobID State Elapsed UserCPU SystemCPU TotalCPU
------------ ---------- ---------- ---------- ---------- ----------
42 COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187
42.batch COMPLETED 00:01:07 01:03.980 00:02.207 01:06.187
43 COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403
43.batch COMPLETED 00:01:08 01:05.230 00:02.173 01:07.403
i.e., time spent in subprocesses is reported.
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo