I have a bug report of an anomaly in the job state display of sacct, as
illustrated below. Jobs 7,8, 11 and 12 are all simple jobs submitted via
"sbatch". Job 7 was submitted and then cancelled by the submitting user.
Job 7 was submitted by the user and then cancelled by root. Job 11 was
pre-empted by job 12 from a higher priority partition, but was requeued
and completed after job 12 terminated. Following is the output of
"sacct", showing jobid, state and start/stop times, and including
duplicates ("-D") option:
[sulu] (dalbert) dalbert> sacct --format=jobid%8,state%-18,submit,end -D
JobID State Submit End
-------- ------------------ ------------------- -------------------
7 CANCELLED by 605 2011-09-09T10:24:41 2011-09-09T10:24:50
7.batch FAILED 2011-09-09T10:24:41 2011-09-09T10:24:50
7.0 CANCELLED by -1 2011-09-09T10:24:42 2011-09-09T10:24:50
8 CANCELLED by 0 2011-09-09T10:24:56 2011-09-09T10:25:12
8.batch CANCELLED by -1 2011-09-09T10:24:56 2011-09-09T10:25:12
8.0 CANCELLED by -1 2011-09-09T10:24:56 2011-09-09T10:25:12
11 CANCELLED by -1 2011-09-09T10:29:42 2011-09-09T10:29:51
11.0 CANCELLED by -1 2011-09-09T10:29:43 2011-09-09T10:29:51
11 COMPLETED 2011-09-09T10:29:51 2011-09-09T10:30:52
11.batch COMPLETED 2011-09-09T10:29:42 2011-09-09T10:30:52
11.1 COMPLETED 2011-09-09T10:30:22 2011-09-09T10:30:52
12 COMPLETED 2011-09-09T10:29:51 2011-09-09T10:30:22
12.batch COMPLETED 2011-09-09T10:29:51 2011-09-09T10:30:22
12.0 COMPLETED 2011-09-09T10:29:51 2011-09-09T10:30:21
Job 7shows a state of "CANCELLED by 605", which is the "uid" of the user
that submitted the job and the "scancel", but the batch and step records
for the job display as "CANCELLED by -1". Job 8 shows a state of
"CANCELLED by 0", which is the "uid" of the root user, but again the
batch and step records for the job display as "CANCELLED by -1". For job
11, both the job and step records for the pre-emption show as "CANCELLED
by -1".
Looking into the code, I saw that the "by xxx" portion of the display
depends on the content of a field called "requid" in the job, step and
batch record structures. This field is intialized to "-1" in various
places. Other values in the records are initalized to "NO_VAL", but
"requid" is explicitedly set to "-1" instead of "NO_VAL".
Unfortunately, the test in module "src/sacct/print.c" for whether to
print the "by xxx" is checking a variable "tmp_int2", (which is set from
the value of "requid"), for not equal to "NO_VAL", not "-1". So in
cases where "requid" has not been set to a uid, it is being printed
anyway, as the string "by -1".
The patch below simply changes the test from not equal to "NO_VAL" to not
equal to "-1". With the patch, the jobs show up as:
[sulu] (dalbert) dalbert> sacct --format=jobid%8,state%-18,submit,end -D
JobID State Submit End
-------- ------------------ ------------------- -------------------
13 CANCELLED by 605 2011-09-09T10:47:40 2011-09-09T10:47:48
13.batch FAILED 2011-09-09T10:47:40 2011-09-09T10:47:48
13.0 CANCELLED 2011-09-09T10:47:41 2011-09-09T10:47:48
14 CANCELLED by 0 2011-09-09T10:47:54 2011-09-09T10:48:09
14.batch FAILED 2011-09-09T10:47:54 2011-09-09T10:48:09
14.0 CANCELLED 2011-09-09T10:47:54 2011-09-09T10:48:09
15 CANCELLED 2011-09-09T10:48:30 2011-09-09T10:48:36
15.0 CANCELLED 2011-09-09T10:48:30 2011-09-09T10:48:36
15 COMPLETED 2011-09-09T10:48:36 2011-09-09T10:49:37
15.batch COMPLETED 2011-09-09T10:48:30 2011-09-09T10:49:37
15.1 COMPLETED 2011-09-09T10:49:07 2011-09-09T10:49:37
16 COMPLETED 2011-09-09T10:48:36 2011-09-09T10:49:07
16.batch COMPLETED 2011-09-09T10:48:36 2011-09-09T10:49:07
16.0 COMPLETED 2011-09-09T10:48:36 2011-09-09T10:49:06
where job 13 was cancelled by the submitter (uid 605), and job 14 was
cancelled by root (uid 0). Job 15 was pre-empted by job 16 and then
requeued and completed later.
-Don Albert-
A patch against SLURM 2.3.0-rc2 follows:
[stag] (dalbert) s230rc2> cvs diff -u slurm/src/sacct/print.c
Index: slurm/src/sacct/print.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/sacct/print.c,v
retrieving revision 1.1.1.35.2.1
diff -u -r1.1.1.35.2.1 print.c
--- slurm/src/sacct/print.c 29 Aug 2011 17:45:00 -0000 1.1.1.35.2.1
+++ slurm/src/sacct/print.c 9 Sep 2011 18:03:43 -0000
@@ -1128,7 +1128,7 @@
}
if (((tmp_int & JOB_STATE_BASE) == JOB_CANCELLED)
&&
- (tmp_int2 != NO_VAL))
+ (tmp_int2 != -1))
snprintf(outbuf, FORMAT_STRING_SIZE,
"%s by %d",
job_state_string(tmp_int),