I have a bug report of an anomaly in the job state display of sacct,  as 
illustrated below.   Jobs 7,8, 11 and 12 are all simple jobs submitted via 
"sbatch".   Job 7 was submitted and then cancelled by the submitting user. 
  Job 7 was submitted by the user and then cancelled by root.  Job 11 was 
pre-empted by job 12 from a higher priority partition,  but was requeued 
and completed after job 12 terminated.   Following is the output of 
"sacct", showing jobid, state and start/stop times, and including 
duplicates ("-D") option:

[sulu] (dalbert) dalbert> sacct --format=jobid%8,state%-18,submit,end -D
   JobID              State              Submit                 End
-------- ------------------ ------------------- -------------------
       7 CANCELLED by 605   2011-09-09T10:24:41 2011-09-09T10:24:50
 7.batch FAILED             2011-09-09T10:24:41 2011-09-09T10:24:50
     7.0 CANCELLED by -1    2011-09-09T10:24:42 2011-09-09T10:24:50
       8 CANCELLED by 0     2011-09-09T10:24:56 2011-09-09T10:25:12
 8.batch CANCELLED by -1    2011-09-09T10:24:56 2011-09-09T10:25:12
     8.0 CANCELLED by -1    2011-09-09T10:24:56 2011-09-09T10:25:12
      11 CANCELLED by -1    2011-09-09T10:29:42 2011-09-09T10:29:51
    11.0 CANCELLED by -1    2011-09-09T10:29:43 2011-09-09T10:29:51
      11 COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:52
11.batch COMPLETED          2011-09-09T10:29:42 2011-09-09T10:30:52
    11.1 COMPLETED          2011-09-09T10:30:22 2011-09-09T10:30:52
      12 COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:22
12.batch COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:22
    12.0 COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:21

Job 7shows a state of "CANCELLED by 605",  which is the "uid" of the user 
that submitted the job and the "scancel",  but the batch and step records 
for the job display as "CANCELLED by  -1".    Job 8 shows a state of 
"CANCELLED by 0",  which is the "uid" of the root user,  but again the 
batch and step records for the job display as "CANCELLED by  -1".  For job 
11,  both the job and step records for the pre-emption show as "CANCELLED 
by  -1".

Looking into the code, I saw that the "by xxx" portion of the display 
depends on the content of a field called "requid" in the job, step and 
batch record structures.  This field is intialized to "-1" in various 
places.  Other values in the records are initalized to "NO_VAL",  but 
"requid" is explicitedly set to "-1" instead of "NO_VAL".

Unfortunately,  the test in module "src/sacct/print.c" for whether to 
print the "by xxx" is checking a variable "tmp_int2", (which is set from 
the value of "requid"),  for not equal to "NO_VAL", not "-1".   So in 
cases where "requid" has not been set to a uid,  it is being printed 
anyway, as the string "by -1". 

The patch below simply changes the test from not equal to "NO_VAL" to not 
equal to "-1".  With the patch,  the jobs show up as:

[sulu] (dalbert) dalbert> sacct --format=jobid%8,state%-18,submit,end -D
   JobID              State              Submit                 End
-------- ------------------ ------------------- -------------------
      13 CANCELLED by 605   2011-09-09T10:47:40 2011-09-09T10:47:48
13.batch FAILED             2011-09-09T10:47:40 2011-09-09T10:47:48
    13.0 CANCELLED          2011-09-09T10:47:41 2011-09-09T10:47:48
      14 CANCELLED by 0     2011-09-09T10:47:54 2011-09-09T10:48:09
14.batch FAILED             2011-09-09T10:47:54 2011-09-09T10:48:09
    14.0 CANCELLED          2011-09-09T10:47:54 2011-09-09T10:48:09
      15 CANCELLED          2011-09-09T10:48:30 2011-09-09T10:48:36
    15.0 CANCELLED          2011-09-09T10:48:30 2011-09-09T10:48:36
      15 COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:37
15.batch COMPLETED          2011-09-09T10:48:30 2011-09-09T10:49:37
    15.1 COMPLETED          2011-09-09T10:49:07 2011-09-09T10:49:37
      16 COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:07
16.batch COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:07
    16.0 COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:06

where job 13 was cancelled by the submitter (uid 605),  and job 14 was 
cancelled by root (uid 0).   Job 15 was pre-empted by job 16 and then 
requeued and completed later.

  -Don Albert-

A patch against SLURM 2.3.0-rc2 follows:

[stag] (dalbert) s230rc2> cvs diff -u slurm/src/sacct/print.c
Index: slurm/src/sacct/print.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/sacct/print.c,v
retrieving revision 1.1.1.35.2.1
diff -u -r1.1.1.35.2.1 print.c
--- slurm/src/sacct/print.c     29 Aug 2011 17:45:00 -0000 1.1.1.35.2.1
+++ slurm/src/sacct/print.c     9 Sep 2011 18:03:43 -0000
@@ -1128,7 +1128,7 @@
                        }

                        if (((tmp_int & JOB_STATE_BASE) == JOB_CANCELLED) 
&&
-                           (tmp_int2 != NO_VAL))
+                           (tmp_int2 != -1))
                                snprintf(outbuf, FORMAT_STRING_SIZE,
                                         "%s by %d",
                                         job_state_string(tmp_int),





Reply via email to