Don,

I just checked in both of your patches and they will be in version 2.3.0. Your patch to sacct definitely fixes the problem that you reported, but I'm really not comfortable with the current SLURM code managing the "requid" field because it is a uint32_t and treated as an integer in some places (like where it is initialized to a negative number). I don't want to change that right before releasing version 2.3, but I'll note that as something that should be addressed in version 2.4.

Thanks for the patches.
Moe


Quoting [email protected]:

I have a bug report of an anomaly in the job state display of sacct,  as
illustrated below.   Jobs 7,8, 11 and 12 are all simple jobs submitted via
"sbatch".   Job 7 was submitted and then cancelled by the submitting user.
  Job 7 was submitted by the user and then cancelled by root.  Job 11 was
pre-empted by job 12 from a higher priority partition,  but was requeued
and completed after job 12 terminated.   Following is the output of
"sacct", showing jobid, state and start/stop times, and including
duplicates ("-D") option:

[sulu] (dalbert) dalbert> sacct --format=jobid%8,state%-18,submit,end -D
   JobID              State              Submit                 End
-------- ------------------ ------------------- -------------------
       7 CANCELLED by 605   2011-09-09T10:24:41 2011-09-09T10:24:50
 7.batch FAILED             2011-09-09T10:24:41 2011-09-09T10:24:50
     7.0 CANCELLED by -1    2011-09-09T10:24:42 2011-09-09T10:24:50
       8 CANCELLED by 0     2011-09-09T10:24:56 2011-09-09T10:25:12
 8.batch CANCELLED by -1    2011-09-09T10:24:56 2011-09-09T10:25:12
     8.0 CANCELLED by -1    2011-09-09T10:24:56 2011-09-09T10:25:12
      11 CANCELLED by -1    2011-09-09T10:29:42 2011-09-09T10:29:51
    11.0 CANCELLED by -1    2011-09-09T10:29:43 2011-09-09T10:29:51
      11 COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:52
11.batch COMPLETED          2011-09-09T10:29:42 2011-09-09T10:30:52
    11.1 COMPLETED          2011-09-09T10:30:22 2011-09-09T10:30:52
      12 COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:22
12.batch COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:22
    12.0 COMPLETED          2011-09-09T10:29:51 2011-09-09T10:30:21

Job 7shows a state of "CANCELLED by 605",  which is the "uid" of the user
that submitted the job and the "scancel",  but the batch and step records
for the job display as "CANCELLED by  -1".    Job 8 shows a state of
"CANCELLED by 0",  which is the "uid" of the root user,  but again the
batch and step records for the job display as "CANCELLED by  -1".  For job
11,  both the job and step records for the pre-emption show as "CANCELLED
by  -1".

Looking into the code, I saw that the "by xxx" portion of the display
depends on the content of a field called "requid" in the job, step and
batch record structures.  This field is intialized to "-1" in various
places.  Other values in the records are initalized to "NO_VAL",  but
"requid" is explicitedly set to "-1" instead of "NO_VAL".

Unfortunately,  the test in module "src/sacct/print.c" for whether to
print the "by xxx" is checking a variable "tmp_int2", (which is set from
the value of "requid"),  for not equal to "NO_VAL", not "-1".   So in
cases where "requid" has not been set to a uid,  it is being printed
anyway, as the string "by -1".

The patch below simply changes the test from not equal to "NO_VAL" to not
equal to "-1".  With the patch,  the jobs show up as:

[sulu] (dalbert) dalbert> sacct --format=jobid%8,state%-18,submit,end -D
   JobID              State              Submit                 End
-------- ------------------ ------------------- -------------------
      13 CANCELLED by 605   2011-09-09T10:47:40 2011-09-09T10:47:48
13.batch FAILED             2011-09-09T10:47:40 2011-09-09T10:47:48
    13.0 CANCELLED          2011-09-09T10:47:41 2011-09-09T10:47:48
      14 CANCELLED by 0     2011-09-09T10:47:54 2011-09-09T10:48:09
14.batch FAILED             2011-09-09T10:47:54 2011-09-09T10:48:09
    14.0 CANCELLED          2011-09-09T10:47:54 2011-09-09T10:48:09
      15 CANCELLED          2011-09-09T10:48:30 2011-09-09T10:48:36
    15.0 CANCELLED          2011-09-09T10:48:30 2011-09-09T10:48:36
      15 COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:37
15.batch COMPLETED          2011-09-09T10:48:30 2011-09-09T10:49:37
    15.1 COMPLETED          2011-09-09T10:49:07 2011-09-09T10:49:37
      16 COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:07
16.batch COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:07
    16.0 COMPLETED          2011-09-09T10:48:36 2011-09-09T10:49:06

where job 13 was cancelled by the submitter (uid 605),  and job 14 was
cancelled by root (uid 0).   Job 15 was pre-empted by job 16 and then
requeued and completed later.

  -Don Albert-

A patch against SLURM 2.3.0-rc2 follows:

[stag] (dalbert) s230rc2> cvs diff -u slurm/src/sacct/print.c
Index: slurm/src/sacct/print.c
===================================================================
RCS file: /cvsroot/slurm/slurm/src/sacct/print.c,v
retrieving revision 1.1.1.35.2.1
diff -u -r1.1.1.35.2.1 print.c
--- slurm/src/sacct/print.c     29 Aug 2011 17:45:00 -0000 1.1.1.35.2.1
+++ slurm/src/sacct/print.c     9 Sep 2011 18:03:43 -0000
@@ -1128,7 +1128,7 @@
                        }

                        if (((tmp_int & JOB_STATE_BASE) == JOB_CANCELLED)
&&
-                           (tmp_int2 != NO_VAL))
+                           (tmp_int2 != -1))
                                snprintf(outbuf, FORMAT_STRING_SIZE,
                                         "%s by %d",
                                         job_state_string(tmp_int),









Reply via email to