I'm pretty sure this will fix the problem:

Index: src/plugins/jobcomp/script/jobcomp_script.c
===================================================================
--- src/plugins/jobcomp/script/jobcomp_script.c (revision 22817)
+++ src/plugins/jobcomp/script/jobcomp_script.c (working copy)
@@ -221,7 +221,11 @@
                j->jobstate = xstrdup (job_state_string (state));
                if (job->resize_time)
                        j->start = job->resize_time;
-               else
+               else if (job->start_time > job->end_time) {
+                       /* Job cancelled while pending and
+                        * expected start time is in the future. */
+                       j->start = 0;
+               } else
                        j->start = job->start_time;
                j->end = job->end_time;
        }
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of [email protected] [[email protected]]
Sent: Monday, March 21, 2011 7:13 AM
To: [email protected]; Michael Di Domenico
Subject: Re: [slurm-dev] job completion script bug?

The job's start time is set by the backfill scheduler to be the
expected start time of the job. I'd guess guess someone is cancelling
a pending job after that time is set. Since the job never actually
started, SLURM should avoid passing the start time in the environment
variable to the job completion script. That's probably a trivial patch.

Quoting Michael Di Domenico <[email protected]>:

> We're using the job completion script to insert tracking records into
> a database for the jobs run on our machine, every once in a while i
> see an error with the start time listed for cancelled jobs
>
> Here's some examples:
>
> jobid=11721
> jobstate=cancelled
> submit=1300487648
> start=1331416200
> end=1300487677
>
> jobid=11722
> jobstate=cancelled
> submit=1300550901
> start=1332074346
> end=1300550922
>
> jobid=11723
> jobstate=cancelled
> submit=1300582908
> start=1332112169
> end=1300582920
>
> I haven't been able to figure out what transpires to reproduce this
> behavior though.  We have a longish slurmctld-prolog script, I think
> someone is srun'ing the job and cancelling it before/during the
> prolog.  The jobs don't show an assigned node in the output so perhaps
> our users are srun'ing and canceling the job with rather speedy
> fingers, before slurm has a chance to do anything.
>
> Can anyone confirm this as a bug?
>





Reply via email to