Hi,

When the following conditions are met :

- submitting a script with sbatch
- allocation done on nodes in power save mode
- backfill scheduler
- no PrologSlurmctld program

then the routine 'launch_job' (job_scheduler.c) is never called causing the job
to be completed by '_purge_missing_jobs' (job_mgr.c) with the following log
message :

[2017-02-08T16:00:36.272] Batch JobId=214 missing from node 0 (not found BatchStartTime after startup) [2017-02-08T16:00:36.272] job_complete: JobID=214 State=0x1 NodeCnt=1 WTERMSIG 126 [2017-02-08T16:00:36.272] job_complete: JobID=214 State=0x1 NodeCnt=1 cancelled by node failure

Before being cancelled, the job status appears in squeue as :
- 'Configuring' during the boot process of nodes being resumed from power save
- 'Running' once the nodes are up (but no script will never be started)

I have done some work to track down the bug:

The routine 'launch_job' is called by several functions in slurmctld :

(1) _start_job (backfill.c) if job's CONFIGURING flag is false
(2) _schedule           (job_scheduler.c) if job's CONFIGURING flag is false
(3) prolog_running_decr (job_scheduler.c) in case a PrologSlurmctld program is run (4) job_time_limit (job_mgr.c) if the nodes are coming from REBOOT

It seems that functions (1) or (2) may be called during job submission but the
job CONFIGURING flag is true because job is started on allocated nodes that
are in power save mode => launch_job cannot be called. Then later,
periodically, functions (1) and (2) are called but as they are dealing only with
PENDING jobs, our RUNNING job is avoided => launch_job cannot be called.

The function (3) is called when a PrologSlurmctld program is defined : I don't
have one => launch_job cannot be called. Note that when a PrologSlurmctld
program is defined, there is no problem.

Finally, the issue can be fixed in the 'job_time_limit' function (4) that is
periodically called for RUNNING jobs. I am just not sure that this is not
breaking the logic for the NODE_REBOOT case but it's working fine :

diff --git a/src/slurmctld/job_mgr.c b/src/slurmctld/job_mgr.c
index 1d961ab..d6463cc 100644
--- a/src/slurmctld/job_mgr.c
+++ b/src/slurmctld/job_mgr.c
@@ -7583,9 +7583,10 @@ void job_time_limit(void)
                        if (job_ptr->bit_flags & NODE_REBOOT) {
                                job_ptr->bit_flags &= (~NODE_REBOOT);
                                job_validate_mem(job_ptr);
-                               if (job_ptr->batch_flag)
-                                       launch_job(job_ptr);
-                       }
+                        }
+                       if (job_ptr->batch_flag){
+                               launch_job(job_ptr);
+                        }
                }
 #endif
                /* This needs to be near the top of the loop, checks every

What do you think?

Best regards,

Didier

Reply via email to