Config details:
- Slurm v17.11.8
- QOS-based preemption
- Backfill scheduler (default parameters)
- QOS:
- "normal" = PreemptMode=CANCEL, GraceTime=5 minutes
- Per-stakeholder = Preempt=normal GrpTRES=<limits>
- Partitions:
- "standard" (default) = QOS=normal
- Per-stakeholder = QOS=<stakeholder-qos>
When users need priority access to purchased hardware, they submit to a
stakeholder partition; jobs in stakeholder partitions can preempt jobs in the
opportunistic "standard" partition.
Problem 1: Preemption not logged
=================================
We've had a number of users reporting jobs that fail before they reach their
TimeLimit. No mention of preemption in the slurmd/slurmctld logs, but the
EndTime has been altered from its original value and job step(s) end up with
FAILED status:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
237755 ewb0.55_b+ standard XXXXXXXXXX 72 FAILED 77:0
237755.batch batch XXXXXXXXXX 36 FAILED 77:0
237755.exte+ extern XXXXXXXXXX 72 COMPLETED 0:0
237755.0 orted XXXXXXXXXX 1 FAILED 1:0
The slurm-237755.out shows the MPI runtime lost contact with the remote orted
daemon (237755.0) and failed. I think I've traced this through the Slurm
source far enough to determine that with a non-zero GraceTime, the EndTime is
altered and the SIGCONT-SIGTERM pair are sent directly by slurmctld (no
SIG_PREEMPTED, see slurm_job_check_grace() and _preempt_signal() in
slurmctld/preempt.c); slurmctld logs nothing about the preemption at this point
outside debug levels. The signal is not caught by orted, so the 237755.0 step
dies and logs in slurmd:
[237755.0] error: *** STEP 237755.0 ON r00n23 CANCELLED AT 2019-01-15T11:20:42
***
because the signal in question was SIGTERM (see slurmd/slurmstepd/req.c). This
causes the MPI runtime to exit and the batch script returns in error. Since
GraceTime hasn't expired, slurmctld logs failure instead of preemption. The
only indicator of preemption is the altered EndTime (relative to its starting
value) and a job from a per-stakeholder partition (which can preempt
"standard") starting on the nodes _immediately_ following the death of 237755:
[2019-01-15T05:16:12.894] _slurm_rpc_submit_batch_job: JobId=237755
InitPrio=2275 usec=5960
[2019-01-15T09:11:19.367] sched: Allocate JobID=237755 NodeList=r00n[21,23]
#CPUs=72 Partition=standard
[2019-01-15T11:20:41.430] _slurm_rpc_submit_batch_job: JobId=238396
InitPrio=22402 usec=6505
[2019-01-15T11:20:43.594] _job_complete: JobID=237755 State=0x1 NodeCnt=2
WEXITSTATUS 205
[2019-01-15T11:20:43.594] email msg to XXXXXXXXXXXXXX: SLURM Job_id=237755
Name=ewb0.55_b96sc00_a Failed, Run time 02:09:24, FAILED, ExitCode 205
[2019-01-15T11:20:43.595] _job_complete: JobID=237755 State=0x8005 NodeCnt=2
done
[2019-01-15T11:20:44.606] email msg to XXXXXXXXXXXXXX: SLURM Job_id=238396
Name=Pt111_sol Began, Queued time 00:00:03
[2019-01-15T11:20:44.606] sched: Allocate JobID=238396 NodeList=r00n[21,23]
#CPUs=72 Partition=ccei_biomass
[2019-01-15T14:13:55.232] _job_complete: JobID=238396 State=0x1 NodeCnt=2
WEXITSTATUS 0
[2019-01-15T14:13:55.237] _job_complete: JobID=238396 State=0x8003 NodeCnt=2
done
In contrast, jobs that ignore/catch SIGCONT-SIGTERM and keep running through
the 5 minute GraceTime are then killed and their overall state is logged as
being preempted:
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
412845 TTC100um1 standard XXXXXXXXXX 36 PREEMPTED 0:0
412845.batch batch XXXXXXXXXX 36 CANCELLED 0:15
412845.exte+ extern XXXXXXXXXX 36 COMPLETED 0:0
So any job that catches SIGCONT-SIGTERM, gracefully ends its work, and exits
before GraceTime expires would be logged as COMPLETED? Is this behavior inline
with anyone else's experience?
Problem 2: Backfill of preempted node(s)
=========================================
This one was especially confounding:
[2019-03-07T23:08:43.769] _slurm_rpc_submit_batch_job: JobId=410289
InitPrio=2997 usec=5694
[2019-03-07T23:08:46.215] udhpc: setting job account to safrono (1040)
[2019-03-07T23:08:46.215] udhpc: time_min is empty, setting to time_limit
:
[2019-03-08T00:28:04.364] backfill: Started JobID=410289 in standard on r00n29
[2019-03-08T00:32:43.490] job_time_limit: Preemption GraceTime reached
JobId=409950
[2019-03-08T00:32:43.501] job_time_limit: Preemption GraceTime reached
JobId=409951
[2019-03-08T00:32:43.503] job_time_limit: Preemption GraceTime reached
JobId=409952
[2019-03-08T00:32:43.505] job_time_limit: Preemption GraceTime reached
JobId=409953
[2019-03-08T00:32:43.507] job_time_limit: Preemption GraceTime reached
JobId=409954
[2019-03-08T00:32:43.509] job_time_limit: Preemption GraceTime reached
JobId=409955
[2019-03-08T00:32:43.511] job_time_limit: Preemption GraceTime reached
JobId=409956
[2019-03-08T00:32:43.513] job_time_limit: Preemption GraceTime reached
JobId=409957
[2019-03-08T00:32:43.515] job_time_limit: Preemption GraceTime reached
JobId=409958
[2019-03-08T00:32:43.517] job_time_limit: Preemption GraceTime reached
JobId=409959
[2019-03-08T00:32:43.519] job_time_limit: Preemption GraceTime reached
JobId=409960
[2019-03-08T00:32:43.520] job_time_limit: Preemption GraceTime reached
JobId=409961
[2019-03-08T00:32:43.522] job_time_limit: Preemption GraceTime reached
JobId=409962
[2019-03-08T00:32:43.524] job_time_limit: Preemption GraceTime reached
JobId=409963
[2019-03-08T00:32:43.526] job_time_limit: Preemption GraceTime reached
JobId=409964
[2019-03-08T00:32:43.527] job_time_limit: Preemption GraceTime reached
JobId=409965
[2019-03-08T00:32:43.529] job_time_limit: Preemption GraceTime reached
JobId=409966
[2019-03-08T00:32:43.531] job_time_limit: Preemption GraceTime reached
JobId=409967
[2019-03-08T00:32:43.532] job_time_limit: Preemption GraceTime reached
JobId=409968
[2019-03-08T00:32:43.534] job_time_limit: Preemption GraceTime reached
JobId=409969
[2019-03-08T00:32:43.535] job_time_limit: Preemption GraceTime reached
JobId=409970
[2019-03-08T00:32:43.537] job_time_limit: Preemption GraceTime reached
JobId=409971
[2019-03-08T00:32:43.538] job_time_limit: Preemption GraceTime reached
JobId=410003
[2019-03-08T00:32:46.226] _job_complete: JobID=410289 State=0x1 NodeCnt=1
WEXITSTATUS 1
[2019-03-08T00:32:46.230] _job_complete: JobID=410289 State=0x8005 NodeCnt=1
done
[2019-03-08T00:32:46.310] email msg to [email protected]: SLURM Job_id=410300
Name=mesh_12x12x4_T3 Began, Queued time 00:05:17
[2019-03-08T00:32:46.350] sched: Allocate JobID=410300 NodeList=r00n29 #CPUs=36
Partition=ccei_biomass
Job 410300 was in a per-stakeholder partition, thus preempting jobs in the
"standard" partition. For GraceTime of 5 minutes, the preemption occurred ca.
00:27:43. So how did the backfill scheduler manage to start a new job (410289)
on that node at 00:28:04, _after_ the preemption state should have been noted
on r00n29? The submitted TimeLimit on 410289 was 7 days, yet again its EndTime
was altered to the exact time it died, and it died the same way as in Problem
1: initial SIGCONT-SIGTERM delivery killed step 0, etc. With EndTime reset to
00:32:46, the preemption would have had to happen ca. 00:27:46. Seems like a
race condition.
Release notes don't seem to indicate that these issues were known and have been
addressed, and I didn't find anything on bugs.schedmd.com
<http://bugs.schedmd.com/> that seemed to pertain. To anyone who has
encountered like/similar behavior, how did you mitigate it?
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::