[slurm-users] Possible bug with Prologslurmctld and Epilogslurmctld scripts?

Joe Teumer Mon, 27 Sep 2021 09:16:26 -0700

Should the Prologslurmctld script only run after the Epilogslurmctld script
finishes?


Below you can see JobA runs and completes.
While Epilogslurmctld (from JobA Node A) is executing on the Slurm
controller the Prologslurmctld script for the next job (from Job B Node A)
is also running on the Slurm controller.

This breaks our workflow as we are expecting the next Prologslurmctld
script to only run when the prior job is 100% completed (initial
Prologslurmctld completes AND Job completes AND Job Epilogslurmctld
completes).

Prologslurmctld > Here Prolog is starting for Job B (ID 812)
*2021-09-27 15:42:58,746* | INFO | Starting...

Epilogslurmctld > Here Epilog is starting and ending for Job A (ID 811)
*2021-09-27 15:42:56,694* | INFO | Starting...
*2021-09-27 15:43:01,756* | INFO | Exiting 0 after main

[2021-09-27T15:42:50.224] debug:  sched/backfill: _attempt_backfill:
beginning
[2021-09-27T15:42:50.224] debug:  sched/backfill: _attempt_backfill: 1 jobs
to backfill
[2021-09-27T15:42:56.653] _job_complete: JobId=811 WEXITSTATUS 0
[2021-09-27T15:42:56.653] debug:  email msg to root: Slurm Job_id=811
Name=JobA_BIOS_fixedfreq_1067mclk_nps1.ini Ended, Run time 00:22:36,
COMPLETED, ExitCode 0
*[2021-09-27T15:42:56.657]* _job_complete: JobId=811 done
[2021-09-27T15:42:58.703] debug:  sched: Running job scheduler for full
queue.
*[2021-09-27T15:42:58.704]* debug:  email msg to root: Slurm Job_id=812
Name=JobA_BIOS_fixedfreq_1600mclk_nps1.ini Began, Queued time 00:22:38
[2021-09-27T15:42:58.704] sched: Allocate JobId=812 NodeList=xxx #CPUs=256
Partition=xxx

[slurm-users] Possible bug with Prologslurmctld and Epilogslurmctld scripts?

Reply via email to