[slurm-dev] Prolog for requeued jobs not run on all nodes

Pär Lindfors Thu, 26 Feb 2015 15:54:51 -0800

Hi,

We have just discovered a problem with 14.11.4 configured to run Prolog
at job allocation. ( PrologFlags=Alloc )


When a batch job starts the first time, the Prolog is executed on all
nodes as expected.

If the job is then requeued and restarted, the Prolog is only run on the
first node that run the batch script, and any node in the job allocation
that was not allocated to the job the first time it ran.

This is caused by cached job credentials. If slurmds are restarted with
"-c" before the job is restarted, then Prolog runs on all nodes. If more
than 20 minutes (1200 seconds, DEFAULT_EXPIRATION_WINDOW in
slucm_cred.c) passes before the job is restarted Prolog also runs on all
nodes.

When slurmd gets the REQUEST_LAUNCH_PROLOG RPC it runs _rpc_prolog() (in
slurmd/req.c) which have the following code to make sure Prolog only
runs once:

        first_job_run = !slurm_cred_jobid_cached(conf->vctx, req->job_id);
        if (first_job_run) {
                ...Prolog is run in here...
        }

Unfortunately slurm_cred_jobid_cached() also returns true when the
cached credential from last time the job ran on the same node have not
yet been purged.

As a quick test I simply removed the check:

======================================================================
diff --git a/src/slurmd/slurmd/req.c b/src/slurmd/slurmd/req.c
index 76e9f4e..a53a91f 100644
--- a/src/slurmd/slurmd/req.c
+++ b/src/slurmd/slurmd/req.c
@@ -1502,5 +1502,5 @@ static void _rpc_prolog(slurm_msg_t *msg)
        first_job_run = !slurm_cred_jobid_cached(conf->vctx, req->job_id);
 
-       if (1) {
+       if (first_job_run) {
                slurm_cred_insert_jobid(conf->vctx, req->job_id);
                _add_job_running_prolog(req->job_id);
======================================================================

This appears to work fine and runs the Prolog on all nodes when it
should. However, I guess that check is needed for some reason so this
could introduce other issues.


As I mentioned, Prolog always gets run on the jobs first node, which
runs the batch script. REQUEST_LAUNCH_PROLOG fails on this node as well,
but the Prolog gets run later on when _rpc_batch_job() handles the
REQUEST_BATCH_JOB_LAUNCH RPC. This function have an identical
first_job_run test, but I believe it works since
slurm_cred_handle_reissue() is called before that and gets rid of the
old credential.

Regards,
Pär Lindfors, NSC

[slurm-dev] Prolog for requeued jobs not run on all nodes

Reply via email to