Hi Andriy,
This one line patch seems to fix the problem and has been applied to
the version 2.4 code.
diff --git a/src/srun/srun.c b/src/srun/srun.c
index cadbca1..ae31f92 100644
--- a/src/srun/srun.c
+++ b/src/srun/srun.c
@@ -1142,7 +1142,6 @@ _terminate_job_step(slurm_step_ctx_t *step_ctx)
slurm_step_ctx_get(step_ctx, SLURM_STEP_CTX_JOBID, &job_id);
slurm_step_ctx_get(step_ctx, SLURM_STEP_CTX_STEPID, &step_id);
info("Terminating job step %u.%u", job_id, step_id);
- update_job_state(job, SRUN_JOB_CANCELLED);
slurm_kill_job_step(job_id, step_id, SIGKILL);
}
Quoting "Andrej N. Gritsenko" <and...@rep.kiev.ua>:
Hello!
Our customers complaining about strange behavior when job started via
srun was ended with non-zero exit code:
======================================================================
$ srun sh -c "exit 2"
srun: error: node4-131-23: task 0: Exited with exit code 2
srun: Terminating job step 2602.0
slurmd[node4-131-23]: *** STEP 2602.0 KILLED AT 2011-06-20T18:41:48
WITH SIGNAL 9 ***
$ scontrol show jobs 2602
JobId=2602 Name=sh
UserId=user1-1(510) GroupId=user1-1(510)
Priority=100983 Account=group1 QOS=
JobState=CANCELLED Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 ExitCode=0:0
======================================================================
That seems as if user just cancelled the job and it happens only when
slurm.conf have the statement KillOnBadExit=1. With KillOnBadExit=0
everything is plain:
======================================================================
JobId=2604 Name=sh
UserId=user1-1(510) GroupId=user1-1(510)
Priority=983 Account=group1 QOS=
JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=0 ExitCode=2:0
======================================================================
When a job is started via sbatch then everything is plain with any value
of KillOnBadExit parameter. Is it a bug or just undocumented feature?
With best wishes.
Andriy.