Hello! Our customers complaining about strange behavior when job started via srun was ended with non-zero exit code:
====================================================================== $ srun sh -c "exit 2" srun: error: node4-131-23: task 0: Exited with exit code 2 srun: Terminating job step 2602.0 slurmd[node4-131-23]: *** STEP 2602.0 KILLED AT 2011-06-20T18:41:48 WITH SIGNAL 9 *** $ scontrol show jobs 2602 JobId=2602 Name=sh UserId=user1-1(510) GroupId=user1-1(510) Priority=100983 Account=group1 QOS= JobState=CANCELLED Reason=None Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 ExitCode=0:0 ====================================================================== That seems as if user just cancelled the job and it happens only when slurm.conf have the statement KillOnBadExit=1. With KillOnBadExit=0 everything is plain: ====================================================================== JobId=2604 Name=sh UserId=user1-1(510) GroupId=user1-1(510) Priority=983 Account=group1 QOS= JobState=FAILED Reason=NonZeroExitCode Dependency=(null) Requeue=0 Restarts=0 BatchFlag=0 ExitCode=2:0 ====================================================================== When a job is started via sbatch then everything is plain with any value of KillOnBadExit parameter. Is it a bug or just undocumented feature? With best wishes. Andriy.