Hello!

    Our customers complaining about strange behavior when job started via
srun was ended with non-zero exit code:

======================================================================
$ srun sh -c "exit 2"
srun: error: node4-131-23: task 0: Exited with exit code 2
srun: Terminating job step 2602.0
slurmd[node4-131-23]: *** STEP 2602.0 KILLED AT 2011-06-20T18:41:48 WITH SIGNAL 
9 ***

$ scontrol show jobs 2602
JobId=2602 Name=sh
   UserId=user1-1(510) GroupId=user1-1(510)
   Priority=100983 Account=group1 QOS=
   JobState=CANCELLED Reason=None Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 ExitCode=0:0
======================================================================

That seems as if user just cancelled the job and it happens only when
slurm.conf have the statement KillOnBadExit=1. With KillOnBadExit=0
everything is plain:

======================================================================
JobId=2604 Name=sh
   UserId=user1-1(510) GroupId=user1-1(510)
   Priority=983 Account=group1 QOS=
   JobState=FAILED Reason=NonZeroExitCode Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=0 ExitCode=2:0
======================================================================

When a job is started via sbatch then everything is plain with any value
of KillOnBadExit parameter. Is it a bug or just undocumented feature?

    With best wishes.
    Andriy.

Reply via email to