Hi,

After 14.03 -> 14.11 upgrade our users started to complain that jobs are randomly failing with reason: slurmstepd: error: get_exit_code task 0 died by signal

Culprit is change in squeue command:

https://github.com/SchedMD/slurm/blob/master/etc/slurm.epilog.clean

squeue --noheader --format=%A --user=991 --node=localhost
squeue: error: Invalid node name localhost

possible workaround:

job_host=`hostname`
job_list=`${SLURM_BIN}squeue --noheader --format=%A --user=$SLURM_UID --node=$job_host`

Regards,
Tommi

Reply via email to