Marti, I suspect the job failed for some reason and the epilog for each node of the job was invoked. The epilog that was run is this one:
scontrol show conf | grep Epilog Epilog = /etc/slurm/epilog and if you cat /etc/slurm/epilog, you will probably find a condition whereby it returns a non-zero exit code. This is what was recorded in your message below. My guess is that the epilog failure is the canary in the mine shaft. It indicated a problem with the system that caused at least one job step of the job to fail. Memory exhaustion perhaps? Don From: Hill, Marti T [mailto:[email protected]] Sent: Monday, January 06, 2014 7:24 AM To: slurm-dev Subject: [slurm-dev] slurmstepd spank epilog Has anyone seen this error? On the nodes listed below, their /var/log/slurmd.log logs all have the following entries: [2014-01-03T10:11:08] Calling /usr/sbin/slurmstepd spank epilog [2014-01-03T10:11:08] error: [job 299463] epilog failed status=25:0 [2014-01-03T10:11:10] [340941.0] done with job I'm not sure why slurmstepd spank epilog would run in the middle of a job. But all of them failed at the same time with the same status. Thanks for any help. Marti
<<inline: image002.jpg>>
<<inline: image003.jpg>>
