Marti,

I suspect the job failed for some reason and the epilog for each node of the 
job was invoked.  The epilog that was run is this one:

scontrol show conf | grep Epilog
Epilog                  = /etc/slurm/epilog

and if you cat /etc/slurm/epilog, you will probably find a condition whereby it 
returns a non-zero exit code.  This is what was recorded in your message below.

My guess is that the epilog failure is the canary in the mine shaft.  It 
indicated a problem with the system that caused at least one job step of the 
job to fail.  Memory exhaustion perhaps?

Don

From: Hill, Marti T [mailto:[email protected]]
Sent: Monday, January 06, 2014 7:24 AM
To: slurm-dev
Subject: [slurm-dev] slurmstepd spank epilog


Has anyone seen this error?



On the nodes listed below, their /var/log/slurmd.log logs all have the 
following entries:



[2014-01-03T10:11:08] Calling /usr/sbin/slurmstepd spank epilog 
[2014-01-03T10:11:08] error: [job 299463] epilog failed status=25:0 
[2014-01-03T10:11:10] [340941.0] done with job



I'm not sure why slurmstepd spank epilog would run in the middle of a job. But 
all of them failed at the same time with the same status.



Thanks for any help.



Marti


<<inline: image002.jpg>>

<<inline: image003.jpg>>

Reply via email to