Hi All, Thanks for all of the help, changing my script to wait on wait instead of on the sleep command allowed me successfully trap the signals. So my understanding now is that generally SLURM sends SIGTERM, except when memory limits are exceeded, in which case cgroups will kill the job with the OOM killer, which I think sends SIGKILL, which is uncatchable. Is that correct?
Based on this, I think it should be pretty easy to write code to auto-checkpoint a job on SIGTERM, which will mean the job is checkpointed if it is killed with scancel or by timeout, so the only uncatchable kill would be the memory overuse one. Is that correct? Thanks all, Mike On Fri, Feb 26, 2016 at 12:44 AM, Fitzpatrick, Ben < ben.fitzpatr...@metoffice.gov.uk> wrote: > As far as I know, Slurm sends SIGTERM for timeout, and we use cgroups for > memory enforcement which will cause an OOM abort (ERR) of the process > outside Slurm when the process insists on exceeding the memory limit. When > a task’s timing out, you should only get a SIGKILL on timeout + KillWait > seconds. > > > > scancel documentation claims: > > > > To cancel a job, invoke scancel without --signal option. This will > > send first a SIGCONT to all steps to eventually wake them up > followed > > by a SIGTERM, then wait the KillWait duration defined in the > slurm.conf > > file and finally if they have not terminated send a SIGKILL. > This > > gives time for the running job/step(s) to clean up. > > > > Another interesting pointer to the way things are working is in the --full > option to scancel: > > > > -f, --full > > Signal all steps associated with the job including any > batch > > step (the shell script plus all of its child > processes). By > > default, signals other than SIGKILL are not sent to the > batch > > step. Also see the -b, --batch option > > > > > > To get your traps to work - basically, try to avoid trapping SIGTERM, even > though it is the exact signal you’re looking for. > > > > If you trap SIGTERM in the batch script, and you’re not launching children > via srun, you won’t end up killing the child processes - they’ll continue > to run, then get SIGKILL-ed later. > > > > We encountered this problem within cylc (https://github.com/cylc/cylc) > which likes to have tasks communicate back before they finish - we always > want to successfully trap failures if possible. > > https://github.com/cylc/cylc/pull/1287 has a bit of explanation about the > problem and how we fixed it, but it boils down to - don’t trap SIGTERM! > > > > We ended up having traps like this in our cylc batch scripts (skip the > cylc/CYLC lines, obviously, XCPU is unnecessary (but the best signal for > timeouts!) and VACATION_SIGNALS is unset for Slurm): > > > > set -u # Fail when using an undefined variable > > FAIL_SIGNALS='EXIT ERR XCPU' > > TRAP_FAIL_SIGNAL() { > > typeset SIGNAL=$1 > > echo "Received signal $SIGNAL" >&2 > > typeset S= > > for S in ${VACATION_SIGNALS:-} $FAIL_SIGNALS; do > > trap "" $S > > done > > if [[ -n "${CYLC_TASK_MESSAGE_STARTED_PID:-}" ]]; then > > wait "${CYLC_TASK_MESSAGE_STARTED_PID}" 2>/dev/null || true > > fi > > cylc task message -p 'CRITICAL' "Task job script received signal > $SIGNAL" 'failed' > > exit 1 > > } > > for S in $FAIL_SIGNALS; do > > trap "TRAP_FAIL_SIGNAL $S" $S > > done > > unset S > > > > We then run the process and then have: > > > > trap '' EXIT > > > > so that we can actually quit the script successfully at the end! > > > > Uncaught signals filter down to the running child processes, which then > causes them to abort and trigger the above trap - so it works whatever you > do. > > > > I think the sleep command has a peculiar interaction with signals, so it > may not be the best command to try. > > > > Cheers, > > > > Ben > > > > > > *From:* Mike Dacre [mailto:mike.da...@gmail.com] > *Sent:* 26 February 2016 01:36 > *To:* slurm-dev > *Subject:* [slurm-dev] Kill Signals Sent By SLURM > > > > Hi All, > > > > I am trying to incorporate checkpointing using DMTCP into my SLURM jobs, > specifically, to allow the checkpointing of a job when it is killed by > SLURM on timeout or memory overuse (or anything else), to allow > resubmission from the checkpoint later. I have been talking with the DMTCP > devs about this here: https://github.com/dmtcp/dmtcp/issues/324 but I > have run into some trouble. > > > > Even using the --signal command to sbatch, I cannot capture the kill > signal sent to the job by SLURM. The script I am using is here: > https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd. Irrespective of > whether I specify --signal with or without the B:, if I allow the job to > timeout or kill it with scancel, my trap command is unable to trap the > signal. > > > > Do any of you know a better way of trapping exit signals with a slurm > script? Do you by any chance know what signal SLURM sends to jobs when they > are killed by scancel or for time or memory use reasons? > > > > Thanks so much, > > > > Mike >