Hi all,

Developing a Slurm plugin I've come to a funny problem. I guess it is not
strictly related to Slurm but just system administration, but maybe someone
can point me on the right direction.

I have 2 machines, one with CentOS 7 and one with BullX (based on CentOS6).
When I send a signal to finish a running tasks, the behaviours are
different.

It can be seen with 2 nested scripts, based on slurm_trap.sh by Mike Drake
 (https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd ). The code is at
the bottom of the mail. As can be seen, both father and son are capturing
SIGTERM and SIGKILL,. The execution consists on "father" calling "son", and
"son" waiting forever until it is killed.


As you can see in the execution results (bottom of the mail), one of the
machines executes the functions stated in "trap", but the other does not.
Moreover, this second machine does execute the functions in trap when only
a single script is executed, not two nested ones.

have you got an explanation for this? Is is possible to ensure that the
"trap" command will always be executed?

Thanks for your help,

Manuel

-----
-----
-bash-4.2$ more father.sh

#!/bin/bash

trap_with_arg() {
    func="$1" ; shift
    for sig ; do
        trap "$func $sig" "$sig"
    done
}

func_trap() {
    echo father: trapped $1
}

trap_with_arg func_trap 0 1 USR1 EXIT HUP INT QUIT PIPE TERM

cat /dev/zero > /dev/null &

sh son.sh
-bash-4.2$ more son.sh
#!/bin/bash


trap_with_arg() {
    func="$1" ; shift
    for sig ; do
        trap "$func $sig" "$sig"
    done
}

func_trap() {
    echo son: trapped $1
}

trap_with_arg func_trap 0 1 USR1 EXIT HUP INT QUIT PIPE TERM

cat /dev/zero > /dev/null &
wait
-----
-----


Output in CentOS7:
-bash-4.2$ sbatch  father.sh
Submitted batch job 1563
-bash-4.2$ scancel 1563
-bash-4.2$ more slurm-1563.out
slurmstepd: error: *** JOB 1563 ON acme12 CANCELLED AT 2017-07-04T15:39:00
***
son: trapped TERM
son: trapped EXIT
father: trapped TERM
father: trapped EXIT

Output in BullX:
~/signalTests> sbatch  father.sh
Submitted batch job 233
~/signalTests> scancel 233
~/signalTests> more slurm-233.out
slurmstepd: error: *** JOB 233 ON taurusi5089 CANCELLED AT
2017-07-04T15:43:54 ***

Output in BullX, just son:
~/signalTests> sbatch -- son.sh
Submitted batch job 235
~/signalTests> scancel 235
~/signalTests> more slurm-235.out
slurmstepd: error: *** JOB 235 ON taurusi4061 CANCELLED AT
2017-07-04T15:48:29 ***
son: trapped TERM
son: trapped EXIT

Reply via email to