Dear community, I am trying to design a batch script that launches a parallel job with mpirun (the Intel MPI version i'm using does not have PMI interface so i can't launch via srun). The application i'm using offers a feature to stop the calculation smoothly with proper checkpointing. Basically i have to write an empty file in the working directory so that it will be detected by the application which takes subsequent proper abortive action. I thought in designing a script which traps a signal sent via scancel, for example, scancel --batch -s TERM JOBID, but unfortunately, in my particular case, slurm sends the signal to the child processes too. So all the child MPI processes seem to be killed or stopped externally by slurm instead of letting my job script to do it. Can anybody point me to the right track?
The version of slurm i am using was packaged by BULL, v.2.0.5, and bellow i include a sketch of my job script. Thanks, Domingos -------------------------------------------------------------- #!/bin/bash # #SBATCH -o Si_liquid-%N-%j.out #SBATCH -J Si_liquid #SBATCH --ntasks=8 #SBATCH --nodes=1 #SBATCH --cpus-per-task=1 source /opt/intel/Compiler/11.1/069/bin/iccvars.sh intel64 source /opt/intel/Compiler/11.1/069/bin/ifortvars.sh intel64 source /opt/intel/impi/4.0.0.028/intel64/bin/mpivars.sh export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so export I_MPI_FABRICS=shm:dapl ulimit -s unlimited ulimit -a ... ... stagein() { ... ... } stageout() { ... ... } early() { echo ' ' echo ' ############ WARNING: EARLY TERMINATION #############' echo ' ' touch stop_scf sleep 120 # and so parsec does a clean kill ... } trap 'early; stageout' SIGTERM stagein #------- HOSTFILE=/tmp/hosts.$SLURM_JOB_ID srun hostname -s | sort -u > ${HOSTFILE} mpdboot -n ${SLURM_NNODES} -f ${HOSTFILE} -r ssh mpdtrace -l mpiexec -np ${SLURM_NPROCS} ./${EXEC_BIN} mpdallexit #------- stageout exit --------------------------------------------------------------