This is from the scancel man page:

-b, --batch
       Signal the batch job shell and its child processes.

You have a few options to do what you want:
* Pick a signal that does not cause problems for any of the child processes (perhaps SIGUSR1 or SIGUSR2) * Write a checkpoint/intel_mpi plugin that creates your empty file and integrates with SLURM's checkpoint logic * Hack the SLURM code so that under specific conditions it only signals the parent process, this could break various other functions so proceed with caution


Quoting Domingos <ddc...@gmail.com>:

Dear community,

I am trying to design a batch script that launches a parallel job with
mpirun (the Intel MPI version
i'm using does not have PMI interface so i can't launch via srun). The
application i'm using offers
a feature to stop the calculation smoothly with proper checkpointing.
Basically i have to write an empty file in the working directory so
that it will be detected by the application which takes subsequent
proper abortive action. I thought in designing a script which traps a
signal sent
via scancel, for example, scancel --batch -s TERM JOBID, but
unfortunately, in my particular case, slurm sends the signal to the
child processes too. So all the child MPI processes seem
to be killed or stopped externally by slurm instead of letting my job
script to do it.
Can anybody point me to the right track?

The version of slurm i am using was packaged by BULL, v.2.0.5, and
bellow i include a sketch
of my job script.

Thanks,
Domingos

--------------------------------------------------------------
    #!/bin/bash
    #
    #SBATCH -o Si_liquid-%N-%j.out
    #SBATCH -J Si_liquid
    #SBATCH --ntasks=8
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=1

    source /opt/intel/Compiler/11.1/069/bin/iccvars.sh intel64
    source /opt/intel/Compiler/11.1/069/bin/ifortvars.sh intel64
    source /opt/intel/impi/4.0.0.028/intel64/bin/mpivars.sh
    export I_MPI_PMI_LIBRARY=/usr/lib64/libpmi.so
    export I_MPI_FABRICS=shm:dapl

    ulimit -s unlimited
    ulimit -a

    ...
    ...

    stagein()
    {
      ...
      ...
    }

    stageout()
    {
      ...
      ...
    }

    early()
    {
        echo ' '
        echo ' ############ WARNING:  EARLY TERMINATION #############'
        echo ' '

        touch stop_scf
        sleep 120
        # and so parsec does a clean kill ...
    }
    trap 'early; stageout' SIGTERM

    stagein
    #-------
    HOSTFILE=/tmp/hosts.$SLURM_JOB_ID
    srun hostname -s | sort -u > ${HOSTFILE}

    mpdboot -n ${SLURM_NNODES} -f ${HOSTFILE} -r ssh
    mpdtrace -l
    mpiexec -np ${SLURM_NPROCS} ./${EXEC_BIN}
    mpdallexit
    #-------
    stageout

    exit
--------------------------------------------------------------




Reply via email to