Slurm signals the tasks that it launches. I suspect that you do not launch with srun but use mpirun or some other tool that does not use srun for task launch so those processes are not managed by slurm. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
Domingos Rodrigues <ddc...@gmail.com> wrote: Hello, Here I am again! Well it looks like I felt in the wrong path again. I thought that I could do some signal handling in my application by trapping the signals sent by the command "scancel --batch --s INT JOBID". My application is in Fortran 90 and so signal handling is a bit tricky, but I wrote an auxiliary C routine. Here follows a sketch of my code (i hope it's self-explanatory): --------------------------BEGIN CODE_____________________________________________ program my_program ! declarations ... !_____________________________________________ ! Define parallel environment. !_____________________________________________ call create_parallel_data (parallel) ... !_____________________________________________ ! Initialize external signal handling !_____________________________________________ call sigclear(SIGUSR1) ! Intel needs this, otherwise its ! runtime library catches the signal ! before this process if (parallel%iammaster) then call trap_signal(SIGUSR1, checkpt_sigusr1) else call trap_signal(SIGUSR1, ignore_sigusr1) endif .... contains subroutine checkpt_sigusr1 print *, 'SLURM sent a SIGUSR1 signal: aborting SCF' open(99,file='stop_scf',status='unknown',form='formatted') return end subroutine checkpt_sigusr1 subroutine ignore_sigusr1 print *, 'SLURM sent a SIGUSR1 signal' return end subroutine ignore_sigusr1 end program my_program --------------------------END CODE_____________________________________________ The function trap_signal() is defined in C: --------------------------BEGIN CODE_____________________________________________ /* in "csigfun.c" */ /* #include <config.h> */ #include <stdlib.h> #ifdef HAVE_SIGNAL_H #include <signal.h> #endif typedef void (*sighandler_t)(int); void sigclear_(int *signum) { signal(*signum, NULL); } void trap_signal_(int* signum, sighandler_t handler) { signal(*signum, handler); } --------------------------END CODE_____________________________________________ This works well if I send a signal through kill (kill -s USR1 pid), the file stop_scf is created and everything follows normally, but not in the slurm environment. I thought naively that at some point SLURM would use the signal() function, but apparently not so. What is the right way to trap the signals sent by scancel? Actually I am asking this only for academic curiosity, because from the pragmatic point of view I can always log into the node where the job is running and create myself manually the file stop_scf :-) Thanks a lot, Domingos _____________________________________________ Domingos Rodrigues, PhD Laboratório de Computação Científica ICeX, Sala 2040 Universidade Federal de Minas Gerais, Av. Antônio Carlos, 6627 - Pampulha 31270-901 - Belo Horizonte - MG - Brasil Tel +55 31 3409-4909 VOIP: +55 31 3409-3333 / 10811803 Fax +55 313409-5390 http://www.cenapad.ufmg.br Email: ddcr(at)lcc.ufmg.br, ddcr(at)ufmg.br _____________________________________________ On Tue, Oct 18, 2011 at 12:48 PM, David N. Lombard <dnlom...@ichips.intel.com> wrote: > > On Sat, Oct 15, 2011 at 11:23:59AM -0600, Domingos wrote: > > Dear community, > > > > I am trying to design a batch script that launches a parallel job with > > mpirun (the Intel MPI version > > i'm using does not have PMI interface so i can't launch via srun). > > Intel MPI does offer a PMI interface. Here's a quick example: > > $ export I_MPI_PMI_LIBRARY=/full/path/to/slurm/libpmi.so > $ export I_MPI_FABRICS=shm:ofa > $ srun -n 2 ./hello_world > > You can also find more info at > http://software.intel.com/en-us/articles/how-to-use-slurm-pmi-with-the-intel-mpi-library-for-linux > > I have been told that you should use 4.0.3 due to some fixes. > > -- > David N. Lombard, Intel, Irvine, CA > I do not speak for Intel Corporation; all comments are strictly my own.