Slurm signals the tasks that it launches. I suspect that you do not launch with 
srun but use mpirun or some other tool that does not use srun for task launch 
so those processes are not managed by slurm.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Domingos Rodrigues <ddc...@gmail.com> wrote:

Hello,
Here I am again! Well it looks like I felt in the wrong path again.
I thought that I could do some signal handling in my application
by trapping the signals sent by the command "scancel --batch --s INT JOBID".
My application is in Fortran 90 and so signal handling is a bit
tricky, but I wrote an
auxiliary C routine. Here follows a sketch of my code (i hope it's
self-explanatory):

--------------------------BEGIN 
CODE_____________________________________________

program my_program
  ! declarations
   ...

  !_____________________________________________

  ! Define parallel environment.
  !_____________________________________________

  call create_parallel_data (parallel)
  ...
  !_____________________________________________

  ! Initialize external signal handling
  !_____________________________________________

  call sigclear(SIGUSR1) ! Intel needs this, otherwise its
! runtime library catches the signal
                                      ! before this process

  if (parallel%iammaster) then
     call trap_signal(SIGUSR1, checkpt_sigusr1)
  else
     call trap_signal(SIGUSR1, ignore_sigusr1)
  endif


  ....
contains

   subroutine checkpt_sigusr1
    print *, 'SLURM sent a SIGUSR1 signal: aborting SCF'
    open(99,file='stop_scf',status='unknown',form='formatted')
    return
  end subroutine checkpt_sigusr1

  subroutine ignore_sigusr1
    print *, 'SLURM sent a SIGUSR1 signal'
    return
  end subroutine ignore_sigusr1

end program my_program
--------------------------END CODE_____________________________________________



The function trap_signal() is defined in C:

--------------------------BEGIN 
CODE_____________________________________________

/* in "csigfun.c" */
/* #include <config.h> */
#include <stdlib.h>
#ifdef HAVE_SIGNAL_H
#include <signal.h>
#endif
typedef void (*sighandler_t)(int);
void sigclear_(int *signum)
{
  signal(*signum, NULL);
}

void trap_signal_(int* signum, sighandler_t handler)
{
  signal(*signum, handler);
}
--------------------------END CODE_____________________________________________


This works well if I send a signal through kill (kill -s USR1 pid), the file
stop_scf is created and everything follows normally, but not
in the slurm environment.
I thought naively that at some point SLURM would use the signal() function,
but apparently not so. What is the right way to trap the signals sent
by scancel?

Actually I am asking this only for academic curiosity, because from the
pragmatic point of view I can always log into the node where the job is running
and create myself manually the file stop_scf :-)

Thanks a lot,
Domingos
_____________________________________________

Domingos Rodrigues, PhD
Laboratório de Computação Científica
ICeX, Sala 2040
Universidade Federal de Minas Gerais,
Av. Antônio Carlos, 6627 - Pampulha
31270-901 - Belo Horizonte - MG - Brasil
Tel +55 31 3409-4909
VOIP: +55 31 3409-3333 / 10811803
Fax +55 313409-5390
http://www.cenapad.ufmg.br
Email: ddcr(at)lcc.ufmg.br, ddcr(at)ufmg.br
_____________________________________________



On Tue, Oct 18, 2011 at 12:48 PM, David N. Lombard
<dnlom...@ichips.intel.com> wrote:
>
> On Sat, Oct 15, 2011 at 11:23:59AM -0600, Domingos wrote:
> > Dear community,
> >
> > I am trying to design a batch script that launches a parallel job with
> > mpirun (the Intel MPI version
> > i'm using does not have PMI interface so i can't launch via srun).
>
> Intel MPI does offer a PMI interface. Here's a quick example:
>
>  $ export I_MPI_PMI_LIBRARY=/full/path/to/slurm/libpmi.so
>  $ export I_MPI_FABRICS=shm:ofa
>  $ srun -n 2 ./hello_world
>
> You can also find more info at
> http://software.intel.com/en-us/articles/how-to-use-slurm-pmi-with-the-intel-mpi-library-for-linux
>
> I have been told that you should use 4.0.3 due to some fixes.
>
> --
> David N. Lombard, Intel, Irvine, CA
> I do not speak for Intel Corporation; all comments are strictly my own.

Reply via email to