Hello,

Try the manual checkpoint/restart instructions in the MVAPICH2
documentation first.
Outside of SLURM if possible. I want to make sure your MPI is working with
BLCR first.

There are instructions in the MVAPICH user guide for how to do this.
http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html
Now, I will check the manual checkpointing with MVAPICH2, I only test it in context with Slurm...

First of all I have to announce that currently we do not use MVAPICH2 in our productive system.
So, I tried to install a minimal configuration of MVAPICH2 as follows:

1. Configuration and installation:
../configure --prefix=/scratch/hpcsupport/checkpoint_test/neu/mvapich2 --enable-ckpt --with-blcr=/opt/blcr
  make -j 8 && make install
2. Export environment variables
  export PATH=/scratch/hpcsupport/checkpoint_test/neu/mvapich2/bin:$PATH
export LD_LIBRARY_PATH=/scratch/hpcsupport/checkpoint_test/neu/mvapich2/lib:$LD_LIBRARY_PATH
3. Compile a little mpi example:
  testcode_mpi.c:

   #include <stdio.h>
   #include <mpi.h>
   #include <unistd.h>

   int main(int argc, char **argv) {

       int rank, ntasks;

       MPI_Init(&argc,&argv);
       MPI_Comm_rank(MPI_COMM_WORLD,&rank);
       MPI_Comm_size(MPI_COMM_WORLD,&ntasks);

       char nodename[12] = "";
       gethostname(nodename, 12);
       unsigned long long i, j;
       double d=0.0;
       for(i=0; i<1e10; i++) {

          d += 0.001;
          for(j=0; j<1e1; j++);

          if((i % 100000000) == 0) {
             printf("node: %s, rank: %d, step: %.2lf\n", nodename,
   rank, d);
             fflush(stdout);
          }

       }

       MPI_Finalize();

   }

  mpicc testcode_mpi.c -o testcode_mpi
4. Run testcode_mpi in one terminal:
  mpiexec -np 2 ./testcode_mpi
5. Make a checkpoint in a second terminal:
  mv2_checkpoint

   PID USER     TT       COMMAND     %CPU    VSZ  START CMD
   24082 rotscher pts/0    mpiexec      0.0  24324  06:37 mpiexec -np 2
   ./testcode_mpi

   Enter PID to checkpoint or Control-C to exit: 24082
   Checkpointing PID 24082
   Checkpoint file: context.24082

6. Try to restart the prozess in the one shell and in the second shell, but it runs into an error:
  cr_restart context.24082

   [mpiexec@taurusi4005] HYDT_dmxu_poll_wait_for_event
   (../../../../src/pm/hydra/tools/demux/demux_poll.c:70): assert
   (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR))
   failed
   [mpiexec@taurusi4005] HYD_pmci_wait_for_completion
   (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error
   waiting for event
   [mpiexec@taurusi4005] main
   (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager
   error waiting for completion


Thank you again for helping us to solve the problem!

Kind regards,
Danny

--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Danny Rotscher
HPC-Support

Technische Universität Dresden
Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH)
01062 Dresden
Tel.: +49 351 463-35853
Fax : +49 351 463-37773
E-Mail: [email protected]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to