Hello,
Now, I will check the manual checkpointing with MVAPICH2, I only test it in context with Slurm...Try the manual checkpoint/restart instructions in the MVAPICH2 documentation first.Outside of SLURM if possible. I want to make sure your MPI is working withBLCR first.There are instructions in the MVAPICH user guide for how to do this.http://mvapich.cse.ohio-state.edu/static/media/mvapich/mvapich2-2.0-userguide.html
First of all I have to announce that currently we do not use MVAPICH2 in our productive system.
So, I tried to install a minimal configuration of MVAPICH2 as follows: 1. Configuration and installation:../configure --prefix=/scratch/hpcsupport/checkpoint_test/neu/mvapich2 --enable-ckpt --with-blcr=/opt/blcr
make -j 8 && make install 2. Export environment variables export PATH=/scratch/hpcsupport/checkpoint_test/neu/mvapich2/bin:$PATHexport LD_LIBRARY_PATH=/scratch/hpcsupport/checkpoint_test/neu/mvapich2/lib:$LD_LIBRARY_PATH
3. Compile a little mpi example:
testcode_mpi.c:
#include <stdio.h>
#include <mpi.h>
#include <unistd.h>
int main(int argc, char **argv) {
int rank, ntasks;
MPI_Init(&argc,&argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&ntasks);
char nodename[12] = "";
gethostname(nodename, 12);
unsigned long long i, j;
double d=0.0;
for(i=0; i<1e10; i++) {
d += 0.001;
for(j=0; j<1e1; j++);
if((i % 100000000) == 0) {
printf("node: %s, rank: %d, step: %.2lf\n", nodename,
rank, d);
fflush(stdout);
}
}
MPI_Finalize();
}
mpicc testcode_mpi.c -o testcode_mpi
4. Run testcode_mpi in one terminal:
mpiexec -np 2 ./testcode_mpi
5. Make a checkpoint in a second terminal:
mv2_checkpoint
PID USER TT COMMAND %CPU VSZ START CMD
24082 rotscher pts/0 mpiexec 0.0 24324 06:37 mpiexec -np 2
./testcode_mpi
Enter PID to checkpoint or Control-C to exit: 24082
Checkpointing PID 24082
Checkpoint file: context.24082
6. Try to restart the prozess in the one shell and in the second shell,
but it runs into an error:
cr_restart context.24082 [mpiexec@taurusi4005] HYDT_dmxu_poll_wait_for_event (../../../../src/pm/hydra/tools/demux/demux_poll.c:70): assert (!(pollfds[i].revents & ~POLLIN & ~POLLOUT & ~POLLHUP & ~POLLERR)) failed [mpiexec@taurusi4005] HYD_pmci_wait_for_completion (../../../../src/pm/hydra/pm/pmiserv/pmiserv_pmci.c:198): error waiting for event [mpiexec@taurusi4005] main (../../../../src/pm/hydra/ui/mpich/mpiexec.c:344): process manager error waiting for completion Thank you again for helping us to solve the problem! Kind regards, Danny -- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Danny Rotscher HPC-Support Technische Universität Dresden Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) 01062 Dresden Tel.: +49 351 463-35853 Fax : +49 351 463-37773 E-Mail: [email protected] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
smime.p7s
Description: S/MIME Cryptographic Signature
