I was so careless. BLCR Admin Guide says: as the root, load the kernel modules in this order: # /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko # /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko In the last email, I load the kernel in the wrong order. And I followed the order above, it succeeded. lol I really thank you for your advice, Josh. Many thanks.
I really thank you for your advice, Josh. As you say, when check 'lsmod | grep blcr' on blade02, nothing shows. That means no blcr module is inserted on blade02. I think that's the main reason why I can't C/R mpi programs on these two nodes. But here is the problem: I installed blcr under /opt/blcr on blade01. Our blades use NFS. /opt/ directory and /home/ directory are shared. And on blade02, such commands like 'cr_run', 'cr_restart' can be found. But I can't insert blcr module on blade02. It shows: insmod: error inserting '/opt/blcr/lib/blcr/2.6.16.60-0.21-smp/blcr.ko': -1 Unknown symbol in module Does it mean that I have to install blcr on blade02? If so, where should I install it? Just cover /opt/blcr or somewhere else? Plz give me some advice. Thank you. On Aug 24, 2010, at 10:27 AM, ?????? wrote: > Dear OMPI users, > > I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ?C > blade10, nfs) BLCR configure script: ./configure ?Cprefix=/opt/blcr > ?Cenable-static After the installation, I can see the ??blcr?? module loaded correctly (lsmod | grep blcr). And I can also run ??cr_run??, ??cr_checkpoint??, ??cr_restart?? to C/R the examples correctly under /blcr/examples/. > Then, OMPI configure script is: ./configure ?Cprefix=/opt/ompi > ?Cwith-ft=cr ?Cwith-blcr=/opt/blcr ?Cenable-ft-thread ?Cenable-mpi-threads ?Cenable-static The installation is okay too. > > Then here comes the problem. > On one node: > mpirun -np 2 ./hello_c.c > mpirun -np 2 ?Cam ft-enable-cr ./hello_c.c > are both okay. > On two nodes(blade01, blade02): > mpirun ?Cnp 2 ?Cmachinefile mf ./hello_c.c OK. > mpirun ?Cnp 2 ?Cmachinefile mf ?Cam ft-enable-cr ./hello_c.c ERROR. Listed below: > > *** An error occurred in MPI_Init > *** before MPI was initialized > *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [blade02:28896] > Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed! > ---------------------------------------------------------------------- > ---- It looks like opal_init failed for some reason; your parallel > process is likely to abort. There are many reasons that a parallel > process can fail during opal_init; some of which are due to > configuration or environment problems. This failure appears to be an > internal failure; here's some additional information (which may only > be relevant to an Open MPI developer): > opal_cr_init() failed failed > --> Returned value -1 instead of OPAL_SUCCESS > ---------------------------------------------------------------------- > ---- [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file > runtime/orte_init.c at line 77 > ---------------------------------------------------------------------- > ---- It looks like MPI_INIT failed for some reason; your parallel > process is likely to abort. There are many reasons that a parallel > process can fail during MPI_INIT; some of which are due to > configuration or environment problems. This failure appears to be an > internal failure; here's some additional information (which may only > be relevant to an Open MPI > developer): > ompi_mpi_init: orte_init failed > --> Returned "Error" (-1) instead of "Success" (0) > ---------------------------------------------------------------------- > ---- > > I have no idea about the error. Our blades use nfs, does it matter? Can anyone help me solve the problem? I really appreciate it. Thank you. > > btw, similar error like: > ??Oops, cr_init() failed (the initialization call to the BLCR checkpointing system). Abort in despair. > The crmpi SSI subsystem failed to initialized modules successfully during MPI_INIT. This is a fatal error; I must abort.?? occurs when I use LAM/MPI + BLCR. This seems to indicate that BLCR is not working correctly on one of the compute nodes. Did you try some of the BLCR example programs on both of the compute nodes? If BLCRs cr_init() fails, then there is not much the MPI library can do for you. I would check the installation of BLCR on all of the compute nodes (blade01 and blade02). Make sure the modules are loaded and that the BLCR single process examples work on all nodes. I suspect that one of the nodes is having trouble initializing the BLCR library. You may also want to check to make sure prelinking is turned off on all nodes as well: https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink If that doesn't work then I would suggest trying the current Open MPI trunk. There should not be any problem with using NFS, since this is occurring in MPI_Init, this is well before we ever try to use the file system. I also test with NFS, and local staging on a fairly regular basis, so it shouldn't be a problem even when checkpointing/restarting. -- Josh > > Regards > > whchen > > <ATT00001..txt> ------------------------------------ Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://www.cs.indiana.edu/~jjhursey