Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-25 Thread 陈文浩
I was so careless. BLCR Admin Guide says: as the root, load the kernel
modules in this order:
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr_imports.ko
# /sbin/insmod /usr/local/lib/blcr/2.6.12-1.234/blcr.ko
In the last email, I load the kernel in the wrong order. And I followed the
order above, it succeeded. lol
I really thank you for your advice, Josh. Many thanks.

I really thank you for your advice, Josh. As you say, when check 'lsmod |
grep blcr' on blade02, nothing shows. That means no blcr module is inserted
on blade02. I think that's the main reason why I can't C/R mpi programs on
these two nodes.
But here is the problem:
I installed blcr under /opt/blcr on blade01. Our blades use NFS. /opt/
directory and /home/ directory are shared. And on blade02, such commands
like 'cr_run', 'cr_restart' can be found. But I can't insert blcr module on
blade02. It shows:
insmod: error inserting '/opt/blcr/lib/blcr/2.6.16.60-0.21-smp/blcr.ko': -1
Unknown symbol in module Does it mean that I have to install blcr on
blade02? If so, where should I install it? Just cover /opt/blcr or somewhere
else?
Plz give me some advice. Thank you.


On Aug 24, 2010, at 10:27 AM, ?? wrote:

> Dear OMPI users,
>  
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ?C 
> blade10, nfs) BLCR configure script: ./configure ?Cprefix=/opt/blcr 
> ?Cenable-static After the installation, I can see the ??blcr?? module
loaded correctly (lsmod | grep blcr). And I can also run ??cr_run??,
??cr_checkpoint??, ??cr_restart?? to C/R the examples correctly under
/blcr/examples/.
> Then, OMPI configure script is: ./configure ?Cprefix=/opt/ompi 
> ?Cwith-ft=cr ?Cwith-blcr=/opt/blcr ?Cenable-ft-thread ?Cenable-mpi-threads
?Cenable-static The installation is okay too.
>  
> Then here comes the problem.
> On one node:
>  mpirun -np 2 ./hello_c.c
>  mpirun -np 2 ?Cam ft-enable-cr ./hello_c.c
>  are both okay.
> On two nodes(blade01, blade02):
>  mpirun ?Cnp 2 ?Cmachinefile mf ./hello_c.c  OK.
> mpirun ?Cnp 2 ?Cmachinefile mf ?Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:
>  
> *** An error occurred in MPI_Init
> *** before MPI was initialized
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) [blade02:28896] 
> Abort before MPI_INIT completed successfully; not able to guarantee that
all other processes were killed!
> --
>  It looks like opal_init failed for some reason; your parallel 
> process is likely to abort. There are many reasons that a parallel 
> process can fail during opal_init; some of which are due to 
> configuration or environment problems. This failure appears to be an 
> internal failure; here's some additional information (which may only 
> be relevant to an Open MPI developer):
>   opal_cr_init() failed failed 
>   --> Returned value -1 instead of OPAL_SUCCESS
> --
>  [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 77
> --
>  It looks like MPI_INIT failed for some reason; your parallel 
> process is likely to abort. There are many reasons that a parallel 
> process can fail during MPI_INIT; some of which are due to 
> configuration or environment problems. This failure appears to be an 
> internal failure; here's some additional information (which may only 
> be relevant to an Open MPI
> developer):
>   ompi_mpi_init: orte_init failed 
>   --> Returned "Error" (-1) instead of "Success" (0)
> --
> 
>  
> I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.
>  
> btw, similar error like:
> ??Oops, cr_init() failed (the initialization call to the BLCR
checkpointing system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.?? occurs when I use LAM/MPI +
BLCR.

This seems to indicate that BLCR is not working correctly on one of the
compute nodes. Did you try some of the BLCR example programs on both of the
compute nodes? If BLCRs cr_init() fails, then there is not much the MPI
library can do for you.

I would check the installation of BLCR on all of the compute nodes (blade01
and blade02). Make sure the modules are loaded and that the BLCR single
process examples work on all nodes. I suspect that one of the nodes is
having trouble initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all
nodes as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk.
There should not be any problem with using NFS, since this is occurring in

Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-25 Thread 陈文浩
I really thank you for your advice, Josh. As you say, when check 'lsmod |
grep blcr' on blade02, nothing shows. That means no blcr module is inserted
on blade02. I think that's the main reason why I can't C/R mpi programs on
these two nodes.
But here is the problem:
I installed blcr under /opt/blcr on blade01. Our blades use NFS. /opt/
directory and /home/ directory are shared. And on blade02, such commands
like 'cr_run', 'cr_restart' can be found. But I can't insert blcr module on
blade02. It shows:
insmod: error inserting '/opt/blcr/lib/blcr/2.6.16.60-0.21-smp/blcr.ko': -1
Unknown symbol in module
Does it mean that I have to install blcr on blade02? If so, where should I
install it? Just cover /opt/blcr or somewhere else?
Plz give me some advice. Thank you.


On Aug 24, 2010, at 10:27 AM, ?? wrote:

> Dear OMPI users,
>  
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 ?C
blade10, nfs)
> BLCR configure script: ./configure ?Cprefix=/opt/blcr ?Cenable-static
> After the installation, I can see the ??blcr?? module loaded correctly
(lsmod | grep blcr). And I can also run ??cr_run??, ??cr_checkpoint??,
??cr_restart?? to C/R the examples correctly under /blcr/examples/.
> Then, OMPI configure script is: ./configure ?Cprefix=/opt/ompi
?Cwith-ft=cr ?Cwith-blcr=/opt/blcr ?Cenable-ft-thread ?Cenable-mpi-threads
?Cenable-static
> The installation is okay too.
>  
> Then here comes the problem.
> On one node:
>  mpirun -np 2 ./hello_c.c
>  mpirun -np 2 ?Cam ft-enable-cr ./hello_c.c
>  are both okay.
> On two nodes(blade01, blade02):
>  mpirun ?Cnp 2 ?Cmachinefile mf ./hello_c.c  OK.
> mpirun ?Cnp 2 ?Cmachinefile mf ?Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:
>  
> *** An error occurred in MPI_Init 
> *** before MPI was initialized 
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
> [blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed! 
> --

> It looks like opal_init failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during opal_init; some of which are due to configuration or 
> environment problems. This failure appears to be an internal failure; 
> here's some additional information (which may only be relevant to an 
> Open MPI developer):
>   opal_cr_init() failed failed 
>   --> Returned value -1 instead of OPAL_SUCCESS 
> --

> [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77 
> --

> It looks like MPI_INIT failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during MPI_INIT; some of which are due to configuration or
environment 
> problems. This failure appears to be an internal failure; here's some 
> additional information (which may only be relevant to an Open MPI 
> developer):
>   ompi_mpi_init: orte_init failed 
>   --> Returned "Error" (-1) instead of "Success" (0) 
> --
>  
> I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.
>  
> btw, similar error like:
> ??Oops, cr_init() failed (the initialization call to the BLCR
checkpointing system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.?? occurs when I use LAM/MPI +
BLCR.

This seems to indicate that BLCR is not working correctly on one of the
compute nodes. Did you try some of the BLCR example programs on both of the
compute nodes? If BLCRs cr_init() fails, then there is not much the MPI
library can do for you.

I would check the installation of BLCR on all of the compute nodes (blade01
and blade02). Make sure the modules are loaded and that the BLCR single
process examples work on all nodes. I suspect that one of the nodes is
having trouble initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all
nodes as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk.
There should not be any problem with using NFS, since this is occurring in
MPI_Init, this is well before we ever try to use the file system. I also
test with NFS, and local staging on a fairly regular basis, so it shouldn't
be a problem even when checkpointing/restarting.

-- Josh

>  
> Regards
>  
> whchen
>  
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey



Re: [OMPI users] OpenMPI with BLCR runtime problem

2010-08-24 Thread Joshua Hursey

On Aug 24, 2010, at 10:27 AM, 陈文浩 wrote:

> Dear OMPI users,
>  
> I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C blade10, 
> nfs)
> BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static
> After the installation, I can see the ‘blcr’ module loaded correctly (lsmod | 
> grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’, ‘cr_restart’ to C/R 
> the examples correctly under /blcr/examples/.
> Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr 
> �Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �Cenable-static
> The installation is okay too.
>  
> Then here comes the problem.
> On one node:
>  mpirun -np 2 ./hello_c.c
>  mpirun -np 2 �Cam ft-enable-cr ./hello_c.c
>  are both okay.
> On two nodes(blade01, blade02):
>  mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c  OK.
> mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed 
> below:
>  
> *** An error occurred in MPI_Init 
> *** before MPI was initialized 
> *** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
> [blade02:28896] Abort before MPI_INIT completed successfully; not able to 
> guarantee that all other processes were killed! 
> -- 
> It looks like opal_init failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during opal_init; some of which are due to configuration or 
> environment problems. This failure appears to be an internal failure; 
> here's some additional information (which may only be relevant to an 
> Open MPI developer):
>   opal_cr_init() failed failed 
>   --> Returned value -1 instead of OPAL_SUCCESS 
> -- 
> [blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file 
> runtime/orte_init.c at line 77 
> -- 
> It looks like MPI_INIT failed for some reason; your parallel process is 
> likely to abort. There are many reasons that a parallel process can 
> fail during MPI_INIT; some of which are due to configuration or environment 
> problems. This failure appears to be an internal failure; here's some 
> additional information (which may only be relevant to an Open MPI 
> developer):
>   ompi_mpi_init: orte_init failed 
>   --> Returned "Error" (-1) instead of "Success" (0) 
> --
>  
> I have no idea about the error. Our blades use nfs, does it matter? Can 
> anyone help me solve the problem? I really appreciate it. Thank you.
>  
> btw, similar error like:
> “Oops, cr_init() failed (the initialization call to the BLCR checkpointing 
> system). Abort in despair.
> The crmpi SSI subsystem failed to initialized modules successfully during 
> MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI + 
> BLCR.

This seems to indicate that BLCR is not working correctly on one of the compute 
nodes. Did you try some of the BLCR example programs on both of the compute 
nodes? If BLCRs cr_init() fails, then there is not much the MPI library can do 
for you.

I would check the installation of BLCR on all of the compute nodes (blade01 and 
blade02). Make sure the modules are loaded and that the BLCR single process 
examples work on all nodes. I suspect that one of the nodes is having trouble 
initializing the BLCR library.

You may also want to check to make sure prelinking is turned off on all nodes 
as well:
  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink

If that doesn't work then I would suggest trying the current Open MPI trunk. 
There should not be any problem with using NFS, since this is occurring in 
MPI_Init, this is well before we ever try to use the file system. I also test 
with NFS, and local staging on a fairly regular basis, so it shouldn't be a 
problem even when checkpointing/restarting.

-- Josh

>  
> Regards
>  
> whchen
>  
> 


Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://www.cs.indiana.edu/~jjhursey







[OMPI users] OpenMPI with BLCR runtime problem

2010-08-24 Thread 陈文浩
Dear OMPI users,

 

I configured and installed OpenMPI-1.4.2 and BLCR-0.8.2. (blade01 �C
blade10, nfs)

BLCR configure script: ./configure �Cprefix=/opt/blcr �Cenable-static

After the installation, I can see the ‘blcr’ module loaded correctly
(lsmod | grep blcr). And I can also run ‘cr_run’, ‘cr_checkpoint’,
‘cr_restart’ to C/R the examples correctly under /blcr/examples/.

Then, OMPI configure script is: ./configure �Cprefix=/opt/ompi �Cwith-ft=cr
�Cwith-blcr=/opt/blcr �Cenable-ft-thread �Cenable-mpi-threads �C
enable-static

The installation is okay too.

 

Then here comes the problem.

On one node:

 mpirun -np 2 ./hello_c.c

 mpirun -np 2 �Cam ft-enable-cr ./hello_c.c

 are both okay.

On two nodes(blade01, blade02):

 mpirun �Cnp 2 �Cmachinefile mf ./hello_c.c  OK.

mpirun �Cnp 2 �Cmachinefile mf �Cam ft-enable-cr ./hello_c.c ERROR. Listed
below:

 

*** An error occurred in MPI_Init 
*** before MPI was initialized 
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort) 
[blade02:28896] Abort before MPI_INIT completed successfully; not able to
guarantee that all other processes were killed! 
-- 
It looks like opal_init failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during opal_init; some of which are due to configuration or 
environment problems. This failure appears to be an internal failure; 
here's some additional information (which may only be relevant to an 
Open MPI developer): 

  opal_cr_init() failed failed 
  --> Returned value -1 instead of OPAL_SUCCESS 
-- 
[blade02:28896] [[INVALID],INVALID] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 77 
-- 
It looks like MPI_INIT failed for some reason; your parallel process is 
likely to abort. There are many reasons that a parallel process can 
fail during MPI_INIT; some of which are due to configuration or environment 
problems. This failure appears to be an internal failure; here's some 
additional information (which may only be relevant to an Open MPI 
developer): 

  ompi_mpi_init: orte_init failed 
  --> Returned "Error" (-1) instead of "Success" (0) 
-- 

 

I have no idea about the error. Our blades use nfs, does it matter? Can
anyone help me solve the problem? I really appreciate it. Thank you.

 

btw, similar error like: 

“Oops, cr_init() failed (the initialization call to the BLCR checkpointing
system). Abort in despair.

The crmpi SSI subsystem failed to initialized modules successfully during
MPI_INIT. This is a fatal error; I must abort.” occurs when I use LAM/MPI +
BLCR.

 

Regards

 

whchen