[OMPI users] fault tolerance support via apt-get

2009-10-06 Thread Hui Jin

Hi,
I was trying to install openmpi with fault tolerance support (blcr) on 
my cluster.

The OS is Ubuntu 9.04 server version (64-bit).
I was able to install open mpi by apt-get,
apt-get install libopenmpi-dev libopenmpi1 openmpi-bin openmpi-common 
openmpi-doc


However, it seems that the checkpointing functionality was not installed 
by default.
Could you please let me know if there is any solution to installing 
openmpi with checkpointing support via apt-get?

Or I have to do a package install via make?

Thanks,
Hui Jin



Re: [OMPI users] Program hangs when run in the remote host ...

2009-10-06 Thread Ashley Pittman
On Tue, 2009-10-06 at 12:22 +0530, souvik bhattacherjee wrote:

> This implies that one has to copy the executables in the remote host
> each time one requires to run a program which is different from the
> previous one. 

This is correct, the name of the executable is passed to each node and
that executable is then executed locally.

> Is the implication correct or is there some way around.

Typically some kind of a shared filesystem would be used, nfs for
example.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk



Re: [OMPI users] Program hangs when run in the remote host ...

2009-10-06 Thread souvik bhattacherjee
Finally, it seems I'm able to run my program on a remote host.

The problem was due to some firewall settings. Modifying the firewall ACCEPT
policy as shown below, did the work.

# /etc/init.d/ip6tables stop
Resetting built-in chains to the default ACCEPT policy: [  OK  ]
# /etc/init.d/iptables stop
Resetting built-in chains to the default ACCEPT policy: [  OK  ]

Another related query:

Let me mention once again, I had installed openmpi-1.3.3 separately on two
of my machines ict1 and ict2. Now when I issue the following command :

$ mpirun --prefix /usr/local/openmpi-1.3.3/ -np 4 --host ict2,ict1 hello_c
--
mpirun was unable to launch the specified application as it could not find
an executable:

Executable: hello_c
Node: ict1

while attempting to start process rank 1.
--

So, I did a *make* on the examples directory on ict1 to generate the
executable (One can also copy the executable from ict2 to ict1 in the same
directory).

Now, it seems to run fine.

$ mpirun --prefix /usr/local/openmpi-1.3.3/ -np 4 --host ict2,ict1 hello_c
Hello, world, I am 0 of 8
Hello, world, I am 2 of 8
Hello, world, I am 4 of 8
Hello, world, I am 6 of 8
Hello, world, I am 5 of 8
Hello, world, I am 3 of 8
Hello, world, I am 7 of 8
Hello, world, I am 1 of 8
$

This implies that one has to copy the executables in the remote host each
time one requires to run a program which is different from the previous one.

Is the implication correct or is there some way around.

Thanks,


On Mon, Sep 21, 2009 at 1:54 PM, souvik bhattacherjee wrote:

> As Ralph suggested, I *reversed the order of my PATH settings*:
>
> This is what I it shows:
>
> $ echo $PATH
>
> /usr/local/openmpi-1.3.3/bin/:/usr/bin:/bin:/usr/local/bin:/usr/X11R6/bin/:/usr/games:/usr/lib/qt4/bin:/usr/bin:/opt/kde3/bin
>
> $ echo $LD_LIBRARY_PATH
> /usr/local/openmpi-1.3.3/lib/
>
> Moreover, I checked that there were *NO* system supplied versions of OMPI,
> previously installed. ( I did install MPICH2 earlier, but I had removed the
> binaries and the related files). This is because,
>
> $ locate mpicc
>
> /home/souvik/software/openmpi-1.3.3/build/ompi/contrib/vt/wrappers/mpicc-vt-wrapper-data.txt
>
> /home/souvik/software/openmpi-1.3.3/build/ompi/tools/wrappers/mpicc-wrapper-data.txt
> /home/souvik/software/openmpi-1.3.3/build/ompi/tools/wrappers/mpicc.1
>
> /home/souvik/software/openmpi-1.3.3/contrib/platform/win32/ConfigFiles/mpicc-wrapper-data.txt.cmake
>
> /home/souvik/software/openmpi-1.3.3/ompi/contrib/vt/wrappers/mpicc-vt-wrapper-data.txt
> /home/souvik/software/openmpi-1.3.3/ompi/contrib/vt/wrappers/
> mpicc-vt-wrapper-data.txt.in
>
> /home/souvik/software/openmpi-1.3.3/ompi/tools/wrappers/mpicc-wrapper-data.txt
> /home/souvik/software/openmpi-1.3.3/ompi/tools/wrappers/
> mpicc-wrapper-data.txt.in
> /usr/local/openmpi-1.3.3/bin/mpicc
> /usr/local/openmpi-1.3.3/bin/mpicc-vt
> /usr/local/openmpi-1.3.3/share/man/man1/mpicc.1
> /usr/local/openmpi-1.3.3/share/openmpi/mpicc-vt-wrapper-data.txt
> /usr/local/openmpi-1.3.3/share/openmpi/mpicc-wrapper-data.txt
>
> does not show the occurrence of mpicc in any directory related to MPICH2.
>
> The results are same with mpirun
>
> $ locate mpirun
> /home/souvik/software/openmpi-1.3.3/build/ompi/tools/ortetools/mpirun.1
> /home/souvik/software/openmpi-1.3.3/ompi/runtime/mpiruntime.h
> /usr/local/openmpi-1.3.3/bin/mpirun
> /usr/local/openmpi-1.3.3/share/man/man1/mpirun.1
>
> *These tests were done both on ict1 and ict2*.
>
> I performed another test which probably proves that the executable finds
> the required files on the remote host. The program was run from ict2.
>
> $ cd /home/souvik/software/openmpi-1.3.3/examples/
>
> $ mpirun -np 4 --host ict2,ict1 hello_c
> bash: orted: command not found
> --
> A daemon (pid 28023) died unexpectedly with status 127 while attempting
> to launch so we are aborting.
>
> There may be more information reported by the environment (see above).
>
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --
> mpirun: clean termination accomplished
>
> $ mpirun --prefix /usr/local/openmpi-1.3.3/ -np 4 --host ict2,ict1 hello_c
>
> *This command-line statement as usual does not produce any output. On
> pressing Crtl+C, the