On Mar 3, 2006, at 9:07 AM, Jose Pedro Garcia Mahedero wrote:
cluster master machine
eth0 mpihosts_out --> for outside use (getting its own ip via dhcp)
eth1, mpihosts_cluster --> for cluster use (serves ip's to
cluster nodes)
------- TESTS 1,2 -openmpi-1.0.2a9 ------
1.- cd openmpi-1.0.1
2.- make clean
3.- cd openmpi-1.0.2a9
4.- ./configure
5.- make all install
no parameters /usr/local/etc/openmpi-mca-params.conf
mpirun -np 2 --hostfile mpihosts_cluster ping_pong
mpirun -np 2 --hostfile mpihosts_out ping_pong
GIve the same results:
Signal:11 info.si_errno:0 (Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x6
*** End of error message ***
[0] func:/usr/local/lib/libopal.so.0 [0x40101cb2]
[1] func:[0xffffe440]
[2] func:/usr/local/lib/openmpi/mca_btl_tcp.so [0x404541d6]
[3] func:/usr/local/lib/openmpi/mca_btl_tcp.so(mca_btl_tcp_add_procs
+0x149) [0x404502f9]
Yoinks -- whatever we do, we should not be seg faulting. :-( It is
apparently dying in the mca_btl_tcp_add_procs() function, which is
where we're creating MPI networking mappings from one TCP peer to
another.
I am unable to repeat this error (I tried it on a cluster with a
similar setup to yours). Can you recompile the TCP BTL with
debugging symbols so that we can get a little more information?
Do the following:
cd top_of_your_open_mpi_source_tree
cd ompi/mca/btl/tcp
make CFLAGS=-g clean all install
Then run the test again (you shouldn't need to recompile your
application; this just recompiled and re-installed the TCP BTL
plugin). The output stack trace for the mca_btl_tcp stuff should now
include line numbers, and tell us exactly where it is dying. If you
get a corefile, can you load that up in gdb and send the output of
"bt full"?
--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/