Hi all,
I've got a problem with sending messages from one of my machines. It
appears during MPI_Send/MPI_Recv and MPI_Bcast. The simplest case I've
found is two processes, rank 0 sending a simple message and rank 1
receiving this message. I execute these processes using mpirun with
-np 2.
- when both processes are executed on the host machine, it works fine;
- when both processes are executed on client machines (both on the
same or different machines), it works fine;
- when sender is executed on one of the client machines and receiver
on the host machine, it works fine;
- when sender is executed on the host machine and receiver on client
machine, it blocks.

This last case is my problem. When adding option '--mca
btl_base_verbose 30' to mpirun, I get:

----------
[host:28186] mca: base: components_open: Looking for btl components
[host:28186] mca: base: components_open: opening btl components
[host:28186] mca: base: components_open: found loaded component self
[host:28186] mca: base: components_open: component self has no register function
[host:28186] mca: base: components_open: component self open function successful
[host:28186] mca: base: components_open: found loaded component sm
[host:28186] mca: base: components_open: component sm has no register function
[host:28186] mca: base: components_open: component sm open function successful
[host:28186] mca: base: components_open: found loaded component tcp
[host:28186] mca: base: components_open: component tcp has no register function
[host:28186] mca: base: components_open: component tcp open function successful
[host:28186] select: initializing btl component self
[host:28186] select: init of component self returned success
[host:28186] select: initializing btl component sm
[host:28186] select: init of component sm returned success
[host:28186] select: initializing btl component tcp
[host:28186] select: init of component tcp returned success
[client01:19803] mca: base: components_open: Looking for btl components
[client01:19803] mca: base: components_open: opening btl components
[client01:19803] mca: base: components_open: found loaded component self
[client01:19803] mca: base: components_open: component self has no
register function
[client01:19803] mca: base: components_open: component self open
function successful
[client01:19803] mca: base: components_open: found loaded component sm
[client01:19803] mca: base: components_open: component sm has no
register function
[client01:19803] mca: base: components_open: component sm open
function successful
[client01:19803] mca: base: components_open: found loaded component tcp
[client01:19803] mca: base: components_open: component tcp has no
register function
[client01:19803] mca: base: components_open: component tcp open
function successful
[client01:19803] select: initializing btl component self
[client01:19803] select: init of component self returned success
[client01:19803] select: initializing btl component sm
[client01:19803] select: init of component sm returned success
[client01:19803] select: initializing btl component tcp
[client01:19803] select: init of component tcp returned success
00 of 2 host
[host:28186] btl: tcp: attempting to connect() to address 10.0.7.97 on
port 53255
01 of 2 client01
----------

Where lines "00 of 2 host" and "01 of 2 client01" are just my debug
saying "mpirank of comm_size hostname". The last but one line appears
in call to Send:
MPI::COMM_WORLD.Send(message, 5, MPI::CHAR, 1, 13);

When executing the sender on host with strace, I get:

----------
...
connect(10, {sa_family=AF_INET, sin_port=htons(1024),
sin_addr=inet_addr("10.0.7.97")}, 16) = -1 EINPROGRESS (Operation now
in progress)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 1 ([{fd=10,
revents=POLLOUT}])
getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
send(10, "D\227\0\1\0\0\0\0", 8, 0)     = 8
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 1 ([{fd=10,
revents=POLLIN}])
recv(10, "", 8, 0)                      = 0
close(10)                               = 0
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}], 6, 0) = 0 (Timeout)
poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6,
events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9,
events=POLLIN}], 6, 0) = 0 (Timeout)
...
(forever)
...
----------

For me it looks like the above connect is responsible for establishing
connection, but I'm afraid I don't understand what those calls for
poll are supposed to do.

Attaching gdb to the sender gives me:

----------
(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0x0064993b in poll () from /lib/libc.so.6
#2  0xf7df07b5 in poll_dispatch () from /home/gmaj/openmpi/lib/libopen-pal.so.0
#3  0xf7def8c3 in opal_event_base_loop () from
/home/gmaj/openmpi/lib/libopen-pal.so.0
#4  0xf7defbe7 in opal_event_loop () from
/home/gmaj/openmpi/lib/libopen-pal.so.0
#5  0xf7de323b in opal_progress () from /home/gmaj/openmpi/lib/libopen-pal.so.0
#6  0xf7c51455 in mca_pml_ob1_send () from
/home/gmaj/openmpi/lib/openmpi/mca_pml_ob1.so
#7  0xf7ed9c60 in PMPI_Send () from /home/gmaj/openmpi/lib/libmpi.so.0
#8  0x0804e900 in main ()
----------

If anybody knows what may cause this problem or what may I do to find
the reason, any help is appreciated.

My open-mpi is version 1.4.1.


Regards,
Grzegorz Maj

Reply via email to