Hi all, I've got a problem with sending messages from one of my machines. It appears during MPI_Send/MPI_Recv and MPI_Bcast. The simplest case I've found is two processes, rank 0 sending a simple message and rank 1 receiving this message. I execute these processes using mpirun with -np 2. - when both processes are executed on the host machine, it works fine; - when both processes are executed on client machines (both on the same or different machines), it works fine; - when sender is executed on one of the client machines and receiver on the host machine, it works fine; - when sender is executed on the host machine and receiver on client machine, it blocks.
This last case is my problem. When adding option '--mca btl_base_verbose 30' to mpirun, I get: ---------- [host:28186] mca: base: components_open: Looking for btl components [host:28186] mca: base: components_open: opening btl components [host:28186] mca: base: components_open: found loaded component self [host:28186] mca: base: components_open: component self has no register function [host:28186] mca: base: components_open: component self open function successful [host:28186] mca: base: components_open: found loaded component sm [host:28186] mca: base: components_open: component sm has no register function [host:28186] mca: base: components_open: component sm open function successful [host:28186] mca: base: components_open: found loaded component tcp [host:28186] mca: base: components_open: component tcp has no register function [host:28186] mca: base: components_open: component tcp open function successful [host:28186] select: initializing btl component self [host:28186] select: init of component self returned success [host:28186] select: initializing btl component sm [host:28186] select: init of component sm returned success [host:28186] select: initializing btl component tcp [host:28186] select: init of component tcp returned success [client01:19803] mca: base: components_open: Looking for btl components [client01:19803] mca: base: components_open: opening btl components [client01:19803] mca: base: components_open: found loaded component self [client01:19803] mca: base: components_open: component self has no register function [client01:19803] mca: base: components_open: component self open function successful [client01:19803] mca: base: components_open: found loaded component sm [client01:19803] mca: base: components_open: component sm has no register function [client01:19803] mca: base: components_open: component sm open function successful [client01:19803] mca: base: components_open: found loaded component tcp [client01:19803] mca: base: components_open: component tcp has no register function [client01:19803] mca: base: components_open: component tcp open function successful [client01:19803] select: initializing btl component self [client01:19803] select: init of component self returned success [client01:19803] select: initializing btl component sm [client01:19803] select: init of component sm returned success [client01:19803] select: initializing btl component tcp [client01:19803] select: init of component tcp returned success 00 of 2 host [host:28186] btl: tcp: attempting to connect() to address 10.0.7.97 on port 53255 01 of 2 client01 ---------- Where lines "00 of 2 host" and "01 of 2 client01" are just my debug saying "mpirank of comm_size hostname". The last but one line appears in call to Send: MPI::COMM_WORLD.Send(message, 5, MPI::CHAR, 1, 13); When executing the sender on host with strace, I get: ---------- ... connect(10, {sa_family=AF_INET, sin_port=htons(1024), sin_addr=inet_addr("10.0.7.97")}, 16) = -1 EINPROGRESS (Operation now in progress) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 0 (Timeout) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLOUT}], 7, 0) = 1 ([{fd=10, revents=POLLOUT}]) getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 send(10, "D\227\0\1\0\0\0\0", 8, 0) = 8 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 0 (Timeout) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=10, events=POLLIN}], 7, 0) = 1 ([{fd=10, revents=POLLIN}]) recv(10, "", 8, 0) = 0 close(10) = 0 poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) poll([{fd=4, events=POLLIN}, {fd=5, events=POLLIN}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}], 6, 0) = 0 (Timeout) ... (forever) ... ---------- For me it looks like the above connect is responsible for establishing connection, but I'm afraid I don't understand what those calls for poll are supposed to do. Attaching gdb to the sender gives me: ---------- (gdb) bt #0 0xffffe410 in __kernel_vsyscall () #1 0x0064993b in poll () from /lib/libc.so.6 #2 0xf7df07b5 in poll_dispatch () from /home/gmaj/openmpi/lib/libopen-pal.so.0 #3 0xf7def8c3 in opal_event_base_loop () from /home/gmaj/openmpi/lib/libopen-pal.so.0 #4 0xf7defbe7 in opal_event_loop () from /home/gmaj/openmpi/lib/libopen-pal.so.0 #5 0xf7de323b in opal_progress () from /home/gmaj/openmpi/lib/libopen-pal.so.0 #6 0xf7c51455 in mca_pml_ob1_send () from /home/gmaj/openmpi/lib/openmpi/mca_pml_ob1.so #7 0xf7ed9c60 in PMPI_Send () from /home/gmaj/openmpi/lib/libmpi.so.0 #8 0x0804e900 in main () ---------- If anybody knows what may cause this problem or what may I do to find the reason, any help is appreciated. My open-mpi is version 1.4.1. Regards, Grzegorz Maj