(This time with attachments...) Hi there,
I've had a look through the FAQ and searched the list archives and can't find any similar problems to this one. I'm running OpenMPI 1.2.2 on 10 Intel iMacs (Intel Core2 Duo CPU). I am specifiying two slots per machine and starting my job with: /Network/Guanine/csr201/local-i386/opt/openmpi/bin/mpirun -np 20 --hostfile bhost.jobControl nice -19 /Network/Guanine/csr201/jobControl/run_torus.pl /Network/Guanine/csr201/models-gap/torus/torus.ompiosx-intel The config.log and output of 'ompi_info --all' are attached. Also attached is a small patch that I wrote to work around some firewall limitations on the nodes (I don't know if there's a better way to do this - suggestions are welcome). The patch may or may not be relevant, but I'm not ruling out network issues and a bit of peer review never goes amiss in case I've done something very silly. The programme that I'm trying to run is fairly hefty, so I'm afraid that I can't provide you with a simple test case to highlight the problem. The best I can do it provide you with a description of where I'm at and then ask for some advice/suggestions... The code itself has run in the past with various versions of MPI/LAM and OpenMPI and hasn't, to my knowledge, undergone any significant changes recently. I have noticed delays before, both on this system and on others, when MPI_BARRIER is called but they don't always result in a permanent 'spinning' of the process. The 20-node job that I'm running right now is using 90-100% of every CPU, but hasn't made any progress for around 14 hours. I've used GDB to attach to each of these processes and verified that every single one of them is sitting inside a call to MPI_BARRIER. My understanding is that once every process hits the barrier, they should then move on to the next part of the code. Here's an example of what I see when I attach to one of these processes: ------------------------------------------------------------------------------ Attaching to program: `/private/var/automount/Network/Guanine/csr201/models-gap/torus/torus.ompiosx-intel', process 29578. Reading symbols for shared libraries ..+++++.................................................................... done 0x9000121c in sigprocmask () (gdb) where #0 0x9000121c in sigprocmask () #1 0x01c46f96 in opal_evsignal_recalc () #2 0x01c458c2 in opal_event_base_loop () #3 0x01c45d32 in opal_event_loop () #4 0x01c3e6f2 in opal_progress () #5 0x01b6083e in ompi_request_wait_all () #6 0x01ec68d8 in ompi_coll_tuned_sendrecv_actual () #7 0x01ecbf64 in ompi_coll_tuned_barrier_intra_bruck () #8 0x01b75590 in MPI_Barrier () #9 0x01aec47a in mpi_barrier__ () #10 0x0011c66c in MAIN_ () #11 0x002870f9 in main (argc=1, argv=0xbfffe6ec) (gdb) ------------------------------------------------------------------------------ Does anyone have any suggestions as to what might be happening here? Is there any way to 'tickle' the processes and get them to move on? What if some packets went missing on the network? Surely TCP should take care of this an resend? As implied by my line of questioning, my current thoughts are that some messages between nodes have somehow gone missing. Could this happen? What could cause this? All machines are on the same subnet. I'm sorry my question is so open, but I don't know much about the internals of OpenMPI and how it passes messages and I'm looking for some ideas on where to start searching! Thanks in advance for any help or suggestions that you can offer, Chris
ompi_config.log.gz
Description: Binary data
ompi_info.out.gz
Description: Binary data
diff -ru openmpi-1.2.orig/ompi/mca/btl/tcp/btl_tcp_component.c openmpi-1.2/ompi/mca/btl/tcp/btl_tcp_component.c --- openmpi-1.2.orig/ompi/mca/btl/tcp/btl_tcp_component.c 2007-01-14 02:39:42.000000000 +0000 +++ openmpi-1.2/ompi/mca/btl/tcp/btl_tcp_component.c 2007-04-20 17:08:17.000000000 +0100 @@ -393,6 +393,9 @@ int flags; struct sockaddr_in inaddr; opal_socklen_t addrlen; + int min_port = 15000; + int max_port = 15999; + int tmp_port; /* create a listen socket for incoming connections */ mca_btl_tcp_component.tcp_listen_sd = socket(AF_INET, SOCK_STREAM, 0); @@ -406,13 +409,30 @@ memset(&inaddr, 0, sizeof(inaddr)); inaddr.sin_family = AF_INET; inaddr.sin_addr.s_addr = INADDR_ANY; +/* inaddr.sin_port = 0; if(bind(mca_btl_tcp_component.tcp_listen_sd, (struct sockaddr*)&inaddr, sizeof(inaddr)) < 0) { BTL_ERROR(("bind() failed with errno=%d", opal_socket_errno)); return OMPI_ERROR; } - +*/ + + tmp_port = min_port; + while (1) { + inaddr.sin_port = htons((unsigned short) tmp_port); + if(bind(mca_btl_tcp_component.tcp_listen_sd, (struct sockaddr *) &inaddr, sizeof(inaddr)) < 0) { + if (tmp_port == max_port) { + BTL_ERROR(("bind() failed with errno=%d", opal_socket_errno)); + return OMPI_ERROR; + } else { + ++tmp_port; + } + } else { + break; + } + } + /* resolve system assignend port */ addrlen = sizeof(struct sockaddr_in); if(getsockname(mca_btl_tcp_component.tcp_listen_sd, (struct sockaddr*)&inaddr, &addrlen) < 0) { diff -ru openmpi-1.2.orig/orte/mca/oob/tcp/oob_tcp.c openmpi-1.2/orte/mca/oob/tcp/oob_tcp.c --- openmpi-1.2.orig/orte/mca/oob/tcp/oob_tcp.c 2007-01-24 18:16:10.000000000 +0000 +++ openmpi-1.2/orte/mca/oob/tcp/oob_tcp.c 2007-04-20 16:50:08.000000000 +0100 @@ -378,6 +378,9 @@ int flags; struct sockaddr_in inaddr; opal_socklen_t addrlen; + int min_port = 15000; + int max_port = 15999; + int tmp_port; /* create a listen socket for incoming connections */ mca_oob_tcp_component.tcp_listen_sd = socket(AF_INET, SOCK_STREAM, 0); @@ -394,6 +397,7 @@ memset(&inaddr, 0, sizeof(inaddr)); inaddr.sin_family = AF_INET; inaddr.sin_addr.s_addr = INADDR_ANY; +/* inaddr.sin_port = 0; if(bind(mca_oob_tcp_component.tcp_listen_sd, (struct sockaddr*)&inaddr, sizeof(inaddr)) < 0) { @@ -401,6 +405,23 @@ strerror(opal_socket_errno), opal_socket_errno); return ORTE_ERROR; } +*/ + + tmp_port = min_port; + while (1) { + inaddr.sin_port = htons((unsigned short) tmp_port); + if(bind(mca_oob_tcp_component.tcp_listen_sd, (struct sockaddr *) &inaddr, sizeof(inaddr)) < 0) { + if (tmp_port == max_port) { + opal_output(0,"mca_oob_tcp_create_listen: bind() failed: %s (%d)", + strerror(opal_socket_errno), opal_socket_errno); + return ORTE_ERROR; + } else { + ++tmp_port; + } + } else { + break; + } + } /* resolve system assigned port */ addrlen = sizeof(struct sockaddr_in); @@ -589,6 +610,9 @@ struct sockaddr_in inaddr; opal_socklen_t addrlen; int flags; + int min_port = 15000; + int max_port = 15999; + int tmp_port; /* create a listen socket for incoming connections */ mca_oob_tcp_component.tcp_listen_sd = socket(AF_INET, SOCK_STREAM, 0); @@ -605,6 +629,7 @@ memset(&inaddr, 0, sizeof(inaddr)); inaddr.sin_family = AF_INET; inaddr.sin_addr.s_addr = INADDR_ANY; +/* inaddr.sin_port = 0; if(bind(mca_oob_tcp_component.tcp_listen_sd, (struct sockaddr*)&inaddr, sizeof(inaddr)) < 0) { @@ -612,6 +637,23 @@ strerror(opal_socket_errno), opal_socket_errno); return ORTE_ERROR; } +*/ + + tmp_port = min_port; + while (1) { + inaddr.sin_port = htons((unsigned short) tmp_port); + if(bind(mca_oob_tcp_component.tcp_listen_sd, (struct sockaddr *) &inaddr, sizeof(inaddr)) < 0) { + if (tmp_port == max_port) { + opal_output(0,"mca_oob_tcp_create_listen: bind() failed: %s (%d)", + strerror(opal_socket_errno), opal_socket_errno); + return ORTE_ERROR; + } else { + ++tmp_port; + } + } else { + break; + } + } /* resolve system assigned port */ addrlen = sizeof(struct sockaddr_in);