Hi I am in the process of moving a parallel program from our old 32 bit based (Xeon @ 2.8 GHz) Linux cluster to a new EM64T (Intel Xeon 5160 @ 3.00GHz) base linux cluster.
OS on the old cluster is Redhat 9 and Fedora 7 on the new cluster. I have installed the Intel Fortran compiler version 10.0 and openmpi-1.2.3. I configured openmpi with --prefix=/opt/openmpi F77=ifort FC=ifort. config.log and the output from ompi_info --all are in the attached files. /opt/ is mounted on all nodes in the cluster. The program causing me problems, is a program that solves two large interrelated systems of equations (+200.000.000 eq.) using PCG iteration. The program starts to iterate on the first system until a certain degree of convergence is reached, then the master node executes a shell script which starts the parallel solver on the second system. Again the iteration is continued until certain degree of convergence, some parameters from solving the second system is stored in different files. After the solving of the second system, the stored parameters is used in the solver for the first system. Both before and after the master node makes the system call the nodes are synchronized via calls of MPI_BARRIER. This setup has worked fine on the old cluster, but on the new cluster, The system call do not start the parallel solver for the second system. The solver program is very complex, so I have med some small Fortran programs and shell scripts that illustrates the problem. The setup is as follows: mpi_master starts mpi on a number of nodes and checks that the nodes is alive. The master then executes the shell script serial.sh via a system call, thats starts a serial Fortran program serial_subprog). After return from the system call, the master executes the shell script mpi.sh. This script tries to start mpi_subprog via mpirun. I have used mpif90 to compile the mpi programs and ifort to compile the serial program. Mpi_main starts as expected, the call of serial.sh starts the serial program as expected. However, the system call to execute the mpi.sh do not start mpi_subprog. The Fortran programs and scripts are in the attached file test.tar.gz. When I run the setup via: mpirun -np 4 -hostfile nodelist ./mpi_main I get the following: MPI_INIT return code: 0 MPI_INIT return code: 0 MPI_COMM_RANK return code: 0 MPI_COMM_SIZE return code: 0 Process 1 of 2 is alive - Hostname= c01b04 1 : 19 MPI_COMM_RANK return code: 0 MPI_COMM_SIZE return code: 0 Process 0 of 2 is alive - Hostname= c01b05 0 : 19 MYID: 1 MPI_REDUCE 1 red_chk_sum= 0 rc= 0 MYID: 0 MPI_REDUCE 1 red_chk_sum= 2 rc= 0 MYID: 1 MPI_BARRIER 1 RC= 0 MYID: 0 MPI_BARRIER 1 RC= 0 Master will now execute the shell script serial.sh This is from serial.sh We are now in the serial subprogram Master back from the shell script serial.sh IERR= 0 Master will now execute the shell script mpi.sh This is from mpi.sh /nav/denmark/navper19/mpi_test [c01b05.ctrl.ghpc.dk:25337] OOB: Connection to HNP lost Master back from the shell script mpi.sh IERR= 0 MYID: 0 MPI_BARRIER 2 RC= 0 MYID: 0 MPI_REDUCE 2 red_chk_sum= 20 rc= 0 MYID: 1 MPI_BARRIER 2 RC= 0 MYID: 1 MPI_REDUCE 2 red_chk_sum= 0 rc= 0 As you can see, the execution on the serial program works, while the mpi program is not started. I have checked that mpirun is in the PATH in the shell started by the system call, and I have checked the the mpi.sh script works if it is executed from the command prompt. Output from a run with mpirun options -v -d are in the attached file test.tar.gz. Is there anyone out there that have tried to do some thing similar? Regards Per Madsen Senior scientist AARHUS UNIVERSITET / UNIVERSITY OF AARHUS Det Jordbrugsvidenskabelige Fakultet / Faculty of Agricultural Sciences Forskningscenter Foulum / Research Centre Foulum Genetik og Bioteknologi / Dept. of Genetics and Biotechnology Blichers Allé 20, P.O. BOX 50 DK-8830 Tjele
config.log.gz
Description: config.log.gz
eth0 Link encap:Ethernet HWaddr 00:14:5E:C2:BB:E4 inet addr:10.55.55.65 Bcast:10.55.55.255 Mask:255.255.255.0 inet6 addr: fe80::214:5eff:fec2:bbe4/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:140268254 errors:0 dropped:0 overruns:0 frame:0 TX packets:166380187 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:138443717024 (128.9 GiB) TX bytes:201070313859 (187.2 GiB) Interrupt:17 Memory:da000000-da012100 eth1 Link encap:Ethernet HWaddr 00:14:5E:C2:BB:E6 inet addr:10.55.56.65 Bcast:10.55.56.255 Mask:255.255.255.0 inet6 addr: fe80::214:5eff:fec2:bbe6/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:639993727 errors:0 dropped:0 overruns:0 frame:0 TX packets:518028570 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:845939849040 (787.8 GiB) TX bytes:311070822710 (289.7 GiB) Interrupt:19 Memory:d8000000-d8012100 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:143166 errors:0 dropped:0 overruns:0 frame:0 TX packets:143166 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:31459709 (30.0 MiB) TX bytes:31459709 (30.0 MiB)
ompi_info.log.gz
Description: ompi_info.log.gz
test.tar.gz
Description: test.tar.gz