Hi,
      We are running the OFED 1.2rc4 distribution containing openmpi-1.2.2 on a 
RedhatEL4U4 system with Scyld Clusterware 4.1. The hardware configuration 
consists of a DELL 2950 as the headnode and 3 DELL 1950 blades as compute nodes 
using Cisco TopSpin Infiband HCAs and switches for the interconnect.
 
       When we use 'mpirun' from the OFED/Open MPI distribution to start 
processes on the compute nodes, everything works correctly. However, when we 
try to start processes on the head node, the processes appear to run correctly 
but 'mpirun' hangs and does not terminate until killed. The attached 'run1.tgz' 
file contains detailed information from running the following command:
 
      mpirun --hostfile hostfile1 --np 1 --byslot --debug-daemons -d hostname
 
where 'hostfile1' contains the following:
 
-1 slots=2 max_slots=2
 
The 'run.log' is the output of the above line. The 'strace.out.0' is the result 
of 'strace -f' on the mpirun process (and the 'hostname' child process since 
mpirun simply forks the local processes). The child process (pid 23415 in this 
case) runs to completion and exits successfully. The parent process (mpirun) 
doesn't appear to recognize that the child has completed and hangs until killed 
(with a ^c). 
 
Additionally, when we run a set of processes which span the headnode and the 
compute nodes, the processes on the head node complete successfully, but the 
processes on the compute nodes do not appear to start. mpirun again appears to 
hang.
 
Do I have a configuration error or is there a problem that I have encountered? 
Thank you in advance for your assistance or suggestions
 
Sean
 
------
Sean M. Kelley
sean.kel...@solers.com
 
 

Attachment: run1.tgz
Description: run1.tgz



Reply via email to