The 1.6 code always expects to find the default hostfile, even if it is empty. We always install it by default, so I don't know why yours isn't there. In the future, we just ignore it if we don't find it.
You have two options: 1. create that file and leave it empty 2. you can work around it by adding --default-hostfile none to your cmd line, or adding OMPI_MCA_orte_default_hostfile=none to your environment. If you want to do this for everyone on the system, then add "orte_default_hostfile=none" to your default MCA param file. HTH Ralph On Aug 23, 2012, at 4:03 PM, Jim Kusznir <jkusz...@gmail.com> wrote: > Hi all: > > I recently rebuilt my cluster from rocks 5 to rocks 6 (which is based > on CentOS 6.2) using the official spec file and my build options as > before. It all built successfully and all appeared good. That is, > until one tried to use it. This is built with torque integration, and > its run through torque. When a user's job runs, this ends up in the > error file and the program does not run successfully: > > -------------------------------------------------------------------------- > Open RTE was unable to open the hostfile: > /opt/openmpi-gcc/1.6/etc/openmpi-default-hostfile > Check to make sure the path and filename are correct. > -------------------------------------------------------------------------- > [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in > file base/rmaps_base_support_fns.c at line 88 > [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in > file rmaps_rr.c at line 82 > [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in > file base/rmaps_base_map_job.c at line 88 > [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in > file base/plm_base_launch_support.c at line 105 > [compute-0-2.local:13834] [[12466,0],0] ORTE_ERROR_LOG: Not found in > file plm_tm_module.c at line 194 > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > > This has been confirmed with several different node assignments. Any > ideas on cause or fixes? > > I built it with this command: > rpmbuild -bb --define 'install_in_opt 1' --define 'install_modulefile > 1' --define 'modules_rpm_name environment-modules' --define > 'build_all_in_one_rpm 0' --define 'configure_options > --with-tm=/opt/torque' --define '_name openmpi-gcc' --define 'makeopts > -J8' openmpi.spec > > (and the PGI version was built with: > CC=pgcc CXX=pgCC F77=pgf77 FC=pgf90 rpmbuild -bb --define > 'install_in_opt 1' --define 'install_modulefile 1' --define > 'modules_rpm_name environment-modules' --define 'build_all_in_one_rpm > 0' --define 'configure_options --with-tm=/opt/torque' --define '_name > openmpi-pgi' --define 'use_default_rpm_opt_flags 0' openmpi.spec > ) > > --Jim > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users