I'm having trouble getting OpenMPI to execute jobs when submitting through Torque. Everything works fine if I am to use the included mpirun scripts, but this is obviously not a good solution for the general users on the cluster.
I'm running under OS X 10.4, Darwin 8.6.0. I configured OpenMpi with: export CC=/opt/ibmcmp/vac/6.0/bin/xlc export CXX=/opt/ibmcmp/vacpp/6.0/bin/xlc++ export FC=/opt/ibmcmp/xlf/8.1/bin/xlf90_r export F77=/opt/ibmcmp/xlf/8.1/bin/xlf_r export LDFLAGS=-lSystemStubs export LIBTOOL=glibtool PREFIX=/usr/local/ompi-xl ./configure \ --prefix=$PREFIX \ --with-tm=/usr/local/pbs/ \ --with-gm=/opt/gm \ --enable-static \ --disable-cxx I also had to employ the fix listed in: http://www.open-mpi.org/community/lists/users/2006/04/1007.php I've attached the output of ompi_info while in an interactive job. Looking through the list, I can at least save a bit of trouble by listing what does work. Anything outside of Torque seems fine. From within an interactive job, pbsdsh works fine, hence the earlier problems with poll are fixed. Here is the error that is reported when I attemt to run hostname on one processor: node96:/usr/src/openmpi-1.1 jbronder$ /usr/local/ompi-xl/bin/mpirun -np 1 -mca pls_tm_debug 1 /bin/hostname [node96.meldrew.clusters.umaine.edu:00850] pls:tm: final top-level argv: [node96.meldrew.clusters.umaine.edu:00850] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename --universe jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica " 0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0;tcp://10.0.1.96:49395" [node96.meldrew.clusters.umaine.edu:00850] pls:tm: Set prefix:/usr/local/ompi-xl [node96.meldrew.clusters.umaine.edu:00850] pls:tm: launching on node localhost [node96.meldrew.clusters.umaine.edu:00850] pls:tm: resetting PATH: /usr/local/ompi-xl/bin:/bin:/sbin:/usr/bin:/usr/sbin:/usr/local/pbs/bin:/usr/local/mpiexec/bin:/opt/ibmcmp/xlf/8.1/bin:/opt/ibmcmp/vac/6.0/bin:/opt/ibmcmp/vacpp/6.0/bin:/opt/gm/bin:/opt/fms/bin [node96.meldrew.clusters.umaine.edu:00850] pls:tm: found /usr/local/ompi-xl/bin/orted [node96.meldrew.clusters.umaine.edu:00850] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node96.meldrew.clusters.umaine.edu:00850] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename localhost --universe jbron...@node96.meldrew.clusters.umaine.edu:default-universe --nsreplica "0.0.0;tcp://10.0.1.96:49395" --gprreplica "0.0.0 ;tcp://10.0.1.96:49395" [node96.meldrew.clusters.umaine.edu:00850] pls:tm: start_procs returned error -13 [node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 184 [node96.meldrew.clusters.umaine.edu:00850] [0,0,0] ORTE_ERROR_LOG: Not found in file rmgr_urm.c at line 432 [node96.meldrew.clusters.umaine.edu:00850] mpirun: spawn failed with errno=-13 node96:/usr/src/openmpi-1.1 jbronder$ My thanks for any help in advance, Justin Bronder.
ompi_info.log.gz
Description: GNU Zip compressed data