Hi, I have been trying to set up OpenMPI 1.0.3a1r10374 on our cluster and was partly successful. Partly, because installation worked, compiling a simple example and running it through the rsh pls also worked. However, I'm the only user who has rsh access to the nodes, all other users must go through torque and launch mpi apps using torque's TM subsystem. That's where my problem starts: I was not successful in launching apps through TM. TM pls is configured okay, I can see it making connections to torque mom in mom's logfile; however, the app never gets run. Even if I only request one processor, mpiexec spawns several orted in a row. Here is my session log (where I kill mpiexec using CTRL-C cause it would otherwise run forever):
schaffoe@node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --mca pls tm `pwd`/openmpitest [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: found /opt/openmpi/bin/orted [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 3 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 4 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" mpiexec: killing job... [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 5 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 6 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" -------------------------------------------------------------------------- WARNING: mpiexec encountered an abnormal exit. This means that mpiexec exited before it received notification that all started processes had terminated. You should double check and ensure that there are no runaway processes still executing. -------------------------------------------------------------------------- I read in the README that TM pls is working, whereas in the latex usersguide it says that only rsh and bproc are supported. I am confused... Can anybody shed a better light on this? Regards, -- Martin Schafföner Cognitive Systems Group, Institute of Electronics, Signal Processing and Communication Technologies, Department of Electrical Engineering, Otto-von-Guericke University Magdeburg Phone: +49 391 6720063