Hi,

I have been trying to set up OpenMPI 1.0.3a1r10374 on our cluster and was 
partly successful. Partly, because installation worked, compiling a simple 
example and running it through the rsh pls also worked. However, I'm the only 
user who has rsh access to the nodes, all other users must go through torque 
and launch mpi apps using torque's TM subsystem. That's where my problem 
starts: I was not successful in launching apps through TM. TM pls is 
configured okay, I can see it making connections to torque mom in mom's 
logfile; however, the app never gets run. Even if I only request one 
processor, mpiexec spawns several orted in a row. Here is my session log 
(where I kill mpiexec  using CTRL-C cause it would otherwise run forever):

schaffoe@node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --mca pls tm 
`pwd`/openmpitest
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 2 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: found /opt/openmpi/bin/orted
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 3 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 4 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
mpiexec: killing job...
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 5 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 6 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
--------------------------------------------------------------------------
WARNING: mpiexec encountered an abnormal exit.

This means that mpiexec exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------


I read in the README that TM pls is working, whereas in the latex usersguide 
it says that only rsh and bproc are supported. I am confused...

Can anybody shed a better light on this?

Regards,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063

Reply via email to