On Friday 16 June 2006 15:00, Jeff Squyres (jsquyres) wrote: > Try two things: > > 1. Use the pbsdsh command to try to launch a trivial non-MPI application > (e.g., hostname): > > (inside a PBS job) > pbsdsh -<N> -v hostname > > where <N> is the number of vcpu's in your job. > > 2. If that works, try mpirun'ing a trivial non-MPI application (e.g., > hostname): > > (inside a PBS job) > mpirun -np <N> -d --mca pls_tm_debug 1 hostname > > If #1 fails, then there is something wrong with your Torque installation > (pbsdsh uses the same PBS API that Open MPI does), and Open MPI's failure > is a symptom of the underlying problem. If #1 succeeds and #2 fails, send > back the details and let's go from there.
So, #1 works (I know because we're constantly using pbsdsh and OSC's mpiexec for mpich-type implementations). #2 doesn't work; I'll repeat the session log from my first post, this time (hopefully!!!) with linebreaks (could it be that the mailing list has problems with utf8 posts?): schaffoe@node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --mca pls tm `pwd`/openmpitest [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 2 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: found /opt/openmpi/bin/orted [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 3 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 4 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" mpiexec: killing job... [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 5 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: final top-level argv: [node16:03113] pls:tm: orted --no-daemonize --bootproxy 1 --name --num_procs 6 --vpid_start 0 --nodename --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" [node16:03113] pls:tm: launching on node node16 [node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0 [node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe schaffoe@node16:default-universe-3113 --nsreplica "0.0.0;tcp://192.168.1.16:60601" --gprreplica "0.0.0;tcp://192.168.1.16:60601" -------------------------------------------------------------------------- WARNING: mpiexec encountered an abnormal exit. This means that mpiexec exited before it received notification that all started processes had terminated. You should double check and ensure that there are no runaway processes still executing. -------------------------------------------------------------------------- CU, -- Martin Schafföner Cognitive Systems Group, Institute of Electronics, Signal Processing and Communication Technologies, Department of Electrical Engineering, Otto-von-Guericke University Magdeburg Phone: +49 391 6720063