On Friday 16 June 2006 15:00, Jeff Squyres (jsquyres) wrote:
> Try two things:
>
> 1. Use the pbsdsh command to try to launch a trivial non-MPI application
> (e.g., hostname):
>
> (inside a PBS job)
> pbsdsh -<N> -v hostname
>
> where <N> is the number of vcpu's in your job.
>
> 2. If that works, try mpirun'ing a trivial non-MPI application (e.g.,
> hostname):
>
> (inside a PBS job)
> mpirun -np <N> -d --mca pls_tm_debug 1 hostname
>
> If #1 fails, then there is something wrong with your Torque installation
> (pbsdsh uses the same PBS API that Open MPI does), and Open MPI's failure
> is a symptom of the underlying problem.  If #1 succeeds and #2 fails, send
> back the details and let's go from there.

So, #1 works (I know because we're constantly using pbsdsh and OSC's mpiexec 
for mpich-type implementations). #2 doesn't work; I'll repeat the session log 
from my first post, this time (hopefully!!!) with linebreaks (could it be 
that the mailing list has problems with utf8 posts?):

schaffoe@node16:~/tmp/mpitest> mpiexec -np 1 --mca pls_tm_debug 1 --mca pls tm 
`pwd`/openmpitest
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 2 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: found /opt/openmpi/bin/orted
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.1 --num_procs 2 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 3 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.2 --num_procs 3 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 4 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.3 --num_procs 4 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
mpiexec: killing job...
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 5 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.4 --num_procs 5 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: final top-level argv:
[node16:03113] pls:tm:     orted --no-daemonize --bootproxy 1 --name  
--num_procs 6 --vpid_start 0 --nodename  --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
[node16:03113] pls:tm: launching on node node16
[node16:03113] pls:tm: not oversubscribed -- setting mpi_yield_when_idle to 0
[node16:03113] pls:tm: executing: orted --no-daemonize --bootproxy 1 --name 
0.0.5 --num_procs 6 --vpid_start 0 --nodename node16 --universe 
schaffoe@node16:default-universe-3113 --nsreplica 
"0.0.0;tcp://192.168.1.16:60601" --gprreplica 
"0.0.0;tcp://192.168.1.16:60601"
--------------------------------------------------------------------------
WARNING: mpiexec encountered an abnormal exit.

This means that mpiexec exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--------------------------------------------------------------------------

CU,
-- 
Martin Schafföner

Cognitive Systems Group, Institute of Electronics, Signal Processing and 
Communication Technologies, Department of Electrical Engineering, 
Otto-von-Guericke University Magdeburg
Phone: +49 391 6720063

Reply via email to