Hi guys,

  I seem to have encountered an error while trying to run an MPMD executable
through Open MPI's '-app' option, and I'm wondering if anyone else has seen
this or can verify this?

Backing up to a simple example, running a "hello world" executable (hwc.exe)
works fine when run as:  (using an interactive PBS session with -l
nodes=2:ppn=4)
 mpiexec -v -d  -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude eth0 -mca
pls_rsh_agent ssh -np 8 ./hwc.exe

But when I run what should be the same thing via an '--app' file (or implied
command line) liks the following fails:
 mpiexec -v -d  -machinefile $PBS_NODEFILE -mca oob_tcp_if_exclude eth0 -mca
pls_rsh_agent ssh  -np 6 ./hwc.exe : -np 2 ./hwc.exe

  My understanding is that these are equivalent, no?  But the latter example
fails with multiple "Software caused connection abort (103)" errors, such as
the following:
[xxx:13978] [0,0,2]-[0,0,0] mca_oob_tcp_peer_try_connect: connect to
xx.x.2.81:34103 failed: Software caused connection abort (103)

  Any thoughts?  I haven't dug around the source yet since this could be a
weird problem with the system I'm using.  For the record, this is with
OpenMPI 1.2.4 compiled with PGI 7.1-2.

  As an aside, the '-app' syntax DOES work fine when all copies are running
on the same node.. for example, having requested 4 CPUs per node, if I run
"-np 2 ./hwc.exe : -np 2 ./hwc.exe", it works fine.  And I did also try
duplicating the mca parameters after the colon since I figured they might
not propagate, thus perhaps it was trying to use the wrong interface, but
that didn't help either.

  Thanks very much,
  - Brian


Brian Dobbins
Yale University HPC

Reply via email to