you are touching here a difficult area in Open MPI:
- name publishing across independent jobs does unfortunatly not work
right now (It does work, if all processes have been started by the same
mpirun or if the have been spawned by a father process using
MPI_Comm_spawn). Your approach with passing the port as a command line
option should work however.
- you have to start however the orted daemon *before* starting both jobs
using the flags
' orted --seed --persistent --scope public'
These flags are however currently just lightly tested, since a brand new
runtime environment with much better support for these operations is
currently under development.
- regarding the 'pack data mismatch': do both machines which you are
using have the same data representation? The reason I ask is because
this looks like a data type mismatch error, and Open MPI currently does
have some restriction regarding different data formats and endianness...
Thanks
Edgar
Robert Latham wrote:
Hello
In playing around with process management routines, I found another
issue. This one might very well be operator error, or something
implementation specific.
I've got two processes (a and b), linked with openmpi, but started
independently (no mpiexec).
- A starts up and calls MPI_Init
- A calls MPI_Open_port, prints out the port name to stdout, then
calls MPI_Comm_accept and blocks.
- B takes as a command line argument the port
name printed out by A. It calls MPI_Init and then and passes that
port name to MPI_Comm_connect
- B gets the following error:
[leela.mcs.anl.gov:04177] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch
in file ../../../orte/dps/dps_unpack.c at line 121
[leela.mcs.anl.gov:04177] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch
in file ../../../orte/dps/dps_unpack.c at line 95
[leela.mcs.anl.gov:04177] *** An error occurred in MPI_Comm_connect
[leela.mcs.anl.gov:04177] *** on communicator MPI_COMM_WORLD
[leela.mcs.anl.gov:04177] *** MPI_ERR_UNKNOWN: unknown error
[leela.mcs.anl.gov:04177] *** MPI_ERRORS_ARE_FATAL (goodbye)
[leela.mcs.anl.gov:04177] [0,0,0] ORTE_ERROR_LOG: Not found in file
../../../../../orte/mca/pls/base/pls_base_proxy.c at line 183
- A is still waiting for someone to connect to it.
Did I pass MPI port strings between programs the correct way, or is
MPI_Publish_name/MPI_Lookup_name the prefered way to pass around this
information?
Thanks
==rob
--
Edgar Gabriel
Assistant Professor
Department of Computer Science email:gabr...@cs.uh.edu
University of Houston http://www.cs.uh.edu/~gabriel
Philip G. Hoffman Hall, Room 524 Tel: +1 (713) 743-3857
Houston, TX-77204, USA Fax: +1 (713) 743-3335