you are touching here a difficult area in Open MPI:

- name publishing across independent jobs does unfortunatly not work right now (It does work, if all processes have been started by the same mpirun or if the have been spawned by a father process using MPI_Comm_spawn). Your approach with passing the port as a command line option should work however.

- you have to start however the orted daemon *before* starting both jobs using the flags
' orted --seed --persistent --scope public'
These flags are however currently just lightly tested, since a brand new runtime environment with much better support for these operations is currently under development.

- regarding the 'pack data mismatch': do both machines which you are using have the same data representation? The reason I ask is because this looks like a data type mismatch error, and Open MPI currently does have some restriction regarding different data formats and endianness...

Thanks
Edgar

Robert Latham wrote:

Hello
In playing around with process management routines, I found another
issue.  This one might very well be operator error, or something
implementation specific.

I've got two processes (a and b), linked with openmpi, but started
independently (no mpiexec).
- A starts up and calls MPI_Init
- A calls MPI_Open_port, prints out the port name to stdout, then
  calls MPI_Comm_accept and blocks.
- B takes as a command line argument the port
  name printed out by A.  It calls MPI_Init and then and passes that
  port name to MPI_Comm_connect
- B gets the following error:

[leela.mcs.anl.gov:04177] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch
in file ../../../orte/dps/dps_unpack.c at line 121
[leela.mcs.anl.gov:04177] [0,0,0] ORTE_ERROR_LOG: Pack data mismatch
in file ../../../orte/dps/dps_unpack.c at line 95
[leela.mcs.anl.gov:04177] *** An error occurred in MPI_Comm_connect
[leela.mcs.anl.gov:04177] *** on communicator MPI_COMM_WORLD
[leela.mcs.anl.gov:04177] *** MPI_ERR_UNKNOWN: unknown error
[leela.mcs.anl.gov:04177] *** MPI_ERRORS_ARE_FATAL (goodbye)
[leela.mcs.anl.gov:04177] [0,0,0] ORTE_ERROR_LOG: Not found in file
../../../../../orte/mca/pls/base/pls_base_proxy.c at line 183

- A is still waiting for someone to connect to it.

Did I pass MPI port strings between programs the correct way, or is
MPI_Publish_name/MPI_Lookup_name the prefered way to pass around this
information?

Thanks
==rob


--
Edgar Gabriel
Assistant Professor
Department of Computer Science          email:gabr...@cs.uh.edu
University of Houston                   http://www.cs.uh.edu/~gabriel
Philip G. Hoffman Hall, Room 524        Tel: +1 (713) 743-3857
Houston, TX-77204, USA                  Fax: +1 (713) 743-3335

Reply via email to