Hi, 

I am trying to set up some machines with OpenMPI connected with ethernet to 
expand some batch system we already have in use. 

This is controlled with Slurm already and we are able to get a basic MPI 
program running across 2 of the machines but when I compile and something that 
actually performs communication it fails. 

Slurm was not configured with PMI/PMI2 so we require running with mpirun for 
program execution. 

OpenMPI is installed on my home space which is accessible on all of the nodes 
we are trying to run on.

My hello world application gets the world size, rank and hostname and prints 
this. This successfully launches and runs.

Hello world from processor viper-03, rank 0 out of 8 processors
Hello world from processor viper-03, rank 1 out of 8 processors
Hello world from processor viper-03, rank 2 out of 8 processors
Hello world from processor viper-03, rank 3 out of 8 processors
Hello world from processor viper-04, rank 4 out of 8 processors
Hello world from processor viper-04, rank 5 out of 8 processors
Hello world from processor viper-04, rank 6 out of 8 processors
Hello world from processor viper-04, rank 7 out of 8 processors 

I then tried to run the OSU micro-benchmarks but these fail to run. I get the 
following output: 

# OSU MPI Latency Test v5.6.3
# Size          Latency (us)
[viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past end 
of buffer in file util/show_help.c at line 507
--------------------------------------------------------------------------
WARNING: Open MPI accepted a TCP connection from what appears to be a
another Open MPI process but cannot find a corresponding process
entry for that peer.

This attempted connection will be ignored; your MPI job may or may not
continue properly.

  Local host: viper-02
  PID:        20406
—————————————————————————————————————

The machines are firewall yet the ports 9000-9060 are open. I have set the 
following MCA parameters to match the open ports:  

btl_tcp_port_min_v4=9000
btl_tcp_port_range_v4=60
oob_tcp_dynamic_ipv4_ports=9020

OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was set 
to $HOME/local/ompi.

What else could be going wrong? 

Kind Regards, 

Dean 

Reply via email to