Hi, I am trying to set up some machines with OpenMPI connected with ethernet to expand some batch system we already have in use.
This is controlled with Slurm already and we are able to get a basic MPI program running across 2 of the machines but when I compile and something that actually performs communication it fails. Slurm was not configured with PMI/PMI2 so we require running with mpirun for program execution. OpenMPI is installed on my home space which is accessible on all of the nodes we are trying to run on. My hello world application gets the world size, rank and hostname and prints this. This successfully launches and runs. Hello world from processor viper-03, rank 0 out of 8 processors Hello world from processor viper-03, rank 1 out of 8 processors Hello world from processor viper-03, rank 2 out of 8 processors Hello world from processor viper-03, rank 3 out of 8 processors Hello world from processor viper-04, rank 4 out of 8 processors Hello world from processor viper-04, rank 5 out of 8 processors Hello world from processor viper-04, rank 6 out of 8 processors Hello world from processor viper-04, rank 7 out of 8 processors I then tried to run the OSU micro-benchmarks but these fail to run. I get the following output: # OSU MPI Latency Test v5.6.3 # Size Latency (us) [viper-01:25885] [[21336,0],0] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file util/show_help.c at line 507 -------------------------------------------------------------------------- WARNING: Open MPI accepted a TCP connection from what appears to be a another Open MPI process but cannot find a corresponding process entry for that peer. This attempted connection will be ignored; your MPI job may or may not continue properly. Local host: viper-02 PID: 20406 ————————————————————————————————————— The machines are firewall yet the ports 9000-9060 are open. I have set the following MCA parameters to match the open ports: btl_tcp_port_min_v4=9000 btl_tcp_port_range_v4=60 oob_tcp_dynamic_ipv4_ports=9020 OpenMPI 4.0.5 was built with GCC 4.8.5 and only the installation prefix was set to $HOME/local/ompi. What else could be going wrong? Kind Regards, Dean