Thanks for reporting the issue. First, you can workaround the issue by using:
mpirun --mca oob tcp ... This uses a different out-of-band plugin (TCP) instead of verbs unreliable datagrams. Second, I just filed a fix for our current release branches (v2.1.x, v3.0.x, and v3.1.x): https://github.com/open-mpi/ompi/issues/5672 Could you try it out and let me know if it works for you? Thanks! > On Sep 10, 2018, at 5:36 PM, Balazs HAJGATO <balazs.hajg...@vub.be> wrote: > > Dear list readers, > > I have some problems with OpenMPI 3.1.1. In some node combos, I got the error > (libibverbs: GRH is mandatory For RoCE address handle; *** Error in > `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted': > double free or corruption (out): 0x00002aaab4001680 ***), see details in > file 114_151.out.bz2, even with the most simplest run, like > mpirun -host nic114,nic151 hostname > In the file 114_151.out.bz2, you can see the output if I run the command from > nic114. If I run the same command from nic151, it simply spits out the > hostnames, without any errors. > > I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is > identical, see ompi.nic114.bz2). I do not have the config.log file, although > I still have the config output (see confilg.out.bz2). The nodes have > identical opsystems (as we use the same image), and the OpenMPI is also > loaded from a central directory shared amongst the nodes. We have an > infiniband network (with IP over IB) and an ethernet network. Intel MPI works > without a problem, and I confirmed that the network is IB when I use the > Intel MPI) It is not clear whether the orted error is the consequence of the > libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. > (ibv_devinfo is also attached, we do have a somewhat creative infiniband > topology, based on fat-tree, but changing the topology did not solved the > problem). The /tmp directory is writable, and not full. As a matter of fact, > I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get > this error in case of OpenMPI 1.10.2, and 1.10.3. Can anyone have some thoughts about this issue? > > Regards, > > Balazs Hajgato > <ibv_dev.nic114><ibv_dev.nic151><114_151.out.bz2><config.out.bz2><ompi.nic114.bz2>_______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Squyres jsquy...@cisco.com _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users