Thanks for reporting the issue.

First, you can workaround the issue by using:

    mpirun --mca oob tcp ...

This uses a different out-of-band plugin (TCP) instead of verbs unreliable 
datagrams.

Second, I just filed a fix for our current release branches (v2.1.x, v3.0.x, 
and v3.1.x):

    https://github.com/open-mpi/ompi/issues/5672

Could you try it out and let me know if it works for you?

Thanks!


> On Sep 10, 2018, at 5:36 PM, Balazs HAJGATO <balazs.hajg...@vub.be> wrote:
> 
> Dear list readers,
> 
> I have some problems with OpenMPI 3.1.1. In some node combos, I got the error 
> (libibverbs: GRH is mandatory For RoCE address handle; *** Error in 
> `/apps/brussel/CO7/ivybridge-ib/software/OpenMPI/3.1.1-GCC-7.3.0-2.30/bin/orted':
>  double free or corruption (out): 0x00002aaab4001680 ***), see details in 
> file 114_151.out.bz2, even with the most simplest run, like
> mpirun -host nic114,nic151 hostname
> In the file 114_151.out.bz2, you can see the output if I run the command from 
> nic114. If I run the same command from nic151, it simply spits out the 
> hostnames, without any errors. 
> 
> I also enclosed the ompi_info --all --parsable outputs from nic114 (nic151 is 
> identical, see ompi.nic114.bz2). I do not have the config.log file, although 
> I still have the config output (see confilg.out.bz2). The nodes have 
> identical opsystems (as we use the same image), and the OpenMPI is also 
> loaded from a central directory shared amongst the nodes. We have an 
> infiniband network (with IP over IB) and an ethernet network. Intel MPI works 
> without a problem, and I confirmed that the network is IB when I use the 
> Intel MPI) It is not clear whether the orted error is the consequence of the 
> libibverbs error, but it is not clear why OpenMPI wants to use RoCE at all. 
> (ibv_devinfo is also attached, we do have a somewhat creative infiniband 
> topology, based on fat-tree, but changing the topology did not solved the 
> problem). The /tmp directory is writable, and not full. As a matter of fact, 
> I get the same error incase of OpenMPI 2.0.2, and 2.1.1, and I do not get 
> this error in case of OpenMPI
  1.10.2, and 1.10.3. Can anyone have some thoughts about this issue?
> 
> Regards,
> 
> Balazs Hajgato
> <ibv_dev.nic114><ibv_dev.nic151><114_151.out.bz2><config.out.bz2><ompi.nic114.bz2>_______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users


-- 
Jeff Squyres
jsquy...@cisco.com

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to