Hi, Thanks Jeff for your reply, and sorry for this late follow-up...
On Sun, Aug 11, 2019 at 02:27:53PM -0700, Jeff Hammond wrote: > > openmpi-4.0.1 gives essentially the same results (similar files > > attached), but with various doubts on my part as to whether I've run this > > check correctly. Here are my doubts: > > - whether I should or not have an ucx build for an omnipath cluster > > (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?), > > > > UCX is not optimized for Omni Path. Don't use it. good. Does that mean that the information conveyed by this message is incomplete ? It's easy to misconstrue it as an invitation to enable ucx. -------------------------------------------------------------------------- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: node0 Local adapter: hfi1_0 Local port: 1 -------------------------------------------------------------------------- -------------------------------------------------------------------------- WARNING: There was an error initializing an OpenFabrics device. Local host: node0 Local device: hfi1_0 -------------------------------------------------------------------------- > > - which btl I should use (I understand that openib goes to > > deprecation and it complains unless I do --mca btl openib --mca > > btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp > > btl should I use instead ?) > > > > OFI->PS2 and PSM2 are the right conduits for Omni Path. I assume you meant ofi->psm2 and psm2. I understand that --mca mtl ofi should be Right in that case, and that --mca mtl psm2 should be as well. Which unfortunately doesn't tell me much about pml and btl selection, if these happen to matter (pml certainly, based on my initial report). > It sounds like Open-MPI doesn't properly support the maximum transfer size > of PSM2. One way to work around this is to wrap your MPI collective calls > and do <4G chunking yourself. I'm afraid that it's not a very satisfactory answer. Once I've spent some time diagnosing the issue, sure I could do that sort of kludge. But the path to discovering the issue is long-winded. I'd have been *MUCH* better off if openmpi spat at me a big loud error message (like it does for psm2). The fact that it silently omits copying some of my data with the mtl ofi is extremely annoying. Best, E. _______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users