Re: [OMPI users] silent failure for large allgather

Emmanuel Thomé via users Fri, 13 Sep 2019 00:06:27 -0700

Hi,

Thanks Jeff for your reply, and sorry for this late follow-up...

On Sun, Aug 11, 2019 at 02:27:53PM -0700, Jeff Hammond wrote:
> > openmpi-4.0.1 gives essentially the same results (similar files
> > attached), but with various doubts on my part as to whether I've run this
> > check correctly. Here are my doubts:
> >     - whether I should or not have an ucx build for an omnipath cluster
> >       (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),
> >
> 
> UCX is not optimized for Omni Path.  Don't use it.

good.

Does that mean that the information conveyed by this message is
incomplete ? It's easy to misconstrue it as an invitation to enable ucx.

    --------------------------------------------------------------------------
    By default, for Open MPI 4.0 and later, infiniband ports on a device
    are not used by default.  The intent is to use UCX for these devices.
    You can override this policy by setting the btl_openib_allow_ib MCA 
parameter
    to true.

      Local host:              node0
      Local adapter:           hfi1_0
      Local port:              1

    --------------------------------------------------------------------------
    --------------------------------------------------------------------------
    WARNING: There was an error initializing an OpenFabrics device.

      Local host:   node0
      Local device: hfi1_0
    --------------------------------------------------------------------------

> >     - which btl I should use (I understand that openib goes to
> >       deprecation and it complains unless I do --mca btl openib --mca
> >       btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
> >       btl should I use instead ?)
> >
> 
> OFI->PS2 and PSM2 are the right conduits for Omni Path.

I assume you meant ofi->psm2 and psm2. I understand that --mca mtl ofi
should be Right in that case, and that --mca mtl psm2 should be as well.
Which unfortunately doesn't tell me much about pml and btl selection, if
these happen to matter (pml certainly, based on my initial report).

> It sounds like Open-MPI doesn't properly support the maximum transfer size
> of PSM2.  One way to work around this is to wrap your MPI collective calls
> and do <4G chunking yourself.

I'm afraid that it's not a very satisfactory answer. Once I've spent some
time diagnosing the issue, sure I could do that sort of kludge.

But the path to discovering the issue is long-winded. I'd have been
*MUCH* better off if openmpi spat at me a big loud error message (like it
does for psm2). The fact that it silently omits copying some of my data
with the mtl ofi is extremely annoying.

Best,

E.
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] silent failure for large allgather

Reply via email to