Re: [OMPI users] silent failure for large allgather

2019-09-25 Thread Heinz, Michael William via users
Emmanuel Thomé,

Thanks for bringing this to our attention. It turns out this issue affects all 
OFI providers in open-mpi. We've applied a fix to the 3.0.x and later branches 
of open-mpi/ompi on github. However, you should be aware that this fix simply 
adds the appropriate error message, it does not allow OFI to support message 
sizes larger than the OFI provider actually supports. That will require a more 
significant effort which we are evaluating now.

---
Mike Heinz
Networking Fabric Software Engineer
Intel Corporation

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] silent failure for large allgather

2019-09-13 Thread Jeff Squyres (jsquyres) via users
Emmanuel --

Looks like the right people missed this when you posted; sorry about that!

We're tracking it now: https://github.com/open-mpi/ompi/issues/6976



On Sep 13, 2019, at 3:04 AM, Emmanuel Thomé via users 
mailto:users@lists.open-mpi.org>> wrote:

Hi,

Thanks Jeff for your reply, and sorry for this late follow-up...

On Sun, Aug 11, 2019 at 02:27:53PM -0700, Jeff Hammond wrote:
openmpi-4.0.1 gives essentially the same results (similar files
attached), but with various doubts on my part as to whether I've run this
check correctly. Here are my doubts:
   - whether I should or not have an ucx build for an omnipath cluster
 (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),


UCX is not optimized for Omni Path.  Don't use it.

good.

Does that mean that the information conveyed by this message is
incomplete ? It's easy to misconstrue it as an invitation to enable ucx.

   --
   By default, for Open MPI 4.0 and later, infiniband ports on a device
   are not used by default.  The intent is to use UCX for these devices.
   You can override this policy by setting the btl_openib_allow_ib MCA parameter
   to true.

 Local host:  node0
 Local adapter:   hfi1_0
 Local port:  1

   --
   --
   WARNING: There was an error initializing an OpenFabrics device.

 Local host:   node0
 Local device: hfi1_0
   --

   - which btl I should use (I understand that openib goes to
 deprecation and it complains unless I do --mca btl openib --mca
 btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
 btl should I use instead ?)


OFI->PS2 and PSM2 are the right conduits for Omni Path.

I assume you meant ofi->psm2 and psm2. I understand that --mca mtl ofi
should be Right in that case, and that --mca mtl psm2 should be as well.
Which unfortunately doesn't tell me much about pml and btl selection, if
these happen to matter (pml certainly, based on my initial report).

It sounds like Open-MPI doesn't properly support the maximum transfer size
of PSM2.  One way to work around this is to wrap your MPI collective calls
and do <4G chunking yourself.

I'm afraid that it's not a very satisfactory answer. Once I've spent some
time diagnosing the issue, sure I could do that sort of kludge.

But the path to discovering the issue is long-winded. I'd have been
*MUCH* better off if openmpi spat at me a big loud error message (like it
does for psm2). The fact that it silently omits copying some of my data
with the mtl ofi is extremely annoying.

Best,

E.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


--
Jeff Squyres
jsquy...@cisco.com

___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] silent failure for large allgather

2019-09-13 Thread Emmanuel Thomé via users
Hi,

Thanks Jeff for your reply, and sorry for this late follow-up...

On Sun, Aug 11, 2019 at 02:27:53PM -0700, Jeff Hammond wrote:
> > openmpi-4.0.1 gives essentially the same results (similar files
> > attached), but with various doubts on my part as to whether I've run this
> > check correctly. Here are my doubts:
> > - whether I should or not have an ucx build for an omnipath cluster
> >   (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),
> >
> 
> UCX is not optimized for Omni Path.  Don't use it.

good.

Does that mean that the information conveyed by this message is
incomplete ? It's easy to misconstrue it as an invitation to enable ucx.

--
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA 
parameter
to true.

  Local host:  node0
  Local adapter:   hfi1_0
  Local port:  1

--
--
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   node0
  Local device: hfi1_0
--

> > - which btl I should use (I understand that openib goes to
> >   deprecation and it complains unless I do --mca btl openib --mca
> >   btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
> >   btl should I use instead ?)
> >
> 
> OFI->PS2 and PSM2 are the right conduits for Omni Path.

I assume you meant ofi->psm2 and psm2. I understand that --mca mtl ofi
should be Right in that case, and that --mca mtl psm2 should be as well.
Which unfortunately doesn't tell me much about pml and btl selection, if
these happen to matter (pml certainly, based on my initial report).

> It sounds like Open-MPI doesn't properly support the maximum transfer size
> of PSM2.  One way to work around this is to wrap your MPI collective calls
> and do <4G chunking yourself.

I'm afraid that it's not a very satisfactory answer. Once I've spent some
time diagnosing the issue, sure I could do that sort of kludge.

But the path to discovering the issue is long-winded. I'd have been
*MUCH* better off if openmpi spat at me a big loud error message (like it
does for psm2). The fact that it silently omits copying some of my data
with the mtl ofi is extremely annoying.

Best,

E.
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users


Re: [OMPI users] silent failure for large allgather

2019-08-11 Thread Jeff Hammond via users
On Tue, Aug 6, 2019 at 9:54 AM Emmanuel Thomé via users <
users@lists.open-mpi.org> wrote:

> Hi,
>
> In the attached program, the MPI_Allgather() call fails to communicate
> all data (the amount it communicates wraps around at 4G...).  I'm running
> on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested
> both).
>
> With the OFI mtl, the failure is silent, with no error message reported.
> This is very annoying.
>
> With the PSM2 mtl, we have at least some info printed that 4G is a limit.
>
> I have tested it with various combinations of mca parameters. It seems
> that the one config bit that makes the test pass is the selection of the
> ob1 pml. However I have to select it explicitly, because otherwise cm is
> selected instead (priority 40 vs 20, it seems), and the program fails. I
> don't know to which extent the cm pml is the root cause, or whether I'm
> witnessing a side-effect of something.
>
> openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11):
>
> node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node  -n 2 ./a.out
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: ...
> Message size 4295032832 bigger than supported by PSM2 API. Max =
> 4294967296
> MPI error returned:
> MPI_ERR_OTHER: known error not in list
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: NOK
> [node0.localdomain:14592] 1 more process has sent help message
> help-mtl-psm2.txt / message too big
> [node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
>
> node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node  -n 2 --mca
> mtl ofi ./a.out
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: ...
> MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 *
> 0x10001 bytes: NOK
> node 0 failed_offset = 0x10002
> node 1 failed_offset = 0x1
>
> I attached the corresponding outputs with some mca verbose
> parameters on, plus ompi_info, as well as variations of the pml layer
> (ob1 works).
>
> openmpi-4.0.1 gives essentially the same results (similar files
> attached), but with various doubts on my part as to whether I've run this
> check correctly. Here are my doubts:
> - whether I should or not have an ucx build for an omnipath cluster
>   (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),
>

UCX is not optimized for Omni Path.  Don't use it.


> - which btl I should use (I understand that openib goes to
>   deprecation and it complains unless I do --mca btl openib --mca
>   btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
>   btl should I use instead ?)
>

OFI->PS2 and PSM2 are the right conduits for Omni Path.


> - which layers matter, which ones matter less... I tinkered with btl
>   pml mtl.  It's fine if there are multiple choices, but if some
>   combinations lead to silent data corruption, that's not really
>   cool.
>

It sounds like Open-MPI doesn't properly support the maximum transfer size
of PSM2.  One way to work around this is to wrap your MPI collective calls
and do <4G chunking yourself.

Jeff


> Could the error reporting in this case be somehow improved ?
>
> I'd be glad to provide more feedback if needed.
>
> E.
> ___
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/
___
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users