Re: [OMPI users] silent failure for large allgather

Jeff Hammond via users Sun, 11 Aug 2019 14:29:38 -0700

On Tue, Aug 6, 2019 at 9:54 AM Emmanuel Thomé via users <
users@lists.open-mpi.org> wrote:


> Hi,
>
> In the attached program, the MPI_Allgather() call fails to communicate
> all data (the amount it communicates wraps around at 4G...).  I'm running
> on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested
> both).
>
> With the OFI mtl, the failure is silent, with no error message reported.
> This is very annoying.
>
> With the PSM2 mtl, we have at least some info printed that 4G is a limit.
>
> I have tested it with various combinations of mca parameters. It seems
> that the one config bit that makes the test pass is the selection of the
> ob1 pml. However I have to select it explicitly, because otherwise cm is
> selected instead (priority 40 vs 20, it seems), and the program fails. I
> don't know to which extent the cm pml is the root cause, or whether I'm
> witnessing a side-effect of something.
>
> openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11):
>
>     node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node  -n 2 ./a.out
>     MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 *
> 0x100010000 bytes: ...
>     Message size 4295032832 bigger than supported by PSM2 API. Max =
> 4294967296
>     MPI error returned:
>     MPI_ERR_OTHER: known error not in list
>     MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 *
> 0x100010000 bytes: NOK
>     [node0.localdomain:14592] 1 more process has sent help message
> help-mtl-psm2.txt / message too big
>     [node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate"
> to 0 to see all help / error messages
>
>     node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node  -n 2 --mca
> mtl ofi ./a.out
>     MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 *
> 0x100010000 bytes: ...
>     MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 *
> 0x100010000 bytes: NOK
>     node 0 failed_offset = 0x100020000
>     node 1 failed_offset = 0x10000
>
>     I attached the corresponding outputs with some mca verbose
>     parameters on, plus ompi_info, as well as variations of the pml layer
>     (ob1 works).
>
> openmpi-4.0.1 gives essentially the same results (similar files
> attached), but with various doubts on my part as to whether I've run this
> check correctly. Here are my doubts:
>     - whether I should or not have an ucx build for an omnipath cluster
>       (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?),
>

UCX is not optimized for Omni Path.  Don't use it.


>     - which btl I should use (I understand that openib goes to
>       deprecation and it complains unless I do --mca btl openib --mca
>       btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp
>       btl should I use instead ?)
>

OFI->PS2 and PSM2 are the right conduits for Omni Path.


>     - which layers matter, which ones matter less... I tinkered with btl
>       pml mtl.  It's fine if there are multiple choices, but if some
>       combinations lead to silent data corruption, that's not really
>       cool.
>

It sounds like Open-MPI doesn't properly support the maximum transfer size
of PSM2.  One way to work around this is to wrap your MPI collective calls
and do <4G chunking yourself.

Jeff


> Could the error reporting in this case be somehow improved ?
>
> I'd be glad to provide more feedback if needed.
>
> E.
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users



-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Re: [OMPI users] silent failure for large allgather

Reply via email to