On Tue, Aug 6, 2019 at 9:54 AM Emmanuel Thomé via users < users@lists.open-mpi.org> wrote:
> Hi, > > In the attached program, the MPI_Allgather() call fails to communicate > all data (the amount it communicates wraps around at 4G...). I'm running > on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested > both). > > With the OFI mtl, the failure is silent, with no error message reported. > This is very annoying. > > With the PSM2 mtl, we have at least some info printed that 4G is a limit. > > I have tested it with various combinations of mca parameters. It seems > that the one config bit that makes the test pass is the selection of the > ob1 pml. However I have to select it explicitly, because otherwise cm is > selected instead (priority 40 vs 20, it seems), and the program fails. I > don't know to which extent the cm pml is the root cause, or whether I'm > witnessing a side-effect of something. > > openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11): > > node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 ./a.out > MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * > 0x100010000 bytes: ... > Message size 4295032832 bigger than supported by PSM2 API. Max = > 4294967296 > MPI error returned: > MPI_ERR_OTHER: known error not in list > MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * > 0x100010000 bytes: NOK > [node0.localdomain:14592] 1 more process has sent help message > help-mtl-psm2.txt / message too big > [node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate" > to 0 to see all help / error messages > > node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 --mca > mtl ofi ./a.out > MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * > 0x100010000 bytes: ... > MPI_Allgather, 2 nodes, 0x10001 chunks of 0x10000 bytes, total 2 * > 0x100010000 bytes: NOK > node 0 failed_offset = 0x100020000 > node 1 failed_offset = 0x10000 > > I attached the corresponding outputs with some mca verbose > parameters on, plus ompi_info, as well as variations of the pml layer > (ob1 works). > > openmpi-4.0.1 gives essentially the same results (similar files > attached), but with various doubts on my part as to whether I've run this > check correctly. Here are my doubts: > - whether I should or not have an ucx build for an omnipath cluster > (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?), > UCX is not optimized for Omni Path. Don't use it. > - which btl I should use (I understand that openib goes to > deprecation and it complains unless I do --mca btl openib --mca > btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp > btl should I use instead ?) > OFI->PS2 and PSM2 are the right conduits for Omni Path. > - which layers matter, which ones matter less... I tinkered with btl > pml mtl. It's fine if there are multiple choices, but if some > combinations lead to silent data corruption, that's not really > cool. > It sounds like Open-MPI doesn't properly support the maximum transfer size of PSM2. One way to work around this is to wrap your MPI collective calls and do <4G chunking yourself. Jeff > Could the error reporting in this case be somehow improved ? > > I'd be glad to provide more feedback if needed. > > E. > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users -- Jeff Hammond jeff.scie...@gmail.com http://jeffhammond.github.io/
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users