Re: [OMPI users] silent failure for large allgather
Hi, Thanks Jeff for your reply, and sorry for this late follow-up... On Sun, Aug 11, 2019 at 02:27:53PM -0700, Jeff Hammond wrote: > > openmpi-4.0.1 gives essentially the same results (similar files > > attached), but with various doubts on my part as to whether I've run this > > check correctly. Here are my doubts: > > - whether I should or not have an ucx build for an omnipath cluster > > (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?), > > > > UCX is not optimized for Omni Path. Don't use it. good. Does that mean that the information conveyed by this message is incomplete ? It's easy to misconstrue it as an invitation to enable ucx. -- By default, for Open MPI 4.0 and later, infiniband ports on a device are not used by default. The intent is to use UCX for these devices. You can override this policy by setting the btl_openib_allow_ib MCA parameter to true. Local host: node0 Local adapter: hfi1_0 Local port: 1 -- -- WARNING: There was an error initializing an OpenFabrics device. Local host: node0 Local device: hfi1_0 -- > > - which btl I should use (I understand that openib goes to > > deprecation and it complains unless I do --mca btl openib --mca > > btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp > > btl should I use instead ?) > > > > OFI->PS2 and PSM2 are the right conduits for Omni Path. I assume you meant ofi->psm2 and psm2. I understand that --mca mtl ofi should be Right in that case, and that --mca mtl psm2 should be as well. Which unfortunately doesn't tell me much about pml and btl selection, if these happen to matter (pml certainly, based on my initial report). > It sounds like Open-MPI doesn't properly support the maximum transfer size > of PSM2. One way to work around this is to wrap your MPI collective calls > and do <4G chunking yourself. I'm afraid that it's not a very satisfactory answer. Once I've spent some time diagnosing the issue, sure I could do that sort of kludge. But the path to discovering the issue is long-winded. I'd have been *MUCH* better off if openmpi spat at me a big loud error message (like it does for psm2). The fact that it silently omits copying some of my data with the mtl ofi is extremely annoying. Best, E. ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users
[OMPI users] silent failure for large allgather
Hi, In the attached program, the MPI_Allgather() call fails to communicate all data (the amount it communicates wraps around at 4G...). I'm running on an omnipath cluster (2018 hardware), openmpi 3.1.3 or 4.0.1 (tested both). With the OFI mtl, the failure is silent, with no error message reported. This is very annoying. With the PSM2 mtl, we have at least some info printed that 4G is a limit. I have tested it with various combinations of mca parameters. It seems that the one config bit that makes the test pass is the selection of the ob1 pml. However I have to select it explicitly, because otherwise cm is selected instead (priority 40 vs 20, it seems), and the program fails. I don't know to which extent the cm pml is the root cause, or whether I'm witnessing a side-effect of something. openmpi-3.1.3 (debian10 package openmpi-bin-3.1.3-11): node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 ./a.out MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 0x10001 bytes: ... Message size 4295032832 bigger than supported by PSM2 API. Max = 4294967296 MPI error returned: MPI_ERR_OTHER: known error not in list MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 0x10001 bytes: NOK [node0.localdomain:14592] 1 more process has sent help message help-mtl-psm2.txt / message too big [node0.localdomain:14592] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages node0 ~ $ mpiexec -machinefile /tmp/hosts --map-by node -n 2 --mca mtl ofi ./a.out MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 0x10001 bytes: ... MPI_Allgather, 2 nodes, 0x10001 chunks of 0x1 bytes, total 2 * 0x10001 bytes: NOK node 0 failed_offset = 0x10002 node 1 failed_offset = 0x1 I attached the corresponding outputs with some mca verbose parameters on, plus ompi_info, as well as variations of the pml layer (ob1 works). openmpi-4.0.1 gives essentially the same results (similar files attached), but with various doubts on my part as to whether I've run this check correctly. Here are my doubts: - whether I should or not have an ucx build for an omnipath cluster (IIUC https://github.com/openucx/ucx/issues/750 is now fixed ?), - which btl I should use (I understand that openib goes to deprecation and it complains unless I do --mca btl openib --mca btl_openib_allow_ib true ; fine. But then, which non-openib non-tcp btl should I use instead ?) - which layers matter, which ones matter less... I tinkered with btl pml mtl. It's fine if there are multiple choices, but if some combinations lead to silent data corruption, that's not really cool. Could the error reporting in this case be somehow improved ? I'd be glad to provide more feedback if needed. E. #define _GNU_SOURCE #include #include #include #include #include long failed_offset = 0; size_t chunk_size = 1 << 16; size_t nchunks = (1 << 16) + 1; int main(int argc, char * argv[]) { if (argc >= 2) chunk_size = atol(argv[1]); if (argc >= 3) nchunks = atol(argv[1]); MPI_Init(, ); /* * This function returns: * 0 on success. * a non-zero MPI Error code if MPI_Allgather returned one. * -1 if no MPI Error code was returned, but the result of Allgather * was wrong. * -2 if memory allocation failed. * * (note that the MPI document guarantees that MPI error codes are * positive integers) */ int size, rank; MPI_Comm_size(MPI_COMM_WORLD, ); MPI_Comm_rank(MPI_COMM_WORLD, ); int err; char * check_text; int rc = asprintf(_text, "MPI_Allgather, %d nodes, 0x%zx chunks of 0x%zx bytes, total %d * 0x%zx bytes", size, nchunks, chunk_size, size, chunk_size * nchunks); if (rc < 0) abort(); if (!rank) printf("%s: ...\n", check_text); MPI_Datatype mpi_ft; MPI_Type_contiguous(chunk_size, MPI_BYTE, _ft); MPI_Type_commit(_ft); MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN); void * data = malloc(nchunks * size * chunk_size); memset(data, 0, nchunks * size * chunk_size); int alloc_ok = data != NULL; MPI_Allreduce(MPI_IN_PLACE, _ok, 1, MPI_INT, MPI_MIN, MPI_COMM_WORLD); if (alloc_ok) { memset(((char*)data) + nchunks * chunk_size * rank, 0x42, nchunks * chunk_size); err = MPI_Allgather(MPI_IN_PLACE, 0, MPI_DATATYPE_NULL, data, nchunks, mpi_ft, MPI_COMM_WORLD); if (err == 0) { void * p = memchr(data, 0, nchunks * size * chunk_size); if (p != NULL) { /* We found a zero, we shouldn't ! */ err = -1; failed_offset = ((char*)p)-(char*)data; } } } else { err = -2; } if (data) free(data); MPI_Type_free(_ft); if (!rank) { printf("%s: %s\n", check_text, err ==
[OMPI users] pml ^ucx + mtl ofi (nonsensical ?) --> segfault at large sizes
Hi, I came across this. openmpi-4.0.1 compiled with: ../openmpi-4.0.1/configure --disable-mpi-fortran --without-cuda --disable-opencl --with-ucx=/path/to/ucx-1.5.1 The execution of the attached program (simple mpi_send / mpi_recv pair) gives a segfault when the message size exceeds 2^30. I'm seeing the failure on debian10 nodes connected with 1G ethernet and mellanox IB FDR (ConnectX-3). On another cluster with omnipath interconnect, the test passes fine. Both have ipoib configured. node0 ~ $ mpiexec -machinefile /tmp/hosts -n 2 --mca btl tcp,self --mca mtl ofi --mca pml ^ucx ./a.out 12 Maybe this btl/pml/mtl combination is nonsensical, I don't know. What annoys me is that the following failure: 1 - occurs only for large messages, not for smaller test runs 2 - is not recoverable via MPI_ERRORS_RETURN Output: [node0:9791 :0:9791] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil)) backtrace 0 /path/to/ucx-1.5.1/lib/libucs.so.0(+0x1dee0) [0x7f21e2b01ee0] 1 /path/to/ucx-1.5.1/lib/libucs.so.0(+0x1e188) [0x7f21e2b02188] === -- Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted. -- -- mpiexec noticed that process rank 1 with PID 0 on node node0 exited on signal 11 (Segmentation fault). -- Running this under gdb, it seems that the backtrace just points to the ucs signal handler, and that the cause of the segv is there (ompi/mca/mtl/ofi/mtl_ofi.h:107) : } else if (OPAL_UNLIKELY(ret == -FI_EAVAIL)) { /** * An error occured and is being reported via the CQ. * Read the error and forward it to the upper layer. */ [...] ret = ofi_req->error_callback(, ofi_req); with ofi_req->error_callback being unfortunately NULL. Is it really just me doing something absolutely silly, or is it something that ought to be fixed ? Best, E. #include #include #include #include int main(int argc, char * argv[]) { size_t chunk = 3<<29; if (argc > 1) chunk = atol(argv[1]); int rank; int size; MPI_Init(, ); MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN); MPI_Comm_rank(MPI_COMM_WORLD, ); MPI_Comm_size(MPI_COMM_WORLD, ); void * data = malloc(chunk); memset(data, 0x42, chunk); if (rank == 0) { MPI_Send(data, chunk, MPI_BYTE, 1, 0xbeef, MPI_COMM_WORLD); } else { MPI_Recv(data, chunk, MPI_BYTE, 0, 0xbeef, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } MPI_Barrier(MPI_COMM_WORLD); printf("ok\n"); MPI_Finalize(); } ___ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users