Mark Dixon via users <users@lists.open-mpi.org> writes: > Surely I cannot be the only one who cares about using a recent openmpi > with hdf5 on lustre?
I generally have similar concerns. I dug out the romio tests, assuming something more basic is useful. I ran them with ompi 4.0.5+ucx on Mark's lustre system (similar to a few nodes of Summit, apart from the filesystem, but with quad-rail IB which doesn't give the bandwidth I expected). The perf test says romio performs a bit better. Also -- from overall time -- it's faster on IMB-IO (which I haven't looked at in detail, and ran with suboptimal striping). Test: perf romio321 Access size per process = 4194304 bytes, ntimes = 5 Write bandwidth without file sync = 19317.372354 Mbytes/sec Read bandwidth without prior file sync = 35033.325451 Mbytes/sec Write bandwidth including file sync = 1081.096713 Mbytes/sec Read bandwidth after file sync = 47135.349155 Mbytes/sec ompio Access size per process = 4194304 bytes, ntimes = 5 Write bandwidth without file sync = 18442.698536 Mbytes/sec Read bandwidth without prior file sync = 31958.198676 Mbytes/sec Write bandwidth including file sync = 1081.058583 Mbytes/sec Read bandwidth after file sync = 31506.854710 Mbytes/sec However, romio coll_perf fails as follows, and ompio runs. Isn't there mpi-io regression testing? [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x1fffbc000010) ==== backtrace (tid: 89063) ==== 0 0x000000000005453c ucs_debug_print_backtrace() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656 1 0x0000000000041b04 ucp_rndv_pack_data() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335 2 0x000000000001c814 uct_self_ep_am_bcopy() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278 3 0x000000000003f7ac uct_ep_am_bcopy() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561 4 0x000000000003f7ac ucp_do_am_bcopy_multi() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79 5 0x000000000003f7ac ucp_rndv_progress_am_bcopy() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352 6 0x0000000000041cb8 ucp_request_try_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 7 0x0000000000041cb8 ucp_request_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 8 0x0000000000041cb8 ucp_rndv_rtr_handler() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754 9 0x000000000001c984 uct_iface_invoke_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 10 0x000000000001c984 uct_self_iface_sendrecv_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 11 0x000000000001c984 uct_self_ep_am_short() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 12 0x000000000002ee30 uct_ep_am_short() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 13 0x000000000002ee30 ucp_do_am_single() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 14 0x0000000000042908 ucp_proto_progress_rndv_rtr() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172 15 0x000000000003f4c4 ucp_request_try_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 16 0x000000000003f4c4 ucp_request_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 17 0x000000000003f4c4 ucp_rndv_req_send_rtr() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423 18 0x0000000000045214 ucp_rndv_matched() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262 19 0x0000000000046158 ucp_rndv_process_rts() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1280 20 0x0000000000046268 ucp_rndv_rts_handler() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1304 21 0x000000000001c984 uct_iface_invoke_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 22 0x000000000001c984 uct_self_iface_sendrecv_am() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 23 0x000000000001c984 uct_self_ep_am_short() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 24 0x000000000002ee30 uct_ep_am_short() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 25 0x000000000002ee30 ucp_do_am_single() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 26 0x000000000003f430 ucp_proto_progress_rndv_rts() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:125 27 0x0000000000049df4 ucp_request_try_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 28 0x0000000000049df4 ucp_request_send() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 29 0x0000000000049df4 ucp_tag_send_req() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/tag_send.c:124 30 0x0000000000049df4 ucp_tag_send_nbx() /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/tag_send.c:289 31 0x0000000000007170 mca_pml_ucx_isend() ???:0 32 0x00000000000adf94 MPI_Isend() ???:0 33 0x000000000001cfbc ADIOI_LUSTRE_WriteStridedColl() ???:0 34 0x0000000000018318 MPIOI_File_write_all() ???:0 35 0x00000000000184c8 mca_io_romio_dist_MPI_File_write_all() ???:0 36 0x000000000000f1fc mca_io_romio321_file_write_all() ???:0 37 0x00000000000a0e7c MPI_File_write_all() ???:0 38 0x00000000100017a0 main() /users/***/lustre/openmpi-4.0.5/ompi/mca/io/romio321/romio/test/coll_perf.c:97 39 0x0000000000025200 generic_start_main.isra.0() libc-start.c:0 40 0x00000000000253f4 __libc_start_main() ???:0