HI All, I opened a new issue to track the coll_perf failure in case its not related to the HDF5 problem reported earlier.
https://github.com/open-mpi/ompi/issues/8246 Howard Am Mo., 23. Nov. 2020 um 12:14 Uhr schrieb Dave Love via users < users@lists.open-mpi.org>: > Mark Dixon via users <users@lists.open-mpi.org> writes: > > > Surely I cannot be the only one who cares about using a recent openmpi > > with hdf5 on lustre? > > I generally have similar concerns. I dug out the romio tests, assuming > something more basic is useful. I ran them with ompi 4.0.5+ucx on > Mark's lustre system (similar to a few nodes of Summit, apart from the > filesystem, but with quad-rail IB which doesn't give the bandwidth I > expected). > > The perf test says romio performs a bit better. Also -- from overall > time -- it's faster on IMB-IO (which I haven't looked at in detail, and > ran with suboptimal striping). > > Test: perf > romio321 > Access size per process = 4194304 bytes, ntimes = 5 > Write bandwidth without file sync = 19317.372354 Mbytes/sec > Read bandwidth without prior file sync = 35033.325451 Mbytes/sec > Write bandwidth including file sync = 1081.096713 Mbytes/sec > Read bandwidth after file sync = 47135.349155 Mbytes/sec > ompio > Access size per process = 4194304 bytes, ntimes = 5 > Write bandwidth without file sync = 18442.698536 Mbytes/sec > Read bandwidth without prior file sync = 31958.198676 Mbytes/sec > Write bandwidth including file sync = 1081.058583 Mbytes/sec > Read bandwidth after file sync = 31506.854710 Mbytes/sec > > However, romio coll_perf fails as follows, and ompio runs. Isn't there > mpi-io regression testing? > > [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not > mapped to object at address 0x1fffbc000010) > ==== backtrace (tid: 89063) ==== > 0 0x000000000005453c ucs_debug_print_backtrace() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656 > 1 0x0000000000041b04 ucp_rndv_pack_data() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335 > 2 0x000000000001c814 uct_self_ep_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278 > 3 0x000000000003f7ac uct_ep_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561 > 4 0x000000000003f7ac ucp_do_am_bcopy_multi() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79 > 5 0x000000000003f7ac ucp_rndv_progress_am_bcopy() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352 > 6 0x0000000000041cb8 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 > 7 0x0000000000041cb8 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 > 8 0x0000000000041cb8 ucp_rndv_rtr_handler() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754 > 9 0x000000000001c984 uct_iface_invoke_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 > 10 0x000000000001c984 uct_self_iface_sendrecv_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 > 11 0x000000000001c984 uct_self_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 > 12 0x000000000002ee30 uct_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 > 13 0x000000000002ee30 ucp_do_am_single() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 > 14 0x0000000000042908 ucp_proto_progress_rndv_rtr() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172 > 15 0x000000000003f4c4 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 > 16 0x000000000003f4c4 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 > 17 0x000000000003f4c4 ucp_rndv_req_send_rtr() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423 > 18 0x0000000000045214 ucp_rndv_matched() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262 > 19 0x0000000000046158 ucp_rndv_process_rts() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1280 > 20 0x0000000000046268 ucp_rndv_rts_handler() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1304 > 21 0x000000000001c984 uct_iface_invoke_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635 > 22 0x000000000001c984 uct_self_iface_sendrecv_am() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149 > 23 0x000000000001c984 uct_self_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262 > 24 0x000000000002ee30 uct_ep_am_short() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549 > 25 0x000000000002ee30 ucp_do_am_single() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68 > 26 0x000000000003f430 ucp_proto_progress_rndv_rts() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:125 > 27 0x0000000000049df4 ucp_request_try_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223 > 28 0x0000000000049df4 ucp_request_send() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258 > 29 0x0000000000049df4 ucp_tag_send_req() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/tag_send.c:124 > 30 0x0000000000049df4 ucp_tag_send_nbx() > /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/tag_send.c:289 > 31 0x0000000000007170 mca_pml_ucx_isend() ???:0 > 32 0x00000000000adf94 MPI_Isend() ???:0 > 33 0x000000000001cfbc ADIOI_LUSTRE_WriteStridedColl() ???:0 > 34 0x0000000000018318 MPIOI_File_write_all() ???:0 > 35 0x00000000000184c8 mca_io_romio_dist_MPI_File_write_all() ???:0 > 36 0x000000000000f1fc mca_io_romio321_file_write_all() ???:0 > 37 0x00000000000a0e7c MPI_File_write_all() ???:0 > 38 0x00000000100017a0 main() > /users/***/lustre/openmpi-4.0.5/ompi/mca/io/romio321/romio/test/coll_perf.c:97 > 39 0x0000000000025200 generic_start_main.isra.0() libc-start.c:0 > 40 0x00000000000253f4 __libc_start_main() ???:0 >