HI All,

I opened a new issue to track the coll_perf failure in case its not related
to the HDF5 problem reported earlier.

https://github.com/open-mpi/ompi/issues/8246

Howard


Am Mo., 23. Nov. 2020 um 12:14 Uhr schrieb Dave Love via users <
users@lists.open-mpi.org>:

> Mark Dixon via users <users@lists.open-mpi.org> writes:
>
> > Surely I cannot be the only one who cares about using a recent openmpi
> > with hdf5 on lustre?
>
> I generally have similar concerns.  I dug out the romio tests, assuming
> something more basic is useful.  I ran them with ompi 4.0.5+ucx on
> Mark's lustre system (similar to a few nodes of Summit, apart from the
> filesystem, but with quad-rail IB which doesn't give the bandwidth I
> expected).
>
> The perf test says romio performs a bit better.  Also -- from overall
> time -- it's faster on IMB-IO (which I haven't looked at in detail, and
> ran with suboptimal striping).
>
>   Test: perf
>   romio321
>   Access size per process = 4194304 bytes, ntimes = 5
>   Write bandwidth without file sync = 19317.372354 Mbytes/sec
>   Read bandwidth without prior file sync = 35033.325451 Mbytes/sec
>   Write bandwidth including file sync = 1081.096713 Mbytes/sec
>   Read bandwidth after file sync = 47135.349155 Mbytes/sec
>   ompio
>   Access size per process = 4194304 bytes, ntimes = 5
>   Write bandwidth without file sync = 18442.698536 Mbytes/sec
>   Read bandwidth without prior file sync = 31958.198676 Mbytes/sec
>   Write bandwidth including file sync = 1081.058583 Mbytes/sec
>   Read bandwidth after file sync = 31506.854710 Mbytes/sec
>
> However, romio coll_perf fails as follows, and ompio runs.  Isn't there
> mpi-io regression testing?
>
>   [gpu025:89063:0:89063] Caught signal 11 (Segmentation fault: address not
> mapped to object at address 0x1fffbc000010)
>   ==== backtrace (tid:  89063) ====
>    0 0x000000000005453c ucs_debug_print_backtrace()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucs/debug/debug.c:656
>    1 0x0000000000041b04 ucp_rndv_pack_data()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1335
>    2 0x000000000001c814 uct_self_ep_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:278
>    3 0x000000000003f7ac uct_ep_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2561
>    4 0x000000000003f7ac ucp_do_am_bcopy_multi()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.inl:79
>    5 0x000000000003f7ac ucp_rndv_progress_am_bcopy()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1352
>    6 0x0000000000041cb8 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>    7 0x0000000000041cb8 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>    8 0x0000000000041cb8 ucp_rndv_rtr_handler()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1754
>    9 0x000000000001c984 uct_iface_invoke_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635
>   10 0x000000000001c984 uct_self_iface_sendrecv_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149
>   11 0x000000000001c984 uct_self_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262
>   12 0x000000000002ee30 uct_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549
>   13 0x000000000002ee30 ucp_do_am_single()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68
>   14 0x0000000000042908 ucp_proto_progress_rndv_rtr()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:172
>   15 0x000000000003f4c4 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>   16 0x000000000003f4c4 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>   17 0x000000000003f4c4 ucp_rndv_req_send_rtr()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:423
>   18 0x0000000000045214 ucp_rndv_matched()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1262
>   19 0x0000000000046158 ucp_rndv_process_rts()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1280
>   20 0x0000000000046268 ucp_rndv_rts_handler()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:1304
>   21 0x000000000001c984 uct_iface_invoke_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/base/uct_iface.h:635
>   22 0x000000000001c984 uct_self_iface_sendrecv_am()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:149
>   23 0x000000000001c984 uct_self_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/sm/self/self.c:262
>   24 0x000000000002ee30 uct_ep_am_short()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/uct/api/uct.h:2549
>   25 0x000000000002ee30 ucp_do_am_single()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/proto/proto_am.c:68
>   26 0x000000000003f430 ucp_proto_progress_rndv_rts()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/rndv.c:125
>   27 0x0000000000049df4 ucp_request_try_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:223
>   28 0x0000000000049df4 ucp_request_send()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/core/ucp_request.inl:258
>   29 0x0000000000049df4 ucp_tag_send_req()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/tag_send.c:124
>   30 0x0000000000049df4 ucp_tag_send_nbx()
> /tmp/***/spack-stage/spack-stage-ucx-1.9.0-wqtizxmjw66cklwpuq3zcrae2g33b6el/spack-src/src/ucp/tag/tag_send.c:289
>   31 0x0000000000007170 mca_pml_ucx_isend()  ???:0
>   32 0x00000000000adf94 MPI_Isend()  ???:0
>   33 0x000000000001cfbc ADIOI_LUSTRE_WriteStridedColl()  ???:0
>   34 0x0000000000018318 MPIOI_File_write_all()  ???:0
>   35 0x00000000000184c8 mca_io_romio_dist_MPI_File_write_all()  ???:0
>   36 0x000000000000f1fc mca_io_romio321_file_write_all()  ???:0
>   37 0x00000000000a0e7c MPI_File_write_all()  ???:0
>   38 0x00000000100017a0 main()
> /users/***/lustre/openmpi-4.0.5/ompi/mca/io/romio321/romio/test/coll_perf.c:97
>   39 0x0000000000025200 generic_start_main.isra.0()  libc-start.c:0
>   40 0x00000000000253f4 __libc_start_main()  ???:0
>

Reply via email to