Hi, Noam Can you try your original command line with the following addition:
mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx * I think we're seeing some conflict between UCX PML and UCT OSC. Josh On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users < users@lists.open-mpi.org> wrote: > On Jun 19, 2019, at 2:44 PM, George Bosilca <bosi...@icl.utk.edu> wrote: > > To completely disable UCX you need to disable the UCX MTL and not only the > BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”. > > > Thanks for the pointer. Disabling ucx this way _does_ seem to fix the > memory issue. That’s a very helpful workaround, if nothing else. > > Using ucx 1.5.1 downloaded from the ucx web site at runtime (just by > inserting it into LD_LIBRARY_PATH, without recompiling openmpi) doesn’t > seem to fix the problem. > > > As you have a gdb session on the processes you can try to break on some of > the memory allocations function (malloc, realloc, calloc). > > > Good idea. I set breakpoints on all 3 of those, then did “c” 3 times. > Does this mean anything to anyone? I’m investigating the upstream calls > (not included below) that generate these calls to mpi_bcast, but given that > it works on other types of nodes, I doubt those are problematic. > > #0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6 > #1 0x00002b9e651f358a in ucs_rcache_create_region > (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072, > address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:500 > #2 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, > length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, > region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612 > #3 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, > address=<optimized out>, length=<optimized out>, flags=96, > memh_p=0xbc409b0) at ib/base/ib_md.c:990 > #4 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, > reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized > out>, uct_flags=uct_flags@entry=96, > alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry > =UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0, > uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8) > at core/ucp_mm.c:100 > #5 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, > md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128, > state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, > req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, > uct_flags@entry=0) at core/ucp_request.c:218 > #6 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized > out>, req=0xbc40940) > at > /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343 > #7 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153 > #8 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, > proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 > <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, > rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, > req=<optimized out>) at tag/tag_send.c:78 > #9 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, > count=8192, datatype=<optimized out>, tag=<optimized out>, > cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203 > #10 0x00002b9e64465fa6 in mca_pml_ucx_isend () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so > #11 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #12 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #13 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so > #14 0x00002b9e521dbb79 in PMPI_Bcast () from > /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #15 0x00002b9e51f623df in pmpi_bcast__ () from > /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40 > > > #0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6 > #1 0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at > datastruct/pgtable.c:69 > #2 ucs_pgtable_insert_page (region=0xc6919d0, order=12, > address=47959585718272, pgtable=0xb341ab8) at datastruct/pgtable.c:299 > #3 ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8, > region=region@entry=0xc6919d0) at datastruct/pgtable.c:403 > #4 0x00002b9e651f35bc in ucs_rcache_create_region > (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072, > address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:511 > #5 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, > length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, > region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612 > #6 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, > address=<optimized out>, length=<optimized out>, flags=96, > memh_p=0xbc409b0) at ib/base/ib_md.c:990 > #7 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, > reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized > out>, uct_flags=uct_flags@entry=96, > alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry > =UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0, > uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8) > at core/ucp_mm.c:100 > #8 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, > md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128, > state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, > req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, > uct_flags@entry=0) at core/ucp_request.c:218 > #9 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized > out>, req=0xbc40940) > at > /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343 > #10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153 > #11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, > proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 > <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, > rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, > req=<optimized out>) at tag/tag_send.c:78 > #12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, > count=8192, datatype=<optimized out>, tag=<optimized out>, > cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203 > #13 0x00002b9e64465fa6 in mca_pml_ucx_isend () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so > #14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so > #17 0x00002b9e521dbb79 in PMPI_Bcast () from > /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #18 0x00002b9e51f623df in pmpi_bcast__ () from > /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40 > > #0 0x00002b9e5303e160 in malloc () from /lib64/libc.so.6 > #1 0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at > datastruct/pgtable.c:69 > #2 ucs_pgtable_insert_page (region=0xc6919d0, order=4, > address=47959585726464, pgtable=0xb341ab8) at datastruct/pgtable.c:299 > #3 ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8, > region=region@entry=0xc6919d0) at datastruct/pgtable.c:403 > #4 0x00002b9e651f35bc in ucs_rcache_create_region > (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072, > address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:511 > #5 ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070, > length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c, > region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612 > #6 0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>, > address=<optimized out>, length=<optimized out>, flags=96, > memh_p=0xbc409b0) at ib/base/ib_md.c:990 > #7 0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>, > reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized > out>, uct_flags=uct_flags@entry=96, > alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry > =UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0, > uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8) > at core/ucp_mm.c:100 > #8 0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800, > md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128, > state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST, > req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>, > uct_flags@entry=0) at core/ucp_request.c:218 > #9 0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized > out>, req=0xbc40940) > at > /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343 > #10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153 > #11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1, > proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350 > <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>, > rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192, > req=<optimized out>) at tag/tag_send.c:78 > #12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>, > count=8192, datatype=<optimized out>, tag=<optimized out>, > cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203 > #13 0x00002b9e64465fa6 in mca_pml_ucx_isend () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so > #14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed () > from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so > #17 0x00002b9e521dbb79 in PMPI_Bcast () from > /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40 > #18 0x00002b9e51f623df in pmpi_bcast__ () from > /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40 > #19 0x000000000040d442 in m_bcast_z_from (comm=..., vec=..., n=55826, > inode=2) at mpi.F:1781 > > > > Noam > > ____________ > | > | > | > *U.S. NAVAL* > | > | > _*RESEARCH*_ > | > LABORATORY > > Noam Bernstein, Ph.D. > Center for Materials Physics and Technology > U.S. Naval Research Laboratory > T +1 202 404 8628 F +1 202 404 7546 > https://www.nrl.navy.mil > > _______________________________________________ > users mailing list > users@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________ users mailing list users@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/users