Hi, Noam

Can you try your original command line with the following addition:

mpirun —mca pml ucx —mca btl ^vader,tcp,openib -*mca osc ucx  *

I think we're seeing some conflict between UCX PML and UCT OSC.

Josh

On Wed, Jun 19, 2019 at 4:36 PM Noam Bernstein via users <
users@lists.open-mpi.org> wrote:

> On Jun 19, 2019, at 2:44 PM, George Bosilca <bosi...@icl.utk.edu> wrote:
>
> To completely disable UCX you need to disable the UCX MTL and not only the
> BTL. I would use "--mca pml ob1 --mca btl ^ucx —mca btl_openib_allow_ib 1”.
>
>
> Thanks for the pointer.  Disabling ucx this way _does_ seem to fix the
> memory issue.  That’s a very helpful workaround, if nothing else.
>
> Using ucx 1.5.1 downloaded from the ucx web site at runtime (just by
> inserting it into LD_LIBRARY_PATH, without recompiling openmpi) doesn’t
> seem to fix the problem.
>
>
> As you have a gdb session on the processes you can try to break on some of
> the memory allocations function (malloc, realloc, calloc).
>
>
> Good idea.  I set breakpoints on all 3 of those, then did “c” 3 times.
> Does this mean anything to anyone?  I’m investigating the upstream calls
> (not included below) that generate these calls to mpi_bcast, but given that
> it works on other types of nodes, I doubt those are problematic.
>
> #0  0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
> #1  0x00002b9e651f358a in ucs_rcache_create_region
> (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072,
> address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:500
> #2  ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070,
> length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c,
> region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
> #3  0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>,
> address=<optimized out>, length=<optimized out>, flags=96,
> memh_p=0xbc409b0) at ib/base/ib_md.c:990
> #4  0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>,
> reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized
> out>, uct_flags=uct_flags@entry=96,
>     alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry
> =UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0,
> uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8)
>     at core/ucp_mm.c:100
> #5  0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800,
> md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128,
> state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST,
>     req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>,
> uct_flags@entry=0) at core/ucp_request.c:218
> #6  0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized
> out>, req=0xbc40940)
> at 
> /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
> #7  ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
> #8  0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1,
> proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350
> <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>,
>     rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192,
> req=<optimized out>) at tag/tag_send.c:78
> #9  ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>,
> count=8192, datatype=<optimized out>, tag=<optimized out>,
> cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203
> #10 0x00002b9e64465fa6 in mca_pml_ucx_isend ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
> #11 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #12 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #13 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
> #14 0x00002b9e521dbb79 in PMPI_Bcast () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #15 0x00002b9e51f623df in pmpi_bcast__ () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
>
>
> #0  0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
> #1  0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at
> datastruct/pgtable.c:69
> #2  ucs_pgtable_insert_page (region=0xc6919d0, order=12,
> address=47959585718272, pgtable=0xb341ab8) at datastruct/pgtable.c:299
> #3  ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8,
> region=region@entry=0xc6919d0) at datastruct/pgtable.c:403
> #4  0x00002b9e651f35bc in ucs_rcache_create_region
> (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072,
> address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:511
> #5  ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070,
> length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c,
> region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
> #6  0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>,
> address=<optimized out>, length=<optimized out>, flags=96,
> memh_p=0xbc409b0) at ib/base/ib_md.c:990
> #7  0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>,
> reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized
> out>, uct_flags=uct_flags@entry=96,
>     alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry
> =UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0,
> uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8)
>     at core/ucp_mm.c:100
> #8  0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800,
> md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128,
> state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST,
>     req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>,
> uct_flags@entry=0) at core/ucp_request.c:218
> #9  0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized
> out>, req=0xbc40940)
> at 
> /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
> #10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
> #11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1,
> proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350
> <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>,
>     rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192,
> req=<optimized out>) at tag/tag_send.c:78
> #12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>,
> count=8192, datatype=<optimized out>, tag=<optimized out>,
> cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203
> #13 0x00002b9e64465fa6 in mca_pml_ucx_isend ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
> #14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
> #17 0x00002b9e521dbb79 in PMPI_Bcast () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #18 0x00002b9e51f623df in pmpi_bcast__ () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
>
> #0  0x00002b9e5303e160 in malloc () from /lib64/libc.so.6
> #1  0x00002b9e651ed684 in ucs_pgt_dir_alloc (pgtable=0xb341ab8) at
> datastruct/pgtable.c:69
> #2  ucs_pgtable_insert_page (region=0xc6919d0, order=4,
> address=47959585726464, pgtable=0xb341ab8) at datastruct/pgtable.c:299
> #3  ucs_pgtable_insert (pgtable=pgtable@entry=0xb341ab8,
> region=region@entry=0xc6919d0) at datastruct/pgtable.c:403
> #4  0x00002b9e651f35bc in ucs_rcache_create_region
> (region_p=0x7fff82806da0, arg=0x7fff82806d9c, prot=3, length=131072,
> address=0x2b9e76102070, rcache=0xb341a50) at sys/rcache.c:511
> #5  ucs_rcache_get (rcache=0xb341a50, address=0x2b9e76102070,
> length=131072, prot=prot@entry=3, arg=arg@entry=0x7fff82806d9c,
> region_p=region_p@entry=0x7fff82806da0) at sys/rcache.c:612
> #6  0x00002b9e64f7a3d4 in uct_ib_mem_rcache_reg (uct_md=<optimized out>,
> address=<optimized out>, length=<optimized out>, flags=96,
> memh_p=0xbc409b0) at ib/base/ib_md.c:990
> #7  0x00002b9e64d245e2 in ucp_mem_rereg_mds (context=<optimized out>,
> reg_md_map=4, address=address@entry=0x2b9e76102070, length=<optimized
> out>, uct_flags=uct_flags@entry=96,
>     alloc_md=alloc_md@entry=0x0, mem_type=mem_type@entry
> =UCT_MD_MEM_TYPE_HOST, alloc_md_memh_p=alloc_md_memh_p@entry=0x0,
> uct_memh=uct_memh@entry=0xbc409b0, md_map_p=md_map_p@entry=0xbc409a8)
>     at core/ucp_mm.c:100
> #8  0x00002b9e64d260f0 in ucp_request_memory_reg (context=0xb340800,
> md_map=4, buffer=0x2b9e76102070, length=131072, datatype=128,
> state=state@entry=0xbc409a0, mem_type=UCT_MD_MEM_TYPE_HOST,
>     req_dbg=req_dbg@entry=0xbc40940, uct_flags=<optimized out>,
> uct_flags@entry=0) at core/ucp_request.c:218
> #9  0x00002b9e64d3716b in ucp_request_send_buffer_reg (md_map=<optimized
> out>, req=0xbc40940)
> at 
> /home_tin/bernadm/configuration/330_OFED/ucx-1.5.1/src/ucp/core/ucp_request.inl:343
> #10 ucp_tag_send_start_rndv (sreq=sreq@entry=0xbc40940) at tag/rndv.c:153
> #11 0x00002b9e64d3abb9 in ucp_tag_send_req (enable_zcopy=1,
> proto=0x2b9e64f569c0 <ucp_tag_eager_proto>, cb=0x2b9e64467350
> <mca_pml_ucx_send_completion>, rndv_am_thresh=<optimized out>,
>     rndv_rma_thresh=<optimized out>, msg_config=0xb3ea278, dt_count=8192,
> req=<optimized out>) at tag/tag_send.c:78
> #12 ucp_tag_send_nb (ep=<optimized out>, buffer=<optimized out>,
> count=8192, datatype=<optimized out>, tag=<optimized out>,
> cb=0x2b9e64467350 <mca_pml_ucx_send_completion>) at tag/tag_send.c:203
> #13 0x00002b9e64465fa6 in mca_pml_ucx_isend ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_pml_ucx.so
> #14 0x00002b9e52211900 in ompi_coll_base_bcast_intra_generic ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #15 0x00002b9e52211d4b in ompi_coll_base_bcast_intra_pipeline ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #16 0x00002b9e673bc384 in ompi_coll_tuned_bcast_intra_dec_fixed ()
> from /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/openmpi/mca_coll_tuned.so
> #17 0x00002b9e521dbb79 in PMPI_Bcast () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi.so.40
> #18 0x00002b9e51f623df in pmpi_bcast__ () from
> /share/apps/mpi/openmpi/4.0.1/ib/gnu/lib/libmpi_mpifh.so.40
> #19 0x000000000040d442 in m_bcast_z_from (comm=..., vec=..., n=55826,
> inode=2) at mpi.F:1781
>
>
>
> Noam
>
> ____________
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
> _______________________________________________
> users mailing list
> users@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/users
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/users

Reply via email to