*Bump* There doesn't seem to have been any progress on this. Can you at least have an error message saying that Open MPI one-sided does not work with datatypes instead of silently causing wanton corruption and deadlock?
On Thu, Dec 22, 2011 at 4:17 PM, Jed Brown <j...@59a2.org> wrote: > [Forgot the attachment.] > > > On Thu, Dec 22, 2011 at 15:16, Jed Brown <j...@59a2.org> wrote: > >> I wrote a new communication layer that we are evaluating for use in mesh >> management and PDE solvers, but it is based on MPI-2 one-sided operations >> (and will eventually benefit from some of the MPI-3 one-sided proposals, >> especially MPI_Fetch_and_op() and dynamic windows). All the basic >> functionality works well with MPICH2, but I have run into some Open MPI >> bugs regarding one-sided operations with composite data types. This email >> provides a reduced test case for two such bugs. I see that there are also >> some existing serious-looking bug reports regarding one-sided operations, >> but they are getting pretty old now and haven't seen action in a while. >> >> https://svn.open-mpi.org/trac/ompi/ticket/2656 >> https://svn.open-mpi.org/trac/ompi/ticket/1905 >> >> Is there a plan for resolving these in the near future? >> >> Is anyone using Open MPI for serious work with one-sided operations? >> >> >> Bugs I am reporting: >> >> *1.* If an MPI_Win is used with an MPI_Datatype, even if the MPI_Win >> operation has completed, I get an invalid free when MPI_Type_free() is >> called before MPI_Win_free(). Since MPI_Type_free() is only supposed to >> mark the datatype for deletion, the implementation should properly manage >> reference counting. If you run the attached code with >> >> $ mpiexec -n 2 ./a.out 1 >> >> (which only does part of the comm described for the second bug, below), >> you can see the invalid free on rank 1 with stack still in MPI_Win_fence() >> >> (gdb) bt >> #0 0x00007ffff7288905 in raise () from /lib/libc.so.6 >> #1 0x00007ffff7289d7b in abort () from /lib/libc.so.6 >> #2 0x00007ffff72c147e in __libc_message () from /lib/libc.so.6 >> #3 0x00007ffff72c7396 in malloc_printerr () from /lib/libc.so.6 >> #4 0x00007ffff72cb26c in free () from /lib/libc.so.6 >> #5 0x00007ffff7a5aaa8 in ompi_datatype_release_args (pData=0x845010) at >> ompi_datatype_args.c:414 >> #6 0x00007ffff7a5b0ea in __ompi_datatype_release (datatype=0x845010) at >> ompi_datatype_create.c:47 >> #7 0x00007ffff218e772 in opal_obj_run_destructors (object=0x845010) at >> ../../../../opal/class/opal_object.h:448 >> #8 ompi_osc_rdma_replyreq_free (replyreq=0x680a80) at >> osc_rdma_replyreq.h:136 >> #9 ompi_osc_rdma_replyreq_send_cb (btl=0x7ffff3680ce0, >> endpoint=<optimized out>, descriptor=0x837b00, status=<optimized out>) at >> osc_rdma_data_move.c:691 >> #10 0x00007ffff347f38f in mca_btl_sm_component_progress () at >> btl_sm_component.c:645 >> #11 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207 >> #12 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>, >> c=0x842ee0) at ../../../../opal/threads/condition.h:99 >> #13 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at >> osc_rdma_sync.c:207 >> #14 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at >> pwin_fence.c:60 >> #15 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60 >> >> meanwhile, rank 0 has already freed the datatype and is waiting in >> MPI_Win_free(). >> (gdb) bt >> #0 0x00007ffff7312107 in sched_yield () from /lib/libc.so.6 >> #1 0x00007ffff7b1f82b in opal_progress () at runtime/opal_progress.c:220 >> #2 0x00007ffff7a53fe4 in opal_condition_wait (m=<optimized out>, >> c=<optimized out>) at ../opal/threads/condition.h:99 >> #3 ompi_request_default_wait_all (count=2, requests=0x7fffffffd210, >> statuses=0x7fffffffd1e0) at request/req_wait.c:263 >> #4 0x00007ffff25b8d71 in ompi_coll_tuned_sendrecv_actual (sendbuf=0x0, >> scount=0, sdatatype=0x7ffff7dba840, dest=1, stag=-16, recvbuf=<optimized >> out>, rcount=0, rdatatype=0x7ffff7dba840, source=1, rtag=-16, >> comm=0x8431a0, status=0x0) at coll_tuned_util.c:54 >> #5 0x00007ffff25c2de2 in ompi_coll_tuned_barrier_intra_two_procs >> (comm=<optimized out>, module=<optimized out>) at coll_tuned_barrier.c:256 >> #6 0x00007ffff25b92ab in ompi_coll_tuned_barrier_intra_dec_fixed >> (comm=0x8431a0, module=0x844980) at coll_tuned_decision_fixed.c:190 >> #7 0x00007ffff2186248 in ompi_osc_rdma_module_free (win=0x842170) at >> osc_rdma.c:46 >> #8 0x00007ffff7a58a44 in ompi_win_free (win=0x842170) at win/win.c:150 >> #9 0x00007ffff7a8a0dd in PMPI_Win_free (win=0x7fffffffd408) at >> pwin_free.c:56 >> #10 0x0000000000401195 in main (argc=2, argv=0x7fffffffd508) at win.c:69 >> >> >> *2.* This appears to be more fundamental and perhaps much harder to fix. >> The attached code sets up the following graph >> >> rank 0: >> 0 -> (1,0) >> 1 -> nothing >> 2 -> (1,1) >> >> rank 1: >> 0 -> (0,0) >> 1 -> (0,2) >> 2 -> (0,1) >> >> We pull over this graph using two calls to MPI_Get(), each with composite >> data types defining what to pull into the first two slots, and what to put >> into the third slot. It is Valgrind-clean with MPICH2, and produces the >> following: >> >> $ mpiexec.hydra -n 2 ./a.out 2 >> [0] provided [100,101,102] got [200, -2,201] >> [1] provided [200,201,202] got [100,102,101] >> >> With Open MPI, I see >> >> a.out: malloc.c:3096: sYSMALLOc: Assertion `(old_top == (((mbinptr) >> (((char *) &((av)->bins[((1) - 1) * 2])) - __builtin_offsetof (struct >> malloc_chunk, fd)))) && old_size == 0) || ((unsigned long) (old_size) >= >> (unsigned long)((((__builtin_offsetof (struct malloc_chunk, >> fd_nextsize))+((2 * (sizeof(size_t))) - 1)) & ~((2 * (sizeof(size_t))) - >> 1))) && ((old_top)->size & 0x1) && ((unsigned long)old_end & pagemask) == >> 0)' failed. >> >> on both ranks, with rank 0 at >> >> (gdb) bt >> #0 0x00007ffff7288905 in raise () from /lib/libc.so.6 >> #1 0x00007ffff7289d7b in abort () from /lib/libc.so.6 >> #2 0x00007ffff72c675d in __malloc_assert () from /lib/libc.so.6 >> #3 0x00007ffff72c96d3 in _int_malloc () from /lib/libc.so.6 >> #4 0x00007ffff72cad5d in malloc () from /lib/libc.so.6 >> #5 0x00007ffff7b46c46 in opal_free_list_grow (flist=0x7ffff239f150, >> num_elements=1) at class/opal_free_list.c:93 >> #6 0x00007ffff2196152 in ompi_osc_rdma_replyreq_alloc >> (replyreq=0x7fffffffd0f8, origin_rank=1, module=0x842d10) at >> osc_rdma_replyreq.h:82 >> #7 ompi_osc_rdma_replyreq_alloc_init (module=0x842d10, origin=1, >> origin_request=..., target_displacement=0, target_count=1, >> datatype=0x8455b0, replyreq=0x7fffffffd0f8) at osc_rdma_replyreq.c:40 >> #8 0x00007ffff218c051 in component_fragment_cb (btl=0x7ffff3680ce0, >> tag=<optimized out>, descriptor=<optimized out>, cbdata=<optimized out>) at >> osc_rdma_component.c:633 >> #9 0x00007ffff347f25f in mca_btl_sm_component_progress () at >> btl_sm_component.c:623 >> #10 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207 >> #11 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>, >> c=0x842de0) at ../../../../opal/threads/condition.h:99 >> #12 ompi_osc_rdma_module_fence (assert=0, win=0x842170) at >> osc_rdma_sync.c:207 >> #13 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842170) at >> pwin_fence.c:60 >> #14 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60 >> >> and rank 1 at >> >> (gdb) bt >> #0 0x00007ffff7288905 in raise () from /lib/libc.so.6 >> #1 0x00007ffff7289d7b in abort () from /lib/libc.so.6 >> #2 0x00007ffff72c675d in __malloc_assert () from /lib/libc.so.6 >> #3 0x00007ffff72c96d3 in _int_malloc () from /lib/libc.so.6 >> #4 0x00007ffff72cad5d in malloc () from /lib/libc.so.6 >> #5 0x00007ffff7a5b3ce in opal_obj_new (cls=0x7ffff7db2060) at >> ../../opal/class/opal_object.h:469 >> #6 opal_obj_new_debug (line=71, file=0x7ffff7b60323 >> "ompi_datatype_create.c", type=0x7ffff7db2060) at >> ../../opal/class/opal_object.h:251 >> #7 ompi_datatype_create (expectedSize=3) at ompi_datatype_create.c:71 >> #8 0x00007ffff7a5b7e9 in ompi_datatype_create_indexed_block (count=1, >> bLength=1, pDisp=0x7fffee18a834, oldType=0x7ffff7db3640, >> newType=0x7fffffffd070) at ompi_datatype_create_indexed.c:124 >> #9 0x00007ffff7a5a349 in __ompi_datatype_create_from_args (type=9, >> d=0x844f40, a=0x7fffee18a828, i=0x7fffee18a82c) at ompi_datatype_args.c:691 >> #10 __ompi_datatype_create_from_packed_description >> (packed_buffer=0x7fffffffd108, remote_processor=0x652b90) at >> ompi_datatype_args.c:626 >> #11 0x00007ffff7a5b045 in ompi_datatype_create_from_packed_description >> (packed_buffer=<optimized out>, remote_processor=<optimized out>) at >> ompi_datatype_args.c:779 >> #12 0x00007ffff218bf60 in ompi_osc_base_datatype_create >> (payload=0x7fffffffd108, remote_proc=<optimized out>) at >> ../../../../ompi/mca/osc/base/osc_base_obj_convert.h:52 >> #13 component_fragment_cb (btl=0x7ffff3680ce0, tag=<optimized out>, >> descriptor=<optimized out>, cbdata=<optimized out>) at >> osc_rdma_component.c:624 >> #14 0x00007ffff347f25f in mca_btl_sm_component_progress () at >> btl_sm_component.c:623 >> #15 0x00007ffff7b1f80a in opal_progress () at runtime/opal_progress.c:207 >> #16 0x00007ffff21977c5 in opal_condition_wait (m=<optimized out>, >> c=0x842ee0) at ../../../../opal/threads/condition.h:99 >> #17 ompi_osc_rdma_module_fence (assert=0, win=0x842270) at >> osc_rdma_sync.c:207 >> #18 0x00007ffff7a89db5 in PMPI_Win_fence (assert=0, win=0x842270) at >> pwin_fence.c:60 >> #19 0x00000000004010d8 in main (argc=2, argv=0x7fffffffd508) at win.c:60 >> >> This looks like memory corruption, but Open MPI internals are too noisy >> under valgrind for it to be obvious where to look. This is with Open MPI >> 1.5.4, but I observed the same thing with trunk. If I run with three >> processes, the graph is slightly different and only ranks 1 and 2 error >> (rank 0 hangs). >> > >