At least with 1.1.4, I'm having a heck of a time with enabling multi-threading. Configuring with --with-threads=posix --enable-mpi-threads --enable-progress-threads leads to mpirun just hanging, even when not launching MPI apps, i.e. mpirun -np 1 hostname, and I can't crtl-c to kill it, I have to kill -9 it. Removing progress threads support results in the same behavior. Removing --enable-mpi-threads gets mpirun working again, but not the thread protection I need.

What is the status for multi thread support? It looks like it's still largely untested from my reading of the mailing lists. We actually have an application that would be much easier to deal with if we could have two threads in a process both using MPI. Funneling everything through a single processor creates a locking nightmare, and generally means we will be forced to spin checking a IRecv and the status of a data structure instead of having one thread happily sitting on a blocking receive and the other watching the data structure, basically pissing away a processor that we could be using to do something useful. (We are basically doing a simplified version of DSM and we need to respond to remote data requests).

At the moment, it seems that when running without threading support enabled, if we only post a receive on a single thread, things are mostly happy, except if one thread in process sends to the other thread in the same process who has posted a receive. Under TCP, the send fails with:

*** An error occurred in MPI_Send
*** on communicator MPI_COMM_WORLD
*** MPI_ERR_INTERN: internal error
*** MPI_ERRORS_ARE_FATAL (goodbye)
[0,0,0]-[0,1,0] mca_oob_tcp_msg_recv: readv failed with errno=104

SM has undefined results.

Obviously I'm playing fast and loose, which is why I'm attempting to get threading support to work to see if it solve the headaches. If you really want to have some fun, have a posted MPI_Recv on one thread and issue an MPI_Barrier on the other (with SM):

Signal:11 info.si_errno:0(Success) si_code:1(SEGV_MAPERR)
Failing at addr:0x1c
[0] func:/usr/lib/libopal.so.0 [0xc030f4]
[1] func:/lib/tls/libpthread.so.0 [0x46f93890]
[2] func:/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_match+0xb08) [0x14ec38] [3] func:/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_recv_frag_callback+0x2f9) [0x14f7e9] [4] func:/usr/lib/openmpi/mca_btl_sm.so(mca_btl_sm_component_progress+0xa87) [0x806c07]
[5] func:/usr/lib/openmpi/mca_bml_r2.so(mca_bml_r2_progress+0x39) [0x510c69]
[6] func:/usr/lib/libopal.so.0(opal_progress+0x69) [0xbecc39]
[7] func:/usr/lib/openmpi/mca_pml_ob1.so(mca_pml_ob1_send+0x785) [0x14d675]
[8] func:/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_sendrecv_actual_localcompleted+0x8c) [0x5cc3fc] [9] func:/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_two_procs+0x76) [0x5ceef6] [10] func:/usr/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0x38) [0x5cc638]
[11] func:/usr/lib/libmpi.so.0(PMPI_Barrier+0xe9) [0x29a1b9]

-Mike

Reply via email to