> On Thu, 2 Aug 2012 10:25:53 -0400 > Jeff Squyres <jsquy...@cisco.com> wrote: > > > On Aug 1, 2012, at 9:44 AM, Christopher Yeoh wrote: > > > (gdb) bt > #0 0x0000008039720d6c in .pthread_cond_wait () > from /lib64/power6/libpthread.so.0 #1 0x00000400001299d8 in > opal_condition_wait (c=0x400004763f8, m=0x40000476460) > at ../../ompi-trunk.chris2/opal/threads/condition.h:79 #2 > 0x000004000012a08c in ompi_request_default_wait_all (count=2, > requests=0xfffffa9db20, statuses=0x0) > at ../../ompi-trunk.chris2/ompi/request/req_wait.c:281 #3 > 0x000004000012f56c in ompi_init_preconnect_mpi () > at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_preconnect.c:72 #4 > 0x000004000012c738 in ompi_mpi_init (argc=1, argv=0xfffffa9f278, > requested=3, provided=0xfffffa9edd8) > at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_init.c:800 #5 > 0x000004000017a064 in PMPI_Init_thread (argc=0xfffffa9ee20, > argv=0xfffffa9ee28, required=3, provided=0xfffffa9edd8) at > pinit_thread.c:84 #6 0x0000000010000ae4 in main (argc=1, > argv=0xfffffa9f278) at test2.c:15 > > (neither of the requests are received so presumably messages are > getting lost). > > In contrast if you run against the exact same build of OMPI with > pretty much the same test program but do "MPI_Init(&argc, &argv)" > then it works fine. >
I think I've worked out what is going on. The difference between 1.6 and trunk is that in 1.6 is the #ifdefs in opal/threads/condition.h and how they are set by configure. Its a bit complicated because there was some renaming done between 1.6 and trunk. An excerpt from opal/threads/condition.h from 1.6 (in opal_condition_wait): if (opal_using_threads()) { #if OPAL_HAVE_POSIX_THREADS && OPAL_ENABLE_PROGRESS_THREADS rc = pthread_cond_wait(&c->c_cond, &m->m_lock_pthread); #elif OPAL_HAVE_SOLARIS_THREADS && OPAL_ENABLE_PROGRESS_THREADS rc = cond_wait(&c->c_cond, &m->m_lock_solaris); #else if (c->c_signaled) { c->c_waiting--; opal_mutex_unlock(m); opal_progress(); and from trunk: if (opal_using_threads()) { #if OPAL_HAVE_POSIX_THREADS && OPAL_ENABLE_MULTI_THREADS rc = pthread_cond_wait(&c->c_cond, &m->m_lock_pthread); #elif OPAL_HAVE_SOLARIS_THREADS && OPAL_ENABLE_MULTI_THREADS rc = cond_wait(&c->c_cond, &m->m_lock_solaris); #else if (c->c_signaled) { c->c_waiting--; opal_mutex_unlock(m); opal_progress(); Now in 1.6 OPAL_ENABLE_PROGRESS_THREADS is hardcoded by configure to be off. So even with mpi threads enabled when we are in ompi_request_default_wait_all and call opal_condition_wait we still call opal_progress. In trunk OPAL_ENABLE_MULTI_THREADS is set to 1 if mpi threads are enabled. Note that in 1.6 OPAL_ENABLE_MULTI_THREADS also exists and is set to 1 if mpi threads are enabled, but as can be seen above is not used to control how opal_condition_wait behaves. So in trunk when MPI_THREAD_MULTIPLE is requrest in init, the pthread_cond_wait path is taken. MPI programs get stuck because the main thread sits in pthread_cond_wait and there appears to be no one around to call opal_progress. I've looked around in the OMPI code to see where a thread should be spawned to service opal_progress, but I haven't been able to find it. Between 1.6 and trunk OPAL_ENABLE_PROGRESS_THREADS seems to have disappeared and OMPI_ENABLE_PROGRESS_THREADS has appeared. The latter is hardcoded to be off. I tried to compile with OMPI_ENABLE_PROGRESS_THREADS set, but there are compile errors (presumably why its turned off). But I'm wondering if in opal_condition_wait and a few other areas if OPAL_ENABLE_MULTI_THREADS should in fact be OMPI_ENABLE_PROGRESS_THREADS? If I change a few of those OPAL_ENABLE_MULTI_THREADS to OMPI_ENABLE_PROGRESS_THREADS (I don't know if I changed all that need to be changed) then I can start running threaded MPI programs again. Regards, Chris -- cy...@ozlabs.org