Siegmar, could you please give a try to the attached patch ? /* and keep in mind this is just a workaround that happen to work */
Cheers, Gilles On 2014/12/22 22:48, Siegmar Gross wrote: > Hi, > > today I installed openmpi-dev-602-g82c02b4 on my machines (Solaris 10 Sparc, > Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-4.9.2 and the > new Solaris Studio 12.4 compilers. All build processes finished without > errors, but I have a problem running a very small program. It works for > three processes but hangs for six processes. I have the same behaviour > for both compilers. > > tyr small_prog 139 time; mpiexec -np 3 --host sunpc1,linpc1,tyr > init_finalize; time > 827.161u 210.126s 30:51.08 56.0% 0+0k 4151+20io 2898pf+0w > Hello! > Hello! > Hello! > 827.886u 210.335s 30:54.68 55.9% 0+0k 4151+20io 2898pf+0w > tyr small_prog 140 time; mpiexec -np 6 --host sunpc1,linpc1,tyr > init_finalize; time > 827.946u 210.370s 31:15.02 55.3% 0+0k 4151+20io 2898pf+0w > ^CKilled by signal 2. > Killed by signal 2. > 869.242u 221.644s 33:40.54 53.9% 0+0k 4151+20io 2898pf+0w > tyr small_prog 141 > > tyr small_prog 145 ompi_info | grep -e "Open MPI repo revision:" -e "C > compiler:" > Open MPI repo revision: dev-602-g82c02b4 > C compiler: cc > tyr small_prog 146 > > > tyr small_prog 146 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec > GNU gdb (GDB) 7.6.1 > ... > (gdb) run -np 3 --host sunpc1,linpc1,tyr init_finalize > Starting program: /usr/local/openmpi-1.9.0_64_cc/bin/mpiexec -np 3 --host > sunpc1,linpc1,tyr > init_finalize > [Thread debugging using libthread_db enabled] > [New Thread 1 (LWP 1)] > [New LWP 2 ] > Hello! > Hello! > Hello! > [LWP 2 exited] > [New Thread 2 ] > [Switching to Thread 1 (LWP 1)] > sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to > satisfy query > (gdb) run -np 6 --host sunpc1,linpc1,tyr init_finalize > The program being debugged has been started already. > Start it from the beginning? (y or n) y > > Starting program: /usr/local/openmpi-1.9.0_64_cc/bin/mpiexec -np 6 --host > sunpc1,linpc1,tyr > init_finalize > [Thread debugging using libthread_db enabled] > [New Thread 1 (LWP 1)] > [New LWP 2 ] > ^CKilled by signal 2. > Killed by signal 2. > > Program received signal SIGINT, Interrupt. > [Switching to Thread 1 (LWP 1)] > 0xffffffff7d1dc6b0 in __pollsys () from /lib/sparcv9/libc.so.1 > (gdb) bt > #0 0xffffffff7d1dc6b0 in __pollsys () from /lib/sparcv9/libc.so.1 > #1 0xffffffff7d1cb468 in _pollsys () from /lib/sparcv9/libc.so.1 > #2 0xffffffff7d170ed8 in poll () from /lib/sparcv9/libc.so.1 > #3 0xffffffff7e69a630 in poll_dispatch () > from /usr/local/openmpi-1.9.0_64_cc/lib64/libopen-pal.so.0 > #4 0xffffffff7e6894ec in opal_libevent2021_event_base_loop () > from /usr/local/openmpi-1.9.0_64_cc/lib64/libopen-pal.so.0 > #5 0x000000010000eb14 in orterun (argc=1757447168, argv=0xffffff7ed8550cff) > at ../../../../openmpi-dev-602-g82c02b4/orte/tools/orterun/orterun.c:1090 > #6 0x0000000100004e2c in main (argc=256, argv=0xffffff7ed8af5c00) > at ../../../../openmpi-dev-602-g82c02b4/orte/tools/orterun/main.c:13 > (gdb) > > Any ideas? Unfortunately I'm leaving for vaccation so that I cannot test > any patches until the end of the year. Neverthess I wanted to report the > problem. At the moment I cannot test if I have the same behaviour in a > homogeneous environment with three machines because the new version isn't > available before tomorrow on the other machines. I used the following > configure command. > > ../openmpi-dev-602-g82c02b4/configure --prefix=/usr/local/openmpi-1.9.0_64_cc > \ > --libdir=/usr/local/openmpi-1.9.0_64_cc/lib64 \ > --with-jdk-bindir=/usr/local/jdk1.8.0/bin \ > --with-jdk-headers=/usr/local/jdk1.8.0/include \ > JAVA_HOME=/usr/local/jdk1.8.0 \ > LDFLAGS="-m64 -mt" \ > CC="cc" CXX="CC" FC="f95" \ > CFLAGS="-m64 -mt" CXXFLAGS="-m64 -library=stlport4" FCFLAGS="-m64" \ > CPP="cpp" CXXCPP="cpp" \ > CPPFLAGS="" CXXCPPFLAGS="" \ > --enable-mpi-cxx \ > --enable-cxx-exceptions \ > --enable-mpi-java \ > --enable-heterogeneous \ > --enable-mpi-thread-multiple \ > --with-threads=posix \ > --with-hwloc=internal \ > --without-verbs \ > --with-wrapper-cflags="-m64 -mt" \ > --with-wrapper-cxxflags="-m64 -library=stlport4" \ > --with-wrapper-ldflags="-mt" \ > --enable-debug \ > |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc > > Furthermore I used the following test program. > > #include <stdio.h> > #include <stdlib.h> > #include "mpi.h" > > int main (int argc, char *argv[]) > { > MPI_Init (&argc, &argv); > printf ("Hello!\n"); > MPI_Finalize (); > return EXIT_SUCCESS; > } > > > > Kind regards > > Siegmar > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2014/12/26052.php
diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c index 4f0493c..0f4c816 100644 --- a/orte/orted/pmix/pmix_server.c +++ b/orte/orted/pmix/pmix_server.c @@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, orte_process_name_t* sender, /* pass across any returned blobs */ opal_dss.copy_payload(reply, buffer); stored = true; - OBJ_RETAIN(reply); - PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply); } + OBJ_RETAIN(reply); + PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply); } else { /* If peer has an access to shared memory dstore, check * if we already stored data for the target process.