Siegmar,

could you please give a try to the attached patch ?
/* and keep in mind this is just a workaround that happen to work */

Cheers,

Gilles

On 2014/12/22 22:48, Siegmar Gross wrote:
> Hi,
>
> today I installed openmpi-dev-602-g82c02b4 on my machines (Solaris 10 Sparc,
> Solaris 10 x86_64, and openSUSE Linux 12.1 x86_64) with gcc-4.9.2 and the
> new Solaris Studio 12.4 compilers. All build processes finished without
> errors, but I have a problem running a very small program. It works for
> three processes but hangs for six processes. I have the same behaviour
> for both compilers.
>
> tyr small_prog 139 time; mpiexec -np 3 --host sunpc1,linpc1,tyr 
> init_finalize; time
> 827.161u 210.126s 30:51.08 56.0%        0+0k 4151+20io 2898pf+0w
> Hello!
> Hello!
> Hello!
> 827.886u 210.335s 30:54.68 55.9%        0+0k 4151+20io 2898pf+0w
> tyr small_prog 140 time; mpiexec -np 6 --host sunpc1,linpc1,tyr 
> init_finalize; time
> 827.946u 210.370s 31:15.02 55.3%        0+0k 4151+20io 2898pf+0w
> ^CKilled by signal 2.
> Killed by signal 2.
> 869.242u 221.644s 33:40.54 53.9%        0+0k 4151+20io 2898pf+0w
> tyr small_prog 141 
>
> tyr small_prog 145 ompi_info | grep -e "Open MPI repo revision:" -e "C 
> compiler:"
>   Open MPI repo revision: dev-602-g82c02b4
>               C compiler: cc
> tyr small_prog 146 
>
>
> tyr small_prog 146 /usr/local/gdb-7.6.1_64_gcc/bin/gdb mpiexec
> GNU gdb (GDB) 7.6.1
> ...
> (gdb) run -np 3 --host sunpc1,linpc1,tyr init_finalize
> Starting program: /usr/local/openmpi-1.9.0_64_cc/bin/mpiexec -np 3 --host 
> sunpc1,linpc1,tyr 
> init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP    2        ]
> Hello!
> Hello!
> Hello!
> [LWP    2         exited]
> [New Thread 2        ]
> [Switching to Thread 1 (LWP 1)]
> sol_thread_fetch_registers: td_ta_map_id2thr: no thread can be found to 
> satisfy query
> (gdb) run -np 6 --host sunpc1,linpc1,tyr init_finalize
> The program being debugged has been started already.
> Start it from the beginning? (y or n) y
>
> Starting program: /usr/local/openmpi-1.9.0_64_cc/bin/mpiexec -np 6 --host 
> sunpc1,linpc1,tyr 
> init_finalize
> [Thread debugging using libthread_db enabled]
> [New Thread 1 (LWP 1)]
> [New LWP    2        ]
> ^CKilled by signal 2.
> Killed by signal 2.
>
> Program received signal SIGINT, Interrupt.
> [Switching to Thread 1 (LWP 1)]
> 0xffffffff7d1dc6b0 in __pollsys () from /lib/sparcv9/libc.so.1
> (gdb) bt
> #0  0xffffffff7d1dc6b0 in __pollsys () from /lib/sparcv9/libc.so.1
> #1  0xffffffff7d1cb468 in _pollsys () from /lib/sparcv9/libc.so.1
> #2  0xffffffff7d170ed8 in poll () from /lib/sparcv9/libc.so.1
> #3  0xffffffff7e69a630 in poll_dispatch ()
>    from /usr/local/openmpi-1.9.0_64_cc/lib64/libopen-pal.so.0
> #4  0xffffffff7e6894ec in opal_libevent2021_event_base_loop ()
>    from /usr/local/openmpi-1.9.0_64_cc/lib64/libopen-pal.so.0
> #5  0x000000010000eb14 in orterun (argc=1757447168, argv=0xffffff7ed8550cff)
>     at ../../../../openmpi-dev-602-g82c02b4/orte/tools/orterun/orterun.c:1090
> #6  0x0000000100004e2c in main (argc=256, argv=0xffffff7ed8af5c00)
>     at ../../../../openmpi-dev-602-g82c02b4/orte/tools/orterun/main.c:13
> (gdb) 
>
> Any ideas? Unfortunately I'm leaving for vaccation so that I cannot test
> any patches until the end of the year. Neverthess I wanted to report the
> problem. At the moment I cannot test if I have the same behaviour in a
> homogeneous environment with three machines because the new version isn't
> available before tomorrow on the other machines. I used the following
> configure command.
>
> ../openmpi-dev-602-g82c02b4/configure --prefix=/usr/local/openmpi-1.9.0_64_cc 
> \
>   --libdir=/usr/local/openmpi-1.9.0_64_cc/lib64 \
>   --with-jdk-bindir=/usr/local/jdk1.8.0/bin \
>   --with-jdk-headers=/usr/local/jdk1.8.0/include \
>   JAVA_HOME=/usr/local/jdk1.8.0 \
>   LDFLAGS="-m64 -mt" \
>   CC="cc" CXX="CC" FC="f95" \
>   CFLAGS="-m64 -mt" CXXFLAGS="-m64 -library=stlport4" FCFLAGS="-m64" \
>   CPP="cpp" CXXCPP="cpp" \
>   CPPFLAGS="" CXXCPPFLAGS="" \
>   --enable-mpi-cxx \
>   --enable-cxx-exceptions \
>   --enable-mpi-java \
>   --enable-heterogeneous \
>   --enable-mpi-thread-multiple \
>   --with-threads=posix \
>   --with-hwloc=internal \
>   --without-verbs \
>   --with-wrapper-cflags="-m64 -mt" \
>   --with-wrapper-cxxflags="-m64 -library=stlport4" \
>   --with-wrapper-ldflags="-mt" \
>   --enable-debug \
>   |& tee log.configure.$SYSTEM_ENV.$MACHINE_ENV.64_cc
>
> Furthermore I used the following test program.
>
> #include <stdio.h>
> #include <stdlib.h>
> #include "mpi.h"
>
> int main (int argc, char *argv[])
> {
>   MPI_Init (&argc, &argv);
>   printf ("Hello!\n");
>   MPI_Finalize ();
>   return EXIT_SUCCESS;
> }
>
>
>
> Kind regards
>
> Siegmar
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/12/26052.php

diff --git a/orte/orted/pmix/pmix_server.c b/orte/orted/pmix/pmix_server.c
index 4f0493c..0f4c816 100644
--- a/orte/orted/pmix/pmix_server.c
+++ b/orte/orted/pmix/pmix_server.c
@@ -1241,9 +1241,9 @@ static void pmix_server_dmdx_resp(int status, orte_process_name_t* sender,
                     /* pass across any returned blobs */
                     opal_dss.copy_payload(reply, buffer);
                     stored = true;
-                    OBJ_RETAIN(reply);
-                    PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
                 }
+                OBJ_RETAIN(reply);
+                PMIX_SERVER_QUEUE_SEND(req->peer, req->tag, reply);
             } else {
                 /* If peer has an access to shared memory dstore, check
                  * if we already stored data for the target process.

Reply via email to