Hi,

An LLNL TotalView user on a Mac reported that their MPI job was hanging inside 
MPI_Init() when started under the control of TotalView. They were using Open 
MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we don't support 
the PMIx debugging hooks yet).

I was able to reproduce the hang on my own Linux system with my own build of 
Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, there 
is some sort of race inside of Open MPI 4.0.1, because if I placed breakpoints 
at certain points in the Open MPI code, and thus change the timing slightly, 
that was enough to avoid the hang.

When the code hangs, it appeared as if one or more MPI processes are waiting 
inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to be released. 
In one of the runs, rank 0 was the only one the was hanging there (though I 
have seen runs where two ranks were hung there).

Here's a backtrace of the first thread in the rank 0 process in the case where 
one rank was hung:

d1.<> f 10.1 w
>  0 __nanosleep_nocancel PC=0x7ffff74e2efd, FP=0x7fffffffd1e0 
> [/lib64/libc.so.6]
   1 usleep           PC=0x7ffff7513b2f, FP=0x7fffffffd200 [/lib64/libc.so.6]
   2 ompi_mpi_init    PC=0x7ffff7a64009, FP=0x7fffffffd350 
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
   3 PMPI_Init        PC=0x7ffff7ab0be4, FP=0x7fffffffd390 
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
   4 main             PC=0x00400c5e, FP=0x7fffffffd550 [/home/jdelsign/cpi.c#27]
   5 __libc_start_main PC=0x7ffff7446b13, FP=0x7fffffffd610 [/lib64/libc.so.6]
   6 _start           PC=0x00400b04, FP=0x7fffffffd618 [/amd/home/jdelsign/cpi]

Here's the block of code where the thread is hung:

    /* if we executed the above fence in the background, then
     * we have to wait here for it to complete. However, there
     * is no reason to do two barriers! */
    if (background_fence) {
        OMPI_LAZY_WAIT_FOR_COMPLETION(active);
    } else if (!ompi_async_mpi_init) {
        /* wait for everyone to reach this point - this is a hard
         * barrier requirement at this time, though we hope to relax
         * it at a later point */
        if (NULL != opal_pmix.fence_nb) {
            active = true;
            OPAL_POST_OBJECT(&active);
            if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
                               fence_release, (void*)&active))) {
                error = "opal_pmix.fence_nb() failed";
                goto error;
            }
            OMPI_LAZY_WAIT_FOR_COMPLETION(active);   <<<<----- STUCK HERE 
WAITING FOR THE FENCE TO BE RELEASED
        } else {
            if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
                error = "opal_pmix.fence() failed";
                goto error;
            }
        }
    }

And here is an aggregated backtrace of all of the processes and threads in the 
job:

d1.<> f g w -g f+l
+/
 +__clone : 5:12[0-3.2-3, p1.2-5]
 |+start_thread
 | +listen_thread@oob_tcp_listener.c#705 : 1:1[p1.5]
 | |+__select_nocancel
 | +listen_thread@ptl_base_listener.c#214 : 1:1[p1.3]
 | |+__select_nocancel
 | +progress_engine@opal_progress_threads.c#105 : 5:5[0-3.2, p1.4]
 | |+opal_libevent2022_event_base_loop@event.c#1632
 | | +poll_dispatch@poll.c#167
 | |  +__poll_nocancel
 | +progress_engine@pmix_progress_threads.c#108 : 5:5[0-3.3, p1.2]
 |  +opal_libevent2022_event_base_loop@event.c#1632
 |   +epoll_dispatch@epoll.c#409
 |    +__epoll_wait_nocancel
 +_start : 5:5[0-3.1, p1.1]
  +__libc_start_main
   +main@cpi.c#27 : 4:4[0-3.1]
   |+PMPI_Init@pinit.c#67
   | 
+ompi_mpi_init@ompi_mpi_init.c#890<mailto:ompi_mpi_init@ompi_mpi_init.c#890> : 
3:3[1-3.1]  <<<<---- THE 3 OTHER MPI PROCS MADE IT PAST FENCE
   | |+ompi_rte_wait_for_debugger@rte_orte_module.c#196
   | | +opal_progress@opal_progress.c#251
   | |  +opal_progress_events@opal_progress.c#191
   | |   +opal_libevent2022_event_base_loop@event.c#1632
   | |    +poll_dispatch@poll.c#167
   | |     +__poll_nocancel
   | 
+ompi_mpi_init@ompi_mpi_init.c#904<mailto:ompi_mpi_init@ompi_mpi_init.c#904> : 
1:1[0.1] <<<<---- THE THREAD THAT IS STUCK
   |  +usleep
   |   +__nanosleep_nocancel
   +main@main.c#14 : 1:1[p1.1]
    +orterun@orterun.c#200
     +opal_libevent2022_event_base_loop@event.c#1632
      +poll_dispatch@poll.c#167
       +__poll_nocancel

d1.<>

I have tested Open MPI 4.0.2 dozens of times, and the hang does not seem to 
happen. My concern is that if the problem is indeed a race, then it's possible 
(but perhaps not likely) that the same race exists in Open MPI 4.0.2, but the 
timing could be slightly different such that it doesn't hang using my simple 
test setup. In other words, maybe I've just been "lucky" with my testing of 
Open MPI 4.0.2 and have failed to provoke the hang yet.

My question is: Was this a known problem in Open MPI 4.0.1 that was fixed in 
Open MPI 4.0.2?

Thanks, John D.

Reply via email to