Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread Ralph Castain via devel
enpmix/issues/1189 

I think this lines up - 4.0.2 should be good with a fix.

John DelSignore ---11/12/2019 02:25:14 PM---Hi Austen, Thanks for 
the reply. What I am seeing is consistent with your thought, in that when I se

From: John DelSignore
To: Open MPI Developers
Cc: Austen W Lauria, devel 

Date: 11/12/2019 02:25 PM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView





Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me. 

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all four MPI processes were waiting in 
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:

        /* let the MPI progress engine run while we wait for debugger release */
       OMPI_WAIT_FOR_COMPLETION(debugger_event_active);

I don't know how that is supposed to work, but I can clearly see that 
debugger_event_active was true in all of the processes, even though TV set 
MPIR_debug_gate to 1:

d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
Thread 2.1:
debugger_event_active = true (1)
Thread 3.1:
debugger_event_active = true (1)
Thread 4.1:
debugger_event_active = true (1)
Thread 5.1:
debugger_event_active = true (1)
d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
Thread 2.1:
MPIR_debug_gate = 0x0001 (1)
Thread 3.1:
MPIR_debug_gate = 0x0001 (1)
Thread 4.1:
MPIR_debug_gate = 0x0001 (1)
Thread 5.1:
MPIR_debug_gate = 0x0001 (1)
d1.<> 
I think the _release_fn() function in rte_orte_module.c is supposed to set 
debugger_event_active to false, but that apparently did not happen in this 
case. So, AFAICT, the reason debugger_event_active would not be set to false is 
that the event was never delivered, so the _release_fn() function was never 
called. If that's the case, then the lack of a memory barrier is probably a 
moot point, and the problem is likely related to event generation or dropped 
events.

Cheers, John D. 

FWIW: Here's the aggregated backtrace after the whole job was allowed to run 
freely for about 60 seconds, and then stopped:

d1.<> f g w -g f+l

+/
+__clone : 5:12[0-3.2-3, p1.2-5]
|+start_thread
| +listen_thread : 1:2[p1.3, p1.5]
| |+__select_nocancel
| +progress_engine@opal_progress_threads.c 
<mailto:progress_engine@opal_progress_threads.c> #105 : 4:4[0-3.2]
| |+opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
| | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
| |  +__poll_nocancel
| +progress_engine@pmix_progress_threads.c 
<mailto:progress_engine@pmix_progress_threads.c> #109 : 4:4[0-3.3]
| |+opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
| | +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c> #407
| |  +__epoll_wait_nocancel
| +progress_engine : 1:2[p1.2, p1.4]
|  +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
|   +epoll_dispatch@epoll.c <mailto:epoll_dispatch@epoll.c> #407 : 1:1[p1.2]
|   |+__epoll_wait_nocancel
|   +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165 : 1:1[p1.4]
|    +__poll_nocancel
+_start : 5:5[0-3.1, p1.1]
 +__libc_start_main
  +main@cpi.c#27 <mailto:main@cpi.c#27>  : 4:4[0-3.1]
  |+PMPI_Init@pinit.c <mailto:PMPI_Init@pinit.c> #67
  | +ompi_mpi_init@ompi_mpi_init.c <mailto:ompi_mpi_init@ompi_mpi_init.c> #890
  |  +ompi_rte_wait_for_debugger@rte_orte_module.c#196 
<mailto:ompi_rte_wait_for_debugger@rte_orte_module.c#196> 
  |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#245 : 1:1[0.1]
  |   |+opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
  |   | +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_e

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread Ralph Castain via devel
16009869231249, Error is 0.0818
wall clock time = 0.000133
mic:/amd/home/jdelsign>

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH not set, all of the runs complete and 
look like this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 mpirun -np 4 ./cpi401
Process 2 on microway1
Process 0 on microway1
Process 3 on microway1
Process 1 on microway1
pi is approximately 3.1416009869231249, Error is 0.0818
wall clock time = 0.000153
mic:/amd/home/jdelsign>

As you can see in this last test, if ORTE_TEST_DEBUGGER_ATTACH is not set, the 
code in ompi_rte_wait_for_debugger() is not executed.

Honesty, I don't know if this is a valid test or not, but it strongly suggests 
that there is a problem in that code in 4.0.1 and it cannot be the debugger's 
fault, because there is no debugger in the picture. Tthe GitHub issues Austen 
pointed at seem to accurately describe what I have seen and the conclusion 
there was that it was a bug in PMIx. I have no basis to believe otherwise.

Finally, I'd like to reply to your statement, "What I suspect is actually 
happening is that the debugger is interfering with the OMPI internals that are 
involved in a way that creates a potential loss of the release event." In my 
experience, a debugger can most commonly affect application execution in the 
following ways:

*   It can change execution timing of target processes and threads, so if 
there's already a race in the application code, the debugger can either provoke 
it or cover it up.
*   It can cause data caches to be flushed back to memory when the process 
is stopped.
*   It can cause code that is not EINTR safe. Linux is pretty good these 
days about avoiding EINTR problems in the code, but it can still happen if the 
debugger happens to stop the process while in an interruptible system call. If 
the code does not handle EINTR correctly, then it can fail.

I think that unless there's already a problem in the code, the debugger should 
not be able to interfere at all.

Cheers, John D.



On 11/12/19 6:51 PM, Ralph Castain via devel wrote:
Again, John, I'm not convinced your last statement is true. However, I think it 
is "good enough" for now as it seems to work for you and it isn't seen outside 
of a debugger scenario.


On Nov 12, 2019, at 3:13 PM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:

Hi Austen,

Thanks very much, the issues you show below do indeed describe what I am seeing.

Using printfs and breakpoints I inserted into the _release_fn() function, I was 
able to see that with OMPI 4.0.1, at most one of the MPI processes called the 
function. Most of the time rank 0 would be the only one to execute the 
function, but sometimes none of the MPI processes would execute it. However, 
with OMPI 4.0.2, all of the MPI processes execute the function reliably.

I'm glad to know that the problem was actually fixed in OMPI 4.0.2, and not 
just accidentally working for my test cases.

Cheers, John D.

On 11/12/19 3:41 PM, Austen W Lauria wrote:
I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched, it 
looks to me, for 4.0.2. Here is the openpmix 
issue:https://github.com/openpmix/openpmix/issues/1189 

I think this lines up - 4.0.2 should be good with a fix.

John DelSignore ---11/12/2019 02:25:14 PM---Hi Austen, Thanks for 
the reply. What I am seeing is consistent with your thought, in that when I se

From: John DelSignore
To: Open MPI Developers
Cc: Austen W Lauria, devel 

Date: 11/12/2019 02:25 PM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView





Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me. 

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all f

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread John DelSignore via devel
ening is that the debugger is interfering with the OMPI internals that are 
involved in a way that creates a potential loss of the release event." In my 
experience, a debugger can most commonly affect application execution in the 
following ways:

  *   It can change execution timing of target processes and threads, so if 
there's already a race in the application code, the debugger can either provoke 
it or cover it up.
  *   It can cause data caches to be flushed back to memory when the process is 
stopped.
  *   It can cause code that is not EINTR safe. Linux is pretty good these days 
about avoiding EINTR problems in the code, but it can still happen if the 
debugger happens to stop the process while in an interruptible system call. If 
the code does not handle EINTR correctly, then it can fail.

I think that unless there's already a problem in the code, the debugger should 
not be able to interfere at all.

Cheers, John D.


On 11/12/19 6:51 PM, Ralph Castain via devel wrote:
Again, John, I'm not convinced your last statement is true. However, I think it 
is "good enough" for now as it seems to work for you and it isn't seen outside 
of a debugger scenario.


On Nov 12, 2019, at 3:13 PM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org>> wrote:


Hi Austen,

Thanks very much, the issues you show below do indeed describe what I am seeing.

Using printfs and breakpoints I inserted into the _release_fn() function, I was 
able to see that with OMPI 4.0.1, at most one of the MPI processes called the 
function. Most of the time rank 0 would be the only one to execute the 
function, but sometimes none of the MPI processes would execute it. However, 
with OMPI 4.0.2, all of the MPI processes execute the function reliably.

I'm glad to know that the problem was actually fixed in OMPI 4.0.2, and not 
just accidentally working for my test cases.

Cheers, John D.

On 11/12/19 3:41 PM, Austen W Lauria wrote:

I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched, it 
looks to me, for 4.0.2. Here is the openpmix issue: 
https://github.com/openpmix/openpmix/issues/1189

I think this lines up - 4.0.2 should be good with a fix.

John DelSignore ---11/12/2019 02:25:14 PM---Hi Austen, Thanks for 
the reply. What I am seeing is consistent with your thought, in that when I se

From: John DelSignore 
<mailto:jdelsign...@perforce.com>
To: Open MPI Developers 
<mailto:devel@lists.open-mpi.org>
Cc: Austen W Lauria <mailto:awlau...@us.ibm.com>, devel 
<mailto:devel-boun...@lists.open-mpi.org>
Date: 11/12/2019 02:25 PM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView





Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me.

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all four MPI processes were waiting in 
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:

/* let the MPI progress engine run while we wait for debugger release */
   OMPI_WAIT_FOR_COMPLETION(debugger_event_active);

I don't know how that is supposed to work, but I can clearly see that 
debugger_event_active was true in all of the processes, even though TV set 
MPIR_debug_gate to 1:

d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
Thread 2.1:
debugger_event_active = true (1)
Thread 3.1:
debugger_event_active = true (1)
Thread 4.1:
debugger_event_active = true (1)
Thread 5.1:
debugger_event_active = true (1)
d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
Thread 2.1:
MPIR_debug_gate = 0x0001 (1)
Thread 3.1:
MPIR_debug_gate = 0x0001 (1)
Thread 4.1:
MPIR_debug_gate = 0x0001 (1)
Thread 5.1:
MPIR_debug_gate = 0x0001 (1)
d1.<>

I think the _release_fn

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
Again, John, I'm not convinced your last statement is true. However, I think it 
is "good enough" for now as it seems to work for you and it isn't seen outside 
of a debugger scenario.


On Nov 12, 2019, at 3:13 PM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:

Hi Austen,

Thanks very much, the issues you show below do indeed describe what I am seeing.

Using printfs and breakpoints I inserted into the _release_fn() function, I was 
able to see that with OMPI 4.0.1, at most one of the MPI processes called the 
function. Most of the time rank 0 would be the only one to execute the 
function, but sometimes none of the MPI processes would execute it. However, 
with OMPI 4.0.2, all of the MPI processes execute the function reliably.

I'm glad to know that the problem was actually fixed in OMPI 4.0.2, and not 
just accidentally working for my test cases.

Cheers, John D.

On 11/12/19 3:41 PM, Austen W Lauria wrote:
I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched, it 
looks to me, for 4.0.2. Here is the openpmix 
issue:https://github.com/openpmix/openpmix/issues/1189 

I think this lines up - 4.0.2 should be good with a fix.

John DelSignore ---11/12/2019 02:25:14 PM---Hi Austen, Thanks for 
the reply. What I am seeing is consistent with your thought, in that when I se

From: John DelSignore 
To: Open MPI Developers 
Cc: Austen W Lauria , devel 

Date: 11/12/2019 02:25 PM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView





Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me. 

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all four MPI processes were waiting in 
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:

        /* let the MPI progress engine run while we wait for debugger release */
       OMPI_WAIT_FOR_COMPLETION(debugger_event_active);

I don't know how that is supposed to work, but I can clearly see that 
debugger_event_active was true in all of the processes, even though TV set 
MPIR_debug_gate to 1: 

d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
Thread 2.1:
debugger_event_active = true (1)
Thread 3.1:
debugger_event_active = true (1)
Thread 4.1:
debugger_event_active = true (1)
Thread 5.1:
debugger_event_active = true (1)
d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
Thread 2.1:
MPIR_debug_gate = 0x0001 (1)
Thread 3.1:
MPIR_debug_gate = 0x0001 (1)
Thread 4.1:
MPIR_debug_gate = 0x0001 (1)
Thread 5.1:
MPIR_debug_gate = 0x0001 (1)
d1.<> 
I think the _release_fn() function in rte_orte_module.c is supposed to set 
debugger_event_active to false, but that apparently did not happen in this 
case. So, AFAICT, the reason debugger_event_active would not be set to false is 
that the event was never delivered, so the _release_fn() function was never 
called. If that's the case, then the lack of a memory barrier is probably a 
moot point, and the problem is likely related to event generation or dropped 
events.

Cheers, John D. 

FWIW: Here's the aggregated backtrace after the whole job was allowed to run 
freely for about 60 seconds, and then stopped:

d1.<> f g w -g f+l

+/
+__clone : 5:12[0-3.2-3, p1.2-5]
|+start_thread
| +listen_thread : 1:2[p1.3, p1.5]
| |+__select_nocancel
| +progress_engine@opal_progress_threads.c 
<mailto:progress_engine@opal_progress_threads.c> #105 : 4:4[0-3.2]
| |+opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
| | +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
| |  +__poll_nocancel
| +progress_engine@pmix_progress_threads.c 
<mailto:progress_engine@pmix_progress_thr

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread John DelSignore via devel
Hi Austen,

Thanks very much, the issues you show below do indeed describe what I am seeing.

Using printfs and breakpoints I inserted into the _release_fn() function, I was 
able to see that with OMPI 4.0.1, at most one of the MPI processes called the 
function. Most of the time rank 0 would be the only one to execute the 
function, but sometimes none of the MPI processes would execute it. However, 
with OMPI 4.0.2, all of the MPI processes execute the function reliably.

I'm glad to know that the problem was actually fixed in OMPI 4.0.2, and not 
just accidentally working for my test cases.

Cheers, John D.

On 11/12/19 3:41 PM, Austen W Lauria wrote:

I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched, it 
looks to me, for 4.0.2. Here is the openpmix issue: 
https://github.com/openpmix/openpmix/issues/1189

I think this lines up - 4.0.2 should be good with a fix.

[Inactive hide details for John DelSignore ---11/12/201902:25:14 
PM---Hi Austen, Thanks for the reply. What I amseeing is cons]John 
DelSignore ---11/12/2019 02:25:14 PM---Hi Austen, Thanks for the reply. What I 
am seeing is consistent with your thought, in that when I se

From: John DelSignore 
<mailto:jdelsign...@perforce.com>
To: Open MPI Developers 
<mailto:devel@lists.open-mpi.org>
Cc: Austen W Lauria <mailto:awlau...@us.ibm.com>, devel 
<mailto:devel-boun...@lists.open-mpi.org>
Date: 11/12/2019 02:25 PM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView





Hi Austen,

Thanks for the reply. What I am seeing is consistent with your thought, in that 
when I see the hang, one or more processes did not have a flag updated. I don't 
understand how the Open MPI code works well enough to say if it is a memory 
barrier problem or not. It almost looks like a event delivery or dropped event 
problem to me.

The place in the MPI_init() code where the MPI processes hang and the number of 
"hung" processes seems to vary from run to run. In some cases the processes are 
waiting for an event or waiting for a fence (whatever that is).

I did the following run today, which shows that it can hang waiting for an 
event that apparently was not generated or was dropped:

1. Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
2. Ran the mpirun process until it hit the MPIR_Breakpoint() event.
3. TV attached to all four of the MPI processes and left all five processes 
stopped.
4. Continued all of the processes/threads and let them run freely for about 60 
seconds. They should have run to completion in that amount of time.
5. Halted all of the processes. I included an aggregated backtrace of all of 
the processes below.

In this particular run, all four MPI processes were waiting in 
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:

/* let the MPI progress engine run while we wait for debugger release */
   OMPI_WAIT_FOR_COMPLETION(debugger_event_active);

I don't know how that is supposed to work, but I can clearly see that 
debugger_event_active was true in all of the processes, even though TV set 
MPIR_debug_gate to 1:

d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
Thread 2.1:
debugger_event_active = true (1)
Thread 3.1:
debugger_event_active = true (1)
Thread 4.1:
debugger_event_active = true (1)
Thread 5.1:
debugger_event_active = true (1)
d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
Thread 2.1:
MPIR_debug_gate = 0x0001 (1)
Thread 3.1:
MPIR_debug_gate = 0x0001 (1)
Thread 4.1:
MPIR_debug_gate = 0x0001 (1)
Thread 5.1:
MPIR_debug_gate = 0x0001 (1)
d1.<>

I think the _release_fn() function in rte_orte_module.c is supposed to set 
debugger_event_active to false, but that apparently did not happen in this 
case. So, AFAICT, the reason debugger_event_active would not be set to false is 
that the event was never delivered, so the _release_fn() function was never 
called. If that's the case, then the lack of a memory barrier is probably a 
moot point, and the problem is likely related to event generation or dropped 
events.

Cheers, John D.

FWIW: Here's the aggregated backtrace after the whole job was allowed to run 
freely for about 60 seconds, and then stopped:

d1.<> f g w -g f+l

+/
+__clone : 5:12[0-3.2-3, p1.2-5]
|+start_thread
| +listen_thread : 1:2[p1.3, p1.5]
| |+__select_nocancel
| +progress_engine@opal_progress_threads.c#105 : 4:4[0-3.2]
| |+opal_libevent2022_event_base_loop@event.c#1630
| | +poll_dispatch@poll.c#165
| |  +__poll_nocancel
| +progress_engine@pmix_progress_threads.c#109 : 4:4[0-3.3]
| |+opal_libevent2022_event_base_loop@event.c#1630
| | +epoll_dispatch@epoll.c#407
| |  +__epoll_wait_nocancel
| +progress_engine : 1:2[p1.2, p1.4]
|  +opal_libevent2022_event_base_lo

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
progress@opal_progress.c> 
#245 : 1:1[0.1]
   |   |+opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
   |   | +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
   |   |  +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
   |   |   +__poll_nocancel
   |   +opal_progress@opal_progress.c <mailto:opal_progress@opal_progress.c> 
#247 : 3:3[1-3.1]
   |    +opal_progress_events@opal_progress.c 
<mailto:opal_progress_events@opal_progress.c> #191
   |     +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
   |      +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
   |       +__poll_nocancel
   +orterun : 1:1[p1.1]
    +opal_libevent2022_event_base_loop@event.c 
<mailto:opal_libevent2022_event_base_loop@event.c> #1630
     +poll_dispatch@poll.c <mailto:poll_dispatch@poll.c> #165
      +__poll_nocancel

d1.<> 

On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:

Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \
while ((flg)) { \
opal_progress(); \
usleep(100); \
} \
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \ 
pthread_cond_wait(, ); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(); \

}while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel  
<mailto:devel@lists.open-mpi.org> wrote: >

From: "Ralph Castain via devel"  
<mailto:devel@lists.open-mpi.org> 
To: "OpenMPI Devel"  
<mailto:devel@lists.open-mpi.org> 
Cc: "Ralph Castain"  <mailto:r...@open-mpi.org> 
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel"  
<mailto:devel-boun...@lists.open-mpi.org> 







> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
>  <mailto:devel@lists.open-mpi.org>  wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR sup

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
t;  | +ompi_mpi_init@ompi_mpi_init.c#890
>  |  +*ompi_rte_wait_for_debugger@rte_orte_module.c#196*
>
>  |   +opal_progress@opal_progress.c#245 : 1:1[0.1]
>  |   |+opal_progress_events@opal_progress.c#191
>  |   | +*opal_libevent2022_event_base_loop@event.c*
>#1630
>  |   |  +*poll_dispatch@poll.c* #165
>  |   |   +__poll_nocancel
>  |   +opal_progress@opal_progress.c#247 : 3:3[1-3.1]
>  |+opal_progress_events@opal_progress.c#191
>  | +*opal_libevent2022_event_base_loop@event.c*
>#1630
>  |  +*poll_dispatch@poll.c* #165
>  |   +__poll_nocancel
>  +orterun : 1:1[p1.1]
>   +*opal_libevent2022_event_base_loop@event.c*
>#1630
>+*poll_dispatch@poll.c* #165
> +__poll_nocancel
>
>d1.<>
>
>  On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:
>  Could it be that some processes are not seeing the flag get
>updated? I don't think just using a simple while loop with a 
> volatile
>variable is sufficient in all cases in a multi-threaded 
> environment. It's
>my understanding that the volatile keyword just tells the 
> compiler to not
>optimize or do anything funky with it - because it can change 
> at any time.
>However, this doesn't provide any memory barrier - so it's 
> possible that
>the thread polling on this variable is never seeing the update.
>
>Looking at the code - I see:
>
>#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
>do { \
>opal_output_verbose(1,
>ompi_rte_base_framework.framework_output, \
>"%s lazy waiting on RTE event at %s:%d", \
>OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
>__FILE__, __LINE__); \
>while ((flg)) { \
>opal_progress(); \
>usleep(100); \
>} \
>}while(0);
>
>I think replacing that with:
>
>#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
>do { \
>opal_output_verbose(1,
>ompi_rte_base_framework.framework_output, \
>"%s lazy waiting on RTE event at %s:%d", \
>OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
>__FILE__, __LINE__); \
>
>pthread_mutex_lock(); \
>while(flag) { \
>pthread_cond_wait(, ); \ //Releases the lock
>while waiting for a signal from another thread to wake up
>} \
>pthread_mutex_unlock(); \
>
>}while(0);
>
>Is much more standard when dealing with threads updating a
>shared variable - and might lead to a more expected result in 
> this case.
>
>On the other end, this would require the thread updating
>this variable to:
>
>        pthread_mutex_lock();
>        flg = new_val;
>    pthread_cond_signal();
>pthread_mutex_unlock();
>
>This provides the memory barrier for the thread polling on
>the flag to see the update - something the volatile keyword 
> doesn't do on
>its own. I think it's also much cleaner as it eliminates an 
> arbitrary sleep
>from the code - which I see as a good thing as well.
>
>
>"Ralph Castain via devel" ---11/12/2019
>09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, Gilles 
> Gouaillardet via devel
>** 
>wrote: >
>
>From: "Ralph Castain via devel"
>** 
>To: "OpenMPI Devel" **
>
>Cc: "Ralph Castain" ** 
>Date: 11/12/2019 09:24 AM
>Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1:
>Process is hanging inside MPI_Init() when debugged with 
> TotalView
>Sent by: "devel" **
>
>
>--
>
>
>
>
>
>> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via
>devel **
> wrot

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel
 I think replacing that with:

  #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
  do { \
  opal_output_verbose(1,
  ompi_rte_base_framework.framework_output, \
  "%s lazy waiting on RTE event at %s:%d", \
  OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
  __FILE__, __LINE__); \

  pthread_mutex_lock(); \
  while(flag) { \
  pthread_cond_wait(, ); \ //Releases the lock while
  waiting for a signal from another thread to wake up
  } \
  pthread_mutex_unlock(); \

  }while(0);

  Is much more standard when dealing with threads updating a
  shared variable - and might lead to a more expected result in
  this case.

  On the other end, this would require the thread updating this
  variable to:

  pthread_mutex_lock();
  flg = new_val;
  pthread_cond_signal();
  pthread_mutex_unlock();

  This provides the memory barrier for the thread polling on
  the flag to see the update - something the volatile keyword
  doesn't do on its own. I think it's also much cleaner as it
  eliminates an arbitrary sleep from the code - which I see as
  a good thing as well.


  "Ralph Castain via devel" ---11/12/2019 09:24:23
  AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via
  devel  wrote: >

  From: "Ralph Castain via devel" 
          To: "OpenMPI Devel" 
      Cc: "Ralph Castain" 
  Date: 11/12/2019 09:24 AM
  Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process
  is hanging inside MPI_Init() when debugged with TotalView
  Sent by: "devel" 







  > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel
   wrote:
  >
  > John,
  >
  > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
  >
  >
  > is a simple loop that periodically checks the (volatile)
  "active" condition, that is expected to be updated by an
  other thread.
  > So if you set your breakpoint too early, and **all**
  threads are stopped when this breakpoint is hit, you might
  experience
  > what looks like a race condition.
  > I guess a similar scenario can occur if the breakpoint is
  set in mpirun/orted too early, and prevents the pmix (or
  oob/tcp) thread
  > from sending the message to all MPI tasks)
  >
  >
  >
  > Ralph,
  >
  > does the v4.0.x branch still need the oob/tcp progress
  thread running inside the MPI app?
  > or are we missing some commits (since all interactions with
  mpirun/orted are handled by PMIx, at least in the master
  branch) ?

  IIRC, that progress thread only runs if explicitly asked to
  do so by MCA param. We don't need that code any more as PMIx
  takes care of it.

  >
  > Cheers,
  >
  > Gilles
  >
  > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
  >> Hi John
  >>
  >> Sorry to say, but there is no way to really answer your
  question as the OMPI community doesn't actively test MPIR
  support. I haven't seen any reports of hangs during MPI_Init
  from any release series, including 4.x. My guess is that it
  may have something to do with the debugger interactions as
  opposed to being a true race condition.
  >>
  >> Ralph
  >>
  >>
  >>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel <
  devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>>
  wrote:
  >>>
  >>> Hi,
  >>>
  >>> An LLNL TotalView user on a Mac reported that their MPI
  job was hanging inside MPI_Init() when started under the
  control of TotalView. They were using Open MPI 4.0.1, and
  TotalView was using the MPIR Interface (sorry, we don't
  support the PMIx debugging hooks yet).
  >>>
  >>> I was able to reproduce the hang on my own Linux system
  with my own

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
-3.1]
>|+opal_progress_events@opal_progress.c#191
>| +opal_libevent2022_event_base_loop@event.c#1630
>|  +poll_dispatch@poll.c#165
>|   +__poll_nocancel
>+orterun : 1:1[p1.1]
> +opal_libevent2022_event_base_loop@event.c#1630
>  +poll_dispatch@poll.c#165
>   +__poll_nocancel
>
> d1.<>
>
>
> On 11/12/19 9:47 AM, Austen W Lauria via devel wrote:
>
> Could it be that some processes are not seeing the flag get updated? I
> don't think just using a simple while loop with a volatile variable is
> sufficient in all cases in a multi-threaded environment. It's my
> understanding that the volatile keyword just tells the compiler to not
> optimize or do anything funky with it - because it can change at any time.
> However, this doesn't provide any memory barrier - so it's possible that
> the thread polling on this variable is never seeing the update.
>
> Looking at the code - I see:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
> while ((flg)) { \
> opal_progress(); \
> usleep(100); \
> } \
> }while(0);
>
> I think replacing that with:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
>
> pthread_mutex_lock(); \
> while(flag) { \
> pthread_cond_wait(, ); \ //Releases the lock while waiting for a
> signal from another thread to wake up
> } \
> pthread_mutex_unlock(); \
>
> }while(0);
>
> Is much more standard when dealing with threads updating a shared variable
> - and might lead to a more expected result in this case.
>
> On the other end, this would require the thread updating this variable to:
>
> pthread_mutex_lock();
> flg = new_val;
> pthread_cond_signal();
> pthread_mutex_unlock();
>
> This provides the memory barrier for the thread polling on the flag to see
> the update - something the volatile keyword doesn't do on its own. I think
> it's also much cleaner as it eliminates an arbitrary sleep from the code -
> which I see as a good thing as well.
>
>
> [image: Inactive hide details for "Ralph Castain via devel" ---11/12/2019
> 09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillar]"Ralph
> Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 2019, at 4:53
> PM, Gilles Gouaillardet via devel 
>  wrote: >
>
> From: "Ralph Castain via devel" 
> 
> To: "OpenMPI Devel"  
> Cc: "Ralph Castain"  
> Date: 11/12/2019 09:24 AM
> Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
> inside MPI_Init() when debugged with TotalView
> Sent by: "devel" 
> 
> --
>
>
>
>
>
> > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel
>   wrote:
> >
> > John,
> >
> > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> >
> >
> > is a simple loop that periodically checks the (volatile) "active"
> condition, that is expected to be updated by an other thread.
> > So if you set your breakpoint too early, and **all** threads are stopped
> when this breakpoint is hit, you might experience
> > what looks like a race condition.
> > I guess a similar scenario can occur if the breakpoint is set in
> mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
> > from sending the message to all MPI tasks)
> >
> >
> >
> > Ralph,
> >
> > does the v4.0.x branch still need the oob/tcp progress thread running
> inside the MPI app?
> > or are we missing some commits (since all interactions with mpirun/orted
> are handled by PMIx, at least in the master branch) ?
>
> IIRC, that progress thread only runs if explicitly asked to do so by MCA
> param. We don't need that code any more as PMIx takes care of it.
>
> >
> > Cheers,
> >
> > Gilles
> >
> > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
> >> Hi John
> >>
> >> Sorry to say, but there is no way to really answer your question as the
> OMPI community doesn't actively test MPIR support. I haven't seen any
> reports of hangs during MPI_Init from any release series, including 4.x. My
> guess is that it may have something to do with the debugger interactions as
> opposed to being a true race condition.
> >>
> 

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel

I think you are hitting this issue here in 4.0.1:

https://github.com/open-mpi/ompi/issues/6613

MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched,
it looks to me, for 4.0.2. Here is the openpmix issue:
https://github.com/openpmix/openpmix/issues/1189

I think this lines up - 4.0.2 should be good with a fix.



From:   John DelSignore 
To: Open MPI Developers 
Cc: Austen W Lauria , devel

Date:   11/12/2019 02:25 PM
Subject:[EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
inside MPI_Init() when debugged with TotalView



Hi Austen,


Thanks for the reply. What I am seeing is consistent with your thought, in
that when I see the hang, one or more processes did not have a flag
updated. I don't understand how the Open MPI code works well enough to say
if it is a memory barrier problem or not. It almost looks like a event
delivery or dropped event problem to me.


The place in the MPI_init() code where the MPI processes hang and the
number of "hung" processes seems to vary from run to run. In some cases the
processes are waiting for an event or waiting for a fence (whatever that
is).


I did the following run today, which shows that it can hang waiting for an
event that apparently was not generated or was dropped:
   1.   Started TV on mpirun: totalview -args mpirun -np 4 ./cpi
   2.   Ran the mpirun process until it hit the MPIR_Breakpoint() event.
   3.   TV attached to all four of the MPI processes and left all five
  processes stopped.
   4.   Continued all of the processes/threads and let them run freely for
  about 60 seconds. They should have run to completion in that amount
  of time.
   5.   Halted all of the processes. I included an aggregated backtrace of
  all of the processes below.


In this particular run, all four MPI processes were waiting in
ompi_rte_wait_for_debugger() in rte_orte_module.c at line 196, which is:


/* let the MPI progress engine run while we wait for debugger
release */
OMPI_WAIT_FOR_COMPLETION(debugger_event_active);


I don't know how that is supposed to work, but I can clearly see that
debugger_event_active was true in all of the processes, even though TV set
MPIR_debug_gate to 1:


  d1.<> f {2.1 3.1 4.1 5.1} p debugger_event_active
  Thread 2.1:
   debugger_event_active = true (1)
  Thread 3.1:
   debugger_event_active = true (1)
  Thread 4.1:
   debugger_event_active = true (1)
  Thread 5.1:
   debugger_event_active = true (1)
  d1.<> f {2.1 3.1 4.1 5.1} p MPIR_debug_gate
  Thread 2.1:
   MPIR_debug_gate = 0x0001 (1)
  Thread 3.1:
   MPIR_debug_gate = 0x0001 (1)
  Thread 4.1:
   MPIR_debug_gate = 0x0001 (1)
  Thread 5.1:
   MPIR_debug_gate = 0x0001 (1)
  d1.<>


I think the _release_fn() function in rte_orte_module.c is supposed to set
debugger_event_active to false, but that apparently did not happen in this
case. So, AFAICT, the reason debugger_event_active would not be set to
false is that the event was never delivered, so the _release_fn() function
was never called. If that's the case, then the lack of a memory barrier is
probably a moot point, and the problem is likely related to event
generation or dropped events.


Cheers, John D.


FWIW: Here's the aggregated backtrace after the whole job was allowed to
run freely for about 60 seconds, and then stopped:


  d1.<> f g w -g f+l


  +/
   +__clone : 5:12[0-3.2-3, p1.2-5]
   |+start_thread
   | +listen_thread : 1:2[p1.3, p1.5]
   | |+__select_nocancel
   | +progress_engine@opal_progress_threads.c#105 : 4:4[0-3.2]
   | |+opal_libevent2022_event_base_loop@event.c#1630
   | | +poll_dispatch@poll.c#165
   | |  +__poll_nocancel
   | +progress_engine@pmix_progress_threads.c#109 : 4:4[0-3.3]
   | |+opal_libevent2022_event_base_loop@event.c#1630
   | | +epoll_dispatch@epoll.c#407
   | |  +__epoll_wait_nocancel
   | +progress_engine : 1:2[p1.2, p1.4]
   |  +opal_libevent2022_event_base_loop@event.c#1630
   |   +epoll_dispatch@epoll.c#407 : 1:1[p1.2]
   |   |+__epoll_wait_nocancel
   |   +poll_dispatch@poll.c#165 : 1:1[p1.4]
   |+__poll_nocancel
   +_start : 5:5[0-3.1, p1.1]
+__libc_start_main
 +main@cpi.c#27 : 4:4[0-3.1]
 |+PMPI_Init@pinit.c#67
 | +ompi_mpi_init@ompi_mpi_init.c#890
 |  +ompi_rte_wait_for_debugger@rte_orte_module.c#196
 |   +opal_progress@opal_progress.c#245 : 1:1[0.1]
 |   |+opal_progress_events@opal_progress.c#191
 |   | +opal_libevent2022_event_base_loop@event.c#1630
 |   |  +poll_dispatch@poll.c#165
 |   |   +__poll_nocancel
 |   +opal_progress@opal_progress.c#247 : 3:3[1-3.1]
 |+opal_progress_events@opal_progress.c#191
 | +opal_libevent2022_event_base_loop@event.c#1630
   

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel

Yes that was an omission on my part.

Regarding volatile being sufficient - I don't think that is the case in all
situations. It might work under most conditions - but it can lead to the
"it works on my machine..." type of bugs. In particular it doesn't
guarantee that the waiting thread will ever see the updated value.

This is easy enough to see - can a print be added or break on the line
where it is updated to confirm that it is in fact being set for all ranks
under the users conditions?



From:   "Ralph Castain via devel" 
To: "OpenMPI Devel" 
Cc: "Ralph Castain" 
Date:   11/12/2019 01:28 PM
Subject:        [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
        inside MPI_Init() when debugged with TotalView
Sent by:"devel" 



Just to be clear as well: you cannot use the pthread method you propose
because you must loop over opal_progress - the "usleep" is in there simply
to avoid consuming 100% cpu while we wait.


  On Nov 12, 2019, at 8:52 AM, George Bosilca via devel <
  devel@lists.open-mpi.org> wrote:

  I don't think there is a need any protection around that variable. It
  will change value only once (in a callback triggered from
  opal_progress), and the volatile guarantees that loads will be issued
  for every access, so the waiting thread will eventually notice the
  change.

   George.


  On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel <
  devel@lists.open-mpi.org> wrote:
Could it be that some processes are not seeing the flag get
updated? I don't think just using a simple while loop with a
volatile variable is sufficient in all cases in a multi-threaded
environment. It's my understanding that the volatile keyword just
tells the compiler to not optimize or do anything funky with it -
because it can change at any time. However, this doesn't provide
any memory barrier - so it's possible that the thread polling on
this variable is never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \
while ((flg)) { \
opal_progress(); \
usleep(100); \
} \
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \
pthread_cond_wait(, ); \ //Releases the lock while
waiting for a signal from another thread to wake up
} \
pthread_mutex_unlock(); \

}while(0);

Is much more standard when dealing with threads updating a shared
variable - and might lead to a more expected result in this case.

On the other end, this would require the thread updating this
variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag
to see the update - something the volatile keyword doesn't do on
its own. I think it's also much cleaner as it eliminates an
arbitrary sleep from the code - which I see as a good thing as
well.


"Ralph Castain via devel" ---11/12/2019 09:24:23
AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
devel@lists.open-mpi.org> wrote: >

From: "Ralph Castain via devel" 
    To: "OpenMPI Devel" 
Cc: "Ralph Castain" 
    Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is
hanging inside MPI_Init() when debugged with TotalView
Sent by: "devel" 





> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
devel@lists.open-mpi.org> wrote:
>
> John,
>
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
>
>
> is a simple loop that periodically checks the (volatile) "active"
condition, that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are
stopped when this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is se

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread John DelSignore via devel
e(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \
while ((flg)) { \
opal_progress(); \
usleep(100); \
} \
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
do { \
opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
"%s lazy waiting on RTE event at %s:%d", \
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
__FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \
pthread_cond_wait(, ); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(); \

}while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


[Inactive hide details for "Ralph Castain viadevel" ---11/12/2019 
09:24:23 AM---> On Nov 11, 2019,at 4:53 PM, Gilles Gouaillar]"Ralph 
Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, 
Gilles Gouaillardet via devel 
<mailto:devel@lists.open-mpi.org> wrote: >

From: "Ralph Castain via devel" 
<mailto:devel@lists.open-mpi.org>
To: "OpenMPI Devel" <mailto:devel@lists.open-mpi.org>
Cc: "Ralph Castain" <mailto:r...@open-mpi.org>
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel" 
<mailto:devel-boun...@lists.open-mpi.org>







> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
> <mailto:devel@lists.open-mpi.org> wrote:
>
> John,
>
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
>
>
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
>
>
>
> Ralph,
>
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

>
> Cheers,
>
> Gilles
>
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>>
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>>
>> Ralph
>>
>>
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>>> mailto:devel@lists.open-mpi.org> 
>>> <mailto:devel@lists.open-mpi.org>> wrote:
>>>
>>> Hi,
>>>
>>> An LLNL TotalView user on a Mac reported that their MPI job was hanging 
>>> inside MPI_Init() when started under the control of TotalView. They were 
>>> using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we 
>>> don't support the PMIx debugging hooks yet).
>>>
>>> I was able to reproduce the hang on my own Linux system with my own build 
>>> of Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, 
>>> there is some sort of race inside of Open MPI 4.0.1, because if I placed 
>>> breakpoints at certain points in the Open MPI code, and thus change the 
>>> timing slightly, that was enough to avoid the hang.
>>>
>>> When the code hangs, it appeared as if one or more MPI processes are 
>>> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to 
>>> be released. In one of the runs, rank 0 was the only one the was hanging 

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
Just to be clear as well: you cannot use the pthread method you propose because 
you must loop over opal_progress - the "usleep" is in there simply to avoid 
consuming 100% cpu while we wait.


On Nov 12, 2019, at 8:52 AM, George Bosilca via devel mailto:devel@lists.open-mpi.org> > wrote:

I don't think there is a need any protection around that variable. It will 
change value only once (in a callback triggered from opal_progress), and the 
volatile guarantees that loads will be issued for every access, so the waiting 
thread will eventually notice the change.

 George.


On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel 
mailto:devel@lists.open-mpi.org> > wrote:
Could it be that some processes are not seeing the flag get updated? I don't 
think just using a simple while loop with a volatile variable is sufficient in 
all cases in a multi-threaded environment. It's my understanding that the 
volatile keyword just tells the compiler to not optimize or do anything funky 
with it - because it can change at any time. However, this doesn't provide any 
memory barrier - so it's possible that the thread polling on this variable is 
never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
 do { \
 opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
 "%s lazy waiting on RTE event at %s:%d", \
 OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
 __FILE__, __LINE__); \
 while ((flg)) { \
 opal_progress(); \
 usleep(100); \
 } \
 }while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
 do { \
 opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
 "%s lazy waiting on RTE event at %s:%d", \
 OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
 __FILE__, __LINE__); \

pthread_mutex_lock(); \
while(flag) { \ 
 pthread_cond_wait(, ); \ //Releases the lock while waiting for a 
signal from another thread to wake up
} \
pthread_mutex_unlock(); \

 }while(0);

Is much more standard when dealing with threads updating a shared variable - 
and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see the 
update - something the volatile keyword doesn't do on its own. I think it's 
also much cleaner as it eliminates an arbitrary sleep from the code - which I 
see as a good thing as well.


"Ralph Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 
2019, at 4:53 PM, Gilles Gouaillardet via devel mailto:devel@lists.open-mpi.org> > wrote: >

From: "Ralph Castain via devel" mailto:devel@lists.open-mpi.org> >
To: "OpenMPI Devel" mailto:devel@lists.open-mpi.org> 
>
Cc: "Ralph Castain" mailto:r...@open-mpi.org> >
Date: 11/12/2019 09:24 AM
Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside 
MPI_Init() when debugged with TotalView
Sent by: "devel" mailto:devel-boun...@lists.open-mpi.org> >







> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
> mailto:devel@lists.open-mpi.org> > wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>> 
>> Ralph
>> 
>> 
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
I don't think there is a need any protection around that variable. It will
change value only once (in a callback triggered from opal_progress), and
the volatile guarantees that loads will be issued for every access, so the
waiting thread will eventually notice the change.

 George.


On Tue, Nov 12, 2019 at 9:48 AM Austen W Lauria via devel <
devel@lists.open-mpi.org> wrote:

> Could it be that some processes are not seeing the flag get updated? I
> don't think just using a simple while loop with a volatile variable is
> sufficient in all cases in a multi-threaded environment. It's my
> understanding that the volatile keyword just tells the compiler to not
> optimize or do anything funky with it - because it can change at any time.
> However, this doesn't provide any memory barrier - so it's possible that
> the thread polling on this variable is never seeing the update.
>
> Looking at the code - I see:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
> while ((flg)) { \
> opal_progress(); \
> usleep(100); \
> } \
> }while(0);
>
> I think replacing that with:
>
> #define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock) \
> do { \
> opal_output_verbose(1, ompi_rte_base_framework.framework_output, \
> "%s lazy waiting on RTE event at %s:%d", \
> OMPI_NAME_PRINT(OMPI_PROC_MY_NAME), \
> __FILE__, __LINE__); \
>
> pthread_mutex_lock(); \
> while(flag) { \
> pthread_cond_wait(, ); \ //Releases the lock while waiting for a
> signal from another thread to wake up
> } \
> pthread_mutex_unlock(); \
>
> }while(0);
>
> Is much more standard when dealing with threads updating a shared variable
> - and might lead to a more expected result in this case.
>
> On the other end, this would require the thread updating this variable to:
>
> pthread_mutex_lock();
> flg = new_val;
> pthread_cond_signal();
> pthread_mutex_unlock();
>
> This provides the memory barrier for the thread polling on the flag to see
> the update - something the volatile keyword doesn't do on its own. I think
> it's also much cleaner as it eliminates an arbitrary sleep from the code -
> which I see as a good thing as well.
>
>
> [image: Inactive hide details for "Ralph Castain via devel" ---11/12/2019
> 09:24:23 AM---> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillar]"Ralph
> Castain via devel" ---11/12/2019 09:24:23 AM---> On Nov 11, 2019, at 4:53
> PM, Gilles Gouaillardet via devel  wrote: >
>
> From: "Ralph Castain via devel" 
> To: "OpenMPI Devel" 
> Cc: "Ralph Castain" 
> Date: 11/12/2019 09:24 AM
> Subject: [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
> inside MPI_Init() when debugged with TotalView
> Sent by: "devel" 
> --
>
>
>
>
>
> > On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel <
> devel@lists.open-mpi.org> wrote:
> >
> > John,
> >
> > OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> >
> >
> > is a simple loop that periodically checks the (volatile) "active"
> condition, that is expected to be updated by an other thread.
> > So if you set your breakpoint too early, and **all** threads are stopped
> when this breakpoint is hit, you might experience
> > what looks like a race condition.
> > I guess a similar scenario can occur if the breakpoint is set in
> mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
> > from sending the message to all MPI tasks)
> >
> >
> >
> > Ralph,
> >
> > does the v4.0.x branch still need the oob/tcp progress thread running
> inside the MPI app?
> > or are we missing some commits (since all interactions with mpirun/orted
> are handled by PMIx, at least in the master branch) ?
>
> IIRC, that progress thread only runs if explicitly asked to do so by MCA
> param. We don't need that code any more as PMIx takes care of it.
>
> >
> > Cheers,
> >
> > Gilles
> >
> > On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
> >> Hi John
> >>
> >> Sorry to say, but there is no way to really answer your question as the
> OMPI community doesn't actively test MPIR support. I haven't seen any
> reports of hangs during MPI_Init from any release series, including 4.x. My
> guess is that it may have something to do with the debugger interactions as
> opposed to being a true race condition.
> >>
> >> Ralph
> >>
> >>

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel

Could it be that some processes are not seeing the flag get updated? I
don't think just using a simple while loop with a volatile variable is
sufficient in all cases in a multi-threaded environment. It's my
understanding that the volatile keyword just tells the compiler to not
optimize or do anything funky with it - because it can change at any time.
However, this doesn't provide any memory barrier - so it's possible that
the thread polling on this variable is never seeing the update.

Looking at the code - I see:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg)
\
do
{\
opal_output_verbose(1, ompi_rte_base_framework.framework_output,
\
"%s lazy waiting on RTE event at %s:%d",
\
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
\
__FILE__, __LINE__);
\
while ((flg))
{ \
opal_progress();
\
usleep(100);
\
}
\
}while(0);

I think replacing that with:

#define OMPI_LAZY_WAIT_FOR_COMPLETION(flg, cond, lock)
\
do
{\
opal_output_verbose(1, ompi_rte_base_framework.framework_output,
\
"%s lazy waiting on RTE event at %s:%d",
\
OMPI_NAME_PRINT(OMPI_PROC_MY_NAME),
\
__FILE__, __LINE__);
\

pthread_mutex_lock(); \
while(flag)  { \
  pthread_cond_wait(, ); \ //Releases the lock while waiting for
a signal from another thread to wake up
} \
pthread_mutex_unlock(); \

}while(0);

Is much more standard when dealing with threads updating a shared variable
- and might lead to a more expected result in this case.

On the other end, this would require the thread updating this variable to:

pthread_mutex_lock();
flg = new_val;
pthread_cond_signal();
pthread_mutex_unlock();

This provides the memory barrier for the thread polling on the flag to see
the update - something the volatile keyword doesn't do on its own. I think
it's also much cleaner as it eliminates an arbitrary sleep from the code -
which I see as a good thing as well.




From:   "Ralph Castain via devel" 
To: "OpenMPI Devel" 
Cc: "Ralph Castain" 
Date:   11/12/2019 09:24 AM
Subject:    [EXTERNAL] Re: [OMPI devel] Open MPI v4.0.1: Process is hanging
inside MPI_Init() when debugged with TotalView
Sent by:"devel" 





> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel
 wrote:
>
> John,
>
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
>
>
> is a simple loop that periodically checks the (volatile) "active"
condition, that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped
when this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in
mpirun/orted too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
>
>
>
> Ralph,
>
> does the v4.0.x branch still need the oob/tcp progress thread running
inside the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted
are handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA
param. We don't need that code any more as PMIx takes care of it.

>
> Cheers,
>
> Gilles
>
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>>
>> Sorry to say, but there is no way to really answer your question as the
OMPI community doesn't actively test MPIR support. I haven't seen any
reports of hangs during MPI_Init from any release series, including 4.x. My
guess is that it may have something to do with the debugger interactions as
opposed to being a true race condition.
>>
>> Ralph
>>
>>
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel
mailto:devel@lists.open-mpi.org>> wrote:
>>>
>>> Hi,
>>>
>>> An LLNL TotalView user on a Mac reported that their MPI job was hanging
inside MPI_Init() when started under the control of TotalView. They were
using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we
don't support the PMIx debugging hooks yet).
>>>
>>> I was able to reproduce the hang on my own Linux system with my own
build of Open MPI 4.0.1, which I built with debug symbols. As far as I can
tell, there is some sort of race inside of Open MPI 4.0.1, because if I
placed breakpoints at certain points in the Open MPI code, and thus change
the timing slightly, that was enough to avoid the hang.
>>>
>>>

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel


> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel 
>  wrote:
> 
> John,
> 
> OMPI_LAZY_WAIT_FOR_COMPLETION(active)
> 
> 
> is a simple loop that periodically checks the (volatile) "active" condition, 
> that is expected to be updated by an other thread.
> So if you set your breakpoint too early, and **all** threads are stopped when 
> this breakpoint is hit, you might experience
> what looks like a race condition.
> I guess a similar scenario can occur if the breakpoint is set in mpirun/orted 
> too early, and prevents the pmix (or oob/tcp) thread
> from sending the message to all MPI tasks)
> 
> 
> 
> Ralph,
> 
> does the v4.0.x branch still need the oob/tcp progress thread running inside 
> the MPI app?
> or are we missing some commits (since all interactions with mpirun/orted are 
> handled by PMIx, at least in the master branch) ?

IIRC, that progress thread only runs if explicitly asked to do so by MCA param. 
We don't need that code any more as PMIx takes care of it.

> 
> Cheers,
> 
> Gilles
> 
> On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:
>> Hi John
>> 
>> Sorry to say, but there is no way to really answer your question as the OMPI 
>> community doesn't actively test MPIR support. I haven't seen any reports of 
>> hangs during MPI_Init from any release series, including 4.x. My guess is 
>> that it may have something to do with the debugger interactions as opposed 
>> to being a true race condition.
>> 
>> Ralph
>> 
>> 
>>> On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
>>> mailto:devel@lists.open-mpi.org>> wrote:
>>> 
>>> Hi,
>>> 
>>> An LLNL TotalView user on a Mac reported that their MPI job was hanging 
>>> inside MPI_Init() when started under the control of TotalView. They were 
>>> using Open MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we 
>>> don't support the PMIx debugging hooks yet).
>>> 
>>> I was able to reproduce the hang on my own Linux system with my own build 
>>> of Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, 
>>> there is some sort of race inside of Open MPI 4.0.1, because if I placed 
>>> breakpoints at certain points in the Open MPI code, and thus change the 
>>> timing slightly, that was enough to avoid the hang.
>>> 
>>> When the code hangs, it appeared as if one or more MPI processes are 
>>> waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to 
>>> be released. In one of the runs, rank 0 was the only one the was hanging 
>>> there (though I have seen runs where two ranks were hung there).
>>> 
>>> Here's a backtrace of the first thread in the rank 0 process in the case 
>>> where one rank was hung:
>>> 
>>> d1.<> f 10.1 w
>>> >  0 __nanosleep_nocancel PC=0x774e2efd, FP=0x7fffd1e0 
>>> > [/lib64/libc.so.6]
>>>1 usleep PC=0x77513b2f, FP=0x7fffd200 [/lib64/libc.so.6]
>>>2 ompi_mpi_init PC=0x77a64009, FP=0x7fffd350 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
>>>3 PMPI_Init PC=0x77ab0be4, FP=0x7fffd390 
>>> [/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
>>>4 main PC=0x00400c5e, FP=0x7fffd550 
>>> [/home/jdelsign/cpi.c#27]
>>>5 __libc_start_main PC=0x77446b13, FP=0x7fffd610 
>>> [/lib64/libc.so.6]
>>>6 _start   PC=0x00400b04, FP=0x7fffd618 
>>> [/amd/home/jdelsign/cpi]
>>> 
>>> Here's the block of code where the thread is hung:
>>> 
>>> /* if we executed the above fence in the background, then
>>>  * we have to wait here for it to complete. However, there
>>>  * is no reason to do two barriers! */
>>> if (background_fence) {
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active);
>>> } else if (!ompi_async_mpi_init) {
>>> /* wait for everyone to reach this point - this is a hard
>>>  * barrier requirement at this time, though we hope to relax
>>>  * it at a later point */
>>> if (NULL != opal_pmix.fence_nb) {
>>> active = true;
>>> OPAL_POST_OBJECT();
>>> if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
>>> fence_release, (void* {
>>> error = "opal_pmix.fence_nb() failed";
>>> goto error;
>>> }
>>> OMPI_LAZY_WAIT_FOR_COMPLETION(active); *- STUCK HERE WAITING FOR 
>>> THE FENCE TO BE RELEASED*
>>> } else {
>>> if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
>>> error = "opal_pmix.fence() failed";
>>> goto error;
>>> }
>>> }
>>> }
>>> 
>>> And here is an aggregated backtrace of all of the processes and threads in 
>>> the job:
>>> 
>>> d1.<> f g w -g f+l
>>> +/
>>>  +__clone : 5:12[0-3.2-3, p1.2-5]
>>>  |+start_thread
>>>  | +listen_thread@oob_tcp_listener.c 
>>> #705 : 1:1[p1.5]
>>>  | |+__select_nocancel
>>>  | +listen_thread@ptl_base_listener.c 
>>> 

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-11 Thread Gilles Gouaillardet via devel

John,

OMPI_LAZY_WAIT_FOR_COMPLETION(active)


is a simple loop that periodically checks the (volatile) "active" 
condition, that is expected to be updated by an other thread.
So if you set your breakpoint too early, and **all** threads are stopped 
when this breakpoint is hit, you might experience

what looks like a race condition.
I guess a similar scenario can occur if the breakpoint is set in 
mpirun/orted too early, and prevents the pmix (or oob/tcp) thread

from sending the message to all MPI tasks)



Ralph,

does the v4.0.x branch still need the oob/tcp progress thread running 
inside the MPI app?
or are we missing some commits (since all interactions with mpirun/orted 
are handled by PMIx, at least in the master branch) ?


Cheers,

Gilles

On 11/12/2019 9:27 AM, Ralph Castain via devel wrote:

Hi John

Sorry to say, but there is no way to really answer your question as 
the OMPI community doesn't actively test MPIR support. I haven't seen 
any reports of hangs during MPI_Init from any release series, 
including 4.x. My guess is that it may have something to do with the 
debugger interactions as opposed to being a true race condition.


Ralph


On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org>> wrote:


Hi,

An LLNL TotalView user on a Mac reported that their MPI job was 
hanging inside MPI_Init() when started under the control of 
TotalView. They were using Open MPI 4.0.1, and TotalView was using 
the MPIR Interface (sorry, we don't support the PMIx debugging hooks 
yet).


I was able to reproduce the hang on my own Linux system with my own 
build of Open MPI 4.0.1, which I built with debug symbols. As far as 
I can tell, there is some sort of race inside of Open MPI 4.0.1, 
because if I placed breakpoints at certain points in the Open MPI 
code, and thus change the timing slightly, that was enough to avoid 
the hang.


When the code hangs, it appeared as if one or more MPI processes are 
waiting inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a 
fence to be released. In one of the runs, rank 0 was the only one the 
was hanging there (though I have seen runs where two ranks were hung 
there).


Here's a backtrace of the first thread in the rank 0 process in the 
case where one rank was hung:


d1.<> f 10.1 w
>  0 __nanosleep_nocancel PC=0x774e2efd, FP=0x7fffd1e0 
[/lib64/libc.so.6]

   1 usleep PC=0x77513b2f, FP=0x7fffd200 [/lib64/libc.so.6]
   2 ompi_mpi_init PC=0x77a64009, FP=0x7fffd350 
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
   3 PMPI_Init PC=0x77ab0be4, FP=0x7fffd390 
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
   4 main PC=0x00400c5e, FP=0x7fffd550 
[/home/jdelsign/cpi.c#27]
   5 __libc_start_main PC=0x77446b13, FP=0x7fffd610 
[/lib64/libc.so.6]
   6 _start   PC=0x00400b04, FP=0x7fffd618 
[/amd/home/jdelsign/cpi]


Here's the block of code where the thread is hung:

    /* if we executed the above fence in the background, then
 * we have to wait here for it to complete. However, there
 * is no reason to do two barriers! */
    if (background_fence) {
OMPI_LAZY_WAIT_FOR_COMPLETION(active);
    } else if (!ompi_async_mpi_init) {
    /* wait for everyone to reach this point - this is a hard
 * barrier requirement at this time, though we hope to relax
 * it at a later point */
    if (NULL != opal_pmix.fence_nb) {
    active = true;
OPAL_POST_OBJECT();
    if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
fence_release, (void* {
    error = "opal_pmix.fence_nb() failed";
    goto error;
    }
OMPI_LAZY_WAIT_FOR_COMPLETION(active); *- STUCK HERE WAITING 
FOR THE FENCE TO BE RELEASED*

    } else {
    if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
    error = "opal_pmix.fence() failed";
    goto error;
    }
    }
    }

And here is an aggregated backtrace of all of the processes and 
threads in the job:


d1.<> f g w -g f+l
+/
 +__clone : 5:12[0-3.2-3, p1.2-5]
 |+start_thread
 | +listen_thread@oob_tcp_listener.c 
#705 : 1:1[p1.5]

 | |+__select_nocancel
 | +listen_thread@ptl_base_listener.c 
#214 : 1:1[p1.3]

 | |+__select_nocancel
 | +progress_engine@opal_progress_threads.c 
#105 : 5:5[0-3.2, p1.4]
 | |+opal_libevent2022_event_base_loop@event.c 
#1632

 | | +poll_dispatch@poll.c #167
 | |  +__poll_nocancel
 | +progress_engine@pmix_progress_threads.c 
#108 : 5:5[0-3.3, p1.2]
 |  +opal_libevent2022_event_base_loop@event.c 
#1632

 |   

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-11 Thread Ralph Castain via devel
Hi John

Sorry to say, but there is no way to really answer your question as the OMPI 
community doesn't actively test MPIR support. I haven't seen any reports of 
hangs during MPI_Init from any release series, including 4.x. My guess is that 
it may have something to do with the debugger interactions as opposed to being 
a true race condition.

Ralph


On Nov 8, 2019, at 11:27 AM, John DelSignore via devel 
mailto:devel@lists.open-mpi.org> > wrote:

Hi,

An LLNL TotalView user on a Mac reported that their MPI job was hanging inside 
MPI_Init() when started under the control of TotalView. They were using Open 
MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we don't support 
the PMIx debugging hooks yet).

I was able to reproduce the hang on my own Linux system with my own build of 
Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, there 
is some sort of race inside of Open MPI 4.0.1, because if I placed breakpoints 
at certain points in the Open MPI code, and thus change the timing slightly, 
that was enough to avoid the hang.

When the code hangs, it appeared as if one or more MPI processes are waiting 
inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to be released. 
In one of the runs, rank 0 was the only one the was hanging there (though I 
have seen runs where two ranks were hung there).

Here's a backtrace of the first thread in the rank 0 process in the case where 
one rank was hung:

d1.<> f 10.1 w
>  0 __nanosleep_nocancel PC=0x774e2efd, FP=0x7fffd1e0 
>[/lib64/libc.so.6]
   1 usleep   PC=0x77513b2f, FP=0x7fffd200 [/lib64/libc.so.6]
   2 ompi_mpi_init    PC=0x77a64009, FP=0x7fffd350 
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
   3 PMPI_Init    PC=0x77ab0be4, FP=0x7fffd390 
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
   4 main PC=0x00400c5e, FP=0x7fffd550 [/home/jdelsign/cpi.c#27]
   5 __libc_start_main PC=0x77446b13, FP=0x7fffd610 [/lib64/libc.so.6]
   6 _start   PC=0x00400b04, FP=0x7fffd618 [/amd/home/jdelsign/cpi]

Here's the block of code where the thread is hung:

    /* if we executed the above fence in the background, then
 * we have to wait here for it to complete. However, there
 * is no reason to do two barriers! */
    if (background_fence) {
    OMPI_LAZY_WAIT_FOR_COMPLETION(active);
    } else if (!ompi_async_mpi_init) {
    /* wait for everyone to reach this point - this is a hard
 * barrier requirement at this time, though we hope to relax
 * it at a later point */
    if (NULL != opal_pmix.fence_nb) {
    active = true;
    OPAL_POST_OBJECT();
    if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
   fence_release, (void* {
    error = "opal_pmix.fence_nb() failed";
    goto error;
    }
    OMPI_LAZY_WAIT_FOR_COMPLETION(active);   - STUCK HERE 
WAITING FOR THE FENCE TO BE RELEASED
    } else {
    if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
    error = "opal_pmix.fence() failed";
    goto error;
    }
    }
    }

And here is an aggregated backtrace of all of the processes and threads in the 
job:

d1.<> f g w -g f+l
+/
 +__clone : 5:12[0-3.2-3, p1.2-5]
 |+start_thread
 | +listen_thread@oob_tcp_listener.c  
#705 : 1:1[p1.5]
 | |+__select_nocancel
 | +listen_thread@ptl_base_listener.c 
 #214 : 1:1[p1.3]
 | |+__select_nocancel
 | +progress_engine@opal_progress_threads.c 
 #105 : 5:5[0-3.2, p1.4]
 | |+opal_libevent2022_event_base_loop@event.c 
 #1632
 | | +poll_dispatch@poll.c  #167
 | |  +__poll_nocancel
 | +progress_engine@pmix_progress_threads.c 
 #108 : 5:5[0-3.3, p1.2]
 |  +opal_libevent2022_event_base_loop@event.c 
 #1632
 |   +epoll_dispatch@epoll.c  #409
 |    +__epoll_wait_nocancel
 +_start : 5:5[0-3.1, p1.1]
  +__libc_start_main
   +main@cpi.c  #27 : 4:4[0-3.1]
   |+PMPI_Init@pinit.c  #67
   | +ompi_mpi_init@ompi_mpi_init.c#890 : 3:3[1-3.1]   THE 3 OTHER MPI 
PROCS MADE IT PAST FENCE
   | |+ompi_rte_wait_for_debugger@rte_orte_module.c 
 #196
   | | +opal_progress@opal_progress.c  
#251
   | |  +opal_progress_events@opal_progress.c 
 #191
   | |   +opal_libevent2022_event_base_loop@event.c 
 

[OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-08 Thread John DelSignore via devel
Hi,

An LLNL TotalView user on a Mac reported that their MPI job was hanging inside 
MPI_Init() when started under the control of TotalView. They were using Open 
MPI 4.0.1, and TotalView was using the MPIR Interface (sorry, we don't support 
the PMIx debugging hooks yet).

I was able to reproduce the hang on my own Linux system with my own build of 
Open MPI 4.0.1, which I built with debug symbols. As far as I can tell, there 
is some sort of race inside of Open MPI 4.0.1, because if I placed breakpoints 
at certain points in the Open MPI code, and thus change the timing slightly, 
that was enough to avoid the hang.

When the code hangs, it appeared as if one or more MPI processes are waiting 
inside ompi_mpi_init() at line ompi_mpi_init.c#904 for a fence to be released. 
In one of the runs, rank 0 was the only one the was hanging there (though I 
have seen runs where two ranks were hung there).

Here's a backtrace of the first thread in the rank 0 process in the case where 
one rank was hung:

d1.<> f 10.1 w
>  0 __nanosleep_nocancel PC=0x774e2efd, FP=0x7fffd1e0 
> [/lib64/libc.so.6]
   1 usleep   PC=0x77513b2f, FP=0x7fffd200 [/lib64/libc.so.6]
   2 ompi_mpi_initPC=0x77a64009, FP=0x7fffd350 
[/home/jdelsign/src/tools-external/openmpi-4.0.1/ompi/runtime/ompi_mpi_init.c#904]
   3 PMPI_InitPC=0x77ab0be4, FP=0x7fffd390 
[/home/jdelsign/src/tools-external/openmpi-4.0.1-lid/ompi/mpi/c/profile/pinit.c#67]
   4 main PC=0x00400c5e, FP=0x7fffd550 [/home/jdelsign/cpi.c#27]
   5 __libc_start_main PC=0x77446b13, FP=0x7fffd610 [/lib64/libc.so.6]
   6 _start   PC=0x00400b04, FP=0x7fffd618 [/amd/home/jdelsign/cpi]

Here's the block of code where the thread is hung:

/* if we executed the above fence in the background, then
 * we have to wait here for it to complete. However, there
 * is no reason to do two barriers! */
if (background_fence) {
OMPI_LAZY_WAIT_FOR_COMPLETION(active);
} else if (!ompi_async_mpi_init) {
/* wait for everyone to reach this point - this is a hard
 * barrier requirement at this time, though we hope to relax
 * it at a later point */
if (NULL != opal_pmix.fence_nb) {
active = true;
OPAL_POST_OBJECT();
if (OMPI_SUCCESS != (ret = opal_pmix.fence_nb(NULL, false,
   fence_release, (void* {
error = "opal_pmix.fence_nb() failed";
goto error;
}
OMPI_LAZY_WAIT_FOR_COMPLETION(active);   - STUCK HERE 
WAITING FOR THE FENCE TO BE RELEASED
} else {
if (OMPI_SUCCESS != (ret = opal_pmix.fence(NULL, false))) {
error = "opal_pmix.fence() failed";
goto error;
}
}
}

And here is an aggregated backtrace of all of the processes and threads in the 
job:

d1.<> f g w -g f+l
+/
 +__clone : 5:12[0-3.2-3, p1.2-5]
 |+start_thread
 | +listen_thread@oob_tcp_listener.c#705 : 1:1[p1.5]
 | |+__select_nocancel
 | +listen_thread@ptl_base_listener.c#214 : 1:1[p1.3]
 | |+__select_nocancel
 | +progress_engine@opal_progress_threads.c#105 : 5:5[0-3.2, p1.4]
 | |+opal_libevent2022_event_base_loop@event.c#1632
 | | +poll_dispatch@poll.c#167
 | |  +__poll_nocancel
 | +progress_engine@pmix_progress_threads.c#108 : 5:5[0-3.3, p1.2]
 |  +opal_libevent2022_event_base_loop@event.c#1632
 |   +epoll_dispatch@epoll.c#409
 |+__epoll_wait_nocancel
 +_start : 5:5[0-3.1, p1.1]
  +__libc_start_main
   +main@cpi.c#27 : 4:4[0-3.1]
   |+PMPI_Init@pinit.c#67
   | 
+ompi_mpi_init@ompi_mpi_init.c#890 : 
3:3[1-3.1]   THE 3 OTHER MPI PROCS MADE IT PAST FENCE
   | |+ompi_rte_wait_for_debugger@rte_orte_module.c#196
   | | +opal_progress@opal_progress.c#251
   | |  +opal_progress_events@opal_progress.c#191
   | |   +opal_libevent2022_event_base_loop@event.c#1632
   | |+poll_dispatch@poll.c#167
   | | +__poll_nocancel
   | 
+ompi_mpi_init@ompi_mpi_init.c#904 : 
1:1[0.1]  THE THREAD THAT IS STUCK
   |  +usleep
   |   +__nanosleep_nocancel
   +main@main.c#14 : 1:1[p1.1]
+orterun@orterun.c#200
 +opal_libevent2022_event_base_loop@event.c#1632
  +poll_dispatch@poll.c#167
   +__poll_nocancel

d1.<>

I have tested Open MPI 4.0.2 dozens of times, and the hang does not seem to 
happen. My concern is that if the problem is indeed a race, then it's possible 
(but perhaps not likely) that the same race exists in Open MPI 4.0.2, but the 
timing could be slightly different such that it doesn't hang using my simple 
test setup. In other words, maybe I've just been "lucky" with my testing of 
Open MPI 4.0.2 and have failed to provoke the hang yet.

My question is: Was this a known problem in Open MPI 4.0.1 that was fixed in 
Open MPI 4.0.2?

Thanks, John D.