Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread Ralph Castain via devel
Just want to clarify my remarks to avoid any misunderstanding. I'm not in any 
way saying MPIR or the debugger are at fault here, nor was I trying to imply 
that PMIx-based tools are somehow "superior" to MPIR-based ones.

My point was solely focused on the question of reliability. The MPIR-based 
tools activate a code path in OMPI that is only used when MPIR-based tools are 
involved - it is an "exception" code path and therefore not exercised during 
any normal operations. Thus, all the nightly regression testing and normal 
daily uses of OMPI do not activate that code path, leaving it effectively 
untested.

In contrast, PMIx-based tools utilize code paths that are active during normal 
operations. Thus, those code paths are exercised and tested with every nightly 
regression test, and 10's of thousands of times a day when users run OMPI-based 
applications. There is a much higher probability of detecting a race condition 
problem in the PMIx code path, and a correspondingly higher confidence level 
that the code is working correctly.

We are not hearing of any "hangs" such as the one described in this thread from 
our user base. This means that it is unlikely any similar race condition 
resides in the "normal" code paths shared by PMIx-based tools. Thus, it is most 
likely something in the MPIR-based code paths that is the root cause of the 
trouble.

The uniqueness of the MPIR-based code paths and the corresponding lack of 
testing of those paths is why we are moving to PMIx-based tool support in OMPI 
v5.

HTH
Ralph


On Nov 13, 2019, at 10:40 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote:

Agreed and understood. My point was only that I'm not convinced the problem was 
"fixed" as it is entirely consistent with your findings for the race condition 
to still exist, but be biased so strongly that it now "normally" passes. 
Without determining the precise code that causes things to hang vs complete, 
there is no way to say that the code path is truly "fixed".

The fact that this only appears to happen IF the debugger_attach flag is set 
would indicate it has something to do with debugger-related code. Could be 
something in PMIx, or it could be that the change in PMIx just modified the 
race condition. It could be something in the OMPI debugger code, it could be in 
the abstraction layer between PMIx and OMPI, etc.

I don't have an immediate plan for digging deeper into possible root cause - 
and as I said, I'm not all that motivated to do so as PMIx-based tools are not 
displaying the same behavior  :-)

Ralph


On Nov 13, 2019, at 8:41 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:

Hi Ralph,

I assume you are referring to your previous email, where you wrote:

Personally, I have never been entirely comfortable with the claim that the PMIx 
modification was the solution to the problem being discussed here. We have 
never seen a report of an application hanging in that spot outside of a 
debugger. Not one report. Yet that code has been "in the wild" now for several 
years.

What I suspect is actually happening is that the debugger is interfering with 
the OMPI internals that are involved in a way that creates a potential loss of 
the release event. The modified timing of the PMIx update biases that race 
sufficiently to make it happen "virtually never", which only means that it 
doesn't trigger when you run it a few times in quick succession. I don't know 
how to further debug it, nor am I particularly motivated to do so as the 
PMIx-based tools work within (not alongside) the release mechanism and are 
unlikely to evince the same behavior.

For now, it appears 4.0.2 is "good enough".

I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which 
is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang 
in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 
has not hung once after running the same test over 1,000 times.

Here's what I did:

*   I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 
4.0.2:
*   One inside _release_fn().
*   One inside ompi_rte_wait_for_debugger() at the start of the 
block that calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);".
*   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi401
*   Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi402
*   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like 
this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpirun -np 4 ./cpi401
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called 

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread Ralph Castain via devel
Agreed and understood. My point was only that I'm not convinced the problem was 
"fixed" as it is entirely consistent with your findings for the race condition 
to still exist, but be biased so strongly that it now "normally" passes. 
Without determining the precise code that causes things to hang vs complete, 
there is no way to say that the code path is truly "fixed".

The fact that this only appears to happen IF the debugger_attach flag is set 
would indicate it has something to do with debugger-related code. Could be 
something in PMIx, or it could be that the change in PMIx just modified the 
race condition. It could be something in the OMPI debugger code, it could be in 
the abstraction layer between PMIx and OMPI, etc.

I don't have an immediate plan for digging deeper into possible root cause - 
and as I said, I'm not all that motivated to do so as PMIx-based tools are not 
displaying the same behavior  :-)

Ralph


On Nov 13, 2019, at 8:41 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote:

Hi Ralph,

I assume you are referring to your previous email, where you wrote:

Personally, I have never been entirely comfortable with the claim that the PMIx 
modification was the solution to the problem being discussed here. We have 
never seen a report of an application hanging in that spot outside of a 
debugger. Not one report. Yet that code has been "in the wild" now for several 
years.

What I suspect is actually happening is that the debugger is interfering with 
the OMPI internals that are involved in a way that creates a potential loss of 
the release event. The modified timing of the PMIx update biases that race 
sufficiently to make it happen "virtually never", which only means that it 
doesn't trigger when you run it a few times in quick succession. I don't know 
how to further debug it, nor am I particularly motivated to do so as the 
PMIx-based tools work within (not alongside) the release mechanism and are 
unlikely to evince the same behavior.

For now, it appears 4.0.2 is "good enough".

I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which 
is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang 
in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 
has not hung once after running the same test over 1,000 times.

Here's what I did:

*   I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 
4.0.2:
*   One inside _release_fn().
*   One inside ompi_rte_wait_for_debugger() at the start of the 
block that calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);".
*   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi401
*   Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi402
*   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like 
this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpirun -np 4 ./cpi401
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
...HANG...

With 4.0.2 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs complete and look 
like this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpirun -np 4 ./cpi402
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Process 1 on microway1
Process 2 on microway1
Process 3 on microway1
Process 0 on microway1
pi is approximately 3.1416009869231249, Error is 0.0818
wall clock time = 0.000133
mic:/amd/home/jdelsign>

With 4.0.1 and 

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-13 Thread John DelSignore via devel
Hi Ralph,

I assume you are referring to your previous email, where you wrote:

Personally, I have never been entirely comfortable with the claim that the PMIx 
modification was the solution to the problem being discussed here. We have 
never seen a report of an application hanging in that spot outside of a 
debugger. Not one report. Yet that code has been "in the wild" now for several 
years.

What I suspect is actually happening is that the debugger is interfering with 
the OMPI internals that are involved in a way that creates a potential loss of 
the release event. The modified timing of the PMIx update biases that race 
sufficiently to make it happen "virtually never", which only means that it 
doesn't trigger when you run it a few times in quick succession. I don't know 
how to further debug it, nor am I particularly motivated to do so as the 
PMIx-based tools work within (not alongside) the release mechanism and are 
unlikely to evince the same behavior.

For now, it appears 4.0.2 is "good enough".

I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which 
is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang 
in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 
has not hung once after running the same test over 1,000 times.

Here's what I did:

  *   I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 
4.0.2:
 *   One inside _release_fn().
 *   One inside ompi_rte_wait_for_debugger() at the start of the block that 
calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);".
  *   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi401
  *   Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpir -np 4 ./cpi402
  *   Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like 
this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpirun -np 4 ./cpi401
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182
...HANG...

With 4.0.2 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs complete and look 
like this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 
mpirun -np 4 ./cpi402
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called ompi_rte_wait_for_debugger(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Called _release_fn(), 
../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: 
status==-54, source->jobid=0xea22, source->vpid=0
Process 1 on microway1
Process 2 on microway1
Process 3 on microway1
Process 0 on microway1
pi is approximately 3.1416009869231249, Error is 0.0818
wall clock time = 0.000133
mic:/amd/home/jdelsign>

With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH not set, all of the runs complete and 
look like this:

mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 mpirun -np 4 ./cpi401
Process 2 on microway1
Process 0 on microway1
Process 3 on microway1
Process 1 on microway1
pi is approximately 3.1416009869231249, Error is 0.0818
wall clock time = 0.000153
mic:/amd/home/jdelsign>

As you can see in this last test, if ORTE_TEST_DEBUGGER_ATTACH is not set, the 
code in ompi_rte_wait_for_debugger() is not executed.

Honesty, I don't know if this is a valid test or not, but it strongly suggests 
that there is a problem in that code in 4.0.1 and it cannot be the debugger's 
fault, because there is no debugger in the picture. Tthe GitHub issues Austen 
pointed at seem to accurately describe what I have seen and the conclusion 
there was that it was a bug in PMIx. I have no basis to believe otherwise.

Finally, I'd like to reply to your statement, "What I suspect is actually 
happening is that the debugger is interfering with the OMPI internals that are 
involved in a