Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView
Just want to clarify my remarks to avoid any misunderstanding. I'm not in any way saying MPIR or the debugger are at fault here, nor was I trying to imply that PMIx-based tools are somehow "superior" to MPIR-based ones. My point was solely focused on the question of reliability. The MPIR-based tools activate a code path in OMPI that is only used when MPIR-based tools are involved - it is an "exception" code path and therefore not exercised during any normal operations. Thus, all the nightly regression testing and normal daily uses of OMPI do not activate that code path, leaving it effectively untested. In contrast, PMIx-based tools utilize code paths that are active during normal operations. Thus, those code paths are exercised and tested with every nightly regression test, and 10's of thousands of times a day when users run OMPI-based applications. There is a much higher probability of detecting a race condition problem in the PMIx code path, and a correspondingly higher confidence level that the code is working correctly. We are not hearing of any "hangs" such as the one described in this thread from our user base. This means that it is unlikely any similar race condition resides in the "normal" code paths shared by PMIx-based tools. Thus, it is most likely something in the MPIR-based code paths that is the root cause of the trouble. The uniqueness of the MPIR-based code paths and the corresponding lack of testing of those paths is why we are moving to PMIx-based tool support in OMPI v5. HTH Ralph On Nov 13, 2019, at 10:40 AM, Ralph Castain via devel mailto:devel@lists.open-mpi.org> > wrote: Agreed and understood. My point was only that I'm not convinced the problem was "fixed" as it is entirely consistent with your findings for the race condition to still exist, but be biased so strongly that it now "normally" passes. Without determining the precise code that causes things to hang vs complete, there is no way to say that the code path is truly "fixed". The fact that this only appears to happen IF the debugger_attach flag is set would indicate it has something to do with debugger-related code. Could be something in PMIx, or it could be that the change in PMIx just modified the race condition. It could be something in the OMPI debugger code, it could be in the abstraction layer between PMIx and OMPI, etc. I don't have an immediate plan for digging deeper into possible root cause - and as I said, I'm not all that motivated to do so as PMIx-based tools are not displaying the same behavior :-) Ralph On Nov 13, 2019, at 8:41 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote: Hi Ralph, I assume you are referring to your previous email, where you wrote: Personally, I have never been entirely comfortable with the claim that the PMIx modification was the solution to the problem being discussed here. We have never seen a report of an application hanging in that spot outside of a debugger. Not one report. Yet that code has been "in the wild" now for several years. What I suspect is actually happening is that the debugger is interfering with the OMPI internals that are involved in a way that creates a potential loss of the release event. The modified timing of the PMIx update biases that race sufficiently to make it happen "virtually never", which only means that it doesn't trigger when you run it a few times in quick succession. I don't know how to further debug it, nor am I particularly motivated to do so as the PMIx-based tools work within (not alongside) the release mechanism and are unlikely to evince the same behavior. For now, it appears 4.0.2 is "good enough". I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 has not hung once after running the same test over 1,000 times. Here's what I did: * I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 4.0.2: * One inside _release_fn(). * One inside ompi_rte_wait_for_debugger() at the start of the block that calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);". * Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpir -np 4 ./cpi401 * Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpir -np 4 ./cpi402 * Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401 With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like this: mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpirun -np 4 ./cpi401 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called
Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView
Agreed and understood. My point was only that I'm not convinced the problem was "fixed" as it is entirely consistent with your findings for the race condition to still exist, but be biased so strongly that it now "normally" passes. Without determining the precise code that causes things to hang vs complete, there is no way to say that the code path is truly "fixed". The fact that this only appears to happen IF the debugger_attach flag is set would indicate it has something to do with debugger-related code. Could be something in PMIx, or it could be that the change in PMIx just modified the race condition. It could be something in the OMPI debugger code, it could be in the abstraction layer between PMIx and OMPI, etc. I don't have an immediate plan for digging deeper into possible root cause - and as I said, I'm not all that motivated to do so as PMIx-based tools are not displaying the same behavior :-) Ralph On Nov 13, 2019, at 8:41 AM, John DelSignore mailto:jdelsign...@perforce.com> > wrote: Hi Ralph, I assume you are referring to your previous email, where you wrote: Personally, I have never been entirely comfortable with the claim that the PMIx modification was the solution to the problem being discussed here. We have never seen a report of an application hanging in that spot outside of a debugger. Not one report. Yet that code has been "in the wild" now for several years. What I suspect is actually happening is that the debugger is interfering with the OMPI internals that are involved in a way that creates a potential loss of the release event. The modified timing of the PMIx update biases that race sufficiently to make it happen "virtually never", which only means that it doesn't trigger when you run it a few times in quick succession. I don't know how to further debug it, nor am I particularly motivated to do so as the PMIx-based tools work within (not alongside) the release mechanism and are unlikely to evince the same behavior. For now, it appears 4.0.2 is "good enough". I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 has not hung once after running the same test over 1,000 times. Here's what I did: * I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 4.0.2: * One inside _release_fn(). * One inside ompi_rte_wait_for_debugger() at the start of the block that calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);". * Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpir -np 4 ./cpi401 * Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpir -np 4 ./cpi402 * Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401 With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like this: mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpirun -np 4 ./cpi401 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 ...HANG... With 4.0.2 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs complete and look like this: mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpirun -np 4 ./cpi402 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Process 1 on microway1 Process 2 on microway1 Process 3 on microway1 Process 0 on microway1 pi is approximately 3.1416009869231249, Error is 0.0818 wall clock time = 0.000133 mic:/amd/home/jdelsign> With 4.0.1 and
Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView
Hi Ralph, I assume you are referring to your previous email, where you wrote: Personally, I have never been entirely comfortable with the claim that the PMIx modification was the solution to the problem being discussed here. We have never seen a report of an application hanging in that spot outside of a debugger. Not one report. Yet that code has been "in the wild" now for several years. What I suspect is actually happening is that the debugger is interfering with the OMPI internals that are involved in a way that creates a potential loss of the release event. The modified timing of the PMIx update biases that race sufficiently to make it happen "virtually never", which only means that it doesn't trigger when you run it a few times in quick succession. I don't know how to further debug it, nor am I particularly motivated to do so as the PMIx-based tools work within (not alongside) the release mechanism and are unlikely to evince the same behavior. For now, it appears 4.0.2 is "good enough". I'm not an OMPI/PMIx expert here, so I can only tell you what I observe, which is even without a debugger in the picture, I can reliably make OMPI 4.0.1 hang in that code by setting env ORTE_TEST_DEBUGGER_ATTACH=1. However, OMPI 4.0.2 has not hung once after running the same test over 1,000 times. Here's what I did: * I added two fprintfs to the rte_orte_module.c file in both 4.0.1 and 4.0.2: * One inside _release_fn(). * One inside ompi_rte_wait_for_debugger() at the start of the block that calls "OMPI_WAIT_FOR_COMPLETION(debugger_event_active);". * Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpir -np 4 ./cpi401 * Ran w/ 4.0.2: env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpir -np 4 ./cpi402 * Ran w/ 4.0.1: env OMPI_MPIR_DO_NOT_WARN=1 mpir -np 4 ./cpi401 With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs hang and looks like this: mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpirun -np 4 ./cpi401 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.1/ompi/mca/rte/orte/rte_orte_module.c:182 ...HANG... With 4.0.2 and ORTE_TEST_DEBUGGER_ATTACH=1, all of the runs complete and look like this: mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 ORTE_TEST_DEBUGGER_ATTACH=1 mpirun -np 4 ./cpi402 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called ompi_rte_wait_for_debugger(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:182 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Called _release_fn(), ../../../../../openmpi-4.0.2/ompi/mca/rte/orte/rte_orte_module.c:115: status==-54, source->jobid=0xea22, source->vpid=0 Process 1 on microway1 Process 2 on microway1 Process 3 on microway1 Process 0 on microway1 pi is approximately 3.1416009869231249, Error is 0.0818 wall clock time = 0.000133 mic:/amd/home/jdelsign> With 4.0.1 and ORTE_TEST_DEBUGGER_ATTACH not set, all of the runs complete and look like this: mic:/amd/home/jdelsign>env OMPI_MPIR_DO_NOT_WARN=1 mpirun -np 4 ./cpi401 Process 2 on microway1 Process 0 on microway1 Process 3 on microway1 Process 1 on microway1 pi is approximately 3.1416009869231249, Error is 0.0818 wall clock time = 0.000153 mic:/amd/home/jdelsign> As you can see in this last test, if ORTE_TEST_DEBUGGER_ATTACH is not set, the code in ompi_rte_wait_for_debugger() is not executed. Honesty, I don't know if this is a valid test or not, but it strongly suggests that there is a problem in that code in 4.0.1 and it cannot be the debugger's fault, because there is no debugger in the picture. Tthe GitHub issues Austen pointed at seem to accurately describe what I have seen and the conclusion there was that it was a bug in PMIx. I have no basis to believe otherwise. Finally, I'd like to reply to your statement, "What I suspect is actually happening is that the debugger is interfering with the OMPI internals that are involved in a