Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
Again, John, I'm not convinced your last statement is true. However, I think it is "good enough" for now as it seems to work for you and it isn't seen outside of a debugger scenario. On Nov 12, 2019, at 3:13 PM, John DelSignore via devel mailto:devel@lists.open-mpi.org> > wrote: Hi Austen,

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread John DelSignore via devel
Hi Austen, Thanks very much, the issues you show below do indeed describe what I am seeing. Using printfs and breakpoints I inserted into the _release_fn() function, I was able to see that with OMPI 4.0.1, at most one of the MPI processes called the function. Most of the time rank 0 would be

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
George beat me to the response - I agree entirely with his statement. Let's not go down a deadend here. Personally, I have never been entirely comfortable with the claim that the PMIx modification was the solution to the problem being discussed here. We have never seen a report of an

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
As indicated by this discussion, the proper usage of volatile is certainly misunderstood. However, the usage of the volatile we are doing in this particular instance is correct and valid even in multi-threaded cases. We are using it for a __single__ trigger, __one way__ synchronization similar

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel
I agree that the use of volatile is insufficient if we want to adhere to proper multi-threaded programming standards: "Note that volatile variables are not suitable for communication between threads; they do not offer atomicity, synchronization, or memory ordering. A read from a volatile

Re: [OMPI devel] [EXTERNAL] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Larry Baker via devel
"allowing us to weakly synchronize two threads" concerns me if the synchronization is important or must be reliable. I do not understand how volatile alone provides reliable synchronization without a mechanism to order visible changes to memory. If the flag(s) in question are suppposed to

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
If the issue was some kind of memory consistently between threads, then printing that variable in the context of the debugger would show the value of debugger_event_active being false. volatile is not a memory barrier, it simply forces a load for each access of the data, allowing us to weakly

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel
I think you are hitting this issue here in 4.0.1: https://github.com/open-mpi/ompi/issues/6613 MPIR was broken in 4.0.1 due to a race condition in PMIx. It was patched, it looks to me, for 4.0.2. Here is the openpmix issue: https://github.com/openpmix/openpmix/issues/1189 I think this lines up

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel
Yes that was an omission on my part. Regarding volatile being sufficient - I don't think that is the case in all situations. It might work under most conditions - but it can lead to the "it works on my machine..." type of bugs. In particular it doesn't guarantee that the waiting thread will ever

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread John DelSignore via devel
Hi Austen, Thanks for the reply. What I am seeing is consistent with your thought, in that when I see the hang, one or more processes did not have a flag updated. I don't understand how the Open MPI code works well enough to say if it is a memory barrier problem or not. It almost looks like a

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
Just to be clear as well: you cannot use the pthread method you propose because you must loop over opal_progress - the "usleep" is in there simply to avoid consuming 100% cpu while we wait. On Nov 12, 2019, at 8:52 AM, George Bosilca via devel mailto:devel@lists.open-mpi.org> > wrote: I don't

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread George Bosilca via devel
I don't think there is a need any protection around that variable. It will change value only once (in a callback triggered from opal_progress), and the volatile guarantees that loads will be issued for every access, so the waiting thread will eventually notice the change. George. On Tue, Nov

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Austen W Lauria via devel
Could it be that some processes are not seeing the flag get updated? I don't think just using a simple while loop with a volatile variable is sufficient in all cases in a multi-threaded environment. It's my understanding that the volatile keyword just tells the compiler to not optimize or do

Re: [OMPI devel] Open MPI v4.0.1: Process is hanging inside MPI_Init() when debugged with TotalView

2019-11-12 Thread Ralph Castain via devel
> On Nov 11, 2019, at 4:53 PM, Gilles Gouaillardet via devel > wrote: > > John, > > OMPI_LAZY_WAIT_FOR_COMPLETION(active) > > > is a simple loop that periodically checks the (volatile) "active" condition, > that is expected to be updated by an other thread. > So if you set your breakpoint