Re: [OMPI devel] Deadlock in sync_wait_mt(): Proposed patch

2016-09-21 Thread George Bosilca
Nice catch. Keeping the first check only works because the signaling field
prevent us from releasing the condition too early. I added some comments
around the code (131fe42d).

  George.

On Wed, Sep 21, 2016 at 5:33 AM, Nathan Hjelm  wrote:

> Yeah, that looks like a bug to me. We need to keep the check before the
> lock but otherwise this is fine and should be fixed in 2.0.2.
>
> -Nathan
>
> > On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL 
> wrote:
> >
> > I encountered a deadlock in sync_wait_mt().
> >
> > After investigations, it appears that a first thread executing
> wait_sync_update() decrements sync->count just after a second thread in
> sync_wait_mt() made the test :
> >
> > if(sync->count <= 0)
> > return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> >
> > After that, there is a narrow window in which the first thread may call
> pthread_cond_signal() before the second thread calls pthread_cond_wait().
> >
> > If I protect this test by the sync->lock, this window is closed and the
> problem does not reproduce.
> >
> > To easy reproduce the problem, just add a call to usleep(100) before the
> call to pthread_mutex(&sync->lock);
> >
> > So my proposed patch is:
> >
> > diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c
> > index c9b9137..2f90965 100644
> > --- a/opal/threads/wait_sync.c
> > +++ b/opal/threads/wait_sync.c
> > @@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL;
> >
> > int sync_wait_mt(ompi_wait_sync_t *sync)
> > {
> > -if(sync->count <= 0)
> > -return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> > -
> >  /* lock so nobody can signal us during the list updating */
> >  pthread_mutex_lock(&sync->lock);
> >
> > +if(sync->count <= 0) {
> > +pthread_mutex_unlock(&sync->lock);
> > +return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> > +}
> > +
> >  /* Insert sync on the list of pending synchronization constructs */
> >  OPAL_THREAD_LOCK(&wait_sync_lock);
> >  if( NULL == wait_sync_list ) {
> >
> > For performance reasons, it is also possible to leave the first test
> call. So if the request is terminated, we do not spend time to take and
> free the lock.
> >
> >
> >
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] Deadlock in sync_wait_mt(): Proposed patch

2016-09-21 Thread Nathan Hjelm
Yeah, that looks like a bug to me. We need to keep the check before the lock 
but otherwise this is fine and should be fixed in 2.0.2.

-Nathan

> On Sep 21, 2016, at 3:16 AM, DEVEZE, PASCAL  wrote:
> 
> I encountered a deadlock in sync_wait_mt().
>  
> After investigations, it appears that a first thread executing 
> wait_sync_update() decrements sync->count just after a second thread in 
> sync_wait_mt() made the test :
>  
> if(sync->count <= 0)
> return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
>  
> After that, there is a narrow window in which the first thread may call 
> pthread_cond_signal() before the second thread calls pthread_cond_wait().
>  
> If I protect this test by the sync->lock, this window is closed and the 
> problem does not reproduce.
>  
> To easy reproduce the problem, just add a call to usleep(100) before the call 
> to pthread_mutex(&sync->lock);
>  
> So my proposed patch is:
>  
> diff --git a/opal/threads/wait_sync.c b/opal/threads/wait_sync.c
> index c9b9137..2f90965 100644
> --- a/opal/threads/wait_sync.c
> +++ b/opal/threads/wait_sync.c
> @@ -25,12 +25,14 @@ static ompi_wait_sync_t* wait_sync_list = NULL;
>  
> int sync_wait_mt(ompi_wait_sync_t *sync)
> {
> -if(sync->count <= 0)
> -return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> -
>  /* lock so nobody can signal us during the list updating */
>  pthread_mutex_lock(&sync->lock);
>  
> +if(sync->count <= 0) {
> +pthread_mutex_unlock(&sync->lock);
> +return (0 == sync->status) ? OPAL_SUCCESS : OPAL_ERROR;
> +}
> +
>  /* Insert sync on the list of pending synchronization constructs */
>  OPAL_THREAD_LOCK(&wait_sync_lock);
>  if( NULL == wait_sync_list ) {
>  
> For performance reasons, it is also possible to leave the first test call. So 
> if the request is terminated, we do not spend time to take and free the lock.
>  
>  
>  
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel