Re: [OMPI devel] Ticket #1267 - thread locks in ompi_proc_t code

2008-07-07 Thread Brian W. Barrett

Responding to both of Ralph's e-mails in one, just to confuse people :).

First, the issue of the recursive locks...  Back in the day, ompi_proc_t 
instances could be created as a side effect of other operations. 
Therefore, to maintain sanity, the procs were implicitly added to the 
master proc list when the proc was created.


When I reworked the modex last year, I also changed the proc code so that 
procs are never implicitly created.  There's only two ways to get a new 
proc to come into existence -- ompi_proc_init() and ompi_proc_unpack(). 
Therefore, there's no need for the proc construct to implicitly add the 
proc to the procs list -- there's only two places in the proc code that 
need to change if it isn't (solving the recusive locking problem).


Ralph already committed code to the trunk to deal with this case, although 
there are a couple of changes that Jeff and Ralph are working out.  For 
removal, it ended up making more sense to remove the procs from the list 
during the destructor, as it will allow us to remove procs from the system 
in comm_disconnect situations in the future (although we're not actually 
going to do that until after v1.3).


As for thread safety with the return from ompi_proc_all and 
ompi_proc_world.  Today, the ompi_proc code has one ref count held for it, 
which is set in ompi_proc_init or ompi_proc_unpack and not released until 
ompi_proc_finalize.  So there's no way to remove a proc from the system 
before MPI_FINALIZE, unless something bad has already happened. 
Therefore, the lists are (today) thread safe.  ompi_proc_world and 
ompi_proc_self only return procs created by ompi_proc_init, and there's no 
way a correct MPI program could ever cause the procs in that list to reach 
a ref count of zero before MPI_FINALIZE, so they are always thread safe in 
usage.


There is a problem which I've outlined to Jeff and Ralph with the thread 
safety of ompi_proc_all() if the accept/connect code is changed such that 
procs can be destroyed before ompi_proc_finalize().  This can't happen 
today, so there is no thread safety issue with ompi_proc_all() today.  In 
the future, there might be, although the solution is fairly simple (and 
Jeff and Ralph know what I propsed and will implement it before implentng 
the release procs before finalize thing, which has other implications to 
the PML/BML/BTL that I won't go into here.


Brian



On Mon, 7 Jul 2008, Ralph H Castain wrote:


I will have to correct something here. From what I can see, it appears the
MPI code may not be creating ompi_proc_t structures, but rather creating
arrays of ompi_proc_t* pointers that are then filled with values equal to
the pointers in the ompi_proc_list held inside of ompi/proc/proc.c.

It appears, though, that this may be done in a non-thread-safe manner. When
the arrays are filled by calling ompi_proc_world or ompi_proc_all, the
objects themselves are never OBJ_RETAIN'd. As a result, when the first
thread in the code calls OBJ_RELEASE, the object is removed from the
ompi_proc_list and free'd - but the other threads that called
ompi_proc_world/all still retain pointers that now reference invalid memory.

Perhaps the best path forward is to devise a thread-safe design for this
code area and present it for people's review. I'll see what I can do.

Again, comments are welcomed
Ralph



On 7/7/08 8:22 AM, "Ralph H Castain"  wrote:


I am seeking input before making a change to the ompi/proc/proc.c code to
resolve the referenced ticket. The change could potentially impact how the
ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't
impact you, please ignore the remainder of this note.


I was asked last week to take a look at ticket #1267, filed by Tim Prins
several months ago. This ticket references a report on the devel list about
thread locks when calling comm_spawn and using MPI_Init_thread. The thread
lock is caused by the constructor/destructor in the ompi_proc_t class
explicitly removing the referenced ompi_proc_t from the static local global
ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that
list operation.

As far as I can see, Tim correctly resolved the constructor conflict by
simply removing the thread lock/unlock and list append operation from the
constructor. A scan of the code shows that OBJ_NEW is -only- called from
within the ompi/proc/proc.c code, so this won't be an issue.

However, I noted several issues surrounding the creation and release of
ompi_proc_t objects that -may- cause problems in making a similar change to
the destructor to fix the rest of the threading problem. These may have been
created in response to the list modification code currently existing in the
ompi_proc_t object destructor - or they may be due to other factors.

Specifically, the MPI code base outside of ompi/proc/proc.c:

1. -never- treats the ompi_proc_t as an opal object. Instead, the code
specifically calls calloc to create space for the structures, and then

Re: [OMPI devel] Ticket #1267 - thread locks in ompi_proc_t code

2008-07-07 Thread Ralph H Castain
I will have to correct something here. From what I can see, it appears the
MPI code may not be creating ompi_proc_t structures, but rather creating
arrays of ompi_proc_t* pointers that are then filled with values equal to
the pointers in the ompi_proc_list held inside of ompi/proc/proc.c.

It appears, though, that this may be done in a non-thread-safe manner. When
the arrays are filled by calling ompi_proc_world or ompi_proc_all, the
objects themselves are never OBJ_RETAIN'd. As a result, when the first
thread in the code calls OBJ_RELEASE, the object is removed from the
ompi_proc_list and free'd - but the other threads that called
ompi_proc_world/all still retain pointers that now reference invalid memory.

Perhaps the best path forward is to devise a thread-safe design for this
code area and present it for people's review. I'll see what I can do.

Again, comments are welcomed
Ralph



On 7/7/08 8:22 AM, "Ralph H Castain"  wrote:

> I am seeking input before making a change to the ompi/proc/proc.c code to
> resolve the referenced ticket. The change could potentially impact how the
> ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't
> impact you, please ignore the remainder of this note.
> 
> 
> I was asked last week to take a look at ticket #1267, filed by Tim Prins
> several months ago. This ticket references a report on the devel list about
> thread locks when calling comm_spawn and using MPI_Init_thread. The thread
> lock is caused by the constructor/destructor in the ompi_proc_t class
> explicitly removing the referenced ompi_proc_t from the static local global
> ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that
> list operation.
> 
> As far as I can see, Tim correctly resolved the constructor conflict by
> simply removing the thread lock/unlock and list append operation from the
> constructor. A scan of the code shows that OBJ_NEW is -only- called from
> within the ompi/proc/proc.c code, so this won't be an issue.
> 
> However, I noted several issues surrounding the creation and release of
> ompi_proc_t objects that -may- cause problems in making a similar change to
> the destructor to fix the rest of the threading problem. These may have been
> created in response to the list modification code currently existing in the
> ompi_proc_t object destructor - or they may be due to other factors.
> 
> Specifically, the MPI code base outside of ompi/proc/proc.c:
> 
> 1. -never- treats the ompi_proc_t as an opal object. Instead, the code
> specifically calls calloc to create space for the structures, and then
> manually constructs them.
> 
> 2. consistently calls OBJ_RELEASE on the resulting structures, even though
> they were never created as opal objects via OBJ_NEW.
> 
> I confess to being puzzled here as the destructor will attempt to remove the
> referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c,
> but (since OBJ_NEW wasn't called) that object won't be on the list. Looking
> at the code itself, it appears we rely on the list function to realize that
> this object isn't on the list and ignore the request to remove it. We don't
> check the return code from the opal_list_remove_item, and so ignore the
> returned error.
> 
> My point here is to seek comment about the proposed fix for the problem
> referenced in the ticket. My proposal is to remove the thread lock/unlock
> and list manipulation from the ompi_proc_t destructor. From what I can see
> (as described above), this should not impact the rest of the code base. I
> will then add thread lock/unlock protection explicitly to ompi_proc_finalize
> to protect its list operations.
> 
> It appears to me that this change would also open the way to allowing the
> remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW
> to correctly and consistently initialize those objects. I note that any
> change to ompi_proc_t today (which has occurred in the not-too-distant
> past!) can create a problem throughout the current code base due to the
> manual construction of this object. This is why we have objects in the first
> place - I suspect people didn't use OBJ_NEW because of the automatic change
> it induced in the ompi_proc_list in ompi/proc/proc.c.
> 
> Any comments would be welcome.
> Ralph
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Ticket #1267 - thread locks in ompi_proc_t code

2008-07-07 Thread Ralph H Castain
I am seeking input before making a change to the ompi/proc/proc.c code to
resolve the referenced ticket. The change could potentially impact how the
ompi_proc_t struct is used in the rest of the MPI code base. If this doesn't
impact you, please ignore the remainder of this note.


I was asked last week to take a look at ticket #1267, filed by Tim Prins
several months ago. This ticket references a report on the devel list about
thread locks when calling comm_spawn and using MPI_Init_thread. The thread
lock is caused by the constructor/destructor in the ompi_proc_t class
explicitly removing the referenced ompi_proc_t from the static local global
ompi_proc_list, and calling OPAL_THREAD_LOCK/OPAL_THREAD_UNLOCK around that
list operation.

As far as I can see, Tim correctly resolved the constructor conflict by
simply removing the thread lock/unlock and list append operation from the
constructor. A scan of the code shows that OBJ_NEW is -only- called from
within the ompi/proc/proc.c code, so this won't be an issue.

However, I noted several issues surrounding the creation and release of
ompi_proc_t objects that -may- cause problems in making a similar change to
the destructor to fix the rest of the threading problem. These may have been
created in response to the list modification code currently existing in the
ompi_proc_t object destructor - or they may be due to other factors.

Specifically, the MPI code base outside of ompi/proc/proc.c:

1. -never- treats the ompi_proc_t as an opal object. Instead, the code
specifically calls calloc to create space for the structures, and then
manually constructs them.

2. consistently calls OBJ_RELEASE on the resulting structures, even though
they were never created as opal objects via OBJ_NEW.

I confess to being puzzled here as the destructor will attempt to remove the
referenced ompi_proc_t* pointer from the ompi_proc_list in ompi/proc/proc.c,
but (since OBJ_NEW wasn't called) that object won't be on the list. Looking
at the code itself, it appears we rely on the list function to realize that
this object isn't on the list and ignore the request to remove it. We don't
check the return code from the opal_list_remove_item, and so ignore the
returned error.

My point here is to seek comment about the proposed fix for the problem
referenced in the ticket. My proposal is to remove the thread lock/unlock
and list manipulation from the ompi_proc_t destructor. From what I can see
(as described above), this should not impact the rest of the code base. I
will then add thread lock/unlock protection explicitly to ompi_proc_finalize
to protect its list operations.

It appears to me that this change would also open the way to allowing the
remainder of the code base to treat ompi_proc_t as an object, using OBJ_NEW
to correctly and consistently initialize those objects. I note that any
change to ompi_proc_t today (which has occurred in the not-too-distant
past!) can create a problem throughout the current code base due to the
manual construction of this object. This is why we have objects in the first
place - I suspect people didn't use OBJ_NEW because of the automatic change
it induced in the ompi_proc_list in ompi/proc/proc.c.

Any comments would be welcome.
Ralph