Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27161 - trunk/orte/mca/grpcomm/base

2012-08-30 Thread George Bosilca

On Aug 30, 2012, at 17:15 , Ralph Castain  wrote:

> 
> On Aug 30, 2012, at 8:10 AM, George Bosilca  wrote:
> 
>> A strange race condition happening for undisclosed reasons, and only fixable 
>> by replication is jeopardizing our reference count system. That sounds 
>> definitively almost scary (!) 
>> 
>> I think that the proposed solution is just a band-aid. It somehow fixes this 
>> particular instance of the issue but leave all the others unpatched, asking 
>> for troubles later on. This problem has been lingering around for years, but 
>> we failed to address it correctly up to now.
>> 
>> Based on my understanding of the code the problem is not with the ref count 
>> but with the way opal_buffer_t is handled. We have no way to retrieve the 
>> pointer where the data in the opal_buffer_t is stored without a destructive 
>> operation. This means every time we need to have the pointer of the 
>> opal_buffer_t (like in the send operation to build the iovecs), we have to 
>> do a load followed by an unload, leaving the opal_buffer_t uninitialized for 
>> a short amount of time. As a result it is completely unsafe to use the same 
>> opal_buffer_t concurrently for multiple operations, as some callbacks can 
>> find the buffer uninitialized when they fire.
> 
> That is correct - and yes, it is a bandaid. Fixing the opal_buffer_t 
> situation is a much bigger issue that will require more time and effort than 
> we had at the moment.

I feel compelled to acknowledge the clearness of the commit message explaining 
the real reason behind the bandaid. And of course the fact that the community 
was made aware about this critical issue, especially now that we have an event 
based runtime and people are toying with multiple threads (every ingredient is 
here to have this issue come back more often).

Anyway, I guess as long as you were aware of the issue, the community doesn't 
have to take any further action nor be informed about. A proper software design 
will certainly be proposed in a timely manner.

  george.


>> Now regarding the patch itself, I have to congratulate the Open MPI 
>> community for its unbelievable response time. A solution proposed, then 
>> tested on the faulty platforms, then the code carefully reviewed and finally 
>> pushed in a stable branch all in a mere 43 minutes (!). It shows that all 
>> the protection mechanism we put in place around our stable branches are 
>> entirely functional and their role is completely fulfilled. I doubt any 
>> other open source project can claim such a feat. Congratulations!
> 
> As always, George - thanks for your positive, inspirational attitude. I'm 
> sure we all truly appreciate your input.
> 
> 
>> 
>> commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago)
>> commit in the 1.7   @ Timestamp: 08/28/12 14:00:10 (5 hours ago)
>> 
>> george.
>> 
>> On Aug 28, 2012, at 19:17 , svn-commit-mai...@open-mpi.org wrote:
>> 
>>> Author: rhc (Ralph Castain)
>>> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012)
>>> New Revision: 27161
>>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161
>>> 
>>> Log:
>>> Fix a strange race condition by creating a separate buffer for each send - 
>>> apparently, just a retain isn't enough protection on some systems
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27161 - trunk/orte/mca/grpcomm/base

2012-08-30 Thread Ralph Castain

On Aug 30, 2012, at 8:10 AM, George Bosilca  wrote:

> A strange race condition happening for undisclosed reasons, and only fixable 
> by replication is jeopardizing our reference count system. That sounds 
> definitively almost scary (!) 
> 
> I think that the proposed solution is just a band-aid. It somehow fixes this 
> particular instance of the issue but leave all the others unpatched, asking 
> for troubles later on. This problem has been lingering around for years, but 
> we failed to address it correctly up to now.
> 
> Based on my understanding of the code the problem is not with the ref count 
> but with the way opal_buffer_t is handled. We have no way to retrieve the 
> pointer where the data in the opal_buffer_t is stored without a destructive 
> operation. This means every time we need to have the pointer of the 
> opal_buffer_t (like in the send operation to build the iovecs), we have to do 
> a load followed by an unload, leaving the opal_buffer_t uninitialized for a 
> short amount of time. As a result it is completely unsafe to use the same 
> opal_buffer_t concurrently for multiple operations, as some callbacks can 
> find the buffer uninitialized when they fire.

That is correct - and yes, it is a bandaid. Fixing the opal_buffer_t situation 
is a much bigger issue that will require more time and effort than we had at 
the moment.

> 
> 
> Now regarding the patch itself, I have to congratulate the Open MPI community 
> for its unbelievable response time. A solution proposed, then tested on the 
> faulty platforms, then the code carefully reviewed and finally pushed in a 
> stable branch all in a mere 43 minutes (!). It shows that all the protection 
> mechanism we put in place around our stable branches are entirely functional 
> and their role is completely fulfilled. I doubt any other open source project 
> can claim such a feat. Congratulations!

As always, George - thanks for your positive, inspirational attitude. I'm sure 
we all truly appreciate your input.


> 
> commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago)
> commit in the 1.7   @ Timestamp: 08/28/12 14:00:10 (5 hours ago)
> 
>  george.
> 
> On Aug 28, 2012, at 19:17 , svn-commit-mai...@open-mpi.org wrote:
> 
>> Author: rhc (Ralph Castain)
>> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012)
>> New Revision: 27161
>> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161
>> 
>> Log:
>> Fix a strange race condition by creating a separate buffer for each send - 
>> apparently, just a retain isn't enough protection on some systems
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r27161 - trunk/orte/mca/grpcomm/base

2012-08-30 Thread George Bosilca
A strange race condition happening for undisclosed reasons, and only fixable by 
replication is jeopardizing our reference count system. That sounds 
definitively almost scary (!) 

I think that the proposed solution is just a band-aid. It somehow fixes this 
particular instance of the issue but leave all the others unpatched, asking for 
troubles later on. This problem has been lingering around for years, but we 
failed to address it correctly up to now.

Based on my understanding of the code the problem is not with the ref count but 
with the way opal_buffer_t is handled. We have no way to retrieve the pointer 
where the data in the opal_buffer_t is stored without a destructive operation. 
This means every time we need to have the pointer of the opal_buffer_t (like in 
the send operation to build the iovecs), we have to do a load followed by an 
unload, leaving the opal_buffer_t uninitialized for a short amount of time. As 
a result it is completely unsafe to use the same opal_buffer_t concurrently for 
multiple operations, as some callbacks can find the buffer uninitialized when 
they fire.


Now regarding the patch itself, I have to congratulate the Open MPI community 
for its unbelievable response time. A solution proposed, then tested on the 
faulty platforms, then the code carefully reviewed and finally pushed in a 
stable branch all in a mere 43 minutes (!). It shows that all the protection 
mechanism we put in place around our stable branches are entirely functional 
and their role is completely fulfilled. I doubt any other open source project 
can claim such a feat. Congratulations!

commit in the trunk @ Timestamp: 08/28/12 13:17:34 (6 hours ago)
commit in the 1.7   @ Timestamp: 08/28/12 14:00:10 (5 hours ago)

  george.

On Aug 28, 2012, at 19:17 , svn-commit-mai...@open-mpi.org wrote:

> Author: rhc (Ralph Castain)
> Date: 2012-08-28 13:17:34 EDT (Tue, 28 Aug 2012)
> New Revision: 27161
> URL: https://svn.open-mpi.org/trac/ompi/changeset/27161
> 
> Log:
> Fix a strange race condition by creating a separate buffer for each send - 
> apparently, just a retain isn't enough protection on some systems