Re: [OMPI devel] matching code rewrite in OB1

2007-12-18 Thread Gleb Natapov
On Mon, Dec 17, 2007 at 08:08:02PM -0500, Richard Graham wrote: > Needless to say (for the nth time :-) ) that changing this bit of code > makes me > nervous. I've noticed it already :) >However, it occurred to me that there is a much better way to > test > this code than setting

Re: [OMPI devel] matching code rewrite in OB1

2007-12-17 Thread Gleb Natapov
On Thu, Dec 13, 2007 at 08:04:21PM -0500, Richard Graham wrote: > Yes, should be a bit more clear. Need an independent way to verify that > data is matched > in the correct order ­ sending this information as payload is one way to do > this. So, > sending unique data in every message, and

Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Gleb Natapov
On Fri, Dec 14, 2007 at 06:53:55AM -0500, Richard Graham wrote: > If you have positive confirmation that such things have happened, this will > go a long way. I instrumented the code to log all kind of info about fragment reordering while I chased a bug in openib that caused matching logic to

Re: [OMPI devel] matching code rewrite in OB1

2007-12-14 Thread Richard Graham
If you have positive confirmation that such things have happened, this will go a long way. I will not trust the code until this has also been done with multiple independent network paths. I very rarely express such strong opinions, even if I don't agree with what is being done, but this is the

Re: [OMPI devel] matching code rewrite in OB1

2007-12-13 Thread Richard Graham
Yes, should be a bit more clear. Need an independent way to verify that data is matched in the correct order ­ sending this information as payload is one way to do this. So, sending unique data in every message, and making sure that it arrives in the user buffers in the expected order is a

Re: [OMPI devel] matching code rewrite in OB1

2007-12-13 Thread Richard Graham
The situation that needs to be triggered, just as George has mentions, is where we have a lot of unexpected messages, to make sure that when one that we can match against comes in, all the unexpected messages that can be matched with pre-posted receives are matched. Since we attempt to match only

Re: [OMPI devel] matching code rewrite in OB1

2007-12-13 Thread George Bosilca
Rich was referring to the fact that the reordering of fragments other than the matching ones is irrelevant to the Gleb's change. In order to trigger the changes we need to force a lot of small unexpected messages over multiple networks. The testing environment should have multiple similar

Re: [OMPI devel] matching code rewrite in OB1

2007-12-13 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 03:10:10PM -0600, Brian W. Barrett wrote: > On Wed, 12 Dec 2007, Gleb Natapov wrote: > > > On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: > >> This is better than nothing, but really not very helpful for looking at the > >> specific issues that can arise

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres
[mailto:gl...@voltaire.com] Sent: Wednesday, December 12, 2007 03:54 PM Eastern Standard Time To: Open MPI Developers Subject:Re: [OMPI devel] matching code rewrite in OB1 On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote: > On Dec 12, 2007, at 3:20 PM, Gleb Natapov wr

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres
Was Rich referring to ensuring that the test codes checked that their payloads were correct (and not re-assembled in the wrong order)? On Dec 12, 2007, at 4:10 PM, Brian W. Barrett wrote: On Wed, 12 Dec 2007, Gleb Natapov wrote: On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Brian W. Barrett
On Wed, 12 Dec 2007, Gleb Natapov wrote: On Wed, Dec 12, 2007 at 03:46:10PM -0500, Richard Graham wrote: This is better than nothing, but really not very helpful for looking at the specific issues that can arise with this, unless these systems have several parallel networks, with tests that

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres (jsquyres)
: [OMPI devel] matching code rewrite in OB1 On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote: > On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: > > >> How about making a tarball with this patch in it that can be thrown > >> at > >> everyone's MTT? (w

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 03:52:17PM -0500, Jeff Squyres wrote: > On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: > > >> How about making a tarball with this patch in it that can be thrown > >> at > >> everyone's MTT? (we can put the tarball on www.open-mpi.org > >> somewhere) > > I don't have

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres
On Dec 12, 2007, at 3:20 PM, Gleb Natapov wrote: How about making a tarball with this patch in it that can be thrown at everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere) I don't have access to www.open-mpi.org, but I can send you the patch. I can send you a tarball too,

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Gleb Natapov
On Wed, Dec 12, 2007 at 11:57:11AM -0500, Jeff Squyres wrote: > Gleb -- > > How about making a tarball with this patch in it that can be thrown at > everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere) I don't have access to www.open-mpi.org, but I can send you the patch. I

Re: [OMPI devel] matching code rewrite in OB1

2007-12-12 Thread Jeff Squyres
Gleb -- How about making a tarball with this patch in it that can be thrown at everyone's MTT? (we can put the tarball on www.open-mpi.org somewhere) On Dec 11, 2007, at 4:14 PM, Richard Graham wrote: I will re-iterate my concern. The code that is there now is mostly nine years old

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Richard Graham
I will re-iterate my concern. The code that is there now is mostly nine years old (with some mods made when it was brought over to Open MPI). It took about 2 months of testing on systems with 5-13 way network parallelism to track down all KNOWN race conditions. This code is at the center of MPI

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
On Tue, Dec 11, 2007 at 08:03:52AM -0800, Andrew Friedley wrote: > Try UD, frags are reordered at a very high rate so should be a good test. mpi-ping works fine with UD BTL and the patch. > > Andrew > > Richard Graham wrote: > > Gleb, > > I would suggest that before this is checked in this be

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Andrew Friedley
Possibly, though I have results from a benchmark I've written indicating the reordering happens at the sender. I believe I found it was due to the QP striping trick I use to get more bandwidth -- if you back down to one QP (there's a define in the code you can change), the reordering rate

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Andrew Friedley
Try UD, frags are reordered at a very high rate so should be a good test. Andrew Richard Graham wrote: Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Richard Graham
Gleb, I would suggest that before this is checked in this be tested on a system that has N-way network parallelism, where N is as large as you can find. This is a key bit of code for MPI correctness, and out-of-order operations will break it, so you want to maximize the chance for such

Re: [OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Brian W. Barrett
On Tue, 11 Dec 2007, Gleb Natapov wrote: I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I

[OMPI devel] matching code rewrite in OB1

2007-12-11 Thread Gleb Natapov
Hi, I did a rewrite of matching code in OB1. I made it much simpler and 2 times smaller (which is good, less code - less bugs). I also got rid of huge macros - very helpful if you need to debug something. There is no performance degradation, actually I even see very small performance