Re: [OMPI devel] SM BTL hang issue
Scott Atchley wrote: Terry, Are you testing on Linux? If so, which kernel? No, I am running into issues on Solaris but Ollie's run of the test code on Linux seems to work fine. --td See the patch to iperf to handle kernel 2.6.21 and the issue that they had with usleep(0): http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt Scott On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote: Ok, I have an update to this issue. I believe there is an implementation difference of sched_yield between Linux and Solaris. If I change the sched_yield in opal_progress to be a usleep(500) then my program completes quite quickly. I have sent a few questions to a Solaris engineer and hopefully will get some useful information. That being said, CT-6's implementation also used yield calls (note this actually is what sched_yield reduces down to in Solaris) and we did not see the same degradation issue as with Open MPI. I believe the reason is because CT-6's SM implementation is not calling CT-6's version of progress recursively and forcing all the unexpected to be read in before continuing. CT-6 also has a natural flow control in it's implementation (ie it has a fixed set fifo for eager messages. I believe both of these characteristics lend CT-6 to not being completely killed by the yield differences. --td Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote: hmmm, interesting since my version doesn't abort at all. Some problem with fortran compiler/language binding? My C translation doesn't have any problem. [ollie@exponential ~]$ mpirun -np 4 a.out 10 Target duration (seconds): 10.00, #of msgs: 50331, usec per msg: 198.684707 Did you oversubscribe? I found np=10 on a 8 core system clogged things up sufficiently. Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads). Is this using Linux? Yes. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM BTL hang issue
Terry, Are you testing on Linux? If so, which kernel? See the patch to iperf to handle kernel 2.6.21 and the issue that they had with usleep(0): http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt Scott On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote: Ok, I have an update to this issue. I believe there is an implementation difference of sched_yield between Linux and Solaris. If I change the sched_yield in opal_progress to be a usleep(500) then my program completes quite quickly. I have sent a few questions to a Solaris engineer and hopefully will get some useful information. That being said, CT-6's implementation also used yield calls (note this actually is what sched_yield reduces down to in Solaris) and we did not see the same degradation issue as with Open MPI. I believe the reason is because CT-6's SM implementation is not calling CT-6's version of progress recursively and forcing all the unexpected to be read in before continuing. CT-6 also has a natural flow control in it's implementation (ie it has a fixed set fifo for eager messages. I believe both of these characteristics lend CT-6 to not being completely killed by the yield differences. --td Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote: hmmm, interesting since my version doesn't abort at all. Some problem with fortran compiler/language binding? My C translation doesn't have any problem. [ollie@exponential ~]$ mpirun -np 4 a.out 10 Target duration (seconds): 10.00, #of msgs: 50331, usec per msg: 198.684707 Did you oversubscribe? I found np=10 on a 8 core system clogged things up sufficiently. Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads). Is this using Linux? Yes. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [MTT devel] [MTT users] Database submit error
Sounds good. Cleaning up now. Cheers, Josh On Aug 31, 2007, at 1:38 PM, Jeff Squyres wrote: No objections. If the data is junk, just ditch it. On Aug 31, 2007, at 12:47 PM, Josh Hursey wrote: I was looking at the data from Monday Aug 27, 8 am to Tuesday Aug 28, Noonish when this problem was occuring, and the data is mostly invalid. We have test_builds pointing at the wrong test_suites. Since this brings all of this data inso suspicion I'm going through and flaging them all as 'trial'. If you don't have any conflict, then I'd like to remove this data alltogether from the database so the normalization tables can be cleaned up. Any objections to removing the set of data in the time range Monday Aug 27, 8 am to Tuesday Aug 28, Noonish? it's about 8,000 test_runs since most of the test runs were getting rejected during that time period we are not losing any good data. -- Josh On Aug 28, 2007, at 10:27 AM, Josh Hursey wrote: Short Version: -- I just finished the fix, and the submit script is back up and running. This was a bug that arose in testing, but somehow did not get propagated to the production database. Long Version: - The new databases uses partition tables to archive test results. As part of this there are some complex rules to mask the partition table complexity from the users of the db. There was a bug in the insert rule in which the 'id' of the submitted result (mpi_install, test_build, and test_run) was a different value than expected since the 'id' was not translated properly to the partition table setup. The fix was to drop all rules and replace them with the correct versions. The submit errors you saw below were caused by integrity checks in the submit script that keep data from being submitted that do not have a proper lineage (e.g., you cannot submit a test_run without having submitted a test_build and an mpi_install result). The bug caused the client and the server to become confused on what the proper 'id' should be and when the submit script attempted to 'guess' the correct run it was unsuccessful and errored out. So sorry this bug lived this long, but it should be fixed now. -- Josh On Aug 28, 2007, at 10:16 AM, Jeff Squyres wrote: Josh found the problem and is in the process of fixing it. DB submits are currently disabled while Josh is working on the fix. More specific details coming soon. Unfortunately, it looks like all data from last night will be junk. :-( You might as well kill any MTT scripts that are still running from last night. On Aug 28, 2007, at 9:14 AM, Jeff Squyres wrote: Josh and I are investigating -- the total runs in the db in the summary report from this morning is far too low. :-( On Aug 28, 2007, at 9:13 AM, Tim Prins wrote: It installed and the tests built and made it into the database: http://www.open-mpi.org/mtt/reporter.php?do_redir=293 Tim Jeff Squyres wrote: Did you get a correct MPI install section for mpich2? On Aug 28, 2007, at 9:05 AM, Tim Prins wrote: Hi all, I am working with the jms branch, and when trying to use mpich2, I get the following submit error: *** WARNING: MTTDatabase server notice: mpi_install_section_name is not in mtt database. MTTDatabase server notice: number_of_results is not in mtt database. MTTDatabase server notice: phase is not in mtt database. MTTDatabase server notice: test_type is not in mtt database. MTTDatabase server notice: test_build_section_name is not in mtt database. MTTDatabase server notice: variant is not in mtt database. MTTDatabase server notice: command is not in mtt database. MTTDatabase server notice: fields is not in mtt database. MTTDatabase server notice: resource_manager is not in mtt database. MTT submission for test run MTTDatabase server notice: Invalid test_build_id (47368) given. Guessing that it should be -1 MTTDatabase server error: ERROR: Unable to find a test_build to associate with this test_run. MTTDatabase abort: (Tried to send HTTP error) 400 MTTDatabase abort: No test_build associated with this test_run *** WARNING: MTTDatabase did not get a serial; phases will be isolated from each other in the reports Reported to MTTDatabase: 1 successful submit, 0 failed submits (total of 12 results) This happens for each test run section. Thanks, Tim ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users -- Jeff Squyres Cisco Systems ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users -- Jeff Squyres Cisco Systems ___ mtt-users mailing
Re: [MTT devel] [MTT users] Database submit error
No objections. If the data is junk, just ditch it. On Aug 31, 2007, at 12:47 PM, Josh Hursey wrote: I was looking at the data from Monday Aug 27, 8 am to Tuesday Aug 28, Noonish when this problem was occuring, and the data is mostly invalid. We have test_builds pointing at the wrong test_suites. Since this brings all of this data inso suspicion I'm going through and flaging them all as 'trial'. If you don't have any conflict, then I'd like to remove this data alltogether from the database so the normalization tables can be cleaned up. Any objections to removing the set of data in the time range Monday Aug 27, 8 am to Tuesday Aug 28, Noonish? it's about 8,000 test_runs since most of the test runs were getting rejected during that time period we are not losing any good data. -- Josh On Aug 28, 2007, at 10:27 AM, Josh Hursey wrote: Short Version: -- I just finished the fix, and the submit script is back up and running. This was a bug that arose in testing, but somehow did not get propagated to the production database. Long Version: - The new databases uses partition tables to archive test results. As part of this there are some complex rules to mask the partition table complexity from the users of the db. There was a bug in the insert rule in which the 'id' of the submitted result (mpi_install, test_build, and test_run) was a different value than expected since the 'id' was not translated properly to the partition table setup. The fix was to drop all rules and replace them with the correct versions. The submit errors you saw below were caused by integrity checks in the submit script that keep data from being submitted that do not have a proper lineage (e.g., you cannot submit a test_run without having submitted a test_build and an mpi_install result). The bug caused the client and the server to become confused on what the proper 'id' should be and when the submit script attempted to 'guess' the correct run it was unsuccessful and errored out. So sorry this bug lived this long, but it should be fixed now. -- Josh On Aug 28, 2007, at 10:16 AM, Jeff Squyres wrote: Josh found the problem and is in the process of fixing it. DB submits are currently disabled while Josh is working on the fix. More specific details coming soon. Unfortunately, it looks like all data from last night will be junk. :-( You might as well kill any MTT scripts that are still running from last night. On Aug 28, 2007, at 9:14 AM, Jeff Squyres wrote: Josh and I are investigating -- the total runs in the db in the summary report from this morning is far too low. :-( On Aug 28, 2007, at 9:13 AM, Tim Prins wrote: It installed and the tests built and made it into the database: http://www.open-mpi.org/mtt/reporter.php?do_redir=293 Tim Jeff Squyres wrote: Did you get a correct MPI install section for mpich2? On Aug 28, 2007, at 9:05 AM, Tim Prins wrote: Hi all, I am working with the jms branch, and when trying to use mpich2, I get the following submit error: *** WARNING: MTTDatabase server notice: mpi_install_section_name is not in mtt database. MTTDatabase server notice: number_of_results is not in mtt database. MTTDatabase server notice: phase is not in mtt database. MTTDatabase server notice: test_type is not in mtt database. MTTDatabase server notice: test_build_section_name is not in mtt database. MTTDatabase server notice: variant is not in mtt database. MTTDatabase server notice: command is not in mtt database. MTTDatabase server notice: fields is not in mtt database. MTTDatabase server notice: resource_manager is not in mtt database. MTT submission for test run MTTDatabase server notice: Invalid test_build_id (47368) given. Guessing that it should be -1 MTTDatabase server error: ERROR: Unable to find a test_build to associate with this test_run. MTTDatabase abort: (Tried to send HTTP error) 400 MTTDatabase abort: No test_build associated with this test_run *** WARNING: MTTDatabase did not get a serial; phases will be isolated from each other in the reports Reported to MTTDatabase: 1 successful submit, 0 failed submits (total of 12 results) This happens for each test run section. Thanks, Tim ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users -- Jeff Squyres Cisco Systems ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users -- Jeff Squyres Cisco Systems ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
Re: [OMPI devel] SM BTL hang issue
Ok, I have an update to this issue. I believe there is an implementation difference of sched_yield between Linux and Solaris. If I change the sched_yield in opal_progress to be a usleep(500) then my program completes quite quickly. I have sent a few questions to a Solaris engineer and hopefully will get some useful information. That being said, CT-6's implementation also used yield calls (note this actually is what sched_yield reduces down to in Solaris) and we did not see the same degradation issue as with Open MPI. I believe the reason is because CT-6's SM implementation is not calling CT-6's version of progress recursively and forcing all the unexpected to be read in before continuing. CT-6 also has a natural flow control in it's implementation (ie it has a fixed set fifo for eager messages. I believe both of these characteristics lend CT-6 to not being completely killed by the yield differences. --td Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote: hmmm, interesting since my version doesn't abort at all. Some problem with fortran compiler/language binding? My C translation doesn't have any problem. [ollie@exponential ~]$ mpirun -np 4 a.out 10 Target duration (seconds): 10.00, #of msgs: 50331, usec per msg: 198.684707 Did you oversubscribe? I found np=10 on a 8 core system clogged things up sufficiently. Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads). Is this using Linux? Yes. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Public tmp branches
Jeff Squyres wrote: That's fine, too. I don't really care -- /public already exists. We can simply rename it to /tmp-public. Let's do that. It should (more or less) address all concerns that have been voiced. Tim On Aug 31, 2007, at 8:52 AM, Ralph Castain wrote: Why not make /tmp-public and /tmp-private? Leave /tmp alone. Have all new branches made in one of the two new directories, and as /tmp branches are slowly whacked, we can (eventually) get rid of /tmp. I'm fine with that. If no one else objects, let's bring this up on Tuesday to make sure everyone is aware and then pick a date to rename everything (requires a global sync since it will affect anyone who has a current /tmp checkout). Or, to make life really simple, just leave /tmp alone and private. Just create a tmp-public for branches that are not private. That way, those of us with private tmp branches are unaffected, no global sync's are required, etc. Or perhaps that is -too- simple ;-) Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Public tmp branches
That's fine, too. I don't really care -- /public already exists. We can simply rename it to /tmp-public. On Aug 31, 2007, at 8:52 AM, Ralph Castain wrote: Why not make /tmp-public and /tmp-private? Leave /tmp alone. Have all new branches made in one of the two new directories, and as /tmp branches are slowly whacked, we can (eventually) get rid of /tmp. I'm fine with that. If no one else objects, let's bring this up on Tuesday to make sure everyone is aware and then pick a date to rename everything (requires a global sync since it will affect anyone who has a current /tmp checkout). Or, to make life really simple, just leave /tmp alone and private. Just create a tmp-public for branches that are not private. That way, those of us with private tmp branches are unaffected, no global sync's are required, etc. Or perhaps that is -too- simple ;-) Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[MTT devel] Testbake results
From last night -- it ain't perfect yet, but we're getting darn close: http://www.open-mpi.org/mtt/index.php?do_redir=309 (you may need "show trial" on to see these?) I'll be digging into these results today to chase down some final issues. I know of a few problems left: - looks like the MPICH2 test runs didn't fire properly - timeouts won't be good for large np values - need a way to specify (for each MPI) by node/slot between netpipe +osu and imb+skampi - sometimes the "pass" count does not equal the "perf" count (I suspect client problems, not server problems) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Public tmp branches
Why not make /tmp-public and /tmp-private? Leave /tmp alone. Have all new branches made in one of the two new directories, and as /tmp branches are slowly whacked, we can (eventually) get rid of /tmp. Tim Jeff Squyres (jsquyres) wrote: I thought about both of those (/tmp/private and /tmp/public), but couldn't think of a way to make them work. 1. If we do /tmp/private, we have to svn mv all the existing trees there which will annoy the developers (but is not a deal-breaker) and make /tmp publicly readable. But that makes the history of all the private branches public. 2. If we do /tmp/public, I'm not quite sure how to setup the perms in SH to do that - if we setup /tmp to be 'no read access' for * and /tmp/public to have 'read access' for *, will a non authenticated user be able to reach /tmp/private? -jms -Original Message- From: Brian Barrett [mailto:bbarr...@lanl.gov] Sent: Friday, August 17, 2007 11:51 AM Eastern Standard Time To: Open MPI Developers Subject:Re: [OMPI devel] Public tmp branches ugh, sorry, I've been busy this week and didn't see a timeout, so a response got delayed. I really don't like this format. public doesn't have any meaning to it (tmp suggests, well, it's temporary). I'd rather have /tmp/ and / tmp/private or something like that. Or /tmp/ and /tmp/public/. Either way :/. Brian On Aug 17, 2007, at 6:21 AM, Jeff Squyres wrote: > I didn't really put this in RFC format with a timeout, but no one > objected, so I have created: > > http://svn.open-mpi.org/svn/ompi/public > > Developers should feel free to use this tree for public temporary > branches. Specifically: > > - use /tmp if your branch is intended to be private > - use /public if your branch is intended to be public > > Enjoy. > > > On Aug 10, 2007, at 9:50 AM, Jeff Squyres wrote: > >> Right now all branches under /tmp are private to the OMPI core group >> (e.g., to allow unpublished academic work). However, there are >> definitely cases where it would be useful to allow public branches >> when there's development work that is public but not yet ready for >> the trunk. Periodically, we go an assign individual permissions to / >> tmp branches (like I just did to /tmp/vt-integration), but it would >> be easier if we had a separate tree for public "tmp" branches. >> >> Would anyone have an objection if I added /public (or any better name >> that someone can think of) for tmp-style branches, but that are open >> for read-only access to the public? >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel