Re: [OMPI devel] SM BTL hang issue

2007-08-31 Thread Terry D. Dontje

Scott Atchley wrote:


Terry,

Are you testing on Linux? If so, which kernel?

 

No, I am running into issues on Solaris but Ollie's run of the test code 
on Linux seems to work fine.


--td

See the patch to iperf to handle kernel 2.6.21 and the issue that  
they had with usleep(0):


http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt

Scott

On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote:

 


Ok, I have an update to this issue.  I believe there is an
implementation difference of sched_yield between Linux and  
Solaris.  If

I change the sched_yield in opal_progress to be a usleep(500) then my
program completes quite quickly.  I have sent a few questions to a
Solaris engineer and hopefully will get some useful information.

That being said, CT-6's implementation also used yield calls (note  
this
actually is what sched_yield reduces down to in Solaris) and we did  
not

see the same degradation issue as with Open MPI.  I believe the reason
is because CT-6's SM implementation is not calling CT-6's version of
progress recursively and forcing all the unexpected to be read in  
before
continuing.  CT-6 also has a natural flow control in it's  
implementation

(ie it has a fixed set fifo for eager messages.

I believe both of these characteristics lend CT-6 to not being
completely killed by the yield differences.

--td


Li-Ta Lo wrote:

   


On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote:


 


Li-Ta Lo wrote:



   


On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote:




 


Li-Ta Lo wrote:





   


On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote:






 


hmmm, interesting since my version doesn't abort at all.







   

Some problem with fortran compiler/language binding? My C  
translation

doesn't have any problem.

[ollie@exponential ~]$ mpirun -np 4 a.out 10
Target duration (seconds): 10.00, #of msgs: 50331, usec  
per msg:

198.684707







 

Did you oversubscribe?  I found np=10 on a 8 core system  
clogged things

up sufficiently.





   

Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4  
threads).






 


Is this using Linux?



   


Yes.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] SM BTL hang issue

2007-08-31 Thread Scott Atchley

Terry,

Are you testing on Linux? If so, which kernel?

See the patch to iperf to handle kernel 2.6.21 and the issue that  
they had with usleep(0):


http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt

Scott

On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote:


Ok, I have an update to this issue.  I believe there is an
implementation difference of sched_yield between Linux and  
Solaris.  If

I change the sched_yield in opal_progress to be a usleep(500) then my
program completes quite quickly.  I have sent a few questions to a
Solaris engineer and hopefully will get some useful information.

That being said, CT-6's implementation also used yield calls (note  
this
actually is what sched_yield reduces down to in Solaris) and we did  
not

see the same degradation issue as with Open MPI.  I believe the reason
is because CT-6's SM implementation is not calling CT-6's version of
progress recursively and forcing all the unexpected to be read in  
before
continuing.  CT-6 also has a natural flow control in it's  
implementation

(ie it has a fixed set fifo for eager messages.

I believe both of these characteristics lend CT-6 to not being
completely killed by the yield differences.

--td


Li-Ta Lo wrote:


On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote:



Li-Ta Lo wrote:




On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote:





Li-Ta Lo wrote:






On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote:







hmmm, interesting since my version doesn't abort at all.







Some problem with fortran compiler/language binding? My C  
translation

doesn't have any problem.

[ollie@exponential ~]$ mpirun -np 4 a.out 10
Target duration (seconds): 10.00, #of msgs: 50331, usec  
per msg:

198.684707







Did you oversubscribe?  I found np=10 on a 8 core system  
clogged things

up sufficiently.





Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4  
threads).







Is this using Linux?






Yes.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [MTT devel] [MTT users] Database submit error

2007-08-31 Thread Josh Hursey

Sounds good. Cleaning up now.

Cheers,
Josh

On Aug 31, 2007, at 1:38 PM, Jeff Squyres wrote:


No objections.  If the data is junk, just ditch it.

On Aug 31, 2007, at 12:47 PM, Josh Hursey wrote:


I was looking at the data from Monday Aug 27, 8 am to Tuesday Aug 28,
Noonish when this problem was occuring, and the data is mostly
invalid. We have test_builds pointing at the wrong test_suites. Since
this brings all of this data inso suspicion I'm going through and
flaging them all as 'trial'.

If you don't have any conflict, then I'd like to remove this data
alltogether from the database so the normalization tables can be
cleaned up.

Any objections to removing the set of data in the time range Monday
Aug 27, 8 am to Tuesday Aug 28, Noonish? it's about 8,000 test_runs
since most of the test runs were getting rejected during that time
period we are not losing any good data.

-- Josh


On Aug 28, 2007, at 10:27 AM, Josh Hursey wrote:


Short Version:
--
I just finished the fix, and the submit script is back up and
running.

This was a bug that arose in testing, but somehow did not get
propagated to the production database.

Long Version:
-
The new databases uses partition tables to archive test results. As
part of this there are some complex rules to mask the partition  
table

complexity from the users of the db. There was a bug in the insert
rule in which the 'id' of the submitted result (mpi_install,
test_build, and test_run) was a different value than expected since
the 'id' was not translated properly to the partition table setup.

The fix was to drop all rules and replace them with the correct
versions. The submit errors you saw below were caused by integrity
checks in the submit script that keep data from being submitted that
do not have a proper lineage (e.g., you cannot submit a test_run
without having submitted a test_build and an mpi_install result).  
The

bug caused the client and the server to become confused on what the
proper 'id' should be and when the submit script attempted to  
'guess'

the correct run it was unsuccessful and errored out.

So sorry this bug lived this long, but it should be fixed now.

-- Josh

On Aug 28, 2007, at 10:16 AM, Jeff Squyres wrote:


Josh found the problem and is in the process of fixing it.  DB
submits are currently disabled while Josh is working on the fix.
More specific details coming soon.

Unfortunately, it looks like all data from last night will be
junk.  :-(  You might as well kill any MTT scripts that are still
running from last night.


On Aug 28, 2007, at 9:14 AM, Jeff Squyres wrote:


Josh and I are investigating -- the total runs in the db in the
summary report from this morning is far too low.  :-(


On Aug 28, 2007, at 9:13 AM, Tim Prins wrote:


It installed and the tests built and made it into the database:
http://www.open-mpi.org/mtt/reporter.php?do_redir=293

Tim

Jeff Squyres wrote:

Did you get a correct MPI install section for mpich2?

On Aug 28, 2007, at 9:05 AM, Tim Prins wrote:


Hi all,

I am working with the jms branch, and when trying to use  
mpich2,

I get
the following submit error:

*** WARNING: MTTDatabase server notice:
mpi_install_section_name is
not in
 mtt database.
 MTTDatabase server notice: number_of_results is not in mtt
database.
 MTTDatabase server notice: phase is not in mtt database.
 MTTDatabase server notice: test_type is not in mtt
database.
 MTTDatabase server notice: test_build_section_name is
not in
mtt
 database.
 MTTDatabase server notice: variant is not in mtt database.
 MTTDatabase server notice: command is not in mtt database.
 MTTDatabase server notice: fields is not in mtt database.
 MTTDatabase server notice: resource_manager is not in mtt
database.

 MTT submission for test run
 MTTDatabase server notice: Invalid test_build_id (47368)
given.
 Guessing that it should be -1
 MTTDatabase server error: ERROR: Unable to find a
test_build to
 associate with this test_run.

 MTTDatabase abort: (Tried to send HTTP error) 400
 MTTDatabase abort:
 No test_build associated with this test_run
*** WARNING: MTTDatabase did not get a serial; phases will be
isolated from
 each other in the reports
Reported to MTTDatabase: 1 successful submit, 0 failed  
submits

(total of

   12 results)

This happens for each test run section.

Thanks,

Tim
___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users






___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
Jeff Squyres
Cisco Systems

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
Jeff Squyres
Cisco Systems

___
mtt-users mailing 

Re: [MTT devel] [MTT users] Database submit error

2007-08-31 Thread Jeff Squyres

No objections.  If the data is junk, just ditch it.

On Aug 31, 2007, at 12:47 PM, Josh Hursey wrote:


I was looking at the data from Monday Aug 27, 8 am to Tuesday Aug 28,
Noonish when this problem was occuring, and the data is mostly
invalid. We have test_builds pointing at the wrong test_suites. Since
this brings all of this data inso suspicion I'm going through and
flaging them all as 'trial'.

If you don't have any conflict, then I'd like to remove this data
alltogether from the database so the normalization tables can be
cleaned up.

Any objections to removing the set of data in the time range Monday
Aug 27, 8 am to Tuesday Aug 28, Noonish? it's about 8,000 test_runs
since most of the test runs were getting rejected during that time
period we are not losing any good data.

-- Josh


On Aug 28, 2007, at 10:27 AM, Josh Hursey wrote:


Short Version:
--
I just finished the fix, and the submit script is back up and  
running.


This was a bug that arose in testing, but somehow did not get
propagated to the production database.

Long Version:
-
The new databases uses partition tables to archive test results. As
part of this there are some complex rules to mask the partition table
complexity from the users of the db. There was a bug in the insert
rule in which the 'id' of the submitted result (mpi_install,
test_build, and test_run) was a different value than expected since
the 'id' was not translated properly to the partition table setup.

The fix was to drop all rules and replace them with the correct
versions. The submit errors you saw below were caused by integrity
checks in the submit script that keep data from being submitted that
do not have a proper lineage (e.g., you cannot submit a test_run
without having submitted a test_build and an mpi_install result). The
bug caused the client and the server to become confused on what the
proper 'id' should be and when the submit script attempted to 'guess'
the correct run it was unsuccessful and errored out.

So sorry this bug lived this long, but it should be fixed now.

-- Josh

On Aug 28, 2007, at 10:16 AM, Jeff Squyres wrote:


Josh found the problem and is in the process of fixing it.  DB
submits are currently disabled while Josh is working on the fix.
More specific details coming soon.

Unfortunately, it looks like all data from last night will be
junk.  :-(  You might as well kill any MTT scripts that are still
running from last night.


On Aug 28, 2007, at 9:14 AM, Jeff Squyres wrote:


Josh and I are investigating -- the total runs in the db in the
summary report from this morning is far too low.  :-(


On Aug 28, 2007, at 9:13 AM, Tim Prins wrote:


It installed and the tests built and made it into the database:
http://www.open-mpi.org/mtt/reporter.php?do_redir=293

Tim

Jeff Squyres wrote:

Did you get a correct MPI install section for mpich2?

On Aug 28, 2007, at 9:05 AM, Tim Prins wrote:


Hi all,

I am working with the jms branch, and when trying to use mpich2,
I get
the following submit error:

*** WARNING: MTTDatabase server notice:
mpi_install_section_name is
not in
 mtt database.
 MTTDatabase server notice: number_of_results is not in mtt
database.
 MTTDatabase server notice: phase is not in mtt database.
 MTTDatabase server notice: test_type is not in mtt  
database.
 MTTDatabase server notice: test_build_section_name is  
not in

mtt
 database.
 MTTDatabase server notice: variant is not in mtt database.
 MTTDatabase server notice: command is not in mtt database.
 MTTDatabase server notice: fields is not in mtt database.
 MTTDatabase server notice: resource_manager is not in mtt
database.

 MTT submission for test run
 MTTDatabase server notice: Invalid test_build_id (47368)
given.
 Guessing that it should be -1
 MTTDatabase server error: ERROR: Unable to find a
test_build to
 associate with this test_run.

 MTTDatabase abort: (Tried to send HTTP error) 400
 MTTDatabase abort:
 No test_build associated with this test_run
*** WARNING: MTTDatabase did not get a serial; phases will be
isolated from
 each other in the reports

Reported to MTTDatabase: 1 successful submit, 0 failed submits
(total of

   12 results)

This happens for each test run section.

Thanks,

Tim
___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users






___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
Jeff Squyres
Cisco Systems

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



--
Jeff Squyres
Cisco Systems

___
mtt-users mailing list
mtt-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users



Re: [OMPI devel] SM BTL hang issue

2007-08-31 Thread Terry D. Dontje
Ok, I have an update to this issue.  I believe there is an 
implementation difference of sched_yield between Linux and Solaris.  If 
I change the sched_yield in opal_progress to be a usleep(500) then my 
program completes quite quickly.  I have sent a few questions to a 
Solaris engineer and hopefully will get some useful information.


That being said, CT-6's implementation also used yield calls (note this 
actually is what sched_yield reduces down to in Solaris) and we did not 
see the same degradation issue as with Open MPI.  I believe the reason 
is because CT-6's SM implementation is not calling CT-6's version of 
progress recursively and forcing all the unexpected to be read in before 
continuing.  CT-6 also has a natural flow control in it's implementation 
(ie it has a fixed set fifo for eager messages.


I believe both of these characteristics lend CT-6 to not being 
completely killed by the yield differences.


--td


Li-Ta Lo wrote:


On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote:
 


Li-Ta Lo wrote:

   


On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote:


 


Li-Ta Lo wrote:

  

   


On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote:




 


hmmm, interesting since my version doesn't abort at all.

 

  

   

Some problem with fortran compiler/language binding? My C translation 
doesn't have any problem.


[ollie@exponential ~]$ mpirun -np 4 a.out 10
Target duration (seconds): 10.00, #of msgs: 50331, usec per msg:
198.684707





 

Did you oversubscribe?  I found np=10 on a 8 core system clogged things 
up sufficiently.


  

   


Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads).



 


Is this using Linux?

   




Yes.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] Public tmp branches

2007-08-31 Thread Tim Prins

Jeff Squyres wrote:
That's fine, too.  I don't really care -- /public already exists.  We  
can simply rename it to /tmp-public.


Let's do that. It should (more or less) address all concerns that have 
been voiced.


Tim




On Aug 31, 2007, at 8:52 AM, Ralph Castain wrote:


Why not make /tmp-public and /tmp-private?

Leave /tmp alone. Have all new branches made in one of the two new
directories, and as /tmp branches are slowly whacked, we can
(eventually) get rid of /tmp.

I'm fine with that.  If no one else objects, let's bring this up on
Tuesday to make sure everyone is aware and then pick a date to rename
everything (requires a global sync since it will affect anyone who
has a current /tmp checkout).
Or, to make life really simple, just leave /tmp alone and private.  
Just
create a tmp-public for branches that are not private. That way,  
those of us
with private tmp branches are unaffected, no global sync's are  
required,

etc.

Or perhaps that is -too- simple ;-)

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







Re: [OMPI devel] Public tmp branches

2007-08-31 Thread Jeff Squyres
That's fine, too.  I don't really care -- /public already exists.  We  
can simply rename it to /tmp-public.



On Aug 31, 2007, at 8:52 AM, Ralph Castain wrote:


Why not make /tmp-public and /tmp-private?

Leave /tmp alone. Have all new branches made in one of the two new
directories, and as /tmp branches are slowly whacked, we can
(eventually) get rid of /tmp.


I'm fine with that.  If no one else objects, let's bring this up on
Tuesday to make sure everyone is aware and then pick a date to rename
everything (requires a global sync since it will affect anyone who
has a current /tmp checkout).


Or, to make life really simple, just leave /tmp alone and private.  
Just
create a tmp-public for branches that are not private. That way,  
those of us
with private tmp branches are unaffected, no global sync's are  
required,

etc.

Or perhaps that is -too- simple ;-)

Ralph


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[MTT devel] Testbake results

2007-08-31 Thread Jeff Squyres

From last night -- it ain't perfect yet, but we're getting darn close:

http://www.open-mpi.org/mtt/index.php?do_redir=309

(you may need "show trial" on to see these?)

I'll be digging into these results today to chase down some final  
issues.  I know of a few problems left:


- looks like the MPICH2 test runs didn't fire properly
- timeouts won't be good for large np values
- need a way to specify (for each MPI) by node/slot between netpipe 
+osu and imb+skampi
- sometimes the "pass" count does not equal the "perf" count (I  
suspect client problems, not server problems)


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Public tmp branches

2007-08-31 Thread Tim Prins

Why not make /tmp-public and /tmp-private?

Leave /tmp alone. Have all new branches made in one of the two new 
directories, and as /tmp branches are slowly whacked, we can 
(eventually) get rid of /tmp.


Tim

Jeff Squyres (jsquyres) wrote:
I thought about both of those (/tmp/private and /tmp/public), but 
couldn't think of a way to make them work.


1. If we do /tmp/private, we have to svn mv all the existing trees there 
which will annoy the developers (but is not a deal-breaker) and make 
/tmp publicly readable.  But that makes the history of all the private 
branches public.


2. If we do /tmp/public, I'm not quite sure how to setup the perms in SH 
to do that - if we setup /tmp to be 'no read access' for * and 
/tmp/public to have 'read access' for *, will a non authenticated user 
be able to reach /tmp/private?


-jms

 -Original Message-
From:   Brian Barrett [mailto:bbarr...@lanl.gov]
Sent:   Friday, August 17, 2007 11:51 AM Eastern Standard Time
To: Open MPI Developers
Subject:Re: [OMPI devel] Public tmp branches

ugh, sorry, I've been busy this week and didn't see a timeout, so a 
response got delayed.


I really don't like this format.  public doesn't have any meaning to 
it (tmp suggests, well, it's temporary).  I'd rather have /tmp/ and /
tmp/private or something like that.  Or /tmp/ and /tmp/public/.  
Either way :/.


Brian


On Aug 17, 2007, at 6:21 AM, Jeff Squyres wrote:

 > I didn't really put this in RFC format with a timeout, but no one
 > objected, so I have created:
 >
 >   http://svn.open-mpi.org/svn/ompi/public
 >
 > Developers should feel free to use this tree for public temporary
 > branches.  Specifically:
 >
 > - use /tmp if your branch is intended to be private
 > - use /public if your branch is intended to be public
 >
 > Enjoy.
 >
 >
 > On Aug 10, 2007, at 9:50 AM, Jeff Squyres wrote:
 >
 >> Right now all branches under /tmp are private to the OMPI core group
 >> (e.g., to allow unpublished academic work).  However, there are
 >> definitely cases where it would be useful to allow public branches
 >> when there's development work that is public but not yet ready for
 >> the trunk.  Periodically, we go an assign individual permissions to /
 >> tmp branches (like I just did to /tmp/vt-integration), but it would
 >> be easier if we had a separate tree for public "tmp" branches.
 >>
 >> Would anyone have an objection if I added /public (or any better name
 >> that someone can think of) for tmp-style branches, but that are open
 >> for read-only access to the public?
 >>
 >> --
 >> Jeff Squyres
 >> Cisco Systems
 >>
 >> ___
 >> devel mailing list
 >> de...@open-mpi.org
 >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
 >
 >
 > --
 > Jeff Squyres
 > Cisco Systems
 >
 > ___
 > devel mailing list
 > de...@open-mpi.org
 > http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel