[MTT users] Status

2006-08-30 Thread Jeff Squyres
Josh noticed that Test Run data is not currently being recorded.  I actually
had already filed ticket #42 about this -- just to let you all know, we're
aware of the problem and Ethan is working on it.

Also, I just brought over the CSH script fix that Josh identified earlier
(i.e., the sourceable script that is generated now contains "$?" properly,
not "0").  And I added a new variable to the generated scripts: $MPI_ROOT.
The intent is that you can source these scripts and then run with OMPI as
such:

mpirun --prefix $MPI_ROOT -np 8 a.out

For systems that need --prefix (e.g., rsh/ssh environments -- SLURM/PBS
users need not worry).

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [MTT users] Tests timing out

2006-08-30 Thread Josh Hursey
This fixes the hanging and gets me running (and passing) some/most of  
the tests [Trivial and ibm]. Yay!


I have a 16 processor job running on Odin at the moment that seems to  
be going well so far.


Thanks for your help.

Want me to file a bug about the tcsh problem below?

-- Josh

On Aug 30, 2006, at 2:30 PM, Jeff Squyres wrote:


Bah!

This is the result of perl expanding $? To 0 -- it seems that I  
need to

escape $? So that it's not output as 0.

Sorry about that!

So is this just for the sourcing files, or for your overall (hanging)
problems?


On 8/30/06 2:28 PM, "Josh Hursey"  wrote:


So here are the results of my exploration. I have things running now.
The problem was that the user that I am running under does not set
the LD_LIBRARY_PATH variable at any point. So when MTT tries to
export the variable it does:
if (0LD_LIBRARY_PATH == 0) then
 setenv LD_LIBRARY_PATH /san//install/lib
else
 setenv LD_LIBRARY_PATH /san//install/lib:
$LD_LIBRARY_PATH
endif

So this causes tcsh to emit the error the LD_LIBRARY_PATH is not
defined. So it is not set due to the error.

I fixed this by always declaring it in the .cshrc file to "". However
MTT could do a sanity check before trying to check the value to see
if it is defined. Something like:


if ($?LD_LIBRARY_PATH) then
else
   setenv LD_LIBRARY_PATH ""
endif

if (0LD_LIBRARY_PATH == 0) then
 setenv LD_LIBRARY_PATH /san//install/lib
else
 setenv LD_LIBRARY_PATH /san//install/lib:
$LD_LIBRARY_PATH
endif


or something of the sort.

As another note, could we start a "How to debug MTT" Wiki page with
some of the information that Jeff sent in this message regarding the
dumping of env vars? I think that would be helpful when getting
things started.

Thanks for all your help, I'm sure I'll have more questions in the
near future.

Cheers,
Josh


On Aug 30, 2006, at 12:31 PM, Jeff Squyres wrote:


On 8/30/06 12:10 PM, "Josh Hursey"  wrote:

MTT directly sets environment variables in its own environment  
(via

$ENV{whatever} = "foo") before using fork/exec to launch compiles
and runs.
Hence, the forked children inherit the environment variables that
we set
(E.g., PATH and LD_LIBRARY_PATH).

So if you source the env vars files that MTT drops, that should be
sufficient.


Does it drop them to a file, or is it printed in the debugging  
output

anywhere? I'm having a bit of trouble finding these strings in the
output.


It does not put these in the -debug output.

The files that it drops are in the scratch dir.  You'll need to go
into
scratch/installs, and then it depends on what your INI file section
names
are.  You'll go to:

/installs

And there should be files named "mpi_installed_vars.[csh|sh]" that
you can
source, depending on your shell.  IT should set PATH and
LD_LIBRARY_PATH.

The intent of these files is for exactly this purpose -- for a
human to test
borked MPI installs inside the MTT scratch tree.



As for setting the values on *remote* nodes, we do it solely  
via the

--prefix option.  I wonder if --prefix is broken under SLURM...?
That might
be something to check -- youmight be inadvertantly mixing
installations of
OMPI...?


Yep I'll check it out.

Cheers,
Josh




On 8/30/06 10:36 AM, "Josh Hursey"  wrote:


I'm trying to replicate the MTT environment as much as possible,
and
have a couple of questions.

Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
MTT. After MTT builds Open MPI, how does it export these
variables so
that it can build the tests? How does it export these when it  
runs

those tests (solely via --prefix)?

Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:

I already tried that. However I'm trying it in a couple  
different

ways and getting some mixed results. Let me formulate the error
cases
and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:

Well, why don't you try first separating this from MTT? Just  
run

the command
manually in batch mode and see if it works. If that works,
then the
problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey"   
wrote:



yet another point (sorry for the spam). This may not be an MTT
issue
but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run
fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh)
then I
see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on
what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:


forgot this bit in my mail. With the mpirun just hanging out
there I
attached GDB and got the following stack trace:
(gdb) bt
#0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6

Re: [MTT users] Tests timing out

2006-08-30 Thread Josh Hursey


On Aug 30, 2006, at 11:36 AM, Jeff Squyres wrote:


(sorry -- been afk much of this morning)

MTT directly sets environment variables in its own environment (via
$ENV{whatever} = "foo") before using fork/exec to launch compiles  
and runs.
Hence, the forked children inherit the environment variables that  
we set

(E.g., PATH and LD_LIBRARY_PATH).

So if you source the env vars files that MTT drops, that should be
sufficient.


Does it drop them to a file, or is it printed in the debugging output  
anywhere? I'm having a bit of trouble finding these strings in the  
output.




As for setting the values on *remote* nodes, we do it solely via the
--prefix option.  I wonder if --prefix is broken under SLURM...?   
That might
be something to check -- youmight be inadvertantly mixing  
installations of

OMPI...?


Yep I'll check it out.

Cheers,
Josh




On 8/30/06 10:36 AM, "Josh Hursey"  wrote:


I'm trying to replicate the MTT environment as much as possible, and
have a couple of questions.

Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
MTT. After MTT builds Open MPI, how does it export these variables so
that it can build the tests? How does it export these when it runs
those tests (solely via --prefix)?

Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:


I already tried that. However I'm trying it in a couple different
ways and getting some mixed results. Let me formulate the error  
cases

and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:


Well, why don't you try first separating this from MTT? Just run
the command
manually in batch mode and see if it works. If that works, then the
problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey"  wrote:

yet another point (sorry for the spam). This may not be an MTT  
issue

but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh)  
then I

see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:


forgot this bit in my mail. With the mpirun just hanging out
there I
attached GDB and got the following stack trace:
(gdb) bt
#0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6
#1  0x002a956e6389 in opal_poll_dispatch (base=0x5136d0,
arg=0x513730, tv=0x7fbfffee70) at poll.c:191
#2  0x002a956e28b6 in opal_event_base_loop (base=0x5136d0,
flags=5) at event.c:584
#3  0x002a956e26b7 in opal_event_loop (flags=5) at event.c: 
514

#4  0x002a956db7c7 in opal_progress () at runtime/
opal_progress.c:
259
#5  0x0040334c in opal_condition_wait (c=0x509650,
m=0x509600) at ../../../opal/threads/condition.h:81
#6  0x00402f52 in orterun (argc=9, argv=0x7fb0b8) at
orterun.c:444
#7  0x004028a3 in main (argc=9, argv=0x7fb0b8) at
main.c:13

Seems that mpirun is waiting for things to complete :/

On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:



On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:

On 8/29/06 8:57 PM, "Josh Hursey"   
wrote:


Does this apply to *all* tests, or only some of the tests  
(like

allgather)?


All of the tests: Trivial and ibm. They all timeout :(


Blah.  The trivial tests are simply "hello world", so they  
should

take just
about no time at all.

Is this running under SLURM?  I put the code in there to  
know how

many procs
to use in SLURM but have not tested it in eons.  I doubt that's
the
problem,
but that's one thing to check.



Yep it is in SLURM. and it seems that the 'number of procs'
code is
working fine as it changes with different allocations.

Can you set a super-long timeout (e.g., a few minutes), and  
while

one of the
trivial tests is running, do some ps's on the relevant nodes  
and

see what,
if anything, is running?  E.g., mpirun, the test executable on
the
nodes,
etc.


Without setting a long timeout. It seems that mpirun is running,
but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME
COMMAND
mpiteam  15117  0.5  0.8 113024 33680 ?  S09:32   0:06
perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch -- 
file /u/

mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0 00 ?Z09:32   0:00
[sh]

mpiteam  28453  0.2  0.0 38072 3536 ?S09:50   0:00
mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/ 
1.3a1r11497/

install 

Re: [MTT users] Tests timing out

2006-08-30 Thread Jeff Squyres
FWIW, I am pretty sure that "srun -b myscript" *used* to work.

But there must be something different about the environment between the two
(-A and -b)...?  For one thing, mpirun is running on the first node of the
allocation with -b (vs. The head node for -A), but I wouldn't think that
that would make a difference.  :-\

I assume you're kicking off MTT runs with srun -b, and that's why you think
that this may be the problem?


On 8/30/06 10:03 AM, "Josh Hursey"  wrote:

> yet another point (sorry for the spam). This may not be an MTT issue
> but a broken ORTE on the trunk :(
> 
> When I try to run in a allocation (srun -N 16 -A) things run fine.
> But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
> see the same hang as in MTT. seems that mpirun is not getting
> properly notified of the completion of the job. :(
> 
> I'll try to investigate a bit further today. Any thoughts on what
> might be causing this?
> 
> Cheers,
> Josh
> 
> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
> 
>> forgot this bit in my mail. With the mpirun just hanging out there I
>> attached GDB and got the following stack trace:
>> (gdb) bt
>> #0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>> #1  0x002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>> #2  0x002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>> flags=5) at event.c:584
>> #3  0x002a956e26b7 in opal_event_loop (flags=5) at event.c:514
>> #4  0x002a956db7c7 in opal_progress () at runtime/opal_progress.c:
>> 259
>> #5  0x0040334c in opal_condition_wait (c=0x509650,
>> m=0x509600) at ../../../opal/threads/condition.h:81
>> #6  0x00402f52 in orterun (argc=9, argv=0x7fb0b8) at
>> orterun.c:444
>> #7  0x004028a3 in main (argc=9, argv=0x7fb0b8) at
>> main.c:13
>> 
>> Seems that mpirun is waiting for things to complete :/
>> 
>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>> 
>>> 
>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>> 
 On 8/29/06 8:57 PM, "Josh Hursey"  wrote:
 
>> Does this apply to *all* tests, or only some of the tests (like
>> allgather)?
> 
> All of the tests: Trivial and ibm. They all timeout :(
 
 Blah.  The trivial tests are simply "hello world", so they should
 take just
 about no time at all.
 
 Is this running under SLURM?  I put the code in there to know how
 many procs
 to use in SLURM but have not tested it in eons.  I doubt that's the
 problem,
 but that's one thing to check.
 
>>> 
>>> Yep it is in SLURM. and it seems that the 'number of procs' code is
>>> working fine as it changes with different allocations.
>>> 
 Can you set a super-long timeout (e.g., a few minutes), and while
 one of the
 trivial tests is running, do some ps's on the relevant nodes and
 see what,
 if anything, is running?  E.g., mpirun, the test executable on the
 nodes,
 etc.
>>> 
>>> Without setting a long timeout. It seems that mpirun is running, but
>>> nothing else and only on the launching node.
>>> 
>>> When a test starts you see the mpirun launching properly:
>>> $ ps aux | grep ...
>>> USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME
>>> COMMAND
>>> mpiteam  15117  0.5  0.8 113024 33680 ?  S09:32   0:06
>>> perl ./
>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>> mpiteam  15294  0.0  0.0 00 ?Z09:32   0:00 [sh]
>>> 
>>> mpiteam  28453  0.2  0.0 38072 3536 ?S09:50   0:00 mpirun
>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>> install collective/allgather_in_place
>>> mpiteam  28454  0.0  0.0 41716 2040 ?Sl   09:50   0:00
>>> srun --
>>> nodes=16 --ntasks=16 --
>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin
>>> 0
>>> 15
>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>> num_procs 16 --vpid_start 0 --universe
>>> mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>> 129.79.240.107:40904"
>>> mpiteam  28455  0.0  0.0 23212 1804 ?Ssl  09:50   0:00
>>> srun --
>>> nodes=16 --ntasks=16 --
>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin
>>> 0
>>> 15
>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>> num_procs 16 --vpid_start 0 --universe
>>> mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>> 129.79.240.107:40904"
>>> mpiteam  28472  0.0  0.0 36956 

Re: [MTT users] Tests timing out

2006-08-30 Thread Josh Hursey
I'm trying to replicate the MTT environment as much as possible, and  
have a couple of questions.


Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start  
MTT. After MTT builds Open MPI, how does it export these variables so  
that it can build the tests? How does it export these when it runs  
those tests (solely via --prefix)?


Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:


I already tried that. However I'm trying it in a couple different
ways and getting some mixed results. Let me formulate the error cases
and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:


Well, why don't you try first separating this from MTT? Just run
the command
manually in batch mode and see if it works. If that works, then the
problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey"  wrote:


yet another point (sorry for the spam). This may not be an MTT issue
but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:

forgot this bit in my mail. With the mpirun just hanging out  
there I

attached GDB and got the following stack trace:
(gdb) bt
#0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6
#1  0x002a956e6389 in opal_poll_dispatch (base=0x5136d0,
arg=0x513730, tv=0x7fbfffee70) at poll.c:191
#2  0x002a956e28b6 in opal_event_base_loop (base=0x5136d0,
flags=5) at event.c:584
#3  0x002a956e26b7 in opal_event_loop (flags=5) at event.c:514
#4  0x002a956db7c7 in opal_progress () at runtime/
opal_progress.c:
259
#5  0x0040334c in opal_condition_wait (c=0x509650,
m=0x509600) at ../../../opal/threads/condition.h:81
#6  0x00402f52 in orterun (argc=9, argv=0x7fb0b8) at
orterun.c:444
#7  0x004028a3 in main (argc=9, argv=0x7fb0b8) at
main.c:13

Seems that mpirun is waiting for things to complete :/

On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:



On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:


On 8/29/06 8:57 PM, "Josh Hursey"  wrote:


Does this apply to *all* tests, or only some of the tests (like
allgather)?


All of the tests: Trivial and ibm. They all timeout :(


Blah.  The trivial tests are simply "hello world", so they should
take just
about no time at all.

Is this running under SLURM?  I put the code in there to know how
many procs
to use in SLURM but have not tested it in eons.  I doubt that's
the
problem,
but that's one thing to check.



Yep it is in SLURM. and it seems that the 'number of procs'  
code is

working fine as it changes with different allocations.


Can you set a super-long timeout (e.g., a few minutes), and while
one of the
trivial tests is running, do some ps's on the relevant nodes and
see what,
if anything, is running?  E.g., mpirun, the test executable on  
the

nodes,
etc.


Without setting a long timeout. It seems that mpirun is running,
but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME
COMMAND
mpiteam  15117  0.5  0.8 113024 33680 ?  S09:32   0:06
perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0 00 ?Z09:32   0:00  
[sh]


mpiteam  28453  0.2  0.0 38072 3536 ?S09:50   0:00
mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place
mpiteam  28454  0.0  0.0 41716 2040 ?Sl   09:50   0:00
srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o 
d

in
0
15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam  28455  0.0  0.0 23212 1804 ?Ssl  09:50   0:00
srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o 
d

in
0
15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica

[MTT users] OMPI snapshot tarball generation

2006-08-30 Thread Jeff Squyres
FYI -- see:



This means that MTT will potentially have to test less stuff.  More
specifically, MTT will only have a tarball to test when there is actually
something new to test.  Hence, this can significantly decrease the
proability of their being 1.1 and 1.0 tarballs to test, and therefore lower
the amount of resources required for sites to test 1.1 and 1.0.

We used to generate new tarballs for all branches whenever there was even
one commit on any development/release.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [MTT users] Tests timing out

2006-08-30 Thread Jeff Squyres
On 8/29/06 8:57 PM, "Josh Hursey"  wrote:

>> Does this apply to *all* tests, or only some of the tests (like
>> allgather)?
> 
> All of the tests: Trivial and ibm. They all timeout :(

Blah.  The trivial tests are simply "hello world", so they should take just
about no time at all.

Is this running under SLURM?  I put the code in there to know how many procs
to use in SLURM but have not tested it in eons.  I doubt that's the problem,
but that's one thing to check.

Can you set a super-long timeout (e.g., a few minutes), and while one of the
trivial tests is running, do some ps's on the relevant nodes and see what,
if anything, is running?  E.g., mpirun, the test executable on the nodes,
etc.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


[MTT users] Update your checkouts

2006-08-30 Thread Jeff Squyres
We moved a few fixes and improvements over to the MTT release branch
yesterday; you probably want to run "svn up" in your MTT checkouts.

I also added a "tips and tricks" section to the wiki on the OMPI Testing
page for some of the gotchas that have occurred so far.

Indeed, we'll be carefully monitoring what gets put out on the release
branch (remember: WE DO NOT WANT YOU RUNNING FROM THE TRUNK!); it might be
best to simply have a "svn up" at the beginning of your script that launches
MTT for your daily runs.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems