Re: [MTT users] Tests timing out

2006-08-30 Thread Josh Hursey
This fixes the hanging and gets me running (and passing) some/most of  
the tests [Trivial and ibm]. Yay!


I have a 16 processor job running on Odin at the moment that seems to  
be going well so far.


Thanks for your help.

Want me to file a bug about the tcsh problem below?

-- Josh

On Aug 30, 2006, at 2:30 PM, Jeff Squyres wrote:


Bah!

This is the result of perl expanding $? To 0 -- it seems that I  
need to

escape $? So that it's not output as 0.

Sorry about that!

So is this just for the sourcing files, or for your overall (hanging)
problems?


On 8/30/06 2:28 PM, "Josh Hursey"  wrote:


So here are the results of my exploration. I have things running now.
The problem was that the user that I am running under does not set
the LD_LIBRARY_PATH variable at any point. So when MTT tries to
export the variable it does:
if (0LD_LIBRARY_PATH == 0) then
 setenv LD_LIBRARY_PATH /san//install/lib
else
 setenv LD_LIBRARY_PATH /san//install/lib:
$LD_LIBRARY_PATH
endif

So this causes tcsh to emit the error the LD_LIBRARY_PATH is not
defined. So it is not set due to the error.

I fixed this by always declaring it in the .cshrc file to "". However
MTT could do a sanity check before trying to check the value to see
if it is defined. Something like:


if ($?LD_LIBRARY_PATH) then
else
   setenv LD_LIBRARY_PATH ""
endif

if (0LD_LIBRARY_PATH == 0) then
 setenv LD_LIBRARY_PATH /san//install/lib
else
 setenv LD_LIBRARY_PATH /san//install/lib:
$LD_LIBRARY_PATH
endif


or something of the sort.

As another note, could we start a "How to debug MTT" Wiki page with
some of the information that Jeff sent in this message regarding the
dumping of env vars? I think that would be helpful when getting
things started.

Thanks for all your help, I'm sure I'll have more questions in the
near future.

Cheers,
Josh


On Aug 30, 2006, at 12:31 PM, Jeff Squyres wrote:


On 8/30/06 12:10 PM, "Josh Hursey"  wrote:

MTT directly sets environment variables in its own environment  
(via

$ENV{whatever} = "foo") before using fork/exec to launch compiles
and runs.
Hence, the forked children inherit the environment variables that
we set
(E.g., PATH and LD_LIBRARY_PATH).

So if you source the env vars files that MTT drops, that should be
sufficient.


Does it drop them to a file, or is it printed in the debugging  
output

anywhere? I'm having a bit of trouble finding these strings in the
output.


It does not put these in the -debug output.

The files that it drops are in the scratch dir.  You'll need to go
into
scratch/installs, and then it depends on what your INI file section
names
are.  You'll go to:

/installs

And there should be files named "mpi_installed_vars.[csh|sh]" that
you can
source, depending on your shell.  IT should set PATH and
LD_LIBRARY_PATH.

The intent of these files is for exactly this purpose -- for a
human to test
borked MPI installs inside the MTT scratch tree.



As for setting the values on *remote* nodes, we do it solely  
via the

--prefix option.  I wonder if --prefix is broken under SLURM...?
That might
be something to check -- youmight be inadvertantly mixing
installations of
OMPI...?


Yep I'll check it out.

Cheers,
Josh




On 8/30/06 10:36 AM, "Josh Hursey"  wrote:


I'm trying to replicate the MTT environment as much as possible,
and
have a couple of questions.

Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
MTT. After MTT builds Open MPI, how does it export these
variables so
that it can build the tests? How does it export these when it  
runs

those tests (solely via --prefix)?

Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:

I already tried that. However I'm trying it in a couple  
different

ways and getting some mixed results. Let me formulate the error
cases
and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:

Well, why don't you try first separating this from MTT? Just  
run

the command
manually in batch mode and see if it works. If that works,
then the
problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey"   
wrote:



yet another point (sorry for the spam). This may not be an MTT
issue
but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run
fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh)
then I
see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on
what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:


forgot this bit in my mail. With the mpirun just hanging out
there I
attached GDB and got the following stack trace:
(gdb) bt
#0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6

Re: [MTT users] Tests timing out

2006-08-30 Thread Josh Hursey


On Aug 30, 2006, at 11:36 AM, Jeff Squyres wrote:


(sorry -- been afk much of this morning)

MTT directly sets environment variables in its own environment (via
$ENV{whatever} = "foo") before using fork/exec to launch compiles  
and runs.
Hence, the forked children inherit the environment variables that  
we set

(E.g., PATH and LD_LIBRARY_PATH).

So if you source the env vars files that MTT drops, that should be
sufficient.


Does it drop them to a file, or is it printed in the debugging output  
anywhere? I'm having a bit of trouble finding these strings in the  
output.




As for setting the values on *remote* nodes, we do it solely via the
--prefix option.  I wonder if --prefix is broken under SLURM...?   
That might
be something to check -- youmight be inadvertantly mixing  
installations of

OMPI...?


Yep I'll check it out.

Cheers,
Josh




On 8/30/06 10:36 AM, "Josh Hursey"  wrote:


I'm trying to replicate the MTT environment as much as possible, and
have a couple of questions.

Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start
MTT. After MTT builds Open MPI, how does it export these variables so
that it can build the tests? How does it export these when it runs
those tests (solely via --prefix)?

Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:


I already tried that. However I'm trying it in a couple different
ways and getting some mixed results. Let me formulate the error  
cases

and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:


Well, why don't you try first separating this from MTT? Just run
the command
manually in batch mode and see if it works. If that works, then the
problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey"  wrote:

yet another point (sorry for the spam). This may not be an MTT  
issue

but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh)  
then I

see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:


forgot this bit in my mail. With the mpirun just hanging out
there I
attached GDB and got the following stack trace:
(gdb) bt
#0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6
#1  0x002a956e6389 in opal_poll_dispatch (base=0x5136d0,
arg=0x513730, tv=0x7fbfffee70) at poll.c:191
#2  0x002a956e28b6 in opal_event_base_loop (base=0x5136d0,
flags=5) at event.c:584
#3  0x002a956e26b7 in opal_event_loop (flags=5) at event.c: 
514

#4  0x002a956db7c7 in opal_progress () at runtime/
opal_progress.c:
259
#5  0x0040334c in opal_condition_wait (c=0x509650,
m=0x509600) at ../../../opal/threads/condition.h:81
#6  0x00402f52 in orterun (argc=9, argv=0x7fb0b8) at
orterun.c:444
#7  0x004028a3 in main (argc=9, argv=0x7fb0b8) at
main.c:13

Seems that mpirun is waiting for things to complete :/

On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:



On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:

On 8/29/06 8:57 PM, "Josh Hursey"   
wrote:


Does this apply to *all* tests, or only some of the tests  
(like

allgather)?


All of the tests: Trivial and ibm. They all timeout :(


Blah.  The trivial tests are simply "hello world", so they  
should

take just
about no time at all.

Is this running under SLURM?  I put the code in there to  
know how

many procs
to use in SLURM but have not tested it in eons.  I doubt that's
the
problem,
but that's one thing to check.



Yep it is in SLURM. and it seems that the 'number of procs'
code is
working fine as it changes with different allocations.

Can you set a super-long timeout (e.g., a few minutes), and  
while

one of the
trivial tests is running, do some ps's on the relevant nodes  
and

see what,
if anything, is running?  E.g., mpirun, the test executable on
the
nodes,
etc.


Without setting a long timeout. It seems that mpirun is running,
but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME
COMMAND
mpiteam  15117  0.5  0.8 113024 33680 ?  S09:32   0:06
perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch -- 
file /u/

mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0 00 ?Z09:32   0:00
[sh]

mpiteam  28453  0.2  0.0 38072 3536 ?S09:50   0:00
mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/ 
1.3a1r11497/

install 

Re: [MTT users] Tests timing out

2006-08-30 Thread Jeff Squyres
FWIW, I am pretty sure that "srun -b myscript" *used* to work.

But there must be something different about the environment between the two
(-A and -b)...?  For one thing, mpirun is running on the first node of the
allocation with -b (vs. The head node for -A), but I wouldn't think that
that would make a difference.  :-\

I assume you're kicking off MTT runs with srun -b, and that's why you think
that this may be the problem?


On 8/30/06 10:03 AM, "Josh Hursey"  wrote:

> yet another point (sorry for the spam). This may not be an MTT issue
> but a broken ORTE on the trunk :(
> 
> When I try to run in a allocation (srun -N 16 -A) things run fine.
> But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
> see the same hang as in MTT. seems that mpirun is not getting
> properly notified of the completion of the job. :(
> 
> I'll try to investigate a bit further today. Any thoughts on what
> might be causing this?
> 
> Cheers,
> Josh
> 
> On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:
> 
>> forgot this bit in my mail. With the mpirun just hanging out there I
>> attached GDB and got the following stack trace:
>> (gdb) bt
>> #0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6
>> #1  0x002a956e6389 in opal_poll_dispatch (base=0x5136d0,
>> arg=0x513730, tv=0x7fbfffee70) at poll.c:191
>> #2  0x002a956e28b6 in opal_event_base_loop (base=0x5136d0,
>> flags=5) at event.c:584
>> #3  0x002a956e26b7 in opal_event_loop (flags=5) at event.c:514
>> #4  0x002a956db7c7 in opal_progress () at runtime/opal_progress.c:
>> 259
>> #5  0x0040334c in opal_condition_wait (c=0x509650,
>> m=0x509600) at ../../../opal/threads/condition.h:81
>> #6  0x00402f52 in orterun (argc=9, argv=0x7fb0b8) at
>> orterun.c:444
>> #7  0x004028a3 in main (argc=9, argv=0x7fb0b8) at
>> main.c:13
>> 
>> Seems that mpirun is waiting for things to complete :/
>> 
>> On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:
>> 
>>> 
>>> On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:
>>> 
 On 8/29/06 8:57 PM, "Josh Hursey"  wrote:
 
>> Does this apply to *all* tests, or only some of the tests (like
>> allgather)?
> 
> All of the tests: Trivial and ibm. They all timeout :(
 
 Blah.  The trivial tests are simply "hello world", so they should
 take just
 about no time at all.
 
 Is this running under SLURM?  I put the code in there to know how
 many procs
 to use in SLURM but have not tested it in eons.  I doubt that's the
 problem,
 but that's one thing to check.
 
>>> 
>>> Yep it is in SLURM. and it seems that the 'number of procs' code is
>>> working fine as it changes with different allocations.
>>> 
 Can you set a super-long timeout (e.g., a few minutes), and while
 one of the
 trivial tests is running, do some ps's on the relevant nodes and
 see what,
 if anything, is running?  E.g., mpirun, the test executable on the
 nodes,
 etc.
>>> 
>>> Without setting a long timeout. It seems that mpirun is running, but
>>> nothing else and only on the launching node.
>>> 
>>> When a test starts you see the mpirun launching properly:
>>> $ ps aux | grep ...
>>> USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME
>>> COMMAND
>>> mpiteam  15117  0.5  0.8 113024 33680 ?  S09:32   0:06
>>> perl ./
>>> client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
>>> mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
>>> mpiteam  15294  0.0  0.0 00 ?Z09:32   0:00 [sh]
>>> 
>>> mpiteam  28453  0.2  0.0 38072 3536 ?S09:50   0:00 mpirun
>>> -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
>>> scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
>>> install collective/allgather_in_place
>>> mpiteam  28454  0.0  0.0 41716 2040 ?Sl   09:50   0:00
>>> srun --
>>> nodes=16 --ntasks=16 --
>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin
>>> 0
>>> 15
>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>> num_procs 16 --vpid_start 0 --universe
>>> mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>> 129.79.240.107:40904"
>>> mpiteam  28455  0.0  0.0 23212 1804 ?Ssl  09:50   0:00
>>> srun --
>>> nodes=16 --ntasks=16 --
>>> nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,odin
>>> 0
>>> 15
>>> ,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
>>> orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
>>> num_procs 16 --vpid_start 0 --universe
>>> mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
>>> "0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
>>> 129.79.240.107:40904"
>>> mpiteam  28472  0.0  0.0 36956 

Re: [MTT users] Tests timing out

2006-08-30 Thread Josh Hursey
I'm trying to replicate the MTT environment as much as possible, and  
have a couple of questions.


Assume there is no mpirun in my PATH/LD_LIBRARY_PATH when I start  
MTT. After MTT builds Open MPI, how does it export these variables so  
that it can build the tests? How does it export these when it runs  
those tests (solely via --prefix)?


Cheers,
josh

On Aug 30, 2006, at 10:25 AM, Josh Hursey wrote:


I already tried that. However I'm trying it in a couple different
ways and getting some mixed results. Let me formulate the error cases
and get back to you.

Cheers,
Josh

On Aug 30, 2006, at 10:17 AM, Ralph H Castain wrote:


Well, why don't you try first separating this from MTT? Just run
the command
manually in batch mode and see if it works. If that works, then the
problem
is with MTT. Otherwise, we have a problem with notification.

Or are you saying that you have already done this?
Ralph


On 8/30/06 8:03 AM, "Josh Hursey"  wrote:


yet another point (sorry for the spam). This may not be an MTT issue
but a broken ORTE on the trunk :(

When I try to run in a allocation (srun -N 16 -A) things run fine.
But if I try to run in batch mode (srun -N 16 -b myscript.sh) then I
see the same hang as in MTT. seems that mpirun is not getting
properly notified of the completion of the job. :(

I'll try to investigate a bit further today. Any thoughts on what
might be causing this?

Cheers,
Josh

On Aug 30, 2006, at 9:54 AM, Josh Hursey wrote:

forgot this bit in my mail. With the mpirun just hanging out  
there I

attached GDB and got the following stack trace:
(gdb) bt
#0  0x003d1b9bd1af in poll () from /lib64/tls/libc.so.6
#1  0x002a956e6389 in opal_poll_dispatch (base=0x5136d0,
arg=0x513730, tv=0x7fbfffee70) at poll.c:191
#2  0x002a956e28b6 in opal_event_base_loop (base=0x5136d0,
flags=5) at event.c:584
#3  0x002a956e26b7 in opal_event_loop (flags=5) at event.c:514
#4  0x002a956db7c7 in opal_progress () at runtime/
opal_progress.c:
259
#5  0x0040334c in opal_condition_wait (c=0x509650,
m=0x509600) at ../../../opal/threads/condition.h:81
#6  0x00402f52 in orterun (argc=9, argv=0x7fb0b8) at
orterun.c:444
#7  0x004028a3 in main (argc=9, argv=0x7fb0b8) at
main.c:13

Seems that mpirun is waiting for things to complete :/

On Aug 30, 2006, at 9:53 AM, Josh Hursey wrote:



On Aug 30, 2006, at 7:19 AM, Jeff Squyres wrote:


On 8/29/06 8:57 PM, "Josh Hursey"  wrote:


Does this apply to *all* tests, or only some of the tests (like
allgather)?


All of the tests: Trivial and ibm. They all timeout :(


Blah.  The trivial tests are simply "hello world", so they should
take just
about no time at all.

Is this running under SLURM?  I put the code in there to know how
many procs
to use in SLURM but have not tested it in eons.  I doubt that's
the
problem,
but that's one thing to check.



Yep it is in SLURM. and it seems that the 'number of procs'  
code is

working fine as it changes with different allocations.


Can you set a super-long timeout (e.g., a few minutes), and while
one of the
trivial tests is running, do some ps's on the relevant nodes and
see what,
if anything, is running?  E.g., mpirun, the test executable on  
the

nodes,
etc.


Without setting a long timeout. It seems that mpirun is running,
but
nothing else and only on the launching node.

When a test starts you see the mpirun launching properly:
$ ps aux | grep ...
USER   PID %CPU %MEM   VSZ  RSS TTY  STAT START   TIME
COMMAND
mpiteam  15117  0.5  0.8 113024 33680 ?  S09:32   0:06
perl ./
client/mtt --debug --scratch /u/mpiteam/tmp/mtt-scratch --file /u/
mpiteam/local/etc/ompi-iu-odin-core.ini --verbose --print-time
mpiteam  15294  0.0  0.0 00 ?Z09:32   0:00  
[sh]


mpiteam  28453  0.2  0.0 38072 3536 ?S09:50   0:00
mpirun
-mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/tmp/mtt-
scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/1.3a1r11497/
install collective/allgather_in_place
mpiteam  28454  0.0  0.0 41716 2040 ?Sl   09:50   0:00
srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o 
d

in
0
15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica
"0.0.0;tcp://129.79.240.107:40904" --gprreplica "0.0.0;tcp://
129.79.240.107:40904"
mpiteam  28455  0.0  0.0 23212 1804 ?Ssl  09:50   0:00
srun --
nodes=16 --ntasks=16 --
nodelist=odin022,odin021,odin020,odin019,odin018,odin017,odin016,o 
d

in
0
15
,odin014,odin013,odin012,odin011,odin010,odin009,odin008,odin007
orted --no-daemonize --bootproxy 1 --ns-nds slurm --name 0.0.1 --
num_procs 16 --vpid_start 0 --universe
mpit...@odin007.cs.indiana.edu:default-universe-28453 --nsreplica

Re: [MTT users] Tests timing out

2006-08-30 Thread Jeff Squyres
On 8/29/06 8:57 PM, "Josh Hursey"  wrote:

>> Does this apply to *all* tests, or only some of the tests (like
>> allgather)?
> 
> All of the tests: Trivial and ibm. They all timeout :(

Blah.  The trivial tests are simply "hello world", so they should take just
about no time at all.

Is this running under SLURM?  I put the code in there to know how many procs
to use in SLURM but have not tested it in eons.  I doubt that's the problem,
but that's one thing to check.

Can you set a super-long timeout (e.g., a few minutes), and while one of the
trivial tests is running, do some ps's on the relevant nodes and see what,
if anything, is running?  E.g., mpirun, the test executable on the nodes,
etc.

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


Re: [MTT users] Tests timing out

2006-08-29 Thread Josh Hursey


On Aug 29, 2006, at 6:57 PM, Jeff Squyres wrote:


On 8/29/06 1:55 PM, "Josh Hursey"  wrote:


So I'm having trouble getting tests to complete without timing out in
MTT. It seems that the tests timeout and hang in MTT, but complete
normally outside of MTT.


Does this apply to *all* tests, or only some of the tests (like  
allgather)?


All of the tests: Trivial and ibm. They all timeout :(




Here are some details:
Build:
   Open MPI Trunk (1.3a1r11481)

Tests:
   Trivial
   ibm

BTL:
   tcp
   self

Nodes/processes:
   16 nodes (32 processors) on the Odin Cluster at IU


In MTT all of the tests timeout:

Running command: mpirun  -mca btl tcp,self -np 32 --prefix
/san/homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly- 
trunk/

odin_g
cc_warnings/1.3a1r11481/install collective/allgather
Timeout: 1 - 1156872348 (vs. now: 1156872028)
Past timeout! 1156872348 < 1156872349
Past timeout! 1156872348 < 1156872349

[snipped]

: returning 0
String now: 0
*** WARNING: Test: allgather, np=32, variant=1: TIMED OUT (failed)


Outside of MTT using the same build the test runs and completes
normally:
  $ cd ~/tmp/mtt-scratch/installs/ompi-nightly-trunk/
odin_gcc_warnings/1.3a1r11481/tests/ibm/ibm/
  $ mpirun -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/
tmp/mtt-scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
1.3a1r11481/install collective/allgather


Where is mpirun in your path?

MTT actually drops sourceable files in the top-level install dir  
(i.e., the

1.3a1r11481) that you can source in your shell and set the
PATH/LD_LIBRARY_PATH for that install.  Can you source it and try  
to run

again?


Yep I exported the PATH/LD_LIBRARY_PATH to the one cited in the -- 
prefix argument before running manually.





How long does it take to run manually -- just a few seconds, or a  
long time

(that could potentially timeout)?


Just a few seconds (say 5 or so).



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Josh Hursey
jjhur...@open-mpi.org
http://www.open-mpi.org/



Re: [MTT users] Tests timing out

2006-08-29 Thread Jeff Squyres
On 8/29/06 1:55 PM, "Josh Hursey"  wrote:

> So I'm having trouble getting tests to complete without timing out in
> MTT. It seems that the tests timeout and hang in MTT, but complete
> normally outside of MTT.

Does this apply to *all* tests, or only some of the tests (like allgather)?

> Here are some details:
> Build:
>Open MPI Trunk (1.3a1r11481)
> 
> Tests:
>Trivial
>ibm
> 
> BTL:
>tcp
>self
> 
> Nodes/processes:
>16 nodes (32 processors) on the Odin Cluster at IU
> 
> 
> In MTT all of the tests timeout:
> 
> Running command: mpirun  -mca btl tcp,self -np 32 --prefix
> /san/homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/
> odin_g
> cc_warnings/1.3a1r11481/install collective/allgather
> Timeout: 1 - 1156872348 (vs. now: 1156872028)
> Past timeout! 1156872348 < 1156872349
> Past timeout! 1156872348 < 1156872349
[snipped]
> : returning 0
> String now: 0
> *** WARNING: Test: allgather, np=32, variant=1: TIMED OUT (failed)
> 
> 
> Outside of MTT using the same build the test runs and completes
> normally:
>   $ cd ~/tmp/mtt-scratch/installs/ompi-nightly-trunk/
> odin_gcc_warnings/1.3a1r11481/tests/ibm/ibm/
>   $ mpirun -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/
> tmp/mtt-scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/
> 1.3a1r11481/install collective/allgather

Where is mpirun in your path?

MTT actually drops sourceable files in the top-level install dir (i.e., the
1.3a1r11481) that you can source in your shell and set the
PATH/LD_LIBRARY_PATH for that install.  Can you source it and try to run
again?

How long does it take to run manually -- just a few seconds, or a long time
(that could potentially timeout)?

-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems


[MTT users] Tests timing out

2006-08-29 Thread Josh Hursey

Hey all,

So I'm having trouble getting tests to complete without timing out in  
MTT. It seems that the tests timeout and hang in MTT, but complete  
normally outside of MTT.


Here are some details:
Build:
  Open MPI Trunk (1.3a1r11481)

Tests:
  Trivial
  ibm

BTL:
  tcp
  self

Nodes/processes:
  16 nodes (32 processors) on the Odin Cluster at IU


In MTT all of the tests timeout:

Running command: mpirun  -mca btl tcp,self -np 32 --prefix
   /san/homedirs/mpiteam/tmp/mtt-scratch/installs/ompi-nightly-trunk/ 
odin_g

   cc_warnings/1.3a1r11481/install collective/allgather
Timeout: 1 - 1156872348 (vs. now: 1156872028)
Past timeout! 1156872348 < 1156872349
Past timeout! 1156872348 < 1156872349
Command complete, exit status: 72057594037927935
Evaluating: ((_exit_status(), 0), (_exit_status(),  
77))

Got name: test_exit_status
Got args:
_do: $ret = MTT::Values::Functions::test_exit_status()
_exit_status returning: 72057594037927935
String now: ((72057594037927935, 0), (_exit_status(), 77))
Got name: eq
Got args: 72057594037927935, 0
_do: $ret = MTT::Values::Functions::eq(72057594037927935, 0)
 got: 72057594037927935 0
: returning 0
String now: (0, (_exit_status(), 77))
Got name: test_exit_status
Got args:
_do: $ret = MTT::Values::Functions::test_exit_status()
_exit_status returning: 72057594037927935
String now: (0, (72057594037927935, 77))
Got name: eq
Got args: 72057594037927935, 77
_do: $ret = MTT::Values::Functions::eq(72057594037927935, 77)
 got: 72057594037927935 77
: returning 0
String now: (0, 0)
Got name: or
Got args: 0, 0
_do: $ret = MTT::Values::Functions::or(0, 0)
 got: 0 0
: returning 0
String now: 0
*** WARNING: Test: allgather, np=32, variant=1: TIMED OUT (failed)


Outside of MTT using the same build the test runs and completes  
normally:
 $ cd ~/tmp/mtt-scratch/installs/ompi-nightly-trunk/ 
odin_gcc_warnings/1.3a1r11481/tests/ibm/ibm/
 $ mpirun -mca btl tcp,self -np 32 --prefix /san/homedirs/mpiteam/ 
tmp/mtt-scratch/installs/ompi-nightly-trunk/odin_gcc_warnings/ 
1.3a1r11481/install collective/allgather

 $

Any thoughts on why this might be happening in MTT but not outside of  
it?


Cheers,
Josh