Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-29 Thread Ralph Castain

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you care if  
the process ranks are sharing nodes)? If not, why not add "-bynode" to  
your cmd line?


Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you fork M  
times, you'll wind up with the exact pattern you wanted. And, as each  
one exits, you could immediately launch a replacement without worrying  
about oversubscription.


Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too many  
problems with cleanup and other things. :-)


On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:


Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The new 1.3.x  
machinefile behavior is great (thanks!) -- I can use machinefiles to  
manage multiple simultaneous mpiruns within a single torque  
allocation (where the hosts are a subset of $PBS_NODEFILE).   
However, this requires some careful management of machinefiles.


I'm curious if OpenMPI now directly supports the behavior I need,  
described in general in the quote below.  Specifically, given a  
single PBS/Torque allocation of M*N processors, I will run a serial  
program that will fork M times.  Each of the M forked processes  
calls 'mpirun -np N application.exe' and blocks until completion.   
This seems akin to the case you described of "mpiruns executed in  
separate windows/prompts."


What I'd like to see is the M processes "tiled" across the available  
slots, so all M*N processors are used.  What I see instead appears  
at face value to be the first N resources being oversubscribed M  
times.


Also, when one of the forked processes returns, I'd like to be able  
to spawn another and have its mpirun schedule on the resources freed  
by the previous one that exited.  Is any of this possible?


I tried starting an orted (1.3.3, roughly as you suggested below),  
but got this error:



orted --daemonize
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file  
runtime/orte_init.c at line 125

--
It looks like orte_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 orte_ess_base_select failed
 --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file  
orted/orted_main.c at line 323


I spared the debugging info as I'm not even sure this is a correct  
invocation...


Thanks for any suggestions you can offer!
Brian
--
Brian M. Adams, PhD (bria...@sandia.gov)
Optimization and Uncertainty Quantification
Sandia National Laboratories, Albuquerque, NM
http://www.sandia.gov/~briadam



From: Ralph Castain (rhc_at_[hidden])
Date: 2006-12-12 00:46:59

Hi Chris


Some of this is doable with today's codeand one of these
behaviors is not. :-(


Open MPI/OpenRTE can be run in "persistent" mode - this
allows multiple jobs to share the same allocation. This works
much as you describe (syntax is slightly different, of
course!) - the first mpirun will map using whatever mode was
requested, then the next mpirun will map starting from where
the first one left off.


I *believe* you can run each mpirun in the background.
However, I don't know if this has really been tested enough
to support such a claim. All testing that I know about
to-date has executed mpirun in the foreground - thus, your
example would execute sequentially instead of in parallel.


I know people have tested multiple mpirun's operating in
parallel within a single allocation (i.e., persistent mode)
where the mpiruns are executed in separate windows/prompts.
So I suspect you could do something like you describe - just
haven't personally verified it.


Where we definitely differ is that Open MPI/RTE will *not*
block until resources are freed up from the prior mpiruns.
Instead, we will attempt to execute each mpirun immediately -
and will error out the one(s) that try to execute without
sufficient resources. I imagine we could provide the kind of
"flow control" you describe, but I'm not sure when that might happen.


I am (in my copious free time...haha) working on an
"orteboot" program that will startup a virtual machine to
make the persistent mode of operation a little easier. For
now, though, you can do it by:


1. starting up the "server" using the following command:
orted --seed --persistent --scope public [--universe foo]


2. do your mpirun commands. They will automagically find the
"server" and connect to it. If you 

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-29 Thread Adams, Brian M
Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The new 1.3.x 
machinefile behavior is great (thanks!) -- I can use machinefiles to manage 
multiple simultaneous mpiruns within a single torque allocation (where the 
hosts are a subset of $PBS_NODEFILE).  However, this requires some careful 
management of machinefiles.

I'm curious if OpenMPI now directly supports the behavior I need, described in 
general in the quote below.  Specifically, given a single PBS/Torque allocation 
of M*N processors, I will run a serial program that will fork M times.  Each of 
the M forked processes calls 'mpirun -np N application.exe' and blocks until 
completion.  This seems akin to the case you described of "mpiruns executed in 
separate windows/prompts."

What I'd like to see is the M processes "tiled" across the available slots, so 
all M*N processors are used.  What I see instead appears at face value to be 
the first N resources being oversubscribed M times.  

Also, when one of the forked processes returns, I'd like to be able to spawn 
another and have its mpirun schedule on the resources freed by the previous one 
that exited.  Is any of this possible?

I tried starting an orted (1.3.3, roughly as you suggested below), but got this 
error:

> orted --daemonize
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orted/orted_main.c at line 323

I spared the debugging info as I'm not even sure this is a correct invocation...

Thanks for any suggestions you can offer!
Brian
--
Brian M. Adams, PhD (bria...@sandia.gov)
Optimization and Uncertainty Quantification
Sandia National Laboratories, Albuquerque, NM
http://www.sandia.gov/~briadam


> From: Ralph Castain (rhc_at_[hidden])
> Date: 2006-12-12 00:46:59
> 
> Hi Chris
> 
> 
> Some of this is doable with today's codeand one of these 
> behaviors is not. :-(
> 
> 
> Open MPI/OpenRTE can be run in "persistent" mode - this 
> allows multiple jobs to share the same allocation. This works 
> much as you describe (syntax is slightly different, of 
> course!) - the first mpirun will map using whatever mode was 
> requested, then the next mpirun will map starting from where 
> the first one left off.
> 
> 
> I *believe* you can run each mpirun in the background. 
> However, I don't know if this has really been tested enough 
> to support such a claim. All testing that I know about 
> to-date has executed mpirun in the foreground - thus, your 
> example would execute sequentially instead of in parallel.
> 
> 
> I know people have tested multiple mpirun's operating in 
> parallel within a single allocation (i.e., persistent mode) 
> where the mpiruns are executed in separate windows/prompts. 
> So I suspect you could do something like you describe - just 
> haven't personally verified it.
> 
> 
> Where we definitely differ is that Open MPI/RTE will *not* 
> block until resources are freed up from the prior mpiruns. 
> Instead, we will attempt to execute each mpirun immediately - 
> and will error out the one(s) that try to execute without 
> sufficient resources. I imagine we could provide the kind of 
> "flow control" you describe, but I'm not sure when that might happen.
> 
> 
> I am (in my copious free time...haha) working on an 
> "orteboot" program that will startup a virtual machine to 
> make the persistent mode of operation a little easier. For 
> now, though, you can do it by:
> 
> 
> 1. starting up the "server" using the following command:
> orted --seed --persistent --scope public [--universe foo]
> 
> 
> 2. do your mpirun commands. They will automagically find the 
> "server" and connect to it. If you specified a universe name 
> when starting the server, then you must specify the same 
> universe name on your mpirun commands.
> 
> 
> When you are done, you will have to (unfortunately) manually 
> "kill" the server and remove its session directory. I have a 
> program called "ortehalt"
> in the trunk that will do this cleanly for you, but it isn't 
> yet in the release distributions. You are welcome to use it, 
> though, if you are working with the trunk - I can't promise 
> it is bulletproof yet, but it seems to be working.
> 
> 
> Ralph
> 
> 
> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel" 
> 
> 

Re: [OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread Ralph Castain
Ah, so there is a firewall involved? That is always a problem. I  
gather that node 126 has clear access to all other nodes, but nodes  
122, 123, and 125 do not all have access to each other?


See if your admin is willing to open at least one port on each node  
that can reach all other nodes. It is easiest if it is the same port  
for every node, but not required. Then you can try setting the mca  
params oob_tcp_port_minv4 and oob_tcp_port_rangev4. This should allow  
the daemons to communicate.


Check ompi_info --param oob tcp for info on those (and other) params.

Ralph

On Jul 29, 2009, at 2:46 PM, David Doria wrote:



On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain   
wrote:
Using direct can cause scaling issues as every process will open a  
socket to every other process in the job. You would at least have to  
ensure you have enough file descriptors available on every node.


The most likely cause is either (a) a different OMPI version getting  
picked up on one of the nodes, or (b) something blocking  
communication between at least one of your other nodes. I would  
suspect the latter - perhaps a firewall or something?


I''m disturbed by your not seeing any error output - that seems  
strange. Try adding --debug-daemons to the cmd line. That should  
definitely generate output from every daemon (at the least, they  
report they are alive).


Ralph

Nifty, I used MPI_Get_processor_name - as you said, this is much  
more helpful output. I also check all the versions and they seem to  
be fine - 'mpirun -V' says 1.3.3 on all 4 machines.


The output with '-mca routed direct' is now (correctly):
[doriad@daviddoria MPITest]$ mpirun -H  
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello- 
mpi

Process 0 on daviddoria out of 4
Process 1 on cloud3 out of 4
Process 2 on cloud4 out of 4
Process 3 on cloud6 out of 4

Here is the output with --debug-daemons.

Is there a particular port / set of ports I can have my system admin  
unblock on the firewall to see if that fixes it?


[doriad@daviddoria MPITest]$ mpirun -H  
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached  
--debug-daemons -np 4 hello-mpi


Daemon was launched on cloud3 - beginning to initialize
Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
Daemon [[9461,0],1] not using static ports
[cloud3:14707] [[9461,0],1] orted: up and running - waiting for  
commands!

Daemon was launched on cloud4 - beginning to initialize
Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
Daemon [[9461,0],2] not using static ports
[cloud4:05987] [[9461,0],2] orted: up and running - waiting for  
commands!

Daemon was launched on cloud6 - beginning to initialize
Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
Daemon [[9461,0],3] not using static ports
[daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0  
arch ffca0200

[daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
[daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
[cloud6:01037] [[9461,0],3] orted: up and running - waiting for  
commands!
[cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch  
ffca0200

[cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
[cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
[cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch  
ffca0200

[cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
[cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
[cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
[cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
[daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap  
from local proc [[9461,1],0]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data  
cmd
[cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from  
local proc [[9461,1],2]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data  
cmd

[cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd

Any more thoughts?

Thanks,

David

___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread David Doria
On Wed, Jul 29, 2009 at 4:15 PM, Ralph Castain  wrote:

> Using direct can cause scaling issues as every process will open a socket
> to every other process in the job. You would at least have to ensure you
> have enough file descriptors available on every node.
> The most likely cause is either (a) a different OMPI version getting picked
> up on one of the nodes, or (b) something blocking communication between at
> least one of your other nodes. I would suspect the latter - perhaps a
> firewall or something?
>
> I''m disturbed by your not seeing any error output - that seems strange.
> Try adding --debug-daemons to the cmd line. That should definitely generate
> output from every daemon (at the least, they report they are alive).
>
> Ralph
>

Nifty, I used MPI_Get_processor_name - as you said, this is much more
helpful output. I also check all the versions and they seem to be fine -
'mpirun -V' says 1.3.3 on all 4 machines.

The output with '-mca routed direct' is now (correctly):
[doriad@daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 -mca routed direct hello-mpi
Process 0 on daviddoria out of 4
Process 1 on cloud3 out of 4
Process 2 on cloud4 out of 4
Process 3 on cloud6 out of 4

Here is the output with --debug-daemons.

Is there a particular port / set of ports I can have my system admin unblock
on the firewall to see if that fixes it?

[doriad@daviddoria MPITest]$ mpirun -H
10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave-session-attached
--debug-daemons -np 4 hello-mpi


Daemon was launched on cloud3 - beginning to initialize
Daemon [[9461,0],1] checking in as pid 14707 on host cloud3
Daemon [[9461,0],1] not using static ports
[cloud3:14707] [[9461,0],1] orted: up and running - waiting for commands!
Daemon was launched on cloud4 - beginning to initialize
Daemon [[9461,0],2] checking in as pid 5987 on host cloud4
Daemon [[9461,0],2] not using static ports
[cloud4:05987] [[9461,0],2] orted: up and running - waiting for commands!
Daemon was launched on cloud6 - beginning to initialize
Daemon [[9461,0],3] checking in as pid 1037 on host cloud6
Daemon [[9461,0],3] not using static ports
[daviddoria:11061] [[9461,0],0] node[0].name daviddoria daemon 0 arch
ffca0200
[daviddoria:11061] [[9461,0],0] node[1].name 10 daemon 1 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[2].name 10 daemon 2 arch ffca0200
[daviddoria:11061] [[9461,0],0] node[3].name 10 daemon 3 arch ffca0200
[daviddoria:11061] [[9461,0],0] orted_cmd: received add_local_procs
[cloud6:01037] [[9461,0],3] orted: up and running - waiting for commands!
[cloud3:14707] [[9461,0],1] node[0].name daviddoria daemon 0 arch ffca0200
[cloud3:14707] [[9461,0],1] node[1].name 10 daemon 1 arch ffca0200
[cloud3:14707] [[9461,0],1] node[2].name 10 daemon 2 arch ffca0200
[cloud3:14707] [[9461,0],1] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] node[0].name daviddoria daemon 0 arch ffca0200
[cloud4:05987] [[9461,0],2] node[1].name 10 daemon 1 arch ffca0200
[cloud4:05987] [[9461,0],2] node[2].name 10 daemon 2 arch ffca0200
[cloud4:05987] [[9461,0],2] node[3].name 10 daemon 3 arch ffca0200
[cloud4:05987] [[9461,0],2] orted_cmd: received add_local_procs
[cloud3:14707] [[9461,0],1] orted_cmd: received add_local_procs
[daviddoria:11061] [[9461,0],0] orted_recv: received sync+nidmap from local
proc [[9461,1],0]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_recv: received sync+nidmap from local proc
[[9461,1],2]
[daviddoria:11061] [[9461,0],0] orted_cmd: received collective data cmd
[cloud4:05987] [[9461,0],2] orted_cmd: received collective data cmd

Any more thoughts?

Thanks,

David


Re: [OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread Nifty Tom Mitchell
On Wed, Jul 29, 2009 at 01:42:39PM -0600, Ralph Castain wrote:
> 
> It sounds like perhaps IOF messages aren't getting relayed along the  
> daemons. Note that the daemon on each node does have to be able to send 
> TCP messages to all other nodes, not just mpirun.
>
> Couple of things you can do to check:
>
> 1. -mca routed direct - this will send all messages direct instead of  
> across the daemons
>
> 2. --leave-session-attached - will allow you to see any errors reported 
> by the daemons, including those from attempting to relay messages
>
> Ralph
>
> On Jul 29, 2009, at 1:19 PM, David Doria wrote:
>
>> I wrote a simple program to display "hello world" from each process.
>>
>> When I run this (126 - my machine, 122, and 123), everything works  
.
>> However, when I run this (126 - my machine, 122, 123, AND 125), I get 
>> no output at all.
>>
>> Is there any way to check what is going on / does anyone know what  

All of the above good stuff and:

Since the set of hosts all work in most of the possible permutations for
the case of three but not four it is possible that your simple program
has an issue in the way it exit(s).

Please post your simple program.  I am looking for the omission of
MPI_Finalize() or a funny return/exit status.


http://www.mcs.anl.gov/research/projects/mpi/mpi-standard/mpi-report-2.0/node32.htm

Also, Try adding a sleep(1) after the printf(...---"hello world"...)
and/ or after MPI_Finalize() on the chance that there is a race on exit.

Try the "hello world" example in the source package for Open MPI or at: 

http://www.dartmouth.edu/~rc/classes/intro_mpi/hello_world_ex.html

You can also add gethostbyname() or environment variable checks etc
to make sure that each host is involved as you expect in contrast to
nearly anonymous rank number.   Also double check to see which mpirun
you are using.  i.e alternatives on your system may be "interesting"
since various versions of MPI are naturally in some distros $PATH/$path
may be important.
$ file /usr/bin/mpirun
/usr/bin/mpirun: symbolic link to `/etc/alternatives/mpi-run'
$ locate bin/mpirun
/usr/bin/mpirun
/usr/bin/mpirun.py
$ rpm -qf /usr/bin/mpirun.py
mpich2-1.1-1.fc10.x86_64





-- 
T o m  M i t c h e l l 
Found me a new hat, now what?



Re: [OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread Ralph Castain
Using direct can cause scaling issues as every process will open a  
socket to every other process in the job. You would at least have to  
ensure you have enough file descriptors available on every node.


The most likely cause is either (a) a different OMPI version getting  
picked up on one of the nodes, or (b) something blocking communication  
between at least one of your other nodes. I would suspect the latter -  
perhaps a firewall or something?


I''m disturbed by your not seeing any error output - that seems  
strange. Try adding --debug-daemons to the cmd line. That should  
definitely generate output from every daemon (at the least, they  
report they are alive).


Ralph

On Jul 29, 2009, at 2:06 PM, David Doria wrote:

On Wed, Jul 29, 2009 at 3:42 PM, Ralph Castain   
wrote:
It sounds like perhaps IOF messages aren't getting relayed along the  
daemons. Note that the daemon on each node does have to be able to  
send TCP messages to all other nodes, not just mpirun.


Couple of things you can do to check:

1. -mca routed direct - this will send all messages direct instead  
of across the daemons


2. --leave-session-attached - will allow you to see any errors  
reported by the daemons, including those from attempting to relay  
messages


Ralph


Ralph, thanks for the quick response.

With
-mca routed direct
it works correctly.

With this:
mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125 --leave- 
session-attached -np 4 /home/doriad/MPITest/hello-mpi


I still get no output nor errors from the daemons.

Is there a downside to using 'mca routed direct'? Or should I fix  
whatever is causing this daemon issue? You have any other tests for  
me to try to see what's wrong?


Thanks,

David
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread David Doria
On Wed, Jul 29, 2009 at 3:42 PM, Ralph Castain  wrote:

> It sounds like perhaps IOF messages aren't getting relayed along the
> daemons. Note that the daemon on each node does have to be able to send TCP
> messages to all other nodes, not just mpirun.
>
> Couple of things you can do to check:
>
> 1. -mca routed direct - this will send all messages direct instead of
> across the daemons
>
> 2. --leave-session-attached - will allow you to see any errors reported by
> the daemons, including those from attempting to relay messages
>
> Ralph
>
>
Ralph, thanks for the quick response.

With
-mca routed direct
it works correctly.

With this:
mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123,10.1.2.125
--leave-session-attached -np 4 /home/doriad/MPITest/hello-mpi

I still get no output nor errors from the daemons.

Is there a downside to using 'mca routed direct'? Or should I fix whatever
is causing this daemon issue? You have any other tests for me to try to see
what's wrong?

Thanks,

David


Re: [OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread Ralph Castain
It sounds like perhaps IOF messages aren't getting relayed along the  
daemons. Note that the daemon on each node does have to be able to  
send TCP messages to all other nodes, not just mpirun.


Couple of things you can do to check:

1. -mca routed direct - this will send all messages direct instead of  
across the daemons


2. --leave-session-attached - will allow you to see any errors  
reported by the daemons, including those from attempting to relay  
messages


Ralph

On Jul 29, 2009, at 1:19 PM, David Doria wrote:


I wrote a simple program to display "hello world" from each process.

When I run this (126 - my machine, 122, and 123), everything works  
fine:
[doriad@daviddoria MPITest]$ mpirun -H  
10.1.2.126,10.1.2.122,10.1.2.123 hello-mpi

>From process 1 out of 3, Hello World!
From process 2 out of 3, Hello World!
From process 3 out of 3, Hello World!

When I run this (126 - my machine, 122, and 125), everything works  
fine:
[doriad@daviddoria MPITest]$ mpirun -H  
10.1.2.126,10.1.2.122,10.1.2.125 hello-mpi

>From process 2 out of 3, Hello World!
From process 1 out of 3, Hello World!
From process 3 out of 3, Hello World!

When I run this (126 - my machine, 123, and 125), everything works  
fine:
[doriad@daviddoria MPITest]$ mpirun -H  
10.1.2.126,10.1.2.123,10.1.2.125 hello-mpi

>From process 2 out of 3, Hello World!
From process 1 out of 3, Hello World!
From process 3 out of 3, Hello World!


However, when I run this (126 - my machine, 122, 123, AND 125), I  
get no output at all.


Is there any way to check what is going on / does anyone know what  
that would happen? I'm using OpenMPI 1.3.3


Thanks,

David
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Test works with 3 computers, but not 4?

2009-07-29 Thread David Doria
I wrote a simple program to display "hello world" from each process.

When I run this (126 - my machine, 122, and 123), everything works fine:
[doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.123
hello-mpi
>From process 1 out of 3, Hello World!
>From process 2 out of 3, Hello World!
>From process 3 out of 3, Hello World!

When I run this (126 - my machine, 122, and 125), everything works fine:
[doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.122,10.1.2.125
hello-mpi
>From process 2 out of 3, Hello World!
>From process 1 out of 3, Hello World!
>From process 3 out of 3, Hello World!

When I run this (126 - my machine, 123, and 125), everything works fine:
[doriad@daviddoria MPITest]$ mpirun -H 10.1.2.126,10.1.2.123,10.1.2.125
hello-mpi
>From process 2 out of 3, Hello World!
>From process 1 out of 3, Hello World!
>From process 3 out of 3, Hello World!


However, when I run this (126 - my machine, 122, 123, AND 125), I get no
output at all.

Is there any way to check what is going on / does anyone know what that
would happen? I'm using OpenMPI 1.3.3

Thanks,

David


Re: [OMPI users] strange IMB runs

2009-07-29 Thread Dorian Krause

Hi,

--mca mpi_leave_pinned 1

might help. Take a look at the FAQ for various tuning parameters.


Michael Di Domenico wrote:

I'm not sure I understand what's actually happened here.  I'm running
IMB on an HP superdome, just comparing the PingPong benchmark

HP-MPI v2.3
Max ~ 700-800MB/sec

OpenMPI v1.3
-mca btl self,sm - Max ~ 125-150MB/sec
-mca btl self,tcp - Max ~ 500-550MB/sec

Is this behavior expected?  Are there any tunables to get the OpenMPI
sockets up near HP-MPI?
___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

  




[OMPI users] strange IMB runs

2009-07-29 Thread Michael Di Domenico
I'm not sure I understand what's actually happened here.  I'm running
IMB on an HP superdome, just comparing the PingPong benchmark

HP-MPI v2.3
Max ~ 700-800MB/sec

OpenMPI v1.3
-mca btl self,sm - Max ~ 125-150MB/sec
-mca btl self,tcp - Max ~ 500-550MB/sec

Is this behavior expected?  Are there any tunables to get the OpenMPI
sockets up near HP-MPI?


Re: [OMPI users] users Digest, Vol 1302, Issue 1

2009-07-29 Thread Ricardo Fonseca
Yes, I am using the right one. I've installed the freshly compiled  
openmpi into /opt/openmpi/1.3.3-g95-32. If I edit the mpif.h file by  
hand and put "error!" in the first line I get:


zamblap:sandbox zamb$ edit /opt/openmpi/1.3.3-g95-32/include/mpif.h

zamblap:sandbox zamb$ mpif77 inplace_test.f90

In file mpif.h:1

Included at inplace_test.f90:7

error!

1

Error: Unclassifiable statement at (1)

(btw, if I use the F90 bindings instead I get a similar problem,  
except the address for the MPI_IN_PLACE fortran constant is slightly  
different from the F77 binding, i.e. instead of 0x50920 I get 0x508e0)


Thanks for your help,

Ricardo

Subject: Re: [OMPI users] OMPI users] MPI_IN_PLACE in Fortran  
withMPI_REDUCE / MPI_ALLREDUCE

From: Jeff Squyres (jsquyres_at_[hidden])
Date: 2009-07-29 08:54:38

Can you confirm that you're using the right mpif.h?

Keep in mind that each MPI implementation's mpif.h is different --
it's a common mistake to assume that the mpif.h from one MPI
implementation should work with another implementation (e.g., someone
copied mpif.h from one MPI to your software's source tree, so the
compiler always finds that one instead of the MPI-implementation-
provided mpif.h.).


On Jul 28, 2009, at 1:17 PM, Ricardo Fonseca wrote:


[OMPI users] Jeffrey M Ceason is out of the office.

2009-07-29 Thread Jeffrey M Ceason

I will be out of the office starting  07/28/2009 and will not return until
08/03/2009.

I will respond to your message when I return.



Re: [OMPI users] OMPI users] MPI_IN_PLACE in Fortran withMPI_REDUCE / MPI_ALLREDUCE

2009-07-29 Thread Jeff Squyres

Can you confirm that you're using the right mpif.h?

Keep in mind that each MPI implementation's mpif.h is different --  
it's a common mistake to assume that the mpif.h from one MPI  
implementation should work with another implementation (e.g., someone  
copied mpif.h from one MPI to your software's source tree, so the  
compiler always finds that one instead of the MPI-implementation- 
provided mpif.h.).



On Jul 28, 2009, at 1:17 PM, Ricardo Fonseca wrote:


Hi George

I did some extra digging and found that (for some reason) the  
MPI_IN_PLACE parameter is not being recognized as such by  
mpi_reduce_f (reduce_f.c:61). I added a couple of printfs:


printf(" sendbuf = %p \n", sendbuf );

printf(" MPI_FORTRAN_IN_PLACE = %p \n", _FORTRAN_IN_PLACE );
printf(" mpi_fortran_in_place = %p \n", _fortran_in_place );
printf(" mpi_fortran_in_place_ = %p \n", _fortran_in_place_ );
printf(" mpi_fortran_in_place__ = %p \n",  
_fortran_in_place__ );


And this is what I get on node 0:

 sendbuf = 0x50920
 MPI_FORTRAN_IN_PLACE = 0x17cd30
 mpi_fortran_in_place = 0x17cd34
 mpi_fortran_in_place_ = 0x17cd38
 mpi_fortran_in_place__ = 0x17cd3c

This makes OMPI_F2C_IN_PLACE(sendbuf) fail. If I replace the line:

sendbuf = OMPI_F2C_IN_PLACE(sendbuf);

with:

if ( sendbuf == 0x50920 ) {
  printf("sendbuf is MPI_IN_PLACE!\n");
  sendbuf = MPI_IN_PLACE;
}

Then the code works and gives the correct result:

sendbuf is MPI_IN_PLACE!
 Result:
 3. 3. 3. 3.

So my guess is that somehow the MPI_IN_PLACE constant for fortran is  
getting the wrong address. Could this be related to the fortran  
compilers I'm using (ifort / g95)?


Ricardo

---
Prof. Ricardo Fonseca

GoLP - Grupo de Lasers e Plasmas
Instituto de Plasmas e Fusão Nuclear
Instituto Superior Técnico
Av. Rovisco Pais
1049-001 Lisboa
Portugal

tel: +351 21 8419202
fax: +351 21 8464455
web: http://cfp.ist.utl.pt/golp/

On Jul 28, 2009, at 17:00 , users-requ...@open-mpi.org wrote:


Message: 1
Date: Tue, 28 Jul 2009 11:16:34 -0400
From: George Bosilca 
Subject: Re: [OMPI users] OMPI users] MPI_IN_PLACE in Fortran with
MPI_REDUCE / MPI_ALLREDUCE
To: Open MPI Users 
Message-ID: 
Content-Type: text/plain; charset=ISO-8859-1; format=flowed;  
delsp=yes


___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Jeff Squyres
jsquy...@cisco.com




Re: [OMPI users] users Digest, Vol 1296, Issue 6

2009-07-29 Thread Josh Hursey
This mailing list supports the Open MPI implementation of the MPI  
standard. If you have concerns about Intel MPI you should contact  
their support group.


The ompi_checkpoint/ompi_restart routines are designed to work with  
Open MPI, and will certainly fail when used with other MPI  
implementations due to library and protocol disparities.


If you are interested in how checkpoint/restart is supported in Open  
MPI, I suggest looking at the User's Guide posted on the wiki page  
below:

  https://svn.open-mpi.org/trac/ompi/wiki/ProcessFT_CR

-- Josh

On Jul 29, 2009, at 5:27 AM, Mallikarjuna Shastry wrote:



DEAR SIR/MADAM
kindly tell the commands for checkpointing and restarting of mpi  
programs using intel mpi


i tried the following commands they did not work

ompi_checkpoint 
ompi_restart file name of global snap shot

with regards

mallikarjuna shastry





___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] users Digest, Vol 1296, Issue 6

2009-07-29 Thread Mallikarjuna Shastry

DEAR SIR/MADAM
 kindly tell the commands for checkpointing and restarting of mpi programs 
using intel mpi

i tried the following commands they did not work

ompi_checkpoint 
ompi_restart file name of global snap shot

with regards

mallikarjuna shastry