Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-31 Thread Ralph Castain


On Jul 30, 2009, at 4:36 PM, Adams, Brian M wrote:

I found the manual pages for mpirun and orte_hosts, which have a  
pretty thorough description of these features.  Let me know if  
there's anything else I should check out.


My quick impression is that this will meet at least 90% of user  
needs out of the box as most (all?) users will run with number of  
job processors that divides into PPN or is a multiple of PPN.


The docs imply that the relative indexing is relative to nodes as  
opposed to slots.  If so, it's easy to cover my cases 1 and 2 with  
modulo arithmetic right on the command line, but the edge case 3  
might require creation of host files or a maps file, specifying  
specific nodes and slots.  I'm still trying to fully wrap my head  
around the way the scheduling works and the nodes/slots distinction,  
so have to think on this.  For example, in a PBS/Torque environment,  
I guess each processor gets an entry as a "node" in the  
$PBS_NODEFILE, so maybe this won't be an issue for many of our  
machines.


Just as any FYI. The PBS_NODEFILE contains one entry for each -slot-  
on a node. So if you have 4 slots on a node, the node's name appears  
in the file 4 times (repeated sequentially). When we read it to  
compute #slots available, we collapse that info. If you give it to us  
as the sequential mapper file, we will map ranks n to n+4 on that node.




Another potential gotcha is the scheduler driving my M scheduler  
processes ideally runs asynchronously (since not all compute jobs  
take equal time as parameters vary), so initially jobs 0--M could be  
on resource blocks 0--M in order, but if job 2 finishes first, I'd  
want to put job M+1 on block 2.  So I might end up needing some glue  
code anyway for the dynamic scheduling case.  What's in ORTE today  
will likely work great in the static scheduling case.


I'll give some more thought to the dynamic scheduling case. Like I  
said, this has come up here at LANL too, so it is a common problem  
between us. Something else you might want to consider in the interim.  
There is a "tools" interface in the ORTE library that provides simple  
commands to (a) find out all the mpirun instances currently running,  
and (b) what nodes and processes they are running. It isn't quite as  
clean for what you are trying to do since it was intended mostly for  
determining status, but you could parse the info to get what you need.


Might make that glue easier, at the least.

I'll get back to you with some ideas on how ORTE could do this  
natively. Part of the issue, of course, is figuring out a cmd syntax  
that isn't too burdensome :-)




Brian


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]  
On Behalf Of Ralph Castain

Sent: Thursday, July 30, 2009 2:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule  
within a scheduled machinefile/job allocation)


Let me know how it goes, if you don't mind. It would be nice to know  
if we actually met your needs, or if a tweak might help make it  
easier.


Thanks
Ralph

On Jul 30, 2009, at 1:36 PM, Adams, Brian M wrote:

Thanks Ralph, I wasn't aware of the relative indexing or sequential  
mapper capabilities.  I will check those out and report back if I  
still have a feature request. -- Brian


From: users-boun...@open-mpi.org [mailto:users-bounces@open- 
mpi.org] On Behalf Of Ralph Castain

Sent: Thursday, July 30, 2009 12:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule  
within a scheduled machinefile/job allocation)



On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

Apologies if I'm being confusing; I'm probably trying to get at  
atypical use cases.  M and N  need not correspond to the number of  
nodes/ppn nor ppn/nodes available.  By node vs. slot doesn't much  
matter, as long as in the end I don't oversubscribe any node.  By  
slot might be good for efficiency in some apps, but I can't make a  
general case for it.


I think what you proposed offers some help in the case where N is  
an integer multiple of the number of available nodes, but perhaps  
not in other cases.  I must be missing something here, so instead  
of being fully general, perhaps consider a  specific case.   
Suppose we have 4 nodes, 8 ppn (32 slots is I think the ompi  
language).  I might want to schedule, for example


1. M=2 simultaneous N=16 processor jobs: Here I believe what you  
suggested will work since N is a multiple of the available number  
of nodes.  I could use either npernode 4 or just bynode and I  
think get the same result: an even distribution of tasks.   
(similar applies to, e.g., 8x4, 4x8)


Yes, agreed



2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode  
or npernode, I would end up with 16 processes on each of the first  
two nodes (similar 

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-30 Thread Adams, Brian M
I found the manual pages for mpirun and orte_hosts, which have a pretty 
thorough description of these features.  Let me know if there's anything else I 
should check out.

My quick impression is that this will meet at least 90% of user needs out of 
the box as most (all?) users will run with number of job processors that 
divides into PPN or is a multiple of PPN.

The docs imply that the relative indexing is relative to nodes as opposed to 
slots.  If so, it's easy to cover my cases 1 and 2 with modulo arithmetic right 
on the command line, but the edge case 3 might require creation of host files 
or a maps file, specifying specific nodes and slots.  I'm still trying to fully 
wrap my head around the way the scheduling works and the nodes/slots 
distinction, so have to think on this.  For example, in a PBS/Torque 
environment, I guess each processor gets an entry as a "node" in the 
$PBS_NODEFILE, so maybe this won't be an issue for many of our machines.

Another potential gotcha is the scheduler driving my M scheduler processes 
ideally runs asynchronously (since not all compute jobs take equal time as 
parameters vary), so initially jobs 0--M could be on resource blocks 0--M in 
order, but if job 2 finishes first, I'd want to put job M+1 on block 2.  So I 
might end up needing some glue code anyway for the dynamic scheduling case.  
What's in ORTE today will likely work great in the static scheduling case.

Brian



From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, July 30, 2009 2:08 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a 
scheduled machinefile/job allocation)

Let me know how it goes, if you don't mind. It would be nice to know if we 
actually met your needs, or if a tweak might help make it easier.

Thanks
Ralph

On Jul 30, 2009, at 1:36 PM, Adams, Brian M wrote:

Thanks Ralph, I wasn't aware of the relative indexing or sequential mapper 
capabilities.  I will check those out and report back if I still have a feature 
request. -- Brian


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, July 30, 2009 12:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a 
scheduled machinefile/job allocation)


On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

Apologies if I'm being confusing; I'm probably trying to get at atypical use 
cases.  M and N  need not correspond to the number of nodes/ppn nor ppn/nodes 
available.  By node vs. slot doesn't much matter, as long as in the end I don't 
oversubscribe any node.  By slot might be good for efficiency in some apps, but 
I can't make a general case for it.

I think what you proposed offers some help in the case where N is an integer 
multiple of the number of available nodes, but perhaps not in other cases.  I 
must be missing something here, so instead of being fully general, perhaps 
consider a  specific case.  Suppose we have 4 nodes, 8 ppn (32 slots is I think 
the ompi language).  I might want to schedule, for example

1. M=2 simultaneous N=16 processor jobs: Here I believe what you suggested will 
work since N is a multiple of the available number of nodes.  I could use 
either npernode 4 or just bynode and I think get the same result: an even 
distribution of tasks.  (similar applies to, e.g., 8x4, 4x8)

Yes, agreed


2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or npernode, 
I would end up with 16 processes on each of the first two nodes (similar 
applies to, e.g., 32x1 or 10x3).  Scheduling many small jobs is a common 
problem for us.

3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with this 
distribution (where A-D are nodes and 0-2 jobs)

A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2

where A and B are over-subscribed and there are more than the two unused slots 
I'd expect in the whole allocation.

Again, I can manage all these via a script that partitions the machine files, 
just wondering which scenarios OpenMPI can manage.


Have you looked at the relative indexing in 1.3.3? You could specify any of 
these in relative index terms, and have one "hostfile" that would support 16x2 
operations. This would work then for any allocation.

Your launch script could even just do it, something like this:

mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app

etc. Obviously, you could compute the relative indexing and just stick it in as 
required.

Likewise, you could use the new "seq" (sequential) mapper to achieve any 
desired layout, again utilizing relative indexing to avoid having to create a 
special hostfile for each run.

Note that in all cases, you can spec

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-30 Thread Ralph Castain
Let me know how it goes, if you don't mind. It would be nice to know  
if we actually met your needs, or if a tweak might help make it easier.


Thanks
Ralph

On Jul 30, 2009, at 1:36 PM, Adams, Brian M wrote:

Thanks Ralph, I wasn't aware of the relative indexing or sequential  
mapper capabilities.  I will check those out and report back if I  
still have a feature request. -- Brian


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]  
On Behalf Of Ralph Castain

Sent: Thursday, July 30, 2009 12:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule  
within a scheduled machinefile/job allocation)



On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

Apologies if I'm being confusing; I'm probably trying to get at  
atypical use cases.  M and N  need not correspond to the number of  
nodes/ppn nor ppn/nodes available.  By node vs. slot doesn't much  
matter, as long as in the end I don't oversubscribe any node.  By  
slot might be good for efficiency in some apps, but I can't make a  
general case for it.


I think what you proposed offers some help in the case where N is  
an integer multiple of the number of available nodes, but perhaps  
not in other cases.  I must be missing something here, so instead  
of being fully general, perhaps consider a  specific case.  Suppose  
we have 4 nodes, 8 ppn (32 slots is I think the ompi language).  I  
might want to schedule, for example


1. M=2 simultaneous N=16 processor jobs: Here I believe what you  
suggested will work since N is a multiple of the available number  
of nodes.  I could use either npernode 4 or just bynode and I think  
get the same result: an even distribution of tasks.  (similar  
applies to, e.g., 8x4, 4x8)


Yes, agreed



2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode  
or npernode, I would end up with 16 processes on each of the first  
two nodes (similar applies to, e.g., 32x1 or 10x3).  Scheduling  
many small jobs is a common problem for us.




3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with  
this distribution (where A-D are nodes and 0-2 jobs)


A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2

where A and B are over-subscribed and there are more than the two  
unused slots I'd expect in the whole allocation.


Again, I can manage all these via a script that partitions the  
machine files, just wondering which scenarios OpenMPI can manage.




Have you looked at the relative indexing in 1.3.3? You could specify  
any of these in relative index terms, and have one "hostfile" that  
would support 16x2 operations. This would work then for any  
allocation.


Your launch script could even just do it, something like this:

mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app

etc. Obviously, you could compute the relative indexing and just  
stick it in as required.


Likewise, you could use the new "seq" (sequential) mapper to achieve  
any desired layout, again utilizing relative indexing to avoid  
having to create a special hostfile for each run.


Note that in all cases, you can specify a -n N that will tell OMPI  
to only execute N processes, regardless of what is in the sequential  
mapper file   or -host.


If none of those work well, please let me know. I'm happy to create  
the required capability as I'm sure LANL will use it too (know of  
several similar cases here, but the current options seem okay for  
them).




Thanks!
Brian


-Original Message-
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, July 29, 2009 4:19 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job
(schedule within a scheduled machinefile/job allocation)

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you
care if the process ranks are sharing nodes)? If not, why not
add "-bynode" to your cmd line?

Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you
fork M times, you'll wind up with the exact pattern you
wanted. And, as each one exits, you could immediately launch
a replacement without worrying about oversubscription.

Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too
many problems with cleanup and other things. :-)

On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:


Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The

new 1.3.x

machinefile behavior is great (thanks!) -- I can use

machinefiles to

manage multiple simultaneous mpiruns within a single torque
allocation (where the hosts are a subset of $PBS_NODEFILE).
However, this requires some careful management of machinefiles.

I'm curi

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-30 Thread Adams, Brian M
Thanks Ralph, I wasn't aware of the relative indexing or sequential mapper 
capabilities.  I will check those out and report back if I still have a feature 
request. -- Brian


From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf 
Of Ralph Castain
Sent: Thursday, July 30, 2009 12:26 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job (schedule within a 
scheduled machinefile/job allocation)


On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

Apologies if I'm being confusing; I'm probably trying to get at atypical use 
cases.  M and N  need not correspond to the number of nodes/ppn nor ppn/nodes 
available.  By node vs. slot doesn't much matter, as long as in the end I don't 
oversubscribe any node.  By slot might be good for efficiency in some apps, but 
I can't make a general case for it.

I think what you proposed offers some help in the case where N is an integer 
multiple of the number of available nodes, but perhaps not in other cases.  I 
must be missing something here, so instead of being fully general, perhaps 
consider a  specific case.  Suppose we have 4 nodes, 8 ppn (32 slots is I think 
the ompi language).  I might want to schedule, for example

1. M=2 simultaneous N=16 processor jobs: Here I believe what you suggested will 
work since N is a multiple of the available number of nodes.  I could use 
either npernode 4 or just bynode and I think get the same result: an even 
distribution of tasks.  (similar applies to, e.g., 8x4, 4x8)

Yes, agreed


2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or npernode, 
I would end up with 16 processes on each of the first two nodes (similar 
applies to, e.g., 32x1 or 10x3).  Scheduling many small jobs is a common 
problem for us.

3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with this 
distribution (where A-D are nodes and 0-2 jobs)

A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2

where A and B are over-subscribed and there are more than the two unused slots 
I'd expect in the whole allocation.

Again, I can manage all these via a script that partitions the machine files, 
just wondering which scenarios OpenMPI can manage.


Have you looked at the relative indexing in 1.3.3? You could specify any of 
these in relative index terms, and have one "hostfile" that would support 16x2 
operations. This would work then for any allocation.

Your launch script could even just do it, something like this:

mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app

etc. Obviously, you could compute the relative indexing and just stick it in as 
required.

Likewise, you could use the new "seq" (sequential) mapper to achieve any 
desired layout, again utilizing relative indexing to avoid having to create a 
special hostfile for each run.

Note that in all cases, you can specify a -n N that will tell OMPI to only 
execute N processes, regardless of what is in the sequential mapper file or 
-host.

If none of those work well, please let me know. I'm happy to create the 
required capability as I'm sure LANL will use it too (know of several similar 
cases here, but the current options seem okay for them).

Thanks!
Brian

-Original Message-
From: users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, July 29, 2009 4:19 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job
(schedule within a scheduled machinefile/job allocation)

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you
care if the process ranks are sharing nodes)? If not, why not
add "-bynode" to your cmd line?

Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you
fork M times, you'll wind up with the exact pattern you
wanted. And, as each one exits, you could immediately launch
a replacement without worrying about oversubscription.

Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too
many problems with cleanup and other things. :-)

On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:

Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The
new 1.3.x
machinefile behavior is great (thanks!) -- I can use
machinefiles to
manage multiple simultaneous mpiruns within a single torque
allocation (where the hosts are a subset of $PBS_NODEFILE).
However, this requires some careful management of machinefiles.

I'm curious if OpenMPI now directly supports the behavior I need,
described in general in the quote below.  Specifically,
given a single
PBS/Torque allocation of M*N processors, I will run a
serial program
that will f

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-30 Thread Ralph Castain


On Jul 30, 2009, at 11:49 AM, Adams, Brian M wrote:

Apologies if I'm being confusing; I'm probably trying to get at  
atypical use cases.  M and N  need not correspond to the number of  
nodes/ppn nor ppn/nodes available.  By node vs. slot doesn't much  
matter, as long as in the end I don't oversubscribe any node.  By  
slot might be good for efficiency in some apps, but I can't make a  
general case for it.


I think what you proposed offers some help in the case where N is an  
integer multiple of the number of available nodes, but perhaps not  
in other cases.  I must be missing something here, so instead of  
being fully general, perhaps consider a  specific case.  Suppose we  
have 4 nodes, 8 ppn (32 slots is I think the ompi language).  I  
might want to schedule, for example


1. M=2 simultaneous N=16 processor jobs: Here I believe what you  
suggested will work since N is a multiple of the available number of  
nodes.  I could use either npernode 4 or just bynode and I think get  
the same result: an even distribution of tasks.  (similar applies  
to, e.g., 8x4, 4x8)


Yes, agreed



2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or  
npernode, I would end up with 16 processes on each of the first two  
nodes (similar applies to, e.g., 32x1 or 10x3).  Scheduling many  
small jobs is a common problem for us.




3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with  
this distribution (where A-D are nodes and 0-2 jobs)


A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2

where A and B are over-subscribed and there are more than the two  
unused slots I'd expect in the whole allocation.


Again, I can manage all these via a script that partitions the  
machine files, just wondering which scenarios OpenMPI can manage.




Have you looked at the relative indexing in 1.3.3? You could specify  
any of these in relative index terms, and have one "hostfile" that  
would support 16x2 operations. This would work then for any allocation.


Your launch script could even just do it, something like this:

mpirun -n 2 -host +n0:1,+n1:1 app
mpirun -n 2 -host +n0:2,+n1:2 app

etc. Obviously, you could compute the relative indexing and just stick  
it in as required.


Likewise, you could use the new "seq" (sequential) mapper to achieve  
any desired layout, again utilizing relative indexing to avoid having  
to create a special hostfile for each run.


Note that in all cases, you can specify a -n N that will tell OMPI to  
only execute N processes, regardless of what is in the sequential  
mapper file or -host.


If none of those work well, please let me know. I'm happy to create  
the required capability as I'm sure LANL will use it too (know of  
several similar cases here, but the current options seem okay for them).




Thanks!
Brian


-Original Message-
From: users-boun...@open-mpi.org
[mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
Sent: Wednesday, July 29, 2009 4:19 PM
To: Open MPI Users
Subject: Re: [OMPI users] Multiple mpiexec's within a job
(schedule within a scheduled machinefile/job allocation)

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you
care if the process ranks are sharing nodes)? If not, why not
add "-bynode" to your cmd line?

Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you
fork M times, you'll wind up with the exact pattern you
wanted. And, as each one exits, you could immediately launch
a replacement without worrying about oversubscription.

Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too
many problems with cleanup and other things. :-)

On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:


Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The

new 1.3.x

machinefile behavior is great (thanks!) -- I can use

machinefiles to

manage multiple simultaneous mpiruns within a single torque
allocation (where the hosts are a subset of $PBS_NODEFILE).
However, this requires some careful management of machinefiles.

I'm curious if OpenMPI now directly supports the behavior I need,
described in general in the quote below.  Specifically,

given a single

PBS/Torque allocation of M*N processors, I will run a

serial program

that will fork M times.  Each of the M forked processes
calls 'mpirun -np N application.exe' and blocks until completion.
This seems akin to the case you described of "mpiruns executed in
separate windows/prompts."

What I'd like to see is the M processes "tiled" across the

available

slots, so all M*N processors are used.  What I see instead

appears at

face value to be the first N resources being oversubscribed M times.

Also, when one of th

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-30 Thread Adams, Brian M
Apologies if I'm being confusing; I'm probably trying to get at atypical use 
cases.  M and N  need not correspond to the number of nodes/ppn nor ppn/nodes 
available.  By node vs. slot doesn't much matter, as long as in the end I don't 
oversubscribe any node.  By slot might be good for efficiency in some apps, but 
I can't make a general case for it.

I think what you proposed offers some help in the case where N is an integer 
multiple of the number of available nodes, but perhaps not in other cases.  I 
must be missing something here, so instead of being fully general, perhaps 
consider a  specific case.  Suppose we have 4 nodes, 8 ppn (32 slots is I think 
the ompi language).  I might want to schedule, for example

1. M=2 simultaneous N=16 processor jobs: Here I believe what you suggested will 
work since N is a multiple of the available number of nodes.  I could use 
either npernode 4 or just bynode and I think get the same result: an even 
distribution of tasks.  (similar applies to, e.g., 8x4, 4x8)

2. M=16 simultaneous N=2 processor jobs: it seems if I use bynode or npernode, 
I would end up with 16 processes on each of the first two nodes (similar 
applies to, e.g., 32x1 or 10x3).  Scheduling many small jobs is a common 
problem for us.

3. M=3 simultaneous, N=10 processor jobs: I think we'd end up with this 
distribution (where A-D are nodes and 0-2 jobs)

A 0 0 0 1 1 1 2 2 2
B 0 0 0 1 1 1 2 2 2
C 0 0   1 1   2 2
D 0 0   1 1   2 2

where A and B are over-subscribed and there are more than the two unused slots 
I'd expect in the whole allocation.

Again, I can manage all these via a script that partitions the machine files, 
just wondering which scenarios OpenMPI can manage.

Thanks!
Brian

> -Original Message-
> From: users-boun...@open-mpi.org 
> [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain
> Sent: Wednesday, July 29, 2009 4:19 PM
> To: Open MPI Users
> Subject: Re: [OMPI users] Multiple mpiexec's within a job 
> (schedule within a scheduled machinefile/job allocation)
> 
> Oh my - that does take me back a long way! :-)
> 
> Do you need these processes to be mapped byslot (i.e., do you 
> care if the process ranks are sharing nodes)? If not, why not 
> add "-bynode" to your cmd line?
> 
> Alternatively, given the mapping you want, just do
> 
> mpirun -npernode 1 application.exe
> 
> This would launch one copy on each of your N nodes. So if you 
> fork M times, you'll wind up with the exact pattern you 
> wanted. And, as each one exits, you could immediately launch 
> a replacement without worrying about oversubscription.
> 
> Does that help?
> Ralph
> 
> PS. we dropped that "persistent" operation - caused way too 
> many problems with cleanup and other things. :-)
> 
> On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:
> 
> > Hi Ralph (all),
> >
> > I'm resurrecting this 2006 thread for a status check.  The 
> new 1.3.x 
> > machinefile behavior is great (thanks!) -- I can use 
> machinefiles to 
> > manage multiple simultaneous mpiruns within a single torque
> > allocation (where the hosts are a subset of $PBS_NODEFILE).   
> > However, this requires some careful management of machinefiles.
> >
> > I'm curious if OpenMPI now directly supports the behavior I need, 
> > described in general in the quote below.  Specifically, 
> given a single 
> > PBS/Torque allocation of M*N processors, I will run a 
> serial program 
> > that will fork M times.  Each of the M forked processes
> > calls 'mpirun -np N application.exe' and blocks until completion.   
> > This seems akin to the case you described of "mpiruns executed in 
> > separate windows/prompts."
> >
> > What I'd like to see is the M processes "tiled" across the 
> available 
> > slots, so all M*N processors are used.  What I see instead 
> appears at 
> > face value to be the first N resources being oversubscribed M times.
> >
> > Also, when one of the forked processes returns, I'd like to 
> be able to 
> > spawn another and have its mpirun schedule on the resources 
> freed by 
> > the previous one that exited.  Is any of this possible?
> >
> > I tried starting an orted (1.3.3, roughly as you suggested 
> below), but 
> > got this error:
> >
> >> orted --daemonize
> > [gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
> > runtime/orte_init.c at line 125
> > 
> --
> >  It looks like orte_init failed for some reason; your parallel 
>

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-29 Thread Ralph Castain

Oh my - that does take me back a long way! :-)

Do you need these processes to be mapped byslot (i.e., do you care if  
the process ranks are sharing nodes)? If not, why not add "-bynode" to  
your cmd line?


Alternatively, given the mapping you want, just do

mpirun -npernode 1 application.exe

This would launch one copy on each of your N nodes. So if you fork M  
times, you'll wind up with the exact pattern you wanted. And, as each  
one exits, you could immediately launch a replacement without worrying  
about oversubscription.


Does that help?
Ralph

PS. we dropped that "persistent" operation - caused way too many  
problems with cleanup and other things. :-)


On Jul 29, 2009, at 3:46 PM, Adams, Brian M wrote:


Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The new 1.3.x  
machinefile behavior is great (thanks!) -- I can use machinefiles to  
manage multiple simultaneous mpiruns within a single torque  
allocation (where the hosts are a subset of $PBS_NODEFILE).   
However, this requires some careful management of machinefiles.


I'm curious if OpenMPI now directly supports the behavior I need,  
described in general in the quote below.  Specifically, given a  
single PBS/Torque allocation of M*N processors, I will run a serial  
program that will fork M times.  Each of the M forked processes  
calls 'mpirun -np N application.exe' and blocks until completion.   
This seems akin to the case you described of "mpiruns executed in  
separate windows/prompts."


What I'd like to see is the M processes "tiled" across the available  
slots, so all M*N processors are used.  What I see instead appears  
at face value to be the first N resources being oversubscribed M  
times.


Also, when one of the forked processes returns, I'd like to be able  
to spawn another and have its mpirun schedule on the resources freed  
by the previous one that exited.  Is any of this possible?


I tried starting an orted (1.3.3, roughly as you suggested below),  
but got this error:



orted --daemonize
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file  
runtime/orte_init.c at line 125

--
It looks like orte_init failed for some reason; your parallel  
process is

likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

 orte_ess_base_select failed
 --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file  
orted/orted_main.c at line 323


I spared the debugging info as I'm not even sure this is a correct  
invocation...


Thanks for any suggestions you can offer!
Brian
--
Brian M. Adams, PhD (bria...@sandia.gov)
Optimization and Uncertainty Quantification
Sandia National Laboratories, Albuquerque, NM
http://www.sandia.gov/~briadam



From: Ralph Castain (rhc_at_[hidden])
Date: 2006-12-12 00:46:59

Hi Chris


Some of this is doable with today's codeand one of these
behaviors is not. :-(


Open MPI/OpenRTE can be run in "persistent" mode - this
allows multiple jobs to share the same allocation. This works
much as you describe (syntax is slightly different, of
course!) - the first mpirun will map using whatever mode was
requested, then the next mpirun will map starting from where
the first one left off.


I *believe* you can run each mpirun in the background.
However, I don't know if this has really been tested enough
to support such a claim. All testing that I know about
to-date has executed mpirun in the foreground - thus, your
example would execute sequentially instead of in parallel.


I know people have tested multiple mpirun's operating in
parallel within a single allocation (i.e., persistent mode)
where the mpiruns are executed in separate windows/prompts.
So I suspect you could do something like you describe - just
haven't personally verified it.


Where we definitely differ is that Open MPI/RTE will *not*
block until resources are freed up from the prior mpiruns.
Instead, we will attempt to execute each mpirun immediately -
and will error out the one(s) that try to execute without
sufficient resources. I imagine we could provide the kind of
"flow control" you describe, but I'm not sure when that might happen.


I am (in my copious free time...haha) working on an
"orteboot" program that will startup a virtual machine to
make the persistent mode of operation a little easier. For
now, though, you can do it by:


1. starting up the "server" using the following command:
orted --seed --persistent --scope public [--universe foo]


2. do your mpirun commands. They will automagically find the
"server" and connect to it. If you spe

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2009-07-29 Thread Adams, Brian M
Hi Ralph (all),

I'm resurrecting this 2006 thread for a status check.  The new 1.3.x 
machinefile behavior is great (thanks!) -- I can use machinefiles to manage 
multiple simultaneous mpiruns within a single torque allocation (where the 
hosts are a subset of $PBS_NODEFILE).  However, this requires some careful 
management of machinefiles.

I'm curious if OpenMPI now directly supports the behavior I need, described in 
general in the quote below.  Specifically, given a single PBS/Torque allocation 
of M*N processors, I will run a serial program that will fork M times.  Each of 
the M forked processes calls 'mpirun -np N application.exe' and blocks until 
completion.  This seems akin to the case you described of "mpiruns executed in 
separate windows/prompts."

What I'd like to see is the M processes "tiled" across the available slots, so 
all M*N processors are used.  What I see instead appears at face value to be 
the first N resources being oversubscribed M times.  

Also, when one of the forked processes returns, I'd like to be able to spawn 
another and have its mpirun schedule on the resources freed by the previous one 
that exited.  Is any of this possible?

I tried starting an orted (1.3.3, roughly as you suggested below), but got this 
error:

> orted --daemonize
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
runtime/orte_init.c at line 125
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--
[gy8:25871] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file 
orted/orted_main.c at line 323

I spared the debugging info as I'm not even sure this is a correct invocation...

Thanks for any suggestions you can offer!
Brian
--
Brian M. Adams, PhD (bria...@sandia.gov)
Optimization and Uncertainty Quantification
Sandia National Laboratories, Albuquerque, NM
http://www.sandia.gov/~briadam


> From: Ralph Castain (rhc_at_[hidden])
> Date: 2006-12-12 00:46:59
> 
> Hi Chris
> 
> 
> Some of this is doable with today's codeand one of these 
> behaviors is not. :-(
> 
> 
> Open MPI/OpenRTE can be run in "persistent" mode - this 
> allows multiple jobs to share the same allocation. This works 
> much as you describe (syntax is slightly different, of 
> course!) - the first mpirun will map using whatever mode was 
> requested, then the next mpirun will map starting from where 
> the first one left off.
> 
> 
> I *believe* you can run each mpirun in the background. 
> However, I don't know if this has really been tested enough 
> to support such a claim. All testing that I know about 
> to-date has executed mpirun in the foreground - thus, your 
> example would execute sequentially instead of in parallel.
> 
> 
> I know people have tested multiple mpirun's operating in 
> parallel within a single allocation (i.e., persistent mode) 
> where the mpiruns are executed in separate windows/prompts. 
> So I suspect you could do something like you describe - just 
> haven't personally verified it.
> 
> 
> Where we definitely differ is that Open MPI/RTE will *not* 
> block until resources are freed up from the prior mpiruns. 
> Instead, we will attempt to execute each mpirun immediately - 
> and will error out the one(s) that try to execute without 
> sufficient resources. I imagine we could provide the kind of 
> "flow control" you describe, but I'm not sure when that might happen.
> 
> 
> I am (in my copious free time...haha) working on an 
> "orteboot" program that will startup a virtual machine to 
> make the persistent mode of operation a little easier. For 
> now, though, you can do it by:
> 
> 
> 1. starting up the "server" using the following command:
> orted --seed --persistent --scope public [--universe foo]
> 
> 
> 2. do your mpirun commands. They will automagically find the 
> "server" and connect to it. If you specified a universe name 
> when starting the server, then you must specify the same 
> universe name on your mpirun commands.
> 
> 
> When you are done, you will have to (unfortunately) manually 
> "kill" the server and remove its session directory. I have a 
> program called "ortehalt"
> in the trunk that will do this cleanly for you, but it isn't 
> yet in the release distributions. You are welcome to use it, 
> though, if you are working with the trunk - I can't promise 
> it is bulletproof yet, but it seems to be working.
> 
> 
> Ralph
> 
> 
> On 12/11/06 8:07 PM, "Maestas, Christopher Daniel" 
> 
> wrote:
> 
> 
> > Hello

Re: [OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2006-12-12 Thread Ralph Castain
Hi Chris

Some of this is doable with today's codeand one of these behaviors is
not. :-(

Open MPI/OpenRTE can be run in "persistent" mode - this allows multiple jobs
to share the same allocation. This works much as you describe (syntax is
slightly different, of course!) - the first mpirun will map using whatever
mode was requested, then the next mpirun will map starting from where the
first one left off.

I *believe* you can run each mpirun in the background. However, I don't know
if this has really been tested enough to support such a claim. All testing
that I know about to-date has executed mpirun in the foreground - thus, your
example would execute sequentially instead of in parallel.

I know people have tested multiple mpirun's operating in parallel within a
single allocation (i.e., persistent mode) where the mpiruns are executed in
separate windows/prompts. So I suspect you could do something like you
describe - just haven't personally verified it.

Where we definitely differ is that Open MPI/RTE will *not* block until
resources are freed up from the prior mpiruns. Instead, we will attempt to
execute each mpirun immediately - and will error out the one(s) that try to
execute without sufficient resources. I imagine we could provide the kind of
"flow control" you describe, but I'm not sure when that might happen.

I am (in my copious free time...haha) working on an "orteboot" program that
will startup a virtual machine to make the persistent mode of operation a
little easier. For now, though, you can do it by:

1. starting up the "server" using the following command:
orted --seed --persistent --scope public [--universe foo]

2. do your mpirun commands. They will automagically find the "server" and
connect to it. If you specified a universe name when starting the server,
then you must specify the same universe name on your mpirun commands.

When you are done, you will have to (unfortunately) manually "kill" the
server and remove its session directory. I have a program called "ortehalt"
in the trunk that will do this cleanly for you, but it isn't yet in the
release distributions. You are welcome to use it, though, if you are working
with the trunk - I can't promise it is bulletproof yet, but it seems to be
working.

Ralph



On 12/11/06 8:07 PM, "Maestas, Christopher Daniel" 
wrote:

> Hello,
> 
> Sometimes we have users that like to do from within a single job (think
> schedule within an job scheduler allocation):
> "mpiexec -n X myprog"
> "mpiexec -n Y myprog2"
> Does mpiexec within Open MPI keep track of the node list it is using if
> it binds to a particular scheduler?
> For example with 4 nodes (2ppn SMP):
> "mpiexec -n 2 myprog"
> "mpiexec -n 2 myprog2"
> "mpiexec -n 1 myprog3"
> And assume this is by-slot allocation we would have the following
> allocation:
> node1 - processor1 - myprog
> - processor2 - myprog
> node2 - processor1 - myprog2
> - processor2 - myprog2
> And for a by-node allocation:
> node1 - processor1 - myprog
> - processor2 - myprog2
> node2 - processor1 - myprog
> - processor2 - myprog2
> 
> I think this is possible using ssh cause it shouldn't really matter how
> many times it spawns, but with something like torque it would get
> restricted to a max process launch of 4.  We would want the third
> mpiexec to block processes and eventually be run on the first available
> node allocation that frees up from myprog or myprog2 
> 
> For example for torque, we had to add the following to osc mpiexec:
> ---
>Finally,  since only one mpiexec can be the master at a time, if
> your code setup requires
>that mpiexec exit to get a result, you can start a "dummy"
> mpiexec first  in  your  batch
>job:
> 
>  mpiexec -server
> 
>It  runs  no tasks itself but handles the connections of other
> transient mpiexec clients.
>It will shut down cleanly when the batch job exits or you may
> kill the server explicitly.
>If  the server is killed with SIGTERM (or HUP or INT), it will
> exit with a status of zero
>if there were no clients connected at the time.  If there were
> still  clients  using  the
>server,  the server will kill all their tasks, disconnect from
> the clients, and exit with
>status 1.
> ---
> 
> So a user ran:
> mpiexec -server
> mpiexec -n 2 myprog
> mpiexec -n 2 myprog2
> And the server kept track of the allocation ... I would think that the
> orted could do this?
> 
> Sorry if this sounds confusing ... But I'm sure it will clear up with
> any further responses I make. :-)
> -cdm
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users




[OMPI users] Multiple mpiexec's within a job (schedule within a scheduled machinefile/job allocation)

2006-12-11 Thread Maestas, Christopher Daniel
Hello,

Sometimes we have users that like to do from within a single job (think
schedule within an job scheduler allocation):
"mpiexec -n X myprog"
"mpiexec -n Y myprog2"
Does mpiexec within Open MPI keep track of the node list it is using if
it binds to a particular scheduler?
For example with 4 nodes (2ppn SMP):
"mpiexec -n 2 myprog"
"mpiexec -n 2 myprog2"
"mpiexec -n 1 myprog3"
And assume this is by-slot allocation we would have the following
allocation:
node1   - processor1 - myprog
- processor2 - myprog
node2   - processor1 - myprog2
- processor2 - myprog2
And for a by-node allocation:
node1 - processor1 - myprog
- processor2 - myprog2
node2 - processor1 - myprog
- processor2 - myprog2

I think this is possible using ssh cause it shouldn't really matter how
many times it spawns, but with something like torque it would get
restricted to a max process launch of 4.  We would want the third
mpiexec to block processes and eventually be run on the first available
node allocation that frees up from myprog or myprog2 

For example for torque, we had to add the following to osc mpiexec:
---
   Finally,  since only one mpiexec can be the master at a time, if
your code setup requires
   that mpiexec exit to get a result, you can start a "dummy"
mpiexec first  in  your  batch
   job:

 mpiexec -server

   It  runs  no tasks itself but handles the connections of other
transient mpiexec clients.
   It will shut down cleanly when the batch job exits or you may
kill the server explicitly.
   If  the server is killed with SIGTERM (or HUP or INT), it will
exit with a status of zero
   if there were no clients connected at the time.  If there were
still  clients  using  the
   server,  the server will kill all their tasks, disconnect from
the clients, and exit with
   status 1.
---

So a user ran:
mpiexec -server
mpiexec -n 2 myprog
mpiexec -n 2 myprog2
And the server kept track of the allocation ... I would think that the
orted could do this? 

Sorry if this sounds confusing ... But I'm sure it will clear up with
any further responses I make. :-)
-cdm